Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments

Ko, Chaeyoung; Jeon, Daul; Song, Yunkeun; Lee, Yousik

doi:10.3390/app16062979

Open AccessArticle

Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments

¹

Department of Information Security Engineering, Soonchunhyang University, Asan 31538, Republic of Korea

²

Department of Computer Science, Dankook University, Gyeonggi 16890, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2979; https://doi.org/10.3390/app16062979

Submission received: 27 February 2026 / Revised: 15 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

(This article belongs to the Special Issue Innovative Computer Vision and Deep Learning Applications)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of autonomous driving technology and the commercialization of Human–Machine Interface (HMI) services, camera-based systems for external environment perception are being extensively deployed. While comprehensive camera systems enhance safety and convenience, they simultaneously raise serious privacy concerns by collecting facial and biometric information of Vulnerable Road Users (VRUs) and passengers. Although facial anonymization technology has emerged as a key solution, the field currently faces a fundamental challenge: the absence of unified performance evaluation criteria. Existing studies employ disparate evaluation metrics, making objective inter-model comparison and performance verification difficult. This study proposes quantitative evaluation metrics and corresponding evaluation criteria that enable systematic and objective assessment of facial anonymization model performance. Through large-scale experiments, we developed quantitative evaluation metrics encompassing facial landmark variations, visual similarity, and re-identification prevention capability, and derived specific threshold values based on statistical methodologies. Furthermore, to validate the proposed evaluation criteria, we conducted systematic empirical assessments using models that adopt different technical approaches. The validation experiments showed that the evaluation criteria proposed in this study can be applied across models with distinct technical characteristics. This research is expected to contribute to resolving the heterogeneous evaluation criteria issues in existing studies by providing unified evaluation criteria. It may also support the development of privacy protection technologies in autonomous driving environments.

Keywords:

facial anonymization; autonomous vehicles; privacy protection; facial anonymization model evaluation criteria; computer vision; machine learning

1. Introduction

With the rapid advancement and commercialization of autonomous driving technology, vehicles are evolving beyond mere transportation means into complex intelligent systems. Modern autonomous vehicles utilize multiple sensors including cameras, Light Detection and Ranging (LiDAR), and radar to detect and analyze Vulnerable Road Users (VRUs) and surrounding objects in real-time for external environment perception. Simultaneously, Human–Machine Interface (HMI) systems within vehicles continuously operate through built-in cameras for driver and passenger state monitoring, personalized service provision, and safety management.

However, while camera-based systems significantly enhance the safety and convenience of autonomous driving, they simultaneously raise serious privacy concerns. External cameras collect personal information from various individuals including pedestrians, cyclists, and drivers of other vehicles, while internal cameras continuously record passengers’ private moments and personal characteristics. The risk of privacy invasion and identity exposure increases dramatically, particularly when collected video data is transmitted to cloud servers or shared with third parties for data analysis.

Privacy protection in autonomous driving environments has become a legal imperative beyond merely an ethical consideration. The European Union’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act (CCPA) [1,2] are playing pioneering roles, and this movement is spreading globally. Countries worldwide are introducing stricter restrictions on the collection and processing of personally identifiable information. Examples include China’s Personal Information Protection Law (PIPL), Japan’s amended Act on the Protection of Personal Information (APPI), Brazil’s Lei Geral de Proteção de Dados (LGPD), and South Korea’s strengthened Personal Information Protection Act (PIPA) [3,4,5,6].

Significant progress is also being made in international standardization. ITU-T Study Group 17 (ITU-T SG17) is developing international standards for anonymization and protection of personal information collected in vehicle systems through X.af-sec [7], a security framework standardization effort for autonomous driving environments. ISO Technical Committee 204 (ISO/TC 204), Intelligent transport systems present privacy protection guidelines for ITS environments through the ISO 21177 series [8], and the Institute of Electrical and Electronics Engineers (IEEE) has standardized security services for vehicular communications through IEEE Std 1609.2 [9]. Furthermore, UNECE World Forum for Harmonization of Vehicle Regulations (WP.29) has established regulations on cybersecurity for autonomous vehicles (UN Regulations No. 155, 156) [10,11], implicitly addressing comprehensive data protection matters and specifying obligations for vehicle manufacturers.

These regulatory and standardization movements indicate that autonomous vehicle manufacturers face the complex challenge of maintaining system functionality while meeting multilayered regulatory requirements in the global market. According to guidelines published by the European Automobile Manufacturers Association (ACEA) [12], autonomous driving systems must comply with data collection minimization, purpose limitation, storage period limitation, and anonymization obligations.

Consequently, facial anonymization technology has emerged as a critical solution to address these privacy protection requirements. Facial anonymization, as discussed in this paper, refers to technology that removes or transforms facial features, enabling personal identification in video data, thereby protecting personal information while meeting functional system requirements. Accordingly, research on developing facial anonymization models utilizing various technical approaches, including computer vision, machine learning, and generative artificial intelligence, is actively progressing in both academia and industry.

However, the facial anonymization technology field currently faces a fundamental problem: the absence of objective evaluation criteria. Existing studies employ different evaluation metrics and standards depending on their respective research purposes and technical characteristics. Even when the same metric is used, threshold values and measurement methods vary across studies, which makes objective inter-model comparison and performance verification difficult. This ambiguity and inconsistency in evaluation criteria act as a major barrier hindering the advancement of facial anonymization technology. The absence of clear and consistent evaluation criteria makes it difficult for researchers to accurately assess the actual performance of their models and causes confusion in model selection and optimization processes for practical applications. Particularly in fields where safety and reliability are critical, such as autonomous driving, performance misjudgments due to unverified evaluation criteria can lead to serious consequences. This study aims to resolve heterogeneous evaluation issues in facial anonymization research by proposing privacy-oriented evaluation criteria that enable systematic and objective assessment of anonymization performance. More specifically, this study focuses on deriving such criteria and validating them across anonymization models with different technical approaches.

The expected contributions of this study are as follows. First, it establishes comprehensive quantitative evaluation metrics and statistically derived criteria to provide unified evaluation standards for facial anonymization research. Second, it constructs a unified evaluation framework that enables objective comparison across diverse technical approaches. Third, it empirically examines the effectiveness and discriminative power of the proposed evaluation criteria.

The remainder of this paper is organized as follows. Section 2 presents the background on facial anonymization algorithms and reviews related work, including technique families and typical evaluation patterns, as well as representative studies and metric usage in practice. Section 3 describes the generation of anonymized image datasets and preprocessing, the establishment of experimental items, and the experimentation process used to derive evaluation criteria. It presents the statistical evaluation criteria setting methodology and derives criteria for facial landmark reduction rate, facial similarity, and facial re-identification rate. Section 4 introduces validation experiments and validation models, and reports validation results on model discrimination capability, model characteristic reflection capability, and technical approach discrimination capability, followed by a comprehensive evaluation of each model. Finally, Section 5 summarizes the study and discusses limitations and future work.

2. Background & Related Work

This section briefly outlines representative technical components used in facial anonymization and then reviews prior studies with a focus on how evaluation metrics and evaluation practices differ across studies.

2.1. Background

This subsection briefly outlines representative technical components used in facial anonymization, focusing on computer vision, deep learning, and facial landmark extraction.

2.1.1. Computer Vision Algorithms

Early facial anonymization methods often relied on conventional computer vision techniques such as Haar cascades and Histogram of Oriented Gradients (HOG) [13,14] for face or feature detection. However, because these approaches have limitations in handling complex facial transformations, they are now more commonly used as auxiliary components than as standalone anonymization solutions.

2.1.2. Deep Learning Algorithms

Following early computer vision-based approaches, recent facial anonymization studies have increasingly adopted Generative Adversarial Network (GAN) [15]-based generative models for transforming or synthesizing facial content. Representative variants include StarGAN [16] for multi-domain image-to-image translation and StyleGAN [17] for flexible facial synthesis. Additional GAN-based variants have continued to be proposed in prior studies.

2.1.3. Facial Landmark Detection and Extraction Algorithms

Facial landmarks are often used in facial anonymization pipelines to represent key facial regions and support tasks such as alignment, preprocessing, and facial structure analysis [18]. Prior work has used a range of landmark-related methods, from classical feature-based approaches such as Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Oriented FAST and Rotated BRIEF (ORB), and Accelerated KAZE (A-KAZE) [19] to more recent deep learning-based models such as the dlib 68-point model [20], the MediaPipe 468-point model [21], and the Multi-task Cascaded Convolutional Networks (MTCNN) 5-point model [22]. In this study, FaceBoxes [23] is used for face detection, while 3D Dense Face Alignment (3DDFA) [24], a 3D Morphable Model (3DMM) based method for estimating 3D facial landmarks from 2D facial images, is used for landmark extraction in the evaluation pipeline.

2.2. Related Work

This subsection reviews prior facial anonymization studies with a focus on how evaluation metrics have been selected, combined, and reported across studies. It first outlines representative technique families and their typical evaluation patterns and then examines recent studies to show how metric selection varies depending on the technical approach, application context, and experimental objective.

2.2.1. Technique Families and Typical Evaluation Patterns

Facial anonymization techniques can be grouped by technical paradigms, and each family tends to adopt a characteristic set of evaluation metrics. Accordingly, this subsection summarizes typical evaluation patterns by technique family, highlighting which metric families are commonly reported and why. As representative examples, we discuss evaluation patterns commonly reported in classical computer vision-based approaches and GAN-based generative approaches.

In technique families that rely on classical computer vision modules (e.g., detection- or tracking-oriented anonymization) [13,14], quantitative metrics such as Detection Rate (DR), False Positive Rate (FPR), and Receiver Operating Characteristic (ROC) curves are commonly used to evaluate system performance.

For GAN-based anonymization approaches that synthesize or translate facial content, metrics such as Inception Score (IS) [25] and Fréchet Inception Distance (FID) [26] are commonly used, while Precision and Recall (P&R) [27] and Perceptual Path Length (PPL) [17] are also frequently adopted in recent studies. Additional metrics for GAN evaluation have also been proposed. Ali Borji emphasized that GAN evaluation remains an evolving field and that continued research is needed to improve these metrics and address their limitations [28].

2.2.2. Representative Studies and Metric Usage in Practice

This heterogeneity can also be observed in representative studies, where evaluation metrics are reported in different combinations depending on the technical focus of each proposed method. For example, Wang et al. evaluated a deepfake-based facial anonymization model using a broad set of metrics, including landmark quantity reduction, landmark regions, landmark extraction algorithms, and Area Under the Curve (AUC) values from AUC-ROC [29]. In this case, landmark-related metrics are used to capture changes in facial structure or representation, whereas AUC-based evaluation is used to assess discrimination performance.

Other studies have emphasized different evaluation targets and therefore adopted different metric sets. Hellmann et al. evaluated GANonymization using FID and Learned Perceptual Image Patch Similarity (LPIPS), which mainly reflect image quality and perceptual similarity [30]. Kuang et al. assessed their method using face recognition accuracy together with quality-related metrics, thereby combining identity-related evaluation with visual quality assessment [31]. Kim et al. proposed a semantic-aware GAN-based de-identification model for identity anonymization. They employed the Structural Similarity Index Measure (SSIM) and re-identification accuracy to evaluate structural similarity and residual identity-related information [32]. These examples show that even among recent learning-based approaches, different studies emphasize different aspects of anonymization, such as visual realism, structural preservation, or identity leakage.

Variation also appears when facial anonymization is studied in different application settings. Wen et al. proposed a face anonymization framework that incorporates a differential privacy mechanism [33]. Their study emphasized the privacy–utility trade-off and showed that privacy levels can be controlled through a privacy budget. Cao et al. systematically reviewed recent face de-identification methods and compared them in terms of privacy protection effectiveness and utility preservation, highlighting that evaluation emphasis varies across methods and application settings [34]. Such examples indicate that evaluation practice is shaped not only by model architecture, but also by the intended application context and experimental objective.

Taken together, these representative studies show that facial anonymization has been evaluated using different metric types, metric combinations, and evaluation focuses depending on the method and application context. Because the reported results are not expressed within a common evaluation basis, it is difficult to directly compare studies or to judge whether different methods have achieved a comparable level of privacy protection. This limitation motivates the need for evaluation criteria that can be referenced more consistently across diverse facial anonymization techniques.

3. Proposed Evaluation Metrics and Criteria

This section proposes quantitative evaluation metrics for systematically assessing the performance of facial anonymization techniques. While facial anonymization performance evaluation in existing research has primarily relied on individual researchers’ subjective criteria or fragmented metrics, this study aims to establish objective and reliable evaluation criteria through a statistically validated comprehensive evaluation framework. The experiments proceed in the sequence shown in Figure 1.

3.1. Generate Anonymized Image Datasets and Preprocessing

3.1.1. Generate Anonymized Image

We constructed an experimental dataset based on 50,000 face images randomly sampled from the CelebFaces Attributes Dataset (CelebA) [35,36], which we denote as the original dataset. CelebA was not collected directly from real autonomous driving environments, although it includes characteristics partially corresponding to autonomous driving-related imaging conditions, such as occlusion, side-view faces, and lighting variation. Using this same 50,000-image original CelebA set, we applied five anonymization approaches separately and generated five anonymized datasets, with each dataset corresponding to one approach.

Each anonymized image is paired with its corresponding original image, enabling consistent comparison across models using identical inputs. The datasets used in the experiment are summarized in Table 1.

3.1.2. Preliminary Experiments for Rounding Precision Setting

We include rounded variants to examine how limiting numerical precision affects anonymization outcomes in practical implementations. Since rounding can improve stability and reduce sensitivity to minor variations, we conduct preliminary experiments to select an appropriate precision level.

Preliminary experiments were conducted to determine the rounding precision for setting the respective function values to generate the 3D_round and Depth round datasets.

First Decimal Place Rounding Experiment: For both the 3D function model and the Depth function model, first decimal place rounding caused excessive information loss. Experimental results showed that fine facial contour information was significantly lost, resulting in unnatural blocking artifacts, and problems occurred where even basic structural facial features were distorted.

Second Decimal Place Rounding Experiment: Second decimal place rounding yielded results that effectively transformed fine features necessary for personal identification while appropriately preserving basic facial structure and contours. In particular, while the basic shapes of eyes, nose, and mouth were maintained, individual-specific characteristic details were appropriately anonymized.

Third Decimal Place Rounding Experiment: Third decimal place rounding showed no statistically significant difference from the original images. The anonymization effect was minimal, with personal identification remaining possible.

Based on these preliminary results, we selected second decimal place rounding as it provides a clear anonymization effect while preserving overall facial contours and maintaining stable visual quality. Representative visual examples of rounding to one and two decimal places are shown in Figure 2.

3.1.3. Dataset Preprocessing

All datasets were preprocessed with identical file naming systems to enable systematic comparison between original images and their corresponding anonymization results. Datasets were prepared for each anonymization model with the following configuration:

Original dataset: 50,000 original images extracted from CelebA
3D dataset: 50,000 images from 3D function application results
3D_round dataset: 50,000 images from 3D function + second decimal place rounding
Depth dataset: 50,000 images from Depth function application results
Depth_round dataset: 50,000 images from Depth function + second decimal place rounding
SynergyNet dataset: 50,000 images from SynergyNet model application results

Each dataset was normalized to a consistent image resolution (256 × 256 pixels) and unified to the same color space (RGB) and file format (PNG) to ensure experimental consistency.

3.2. Establish Experimental Items

This subsection proposes evaluation metrics for objectively measuring facial anonymization performance and corresponding evaluation criteria. The proposed evaluation metrics include facial landmark reduction rate, facial similarity rate, and re-identification rate.

Facial Landmark Reduction Rate: Facial anonymization may alter facial geometric structure in ways that affect the reliability and quantity of detected facial landmarks between original and anonymized images. Wang et al. reported that Deepfake faces tend to exhibit fewer detected feature points than real faces, particularly in specific facial regions, due to manipulation artifacts that introduce “feature point defects” [29]. Motivated by this observation, we hypothesize that anonymized images may yield a reduced number of reliably detected facial landmarks compared to their corresponding original images.
Facial Similarity: Effective facial anonymization should reduce how similar the anonymized face looks to the original face, because higher similarity can imply a higher risk of revealing identity-related information.
Facial Re-identification Rate: For effective facial anonymization, it should be difficult to re-identify faces from original images in anonymized facial images. The re-identification (re-ID) rate in this study refers to the proportion of anonymized facial images identified as the same person as in the original images.

In addition to the three proposed core evaluation metrics, model-specific supplementary metrics may be selectively incorporated to better reflect the technical characteristics of each anonymization approach. Examples of supplementary evaluation metrics are as follows:

Algorithm-Specific Model Evaluation Metrics: Evaluation methods can be selected according to the characteristics of the algorithm used by each facial anonymization model. Computer vision algorithms can be evaluated using Detection Rate, False Positive Rate, and ROC curve [13,14], while deep learning algorithms can be evaluated using IS, FID, Precision/Recall, PPL, NND, Memorization Assessment, and AUC [17,25,26,27,28].
Facial Landmark Region Anonymization Evaluation: A widely used method in recent facial anonymization is utilizing facial landmarks. The five most characteristic regions of the face are the face contour, left eye, right eye, nose, and mouth. Since re-identification from virtual faces to real faces is possible through all these facial landmark regions, it should be evaluated whether the five aforementioned facial landmark regions are necessarily anonymized in anonymized images.
Use of Facial Landmark Extraction Algorithms: To ensure the performance and reliability of facial landmark-based anonymization models, accurate landmark extraction must precede. Therefore, it should be evaluated whether the model accurately detects landmarks using reliable facial landmark extraction algorithms such as SIFT, SURF, ORB, and A-KAZE. This is because inaccurate landmark extraction can cause incompleteness in anonymization.
Implementation of Overfitting Prevention Measures: Artificial intelligence models can experience overfitting problems where they become excessively optimized to training data, resulting in degraded performance on new data in real environments. To ensure stable performance and generalization capability of anonymization models, it must be reviewed whether overfitting prevention techniques such as Batch Normalization [38], Dropout [39], and L1/L2 Regularization [40] are appropriately applied.
Training Dataset Diversity: Facial landmark regions (face contour, left eye, right eye, nose, mouth) can be occluded by hair, masks, sunglasses, or profile-facing behavior. Therefore, the training dataset should include sufficiently diverse situations, such as profile views, frontal views, and partial facial occlusions, so that the anonymization model can operate effectively even when key facial landmark regions are occluded in test images or videos.
Generalization Capability Verification: To demonstrate the practical and generalization capability of anonymization models, evaluation on external datasets not used in the training process or real-environment data is essential. The process of confirming whether the model consistently performs effective anonymization not only on training data but also on images from various environments and conditions should be included.

3.3. Experimentation and Derive Evaluation Criteria

3.3.1. Statistical Evaluation Criteria Setting Methodology

The establishment of anonymization metric evaluation criteria in this study is based on statistical methodologies widely recognized in machine learning model evaluation. Objective and reproducible evaluation criteria were derived by selectively applying the mean and median according to the distribution characteristics of experimental data.

In machine learning model performance evaluation, mean and median are established as the most fundamental and reliable statistics representing the central tendency of data [41,42]. Particularly, machine learning model evaluation research in genomics and bioinformatics emphasizes that appropriately selecting between mean and median according to data distribution characteristics is essential for reliable measurement of model performance [43].

In the facial de-identification field as well, mean and median are utilized as standard statistical metrics for quantitatively evaluating anonymization performance [34]. Following these prior studies, this study uses the mean when the data are approximately symmetric and the influence of outliers is limited. It uses the median when the distribution is asymmetric or when extreme values could distort the representative tendency of the data.

Furthermore, this study utilized skewness to quantitatively determine the symmetry of data distribution more accurately [44,45]. Skewness is calculated by the following equation and represents the degree of asymmetry in the distribution:

γ 1 = \frac{E [(X - μ) 3]}{σ 3} = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{3}}{{(\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2})}^{3 / 2}}

(1)

where

μ

is the mean,

σ

is the standard deviation, and

n

is the sample size. Generally, the smaller the absolute value of skewness, the closer the distribution is to symmetry, and the larger the absolute value, the greater the asymmetry. Following the criterion commonly used in machine learning model evaluation, this study selected the mean when

| γ 1 | < 0.5

and the median when

| γ 1 | \geq 0.5

.

For each evaluation metric, we calculate skewness to determine the distribution shape objectively and then apply the most appropriate representative statistic. This procedure helps ensure that the proposed evaluation criteria are statistically grounded and aligned with standard evaluation practices in facial anonymization research.

3.3.2. Evaluation Criteria for Facial Landmark Reduction Rate

The FaceBoxes library [23] was utilized for facial region detection, and experiments were conducted by extracting facial landmarks by applying the 3DDFA model [24] to detected facial regions. The experimental results are presented in Table 2.

Experimental results showed that an average of 68 facial landmarks were extracted from original images, while approximately 66 and 65 facial landmarks were extracted from 3D function and 3D function round model images, respectively. Additionally, 57 and 58 facial landmarks were extracted from Depth function and Depth function round model images, respectively, and 30 facial landmarks were extracted from SynergyNet model images. Experimental data analysis revealed that the data exhibited asymmetric distribution due to the extreme value of the SynergyNet model (30.0968). Since it is established that the median is a more appropriate representative value for asymmetric data with outliers in machine learning model evaluation, this study derived evaluation criteria using the median. The median of the entire anonymization dataset was calculated as approximately 58.5732, or about 59. Therefore, effective anonymization can be assessed when 59 or fewer facial landmarks are extracted from anonymized images.

However, models used for facial landmark extraction operate based on different numbers of facial landmarks. Due to this difference, objective comparison between models is difficult based solely on absolute landmark counts. Therefore, the facial landmark reduction rate is used to enable consistent evaluation regardless of differences across landmark extraction models.

The facial landmark reduction rate is calculated by the following formula:

Reduction Rate (%) = \frac{(Original Landmark Count - Anonymized Landmark Count)}{Original Landmark Count \times 100}

(2)

Table 3 shows the facial landmark reduction rate for each model using the results from Table 2.

Calculation results showed facial landmark reduction rates of approximately 2.70% and 4.46% for 3D function and 3D function round model images compared to original images, respectively. Additionally, Depth function and Depth function round model images showed facial landmark reductions of 15.88% and 13.86%, respectively, while SynergyNet model images exhibited a facial landmark reduction rate of 55.74%. Skewness calculation of the experimental data yielded γ₁ = 1.89, indicating a strongly asymmetric distribution. As also shown in Figure 3, most values were concentrated in the lower reduction-rate range, whereas the SynergyNet result appeared as an isolated high value at 55.74%. This suggests that the SynergyNet result substantially influenced the overall distribution. Therefore, since the median is generally considered a more appropriate representative value for asymmetric data in machine learning model evaluation, this study derived the reference threshold using the median. The median of the entire anonymization dataset was calculated as approximately 13.86%. Therefore, the proposed reference threshold for facial landmark reduction rate was set at 13.86%, and anonymized images showing a facial landmark reduction rate of 13.86% or higher may be considered to have achieved a meaningful level of structural facial cue reduction.

However, this metric may reflect detector failure rather than actual privacy gain. Therefore, it should be cross-checked with other quantitative and qualitative indicators, including whether landmarks that were normally detected in the original images are no longer detected after anonymization. Conversely, some anonymization methods may preserve detectable landmarks while still reducing re-identification risk. Therefore, the facial landmark reduction rate can serve as a complementary metric and should be interpreted together with facial similarity and re-identification rate.

3.3.3. Evaluation Criteria for Facial Similarity

Facial similarity was quantified using the face recognition library (v19.24.99) [46]. The underlying face embedding implementation is based on dlib [47]. Prior to comparison, custom preprocessing was applied to the input images. For each anonymized image, the facial embedding was compared with that of its corresponding original image using L2 distance. A pair was classified as the same identity when the L2 distance was below the threshold of 0.6. Facial similarity was then calculated as the percentage of image pairs classified as the same identity across the dataset. A lower percentage indicates stronger anonymization. The experimental results are presented in Table 4.

Experimental results showed facial similarity of 0.43% and 0.56% for the 3D function and 3D function round models in Table 4, respectively. Additionally, the depth function and depth function round models showed image similarity of 0.90% and 1.00%, respectively, while the SynergyNet model showed 0.00% image similarity. In this case, the 0.00% result of the SynergyNet model represents an optimal value indicating anonymization and is a natural value within the data distribution; therefore, the mean was calculated, including all data. Furthermore, the skewness calculation of the experimental data yielded γ₁ = −0.21, showing an almost symmetric distribution. As also shown in Figure 4, the facial similarity evaluation results are distributed within the 0–1% range with no isolated extreme outliers. Therefore, following the standard methodology of machine learning evaluation, the mean was selected as the representative value. The mean facial similarity using values from Table 4 was calculated as 0.58%. Therefore, effective anonymization can be assessed when the similarity between original and anonymized images is 0.58% or less.

3.3.4. Evaluation Criteria for Facial Re-Identification Rate

For effective facial anonymization, it should be impossible to re-identify faces from original images in anonymized facial images. The re-identification (re-ID) rate in this study refers to the proportion of anonymized facial images identified as the same person as in the original images.

Re-identification was evaluated using facenet-pytorch (v2.5.2) [48] with Inception-ResnetV1 pre-trained on VGGFace2 [49]. All input images were resized to 160 × 160 and normalized to the range of [−1, 1]. Face embeddings were compared using cosine similarity, and re-identification performance was evaluated using Rank-1 accuracy under a 1:N identification protocol. Because CelebA includes multiple images of the same person, identity overlap can inflate apparent re-identification performance if not properly controlled. To rigorously control for identity overlap in CelebA, a match was considered successful only when the anonymized image matched the exact corresponding original image with the same filename. Under this setting, the reported re-ID rates provide a strict lower-bound estimate of re-identification performance. The experimental results are presented in Table 5.

As shown in Table 5, facial re-identification rate evaluation results using FaceNet revealed facial re-ID rates of 23.73% and 23.65% for the 3D function and 3D function round models, respectively. Additionally, the depth function and depth function round models yielded facial re-ID rates of 22.12% and 22.46%, respectively, while the SynergyNet model produced a facial re-ID rate of 16.77%.

The skewness of the experimental data was calculated as γ₁ = −0.43, indicating an approximately symmetric distribution. As also shown in Figure 5, the re-identification rates were distributed within the 16–24% range without an isolated extreme value. The relatively low value of the SynergyNet model (16.77%) may indicate stronger anonymization performance, but it was not treated as an extreme outlier in the observed distribution. Therefore, following a commonly used approach in machine learning model evaluation, the mean was used to derive the reference threshold.

The mean of the values in Table 5 was calculated as 21.75%. Therefore, a facial re-ID rate of 21.75% or less may be used as a reference threshold when evaluating facial re-identification using FaceNet.

3.3.5. Derive Evaluation Criteria

The final evaluation criteria for determining anonymization performance were derived by integrating the experimental results presented in 3.3.2–3.3.4. The threshold for each metric was established based on the observed data distribution and representative statistics (median/mean). An anonymization method is considered effective when it satisfies the following criteria:

Facial landmark reduction rate ≥ 13.86%
Facial similarity ≤ 0.58%
Re-identification rate (FaceNet) ≤ 21.75%

4. Validation

This section performs an empirical evaluation of three representative anonymization models adopting different technical approaches to verify the validity and practicality of the integrated evaluation criteria proposed in Section 3. For validation, we selected three anonymization models that represent different technical approaches: a real-time 3D facial surface geometry model using MediaPipe (v0.10.21), a depth-based model using Delaunay triangulation, and a masking-based model using the face-alignment library (v1.4.1). By systematically applying the quantitative evaluation metrics proposed in Section 3 to 1000 images from the FFHQ (Flickr-Faces-HQ) [50] dataset using these facial anonymization models, we aim to demonstrate the practical value and academic contribution of the evaluation criteria proposed in this study. Using FFHQ, which is independent from the dataset used for threshold derivation, allows us to evaluate out-of-sample generalization and assess whether the proposed criteria remain valid under a different data distribution. The sequence of validation experiments in this research is shown in Figure 6.

4.1. Validation Experiments

The core purpose of this validation experiment is to empirically verify whether the proposed evaluation criteria are valid for systematically and objectively distinguishing the performance of anonymization models with various technical approaches. In particular, this experiment examines whether the proposed criteria reflect the characteristics of different anonymization approaches and distinguish performance differences between models. Through this analysis, we assess whether the criteria are reliable for use in the anonymization field.

4.1.1. Experimental Method

For validation, 1000 test images were generated for each anonymization model, and 1000 original images were selected from the FFHQ dataset. To reduce the potential bias caused by identity overlap in the CelebA dataset used in Section 3, FFHQ was adopted as the validation dataset because it consists of unique identities without duplicate individuals within the dataset. The FFHQ dataset is a high-resolution facial image collection provided by NVIDIA, containing 70,000 high-quality facial images at 1024 × 1024 pixels [50]. This dataset provides comprehensive facial images encompassing diverse age groups, ethnicities, genders, accessories, backgrounds, and lighting conditions, and is a standard benchmark dataset widely used for training and evaluating generative models such as StyleGAN. FFHQ was not collected directly from real autonomous driving environments, although it includes characteristics partially corresponding to autonomous driving-related imaging conditions, such as variation in lighting, background, and facial appearance.

In this validation experiment, three types of anonymization models were used: (1) the Real-Time Facial Surface (RTFS) model [21], (2) the Exploring Depth Information for Detecting Manipulated Face Videos (EDIDMFV) model [51], and (3) the Learning to Anonymize Faces for Privacy-Preserving Action Detection (LAFPAD) model [52].

4.1.2. Validation Models

RTFS Model: The RTFS model was implemented based on the real-time facial surface geometry estimation technology presented in the Google Research paper. The RTFS model used here generates a facial mesh composed of 468 3D vertices through MediaPipe FaceMesh and extracts depth information based on the z-coordinate values of each vertex [21]. It applies Delaunay triangulation to divide the facial surface into triangular patches and performs anonymization by applying gray-scale shading in the range 0–255, configured for the experiment, according to the depth value of each patch. This approach can effectively alter structural facial features through three-dimensional geometric transformation.
EDIDMFV Model: The EDIDMFV model performs facial anonymization utilizing depth information based on the research in. This model extracts 468 3D landmarks through MediaPipe FaceMesh and generates a depth map by scaling z-coordinate information by a factor of 100 [51]. Using Delaunay triangulation, the model interpolates depth values for pixels inside each triangle with barycentric coordinates. It then applies a Gaussian blur with a 51 × 51 kernel and normalizes the result to the 0–1 range to generate a depth map. Finally, it achieves anonymization by applying a gray-scale mask with gamma correction to the facial region, where γ was set to 0.7.
LAFPAD Model: The LAFPAD model extracts 68 two-dimensional facial landmarks through the face-alignment library based on the research in [52,53]. This model generates a facial mask with an extended forehead region based on facial contours and performs limited anonymization, excluding eye and mouth regions. Anonymization is performed by applying gray-scale values in the range of 80–200, as specified for the experiment, according to the y-coordinate changes from forehead to chin, reflecting design characteristics for preserving action detection performance.

These three selected models represent different technical paradigms: 3D mesh-based approach (RTFS), depth information interpolation technique (EDIDMFV), and limited region masking method (LAFPAD). This technical diversity is essential for objectively verifying the comprehensiveness and applicability of the proposed criteria, and the implementation of each model reflects the core ideas of the original research, ensuring the objectivity and reliability of the experiments.

4.2. Validation Experiment Results

Table 6 presents the experimental results for each model validated according to the facial landmark reduction rate, facial similarity, and facial re-identification rate evaluation criteria proposed in this study. For Facial Landmark Count Change, values are reported as rounded representative values, with raw mean values shown in parentheses.

Representative qualitative examples of the validation results are shown in Figure 7.

4.2.1. Verification of Landmark-Based Evaluation Metrics’ Model Discrimination Capability

For the landmark count change metric, RTFS (56.236) and EDIDMFV (55.012) met the threshold of 59 or less, whereas LAFPAD (65.756) did not. This pattern indicates that the mesh-based models and the 2D landmark-based model produced clearly different outcomes on the landmark-based criterion.

A similar pattern was observed for facial landmark reduction rate. RTFS (17.30%) and EDIDMFV (17.10%) exceeded the threshold of 13.86%, whereas LAFPAD (3.30%) remained below it. Numerically, these results are consistent with the broader geometric modification performed by the 3D mesh-based models and the more limited masking strategy used by LAFPAD.

4.2.2. Verification of Similarity and Re-Identification Evaluation Model Characteristic Reflection Capability

In facial similarity evaluation, both EDIDMFV and LAFPAD models achieved anonymization performance of 0.00%, while the RTFS model showed minimal similarity of 0.10%. All models met the threshold of 0.58% or less presented in Section 3.3.2, confirming that the proposed criteria appropriately reflect practical anonymization requirements. These results demonstrate that each model’s anonymization approach is effective in removing visual similarity, while simultaneously proving that the proposed criteria possess the precision to distinguish even subtle performance differences.

In the re-identification rate evaluation, the LAFPAD model (0.000%) achieved the best performance, followed by RTFS (0.013%) and EDIDMFV (0.015%) models. These results clearly reflect the differences in each model’s design purpose and technical approach. The LAFPAD model’s re-identification results suggest a strong masking effect based on face-alignment. In addition, all models remained well below the 21.75% threshold, indicating that the proposed criteria can still be used to evaluate strong anonymization results.

4.2.3. Demonstration of Evaluation Criteria’s Technical Approach Discrimination Capability

Taken together, the results suggest different strengths across the three models rather than a single uniform pattern of superiority. RTFS and EDIDMFV showed stronger performance on landmark-based metrics, whereas LAFPAD showed the lowest re-identification rate. This contrast indicates that the criteria capture different aspects of anonymization performance across technical approaches.

Accordingly, the validation results support the use of a multi-metric framework instead of relying on a single indicator. The criteria are useful because they reveal trade-offs among models and make those trade-offs interpretable in relation to each model’s technical design.

4.3. Comprehensive Evaluation of Each Model

Building on the metric-level results above, this subsection integrates the outcomes across all criteria for RTFS, EDIDMFV, and LAFPAD. The comprehensive evaluation summarized in Table 7 is intended to show how each model meets or does not meet the proposed criteria as a whole.

Met/Not Met is determined by comparing each measured value against the derived criteria (e.g., landmark reduction rate ≥ 13.86%, similarity match rate ≤ 0.58%, and re-identification rate ≤ 21.75%). It should be noted that these thresholds were intended to assess whether anonymization was achieved, rather than to serve as ranking criteria for fine-grained comparison among high-performing models. In addition, the comprehensive evaluation was conducted by including two supplementary metrics as described in Section 3.2: Facial Landmark Region Anonymization Evaluation and Use of Facial Landmark Extraction Algorithms.

As summarized in Table 7, RTFS and EDIDMFV met most of the landmark-related criteria, whereas LAFPAD did not meet those particular criteria. However, a “Not Met” result in these items should not be interpreted as overall inferiority; rather, it reflects the more limited masking scope of LAFPAD compared with the broader geometric modification used by the other two models.

At the same time, all three models satisfied the similarity and re-identification thresholds, and LAFPAD showed the lowest re-identification rate among them. This indicates that the criteria do not point to one universally superior model; instead, they reveal different strengths depending on which aspect of anonymization performance is emphasized.

The supplementary criteria in Table 7 further clarify this trade-off. RTFS and EDIDMFV satisfied the landmark-region anonymization criterion by covering all five regions, whereas LAFPAD covered only two regions. Overall, the integrated results suggest that the proposed framework is useful for comparing models with different design priorities, while also showing where each model’s strengths and limitations lie.

5. Conclusions

The widespread adoption of vehicle camera systems following the advancement of autonomous driving technology is increasing societal demands for privacy protection. Global privacy protection regulations including GDPR and CCPA, along with stringent requirements from international standardization organizations such as ITU-T, ISO, and IEEE, present autonomous vehicle manufacturers with the challenge of simultaneously achieving technical functionality and privacy protection. While facial anonymization technology has emerged as a key solution, limitations exist in objective performance comparison between models due to inconsistent evaluation metrics and dependence on relative evaluation in existing studies.

This study presented unified evaluation criteria for the facial anonymization research field by establishing comprehensive quantitative evaluation criteria based on statistical methodologies. We systematized the methodology for selective application of mean and median according to data distribution characteristics, and through large-scale experiments utilizing 50,000 images from the CelebA dataset, we derived specific threshold values including facial landmark reduction rate (≥13.86%), facial similarity (≤0.58%), and re-identification rate (≤21.75%). Furthermore, to validate the effectiveness of the proposed evaluation criteria, empirical verification was performed on models adopting different technical approaches including RTFS, EDIDMFV, and LAFPAD. Validation results confirmed the ability to quantitatively distinguish technical differences between 3D mesh-based models (RTFS: 17.30%, EDIDMFV: 17.10%) and 2D landmark-based models (LAFPAD: 3.30%) and accurately capture each model’s design purpose and strengths.

This study established unified evaluation criteria capable of objectively comparing various technical approaches by overcoming the fragmented evaluation systems of existing studies and providing unified evaluation criteria. Additionally, by empirically validating the effectiveness and discriminative power of the proposed evaluation criteria, we provided a reference point for objective comparison and performance improvement in future related research.

As a follow-up to this study, we plan to further validate and refine the proposed criteria using datasets collected from actual vehicle-mounted camera environments and a broader range of generative model architectures. We also plan to extend this work by incorporating downstream task utility evaluation, such as pedestrian detection, action recognition, and gaze-related analysis, in order to more comprehensively assess the practical privacy–utility trade-off in autonomous driving environments. Furthermore, development toward an integrated evaluation framework incorporating qualitative evaluation criteria related to user experience is required, and development of comprehensive criteria capable of evaluating the balance between real-time processing performance and anonymization accuracy is necessary. In particular, systematic research on bias issues that may arise during the anonymization process and the development of ethical evaluation criteria for the adequacy of privacy protection levels are raised as important research challenges. Ultimately, we expect that the evaluation criteria proposed in this study will be practically utilized by autonomous driving system developers to select and evaluate optimal anonymization models that meet privacy protection requirements, contributing to the construction of safe and reliable autonomous driving environments.

Author Contributions

Conceptualization, C.K. and Y.L.; methodology, C.K.; investigation, C.K.; writing—original draft preparation, C.K.; writing—review and editing, D.J. and Y.S.; supervision, Y.L.; funding acquisition, Y.L.; visualization, D.J. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Internet & Security Agency (KISA) grant funded by the Korean government (PIPC) (RS-2023-00258669). This work was supported by the Soonchunhyang University Research Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CelebFaces Attributes Dataset (CelebA) can be downloaded from: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 21 February 2026). The Flickr-Faces-HQ Dataset (FFHQ) can be downloaded from: https://github.com/NVlabs/ffhq-dataset (accessed on 21 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

European Union. General Data Protection Regulation (GDPR). 2016. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj (accessed on 4 February 2026).
California Legislature. California Consumer Privacy Act of 2018 (CCPA) (AB 375). 2018. Available online: https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5 (accessed on 4 February 2026).
National People’s Congress of China. Personal Information Protection Law of the People’s Republic of China (PIPL). 2021. Available online: http://www.npc.gov.cn/npc/c2/c30834/202108/t20210820_313088.htm (accessed on 4 February 2026).
Japanese Government. Act on the Protection of Personal Information (APPI), as Amended. 2003. Available online: https://laws.e-gov.go.jp/law/415AC0000000057 (accessed on 4 February 2026).
Brazilian Government. Lei Geral de Proteção de Dados Pessoais (LGPD) (Law No. 13,709/2018). 2018. Available online: https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/L13709compilado.htm (accessed on 4 February 2026).
Ministry of the Interior and Safety; R.o.K. Personal Information Protection Act (PIPA), as Amended. 2023. Available online: https://elaw.klri.re.kr/eng_service/lawView.do?hseq=62389&lang=ENG (accessed on 4 February 2026).
ITU-T Study Group 17. Draft Recommendation ITU-T X.af-sec: Evaluation Methodologies for Anonymization Techniques Using Face Images in Autonomous Vehicles (Under Study). 2026. Available online: https://www.itu.int/Itu-t/workprog/wp_item.aspx?isn=21790 (accessed on 4 February 2026).
ISO 21177:2024; Intelligent Transport Systems—ITS Station Security Services for Secure Session Establishment and Authentication Between Trusted Devices. ISO: Geneva, Switzerland, 2024. Available online: https://www.iso.org/standard/87225.html (accessed on 4 February 2026).
IEEE Std 1609.2-2022; IEEE Standard for Wireless Access in Vehicular Environments (WAVE)–Security Services for Applications and Management Messages. IEEE Standards Association: Piscataway, NJ, USA, 2022. Available online: https://standards.ieee.org/ieee/1609.2/10258/ (accessed on 4 February 2026).
United Nations Economic Commission for Europe. UN Regulation No. 155: Cyber Security and Cyber Security Management System Requirements. 2021. Available online: https://unece.org/sites/default/files/2023-02/R155e%20%282%29.pdf (accessed on 4 February 2026).
United Nations Economic Commission for Europe. UN Regulation No. 156: Software Update and Software Update Management System. 2021. Available online: https://unece.org/sites/default/files/2024-03/R156e%20%282%29.pdf (accessed on 4 February 2026).
European Automobile Manufacturers’ Association (ACEA). ACEA Principles of Data Protection in Relation to Connected Vehicles and Services. 2016. Available online: https://www.acea.auto/files/ACEA_Principles_of_Data_Protection.pdf (accessed on 4 February 2026).
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: New York, NY, USA, 2005. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Counville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2014), Montréal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, Ca, USA, 16–20 June 2016. [Google Scholar]
Wu, Y.; Ji, Q. Facial Landmark Detection: A Literature Survey. Int. J. Comput. Vis. 2019, 127, 115–142. [Google Scholar] [CrossRef]
Chien, H.-J.; Chuang, C.-C.; Klette, R. When to use what feature? SIFT, SURF, ORB, or A-KAZE features for monocular visual odometry. In Proceedings of the 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ), Palmerston North, New Zealand, 21–22 November 2016. [Google Scholar]
Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Kartynnik, Y.; Ablavatski, A.; Grishchenko, I.; Grundmann, M. Real-Time Facial Surface Geometry from Monocular Video on Mobile GPUs. arXiv 2019, arXiv:1907.06724. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. FaceBoxes: A CPU Real-Time Face Detector with High Accuracy. arXiv 2017, arXiv:1708.05234. [Google Scholar]
Guo, J.; Zhu, X.; Yang, Y.; Yang, F.; Lei, Z.; Li, S.Z. Towards Fast, Accurate and Stable 3D Dense Face Alignment. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 152–168. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]
Kynkäänniemi, T.; Karras, T.; Laine, S.; Lehtinen, J.; Aila, T. Improved Precision and Recall Metric for Assessing Generative Models. arXiv 2019, arXiv:1904.06991. [Google Scholar] [CrossRef]
Borji, A. Pros and Cons of GAN Evaluation Measures: New Developments. arXiv 2021, arXiv:2103.09396. [Google Scholar] [CrossRef]
Wang, G.; Jiang, Q.; Jin, X.; Cui, X. FFR_FD: Effective and fast detection of DeepFakes via feature point defects. Information Sciences 2022, 596, 472–488. [Google Scholar] [CrossRef]
Hellmann, F.; Mertes, S.; Benouis, M.; Hustinx, A.; Hsieh, T.-C.; Conati, C.; Krawitz, P.; André, E. GANonymization: A GAN-Based Face Anonymization Framework for Preserving Emotional Expressions. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 21, 6. [Google Scholar] [CrossRef]
Kuang, Z.; Yang, X.; Shen, Y.; Hu, C.; Yu, J. Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–21 June 2024; pp. 12406–12415. [Google Scholar]
Kim, H.; Pang, Z.; Zhao, L.; Su, X.; Lee, J.S. Semantic-aware deidentification generative adversarial networks for identity anonymization. Multimed. Tools Appl. 2023, 82, 15535–15551. [Google Scholar] [CrossRef]
Wen, Y.; Liu, B.; Ding, M.; Xie, R.; Song, L. IdentityDP: Differential Private Identification Protection for Face Images. arXiv 2021, arXiv:2103.01745. [Google Scholar] [CrossRef]
Cao, J.; Liu, B.; Chen, X.; Ding, M.; Xie, R.; Song, L.; Li, Z.; Zhang, W.; Wu, Y. Face De-identification: State-of-the-art Methods and Comparative Studies. arXiv 2024, arXiv:2411.09863. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. arXiv 2014, arXiv:1411.7766. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. CelebA: Large-Scale CelebFaces Attributes Dataset. 2026. Available online: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 21 February 2026).
Wu, C.-Y.; QXu Neumann, U. Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry. arXiv 2021, arXiv:2110.09772. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; PMLR: Lille, France, 2015; pp. 448–456. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Ng, A.Y. Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, ALB, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004; p. 78. [Google Scholar]
Wilimitis, D.; Walsh, C.G. Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial. JMIR AI 2023, 2, e49023. [Google Scholar] [CrossRef]
Huber, P.J.; Ronchetti, E.M. Robust Statistics, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Miller, C.; Portlock, T.; Nyaga, D.M.; O’sUllivan, J.M. A review of model evaluation metrics for machine learning in genetics and genomics. Front. Bioinform. 2024, 4, 1457619. [Google Scholar] [CrossRef]
Groeneveld, R.A.; Meeden, G. Measuring Skewness and Kurtosis. Stat. 1984, 33, 391. [Google Scholar] [CrossRef]
Joanes, D.N.; Gill, C.A. Comparing Measures of Sample Skewness and Kurtosis. Statistician 1998, 47, 183–189. [Google Scholar] [CrossRef]
Ageitgey. Face_Recognition. 2025. Available online: https://github.com/ageitgey/face_recognition (accessed on 11 October 2025).
King, D.E. Dlib-ml: A Machine Learning Toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. arXiv 2017, arXiv:1710.08092. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. Flickr-Faces-HQ Dataset (FFHQ). 2026. Available online: https://github.com/NVlabs/ffhq-dataset (accessed on 21 February 2026).
Wang, H.; Li, S.; He, J.; Qian, Z.; Zhang, X.; Fan, S. Exploring Depth Information for Detecting Manipulated Face Videos. arXiv 2024, arXiv:2411.18572. [Google Scholar] [CrossRef]
Ren, Z.; Lee, Y.J.; Ryoo, M.S. Learning to Anonymize Faces for Privacy Preserving Action Detection. arXiv 2018, arXiv:1803.11556. [Google Scholar] [CrossRef]
Bulat, A.; Tzimiropoulos, G. How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks). In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Evaluation Criteria Derivation Experiment Process.

Figure 2. Representative visual examples from the preliminary rounding experiment for the 3D_round function: (a) rounding to one decimal place, showing visible blocking artifacts and loss of fine facial contour information; (b) rounding to two decimal places, showing smoother overall facial contours while reducing some individual-specific details.

Figure 3. Distribution of facial landmark reduction rates. Labels below the bars indicate interval bins; square brackets and parentheses denote inclusive and exclusive boundaries, respectively. Wavy marks indicate omitted x-axis intervals with no observations.

Figure 4. Distribution of facial similarity values. Labels below the bars indicate interval bins; square brackets and parentheses denote inclusive and exclusive boundaries, respectively.

Figure 5. Distribution of re-identification rates. Labels below the bars indicate interval bins; square brackets and parentheses denote inclusive and exclusive boundaries, respectively. Wavy marks indicate omitted x-axis intervals with no observations.

Figure 6. Evaluation Criteria Validation Experiment Process.

Figure 7. Representative qualitative examples of the original image and anonymized outputs from the validation models. (a) Original image. (b) RTFS result. (c) EDIDMFV result. (d) LAFPAD result.

Table 1. Experimental Datasets.

Anonymization Dataset Name	Description
3D Dataset	Dataset anonymized by applying the original function values of the 3D model that performs anonymization based on three-dimensional structural information
3D_round Dataset	Dataset anonymized by applying second decimal place rounding to the function values of the 3D model that performs anonymization based on three-dimensional structural information
Depth Dataset	Dataset anonymized by applying the original function values of the Depth model that performs anonymization based on depth information
Depth_round Dataset	Dataset anonymized by applying second decimal place rounding to the function values of the Depth model that performs anonymization based on depth information
SynergyNet Dataset	Dataset anonymized by applying the SynergyNet model [37]

Table 2. Changed Facial Landmark Count by Anonymization Dataset.

Anonymization Dataset Name	Changed Facial Landmark Count
Original image Dataset	68
3D Dataset	66.1650
3D_round Dataset	64.9685
Depth Dataset	57.1999
Depth_round Dataset	58.5732
SynergyNet Dataset	30.0968

Table 3. Facial Landmark Reduction Rate.

Anonymization Dataset Name	Facial Landmark Reduction Rate
Original image Dataset	0%
3D Dataset	2.70%
3D_round Dataset	4.46%
Depth Dataset	15.88%
Depth_round Dataset	13.86%
SynergyNet Dataset	55.74%

Table 4. Similarity Evaluation Results between Original and Anonymized Images.

Anonymization Dataset Name	Facial Similarity Evaluation Results
Anonymization Dataset Name	Identical Face	Not Identical
3D Dataset	0.43%	99.57%
3D_round Dataset	0.56%	99.44%
Depth Dataset	0.90%	99.10%
Depth_round Dataset	1.00%	99.00%
SynergyNet Dataset	0.00%	100.00%

Table 5. Re-identification Rate Evaluation Results (FaceNet).

Anonymization Dataset Name	Re-ID Rate
3D Dataset	23.73%
3D_round Dataset	23.65%
Depth Dataset	22.12%
Depth_round Dataset	22.46%
SynergyNet Dataset	16.77%

Table 6. Quantitative Evaluation Metric Results of Validation Experiment Models.

Evaluation Metrics	RTFS	EDIDMFV	LAFPAD
Facial Landmark Count Change	56 (56.236)	55 (55.012)	66 (65.756)
Facial Landmark Reduction Rate	17.30%	17.10%	3.30%
Facial Similarity Evaluation	0.10%	0.00%	0.00%
re-ID rate	0.013%	0.015%	0.000%

Table 7. Comprehensive Evaluation Results of RTFS, EDIDMFV, and LAFPAD Models.

Evaluation Metrics	Evaluation Criteria	RTFS		EDIDMFV		LAFPAD
Evaluation Metrics	Evaluation Criteria	Measured Value	Achievement	Measured Value	Achievement	Measured Value	Achievement
facial landmark count change	≤59	56 (56.236)	Met	55 (55.012)	Met	66 (65.756)	Not Met
facial landmark reduction rate	≥13.86%	17.30%	Met	17.10%	Met	3.30%	Not Met
facial similarity evaluation	≤0.58%	0.10%	Met	0.00%	Met	0.00%	Met
re-ID rate (FaceNet)	≤21.75%	0.013%	Met	0.015%	Met	0.000%	Met
Use of Facial Landmark Extraction Algorithms	Utilizing	MediaPipe-FaceMesh	Met	MediaPipe-FaceMesh-EDIDMFV	Met	face-alignment	Met
Facial Landmark Region Anonymization Evaluation	5 regions (face contour, left eye, right eye, nose, mouth)	5 regions (face contour, left eye, right eye, nose, mouth)	Met	5 regions (face contour, left eye, right eye, nose, mouth)	Met	2 regions (face contour, nose)	Not Met

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ko, C.; Jeon, D.; Song, Y.; Lee, Y. Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments. Appl. Sci. 2026, 16, 2979. https://doi.org/10.3390/app16062979

AMA Style

Ko C, Jeon D, Song Y, Lee Y. Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments. Applied Sciences. 2026; 16(6):2979. https://doi.org/10.3390/app16062979

Chicago/Turabian Style

Ko, Chaeyoung, Daul Jeon, Yunkeun Song, and Yousik Lee. 2026. "Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments" Applied Sciences 16, no. 6: 2979. https://doi.org/10.3390/app16062979

APA Style

Ko, C., Jeon, D., Song, Y., & Lee, Y. (2026). Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments. Applied Sciences, 16(6), 2979. https://doi.org/10.3390/app16062979

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Facial Anonymization Model Evaluation Criteria: Development and Validation in Autonomous Vehicle Environments

Abstract

1. Introduction

2. Background & Related Work

2.1. Background

2.1.1. Computer Vision Algorithms

2.1.2. Deep Learning Algorithms

2.1.3. Facial Landmark Detection and Extraction Algorithms

2.2. Related Work

2.2.1. Technique Families and Typical Evaluation Patterns

2.2.2. Representative Studies and Metric Usage in Practice

3. Proposed Evaluation Metrics and Criteria

3.1. Generate Anonymized Image Datasets and Preprocessing

3.1.1. Generate Anonymized Image

3.1.2. Preliminary Experiments for Rounding Precision Setting

3.1.3. Dataset Preprocessing

3.2. Establish Experimental Items

3.3. Experimentation and Derive Evaluation Criteria

3.3.1. Statistical Evaluation Criteria Setting Methodology

3.3.2. Evaluation Criteria for Facial Landmark Reduction Rate

3.3.3. Evaluation Criteria for Facial Similarity

3.3.4. Evaluation Criteria for Facial Re-Identification Rate

3.3.5. Derive Evaluation Criteria

4. Validation

4.1. Validation Experiments

4.1.1. Experimental Method

4.1.2. Validation Models

4.2. Validation Experiment Results

4.2.1. Verification of Landmark-Based Evaluation Metrics’ Model Discrimination Capability

4.2.2. Verification of Similarity and Re-Identification Evaluation Model Characteristic Reflection Capability

4.2.3. Demonstration of Evaluation Criteria’s Technical Approach Discrimination Capability

4.3. Comprehensive Evaluation of Each Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI