Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks

Ji, Haojie; Tian, Liangliang; Wang, Jingyan; Yao, Yuchi; Wang, Jiangyue

doi:10.3390/electronics14142843

Open AccessEditor’s ChoiceArticle

Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks

by

Haojie Ji

,

Liangliang Tian

,

Jingyan Wang

^*,

Yuchi Yao

and

Jiangyue Wang

Key Laboratory of Modern Measurement & Control Technology, Ministry of Education, Beijing Information Science & Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2843; https://doi.org/10.3390/electronics14142843

Submission received: 4 June 2025 / Revised: 6 July 2025 / Accepted: 14 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Development and Advances in Autonomous Driving Technology)

Download

Browse Figures

Versions Notes

Abstract

Automotive intelligent agents are increasingly collecting facial data for applications such as driver behavior monitoring and identity verification. These excessive collections of facial data bring serious risks of sensitive information leakage to autonomous driving. Facial information has been explicitly required to be anonymized, but the availability of most desensitized facial data is poor, which will greatly affect its application in autonomous driving. This paper proposes an automotive sensitive information anonymization method with high-quality generated facial images by considering the data availability under privacy protection. By comparing K-Same and Generative Adversarial Networks (GANs), this paper proposes a hierarchical self-attention mechanism in StyleGAN3 to enhance the feature perception of face images. The synchronous regularization of sample data is applied to optimize the loss function of the discriminator of StyleGAN3, thereby improving the convergence stability of the model. The experimental results demonstrate that the proposed facial desensitization model reduces the Frechet inception distance (FID) and structural similarity index measure (SSIM) by 95.8% and 24.3%, respectively. The image quality and privacy desensitization of the facial data generated by the StyleGAN3 model have been fully verified in this work. This research provides an efficient and robust facial privacy protection solution for autonomous driving, which is conducive to promoting the security guarantee of automotive data.

Keywords:

facial desensitization; privacy security; adversarial attack; automotive data; image anonymization

1. Introduction

With the rapid advancement in autonomous driving technology, modern intelligent vehicles have evolved into complex systems integrating sensor fusion, machine learning, and human–computer interaction [1,2]. The deployment of in-vehicle cameras and biometric sensors has become indispensable for ensuring driving safety, enabling functions such as driver fatigue monitoring, pedestrian detection, and personalized human–vehicle interaction [3,4]. Concurrently, this technological progress has led to a dramatic increase in facial data generation. Modern intelligent vehicles generate an average of approximately 2.1 GB of facial data daily, whereas they were generating 0.5 GB five years ago. With the data, 60% is allocated for monitoring the driver’s condition, while the remaining 40% is dedicated to recognizing and interacting with pedestrians outside the vehicle. The sensitive nature of facial data contrasts sharply with the application of autonomous driving functions [5]. Statistics show that 37% of global in-vehicle data breaches involve facial information, affecting over 5 million users. Moreover, the risks of facial data leakage due to multimodal data fusion vulnerabilities are projected to increase to 45%, as evidenced by the increasing frequency of facial data leakage incidents documented as shown in Table 1. In response, both the automotive data security regulations of the European Union and China’s relevant regulations on automotive data security mandate strict desensitization processing for facial data [6,7].

In-vehicle facial desensitization represents a key challenge in balancing privacy protection and the functional demands of autonomous driving [8]. By employing dynamic desensitization for the data de-identification processing, it preserves critical functional features such as driver state monitoring and pedestrian interaction, while mitigating the risk of individual identity leakage. This approach not only complies with domestic and international data security regulations, but also constructs a foundation for the collaborative development of privacy security and intelligent driving, allowing in-vehicle facial data to drive technological iteration within a secure framework. As shown in Table 2, in-vehicle facial desensitization requires a precise balance between privacy protection and visual plausibility. Although privacy protection demands the complete elimination of identity identifiers, visual plausibility necessitates retaining facial imagery, while current technologies face challenges in meeting the two critical requirements. Therefore, mild desensitization may preserve recognizable features, while excessive processing will reduce the quality of the facial image.

Current in-vehicle facial desensitization technologies primarily rely on three approaches, i.e., anonymization processing, encrypted transmission, and feature separation. However, these technologies typically encounter three significant limitations in practical applications. First, lightweight processing tends to preserve residual identifiable features, while rigorous desensitization often leads to safety failures in driving functions. Second, algorithmic robustness significantly decreases in complex lighting and occlusion environments, leading to a false detection rate of more than 25%. Third, the absence of cross-modal data collaborative protection leaves leakage risks in the correlation analysis of in-vehicle and out-of-vehicle facial data with location, behavioral, and other information [9].

To address the challenges of balancing privacy protection and data usability in in-vehicle facial data desensitization, this paper proposes an automotive sensitive information anonymization method based on StyleGAN3. The method introduces a hierarchical self-attention mechanism to enhance the multi-scale feature perception of facial images, addressing the detail ambiguity issues in the complex structural representation of traditional generative models and significantly improving the visual realism and semantic integrity of desensitized images. Meanwhile, a synchronous regularization strategy for sample data is proposed to optimize the convergence stability of the discriminator loss function in StyleGAN3, balancing the intensity of privacy desensitization and image generation quality to ensure the irreversible removal of identity markers while avoiding feature damage caused by excessive processing.

2. Literature Review

Personal privacy security, especially facial information closely linked to identity, has become increasingly prominent in recent years. Researchers are actively exploring face anonymization through image de-identification, with mainstream methods including K-Same-based algorithms and GAN-based approaches.

2.1. K-Same Face Anonymization Methods

Newton et al. proposed a K-Same privacy protection algorithm that generates synthetic faces by fusing features of k facial images, e.g., mean calculation, and principal component analysis (PCA) [10]. This method preserves non-identity features like pose and expression while removing individual identifiers. However, this method produces blurry or distorted faces with poor visual quality due to its reliance on mean-based fusion. Subsequent variants aim to improve the visual fidelity and privacy–utility balance as shown in Table 3.

The main concept of K-Same-based methods involves generating synthetic faces by blending features from k facial images, thereby preserving non-identity attributes such as pose and expression while removing identifiable information. Representative algorithms include the K-Same algorithm proposed and its variants. These methods achieve a balance between privacy protection strength and data utility through optimized loss functions. However, the limitations of K-Same algorithms include frequent blurring and distortion in the generated faces, the poor retention of complex facial details such as wrinkles and pores, and the reliance on manually designed feature decomposition models like Active Appearance Models (AAMs), which exhibit insufficient robustness to varying poses and lighting [15].

As shown in Figure 1. The figure illustrates a schematic representation of five images, ranging from image 1 to image 5, associated with distinct linear models incorporating coefficients

α_{i j}

, where each model is expressed as

μ + {α_{i j} u}_{j}

. The average coefficient

α_{avg}

is calculated with the five cases and integrated into a unified model

y = μ + \sum_{j = 1}^{m} α_{avg, j} u_{j}

, emphasizing the aggregation of individual effects into a simplified framework.

2.2. GAN Face Anonymization Methods

GAN-based methods generate new content to replace sensitive regions in original images, regulating the level of sensitivity by controlling the generated content within specified areas [16]. Wu et al. were the first to apply GANs to face de-identification with the Privacy-Protective-GAN framework, achieving high-quality anonymized images while retaining visual similarity to the originals [17]. Subsequent algorithms explore conditional generation, pre-trained model optimization, attribute decoupling, and multi-modal integration to enhance anonymization effectiveness, as shown in Table 4.

Through the incorporation of architectural optimization based on temporal equivariance and the embedding of physics-informed principles, StyleGAN3 demonstrates significant advantages over its predecessors in the multimodal data synthesis of autonomous driving. The generation process is shown in Figure 2.

3. Experimental Design and Results

3.1. Experimental Datasets

DrivFace, the publicly available in-vehicle face dataset, contains images of subjects in real driving environments. However, this dataset only includes facial data from four subjects, resulting in an extremely small identity sample size that cannot support stable training of deep generative models like StyleGAN3. The DrivFace dataset is as shown in Figure 3. Deep generative models rely on large-scale datasets to capture the statistical distribution of facial features. However, small sample sets are prone to causing model overfitting, thereby restricting the generated results to a narrow range of identity features. Additionally, the collecting and annotating of face data pose significant ethical and legal challenges. In accordance with industry regulations, biometric data are categorized as a special class of personal information, requiring explicit consent from data subjects for collection and prohibiting their use in research without prior anonymization processing. Moreover, the data collection in autonomous driving not only demands user awareness and explicit consent, but also requires robust data anonymization, leading to significant computational overhead and costs.

Flickr-Faces-High-Quality (FFHQ) [22], an open-source high-quality face dataset, has completed rigorous portrait rights authorization and compliance reviews. The 1024 × 1024 high-definition images cover over 70,000 differentiated identity samples, providing sufficient diversity for generative models to learn universal representations of facial structures.

The PDID-Face dataset is a comprehensive dataset that incorporates not only the FFHQ dataset, but also passenger face images collected in autonomous driving. FFHQ is a high-quality face dataset and contains 70,000 high-definition face images in PNG format with a resolution of 1024 × 1024. It highlights the diversity in age, race, and image background, as well as the generalization of facial features, including age, gender, race, skin color, expression, face shape, hairstyle, facial posture, and accessories such as regular glasses, sunglasses, hats, hair accessories, and scarves. In this study, 5000 face images with a resolution of 1024 × 1024 × 3 are randomly sampled from the FFHQ dataset in training of the StyleGAN3 network. The selected images from the FFHQ dataset are partially illustrated as shown in Figure 4.

Considering the consistency between the purity of data content and the efficiency of model training, the FFHQ high-resolution dataset is adopted in CelebA instead of the CelebA 512 × 512 dataset in this study. The CelebA dataset is extensively utilized in facial research. However, the facial regions in the original data of CelebA dataset only occupy 50%–70% of each image, with remaining portions primarily consisting of body parts, clothing textures, and background elements. This unstructured data introduces significant noise into the spatial coding process of the StyleGAN3 network, making it challenging for the face generation model to focus on modeling fundamental facial features. Furthermore, the aliasing effect of background and environmental elements further exacerbates structural distortions in the generated facial images.

In contrast, the FFHQ dataset, with professional data acquisition and preprocessing, ensures that the facial area occupies over 70% of each image frame while uniformly blurring the background. This level of image content purity allows the generator to focus on learning facial structural information and avoids capacity dissipation caused by irrelevant semantic elements. More importantly, high-definition images of size 1024 × 1024 were downsampled to the target resolution of 512 × 512 in this study. This process not only preserves the fine biometric features of the original dataset, but also simulates the imaging characteristics of an in-vehicle camera through artificial degradation.

3.2. Experimental Setup

Specific software and hardware parameters are presented in Table 5.

The StyleGAN3 network involves extensive tensor transformations and weight parameter adjustments during the image generation process. This method not only increases computational complexity, but also results in a large number of model weights. Consequently, the training and generation tasks of this method place high demands on hardware computing performance, such as storage capacity, and data transmission speed. In this study, model training is conducted according to resource allocation requirements.

3.3. Evaluation Metrics

To analyze the privacy and visual credibility of face images, this study assessed the desensitization effect of StyleGAN3. The level of privacy protection in the generated face images is evaluated using the Euclidean distance and SSIM in this study. Moreover, FID is employed to evaluate image generation quality and verify usability. The evaluation metrics are summarized in Table 6.

This paper employs the aforementioned evaluation metrics to ensure the reliability of the evaluation performance of experimental results. The evaluation metrics are illustrated as follows:

1. FID

FID was introduced as a metric for evaluating the quality of images generated in GANs [23]. It quantifies the similarity between generated images and original images by comparing the distributions in feature space. Specifically, FID calculates the mean and covariance matrices of the feature distributions extracted from a pre-trained inception network for both generated images and original images, and then evaluates image quality and diversity using the Frechet distance between these two distributions.

The FID is calculated as follows:

F I D = {‖μ_{1} - μ_{2}‖}_{2}^{2} + T r (\sum r + \sum g - 2 {(\sum r \sum g)}^{1 / 2})

(1)

where

μ_{r}

represents the mean of features from real images, and

μ_{g}

is the mean of features from generated images. ∑r and ∑g denote the covariance matrices of feature vectors for real and generated images, respectively. Tr signifies the trace of a matrix, and the term

{(\sum r \sum g)}^{1 / 2}

represents the geometric mean of the covariance matrices ∑r and ∑g. The FID score inversely correlates with generation quality; that is, small values indicate that the feature distributions of generated images are closer to those of real images, implying better generation quality and visual similarity.

2. Euclidean Distance

The face recognition model adopted in this study is FaceNet. The effect of face desensitization can be quantified by Euclidean distance to measure the distinguishability of identity between the generated image and the original image [24]. Specifically, the FaceNet face recognition model extracts the feature vectors from both the original image and the desensitized image, respectively. Moreover, the Euclidean distance is used to calculate the dissimilarity between the two feature vectors. The calculation results are as follows:

d (x, y) = \sqrt{{(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} + . . . + {(x_{n} - y_{n})}^{2}} = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}

(2)

where n is the dimension of the feature vectors. The Euclidean distance directly reflects the identity difference between the generated images and original images. A larger distance indicates a more significant difference in feature vectors, lower facial similarity, and better desensitization effect, while a smaller distance means a higher identity correlation and weaker desensitization effect. This evaluation method does not require predefined thresholds and directly determines the identity distinguishability after desensitization based on the magnitude of the Euclidean distance.

3. SSIM

The SSIM is defined as follows:

S S I M (x, y) = [(2 μ ₓ μ ᵧ + C_{1}) (2 σ ₓ ᵧ + C_{2})] / [(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (μ_{x}^{2} + μ_{y}^{2} + C_{2})]

(3)

where

x

and

y

represent the original image and the generated image, respectively.

μ ₓ

and

μ ᵧ

are the pixel means of images

x

and

y

, reflecting their average brightness.

μ_{x}^{2}

and

μ_{y}^{2}

are the pixel variances of images

x

and

y

, indicating the contrast information.

σ ₓ ᵧ

is the covariance between images

x

and

y

, and is used to measure the structural similarity of images. The constant terms C₁ = (k₁L)² and C₂ = (k₂L)² are introduced to avoid division by zero, where, typically, k₁ = 0.01, k₂ = 0.03. L represents the dynamic range of pixel values.

Compared with the evaluation methods based on pixel errors, such as Mean Square Error, SSIM is more consistent with the perception of image structure by the human visual system [25]. Regarding the issue of facial desensitization, this work has protected facial privacy while still retaining the visual rationality of the images. By accounting for the brightness, contrast, and structural information of the images, SSIM could provide more accurate quantification of the visual similarity between generated image and original images. Therefore, SSIM is widely used to evaluate the performance of facial desensitization algorithms. A smaller SSIM value indicates a greater structural difference between generated images and original images, implying that the facial identity features have been effectively obscured and the privacy protection effect is enhanced.

3.4. Comparative Analysis

In the evaluation of facial desensitization results, the evaluation indicators need to be able to comprehensively reflect the privacy protection, visual rationality, and generation quality. The three measurement metrics, i.e., FID, Euclidean Distance, and SSIM, proposed in study are used to conduct a quantitative comparative analysis between the StyleGAN3 model and the K-Same facial anonymization method.

1. Generated Image Quality

To achieve excellent balance between privacy protection and data utility, the K-Same facial anonymization method sets K = 5 in the comparative evaluation of generated image quality. On the one hand, this parameter setting ensures that each equivalence class contains at least five indistinguishable records. It could effectively increase the difficulty of identity association and reduce the risk of recognition by generalizing attributes such as age and gender to obscure individuals within these groups. On the other hand, this approach avoids excessive loss of data details that could result from an overly high k value, thereby compromising the analytical value of the dataset.

The experiment employs the FFHQ dataset, which consisted of a total of 5000 images. Specifically, 3000 images were assigned to the training set and 2000 to the test set, with the dataset partitioned into 400 groups for experimental purposes. For the K-Same facial anonymization method, the FID values ranged from 231.01 to 540.81, with an average value of 356.92. The experimental results of five groups of face anonymization are presented in Figure 5.

Figure 6 presents the experimental results regarding the visual likelihood of the StyleGAN3 model after 2000 epochs of training.

Following the optimization of the original StyleGAN3 model, the FID value decreased from an initial 524 to 48, indicating that fine-tuning the base architecture and enhancing the training strategy can effectively improve the quality of generated images.

Overall, the K-Same facial anonymization method exhibits excellent rationality in balancing privacy protection with data utility when evaluating the quality of generated image. However, K-Same exhibits a relatively high fluctuation range and a higher average value of FID, suggesting that the quality of the generated images requires further improvement. Through architectural optimization and refinement of the training strategy, the StyleGAN3 model achieves substantial convergence of FID, leading to superior quality performance of the generated images.

2. Privacy Protection

The experimental results of the two face anonymization methods are summarized in Table 7 and Table 8. When evaluating the facial desensitization, a lower SSIM value indicates a greater structural difference between the generated images and the original images, implying that anonymization more effectively obscures facial identity features. The experimental data presents that the K-Same face anonymization method yields an SSIM value of 0.5827 and a Euclidean distance of 1.1823. The generated images display significant blurriness and reduced clarity, substantially compromising the quality of the generated images. Conversely, the StyleGAN3 model attains an SSIM value of 0.4485 and a Euclidean distance of 1.2332, thus providing excellent privacy protection. Since K-Same has an impact on the quality of generated images, this study chooses the StyleGAN3 model as the optimization target. This choice aims to achieve a balance among the quality of generated images, visual rationality, and privacy protection, thereby meeting the requirements for facial desensitization in autonomous driving.

4. Hybrid Loss Function Optimization

4.1. Limitations of StyleGAN3

Compared with GAN models, StyleGAN3 demonstrates substantial improvements in geometric transformation adaptability, feature disentanglement capability, and training stability. Nevertheless, StyleGAN3 still encounters challenges in facial desensitization in autonomous driving, as evidenced by the desensitized images presented in Table 9.

Firstly, the StyleGAN3 model exhibits limitations in simulating identity-sensitive features distributed across regions, such as inter-eye distance and nasal bridge morphology. Due to its reliance on a fixed feature space division mechanism, this model struggles to adapt to diverse autonomous driving scenarios. Furthermore, while the model supports driver status monitoring by retaining posture features and enables identity authentication by removing personal identifiers, identifiable information may still remain in desensitized images under conditions such as occlusion or variable lighting. Secondly, the geometric transformation module of the StyleGAN3 model demonstrates poor robustness in handling image acquisition challenges in autonomous driving involving extreme viewing angles, low illumination, or motion blur. Additionally, the face image occlusion completion strategy depends heavily on the distribution of training data. When encountering unlabeled or unseen occlusion patterns, the model is prone to generating semantically inconsistent pseudo-features, e.g., pupil misalignment, which can compromise the integrity of key image features.

Furthermore, convolutional layer discriminators typically exhibit modeling limitations regarding long-term spatial dependence, such as the correlation between eyebrow shape and eye shape, as well as the consistency between skin color and illumination. This may result in local noise or structural inconsistency in the generator during high-frequency detail synthesis. However, the loss function in StyleGAN3 might lead to extreme cases, such as the excessive retention of identity features or the over-blurring of key features, due to insufficient constraints on privacy protection and data utility. These issues fail to align with the minimum necessary desensitization principles mandated by industry regulations, especially in autonomous driving scenarios where computational resources are limited. Such shortcomings may significantly impact the convergence stability of the model and the accuracy of the generated images.

Overall, the aforementioned limitations of the StyleGAN3 model have impeded the secure utilization of facial data in autonomous driving. As a result, it is imperative to optimize and strengthen the StyleGAN3 model in key areas such as feature unentanglement granularity, environmental adaptability, adversarial training balance, and multimodal compatibility, thus enabling the development of a facial privacy protection solution that meets the rigorous demands of autonomous driving.

4.2. Optimization Design

GAN is defined as a minimax game between the discriminator

D_{Ψ}

and the generator

G_{θ}

. Given real data

x ~ p_{D}

from the real data distribution, and generated fake data

x ~ p_{D}

from the generator’s data distribution, the general representation of GAN is as follows:

L (θ, ψ) = E_{z ~ p_{z}} [f (D_{Ψ} (G_{θ} (z)))] + E_{x ~ p_{D}} [f (- D_{Ψ} (x))]

(4)

GAN training involves a min–max adversarial game, where the generator

G

minimizes the loss

L

, and the discriminator

D

maximizes it. Theoretically, the loss demonstrates convexity when the generated distribution

p_{θ}

is optimized. However, in practice, the GAN loss pushes fake samples away from the decision boundary of

D

, rather than directly updating the density of

p_{θ}

, thereby triggering issues such as mode collapse, mode dropping, or non-convergence.

GANs have drawbacks such as mode collapse, difficulty in training, and poor convergence [26]. To mitigate these limitations, researchers have developed multiple algorithms. Jolicoeur-Martineau et al. introduced the Relativistic Pairing GAN to tackle mode dropping [27], and the formulation of RpGAN is represented as follows:

L (θ, ψ) = E_{\begin{matrix} z ~ p_{z} \\ x ~ p_{D} \end{matrix}} [f (D_{ψ} (G_{θ} (z)) - D_{ψ} (x))]

(5)

Although RpGAN exhibits no significant difference in model architecture compared to GAN, there exists a fundamental distinction in how the discriminator’s evaluation influences the topological structure of the loss. In the GAN network, the discriminator is tasked with distinguishing between real and fake data. When the real data and the fake data are separable by a single decision boundary, the GAN loss drives the generator to move all fake samples to the opposite side of this boundary. This degradation is referred to as mode collapse or mode dropping, meaning that, when the real and generated data are isolated and processed, the naturally formed single decision boundary may lead to mode collapse.

RpGAN couples real and fake data such that decision boundaries are maintained within the neighborhood of each real sample, avoiding mode collapse. Sun et al. demonstrated that RpGAN did not exhibit the local minima characteristic of mode collapse. While RpGAN addresses the mode collapse issue in GAN, its training dynamics remain to be fully resolved. The ultimate goal of RpGAN is to find an equilibrium point (θ^*,ψ^*), where

p_{θ} * = p_{D}

and

D_{ψ}

_* are constant everywhere under

p_{D}

. Sun noted that RpGAN possessed a non-increasing trajectory theoretically capable of global reach under reasonable assumptions. However, the existence of such a trajectory does not ensure that gradient descent will reliably discover it. One study found that non-regularized RpGAN exhibited suboptimal performance and may fail to converge when using gradient descent [28].

Experiments had shown that zero-centered gradient penalties, such as the widely adopted

R_{1}

and

R_{2}

penalties, could promote the convergence of GANs [29]. To address the non-convergence issue in RpGAN, gradient penalties are introduced as follows:

R_{1} (ψ) = \frac{γ}{2} E_{x ~ p_{D}} [{‖\nabla_{x} D_{ψ}‖}^{2}]

(6)

R_{2} (θ, ψ) = \frac{γ}{2} E_{x ~ p_{θ}} [{‖\nabla_{x} D_{ψ}‖}^{2}]

(7)

where

R_{1}

penalizes the gradient norm of discriminator

D

on real data, while

R_{2}

penalizes the gradient norm of

D

on fake data. Huang et al. employed the StackedMNIST dataset and demonstrated that GANs and RpGANs rapidly diverged and exhibited unstable training in experiments, although using

R_{1}

regularization alone theoretically allowed for the local convergence of losses in both of the two models, with gradients of

D

on fake samples exploding during divergence [30]. Thus,

R_{1}

alone cannot achieve globally convergent training.

R_{1}

can be interpreted as the convolution of the density function of

p_{D}

with N(0,

γ I

), a Gaussian distribution with zero mean and covariance

γ I

, followed by an additional weight term and a Laplace error term. In the early training stage of the model, since the parameters (θ,ψ) are not close to the optimal equilibrium point (θ^*,ψ^*), regularizing

D

on real data has minimal impact on how

D

processes fake data. Similarly, R2 involves convolving

p_{D}

with N(0,

γ I

) and adding an extra weight term and a Laplace error term. The key difference is that

R_{2}

penalizes the gradient norm on fake data rather than real data.

Therefore,

R_{1}

and

R_{2}

are used in the network to regularize both real data and fake data simultaneously. When

D_{ψ}

approaches

D_{ψ}

_*, the Laplace error terms of

R_{1}

and

R_{2}

can cancel each other out, and the training of traditional GANs and RpGANs will become stable. After the GAN starts training, there are still cases of mode collapse and mode dropping, while the RpGAN achieves full mode coverage with a decrease in D_KL. Meanwhile, applying

R_{1}

and

R_{2}

in the network can make both

p_{D}

and

p_{θ}

smoother, thereby enabling the model to learn more effectively than simply smoothing

p_{D}

. Experiments have shown that, in this case,

D

can also satisfy

E_{x ~ p_{D}} [{‖\nabla_{x} D_{ψ}‖}^{2}]

≈

E_{x ~ p_{θ}} [{‖\nabla_{x} D_{ψ}‖}^{2}]

well, and keeping the gradient norms on real and fake data roughly the same may reduce the possibility of the discriminator overfitting. Finally, the final optimization result for the loss function is

R p G A N + R_{1} + R_{2}

. Both the

R_{1}

/

R_{2}

regularization weights

γ

are set to 10, in line with the original configuration of StyleGAN3.

The RpGAN framework constructs dynamic adversarial constraints through a relative probability evaluation, where

R_{1}

regularization enforces the Lipschitz continuity of the discriminator on real data, and

R_{2}

regularization suppresses the gradient explosion in the generated sample region. These three components collectively form a directional constraint matrix in the latent space, achieving the implicit modeling of the mini-batch statistical features. This essentially subsumes the supervisory role of explicit mini-batch variance over feature diversity, thereby eliminating the inertia regularization loss and path regularization loss from the original model.

4.3. Experiment Analyzing

4.3.1. Ablation Experiment

In order to de-identify automotive face images and prevent memory crashes on the server, the input image size to the network is reduced to 512 × 512 × 3 when pre-training the improved StyleGAN3 with 1024 × 1024 × 3-sized face images. The training dataset is named 010000-512 × 512.zip in its compressed form, enabling the network to perform batch training on face data after processing.

To ensure a high level of privacy protection and usability in the face de-identification method, the loss function in StyleGAN3 is optimized by introducing a hybrid loss function and removing the lazy regularization loss and path regularization loss. Additionally, the Conv3 × 3 or 1 × 1 modules in layers L0–L13 of the generator are replaced with 5 × 5 convolutions with a larger receptive field, allowing the generator to learn that each pixel in an image should be correlated with others rather than be independent.

Therefore, the experiments discuss the roles of the hybrid loss function and 5 × 5 convolutional kernels in StyleGAN3. The configuration of experimental models is presented in Table 10. Ablation experiments are conducted to compare the models across three metrics, i.e., FID, Euclidean distance, and SSIM, to evaluate the privacy protection and image quality of the improved StyleGAN3 model.

4.3.2. Analysis of Experimental Results

1. Generated Image Quality

Figure 7 shows the experimental results of the image quality after training 2000 k images for different experimental groups. Regarding the performance of StyleGAN3 models under different improvement strategies, the FID and the generated quality present differences in three experiments.

After optimizing the original StyleGAN3, the FID value gradually converged from an initial 524 to 48, indicating that adjustments to the basic architecture and training strategies effectively enhance the realism of the generated images. The StyleGAN3-l model, which further introduces the hybrid loss function, demonstrates a stronger convergence ability. The FID value of this model drops from 457 to 29 and stabilizes with minor fluctuations. RpGAN mitigates the mode collapse between the generated and real distributions through probability density matching, while the R1 and R2 regularizations constrain the gradient smoothness of the discriminator and the manifold stability of the generator, respectively, enabling the generator to explore a more reasonable interpretive space during adversarial training. However, when expanding the generator’s receptive field to 5 × 5 convolutional kernels based on StyleGAN3-l, the FID of the StyleGAN3-lo5 model unexpectedly increases to 453 and shows a continuous diverging trend. This phenomenon stems from the imbalance between the surge in model complexity and insufficient discriminator capacity. The 5 × 5 kernels introduce high-frequency noise in low-resolution layers, disrupting the consistency of facial low-frequency structures, while the elongation of the backpropagation path leads to abnormal gradient accumulation, ultimately causing model collapse.

2. Privacy Protection

The StyleGAN3 models under different improvement strategies demonstrate a clear hierarchical progression in privacy metrics as shown in Table 11, which summarizes the results from controlled experiments isolating each modification’s impact. The original StyleGAN3 model yields an SSIM of 0.4485 and Euclidean distance of 1.2332, indicating that the generated faces retain sharp facial contours, e.g., defined jawlines, nasal bridges, and fine skin textures, preserving features that enable identity correlation. This fidelity risks exposing personal attributes usable for re-identification in applications like driver monitoring, where privacy is critical.

Introducing the hybrid loss function in StyleGAN3-l improves privacy by stabilizing adversarial training, reducing SSIM to 0.3818 and increasing Euclidean distance to 1.3106. Considering this loss function, the RpGAN model encourages the generator to avoid the principal component space of real data, which is rich in identity-defining features such as facial shape. Meanwhile, the regularization terms ensure that perturbations conform to the facial manifold, thereby achieving a balance between structure preservation and feature obfuscation. This ensures the generated faces lack unique identifiers, e.g., rare mole patterns and ear shapes, while retaining sufficient anatomical coherence for non-private tasks like emotion analysis. However, expanding the generator’s receptive field with 5 × 5 kernels in StyleGAN3-lo5 disrupts this balance, such as the augmented model complexity overwhelming the discriminator, resulting in gradient dynamics collapsing. The result is meaningless noise, indicating that aggressive structural changes without complementary adjustments, e.g., discriminator capacity upgrades, and adaptive regularization, could destroy the adversarial equilibrium.

5. Hierarchical Self-Attention

5.1. Cross-Regional Feature Modeling Problem

The StyleGAN3 models demonstrate the precise perception of global structural features for the multi-view imaging of human faces with pitch angles greater than 30 degrees, motion blur, and complex lighting conditions. However, the convolutional layers in the StyleGAN3 discriminator present limited insufficient feature interaction when processing cross-scale correlations between high-frequency details, e.g., pore distributions, wrinkle orientations, and low-frequency contours such as facial aspect ratios, and mandibular angles. This deficiency makes it difficult for the generator to both disrupt identity markers and preserve intact non-sensitive business features, e.g., facial expressions for driver behavior analysis. Additionally, the distributed representational characteristics of identity-sensitive information, such as the spatial relationship between the interpupillary distance and zygomatic bone position, demand that the discriminator possess dynamic weight allocation capabilities to distinguish critical identity features from non-sensitive structural features. Unfortunately, the fixed feature space partitioning mechanism in the existing models fails to meet this requirement.

To address these issues, introducing a hierarchical self-attention mechanism into the discriminator is proposed, thereby constructing a multi-scale feature interaction network to explicitly model the long-range dependency across regional features. This mechanism not only enhances the discriminator’s sensitivity to distributed identity features, but also guides the generator to produce structurally consistent and semantically plausible de-identified facial images under complex scenarios. As a result, it achieves an excellent balance between privacy protection and visual rationality, ensuring that desensitized face images retain structural coherence, thereby effectively obfuscating the biometric identifiers embedded in cross-scale spatial relationships.

5.2. Self-Attention Mechanism Design

In the GAN framework, the feature extraction capability of the discriminator directly impacts the accuracy in distinguishing the boundaries of data distributions. Zhang et al. demonstrated that introducing the self-attention mechanism effectively enhanced the cross-regional feature correlation modeling ability of StyleGAN-series models [31]. Moreover, the self-attention mechanism, as a dynamic feature encoding approach based on internal correlation modeling, effectively overcomes the local receptive field limitations of conventional convolutional operations by autonomously constructing long-range dependencies between elements at arbitrary positions in the feature space. Unlike attention mechanisms that rely on external prior information, the self-attention mechanism achieves feature reconstruction through interactions within the input features, thereby enabling the model to accurately capture key semantic correlations even in scenarios with limited explicit supervision signals. The self-attention mechanism dynamically captures the dependencies between elements at different positions within a single sequence and generates new sequence representations based on these dependencies. For each element in the input sequence X, linear projections are applied in order to generate three feature spaces, i.e., Query (Q), Key (K), and Value (V), as described by the formulae as follows:

\{\begin{matrix} Q = W_{Q} X \\ K = W_{K} X \\ V = W_{V} X \end{matrix}

(8)

where

W_{Q}

,

W_{K}

, and

W_{V}

denote the projection parameter matrices for Query vector

Q = [q_{1}, q_{2}, . . . {, q}_{N}]

, Key vector

K = [k_{1}, k_{2}, . . . {, k}_{N}]

, and Value vector

V = [v_{1}, v_{2}, . . . {, v}_{N}]

, respectively. For element

q_{n} \in Q

, the calculation formula for output vector

h_{n}

is represented as follows:

h_{N} = A t t e n t i o n ((K, V), q_{n}) = \sum_{j = 1}^{N} α_{n j} v_{j} = \sum_{j = 1}^{N} s o f t m a x (s (k_{i}, q_{n})) v_{j}

(9)

where

n

and

j

denote the indices of the input and output vector sequences, and

α_{n j}

represents the attention weight by which the

n

-th output focuses on the

j

-th output.

The self-attention mechanism is introduced to extract features across different scales, addressing the discriminator’s insufficient perception of cross-regional spatial correlation features. A self-attention module is integrated into the deep feature extraction stage of the StyleGAN3 discriminator to tackle the modeling challenges posed by the sparsity and discreteness of identity-sensitive features in training face images. Specifically, the self-attention module is inserted into the L8 layer of the discriminator, between the convolutional layers corresponding to a resolution of 16 × 16, and the self-attention module is integrated between the adjacent convolutional layers of the discriminator’s second-to-last residual block, forming a cascaded feature processing structure of convolution–self-attention–convolution, as shown in Figure 8. After multiple downsampling operations by the front-end convolutional layers, the feature map resolution is reduced to 64 × 64, which balances the retention of essential spatial information and the reduction in self-attention computational complexity. To extract the deep features at this stage containing semantically abstract information such as facial contours, a global interaction mechanism is proposed in order to effectively capture the distributed representation patterns of identity features, which is critical for modeling sparse, discrete biometric attributes across non-local facial regions.

Given the feature tensor F ∈ R^H×W×C output by conv0, Query (Q), Key (K), and Value (V) are generated using learnable projection matrices W_Q, W_k ∈

R^{C \times d_{k}}

, and W_v ∈

R^{C \times d_{v}}

. The spatial attention weight matrix A is computed from these generated vectors as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x ((\frac{Q K^{T}}{\sqrt{d_{k}}}) V)

(10)

where

d_{k}

denotes the dimension of the vector, and

Q K^{T}

represents the similarity score between Q and K computed via dot product.

The computational workflow of the self-attention mechanism is shown in Figure 9. In the GAN framework, the structural optimization of both the discriminator and generator needs to adhere to the principle of dynamic balance. By enhancing the global feature perception capability of the discriminator, this study conducts a collaborative optimization of the generator’s spatial feature synthesis mechanism. Considering the hierarchical characteristics of the StyleGAN3 model, this study reconstructs the basic convolutional operations at the L0–L13 levels by expanding the 3 × 3 and 1 × 1 convolution kernels in the model to a 5 × 5 scale, thereby increasing the effective receptive field by approximately 2.78 times. Identity-sensitive information in face images often exhibits a cross-regional distributed expression pattern. By introducing larger-scale spatial convolution kernels, the generator can establish non-local geometric constraints across pixels during the high-frequency detail synthesis stage, thereby ensuring a stronger spatial consistency in the generation of micro-features such as wrinkle orientations and pore distributions [32]. The receptive field expansion strategy forms a complementary synergy with the discriminator’s self-attention mechanism, and the generator focuses on modeling long-range structural continuity in the spatial domain, while the discriminator performs a global correlation analysis in the frequency domain through attention weights. The joint optimization effectively improves the accuracy of the model in modeling the distributed expression patterns of identity features, thereby laying a foundation for balancing privacy protection and visual rationality.

Compared with the standard self-attention mechanisms in advanced GANs such as StyleGAN3, the attention module designed in this paper introduces two key improvements. Firstly, the attention module is strategically placed at the 16 × 16 resolution stage in the discriminator. This resolution retains richer spatial semantic information such as facial contours and relative positions of facial components than lower resolutions such as 8 × 8, which supports identity modeling. Meanwhile, since the number of spatial positions at 16 × 16 is only one-quarter of that at 32 × 32, it reduces the computational complexity of self-attention by approximately 16 times, effectively balancing the preservation of semantic details and computational efficiency. Secondly, a sparse feature interaction mechanism is constructed through 16 × 16 resolution constraints and residual block integration. Targeting the spatial sparsity of privacy-sensitive regions such as eye corners and nose bridges in facial biometrics, it avoids the redundancy caused by global pairwise computations in standard self-attention and focuses on the distributed feature expression of semantically critical regions such as eye–mouth interactions, thus naturally aligning with the discrimination requirements of privacy-sensitive facial attributes.

5.3. Experiment Analyzing

5.3.1. Ablation Experiment

This study introduces a self-attention mechanism into the discriminator, allowing it to focus on more fine-grained details and enhancing its feature extraction capabilities.

The experiment is conducted in this study to investigate the effects of a hybrid loss function, 5 × 5 convolution kernels, and the self-attention mechanism on StyleGAN3. The configurations of the experimental model are presented in Table 12. In the ablation experiment, different models across three metrics, i.e., FID, Euclidean distance, and SSIM, are compared in order to evaluate the privacy protection performance and image quality of the improved StyleGAN3 model.

5.3.2. Analysis of Experimental Results

1. Generated Image Quality

Figure 10 illustrates the quantitative assessment of image quality across experimental groups after 2000 k images of training, highlighting the critical role of architectural modifications in resolving model collapse and enhancing generative stability. The baseline model, StyleGAN3-lo5, exhibited severe model collapse, as indicated by an FID score of 346, which illustrated a significant mismatch between the distributions of the generated images and real images. This instability originated from an adversarial imbalance; that is, while the generator employs of 5 × 5 convolution kernels to expand its receptive field, the discriminator fails to capture the complex cross-regional correlations introduced by the generator. Consequently, the discriminator provided weak gradient feedback, resulting in locally inconsistent features and repetitive outputs.

The StyleGAN3-lo5-att model addressed this by integrating a self-attention mechanism into the discriminator, thereby enabling it to dynamically weigh non-local feature interactions. This modification reduced the FID score to 22, representing a 95.8% reduction compared to the StyleGAN3 model, with stable fluctuations. Demonstrating a dramatic improvement in visual fidelity and structural coherence. The self-attention mechanism compensated for the generator’s enhanced spatial capacity by allowing the discriminator to focus on critical regional dependencies, e.g., aligning eye and mouth structures, thereby suppressing noise propagation and restoring global semantic consistency. The results emphasize that adversarial training stability requires balanced advancements. Furthermore, the spatial modeling of the generator and the global feature of the discriminator need to evolve in tandem. Over-optimizing a single module disrupts this equilibrium, highlighting the necessity of architectural adaptation, while the collaborative complexity between networks, rather than isolated tweaks, drives performance breakthroughs in maintaining both image quality and adversarial robustness.

2. Privacy Protection

The StyleGAN3 models with different improvement strategies exhibit a hierarchical progression in privacy protection metrics, as presented in Table 13. By incorporating a self-attention mechanism, the StyleGAN3-lo5-att model establishes cross-regional dependencies within the discriminator, thereby prompting the generator to selectively distort identity-sensitive regions while preserving the coherence of non-critical features. This leads to a further reduction in the SSIM to 0.3396, representing a 24.3% reduction compared to the StyleGAN3 model. Meanwhile, the Euclidean distance is reduced to 1.4548. These values indicate the lowest structural similarity among the four groups, disrupting the original identity features while effectively achieving superior de-identification performance.

The introduction of the self-attention mechanism enables the discriminator to model long-range spatial dependencies, enhancing its sensitivity to global structural consistency and the ability to recognize cross-regional semantic coherence. This compensates for the local noise issues arising from the generator’s expanded receptive field, as the synchronized increase in complexity between the generator and discriminator restores the balance of adversarial training. Consequently, the model generates images by suppressing local noise propagation and aligning the distortion of identity-sensitive features with visual plausibility. These results validate that the self-attention mechanism achieves dual improvements in privacy protection and visual plausibility by complementing the generator’s long-range modeling capacity and maintaining adversarial equilibrium.

5.4. Experimental Validation

To validate the resistance of the proposed method against identity re-identification, a re-identification experimental framework based on the FaceNet model was designed. Specifically, a pre-trained FaceNet model on the VGGFace2 dataset was adopted in order to demonstrate a verification accuracy of 99% on the LFW dataset. Additionally, a novel metric, Identity Similarity, was introduced to quantify the matching proportion between anonymized facial features and their original counterparts [24]. The IS metric is calculated using the cosine similarity of feature vectors, defined by the following principle formula:

I S = c o s (F_{o r i g i n a l}, F_{anonymized}) \times 100 %

(11)

where

c o s (\cdot, \cdot)

denotes the cosine similarity function,

F_{o r i g i n a l} \in R^{d}

is the d-dimensional feature vector of the original face extracted by FaceNet, and

F_{anonymized} \in R^{d}

is the d-dimensional feature vector of the anonymized face.

This formula quantifies the retention degree of identity-related features before and after anonymization by calculating the cosine similarity between feature vectors. The closer the IS value is to 0%, the greater the directional difference between the two vectors in the feature space, indicating that the original identity information is suppressed more thoroughly.

Throughout the experiment, strict adherence to the model training conditions was maintained to ensure the robustness of feature extraction. First, facial regions were detected and cropped using MTCNN, followed by resizing the images to 160 × 160 pixels. The test results on images before and after anonymization showed that the IS of the proposed method was 5.03%, meaning only 5.03% of anonymized faces could be associated with their original identities. This result significantly outperforms the state-of-the-art K-Same methods; when K = 5, the association probability of K-Same methods does not exceed 20%, whose core logic is that the association probability between each de-identified image and its original identity does not exceed 1/K [13,14]. The above comparison results are shown in Table 14.

These results validate the core role of the generator–discriminator architectural synergy in identity suppression, demonstrating that the proposed method can significantly increase the difficulty of re-identification and is highly aligned with the core goal of privacy preservation.

6. Conclusions

This paper proposes an StyleGAN3-based GAN network for facial de-identification in autonomous driving. The integration of a self-attention mechanism augments the discriminator’s ability to perceive global features, and effectively overcome the inherent constrains of convolutional networks in modeling cross-regional correlations. A hybrid loss function is designed to mitigate mode collapse and improve training stability, while expanding the generator’s convolution kernel size to 5 × 5 strengthens the modeling of distributed identity feature representations. The experimental results demonstrate that the collaborative optimization of the self-attention mechanism and the generator balances privacy protection and image quality. Specifically, the improved facial desensitization model reduces the FID and SSIM by 95.8% and 24.3%, and increases the Euclidean distance from 1.2332 to 1.4548, and yields an IS of only 5.03% in re-identification experiments. These results indicate that the generated images effectively disrupt original identity features while maintaining visual plausibility.

However, this study has certain limitations. The current method still exhibits insufficient performance in complex scenarios such as extreme lighting conditions or severe occlusions, and the capability to model cross-modal data correlations requires improvement. Additionally, the diversity of the training dataset is relatively limited, relying primarily on the FFHQ dataset, which may affect the generalization ability of the model in real-world environments of autonomous driving. In response to the objective constraints on real-world data disclosure, collaboration with automotive manufacturers to generate in-vehicle face data for domain adaptation using simulation platforms such as CARLA is proposed. Regarding future work, we plan to collaborate with industry partners to develop an end-to-end framework for testing the adaptability of utility-preserving anonymization methods in driver-monitoring systems.

Author Contributions

Conceptualization, H.J. and L.T.; data curation, H.J. and L.T.; formal analysis, J.W. (Jingyan Wang); funding acquisition, H.J.; investigation, L.T. and J.W. (Jiangyue Wang); methodology, H.J., L.T. and J.W. (Jingyan Wang); project administration, H.J.; resources, J.W. (Jingyan Wang); software, J.W. (Jingyan Wang); supervision, H.J. and J.W. (Jiangyue Wang); validation, L.T. and Y.Y.; visualization, L.T. and Y.Y.; writing—original draft, H.J.; writing—review and editing, L.T., J.W. (Jingyan Wang), Y.Y. and J.W. (Jiangyue Wang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the R&D Program of Beijing Municipal Education Commission (No. KM202411232005), the 2023 Special Project on Industrial Base Reengineering, the High-quality Development of Manufacturing Industry—Industrial Internet Enterprise Data Security Joint Control System Project (No. 0747-2361SCCZA195), and the National Natural Science Foundation of China (No. 52102447).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are from the openly available dataset Flickr-Faces-High-Quality (FFHQ), which can be accessed via its official repository, https://github.com/NVlabs/ffhq-dataset.

Acknowledgments

The authors sincerely acknowledge the contributions of the editor and anonymous reviewers. Their detailed comments and academic rigor have been crucial in enhancing the scientific integrity and presentation of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GAN	Generative Adversarial Network
AAMs	Active Appearance Models
PCA	Principal Component Analysis
FFHQ	Flickr-Faces-High-Quality
FID	Frechet Inception Distance
SSIM	Structural Similarity Index Measure
RpGAN	Relativistic Pairing GAN
IS	Identity Similarity

References

Xu, D.; Li, H.; Wang, Q.; Song, Z.; Chen, L.; Deng, H. M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving. arXiv 2024, arXiv:2403.12552. [Google Scholar]
Wang, Z.; Wang, Y.; Wu, Z.; Ma, H.; Li, Z.; Qiu, H.; Li, J. CMP: Cooperative Motion Prediction with Multi-Agent Communication. arXiv 2025, arXiv:2403.17916. [Google Scholar] [CrossRef]
Sha, X.; Guo, Z.; Guan, Z.; Li, W.; Wang, S.; Zhao, Y. PBTA: Partial Break Triplet Attention Model for Small Pedestrian Detection Based on Vehicle Camera Sensors. IEEE Sens. J. 2024, 24, 17337–17346. [Google Scholar] [CrossRef]
Zhang, J.; Wang, H.; Li, S.; Liu, X. A Survey of Deep Learning-Based Techniques for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2016, 17, 1314–1334. [Google Scholar]
Kuang, Z.; Yang, X.; Shen, Y.; Hu, C.; Yu, J. Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction. arXiv 2024, arXiv:2406.17219. [Google Scholar] [CrossRef]
Palmiotto, F.; Menéndez González, N. Facial recognition technology, democracy and human rights. CLSR 2023, 50, 105678–105693. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, W.; Wu, D.; Lin, Z.; Gu, J.; Wang, W. Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment. arXiv 2024, arXiv:2407.08127. [Google Scholar] [CrossRef]
Alam, T. Data Privacy and Security in Autonomous Connected Vehicles in Smart City Environment. Big Data Cogn. Comput. 2024, 8, 95. [Google Scholar] [CrossRef]
Gong, K.; Gao, Y.; Dong, W. Privacy-Preserving and Cross-Domain Human Sensing by Federated Domain Adaptation with Semantic Knowledge Correction. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 1–26. [Google Scholar] [CrossRef]
Newton, E.M.; Sweeney, L.; Malin, B. Preserving Privacy by De-identifying Face Images. IEEE Trans. Knowl. Data Eng. 2005, 17, 232–243. [Google Scholar] [CrossRef]
Chamikara, M.A.P.; Bertok, P.; Khalil, I.; Liu, D.; Camtepe, S. Privacy Preserving Face Recognition Utilizing Differential Privacy. Comput. Secur. 2020, 97, 101951. [Google Scholar] [CrossRef]
Jourabloo, A.; Liu, Y.; Liu, X. Face De-Spoofing: Anti-Spoofing via Noise Modeling. arXiv 2018, arXiv:1807.09968. [Google Scholar]
Meden, B.; Emersic, Z.; Struc, V.; Peer, P. K-Same-Net: Neural-Network-Based Face Deidentification. In Proceedings of the 2017 International Conference and Workshop on Bioinspired Intelligence (IWOBI), Funchal, Portugal, 10–12 July 2017; pp. 1–7. [Google Scholar]
Pan, Y.; Haung, M.; Ding, K.; Wu, J.; Jang, J. K-Same-Siamese-GAN: K-Same Algorithm with Generative Adversarial Network for Facial Image De-identification with Hyperparameter Tuning and Mixed Precision Training. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
Ji, J.; Wang, H.; Huang, Y.; Wu, J. Privacy-Preserving Face Recognition with Learnable Privacy Budgets in Frequency Domain. arXiv 2022, arXiv:2207.07316. [Google Scholar] [CrossRef]
Rosberg, F.; Aksoy, E.E.; Englund, C.; Alonso, F. FIVA: Facial Image and Video Anonymization and Anonymization Defense. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 362–371. [Google Scholar]
Wu, Y.; Yang, F.; Xu, Y.; Ling, H. Privacy-Protective-GAN for Privacy Preserving Face De-Identification. J. Comput. Sci. Technol. 2019, 34, 47–60. [Google Scholar] [CrossRef]
Maximov, M.; Elezi, I.; Leal-Taixé, L. CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5446–5455. [Google Scholar]
Hukkelas, H.; Mester, R.; Lindseth, F. DeepPrivacy: A Generative Adversarial Network for Face Anonymization. arXiv 2019, arXiv:1909.04538. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
Zhai, L.; Guo, Q.; Xie, X.; Ma, L.; Wang, Y.E.; Liu, Y. A3GAN: Attribute-Aware Anonymization Networks for Face De-identification. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; pp. 5303–5313. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-scale Update Rule Converge to A Local Nash Equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; Volume 30, pp. 6614–6624. [Google Scholar]
Schroff, F.; Kalenichenko, D. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Jolicoeur-Martineau, E. The Relativistic Discriminator: A Key to Stable Training of Generative Adversarial Networks. arXiv 2018, arXiv:1807.00734. [Google Scholar]
Wiatrak, M.; Albrecht, S.V.; Nystrom, A. Stabilizing Generative Adversarial Networks: A Survey. arXiv 2020, arXiv:1910.00927. [Google Scholar] [CrossRef]
Miyato, T.; Kataoka, T.; Koyama, M. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–12. [Google Scholar]
Huang, C.-W.; Lim, J.H.; Courville, A. On the Convergence and Stability of GANs. arXiv 2020, arXiv:2002.11832. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. arXiv 2019, arXiv:1805.08318. [Google Scholar] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L. Attention is All You Need. NeurIPS 2017, 30, 5998–6008. [Google Scholar]

Figure 1. Generation diagram of face image via linear-combination and average-coefficient calculation.

Figure 2. Process diagram of anonymized image generation based on StyleGAN3.

Figure 3. Sample images of DrivFace dataset.

Figure 4. Sample images of FFHQ dataset.

Figure 5. The experimental results of five groups of K-Same face anonymization.

Figure 6. FID variation curve of StyeGAN3.

Figure 7. FID value variation curve of StyleGAN3 models with different strategies.

Figure 8. Discriminator architecture with self-attention mechanism.

Figure 9. Computational workflow of self-attention mechanism.

Figure 10. FID value variation curve of StyleGAN3 models with different strategies.

Table 1. List of automotive facial data leakage incidents.

Time	Incidents Description	Impact Scope	Data Type
Sept. 2023	Illegal resale of autonomous driving road-testing data.	50 TB dataset	High-definition pedestrian faces and location data
Feb. 2024	Stealing driver video streams via firmware backdoor in in-vehicle infotainment system.	80,000 users	Real-time facial video streams
Mar. 2025	Facial data leakage due to application programming interface vulnerability in connected vehicle systems.	150,000 owners	Driver behavioral characteristics and driver identity information

Table 2. Conflicts between privacy protection and visual plausibility.

Performances	Requirements	Shortcomings
Privacy Protection	Irreversible removal of identity markers	Excessive desensitization damages facial features
Visual Plausibility	Preservation of facial imagery	Insufficient privacy protection

Table 3. Information list of K-Same variant methods.

Category	Principle	Representative Methods
Mean/Linear Fusion	Generate synthetic faces through linear combination or mean calculation.	K-Same-Eigen [11]
Statistical-Model-Driven	Decompose identity and non-identity features via parametric models to optimize privacy–utility balance.	Jourabloo [12]
Neural-Network-Generated	Use generative models to implicitly learn identity features and produce high-fidelity anonymized images.	K-Same-Net [13] K-Same-Siamese-GAN [14]

Table 4. List of GAN-based algorithms for privacy-preserving image generation.

Category	Principle	Representative Methods
Conditional Generative Adversarial Network	Generate anonymous images conditioned on identity labels or attributes.	CIAGAN [18] DeepPrivacy [19]
Pre-Trained Model Optimization	Leverage the latent space characteristics of pre-trained models to optimize the anonymity of generated images.	StyleGAN [20]
Attribute Decoupling and Perturbation	Separate identity and non-sensitive attributes and achieve confusion through latent code perturbation.	RBGAN [21]

Table 5. Configuration information of facial desensitization experimental environment.

Experimental Environment	Configuration
Operating System	Ubuntu 22.04
CPU	Intel(R) Xeon(R) Silver 4314 CPU @ 2.40 GHz
GPU	NVIDIA GeForce RTX 4090D*3
Memory	64 GB
Video Memory	24 GB*3
Computational Platform	CUDA 12.4
Development Language	Python 3.8.20
Deep Learning Framework	PyTorch 1.9.1

Table 6. Description of evaluation metrics for generated face images.

Metrics	Description
FID	Evaluate the quality of images generated by the generative model.
Euclidean Distance	Measure the direct differences between two vectors/images in the pixel space.
SSIM	Measure the perceptual similarity between two images in terms of structure, brightness, and contrast.

Table 7. Evaluation results of K-Same method.

Method	Index	Original Images					Generated Images
K-Same	Comparison Images
	SSIM	0.6216	0.4693	0.6283	0.5749	0.6192	0.5827
	Euclidean Distance	1.0339	1.2520	1.1882	1.0750	1.3625	1.1823

Table 8. Evaluation results of StyleGAN3 method.

Method	Index	Original Images
StyleGAN3	Comparison Images
	SSIM	0.4485
	Euclidean Distance	1.2332

Table 9. Problematic images generated in StyleGAN3 model.

Index	Image 1	Image 2	Image 3	Image 4
Original images
Generated images

Table 10. Configuration of experimental models.

Group Name	Hybrid Loss Function	5 × 5 Convolutional Kernels
StyleGAN3	×	×
StyleGAN3-l	√	×
StyleGAN3-lo5	√	√

Note: “√” indicates the inclusion of the module, while “×” indicates the exclusion of the module.

Table 11. Privacy protection evaluation of StyleGAN3 models with different strategies.

Models	SSIM	Euclidean Distance
StyleGAN3	0.4485	1.2332
StyleGAN3-l	0.3818	1.3106
StyleGAN3-lo5	N/A	N/A

Table 12. Configurations of experimental model.

Group Name	Hybrid Loss Function	5 × 5 Convolutional Kernels	Self-Attention Mechanism
StyleGAN3	×	×	×
StyleGAN3-l	√	×	×
StyleGAN3-lo5	√	√	×
StyleGAN3-lo5-att	√	√	√

Note: “√” indicates the inclusion of the module, while “×” indicates the exclusion of the module.

Table 13. Privacy protection evaluation of StyleGAN3 models with different strategies.

Models	SSIM	Euclidean Distance
StyleGAN3	0.4485	1.2332
StyleGAN3-l	0.3818	1.3106
StyleGAN3-lo5	N/A	N/A
StyleGAN3-lo5-att	0.3396	1.4548

Table 14. Comparison of re-identification resistance performance.

Method	K Value	Re-Identification Rate (IS Value, %)
Proposed method	N/A	5.03%
K-Same state-of-the-art method	5	≤20%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ji, H.; Tian, L.; Wang, J.; Yao, Y.; Wang, J. Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks. Electronics 2025, 14, 2843. https://doi.org/10.3390/electronics14142843

AMA Style

Ji H, Tian L, Wang J, Yao Y, Wang J. Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks. Electronics. 2025; 14(14):2843. https://doi.org/10.3390/electronics14142843

Chicago/Turabian Style

Ji, Haojie, Liangliang Tian, Jingyan Wang, Yuchi Yao, and Jiangyue Wang. 2025. "Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks" Electronics 14, no. 14: 2843. https://doi.org/10.3390/electronics14142843

APA Style

Ji, H., Tian, L., Wang, J., Yao, Y., & Wang, J. (2025). Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks. Electronics, 14(14), 2843. https://doi.org/10.3390/electronics14142843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Face Desensitization for Autonomous Driving Based on Identity De-Identification of Generative Adversarial Networks

Abstract

1. Introduction

2. Literature Review

2.1. K-Same Face Anonymization Methods

2.2. GAN Face Anonymization Methods

3. Experimental Design and Results

3.1. Experimental Datasets

3.2. Experimental Setup

3.3. Evaluation Metrics

3.4. Comparative Analysis

4. Hybrid Loss Function Optimization

4.1. Limitations of StyleGAN3

4.2. Optimization Design

4.3. Experiment Analyzing

4.3.1. Ablation Experiment

4.3.2. Analysis of Experimental Results

5. Hierarchical Self-Attention

5.1. Cross-Regional Feature Modeling Problem

5.2. Self-Attention Mechanism Design

5.3. Experiment Analyzing

5.3.1. Ablation Experiment

5.3.2. Analysis of Experimental Results

5.4. Experimental Validation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI