4.1. Limitations of StyleGAN3
Compared with GAN models, StyleGAN3 demonstrates substantial improvements in geometric transformation adaptability, feature disentanglement capability, and training stability. Nevertheless, StyleGAN3 still encounters challenges in facial desensitization in autonomous driving, as evidenced by the desensitized images presented in
Table 9.
Firstly, the StyleGAN3 model exhibits limitations in simulating identity-sensitive features distributed across regions, such as inter-eye distance and nasal bridge morphology. Due to its reliance on a fixed feature space division mechanism, this model struggles to adapt to diverse autonomous driving scenarios. Furthermore, while the model supports driver status monitoring by retaining posture features and enables identity authentication by removing personal identifiers, identifiable information may still remain in desensitized images under conditions such as occlusion or variable lighting. Secondly, the geometric transformation module of the StyleGAN3 model demonstrates poor robustness in handling image acquisition challenges in autonomous driving involving extreme viewing angles, low illumination, or motion blur. Additionally, the face image occlusion completion strategy depends heavily on the distribution of training data. When encountering unlabeled or unseen occlusion patterns, the model is prone to generating semantically inconsistent pseudo-features, e.g., pupil misalignment, which can compromise the integrity of key image features.
Furthermore, convolutional layer discriminators typically exhibit modeling limitations regarding long-term spatial dependence, such as the correlation between eyebrow shape and eye shape, as well as the consistency between skin color and illumination. This may result in local noise or structural inconsistency in the generator during high-frequency detail synthesis. However, the loss function in StyleGAN3 might lead to extreme cases, such as the excessive retention of identity features or the over-blurring of key features, due to insufficient constraints on privacy protection and data utility. These issues fail to align with the minimum necessary desensitization principles mandated by industry regulations, especially in autonomous driving scenarios where computational resources are limited. Such shortcomings may significantly impact the convergence stability of the model and the accuracy of the generated images.
Overall, the aforementioned limitations of the StyleGAN3 model have impeded the secure utilization of facial data in autonomous driving. As a result, it is imperative to optimize and strengthen the StyleGAN3 model in key areas such as feature unentanglement granularity, environmental adaptability, adversarial training balance, and multimodal compatibility, thus enabling the development of a facial privacy protection solution that meets the rigorous demands of autonomous driving.
4.2. Optimization Design
GAN is defined as a minimax game between the discriminator
and the generator
. Given real data
from the real data distribution, and generated fake data
from the generator’s data distribution, the general representation of GAN is as follows:
GAN training involves a min–max adversarial game, where the generator minimizes the loss , and the discriminator maximizes it. Theoretically, the loss demonstrates convexity when the generated distribution is optimized. However, in practice, the GAN loss pushes fake samples away from the decision boundary of , rather than directly updating the density of , thereby triggering issues such as mode collapse, mode dropping, or non-convergence.
GANs have drawbacks such as mode collapse, difficulty in training, and poor convergence [
26]. To mitigate these limitations, researchers have developed multiple algorithms. Jolicoeur-Martineau et al. introduced the Relativistic Pairing GAN to tackle mode dropping [
27], and the formulation of RpGAN is represented as follows:
Although RpGAN exhibits no significant difference in model architecture compared to GAN, there exists a fundamental distinction in how the discriminator’s evaluation influences the topological structure of the loss. In the GAN network, the discriminator is tasked with distinguishing between real and fake data. When the real data and the fake data are separable by a single decision boundary, the GAN loss drives the generator to move all fake samples to the opposite side of this boundary. This degradation is referred to as mode collapse or mode dropping, meaning that, when the real and generated data are isolated and processed, the naturally formed single decision boundary may lead to mode collapse.
RpGAN couples real and fake data such that decision boundaries are maintained within the neighborhood of each real sample, avoiding mode collapse. Sun et al. demonstrated that RpGAN did not exhibit the local minima characteristic of mode collapse. While RpGAN addresses the mode collapse issue in GAN, its training dynamics remain to be fully resolved. The ultimate goal of RpGAN is to find an equilibrium point (θ
*,ψ
*), where
and
* are constant everywhere under
. Sun noted that RpGAN possessed a non-increasing trajectory theoretically capable of global reach under reasonable assumptions. However, the existence of such a trajectory does not ensure that gradient descent will reliably discover it. One study found that non-regularized RpGAN exhibited suboptimal performance and may fail to converge when using gradient descent [
28].
Experiments had shown that zero-centered gradient penalties, such as the widely adopted
and
penalties, could promote the convergence of GANs [
29]. To address the non-convergence issue in RpGAN, gradient penalties are introduced as follows:
where
penalizes the gradient norm of discriminator
on real data, while
penalizes the gradient norm of
on fake data. Huang et al. employed the StackedMNIST dataset and demonstrated that GANs and RpGANs rapidly diverged and exhibited unstable training in experiments, although using
regularization alone theoretically allowed for the local convergence of losses in both of the two models, with gradients of
on fake samples exploding during divergence [
30]. Thus,
alone cannot achieve globally convergent training.
can be interpreted as the convolution of the density function of
with N(0,
), a Gaussian distribution with zero mean and covariance
, followed by an additional weight term and a Laplace error term. In the early training stage of the model, since the parameters (θ,ψ) are not close to the optimal equilibrium point (θ
*,ψ
*), regularizing
on real data has minimal impact on how
processes fake data. Similarly, R2 involves convolving
with N(0,
) and adding an extra weight term and a Laplace error term. The key difference is that
penalizes the gradient norm on fake data rather than real data.
Therefore, and are used in the network to regularize both real data and fake data simultaneously. When approaches *, the Laplace error terms of and can cancel each other out, and the training of traditional GANs and RpGANs will become stable. After the GAN starts training, there are still cases of mode collapse and mode dropping, while the RpGAN achieves full mode coverage with a decrease in DKL. Meanwhile, applying and in the network can make both and smoother, thereby enabling the model to learn more effectively than simply smoothing . Experiments have shown that, in this case, can also satisfy ≈ well, and keeping the gradient norms on real and fake data roughly the same may reduce the possibility of the discriminator overfitting. Finally, the final optimization result for the loss function is . Both the / regularization weights are set to 10, in line with the original configuration of StyleGAN3.
The RpGAN framework constructs dynamic adversarial constraints through a relative probability evaluation, where regularization enforces the Lipschitz continuity of the discriminator on real data, and regularization suppresses the gradient explosion in the generated sample region. These three components collectively form a directional constraint matrix in the latent space, achieving the implicit modeling of the mini-batch statistical features. This essentially subsumes the supervisory role of explicit mini-batch variance over feature diversity, thereby eliminating the inertia regularization loss and path regularization loss from the original model.
4.3. Experiment Analyzing
4.3.1. Ablation Experiment
In order to de-identify automotive face images and prevent memory crashes on the server, the input image size to the network is reduced to 512 × 512 × 3 when pre-training the improved StyleGAN3 with 1024 × 1024 × 3-sized face images. The training dataset is named 010000-512 × 512.zip in its compressed form, enabling the network to perform batch training on face data after processing.
To ensure a high level of privacy protection and usability in the face de-identification method, the loss function in StyleGAN3 is optimized by introducing a hybrid loss function and removing the lazy regularization loss and path regularization loss. Additionally, the Conv3 × 3 or 1 × 1 modules in layers L0–L13 of the generator are replaced with 5 × 5 convolutions with a larger receptive field, allowing the generator to learn that each pixel in an image should be correlated with others rather than be independent.
Therefore, the experiments discuss the roles of the hybrid loss function and 5 × 5 convolutional kernels in StyleGAN3. The configuration of experimental models is presented in
Table 10. Ablation experiments are conducted to compare the models across three metrics, i.e., FID, Euclidean distance, and SSIM, to evaluate the privacy protection and image quality of the improved StyleGAN3 model.
4.3.2. Analysis of Experimental Results
1. Generated Image Quality
Figure 7 shows the experimental results of the image quality after training 2000 k images for different experimental groups. Regarding the performance of StyleGAN3 models under different improvement strategies, the FID and the generated quality present differences in three experiments.
After optimizing the original StyleGAN3, the FID value gradually converged from an initial 524 to 48, indicating that adjustments to the basic architecture and training strategies effectively enhance the realism of the generated images. The StyleGAN3-l model, which further introduces the hybrid loss function, demonstrates a stronger convergence ability. The FID value of this model drops from 457 to 29 and stabilizes with minor fluctuations. RpGAN mitigates the mode collapse between the generated and real distributions through probability density matching, while the R1 and R2 regularizations constrain the gradient smoothness of the discriminator and the manifold stability of the generator, respectively, enabling the generator to explore a more reasonable interpretive space during adversarial training. However, when expanding the generator’s receptive field to 5 × 5 convolutional kernels based on StyleGAN3-l, the FID of the StyleGAN3-lo5 model unexpectedly increases to 453 and shows a continuous diverging trend. This phenomenon stems from the imbalance between the surge in model complexity and insufficient discriminator capacity. The 5 × 5 kernels introduce high-frequency noise in low-resolution layers, disrupting the consistency of facial low-frequency structures, while the elongation of the backpropagation path leads to abnormal gradient accumulation, ultimately causing model collapse.
2. Privacy Protection
The StyleGAN3 models under different improvement strategies demonstrate a clear hierarchical progression in privacy metrics as shown in
Table 11, which summarizes the results from controlled experiments isolating each modification’s impact. The original StyleGAN3 model yields an SSIM of 0.4485 and Euclidean distance of 1.2332, indicating that the generated faces retain sharp facial contours, e.g., defined jawlines, nasal bridges, and fine skin textures, preserving features that enable identity correlation. This fidelity risks exposing personal attributes usable for re-identification in applications like driver monitoring, where privacy is critical.
Introducing the hybrid loss function in StyleGAN3-l improves privacy by stabilizing adversarial training, reducing SSIM to 0.3818 and increasing Euclidean distance to 1.3106. Considering this loss function, the RpGAN model encourages the generator to avoid the principal component space of real data, which is rich in identity-defining features such as facial shape. Meanwhile, the regularization terms ensure that perturbations conform to the facial manifold, thereby achieving a balance between structure preservation and feature obfuscation. This ensures the generated faces lack unique identifiers, e.g., rare mole patterns and ear shapes, while retaining sufficient anatomical coherence for non-private tasks like emotion analysis. However, expanding the generator’s receptive field with 5 × 5 kernels in StyleGAN3-lo5 disrupts this balance, such as the augmented model complexity overwhelming the discriminator, resulting in gradient dynamics collapsing. The result is meaningless noise, indicating that aggressive structural changes without complementary adjustments, e.g., discriminator capacity upgrades, and adaptive regularization, could destroy the adversarial equilibrium.