1. Introduction
Image data in traffic scenarios carries rich environmental information and behavioral characteristics. Generally, these data are regarded as the core input for perception and decision making in intelligent connected vehicles [
1], supporting applications including traffic condition understanding, driver assistance, and behavior analysis. However, images and videos captured by vehicle cameras, dashcams, and in-cabin monitoring often contain highly identifiable information, particularly license plates and faces [
2]. Once exchanged across domains or stored remotely, such data may be misused or maliciously re-identified, leading to precise tracking or privacy breaches [
3]. Therefore, to meet the visual information needs of intelligent mobility services while complying with privacy regulations, notably the General Data Protection Regulation (GDPR), effective desensitization of license plates and faces is vitally important throughout the data processing pipeline [
4]. However, due to the massive scales of camera data, efficiently detecting and cleaning sensitive information remains a major challenge.
Traditionally, the sensitive information is cleaned by manual annotation and processing. Staff manually mark faces and license plates and then occlude or blur them. Manual processing cannot scale to the growing volume of surveillance data. The traditional methods are slow, prone to missed or inconsistent labels, and often fail to meet strict privacy requirements. With the maturing of natural language processing (NLP) and computer vision (CV), automatic image annotation (AIA) has rapidly developed into a key task in artificial intelligence (AI) [
5]. AI, pattern recognition, and CV are the primary technologies for AIA and are employed to analyze visual features of digital images and assign metadata as captions or keywords [
6]. Automatic annotation of objects in traffic scenes enables fast detection and classification of vehicles, pedestrians, and other traffic elements. This step forms the basis for later desensitization and directly determines the effectiveness of anonymization.
Existing AIA methods typically rely on not fully supervised deep learning algorithms. The goal of the AIA methods is to reduce the effort required to obtain labeled data and to lower model dependence on labels while keeping annotation accuracy. Common approaches include semi-supervised learning [
7], transfer learning [
8], reinforcement learning [
9], and active learning [
10]. Transfer learning moves knowledge from one task or domain to another to reduce labeling needs for the new task. In image annotation, common practices are domain adaptation and pretraining with fine-tuning. For cross-domain annotation, Yang et al. [
11] proposed frequency domain adaptation, which reduces distribution gaps by swapping low-frequency components between the source and target domains, while semi-supervised learning leverages unlabeled data to further strengthen model training. Tan et al. [
12] proposed a deep semi-supervised paradigm for powder bed defect segmentation that enables learning with limited annotations. However, the theoretical assumptions underlying many semi-supervised methods may not be applicable to the actual data distribution in the real world, resulting in the model learning incorrect information.
Active learning shares similarities with semi-supervised learning. Unlabeled data are utilized in both approaches to train and improve models. However, active learning is distinguished by the use of specific query strategies. More representative samples are selected from unlabeled data. These samples are handed over to annotators for labeling. A new training set is formed to continue the model training process. Hoxha et al. [
13] proposed annotation cost-efficient active learning (ANNEAL), a cost-efficient active learning method for remote sensing built on content-based image retrieval. ANNEAL aims to build a small but informative training set composed of similar and dissimilar image pairs to learn a precise metric space. Reinforcement learning has also been applied to annotation tasks to optimize labeling strategies. A prominent line of inquiry formulates active learning within the framework of a Markov decision process (MDP), leveraging deep reinforcement learning to derive optimal sample selection policies. Jiu et al. [
14] introduced deep reinforcement active learning, a method that prioritizes unlabeled samples based on uncertainty and subsequently utilizes the ranked feature representations as state inputs for an actor–critic agent. By employing the deep deterministic policy gradient algorithm and utilizing label feedback, the agent acquires a dynamic sampling policy. This method can adaptively select samples for annotation based on the current model and data distribution, avoiding errors from fixed heuristic rules. Such methods usually assume strong distribution consistency between unlabeled data and labeled data or rely on fixed heuristic sample selection strategies. However, in real-traffic scenarios, these assumptions often fail due to the high diversity of weather conditions, viewpoints, illumination, and imaging devices. This mismatch can lead to the accumulation of annotation errors and limited model generalization ability. In recent years, foundation models have shown significant advantages in multimodal understanding and semantic reasoning. Their introduction into automated data annotation pipelines has partially alleviated issues such as high label noise and weak cross-scene generalization in traditional methods. This progress makes large-scale dataset construction with low human involvement feasible.
However, the improvement of automatic annotation capability also drives traffic data collection toward large-scale and highly automated processes. This trend further amplifies the potential risk of large-scale identification and dissemination of sensitive information, such as faces and license plates, in raw data. Without effective privacy protection mechanisms, automated annotation systems may increase the risk of personal privacy leakage while improving data utilization efficiency. Traditional desensitization methods, including pixel-level replacement and occlusion, struggle to balance privacy and visual usability. Pixelation and blurring can make sensitive content unrecognizable to human viewers, while automated models can still extract sensitive features. For this reason, generative methods have been proposed in recent years. Generative approaches modify or replace original image content to protect privacy while keeping the image visually coherent [
15]. Goodfellow et al. [
16] introduced generative adversarial networks (GANs), which facilitate the synthesis of realistic samples via a generator network while generated samples are distinguished from real data by an adversarial discriminator, driving the generator toward the true data distribution. GANs are well-suited to face deidentification because they can generate new samples that follow the learned input distribution.
Recently, diffusion models have achieved breakthroughs in image generation and editing. Their high generation quality and stable training have made them a new direction for image privacy research [
17,
18,
19]. Diffusion models generate high-fidelity images by gradually adding and removing noise [
20]. Compared with GANs, they offer more stable generation and greater diversity. They have been widely applied to anonymization, face swapping, and image restoration in privacy-related tasks [
21,
22]. In earlier research, Li et al. [
23] proposed the latent diffusion-based face anonymization two-stage diffusion anonymization framework. This method first uses a face detection model to locate the face region, and then uses a latent diffusion model (LDM) to perform generative inpainting on that region, thereby synthesizing an anonymized face image. Since LDM samples and generates in latent space rather than pixel space, it better preserves image semantics and background consistency [
24]. Chen et al. [
25] introduced Diff-Privacy, which utilizes a custom image inversion module to map images into latent space for identity replacement while maintaining content fidelity. Shaheryar et al. [
26] proposed a dual-conditional diffusion-based identity anonymization method that integrates identity and non-identity features to achieve controllable, high-quality face deidentification.
Although generative anonymization and AIA have advanced significantly, their coupling in real traffic scenarios remains limited. Few works consider the temporal consistency of generative anonymization together with seamless integration into AIA and quantitative evaluation of privacy utility. Therefore, there is an urgent need for a solution that can provide high-quality visual anonymization while ensuring downstream annotation accuracy and temporal consistency to support large-scale data collection and semi-automatic annotation in real-world traffic scenarios. Accordingly, to meet this need, this study proposes an anonymization and AIA method for sensitive information in complex traffic environments. Firstly, NullFace diffusion anonymization is applied to the collected face dataset to satisfy strict requirements on target fidelity and background realism. Then the Dify engine visual workflow orchestration is combined with Qwen3-VL and Grounding DINO to link cleaning and detection steps. This enables text-driven data cleaning and open vocabulary AIA and produces a traffic-sensitive information dataset with diverse scenes. Through empirical analysis on the constructed traffic scene dataset, we verified the applicability and annotation performance of this method in diverse scenarios. The comparison of the proposed method with existing methods is shown in
Table 1.
The remainder of this paper is organized as follows.
Section 2 presents the anonymization method.
Section 3 gives an overview of automatic annotation for sensitive information.
Section 4 describes the dataset and evaluation metrics.
Section 5 presents the experimental setup and results.
Section 6 concludes the paper and outlines future work.
3. Semi-AIA System Based on Multimodal Foundation Models
Traffic data processed by NullFace must be converted into a standard dataset with precise semantic labels before it can serve downstream training. Manual annotation cannot scale to large volumes, which is slow, costly, and inconsistent. To address this, we designed and implemented a semi-automatic pipeline based on the Dify workflow engine. The system integrates state-of-the-art foundation models, including Qwen3-VL, Grounding DINO, and Deepseek-R1, to automate the full process from data cleaning to standardized output.
3.1. Data Cleaning by Visual Instruction Fine-Tuning
Data cleaning is the first line of defense for building a high-quality dataset. To quickly remove empty frames with no vehicles or no faces from massive surveillance video, the Qwen3-VL family of multimodal foundation models is applied as an intelligent filter. The model architecture is shown in
Figure 2.
Qwen3-VL adopts a deeply coupled architecture that combines a Vision Transformer and a large language model. This design provides strong visual understanding and instruction following capability. The data cleaning task is formulated as a standardized visual question answering process. Each image is input to the model together with a carefully designed prompt. The prompt explicitly asks the model to determine whether a face is present in the image and strictly constrains the output to a JSON format, such as {“face”: false} or {“face”: true}. Through this prompt design, the system can directly parse the returned JSON object. When all key fields are false, the sample is automatically marked as invalid and removed from the processing queue. This mechanism fully exploits the zero-shot generalization ability of multimodal large models. It avoids the cost of training a dedicated classifier and ensures that computational resources in the subsequent annotation stage are focused on valid data.
To support large-scale deployment, the platform uses the Qwen3-VL 30B A3B Instruct model. Since the data cleaning task only requires basic binary existence judgment, the 30B parameter model shows clear performance redundancy in this scenario. Therefore, we adopt 4-bit quantized loading and multi-GPU parallel inference to improve resource efficiency. It is worth noting that no valid samples are missed during the subsequent manual review of the constructed dataset. This empirical result confirms that the 4-bit quantized model achieves very high recall and reliability in data cleaning tasks.
3.2. Open Vocabulary Detection and Annotation
Valid samples pass to the core automatic annotation stage. To overcome the limitations of traditional detectors including You Look Only Once (YOLO) and Faster R-CNN, which are restricted to predefined categories, this study introduces the Grounding DINO model, which possesses open-set detection capabilities, as shown in
Figure 3. This lets users detect arbitrary classes with natural language prompts.
In our system, we call Grounding DINO via API and set the text prompt to the word face. The prompt directs the model to search for regions that match the face concept in feature space. The model outputs detection boxes, confidence scores, and short phrase labels. A postprocessing module converts these outputs for common training frameworks, which denormalizes box coordinates, applies a confidence threshold, and converts formats. The result is a standard YOLO annotation file containing class ID, center coordinates, width, and height. In practice, this method keeps recall high and processes each image in only a few seconds, far faster than manual labeling.
3.3. Dify Workflow Orchestration and Human–Machine Collaboration
To make the model modules usable in production, we built a unified system using Dify workflow visual orchestration. The architecture is shown in
Figure 4. The workflow links cleaning and detection steps and also integrates Deepseek-R1 to improve interpretability.
After detection, the workflow calls Deepseek-R1 for secondary processing. Deepseek-R1 acts both as an information extractor and as a semantic explainer. It pulls key statistics from the raw Grounding DINO JSON and converts coordinate data into natural language descriptions including object count, object positions, YOLO labels, and the meaning of each label value. These descriptions provide intuitive reference information for manual review.
To balance throughput and label quality, the platform supports two operation modes. The first mode is an interactive review interface for human verification, as shown in
Figure 5. In this mode, the system retains Deepseek-R1 explanations and presents the image, annotation boxes, and semantic explanations to the operator in a question-and-answer style. Operators can quickly inspect and slide through results and manually correct difficult or ambiguous samples. This mode takes about 14.92 s per image but substantially reduces human cognitive load. The second mode is an API batch mode for large-scale dataset construction. For maximum efficiency, this mode skips Deepseek-R1 explanation generation and runs only the cleaning and detection steps. This reduces per-image time to about 6.12 s. At that rate, a single workflow instance can automatically label more than 14,000 images per day. This delivers a model driven, human-reviewed data production paradigm.
Table 1 reports the timing for each workflow step for a 640 by 640 image. Grounding DINO inference and semantic extraction by Deepseek-R1 form the main sources of latency in the workflow. The latter is enabled only in the dialogue interface mode to generate interpretable descriptions. The system provides an effective balance between interpretability and throughput, which supports flexible configuration for data needs of different scales.
The proposed pipeline adopts a strictly sequential design. Only the Qwen3-VL model is deployed locally, while the Grounding DINO and Deepseek-R1 modules are accessed via external APIs. Although network communication latency between modules is not independently profiled, it is implicitly captured in the per-module runtime analysis presented in
Table 3, as all API processing and data transformation steps are implemented via explicit code nodes. Furthermore, since only the Qwen3-VL model runs on the local GPU, cross-module GPU memory sharing is not applicable in this deployment.
5. Experimental Results and Analysis
In the experiment, we first conducted consistency verification and hyperparameter sensitivity analysis to determine the parameters of the anonymization model. Secondly, to evaluate the effectiveness of the proposed anonymization model, we compared the proposed method with existing approaches on the WIDER FACE subset and RealFace dataset to assess its ability to preserve non-identity facial attributes while removing personally identifiable information. Secondly, we validate the advantages of the proposed AIA method by training an object detection model on the anonymized dataset. Finally, we investigate the impact of anonymization on downstream performance by training object detection models on both the anonymized and original datasets.
5.1. Hardware Configuration and Scalability
The experiments were conducted on a Windows 11 operating system. The hardware configuration included an AMD Ryzen 9 9950X 16 Core Processor and an NVIDIA GeForce RTX 4090 D graphics card with 24 GB of video memory. The CUDA version was 11.1. All experiments were implemented using the PyTorch 1.8 deep learning framework under the Python 3.12 environment.
The proposed anonymization and annotation pipeline required 17.22 s to process a single 640 × 640 image. During this time, the Nullface-based anonymization method required 2.30 s per image. The semi-automated annotation platform used an interactive review mode with human verification and required approximately 14.92 s to process one image. When constructing large-scale datasets, an API-based batch processing mode can be adopted. In this case, the annotation time per image can be reduced to 6.12 s. At this processing speed, the overall workflow can handle more than 420 images per hour, which meets the requirements of large-scale dataset applications. The models’ performance remains consistent on larger volumes, as each image is processed with the same learned capabilities.
5.2. Consistency Verification and Sensitivity Analysis
To verify the temporal consistency of anonymization, experiments were first conducted on the DH-FaceVid-1K dataset. The stability of anonymization results across video frame sequences was evaluated. A fixed random seed was used during anonymization to ensure consistent anonymized identities for the same target across different frames. This strategy maintained the same anonymized identity for the same subject throughout the video, as illustrated in
Figure 7. The high visual similarity across consecutive frames and the small L2 Euclidean distance differences between faces before and after anonymization indicate that the anonymization results for the same target remain stable across adjacent frames.
We first conducted a sensitivity analysis of the controllable anonymization strength parameter
on the WIDER FACE and RealFace datasets to illustrate the degree of anonymization achieved by the model. As shown in
Figure 8, the original face and four anonymized versions generated with
0.25, 0.50, 0.75, and 1.0 are presented. Identity variation was quantified by computing the Euclidean distance between the anonymized faces and the original face using the ArcFace model. As
increases, both visual inspection and the identity Euclidean distance indicate a larger deviation from the original identity.
Based on this result,
was selected as the identity strength coefficient for subsequent training. The remaining hyperparameter settings are summarized in
Table 5. NullFace adopts Stable Diffusion 1.5 as the pretrained generative model. The total number of diffusion sampling steps was set to 100 to balance generation quality and computational cost. The conditional guidance strength
was set to 10, which is a commonly used configuration in Stable Diffusion. This setting ensures that conditional information provides sufficient guidance during the denoising process and helps preserve the overall facial structure and visual consistency. The IP-Adapter conditional channel strength was set to 1 to maintain stable conditioning during cross-attention injection and to avoid unintended structural distortions caused by excessive amplification or suppression of conditional features. During denoising, the model skipped the first 70 steps and started reverse diffusion from an intermediate timestep with lower noise levels. This strategy preserves facial pose, illumination, and geometric structure while only moderately reconstructing identity-related features. In the sampling equation, the noise coefficient η was set to 1 to introduce moderate randomness. This enhances generation diversity and identity irreversibility, thereby further reducing the risk of identity recovery.
5.3. Comparison of Model Results Before and After Anonymization
To demonstrate the advantage of the NullFace anonymization method, we evaluated the proposed approach on the WIDER FACE dataset and RealFace dataset and compared it with two baselines, FAMS [
38] and Ciagan [
39].
Table 6 and
Table 7 present comparisons. First, in terms of the reidentification metric, all three anonymization methods reduce identity reidentification rates. Among them, NullFace achieves the lowest Re-ID scores on both datasets, indicating the most effective identity anonymization performance. For attribute preservation, head pose, expression consistency, and gaze direction error were evaluated. A head pose estimation model [
33] was used to predict facial orientations in both the generated images and the original images. The quaternion angular distance between these orientations was then computed. NullFace performs slightly worse than FAMS on the pitch angle metric but shows a clear improvement over Ciagan. In addi-tion, NullFace exhibits more stable performance in expression consistency and gaze direction error. These results suggest that NullFace effectively removes identity infor-mation while preserving the original facial geometry and semantic attributes, which supports downstream detection tasks.
We first measured image quality using FID. The proposed method achieved the best result and yielded the lowest FID on both datasets, substantially outperforming Ciagan. We also measured the MUSIQ score difference between generated images and original images to assess quality retention. The NullFace model showed a strong ability to retain the original image quality. On the WIDER FACE dataset, methods like FAMS produced higher MUSIQ scores but generated larger MUSIQ distance from the original images because they performed visible enhancement. By contrast, due to the inversion process, NullFace preserves the source image quality instead of applying unnatural enhancement.
Figure 9 and
Figure 10 show results under single-person conditions including occluded and extreme-angle faces from the WIDER FACE and RealFace datasets. The proposed method effectively anonymizes identity while preserving identity-independent details, including pose, expression, and background. The outputs are visually realistic. Compared with other methods, NullFace does not introduce obvious artifacts or geometric distortions, and it does not break in the scene layout, accessories, or other details relevant to downstream tasks. FAMS failed to preserve participant expressions. Ciagan produced less realistic outputs and missed anonymization targets.
As shown in
Figure 11 and
Figure 12, in the two-person scene of the WIDER FACE and RealFace datasets, NullFace can anonymize multiple participants in the same image and generate internally consistent and mutually distinguishable new identities for each participant. Across different viewpoints, each participant’s facial appearance remains coherent in expression and pose while background details are preserved. Other methods often mix faces or produce inconsistent poses and expressions for multiple people. NullFace remains stable in two-person scenes and produces high-quality anonymization, which demonstrates stronger adaptation to complex scenarios and better privacy protection.
5.4. Evaluation of Object Detection Models Under Different Annotation Methods
After anonymization, images were annotated using the foundation-model-based method described in
Section 2. To compare the impact of different AIA methods on detector training, the YOLOv8 model was trained using two annotation sets, manual labels and semi-automatic labels. Training used gradient descent with an initial learning rate of 0.01. Models were trained for 60 epochs with batch size of 32 and an input image size of 512. Besides the default augmentation, we applied rotation, shear, perspective transforms, and vertical flip. No additional pretraining was used. Data augmentation was disabled for the final 10 epochs. Early stopping was set to 30 epochs.
Table 8 reports the performance comparison for different annotation methods. AWS is the Amazon Web Services SageMaker Ground Truth annotation service. Using the default built-in algorithm with bounding box annotations, an automatic labeling task was constructed for face detection. Consistent with the proposed strategy, an IoU threshold of 0.5 was adopted as the matching criterion. As expected, the manually annotated dataset gave the best performance. The semi-automatic annotated dataset achieved results similar to manual annotation. The AWS annotation method performed the worst. These results indicate that the proposed annotation pipeline is clearly superior to the AWS labeling approach. Overall, precision values for all models were similar and above 90 percent, which is a strong score for detection tasks.
Finally, we evaluated the annotated datasets with the trained detectors and show results in
Figure 13. Under the same training and evaluation settings, YOLOv8 maintained desirable detection performance on anonymized images. Although confidence scores dropped slightly after anonymization, the detector still localized targets accurately and produced high-confidence predictions. This finding indicates that the anonymization removed identity features while preserving head and face geometry and semantic context, keeping the data useful for downstream visual tasks.
6. Conclusions
This paper addresses privacy protection and annotation needs in real traffic data collection. We propose a unified framework that deeply couples generative anonymization with foundation-model-driven automatic annotation. The main conclusions are listed below.
First, the NullFace diffusion anonymization method shows superior performance in complex traffic scenes. Quantitative evaluations show that it achieves the lowest reidentification rate while effectively preserving head pose, facial expression, and gaze direction. In addition, NullFace attains the lowest FID and MUSIQ scores, indicating superior visual quality. The method outperforms mainstream techniques in producing high-quality and realistic anonymized images that remain close to the original image distribution. It effectively removes identity features while preserving expression, head pose, and in-cabin background details, and it maintains consistency across scenarios.
Second, the semi-AIA system, which integrates Qwen3-VL, Grounding DINO, and Deepseek-R1, automates the full pipeline from data cleaning to standardized output. The system balances processing efficiency and label quality. Detection metrics for models trained on data from this pipeline match those trained on manual labels and are significantly better than results from the AWS auto-labeling service.
Finally, the tightly coupled anonymization and semi-AIA pipeline preserves downstream usability. YOLOv8 comparison experiments show that models trained on data processed and labeled by our method retain detection performance comparable to preanonymization levels. This indicates that the anonymization did not damage the geometric and semantic context that is critical for downstream vision tasks.
This work fills a gap in applying generative anonymization and automatic annotation to real traffic scenes. It provides a reference for future research on privacy protection and intelligent annotation for larger-scale, multimodal, and cross-domain traffic data.