Multimodal-Based Selective De-Identification Framework

Kim, Dae-Jin

doi:10.3390/electronics14193896

Open AccessArticle

Multimodal-Based Selective De-Identification Framework

by

Dae-Jin Kim

Institute for Image & Cultural Contents, Dongguk University-Seoul, Seoul 04620, Republic of Korea

Electronics 2025, 14(19), 3896; https://doi.org/10.3390/electronics14193896

Submission received: 2 September 2025 / Revised: 28 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025

(This article belongs to the Special Issue Recent Advances in Security and Privacy for Multimedia Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Selective de-identification is a key technology for protecting sensitive objects in visual data while preserving meaningful information. This study proposes a framework that leverages text prompt-based zeroshot and referring object detection techniques to accurately identify and selectively de-identify sensitive objects without relying on predefined classes. By utilizing state-of-the-art models such as GroundingDINO, objects are detected based on natural language prompts, and de-identification—via blurring or masking—is applied only to the corresponding regions, thereby minimizing information loss while achieving a high level of privacy protection. Experimental results demonstrate that the proposed method outperforms conventional batch de-identification approaches in terms of scalability and flexibility.

Keywords:

selective de-identification; zeroshot object detection; referring object detection; prompts

1. Introduction

With the widespread deployment of video surveillance systems across public and private sectors, video data is increasingly being utilized in diverse domains such as security, crime prevention, traffic monitoring, disaster response, and emergency management. To enable such applications, image de-identification technologies using AI have become mainstream, aiming to obscure randomly occurring forms of personal and sensitive information—such as faces, bodies, license plates, signage, and security facilities—captured in video footage [1,2]. Furthermore, image de-identification is a core technology that goes beyond a tool for privacy protection and enables data sharing and AI training. It is essential for safeguarding individual privacy, complying with legal regulations, expanding the usability of data in research and industry, and improving both the quantity and quality of training datasets for AI models. For instance, developing AI models requires large-scale video datasets, but data containing identifiable elements is often restricted due to legal and ethical constraints [3].

Recent trends in de-identification research can be categorized into three main approaches. First, general de-identification is performed by detecting specific objects such as people or faces. This approach leverages computer vision technologies like object detection and tracking, meaning that advances in these areas directly enhance de-identification performance. Representative object detection algorithms include YOLOv11 [4], SSD [5], Faster R-CNN [6], and DETR [7], each suitable for different real-time and business requirements. While one-stage models like YOLO and SSD were traditionally used for speed-critical applications, and two-stage models like Faster R-CNN for accuracy-focused scenarios, recent YOLO versions now rival or surpass two-stage models in accuracy. And transformer-based models like DETR are increasingly adopted for high-precision detection tasks. Furthermore, object tracking after detection can help maintain consistency in de-identification. Tracking algorithms such as ByteTrack [8] and BoT-SORT [9] can significantly influence de-identification performance depending on their characteristics. ByteTrack leverages even low-confidence bounding boxes for tracking, enabling the preservation of object IDs and ensuring consistent de-identification of specific objects. In addition, BoT-SORT incorporates appearance features of detected objects, which helps maintain consistency in de-identification even under occlusion or detection failures. Second, once the objects to be de-identified are determined, various methods have been studied to make it difficult to recognize these targets. Traditional approaches include masking [10], blurring [11], and pixelation [12], which have been widely used. More recently, techniques leveraging Adversarial Examples (AEs) [13], Generative Adversarial Networks (GANs) [14], and Diffusion Models [15] have been explored to subtly alter detected objects in ways that are imperceptible, thereby improving de-identification effectiveness. Third, selective de-identification methods based on object identification have been proposed, where objects are selectively anonymized using the IDs assigned after object detection. This involves selecting objects through face detection or object detection, extracting feature information, and comparing similarities to verify identity. Techniques such as cosine similarity or fine-grained classification are employed to selectively de-identify only desired objects, thereby protecting individual privacy [1].

Traditional de-identification approaches primarily rely on annotation-based AI models, allowing for anonymization only for objects that have been pre-trained. These methods are mainly focused on single-image analysis and are suitable for static tasks such as object detection, segmentation, and classification. However, with the rapid advancement and widespread adoption of generative AI technologies like ChatGPT, it has become possible to integrate heterogeneous data types—including images, text, and audio—for semantic reasoning and comprehensive information analysis.

As a result, de-identification approaches can now leverage generative AI to infer complex relationships, contextual information, and intentions within images. Prompt-driven, context-aware analysis enables region selection, detailed feature interpretation, and situation-specific semantic de-identification.

Building on these capabilities, this paper proposes a multimodal selective de-identification framework that integrates zeroshot and referring object detection algorithms. The proposed method allows flexible, prompt-based de-identification of user-specified objects without being limited to predefined categories, thereby enhancing privacy protection and data usability in complex real-world environments.

This study presents a prompt-guided multimodal framework for selective de-identification, integrating visual and textual modalities to enable context-aware de-identification. The scientific contributions of this work are as follows:

Proposal of a novel architecture that combines image and text encoders with a cross-modality decoder to perform fine-grained object grounding based on natural language prompts for real-world.
Demonstration of the effectiveness of zeroshot and referring-based object detection for selective de-identification, enabling precise identification of user-specified targets.
Comparative analysis between zeroshot and referring-based approaches across multiple benchmark datasets, highlighting the advantages of prompt-driven methods.

Unlike traditional Vision AI pipelines, our approach introduces a generative AI-based paradigm that leverages language understanding for privacy-preserving tasks. This shift toward prompt-driven multimodal reasoning represents a novel direction in intelligent de-identification research and contributes both scientifically and operationally to the advancement of privacy-aware computer vision systems.

The remainder of this paper is organized as follows. Section 2 reviews prior work on object detection and identification-based selective de-identification. Section 3 presents the integration of multimodal generative AI algorithms to enhance selective anonymization capabilities. Section 4 details the implementation of a prompt-based de-identification framework. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Works

2.1. Selective De-Identification

2.1.1. Selective De-Identification Based on Object Detection

Object detection is one of the most fundamental research areas in Vision AI. Algorithms such as YOLO and DETR are widely used to detect objects belonging to pre-defined classes. In the context of de-identification, numerous face detection algorithms have been developed to identify personally identifiable information, such as faces, including ArcFace [16], CenterFace [17], RetinaFace [18], and MTCNN [19]. From a de-identification perspective, object detection serves as the primary technique for locating target entities, and its performance directly influences the overall effectiveness of the de-identification process.

Figure 1 shows a typical de-identification pipeline. Faces, for example, represent highly sensitive personal information that can be used for identity verification in access control systems, while license plates are linked to vehicle ownership information. In both cases, de-identification can be achieved through specialized detection methods: faces are de-identified using face detection algorithms, whereas license plates are de-identified by detecting and processing plate regions. This approach applies de-identification uniformly to all detected instances, making it suitable for scenarios where generalized de-identification is required. However, for tasks requiring selective de-identification, additional selection criteria must be introduced to determine which objects should be de-identified.

2.1.2. Selective De-Identification Based on Object Identification

Feature similarity-based de-identification

Feature information can be extracted from detected image objects using a variety of AI-based networks. Numerous neural network architectures have been developed for feature extraction, each offering distinct characteristics, output feature dimensions, advantages, limitations, and application domains, as summarized in Table 1.

The extracted features can serve as reference identifiers (IDs) and ground-truth data for each entity. For selective de-identification, feature representations corresponding to each ID can be stored in a database. To enhance recognition accuracy, multiple feature instances corresponding to a single ID can be registered. In the case of face recognition, feature representations should be extracted under varying conditions, such as different angles, lighting environments, and facial expressions, to ensure robust ID matching.

Figure 2 shows the process of selective de-identification using face recognition. A Faces Database is constructed for all participants who have provided consent for de-identification integration. For each participant, face features are extracted using AI networks, and individuals are uniquely labeled with IDs. To improve recognition accuracy, N facial images per participant, representing various angles and expressions, are registered under the same ID.

For inference, the CenterFace algorithm is applied to detect faces in input images or video frames. The extracted face features are then compared against those in the Faces Database to determine if the detected face corresponds to a registered entity. To identify whether a face should be de-identified, the cosine similarity between the input face feature and database features is calculated. A similarity threshold is defined, and de-identification is performed when the similarity score falls below this threshold.

s i m i l a i r t y = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} = \frac{\sum_{i = 0}^{q} A_{i} \times B_{i}}{\sqrt{\sum_{i = 0}^{q} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 0}^{q} {(B_{i})}^{2}}}

(1)

where A, B devotes feature vector, q is the dimension size.

This selective de-identification approach allows anonymization decisions to be made for specific IDs, enabling targeted privacy protection. An example is shown in Figure 3, where a screen shot from the movie series ‘Squid Game’ is processed such that only the main character remains identifiable while all other individuals are de-identified.

2.: Fine-grained classification-based de-identification

More precise recognition can be achieved through model training. In such cases, prior learning is required, making this approach suitable for applications that involve distinguishing between similar items, such as in advertising, e-commerce platforms, and product recommendation systems. Initially, relevant objects are detected using an object detection algorithm. In the training dataset used for this study, eight pre-defined classes (person, sedan, truck, motorcycle, bicycle, electric scooter, emblem, and traffic light) were employed for object detection. YOLO was utilized as the detection model, and when the probability of a detected object exceeded a predefined threshold, the bounding box (Bbox) coordinates and class index information were passed to the object classification layer. For fine-grained classification, if the detected object was classified as an emblem, additional de-identification was performed. Figure 4 shows the selective de-identification process using fine-grained classification.

Emblems were subdivided into six sub-classes (Mercedes-Benz, BMW, Audi, Hyundai, Kia, Volvo), and the EfficientNet architecture was used to train the corresponding weights. For sub-class objects, de-identification was applied only when the classification probability fell below a threshold, using a Gaussian filter over the bounding box. Figure 5 shows an example applicable to scenarios such as films or advertisements, where only the Mercedes-Benz emblem is preserved, while the emblems of all other vehicles are de-identified, demonstrating brand-specific selective de-identification.

2.1.3. Comparison of Feature Similarity-Based and Fine-Grained Classification-Based De-Identification

The two approaches to object identification-based selective de-identification—feature similarity-based and fine-grained classification-based—differ fundamentally in their underlying mechanisms, target specificity, and operational flexibility.

Feature similarity-based de-identification relies on embedding-level comparisons between detected objects and predefined reference features. This method is effective when the target object shares consistent visual characteristics with a known template, such as a specific face (ID). However, its performance is highly dependent on the quality and representativeness of the reference features, and it may struggle with intra-class variability or unseen instances.

In contrast, fine-grained classification-based de-identification utilizes supervised learning to distinguish between visually similar subcategories within a broader class. For example, it can differentiate between various emblem types or clothing styles based on subtle visual cues. This approach offers higher precision in identifying specific object types but requires extensive labeled data and robust training to generalize across diverse contexts.

While feature similarity-based methods are advantageous for rapid matching and low computational cost, fine-grained classification provides greater semantic resolution and contextual awareness. The choice between these methods depends on the application scenario, data availability, and the required level of de-identification granularity

2.2. Multimodal-Based Open-Set Object Detection

Conventional Vision AI approaches, which primarily focus on single-modality image analysis, excel in static tasks such as object detection, segmentation, and classification. However, leveraging multimodal generative AI techniques enhances the ability to infer complex contexts within images, including relationships, situations, and intentions, thereby increasing the applicability of object detection. Figure 6 shows the expansion of object detection toward multimodal integration. While traditional Vision AI performs closed-set object detection (i.e., detecting objects within predefined classes), multimodal AI enables open-set object detection, wherein arbitrary objects within an image can be detected based on language input. In the current era of generalized large language models (LLMs), open-set object detection significantly enhances reasoning capabilities and inference scalability.

Open-set object detection can be categorized into two approaches: zeroshot object detection, which identifies objects belonging to previously unseen classes, and referring object detection, which detects objects specified by phrases or sentences.

Significant research has been conducted in zeroshot open-set detection. ViLD [20] integrates vision-language representations learned from CLIP into the Faster R-CNN architecture via a distillation approach, enabling recognition of unseen classes. By leveraging the strong language representations of CLIP, this model can detect objects based on textual descriptions, even if they were not included in the training set. Detic [21] extends classifiers trained on image-text pairs to object detection, building upon the Mask R-CNN structure while utilizing large-scale image classification data to improve detection performance. Notably, it achieves strong zeroshot capabilities by detecting objects outside the training classes using text embeddings. RegionCLIP [22] aligns CLIP embeddings at the regional level within images, enabling precise correspondence between object locations and textual descriptions. It generates candidate regions using a Region Proposal Network (RPN) and compares each region with text embeddings to perform detection. OV-DETR [23] enhances the Transformer-based DETR architecture with open-vocabulary functionality, maintaining the end-to-end DETR structure while enabling detection of previously unseen classes using text embeddings. GLIP [24] redefines object detection as a phrase grounding problem, leveraging large-scale phrase-image alignment datasets to strengthen vision-language alignment and enabling detection based on complex textual expressions.

Referring open-set object detection has also seen extensive research. MAttNet [25] decomposes referring expressions into three modules—Appearance, Location, and Relationship—applying dedicated attention mechanisms to each. This structural analysis of language allows precise alignment with objects in images and demonstrates strong performance on datasets such as RefCOCO+, forming the basis for subsequent models. UNITER [26] employs vision-language pretraining to project images and text into a shared embedding space, supporting various vision-language tasks and enabling precise alignment between sentences and images in referring object detection, with strong generalization due to large-scale pretraining. TransVG [27] utilizes a Transformer-based architecture to directly link image and textual features. Unlike traditional CNN+RNN models, it simultaneously processes both modalities and performs fine-grained grounding, achieving high accuracy even for complex sentence structures or multi-object relationships. VGTR [28] integrates image and text features using a Transformer to perform referring object detection, demonstrating strong performance in both speed and accuracy and offering a structure suitable for real-time applications. Notably, it is capable of simultaneously performing object detection and grounding.

Recently, GroundingDINO [29] has been proposed as a unified model capable of performing both zeroshot and referring object detection, allowing detection of previously unseen classes using only textual prompts. Figure 7 shows the GroundingDINO framework. Given an image-text pair, GroundingDINO outputs multiple bounding boxes along with corresponding noun phrases. Similar to GLIP, all category names are concatenated to form the input text for object detection. For object detection, each textual input requires an associated bounding box, and the object with the highest score is selected as the output of the Referring Expression Comprehension (REC) task.

GroundingDINO is a state-of-the-art open-vocabulary object detection framework that integrates visual and textual modalities to perform grounding tasks. It adopts a dual-encoder, single-decoder architecture, enabling robust cross-modal reasoning and precise object localization based on natural language descriptions.

Image Backbone: This component processes the input image using a vision transformer (e.g., Swin Transformer or ResNet) to extract high-dimensional visual features. These features encode spatial and semantic information about objects and regions within the image.
Text Backbone: Simultaneously, the input text (e.g., a prompt like “person wearing red shorts”) is processed using a language model (typically BERT or similar transformer-based encoder) to generate contextualized textual embeddings. These embeddings capture the semantic meaning and relationships within the phrase.
Feature Enhancer: The extracted visual and textual features are then fused through a cross-modal feature enhancer. This module aligns the two modalities in a shared semantic space, allowing the model to reason about the relationship between image regions and textual descriptions.
Language-Based Query Selection Module: From the fused features, this module selects a set of queries that are most relevant to the textual input. These queries act as anchors for object localization and are derived from the image features, guided by the semantics of the text.
Cross-Modality Decoder: The selected queries are passed into a transformer-based decoder that performs iterative refinement. At each decoding layer, the model attends to both visual and textual features, updating the queries to better match the target objects described in the prompt.
Final Output Layer: After several decoding iterations, the final layer produces refined queries that are used to predict bounding boxes around the detected objects. Additionally, the model outputs grounded textual phrases that correspond to each box, effectively linking visual regions to language.

This architecture allows GroundingDINO to perform zeroshot object detection, meaning it can detect and localize objects described by text without requiring explicit training on those categories. Its ability to generalize across unseen classes and handle complex prompts makes it highly suitable for selective de-identification tasks, where user-specified objects must be accurately identified and de-identified based on natural language input.

3. Methodology

3.1. Multimodal Approaches for Selective De-Identification

Open-set object detection techniques can be extended to support selective de-identification, an intelligent privacy-preserving method that protects personally identifiable information in video while retaining necessary content. Selective de-identification is typically applied to scenarios where only approved targets are identifiable, while all other entities are anonymized. Applications include de-identification of specific individuals, automatic de-identification in response to anomalous behaviors, selective de-identification for investigative purposes, patient-centered de-identification in medical imaging, and de-identification for educational or research purposes. By enabling precise, context-aware privacy protection, selective de-identification simultaneously enhances data utility.

By leveraging the flexibility of multimodal approaches, more sophisticated selective de-identification can be achieved. However, while both zeroshot and referring object detection combine visual and linguistic information, their objectives and methodologies differ, and thus must be applied distinctly during de-identification.

3.1.1. Zeroshot-Based Selective De-Identification

Zeroshot-based selective de-identification is used for anonymizing objects belonging to classes not seen during training. In this study, GroundingDINO was employed as the underlying AI model. Given an image-text prompt, image features are extracted using an image backbone such as Swin Transformer [30] to obtain multi-scale image features, and text features are extracted using a text backbone such as BERT [31]. Pre-training utilized datasets including O365, OI, GoldG, and Cap4M, where O365 refers to Object365 [32], OI refers to OpenImage [33], and GoldG includes datasets such as Flickr30k Entities [34] and Visual Genome [35].

Table 2 presents evaluation results based on these datasets. Swin-T or Swin-L backbones were employed, and testing was conducted on the COCO validation dataset. When fine-tuning was performed using the COCO training dataset, the average precision (AP) increased from 52.5 before fine-tuning to 62.6 after fine-tuning, demonstrating improved de-identification performance.

Zeroshot-based de-identification enables general anonymization for common object categories. The textual prompts provide general object names such as “person,” “bus,” “glasses,” “shoes,” or “dog.” Figure 8 shows the de-identification result when the general term “person” is used as the prompt.

3.1.2. Referring-Based Selective De-Identification

There are scenarios where more fine-grained de-identification is required than that offered by zeroshot-based selective de-identification. To achieve stronger privacy protection, rather than applying general de-identification, referring-based de-identification can be employed to anonymize specifically designated targets with greater precision. For pre-training, the RefC dataset was used, where “RefC” collectively refers to RefCOCO, RefCOCO+, and RefCOCOg datasets.

Table 3 presents the performance results on three widely used expression datasets: RefCOCO, RefCOCO+, and RefCOCOg. These datasets are designed to test the model’s ability to localize objects based on natural language descriptions, which is critical for prompt-driven de-identification.

GroundingDINO was tested with two backbone configurations—Swin-T (T) and Swin-L (L)—and under different combinations of pre-training datasets (e.g., O365, GoldG, RefC) and fine-tuning settings. The results show that fine-tuning with referring expression data significantly improves performance across all datasets. For example, the Swin-L backbone with O365, OI, GoldG, Cap4M, COCO, RefC fine-tuning achieved 90.56% accuracy on the RefCOCO validation set, 93.19% on testA, and 88.24% on testB, demonstrating robust generalization and high precision in object grounding tasks.

Additionally, the Swin-L backbone consistently outperformed Swin-T, suggesting that deeper visual representations contribute to more accurate multimodal alignment and object localization.

These results validate the effectiveness of referring-based selective de-identification, especially in scenarios requiring fine-grained object recognition guided by textual prompts. The ability to accurately detect and de-identify user-specified objects using natural language input is essential for real-world privacy protection applications. The performance gains observed through fine-tuning and backbone enhancement confirm the scalability and adaptability of the proposed framework.

Figure 9 shows a de-identification result generated using the prompt “person wearing black coat.”

3.1.3. Comparison of Zeroshot and Referring-Based Selective De-Identification

Depending on the intended purpose, selective de-identification can be implemented using either a zeroshot-based or referring-based approach. Zeroshot de-identification is applied to selectively anonymize objects belonging to undefined or previously unseen classes, whereas referring-based de-identification is utilized to anonymize specific objects identified through descriptive phrases or sentences. Unlike traditional approaches that anonymize predefined, static classes, both methods integrate multiple data modalities—such as images, text, and audio—and leverage semantic reasoning for holistic analysis. This integration enables context-aware, inference-driven selective de-identification, offering a new direction for multimodal, reasoning-based de-identification techniques. Figure 10 shows an example of applying the Multimodal-based Selective de-Identification method.

Table 4 presents a conceptual comparison between zeroshot and referring-based de-identification, summarizing their objectives, methods of use, output characteristics, and representative use cases.

3.2. Multimodel-Based Selective De-Identification (MSD) Framework

To address the growing demand for context-aware privacy protection in visual data, this study proposes the Multimodal-based Selective De-identification (MSD) Framework, which leverages both visual and textual modalities to perform fine-grained, user-guided de-identification. The MSD framework introduces a novel approach that moves beyond conventional rule-based or class-driven anonymization by incorporating natural language understanding and multimodal reasoning.

3.2.1. Framework Role and Objectives

The primary role of MSD is to enable selective de-identification of objects within an image or video based on user-defined prompts. Unlike traditional methods that de-identify all detected instances of a predefined class (e.g., all faces or license plates), MSD allows users to specify targets using descriptive language (e.g., “person wearing red shirts” or “running man”), thereby enhancing control, precision, and contextual relevance.

3.2.2. Architectural Components

Figure 11 shows the proposed MSD framework, where the flow of data is delivered in a bottom-up order.

The MSD framework consists of the following key modules:

Frame Layer: The system initiates by capturing a sequence of video frames from a device or server, decodes and analyzes them to identify those containing potentially sensitive objects based on user-defined prompts, and selectively forwards only the relevant frames for further processing to optimize resource efficiency and privacy protection.
Prompt-Guided Targeting: A natural language prompt (e.g., “person wearing red shirts”) provided by the user or system acts as a semantic filter that guides the framework to identify and focus on specific objects or individuals within the selected video frames.
De-identification Model Layer: This module performs object grounding by integrating visual and textual features through a backbone for base-level extraction, a feature enhancer for refined discrimination, and a decoder that aligns prompt semantics with visual regions, operating in either zeroshot mode for general detection or referring mode for prompt-guided localization.
Masking Layer: After the selective de-identification process, the system performs a final object check to verify accuracy and identify any additional targets, applies appropriate masking techniques such as blurring, resampling, or inpainting, evaluates the severity level of visual content (e.g., accidents or crime scenes) to determine the filtering intensity, and encodes the anonymized images or videos for display and distribution.

3.2.3. Strengths and Advantages

The MSD framework offers several advantages over existing de-identification approaches:

Prompt-Guided Precision: Enables fine-grained control over anonymization targets through natural language input.
Multimodal Reasoning: Combines visual and textual cues to enhance semantic understanding and reduce false positives.
Scalability and Flexibility: Supports diverse object types and prompt expressions without retraining for each new category.
Robustness to Ambiguity: Demonstrates resilience in handling complex, ambiguous, or multi-meaning prompts through referring-based detection.
Modular Design: Allows integration of various backbone models and anonymization strategies depending on application needs.

By introducing MSD, this study contributes a scalable and intelligent framework for privacy-preserving visual analytics, suitable for deployment in surveillance, media processing, and human–computer interaction systems.

4. Experiments

4.1. Results and Discussion

To assess the effectiveness of our proposed prompt-driven multimodal based selective de-identification framework, we conducted a comparative analysis against conventional Vision AI-based approaches. Traditional selective de-identification methods can be broadly categorized into three types:

Object and face detection followed by de-identification of all detected entities (OD, OD+FD).
Feature-based identification, where object and face detectors are used to extract feature vector of specified targets, which are then compared against stored reference vectors to de-identify matched individuals (OD+FD+FID).
Fine-grained object classification, leveraging pretrained models to selectively de-identify predefined targets based on detailed attribute recognition (OD+OC).

These methods typically rely on pretrained models such as YOLOv11 for object detection and CenterFace for facial detection. Each detected entity is assigned an identifier (ID), and its feature vector is extracted using MobileNetV2 model. The de-identification process is then executed by comparing the extracted features with those in a reference dataset using a similarity metric (Equation (1)). This pipeline is particularly effective when a specific face image is provided as a query and the same identity appears repeatedly across frames or video streams, as shown in Figure 3.

For performance validation, we selected person-containing images from the COCO2017 and RefCOCO test sets. All facial regions were manually re-annotated to reconstruct bounding boxes for both “Person” and “Face” categories. For vehicle emblem detection, we utilized annotated data from the AI-Hub dataset [36], which provided sufficient diversity for training and evaluation.

To benchmark our framework, we applied GroundingDINO B, a multimodal model capable of zeroshot and referring-based object detection, to perform selective de-identification across three target categories: Person, Face, and Emblem. Referring prompts were extracted from the “answer” column of the RefCOCO test set. To ensure consistency, expressions such as “boy,” “girl,” “guy,” “man,” and “woman” were normalized to the “person” category.

Table 5 shows the performance average precision comparison. For the “Person” target, our multimodal approach achieved an accuracy improvement of approximately average 3.7 compared to traditional Vision AI methods. For Face, the selective de-identification method of Vision AI had better results, with a score of about average 3.2, and for Emblem(sub-classes), it was about average 2.7. These results may be attributed to variations in dataset composition, environmental conditions, and object visibility.

While the primary focus of this study is on the flexibility and scalability of multimodal selective de-identification, computational efficiency remains an important consideration for real-world deployment. To address this, we conducted a comparative analysis of average inference speed across the core models used in our framework. The results are summarized in Table 6.

Table 6 shows the comparison of average inference speed per frame by model for five models, each serving a distinct role in the de-identification pipeline. YOLOv11, used for general object detection (e.g., person, emblem), achieved an average inference speed of 15.14 ms at 1280 × 720 resolution. CenterFace, optimized for facial detection, demonstrated faster performance at 12.02 ms under the same resolution. For feature extraction tasks, MobileNetV2—operating at 128 × 128 resolution—recorded a notably low inference time of 3.2 ms, making it suitable for lightweight applications. EfficientNet, employed for fine-grained emblem classification, processed inputs at 128 × 128 resolution with an average speed of 7.04 ms, balancing accuracy and efficiency. In contrast, GroundingDINO, which performs zeroshot and referring object detection based on natural language prompts, required 40.18 ms per frame at 1280 × 720 resolution. This relatively higher latency is attributed to its multimodal architecture and cross-modal reasoning capabilities.

These results highlight the trade-offs between model complexity and processing speed. While GroundingDINO offers superior flexibility and semantic understanding, its computational cost must be considered when designing systems for time-sensitive applications. Nonetheless, given that selective de-identification tasks often prioritize accuracy and context-awareness over real-time constraints, the performance of GroundingDINO remains acceptable for most deployment scenarios.

Importantly, when addressing novel or untrained objects, conventional Vision AI methods often fail to perform effective de-identification due to their reliance on predefined classes. For instance, as shown in Figure 10a, de-identifying a crane would require prior training on crane-specific data. In contrast, our framework—powered by generative AI—can perform de-identification based solely on textual prompts, without requiring class-specific training.

Similarly, in Figure 10f, de-identifying a “person wearing red shorts” using Vision AI would necessitate a multi-step process: detecting the person, identifying clothing items using a pretrained model, performing fine-grained color classification, and applying logic to isolate the correct individual (e.g., the person on the left). This complexity introduces significant overhead and limits scalability.

By contrast, our multimodal framework enables direct de-identification via natural language prompts, such as “person wearing red shorts,” allowing for scalable, flexible, and precise de-identification without the need for extensive model retraining or rule-based logic. This capability highlights the distinct advantages of our approach over conventional methods.

4.2. Prompt Sensitivity and Robustness to Complex Language for De-Identification

To further evaluate the robustness of our proposed framework, we conducted additional experiments focusing on the system’s ability to handle complex, ambiguous, or multi-meaning textual prompts. Since the framework relies on natural language input to guide selective de-identification, understanding its sensitivity to prompt variations is crucial for real-world deployment.

Figure 10 illustrates a comparative example using both zeroshot and referring-based approaches. When the prompt “person” is used, the system performs general anonymization across all detected individuals. However, when a more complex prompt such as “person wearing red shirts” is provided, the referring-based detection mechanism successfully isolates and de-identifies only the individual matching the description. This demonstrates the framework’s ability to interpret and act upon fine-grained semantic cues embedded in natural language.

To assess prompt sensitivity, we tested a range of expressions with varying complexity, including:

Ambiguous prompts (e.g., “person near the car”)
Multi-attribute prompts (e.g., “woman with a black bag and sunglasses”)
Context-dependent prompts (e.g., “child standing behind the bench”)

The results indicate that referring-based detection using GroundingDINO maintains high accuracy when the prompt contains clear and descriptive attributes. However, performance may degrade in cases involving vague spatial references, overlapping object descriptions, or uncommon phrasing. In such cases, bounding box predictions may either miss the target or include unintended objects.

To address these limitations, we plan to incorporate error case analysis and prompt refinement strategies in future work. This includes:

Prompt rephrasing using language models to improve clarity
Confidence scoring for bounding box predictions
Visual feedback mechanisms for user validation

Overall, the framework demonstrates strong potential for handling complex prompts, but further refinement is needed to ensure robustness across diverse linguistic inputs. This analysis highlights the importance of prompt design and semantic precision in multimodal de-identification systems.

4.3. Framework and Implementation

To apply multimodal-based selective de-identification, a prototype was developed using an emulator. The emulator is designed to provide a testing environment that supports both Vision AI and multimodal applications, enabling the integration of various de-identification methods. The de-identification system was implemented on a 64-bit Windows 10 operating system equipped with an Intel (Santa Clara, CA, USA) Xeon Processor (Skylake, IBRS) running at 2.59 GHz with 16 cores, RTX5090 GPU and 64 GB of memory.

Since the emulator is intended for deployment in video systems, business logic, codec processing, and screen rendering components were developed in C++ to ensure speed and performance, while the de-identification AI algorithms were implemented in Python v3.10.15. These were integrated and validated through multiple interface methods, including shared file communication, shared memory inter-process communication (IPC), and socket communication. Among these, shared file and memory-based communication—being standalone approaches—offered the highest processing speeds, making them well-suited for real-time applications.

AI algorithms have been rapidly evolving, with most implementations primarily developed and deployed in Linux environments due to the ease of integrating relevant libraries. However, Windows environments offer greater user accessibility, making it essential to design systems that align with specific use cases and operating environments. Moreover, network-based architectures provide greater scalability in terms of resource distribution and multi-scaling compared to standalone systems. Therefore, when choosing an integration interface, factors such as the target system environment, distributed processing requirements, security considerations, and development environment should be carefully evaluated.

Table 7 summarizes the characteristics of these interface methods, categorized into the following aspects:

Processing time: Frame-by-frame algorithm processing performance.
Security: Accessibility of frames to external entities.
Applicability: Ease of implementing and updating new algorithms.
Resource distribution: Separation of rendering and algorithm computations.
Multi-scaling: Scalability for service deployment.

Table 7. Implementation interface comparison.

Mode	App. OS	Algorithm. OS	Interface			Characteristics
Mode	App. OS	Algorithm. OS	File Share	IPC	Socket	Characteristics
Stand alone	windows	windows	-processing time (middle) -security (low) -applicability (middle)	-processing time (high) -security (high) -applicability (middle)	-processing time (low) -security (high) -applicability (middle)	resource distribution (low) Multi-Scaling (X)
Stand alone	windows	linux (wsl)	shared file updates not reflected in wsl	shared memory updates not reflected in wsl		resource distribution (low) Multi-Scaling (X)
Network	windows	windows	Impossibility	Impossibility	-processing time (low) -security (high) -applicability (high)	resource distribution (high) Multi-Scaling (O)
Network	windows	linux	Impossibility	Impossibility		resource distribution (high) Multi-Scaling (O)

In this study, the framework prototype was implemented in a standalone configuration to prioritize real-time performance in a single-view application while accommodating future multi-view scalability. Both the application and Python-based de-identification AI algorithms were deployed in a Windows environment, with shared memory communication serving as the integration interface. For zeroshot and referring object detection, the GroundingDINO model was utilized. The model employed a Swin-T backbone and was trained on the O365, GoldG, and RefC datasets, using the resulting weights for inference within the emulator. Figure 12 shows the emulator program developed using AI de-identification framework. This program enables testing of Vision AI and zeroshot/referring de-identification methods, allowing performance evaluation of each approach. To ensure that the proposed framework can be repeated independently, we have documented all implementation details with sufficient granularity. This includes system specifications, interface integration methods, model configurations, and dataset usage. The emulator setup, communication protocols, and deployment environment are described in a way that allows other researchers to replicate the system without relying on undocumented assumptions. This commitment to reproducibility not only strengthens the scientific validity of our work but also promotes broader adoption and extension of multimodal selective de-identification technologies in real-world applications.

5. Conclusions

In this study, we propose a selective de-identification framework for privacy protection in visual data, leveraging zeroshot and referring object detection techniques based on text prompts. Unlike conventional de-identification approaches based on vision-driven object detection, the proposed method leverages natural language descriptions to precisely identify target objects and selectively apply de-identification only to those objects, thereby achieving a balance between information preservation and privacy protection.

As demonstrated in the experiments, the GroundingDINO algorithm was employed to implement both zeroshot and referring object detection, enabling de-identification in a variety of sentences (e.g., “a person holding a weapon,” “a man sitting next to a desk”). The results confirm that our framework provides improved flexibility and scalability compared to traditional methods while enhancing the protection of sensitive information.

Furthermore, the proposed framework does not rely on predefined classes and offers robust detection capabilities based on natural language understanding, which significantly improves its applicability in real-world settings. In particular, this technology can serve as a foundation for dynamic, context-aware de-identification in diverse domains, such as public CCTV surveillance, medical imaging, and smart city applications.

In the future, we plan to conduct research on how accurately the proposed method can identify and de-identify target objects when presented with complex or lengthy text prompts.

Funding

This research received no external funding.

Data Availability Statement

Figure 8 and Figure 9: These images are sourced from an open-access GitHub repository: https://github.com/ultralytics/ultralytics/blob/main/ultralytics/assets/bus.jpg (accessed on 10 July 2025). The repository is publicly available and licensed for research and educational use. Figure 10: This image is part of a publicly available dataset collected from construction sites, provided by AI Hub Korea: https://www.aihub.or.kr (accessed on 2 June 2025). The dataset is open for research purposes and does not contain personally identifiable information. Figure 11: This figure is derived from the Oxford Town Centre dataset, available on Kaggle: https://www.kaggle.com/datasets/almightyj/oxford-town-centre (accessed on 10 July 2025). The dataset is publicly accessible and widely used in academic research.

Conflicts of Interest

The author declares no conflict of interest.

References

Kim, D.-J.; Jeon, Y.G. A study on video automatic selective de-identification. J. Digit. Contents Soc. 2023, 24, 725–734. [Google Scholar] [CrossRef]
Kim, D.-J.; Jeon, Y.-G.; Kim, J.-H. Automatic Selective De-identification Based on Abnormal Behavior Detection. J. Digit. Contents Soc. 2025, 26, 1069–1076. [Google Scholar] [CrossRef]
Rempe, M.; Heine, L.; Seibold, C.; Hörst, F.; Kleesiek, J. De-Identification of Medical Imaging Data: A Comprehensive Tool for Ensuring Patient Privacy. arXiv 2024, arXiv:2410.12402. [Google Scholar] [CrossRef] [PubMed]
Ultralytics YOLO Vision. Available online: https://github.com/ultralytics/ultralytics (accessed on 14 May 2025).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the Computer Vision—ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Ribaric, S.; Ariyaeeinia, A.; Pavesic, N. Deidentification for privacy protection in multimedia content, Signal Process. Image Commun. 2016, 47, 131–151. [Google Scholar]
Newton, E.; Sweeney, L.; Malin, B. Preserving privacy by de-identifying face images. IEEE Transations Knowl. Data Eng. 2005, 17, 232–243. [Google Scholar] [CrossRef]
Boyle, M.; Edwards, C.; Greenberg, S. The effects of filtered video on awareness and privacy. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, Philadelphia, PA, USA, 2–6 December 2000; pp. 1–10. [Google Scholar]
Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–11. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising difusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Xu, Y.; Yan, W.; Yang, G.; Luo, J.; Li, T.; He, J. CenterFace: Joint Face Detection and Alignment Using Face as Point. Sci. Program. 2020, 2020, 7845384. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Zhou, Y.; Yu, J.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-stage Dense Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Gu, X.; Lin, T.-Y.; Kuo, W.; Cui, Y. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In Proceedings of the Tenth International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting Twenty-thousand Classes using Image-level Supervision. In Proceedings of the Computer Vision—ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-based Language-Image Pretraining. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Open-Vocabulary DETR with Conditional Matching. In Proceedings of the Computer Vision—ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; et al. Grounded Language-Image Pre-training. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the Computer Vision—ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [PubMed]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
AI-Hub Dataset (Video for Vehicle Model/Year/License Plate Recognition). Available online: https://www.aihub.or.kr (accessed on 14 May 2025).

Figure 1. A typical pipeline for object detection-based de-identification.

Figure 2. Selective de-identification using face recognition.

Figure 3. Selective de-identification example based on face recognition.

Figure 4. Selective de-identification using fine-grained classification.

Figure 5. De-identification example based on fine-grained classification.

Figure 6. Extension from a closed-set detection model to an open-set detection model.

Figure 7. GroundingDINO framework [29].

Figure 8. Zeroshot de-identification (prompt: person).

Figure 9. Referring de-identification (prompt: person wearing black coat).

Figure 10. (a) shows the original input, (b–d) present the results of object detection using zeroshot and referring-based approaches. (e,f) demonstrate the outcomes of de-identification based on the same zeroshot/referring detection methods. Each result was generated using a specific prompt, and the corresponding de-identified regions can be clearly observed.

Figure 11. MSD Framework.

Figure 12. The emulator program developed using AI de-identification framework.

Table 1. Comparison of AI model architectures for feature extraction.

Model	Key Feature	Dim. (224 × 224)	Strengths	Limitations	Application Domain
MobileNetV2	Inverted residuals, depth-wise conv	7 × 7 × 1280	Lightweight, fast, mobile-friendly	Lower accuracy than larger models	Mobile apps, edge devices, IoT
ResNet-50/101	Residual learning, deep architecture	7 × 7 × 2048	Stable training, high expressiveness	High computational cost	Object/person detection, feature matching
DenseNet121	Layer connectivity, feature reuse	7 × 7 × 1024	Combines features, mitigates vanishing	High memory usage	Medical imaging, fine-grained recognition
EfficientNet-B0~B7	Compound scaling, efficient representation	B\|B0: 7 × 7 × 1280	High accuracy, low parameters	Model size increases with scale	Autonomous driving, high-res classification
VGG16/19	Sequential layers, simple structure	7 × 7 × 512	Easy to use, good for transfer learning	Many parameters, slower inference	Feature extraction, transfer learning

Table 2. Evaluation of zeroshot-based selective de-identification (BB: Backbone, PTD: Pre-Training data) [29].

Model	BB	PTD	Zeroshot 2017val	COCO Fine-Tuning 2017val
GroundingDINO T	Swin-T	O365	46.7	56.9
GroundingDINO T	Swin-T	O365, GoldG	48.1	57.1
GroundingDINO T	Swin-T	O365, GoldG, Cap4M	48.4	57.2
GroundingDINO L	Swin-L	O365, OI, GoldG	52.5	62.6

Table 3. Evaluation of referring-based selective de-identification. (BB: Backbone, PTD: Pre-Training data, FT: Fine-tuning) [29].

Model	BB	PTD	FT	RefCOCO			RefCOCO+			RefCOCOg
Model	BB	PTD	FT	val	testA	testB	val	testA	testB	val	test
GroundingDINO T	Swin-T	O365, GoldG	No	50.41	57.74	43.21	51.40	57.59	45.81	67.46	67.13
GroundingDINO T	Swin-T	O365, GoldG, RefC	No	79.98	74.88	59.29	66.81	69.91	56.09	71.06	72.07
GroundingDINO T	Swin-T	O365, GoldG, RefC	Yes	89.19	91.86	85.99	81.09	87.40	74.71	84.15	84.94
GroundingDINO L	Swin-L	O365, OI, GoldG, Cap4M, COCO, RefC	Yes	90.56	93.19	88.24	82.75	88.95	75.92	86.13	87.023

Table 4. Zeroshot vs. referring-based selective de-identification.

Model	Zeroshot	Referring
Purpose	De-identify objects of undefined classes	De-identify specific objects referred to by sentences or phrases
Input	Image class name(text)	Image + Natural Language Sentence
Output	Masking the location (bounding box) of objects of the corresponding class	Masking the location (bounding box) of the object referred to in the sentence
example	De-identify cats in images even though there was no class called “cat” in the training.	Describe the sentence “brown dog sitting on a chair” and de-identify the dog.

Table 5. Selective de-identification AP comparison (OD: Object Detection, FD: Face Detection, FID: Face Identification, OC: Object Classification, ZD: Zeroshot Object Detection, RD: Referring Object Detection, GC: General Class, ISID: Image Specific ID, PGC: Prompt General Class, PGID: Prompt General ID).

Algorithm	Model	Target	Type	PTD	Test Dataset
Algorithm	Model	Target	Type	PTD	COCO2017	AI-HUB	RefCOCO
OD	YOLOv11x	Person	GC	COCO	66.2	-	68.4
OD	YOLOv11x	Emblem	GC	AI-HUB	-	48.7	-
OD+FD	YOLOv11x, Centerface	Face	GC	COCO, WIDERFACE	64.8	-	66.6
OD+FD+FID	YOLOv11x, Centerface, MobileNetV2	Face	ISID	COCO, WIDERFACE	-	-	-
OD+OC	YOLOv11x, EfficientNet(B0)	Emblem (subclass)	ISID	AI-HUB	-	42.1	-
ZD	GroundingDINO B	Person	PGC	O365, OI, GoldG, Cap4M, COCO, RefC	70.4	-	71.7.
ZD		Face	PGC		62.8	-	62.1.
ZD		Emblem	PGC		-	39.4	-
RD		Person	PGID		-	-	64.3
RD		Face	PGID		-	-	62.3

Table 6. Comparison of Average Inference Speed by Model.

Model	Role	Resolution	Average Inference Speed (ms)
YOLOv11	Object detection (Person, Emblem)	1280 × 720	15.14
Centerface	Face detection	1280 × 720	12.02
MobileNetV2	Extract Feature	128 × 128	3.32
EfficientNet	Object classification (Emblem)	128 × 128	7.04
GroundingDINO	Zeroshot, referring object detection	1280 × 720	40.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, D.-J. Multimodal-Based Selective De-Identification Framework. Electronics 2025, 14, 3896. https://doi.org/10.3390/electronics14193896

AMA Style

Kim D-J. Multimodal-Based Selective De-Identification Framework. Electronics. 2025; 14(19):3896. https://doi.org/10.3390/electronics14193896

Chicago/Turabian Style

Kim, Dae-Jin. 2025. "Multimodal-Based Selective De-Identification Framework" Electronics 14, no. 19: 3896. https://doi.org/10.3390/electronics14193896

APA Style

Kim, D.-J. (2025). Multimodal-Based Selective De-Identification Framework. Electronics, 14(19), 3896. https://doi.org/10.3390/electronics14193896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal-Based Selective De-Identification Framework

Abstract

1. Introduction

2. Related Works

2.1. Selective De-Identification

2.1.1. Selective De-Identification Based on Object Detection

2.1.2. Selective De-Identification Based on Object Identification

2.1.3. Comparison of Feature Similarity-Based and Fine-Grained Classification-Based De-Identification

2.2. Multimodal-Based Open-Set Object Detection

3. Methodology

3.1. Multimodal Approaches for Selective De-Identification

3.1.1. Zeroshot-Based Selective De-Identification

3.1.2. Referring-Based Selective De-Identification

3.1.3. Comparison of Zeroshot and Referring-Based Selective De-Identification

3.2. Multimodel-Based Selective De-Identification (MSD) Framework

3.2.1. Framework Role and Objectives

3.2.2. Architectural Components

3.2.3. Strengths and Advantages

4. Experiments

4.1. Results and Discussion

4.2. Prompt Sensitivity and Robustness to Complex Language for De-Identification

4.3. Framework and Implementation

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI