Next Article in Journal
A Multi-Constrained Knapsack Approach for Educational Resource Allocation: Genetic Algorithm with Category- Specific Optimization
Previous Article in Journal
Skeletal Image Features Based Collaborative Teleoperation Control of the Double Robotic Manipulators
Previous Article in Special Issue
Promotion and Advancement of Data Security Governance in China
 
 
Article
Peer-Review Record

Multimodal-Based Selective De-Identification Framework

Electronics 2025, 14(19), 3896; https://doi.org/10.3390/electronics14193896
by Dae-Jin Kim
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2025, 14(19), 3896; https://doi.org/10.3390/electronics14193896
Submission received: 2 September 2025 / Revised: 28 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025
(This article belongs to the Special Issue Recent Advances in Security and Privacy for Multimedia Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors in this manuscript propose a multimodal-based selected de-identification framework that integrates diverse data modalities (such as text, images, and potentially audio or structured health records) to selectively anonymize personally identifiable information (PII) and protected health information (PHI) in sensitive datasets, particularly for applications in healthcare AI and privacy-preserving data sharing. The approach leverages techniques like rule-based redaction, machine learning classifiers, and transformer-based models to target specific elements for de-identification while minimizing information loss.

I recommend publishing this work, but please the authors need to address the following issues:

  1. The evaluation metrics, such as re-identification risk scores and utility preservation (e.g., via downstream task performance like disease prediction accuracy), should be expanded with comparisons to baselines on diverse multimodal datasets beyond standard benchmarks.
  2. The novelty of the proposed method is not clearly explained. While selective de-identification across modalities is promising, the manuscript should delineate how it advances beyond existing multimodal frameworks (e.g., those using OCR for text-in-image extraction or unified anonymization pipelines) by quantifying improvements in precision-recall trade-offs for selective vs. full redaction.
  3. The computational efficiency analysis is insufficient; include runtime benchmarks for large-scale datasets and discuss deployment feasibility in resource-constrained environments like edge devices.
  4. There are multiple grammar and stylistic errors throughout the text. The authors need to thoroughly proofread the manuscript. For example, sentences like “the framework achieve better results” should read “the framework achieves better results” (subject-verb agreement). In another instance, phrases such as “Figure (1) is showing” should be corrected to “Figure 1 shows…”. There are also run-on sentences and inconsistent tense usage. For instance, the manuscript often omits articles (“propose framework” → “propose a framework”). Please review the entire text and fix these issues to improve readability.
  5. More recent works should be discussed. Please read the following related works and discussed them in your work. Multiview attention network with random interpolation resize for few-shot surface defect detection.

Comments for author File: Comments.pdf

Comments on the Quality of English Language
  1. There are multiple grammar and stylistic errors throughout the text. The authors need to thoroughly proofread the manuscript. For example, sentences like “the framework achieve better results” should read “the framework achieves better results” (subject-verb agreement). In another instance, phrases such as “Figure (1) is showing” should be corrected to “Figure 1 shows…”. There are also run-on sentences and inconsistent tense usage. For instance, the manuscript often omits articles (“propose framework” → “propose a framework”). Please review the entire text and fix these issues to improve readability.

Author Response

[Response to Reviewer Comments]
I sincerely thank the reviewer for the detailed and constructive feedback. Below, we address each point raised:

[Comments and Suggestions for Authors]
The authors in this manuscript propose a multimodal-based selected de-identification framework that integrates diverse data modalities (such as text, images, and potentially audio or structured health records) to selectively anonymize personally identifiable information (PII) and protected health information (PHI) in sensitive datasets, particularly for applications in healthcare AI and privacy-preserving data sharing. The approach leverages techniques like rule-based redaction, machine learning classifiers, and transformer-based models to target specific elements for deidentification while minimizing information loss. I recommend publishing this work, but please the authors need to address the following issues:

(Comments 1) :
The evaluation metrics, such as re-identification risk scores and utility preservation (e.g., via downstream task performance like disease prediction accuracy), should be expanded with comparisons to baselines on diverse multimodal datasets beyond standard benchmarks.

(Response 1) :
Thank you for pointing this out. This section has already been addressed in the main body(3.2.1(page 9), 3.2.2(page 10)) of the manuscript.
The details regarding this point are outlined in the main text.
Table 2 presents evaluation results based on pretraining using datasets such as O365, OI, GoldG, and Cap4M.
De-identification was performed in a zero-shot setting, and Swin-T or Swin-L backbone models were employed.
Testing was conducted on the COCO validation dataset.
When fine-tuning was applied using the COCO training dataset, the average precision (AP) improved, demonstrating enhanced de-identification performance.
Table 3 shows evaluation results based on referring expression-based de-identification across multiple datasets.
Similar to the zeroshot approach, Swin-T or Swin-L backbones were used, and evaluation was conducted on the RefC validation and test datasets (RefCOCO, RefCOCO+, RefCOCOg) after pretraining.
Fine-tuning with the RefC training dataset led to overall performance improvements compared to the results prior to fine-tuning.
A comparative analysis was conducted across various multimodal datasets, and performance was assessed using average precision (AP) as the primary metric.
Additionally, in Section 4.1, Results and Discussion [pages 12–13], we compared conventional selective de-identification methods with the proposed multimodal-based approach, highlighting its innovation and superiority.

(Comments 2) :
The novelty of the proposed method is not clearly explained. While selective deidentification across modalities is promising, the manuscript should delineate how it advances beyond existing multimodal frameworks (e.g., those using OCR for text-in-image extraction or unified anonymization pipelines) by quantifying improvements in precisionrecall trade-offs for selective vs. full redaction.

(Response 2) :
Thank you for pointing this out. 
The Multimodal-based Selective De-Identification Framework presented in this study introduces several key advancements beyond existing approaches:
1. Prompt-Driven Selective De-Identification Across Modalities
Unlike conventional de-identification methods that apply blanket redaction or anonymization, our framework enables fine-grained, user-guided de-identification through natural language prompts. This allows selective targeting of sensitive entities (e.g., faces, license plates, medical terms) across image, video, and text modalities, offering greater flexibility and control.
2. Integration of Referring Expression Detection for Visual Targeting
We incorporate referring expression comprehension (REC) to identify and de-identify specific objects in visual scenes based on textual descriptions. This goes beyond OCR-based text-in-image extraction or rule-based pipelines by enabling context-aware, semantic-level targeting of visual elements.
3. Multimodal Fusion for Privacy-Aware Decision Making
We leverage multimodal fusion techniques to jointly analyze structured text, visual content, and metadata, enabling more accurate identification. This holistic approach improves de-identification precision while minimizing unnecessary information loss.

Chapter 4 presents the Experiments, and Section 4.1, Results and Discussion, compares traditional Vision AI-based selective de-identification methods with the proposed multimodal approach. Through this comparison, the section highlights the originality and distinctiveness of our proposed method. [Pages 11, 12, 13]

The manuscript thoroughly describes these advancements, presents the proposed methodology in detail, and summarizes the key contributions in Chapter 5 (Conclusions).[page 15]


(Comments 3) :
The computational efficiency analysis is insufficient; include runtime benchmarks for largescale datasets and discuss deployment feasibility in resource-constrained environments like edge devices.

(Response 3) :
Thank you for pointing this out. However, we respectfully disagree with this comment. Real-time performance is not a critical factor in de-identification tasks.
Instead, we believe that considerations such as operating system environments, control center infrastructure, business models, and concurrent access scenarios are more relevant in determining the most efficient interface design. Accordingly, our study places greater emphasis on framework integration and system interoperability.
In practice, the municipal CCTV control centers that support our research operate over 5,000 connected cameras and manage more than one month of stored footage through a Video Management System (VMS). Their primary concern is not processing speed, but rather the ability to accurately de-identify subjects when police or citizens request specific footage. The priority is to locate the relevant time segment and ensure precise de-identification of the requested targets.
Our proposed framework follows a process in which accurate de-identification is performed first, followed by data reconstruction and distribution through an integrated interface. We have also implemented a prototype system based on this framework, confirming its flexible compatibility with existing operational environments.
Therefore, in the context of this study, we believe that runtime benchmarking is not an essential consideration.


(Comments 4) :
There are multiple grammar and stylistic errors throughout the text. The authors need to thoroughly proofread the manuscript. For example, sentences like “the framework achieve better results” should read “the framework achieves better results” (subject-verb agreement). In another instance, phrases such as “Figure (1) is showing” should be corrected to “Figure 1 shows…”. There are also run-on sentences and inconsistent tense usage. For instance, the manuscript often omits articles (“propose framework” → “propose a framework”). Please review the entire text and fix these issues to improve readability.

(Response 4) :
Thank you for pointing this out. I agree with this comment. As noted in our response to [Comments on the Quality of English Language], I have thoroughly revised the manuscript by addressing the issues pointed out.


(Comments 5) :
More recent works should be discussed. Please read the following related works and discussed them in your work. Multiview attention network with random interpolation resize for few-shot surface defect detection. 

(Response 5) :
Thank you for pointing this out. But, the paper titled “Enhanced Multiview Attention Network with Random Interpolation Resize for Few-Shot Surface Defect Detection” introduces a novel model, ERNet, designed to effectively detect surface defects in industrial settings using limited training data under few-shot learning conditions.
While the approach is technically valuable, its relevance to our work is limited. Our study focuses on multimodal-based selective de-identification, which aims to anonymize personally identifiable information (PII) in images and videos through prompt-driven targeting. This involves privacy-preserving techniques for sensitive data, particularly in healthcare and surveillance contexts.
Given the distinct objectives and application domains, we believe this paper is not directly aligned with the scope of our research and may not be suitable for inclusion as a recent related work in our manuscript.


[Comments on the Quality of English Language]
(Comments 6) :
There are multiple grammar and stylistic errors throughout the text. The authors need to thoroughly proofread the manuscript. For example, sentences like “the framework achieve better results” should read “the framework achieves better results” (subject-verb agreement). In another instance, phrases such as “Figure (1) is showing” should be corrected to “Figure 1 shows…”. There are also run-on sentences and inconsistent tense usage. For instance, the manuscript often omits articles (“propose framework” → “propose a framework”). Please review the entire text and fix these issues to improve readability.

(Response 6) :
[page 1] Selected -> Selective
[page 3] a multimodal selective de-identification framework->a multimodal-based selective de-identification framework
[page 3] A typical object detection-based de-identification pipeline is illustrated in Figure 1. -> Figure 1 shows a typical object detection-based de-identification pipeline.
[page 4] The process of selective de-identification using face recognition is illustrated in Figure 2 -> Figure 2 shows the process of selective de-identification using face recognition.
[page 6] Figure 4 illustrates the selective de-identification -> Figure 4 shows the selective de-identification
[page 6] Figure 6 illustrates the expansion of object detection toward multimodal integration. -> Figure 6 shows the expansion of object detection toward multimodal integration.
[page 7] Figure 7 illustrates the GroundingDINO framework. -> Figure 7 shows the GroundingDINO framework.
[page 8] The final decoder layer outputs queries that are used to predict bounding boxes and extract the corresponding textual phrases. -> The final decoder layer outputs queries, which are then used to predict bounding boxes and extract corresponding textual phrases.
[page 9] Figure 8 illustrates the de-identification result ... -> Figure 8 shows the de-identification result ...
[page 10] Figure 9 illustrates a de-identification result ... -> Figure 9 shows the de-identification result ...
[page 11] This figure illustrates an example of applying .. -> This figure shows an example of applying
[page 14] Figure 11 illustrates the emulator program ... -> Figure 11 shows the emulator program 
[page 15] the proposed framework does not rely on predefined classes, offering robust natural language-based detection capabilities -> The proposed framework does not rely on predefined classes and offers robust detection capabilities based on natural language understanding.
 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper claims to employ a "multimodal" approach, but in reality, it only fuses image and text information, and the fusion mechanism relies on a pretrained GroundingDINO model. The alignment quality of the learned multimodal representations is not thoroughly analyzed. Have the authors considered introducing a cross-modal attention mechanism or a semantic alignment loss function to enhance fine-grained understanding of image-text interactions?

This research is currently only at the implementation stage, and it can hardly be called a complete study. Experimental evaluation, whether case studies or scientifically conducted specific evaluations, is a key component of academic journal content. The authors must first improve the research content as a prerequisite for submitting it to reviewers.

Secondly, Section 2.2, "Challenges and Research Directions," is somewhat abrupt and should be incorporated into the introduction or the beginning of the methodology to enhance logical coherence. Furthermore, the figure numbers and references are unclear in some areas of the paper, such as Figure 7 and Table 3. The authors should improve the alignment between the figures and the text. Most importantly, this study currently lacks comparative experiments with existing methods, such as the traditional YOLO+ face recognition pipeline, making it difficult to highlight the proposed method's superiority in terms of selectivity. Furthermore, the experimental section fails to mention computational efficiency, such as inference speed and memory usage, which are crucial in practical deployments. Furthermore, the authors need to further discuss the model's robustness to different linguistic cues, such as its ability to handle ambiguous or complex descriptions. Thanks. 

Author Response

[Response to Reviewer Comments]
I sincerely thank the reviewer for the detailed and constructive feedback. Below, we address each point raised:


(Comments 1) :
This paper claims to employ a "multimodal" approach, but in reality, it only fuses image and text information, and the fusion mechanism relies on a pretrained GroundingDINO model. The alignment quality of the learned multimodal representations is not thoroughly analyzed. Have the authors considered introducing a cross-modal attention mechanism or a semantic alignment loss function to enhance fine-grained understanding of image-text interactions?

(Response 1) :
Thank you for pointing this out. While our current implementation primarily fuses image and text modalities using a pretrained GroundingDINO model, our framework is designed with extensibility in mind. We acknowledge that the alignment quality of multimodal representations requires further analysis. As suggested, we are actively exploring the integration of cross-modal attention mechanisms and semantic alignment loss functions to enhance fine-grained image-text interactions. These enhancements will be included in future iterations of the study. 
The core contribution of this paper lies in its departure from traditional de-identification approaches, which primarily focus on blanket removal of sensitive information. In contrast, our proposed framework enables selective de-identification through natural language prompts, allowing users to specify which sensitive entities—such as faces, license plates, or medical terms—should be anonymized. This design supports flexible and fine-grained control across multiple modalities, including images, videos, and text.
Furthermore, our study incorporates referring expression comprehension (REC) to identify and de-identify specific objects within visual scenes based on textual descriptions. This goes beyond conventional OCR-based text extraction or rule-based pipelines, enabling context-aware and semantically driven visual targeting.
This novel approach leverages multimodal fusion of textual, visual, and metadata inputs to improve the accuracy of identification while minimizing unnecessary information loss. Compared to single-modality methods, our framework offers more precise and intelligent de-identification capabilities.
Additionally, in Section 4.1, Results and Discussion [pages 12–13], we compared conventional selective de-identification methods with the proposed multimodal-based approach, highlighting its innovation and superiority.
We kindly ask that these aspects be recognized as key innovations of our work.


(Comments 2) :
This research is currently only at the implementation stage, and it can hardly be called a complete study. Experimental evaluation, whether case studies or scientifically conducted specific evaluations, is a key component of academic journal content. The authors must first improve the research content as a prerequisite for submitting it to reviewers.

(Response 2) :
We fully agree that experimental validation is a critical component of academic research. The current manuscript is structured around a proof-of-concept implementation, and we have submitted it under the Methodology Article category to reflect its exploratory nature.
For scientific and quantitative evaluation, we employed GroundingDINO, a representative open-set detector, to perform zero-shot and referring-based selective de-identification. While it is certainly possible to conduct comparative studies using other open-set detectors such as Qwen2.5VL or Molmo, which are VLM-based object detection models, or even to design a new network architecture specifically for de-identification, we believe that conceptual validation must precede such developments to ensure meaningful progress in this field.
Until now, most de-identification research has focused on Vision AI-based approaches. In contrast, our study proposes a novel multimodal selective de-identification framework that leverages prompt-driven interactions, aligning with the current era of generative AI. This direction represents a creative and original approach that, to our knowledge, has not been explored in prior work.
We kindly ask that this innovative perspective be recognized. 
Additionally, in Section 4.1, Results and Discussion [pages 12–13], we compared conventional selective de-identification methods with the proposed multimodal-based approach, highlighting its innovation and superiority.
These improvements are already being prepared for subsequent studies.


(Comments 3) :
Secondly, Section 2.2, "Challenges and Research Directions," is somewhat abrupt and should be incorporated into the introduction or the beginning of the methodology to enhance logical coherence. 

(Response 3) :
Thank you for pointing this out.  I agree with this comment regarding Section 2.2. To improve logical flow, we will relocate the “Challenges and Research Directions” section to the introduction chapter. This restructuring will help readers better understand the motivation and context of the proposed framework.[page 2]


(Comments 4) :
Furthermore, the figure numbers and references are unclear in some areas of the paper, such as Figure 7 and Table 3. The authors should improve the alignment between the figures and the text. 

(Response 4) :
Thank you for pointing this out. I will carefully review and revise all figure and table references, including Figure 7 and Table 3, to ensure clarity and proper alignment with the text.[page 7, page 9]


(Comments 5) :
Most importantly, this study currently lacks comparative experiments with existing methods, such as the traditional YOLO+ face recognition pipeline, making it difficult to highlight the proposed method's superiority in terms of selectivity. 

(Response 5) :
Thank you for pointing this out. In Section 4.1, Results and Discussion [pages 12–13], we compared conventional selective de-identification methods with the proposed multimodal-based approach, highlighting its innovation and superiority.


(Comments 6) :
Furthermore, the experimental section fails to mention computational efficiency, such as inference speed and memory usage, which are crucial in practical deployments. 

(Response 6) :
Thank you for pointing this out. However, we respectfully disagree with this comment. Real-time performance is not a critical factor in de-identification tasks.
Instead, we believe that considerations such as operating system environments, control center infrastructure, business models, and concurrent access scenarios are more relevant in determining the most efficient interface design. Accordingly, our study places greater emphasis on framework integration and system interoperability.
In practice, the municipal CCTV control centers that support our research operate over 5,000 connected cameras and manage more than one month of stored footage through a Video Management System (VMS). Their primary concern is not processing speed, but rather the ability to accurately de-identify subjects when police or citizens request specific footage. The priority is to locate the relevant time segment and ensure precise de-identification of the requested targets.
Our proposed framework follows a process in which accurate de-identification is performed first, followed by data reconstruction and distribution through an integrated interface. We have also implemented a prototype system based on this framework, confirming its flexible compatibility with existing operational environments.
Therefore, in the context of this study, we believe that runtime or resource benchmarking is not an essential consideration.


(Comments 7) :
Furthermore, the authors need to further discuss the model's robustness to different linguistic cues, such as its ability to handle ambiguous or complex descriptions. Thanks. 

(Response 7) :
Thank you for pointing this out.  The observation regarding the model's ability to handle ambiguous or complex text prompts is a highly important perspective. From the standpoint of researching prompt-based de-identification, this is not a topic that can be addressed in a single study, but rather one that requires ongoing investigation. Therefore, we plan to continue exploring this area in future research, and have presented it as a future research direction in the 5. Conclusion section.[page 15]

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

Your paper presents an interesting research, but it is unclear and confused in this form. Here are some comments that will help you in improving the manuscript:

  1. The paper has no easily-track structure and clear organization. By looking at the paper, it is not clear is it a review paper or research paper. There is no Results section. Methodology section is mixed with related work and comparison results, etc. Hence, the entire paper should be rewritten and reorganized.
  2. From your paper, it is not clear what is your novelty and your proposed methodology vs SOTA.
  3. You need to improve review of SOTA and to analyze more SOTA papers. After that, relevant should be included, addressed, and compare with your idea. You should establish relationship with new SOTA references to establish motivation and niche for your idea.
  4. In lines 123-124 you stated: "A Faces Database is constructed for N participants who have provided consent for de-identification integration." This is not clear and should be rephrased. Fig. 2 suggest that this database is for all participants. If it is optional, changes in Fig. 2 should be made.
  5. Why identifying faces and than de-identifying? Is it easier to blur pixels that are faces at the input stages, like pre-filtering? 
  6. Please state clearly scientific and other contributions of the paper at the end of the Introduction before organization of the paper.
  7. The paper needs more details in order to be repeated independently or publicly available code.
  8. Figure and table captions should start with capital letter. E.g. "Figure 2. selective" should be "Figure 2. Selective", etc.

Therefore, I cannot recommend publishing of the paper in this form.

Kind regards

Author Response

[Response to Reviewer Comments]
I sincerely thank the reviewer for the detailed and constructive feedback. Below, we address each point raised:

(Comments 1) :
The paper has no easily-track structure and clear organization. By looking at the paper, it is not clear is it a review paper or research paper. There is no Results section. Methodology section is mixed with related work and comparison results, etc. Hence, the entire paper should be rewritten and reorganized.
From your paper, it is not clear what is your novelty and your proposed methodology vs SOTA.
You need to improve review of SOTA and to analyze more SOTA papers. After that, relevant should be included, addressed, and compare with your idea. You should establish relationship with new SOTA references to establish motivation and niche for your idea.

(Response 1) :
Thank you for pointing this out. Chapter 4 presents the Experiments, and Section 4.1, Results and Discussion, compares traditional Vision AI-based selective de-identification methods with the proposed multimodal approach. Through this comparison, the section highlights the originality and distinctiveness of our proposed method. [Pages 11, 12, 13]


(Comments 2) :
In lines 123-124 you stated: "A Faces Database is constructed for N participants who have provided consent for de-identification integration." This is not clear and should be rephrased. Fig. 2 suggest that this database is for all participants. If it is optional, changes in Fig. 2 should be made.

(Response 2) :
Thank you for pointing this out. Your observation is correct—the database is constructed by extracting feature vectors from the faces of all participants. We have revised the corresponding statement to clarify that it includes all participants.[page 4]


(Comments 3) :
Why identifying faces and than de-identifying? Is it easier to blur pixels that are faces at the input stages, like pre-filtering? 

(Response 3) :
Thank you for your comment. However, we respectfully disagree with this point. Selective de-identification is a post-processing technology essential for secure data distribution and deployment. In practice, it is a critical component in CCTV monitoring centers, where image must be selectively de-identified before external release. When police or citizens request specific video segments, only the relevant targets should remain identifiable, while all other personal information must be de-identified to comply with privacy regulations.
Moreover, this technology is not limited to privacy protection—it has broader applicability across commercial domains, such as advertising. For instance, in contractual advertising scenarios, only authorized brand logos may be exposed, while others must be masked. This demonstrates the scalability and versatility of selective de-identification beyond personal data.
Therefore, target-specific identification and de-identification must be supported to ensure flexible and context-aware deployment. Our framework addresses this need by enabling precise control over which objects are de-identified, making it suitable for real-world applications across both public safety and commercial use cases.
 

(Comments 4) :
Please state clearly scientific and other contributions of the paper at the end of the Introduction before organization of the paper.

(Response 4) :
Thank you for pointing this out.  Scientific and other contributions have been added at the end of the Introduction section [page 3].
"In addition to its technical contributions, this study offers a novel multimodal frame-work that integrates natural language understanding with visual reasoning, enabling fine-grained and context-aware de-identification across diverse modalities. The pro-posed framework architecture is designed for real-world deployment, supporting flex-ible integration with existing video management environments. Furthermore, the re-search provides a reproducible prototype implementation and outlines practical con-siderations for interface design, system scalability, and privacy compliance—thereby contributing both scientifically and operationally to the advancement of intelligent privacy-preserving technologies."


(Comments 5) :
The paper needs more details in order to be repeated independently or publicly available code.

(Response 5) :
Thank you for pointing this out. The relevant content has been added to the end of Section 4.2, Framework and Implementation [page 14].
"To ensure that the proposed framework can be repeated independently, we have documented all implementation details with sufficient granularity. This includes system specifications, interface integration methods, model configurations, and dataset usage. The emulator setup, communication protocols, and deployment environment are described in a way that allows other researchers to replicate the system without relying on undocumented assumptions. This commitment to reproducibility not only strengthens the scientific validity of our work but also promotes broader adoption and extension of multimodal selective de-identification technologies in real-world applications."


(Comments 6) :
Figure and table captions should start with capital letter. E.g. "Figure 2. selective" should be "Figure 2. Selective", etc.

(Response 6) :
Thank you for pointing this out. We have applied initial capitalization to all figure and table captions.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents an interesting multimodal de-identification framework, demonstrating innovation, but some aspects need improvement. The section "Challenges and Research Directions" in Section 2, e.g., the discussion of multimodal technology trends at the end of page 2, feels somewhat out of place; the authors should integrate this into the introduction or the beginning of the methods section to enhance logical coherence and facilitate a smoother transition from the problem background to the proposed method.  Furthermore, the relationship between figures/tables and the text is unclear. For example, the description of the GroundingDINO framework in Figure 7 and the evaluation results in Table 3 are too brief and not sufficiently integrated with the method description.  The authors should add more explanation and analysis of these figures/tables in the corresponding paragraphs to improve readability. More importantly, while the authors compare their method with traditional approaches, they lack a comparison with other multimodal detection models, such as RegionCLIP and OV-DETR, hindering a comprehensive demonstration of the proposed method's advantages. Additionally, the paper omits metrics related to computational efficiency, such as inference speed and memory usage, which are crucial for real-world deployment. Although the authors emphasize that "real-time performance is not critical," as an academic paper, it should at least provide basic efficiency data for readers to assess scalability. Furthermore, the method relies on text prompts, but it lacks testing and discussion of its ability to handle complex, ambiguous, or multi-meaning language. For example, how robust is the method to complex prompts like "person wearing red shorts"? The discussion should include analysis of prompt sensitivity and error cases. Thanks. 

Author Response

(Comment 1):
The paper presents an interesting multimodal de-identification framework, demonstrating innovation, but some aspects need improvement. The section "Challenges and Research Directions" in Section 2, e.g., the discussion of multimodal technology trends at the end of page 2, feels somewhat out of place; the authors should integrate this into the introduction or the beginning of the methods section to enhance logical coherence and facilitate a smoother transition from the problem background to the proposed method.  

(Response 1):
Thank you for your valuable comment. We agree that the discussion on multimodal technology trends at the end of page 2 may disrupt the logical flow of the manuscript. To address this, we have revised the content accordingly, ensuring a smoother transition from the problem background to the proposed method for the reader. [Page 2]


(Comment 2):
Furthermore, the relationship between figures/tables and the text is unclear. For example, the description of the GroundingDINO framework in Figure 7 and the evaluation results in Table 3 are too brief and not sufficiently integrated with the method description.  The authors should add more explanation and analysis of these figures/tables in the corresponding paragraphs to improve readability. 

(Response 2):
Thank you for your valuable feedback. We acknowledge that the descriptions of Figure 7 and Table 3 were too brief and lacked integration with the method explanation. In the revised manuscript, we have expanded the corresponding paragraphs in Section 2.2 and 3.1.2 to provide a more detailed analysis of the GroundingDINO framework and the evaluation results. These additions clarify the role of each component and how they contribute to the selective de-identification process, thereby improving readability and alignment between figures/tables and the text. [Pages 8, 9, 10, 11]


(Comment 3):
More importantly, while the authors compare their method with traditional approaches, they lack a comparison with other multimodal detection models, such as RegionCLIP and OV-DETR, hindering a comprehensive demonstration of the proposed method's advantages. 

(Response 3):
Thank you for your thoughtful comment. We fully agree that a comparative analysis with other multimodal detection models such as RegionCLIP and OV-DETR would strengthen the demonstration of our framework’s advantages. However, the primary aim of this study is not to benchmark multiple multimodal algorithms to identify the most optimized model for de-identification. Instead, our focus is on proposing a novel direction—transitioning from traditional Vision AI-based de-identification to a prompt-driven multimodal framework powered by generative AI technologies.
Our intention is to explore how natural language understanding and visual reasoning can be integrated to enable flexible, context-aware selective de-identification. While we acknowledge the value of comparative performance analysis, we consider it a meaningful extension that we plan to address in future work. We kindly ask for your understanding regarding this scope, and we hope this clarification helps better convey the intended contribution and direction of our research.


(Comment 4):
Additionally, the paper omits metrics related to computational efficiency, such as inference speed and memory usage, which are crucial for real-world deployment. Although the authors emphasize that "real-time performance is not critical," as an academic paper, it should at least provide basic efficiency data for readers to assess scalability. 

(Response 4):
Thank you for your valuable comment. We agree that providing computational efficiency metrics such as inference speed is essential for evaluating the scalability and practical applicability of the proposed framework. Although our study emphasizes that real-time performance is not a critical requirement for selective de-identification tasks, we recognize the importance of including baseline efficiency data for academic completeness.
To address this, we have added Table 6, which presents a comparative analysis of average inference speed across five models used in our framework. The table includes YOLOv11 for general object detection, CenterFace for face detection, MobileNetV2 for feature extraction, EfficientNet for emblem classification, and GroundingDINO for zeroshot and referring object detection. The results show that while lightweight models such as MobileNetV2 and CenterFace achieve fast inference speeds (3.32 ms and 12.02 ms respectively), GroundingDINO—due to its multimodal architecture—requires a higher processing time of 40.18 ms per frame at 1280×720 resolution.
These metrics provide readers with a clearer understanding of the trade-offs between model complexity and processing speed. Furthermore, we have documented the system specifications used for testing (Intel Xeon 2.59 GHz, 16 cores, 64 GB RAM, RTX5090) to ensure reproducibility and transparency.
[page 15, 16]

(Comment 5):
Furthermore, the method relies on text prompts, but it lacks testing and discussion of its ability to handle complex, ambiguous, or multi-meaning language. For example, how robust is the method to complex prompts like "person wearing red shorts"? The discussion should include analysis of prompt sensitivity and error cases. Thanks. 

(Response 5):
Thank you for your insightful comment. Robustness to complex or ambiguous prompts is indeed a critical factor in prompt-based de-identification. In response, we have added a discussion on prompt sensitivity in Section 4.2, highlighting potential limitations and error scenarios. Furthermore, we propose future research directions aimed at improving semantic understanding and generalization capabilities for diverse prompt expressions.. [page 16, 17]

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

You mentioned "thereby contributing both scientifically and operationally to the advancement of intelligent privacy-preserving technologies." However, it is not explicitly (e.g. in bulleted form) stated what are actual scientific contributions. Except scientific contributions, you should state what is novelty.

How is recognition of car's registry or manufacturer sign different from face recognition? There are several figures with cars. What is fundamentally different?

Section 3.1. describes previous related works. Hence, it better fits to section 2.

It is not clear what is the role of Section 3.2.2. Is it results of the proposed framework? If it is, then it should be in Results (at least table 3).  GroundingDINO is known from references. If this is not novel result, than reference is missing. 

Also similar for Table 2.

Please state clearly and add figure of the proposed method which is originally yours, not from references. It is hard for readers to understand what do you actually propose. What is your method and does it have a name? This should be used to improve methodology section.

The paper in this form looks like a comparison of different methods without new proposal. It could be valid, but it should be clearly stated in abstract, results, and conclusions. In lines 402-3 you mention "our multimodal approach achieved an accuracy improvement of approximately average 3.7", but what is "our" is not clear from table and text and should be highlighted. Furthermore, the paper's methodology should be built around "our approach".

The problem of misunderstanding of this paper is in the Methodology section. This section should be reorganized and clearly organized. It should contain the proposed framework, which is clear, not mixed up with comparisons and other methods. However, comparisons can find their place in Related works or as comparative analysis in the Results.

Author Response

(Comment 1):
You mentioned "thereby contributing both scientifically and operationally to the advancement of intelligent privacy-preserving technologies." However, it is not explicitly (e.g. in bulleted form) stated what are actual scientific contributions. Except scientific contributions, you should state what is novelty.

(Response 1):
Thank you for your valuable comment. To clarify our scientific contributions and novelty, we have revised the manuscript to explicitly list them in bullet form within the Introduction and Conclusion sections. The key scientific contributions include:
- Proposal of a prompt-driven multimodal selective de-identification framework integrating visual and textual modalities for real-world.
- Demonstration of zeroshot and referring-based object grounding for fine-grained anonymization using natural language prompts.
- Comparative analysis of zeroshot and referring-based approaches across multiple datasets.
In terms of novelty, our work introduces a generative AI-based selective de-identification paradigm that moves beyond traditional Vision AI pipelines. This includes the use of language-guided object selection and multimodal reasoning for privacy-preserving tasks, which has not been previously explored in this context.[page2, 3]


(Comment 2):
How is recognition of car's registry or manufacturer sign different from face recognition? There are several figures with cars. What is fundamentally different?

(Response 2):
Thank you for your valuable comment. The fundamental differences in de-identification approaches based on the two recognition methods have been comparatively addressed and added to Section 2.1.3. [page 6]

(Comment 3):
Section 3.1. describes previous related works. Hence, it better fits to section 2.

(Response 3):
Thank you for the suggestion. We agree that Section 3.1, which reviews prior works, is more appropriate within the Related Works section. Accordingly, we have moved this content to Section 2 and reorganized the structure to improve clarity and logical flow. The Methodology section now focuses solely on the proposed framework and its components. [page 8, 9]


(Comment 4):
It is not clear what is the role of Section 3.2.2. Is it results of the proposed framework? If it is, then it should be in Results (at least table 3).  GroundingDINO is known from references. If this is not novel result, than reference is missing. 
Also similar for Table 2.

(Response 4):
Thank you for your comment. The content in Table2, 3 presents the performance results of the proposed framework, which applies the outcomes of GroundingDINO’s openset object detection to selective de-identification. GroundingDINO has been clearly cited in the references, allowing a clear distinction between existing models and the original contributions of this study.[page 9, 10, 11]

(Comment 5):
Please state clearly and add figure of the proposed method which is originally yours, not from references. It is hard for readers to understand what do you actually propose. What is your method and does it have a name? This should be used to improve methodology section.
The paper in this form looks like a comparison of different methods without new proposal. It could be valid, but it should be clearly stated in abstract, results, and conclusions. In lines 402-3 you mention "our multimodal approach achieved an accuracy improvement of approximately average 3.7", but what is "our" is not clear from table and text and should be highlighted. Furthermore, the paper's methodology should be built around "our approach".
The problem of misunderstanding of this paper is in the Methodology section. This section should be reorganized and clearly organized. It should contain the proposed framework, which is clear, not mixed up with comparisons and other methods. However, comparisons can find their place in Related works or as comparative analysis in the Results.

(Response 5):
We sincerely appreciate the reviewer’s insightful feedback regarding the clarity and structure of the proposed method. In response, we have made substantial revisions to the manuscript to clearly present and highlight our original contribution.
To address the reviewer’s concern, we have explicitly named our method as the Multimodal-based Selective De-identification (MSD) Framework. This framework is our original contribution and is designed to perform fine-grained, prompt-guided de-identification by integrating visual and textual modalities. The MSD framework moves beyond traditional rule-based or class-driven de-identification by leveraging natural language understanding and multimodal reasoning. And, we have added a new figure (Figure 11) that visually illustrates the architecture and data flow of the MSD Framework. This figure is entirely original and was created to help readers intuitively understand the structure and operation of our proposed method. The diagram includes key components such as the Frame Layer, Prompt-Guided Targeting, De-identification Model Layer, and Masking Layer, with a bottom-up data flow from video input to anonymized output.[page 11,12.,13]

 

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

I hope your changes will help readers to easier follow your idea.

Back to TopTop