Next Article in Journal
Mouse Data Defence Technology Using Machine Learning in Image-Based User Authentication: Based on the WM_INPUT Message
Previous Article in Journal
Intelligent Routing Optimization via GCN-Transformer Hybrid Encoder and Reinforcement Learning in Space–Air–Ground Integrated Networks
Previous Article in Special Issue
Weak-Cue Mixed Similarity Matrix and Boundary Expansion Clustering for Multi-Target Multi-Camera Tracking Systems in Highway Scenarios
 
 
Article
Peer-Review Record

Efficient Object-Related Scene Text Grouping Pipeline for Visual Scene Analysis in Large-Scale Investigative Data

Electronics 2026, 15(1), 12; https://doi.org/10.3390/electronics15010012 (registering DOI)
by Enrique Shinohara *, Jorge García, Luis Unzueta and Peter Leškovský
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2026, 15(1), 12; https://doi.org/10.3390/electronics15010012 (registering DOI)
Submission received: 15 October 2025 / Revised: 10 December 2025 / Accepted: 16 December 2025 / Published: 19 December 2025
(This article belongs to the Special Issue Deep Learning-Based Scene Text Detection)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors
  1. The abstract should include quantitative performance metrics.
  2. The literature review in the Introduction should include recent works (from the past two years) on text grouping and discuss how the proposed method addresses the limitations of previous studies.
  3. The pipeline description (Fig. 1) is clear, but the three strategies—planar segmentation, multi-class instance segmentation, and promptable segmentation—lack detailed hyperparameters and implementation details. It is strongly required to add pseudocode or a flowchart illustrating the bounding-box overlap computation and grouping algorithm (e.g., IoU threshold). If a zero-shot model (e.g., SAM) is used, please specify its version and prompt-engineering strategy. An ablation study should be strictly included to justify the necessity of each strategy, and the computational complexity should be quantified (e.g., FPS on GPU).
  4. The customized dataset sampled from COCOTextV2 is interesting, but its scale appears preliminary (the dataset size is not specified).
  5. The comparative methods should be more comprehensive, including both recent and classical baselines.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This work is intriguing and offers novel insights; however, it would benefit from addressing a few notable limitations.

It is good that the authors clearly describe the three distinct clustering strategies. However, they should provide parameter sensitivity analysis (e.g., IoU threshold) to demonstrate the stability and robustness of their methods.

The authors present a thoughtful comparison between geometric, semantic, and instantaneous-based strategies, but they should increase the dataset size to ensure statistical reliability and improve the generalizability of their results.

It is commendable that the authors adopt a class-agnostic SAM2 approach, yet they should better highlight its computational cost and memory footprint as a trade-off between accuracy and performance.

The authors effectively use multiple clustering metrics, although it would be beneficial if they justified the choice of these metrics and explained how each reflects different aspects of clustering performance.

The authors correctly separate the detection and grouping stages, but they should include an end-to-end evaluation to show how detection errors propagate through the pipeline in realistic settings.

It is good that the authors use models trained on COCO data. However, they should investigate domain shift effects by testing on unseen or specialized datasets to validate the robustness of the model in real-world applications.

The authors report runtime comparisons, which are useful, but they should provide more details on hardware configurations, inference settings, and throughput variations for reproducibility.

It is appreciated that the authors include quality visualizations, although they should provide more detailed error analysis to classify common failure cases such as obstruction, shadowing, or text overlap.

The authors acknowledge that text words were not integrated into the grouping process. It would be good if they discussed how incorporating semantic or linguistic cues could increase the accuracy of grouping.

It is good that the authors highlight the relevance of their work to law enforcement applications, but they should also briefly address ethical implications and data privacy concerns in their conclusion.

Comments on the Quality of English Language

Satisfactory

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Main Question and Relevance

The research addresses how to efficiently and accurately associate detected text instances with specific, related objects within the vast collections of visual data used by Law Enforcement Agencies (LEAs). The goal is to develop a modular pipeline for contextual scene text grouping, a significant bottleneck in automated scene analysis.

This topic is highly relevant and original in its focused application. It addresses a critical gap because prior works primarily cluster text based on semantic or layout relationships, ignoring the crucial physical association between text and the objects they are written on (e.g., a logo on a product).

Contribution and Consistency of Conclusions

The paper’s key contribution is the proposal of a modular, efficient pipeline that uses three complementary strategies for robust object-related text grouping as indicated in Section 3 (Materials and Methods):

  1. 2D Planar Segmentation: Grouping text that shares the same flat surface.
  2. Multi-Class Instance Segmentation: Associating text with known object types (vehicles, banners, etc.).
  3. Prompt-Based Segmentation: Using flexible, prompt-based methods for complex or novel scenes.

This multi-strategy approach offers a more robust solution for contextualizing scene text in investigative datasets than single-method approaches. The conclusions are consistent with the arguments, as the paper demonstrates that contextual grouping is essential for improving data utility, and the proposed pipeline directly achieves this goal by combining cutting-edge computer vision techniques.

References and Figures

The references appear appropriate, citing foundational works in deep learning, object detection, and segmentation, while also including relevant works on scene text detection. This validates the claim that text-to-object association is a relatively unexplored area.

Regarding figures and tables, they are properly presented with clear captions and consistent numbering. The visual quality and layout are appropriate for publication, and the figures effectively complement the textual explanations. Figure 1 clearly outlines the methodology.

Recommended Minor Improvements:

  1. In the abstract, the authors state, “The testing dataset will be made publicly available upon acceptance.” It is strongly recommended to revise this to “The testing dataset will be made available upon request.”
  2. In Section 4.1 (Dataset), the authors mention using only 123 images, partially extracted from the COCO-Text dataset, selecting challenging and representative scenarios containing text in the wild. Please provide further justification for limiting the dataset to 123 images and clarify the specific criteria used for image selection.
  3. In Section 4.2 (Experimentation Setup), it is highly recommended that the authors describe the hardware and software configurations used in the experiments to enhance reproducibility.
  4. Quantitative Performance Benchmarking: The paper would benefit from a direct comparison between the proposed full, fused pipeline and state-of-the-art text clustering methods that do not incorporate object association. This comparison would more clearly demonstrate the added value of the contextual approach in terms of accuracy and investigative effectiveness.

Overall, the paper addresses an issue concerning the use of visual data by Law Enforcement Agencies (LEAs) and proposes a practical and feasible solution. The paper is recommended for publication in the journal after minor revisions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

After carefully checking, I believe the words did not make it improvement, key information is still missing, the responses and improvemens do not convince me.

Author Response

Thank you very much for the follow-up. We have analysed the manuscript to address this reproducibility and methodology concerns. Reviewing your comment, we identified that our previous description of the pipeline likely caused a misunderstanding regarding how the strategies are applied.


The previous manuscript may have implied that the pipeline dynamically selects a strategy during runtime through an algorithm that determines whether the input image requires one of the three strategies. We have clarified the pipeline in Figure 3 of Section 3 by specifying that the “Chosen Strategy” is not determined programmatically but is rather an operational choice made by the end user (the Law Enforcement Agent). The purpose of this paper is to provide the comparative study that informs this choice.

To also address the lack in technical details for reproducibility, we have added a dedicated Algorithm 2 in Section 3.3 for the prompt engineering used in the third strategy and specified in Figure 3 which algorithm is used for the different strategies. Also, we have updated the descriptions of all three strategies, specifying that we utilize the official pre-trained checkpoints and stick to the default inference hyperparameters provided in their original reference implementations.

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks to all authors, as I found the manuscript is well revised and addressed all the reviewers comments. Therefore, no further suggestions. However, a deep proofread is required. 

 

Comments on the Quality of English Language

Satisfactory

Author Response

Thank you for your review and positive feedback. We have done a thorough proofreading to the newest version of the manuscrip.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

First, the authors did not design any mechanism for strategy selection; instead, they merely integrated several methods and left the comparison and selection to be performed manually. Such an integration does not appear to solve any concrete problem and thus seems to be an action without meaningful purpose. Moreover, the method section contains lengthy textual descriptions with insufficient technical detail. The description of the methodology should combine text with formulas in order to clearly present the problem-solving process and the workflow. This is the most critical issue. Overall, the paper shows no innovation and lacks substantive contribution.

Author Response

Comment: First, the authors did not design any mechanism for strategy selection; instead, they merely integrated several methods and left the comparison and selection to be performed manually. Such an integration does not appear to solve any concrete problem and thus seems to be an action without meaningful purpose. Moreover, the method section contains lengthy textual descriptions with insufficient technical detail. The description of the methodology should combine text with formulas in order to clearly present the problem-solving process and the workflow. This is the most critical issue. Overall, the paper shows no innovation and lacks substantive contribution.

Response: We appreciate the comment and respectfully wish to clarify the specific problem we attempt to solve and the purpose of the proposed pipeline. In the context of police investigations, agents must analyse large collections of media files, such as CCTV, seized devices, and surveillance footage. Standard OCR tools usually extract text as a 'bag of words' without context between each other. If an investigator must manually verify every text instance, e.g., distinguishing a phone number painted on a vehicle from a street address on a wall, the tool loses its practical utility. The problem this paper attempts to solve is the automatic association of text instances based on their physical object they share, generating richer data without the need of human intervention for every media and therefore allowing the processing of vast quantities of media data that would otherwise remain unprocessed.

With this context in mind, we argue that in forensic scenarios, 'black box' automation is considered a liability. Agents need to understand, validate, and justify the tools they use and different investigative cases bring different constraints. Scenarios containing flat surfaces with heavy text, such as graffiti or banners, benefit from the geometric approach of Strategy 1, whereas real-time monitoring of common assets like vehicles requires the efficiency of Strategy 2 (Instance Segmentation). Deep forensic analysis requiring the association of text to unseen objects justifies the higher computational cost of Strategy 3 (Promptable Class-Agnostic Segmentation). Therefore, rather than the selection mechanism being a missing component, we consider that having a human-in-the-loop approach is necessary for agents to make evidence-based choices.

To address the concerns about methodological detail we have extended the explanation on the implementation of each strategy (lines 227-243 Section 3.1 for Strategy 1, lines 255-261 Section 3.2 for Strategy 2 and lines 280-295 Section 3.3 for Strategy 3), increasing the number of formulas along with the three pseudocodes (page 8. Section 3.1, page 10 Section 3.3 and page 11 Section 3.3) related to the clustering algorithms and prompt engineering, for better reproducibility.

Regarding the paper's innovation and contribution, we wish to reiterate the three contributions this work presents. We propose a pipeline that explicitly addresses the unexplored problem of object-level text grouping, a practical necessity for investigative analysis that complements the existing semantic or layout-based approaches.To validate this approach, we worked on a comparative study that evaluates three fundamentally different strategies: geometrical approach, semantic instance segmentation and prompt-based zero-shot segmentation, delivering insights into their trade-offs (speed vs. accuracy vs. generalization), which are critical for investigative agents, as they must actively adapt to the appropriate strategy depending on each case requirements. Finally, we have curated and manually annotated a dataset that associates text instances with objects, directly addressing the lack of labeled data for object-related text grouping.

Thank you again for the constructive critique and we hope that this response satisfies you to see this work as an attempt to push meaningful advances for both deep learning-based scene text understanding and its application in forensic video analysis.

Back to TopTop