Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

Big Data Cogn. Comput. 2025, 9(8), 209; https://doi.org/10.3390/bdcc9080209

by Xilong Qin¹, Yue Hu^1,*, Wansen Wu²

, Xinmeng Li³ and Quanjun Yin¹

Reviewer 1:

Pengcheng Cao

Reviewer 2:

Alexey M. Vulfin

Reviewer 3:

Irina Razveeva

Reviewer 4: Anonymous

Big Data Cogn. Comput. 2025, 9(8), 209; https://doi.org/10.3390/bdcc9080209

Submission received: 11 June 2025 / Revised: 31 July 2025 / Accepted: 11 August 2025 / Published: 14 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a novel scene knowledge-guided visual grounding (SK-VG) framework, namely SplitGround. The introduction and background sections clearly outline the importance of integrating scene knowledge with visual grounding tasks. The proposed method, along with its components (AAW and SCM), is logically explained and demonstrates significant performance improvements. The introduction of SplitGround, a training-free framework that decomposes complex reasoning processes, represents a significant advancement in the field. Additionally, the comprehensive evaluation on the SK-VG benchmark, including qualitative and quantitative analyses, substantiates the effectiveness of the proposed method. The paper is well-organized, with a clear explanation of the problem, proposed solution, experimental setup, and results. The logical flow enhances readability and comprehension.

Suggestions for Improvement:

Inclusion of Case Studies: While the paper mentions the application to UAV detection, it lacks detailed case studies that demonstrate the relevance and practical implications of the proposed framework in real-world UAV detection scenarios. Including such case studies would provide a clearer understanding of the framework's applicability and benefits in practical settings.
Elaboration on Relevance: The paper could further elaborate on the relevance of the SK-VG task to UAV detection. This would help readers appreciate the specific challenges and requirements of UAV detection that necessitate the use of the SplitGround framework.
Code Repository: To enhance the scientific soundness and reproducibility of the research, it would be beneficial to release the code repository associated with the SplitGround framework. Providing access to the code will enable other researchers to validate and build upon the work presented in this paper.

In my opinion, this paper can be considered for publication after minor revisions.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Review of the article

«SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding»

The paper discusses the problem of scene knowledge-guided visual grounding that addressed complex long-chain reasoning challenges during the model grounding process.

The hierarchical task decomposition module and the Agentic Annotation Workflow framework have been developed for visual justification, which does not require training, for a knowledge-oriented scene. The proposed system solves complex problems of long-chain reasoning in the process of substantiating the model output. A detailed analysis of works in the field of multimodal image processing has been carried out in order to describe complex scenes in a meaningful way.

The main conclusions about the applicability of the proposed framework (a collaboration platform that decomposes complex reasoning processes by combining an input query and an image with knowledge using two auxiliary modules) have practical significance and scientific novelty.

The proposed estimates of the experimental results are well-founded and described in detail.

The interpretation of the results of the computational experiment and the reliability of their analysis are sufficient. The proposed solution increases the accuracy score on the test dataset by 15.71% compared to existing solutions.

The given list of references to bibliographic sources reflects the depth of the study of the research problem.

The abstract, introduction and conclusions are presented correctly.

The work can be published in its current form.

Author Response

Thank you very much for your positive feedback. I truly appreciate your time and support in reviewing this manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

SUMMARY

The article submitted for review is relevant to modern science and is in line with the scope of the journal. This article presents SplitGround, a new training-free framework for visual reasoning based on scene knowledge.

The graphical accompaniment of the research process, as well as the pseudocode of the algorithms, are especially interesting.

The authors conducted experiments with the SK-VG dataset, evaluating the performance of SplitGround from both a qualitative and quantitative point of view. The results showed that the approach proposed in the article achieved competitive performance compared to supervised models.

The reviewer believes that this article can be considered for publication, but it is necessary to correct the comments that the reviewer had. The comments are listed below.

COMMENTS

The reviewer finds Figure 1 very interesting. Please add a more detailed description of the differences between SK-VG and traditional VG according to Figure 1.
When describing the dataset, the reviewer considers it necessary to add one scene as an example. At the same time, indicate the annotations of the scene knowledge and other available characteristics.
To understand the difficulty levels (easy/medium/hard), it is worth giving an example.
The reviewer believes that a more detailed description of the selected metrics for assessing the quality of models (accuracy, IoU) is required.
In the "Conclusion", the authors should note the most significant gaps in the scientific field that their research has filled.
In the "Conclusion", the authors should describe the possibilities for further improvement of the results, as well as note the practical significance of the developed approach.

This article has good prospects, but the comments need to be corrected. General conclusion - Minor Revisions.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

English editing is required.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The article presents SplitGround - a framework for solving the Scene Knowledge-guided Visual Grounding (SK-VG) task.

I have the following remarks for the manuscript:

The abstract of the article needs to be rewritten and improved. The authors mention that SK-BG is "also a static form of object detection task", which most likely should refer to "detection task for static objects", as "static form" makes no sense for me and is a terms that is not widely used in the subject area.
The introduction section directly presents the SK-VG and then the SplitGround framework and its functionality. The Introduction section should be more focused on the grounding principles, use cases and disadvantages of the methods in the subject area, and should not directly present the newly developed solution. This should come after the Related works section. After all, in section 2.1 and 2.2 the authors present the Visual grounding and SK-VG, but before that they are presenting and discussing the solution that improves them?!? The logical flow and the structure of the article is wrong!
The "emoticons" in Fig. 1 and Fig. 2 are not suitable for a scientific paper.
Some minor language issues are detected in the manuscript, like the improper use of the definite article "the" or the lack of it.
The text below Fig. 5 is too long and is more like a discussion on the presented results, which should be part of the main text of the article and not the descriptive text under the figure.
The most significant problem of the study is the lack of reasoning related to the use of the method in scenarios related to UAVs, not to mention the fact that the abbreviation UAVs is present only 4 times in the text, but the authors claim that this method is suitable for UAV-related search and rescue scenarios. What makes this method so suitable for use with UAVs? What are the problems of the other methods and their use specifically with UAVs? Can this method be used also for ground robots or other search and rescue platforms/tools? The provided examples and results (like the images in Fig. 5) are not related in any way to UAVs or aerial imagery at all.

The presented framework might indeed be a novel solution that improves over the available ones, but the way the article is structured and the framework is presented is needing a lot of improvements.

Comments on the Quality of English Language

Some minor language issues have been detected, mainly in terms of the misuse of the definite article "the".

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

The article has been revised and many of the remarks from my initial review have been acknowledged, but still there are many issues to be addressed:

The manuscript needs a detailed proofreading - apart from the punctuation issues (like missing commas, etc.), some parts of the text are also requiring a retouch. For example, the sentence "To address above challenges, we propose ...." should be corrected to "To address the discussed challenges ...." or to "To address the above-mentioned challenges ....".
The BDCC journal is a prestigious international scientific journal and not a social media platform, an instant message application or a comics book. The use of emoticons in any scientific journal is not recommended and should be avoided, unless it is absolutely necessary.
The overall structure of the document and the arrangements of the figures in the text is not good. Usually, the figures are first mentioned in the text and then presented afterwards. In the document, Figure 1 is presented on Page 2, but is then referenced for the first time on Page 3. In a similar fashion, Figures 4 and 5 are first presented and then discussed in the text. The most evident case is with Figure 6 - the figure is mentioned for the first time in the middle of Section 4.4, but then is placed in the middle of Section 5 of the article, which is the Conclusion. Further to this, the way the paper is split into two by the Annexes is not good. This makes the reading of the manuscript difficult and challenging.

Comments on the Quality of English Language

The quality of the English language is above the average, but the manuscript requires a detailed proofreading to fix the numerous punctuation issues, as well as many of the expressions that have been used by the authors.

Author Response

Comments 1: The manuscript needs a detailed proofreading - apart from the punctuation issues (like missing commas, etc.), some parts of the text are also requiring a retouch. For example, the sentence "To address above challenges, we propose ...." should be corrected to "To address the discussed challenges ...." or to "To address the above-mentioned challenges ....".

Response 1: Thank you for pointing this out. I have conducted a detailed check of the article's sentences. The revised sections have been marked in red within the PDF file.

Comments 2: The BDCC journal is a prestigious international scientific journal and not a social media platform, an instant message application or a comics book. The use of emoticons in any scientific journal is not recommended and should be avoided, unless it is absolutely necessary.

Response 2: I have modified the figures according to your requirements.

Comments 3: The overall structure of the document and the arrangements of the figures in the text is not good. Usually, the figures are first mentioned in the text and then presented afterwards. In the document, Figure 1 is presented on Page 2, but is then referenced for the first time on Page 3. In a similar fashion, Figures 4 and 5 are first presented and then discussed in the text. The most evident case is with Figure 6 - the figure is mentioned for the first time in the middle of Section 4.4, but then is placed in the middle of Section 5 of the article, which is the Conclusion. Further to this, the way the paper is split into two by the Annexes is not good. This makes the reading of the manuscript difficult and challenging.

Response 3: Thank you for your feedback. I have adjusted the positioning of the figures and appendix as requested, striving to follow the principle of 'figures are first mentioned in the text and then presented afterwards'. However, please note that Figure 1 was already mentioned in the first paragraph of the Introduction on Page 1, with Page 3 being its second mention. Therefore, the placement of Figure 1 remains unchanged.

Round 3

Reviewer 4 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for making all of the corrections and modifications. I acknowledge all of them and have no further comments and remarks.

Best Regards

Article Menu

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

Further Information

Guidelines

MDPI Initiatives

Follow MDPI