SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

Qin, Xilong; Hu, Yue; Wu, Wansen; Li, Xinmeng; Yin, Quanjun

doi:10.3390/bdcc9080209

Open AccessArticle

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

by

Xilong Qin

¹,

Yue Hu

^1,*,

Wansen Wu

²

,

Xinmeng Li

³ and

Quanjun Yin

¹

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

²

Navy Submarine Academy, Qingdao 266000, China

³

Hunan Institute of Advanced Technology, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(8), 209; https://doi.org/10.3390/bdcc9080209

Submission received: 11 June 2025 / Revised: 31 July 2025 / Accepted: 11 August 2025 / Published: 14 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human–computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the inherent difficulty in directly establishing correlations between the given query and the image without leveraging scene knowledge, this task imposes significant demands on a multi-step knowledge reasoning process to achieve accurate grounding. Off-the-shelf VG models underperform under such a setting due to the requirement of detailed description in the query and a lack of knowledge inference based on implicit narratives of the visual scene. Recent Vision–Language Models (VLMs) exhibit improved cross-modal reasoning capabilities. However, their monolithic architectures, particularly in lightweight implementations, struggle to maintain coherent reasoning chains across sequential logical deductions, leading to error accumulation in knowledge integration and object localization. To address the above-mentioned challenges, we propose SplitGround—a collaborative framework that strategically decomposes complex reasoning processes by fusing the input query and image with knowledge through two auxiliary modules. Specifically, it implements an Agentic Annotation Workflow (AAW) for explicit image annotation and a Synonymous Conversion Mechanism (SCM) for semantic query transformation. This hierarchical decomposition enables VLMs to focus on essential reasoning steps while offloading auxiliary cognitive tasks to specialized modules, effectively splitting long reasoning chains into manageable subtasks with reduced complexity. Comprehensive evaluations on the SK-VG benchmark demonstrate the significant advancements of our method. Remarkably, SplitGround attains an accuracy improvement of 15.71% on the hard split of the test set over the previous training-required SOTA, using only a compact VLM backbone without fine-tuning, which provides new insights for knowledge-intensive visual grounding tasks.

Keywords:

visual grounding; scene knowledge reasoning; human–computer interaction; multi-expert collaboration; vision language model

1. Introduction

Scene Knowledge-guided Visual Grounding (SK-VG) is a recently proposed knowledge-driven detection task that requires a model to leverage scene knowledge to comprehend more natural and contextually referring expressions [1]. As shown in the example in Figure 1, the core distinctions between SK-VG and traditional visual grounding (VG) tasks lie in two key aspects: In terms of input structure, SK-VG operates on a triple modality (image, query, and scene knowledge), distinguishing it from the dual modality (image and query) employed by traditional VG. In terms of semantic content, the queries in SK-VG are typically context-dependent, requiring supplementary scene knowledge for unambiguous target object localization. The traditional VG task exhibits significant differences from natural human referring habits when applied to human–computer interaction scenarios. On one hand, VG employs relatively cumbersome descriptive referring expressions, while user interactions tend to be more abstract and concise. On the other hand, traditional VG tasks typically fail to achieve fluent contextual interaction due to the lack of scene knowledge. Building upon the traditional grounding task, SK-VG formulates its task to be more oriented toward practical human–computer interaction scenarios, thereby addressing these limitations.

As the queries in SK-VG are typically insufficient to unambiguously localize the target object in the image, the object detection process exhibits strong dependency on knowledge comprehension and reasoning. Particularly, since the queries in SK-VG seldom directly describe the visual attributes of the targets, the model usually has to execute a multi-step reasoning chain based on the scene knowledge, such as the following: (1) identifying potential candidates, (2) determining their appearance features, and (3) localizing the target in the image. This process presents significant challenges for models, demanding sophisticated long-chain reasoning capabilities across different modalities.

Previous research has achieved remarkable success in conventional grounding tasks [2,3,4,5,6,7]. For instance, Grounding DINO incorporates language modeling into closed-set detectors to enable open-set concept generalization [2], while unified frameworks like UNINEXT and OFA demonstrate outstanding performance across various grounding tasks [3,4]. However, these models primarily focus on grounding capabilities while remaining deficient in knowledge understanding and referencing, leading to suboptimal target localization under the information-insufficient conditions inherent in SK-VG tasks. Although recent models specifically trained for SK-VG such as KeViLI and LeViLM enhance information fusion and feature matching between scene knowledge, images, and queries, they still exhibit limitations in handling deeper-level reasoning processes [1].

Compared to traditional models, Vision–Language Models (VLMs) support flexible input formats and demonstrate superior scene knowledge comprehension and reasoning capabilities due to extensive textual pretraining [8,9,10,11,12]. Their multi-modal alignment proficiency, acquired through training on image–caption pairs [13], further endows them with significant advantages in SK-VG tasks. Nevertheless, VLMs suffer from hallucination issues during reasoning processes, particularly in multi-hop or long-chain scenarios [14,15]. As shown in Figure 1a, directly employing a single VLM to address the SK-VG problem requires the model to internally execute a reasoning chain with multiple discrete stages of semantic analysis and logical deduction before the eventual grounding operation. Such a process, i.e., compressing an interdependent reasoning pipeline into an end-to-end step, manifests in inaccurate contextual relationship capture and consequential referential confusion. This performance bottleneck becomes more pronounced when compact backbones with limited model capacity are used.

In this paper, we propose SplitGround, a novel framework for the SK-VG task that synergistically integrates knowledge reasoning and visual grounding through hierarchical decomposition of complex inference chains leveraging the advantages of VLMs. Our approach establishes a collaborative multi-expert architecture where specialized agents handle modularized subtasks through atomic reasoning operations. This systematic decomposition addresses long-chain reasoning bottlenecks by shortening inference pathways while enhancing system reliability through transparent intermediate processes, thereby improving both detection accuracy and interpretability.

Fundamentally, the inherent gap between queries and images in SK-VG arises from the ambiguous alignment between context-dependent referring expressions and the visual content. To bridge this gap through knowledge integration, SplitGround hierarchically decomposes the task into two layers (left/right part in Figure 1b). The lower layer performs knowledge-enhanced processing of raw inputs: it resolves query–image semantic discrepancies by transforming both modalities into knowledge-anchored representations. This refinement substantially reduces cross-modal misalignment, enabling the upper layer to execute precise visual grounding through direct matching between the two representations, thus operating within a minimized semantic gap environment.

To validate the effectiveness of our approach, we conduct extensive experiments on the SK-VG dataset [1]. Results demonstrate that SplitGround achieves state-of-the-art performance through training-free implementation, attaining an accuracy of 72.99% on the test set, surpassing the fully supervised best-performing model LeViLM by 0.42%. Compared with the single VLM model used in SplitGround [12], our accuracy has increased by 3.89%. Furthermore, the modular architecture of SplitGround demonstrates broad adaptability. It achieves generalized performance improvements across diverse VLMs while maintaining particularly strong effectiveness in lightweight implementations.

The contributions of our work can be summarized as follows:

We propose SplitGround for SK-VG tasks, which is a training-free solution that harnesses the cross-modal grounding and knowledge reasoning capabilities of VLMs.
We design a novel framework featuring multi-expert collaboration for fine-grained image annotation and referring expression conversion. This decomposition of complex long-chain reasoning into stepwise processes enhances overall grounding performance.
Comprehensive experiments on the SK-VG dataset demonstrate that our training-free approach establishes new SOTA results, significantly outperforming previous baselines. Simultaneously, experimental analyses validate the effectiveness of the inherent design and highlight the adaptability of the proposed method.

2. Related Work

In Section 2.1 and Section 2.2, we briefly review existing visual grounding and SK-VG methods, respectively. The two subsections summarize the representative studies in recent years and analyze the limitations of current approaches. Then, in Section 2.3, from the methodology perspective, we investigate techniques that are related to SplitGround, focusing on the application paradigms of VLMs in multi-modal scenarios.

2.1. Visual Grounding

Visual grounding is a multi-modal task that aims to align the referred object in the query with specific regions in images. A classical categorization divides visual grounding models into two-stage [16,17,18,19,20,21] and one-stage methods [22,23,24,25,26,27]. Two-stage approaches typically first generate candidate regions from the image and subsequently evaluate these regions based on textual descriptions. As a representative example, Mao et al. [28] fine-tuned VGGNet to generate candidate bounding boxes, which were then matched with textual features extracted by LSTM [29]. Chen et al. [30] proposed the Ref-NMS module to enhance generated proposals using expression information. Such two-stage methods offer straightforward implementation and strong interpretability. On the other hand, one-stage methods perform dense prediction on fused multi-modal features. A representative example is FAOA proposed by Yang et al. [24], which injects text embeddings together with spatial information into YOLOv3 [31], enabling efficient and accurate region prediction.

In addition to the traditional taxonomy mentioned above, the recent prevalence of transformer-based architectures has established a new category which utilizes encoder–decoder frameworks [32,33,34,35,36,37,38,39,40]. Pioneering works like TransVG [26] introduced a [REG] token to fuse vision–language features, achieving state-of-the-art performance with a concise structure. This has inspired numerous DETR-like [41] end-to-end detectors, including the current SOTA models in the DINO series [2,42,43,44]. Concurrently, utilizing vision–language pretraining (VLP) models like CLIP [45] as the image/text encoder has become mainstream in visual grounding [45,46,47,48,49]. Leveraging such foundation models with their powerful generalization capabilities, researchers have pursued data-efficient optimization strategies for VG tasks, including zero-shot/few-shot adaptation [50] and unsupervised or weakly supervised training paradigms [51,52,53]. Meanwhile, researchers also explore open-vocabulary detection tasks based on the conventional VG task [2,50,54,55], making it more useful in real-world scenarios.

However, even though traditional VG models already demonstrate remarkable grounding capabilities, with some models even unifying various multi-modal tasks [3,4,56], they still exhibit limitations in knowledge comprehension and struggle to achieve effective alignment between knowledge and other modal information, resulting in suboptimal performance on SK-VG.

2.2. Scene Knowledge-Guided Visual Grounding

In recent years, researchers have also expanded the traditional VG task from the perspective of knowledge enhancement, proposing the Scene Knowledge-guided Visual Grounding (SK-VG) task [1]. This task employs knowledge to guide the detection process, enabling the VG framework to accommodate more diverse expression forms, thereby enhancing its practicality for user applications. In an SK-VG task, the insufficient information contained in queries and images often proves inadequate for locating target regions. Compared with a conventional VG task, SK-VG places greater emphasis on the knowledge comprehension and reasoning capabilities of the models, highlighting the crucial role of knowledge for precise grounding. Representative approaches for SK-VG include KeViLI and LeViLM. KeViLI adopts a one-stage approach utilizing transformer layers [57] for multi-modal information extraction and fusion. LeViLM builds upon the pretrained GLIP [55], integrating queries and scene knowledge through prompt engineering and then achieving alignment and fusion of visual and textual features for region-level localization.

Existing SK-VG models have attained basic knowledge understanding capabilities, yet still show notable deficiencies in long-chain reasoning and leave room for improvement on hard samples. To address these issues, our proposed SplitGround framework leverages VLMs [8,12] for knowledge comprehension and processing, adopting a multi-agent collaborative architecture that decomposes complex reasoning into multiple simpler subtasks. We present comparative results between our model and other approaches in Section 4.2.

2.3. VLMs for Detection Tasks

VLMs possess robust multi-modal comprehension and reasoning capabilities, and have gained significant attention in multi-modal detection tasks such as visual grounding or other applied tasks [15,58,59]. Compared to traditional small-scale models, VLMs often achieve remarkable performance in multi-modal tasks, including higher accuracy and less need for additional training. Some works specifically optimize VLM foundation models for grounding tasks [5,55,60,61,62,63]. For instance, LLaVA-Grounding [60] proposed by Zhang et al. integrates the grounding and chat functionality by incorporating external models to create a specialized VLM. CogVLM [62] bridges the gap between pretrained language models and image encoders by introducing a trainable visual expert module, providing a powerful foundational model. Other works employ LLMs and VLMs as agents to create workflows or implement multi-round conversations to achieve task planning, tool invocation, image processing, and multi-modal understanding [64,65,66,67,68,69]. For example, Zhao et al. designed three agents in LLM-Optic [68] based on VLMs, responsible for query refinement, candidate bbox annotation, and bbox selection. Li et al. leveraged the open-vocabulary understanding capability of VLMs in SeeGround [69] to conduct landmark parsing, perspective selection, and image rendering, introducing a novel zero-shot 3D visual grounding model.

In SplitGround, we also adopt a multi-agent collaboration framework, designing diverse VLM-based agents. To accommodate the knowledge-driven nature of the SK-VG task, all agents in SplitGround perform stepwise reasoning by referencing knowledge content. Through agentic workflow collaboration, the framework decomposes the complex task and integrates knowledge into queries and images, thereby reducing the reasoning burden on the grounding process.

3. Method

3.1. Overview

The Scene Knowledge-guided Visual Grounding (SK-VG) task requires the model to localize the referred object in an image I by interpreting a context-dependent query Q under the guidance of scene knowledge K. Formally, this task can be formulated as:

B = F_{g} (Q, I, K),

(1)

where

B \in R^{4}

denotes the predicted bounding box, and

F_{g} (\cdot)

represents the grounding model. Unlike conventional VG tasks, SK-VG inherently requires scene knowledge mediation to address the fundamental gap between context-dependent queries and visual content. This gap necessitates multi-step inference such as entity resolution, appearance description, and visual localization, which introduces significant challenges.

To address this challenge, as shown in Figure 2, we propose SplitGround, a hierarchical framework that splits sequential multi-step reasoning into single-step operations via multi-expert collaboration. Specifically, we utilize a multi-layer framework in order to bridge the fundamental gap of SK-VG. The lower layer employs specialized modules to transform raw inputs into knowledge-anchored representations: the image I is enriched with spatially grounded entity annotations

I^{'}

while the query Q undergoes lexical normalization to distill contextual references into canonical noun phrases

Q^{'}

. This refinement enables direct contextual alignment under shared knowledge K, effectively reducing explicit knowledge dependency during subsequent processing. In the upper layer, the grounding agent operates on these pre-aligned representations

Q^{'}

and

I^{'}

, where the simplified semantic matching mitigates error propagation across reasoning steps while enhancing localization precision through compressed cognitive processing.

Specifically, in the lower layer, the two core modules of SplitGround for processing inputs are as follows:

Agentic Annotation Workflow (AAW) (Section 3.2): Integrates scene knowledge into visual representations by annotating entities on I.
Synonymous Conversion Mechanism (SCM) (Section 3.3): Translates context-dependent queries Q into entity-centric expressions using scene knowledge K.

Formally, the framework can be formulated as:

\begin{matrix} Q^{'} & = F_{S C M} (Q, K), \end{matrix}

(2)

\begin{matrix} I^{'} & = F_{A A W} (I, K), \end{matrix}

(3)

\begin{matrix} B & = F_{g} (Q^{'}, I^{'}, K), \end{matrix}

(4)

where

F_{S C M} (\cdot)

and

F_{A A W} (\cdot)

denote the SCM and AAW, respectively.

F_{g} (\cdot)

employs an off-the-shelf VLM, acting as our grounding backbone. By leveraging VLMs in a training-free paradigm, SplitGround capitalizes on their zero-shot capabilities in cross-modal understanding and localization.

3.2. Agentic Annotation Workflow

AAW implements an annotation–inspection mechanism to explicitly mark entities from knowledge text K on image I. This entity-aware annotation creates direct correspondences between contextual information and visual contents, providing direct visual cues that simplify subsequent grounding operations. In order to obtain more accurate annotations, the framework primarily involves three agents:

Named Entity Recognizer: Extracts human entities from K:

E_{human} = {e_{i}}_{i = 1}^{N} = F_{R} (K),

(5)

where

F_{r} (\cdot)

identifies person-type entities.

Annotator: Labels entities on I using point annotations:

L_{A} = {(x_{i}, y_{i}, e_{i})}_{i = 1}^{N} = F_{A} (E_{human}, I, K),

(6)

where

(x_{i}, y_{i})

indicates the location coordinates of entity

e_{i}

. Compared to the form of bbox, point annotations are better for precise grounding as shown in the experiments. More details will be discussed in Section 4.3. Furthermore, simultaneous annotation of multiple entities enables the model to maintain a holistic perspective, thereby better capturing inter-entity relationships and relative spatial configurations.

Inspector: Inspects and adjusts annotations given by Annotator, as well as gives feedback on omitted entities:

L_{I}, F_{o m i t} = F_{I} (L_{A}, I, K),

(7)

where

L_{I}

indicates the annotations after inspecting and

F_{o m i t} = {e_{j}, d e s c_{j}}_{j = 1}^{o m i t N u m}

indicates the feedback that includes omitted entities and the descriptions of them. The Inspector operates through two primary mechanisms. First, it verifies and rectifies the correspondence between existing coordinates and labels—preserving correct annotations, modifying erroneous labels, and deleting meaningless coordinates. Second, it identifies omitted entities and provides descriptive feedback about their visual characteristics to facilitate subsequent supplementary annotation.

To address potential spatial relation confusions and improve the quality of VLM-based annotations, an additional annotation–inspection process is implemented. The complete workflow is outlined in Algorithm 1. The second-round annotation incorporates results from the initial round and Inspector feedback, while the final inspection validates newly added annotations by retaining correct ones and removing errors. We adopt a two-round refinement strategy that balances efficiency and accuracy, as our empirical results demonstrate diminishing benefits beyond two iterations. A detailed discussion is conducted in Section 4.3.

Algorithm 1 Agentic Annotation Workflow

Input:: Image I, Scene knowledge K
Output:: Annotated image $I^{'}$

1:: $E \leftarrow E X T R A C T E N T I T I E S {(K)}_{type = “ Person ”}$
2:: $I_{i n i t i a l} \leftarrow A N N O T A T E (I, E, K)$

where $I_{i n i t i a l}$ is annotated by points ${e_{i} : (x_{i}, y_{i}) | e_{i} \in E}$

3:: $I_{c o r r e c t e d}, F \leftarrow I N S P E C T (I_{i n i t i a l}, K)$

where $Feedback F = {(e_{o m i t}, {d e s c}_{o m i t})}$

4:: if $F = \emptyset$ then
5:: $I^{'} \leftarrow I_{c o r r e c t e d}$
6:: else
7:: $I_{f i n a l} \leftarrow A N N O T A T E (I_{c o r r e c t e d}, F, K)$
8:: $I^{'} \leftarrow I n s p e c t {(I_{f i n a l}, K)}_{scope = new annotations}$
9:: end if

Throughout this workflow, multi-agent collaboration through information interaction enables autonomous image annotation based on scene knowledge, effectively reducing grounding model dependencies on scene knowledge and complex reasoning.

3.3. Synonymous Conversion Mechanism

SCM is designed to reformulate input queries by leveraging large language models (LLMs) for semantic comprehension and reasoning. It systematically converts context-dependent referring expressions into shortest entity-centric representations by recognizing entities in the knowledge and resolving synonym variations, thereby reducing cognitive complexity for downstream grounding tasks. Formally, given an input query q, the transformation is defined as:

Q^{'} = F_{SCM} (Q, K) = \{\begin{matrix} Name (e (Q)) & if e (Q) \in E_{named} \\ the [type] : Q & otherwise \end{matrix},

(8)

where

E_{named}

represents person-type entities with predefined names in K, and

e (Q)

extracts the entity that query Q refers to.

SCM resolves the entity referenced in the original query through a pre-designed pipeline where the queries are converted into standard forms. Specifically, if an entity corresponds to a person with a predefined name in the scene knowledge K, SCM reformulates the query into its corresponding entity name. For other entities (e.g., unnamed persons or objects), their types (person or object) are explicitly prepended to the query in the format “the [type]: [query]”, where [type] denotes the entity category. For instance, the query “The manager of Caleb’s rival company” might be transformed into “Jackson”, while the query “the black car in front of William” will be transformed into “the object: the black car in front of William”.

This standardized naming can also align with the labels annotated by AAW on the image, enabling the grounding model to directly localize specified entities via name-to-position mapping, thereby reducing the reasoning complexity of the final grounding process.

4. Experiments and Discussion

We conduct extensive experiments on the SplitGround framework. After explaining the basic experimental configurations in Section 4.1, we compare the performance of SplitGround on the SK-VG dataset with existing SK-VG models and representative traditional VG models in Section 4.2. Additionally, we conduct comprehensive tests on the performance of VLMs with different scales before and after deploying the SplitGround framework, followed by a systematic analysis of the observed patterns. Section 4.3 presents ablation studies on key modules and design components to demonstrate the effectiveness of our design in SplitGround. Finally, Section 4.4 provides qualitative case studies through typical examples to visualize the effectiveness of our framework. Additional analytical experiments are provided in Appendix B.1.

4.1. Configurations

Dataset. The SK-VG dataset [1] contains 4000 frames extracted from movie scenes, partitioned into training (60%), validation (20%), and test (20%) sets. Most frames maintain a resolution width of 1920 pixels. Each image is supplemented with two different textual accounts serving as scene knowledge annotations. These sentences combine precise visual observations about depicted characters with non-visual contextual metadata, such as names, identities, and social connections.

Every piece of scene knowledge is associated with five distinct query–box pairs. These queries exhibit linguistic diversity across hundreds of referenced objects while maintaining strict referential uniqueness to specific image elements. The test set is systematically categorized through dual taxonomies: (1) bbox scale (small/medium/large) using size thresholds of 64 × 64 and 128 × 128 pixels, and (2) knowledge dependency (easy/medium/hard). In most easy cases, the referent can be discerned through visual features described in the query. Conversely, hard cases necessitate knowledge-based reasoning to resolve referential ambiguities, where the absence of such contextual information significantly makes it difficult to distinguish or recognize the referent. This hierarchical evaluation framework enables a multi-dimensional assessment of the model’s capabilities.

An input example with three queries of different levels are shown in Figure 3.

Metrics. The evaluation adheres to established guidelines of top-1 accuracy (%) [26]. Predictions are classified as correct when the intersection-over-union (IoU) ratio between the predicted bounding box and the ground truth exceeds 0.5.

Implementation Details. Because SplitGround adopts a training-free paradigm (i.e., VLMs are not fine-tuned on the SK-VG dataset to preserve their generalization capabilities and reduce the training cost), all experiments are conducted on the test set of the SK-VG dataset. Considering that Qwen2.5-VL [12] exhibits superior grounding capabilities while GPT-4o [8] excels in cross-modal reasoning tasks [12,70], in the AAW module, GPT-4o [8] serves as the Recognizer and Inspector to handle reasoning and verification, while Qwen2.5-VL-72B-instruct acts as the Annotator for knowledge-based image annotation. For the relatively simpler reasoning tasks in the SCM module, the GPT-4o-mini model is utilized. And we employ Qwen2.5-VL-7B-instruct as the primary model for grounding tasks. The temperature of all models is set to 0.1 during execution.

When evaluating traditional VG models, since they do not support triplet input formats containing scene knowledge (i.e., image–query–knowledge triplet), we test two alternative input formats: the original image–query pairs and the merged queries formatted as “query: {Q} \n knowledge: {K}”. The VG models exclusively leverage pretrained visual grounding capabilities, with no fine-tuning or training conducted on SK-VG domain data.

Detailed prompts for each module and control groups are provided in Appendix A.

4.2. Comparative Study

Comparisons to the State of the Art. Table 1 demonstrates the merits of SplitGround against traditional VG models and other SK-VG models with various training strategies. The results reveal critical insights about model performance in SK-VG tasks. First, conventional VG models exhibit limited accuracy, with best-performing architectures UNINEXT(H) [3] achieving merely 41.86%. Notably, these models all demonstrate distinct performance degradation when handling the

Q, K, I

input form compared to the

Q, I

input, similar to that in the VG task, whether they specialize in grounding tasks (“VG” in the table) or are applicable to multiple multi-modal tasks (“multi-task” in the table). Despite extensive pretraining on VG datasets, these models exhibit persistent underperformance in knowledge-intensive SK-VG scenarios, revealing a critical mismatch between their abilities and the contextual reasoning demands in this task.

Meanwhile, our framework achieves a substantial breakthrough, attaining an accuracy of 72.99%—surpassing even the best-performing baseline LeViLM, which is pretrained on SK-VG datasets and obtained an overall accuracy of 72.57%. This margin demonstrates the superiority of SplitGround in this task.

Further analysis across varying difficulty levels reveals the advantages of SplitGround. Even though many of the other tested models exhibit marked performance degradation with increasing difficulty, SplitGround maintains consistent accuracy across the three levels. Notably, in hard samples that require multi-hop reasoning, it achieves an accuracy of 75.66%, demonstrating an absolute improvement of 15.71% over the suboptimal approach. These comparative results empirically validates the effectiveness of our multi-expert collaborative architecture, where the decomposition of long reasoning chains enables robust handling of complex cognitive processes that overwhelm other monolithic models.

Enhancement to VLMs. We further investigate the effectiveness of SplitGround in enhancing the performance of the original VLMs on the SK-VG task. Experiments are conducted using Qwen2.5-VL models with 3B, 7B, and 32B parameter scales, and the results are illustrated in Figure 4. In the bar chart, light colors (left) correspond to the results of the original VLMs, while dark colors (right) represent the performance after being enhanced by SplitGround. The line graph shows the trend of improvement as the model scale increases.

A clear observation from the figure is that SplitGround improves the accuracy of VLMs to varying degrees. This suggests that the shortened reasoning chains generated through precise annotation and query conversion can effectively assist VLMs in their overall reasoning process, ultimately leading to higher accuracy on grounding tasks. Notably, such improvements are particularly pronounced for smaller, lightweight VLMs with limited capacity and relatively weaker reasoning abilities. For the 3B-parameter model, grounding accuracy improves dramatically from 65.67% to 71.22% (5.55% absolute improvement). Moreover, mid-sized 7B models achieve more moderate yet significant improvement from 69.10% to 72.99% (3.89% absolute improvement). In contrast, larger VLMs, which inherently possess stronger reasoning capabilities, are to some extent sufficient to handle cross-modal long-range reasoning processes independently, resulting in a reduced reliance on the annotation information and reasoning chain simplification provided by the SplitGround framework. Consequently, the improvement gains diminish as the model scale increases (from 72.29% to 75.02% in the 32B model). This progressively narrowing trajectory empirically establishes a negative correlation between the effectiveness of the SplitGround framework and model scale, validating that the reasoning-chain simplification in SplitGround optimally compensates for capability gaps in compact architectures while remaining compatible with larger models.

4.3. Ablation Study

Module-Level Ablation. Using the Qwen2.5-VL-7B-instruct model as the backbone, we conduct an ablation study on SplitGround over the test set to analyze the contributions of individual modules. The experimental results, shown in Figure 5, reveal that compared to the VLM itself, integrating the AAW module and SCM module individually improves accuracy by 2.06% and 0.85%, respectively, with the AAW module yielding a larger enhancement. Furthermore, combining the SCM module with the AAW-enhanced VLM achieves an additional 1.83% accuracy gain, surpassing the performance of integrating SCM alone with the original VLM. This improvement can be attributed to the fact that the SCM-converted queries, transformed into minimal proper noun forms, align directly with the image annotations generated by AAW. Such synergy between the image annotation expert and the query conversion expert highlights the advantages of the collaborative multi-expert paradigm in SplitGround.

To further analyze the functional mechanisms of the two modules, we conduct thorough experiments across easy, medium, and hard difficulty levels, with quantitative results detailed in Table 2. As evidenced in Table 2, the AAW module demonstrates substantial performance gains (+7.17%) on hard cases requiring multi-step reasoning chains. This validates that the image annotation strategy effectively reduces reasoning complexity for intricate tasks. However, AAW causes a slight accuracy drop (−1.75%) on easy cases where a simpler reasoning process is required, likely due to visual distraction introduced by on-image annotations. Meanwhile, the SCM module delivers consistent improvements across easy and hard difficulty levels (+0.76%, +2.06%). Crucially, the collaborative integration of AAW and SCM in SplitGround achieves dual benefits: amplifying performance gains on hard samples (+10.56%) while alleviating the interference of AAW in easy cases.

The difficulty-specific results in Table 2 elucidate the scaling trends observed in Figure 4. SplitGround exhibits more pronounced performance gains on lightweight VLMs, where the AAW annotation effectively compensates for limited inherent reasoning capacity. For larger VLMs with stronger inherent reasoning abilities, the enhancements from AAW and SCM may diminish on hard samples, while the annotation interference of AAW persists in easy cases. This explains the reduced overall improvement margin when scaling up backbone models.

Complementing the quantitative analysis, Figure 6 further provides qualitative visualizations of the operational mechanisms in SplitGround, which will be thoroughly analyzed in Section 4.4.

Submodule-Level Ablation. In Table 3, we further ablate specific design choices within the AAW and SCM modules. To evaluate the impact of different configurations, we test the following variants:

—: Default*. The configuration adopted by SplitGround in Table 1, serving as the baseline for the experiments in Table 3.
—: One-round. Only a single round of annotation–inspection is performed in AAW.
—: Three-round. The AAW module conducts an additional round of annotation–inspection (three rounds in total).
—: Bbox. The Annotator employs bounding boxes for annotation rather than points.
—: w/o the [type]. The SCM module retains the original query without adding “the [type]:” when name-based transformation is infeasible.

The results in Table 3 demonstrate that the default two-round annotation–inspection process improves accuracy by 0.18% compared to the one-round setting, while achieving comparable performance to the three-round configuration. This indicates that the two-round setting optimally balances effective supplementation of initial annotations with avoidance of redundant cycles.

Other ablated configurations exhibit varying degrees of performance degradation relative to the default. Bbox annotations introduce more visual clutter that may cause attention diversion of the VLM from intrinsic visual content, occasionally reducing the overall accuracy. The explicit type prompting in the SCM module enables precise semantic disambiguation of target categories, providing positive assistance for VLM grounding. These submodule-level ablation results highlight the rationality and effectiveness of the design paradigm in SplitGround.

4.4. Case Study

To visually showcase the effectiveness of SplitGround and the critical role of its multi-expert collaboration mechanism in splitting reasoning chains in different cases, we present a case study in Figure 6 (ground truth, correct prediction, and incorrect prediction bounding boxes shown in yellow, green, and red, respectively; AAW-processed images and SCM-processed queries highlighted in purple and green). Representative samples from different difficulty levels in the SK-VG dataset are selected to compare the grounding results of the original VLM (the top line) and SplitGround (the bottom line), as well as show the effects of the AAW and SCM modules (the middle two rows). This qualitative result can sufficiently align with the previous results shown in Table 2.

For some easy examples (column 1), the baseline VLM demonstrates adequate grounding capability. However, the VLM occasionally confuses grounding targets in some scenarios, as exemplified in column 2 where it misidentifies the carried objects (the glasses) as its human carriers. Our SCM resolves this through type-specific query augmentation (e.g., prepending “the [type]:” prompts), effectively reducing such semantic ambiguities. When processing images containing multiple similar entities (column 3), the VLM often leads to the incorrect target due to its limited discriminative power. The AAW addresses this by visually annotating entity categories, enabling direct visual reference for more precise target identification. Furthermore, the VLM exhibits spatial reasoning limitations (e.g., left–right confusion in column 4), which SplitGround can mitigate through its structured reasoning paradigm.

In summary, the case study highlights that in SplitGround, annotated images explicitly label key entity names, while transformed queries simplify reasoning chains by converting them into forms that align with image annotations. For simple cases, the model can more easily identify the relevant entity in annotated images and localize corresponding objects based on these references. For hard cases, the query conversion significantly reduces reasoning complexity, enabling the model to accurately localize targets in annotated images with minimal effort.

5. Conclusions

In this paper, we present SplitGround, a novel training-free framework for Scene Knowledge-guided Visual Grounding. It fills the most significant gap in existing SK-VG models, namely the challenge that grounding models face when addressing complex long-chain reasoning. By designing expert modules for hierarchical task decomposition, SplitGround effectively integrates knowledge information into the originally uninformative queries and images. Our Agentic Annotation Workflow (AAW) achieved named entity annotation in images through a two-round annotation–inspection process, while the Synonymous Conversion Mechanism (SCM) optimized context-dependent queries to reduce subsequent reasoning costs. These two modules collaboratively split the reasoning chain into multiple sub-chains, significantly simplifying the original lengthy multi-hop reasoning steps and enhancing the accuracy of final grounding results. We conducted extensive experiments on the SK-VG dataset to evaluate the performance of our SplitGround from both qualitative and quantitative perspectives, with a detailed analysis of the inherent mechanism of each component. Experimental results demonstrated that SplitGround outperformed previous methods and achieved competitive performance compared with supervised models.

Further improvements remain possible for SplitGround. From the perspective of structural improvement, the current multi-expert discussion framework still exhibits limitations in real-time performance. Potential enhancements include simplifying the discussion workflow within the AAW module or integrating the SCM and AAW modules. From the perspective of practical application, SplitGround primarily faces real-world human–computer interaction scenarios where context information will typically not be given as a long-text scene knowledge. It might be necessary for the model to learn contextual information in an online pattern. Additionally, the practical effectiveness of SplitGround also requires further validation with more real-world data.

Author Contributions

Conceptualization, X.Q. and W.W.; methodology, X.Q. and Y.H.; validation, X.L.; writing—original draft preparation, X.Q.; writing—review and editing, Y.H.; supervision, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation of China [grant number 62306329]; the Hunan Provincial Natural Science Foundation of China [grant number 2023JJ40676]; and the China Association for Science and Technology (CAST) Youth Talent Supporting Program [grant number 2024-JCJQ-QT-034].

Data Availability Statement

Data and code will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Details of Textual Prompt Design

This section includes the prompts used in the implementation of SplitGround for image annotation, query conversion, and the grounding process, with specific details illustrated in Figure A1, Figure A2 and Figure A3. Below is a brief overview of each prompt component:

Grounding Agent: The prompt structure is divided into three main parts: task description, provided knowledge and query, and output format.

SCM: The SCM includes two agents. The first agent is responsible for determining whether the referent of the query is a person or an object. Its prompt is designed based on the in-context learning approach, which incorporates a basic task description and provides ten examples as learning cases. The second agent is designed for resolving person names from the given scene knowledge.

AAW: The AAW framework includes three roles and two rounds of annotation–inspection, each with their own prompt designs. The Named Entity Recognizer is responsible for outputting entity names in the JSON format. After parsing, these names are listed in order and used to replace the {people} in the prompt of the Annotator in the first round. Subsequently, based on the entity names, a corresponding output template is generated in the JSON format with keys “point” and “label”, which will be used to replace the {output_form}.

To further guide the VLM and obtain more accurate inspection results, an additional checking process is added before the first-round inspection. This check verifies cases where multiple labels overlap on the same person in the image. If the checking result is not “nan”, the results will be formatted according to the template below and inserted into the {Note} in the first-round Inspector prompt; otherwise, the {Note} is left empty:

\n## Note\n<wrong_labels> labels are on the same person, please refer to the process of D in the example\n remember to find the wrongly labeled person as D in the example\n

The Inspector is tasked with verifying all names pointed to by the Annotator in the first round. The task description and examples provided for the first-round Inspector cover all possible scenarios, serving as a comprehensive reference to guide the inspection.

The second-round Annotator and Inspector differ from their first-round counterparts. The Annotator must leverage feedback from the first-round Inspector and the annotated images to supplement missing labels, following the same output format as in the first round. Meanwhile, the second-round Inspector is tasked with making accept/reject decisions on the second-round annotations. The prompt provides some critical criteria to pay attention to and requires the Inspector to explicitly output its thought process during the analysis.

Figure A1. Prompts used in SplitGround (part 1). The phrases marked in purple within the curly braces {} will be replaced with the corresponding contents. Prompts with a purple image icon in the lower right corner represent those that require image input. Best viewed in color.

Figure A2. Prompts used in SplitGround (part 2). The phrases marked in purple within the curly braces {} will be replaced with the corresponding contents. Prompts with a purple image icon in the lower right corner represent those that require image input. Best viewed in color.

Figure A3. Prompts used in SplitGround (part 3). The phrases marked in purple within the curly braces {} will be replaced with the corresponding contents. Prompts with a purple image icon in the lower right corner represent those that require image input. Best viewed in color.

Appendix B. Additional Experiments and Discussion

Appendix B.1. Additional Discussion

In addition to the experimental results presented in Section 4, we conducted further analyses and discussions on SplitGround through supplementary experiments summarized in Table A1. The comprehensive study consists of three experimental groups (a)–(c), each providing unique perspectives for analyzing SplitGround:

(a): Investigating the feasibility and effectiveness of directly applying chain-of-thought reasoning to solve SK-VG tasks;
(b): Examining how annotation quality impacts grounding accuracy by feeding the grounding agent with image annotations of varying levels;
(c): Evaluating the contribution of modules in SplitGround in reducing scene knowledge dependency of the grounding agent by removing its knowledge inputs.

Table A1. Additional experiments for further discussion. The best scores are highlighted in bold.

	Configuration	Acc	Difficulty level
	Configuration	Acc	Easy	Medium	Hard
(a)	SplitGround*	77.4	79.20	72.18	79.43
	Original VLM	73.0	77.88	63.16	74.49
	Chain-of-Thought	48.4	54.42	43.61	43.26
(b)	Wrong Annotation	66.0	73.01	57.89	62.41
	No Annotation	74.0	79.20	63.91	75.18
	AAW Annotation*	77.4	79.20	72.18	79.43
	Human Annotation	77.8	77.88	70.68	84.40
(c)	Both Module (Q)*	73.0	75.66	65.41	75.89
	w/AAW (Q)	60.2	67.70	57.14	51.06
	w/SCM (Q)	38.0	56.19	32.33	14.18
	Original VLM (Q)	49.6	64.16	45.86	29.79

The experimental settings in Table A1 generally align with those in Section 4, but these supplementary experiments are designed for analytical discussions and were validated on a smaller subset of the test dataset (500 randomly selected samples). The table configurations correspond to the following experimental setups:

—: (a) SplitGround*. The configuration adopted by SplitGround in Table 1, serving as the baseline for experiments in Table A1. All bold, green-highlighted rows marked with * in Table A1 follow this setup, which will not be reiterated hereafter (“Both Module (Q)*” has its knowledge input removed.).
—: (a) Original VLM. Directly uses the VLM with raw query, knowledge, and image inputs for grounding, consistent with the setup in Table 2.
—: (a) Chain-of-Thought. Performs reasoning through chain-of-thought to accomplish the SK-VG task, with the detailed prompt illustrated in Figure A4.
—: (b) Wrong Annotation. The grounding agent receives images with incorrect annotations, where each sample is randomly provided with annotation information from other samples.
—: (b) No Annotation. The grounding agent receives raw images without annotations, which is equivalent to removing the AAW module.
—: (b) Human Annotation. Replaces AAW-generated annotations with human-annotated images for grounding, representing the most accurate annotation baseline.
—: (c) w/AAW (Q). The grounding agent receives AAW-processed images and raw queries without knowledge input.
—: (c) w/SCM (Q). The grounding agent receives raw images and SCM-processed queries without knowledge input.
—: (c) Original VLM (Q). Directly uses the VLM with raw query and image inputs for grounding, excluding knowledge inputs.

Based on the experimental results in Table A1, we proceed with the following discussions:

Discussion (a): Given the close relationship between the SK-VG task and reasoning, particularly in scenarios involving extensive multi-step reasoning, utilizing the chain-of-thought (CoT) approach appears to be an intuitive way. However, our experimental validation reveals that directly outputting reasoning steps via CoT before providing the final answer yields no significant improvement. In fact, under our experimental conditions, the accuracy achieved with CoT is substantially lower than that of the VLM operating in a non-CoT manner. This suggests inherent limitations in the CoT method, where extended reasoning outputs may obscure critical information or cause model confusion, rendering it unsuitable for direct application to the SK-VG task.

Discussion (b): Group (b) employed image annotations of varying quality. The results clearly demonstrate a significant upward trend in grounding accuracy as annotation quality improves, indicating a positive correlation between them. Wrong annotations mislead the VLM, while accurate annotations enable more precise grounding for target regions. Thus, keeping the same queries, grounding accuracy can effectively reflect the annotation quality. Consequently, improving grounding performance can be achieved by providing higher-quality image annotations.

Discussion (c): The results in group (c) indicate that the AAW module significantly reduces model dependency on external scene knowledge. In contrast, using the SCM module alone may inadvertently obscure contextual cues within the original query, so that the processed queries can become harder for the model to comprehend without knowledge support. However, when AAW and SCM are combined, the annotations generated by AAW render the otherwise non-intuitive SCM-processed queries interpretable directly from the image. This synergy paradoxically further diminishes reliance on external knowledge.

Appendix B.2. IoU Results on SK-VG Dataset

In addition to the accuracy results presented in Table 1, we also recorded the IoU results between the predicted bboxes of the models and the ground truth bboxes on the SK-VG test set. This provides complementary information to facilitate a more comprehensive and detailed evaluation of the models. Results are shown in Table A2.

Table A2. IoU results on the test set of the SK-VG dataset [1]. The best scores are highlighted in bold.

Method	Input	Average	Difficulty Level
Method	Input	IoU	Easy	Medium	Hard
TransVG [26]	$Q, I$	0.2338	0.2337	0.2292	0.2388
	$Q, K, I$	0.2138	0.1980	0.2133	0.2417
MDETR [71]	$Q, I$	0.3074	0.3598	0.2865	0.2381
	$Q, K, I$	0.2110	0.1892	0.2071	0.2529
OFA [4]	$Q, I$	0.3457	0.4060	0.3160	0.2719
	$Q, K, I$	0.2547	0.2484	0.2521	0.2684
UNINEXT(H) [3]	$Q, I$	0.4105	0.5202	0.3532	0.2801
	$Q, K, I$	0.2310	0.1924	0.2371	0.2917
Florence-2(L) [56]	$Q, I$	0.3307	0.4317	0.3051	0.1819
	$Q, K, I$	0.0002	0.0001	0.0000	0.0005
Grounding DINO(B) [2]	$Q, I$	0.2789	0.4231	0.1965	0.1147
	$Q, K, I$	0.0657	0.0503	0.1025	0.0758
SplitGround (ours)	$Q, K, I$	0.6482	0.6422	0.6299	0.6778

Figure A4. Prompts used in the chain-of-thought configuration. The phrases marked in purple within the curly braces {} will be replaced with the corresponding contents. Prompts with a purple image icon in the lower right corner represent those that require image input. Best viewed in color.

References

Chen, Z.; Zhang, R.; Song, Y.; Wan, X.; Li, G. Advancing Visual Grounding With Scene Knowledge: Benchmark and Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15039–15049. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the Computer Vision-ECCV 2024-18th European Conference, Milan, Italy, 29 September–4 October 2024; Lecture Notes in Computer Science; Proceedings, Part XLVII. Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; Volume 15105, pp. 38–55. [Google Scholar] [CrossRef]
Yan, B.; Jiang, Y.; Wu, J.; Wang, D.; Luo, P.; Yuan, Z.; Lu, H. Universal Instance Perception as Object Discovery and Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Amsterdam, The Netherlands, 2023; pp. 15325–15336. [Google Scholar] [CrossRef]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv 2022, arXiv:2202.03052. [Google Scholar]
Zhang, Y.; Ma, Z.; Gao, X.; Shakiah, S.; Gao, Q.; Chai, J. Groundhog Grounding Large Language Models to Holistic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 14227–14238. [Google Scholar] [CrossRef]
Liu, J.; Fang, W.; Love, P.E.; Hartmann, T.; Luo, H.; Wang, L. Detection and location of unsafe behaviour in digital images: A visual grounding approach. Adv. Eng. Inform. 2022, 53, 101688. [Google Scholar] [CrossRef]
Cai, R.; Guo, Z.; Chen, X.; Li, J.; Tan, Y.; Tang, J. Automatic identification of integrated construction elements using open-set object detection based on image and text modality fusion. Adv. Eng. Inform. 2025, 64, 103075. [Google Scholar] [CrossRef]
Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Kivlichan, I. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Dorkenwald, M.; Barazani, N.; Snoek, C.G.M.; Asano, Y.M. PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs. arXiv 2024, arXiv:2402.08657. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
Wang, Y.; Luo, H.; Fang, W. An integrated approach for automatic safety inspection in construction: Domain knowledge with multimodal large language model. Adv. Eng. Inform. 2025, 65, 103246. [Google Scholar] [CrossRef]
Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 684–696. [Google Scholar] [CrossRef]
Liu, X.; Wang, Z.; Shao, J.; Wang, X.; Li, H. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 1950–1959. [Google Scholar]
Liu, D.; Zhang, H.; Zha, Z.J.; Feng, W. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In Proceedings of the The IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Bajaj, M.; Wang, L.; Sigal, L. G3raphGround: Graph-Based Language Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4280–4289. [Google Scholar]
Yang, S.; Li, G.; Yu, Y. Dynamic Graph Attention for Referring Expression Comprehension. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4643–4652. [Google Scholar] [CrossRef]
Yang, S.; Li, G.; Yu, Y. Graph-Structured Referring Expression Reasoning in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation. IEEE: Piscataway, NJ, USA, 2020; pp. 9949–9958. [Google Scholar] [CrossRef]
Huang, B.; Lian, D.; Luo, W.; Gao, S. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation. IEEE: Piscataway, NJ, USA, 2020; pp. 10031–10040. [Google Scholar] [CrossRef]
Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-stage Visual Grounding by Recursive Sub-query Construction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Deng, J.; Yang, Z.; Chen, T.; Gang Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
Zhou, Y.; Ji, R.; Luo, G.; Sun, X.; Su, J.; Ding, X.; Lin, C.; Tian, Q. A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 134–143. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Ma, W.; Xiao, J.; Zhang, H.; Chang, S. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual, 2–9 February 2021; AAAI Press: Washington, DC, USA, 2021; pp. 1036–1044. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Dai, M.; Yang, L.; Xu, Y.; Feng, Z.; Yang, W. SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion. Proc. Adv. Neural Inf. Process. Syst. 2024, 37, 121670–121698. [Google Scholar]
Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Ahmed, F.; Liu, Z.; Lu, Y.; Wang, L. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling. In Proceedings of the ECCV, 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.H.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv 2022, arXiv:2206.05836. [Google Scholar] [CrossRef]
Kang, W.; Qu, M.; Wei, Y.; Yan, Y. ACTRESS: Active Retraining for Semi-supervised Visual Grounding. arXiv 2024, arXiv:2407.03251. [Google Scholar] [CrossRef]
Kang, W.; Zhou, L.; Wu, J.; Sun, C.; Yan, Y. Visual Grounding with Attention-Driven Constraint Balancing. arXiv 2024, arXiv:2407.03243. [Google Scholar] [CrossRef]
Qu, M.; Wu, Y.; Liu, W.; Gong, Q.; Liang, X.; Russakovsky, O.; Zhao, Y.; Wei, Y. SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding. In Proceedings of the Computer Vision—ECCV 2022-17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXXV. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2022; Volume 13695, pp. 546–562. [Google Scholar] [CrossRef]
Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 9489–9498. [Google Scholar] [CrossRef]
Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 15481–15491. [Google Scholar] [CrossRef]
Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision-ECCV, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the The Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ren, T.; Jiang, Q.; Liu, S.; Zeng, Z.; Liu, W.; Gao, H.; Huang, H.; Ma, Z.; Jiang, X.; Chen, Y.; et al. Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection. arXiv, 2024; arXiv:2405.10300. [Google Scholar] [CrossRef]
Ren, T.; Chen, Y.; Jiang, Q.; Zeng, Z.; Xiong, Y.; Liu, W.; Ma, Z.; Shen, J.; Gao, Y.; Jiang, X.; et al. DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv 2024, arXiv:2411.14347. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; Proceedings of Machine Learning Research. Meila, M., Zhang, T., Eds.; PmLR: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Xiao, L.; Yang, X.; Peng, F.; Wang, Y.; Xu, C. HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding. In Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October–1 November 2024; Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D., Eds.; ACM: New York, NY, USA, 2024; pp. 5460–5469. [Google Scholar] [CrossRef]
Kim, S.; Kang, M.; Kim, D.; Park, J.; Kwak, S. Extending CLIP’s Image-Text Alignment to Referring Image Segmentation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 4–7 June 2024; Volume 1: Long Papers, pp. 4611–4628. [Google Scholar] [CrossRef]
Peng, F.; Yang, X.; Xiao, L.; Wang, Y.; Xu, C. SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification. IEEE Trans. Multimed. 2024, 26, 3469–3480. [Google Scholar] [CrossRef]
Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11676–11685. [Google Scholar] [CrossRef]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Houlsby, N. Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar] [CrossRef]
Xiao, L.; Yang, X.; Peng, F.; Yan, M.; Wang, Y.; Xu, C. CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding. IEEE Trans. Multimedia 2023, 26, 4334–4347. [Google Scholar] [CrossRef]
Jin, L.; Luo, G.; Zhou, Y.; Sun, X.; Jiang, G.; Shu, A.; Ji, R. RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1–10. [Google Scholar] [CrossRef]
Sun, J.; Luo, G.; Zhou, Y.; Sun, X.; Jiang, G.; Wang, Z.; Ji, R. RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville TN, USA, 11–15 June 2023; pp. 19144–19154. [Google Scholar] [CrossRef]
Minderer, M.; Gritsenko, A.; Houlsby, N. Scaling Open-Vocabulary Object Detection. arXiv, 2023; arXiv:2306.09683. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Gao, J. Grounded Language-Image Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Chen, Q.; Yin, X. Tailored vision-language framework for automated hazard identification and report generation in construction sites. Adv. Eng. Inform. 2025, 66, 103478. [Google Scholar] [CrossRef]
Lin, L.; Zhang, S.; Fu, S.; Liu, Y. FD-LLM: Large language model for fault diagnosis of complex equipment. Adv. Eng. Inform. 2025, 65, 103208. [Google Scholar] [CrossRef]
Zhang, H.; Li, H.; Li, F.; Ren, T.; Zou, X.; Liu, S.; Huang, S.; Gao, J.; Zhang, L.; Li, C.; et al. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 Septembe–4 October 2024; pp. 19–35. [Google Scholar]
Rasheed, H.; Maaz, M.; Shaji, S.; Shaker, A.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Xing, E.; Yang, M.H.; Khan, F.S. GLaMM: Pixel Grounding Large Multimodal Model. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 13009–13018. [Google Scholar] [CrossRef]
Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Song, X.; et al. CogVLM: Visual Expert for Pretrained Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 121475–121499. [Google Scholar]
Hong, W.; Wang, W.; Lv, Q.; Xu, J.; Yu, W.; Ji, J.; Wang, Y.; Wang, Z.; Dong, Y.; Ding, M.; et al. CogAgent: A Visual Language Model for GUI Agents. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14281–14290. [Google Scholar] [CrossRef]
Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. In Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, 13–17 May 2024; IEEE: New York, NY, USA, 2024; pp. 7694–7701. [Google Scholar] [CrossRef]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. In Proceedings of the Conference on Language Models (COLM), Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessi, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 68539–68551. [Google Scholar]
Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 38154–38180. [Google Scholar]
Zhao, H.; Ge, W.; Cong Chen, Y. LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding. arXiv 2024, arXiv:2405.17104. [Google Scholar] [CrossRef]
Li, R.; Li, S.; Kong, L.; Yang, X.; Liang, J. SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 14–15 June 2025. [Google Scholar]
Shahriar, S.; Lund, B.D.; Mannuru, N.R.; Arshad, M.A.; Hayawi, K.; Bevara, R.V.K.; Mannuru, A.; Batool, L. Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci. 2024, 14, 7782. [Google Scholar] [CrossRef]
Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion, N. MDETR–Modulated Detection for End-to-End Multi-Modal Understanding. arXiv 2021, arXiv:2104.12763. [Google Scholar]

Figure 1. Comparison of (a) existing methods and (b) our proposed SplitGround. Highlighted annotations and simplified knowledge are employed here for clarity. Best viewed in color.

Figure 2. Overall framework of SplitGround. Best viewed in color.

Figure 3. An input example in the SK-VG dataset (with 3 queries on different levels; best viewed in color).

Figure 4. Comparative results of SplitGround using varying scales of VLMs. Best viewed in color.

Figure 5. Module-level ablation results.

Figure 6. Some cases of our SplitGround on the SK-VG dataset. Scene knowledge is not shown for clarity. Best viewed in color.

Table 1. Comparative results on the test set of the SK-VG dataset [1]. All scores are given in percentage (%). The best and 2nd best scores are highlighted in bold and underlined.

Method	Venue	Model Task	w/o Training	Input	Overall Acc	Difficulty Level
Method	Venue	Model Task	w/o Training	Input	Overall Acc	Easy	Medium	Hard
TransVG [26]	ICCV’21	VG	✓	$Q, I$	23.29	23.32	22.87	23.71
TransVG [26]	ICCV’21	VG	✓	$Q, K, I$	21.05	19.22	21.01	24.28
MDETR [71]	ICCV’21	VG	✓	$Q, I$	31.52	37.29	29.60	23.54
MDETR [71]	ICCV’21	VG	✓	$Q, K, I$	20.48	17.24	20.79	25.77
OFA [4]	ICML’22	multi-task	✓	$Q, I$	35.33	41.91	31.95	27.44
OFA [4]	ICML’22	multi-task	✓	$Q, K, I$	24.89	23.88	25.05	26.46
UNINEXT(H) [3]	CVPR’23	multi-task	✓	$Q, I$	41.86	54.29	35.18	27.27
UNINEXT(H) [3]	CVPR’23	multi-task	✓	$Q, K, I$	20.96	16.31	22.05	27.90
Florence-2(L) [56]	CVPR’24	multi-task	✓	$Q, I$	34.63	46.53	31.29	17.45
Florence-2(L) [56]	CVPR’24	multi-task	✓	$Q, K, I$	0.02	0.00	0.00	0.06
Grounding DINO(B) [2]	ECCV’24	open-set	✓	$Q, I$	29.51	46.04	20.24	10.51
Grounding DINO(B) [2]	ECCV’24	detection	✓	$Q, K, I$	1.44	1.23	2.04	1.34
KeViLI [1]	CVPR’23	SK-VG	✗	$Q, K, I$	30.01	33.75	26.55	27.14
LeViLM [1]	CVPR’23	SK-VG	✓	$Q, K, I$	7.55	13.08	4.38	1.26
LeViLM [1]	CVPR’23	SK-VG	✗		72.57	84.08	65.52	59.95
SplitGround	Ours	SK-VG	✓	$Q, K, I$	72.99^(+0.42)↑	73.15^{(−10.93)↓}	70.19^(+4.67)↑	75.66^(+15.71)↑

Table 2. Module-level ablation results at various difficulty levels. The best scores are highlighted in bold.

Method	Difficulty Level
Method	Easy	Medium	Hard
Original VLM	73.38	65.81	65.10
+AAW	71.63	69.31	72.27
+SCM	74.14	65.65	67.16
SplitGround	73.15	70.19	75.66

Table 3. Ablation results of different configurations in AAW and SCM. The best acc is highlighted in bold.

Configuration	Module	Test Acc
Default*	-	72.99
One-round	AAW	72.81^(−0.18)↓
Three-round	AAW	73.02^(+0.03)↑
Bbox	AAW	68.60^(−4.39)↓
w/o the [type]	SCM	71.84^(−0.97)↓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, X.; Hu, Y.; Wu, W.; Li, X.; Yin, Q. SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data Cogn. Comput. 2025, 9, 209. https://doi.org/10.3390/bdcc9080209

AMA Style

Qin X, Hu Y, Wu W, Li X, Yin Q. SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data and Cognitive Computing. 2025; 9(8):209. https://doi.org/10.3390/bdcc9080209

Chicago/Turabian Style

Qin, Xilong, Yue Hu, Wansen Wu, Xinmeng Li, and Quanjun Yin. 2025. "SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding" Big Data and Cognitive Computing 9, no. 8: 209. https://doi.org/10.3390/bdcc9080209

APA Style

Qin, X., Hu, Y., Wu, W., Li, X., & Yin, Q. (2025). SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data and Cognitive Computing, 9(8), 209. https://doi.org/10.3390/bdcc9080209

Article Menu

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

Abstract

1. Introduction

2. Related Work

2.1. Visual Grounding

2.2. Scene Knowledge-Guided Visual Grounding

2.3. VLMs for Detection Tasks

3. Method

3.1. Overview

3.2. Agentic Annotation Workflow

3.3. Synonymous Conversion Mechanism

4. Experiments and Discussion

4.1. Configurations

4.2. Comparative Study

4.3. Ablation Study

4.4. Case Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Details of Textual Prompt Design

Appendix B. Additional Experiments and Discussion

Appendix B.1. Additional Discussion

Appendix B.2. IoU Results on SK-VG Dataset

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI