SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding
Abstract
1. Introduction
- We propose SplitGround for SK-VG tasks, which is a training-free solution that harnesses the cross-modal grounding and knowledge reasoning capabilities of VLMs.
- We design a novel framework featuring multi-expert collaboration for fine-grained image annotation and referring expression conversion. This decomposition of complex long-chain reasoning into stepwise processes enhances overall grounding performance.
- Comprehensive experiments on the SK-VG dataset demonstrate that our training-free approach establishes new SOTA results, significantly outperforming previous baselines. Simultaneously, experimental analyses validate the effectiveness of the inherent design and highlight the adaptability of the proposed method.
2. Related Work
2.1. Visual Grounding
2.2. Scene Knowledge-Guided Visual Grounding
2.3. VLMs for Detection Tasks
3. Method
3.1. Overview
- Agentic Annotation Workflow (AAW) (Section 3.2): Integrates scene knowledge into visual representations by annotating entities on I.
- Synonymous Conversion Mechanism (SCM) (Section 3.3): Translates context-dependent queries Q into entity-centric expressions using scene knowledge K.
3.2. Agentic Annotation Workflow
Algorithm 1 Agentic Annotation Workflow |
|
3.3. Synonymous Conversion Mechanism
4. Experiments and Discussion
4.1. Configurations
4.2. Comparative Study
4.3. Ablation Study
- —
- —
- One-round. Only a single round of annotation–inspection is performed in AAW.
- —
- Three-round. The AAW module conducts an additional round of annotation–inspection (three rounds in total).
- —
- Bbox. The Annotator employs bounding boxes for annotation rather than points.
- —
- w/o the [type]. The SCM module retains the original query without adding “the [type]:” when name-based transformation is infeasible.
4.4. Case Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Details of Textual Prompt Design
- \n## Note\n<wrong_labels> labels are on the same person, please refer to the process of D in the example\n remember to find the wrongly labeled person as D in the example\n
Appendix B. Additional Experiments and Discussion
Appendix B.1. Additional Discussion
- (a)
- Investigating the feasibility and effectiveness of directly applying chain-of-thought reasoning to solve SK-VG tasks;
- (b)
- Examining how annotation quality impacts grounding accuracy by feeding the grounding agent with image annotations of varying levels;
- (c)
- Evaluating the contribution of modules in SplitGround in reducing scene knowledge dependency of the grounding agent by removing its knowledge inputs.
Configuration | Acc | Difficulty level | |||
---|---|---|---|---|---|
Easy | Medium | Hard | |||
(a) | SplitGround* | 77.4 | 79.20 | 72.18 | 79.43 |
Original VLM | 73.0 | 77.88 | 63.16 | 74.49 | |
Chain-of-Thought | 48.4 | 54.42 | 43.61 | 43.26 | |
(b) | Wrong Annotation | 66.0 | 73.01 | 57.89 | 62.41 |
No Annotation | 74.0 | 79.20 | 63.91 | 75.18 | |
AAW Annotation* | 77.4 | 79.20 | 72.18 | 79.43 | |
Human Annotation | 77.8 | 77.88 | 70.68 | 84.40 | |
(c) | Both Module (Q)* | 73.0 | 75.66 | 65.41 | 75.89 |
w/AAW (Q) | 60.2 | 67.70 | 57.14 | 51.06 | |
w/SCM (Q) | 38.0 | 56.19 | 32.33 | 14.18 | |
Original VLM (Q) | 49.6 | 64.16 | 45.86 | 29.79 |
- —
- —
- (a) Original VLM. Directly uses the VLM with raw query, knowledge, and image inputs for grounding, consistent with the setup in Table 2.
- —
- (a) Chain-of-Thought. Performs reasoning through chain-of-thought to accomplish the SK-VG task, with the detailed prompt illustrated in Figure A4.
- —
- (b) Wrong Annotation. The grounding agent receives images with incorrect annotations, where each sample is randomly provided with annotation information from other samples.
- —
- (b) No Annotation. The grounding agent receives raw images without annotations, which is equivalent to removing the AAW module.
- —
- (b) Human Annotation. Replaces AAW-generated annotations with human-annotated images for grounding, representing the most accurate annotation baseline.
- —
- (c) w/AAW (Q). The grounding agent receives AAW-processed images and raw queries without knowledge input.
- —
- (c) w/SCM (Q). The grounding agent receives raw images and SCM-processed queries without knowledge input.
- —
- (c) Original VLM (Q). Directly uses the VLM with raw query and image inputs for grounding, excluding knowledge inputs.
Appendix B.2. IoU Results on SK-VG Dataset
Method | Input | Average | Difficulty Level | ||
---|---|---|---|---|---|
IoU | Easy | Medium | Hard | ||
TransVG [26] | 0.2338 | 0.2337 | 0.2292 | 0.2388 | |
0.2138 | 0.1980 | 0.2133 | 0.2417 | ||
MDETR [71] | 0.3074 | 0.3598 | 0.2865 | 0.2381 | |
0.2110 | 0.1892 | 0.2071 | 0.2529 | ||
OFA [4] | 0.3457 | 0.4060 | 0.3160 | 0.2719 | |
0.2547 | 0.2484 | 0.2521 | 0.2684 | ||
UNINEXT(H) [3] | 0.4105 | 0.5202 | 0.3532 | 0.2801 | |
0.2310 | 0.1924 | 0.2371 | 0.2917 | ||
Florence-2(L) [56] | 0.3307 | 0.4317 | 0.3051 | 0.1819 | |
0.0002 | 0.0001 | 0.0000 | 0.0005 | ||
Grounding DINO(B) [2] | 0.2789 | 0.4231 | 0.1965 | 0.1147 | |
0.0657 | 0.0503 | 0.1025 | 0.0758 | ||
SplitGround (ours) | 0.6482 | 0.6422 | 0.6299 | 0.6778 |
References
- Chen, Z.; Zhang, R.; Song, Y.; Wan, X.; Li, G. Advancing Visual Grounding With Scene Knowledge: Benchmark and Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15039–15049. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the Computer Vision-ECCV 2024-18th European Conference, Milan, Italy, 29 September–4 October 2024; Lecture Notes in Computer Science; Proceedings, Part XLVII. Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; Volume 15105, pp. 38–55. [Google Scholar] [CrossRef]
- Yan, B.; Jiang, Y.; Wu, J.; Wang, D.; Luo, P.; Yuan, Z.; Lu, H. Universal Instance Perception as Object Discovery and Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Amsterdam, The Netherlands, 2023; pp. 15325–15336. [Google Scholar] [CrossRef]
- Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv 2022, arXiv:2202.03052. [Google Scholar]
- Zhang, Y.; Ma, Z.; Gao, X.; Shakiah, S.; Gao, Q.; Chai, J. Groundhog Grounding Large Language Models to Holistic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 14227–14238. [Google Scholar] [CrossRef]
- Liu, J.; Fang, W.; Love, P.E.; Hartmann, T.; Luo, H.; Wang, L. Detection and location of unsafe behaviour in digital images: A visual grounding approach. Adv. Eng. Inform. 2022, 53, 101688. [Google Scholar] [CrossRef]
- Cai, R.; Guo, Z.; Chen, X.; Li, J.; Tan, Y.; Tang, J. Automatic identification of integrated construction elements using open-set object detection based on image and text modality fusion. Adv. Eng. Inform. 2025, 64, 103075. [Google Scholar] [CrossRef]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Kivlichan, I. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- Dorkenwald, M.; Barazani, N.; Snoek, C.G.M.; Asano, Y.M. PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs. arXiv 2024, arXiv:2402.08657. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
- Wang, Y.; Luo, H.; Fang, W. An integrated approach for automatic safety inspection in construction: Domain knowledge with multimodal large language model. Adv. Eng. Inform. 2025, 65, 103246. [Google Scholar] [CrossRef]
- Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 684–696. [Google Scholar] [CrossRef]
- Liu, X.; Wang, Z.; Shao, J.; Wang, X.; Li, H. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 1950–1959. [Google Scholar]
- Liu, D.; Zhang, H.; Zha, Z.J.; Feng, W. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In Proceedings of the The IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Bajaj, M.; Wang, L.; Sigal, L. G3raphGround: Graph-Based Language Grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4280–4289. [Google Scholar]
- Yang, S.; Li, G.; Yu, Y. Dynamic Graph Attention for Referring Expression Comprehension. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4643–4652. [Google Scholar] [CrossRef]
- Yang, S.; Li, G.; Yu, Y. Graph-Structured Referring Expression Reasoning in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation. IEEE: Piscataway, NJ, USA, 2020; pp. 9949–9958. [Google Scholar] [CrossRef]
- Huang, B.; Lian, D.; Luo, W.; Gao, S. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
- Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation. IEEE: Piscataway, NJ, USA, 2020; pp. 10031–10040. [Google Scholar] [CrossRef]
- Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; Luo, J. A Fast and Accurate One-Stage Approach to Visual Grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving One-stage Visual Grounding by Recursive Sub-query Construction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Deng, J.; Yang, Z.; Chen, T.; Gang Zhou, W.; Li, H. TransVG: End-to-End Visual Grounding with Transformers. In Proceedings of the2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
- Zhou, Y.; Ji, R.; Luo, G.; Sun, X.; Su, J.; Ding, X.; Lin, C.; Tian, Q. A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 134–143. [Google Scholar] [CrossRef] [PubMed]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Ma, W.; Xiao, J.; Zhang, H.; Chang, S. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual, 2–9 February 2021; AAAI Press: Washington, DC, USA, 2021; pp. 1036–1044. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Dai, M.; Yang, L.; Xu, Y.; Feng, Z.; Yang, W. SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion. Proc. Adv. Neural Inf. Process. Syst. 2024, 37, 121670–121698. [Google Scholar]
- Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Ahmed, F.; Liu, Z.; Lu, Y.; Wang, L. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling. In Proceedings of the ECCV, 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.H.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv 2022, arXiv:2206.05836. [Google Scholar] [CrossRef]
- Kang, W.; Qu, M.; Wei, Y.; Yan, Y. ACTRESS: Active Retraining for Semi-supervised Visual Grounding. arXiv 2024, arXiv:2407.03251. [Google Scholar] [CrossRef]
- Kang, W.; Zhou, L.; Wu, J.; Sun, C.; Yan, Y. Visual Grounding with Attention-Driven Constraint Balancing. arXiv 2024, arXiv:2407.03243. [Google Scholar] [CrossRef]
- Qu, M.; Wu, Y.; Liu, W.; Gong, Q.; Liang, X.; Russakovsky, O.; Zhao, Y.; Wei, Y. SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding. In Proceedings of the Computer Vision—ECCV 2022-17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXXV. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2022; Volume 13695, pp. 546–562. [Google Scholar] [CrossRef]
- Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 9489–9498. [Google Scholar] [CrossRef]
- Ye, J.; Tian, J.; Yan, M.; Yang, X.; Wang, X.; Zhang, J.; He, L.; Lin, X. Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 15481–15491. [Google Scholar] [CrossRef]
- Deng, J.; Yang, Z.; Liu, D.; Chen, T.; Zhou, W.; Zhang, Y.; Li, H.; Ouyang, W. TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13636–13652. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision-ECCV, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the The Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ren, T.; Jiang, Q.; Liu, S.; Zeng, Z.; Liu, W.; Gao, H.; Huang, H.; Ma, Z.; Jiang, X.; Chen, Y.; et al. Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection. arXiv, 2024; arXiv:2405.10300. [Google Scholar] [CrossRef]
- Ren, T.; Chen, Y.; Jiang, Q.; Zeng, Z.; Xiong, Y.; Liu, W.; Ma, Z.; Shen, J.; Gao, Y.; Jiang, X.; et al. DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv 2024, arXiv:2411.14347. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; Proceedings of Machine Learning Research. Meila, M., Zhang, T., Eds.; PmLR: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Xiao, L.; Yang, X.; Peng, F.; Wang, Y.; Xu, C. HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding. In Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October–1 November 2024; Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., César, P., Xie, L., Xu, D., Eds.; ACM: New York, NY, USA, 2024; pp. 5460–5469. [Google Scholar] [CrossRef]
- Kim, S.; Kang, M.; Kim, D.; Park, J.; Kwak, S. Extending CLIP’s Image-Text Alignment to Referring Image Segmentation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 4–7 June 2024; Volume 1: Long Papers, pp. 4611–4628. [Google Scholar] [CrossRef]
- Peng, F.; Yang, X.; Xiao, L.; Wang, Y.; Xu, C. SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification. IEEE Trans. Multimed. 2024, 26, 3469–3480. [Google Scholar] [CrossRef]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11676–11685. [Google Scholar] [CrossRef]
- Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Houlsby, N. Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar] [CrossRef]
- Xiao, L.; Yang, X.; Peng, F.; Yan, M.; Wang, Y.; Xu, C. CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding. IEEE Trans. Multimedia 2023, 26, 4334–4347. [Google Scholar] [CrossRef]
- Jin, L.; Luo, G.; Zhou, Y.; Sun, X.; Jiang, G.; Shu, A.; Ji, R. RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1–10. [Google Scholar] [CrossRef]
- Sun, J.; Luo, G.; Zhou, Y.; Sun, X.; Jiang, G.; Wang, Z.; Ji, R. RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville TN, USA, 11–15 June 2023; pp. 19144–19154. [Google Scholar] [CrossRef]
- Minderer, M.; Gritsenko, A.; Houlsby, N. Scaling Open-Vocabulary Object Detection. arXiv, 2023; arXiv:2306.09683. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Gao, J. Grounded Language-Image Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Chen, Q.; Yin, X. Tailored vision-language framework for automated hazard identification and report generation in construction sites. Adv. Eng. Inform. 2025, 66, 103478. [Google Scholar] [CrossRef]
- Lin, L.; Zhang, S.; Fu, S.; Liu, Y. FD-LLM: Large language model for fault diagnosis of complex equipment. Adv. Eng. Inform. 2025, 65, 103208. [Google Scholar] [CrossRef]
- Zhang, H.; Li, H.; Li, F.; Ren, T.; Zou, X.; Liu, S.; Huang, S.; Gao, J.; Zhang, L.; Li, C.; et al. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 Septembe–4 October 2024; pp. 19–35. [Google Scholar]
- Rasheed, H.; Maaz, M.; Shaji, S.; Shaker, A.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Xing, E.; Yang, M.H.; Khan, F.S. GLaMM: Pixel Grounding Large Multimodal Model. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 13009–13018. [Google Scholar] [CrossRef]
- Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Song, X.; et al. CogVLM: Visual Expert for Pretrained Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 121475–121499. [Google Scholar]
- Hong, W.; Wang, W.; Lv, Q.; Xu, J.; Yu, W.; Ji, J.; Wang, Y.; Wang, Z.; Dong, Y.; Ding, M.; et al. CogAgent: A Visual Language Model for GUI Agents. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14281–14290. [Google Scholar] [CrossRef]
- Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. In Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, 13–17 May 2024; IEEE: New York, NY, USA, 2024; pp. 7694–7701. [Google Scholar] [CrossRef]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. In Proceedings of the Conference on Language Models (COLM), Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessi, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 68539–68551. [Google Scholar]
- Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 38154–38180. [Google Scholar]
- Zhao, H.; Ge, W.; Cong Chen, Y. LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding. arXiv 2024, arXiv:2405.17104. [Google Scholar] [CrossRef]
- Li, R.; Li, S.; Kong, L.; Yang, X.; Liang, J. SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 14–15 June 2025. [Google Scholar]
- Shahriar, S.; Lund, B.D.; Mannuru, N.R.; Arshad, M.A.; Hayawi, K.; Bevara, R.V.K.; Mannuru, A.; Batool, L. Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci. 2024, 14, 7782. [Google Scholar] [CrossRef]
- Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion, N. MDETR–Modulated Detection for End-to-End Multi-Modal Understanding. arXiv 2021, arXiv:2104.12763. [Google Scholar]
Method | Venue | Model Task | w/o Training | Input | Overall Acc | Difficulty Level | ||
---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | ||||||
TransVG [26] | ICCV’21 | VG | ✓ | 23.29 | 23.32 | 22.87 | 23.71 | |
21.05 | 19.22 | 21.01 | 24.28 | |||||
MDETR [71] | ICCV’21 | VG | ✓ | 31.52 | 37.29 | 29.60 | 23.54 | |
20.48 | 17.24 | 20.79 | 25.77 | |||||
OFA [4] | ICML’22 | multi-task | ✓ | 35.33 | 41.91 | 31.95 | 27.44 | |
24.89 | 23.88 | 25.05 | 26.46 | |||||
UNINEXT(H) [3] | CVPR’23 | multi-task | ✓ | 41.86 | 54.29 | 35.18 | 27.27 | |
20.96 | 16.31 | 22.05 | 27.90 | |||||
Florence-2(L) [56] | CVPR’24 | multi-task | ✓ | 34.63 | 46.53 | 31.29 | 17.45 | |
0.02 | 0.00 | 0.00 | 0.06 | |||||
Grounding DINO(B) [2] | ECCV’24 | open-set | ✓ | 29.51 | 46.04 | 20.24 | 10.51 | |
detection | 1.44 | 1.23 | 2.04 | 1.34 | ||||
KeViLI [1] | CVPR’23 | SK-VG | ✗ | 30.01 | 33.75 | 26.55 | 27.14 | |
LeViLM [1] | CVPR’23 | SK-VG | ✓ | 7.55 | 13.08 | 4.38 | 1.26 | |
✗ | 72.57 | 84.08 | 65.52 | 59.95 | ||||
SplitGround | Ours | SK-VG | ✓ | 72.99(+0.42)↑ | 73.15(−10.93)↓ | 70.19(+4.67)↑ | 75.66(+15.71)↑ |
Method | Difficulty Level | ||
---|---|---|---|
Easy | Medium | Hard | |
Original VLM | 73.38 | 65.81 | 65.10 |
+AAW | 71.63 | 69.31 | 72.27 |
+SCM | 74.14 | 65.65 | 67.16 |
SplitGround | 73.15 | 70.19 | 75.66 |
Configuration | Module | Test Acc |
---|---|---|
Default* | - | 72.99 |
One-round | AAW | 72.81(−0.18)↓ |
Three-round | AAW | 73.02(+0.03)↑ |
Bbox | AAW | 68.60(−4.39)↓ |
w/o the [type] | SCM | 71.84(−0.97)↓ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qin, X.; Hu, Y.; Wu, W.; Li, X.; Yin, Q. SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data Cogn. Comput. 2025, 9, 209. https://doi.org/10.3390/bdcc9080209
Qin X, Hu Y, Wu W, Li X, Yin Q. SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data and Cognitive Computing. 2025; 9(8):209. https://doi.org/10.3390/bdcc9080209
Chicago/Turabian StyleQin, Xilong, Yue Hu, Wansen Wu, Xinmeng Li, and Quanjun Yin. 2025. "SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding" Big Data and Cognitive Computing 9, no. 8: 209. https://doi.org/10.3390/bdcc9080209
APA StyleQin, X., Hu, Y., Wu, W., Li, X., & Yin, Q. (2025). SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding. Big Data and Cognitive Computing, 9(8), 209. https://doi.org/10.3390/bdcc9080209