Multiple Large AI Models’ Consensus for Object Detection—A Survey
Abstract
1. Introduction
- Conceptual unification. We formalize the notion of Multi-LLM Consensus for object detection and relate it to ensemble and consensus learning traditions in artificial intelligence.
- Comprehensive taxonomy. We categorize existing consensus paradigms into prompt-level, reasoning-to-detection, box-level, and hybrid designs, linking them with representative algorithms such as MoA (Mixture of Agents) [11], LLM-Blender (Ensembling Large Language Models with Pairwise Ranking and Generative Fusion) [12], DetGPT [13], and ContextDET [14].
- Survey of fusion algorithms. We summarize data fusion techniques—including NMS (Non-Maximum Suppression), Soft-NMS, WBF, ALFA, and ProbEn—and discuss how they extend to multimodal, reasoning-guided detection.
- Evaluation framework. We outline appropriate datasets, metrics, and benchmarks for assessing consensus-based systems, emphasizing calibration, hallucination reduction, and inter-agent diversity.
- Challenges and outlook. We identify key open problems: vocabulary alignment, calibration, efficiency, and bias—and highlight emerging research trends such as consensus-aware training and collaborative perception ecosystems.
2. Background of Object Detection
2.1. Closed Vocabulary Methods
2.2. Open Vocabulary and Language-Guided Methods
3. LLM-Guided and Reasoning-Driven Object Detection
3.1. LLM-Guided Approaches
- The LLM receives a natural-language instruction and interprets it to produce a structured plan—a list of target object types or phrases, possibly with attributes (e.g., “detect red cars,” “count people sitting at a table”).
- The plan is executed by an open-vocabulary detector, which performs localization for each text query and returns bounding boxes and confidence scores.
3.2. Evaluation and Challenges
- Latency and cost. Running large models like GPT-4V or InternVL in the detection loop is computationally expensive, especially for real-time applications.
- Stability and determinism. LLM outputs vary with temperature sampling and prompt phrasing; inconsistent reasoning leads to inconsistent detections.
- Grounding accuracy. Many LLMs lack explicit spatial understanding and rely on external detectors for localization; this dependency may propagate detector biases.
- Calibration and confidence. Integrating probabilistic outputs from heterogeneous modules (LLM and detector) remains challenging.
4. Multi-LLM Consensus and Ensemble Reasoning
- Prompt-level consensus: models agree on semantic understanding before detection (shared class or phrase lists).
- Reasoning-to-detection consensus: models produce reasoning chains that inform separate detectors.
- Box-level consensus: final spatial outputs (bounding boxes) are merged geometrically or probabilistically.
- Hybrid consensus: combinations of the above stages in a unified, hierarchical pipeline.
4.1. Prompt-Level Consensus
4.2. Reasoning-to-Detection Consensus
4.3. Box-Level Fusion Consensus
4.4. Hybrid and Hierarchical Consensus
4.5. Scaling and Efficiency
4.6. Challenges and Research Directions
5. Data Fusion Algorithms for Detection Consensus
5.1. Classical Fusion Strategies
5.2. Fusion of Heterogeneous Outputs
- Coordinate normalization: mapping all spatial outputs to the same image reference frame and resolution.
- Label harmonization: aligning class names using semantic embeddings (e.g., CLIP text space) to unify synonyms and resolve ambiguities (“bike” vs. “bicycle”).
- Confidence calibration: normalizing scores across models through temperature scaling or isotonic regression to avoid bias toward overconfident agents.
5.3. Hierarchical and Multi-Stage Fusion
5.4. Adaptive and Trust-Weighted Fusion
5.5. Uncertainty-Aware and Probabilistic Fusion
5.6. Efficiency Considerations
5.7. Integration in Multi AI Models’ Consensus Pipelines
6. Evaluation and Benchmarking of Consensus-Based Detection
6.1. Datasets and Benchmarks
6.2. Metrics for Spatial Accuracy
6.3. Metrics for Semantic and Reasoning Quality
6.4. Calibration and Uncertainty Metrics
6.5. Agreement and Diversity Measures
6.6. Hallucination, Robustness, Efficiency and Cost Evaluation
6.7. Best Practices for Evaluation
- Standardize prompts and schemas. Use identical instruction templates and JSON structures across agents to ensure comparability.
- Report both spatial and semantic metrics. mAP alone may obscure linguistic or reasoning improvements.
- Visualize qualitative results. Show consensus heatmaps and reasoning traces for interpretability.
- Measure diversity explicitly. Diversity metrics explain why certain ensembles succeed or fail.
- Include efficiency analysis. Present throughput and scaling trends alongside accuracy gains.
7. Challenges and Future Research Directions
7.1. Current Challenges
7.2. Emerging Research Trends
8. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ALFA | Agglomerative Late Fusion Algorithm |
| AP | Average Precision |
| AR | Average Recall |
| CLIP | Contrastive Language–Image Pre-training |
| COCO | Common Objects in Context |
| CVinW | Computer Vision in the Wild |
| DETR | DEtection TRansformer |
| ECE | Expected Calibration Error |
| ELEVATER | Evaluating Language-Augmented Visual Models (benchmark/toolkit) |
| FPS | Frames Per Second |
| GLIP | Grounded Language-Image Pre-training |
| GPT-4V | GPT-4 with Vision |
| ICinW | Image Classification in the Wild |
| IoU | Intersection over Union |
| LaMI-DETR | Language Model Instruction DETR |
| LLaVA | Large Language and Vision Assistant |
| LLM | Large Language Model |
| LLM-Blender | Ensembling Large Language Models with Pairwise Ranking and Generative Fusion |
| LLMDet | Learning Strong Open-Vocabulary Object Detectors under LLM Supervision |
| LVIS | Large Vocabulary Instance Segmentation |
| MAD | Multi-Agent Debate |
| MMBench | Multimodal Benchmark |
| MME | Multimodal Evaluation |
| MLLM | Multimodal Large Language Model |
| MoA | Mixture of Agents |
| NMS | Non-Maximum Suppression |
| ODinW | Object Detection in the Wild |
| OV | Open-Vocabulary |
| OVOD | Open-Vocabulary Object Detection |
| OWL-ViT | Open-Vocabulary detection with Vision Transformer |
| POPE | Hallucination evaluation benchmark for VLMs |
| ProbEn | Probabilistic Ensembling |
| RefCOCO | Referring Expressions COCO |
| RefCOCO+ | Referring Expressions COCO+ |
| RefCOCOg | Referring Expressions COCOg |
| VLM | Vision–Language Model |
| VOC | PASCAL Visual Object Classes |
| VQA | Visual Question Answering |
| WBF | Weighted Boxes Fusion |
| YOLO-World | You Only Look Once (YOLO)—World (open-vocabulary variant) |
| Qwen-VL | Qwen Vision–Language |
| InternVL | Intern Vision–Language |
References
- OpenAI. GPT-4V(ision) System Card. 2023. Available online: https://openai.com/index/gpt-4v-system-card/ (accessed on 23 October 2025).
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. arXiv 2024, arXiv:2310.03744. [Google Scholar]
- LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge. 2024. Available online: https://llava-vl.github.io/blog/2024-01-30-llava-next/ (accessed on 4 October 2025).
- Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv 2024, arXiv:2312.14238. [Google Scholar]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef] [PubMed]
- Liu, H.; Xue, W.; Chen, Y.; Chen, D.; Zhao, X.; Wang, K.; Hou, L.; Li, R.; Peng, W. A Survey on Hallucination in Large Vision-Language Models. arXiv 2024, arXiv:2402.00253. [Google Scholar] [CrossRef]
- Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef]
- Solovyev, R.; Wang, W.; Gabruseva, T. Weighted Boxes Fusion: Ensembling Boxes from Different Object Detection Models. Image Vis. Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
- Razinkov, E.; Saveleva, I.; Matas, J. ALFA: Agglomerative Late Fusion Algorithm for Object Detection. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2594–2599. [Google Scholar] [CrossRef]
- Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 139–158. [Google Scholar] [CrossRef]
- Wang, J.; Wang, J.; Athiwaratkun, B.; Zhang, C.; Zou, J. Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv 2024, arXiv:2406.04692. [Google Scholar]
- Jiang, D.; Ren, X.; Lin, B.Y. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Long Papers), Toronto, ON, Canada, 9–14 July 2023. [Google Scholar] [CrossRef]
- Pi, R.; Gao, J.; Diao, S.; Pan, R.; Dong, H.; Zhang, J.; Yao, L.; Han, J.; Xu, H.; Kong, L.; et al. DetGPT: Detect What You Need via Reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar] [CrossRef]
- Zang, Y.; Li, W.; Han, J.; Zhou, K.; Loy, C.C. Contextual Object Detection with Multimodal Large Language Models. Int. J. Comput. Vis. 2025, 133, 825–843. [Google Scholar] [CrossRef]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.W.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 200:1–200:41. [Google Scholar] [CrossRef]
- Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards Open Vocabulary Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5092–5113. [Google Scholar] [CrossRef] [PubMed]
- Guo, K.; Huang, Y.; Jia, T.; Song, X.; Sun, S.; Wei, H.; Han, X.F.; Huang, S.; Strisciuglio, N.; Li, S. Visual Grounding in 2D and 3D: A Unified Perspective and Survey. Inf. Fusion 2026, 126, 103625. [Google Scholar] [CrossRef]
- Chen, F.; Zhang, D.; Han, M.; Chen, X.; Shi, J.; Xu, S.; Xu, B. VLP: A Survey on Vision-Language Pre-training. Mach. Intell. Res. 2023, 20, 38–56. [Google Scholar] [CrossRef]
- Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
- Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal Large Language Models: A Survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar] [CrossRef]
- Caffagni, D.; Cocchi, F.; Barsellotti, L.; Moratelli, N.; Sarto, S.; Baraldi, L.; Cornia, M.; Cucchiara, R. The Revolution of Multimodal Large Language Models: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13590–13618. [Google Scholar]
- Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of Multimodal Large Language Models: A Survey. arXiv 2025, arXiv:2404.18930. [Google Scholar]
- Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, W.X.; Wen, J.R. Evaluating Object Hallucination in Large Vision-Language Models. arXiv 2023, arXiv:2305.10355. [Google Scholar] [CrossRef]
- Kaul, P.; Li, Z.; Yang, H.; Dukler, Y.; Swaminathan, A.; Taylor, C.J.; Soatto, S. THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27218–27228. [Google Scholar] [CrossRef]
- Guan, T.; Liu, F.; Wu, X.; Xian, R.; Li, Z.; Liu, X.; Wang, X.; Chen, L.; Huang, F.; Yacoob, Y.; et al. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. arXiv 2024, arXiv:2310.14566. [Google Scholar]
- Pham, N.; Schott, M. H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models. arXiv 2024, arXiv:2411.04077. [Google Scholar]
- Seth, A.; Manocha, D.; Agarwal, C. Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models. arXiv 2025, arXiv:2412.20622. [Google Scholar]
- Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. [Google Scholar] [CrossRef]
- Chen, Z.; Li, J.; Chen, P.; Li, Z.; Sun, K.; Luo, Y.; Mao, Q.; Li, M.; Xiao, L.; Yang, D.; et al. Harnessing Multiple Large Language Models: A Survey on LLM Ensemble. arXiv 2025, arXiv:2502.18036. [Google Scholar] [CrossRef]
- Ashiga, M.; Jie, W.; Wu, F.; Voskanyan, V.; Dinmohammadi, F.; Brookes, P.; Gong, J.; Wang, Z. Ensemble Learning for Large Language Models in Text and Code Generation: A Survey. arXiv 2025, arXiv:2503.13505. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2023, arXiv:2203.11171. [Google Scholar] [CrossRef]
- Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv 2023, arXiv:2305.14325. [Google Scholar] [CrossRef]
- Tekin, S.F.; Ilhan, F.; Huang, T.; Hu, S.; Liu, L. LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 11951–11966. [Google Scholar] [CrossRef]
- Sun, G.; Kagrecha, A.; Manakul, P.; Woodland, P.; Gales, M. SkillAggregation: Reference-free LLM-Dependent Aggregation. arXiv 2024, arXiv:2410.10215. [Google Scholar]
- Omar, M.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. Refining LLMs outputs with iterative consensus ensemble (ICE). Comput. Biol. Med. 2025, 196, 110731. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
- Cui, Y.; Fu, H.; Zhang, H.; Wang, L.; Zuo, C. Free-MAD: Consensus-Free Multi-Agent Debate. arXiv 2025, arXiv:2509.11035. [Google Scholar]
- Chen, L.; Davis, J.Q.; Hanin, B.; Bailis, P.; Stoica, I.; Zaharia, M.; Zou, J. Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems. arXiv 2024, arXiv:2403.02419. [Google Scholar] [CrossRef]
- Nair-Kanneganti, A.; Chan, T.J.; Goldfinger, S.; Mackay, E.; Anthony, B.; Pouch, A. Increasing LLM response trustworthiness using voting ensembles. arXiv 2025, arXiv:2510.04048. [Google Scholar] [CrossRef]
- Choi, H.K.; Zhu, X.; Li, S. Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? arXiv 2025, arXiv:2508.17536. [Google Scholar] [CrossRef]
- Feng, T.; Zhang, H.; Lei, Z.; Han, P.; Patwary, M.; Shoeybi, M.; Catanzaro, B.; You, J. FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data. arXiv 2025, arXiv:2507.10540. [Google Scholar]
- Zhang, W.; Zhang, Z.; Hu, D.; Xie, Y.; Li, K.; Liu, Q.; Zhang, Z. Ensemble Learning in Vision-Language Models: Challenges and Opportunities. Pattern Recognit. Lett. 2024, 181, 117–125. [Google Scholar] [CrossRef]
- Rahman, M.; Chen, X.; Hossain, R. Object Detection Ensemble Methods: A Comparative Study. IEEE Access 2022, 10, 88256–88270. [Google Scholar]
- Gao, M.; Li, K.; Wu, L.; Chen, K. A Survey of Ensemble Learning for Object Detection. Pattern Recognit. Lett. 2023, 178, 44–56. [Google Scholar] [CrossRef]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar] [CrossRef]
- Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
- Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Heidelberg/Berlin, Germany, 2018; Volume 11218, pp. 765–781. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2239–2251. [Google Scholar] [CrossRef]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10955–10965. [Google Scholar] [CrossRef]
- Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple Open-Vocabulary Object Detection. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 728–755. [Google Scholar] [CrossRef]
- Minderer, M.; Gritsenko, A.; Houlsby, N. Scaling Open-Vocabulary Object Detection. arXiv 2024, arXiv:2306.09683. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2024, arXiv:2303.05499. [Google Scholar]
- Ren, T.; Jiang, Q.; Liu, S.; Zeng, Z.; Liu, W.; Gao, H.; Huang, H.; Ma, Z.; Jiang, X.; Chen, Y.; et al. Grounding DINO 1.5: Advance the “Edge” of Open-Set Object Detection. arXiv 2024, arXiv:2405.10300. [Google Scholar] [CrossRef]
- Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Du, P.; Wang, Y.; Sun, Y.; Wang, L.; Liao, Y.; Zhang, G.; Ding, E.; Wang, Y.; Wang, J.; Liu, S. LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction. arXiv 2024, arXiv:2407.11335. [Google Scholar]
- Fu, S.; Yang, Q.; Mo, Q.; Yan, J.; Wei, X.; Meng, J.; Xie, X.; Zheng, W.S. LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models. arXiv 2025, arXiv:2501.18954. [Google Scholar] [CrossRef]
- Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. MMBench: Is Your Multi-modal Model an All-Around Player? In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Heidelberg/Berlin, Germany, 2024. [Google Scholar] [CrossRef]
- Fu, C.; Chen, P.; Shen, Y.; Qin, Y.; Zhang, M.; Lin, X.; Yang, J.; Zheng, X.; Li, K.; Sun, X.; et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv 2025, arXiv:2306.13394. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Heidelberg/Berlin, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Gupta, A.; Dollár, P.; Girshick, R. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar] [CrossRef]
- Li, C.; Liu, H.; Li, L.H.; Zhang, P.; Aneja, J.; Yang, J.; Jin, P.; Hu, H.; Liu, C.; Lee, Y.J.; et al. ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022 Datasets and Benchmarks Track), Virtual, 28 November 2022. [Google Scholar]
- Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling Context in Referring Expressions. arXiv 2016, arXiv:1608.00272. [Google Scholar] [CrossRef]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar] [CrossRef]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. (IJCV) 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the Multiple Classifier Systems, First International Workshop (MCS), Cagliari, Italy, 21–23 June 2000; Springer: Heidelberg/Berlin, Germany, 2000; pp. 1–15. [Google Scholar] [CrossRef]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. arXiv 2017, arXiv:1706.04599. [Google Scholar] [CrossRef]
- Zhou, J.; He, X.; Sun, L.; Xu, J.; Chen, X.; Chu, Y.; Zhou, L.; Liao, X.; Zhang, B.; Afvari, S.; et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 2024, 15, 5649. [Google Scholar] [CrossRef]
- Boostani, M.; Bánvölgyi, A.; Zouboulis, C.C.; Goldfarb, N.; Suppa, M.; Goldust, M.; Lőrincz, K.; Kiss, T.; Nádudvari, N.; Holló, P.; et al. Large language models in evaluating hidradenitis suppurativa from clinical images. JEADV 2025, 15, e1052–e1055. [Google Scholar] [CrossRef]




| Category | Representative Surveys/Works | Main Object of Study | OV | Hall. | Ens. | Box |
|---|---|---|---|---|---|---|
| Classic object-detection surveys | [15,16] | Closed-set deep object detection (two-stage, one-stage, anchor-free, transformer-based). | no | no | no | yes (mainly NMS and detector metrics) |
| Transformers in vision/detection | [17] | Vision transformers and DETR-style architectures across vision tasks, including detection. | no | no | no | no |
| Open-vocabulary detection/segmentation surveys | [7,18] | Open-vocabulary detection and segmentation, zero-shot generalization, language-guided recognition. | yes | partly | no | no |
| Visual grounding surveys | [19] | 2D/3D visual grounding and referring expression comprehension. | yes (grounding) | no | no | no |
| Vision–language pre-training/VLMs for vision tasks | [20,21] | Vision–language pre-training and VLMs applied to detection, segmentation, VQA, captioning, etc. | partly | partly | no | no |
| Multimodal LLM surveys (MLLMs) | [5,22,23] | Architectures, capabilities and benchmarks of multimodal LLMs. | partly | partly | no | no |
| Hallucination and LVLM hallucination surveys | [6,24] | Taxonomy, causes and mitigation of hallucination in VLMs/MLLMs. | no | yes | no | no |
| Hallucination benchmarks/evaluation tools | [25,26,27,28,29] | Benchmarks for diagnosing object-based and free-form hallucinations in LVLMs. | partly | yes | no | no |
| LLM-ensemble/multi-LLM surveys (text domain) | [30,31,32] | Surveys of ensemble techniques for LLMs (voting, stacking, mixture-of-experts, etc.). | no | no | yes (text) | no |
| Multi-LLM ensemble and debate frameworks (methods) | [11,12,33,34,35,36,37,38,39,40,41,42,43] | Mixture-of-agents, self-consistency, debate, judge-based selection, diversity- and skill-aware aggregation, compound inference scaling. | no | no | yes (core topic, text only) | no |
| VLM ensemble survey | [44] | Ensemble learning for vision–language models, including perception tasks. | partly | partly | yes (VLMs) | partly |
| Object-detection ensemble surveys and fusion algorithms | [8,9,10,45,46,47,48] | Ensemble learning for object detection; NMS variants, Weighted Boxes Fusion, ALFA, probabilistic ensembling. | no | no | no (multi-detector) | yez (box-level fusion is central) |
| This survey (ours) | – | Multi-LLM/VLM consensus for object detection: unified taxonomy (prompt/ reasoning/box/hybrid), mapping of text-domain ensembles to detection pipelines, and joint treatment of fusion, evaluation, hallucination and calibration. | yes | yes (detection-specific) | yes (multi-LLM/VLM) | yes |
| System | Type | Open-Voc. | Inputs | Outputs | Highlights |
|---|---|---|---|---|---|
| Classical detectors | |||||
| Faster R-CNN (2015) [51] | Two-stage CNN | no | Image | Boxes + classes | RPN + ROI pooling; strong accuracy |
| RetinaNet (2017) [55] | One-stage CNN | no | Image | Boxes + classes | Focal Loss for class imbalance |
| FCOS (2019) [58] | One-stage anchor-free | no | Image | Boxes + classes | Per-pixel center + distances |
| DETR (2020) [59] | Transformer set prediction | no | Image | Boxes + classes | Bipartite matching, simple pipeline |
| Deformable DETR (2021) [60] | Transformer | no | Image | Boxes + classes | Sparse attention, faster convergence |
| RT-DETR [63]/v2 (2023/24) [64] | Transformer (real-time) | no | Image | Boxes + classes | Real-time set-prediction detector |
| Open-vocabulary/language-grounded detectors | |||||
| CLIP (2021) [65] | Vision–language encoder | yes (zero-shot cls.) | Image + text | Global logits (image–text) | Contrastive pretraining for OV features |
| GLIP (2022) [66] | OV detector (grounded pretrain) | yes | Image + text | Boxes + text-aligned labels | Unifies grounding + detection |
| OWL-ViT [67]/OWLv2 (2022/23) [68] | OV ViT detector | yes | Image + text | Boxes + labels | Zero-shot detection via text queries |
| Grounding-DINO (2023) [69] | OV transformer detector | yes | Image + text | Boxes + phrases | Strong grounding; open-set |
| Grounding-DINO 1.5 (2024) [70] | OV detector (improved) | yes | Image + text | Boxes + phrases | Better edge/open-set perf. |
| YOLO-World (2024) [71] | OV detector (real-time) | yes | Image + text | Boxes + labels | Real-time open-vocabulary |
| Multimodal LLMs (VLMs/MLLMs) | |||||
| LLaVA-Next (2024) [3] | MLLM (vision–language) | yes (reasoning) | Image + text | Free-form text; coords via prompting | Strong VQA/analysis; instr. tuned |
| Qwen-VL (2024) [72] | MLLM (vision–language) | yes (reasoning) | Image + text | Free-form; OCR; grounding-like | Versatile perception/localization |
| InternVL (2024) [4] | Vision–language foundation | yes (reasoning) | Image + text | Free-form; grounding-style outputs | Scaled multimodal pretraining |
| LLM-guided detection/reasoning-to-detection | |||||
| DetGPT (2023) [13] | LLM-planned detection | yes (via OV det.) | Image + instruction | Boxes + labels | LLM plans; OV det. localizes |
| ContextDET (2023) [14] | LLM context reasoning | yes | Image + context text | Boxes + labels | Context-aware cues for det. |
| LaMI-DETR (2024) [73] | Language-guided DETR | yes | Image + text | Boxes + labels | Instruction-level guidance |
| LLMDet (2025) [74] | LLM-supervised training | yes | Image + text (LLM labels) | Boxes + labels | Pseudo-labels from LLM for OV det. |
| Consensus/fusion (spatial stage) | |||||
| Soft-NMS (2017) [47] | Spatial fusion | — | Boxes and scores | Fused boxes | Score decay instead of suppression |
| WBF (2019/2021) [8] | Spatial fusion | — | Boxes and scores | Weighted fused boxes | Confidence-weighted averaging |
| ALFA (2019) [9] | Spatial fusion (clustering) | — | Boxes and scores | Cluster-based fusion | Agglomerative late fusion |
| ProbEn (2022) [10] | Probabilistic ensemble | — | Boxes and scores | Uncertainty-aware fusion | Bayesian modeling of boxes |
| Dataset/Benchmark | Domain | Task | P/R/B |
|---|---|---|---|
| 1. Object Detection and Segmentation | |||
| COCO [77] | General objects, everyday scenes | Detection, segmentation | B |
| LVIS [78] | Long-tailed object categories | Detection, instance segmentation | B |
| ODinW/ ELEVATER [79] | Open-domain evaluation across multiple visual categories | Open-vocabulary detection | R + B |
| 2. Phrase Grounding and Region-Level Reasoning | |||
| RefCOCO/ RefCOCO+/ RefCOCOg [80] | Referring expressions in images | Phrase grounding, referring detection | P + R |
| Flikr30k/ Entities [81] | Image–sentence region correspondences | Region-to-phrase grounding | P + R |
| Visual Genome [82] | Scene graphs, dense region annotations | Grounding, region-level reasoning | P + R |
| 3. Hallucination and Robustness Benchmarks | |||
| POPE [25] | Object hallucination detection | Hallucination evaluation | P |
| H-POPE [28] | Holistic hallucination benchmark | Multi-object hallucination | P |
| THRONE [26] | Robustness and factual consistency | Compositional reasoning | P |
| Hallucinogen [29] | Hallucination error taxonomy | Diagnostic evaluation | P |
| HallusionBench [27] | Hallucination categories for LVLMs | Diagnostic perception evaluation | P |
| 4. General Multimodal Reasoning | |||
| MMBench [75] | Multimodal understanding across perception and logic | QA, captioning, perception | P + R |
| MME [76] | Comprehensive multimodal evaluation | Perception, grounding, reasoning | P + R |
| Method | Mechanism | Judge/Ranker | Call Cost | Aggregation Type | Fits Detection Stage | Pros/Cons |
|---|---|---|---|---|---|---|
| Self-Consistency [33] | Sample multiple CoT traces majority on final answer | No | k samples (1 model) | Voting (selection) | Prompt-level (classes/queries) | + Simple, robust—needs deterministic schema |
| Mixture-of-Agents (MoA) [11] | Multiple models answer meta-consensus | Optional | N models | Voting/weighted | Prompt-level; can steer detectors | + Heterogeneity gains—coordination cost |
| LLM-Blender [12] | Pairwise ranking generative fusion | Yes (PairRanker) | N + ranker + fuser | Rank-and-fuse (gen.) | Prompt-level (schema merging) | + Synthesizes best of candidates—ranker overhead |
| LLM-as-a-Judge [38] | External LLM scores candidates | Yes | N + judge | Selection (scoring) | Prompt-level (JSON selection) | + Close to human prefs—judge bias/cost |
| Multi-Agent Debate (MAD) [34] | Iterative critique/defense until convergence | No | Iterative rounds | Dialog-based consensus | Prompt-level, pre-detector (class list) | + Improves factuality—latency |
| Free-MAD [39] | Debate without explicit controller decentralized debate | No | Iterative rounds | Decentralized debate | Prompt-level | + Lower orchestration—stability varies |
| LLM-TOPLA [35] | Diversity-maximizing ensemble pruning | No | N (pruned) | Diversity-aware voting | Prompt-level (class candidates) | + Better diversity/efficiency—needs metrics |
| ICE (Iterative Consensus Ensemble) [37] | Iterative re-asking cross-checking | No | Iterative N | Iterative voting/refinement | Prompt-level, label vetting (queries) | + Large gains in reliability—more calls |
| SkillAggregation [36] | Reference-free model-dependent weighting | No | N | Soft weighting (skills) | Prompt-level | + No labels needed—weighting estimation |
| Voting Ensembles (Trustworthy) [41] | Strict/abstention voting for trust | No | N | Conservative vote | Prompt-level | + Higher trust—lower coverage |
| FusionFactory [43] | Fuse multi-LLM logs (query/thought/model levels) | Optional | Offline + N | Multi-level fusion | Prompt-level (playbooks) | + Systematic logs reuse—infra req. |
| Compound Inference Scaling [40] | Analysis: vote/filter-vote scaling laws | No | N/A (guidance) | Design guidance | Design of consensus depth | + Predicts non-monotonic gains—not a method |
| Survey (Text/Code Ensembles) [32] | Taxonomy of output-level ensembling | No | N/A | Survey | Method selection | + Broad guidance—not executable |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Iwanowski, M.; Gahbler, M. Multiple Large AI Models’ Consensus for Object Detection—A Survey. Appl. Sci. 2025, 15, 12961. https://doi.org/10.3390/app152412961
Iwanowski M, Gahbler M. Multiple Large AI Models’ Consensus for Object Detection—A Survey. Applied Sciences. 2025; 15(24):12961. https://doi.org/10.3390/app152412961
Chicago/Turabian StyleIwanowski, Marcin, and Marcin Gahbler. 2025. "Multiple Large AI Models’ Consensus for Object Detection—A Survey" Applied Sciences 15, no. 24: 12961. https://doi.org/10.3390/app152412961
APA StyleIwanowski, M., & Gahbler, M. (2025). Multiple Large AI Models’ Consensus for Object Detection—A Survey. Applied Sciences, 15(24), 12961. https://doi.org/10.3390/app152412961

