Ensemble Large Language Models: A Survey
Abstract
1. Introduction
- A detailed classification of ensemble techniques for LLMs, including model-level, parameter-level, and task-specific approaches.
- An exploration of applications in key domains such as healthcare, legal AI, and education, where ensemble LLMs have demonstrated unique advantages.
- A discussion of challenges and future research directions.
2. Related Work
3. Overview of LLMs
4. Ensemble Techniques for LLMs
4.1. Model-Level Ensembles
4.1.1. Stacking
4.1.2. Bagging
4.1.3. Boosting
4.1.4. Voting Ensembles
4.2. Parameter-Level Ensembles
4.3. Task-Specific Ensembles
4.4. Knowledge Distillation and Model Compression
4.5. Mixture-of-Experts
4.6. Hybrid/Multi-Agent Ensembles
4.7. Ensemble Strategies for Multimodal LLMs
- Late fusion: This is an approach where models independently process each modality and their outputs are combined at the decision level. This method is especially effective when modalities are loosely coupled. For instance, in medical diagnostics, a vision model processes imaging data (e.g., chest X-rays), while an LLM interprets patient history or lab reports; the final decision emerges from a weighted or rule-based fusion of their outputs.
- Modality-specific ensembling: This strategy involves constructing separate ensembles within each modality, such as a group of image encoders and a group of language models. The intermediate representations from each modality are then aligned or jointly reasoned over using cross-modal attention or graph-based fusion. This is particularly effective in tasks like visual question answering (VQA) and cross-modal retrieval, where complementary strengths across modalities drive accuracy [65,66].
- Alignment-based fusion: More advanced approaches leverage alignment-based fusion, where outputs or embeddings from different modalities are projected into a common latent space. This is often accomplished using contrastive learning or joint transformer encoders. Let and represent the visual and textual inputs, respectively, and , denote their modality-specific encoders. These are projected into a shared embedding space as and . The ensemble prediction is then computed via joint inference:
4.8. Comparative Summary of Ensemble Strategies
5. Notable Ensemble LLM Applications
6. Challenges and Limitations
6.1. Scalability and Resource Requirements
6.2. Model Alignment and Coherence
6.3. Bias and Fairness Issues
6.4. Interpretability and Complexity
6.5. Sustainability Concerns
6.6. Reported Limitations and Failure Cases
7. Discussion and Future Directions
8. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AMP | Antimicrobial Peptides |
BERT | Bidirectional Encoder Representations from Transformers |
BPE | Byte Pair Encoding |
DFPE | Diverse Fingerprint Ensemble |
EHR | Electronic Health Record |
FL | Federated Learning |
GPT | Generative Pretrained Transformer |
KL | Kullback–Leibler |
LLM | Large Language Model |
ML | Machine Learning |
NLP | Natural Language Processing |
QA | Question Answering |
QA-RF | Query Answering via Reformulation |
SFT | Supervised Fine-Tuning |
TOPLA | Task-Oriented Prompt-Level Aggregation |
References
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Long, S.; Tan, J.; Mao, B.; Tang, F.; Li, Y.; Zhao, M.; Kato, N. A Survey on Intelligent Network Operations and Performance Optimization Based on Large Language Models. IEEE Commun. Surv. Tutor. 2025, 1. [Google Scholar] [CrossRef]
- Zhang, Q.; Ding, K.; Lv, T.; Wang, X.; Yin, Q.; Zhang, Y.; Yu, J.; Wang, Y.; Li, X.; Xiang, Z.; et al. Scientific Large Language Models: A Survey on Biological & Chemical Domains. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
- Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. Gpt (generative pre-trained transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
- Mienye, I.D.; Swart, T.G. ChatGPT in Education: A Review of Ethical Challenges and Approaches to Enhancing Transparency and Privacy. Procedia Comput. Sci. 2025, 254, 181–190. [Google Scholar] [CrossRef]
- Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
- Sakib, M.; Mustajab, S.; Alam, M. Ensemble deep learning techniques for time series analysis: A comprehensive review, applications, open issues, challenges, and future directions. Clust. Comput. 2025, 28, 73. [Google Scholar] [CrossRef]
- Rane, N.; Choudhary, S.P.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar]
- Yu, Y.C.; Kuo, C.C.; Ye, Z.; Chang, Y.C.; Li, Y.S. Breaking the ceiling of the LLM community by treating token generation as a classification for ensembling. arXiv 2024, arXiv:2406.12585. [Google Scholar]
- Borah, A.; Mihalcea, R. Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions. arXiv 2024, arXiv:2410.02584. [Google Scholar]
- Xu, Y.; Lu, J.; Zhang, J. Bridging the gap between different vocabularies for LLM ensemble. arXiv 2024, arXiv:2404.09492. [Google Scholar]
- Fang, C.; Li, X.; Fan, Z.; Xu, J.; Nag, K.; Korpeoglu, E.; Kumar, S.; Achan, K. LLM-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 2910–2914. [Google Scholar]
- Xian, Y.; Zeng, X.; Xuan, D.; Yang, D.; Li, C.; Fan, P.; Liu, P. Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence. arXiv 2024, arXiv:2412.02263. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
- Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey, 2024. arXiv 2024, arXiv:2402.06196. [Google Scholar]
- Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2023, 6, 100048. [Google Scholar] [CrossRef]
- Wang, C.; Zhao, J.; Gong, J. A survey on large language models from concept to implementation. arXiv 2024, arXiv:2403.18969. [Google Scholar]
- Wan, Z.; Wang, X.; Liu, C.; Alam, S.; Zheng, Y.; Liu, J.; Qu, Z.; Yan, S.; Zhu, Y.; Zhang, Q.; et al. Efficient large language models: A survey. arXiv 2023, arXiv:2312.03863. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, M.; Manzoor, M.A.; Liu, F.; Georgiev, G.; Das, R.J.; Nakov, P. Factuality of large language models: A survey. arXiv 2024, arXiv:2402.02420. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Online, 3–10 March 2021; pp. 610–623. [Google Scholar]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
- Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
- He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis. Intell. Med. 2023, 3, 59–78. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.; et al. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar]
- Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
- Zhang, W.; Xu, Y.; Wang, A.; Chen, G.; Zhao, J. Fuse feeds as one: Cross-modal framework for general identification of AMPs. Briefings Bioinform. 2023, 24, bbad336. [Google Scholar] [CrossRef]
- Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
- Lu, J.; Pang, Z.; Xiao, M.; Zhu, Y.; Xia, R.; Zhang, J. Merge, ensemble, and cooperate! A survey on collaborative strategies in the era of large language models. arXiv 2024, arXiv:2407.06089. [Google Scholar]
- Yang, H.; Li, M.; Zhou, H.; Xiao, Y.; Fang, Q.; Zhang, R. One LLM is not enough: Harnessing the power of ensemble learning for medical question answering. medRxiv 2023. [Google Scholar] [CrossRef]
- Ramesh, V.; Kumaresan, P. Stacked ensemble model for accurate crop yield prediction using machine learning techniques. Environ. Res. Commun. 2025, 7, 035006. [Google Scholar] [CrossRef]
- Matarazzo, A.; Torlone, R. A Survey on Large Language Models with some Insights on their Capabilities and Limitations. arXiv 2025, arXiv:2501.04040. [Google Scholar]
- Streefland, G.J.; Herrema, F.; Martini, M. A Gradient Boosting model to predict the milk production. Smart Agric. Technol. 2023, 6, 100302. [Google Scholar] [CrossRef]
- Huang, J.; Nezafati, K.; Villanueva-Miranda, I.; Gu, Z.; Navar, A.M.; Wanyan, T.; Zhou, Q.; Yao, B.; Rong, R.; Zhan, X.; et al. Large language models enabled multiagent ensemble method for efficient EHR data labeling. arXiv 2024, arXiv:2410.16543. [Google Scholar]
- Farr, D.; Manzonelli, N.; Cruickshank, I.; Starbird, K.; West, J. LLM chain ensembles for scalable and accurate data annotation. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2110–2118. [Google Scholar]
- Dogan, A.; Birant, D. A weighted majority voting ensemble approach for classification. In Proceedings of the 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, 11–15 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
- Yang, E.; Shen, L.; Guo, G.; Wang, X.; Cao, X.; Zhang, J.; Tao, D. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities. arXiv 2024, arXiv:2408.07666. [Google Scholar]
- Du, G.; Lee, J.; Li, J.; Jiang, R.; Guo, Y.; Yu, S.; Liu, H.; Goh, S.K.; Tang, H.K.; He, D.; et al. Parameter competition balancing for model merging. arXiv 2024, arXiv:2410.02396. [Google Scholar]
- Xiao, C.; Zhang, Z.; Song, C.; Jiang, D.; Yao, F.; Han, X.; Wang, X.; Wang, S.; Huang, Y.; Lin, G.; et al. Configurable foundation models: Building LLMs from a modular perspective. arXiv 2024, arXiv:2409.02877. [Google Scholar]
- Arun, A.; John, J.; Kumaran, S. Ensemble of Task-Specific Language Models for Brain Encoding. arXiv 2023, arXiv:2310.15720. [Google Scholar]
- Abimannan, S.; El-Alfy, E.S.M.; Chang, Y.S.; Hussain, S.; Shukla, S.; Satheesh, D. Ensemble multifeatured deep learning models and applications: A survey. IEEE Access 2023, 11, 107194–107217. [Google Scholar] [CrossRef]
- Campagner, A.; Ciucci, D.; Cabitza, F. Aggregation models in ensemble learning: A large-scale comparison. Inf. Fusion 2023, 90, 241–252. [Google Scholar] [CrossRef]
- Yang, C.; Zhu, Y.; Lu, W.; Wang, Y.; Chen, Q.; Gao, C.; Yan, B.; Chen, Y. Survey on knowledge distillation for large language models: Methods, evaluation, and application. ACM Trans. Intell. Syst. Technol. 2024. [Google Scholar] [CrossRef]
- Junaid, A.R. Empowering Compact Language Models with Knowledge Distillation. Authorea Prepr. 2025. [Google Scholar] [CrossRef]
- Dantas, P.V.; Sabino da Silva, W., Jr.; Cordeiro, L.C.; Carvalho, C.B. A comprehensive review of model compression techniques in machine learning. Appl. Intell. 2024, 54, 11804–11844. [Google Scholar] [CrossRef]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar]
- Hu, C.; Li, X.; Liu, D.; Wu, H.; Chen, X.; Wang, J.; Liu, X. Teacher-student architecture for knowledge distillation: A survey. arXiv 2023, arXiv:2308.04268. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
- Bibi, U.; Mazhar, M.; Sabir, D.; Butt, M.F.U.; Hassan, A.; Ghazanfar, M.A.; Khan, A.A.; Abdul, W. Advances in Pruning and Quantization for Natural Language Processing. IEEE Access 2024, 12, 139113–139128. [Google Scholar] [CrossRef]
- Zhang, R.; He, J.; Luo, X.; Niyato, D.; Kang, J.; Xiong, Z.; Li, Y.; Sikdar, B. Toward democratized generative AI in next-generation mobile edge networks. IEEE Netw. 2025. [Google Scholar] [CrossRef]
- Semerikov, S.O.; Vakaliuk, T.A.; Kanevska, O.B.; Moiseienko, M.V.; Donchev, I.I.; Kolhatin, A.O. LLM on the edge: The new frontier. In Proceedings of the CEUR Workshop Proceedings, Barcelona, Spain, 7–10 April 2025; pp. 137–161. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
- Lewis, M.; Bhosale, S.; Dettmers, T.; Goyal, N.; Zettlemoyer, L. Base layers: Simplifying training of large, sparse models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 6265–6274. [Google Scholar]
- Qu, N.; Wang, C.; Li, Z.; Liu, F.; Ji, Y. A distributed multi-agent deep reinforcement learning-aided transmission design for dynamic vehicular communication networks. IEEE Trans. Veh. Technol. 2023, 73, 3850–3862. [Google Scholar] [CrossRef]
- Wang, Z.; Zhu, Y.; Zhao, H.; Zheng, X.; Sui, D.; Wang, T.; Tang, W.; Wang, Y.; Harrison, E.; Pan, C.; et al. Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2250–2261. [Google Scholar]
- Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar]
- Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.J.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. Pali: A jointly-scaled multilingual language-image model. arXiv 2022, arXiv:2209.06794. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Huang, Y.; Feng, X.; Li, B.; Xiang, Y.; Wang, H.; Liu, T.; Qin, B. Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- He, X.; Ban, Y.; Zou, J.; Wei, T.; Cook, C.B.; He, J. LLM-Forest for Health Tabular Data Imputation. arXiv 2024, arXiv:2410.21520. [Google Scholar]
- Lucas, M.M.; Yang, J.; Pomeroy, J.K.; Yang, C.C. Reasoning with large language models for medical question answering. J. Am. Med. Inform. Assoc. 2024, 31, 1964–1975. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Wei, Q.; Huang, L.C.; Li, J.; Hu, Y.; Chuang, Y.S.; He, J.; Das, A.; Keloth, V.K.; Yang, Y.; et al. Ensemble pretrained language models to extract biomedical knowledge from literature. J. Am. Med. Inform. Assoc. 2024, 31, 1904–1911. [Google Scholar] [CrossRef]
- Wu, C.; Fang, W.; Dai, F.; Yin, H. A Model Ensemble Approach with LLM for Chinese Text Classification. In Proceedings of the China Health Information Processing Conference, Fuzhou, China, 15–17 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 214–230. [Google Scholar]
- Gururajan, A.K.; Lopez-Cuena, E.; Bayarri-Planas, J.; Tormos, A.; Hinjos, D.; Bernabeu-Perez, P.; Arias-Duart, A.; Martin-Torres, P.A.; Urcelay-Ganzabal, L.; Gonzalez-Mallo, M.; et al. Aloe: A Family of Fine-tuned Open Healthcare LLMs. arXiv 2024, arXiv:2405.01886. [Google Scholar]
- Lai, Z.; Zhang, X.; Chen, S. Adaptive ensembles of fine-tuned transformers for llm-generated text detection. arXiv 2024, arXiv:2403.13335. [Google Scholar]
- Knafou, J.; Haas, Q.; Borissov, N.; Counotte, M.; Low, N.; Imeri, H.; Ipekci, A.M.; Buitrago-Garcia, D.; Heron, L.; Amini, P.; et al. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst. Rev. 2023, 12, 94. [Google Scholar] [CrossRef] [PubMed]
- Gu, K.; Tuecke, E.; Katz, D.; Horesh, R.; Alvarez-Melis, D.; Yurochkin, M. Chared: Character-wise ensemble decoding for large language models. arXiv 2024, arXiv:2407.11009. [Google Scholar]
- Dhole, K.D.; Agichtein, E. Genqrensemble: Zero-shot LLM ensemble prompting for generative query reformulation. In Proceedings of the European Conference on Information Retrieval, Glasgow, UK, 24–28 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 326–335. [Google Scholar]
- Miah, M.S.U.; Kabir, M.M.; Sarwar, T.B.; Safran, M.; Alfarhood, S.; Mridha, M. A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Sci. Rep. 2024, 14, 9603. [Google Scholar] [CrossRef]
- Luo, Y.; Xu, W.; Andersson, K.; Hossain, M.S.; Xu, D. FELLMVP: An Ensemble LLM Framework for Classifying Smart Contract Vulnerabilities. In Proceedings of the 2024 IEEE International Conference on Blockchain (Blockchain), Copenhagen, Denmark, 19–22 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 89–96. [Google Scholar]
- Huang, Z. An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1750–1754. [Google Scholar]
- Cohen, S.; Goldshlager, N.; Cohen-Inger, N.; Shapira, B.; Rokach, L. DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance. arXiv 2025, arXiv:2501.17479. [Google Scholar]
- Tekin, S.F.; Ilhan, F.; Huang, T.; Hu, S.; Liu, L. LLM-topla: Efficient LLM ensemble by maximising diversity. arXiv 2024, arXiv:2410.03953. [Google Scholar]
- Cho, H.; Kang, S.; An, G.; Yoo, S. COSMosFL: Ensemble of Small Language Models for Fault Localisation. arXiv 2025, arXiv:2502.02908. [Google Scholar]
- Chen, K.; Wang, J.; Chen, Z.; Chen, K.; Chen, Y. LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach. arXiv 2024, arXiv:2409.09383. [Google Scholar]
- Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative ai text classification using ensemble LLM approaches. arXiv 2023, arXiv:2309.07755. [Google Scholar]
- Yang, Z.; Xue, D.; Qian, S.; Dong, W.; Xu, C. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 80–90. [Google Scholar]
- Jiang, D.; Ren, X.; Lin, B.Y. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv 2023, arXiv:2306.02561. [Google Scholar]
- Chen, Z.; Li, J.; Chen, P.; Li, Z.; Sun, K.; Luo, Y.; Mao, Q.; Yang, D.; Sun, H.; Yu, P.S. Harnessing Multiple Large Language Models: A Survey on LLM Ensemble. arXiv 2025, arXiv:2502.18036. [Google Scholar]
- Celikyilmaz, A.; Clark, E.; Gao, J. Evaluation of text generation: A survey. arXiv 2020, arXiv:2006.14799. [Google Scholar]
- Mienye, I.D.; Swart, T.G.; Obaido, G. Fairness Metrics in AI Healthcare Applications: A Review. In Proceedings of the 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), San Jose, CA, USA, 7–9 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 284–289. [Google Scholar]
- Nasarian, E.; Alizadehsani, R.; Acharya, U.R.; Tsui, K.L. Designing interpretable ML system to enhance trust in healthcare: A systematic review to proposed responsible clinician-AI-collaboration framework. Inf. Fusion 2024, 108, 102412. [Google Scholar] [CrossRef]
- Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (LLMs). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
- Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Y.; Xu, J.; Ott, M.; Shuster, K.; Smith, E.M.; et al. Recipes for building an open-domain chatbot. arXiv 2020, arXiv:2004.13637. [Google Scholar]
- Feng, H.; Fan, Y.; Liu, X.; Lin, T.E.; Yao, Z.; Wu, Y.; Huang, F.; Li, Y.; Ma, Q. Improving factual consistency of text summarization by adversarially decoupling comprehension and embellishment abilities of LLMs. arXiv 2023, arXiv:2310.19347. [Google Scholar]
- He, P.; Peng, B.; Lu, L.; Wang, S.; Mei, J.; Liu, Y.; Xu, R.; Awadalla, H.H.; Shi, Y.; Zhu, C.; et al. Z-code++: A pre-trained language model optimized for abstractive summarization. arXiv 2022, arXiv:2208.09770. [Google Scholar]
Ensemble Type | Typical Examples | Advantages | Limitations |
---|---|---|---|
Model-level | Bagging, boosting, stacking, voting | High predictive performance, robust generalization, reduces variance | High computational cost, high latency, low interpretability |
Parameter-level | Snapshot ensembles, parameter averaging | Computational efficiency, improved scalability, moderate performance improvement | Limited model diversity, lower interpretability, moderate accuracy gains |
Task-specific | Prompt-based ensembles, domain-specific fine-tuned ensembles | High domain specialization, scalable, adaptable to targeted applications | Dependent on prompt and domain quality, less generalizable across domains |
Knowledge distillation | Student-teacher model training, distilled compact LLMs | High computational efficiency, lower inference latency, suitable for deployment in resource-constrained environments | Moderate performance reduction, requires careful distillation tuning |
Mixture-of-experts | Switch transformers, sparsely gated experts | Exceptional scalability, efficient resource utilization, capable of training very large models | Complex gating mechanisms, interpretability issues, expert load balancing |
Hybrid/multi-agent | Collaborative agent frameworks, generative agent ensembles | Effective in complex, multi-step reasoning tasks, interactive capabilities, adaptable in dynamic environments | High coordination complexity, synchronization overhead, limited interpretability |
Multimodal LLM ensembles | Late fusion, modality-specific ensembling, alignment-based fusion | Cross-modal reasoning, robust in heterogeneous data settings, excels in VQA and clinical support tasks | Modality alignment, synchronization, and scalability challenges |
Application Domain | Reference | Year | Description |
---|---|---|---|
Model robustness enhancement | Cohen et al. [80] | 2025 | Introduced DFPE, a fingerprint-based ensemble for performance and diversity. |
Software fault localization | Cho et al. [82] | 2025 | Used ensembles of small LLMs to improve bug and fault detection in codebases. |
Medical question answering | Lucas et al. [69] | 2024 | Applied LLM ensembles for improved reasoning in medical QA tasks. |
Biomedical information extraction | Li et al. [70] | 2024 | Used an ensemble of pretrained LLMs to extract biomedical entities and relations from literature. |
EHR annotation | Huang et al. [37] | 2024 | Introduced a multi-agent LLM ensemble to streamline annotation of electronic health records. |
Health data imputation | He et al. [68] | 2024 | Proposed LLM-Forest to enhance imputation of missing values in structured health datasets. |
Open healthcare NLP | Gururajan et al. [72] | 2024 | Released Aloe, a family of fine-tuned open healthcare LLMs for ensemble clinical NLP. |
LLM output detection | Lai et al. [73] | 2024 | Deployed adaptive transformer ensembles to detect AI-generated clinical text. |
Paper provenance tracing | Chen et al. [83] | 2024 | Proposed a GPU-free ensemble LLM method to trace origins of academic publications. |
General NLP tasks | Huang et al. [67] | 2024 | Introduced DEEPEN, a framework for combining heterogeneous LLMs with parallel token alignment. |
LLM diversity optimization | Tekin et al. [81] | 2024 | Proposed LLM-TOPLA, which maximizes diversity among LLMs for more robust NLP outputs. |
Vocabulary alignment | Xu et al. [11] | 2024 | Presented techniques to harmonize vocabularies across ensemble LLMs for consistency. |
Cross-lingual sentiment analysis | Miah et al. [77] | 2024 | Created a multimodal LLM ensemble for sentiment analysis across multiple languages. |
Data annotation | Farr et al. [38] | 2024 | Proposed LLM chain ensembles to improve annotation efficiency and scale. |
Query reformulation | Dhole and Agichtein [76] | 2024 | Introduced GenQRensemble for zero-shot query rewriting using LLM ensembles. |
Product attribute extraction | Fang et al. [12] | 2024 | Built an ensemble model to extract attributes from e-commerce product data. |
Smart contract analysis | Luo et al. [78] | 2024 | Proposed FELLMVP, an LLM ensemble to detect vulnerabilities in smart contracts. |
Composed image retrieval | Yang et al. [85] | 2024 | Used LLM ensembles for zero-shot composed image retrieval via divergent reasoning. |
Text recognition | Huang [79] | 2024 | Combined BERT and BPE-based LLMs for improved optical text recognition. |
Code generation and math reasoning | Gu et al. [75] | 2024 | Proposed CharED, a character-wise ensemble decoding approach enhancing LLM performance in programming and mathematical tasks. |
Medical question answering | Yang et al. [33] | 2023 | Developed an ensemble pipeline combining LLMs to improve performance on medical QA benchmarks. |
Medical text classification | Wu et al. [71] | 2023 | Implemented LLM ensembles to classify Chinese medical texts for healthcare informatics. |
Systematic review triage | Knafou et al. [74] | 2023 | Used an ensemble of deep models to triage COVID-19 literature for systematic reviews. |
AI text detection | Abburi et al. [84] | 2023 | Used ensemble LLMs to classify AI-generated vs. human-written content. |
Ranking and fusion for QA | Jiang et al. [86] | 2023 | Proposed LLM-Blender using pairwise ranking and generative fusion for ensemble reasoning. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. https://doi.org/10.3390/info16080688
Mienye ID, Swart TG. Ensemble Large Language Models: A Survey. Information. 2025; 16(8):688. https://doi.org/10.3390/info16080688
Chicago/Turabian StyleMienye, Ibomoiye Domor, and Theo G. Swart. 2025. "Ensemble Large Language Models: A Survey" Information 16, no. 8: 688. https://doi.org/10.3390/info16080688
APA StyleMienye, I. D., & Swart, T. G. (2025). Ensemble Large Language Models: A Survey. Information, 16(8), 688. https://doi.org/10.3390/info16080688