Ensemble Large Language Models: A Survey

Mienye, Ibomoiye Domor; Swart, Theo G.

doi:10.3390/info16080688

Open AccessReview

Ensemble Large Language Models: A Survey

by

Ibomoiye Domor Mienye

^*

and

Theo G. Swart

Institute for Intelligent Systems, University of Johannesburg, Johannesburg 2006, South Africa

^*

Author to whom correspondence should be addressed.

Information 2025, 16(8), 688; https://doi.org/10.3390/info16080688

Submission received: 29 April 2025 / Revised: 24 July 2025 / Accepted: 5 August 2025 / Published: 13 August 2025

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) have transformed the field of natural language processing (NLP), achieving state-of-the-art performance in tasks such as translation, summarization, and reasoning. Despite their impressive capabilities, challenges persist, including biases, limited interpretability, and resource-intensive training. Ensemble learning, a technique that combines multiple models to improve performance, presents a promising avenue for addressing these limitations in LLMs. This review explores the emerging field of ensemble LLMs, providing a comprehensive analysis of current methodologies, applications across diverse domains, and existing challenges. By reviewing ensemble strategies and evaluating their effectiveness, this paper highlights the potential of ensemble LLMs to enhance robustness and generalizability while proposing future research directions to advance the field.

Keywords:

ensemble learning; GPT; LLMs; NLP; transformers

1. Introduction

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), achieving excellent performance across tasks such as machine translation, text summarization, and question answering [1,2,3]. These models leverage sophisticated transformer architectures and are trained on vast corpora, enabling them to generalize across diverse linguistic tasks [4]. Despite their widespread adoption, several limitations remain, including susceptibility to biases, high computational requirements, and challenges in interpretability [5,6]. These challenges become particularly pronounced in high-stakes domains such as healthcare and law, where model transparency, robustness, and fairness are critical to trust and adoption.

Ensemble learning, a well-established paradigm in machine learning (ML), combines multiple models to enhance predictive performance and robustness [7]. Traditional ensemble methods, such as bagging, boosting, and stacking, have demonstrated efficacy in mitigating overfitting and variance, particularly in complex modeling tasks [8]. When applied to LLMs, these principles offer opportunities to improve performance, reduce biases, and address task-specific limitations. Ensemble approaches can also provide fallback mechanisms for unreliable predictions, facilitate model specialization across subtasks, and offer better uncertainty quantification—an increasingly important factor for real-world NLP systems.

Recent studies have begun to explore ensemble approaches tailored for LLMs. For instance, Yu et al. [9] utilized ensembling techniques to enhance the precision of text generation tasks, while Borah and Mihalcea [10] focused on ensemble strategies to mitigate model biases. Xu et al. [11] addressed the challenge of vocabulary misalignment in heterogeneous LLM ensembles by proposing alignment mechanisms for more coherent outputs. Additionally, Fang et al. [12] developed an ensemble architecture for structured information extraction in e-commerce, demonstrating tangible improvements in attribute value prediction. Similarly, Xian et al. [13] applied LLM ensembles to smart contract analysis in blockchain, demonstrating their applicability in highly technical domains.

These efforts indicate a growing recognition of the potential of ensemble learning to address the various limitations of LLMs and extend their utility across a broader spectrum of tasks and environments. Nevertheless, the research landscape remains fragmented, with varied methodologies and inconsistent evaluation practices making it difficult to draw generalizable conclusions or best practices. This review aims to bridge this gap by systematically categorizing ensemble LLM methodologies, analyzing their applications across various domains, and identifying challenges and future research opportunities.

Furthermore, this survey distinguishes itself from prior reviews by offering a multi-dimensional synthesis of ensemble strategies for LLMs that goes beyond taxonomy. Thereby, offering both academic and applied audiences a practical roadmap for developing ensemble-based LLMs. Specifically, the contributions of this paper include:

A detailed classification of ensemble techniques for LLMs, including model-level, parameter-level, and task-specific approaches.
An exploration of applications in key domains such as healthcare, legal AI, and education, where ensemble LLMs have demonstrated unique advantages.
A discussion of challenges and future research directions.

The remainder of this paper is structured as follows. Section 2 presents a review of related works. Section 3 provides an overview of LLMs, and Section 4 presents ensemble learning techniques for building LLMs. Section 5 discusses applications, while Section 6 highlights key challenges and limitations. Section 7 outlines future research directions, emphasizing the potential for ensemble LLMs to address complex real-world problems. Finally, Section 8 concludes the review.

2. Related Work

The rapid advancement of LLMs has led to an extensive body of research, resulting in numerous surveys that explore different aspects of their development, capabilities, and applications. While some works have provided broad overviews of LLMs, others have focused on specific areas such as evaluation methodologies, efficiency improvements, and ethical considerations. This section reviews related surveys, highlighting their unique contributions and identifying the gaps that this review aim to address.

Zhao et al. [14] conducted a comprehensive meta-survey on the evolution of LLMs, tracing their progression from early statistical models to sophisticated neural architectures, and classifying over 150 studies by model architecture, pretraining, evaluation, and societal impact. Similarly, Minaee et al. [15] focused on prominent LLM families, including GPT, LLaMA, and PaLM, providing insights into their architectural design, contributions, and inherent limitations. Meanwhile, Kalyan [16] centered on OpenAI’s GPT-3 family, assessing their performance across diverse NLP tasks, such as data augmentation and content generation. Wang et al. [17] expanded the application view to include code interpretation, image captioning, and machine translation, demonstrating the cross-domain versatility of transformer-based LLMs.

Beyond applications, efficiency challenges have been a major focus in LLM research. Wan et al. [18] categorized existing research into model-centric, data-centric, and framework-centric approaches, presenting a systematic review of strategies designed to enhance computational efficiency while preserving model performance. These findings are consistent with Zhao et al. [14]’s broader classification of LLM challenges, including compute scaling, ethical risks, and evaluation bottlenecks. These works collectively highlight the ongoing efforts to optimize LLMs while addressing their growing computational and ethical challenges.

Another critical aspect of LLM research is evaluation. Chang et al. [19] concentrated on the methodologies used to assess LLM capabilities, discussing different dimensions of evaluation such as “what to evaluate,” “where to evaluate,” and “how to evaluate.” Their work emphasized the need for robust benchmarking frameworks to measure LLM performance reliably. In a related study, Wang et al. [20] investigated the factual accuracy of LLMs, identifying key challenges and possible solutions for improving the reliability of generated content. Their analysis of automated factuality evaluation shed light on one of the most pressing concerns in LLM deployment—ensuring that these models generate truthful and verifiable information.

Ethical considerations have also been a subject of extensive research. Bender et al. [21] highlighted the risks associated with large-scale language models, emphasizing issues such as biases, misinformation, and the environmental impact of training massive models. Similarly, Bommasani et al. [22] examined the broader societal and industrial impact of foundation models, advocating for interdisciplinary collaborations to address both the opportunities and risks posed by LLMs.

Despite the breadth of existing surveys, there remains a critical gap in research focused on ensemble methods applied to LLMs. While ensemble learning has been extensively studied in traditional machine learning, its application to LLMs is still in its early stages. Most existing reviews discuss individual model architectures, evaluation strategies, and efficiency improvements, but not how ensemble techniques can be leveraged to enhance LLM performance across diverse tasks. Existing ensemble surveys [7,23] typically focus on traditional ML or vision models and rarely address transformer-based LLMs, fairness-aware ensembles, or multimodal ensemble strategies.

Therefore, this review addresses that gap by systematically categorizing ensemble methods tailored for LLMs, evaluating their effectiveness in practical domains such as healthcare, law, and education, and highlighting trade-offs related to interpretability, bias mitigation, and scalability. In doing so, this study offers new insights on ensemble diversity, multimodal fusion, and deployment in real-world conditions, areas that remain underexplored in prior literature.

Additionally, by concentrating on ensemble approaches, this review differentiates itself from prior surveys and provides a novel contribution to LLM research. Specifically, it offers a structured taxonomy of ensemble strategies, an in-depth analysis of their advantages and limitations, and a discussion on their practical applications in real-world scenarios. Given the growing demand for robust, scalable, and interpretable LLM solutions, this review serves as a timely resource for researchers and practitioners seeking to explore ensemble learning as a means to enhance LLM performance.

3. Overview of LLMs

LLMs constitute a significant advancement in artificial intelligence, specifically impacting natural language processing (NLP). The success of these models largely relies on the transformer architecture [24], depicted in Figure 1, which effectively captures long-range dependencies in text through its self-attention mechanism. Formally, self-attention computes contextually relevant representations by dynamically weighing different parts of an input sequence, enabling complex linguistic relationships to be modeled efficiently. Meanwhile, the self-attention mechanism is formally defined as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(1)

where Q, K, and V are query, key, and value matrices derived from the input embeddings, and

d_{k}

is the dimensionality of the key vectors. Training of LLMs typically involves two stages: unsupervised pretraining on extensive text corpora and subsequent fine-tuning on targeted datasets for task-specific performance [22].

Over recent years, several advanced LLMs have significantly influenced NLP research and application development. Prominent among these is GPT-4 by OpenAI, characterized by multimodal capabilities allowing simultaneous processing of text and images, thereby excelling in tasks such as conversational interactions and code assistance [25]. Google’s PaLM 2 similarly extended LLM functionality, emphasizing multilingual proficiency and enhanced reasoning, thus supporting global NLP tasks and applications requiring complex inference [26].

Meta’s LLaMA series has prioritized efficiency, offering strong performance with reduced computational requirements compared to contemporaries, thereby democratizing access to powerful language models. LLaMA 2 particularly stands out for its performance in question answering and content generation tasks [27]. More recently, DeepSeek has garnered attention by introducing specialized models tailored for coding, scientific tasks, and mathematical reasoning, thus enriching the landscape of task-specific LLM applications [28].

Collectively, these models have shaped the evolution of NLP methodologies, inspiring research into techniques like ensemble methods aimed at overcoming inherent limitations such as computational inefficiency, bias, and interpretability challenges inherent in individual LLMs.

4. Ensemble Techniques for LLMs

Ensemble learning refers to a methodological framework that combines multiple models, often referred to as base learners or weak learners, to achieve superior predictive performance compared to individual models [29]. The fundamental principle underlying ensemble learning is the reduction of errors through the aggregation of diverse model predictions, exploiting variance, bias, or both, to improve generalization and robustness [23]. This section presents ensemble learning techniques that can be utilized in building robust LLMs.

4.1. Model-Level Ensembles

Model-level ensembles capitalize on the diversity of multiple LLM architectures by aggregating their outputs with the aim of enhancing predictive accuracy and generalization [30]. The effectiveness of model-level ensembles stems from their ability to compensate for the weaknesses of individual models. By systematically integrating multiple perspectives and learned representations, ensembles improve robustness, generalization, and reliability, making them an essential strategy for deploying LLMs in real-world, high-stakes applications. The ensemble’s prediction

H (x)

can be mathematically expressed as:

H (x) = f ({h_{i} (x)}_{i = 1}^{n}),

(2)

where

h_{i} (x)

is the prediction from the i-th model, and f is an aggregation function such as majority voting, weighted averaging, or a learned fusion mechanism [31]. Model-level ensembles can be applied in areas like machine translation, sentiment analysis, legal document processing, and medical diagnosis, because of their ability to integrate specialized models that effectively handle different aspects of the problem domain.

In machine translation, for example, different LLMs trained on diverse language pairs, grammatical correctness, and contextual coherence can be combined to produce superior translations. Similarly, in sentiment analysis, an ensemble combining models fine-tuned on different social media, news, and review datasets can mitigate dataset bias and improve robustness across various domains [32]. In medical diagnosis, where accuracy is crucial, ensemble models that merge predictions from multiple LLMs fine-tuned on radiology reports, patient records, and medical literature can significantly enhance diagnostic precision, reducing false positives and negatives [33].

4.1.1. Stacking

Stacking employs a meta-model to optimally combine the predictions from base models, offering flexibility in integrating diverse architectures [34]. The meta-model is trained using the outputs of the base models as features, enabling it to learn dependencies among them. This hierarchical structure ensures that stacking benefits from the diversity of base models while overcoming their individual limitations.

For LLMs, stacking is particularly beneficial in tasks like document summarization or question answering, where combining models optimized for different objectives (e.g., fluency, coherence, factuality) produces superior outputs [35]. For example, a base model fine-tuned on grammatical accuracy can complement another focused on domain-specific knowledge, with the meta-model synthesizing their strengths. Furthermore, potential applications of stacking in LLMs include multi-domain chatbots, where individual base models specialize in various domains, such as medicine, law, and technology. The meta-model can adapt dynamically to user inputs, ensuring both accuracy and context relevance, as illustrated in Figure 2.

4.1.2. Bagging

Bagging, or bootstrap aggregating, enhances model robustness by training each base model on different bootstrap samples of the training data. This diversity reduces overfitting and variance, leading to more stable predictions [23]. The aggregated prediction is typically expressed as:

H (x) = \frac{1}{n} \sum_{i = 1}^{n} h_{i} (x),

(3)

where

h_{i} (x)

represents the prediction of the i-th model [23]. Bagging can be particularly effective for LLMs when fine-tuning them on large, noisy datasets, as it mitigates the risk of overfitting specific patterns. For example, in sentiment analysis, multiple LLMs can be fine-tuned on subsets of customer reviews, each capturing unique nuances in the data. Aggregating their predictions ensures a balanced analysis, minimizing the influence of outliers or biases in individual subsets. This approach ensures scalability and reliability, as shown in Figure 3.

Unlike stacking, where a meta-model learns to optimally combine the outputs of different base models, bagging aggregates independently trained models through simple averaging or voting without introducing an additional learning layer. This independence makes bagging exceptionally robust against overfitting, focusing purely on variance reduction rather than leveraging inter-model dependencies.

4.1.3. Boosting

Boosting improves predictive performance by sequentially training models, with each focusing on the errors made by its predecessor [36]. The final ensemble prediction is a weighted sum of individual models’ outputs:

H (x) = \sum_{i = 1}^{n} w_{i} h_{i} (x),

(4)

where

w_{i}

represents the weight of the i-th model, often proportional to its accuracy. Boosting can be particularly effective for LLMs in tasks that demand high precision, such as entity recognition or text classification. One advantage of boosting is its ability to tackle imbalanced datasets [37,38]. For instance, in legal document analysis, an LLM ensemble can iteratively refine classifications of minority classes (e.g., rare legal clauses), ensuring better handling of different legal terminology.

4.1.4. Voting Ensembles

Voting ensembles combine predictions from multiple models using majority or weighted voting [39]. Voting is a straightforward yet powerful technique, especially when combining models with similar performance levels. Meanwhile, for LLMs, voting can be used in question-answering systems, where models fine-tuned on diverse datasets independently generate answers. The ensemble selects the most agreed-upon answer, ensuring both accuracy and reliability. For classification tasks, the ensemble prediction is:

H (x) = arg max_{k} \sum_{i = 1}^{n} w_{i} \cdot 1 {h_{i} (x) = k},

(5)

where k represents class labels,

w_{i}

is the weight for the i-th model, and

1

is an indicator function. Furthermore, voting ensembles can be used in applications like content moderation, where multiple LLMs classify text as harmful or benign.

4.2. Parameter-Level Ensembles

Parameter-level ensembles involve training multiple variants of the same LLM architecture, each fine-tuned with different datasets or hyperparameters [40]. This technique improves robustness and generalization by incorporating diverse learned representations. Mathematically, given a set of models

{h_{i} (x; θ_{i})}

with varying parameters

θ_{i}

, the ensemble prediction is:

H (x) = \frac{1}{n} \sum_{i = 1}^{n} h_{i} (x; θ_{i}) .

(6)

Parameter-level ensembles can adapt LLMs to specific domains by varying hyperparameters like learning rates, layer sizes, or optimization strategies [41,42]. Meanwhile, parameter diversity ensures that the ensemble captures different aspects of the input data, leading to improved task-specific performance.

4.3. Task-Specific Ensembles

Task-specific ensembles are tailored to meet the demands of specialized applications by integrating LLMs fine-tuned for distinct subtasks or domains [43]. This approach allows for the creation of systems capable of handling complex, multi-faceted workflows. For instance, in healthcare applications, models specialized in diagnosis, prognosis, and treatment recommendation can be combined to deliver holistic clinical insights. This ensemble approach ensures that each component model contributes its unique expertise to enhance overall system performance, addressing diverse healthcare scenarios effectively.

The success of task-specific ensembles relies heavily on the careful selection of models and the aggregation strategies used [44]. Ensuring compatibility between component models is critical, as each needs to complement the strengths and mitigate the limitations of others. Aggregation mechanisms, such as weighted voting or cascading predictions, are often used to consolidate outputs [45]. These methods are useful in fields like legal AI, where models fine-tuned for contract analysis and case law summarization can work together to offer a comprehensive solution. Such ensembles improve the system’s ability to navigate the complexity of legal terminologies and contextual requirements.

Moreover, task-specific ensembles offer a significant advantage in bridging domain-specific knowledge gaps, enabling the development of robust solutions for interdisciplinary challenges. In practice, this adaptability is evident in domains like finance and cybersecurity, where separate models fine-tuned for fraud detection, compliance, and risk analysis can be combined to enhance decision making.

4.4. Knowledge Distillation and Model Compression

Knowledge distillation is a model compression technique that uses ensembles of LLMs (teachers) to train smaller, more efficient models (students) without significantly sacrificing predictive performance [46,47]. In this framework, shown in Figure 4, the ensemble provides soft labels or probability distributions as guidance, capturing richer information than hard labels [48]. For instance, an ensemble trained in sentiment analysis might output probabilities such as

{0.6, 0.3, 0.1}

for positive, neutral, and negative classes, respectively. These soft labels convey the uncertainty and relationships between classes, which the student model can learn to replicate.

The effectiveness of the knowledge distillation strategy often depends on the choice of teacher models. Ensembles comprising GPT-based models typically excel in general understanding and context generation, making them suitable for tasks requiring nuanced context preservation, such as summarization and dialogue generation. In contrast, ensembles involving BERT-based teachers are particularly effective for feature extraction tasks, including sentiment classification and named entity recognition, due to their capability in capturing detailed, token-level representations [49,50]. The objective of knowledge distillation is to minimize the discrepancy between the teacher ensemble’s output

H (x)

and the student model’s output

h_{s} (x)

[51]. This is achieved by optimizing the Kullback–Leibler (KL) divergence:

L_{K D} = KL (H (x) | | h_{s} (x)),

(7)

where KL measures how closely the student model’s predictions match the teacher’s probabilities [48,51]. This approach is essential in scenarios where deploying large ensembles is computationally infeasible. By distilling the knowledge from an ensemble into a single compact model, it becomes feasible to deploy high-performing models on resource-constrained devices such as mobile phones or edge computing platforms.

Furthermore, different distillation methods can be broadly categorized into response-based and feature-based distillation. Response-based distillation focuses on transferring the output distributions (soft labels) directly from teacher to student models, as described previously. This method is effective for tasks requiring knowledge about class probabilities or output uncertainties. In contrast, feature-based distillation transfers intermediate representations or internal activations from the teacher models to guide the student model’s feature learning, offering deeper structural knowledge transfer. Feature-based distillation is particularly valuable for tasks demanding strong internal representations, such as semantic embedding extraction and language understanding tasks [50,52].

Model compression techniques often complement knowledge distillation to further optimize student models. Examples include pruning and quantization. Pruning involves removing less critical parameters from the student model to reduce its size without significantly affecting accuracy, while quantization involves representing model weights with lower-precision data types to save memory and computation [53]. Knowledge distillation and model compression are used in Edge AI to deploy compact LLMs for real-time applications like voice assistants and chatbots on smartphones [54,55].

4.5. Mixture-of-Experts

Mixture-of-experts (MoE) is an ensemble-learning technique characterized by dynamically routing inputs to specialized expert models, each responsible for handling different subsets of the input data [56]. This routing mechanism, typically realized by a gating network, ensures efficient utilization of computational resources by activating only a subset of experts for each input, thus significantly enhancing scalability and reducing inference latency compared to traditional ensembles [57]. Formally, an MoE model comprises a set of expert networks

{E_{1}, E_{2}, \dots, E_{n}}

and a gating network G, which determines the weights assigned to each expert for a given input x. The output y of the MoE can be mathematically expressed as follows:

y = \sum_{i = 1}^{n} G_{i} (x) \cdot E_{i} (x),

(8)

where

G_{i} (x)

represents the gating weight for the i-th expert, satisfying the condition

\sum_{i = 1}^{n} G_{i} (x) = 1

. MoE architectures are particularly beneficial for LLMs, allowing the training and deployment of extraordinarily large models by partitioning parameters across experts [56]. For instance, Google’s Switch Transformer scales to trillions of parameters by activating only one expert per input token, significantly reducing computational costs while maintaining high performance [56]. Similarly, Meta’s BASE Layers leverage MoE techniques to specialize individual model components for distinct NLP tasks, thereby improving generalization and task adaptability [58].

4.6. Hybrid/Multi-Agent Ensembles

Hybrid or multi-agent ensembles represent an advanced form of ensemble learning, where multiple distinct LLM agents collaborate, interact, or coordinate their predictions to handle complex, multifaceted tasks [59]. Unlike traditional ensembles, multi-agent ensembles leverage inter-agent communication protocols, allowing agents to exchange intermediate representations, predictions, or queries, thereby significantly enhancing the reasoning and adaptability of the ensemble system [60,61].

In a typical multi-agent ensemble setup, a task T is decomposed into subtasks

{T_{1}, T_{2}, \dots, T_{m}}

, each handled by an individual agent or submodel

A_{i}

. The agents interact through a defined communication mechanism C, which facilitates information exchange and collaboration. The final prediction or output y can be expressed mathematically as an aggregation function over the predictions of interacting agents:

y = f (C (A_{1} (x), A_{2} (x), \dots, A_{m} (x))),

(9)

where

f (\cdot)

denotes a suitable aggregation or decision function, such as weighted voting, averaging, or a learned meta-model. Recent applications highlight the advantages of hybrid and multi-agent ensembles in domains requiring complex reasoning and interaction, such as healthcare, legal analysis, and conversational AI. For example, Wang et al. [60] proposed a multi-agent ensemble method leveraging multiple LLMs for electronic health record (EHR) annotation tasks. Their approach enabled each agent to specialize in distinct annotation subtasks, resulting in improved efficiency and annotation quality. Similarly, Park et al. [61] introduced generative agent ensembles where multiple LLM agents collaborate to simulate human-like interactive environments, significantly enhancing contextual coherence and reasoning abilities.

4.7. Ensemble Strategies for Multimodal LLMs

Multimodal ensemble strategies represent a rapidly growing extension of ensemble learning, enabling LLMs to reason jointly across multiple data modalities, such as text, images, audio, and video. Notable examples of multimodal LLMs include GPT-4V, Flamingo, Kosmos-2, and PaLI, each designed to integrate cross-modal information for complex reasoning tasks [62,63,64]. Ensembling these models offers opportunities to enhance performance in interdisciplinary tasks, but also introduces novel challenges related to synchronization, alignment, and scalability. Some core techniques in this context include:

Late fusion: This is an approach where models independently process each modality and their outputs are combined at the decision level. This method is especially effective when modalities are loosely coupled. For instance, in medical diagnostics, a vision model processes imaging data (e.g., chest X-rays), while an LLM interprets patient history or lab reports; the final decision emerges from a weighted or rule-based fusion of their outputs.
Modality-specific ensembling: This strategy involves constructing separate ensembles within each modality, such as a group of image encoders and a group of language models. The intermediate representations from each modality are then aligned or jointly reasoned over using cross-modal attention or graph-based fusion. This is particularly effective in tasks like visual question answering (VQA) and cross-modal retrieval, where complementary strengths across modalities drive accuracy [65,66].
Alignment-based fusion: More advanced approaches leverage alignment-based fusion, where outputs or embeddings from different modalities are projected into a common latent space. This is often accomplished using contrastive learning or joint transformer encoders. Let $x^{(v)}$ and $x^{(t)}$ represent the visual and textual inputs, respectively, and $f_{v} (\cdot)$ , $f_{t} (\cdot)$ denote their modality-specific encoders. These are projected into a shared embedding space as $z_{v} = f_{v} (x^{(v)})$ and $z_{t} = f_{t} (x^{(t)})$ . The ensemble prediction is then computed via joint inference:

$H (x^{(v)}, x^{(t)}) = g (α z_{v} + (1 - α) z_{t}),$

(10)

where $α \in [0, 1]$ controls the fusion weight between modalities and $g (\cdot)$ is a task-specific decoder or classifier. This fusion mechanism is crucial in tasks where modalities must be tightly synchronized, such as radiology and report interpretation, video captioning, or scene understanding.

Applications of multimodal ensembles are increasingly prominent in areas such as medical imaging, clinical decision support, and VQA. For example, Li et al. [66] demonstrated that ensembling vision-language models improves answer grounding and relevance in VQA benchmarks. Similarly, ensemble methods integrating EHRs and imaging modalities have been shown to outperform unimodal models in clinical outcome prediction [64].

4.8. Comparative Summary of Ensemble Strategies

This section synthesizes the ensemble strategies explored in the preceding sections by highlighting their comparative strengths, limitations, and ideal use cases. While the overarching goal of ensemble learning is to enhance prediction accuracy, robustness, and generalizability, each ensemble type comes with its own trade-offs across dimensions such as interpretability, computational efficiency, latency, and task alignment. At one end of the spectrum, model-level ensembles, including methods such as bagging, boosting, stacking, and voting, offer strong performance gains by aggregating diverse models trained independently. These strategies are especially suited for high-stakes domains like medical diagnosis or legal analytics, where reliability and robustness are critical. However, their reliance on multiple large models leads to high computational and storage costs, making them less practical for real-time or resource-constrained applications.

In contrast, parameter-level ensembles such as snapshot ensembles or parameter averaging reduce computational load by leveraging checkpoints of the same model architecture. These methods improve training efficiency and scalability but may not match the diversity—and hence performance—offered by model-level ensembles. Moreover, their interpretability remains low, given the underlying homogeneity in architecture and representation space. Meanwhile, task-specific ensembles, which leverage specialized models fine-tuned for subdomains or subtasks, strike a balance between scalability and adaptability. Their modularity allows for efficient integration of domain knowledge, making them suitable for interdisciplinary domains like healthcare, finance, and cybersecurity. Nevertheless, their performance is often sensitive to prompt engineering, domain-specific data availability, and compatibility among components.

Knowledge distillation strategies attempt to compress the intelligence of ensemble teachers into a smaller, deployable student model. They reduce latency and storage requirements, making them attractive for edge deployment scenarios. However, their performance is typically a trade-off, with slight reductions in accuracy and a dependency on the quality and diversity of teacher models. Additionally, the distillation process requires careful calibration of hyperparameters and loss functions.

Emerging techniques like MoE ensembles provide a scalable solution by routing different inputs to specialized submodels. These methods enable the training of extremely large models without a proportional increase in inference cost. Still, they pose significant challenges in terms of expert balancing, routing complexity, and debugging due to the sparsity and conditional activation mechanisms used. Meanwhile, hybrid and multi-agent ensembles, comprising multiple autonomous models or agents that interact and reason jointly, are highly promising for complex reasoning and decision-making tasks. They offer adaptability in dynamic environments, particularly in domains requiring dialogue management, multi-step planning, or autonomous coordination. However, coordination overhead, synchronization bottlenecks, and low interpretability can hinder their adoption in real-world systems.

Finally, multimodal LLM ensembles, a rapidly evolving class, combine inputs from heterogeneous sources such as text, images, audio, and video. Strategies such as late fusion, modality-specific ensembling, and alignment-based fusion allow these systems to leverage complementary signals across modalities. They have shown great promise in visual question answering, clinical decision support, and cross-modal retrieval. Nevertheless, their development is constrained by challenges in modality alignment, synchronization, and scalability. A comparative summary of these ensemble strategies is provided in Table 1, offering a consolidated reference for selecting appropriate strategies based on specific deployment and task requirements.

5. Notable Ensemble LLM Applications

This section presents notable applications of ensemble LLMs, highlighting their impact across multiple domains. Huang et al. [67] introduced DEEPEN, an ensemble-learning framework designed to address the challenges posed by heterogeneous LLMs with varying vocabularies. Their approach employs a training-free method that maps probability distributions from individual models into a universal relative space, facilitating effective aggregation and improving performance across diverse NLP tasks. Similarly, Xu et al. [11] explored techniques to harmonize outputs from LLMs with different vocabularies, ensuring seamless integration in ensemble settings.

Ensemble LLM approaches have been explored in medical applications. Yang et al. [33] developed an ensemble learning pipeline that utilizes state-of-the-art LLMs to improve the accuracy of medical question-answering (QA) systems. Their method demonstrated enhanced performance across various medical QA datasets, demonstrating the potential of ensemble methods in healthcare applications. In a related study, He et al. [68] introduced LLM-Forest, a novel ensemble model designed for imputing missing values in healthcare tabular data, thereby improving the reliability of medical datasets.

Furthermore, Lucas et al. [69] proposed an iterative ensemble reasoning approach to enhance the performance of LLMs. Their method focuses on refining the reasoning process of LLMs, leading to improved consistency and accuracy in medical QA tasks. This approach is particularly beneficial for less powerful models, such as GPT-3.5 turbo and Med42-70B, demonstrating its potential to elevate the capabilities of various LLMs in the medical domain. Similarly, Li et al. [70] explored the extraction of biomedical entities and relations from literature by employing an ensemble of pretrained LLMs. Their study emphasizes the robustness of combining multiple models to achieve superior performance in named entity recognition (NER) and relation extraction (RE) tasks across diverse biomedical entities. The findings underscore the efficacy of task-specific model integration in biomedical information extraction.

Huang et al. [37] introduced a multi-agent ensemble method powered by LLMs to address the challenges of electronic health record (EHR) annotation. This innovative approach significantly reduces the time and effort required for labeling large-scale EHR data, automating the process with high accuracy and quality. The ensemble method also generalizes well to other text data-labeling tasks, such as social determinants of health (SDOH) identification, highlighting its versatility in healthcare data management.

Wu et al. [71] implemented an ensemble approach for classifying Chinese medical texts, effectively addressing the complexities inherent in medical text categorization. By integrating three distinct submodels, their method leverages the complementary strengths of each, resulting in enhanced classification performance. This approach demonstrates the value of model integration in tackling the nuanced challenges of medical text classification. Meanwhile, in the pursuit of open healthcare natural language processing (NLP) solutions, Gururajan et al. [72] released Aloe, a family of fine-tuned open healthcare LLMs. Aloe models are trained on leading base models and undergo alignment phases, setting new standards for ethical performance in healthcare LLMs. This initiative contributes significantly to the development of accessible and high-performing models in the healthcare sector.

Lai et al. [73] deployed adaptive ensembles of fine-tuned transformers to detect AI-generated clinical text. Their approach combines individual classifier models using adaptive ensemble algorithms, achieving notable improvements in accuracy and generalization ability. This method addresses the growing need for effective detection of AI-generated content in clinical settings. Additionally, Knafou et al. [74] utilized an ensemble of deep-learning language models to support the triage of COVID-19 literature for systematic reviews. Their study demonstrates the potential of deep-learning ensembles to efficiently perform literature triage, significantly outperforming standalone models. This approach offers valuable support for epidemiological curation and review processes.

In the e-commerce sector, accurate extraction of product attributes is crucial for enhancing customer experience. Fang et al. [12] proposed LLM-Ensemble, an ensemble method tailored for extracting product attribute values in e-commerce. By combining outputs from various LLMs, this approach improved the precision of attribute extraction, leading to better product recommendations and increased customer satisfaction.

Ensemble LLMs have also been explored in code generation and mathematical problem-solving tasks. Gu et al. [75] introduced CharED, a character-wise ensemble decoding method aimed at enhancing LLM performance in coding and mathematical reasoning tasks. Their approach combined outputs from multiple LLMs at the character level, utilizing their complementary strengths to achieve superior results in programming and mathematical logic applications.

In the field of query reformulation, Dhole and Agichtein [76] introduced GenQRensemble, an ensemble-based prompting technique that leverages paraphrases of zero-shot instructions to generate multiple sets of keywords. This method enhances retrieval performance by effectively reformulating queries, showcasing the utility of ensemble prompting in information retrieval tasks.

Cross-lingual and multimodal applications have also benefited from ensemble LLM techniques. Miah et al. [77] explored a multimodal approach to cross-lingual sentiment analysis by combining transformer models and LLMs in an ensemble framework. Their method effectively handled multiple languages and modalities, leading to improved sentiment analysis across diverse linguistic contexts.

In the blockchain technology field, ensuring the security of smart contracts is vital. Luo et al. [78] presented FELLMVP, an ensemble LLM framework designed to accurately classify vulnerabilities in smart contracts. By integrating multiple LLMs, FELLMVP improved the detection and classification of potential security issues, contributing to more secure blockchain applications.

Furthermore, Farr et al. [38] introduced LLM Chain Ensembles, a method that enhances the scalability and accuracy of data-annotation processes by chaining multiple LLMs in an ensemble. This approach streamlines the annotation workflow, reducing manual effort and improving overall efficiency. Similarly, Huang et al. [79] proposed an ensemble framework that integrates BERT and byte pair encoding (BPE) tokenization to enhance text recognition tasks. Their method effectively combines the strengths of different encoding strategies, resulting in improved recognition accuracy.

Meanwhile, Cohen et al. [80] introduced DFPE, a diverse fingerprint ensemble method aimed at boosting LLM performance across various applications. By creating a diverse set of model “fingerprints,” DFPE effectively captures different aspects of data, leading to more robust and accurate LLM outputs. Similarly, Tekin et al. [81] proposed LLM-TOPLA, an ensemble approach that maximizes diversity among LLMs to improve overall performance in NLP tasks.

In software engineering, identifying the source of faults is a critical task. Cho et al. [82] presented an ensemble of small language models, called COSMosFL, designed to improve fault localization. Their approach combines the insights of multiple models, enhancing the precision and reliability of fault detection in large codebases.

Chen et al. [83] proposed a GPU-free ensemble LLM method for tracing the origins of academic publications. By leveraging closed-source LLMs to generate predicted reference sources, their method refines predictions through ensemble learning. Notably, this approach achieves commendable performance without the need for GPU resources, highlighting its accessibility and efficiency. Meanwhile, Abburi et al. [84] explored the classification of AI-generated versus human-written content using ensemble LLM approaches. Their work addresses the critical task of detecting AI-generated language, contributing to responsible usage of LLMs by ensuring the authenticity of textual content.

Yang et al. [85] proposed LDRE, an LLM-based divergent reasoning and ensemble method for zero-shot composed image retrieval. By capturing diverse possible semantics of composed targets, their approach enhances retrieval accuracy, demonstrating the applicability of LLM ensembles in multimodal tasks. Jiang et al. [86] introduced LLM-Blender, an ensembling framework designed to leverage the diverse strengths of multiple open-source LLMs. Consisting of PairRanker and GenFuser modules, LLM-Blender achieves consistently superior performance by distinguishing subtle differences between candidate outputs and generating refined responses. This framework exemplifies the benefits of model ensembling in achieving robust NLP outputs.

Collectively, these studies demonstrate the versatility and robustness of ensemble learning in enhancing LLM performance across a spectrum of applications. Table 2 summarizes the various ensemble LLM studies.

6. Challenges and Limitations

While ensemble LLMs offer numerous benefits, their development and deployment are accompanied by several challenges and limitations. This section explores key issues, including scalability, alignment, bias, and interpretability.

6.1. Scalability and Resource Requirements

The use of ensemble LLMs necessitates substantial computational and memory resources. Training, storing, and deploying multiple large models simultaneously can result in high costs and infrastructure demands [87]. In particular, the deployment of ensembles often requires distributed systems and specialized hardware, which can hinder scalability. For organizations with limited resources, these requirements may prove prohibitive. Additionally, real-time applications that depend on ensembles may encounter latency issues due to the increased computational overhead.

The trade-offs between accuracy and efficiency become even more pressing in resource-constrained or real-time environments. Practitioners must balance the added performance of ensemble models with increased inference latency, power consumption, and storage. For example, latency-sensitive settings such as conversational agents or emergency medical triage systems may experience degraded user experience or operational risk if ensemble responses are delayed. Techniques like model distillation, selective routing, and early-exit strategies can help reduce this burden, though they often come with performance compromises.

To address these concerns, several recent approaches advocate for hybrid ensembles using compact or quantized models that approximate the behavior of larger base models [48]. Similarly, asynchronous ensembling—where slower models update less frequently—can support near real-time throughput while preserving diversity. However, these optimizations must be carefully tuned, as overly aggressive pruning or quantization may reintroduce bias or degrade reliability.

6.2. Model Alignment and Coherence

Achieving alignment and coherence across multiple models in an ensemble is a persistent challenge. Individual models may produce outputs that conflict with one another, leading to inconsistencies, especially in tasks like text generation or summarization [88]. Techniques such as weighted voting or meta-model integration are often employed to address these inconsistencies, but they introduce additional complexity and may not completely resolve coherence issues. Ensuring seamless integration of outputs remains a critical hurdle for effective ensemble usage.

6.3. Bias and Fairness Issues

Bias in ensemble LLMs remains a critical concern, as component models may reflect and reinforce the biases present in their training data. Although ensembles are often assumed to mitigate bias through diversity, this effect is not guaranteed. If the individual models are homogeneously biased—perhaps due to shared pretraining corpora or fine-tuning strategies—the ensemble may amplify rather than reduce these biases. This is particularly problematic in socially sensitive tasks such as hiring recommendation or healthcare triage.

Nevertheless, when constructed thoughtfully, model diversity can offer meaningful improvements. Empirical studies suggest that ensembles with heterogeneous architectures or data sources exhibit lower correlation in error patterns and reduced group-specific disparities [89]. Strategies to enhance diversity include using models fine-tuned on demographically balanced datasets, incorporating adversarial de-biasing techniques, or ensembling models with distinct inductive biases. Diversity-aware voting schemes, where disagreeing predictions are surfaced for human oversight, can also act as a fairness checkpoint.

Still, ensemble design alone cannot guarantee fairness. Bias detection and auditing must be integrated into all stages of development, including pretraining, fine-tuning, and post-deployment evaluation. Practitioners should prioritize transparent reporting on fairness metrics and test ensembles against demographic slices to ensure equitable performance. Without these safeguards, ensemble systems risk reinforcing systemic harms under the illusion of diversity-driven robustness.

6.4. Interpretability and Complexity

Interpretability poses a significant barrier to ensemble adoption in high-stakes fields like healthcare and law, where transparency, auditability, and regulatory compliance are essential. Ensemble models—especially those involving black-box components—add layers of opacity that hinder explainability. Aggregated outputs may obscure how individual models contributed to a prediction, reducing the system’s accountability in decision-making processes.

Nonetheless, interpretability tools such as SHapley Additive exPlanations (SHAP), Integrated Gradients, and attention visualization have been adapted to ensemble settings. For instance, in ensemble clinical decision systems, SHAP has been used to generate feature-level explanations across multiple models to trace risk attribution in medical diagnoses. Attention-based LLM ensembles for legal summarization have also been analyzed via heatmap overlays to highlight clause-level influences from different models [90]. These techniques support more transparent auditing, though they require significant engineering effort to aggregate interpretability scores across diverse components.

To enhance transparency, practitioners can consider designing ensembles from inherently interpretable base models or using surrogate explanation models that approximate ensemble logic. Another emerging approach is explanation-aware ensembling, where interpretability metrics are integrated into model selection or voting weights. While these methods are promising, broader adoption will depend on standardized benchmarks for interpretability in ensemble contexts and regulatory incentives for explainable AI in law and medicine.

6.5. Sustainability Concerns

The resource-intensive nature of ensemble LLMs raises concerns about their environmental impact. Training and deploying ensembles require significant energy, contributing to a large carbon footprint [91]. This sustainability issue becomes even more relevant as the demand for LLMs grows, prompting the need for energy-efficient techniques and alternative approaches to mitigate the ecological implications of ensemble systems.

6.6. Reported Limitations and Failure Cases

Beyond fairness and interpretability, ensemble LLMs also suffer from limitations related to coherence, contradiction, and real-time performance. Model disagreement—where different LLMs in an ensemble produce conflicting or logically incompatible outputs—is common in tasks like multi-document summarization or dialogue generation. These contradictions reduce user trust and complicate downstream integration.

Several documented failures involve incoherent responses or hallucinations arising from unaligned token representations or divergent inference strategies across component models [92,93]. Aggregation mechanisms such as majority voting or response ranking can mitigate these issues but often fail under distribution shift or long-form reasoning. Moreover, mixture-of-experts systems—while more scalable—face expert load imbalance that undermines output consistency in dynamic scenarios.

In real-time or interactive environments, latency from multiple forward passes severely constrains usability. Deployment cases in customer service and emergency response have shown that ensemble-based LLMs introduce response delays exceeding acceptable limits [94]. As a result, practical deployments must weigh accuracy against throughput and resource cost, and may require hybrid setups using fast lightweight models for preliminary response with slower ensembles for fallback or arbitration.

7. Discussion and Future Directions

Ensemble LLMs represent a promising frontier for improving performance and robustness across diverse NLP tasks. Despite their potential, realizing their full impact requires addressing several key areas that remain underexplored. Future research could focus on optimizing resource efficiency to mitigate the computational and memory demands associated with training and deploying ensembles. Techniques such as model compression, pruning, and distributed computing can play a critical role in ensuring that ensemble systems are accessible even to organizations with constrained resources, thereby broadening their applicability.

In addition to resource optimization, ensuring fairness in ensemble LLMs is crucial. While challenges related to bias are well-documented, targeted efforts are needed to design ensemble-specific mitigation strategies. These could involve fairness-aware training processes and robust evaluation frameworks that assess the equity of outputs across diverse user groups. By incorporating more representative datasets and exploring dynamic ensemble weighting techniques, it may be possible to balance performance with fairness in real-world applications.

Another important direction lies in improving the interpretability of ensemble systems. The inherent complexity of combining multiple models often obscures the decision-making process, especially in high-stakes applications like healthcare. Advances in explainability, such as attention-based mechanisms or feature attribution techniques, could help illuminate the contributions of individual models within an ensemble. This, in turn, would foster trust among users and stakeholders while meeting the ethical and regulatory requirements for transparent AI systems.

Beyond addressing existing challenges, new opportunities arise from extending ensemble LLMs to cross-modal applications. Integrating text-based models with vision, audio, or structured data modalities can enable complex multimodal tasks such as image captioning and audio-based sentiment analysis. Effective strategies for information fusion will be essential to harness the full potential of such systems, ensuring coherence and efficiency across modalities.

Finally, sustainability must remain a priority in the development of ensemble LLMs. The environmental impact of resource-intensive training and deployment processes calls for innovative solutions such as energy-efficient model architectures and knowledge distillation techniques. Researchers can align technological progress with the broader goals of sustainable AI by reducing the computational overhead without sacrificing performance.

8. Conclusion

This review has provided a comprehensive examination of ensemble LLMs, highlighting their methodologies, applications, challenges, and potential future directions. Ensemble techniques, rooted in the principles of model diversity and aggregation, have been shown to enhance performance, robustness, and generalization across a wide range of natural language processing tasks, including sentiment analysis, machine translation, and domain-specific applications such as healthcare and cybersecurity.

Key findings from this study indicate the ability of ensemble LLMs to address limitations inherent in single models, such as biases, overfitting, and suboptimal generalization. By combining the strengths of individual models, ensembles deliver improved accuracy, resilience to adversarial inputs, and adaptability to specific tasks. Additionally, the integration of ensemble techniques with emerging fields, such as cross-modal processing and sustainable AI, presents promising opportunities for advancing their impact.

Author Contributions

Conceptualization, I.D.M. and T.G.S.; methodology, I.D.M. and T.G.S.; validation, I.D.M. and T.G.S.; investigation, I.D.M. and T.G.S.; writing—original draft preparation, I.D.M.; writing—review and editing, I.D.M. and T.G.S. visualization, I.D.M.; supervision, T.G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AMP	Antimicrobial Peptides
BERT	Bidirectional Encoder Representations from Transformers
BPE	Byte Pair Encoding
DFPE	Diverse Fingerprint Ensemble
EHR	Electronic Health Record
FL	Federated Learning
GPT	Generative Pretrained Transformer
KL	Kullback–Leibler
LLM	Large Language Model
ML	Machine Learning
NLP	Natural Language Processing
QA	Question Answering
QA-RF	Query Answering via Reformulation
SFT	Supervised Fine-Tuning
TOPLA	Task-Oriented Prompt-Level Aggregation

References

Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Long, S.; Tan, J.; Mao, B.; Tang, F.; Li, Y.; Zhao, M.; Kato, N. A Survey on Intelligent Network Operations and Performance Optimization Based on Large Language Models. IEEE Commun. Surv. Tutor. 2025, 1. [Google Scholar] [CrossRef]
Zhang, Q.; Ding, K.; Lv, T.; Wang, X.; Yin, Q.; Zhang, Y.; Yu, J.; Wang, Y.; Li, X.; Xiang, Z.; et al. Scientific Large Language Models: A Survey on Biological & Chemical Domains. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. Gpt (generative pre-trained transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. ChatGPT in Education: A Review of Ethical Challenges and Approaches to Enhancing Transparency and Privacy. Procedia Comput. Sci. 2025, 254, 181–190. [Google Scholar] [CrossRef]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Sakib, M.; Mustajab, S.; Alam, M. Ensemble deep learning techniques for time series analysis: A comprehensive review, applications, open issues, challenges, and future directions. Clust. Comput. 2025, 28, 73. [Google Scholar] [CrossRef]
Rane, N.; Choudhary, S.P.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar]
Yu, Y.C.; Kuo, C.C.; Ye, Z.; Chang, Y.C.; Li, Y.S. Breaking the ceiling of the LLM community by treating token generation as a classification for ensembling. arXiv 2024, arXiv:2406.12585. [Google Scholar]
Borah, A.; Mihalcea, R. Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions. arXiv 2024, arXiv:2410.02584. [Google Scholar]
Xu, Y.; Lu, J.; Zhang, J. Bridging the gap between different vocabularies for LLM ensemble. arXiv 2024, arXiv:2404.09492. [Google Scholar]
Fang, C.; Li, X.; Fan, Z.; Xu, J.; Nag, K.; Korpeoglu, E.; Kumar, S.; Achan, K. LLM-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 2910–2914. [Google Scholar]
Xian, Y.; Zeng, X.; Xuan, D.; Yang, D.; Li, C.; Fan, P.; Liu, P. Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence. arXiv 2024, arXiv:2412.02263. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey, 2024. arXiv 2024, arXiv:2402.06196. [Google Scholar]
Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2023, 6, 100048. [Google Scholar] [CrossRef]
Wang, C.; Zhao, J.; Gong, J. A survey on large language models from concept to implementation. arXiv 2024, arXiv:2403.18969. [Google Scholar]
Wan, Z.; Wang, X.; Liu, C.; Alam, S.; Zheng, Y.; Liu, J.; Qu, Z.; Yan, S.; Zhu, Y.; Zhang, Q.; et al. Efficient large language models: A survey. arXiv 2023, arXiv:2312.03863. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Wang, Y.; Wang, M.; Manzoor, M.A.; Liu, F.; Georgiev, G.; Das, R.J.; Nakov, P. Factuality of large language models: A survey. arXiv 2024, arXiv:2402.02420. [Google Scholar]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Online, 3–10 March 2021; pp. 610–623. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in medical image analysis. Intell. Med. 2023, 3, 59–78. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.; et al. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar]
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Zhang, W.; Xu, Y.; Wang, A.; Chen, G.; Zhao, J. Fuse feeds as one: Cross-modal framework for general identification of AMPs. Briefings Bioinform. 2023, 24, bbad336. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Lu, J.; Pang, Z.; Xiao, M.; Zhu, Y.; Xia, R.; Zhang, J. Merge, ensemble, and cooperate! A survey on collaborative strategies in the era of large language models. arXiv 2024, arXiv:2407.06089. [Google Scholar]
Yang, H.; Li, M.; Zhou, H.; Xiao, Y.; Fang, Q.; Zhang, R. One LLM is not enough: Harnessing the power of ensemble learning for medical question answering. medRxiv 2023. [Google Scholar] [CrossRef]
Ramesh, V.; Kumaresan, P. Stacked ensemble model for accurate crop yield prediction using machine learning techniques. Environ. Res. Commun. 2025, 7, 035006. [Google Scholar] [CrossRef]
Matarazzo, A.; Torlone, R. A Survey on Large Language Models with some Insights on their Capabilities and Limitations. arXiv 2025, arXiv:2501.04040. [Google Scholar]
Streefland, G.J.; Herrema, F.; Martini, M. A Gradient Boosting model to predict the milk production. Smart Agric. Technol. 2023, 6, 100302. [Google Scholar] [CrossRef]
Huang, J.; Nezafati, K.; Villanueva-Miranda, I.; Gu, Z.; Navar, A.M.; Wanyan, T.; Zhou, Q.; Yao, B.; Rong, R.; Zhan, X.; et al. Large language models enabled multiagent ensemble method for efficient EHR data labeling. arXiv 2024, arXiv:2410.16543. [Google Scholar]
Farr, D.; Manzonelli, N.; Cruickshank, I.; Starbird, K.; West, J. LLM chain ensembles for scalable and accurate data annotation. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2110–2118. [Google Scholar]
Dogan, A.; Birant, D. A weighted majority voting ensemble approach for classification. In Proceedings of the 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, 11–15 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Yang, E.; Shen, L.; Guo, G.; Wang, X.; Cao, X.; Zhang, J.; Tao, D. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities. arXiv 2024, arXiv:2408.07666. [Google Scholar]
Du, G.; Lee, J.; Li, J.; Jiang, R.; Guo, Y.; Yu, S.; Liu, H.; Goh, S.K.; Tang, H.K.; He, D.; et al. Parameter competition balancing for model merging. arXiv 2024, arXiv:2410.02396. [Google Scholar]
Xiao, C.; Zhang, Z.; Song, C.; Jiang, D.; Yao, F.; Han, X.; Wang, X.; Wang, S.; Huang, Y.; Lin, G.; et al. Configurable foundation models: Building LLMs from a modular perspective. arXiv 2024, arXiv:2409.02877. [Google Scholar]
Arun, A.; John, J.; Kumaran, S. Ensemble of Task-Specific Language Models for Brain Encoding. arXiv 2023, arXiv:2310.15720. [Google Scholar]
Abimannan, S.; El-Alfy, E.S.M.; Chang, Y.S.; Hussain, S.; Shukla, S.; Satheesh, D. Ensemble multifeatured deep learning models and applications: A survey. IEEE Access 2023, 11, 107194–107217. [Google Scholar] [CrossRef]
Campagner, A.; Ciucci, D.; Cabitza, F. Aggregation models in ensemble learning: A large-scale comparison. Inf. Fusion 2023, 90, 241–252. [Google Scholar] [CrossRef]
Yang, C.; Zhu, Y.; Lu, W.; Wang, Y.; Chen, Q.; Gao, C.; Yan, B.; Chen, Y. Survey on knowledge distillation for large language models: Methods, evaluation, and application. ACM Trans. Intell. Syst. Technol. 2024. [Google Scholar] [CrossRef]
Junaid, A.R. Empowering Compact Language Models with Knowledge Distillation. Authorea Prepr. 2025. [Google Scholar] [CrossRef]
Dantas, P.V.; Sabino da Silva, W., Jr.; Cordeiro, L.C.; Carvalho, C.B. A comprehensive review of model compression techniques in machine learning. Appl. Intell. 2024, 54, 11804–11844. [Google Scholar] [CrossRef]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar]
Hu, C.; Li, X.; Liu, D.; Wu, H.; Chen, X.; Wang, J.; Liu, X. Teacher-student architecture for knowledge distillation: A survey. arXiv 2023, arXiv:2308.04268. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Bibi, U.; Mazhar, M.; Sabir, D.; Butt, M.F.U.; Hassan, A.; Ghazanfar, M.A.; Khan, A.A.; Abdul, W. Advances in Pruning and Quantization for Natural Language Processing. IEEE Access 2024, 12, 139113–139128. [Google Scholar] [CrossRef]
Zhang, R.; He, J.; Luo, X.; Niyato, D.; Kang, J.; Xiong, Z.; Li, Y.; Sikdar, B. Toward democratized generative AI in next-generation mobile edge networks. IEEE Netw. 2025. [Google Scholar] [CrossRef]
Semerikov, S.O.; Vakaliuk, T.A.; Kanevska, O.B.; Moiseienko, M.V.; Donchev, I.I.; Kolhatin, A.O. LLM on the edge: The new frontier. In Proceedings of the CEUR Workshop Proceedings, Barcelona, Spain, 7–10 April 2025; pp. 137–161. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Lewis, M.; Bhosale, S.; Dettmers, T.; Goyal, N.; Zettlemoyer, L. Base layers: Simplifying training of large, sparse models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 6265–6274. [Google Scholar]
Qu, N.; Wang, C.; Li, Z.; Liu, F.; Ji, Y. A distributed multi-agent deep reinforcement learning-aided transmission design for dynamic vehicular communication networks. IEEE Trans. Veh. Technol. 2023, 73, 3850–3862. [Google Scholar] [CrossRef]
Wang, Z.; Zhu, Y.; Zhao, H.; Zheng, X.; Sui, D.; Wang, T.; Tang, W.; Wang, Y.; Harrison, E.; Pan, C.; et al. Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2250–2261. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar]
Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.J.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. Pali: A jointly-scaled multilingual language-image model. arXiv 2022, arXiv:2209.06794. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Huang, Y.; Feng, X.; Li, B.; Xiang, Y.; Wang, H.; Liu, T.; Qin, B. Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
He, X.; Ban, Y.; Zou, J.; Wei, T.; Cook, C.B.; He, J. LLM-Forest for Health Tabular Data Imputation. arXiv 2024, arXiv:2410.21520. [Google Scholar]
Lucas, M.M.; Yang, J.; Pomeroy, J.K.; Yang, C.C. Reasoning with large language models for medical question answering. J. Am. Med. Inform. Assoc. 2024, 31, 1964–1975. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wei, Q.; Huang, L.C.; Li, J.; Hu, Y.; Chuang, Y.S.; He, J.; Das, A.; Keloth, V.K.; Yang, Y.; et al. Ensemble pretrained language models to extract biomedical knowledge from literature. J. Am. Med. Inform. Assoc. 2024, 31, 1904–1911. [Google Scholar] [CrossRef]
Wu, C.; Fang, W.; Dai, F.; Yin, H. A Model Ensemble Approach with LLM for Chinese Text Classification. In Proceedings of the China Health Information Processing Conference, Fuzhou, China, 15–17 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 214–230. [Google Scholar]
Gururajan, A.K.; Lopez-Cuena, E.; Bayarri-Planas, J.; Tormos, A.; Hinjos, D.; Bernabeu-Perez, P.; Arias-Duart, A.; Martin-Torres, P.A.; Urcelay-Ganzabal, L.; Gonzalez-Mallo, M.; et al. Aloe: A Family of Fine-tuned Open Healthcare LLMs. arXiv 2024, arXiv:2405.01886. [Google Scholar]
Lai, Z.; Zhang, X.; Chen, S. Adaptive ensembles of fine-tuned transformers for llm-generated text detection. arXiv 2024, arXiv:2403.13335. [Google Scholar]
Knafou, J.; Haas, Q.; Borissov, N.; Counotte, M.; Low, N.; Imeri, H.; Ipekci, A.M.; Buitrago-Garcia, D.; Heron, L.; Amini, P.; et al. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst. Rev. 2023, 12, 94. [Google Scholar] [CrossRef] [PubMed]
Gu, K.; Tuecke, E.; Katz, D.; Horesh, R.; Alvarez-Melis, D.; Yurochkin, M. Chared: Character-wise ensemble decoding for large language models. arXiv 2024, arXiv:2407.11009. [Google Scholar]
Dhole, K.D.; Agichtein, E. Genqrensemble: Zero-shot LLM ensemble prompting for generative query reformulation. In Proceedings of the European Conference on Information Retrieval, Glasgow, UK, 24–28 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 326–335. [Google Scholar]
Miah, M.S.U.; Kabir, M.M.; Sarwar, T.B.; Safran, M.; Alfarhood, S.; Mridha, M. A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Sci. Rep. 2024, 14, 9603. [Google Scholar] [CrossRef]
Luo, Y.; Xu, W.; Andersson, K.; Hossain, M.S.; Xu, D. FELLMVP: An Ensemble LLM Framework for Classifying Smart Contract Vulnerabilities. In Proceedings of the 2024 IEEE International Conference on Blockchain (Blockchain), Copenhagen, Denmark, 19–22 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 89–96. [Google Scholar]
Huang, Z. An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1750–1754. [Google Scholar]
Cohen, S.; Goldshlager, N.; Cohen-Inger, N.; Shapira, B.; Rokach, L. DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance. arXiv 2025, arXiv:2501.17479. [Google Scholar]
Tekin, S.F.; Ilhan, F.; Huang, T.; Hu, S.; Liu, L. LLM-topla: Efficient LLM ensemble by maximising diversity. arXiv 2024, arXiv:2410.03953. [Google Scholar]
Cho, H.; Kang, S.; An, G.; Yoo, S. COSMosFL: Ensemble of Small Language Models for Fault Localisation. arXiv 2025, arXiv:2502.02908. [Google Scholar]
Chen, K.; Wang, J.; Chen, Z.; Chen, K.; Chen, Y. LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach. arXiv 2024, arXiv:2409.09383. [Google Scholar]
Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative ai text classification using ensemble LLM approaches. arXiv 2023, arXiv:2309.07755. [Google Scholar]
Yang, Z.; Xue, D.; Qian, S.; Dong, W.; Xu, C. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 80–90. [Google Scholar]
Jiang, D.; Ren, X.; Lin, B.Y. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv 2023, arXiv:2306.02561. [Google Scholar]
Chen, Z.; Li, J.; Chen, P.; Li, Z.; Sun, K.; Luo, Y.; Mao, Q.; Yang, D.; Sun, H.; Yu, P.S. Harnessing Multiple Large Language Models: A Survey on LLM Ensemble. arXiv 2025, arXiv:2502.18036. [Google Scholar]
Celikyilmaz, A.; Clark, E.; Gao, J. Evaluation of text generation: A survey. arXiv 2020, arXiv:2006.14799. [Google Scholar]
Mienye, I.D.; Swart, T.G.; Obaido, G. Fairness Metrics in AI Healthcare Applications: A Review. In Proceedings of the 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), San Jose, CA, USA, 7–9 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 284–289. [Google Scholar]
Nasarian, E.; Alizadehsani, R.; Acharya, U.R.; Tsui, K.L. Designing interpretable ML system to enhance trust in healthcare: A systematic review to proposed responsible clinician-AI-collaboration framework. Inf. Fusion 2024, 108, 102412. [Google Scholar] [CrossRef]
Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (LLMs). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Y.; Xu, J.; Ott, M.; Shuster, K.; Smith, E.M.; et al. Recipes for building an open-domain chatbot. arXiv 2020, arXiv:2004.13637. [Google Scholar]
Feng, H.; Fan, Y.; Liu, X.; Lin, T.E.; Yao, Z.; Wu, Y.; Huang, F.; Li, Y.; Ma, Q. Improving factual consistency of text summarization by adversarially decoupling comprehension and embellishment abilities of LLMs. arXiv 2023, arXiv:2310.19347. [Google Scholar]
He, P.; Peng, B.; Lu, L.; Wang, S.; Mei, J.; Liu, Y.; Xu, R.; Awadalla, H.H.; Shi, Y.; Zhu, C.; et al. Z-code++: A pre-trained language model optimized for abstractive summarization. arXiv 2022, arXiv:2208.09770. [Google Scholar]

Figure 1. Transformer architecture [24].

Figure 2. Overview of the stacking architecture, where base models provide predictions used as features by a meta-learner.

Figure 3. Illustration of the bagging process, where models are trained on bootstrap samples and their predictions aggregated for final output.

Figure 4. The standard teacher–student approach for knowledge distillation [49].

Table 1. Comparative summary of ensemble LLM strategies.

Ensemble Type	Typical Examples	Advantages	Limitations
Model-level	Bagging, boosting, stacking, voting	High predictive performance, robust generalization, reduces variance	High computational cost, high latency, low interpretability
Parameter-level	Snapshot ensembles, parameter averaging	Computational efficiency, improved scalability, moderate performance improvement	Limited model diversity, lower interpretability, moderate accuracy gains
Task-specific	Prompt-based ensembles, domain-specific fine-tuned ensembles	High domain specialization, scalable, adaptable to targeted applications	Dependent on prompt and domain quality, less generalizable across domains
Knowledge distillation	Student-teacher model training, distilled compact LLMs	High computational efficiency, lower inference latency, suitable for deployment in resource-constrained environments	Moderate performance reduction, requires careful distillation tuning
Mixture-of-experts	Switch transformers, sparsely gated experts	Exceptional scalability, efficient resource utilization, capable of training very large models	Complex gating mechanisms, interpretability issues, expert load balancing
Hybrid/multi-agent	Collaborative agent frameworks, generative agent ensembles	Effective in complex, multi-step reasoning tasks, interactive capabilities, adaptable in dynamic environments	High coordination complexity, synchronization overhead, limited interpretability
Multimodal LLM ensembles	Late fusion, modality-specific ensembling, alignment-based fusion	Cross-modal reasoning, robust in heterogeneous data settings, excels in VQA and clinical support tasks	Modality alignment, synchronization, and scalability challenges

Table 2. Summary of ensemble LLM studies by specific application.

Application Domain	Reference	Year	Description
Model robustness enhancement	Cohen et al. [80]	2025	Introduced DFPE, a fingerprint-based ensemble for performance and diversity.
Software fault localization	Cho et al. [82]	2025	Used ensembles of small LLMs to improve bug and fault detection in codebases.
Medical question answering	Lucas et al. [69]	2024	Applied LLM ensembles for improved reasoning in medical QA tasks.
Biomedical information extraction	Li et al. [70]	2024	Used an ensemble of pretrained LLMs to extract biomedical entities and relations from literature.
EHR annotation	Huang et al. [37]	2024	Introduced a multi-agent LLM ensemble to streamline annotation of electronic health records.
Health data imputation	He et al. [68]	2024	Proposed LLM-Forest to enhance imputation of missing values in structured health datasets.
Open healthcare NLP	Gururajan et al. [72]	2024	Released Aloe, a family of fine-tuned open healthcare LLMs for ensemble clinical NLP.
LLM output detection	Lai et al. [73]	2024	Deployed adaptive transformer ensembles to detect AI-generated clinical text.
Paper provenance tracing	Chen et al. [83]	2024	Proposed a GPU-free ensemble LLM method to trace origins of academic publications.
General NLP tasks	Huang et al. [67]	2024	Introduced DEEPEN, a framework for combining heterogeneous LLMs with parallel token alignment.
LLM diversity optimization	Tekin et al. [81]	2024	Proposed LLM-TOPLA, which maximizes diversity among LLMs for more robust NLP outputs.
Vocabulary alignment	Xu et al. [11]	2024	Presented techniques to harmonize vocabularies across ensemble LLMs for consistency.
Cross-lingual sentiment analysis	Miah et al. [77]	2024	Created a multimodal LLM ensemble for sentiment analysis across multiple languages.
Data annotation	Farr et al. [38]	2024	Proposed LLM chain ensembles to improve annotation efficiency and scale.
Query reformulation	Dhole and Agichtein [76]	2024	Introduced GenQRensemble for zero-shot query rewriting using LLM ensembles.
Product attribute extraction	Fang et al. [12]	2024	Built an ensemble model to extract attributes from e-commerce product data.
Smart contract analysis	Luo et al. [78]	2024	Proposed FELLMVP, an LLM ensemble to detect vulnerabilities in smart contracts.
Composed image retrieval	Yang et al. [85]	2024	Used LLM ensembles for zero-shot composed image retrieval via divergent reasoning.
Text recognition	Huang [79]	2024	Combined BERT and BPE-based LLMs for improved optical text recognition.
Code generation and math reasoning	Gu et al. [75]	2024	Proposed CharED, a character-wise ensemble decoding approach enhancing LLM performance in programming and mathematical tasks.
Medical question answering	Yang et al. [33]	2023	Developed an ensemble pipeline combining LLMs to improve performance on medical QA benchmarks.
Medical text classification	Wu et al. [71]	2023	Implemented LLM ensembles to classify Chinese medical texts for healthcare informatics.
Systematic review triage	Knafou et al. [74]	2023	Used an ensemble of deep models to triage COVID-19 literature for systematic reviews.
AI text detection	Abburi et al. [84]	2023	Used ensemble LLMs to classify AI-generated vs. human-written content.
Ranking and fusion for QA	Jiang et al. [86]	2023	Proposed LLM-Blender using pairwise ranking and generative fusion for ensemble reasoning.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. https://doi.org/10.3390/info16080688

AMA Style

Mienye ID, Swart TG. Ensemble Large Language Models: A Survey. Information. 2025; 16(8):688. https://doi.org/10.3390/info16080688

Chicago/Turabian Style

Mienye, Ibomoiye Domor, and Theo G. Swart. 2025. "Ensemble Large Language Models: A Survey" Information 16, no. 8: 688. https://doi.org/10.3390/info16080688

APA Style

Mienye, I. D., & Swart, T. G. (2025). Ensemble Large Language Models: A Survey. Information, 16(8), 688. https://doi.org/10.3390/info16080688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble Large Language Models: A Survey

Abstract

1. Introduction

2. Related Work

3. Overview of LLMs

4. Ensemble Techniques for LLMs

4.1. Model-Level Ensembles

4.1.1. Stacking

4.1.2. Bagging

4.1.3. Boosting

4.1.4. Voting Ensembles

4.2. Parameter-Level Ensembles

4.3. Task-Specific Ensembles

4.4. Knowledge Distillation and Model Compression

4.5. Mixture-of-Experts

4.6. Hybrid/Multi-Agent Ensembles

4.7. Ensemble Strategies for Multimodal LLMs

4.8. Comparative Summary of Ensemble Strategies

5. Notable Ensemble LLM Applications

6. Challenges and Limitations

6.1. Scalability and Resource Requirements

6.2. Model Alignment and Coherence

6.3. Bias and Fairness Issues

6.4. Interpretability and Complexity

6.5. Sustainability Concerns

6.6. Reported Limitations and Failure Cases

7. Discussion and Future Directions

8. Conclusion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI