1. Introduction
One of the major factors in modern digital platforms, nowadays, is undoubtedly recommender systems (RecSys). RecSys enable personalized experiences across everyday aspects of life, from entertainment and social media to employment search and e-commerce. Interpreting user feedback represents a critical challenge in RecSys research. Research in sentiment analysis may prove extremely useful, since, on the one hand, it can contribute to understanding customer feedback and reviews about products they have already seen, purchased, or listened to; on the other hand, it can help predict user satisfaction with products they have not yet experienced [
1].
This work focuses specifically on rating prediction from review text, a fundamental component of modern recommender systems. While a complete RecSys encompasses multiple interconnected modules (user profiling, item representation, ranking algorithms, cold-start handling), rating prediction serves as a critical building block that directly influences recommendation quality in collaborative filtering, hybrid systems, and content-based approaches [
1,
2]. Our investigation of LLM-based rating prediction and meta-model aggregation addresses a core RecSys challenge: accurately inferring user preferences from textual feedback, which subsequently feeds into downstream recommendation algorithms. This component-focused approach aligns with established RecSys literature, where specialized studies on rating prediction, review mining, and sentiment-aware recommendation have consistently contributed to advancing the field’s understanding of preference modeling [
3,
4].
In this direction, the rapid evolution of LLMs has caused a major shift in sentiment classification, moving from traditional feature-based machine learning approaches to more advanced, context-aware reasoning systems that can handle complex human language [
2,
3,
4].
While recent approaches to RecSys sentiment analysis focus on enhancing model architectures or training data quality [
5,
6], an emerging direction explores ensemble approaches and meta-model aggregation strategies for combining multiple models [
7]. Two evaluation approaches are relevant: independent LLMs operating individually on classification tasks, and meta-model aggregation systems that combine predictions from multiple independent LLMs through statistical or reasoning-based methods [
8,
9]. Despite growing interest in these aggregation strategies, systematic comparative evaluations of their effectiveness in real-world applications—such as sentiment prediction in RecSys—remain limited.
In contrast to prior studies that typically evaluate a small number of LLMs [
10]—often fine-tuned for specific tasks—this study examines 12 leading pre-trained models across four major providers (OpenAI, Anthropic, Google, DeepSeek) in a zero-shot setting. This cross-company comparative design enables systematic analysis of architectural differences, provider-specific biases, and performance heterogeneity across model families, providing insights into which design paradigms (reasoning-optimized vs. chat-oriented architectures) best capture sentiment nuances without task-specific adaptation.
This focus on zero-shot evaluation aligns with a growing trend in the AI community, where labeled data for specific tasks is often limited, and contemporary LLMs are increasingly designed with advanced capabilities—including self-improvement through iterative refinement and instruction-following abilities based on their pre-trained knowledge [
11,
12]. In order to evaluate the baseline capabilities of these models, which are critical for determining their potential in real-world applications and for downstream adaptation via few-shot learning or instruction tuning, it is essential to evaluate these models in their default and unmodified state.
This empirical evaluation examines all 12 pre-trained models in both standalone and aggregated configurations. In the aggregation approach, a GPT-based meta-model processes predictions from all base models alongside the original review text to produce a final classification. This experimental design enables direct comparison of individual model performance against meta-model aggregation strategies and traditional ensemble baselines (majority voting, mean aggregation), investigating whether reasoning-based aggregation provides measurable accuracy improvements over statistical combination methods.
To ensure reliable sentiment signals from authentic customer experiences, we created a balanced dataset of 5000 verified reviews from the Amazon Reviews ’23 dataset [
13], where each LLM predicted sentiment ratings using combined title and text inputs in a zero-shot setting.
By comparing standalone LLMs, traditional ensemble methods, and meta-model aggregation across key metrics—accuracy, precision, recall, F1-score, and computational cost—this study investigates which strategies provide measurable performance improvements and whether observed gains justify increased operational complexity.
The evaluation objectives of this research are threefold. First, we measure the performance of 12 leading LLMs in zero-shot review sentiment classification, establishing baseline capabilities without task-specific fine-tuning. Second, empirically compare whether meta-model aggregation with natural language reasoning capabilities yields accuracy improvements over standalone models and traditional ensemble baselines. Third, we investigate several research questions (RQs) that remain underexamined in the literature, particularly concerning model behavior, aggregation effectiveness, and practical trade-offs, including:
Which LLM achieves the highest accuracy in zero-shot sentiment classification?
At which rating levels (1–5) do LLMs most frequently make incorrect predictions, and at which levels do they perform most accurately?
How does meta-model aggregation compare to traditional ensemble baselines in terms of observed accuracy improvements?
Which LLMs contribute positively or negatively to the meta-model’s performance?
How does the meta-model handle independent models’ recommendations—specifically, how frequently does it accept (revise) versus disregard them, and are these decisions beneficial?
Which models most strongly influence the meta-model’s decisions, and which are the least trusted?
How does a model’s performance as a meta-model (in a meta-model aggregation system setup) compare to its performance when operating independently?
Does the meta-model primarily rely on majority voting, or does it reason directly over textual content? How often does each occur?
Which models behave as outliers, potentially disrupting the overall meta-model aggregation system?
Are pre-trained LLMs without fine-tuning capable of accurately capturing sentiment from user feedback?
Do observed accuracy improvements justify the added computational complexity and cost of meta-model aggregation?
How close are the predictions of different LLMs, and can we reduce costs by omitting models with similar outcomes?
How does the performance of zero-shot LLMs and the meta-model aggregation approach compare to traditional fine-tuned transformer models?
What patterns of agreement exist among the 12 base models, and how does the consensus level relate to prediction accuracy?
Finally, under what conditions do meta-models override the majority vote, and how does this relate to prediction accuracy?
The empirical findings provide evidence-based insights into the comparative effectiveness of different aggregation strategies for sentiment analysis in recommender systems, informing practitioners about potential benefits and limitations of meta-model aggregation approaches.
The rest of the paper is structured as follows: in
Section 2, the contemporary literature review is summarized. In
Section 3, we first present the dataset selection and preprocessing procedures and then we introduce the model deployment and evaluation procedures. In
Section 4, the experimental findings are presented, and, lastly,
Section 5 discusses this study.
4. Results
This section presents empirical findings from our comparative evaluation. The findings address the 15 research questions posed in the Introduction section. For each RQ, we restate the question (for the reader’s convenience), present the relevant findings, and then provide an answer to the RQ. In all result tables, boldface is used to mark the highest performance.
4.1. Individual Model Performance Analysis
RQ1: Which LLM achieves the highest accuracy in zero-shot sentiment classification?
The evaluation of the 12 models shows clear differences in their zero-shot sentiment classification performance (illustrated in
Table 5). At the company level, in this task, the Claude models achieved the best performance overall, while the Google models achieved the lowest performance. At the model level, Claude Sonnet 4.5 achieved the highest accuracy at 65.02%, followed closely by Claude Opus 4.1 at 64.48% and GPT-4.1 at 63.54%.
Moderate performance was achieved by the DeepSeek Chat (63.34%), GPT-5 (62.40%), and GPT-5 Mini (62.34%) models, while the worst results were observed for the Gemini 2.5 Flash model (56.86%).
An interesting finding is that, even though DeepSeek Reasoner required a lot more computational resources than DeepSeek Chat (c.f.
Section 4.12), due to its reasoning-based architecture, it failed to surpass its accuracy.
A similar situation was observed between GPT-5 and GPT-4.1, where GPT-5 (designed as a reasoning model) achieved 1.14 percentage points lower accuracy than the older GPT-4.1 model. This suggests that, for zero-shot sentiment classification, reasoning-oriented models may not offer an advantage over their chat-oriented counterparts, despite their higher associated costs.
On the other hand, when these models are incorporated into a meta-model aggregation system, we observe a significant improvement in performance (as shown in the last two rows of
Table 5). More specifically, the GPT-5 meta-model achieved an accuracy of 71.40%, while the GPT-5 mini meta-model reached 70.32%, representing improvements of 10.15% and 9.07%, respectively, compared to the average accuracy of the individual models, which is measured at 61.25%.
4.2. Rating Prediction Challenges
RQ2: At which rating levels (1–5) do LLMs most frequently make incorrect predictions, and at which levels do they perform most accurately?
Most models showed a similar pattern of difficulty with 3-star ratings, with failure rates typically ranging from 55% to 71% (as shown in
Table 6). Interestingly, DeepSeek Reasoner was the only model to deviate from this pattern, showing its highest failure rate with 2-star ratings (62.50%). This unique behavior indicates that the DeepSeek Reasoner may be processing negative sentiment differently to the other models in our study.
Gemini 2.5 Flash produced the highest number of failures (measured at 77.4%) for 3-star ratings (indicating a “fair” user opinion), whereas Claude Sonnet-4.5 produced the fewest (measured at 55.7%). As a result, the difference between these two extremes was a significant 21.7% and shows that there is a clear distinction between the ability of each model to understand “neutral” or “mediocre” sentiment expressions.
Therefore, the above measurements highlight a serious problem that all researchers face when studying sentiment analysis: accurately classifying neutral or mediocre sentiment, which can be determined using data collected from 3-star ratings on a 5-star rating system. To understand why models systematically fail on 3-star reviews, we conducted a qualitative analysis examining misclassified instances across three dimensions: linguistic ambiguity, rating scale interpretation, and contextual insufficiency.
Linguistic Ambiguity: 3-star reviews frequently contain hedging language (“decent but…”, “okay for the price”, “works fine, I guess”) that lacks the definitive sentiment markers present in extreme ratings. Many misclassified 3-star reviews contain contradictory statements (e.g., “Great quality but arrived late”—should be 3-star but predicted as 4-star due to positive “great quality” framing), where the coexistence of positive and negative phrases within single sentences creates feature-level ambiguity that models struggle to balance appropriately. Additionally, comparative qualifiers (“better than expected” without establishing baseline expectations) and sarcastic or ironic expressions that LLMs may interpret literally (“Oh, perfect, another broken charger”) further complicate accurate classification.
Rating Scale Interpretation: The semantic space between 2-star (“poor”), 3-star (“acceptable”), and 4-star (“good”) contains finer-grained distinctions than extreme ratings. When models misclassify 3-star reviews, they frequently predict extreme ratings (1-star or 5-star) rather than adjacent ratings, suggesting difficulty in calibrating “middling” sentiment (
Table 7). The negative bias observed in
Table 8 (mean bias of −0.39 to −0.81 for 3-star) suggests models apply a “negativity heuristic”—treating any criticism as evidence for lower ratings, even when reviews explicitly state “met basic expectations” or “acceptable quality.” This aligns with the anchoring effect, where models pre-trained predominantly on polarized sentiment (more common in training corpora) struggle with the nuanced midpoint calibration required for 3-star classification.
Contextual Insufficiency: Short-form reviews often lack sufficient context for models to disambiguate neutral sentiment. Reviews such as “Product is fine” (actual 3-star, predicted as 4-star) or “Didn’t work for me” (actual 3-star, predicted as 1-star) provide minimal information about whether dissatisfaction stems from personal preferences (warranting 3-star) or objective product failures (warranting 1-star).
As such, this finding has critical implications for how sentiment analysis systems are designed and indicates that developing the ability to detect and classify neutral sentiment expressions must receive greater attention. Potential mitigation strategies include: (1) incorporating lexical diversity and sentiment contrast features to flag ambiguous cases for human review, (2) fine-tuning models specifically on neutral sentiment with contrastive learning objectives that emphasize 2-vs.-3-vs.-4 distinctions, and (3) implementing confidence thresholds that escalate uncertain 3-star predictions to secondary verification.
On the other hand, all independent base models achieved their best performance when predicting values at the extremes of the rating scale (either 1-star or 5-star classifications), representing clear positive or negative sentiments. This concurs with the results of previous studies [
15,
67] which assert that ratings at the bounds of the scale are easier to predict.
Interestingly, we observe that while the models that achieved their best performance on 1-star predictions succeeded more frequently on the ‘easiest rating’ cases, on average, the models that excelled on 5-star predictions performed better (i.e., had the lowest failure percentage) on the “hardest rating” cases. Based on the combined results, the models that achieved higher accuracy on 5-star ratings, such as the Claude Sonnet 4.5 and the GPT-4.1, tended to exhibit stronger overall performance and thereby confirming the previous observation. Furthermore, we also observe significant differences between models from the same company. For example, while GPT-5 and GPT-5 Mini achieved their best performance on 5-star predictions, the GPT-4.1 model performed better on 1-star ratings. A similar situation occurred within the DeepSeek family, where the reasoning variant (the DeepSeek-reasoner model) proved to be more accurate on 1-star predictions than on 5-star ones, as opposed to the DeepSeek-chat model. This pattern strongly suggests that the reasoning-oriented models exhibit better performance in identifying negative sentiment cues, while the chat-oriented models balance sentiment extremes more effectively.
To further investigate the magnitude and direction of prediction errors, we calculated the Mean Absolute Error (MAE) and Mean Error (Bias) for each model across different rating levels.
Table 7 presents the MAE for each model, highlighting the error magnitude on the easiest (1-star and 5-star) and hardest (3-star) ratings.
The results in
Table 7 confirm that 3-star ratings are indeed the most challenging, with MAE values ranging from 0.632 to 1.037, significantly higher than the errors observed at the extremes. Conversely, the 1-star and 5-star ratings exhibit much lower MAE values, typically below 0.3, reinforcing the finding that models are more adept at identifying strong sentiments.
We also analyzed the systematic bias of the models using the Mean Error metric, as shown in
Table 8.
Table 8 reveals a consistent negative bias across almost all models, indicating a general tendency to underestimate ratings. This is particularly pronounced for 3-star ratings, where models frequently predict lower values (1 or 2). The positive bias observed for 1-star ratings is expected, as predictions can only be equal to or higher than the true value. Similarly, the negative bias for 5-star ratings is due to the upper bound of the scale. However, the strong negative bias in the intermediate ratings suggests that models may be overly sensitive to negative sentiment cues, leading them to classify neutral or mixed reviews as more negative than they actually are.
4.3. Impact of Aggregation on System Performance
RQ3: How does meta-model aggregation compare to traditional ensemble baselines in terms of observed accuracy improvements?
When the LLMs are incorporated into a meta-model aggregation system, we observe significant improvements in sentiment analysis accuracy in both meta-model variants, albeit with notable differences in their performance. More specifically, when using the GPT-5 meta-model, we observe an accuracy of 71.4%, representing a 10.15% increase over the average model performance (61.25%). When using the GPT-5 mini meta-model, we observe an accuracy of 70.32%, which corresponds to a 9.07% improvement over the average model performance.
The difference in accuracy between these two meta-models (measured at 1.08%) suggests that the more advanced GPT-5 model is slightly better at synthesizing and analyzing model recommendations. However, the GPT-5 Mini model delivers results that are comparable to those of the GPT-5 model. Therefore, we can conclude that the aggregation architecture itself, rather than the choice of model, is the key factor driving performance enhancement. This suggests that even when using lightweight models, we can achieve very good results, as long as the aggregation architecture remains intact.
Statistical Significance Validation: To rigorously validate that these improvements are not due to random variance in our 5000-sample test set, we performed comprehensive statistical significance testing using McNemar’s test for paired classifier comparisons. McNemar’s test is specifically designed for evaluating whether two classifiers have significantly different error rates on the same test set, making it ideal for our paired prediction scenario. All comparisons demonstrate highly statistically significant results: when comparing the GPT-5 meta-model against individual models, all 12 comparisons yielded p < 0.001 (chi-square test statistics ranging from χ2 = 110.04 to χ2 = 482.23), indicating that the meta-model’s superior performance is not attributable to sampling variance. Similarly, the GPT-5 mini meta-model achieved p < 0.001 for all individual model comparisons (χ2 = 79.47 to χ2 = 381.08). When compared against the majority voting baseline (62.64% accuracy), both meta-models demonstrated highly significant improvements: GPT-5 meta-model (χ2 = 279.19, p < 0.001) and GPT-5 mini meta-model (χ2 = 210.16, p < 0.001). Bootstrap confidence intervals (10,000 iterations) further confirm the reliability of these estimates: GPT-5 meta-model accuracy is 71.40% with 95% CI [70.16%, 72.66%], and GPT-5 mini meta-model is 70.32% with 95% CI [69.08%, 71.54%]. The narrow confidence intervals (margin of error ±1.23–1.25 percentage points) indicate that our 5000-sample test set provides sufficient statistical power to detect meaningful performance differences. These statistical tests conclusively establish that the meta-model improvements are both statistically significant and practically meaningful, addressing concerns about potential sampling variance in our experimental design.
4.4. LLM Influence Analysis
RQ4: Which LLMs contribute positively or negatively to the meta-model’s performance?
To rigorously assess the influence of each independent model on the meta-model’s decision-making process, we employ a classification scheme that evaluates recommendations relative to the meta-model’s standalone performance. This approach focuses on the potential influence of each model—counting all recommendations that provide better or worse estimates—rather than only those that successfully altered the final aggregated decision.
Specifically, we categorize each model’s recommendation as follows:
Positive Influence: The model provides a more accurate estimate than the meta-model (i.e., |Pred(A) − True| < |Pred(M) − True|). This includes cases where the model pulls the estimate closer to the ground truth, even if it overshoots or does not perfectly match the true value.
Negative Influence: The model provides a less accurate estimate than the meta-model (i.e., |Pred(A) − True| > |Pred(M) − True|). This captures cases where the model would mislead the meta-model further from reality.
Neutral Influence: The model provides an estimate with the same error magnitude as the meta-model (i.e., |Pred(A) − True| = |Pred(M) − True|). This typically occurs when the model and meta-model output the same rating, thereby reinforcing the meta-model’s initial belief (whether correct or incorrect).
This definition allows us to measure the intrinsic value of each model’s input, independent of the aggregation mechanism’s final output.
Based on this analysis, significant differences can be observed in the potential influence of individual models, as shown in
Table 9. When GPT-5 serves as the meta-model, the models with the highest number of positive influences are Claude Opus 4.1 and Claude Sonnet 4.5, with 587 and 516, respectively. These models demonstrate a strong capacity to correct the meta-model’s errors. Interestingly, the GPT-4.1 model shows the lowest number of negative influences (288), suggesting it is the least likely to mislead the meta-model, followed closely by Claude Sonnet 4.5 (300).
When using the GPT-5 Mini as a meta-model, we observe a similar pattern, where the Claude Opus 4.1 achieves the highest number of positive influences (659), followed by Claude Sonnet 4.5 (594). In terms of negative influence, Claude Sonnet 4.5 and GPT-4.1 remain the safest options, with 355 and 361 negative influences, respectively.
The high positive influence counts for the Claude models across both meta-models highlight their complementary strength to the GPT-based meta-models. By frequently providing more accurate predictions when the GPT models make mistakes, they act as valuable correctors within the meta-model aggregation system. In contrast, models like Gemini 2.5 Flash show high negative influence counts (785 and 849), indicating a greater risk of introducing errors if their recommendations are followed blindly.
To systematically validate the relationship between standalone model performance and meta-model influence, we conducted a correlation analysis synthesizing the performance results from
Table 5 with the influence metrics presented above. The analysis reveals exceptionally strong correlations between model accuracy and influence patterns: For the GPT-5 meta-model, standalone accuracy exhibits a very strong positive correlation with net influence (Pearson r = 0.974,
p < 0.0001; Spearman ρ = 0.964,
p < 0.0001) and an equally strong negative correlation with negative influence (Pearson r = −0.961,
p < 0.0001). The GPT-5 mini meta-model shows nearly identical patterns (net influence: Pearson r = 0.978,
p < 0.0001; negative influence: Pearson r = −0.952,
p < 0.0001). These statistically significant correlations show that higher-performing standalone models consistently provide more beneficial guidance to the meta-model, while lower-performing models systematically introduce more errors.
Notably, the three highest-accuracy standalone models—Claude Sonnet 4.5 (65.02%), Claude Opus 4.1 (64.48%), and GPT-4.1 (63.54%)—also achieve the strongest net positive influences (+216, +213, and +129, respectively, for the GPT-5 meta-model) and highest positive influence ratios (63.24%, 61.08%, and 59.15%). Conversely, the three lowest-accuracy models—Gemini 2.5 Flash (56.86%), Gemini 2.5 Flash-Lite (58.44%), and Gemini 2.5 Pro (59.06%)—exhibit the most negative net influences (−486, −241, and −277), with positive influence ratios below 40%. This direct correspondence between standalone performance and meta-model contribution shows that the meta-model effectively recognizes and leverages high-quality predictions while appropriately discounting unreliable ones.
Two notable anomalies merit discussion: GPT-5 Mini (62.34% accuracy) shows a slightly negative net influence (−32) despite above-average performance, and DeepSeek Reasoner (60.22% accuracy) exhibits substantial negative influence (−156) despite moderate standalone performance. These anomalies suggest that prediction quality alone does not guarantee positive meta-model influence—factors such as prediction confidence calibration, output format consistency, and alignment with the meta-model’s reasoning style may also play critical roles.
Based on the above findings, we conclude that the effectiveness of a meta-model aggregation system relies heavily on the selection of models that not only perform well individually but also possess the specific capability to correct the meta-model’s weaknesses. The strong empirical correlation between standalone accuracy and meta-model influence provides quantitative validation that high-performing base models function as effective “error correctors” within the meta-model aggregation framework.
4.5. Meta-Model Decision-Making: Revisions and Independence
RQ 5: How does the meta-model handle independent models’ recommendations—specifically, how frequently does it accept (revise) versus disregard them, and are these decisions beneficial?
Revision Behavior (Accepting Recommendations). The GPT-5 meta-model made changes to its original predictions, as directed by independent models, 760 times (15.20%) during the course of this study. Of the 760 changes made to its predictions, 583 were correct (a success rate of more than 76%) and 133 were incorrect. These two statistics result in a positive revision ratio of approximately 4.4 adjustments for each incorrect adjustment. The GPT-5 mini meta-model, on the other hand, made changes to its original predictions 788 times (15.76%), showing a more active revision behavior. Of the 788 changes made to its predictions, 551 were correct (a success rate of 70%) and 152 were incorrect. As such, the GPT-5 mini meta-model has a slightly lower, but still positive, revision ratio of approximately 3.6 adjustments per incorrect revision.
Disregarding Behavior (Independent Decision-Making). Beyond accepting recommendations, we also analyzed how often meta-models completely disregard them. We define “disregarding” recommendations as instances where the meta-model’s final prediction does not match any of the ratings provided by the 12 base models. This represents a strong form of independence, where the meta-model rejects the entire set of proposed values—including the recommendation from its own underlying model acting as a model—and generates a unique conclusion based on its own reasoning over the source text (presumably taking into account to some extent the estimations of individual models).
More specifically, the GPT-5 meta-model disregarded its base models’ recommendations in 231 cases (4.62% of total predictions), while the GPT-5 mini meta-model did so in 175 cases (3.50% of total predictions). This indicates that the GPT-5 meta-model proved to be more confident in its independent decision-making, while the GPT-5 mini meta-model proved to be more conservative.
Considering the GPT-5 meta-model, all 231 disregard cases were correct (100% accuracy), while the GPT-5 Mini achieved 96.57% accuracy (169 correct and 6 incorrect cases). This behavior suggests that both meta-models can successfully operate independently of base models’ recommendations, with the GPT-5 meta-model showing superior judgment in knowing when to trust its independent analysis.
The GPT-5 mini meta-model demonstrated more active revision behavior (15.76% vs. 15.20%), indicating it is more likely to follow the recommendations of its independent models. However, the GPT-5 meta-model achieved both a higher improvement-to-error ratio (4.4 vs. 3.6) and perfect accuracy (100%) when disregarding all recommendations, demonstrating that the quality of meta-model decisions may be more important than the quantity of revisions in aggregation tasks.
4.6. Model Trust and Influence Patterns
RQ6: Which models most strongly influence the meta-model’s decisions, and which are least trusted?
Based on our experiments, the GPT-5 meta-model shows the strongest alignment with the predictions from its own base model (GPT-5), while showing the least reliance on the Gemini 2.5 Flash model. This behavior suggests an intrinsic connection with its own model family (the GPT family from OpenAI OpCo, LLC), while being more skeptical of models that may differ in their reasoning and result formulation processes.
Similarly, the GPT-5 mini meta-model also shows the strongest alignment with the predictions from its own model (GPT-5 Mini), while demonstrating the least reliance on the Gemini 2.5 Flash model. However, the key difference is that the GPT-5 mini meta-model takes the predictions from GPT-5 more seriously, indicating that, while it retains its decision-making autonomy, it values the results produced by its more advanced counterpart to a greater extent. It is worth noting that the meta-model prompt includes both the name and the rating of each model, ensuring that the meta-model is aware of which model generated each prediction.
The fact that both meta-models identify the Gemini 2.5 Flash model as the least trusted one indicates that, regardless of the meta-model’s sophistication, certain patterns of model reliability are universally recognized. This inter-agreement evaluation (derived from both the GPT-5 meta-model and the GPT-5 mini meta-model) practically shows the ability of the meta-models to identify and appropriately adjust the importance of less reliable predictions, regardless of their trust and influence approach.
4.7. Comparative Analysis of Standalone Versus Aggregated Performance
RQ7: How does a model’s performance as a meta-model (in a meta-model aggregation system setup) compare to its performance when operating independently?
Based on the experiments, the GPT-5 model achieved an accuracy of 62.4% as a standalone model, while its accuracy increased to 71.4% when acting as a meta-model in a meta-model aggregation system (a relative increase of 14.4%). Similarly, the GPT-5 Mini model achieved an accuracy of 62.34% as a standalone model, while its accuracy also increased to 70.32% (a relative increase of 12.8%) when acting as a meta-model.
These results show the beneficial influence of the base models’ recommendations on both models, indicating that they can be positively influenced by the recommendations of other models in a meta-model aggregation setup.
4.8. Decision-Making Strategy Analysis
RQ8: Does the meta-model primarily rely on majority voting, or does it reason directly over textual content? How often does each occur?
The GPT-5 meta-model shows a predominant tendency to align with majority decisions, with 4292 cases matching majority vote, compared to 708 cases of independent reasoning. When considering other central tendency metrics, the meta-model’s predictions align with the rounded mean of the models’ outputs in 4241 cases and with the rounded median in 4252 cases (
Table 10). Importantly, its performance actually improves when diverging from the majority, achieving an accuracy of 77.68% in divergent cases compared to 70.36% when following the majority.
The GPT-5 mini meta-model shows a slightly different pattern, with 4227 cases matching the majority vote and 773 cases of independent reasoning. In terms of central tendency, it aligns with the rounded mean in 4229 cases and the rounded median in 4235 cases. However, its performance pattern inverts when diverging from the majority, achieving 68.69% accuracy in divergent cases compared to 70.62% when following the majority. This indicates that the GPT-5 Mini tends to make independent decisions more often than the GPT-5; however, these decisions are proven to be less successful than the ones from its more sophisticated counterpart.
The different behaviors reveal that the two meta-models used in our study exhibit a fundamental divergence in their decision-making capabilities. The GPT-5 meta-model is more adept at recognizing when the majority opinion may be misleading and, as a result, independent reasoning is more likely to yield the correct results. In contrast, the GPT-5 mini meta-model performs better when aligning with majority opinions, demonstrating a more conservative approach in its independent reasoning.
4.9. Outlier Model Behavior Analysis
RQ9: Which models behave as outliers, potentially disrupting the overall meta-model aggregation system?
When using the GPT-5 meta-model, we identified four models as potential system disruptors, due to their consistent lower agreement with the meta-model’s decisions, while simultaneously exerting a stronger negative influence on system performance. These models are the Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Claude Haiku 4.5, and DeepSeek Reasoner.
On the other hand, when GPT-5 Mini served as the meta-model, we identified a different and larger set of five outlier models: Gemini 2.5 Pro, Gemini 2.5 Flash Lite, GPT-5 Nano, Claude Haiku 4.5 and Claude Opus 4.1. This broader identification of outliers suggests that the GPT-5 mini meta-model may be more sensitive to patterns of disagreement or may have a lower threshold for what constitutes disruptive behavior.
Notably, some models appear as outliers in both configurations, particularly the Gemini 2.5 Flash Lite and Claude Haiku 4.5, suggesting that these models may have inherent characteristics that make them more likely to diverge from the consensus, regardless of the meta-model. However, the difference in outlier identification between the meta-models is particularly interesting. The GPT-5 meta-model appears more tolerant of variation in model behavior, identifying fewer outliers and primarily focusing on models from the Gemini family and lighter versions of other model families.
4.10. Pre-Trained LLM Capability Assessment
RQ10: Are pre-trained LLMs without fine-tuning capable of accurately capturing sentiment from user feedback?
Based on the experiments, we observe an average accuracy of 61.25% across all models. The highest and lowest accuracies were achieved by the Claude Sonnet 4.5 and the Gemini 2.5 Flash models, at 65.02% and 56.86%, respectively. The significant difference of 8.16% between the pre-trained LLMs shows that pre-trained LLMs, without fine-tuning, exhibit varying capabilities when handling sentiment analysis tasks.
Additionally, both meta-model configurations produced similar and consistent results, showing that pre-trained LLMs possess innate capabilities to analyze sentiment. Due to their high level of accuracy (greater than 60% in most instances), we can infer that the pre-trained LLMs developed a robust general language understanding that contributed to their capability to analyze sentiment in spite of no task-specific fine-tuning.
Further, based upon the results, we are able to observe that the performance ranking of the LLMs is consistent across both meta-model configurations, with the highest performance being the Claude models (Claude Sonnet-4.5 at 65.02% and Claude-Opus-4.1 at 64.48%), followed by the OpenAI GPT models. Interestingly, the Google Gemini models demonstrated the greatest variability in performance.
Finally, as the performance order of the LLMs was constant, it is possible to infer that architectural and training choices made in developing the LLMs have a lasting impact on their capability to analyze sentiment, regardless of whether the LLM has been optimized for the particular task.
Also, there is a large performance gap between the average individual model performance of 61.25% and the two aggregated performances of 71.40% for GPT-5 and 70.32% for GPT-5 Mini. As such, while the pre-trained LLMs demonstrated their ability to analyze meaningful sentiment, this significant performance gap suggests that pre-trained LLMs can greatly benefit from aggregation. Thus, collaboration with pre-trained models can provide an effective method of overcoming limitations in analyzing sentiment without the requirement of fine-tuning the models.
4.11. Cost-Effectiveness Analysis of Meta-Model Aggregation Approaches
RQ11: Do observed accuracy improvements justify the added computational complexity and cost of meta-model aggregation?
Figure 4 and
Table 11 present a direct comparison of accuracy and the associated cost per model for 5000 sample predictions. We observe that the Gemini-2.5-Pro is the most expensive model while delivering relatively low accuracy, whereas the Claude Sonnet-4.5 achieves a better balance between accuracy and cost, making it more cost-effective in this evaluation.
For these models, the cost shown in
Figure 4 and
Table 11 reflects the expense of API calls for the meta-model itself. While not explicitly depicted in the figure, the total cost of a meta-model aggregation system includes the cost of the meta-model plus the cost of each individual model under aggregation. More specifically, for the GPT-5 meta-model, the total system cost (including all models’ API costs) is
$130.45, while for the GPT-5 mini meta-model, the total system cost is
$111.86.
Comparing these costs with the corresponding increases in prediction accuracy (up to a 10% improvement), the value of meta-model aggregation ultimately depends on how much a business prioritizes additional accuracy relative to its operational budget.
4.12. Model Similarity and Redundancy Analysis
RQ12: How close are the predictions of different LLMs, and can we reduce costs by omitting models with similar outcomes?
To address this question, we investigated the similarity between the predictions of the 12 LLMs using Normalized Mean Absolute Error (NMAE) and Agreement Rate. The analysis results, visualized as a confusion matrix in
Figure 5, revealed that the most similar pair of models is Claude Sonnet 4.5 and Claude Opus 4.1, with an NMAE of 0.030 and an agreement rate of 88%. This high level of similarity suggests that these two models often provide similar predictions.
Based on this finding, we simulated a cost-reduction strategy where the more expensive model, Claude Opus 4.1, was omitted from the ensemble. To maintain the numerical balance of the voting system, the vote of the remaining similar model, Claude Sonnet 4.5, was doubled. The simulation results showed that the baseline accuracy of the 12-model ensemble (62.64%) was maintained, and even slightly improved to 62.80% in the reduced 11-model configuration. This finding suggests that identifying and removing redundant models is a viable strategy for optimizing the cost-efficiency of meta-model aggregation systems without compromising performance.
Table 12 presents the top 5 most similar model pairs identified in our analysis, from which the most prominent candidates for removal can be drawn. Interestingly, maintaining GPT-4.1 could allow the removal of up to three other models (Claude Sonnet-4.5, GPT-5 and Claude-Opus-4.1), while another option would be to maintain Claude Sonnet-4.5 are remove up to three different models (Claude-Opus-4.1, GPT-4.1 and DeepSeek-chat). The choice could take into account both cost and accuracy factors (Claude Sonnet-4.5 is more expensive and more accurate than GPT-4.1).
4.13. Comparison with Fine-Tuned Baseline
RQ13: How does the performance of zero-shot LLMs and the meta-model aggregation approach compare to traditional fine-tuned transformer models?
To contextualize the zero-shot LLM performance and aggregation effectiveness, we fine-tuned RoBERTa-base as a supervised learning baseline. Using the remaining 45,000 balanced reviews (excluding the 5000 test set), we performed hyperparameter tuning on Google Colab Pro (A100 GPU) with a stratified 80/20 train-validation split. Six configurations were tested (learning rates: 2 × 10−5, 3 × 10−5, 5 × 10−5 × batch sizes: 16, 32) with a 3-epoch maximum and early stopping (patience = 2). The best configuration (learning rate 5 × 10−5, batch size 16) achieved 63.03% validation accuracy and 64.00% test accuracy.
RoBERTa’s performance positioned it within the range of zero-shot LLMs (56.86–65.02%) but substantially below aggregation approaches. Specifically: (1) RoBERTa outperformed 7 of 12 zero-shot LLMs, (2) underperformed the top 5 zero-shot models (Claude Sonnet 4.5: 65.02%, Claude Opus 4.1: 64.48%, GPT-4.1: 63.54%), and (3) trailed meta-models by 7.40 and 6.32 percentage points, respectively. Despite direct exposure to 36,000 training examples (7200 per rating), RoBERTa’s performance remained comparable to mid-tier zero-shot LLMs, suggesting fundamental challenges in rating prediction—particularly for neutral sentiment (2-star: 53% recall, 3-star: 53% recall)—that supervised fine-tuning alone cannot fully resolve. It is crucial, though, to mention that by experimenting more with extensive hyperparameter tuning through Grid-search or Bayesian optimization, the above results could be optimized, bringing greater accuracy.
This comparison reveals three key findings: (1) aggregation provides meaningful accuracy improvements even versus specialized fine-tuned models, (2) zero-shot LLM performance is competitive with supervised approaches despite lacking task-specific training, and (3) the 10.15 percentage point aggregation improvement compounds architectural benefits with model diversity. While RoBERTa offers substantially lower inference cost (milliseconds vs. seconds per prediction), the accuracy-cost trade-off depends on deployment context. Future work could explore whether advanced prompting techniques (Few-Shot, Chain-of-Thought) further widen performance gaps or whether hybrid approaches combining fine-tuned models with aggregation yield additional gains.
4.14. Model Agreement and Consensus Patterns
RQ14: What patterns of agreement exist among the 12 base models, and how does the consensus level relate to prediction accuracy?
To rigorously quantify inter-model agreement, we computed Fleiss’ Kappa across all 12 models, yielding κ = 0.5921 (moderate agreement according to Landis & Koch interpretation guidelines [
68]). This indicates that while models show statistically significant agreement beyond chance (observed agreement
, expected agreement
), substantial disagreement persists across the ensemble. Agreement varied significantly by rating level: 1-star (κ = 0.5782) and (especially) 5-star (κ = 0.6294) reviews achieved higher consensus compared to neutral 3-star reviews (κ = 0.5542), consistent with the neutral sentiment challenge discussed in
Section 4.2.
Pairwise Cohen’s Kappa analysis revealed model-specific agreement patterns. The highest-agreeing pairs were Claude Sonnet 4.5–Claude Opus 4.1 (κ = 0.849, 88.00% raw agreement) and GPT-4.1–Claude Sonnet 4.5 (κ = 0.836, 87.04%), suggesting architectural similarity within model families. The lowest-agreeing pairs involved Gemini 2.5 Flash with other models (κ = 0.582–0.671), indicating distinct prediction behavior. Individual model average kappa ranged from 0.659 (Gemini 2.5 Flash) to 0.768 (GPT-5), with higher average agreement correlating with individual model accuracy (Pearson’s r = 0.72, p < 0.01).
Consensus level analysis categorized predictions into five tiers based on majority vote strength: Unanimous (12/12 models agree: 43.36% of cases), Strong Majority (10–11/12: 25.84%), Majority (8–9/12: 18.58%), Weak Majority (6–7/12: 12.02%), and No Majority (<6/12: 0.20%). Prediction accuracy increased monotonically with consensus: unanimous cases achieved 80.12% accuracy versus 47.26% for standard majority and 30.00% for no-majority cases. This pattern is consistent with the aggregation rationale—high consensus provides signal quality, whereas disagreement flags ambiguous cases requiring meta-model reasoning.
Extreme divergence cases (predictions spanning 1–5 stars, n = 4, 0.08%) highlighted linguistic complexity. For example, the rating text “I’m used to new products being the same price, functionally, as used items. I’m just always blown away when something like this happens” (true: 2-star) elicited predictions from 1 to 5 stars due to sarcasm and implicit negativity contradicting surface-level neutral phrasing. Such cases underscore why pure majority voting (62.64% accuracy) underperforms meta-model reasoning (71.40%), as aggregation must identify when consensus itself is unreliable due to shared systematic errors across base models.
4.15. Meta-Model Override Behavior and Decision Triggers
RQ15: Under what conditions do meta-models override the majority vote, and how does this relate to prediction accuracy?
Meta-models exhibited distinct override behaviors with critical accuracy implications. GPT-5 overrode the majority in 14.38% of cases (719/5000), achieving 78.03% accuracy in those instances compared to 70.29% when following the majority—a 7.74 percentage point improvement. Conversely, GPT-5 Mini overrode more frequently (15.64%, 782/5000) but with lower success (69.18% accuracy when overriding versus 70.53% when following, a −1.35 percentage point deficit). This reveals that override frequency alone does not guarantee improvement; decision quality matters more than decision volume.
Override success varied systematically by rating level. GPT-5 achieved its highest override accuracy on extreme ratings: 95.12% (5-star, 78/82 overrides) and 90.77% (1-star, 59/65 overrides), versus 70.40% on neutral 3-star reviews (176/250 overrides). This aligns with the neutral sentiment challenge—meta-models struggle to correct majority errors when the ground truth itself is ambiguous. By contrast, extreme ratings provide clearer linguistic signals that enable confident deviation from consensus.
Override triggers analysis revealed that meta-models disproportionately override under high base-model disagreement. When GPT-5 overrode the majority, base models exhibited a mean prediction standard deviation of 0.408, compared to 0.215 when following (difference = 0.193, t-test p < 0.001). Similarly, GPT-5 Mini showed a mean disagreement of 0.421 (override) versus 0.210 (follow), a difference of 0.211 (p < 0.001). This indicates that disagreement among base models serves as an implicit confidence signal: meta-models are more likely to trust their independent analysis when the ensemble lacks strong consensus.
Majority strength analysis quantified this pattern: when all 12 models agreed (unanimous majority), GPT-5 overrode in 4.5% of cases (97/2168) with 100% accuracy—exclusively correcting rare systematic errors. As the majority strength weakened, override rates increased: 6–7 models agreeing (weak majority) triggered 18.9% override rate (183/969), though accuracy dropped to 73.8%. This suggests optimal aggregation involves calibrated override thresholds: strong consensus should be trusted absent compelling contradictory evidence, whereas weak consensus warrants skeptical re-evaluation.
Qualitative override examples illustrate reasoning sophistication. Correct override (GPT-5): True rating = 2 stars, majority predicted 3 stars, GPT-5 predicted 2 stars. For instance, when processing the review: “It comes with a lot of items related to the game. Very satisfied with the quality of items; however, the packaging is a different story. The box was damaged…” GPT-5 reasoning identified that negative framing of packaging issues outweighed positive product mentions, whereas the majority of models averaged positive and negative signals toward neutrality. On the other hand, an incorrect override (GPT-5 Mini) occurred for the following case, where the True rating was 1 star, the majority predicted 1 star, but GPT-5 Mini overrode the decision, predicting 2 stars. With the review text being “Right off the bat, before I even use this thing, it’s missing the thumb stick covers…” GPT-5 Mini reasoning treated missing components as a moderate defect rather than a critical failure, misaligning with reviewer intent. These cases demonstrate that meta-model override success depends on correctly weighting sentiment intensity, not merely detecting valence.
5. Discussion
This study evaluates meta-model aggregation systems versus independent LLMs in sentiment analysis for RecSys, uncovering key insights into performance, cost-efficiency, and practical implications of LLM aggregation.
5.1. Performance Enhancement Through Aggregation
The most important finding is the considerable improvement in performance achieved through meta-model aggregation. Both meta-models used in this study (GPT-5 and GPT-5 Mini) succeeded in achieving considerable accuracy gains, measured at 10.15% and 9.07% in absolute terms (or 16.6% and 14.8% as relative improvements), respectively, when compared to the average performance of individual models, which was measured at 61.25%. The small (almost negligible) difference between the two meta-models indicates that effectiveness can be achieved even with lightweight, and thus cost-effective, models, meaning there is no need for organizations to deploy the most expensive models as meta-models.
5.2. The Neutral Sentiment Challenge
All 12 LLMs faced extreme challenges in predicting neutral (i.e., 3-star) ratings, showing an average failure rate of 64.83% (almost 2 out of every 3 predictions are misclassified), which is considered significant. This highlights the fundamental challenge of classifying neutral or mixed sentiment expressions in sentiment analysis research. Furthermore, the substantial range in failure rates (ranging from 55.7% to 77.4%) indicates that certain architectural approaches perform considerably better in handling neutral sentiment compared to others. 3-star reviews often contain more balanced or contradictory statements requiring a more detailed understanding, unlike extreme ratings (either praising or strongly unfavorable), which have clear linguistic characteristics. Although meta-model aggregation contributes to resolving this challenge, it remains a critical area that future research should focus on.
5.3. Model Specialization and Reasoning Trade-Offs
Surprisingly, reasoning-oriented models performed worse than their chat-oriented counterparts. More specifically, DeepSeek Reasoner underperformed compared to DeepSeek Chat, while GPT-5 showed 1.14% lower accuracy than GPT-4.1. Based on these findings, we can conclude that the additional computational overhead of reasoning-focused architectures in sentiment analysis tasks does not result in corresponding benefits. This could also be because reviews may typically be of short length, which appears to be more akin to a conversation rather than an essay or a complex document; therefore, the style and content of a review may be a better fit for conversational models. Furthermore, there appears to be a potential bias in reasoning models towards negative sentiments, which is inconsistent with the emotional distribution in real-world reviews, which tend to be much more balanced. In conclusion, organizations should carefully consider whether the additional cost and latency of reasoning models are justified for specific use cases.
5.4. Meta-Model Decision-Making and Independent Models’ Influence
The meta-models demonstrated sophisticated behavior, which was superior to simple majority voting. More specifically, when the GPT-5 meta-model diverged from the consensus, it achieved an accuracy of 77.68%, whereas it achieved 70.36% accuracy when it followed the majority. This suggests that effective aggregation also involves critically evaluating potentially misleading consensus. Furthermore, the meta-models established implicit trust hierarchies, consistently identifying outlier models (Gemini 2.5 Flash Lite and Claude Haiku 4.5). Finally, the perfect accuracy (100%) achieved by GPT-5, when it ignored the models’ recommendations, indicates an advanced meta-cognitive ability to recognize when collective input is less reliable than its own analysis.
An evaluation of the reasoning behind the decision-making process was made using the reasoning text that was generated from the meta-models through a systematic collection and storage of the reasoning text in designated columns within the prediction phase. A comparison of the reasoning style of the two models has shown significant differences.
The GPT-5 mini meta-model produces considerably more text than the GPT-5 meta-model, as evidenced by a 62% greater mean reasoning length (91.15 words) than the GPT-5 meta-model (56.44 words). In many instances, this additional text is directed toward the detailed examination of outliers, as indicated by the term “outlier” being referenced 1693 times as opposed to 376 times in GPT-5.
A qualitative review of the reasoning logs revealed how the meta-models handled conflict resolution. For example, in cases where there was a split decision among the models, both the GPT-5 Mini and GPT-5 meta-models provided explicit comparisons of the reviewers’ textual information versus their own predictive assessments.
Example of Consensus (GPT-5): “All models unanimously predicted 1 star with no deviations. The review is strongly negative, citing safety concerns… and explicit non-recommendation… The severity of the issue… clearly aligns with a 1-star rating.”
Example of Conflict Resolution (GPT-5): “Models split evenly between 1-star and 2-star… The 2-star predictors… deviate from the strongly negative tone. The review reports failure after limited use… clear non-recommendation. Despite a brief ‘worked great’ at first, overall severity and dissatisfaction align with 1 star.”
These examples demonstrate that the meta-models do not merely aggregate votes but actively reason about the content to resolve ambiguities, explaining why certain models (e.g., those predicting 2 stars) might have been misled by specific phrases (e.g., “worked great”).
5.5. Cost–Benefit Analysis
Overall, the cost of the entire system was $130.45 & $111.86 using GPT-5 & GPT-5 Mini as aggregation models (using 5000 predictions). That is substantially more than the cost of the individual LLMs, whose cost ranged from $0.24 (DeepSeek Chat) to $43.97 (Gemini 2.5 Pro). In addition, the total time to execute the model increased drastically, taking 7795 & 8044 min in comparison to the time taken by each individual model (taking anywhere from 64 to 2609 min). On average, this translates into a latency of approximately 1.6 min per request. It is worth noting, however, that the above-stated latency can be greatly improved by making simultaneous calls to the base models. As such, if the system were to make simultaneous calls on all of the LLMs, the total execution time would theoretically reach the amount of time taken by the slowest model (DeepSeek Reasoner), plus the time taken by the meta-model to aggregate the results (approximately 784 min for GPT-5 and 1032 min for GPT-5 mini). The time taken to serve the 5000 requests would therefore be approximately 3393 min for GPT-5 and 3642 min for GPT-5 mini, with the corresponding latencies per request being 0.68 and 0.73 min, respectively. Although there is a greater latency, the accuracy improvement was approximately 10%, representing nearly 500 additional correct predictions, which are very important in applications where the accuracy of the prediction directly influences business results. The meta-model aggregation method provides a “training-free” way to enhance performance without requiring the substantial labeling of data sets or computational resources needed to fine-tune the models. In summary, the accuracy gain may outweigh the cost increase in high-stakes applications; whereas in low-stakes applications, individual models will provide the best value based on cost vs. benefit.
To contextualize these costs in production-scale RecSys environments, we calculate the cost-per-accuracy-point improvement: The GPT-5 meta-model achieves a 10.15% accuracy gain over the average individual model (61.25%) at $130.45 per 5000 reviews, yielding $12.86 per percentage point improvement. For the GPT-5 mini meta-model, the 9.07% improvement at $111.86 translates to $12.33 per percentage point. When compared to the best-performing individual model (Claude Sonnet 4.5 at 65.02% accuracy, costing $5.76), the meta-model’s marginal gain of 6.38 percentage points (from 65.02% to 71.40%) costs an additional $124.69, or $19.54 per marginal percentage point. These metrics reveal that the cost efficiency of aggregation depends heavily on the baseline being compared.
Regarding production scalability, processing millions of reviews daily would indeed incur substantial costs. Extrapolating to 1 million reviews (200× our test set), the GPT-5 meta-model would cost approximately
$26,090, and the GPT-5 mini meta-model approximately
$22,372. For organizations processing 10 million reviews daily, annual costs would exceed
$95 million (GPT-5) or
$81 million (GPT-5 Mini)—financially prohibitive for most real-world deployments. However, several practical deployment strategies can mitigate these costs: (1) Selective Aggregation: Apply the full 12-model ensemble only to high-stakes scenarios (e.g., disputed reviews, flagged content, product launches) while using individual models or reduced ensembles for routine classification. (2) Model Pruning: As demonstrated in
Section 4.12, removing redundant models (e.g., highly similar pairs like Claude Sonnet 4.5 and Claude Opus 4.1) maintains performance while reducing costs proportionally. A 6-model ensemble would halve aggregation expenses. (3) Confidence-Based Routing: Implement a two-tier system where individual models handle high-confidence predictions (e.g., >90% softmax probability), escalating only uncertain cases to aggregation. (4) Batch Optimization: Amortize meta-model overhead across larger batches and leverage parallel API calls to reduce per-review latency from 1.6 min to <1 min. (5) Hybrid Architectures: Fine-tune smaller models (e.g., BERT, RoBERTa) on high-confidence meta-model predictions, creating cost-effective specialized classifiers for production while reserving aggregation for challenging cases.
In conclusion, while the current full-ensemble aggregation is economically challenging for large-scale continuous deployment, the architecture’s primary value lies in (a) research benchmarking to establish performance ceilings, (b) high-stakes applications where accuracy justifies cost (e.g., content moderation, fraud detection, medical sentiment analysis), and (c) training data generation where meta-model predictions can be used to fine-tune efficient production models. Organizations must carefully assess their accuracy-cost trade-offs: for applications where a 10% accuracy improvement translates to significant business value (e.g., reducing false negatives in safety-critical domains), aggregation becomes economically viable; for low-stakes sentiment analysis at scale, individual models or pruned ensembles offer better cost–benefit ratios.
5.6. Practical Implications
The main practical implications are the following:
Organizations should deploy multiple models instead of focusing on finding a single ‘optimal’ model, as lower-performing models can still provide valuable insights.
Lightweight meta-models (like the GPT-5 Mini) offer attractive performance-cost balance and may prove most suitable for lower-stakes applications.
Due to the high failure rate in classifying neutral or mediocre opinions (i.e., 3-star ratings), production systems should implement additional safeguards when encountering these cases (e.g., higher confidence thresholds or human review queues).
Organizations should carefully assess the business value of accuracy improvements in relation to aggregation costs, depending on the application’s stakes.
5.7. Limitations and Future Directions
This study provides a comprehensive evaluation of meta-model aggregation for product review sentiment analysis using Amazon reviews. While our findings demonstrate clear performance improvements within this context, generalizability to other tasks and domains requires empirical validation. Specifically, our results are constrained to:
Task Scope: Our evaluation focuses exclusively on product review sentiment analysis—the task of predicting numerical star ratings (1–5) from textual reviews in an e-commerce context. This focused scope allows rigorous controlled evaluation within a single well-defined task. Future work could investigate whether meta-model aggregation provides similar benefits for other NLP tasks, including text summarization, question answering, dialogue generation, named entity recognition, or open-domain text classification.
Output Format Specificity: The task involves predicting discrete ordinal ratings on a bounded 1–5 scale, providing a clear quantitative evaluation metric. This structured output format enables rigorous statistical analysis and direct performance comparison. Extensions to other sentiment analysis formulations would be valuable future work, including: (a) open-ended sentiment description where models generate free-text sentiment explanations rather than scale-based ratings, (b) aspect-based sentiment analysis where multiple sentiment dimensions must be evaluated simultaneously (e.g., product quality, shipping speed, customer service), or (c) fine-grained emotion detection that extends beyond the positive-negative-neutral spectrum to identify specific emotional states (joy, anger, frustration, disappointment).
Data Source Specificity: All reviews originate from the Amazon e-commerce platform across five product categories (Fashion, Automotive, Books, Electronics, Video Games), providing substantial domain coverage within e-commerce contexts. This domain focus ensures consistent evaluation conditions and reduces confounding variables. Validation with datasets from different providers and of varying form or nature represents a natural extension of this work, with promising directions including: (a) movie reviews (e.g., MovieLens, IMDb) where sentiment may be expressed differently than in product reviews, (b) restaurant reviews (e.g., Yelp) which often emphasize subjective experiences like ambiance and service quality, (c) social media content (e.g., Twitter, Reddit) characterized by informal language, hashtags, and cultural references, or (d) news article sentiment where political bias and journalistic framing introduce distinct challenges.
Language Limitation: This study evaluates English-language reviews exclusively. The performance of zero-shot LLMs and the effectiveness of meta-model aggregation may differ substantially across languages due to: (a) training data imbalance, where English typically dominates pre-training corpora, (b) linguistic structure differences (e.g., sentiment expression in morphologically rich languages, context-dependent languages like Greek, Chinese and Japanese), and (c) cultural variation in how sentiment and product satisfaction are expressed textually. Multilingual evaluation and cross-lingual transfer experiments would be required to assess generalizability beyond English.
Temporal and Versioning Constraints: Our measurements represent performance snapshots using specific model versions available in late 2025 (
Table 3). As acknowledged in our API-based evaluation limitations discussion, model providers continuously update their systems, which may affect both absolute performance levels and relative rankings. Additionally, future model architectures with improved reasoning capabilities, longer context windows, or domain-specific fine-tuning may alter the cost–benefit calculus of meta-model aggregation. Our findings should be interpreted as characterizing the specific model ecosystem evaluated rather than establishing permanent truths about LLM meta-model aggregation effectiveness.
The above scope decisions enable rigorous controlled evaluation within a well-defined context. Our findings demonstrate that meta-model aggregation with natural language reasoning outperforms traditional ensemble methods and individual models for Amazon product review sentiment prediction. These results establish a foundation for future work examining whether similar benefits extend to other tasks, domains, languages, or future model generations.
Data Contamination Considerations: We evaluate models on the Amazon Reviews ’23 dataset, a publicly available benchmark. While LLMs with 2024/2025 training cutoffs may have encountered these reviews during pre-training, the observed zero-shot accuracies (ranging from 56.86% to 65.02% for individual models) suggest limited memorization; extensive training on this specific test set would likely produce substantially higher performance approaching near-perfect accuracy. This performance range is consistent with genuine zero-shot evaluation, though as with all contemporary LLM studies on public datasets, complete certainty about training data overlap remains infeasible.
API-Based Evaluation Limitations: This study evaluates commercial API-based LLM services, documenting all model versions with timestamps (
Table 3) and using temperature = 0 for deterministic sampling. API providers may modify underlying models, update endpoints, or adjust pricing structures over time. Our measurements represent performance indicators captured at a specific point in time (late 2025) using the model versions listed in
Table 3, enabling rigorous comparative evaluation under identical conditions. This approach reflects realistic production deployment scenarios where organizations typically access LLMs through commercial APIs rather than hosting models locally. While API-based inference may exhibit residual non-determinism due to distributed computing architectures and inference optimizations, all models were evaluated under identical conditions within the same temporal window, ensuring valid comparative conclusions.
Ensemble-Based Nature of Approach: Our approach represents a specific instantiation of meta-model aggregation within the established ensemble learning paradigm, where the meta-model processes predictions through natural language reasoning rather than learned weights. This reasoning capability distinguishes our approach from traditional stacking methods while remaining grounded in ensemble learning principles. To clarify scope: we do not introduce novel agent architectures, autonomous goal-pursuit mechanisms, distributed decision-making protocols, or tool-use capabilities that characterize true agentic AI systems. Rather, the contribution lies in the empirical demonstration that meta-model reasoning provides measurable accuracy improvements over statistical aggregation methods when applied to sentiment classification.
Absence of Formal Agent Model: This work focuses on meta-model aggregation for sentiment analysis, positioning itself within ensemble learning and LLM evaluation literature rather than multi-agent systems paradigms. To clarify terminology: our use of “orchestration” in the title refers specifically to meta-model aggregation—the process by which a reasoning-capable LLM combines predictions from multiple independent models. This differs from multi-agent coordination in which autonomous agents negotiate, communicate, and collaborate. The base models function as independent zero-shot inference systems accessed via APIs, processing inputs and generating predictions without maintaining state or learning from experience. For completeness, we note that our system does not implement formal agent properties such as: (1) belief-desire-intention (BDI) models or other formal agent architectures that maintain internal mental states, (2) autonomous learning loops that enable adaptation without human intervention, (3) environment interaction cycles where agents perceive, reason, and act dynamically, or (4) dynamic goal adjustment mechanisms. This positioning clarifies that our contribution lies in the empirical evaluation of meta-model aggregation effectiveness rather than agent architecture innovation.
Text Normalization Trade-offs: Our preprocessing pipeline applied traditional text normalization (lowercase conversion, special character removal) to standardize inputs across 100M+ source reviews, reducing variance from inconsistent formatting artifacts. While modern LLMs are trained on raw text and can process natural formatting directly, normalization ensures consistent experimental conditions across all models and domains. An interesting direction for future work involves investigating whether preserving natural text formatting (capitalization for emphasis like “AMAZING product,” repeated punctuation indicating strong emotion like “terrible!!!”, or acronyms carrying sentiment weight like “OMG”) yields improved sentiment classification accuracy with contemporary LLMs.
Execution Time Measurement and Network Variability: Our execution time measurements capture the complete operational cost of obtaining predictions (network latency, server-side inference, and minimal JSON parsing overhead), providing realistic production-deployment estimates. All measurements were conducted from a single geographic location with consistent network infrastructure, ensuring fair within-study relative comparisons. These measurements serve as relative performance indicators; actual execution times may vary across different deployment environments due to network conditions. Future work comparing execution efficiency across models could explore controlled network conditions or isolated inference-only measurements to further reduce latency variance.
Prompting Methodology: Our study employs zero-shot prompting with structured JSON output for both individual models and the meta-model, prioritizing a fair comparative baseline across heterogeneous model families (GPT, Claude, Gemini, DeepSeek) while isolating the architectural contribution of aggregation from prompt optimization effects. This approach provides production realism and unbiased model comparison. The meta-model prompt elicits reasoning implicitly by requiring structured explanations of model deviations, consensus assessment, and decision justification, sharing conceptual similarities with Chain-of-Thought while embedding reasoning requirements within the task structure. Advanced prompting techniques—including explicit Chain-of-Thought (CoT) reasoning, Few-Shot exemplars, Self-Consistency with multiple sampling paths, and Self-Correction through iterative refinement—represent promising directions for future work. Preliminary literature suggests these techniques could improve individual model accuracy by 5–15 percentage points, which would correspondingly affect meta-model performance. Our zero-shot aggregation baseline establishes a foundation for systematically comparing future prompt-engineered approaches and investigating whether architectural and prompt optimization benefits are additive, multiplicative, or exhibit diminishing returns.
Future research may consider:
Inclusion of datasets from a variety of sources to improve diversity;
Exploration of more advanced aggregation strategies, such as hierarchical aggregation, iterative refinement, and weighted voting;
Integration of fine-tuned models;
Evaluation of advanced prompting techniques (Few-Shot with rating exemplars to reduce 3-star errors, explicit CoT for improved transparency, Self-Consistency for multiple reasoning paths, multi-turn agent-to-agent communication);
Longitudinal tracking of aggregation effectiveness across model updates;
Understanding and employing user trust and enhancing explainability in ensemble-based recommendations;
Investigation of raw text processing without normalization to preserve sentiment cues; and
Evaluation on datasets with verifiable post-training-cutoff timestamps to eliminate data contamination concerns.
5.8. Broader Implications
This empirical evaluation shows that meta-model aggregation can achieve substantial performance improvements over individual models. The findings are consistent with ensemble approaches in the literature; however, these approaches operate at a semantic level (i.e., reasoning with respect to explanations as opposed to probability aggregation). The results indicate that model diversity may provide more value than individual excellence, suggesting that employing only the best-performing models is not necessarily always the best approach. A significant difference was observed between the performance of the zero-shot implementation (a 61.25% average) and the meta-model aggregation implementations (a 71.40% average). This shows that aggregation approaches can achieve strong results without requiring the need for task-specific training, which may be beneficial for organizations with limited machine learning experience and/or no access to large amounts of labeled data. Furthermore, the aggregation patterns exhibited by the meta-model aggregation mechanisms, which include the ability to synthesize the perspective of multiple models, override the consensus of the models, and develop hierarchical trust relationships among the models, exhibit characteristics similar to effective human team dynamics.