Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator
Abstract
1. Introduction
1.1. Background and Objectives of This Study
1.1.1. Background
1.1.2. Objectives of This Study
- Design a reproducible and extensible evaluation pipeline that supports pairwise comparisons across multiple LLMs;
- Investigate verdict stability and consistency under controlled variations in prompt structure and input surface form;
- Diagnose position bias by comparing outcomes under reversed response order;
- Measure inter-evaluator agreement across model-agnostic scoring scenarios;
- Store all evaluation traces in a graph database to support structured querying, semantic linkage, and longitudinal analysis.
1.2. Related Work
- (i)
- Pairwise comparison of model outputs under controlled perturbations, including response order reversal, lexical noise, and prompt injection;
- (ii)
- Multi-judge evaluation, using three independent LLM-based evaluators (LLaMA 3:8B, OpenHermes, Nous-Hermes 2);
- (iii)
- Graph-based storage of all evaluations in Neo4j, supporting traceability, semantic linkage, and longitudinal analysis.
1.3. Research Gap and Novelty
- A pairwise evaluation pipeline with strict control over response order (mirrored comparisons), prompt type (standard/injected), and surface-level perturbations (lexical noise, paraphrasing), supporting robustness diagnostics across 3600+ comparisons;
- The use of three independent LLM-based evaluators (LLaMA 3:8B, OpenHermes, Nous-Hermes 2), yielding 100% inter-evaluator agreement across all conditions;
- The systematic identification of position bias, with 48.4% of verdicts reversing when the order of responses was flipped;
- The application of bias-injected prompts to assess evaluator susceptibility to framing effects and test prompt robustness;
- Graph-based storage in Neo4j of all evaluation traces (questions, answers, verdicts, metadata), enabling semantic querying and longitudinal analysis across 10 prompts and five candidate models.
2. Materials and Methods
2.1. System Overview
2.2. Evaluation Pipeline
2.3. Perturbation and Prompt Control
2.4. Evaluator Models and Decision Format
2.5. Graph-Based Storage and Traceability
2.6. Prompt Set and Question Variants
3. Results
3.1. Model Performance and Win Rates
3.2. Evaluator Agreement and Judgment Consistency
- Consistent: Both AB and BA yielded the same verdict;
- Position Bias: Verdicts reversed depending on response order;
- Inconclusive: At least one comparison returned a null or undefined preference.
3.3. Positional Bias and Perturbation Sensitivity
- Punctuation adjustments (e.g., inserting or removing commas, colons);
- Acronym formatting (e.g., “AI” → “A.I.”);
- Contraction expansions (“don’t” → “do not”);
- Mild synonym substitutions (“artificial intelligence” → “machine intelligence”);
- Hedging intensifiers (“should” → “might be advisable”).
3.4. Semantic Analysis and Ranking Stability
4. Discussion
4.1. Interpretation of Results
4.2. Methodological Implications
4.3. Practical Recommendations
4.4. Limitations
4.5. Future Work
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Google AI. Gemma Formatting and System Instructions. Available online: https://ai.google.dev/gemma/docs/formatting (accessed on 14 July 2025).
- Mistral AI. Latest Updates from Mistral AI. Available online: https://mistral.ai/news/ (accessed on 14 July 2025).
- Cognitive Computations. Dolphin-2.8-Mistral-7B. Available online: https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02 (accessed on 14 July 2025).
- Hugging Face. Zephyr-7B-Beta Model Card. Available online: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta (accessed on 14 July 2025).
- DeepSeek AI. DeepSeek-VL: Vision–Language Models by DeepSeek. Available online: https://github.com/deepseek-ai/DeepSeek-VL (accessed on 14 July 2025).
- Bernard, R.; Raza, S.; Das, S.; Murugan, R. EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. arXiv 2024. arXiv:2501.00257. [Google Scholar] [CrossRef]
- Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 11–16 August 2024. [Google Scholar] [CrossRef]
- Bhat, V. RubricEval: A Scalable Human–LLM Evaluation Framework for Open-Ended Tasks. Stanford CS224N Final Report. 2023. Available online: https://web.stanford.edu/class/cs224n/final-reports/256846781.pdf (accessed on 14 July 2025).
- Huang, X.; Ruan, W.; Huang, W.; Jin, G.; Dong, Y.; Wu, C.; Bensalem, S.; Mu, R.; Qi, Y.; Zhao, X.; et al. A Survey of Safety and Trustworthiness of Large Language Models Through the Lens of Verification and Validation. Artif. Intell. Rev. 2024, 57, 175. [Google Scholar] [CrossRef]
- Huang, X.; Ruan, W.; Huang, W.; Jin, G.; Dong, Y. A Survey of Safety and Trustworthiness of Large Language Models Through the Lens of Verification and Validation. arXiv 2024. [Google Scholar] [CrossRef]
- Chan, C.; Chen, W.; Su, Y.; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; Liu, Z. ChatEval: Towards Better LLM-Based Evaluators Through Multi-Agent Debate. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024), Vienna Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=FQepisCUWu (accessed on 14 July 2025).
- Jin, Y.; Zhao, Q.; Wang, Y.; Chen, H.; Zhu, K.; Xiao, Y.; Wang, J. AgentReview: Exploring Peer Review Dynamics with LLM Agents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Singapore, 6–10 November 2024. [Google Scholar] [CrossRef]
- Li, T.; Chiang, W.; Frick, E.; Dunlap, L.; Wu, T.; Zhu, B.; Gonzalez, J.E.; Stoica, I. From Live Data to High-Quality Benchmarks: The ArenaHard Pipeline. LMSYS Blog, 19 April 2024. Available online: https://lmsys.org/blog/2024-04-19-arena-hard (accessed on 14 July 2025).
- Scherrer, N.; Shi, C.; Feder, A.; Blei, D. Evaluating the Moral Beliefs Encoded in LLMs. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Available online: https://openreview.net/forum?id=O06z2G18me (accessed on 14 July 2025).
- Feng, Z.; Zhang, Y.; Li, H.; Wu, B.; Liao, J.; Liu, W.; Lang, J.; Feng, Y.; Wu, J.; Liu, Z. TEaR: Improving LLM-Based Machine Translation with Systematic Self-Refinement. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Mexico City, Mexico, 16–21 June 2025. [Google Scholar] [CrossRef]
- Shinn, N.; Cassano, F.; Gopinath, A.; Naranghan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Available online: https://papers.nips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf (accessed on 14 July 2025).
- Panickssery, A.; Bowman, S.R.; Feng, S. LLM Evaluators Recognize and Favor Their Own Generations. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024; Available online: https://papers.nips.cc/paper_files/paper/2024/hash/7f1f0218e45f5414c79c0679633e47bc-Abstract-Conference.html (accessed on 14 July 2025).
- Li, G.; Hammoud, H.A.A.K.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Language Model Society. arXiv 2023. arXiv:2303.17760. [Google Scholar] [CrossRef]
- Chen, G.; Dong, S.; Shu, Y.; Zhang, G.; Sesay, J.; Karlsson, B.; Fu, J.; Shi, Y. AutoAgents: A Framework for Automatic Agent Generation. arXiv 2023, arXiv:2309.17288. [Google Scholar] [CrossRef]
- Kim, A.; Kim, K.; Yoon, S. DEBATE: Devil’s Advocate-Based Assessment and Text Evaluation. In Findings of the Association for Computational Linguistics; Association for Computational Linguistics: Bangkok, Thailand, 2024. [Google Scholar] [CrossRef]
- Perin, G.; Chen, X.; Liu, S.; Kailkhura, B.; Wang, Z.; Gallagher, B. RankMean: Module-Level Importance Score for Merging Fine-Tuned LLM Models. In Findings of the Association for Computational Linguistics; Association for Computational Linguistics: Bangkok, Thailand, 2024. [Google Scholar] [CrossRef]
- Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv 2023. arXiv:2305.14325. [Google Scholar] [CrossRef]
- Ni, H. Extracting Insights from Unstructured Data with LLMs & Neo4j. Medium. 2025. Available online: https://watchsound.medium.com/extracting-insights-from-unstructured-data-with-llms-neo4j-914b1f193c64 (accessed on 14 July 2025).
- Rebmann, A.; Schmidt, F.D.; Glavaš, G.; van der Aa, H. On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks. arXiv 2025, arXiv:2504.21074. [Google Scholar] [CrossRef]
- Maurice, G. Kendall. A New Measure of Rank Correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
- Susnea, I.; Pecheanu, E.; Cocu, A.; Istrate, A.; Anghel, C.; Iacobescu, P. Non-Intrusive Monitoring and Detection of Mobility Loss in Older Adults Using Binary Sensors. Sensors 2025, 25, 2755. [Google Scholar] [CrossRef] [PubMed]
- Iacobescu, P.; Marina, V.; Anghel, C.; Anghele, A.-D. Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities. J. Cardiovasc. Dev. Dis. 2024, 11, 396. [Google Scholar] [CrossRef] [PubMed]
- Anghele, A.-D.; Marina, V.; Dragomir, L.; Moscu, C.A.; Anghele, M.; Anghel, C. Predicting Deep Venous Thrombosis Using Artificial Intelligence: A Clinical Data Approach. Bioengineering 2024, 11, 1067. [Google Scholar] [CrossRef] [PubMed]
ID | Question |
---|---|
Q1 | Should AI be allowed to decide life-or-death outcomes? |
Q2 | Is it ever ethical to lie for a greater good? |
Q3 | Can machines possess moral agency? |
Q4 | Should social media algorithms be made transparent by law? |
Q5 | Are autonomous weapons a threat to global stability? |
Q6 | Should democracy adapt to real-time AI voting systems? |
Q7 | Is freedom of speech compatible with misinformation control? |
Q8 | Can a language model become self-aware? |
Q9 | What distinguishes human creativity from machine creativity? |
Q10 | Is it ethical to train AI to manipulate people for social harmony? |
Model | Total Wins | Matches | Win Rate (%) |
---|---|---|---|
gemma:7b-instruct | 958 | 1440 | 66.53 |
zephyr:7b-beta | 830 | 1440 | 57.64 |
mistral:7b-instruct | 766 | 1440 | 53.19 |
dolphin-mistral:latest | 643 | 1440 | 44.65 |
deepseek-r1:8b | 403 | 1440 | 27.99 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Cocu, A.; Istrate, A.; Andrei, C.A. Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information 2025, 16, 652. https://doi.org/10.3390/info16080652
Anghel C, Anghel AA, Pecheanu E, Cocu A, Istrate A, Andrei CA. Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information. 2025; 16(8):652. https://doi.org/10.3390/info16080652
Chicago/Turabian StyleAnghel, Catalin, Andreea Alexandra Anghel, Emilia Pecheanu, Adina Cocu, Adrian Istrate, and Constantin Adrian Andrei. 2025. "Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator" Information 16, no. 8: 652. https://doi.org/10.3390/info16080652
APA StyleAnghel, C., Anghel, A. A., Pecheanu, E., Cocu, A., Istrate, A., & Andrei, C. A. (2025). Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information, 16(8), 652. https://doi.org/10.3390/info16080652