Human Evaluation of Large Language Models: A Review and Protocol Selection Framework
Abstract
1. Introduction
2. LLM Evaluation: A Three-Tiered Taxonomy
2.1. Tier 1: Automated Metrics
2.2. Tier 2: LLM-as-Judge Systems
2.3. Tier 3: Direct Human Evaluation
3. Human Evaluation: Who, What, & How
3.1. Who Evaluates
3.2. What Is Evaluated
3.3. How It Is Evaluated
- Rubric-based protocols: Evaluators score an output on multiple predefined dimensions, each with explicit criteria or anchors (see examples in Supplementary Materials File S1). This method can reduce ambiguity by distributing judgment across multiple explicitly defined dimensions [28,33,35].
4. Reliability, Validity, and Human Judgment Errors
4.1. Reliability
4.2. Validity
4.3. Bias and Cognitive Constraints
5. Common Evaluator Failure Modes
6. Domain-Specific Evaluation
7. The STEP-V Framework
7.1. The STEP-V Dimensions
- Low-stakes applications include settings where evaluation errors have minimal real-world consequences, such as ranking outputs for a creative writing assistant, selecting between alternative marketing slogans, or tuning a chatbot’s tone for engagement. In these contexts, an incorrect evaluation may degrade user experience or preference alignment, but is unlikely to cause harm, making automated metrics or lightweight human preference judgments acceptable.
- Medium-stakes applications include tasks where evaluation errors can lead to meaningful but non-critical consequences, such as summarizing internal business documents, generating educational explanations for general audiences, or assisting with software development (e.g., code suggestions that are reviewed before deployment). Here, incorrect evaluations may reduce efficiency, introduce errors, or mislead users, but are typically mitigated by downstream human oversight. Hybrid evaluation strategies combining automated metrics, LLM-as-judge systems, and periodic human validation may be appropriate for these applications.
- High-stakes applications include domains where evaluation errors could result in significant harm, legal exposure, or safety risks, such as medical advice, clinical decision support, legal analysis, financial recommendations, or military planning. In these settings, an incorrect evaluation may falsely certify unsafe or incorrect outputs as acceptable; evaluation designs should therefore prioritize domain experts, multi-dimensional rubrics, conservative decision thresholds, and explicit handling of uncertainty and disagreement, with automated methods used only as support.
- Closed-ended tasks include outputs with a clearly defined correct answer or externally verifiable criterion, such as factual question answering (e.g., What is the capital of France?), code generation that must pass unit tests, extraction of structured information from text, or classification tasks with known labels. In these cases, correctness can be evaluated against a reference standard or objective rule, and disagreement is more likely to reflect error than interpretation.
- Open-ended tasks include outputs where multiple responses may be acceptable depending on context, goals, or audience, such as drafting an essay, providing medical counseling language, generating legal reasoning, summarizing complex documents, or offering strategic recommendations. Here, evaluation depends on subjective or partially observable constructs (e.g., usefulness, clarity, appropriateness), and disagreement may reflect legitimate differences in judgment instead of incorrectness.
- Low evaluator availability typically arises in domains requiring specialized expertise, such as board-certified physicians evaluating clinical recommendations, legal experts assessing statutory interpretation, or experienced military leaders assessing operational plans. In these cases, evaluators are scarce, expensive, and time-constrained, which places practical limits on sample size and necessitates careful prioritization of what is evaluated and how.
- Moderate evaluator availability may involve access to trained annotators or smaller, curated participant pools who can apply structured rubrics with some degree of consistency. These evaluators can support more nuanced judgments than crowdworkers but are still limited in scalability and domain expertise.
- High evaluator availability may include access to large pools of crowdworkers or end users, enabling scalable evaluation of fluency, relevance, or preference through pairwise comparisons or rating tasks. In some settings, LLM-as-judge systems may also be readily available, providing low-cost, high-throughput evaluation for structured or well-defined criteria, albeit with the need for calibration.
- Early development focuses on rapid iteration and model improvement, where the goal is to identify obvious failure modes and compare alternative model versions. Evaluation at this stage may rely on automated metrics, small-scale human preference tests, or LLM-as-judge systems to provide quick, directional feedback.
- Pre-deployment validation involves a more rigorous assessment before a system is released into real-world use. Here, the goal is to establish that the model meets predefined performance, safety, and reliability thresholds for its intended application. This often requires structured human evaluation, domain-specific criteria, reliability reporting, and testing across diverse scenarios to ensure the system performs adequately under expected conditions.
- Ongoing production monitoring occurs after deployment and focuses on detecting performance drift, emerging failure modes, or changes in user interaction patterns over time. Evaluation at this stage is typically high-volume and continuous, relying on automated monitoring, anomaly detection, and selective human review of flagged or high-risk cases to maintain system quality and safety.
- Low-volume settings may involve evaluating a small number of outputs, such as benchmarking a new model on a curated test set, conducting expert review of clinical decision-support responses, or auditing a limited set of high-stakes cases. In these scenarios, evaluation can be intensive, allowing for detailed rubrics, multiple expert raters, and adjudication processes.
- Moderate-volume settings include situations where outputs are generated regularly but not at a massive scale, such as evaluating weekly batches of generated reports, internal tool outputs, or iterative model updates during development. Here, a combination of sampling strategies, structured human evaluation, and partial automation is often used to balance rigor with efficiency.
- High-volume settings involve continuous or large-scale output generation, such as monitoring millions of chatbot interactions, customer support responses, or real-time recommendation systems. In these contexts, it is infeasible to evaluate every output directly; instead, evaluation relies on automated metrics, statistical sampling, anomaly detection, and escalation pipelines that route a small subset of outputs (e.g., uncertain, high-risk, or outlier cases) to human review.
7.2. Design Logic
7.3. Worked Examples
7.4. When STEP-V Can Fail
- When the construct itself is poorly specified.
- When available evaluators share the same systematic biases.
- When high-stakes tasks lack access to minimally sufficient expert oversight.
- When disagreement is inappropriately collapsed into a single consensus score.
8. Methodological Guidance
9. Limitations and Future Directions
10. Conclusions
Supplementary Materials
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| NLP | Natural Language Processing |
| LLM | Large Language Model |
| IRR | Inter-Rater Reliability |
| BLEU | Bilingual Evaluation Understudy |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
| METEOR | Metric for Evaluation of Translation with Explicit Ordering |
| BERT | Bidirectional Encoder Representations from Transformers |
| STEP-V | Stakes, Task-type, Evaluator availability, Purpose, Volume |
| CoT | Chain-of-Thought |
| RLHF | Reinforcement Learning from Human Feedback |
| SxS | Side-by-Side (comparison) |
| ELO | Elo Rating System |
References
- Annepaka, Y.; Pakray, P. Large Language Models: A Survey of Their Development, Capabilities, and Applications. Knowl. Inf. Syst. 2025, 67, 2967–3022. [Google Scholar] [CrossRef]
- Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2025, arXiv:2402.06196. [Google Scholar]
- Luo, J.; Wu, B.; Luo, X.; Xiao, Z.; Jin, Y.; Tu, R.-C.; Yin, N.; Wang, Y.; Yuan, J.; Ju, W.; et al. A Survey on Efficient Large Language Model Training: From Data-Centric Perspectives. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 30904–30920. [Google Scholar]
- Bohnet, B.; Dangovski, R.; Swersky, K.; Moore, S.; Chaudhry, A.; Kenealy, K.; Fiedel, N. A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios. arXiv 2025, arXiv:2511.00130. [Google Scholar]
- Celikyilmaz, A.; Clark, E.; Gao, J. Evaluation of Text Generation: A Survey. arXiv 2020, arXiv:2006.14799. [Google Scholar]
- Awasthi, R.; Bhattad, A.; Ramachandran, S.P.; Mishra, S.; Khanna, A.K.; Cywinski, J.B.; Maheshwari, K.; Mahapatra, D.; DiRosa, I.; Cohen, A.; et al. Human Evaluation of Large Language Models in Healthcare: Gaps, Challenges, and the Need for Standardization. npj Health Syst. 2025, 2, 40. [Google Scholar] [CrossRef]
- Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A Survey on LLM-as-a-Judge. Innovation 2026. [Google Scholar] [CrossRef]
- Belz, A.; Thompson, C.; Reiter, E.; Mille, S. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. In Proceedings of the Findings of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3676–3687. [Google Scholar]
- Schmidtová, P.; Calò, E.; Balloccu, S.; Gkatzia, D.; Huidrom, R.; Lango, M.; Same, F.; Zouhar, V.; Mahamood, S.; Dušek, O. Do My Eyes Deceive Me? A Survey of Human Evaluations of Hallucinations in NLG; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2025; pp. 60–79. [Google Scholar]
- Elangovan, A.; Liu, L.; Xu, L.; Bodapati, S.B.; Roth, D. Considers-the-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1137–1160. [Google Scholar]
- Serapio-García, G.; Safdari, M.; Crepy, C.; Sun, L.; Fitz, S.; Romero, P.; Matarić, M. A Psychometric Framework for Evaluating and Shaping Personality Traits in Large Language Models. Nat. Mach. Intell. 2025, 7, 1954–1968. [Google Scholar] [CrossRef]
- van der Lee, C.; Gatt, A.; van Miltenburg, E.; Wubben, S.; Krahmer, E. Best Practices for the Human Evaluation of Automatically Generated Text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 355–368. [Google Scholar]
- Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 3356–3369. [Google Scholar]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation Using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2511–2522. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 27730–27744. [Google Scholar]
- Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
- Wang, P.; Li, L.; Chen, L.; Cai, Z.; Zhu, D.; Lin, B.; Cao, Y.; Kong, L.; Liu, Q.; Liu, T.; et al. Large Language Models Are Not Fair Evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9440–9450. [Google Scholar]
- Fu, X.; Liu, W. How Reliable Is Multilingual LLM-as-a-Judge? arXiv 2025, arXiv:2505.12201. [Google Scholar] [CrossRef]
- Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs Instead of Human Judges? A Large Scale Empirical Study Across 20 NLP Evaluation Tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 238–255. [Google Scholar]
- Calderon, N.; Reichart, R.; Dror, R. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 16051–16081. [Google Scholar]
- Waikar, S.S.; Betensky, R.A.; Emerson, S.C.; Bonventre, J.V. Imperfect Gold Standards for Biomarker Evaluation. Clin. Trials 2013, 10, 696–700. [Google Scholar] [CrossRef]
- Xu, Q.; Walder, C.; Xu, C. Humanly Certifying Superhuman Classifiers. arXiv 2021, arXiv:2109.07867. [Google Scholar] [CrossRef]
- Weber-Genzel, L.; Peng, S.; De Marneffe, M.-C.; Plank, B. VariErr NLI: Separating Annotation Error from Human Label Variation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2256–2269. [Google Scholar]
- Croxford, E.; Gao, Y.; Pellegrino, N.; Wong, K.; Wills, G.; First, E.; Liao, F.; Goswami, C.; Patterson, B.; Afshar, M. Current and Future State of Evaluation of Large Language Models for Medical Summarization Tasks. npj Health Syst. 2025, 2, 6. [Google Scholar] [CrossRef]
- Kratzke, N.; Beuter, N.; Drews, A.; Janneck, M. PSYCH—Psychometric Assessment of Large Language Model Characters: An Exploration of the German Language. Analytics 2026, 5, 5. [Google Scholar] [CrossRef]
- Jing, F.; Zhang, Y.; Gao, M.; Zhang, X.; Zhou, H. A Review of Federated Large Language Models for Industry 4.0. Sensors 2026, 26, 1116. [Google Scholar] [CrossRef]
- Larraz, R. The Convergence of Artificial Intelligence and Industrial Transformation. In Prompt Engineering and the Transformation of Petroleum Refining: From Historical Innovation to Net Zero Systems; Larraz, R., Ed.; Springer Nature: Cham, Switzerland, 2026; pp. 1–9. ISBN 978-3-031-99728-0. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
- Chen, A.; Stanovsky, G.; Singh, S.; Gardner, M. Evaluating Question Answering Evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering; Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., Chen, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 119–124. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. LIMA: Less Is More for Alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 55006–55021. [Google Scholar]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
- Howcroft, D.M.; Belz, A.; Clinciu, M.; Gkatzia, D.; Hasan, S.A.; Mahamood, S.; Rieser, V. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. In Proceedings of the 13th International Conference on Natural Language Generation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 169–182. [Google Scholar]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Reiter, E. Types of NLG Evaluation: Which Is Right for Me? Available online: https://ehudreiter.com/2017/01/19/types-of-nlg-evaluation/ (accessed on 12 February 2026).
- Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. npj Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]
- Klie, J.-C.; Eckart de Castilho, R.; Gurevych, I. Analyzing Dataset Annotation Quality Management in the Wild. Comput. Linguist. 2024, 50, 817–866. [Google Scholar] [CrossRef]
- Lee, J.; Hockenmaier, J. Evaluating Step-by-Step Reasoning Traces: A Survey. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
- Bhambri, S.; Biswas, U.; Kambhampati, S. Do Cognitively Interpretable Reasoning Traces Improve LLM Performance? arXiv 2025, arXiv:2508.16695. [Google Scholar] [CrossRef]
- Brookhart, S.M. Appropriate Criteria: Key to Effective Rubrics. Front. Educ. 2018, 3, 22. [Google Scholar] [CrossRef]
- Jamieson, S. Likert Scales: How to (Ab)Use Them. Med. Educ. 2004, 38, 1217–1218. [Google Scholar] [CrossRef]
- Tourangeau, R.; Rips, L.J.; Rasinski, K. The Psychology of Survey Response; Cambridge University Press: Cambridge, UK, 2000; ISBN 978-0-521-57629-1. [Google Scholar]
- Fleiss, J.L.; Levin, B.; Paik, M.C. Statistical Methods for Rates and Proportions; John Wiley & Sons: Oxford, UK, 2003; ISBN 0-471-52629-0. [Google Scholar]
- Fleisig, E.; Blodgett, S.L.; Klein, D.; Talat, Z. The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2279–2292. [Google Scholar]
- Goto, T.; Sakai, Y.; Watanabe, T. Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 1165–1172. [Google Scholar]
- Mohankumar, A.K.; Khapra, M. Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 8761–8781. [Google Scholar]
- Artstein, R.; Poesio, M. Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
- Passonneau, R.J. Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006); European Language Resources Association: Luxembourg, 2006; pp. 831–836. [Google Scholar]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Rubner, Y.; Tomasi, C.; Guibas, L.J. A Metric for Distributions with Applications to Image Databases. In Proceedings of the Sixth International Conference on Computer Vision; IEEE: New York, NY, USA, 1998; pp. 59–66. [Google Scholar]
- Orlikowski, M.; Pei, J.; Röttger, P.; Cimiano, P.; Jurgens, D.; Hovy, D. Beyond Demographics: Fine-Tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 2092–2111. [Google Scholar]
- Xu, Y.; Jurgens, D. Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP. arXiv 2026, arXiv:2601.09065. [Google Scholar] [CrossRef]
- Brunyé, T.T.; Balla, A.; Drew, T.; Elmore, J.G.; Kerr, K.F.; Shucard, H.; Weaver, D.L. From Image to Diagnosis: Characterizing Sources of Error in Histopathologic Interpretation. Mod. Pathol. 2023, 36, 100162. [Google Scholar] [CrossRef]
- Myford, C.M.; Wolfe, E.W. Detecting and Measuring Rater Effects Using Many-Facet Rasch Measurement: Part I. J. Appl. Meas. 2003, 4, 386–422. [Google Scholar]
- Bradley, R.A.; Terry, M.E. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 1952, 39, 324–345. [Google Scholar] [CrossRef]
- Shi, L.; Ma, C.; Liang, W.; Diao, X.; Ma, W.; Vosoughi, S. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. arXiv 2025. [Google Scholar] [CrossRef]
- Tversky, A.; Kahneman, D. Judgment under Uncertainty: Heuristics and Biases. Biases in Judgments Reveal Some Heuristics of Thinking under Uncertainty. Science 1974, 185, 1124–1131. [Google Scholar] [CrossRef]
- Krippendorff, K. Content Analysis: An Introduction to Its Methodology; SAGE Publications: Thousand Oaks, CA, USA, 2018; ISBN 978-1-5063-9567-8. [Google Scholar]
- Hogarth, R.M.; Einhorn, H.J. Order Effects in Belief Updating: The Belief-Adjustment Model. Cogn. Psychol. 1992, 24, 1–55. [Google Scholar] [CrossRef]
- Chaiken, S. Heuristic versus Systematic Information Processing and the Use of Source versus Message Cues in Persuasion. J. Personal. Soc. Psychol. 1980, 39, 752–766. [Google Scholar] [CrossRef]
- Petty, R.E.; Briñol, P. The Elaboration Likelihood and Metacognitive Models of Attitudes: Implications for Prejudice, the Self, and Beyond. In Dual-Process Theories of the Social Mind; The Guilford Press: New York, NY, USA, 2014; pp. 172–187. ISBN 978-1-4625-1439-7. [Google Scholar]
- Oppenheimer, D.M. Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with Using Long Words Needlessly. Appl. Cogn. Psychol. 2006, 20, 139–156. [Google Scholar] [CrossRef]
- Asch, S.E. Effects of Group Pressure upon the Modification and Distortion of Judgments. In Groups, Leadership and Men; Research in Human Relations; Carnegie Press: Oxford, UK, 1951; pp. 177–190. [Google Scholar]
- Cialdini, R.B.; Goldstein, N.J. Social Influence: Compliance and Conformity. Annu. Rev. Psychol. 2004, 55, 591–621. [Google Scholar] [CrossRef]
- Griffin, D.; Tversky, A. The Weighing of Evidence and the Determinants of Confidence. Cogn. Psychol. 1992, 24, 411–435. [Google Scholar] [CrossRef]
- Moore, D.A.; Healy, P.J. The Trouble with Overconfidence. Psychol. Rev. 2008, 115, 502–517. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Z.; Araki, J.; Ding, H.; Neubig, G. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Trans. Assoc. Comput. Linguist. 2021, 9, 962–977. [Google Scholar] [CrossRef]
- Xiong, M.; Hu, Z.; Lu, X.; Li, Y.; Fu, J.; He, J.; Hooi, B. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv 2024, arXiv:2306.13063. [Google Scholar] [CrossRef]
- Boksem, M.A.S.; Meijman, T.F.; Lorist, M.M. Effects of Mental Fatigue on Attention: An ERP Study. Cogn. Brain Res. 2005, 25, 107–116. [Google Scholar] [CrossRef]
- Hopstaken, J.F.; van der Linden, D.; Bakker, A.B.; Kompier, M.A.J.; Leung, Y.K. Shifts in Attention during Mental Fatigue: Evidence from Subjective, Behavioral, Physiological, and Eye-Tracking Data. J. Exp. Psychol. Hum. Percept. Perform. 2016, 42, 878–889. [Google Scholar] [CrossRef]
- Vinay, V. Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications. arXiv 2025, arXiv:2511.19933. [Google Scholar] [CrossRef]
- Chi, M.T.H.; Feltovich, P.J.; Glaser, R. Categorization and Representation of Physics Problems by Experts and Novices. Cogn. Sci. 1981, 5, 121–152. [Google Scholar] [CrossRef] [PubMed]
- Nourani, M.; King, J.T.; Ragan, E.D. The Role of Domain Expertise in User Trust and the Impact of First Impressions with Intelligent Systems. In Proceedings of AAAI Conference on Human Computation and Crowdsourcing (HCOMP); Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2020. [Google Scholar]
- Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 9802–9822. [Google Scholar]
- Henrich, J.; Heine, S.J.; Norenzayan, A. The Weirdest People in the World? Behav. Brain Sci. 2010, 33, 61–83. [Google Scholar] [CrossRef]
- Markus, H.R.; Kitayama, S. Culture and the Self: Implications for Cognition, Emotion, and Motivation. In College Student Development and Academic Life; Routledge: Oxfordshire, UK, 1998. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; Association for Computing Machinery: New York, NY, USA, 2021; pp. 610–623. [Google Scholar]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
- DePaulo, B.M.; Lindsay, J.J.; Malone, B.E.; Muhlenbruck, L.; Charlton, K.; Cooper, H. Cues to Deception. Psychol. Bull. 2003, 129, 74–118. [Google Scholar] [CrossRef]
- Pennycook, G.; Rand, D.G. Lazy, Not Biased: Susceptibility to Partisan Fake News Is Better Explained by Lack of Reasoning than by Motivated Reasoning. Cognition 2019, 188, 39–50. [Google Scholar] [CrossRef]
- Sendak, M.P.; D’Arcy, J.; Kashyap, S.; Gao, M.; Nichols, M.; Corey, K.; Ratliff, W.; Balu, S. A Path for Translation of Machine Learning Products into Healthcare Delivery. EMJ Innov. 2020, 10, 19–00172. [Google Scholar] [CrossRef]
- Hernandez-Boussard, T.; Bozkurt, S.; Ioannidis, J.P.A.; Shah, N.H. MINIMAR (MINimum Information for Medical AI Reporting): Developing Reporting Standards for Artificial Intelligence in Health Care. J. Am. Med. Inform. Assoc. 2020, 27, 2011–2015. [Google Scholar] [CrossRef] [PubMed]
- Wiens, J.; Saria, S.; Sendak, M.; Ghassemi, M.; Liu, V.X.; Doshi-Velez, F.; Jung, K.; Heller, K.; Kale, D.; Saeed, M.; et al. Do No Harm: A Roadmap for Responsible Machine Learning for Health Care. Nat. Med. 2019, 25, 1337–1340. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in Health and Medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef] [PubMed]
- Brunyé, T.T.; Mitroff, S.R.; Elmore, J.G. Artificial Intelligence and Computer-Aided Diagnosis in Diagnostic Decisions: 5 Questions for Medical Informatics and Human-Computer Interface Research. J. Am. Med. Inform. Assoc. 2026, 33, 543–550. [Google Scholar] [CrossRef]
- Dahl, M.; Magesh, V.; Suzgun, M.; Ho, D.E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. J. Leg. Anal. 2024, 16, 64. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, H.; Wang, C.; Li, K.; Wu, T.-H.; Li, H.; Xu, X.; Huo, S.; Su, W.; Zheng, N.; et al. Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions. arXiv 2026, arXiv:2601.15267. [Google Scholar] [CrossRef]
- Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. J. Empir. Leg. Stud. 2025, 22, 216–242. [Google Scholar] [CrossRef]
- Vongpradit, P.; Imsombut, A.; Kongyoung, S.; Damrongrat, C.; Phaholphinyo, S.; Tanawong, T. SafeCultural: A Dataset for Evaluating Safety and Cultural Sensitivity in Large Language Models. In Proceedings of the 2024 8th International Conference on Information Technology (InCIT); IEEE: New York, NY, USA, 2024; pp. 740–745. [Google Scholar]
- Gligoric, K.; Zrnic, T.; Lee, C.; Candes, E.; Jurafsky, D. Can Unconfident LLM Annotations Be Used for Confident Conclusions? In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3514–3533. [Google Scholar]
- Meister, N.; Guestrin, C.; Hashimoto, T. Benchmarking Distributional Alignment of Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 24–49. [Google Scholar]
- Bommasani, R.; Liang, P.; Lee, T. Holistic Evaluation of Language Models. Ann. N. Y. Acad. Sci. 2023, 1525, 140–146. [Google Scholar] [CrossRef]
- Siro, C.; Aliannejadi, P.; Aliannejadi, M. Learning to Judge: LLMs Designing and Applying Evaluation Rubrics. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2026; Demberg, V., Inui, K., Marquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2026; pp. 6371–6389. [Google Scholar]
- Hashimoto, T.B.; Zhang, H.; Liang, P. Unifying Human and Statistical Evaluation for Natural Language Generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1689–1701. [Google Scholar]
- Davidson, R.R. On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments. J. Am. Stat. Assoc. 1970, 65, 317–328. [Google Scholar] [CrossRef]
- Thurstone, L.L. A Law of Comparative Judgment. In Scaling; Routledge: Oxfordshire, UK, 1974. [Google Scholar]
- Belz, A.; Thomson, C. HEDS 3.0: The Human Evaluation Data Sheet Version 3.0. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2); Arviv, O., Clinciu, M., Dhole, K., Dror, R., Gehrmann, S., Habba, E., Itzhak, I., Mille, S., Perlitz, Y., Santus, E., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 60–81. [Google Scholar]

| Evaluator | Validity | Reliability | Cost | Scalability | Best Use Case |
|---|---|---|---|---|---|
| Domain Experts | Very High | Moderate | Very High | Low | High-stakes, domain-specific; e.g., medicine, law, military, finance. |
| Trained Annotators | Moderate | Moderate | Moderate | Moderate | General-purpose quality assessment with rubrics and calibration. |
| Crowd Workers | Low to Moderate | Low to Moderate | Low | Very High | Fluency, relevance, basic comprehension, large low-stakes annotation. |
| End Users | High Ecological | Low | Low | Moderate | Satisfaction, usability, real-world utility in deployment context. |
| Failure Mode | Human Manifestation | LLM-as-Judge Manifestation | Design Implication |
|---|---|---|---|
| Order Effects | Primacy, recency, anchoring [57,59] | Position bias [17,56] | Randomize order and blind presentation. |
| Verbosity Bias | Longer responses perceived as better [60,61] | Preference for length over substance [16,17] | Control length or evaluate brevity separately. |
| Authority Bias | Deference to prestigious sources or tone [62] | Favoring authoritative-sounding outputs [16,17] | Remove source cues where possible. |
| Social Conformity | Herding toward prior judgments [63,64] | Reinforcement of prior signals [17] | Isolate evaluators and prevent score leakage. |
| Overconfidence | Excess certainty in ratings [65,66] | Excess certainty in judgments [67,68] | Require justification or confidence reporting. |
| Fatigue and Overload | Attention decline within & across sessions [69,70] | Degradation under context length, task complexity, repeated inference [71] | Limit session length and audit drift by position. |
| Subjectivity Failure | Low interrater reliability (IRR) on creative or cultural judgments [47,58] | Weak alignment with humans on subjective tasks [7,16] | Use multiple raters and preserve disagreement, examine noise for signal. |
| Knowledge Gaps | Inability to detect specialized errors [53,72] | Weak domain-specific judgment [73,74] | Match evaluator expertise to task. |
| Cultural Bias | Ratings shaped by demographic or cultural background [75,76] | Bias inherited from training data [77,78] | Diversify evaluator pools and test subgroup effects. |
| Hallucination Trust | Failure to detect confident misinformation [79,80] | Scoring fabricated content as valid [74] | Add external verification on factual dimensions. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Brunyé, T.T. Human Evaluation of Large Language Models: A Review and Protocol Selection Framework. AI 2026, 7, 174. https://doi.org/10.3390/ai7050174
Brunyé TT. Human Evaluation of Large Language Models: A Review and Protocol Selection Framework. AI. 2026; 7(5):174. https://doi.org/10.3390/ai7050174
Chicago/Turabian StyleBrunyé, Tad T. 2026. "Human Evaluation of Large Language Models: A Review and Protocol Selection Framework" AI 7, no. 5: 174. https://doi.org/10.3390/ai7050174
APA StyleBrunyé, T. T. (2026). Human Evaluation of Large Language Models: A Review and Protocol Selection Framework. AI, 7(5), 174. https://doi.org/10.3390/ai7050174

