Artificial Authority: The Promise and Perils of LLM Judges in Healthcare
Abstract
1. Introduction
1.1. Background
1.2. Research Objectives
- What methodologies are used to implement LLMs as evaluators?
- In which clinical application domains have LLM judges been studied, and with what outcomes?
- How well do LLM-based judges align with human clinicians across different evaluation tasks?
- What factors influence the performance and reliability of LLM judges?
- What limitations, risks, and ethical concerns accompany the use of LLMs as evaluators in healthcare?
2. LLM Judge Evaluation Architectures
2.1. G-EVAL Method
2.2. LLM-Judge Specialists
2.3. LLM Jury Method
3. Emerging Applications of LLM-as-a-Judge in Clinical AI
3.1. Clinical Summaries and Documentation
3.2. Medical Question-Answering
3.3. Clinical Conversation Evaluation
4. Cross-Study Thematic Analysis
4.1. LLM Judges Are Being Used to Operationalize a Wide Range of Constructs, from Factual Correctness to Interaction Quality
4.2. LLM Judges May Better Align with Clinicians on Concrete, Observable Dimensions than on Subjective or Affective Ones
4.3. Judge Performance May Increase When the Evaluation Is Structured
4.4. The Reasoning Process (Advanced Prompt Engineering) Imposed on the LLM May Improve Reliability and Agreement with Human Experts
4.5. LLM Judges Can Often Match Human Evaluations and Exceed Average Clinician Agreement, but Their Performance Is Bounded by Data and Deployment Context
5. Discussion
5.1. Interpretation of the Results
5.2. Ethical Considerations
5.3. Strengths and Limitations
5.4. Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| LLM | Large Language Model |
| LLM-as-a-Judge | Large Language Model as a Judge |
| EHR | Electronic Health Record |
| BLEU | Bilingual Evaluation Understudy |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
| BERTScore | Bidirectional Encoder Representations from Transformers Scoring |
| CoT | Chain-of-Thought |
| G-EVAL | Generative Evaluation Architecture using LLMs |
| PDSQI-9 | Provider Documentation Summarization Quality Instrument (9-item) |
| ICC | Intraclass Correlation Coefficient |
| RAG | Retrieval-Augmented Generation |
| BHC | Brief Hospital Course |
| SOAP | Subjective, Objective, Assessment, Plan |
| GENMOD | General Model for Multi-Section Note Generation |
| SPECMOD | Specialized Model for Independent Section Generation |
| κ (Kappa) | Cohen’s Kappa Statistic |
| QA | Question Answering |
| CDC | Centers for Disease Control and Prevention |
| MedQuAD | Medical Question Answering Dataset |
| WAI-O-S | Working Alliance Inventory–Observer Short Form |
| ORS | Outcome Rating Scale |
| PoLL | Panel of LLM Evaluators |
| F1 Score | Harmonic Mean of Precision and Recall |
References
- Peikos, G.; Kasela, P.; Pasi, G. Leveraging large language models for medical information extraction and query generation. In Proceedings of the IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, Bankok, Thailand, 9–12 December 2024. [Google Scholar]
- Lin, C.; Kuo, C.-F. Roles and potential of large language models in healthcare: A comprehensive review. Biomed. J. 2025, 48, 100868. [Google Scholar] [CrossRef]
- Aguero, D.; Nelson, S.D. The Potential Application of Large Language Models in Pharmaceutical Supply Chain Management. J. Pediatr. Pharmacol. Ther. 2024, 29, 200–205. [Google Scholar] [CrossRef]
- McCoy, R.T.; Yao, S.; Friedman, D.; Hardy, M.; Griffiths, T.L. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. arXiv 2023, arXiv:2309.13638. [Google Scholar] [CrossRef]
- Brodeur, P.G.; Buckley, T.A.; Kanjee, Z.; Goh, E.; Ling, E.B.; Jain, P.; Cabral, S.; Abdulnour, R.-E.; Haimovich, A.D.; Freed, J.A. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv 2024, arXiv:2412.10849. [Google Scholar] [CrossRef]
- Healy, J.; Kossoff, J.; Lee, M.; Hasford, C. Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis. medRxiv 2025. [Google Scholar] [CrossRef]
- Yang, X.; Li, T.; Wang, H.; Zhang, R.; Ni, Z.; Liu, N.; Zhai, H.; Zhao, J.; Meng, F.; Zhou, Z. Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms. npj Digit. Med. 2025, 8, 85. [Google Scholar] [CrossRef]
- Berman, E.; Sundberg Malek, H.; Bitzer, M.; Malek, N.; Eickhoff, C. Retrieval augmented therapy suggestion for molecular tumor boards: Algorithmic development and validation study. J. Med. Internet Res. 2025, 27, e64364. [Google Scholar] [CrossRef]
- Salinas, M.P.; Sepúlveda, J.; Hidalgo, L.; Peirano, D.; Morel, M.; Uribe, P.; Rotemberg, V.; Briones, J.; Mery, D.; Navarrete-Dechent, C. A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digit. Med. 2024, 7, 125. [Google Scholar] [CrossRef]
- Brinker, T.J.; Hekler, A.; Enk, A.H.; Berking, C.; Haferkamp, S.; Hauschild, A.; Weichenthal, M.; Klode, J.; Schadendorf, D.; Holland-Letz, T. Deep neural networks are superior to dermatologists in melanoma image classification. Eur. J. Cancer 2019, 119, 11–17. [Google Scholar] [CrossRef]
- Lim, J.I.; Regillo, C.D.; Sadda, S.R.; Ipp, E.; Bhaskaranand, M.; Ramachandra, C.; Solanki, K.; Dubiner, H.; Levy-Clarke, G.; Pesavento, R. Artificial intelligence detection of diabetic retinopathy: Subgroup comparison of the EyeArt system with ophthalmologists’ dilated examinations. Ophthalmol. Sci. 2023, 3, 100228. [Google Scholar] [CrossRef]
- Printz, C. Artificial intelligence platform for oncology could assist in treatment decisions. Cancer 2017, 123, 905. [Google Scholar] [CrossRef]
- Genovese, A.; Prabha, S.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Trabilsy, M.; Tao, C.; Aziz, K.T.; Murray, P.M.; Forte, A.J. Artificial intelligence for patient support: Assessing retrieval-augmented generation for answering postoperative rhinoplasty questions. Aesthetic Surg. J. 2025, 45, 735–744. [Google Scholar] [CrossRef]
- Gomez-Cabello, C.A.; Prabha, S.; Haider, S.A.; Genovese, A.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Forte, A.J. Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support. Bioengineering 2025, 12, 1194. [Google Scholar] [CrossRef]
- Genovese, A.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Prabha, S.; Trabilsy, M.; Forte, A.J. The Current Landscape of Artificial Intelligence in Plastic Surgery Education and Training: A Systematic Review. J. Surg. Educ. 2025, 82, 103519. [Google Scholar] [CrossRef]
- McGowan, A.; Gui, Y.; Dobbs, M.; Shuster, S.; Cotter, M.; Selloni, A.; Goodman, M.; Srivastava, A.; Cecchi, G.A.; Corcoran, C.M. ChatGPT and Bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Res. 2023, 326, 115334. [Google Scholar] [CrossRef]
- Ajwani, R.; Javaji, S.R.; Rudzicz, F.; Zhu, Z. LLM-generated black-box explanations can be adversarially helpful. arXiv 2024, arXiv:2405.06800. [Google Scholar]
- Sariyar, M. Relevance of Grounding AI for Health Care. Stud. Health Technol. Inform. 2025, 328, 146–150. [Google Scholar]
- Kilov, D.; Hendy, C.; Guyot, S.Y.; Snoswell, A.J.; Lazar, S. Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs. arXiv 2025, arXiv:2506.13082. [Google Scholar] [CrossRef]
- Zhao, S.; Zhang, Y.; Xiao, L.; Wu, X.; Jia, Y.; Guo, Z.; Wu, X.; Nguyen, C.-D.; Zhang, G.; Luu, A.T. Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity. arXiv 2025, arXiv:2507.05816. [Google Scholar] [CrossRef]
- Krolik, J.; Mahal, H.; Ahmad, F.; Trivedi, G.; Saket, B. Towards leveraging large language models for automated medical q&a evaluation. arXiv 2024, arXiv:2409.01941. [Google Scholar] [CrossRef]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025, 333, 319–328. [Google Scholar] [CrossRef]
- Croxford, E.; Gao, Y.; First, E.; Pellegrino, N.; Schnier, M.; Caskey, J.; Oguss, M.; Wills, G.; Chen, G.; Dligach, D. Automating evaluation of AI text generation in healthcare with a large language model (LLM)-as-a-judge. medRxiv 2025. [Google Scholar] [CrossRef]
- Pan, Q.; Ashktorab, Z.; Desmond, M.; Cooper, M.S.; Johnson, J.; Nair, R.; Daly, E.; Geyer, W. Human-Centered Design Recommendations for LLM-as-a-judge. arXiv 2024, arXiv:2407.03479. [Google Scholar] [CrossRef]
- Genovese, A. LLM-as-a-Judge-Workflow. 2025. Available online: https://BioRender.com/ddo2y59 (accessed on 16 December 2025).
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
- Kim, S.; Suk, J.; Longpre, S.; Lin, B.Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv 2024, arXiv:2405.01535. [Google Scholar] [CrossRef]
- Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv 2024, arXiv:2404.18796. [Google Scholar] [CrossRef]
- Croxford, E.; Gao, Y.; First, E.; Pellegrino, N.; Schnier, M.; Caskey, J.; Oguss, M.; Wills, G.; Chen, G.; Dligach, D. Evaluating clinical AI summaries with large language models as judges. npj Digit. Med. 2025, 8, 640. [Google Scholar] [CrossRef]
- Li, Y. A Practical Survey on Zero-Shot Prompt Design for In-Context Learning. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 4–6 September 2023. [Google Scholar]
- Chung, P.; Swaminathan, A.; Goodell, A.J.; Kim, Y.; Reincke, S.M.; Han, L.; Deverett, B.; Sadeghi, M.A.; Ariss, A.-B.; Ghanem, M. Verifact: Verifying facts in llm-generated clinical text with electronic health records. arXiv 2025, arXiv:2501.16672. [Google Scholar]
- Brake, N.; Schaaf, T. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? arXiv 2024, arXiv:2404.06503. [Google Scholar] [CrossRef]
- Diekmann, Y.; Fensore, C.; Carrillo-Larco, R.; Rosales, E.C.; Shiromani, S.; Pai, R.; Shah, M.; Ho, J. LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA. In Proceedings of the 24th Workshop on Biomedical Language Processing, Viena, Austria, 1 August 2025. [Google Scholar]
- Li, A.; Lu, Y.; Song, N.; Zhang, S.; Ma, L.; Lan, Z. Understanding the therapeutic relationship between counselors and clients in online text-based counseling using LLMs. arXiv 2024, arXiv:2402.11958. [Google Scholar]
- Zheng, W.; Turner, L.; Kropczynski, J.; Ozer, M.; Nguyen, T.; Halse, S. LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic. arXiv 2025, arXiv:2506.11221. [Google Scholar]
- Ray, P.P. Toward Transparent AI-Enabled Patient Selection in Cosmetic Surgery by Integrating Reasoning and Medical LLMs. Aesthetic Plast. Surg. 2025, 49, 5641–5642. [Google Scholar] [CrossRef] [PubMed]
- Ji, K.; Wu, Z.; Han, J.; Zhai, G.; Liu, J. Evaluating ChatGPT-4’s performance on oral and maxillofacial queries: Chain of Thought and standard method. Front. Oral Health 2025, 6, 1541976. [Google Scholar] [CrossRef] [PubMed]
- Genovese, A. What Improves, Scales, and Limits LLM-Judges in Healthcare. 2026. Available online: https://BioRender.com/74iqinv (accessed on 1 January 2026).
- Gebreegziabher, S.A.; Chiang, C.; Wang, Z.; Ashktorab, Z.; Brachman, M.; Geyer, W.; Li, T.J.-J.; Gómez-Zará, D. MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. In Proceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, Amsterdam, The Netherlands, 23–25 June 2025; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar]
- Krumdick, M.; Lovering, C.; Reddy, V.; Ebner, S.; Tanner, C. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding. arXiv 2025, arXiv:2503.05061. [Google Scholar]
- Bedemariam, R.; Perez, N.; Bhaduri, S.; Kapoor, S.; Gil, A.; Conjar, E.; Itoku, I.; Theil, D.; Chadha, A.; Nayyar, N. Potential and Perils of Large Language Models as Judges of Unstructured Textual Data. arXiv 2025, arXiv:2501.08167. [Google Scholar] [CrossRef]
- Xu, A.; Bansal, S.; Ming, Y.; Yavuz, S.; Joty, S. Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings. arXiv 2025, arXiv:2503.15620. [Google Scholar]
- Kotek, H.; Dockum, R.; Sun, D. Gender bias and stereotypes in large language models. In Proceedings of the ACM Collective Intelligence Conference, Delft, The Netherlands, 6–9 November 2023. [Google Scholar]
- Bautista, Y.J.P.; Theran, C.; Aló, R.; Lima, V. Health disparities through generative AI models: A comparison study using a Domain specific large Language Model. In Proceedings of the Future Technologies Conference, San Francisco, CA, USA, 2–3 November 2023. [Google Scholar]


| Dimension | G-EVAL | LLM-Judge Specialists (Prometheus 2) | LLM Jury Method (PoLL) |
|---|---|---|---|
| Judging Architecture | Single LLM evaluator using form-filling paradigm, auto-generated chain-of-thought, and probability-weighted scoring | Single open-source LLM evaluator trained via weight merging of direct-assessment and pairwise-ranking models | Small heterogeneous panel of LLM evaluators (instantiated with three models from different families) |
| Target of Evaluation | NLG output quality across summarization, dialogue generation | General language generation quality using user-defined evaluation criteria | Question answering and conversational model outputs |
| Strengths | Structured evaluation, improved human alignment, fine-grained continuous scoring | Unified evaluation across formats, open-source and reproducible | Reduced evaluator bias and variance, improved robustness, substantially lower cost |
| Limitations | Prompt sensitivity, model-size dependence, bias favoring LLM-generated text | Limited scoring formats, indirect validation, unclear generalization beyond tested domains | Task-specific validation, unresolved panel selection strategy, untested performance on reasoning-heavy or clinical domains |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Genovese, A.; Hegstrom, L.; Prabha, S.; Gomez-Cabello, C.A.; Haider, S.A.; Collaco, B.; Wood, N.G.; Forte, A.J. Artificial Authority: The Promise and Perils of LLM Judges in Healthcare. Bioengineering 2026, 13, 108. https://doi.org/10.3390/bioengineering13010108
Genovese A, Hegstrom L, Prabha S, Gomez-Cabello CA, Haider SA, Collaco B, Wood NG, Forte AJ. Artificial Authority: The Promise and Perils of LLM Judges in Healthcare. Bioengineering. 2026; 13(1):108. https://doi.org/10.3390/bioengineering13010108
Chicago/Turabian StyleGenovese, Ariana, Lars Hegstrom, Srinivasagam Prabha, Cesar A. Gomez-Cabello, Syed Ali Haider, Bernardo Collaco, Nadia G. Wood, and Antonio Jorge Forte. 2026. "Artificial Authority: The Promise and Perils of LLM Judges in Healthcare" Bioengineering 13, no. 1: 108. https://doi.org/10.3390/bioengineering13010108
APA StyleGenovese, A., Hegstrom, L., Prabha, S., Gomez-Cabello, C. A., Haider, S. A., Collaco, B., Wood, N. G., & Forte, A. J. (2026). Artificial Authority: The Promise and Perils of LLM Judges in Healthcare. Bioengineering, 13(1), 108. https://doi.org/10.3390/bioengineering13010108

