Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study
Abstract
1. Introduction
2. Literature Review
3. Materials and Methods
3.1. Setting
3.2. Study Design
3.3. Study Hypothesis and Objectives
3.4. Statistical Analysis
4. Results
4.1. Pre-Evaluation of LLMs to Clinical Infection Control Scenarios
4.2. Internal Consistency and Reliability Analysis
4.3. Descriptive Statistics
4.4. Analysis of Variance Components
4.5. Pairwise Comparison of LLMs Under Open-Ended Question
4.6. Effect of Structured Prompting
4.7. Influence of Evaluator Characteristics
4.8. Qualitative Analysis of Deficiencies
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AFB | Acid-fast bacilli |
AI | Artificial intelligence |
CDC | Centers for Disease Control and Prevention |
CI | Confidence interval |
CPE | Carbapenemase-producing Enterobacterales |
ICO | Infection control officer |
ICN(s) | Infection control nurse(s) |
ICT | Infection control team |
IPC | Infection prevention and control |
LLM(s) | Large language model(s) |
MDRA | Multidrug-resistant Acinetobacter species |
MDROs | Multidrug-resistant organisms |
MRSA | Methicillin-resistant Staphylococcus aureus |
NaDCC | Sodium dichloroisocyanurate |
NTM | Nontuberculous mycobacteria |
SD | Standard deviation |
SNO | Senior nursing officer |
TB | Tuberculosis |
WHO | World Health Organization |
References
- Wong, S.-C.; Chau, P.-H.; So, S.Y.-C.; Lam, G.K.-M.; Chan, V.W.-M.; Yuen, L.L.-H.; Au Yeung, C.H.-Y.; Chen, J.H.-K.; Ho, P.-L.; Yuen, K.-Y.; et al. Control of healthcare-associated carbapenem-resistant Acinetobacter baumannii by enhancement of infection control measures. Antibiotics 2022, 11, 1076. [Google Scholar] [CrossRef]
- Wong, S.-C.; Yuen, L.-H.; Li, C.; Kwok, M.-T.; Chen, J.-K.; Cheng, V.-C. Proactive infection control measures to prevent nosocomial transmission of Candida auris in Hong Kong. J. Hosp. Infect. 2023, 134, 166–168. [Google Scholar] [CrossRef]
- Wong, S.-C.; Yip, C.C.-Y.; Chen, J.H.-K.; Yuen, L.L.-H.; AuYeung, C.H.-Y.; Chan, W.-M.; Chu, A.W.-H.; Leung, R.C.-Y.; Ip, J.D.; So, S.Y.-C.; et al. Investigation of air dispersal during a rhinovirus outbreak in a pediatric intensive care unit. Am. J. Infect. Control 2024, 52, 472–478. [Google Scholar] [CrossRef]
- Wong, S.-C.; Chen, J.H.-K.; Chau, P.-H.; Tam, W.-O.; Lam, G.K.-M.; Yuen, L.L.-H.; Chan, W.-M.; Chu, A.W.-H.; Ip, J.D.; Tsoi, H.-W.; et al. Tracking SARS-CoV-2 RNA in the air: Lessons from a COVID-19 outbreak in an infirmary unit. Am. J. Infect. Control 2025, 53, 348–356. [Google Scholar] [CrossRef]
- van der Werff, S.D.; van Rooden, S.M.; Henriksson, A.; Behnke, M.; Aghdassi, S.J.; van Mourik, M.S.; Nauclér, P. The future of healthcare-associated infection surveillance: Automated surveillance and using the potential of artificial intelligence. J. Intern. Med. 2025, 298, 54–77. [Google Scholar] [CrossRef]
- Wiemken, T.L.; Carrico, R.M. Assisting the infection preventionist: Use of artificial intelligence for health care–associated infection surveillance. Am. J. Infect. Control 2024, 52, 625–629. [Google Scholar] [CrossRef]
- Wong, Z.S.; Zhou, J.; Zhang, Q. Artificial intelligence for infectious disease big data analytics. Infect. Dis. Health 2019, 24, 44–48. [Google Scholar] [CrossRef] [PubMed]
- Pinto-de-Sá, R.; Sousa-Pinto, B.; Costa-de-Oliveira, S. Brave new world of artificial intelligence: Its use in antimicrobial stewardship—A systematic review. Antibiotics 2024, 13, 307. [Google Scholar] [CrossRef] [PubMed]
- Lee, A.L.H.; To, C.C.K.; Chan, R.C.K.; Wong, J.S.H.; Lui, G.C.Y.; Cheung, I.Y.Y.; Chow, V.C.Y.; Lai, C.K.C.; Ip, M.; Lai, R.W.M. Predicting antibiotic susceptibility in urinary tract infection with artificial intelligence—Model performance in a multi-centre cohort. JAC-Antimicrob. Resist. 2024, 6, dlae121. [Google Scholar] [CrossRef] [PubMed]
- Bhattacharjee, S.; Bhattacharya, S. Leveraging AI-driven nudge theory to enhance hand hygiene compliance: Paving the path for future infection control. Front. Public Health 2025, 12, 1522045. [Google Scholar] [CrossRef]
- Aldahlawi, S.A.; Almoallim, A.H.; Afifi, I.K. Artificial Intelligence and Hand Hygiene Accuracy: A New Era in Infection Control for Dental Practices. Clin. Exp. Dent. Res. 2025, 11, e70150. [Google Scholar] [CrossRef]
- Fitzpatrick, F.; Doherty, A.; Lacey, G. Using artificial intelligence in infection prevention. Curr. Treat. Options Infect. Dis. 2020, 12, 135–144. [Google Scholar] [CrossRef]
- El Arab, R.A.; Almoosa, Z.; Alkhunaizi, M.; Abuadas, F.H.; Somerville, J. Artificial intelligence in hospital infection prevention: An integrative review. Front. Public Health 2025, 13, 1547450. [Google Scholar] [CrossRef]
- Syrowatka, A.; Kuznetsova, M.; Alsubai, A.; Beckman, A.L.; Bain, P.A.; Craig, K.J.T.; Hu, J.; Jackson, G.P.; Rhee, K.; Bates, D.W. Leveraging artificial intelligence for pandemic preparedness and response: A scoping review to identify key use cases. NPJ Digit. Med. 2021, 4, 96. [Google Scholar] [CrossRef] [PubMed]
- Gawande, M.S.; Zade, N.; Kumar, P.; Gundewar, S.; Weerarathna, I.N.; Verma, P. The role of artificial intelligence in pandemic responses: From epidemiological modeling to vaccine development. Mol. Biomed. 2025, 6, 1. [Google Scholar] [CrossRef] [PubMed]
- Yang, L.; Lu, S.; Zhou, L. The implications of artificial intelligence on infection prevention and control: Current progress and future perspectives. China CDC Wkly. 2024, 6, 901. [Google Scholar] [CrossRef]
- Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef] [PubMed]
- Chiu, E.K.-Y.; Sridhar, S.; Sai-Yin Wong, S.; Tam, A.R.; Choi, M.-H.; Lau, A.W.-T.; Wong, W.-C.; Chiu, K.H.-Y.; Ng, Y.-Z.; Yuen, K.-Y.; et al. Generative artificial intelligence models in clinical infectious disease consultations: A cross-sectional analysis among specialists and resident trainees. Healthcare 2025, 13, 744. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
- Montella, E.; Marino, M.R.; Scala, A.; Trunfio, T.A.; Triassi, M.; Improta, G. Machine learning algorithms to predict healthcare associated infections in a neonatal intensive care unit. In International Symposium on Biomedical and Computational Biology; Springer: Cham, Switzerland, 2022; pp. 420–427. [Google Scholar]
- Wang, M.; Li, W.; Wang, H.; Song, P. Development and validation of machine learning-based models for predicting healthcare-associated bacterial/fungal infections among COVID-19 inpatients: A retrospective cohort study. Antimicrob. Resist. Infect. Control 2024, 13, 42. [Google Scholar] [CrossRef]
- Goodman, K.E.; Heil, E.L.; Claeys, K.C.; Banoub, M.; Bork, J.T. Real-world antimicrobial stewardship experience in a large academic medical center: Using statistical and machine learning approaches to identify intervention “hotspots” in an antibiotic audit and feedback program. Open Forum Infect. Dis. 2022, 9, ofac289. [Google Scholar] [CrossRef]
- Barchitta, M.; Maugeri, A.; Favara, G.; Riela, P.M.; Gallo, G.; Mura, I.; Agodi, A.; Network, S.-U. A machine learning approach to predict healthcare-associated infections at intensive care unit admission: Findings from the SPIN-UTI project. J. Hosp. Infect. 2021, 112, 77–86. [Google Scholar] [CrossRef]
- Heidari, A.; Jafari Navimipour, N.; Unal, M.; Toumaj, S. Machine learning applications for COVID-19 outbreak management. Neural Comput. Appl. 2022, 34, 15313–15348. [Google Scholar] [CrossRef] [PubMed]
- Schwartz, I.S.; Link, K.E.; Daneshjou, R.; Cortés-Penfield, N. Black box warning: Large language models and the future of infectious diseases consultation. Clin. Infect. Dis. 2024, 78, 860–866. [Google Scholar] [CrossRef] [PubMed]
- Hospital Authority Annual Report 2022–2023. Available online: https://www.ha.org.hk/haho/ho/cc/HA_Annual_Report_2022-2023_en.pdf (accessed on 6 September 2024).
- Qualtrics. Available online: https://www.qualtrics.com/ (accessed on 10 August 2025).
- Koo, T.K.; Li, M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef]
- Rahman, A.; Mahir, S.H.; Tashrif, M.T.A.; Aishi, A.A.; Karim, M.A.; Kundu, D.; Debnath, T.; Moududi, M.A.A.; Eidmum, M. Comparative analysis based on deepseek, chatgpt, and google gemini: Features, techniques, performance, future prospects. arXiv 2025, arXiv:2503.04783. [Google Scholar] [CrossRef]
- Kramer, A.; Assadian, O.; Helfrich, J.; Krüger, C.; Pfenning, I.; Ryll, S.; Perner, A.; Loczenski, B. Questionnaire-based survey on structural quality of hospitals and nursing homes for the elderly, their staffing with infection control personal, and implementation of infection control measures in Germany. GMS Hyg. Infect. Control 2013, 8, Doc11. [Google Scholar] [PubMed]
- Schubert, L.; Strassl, R.; Burgmann, H.; Dvorak, G.; Karer, M.; Kundi, M.; Kussmann, M.; Lagler, H.; Lötsch, F.; Milacek, C.; et al. A Longitudinal seroprevalence study evaluating infection control and prevention strategies at a large tertiary care center with low COVID-19 incidence. Int. J. Environ. Res. Public Health 2021, 18, 4201. [Google Scholar] [CrossRef]
- Carlsen, B.; Glenton, C. What about N? A methodological study of sample-size reporting in focus group studies. BMC Med. Res. Methodol. 2011, 11, 26. [Google Scholar] [CrossRef] [PubMed]
Study (Author, Year) | Clinical Task | Key Contribution | Gap Addressed by Our Study |
---|---|---|---|
Wang et al. (2024) [23]; Goodman et al. (2022) [24]; Barchitta et al. (2021) [25] | HAI prediction and antimicrobial, stewardship | Demonstrated that traditional ML can automate risk stratification and optimize stewardship using structured clinical data (e.g., lab values, patient history). | Relies on structured data; not designed to handle the unstructured, dialogue-based nature of real-time clinical consultations. |
Heidari et al. (2022) [26] | COVID-19 outbreak management | Provided a broad overview of AI’s utility in epidemiological modeling, diagnosis, and resource allocation during a pandemic. | High-level review; does not benchmark the performance of specific models for granular, case-based clinical decision-making in IPC. |
Gaber et al. (2025) [17] | Triage, referral, and diagnosis | Benchmarked LLMs for general clinical decision support, showing they can assist in triage and diagnosis but performance varies. | Focuses on general emergency medicine; not specific to field of IPC. |
Chiu et al. (2025) [18]; Schwartz et al. (2024) [27] | Infectious disease consultations | Raised critical safety alarms. Highlighted the risk of factual inaccuracies (“confabulations”) and the need for expert oversight to prevent patient harm. | Identified the problem of safety and reliability but did not: (1) perform a head-to-head benchmark of the latest models, or (2) systematically test mitigation strategies like structured prompting. |
Introduction |
---|
You are an infection control officer in Hong Kong responsible for handling clinical queries regarding infection control. Your task is to carefully review the provided clinical scenario and obtain any missing details before formulating recommendations. |
Phase 1: Clarification and information gathering |
## Do not formulate recommendations in phase 1 of your response.
|
Phase 2: Formulating recommendations |
## This phase should only begin after you have received responses to your clarification questions.
|
Criterion | Definition | Low Score (1) Represents | High Score (10) Represents |
---|---|---|---|
Coherence | The quality of the writing being logical, consistent, and easy to understand. The text should be well-structured and free from contradictions. | The output is confusing, illogical, internally inconsistent, or very difficult to read and understand. | The output is exceptionally clear, logical, and well-structured. The reasoning flows naturally and is easy to follow. |
Conciseness | The output is concise and to the point, avoiding irrelevant information, redundancy, or excessive verbosity. | The output is overly long, repetitive, and contains significant irrelevant information that detracts from the main points. | The output is perfectly concise. The core message is delivered efficiently and without unnecessary filler. |
Usefulness and relevance | The output directly addresses the core clinical problem presented in the scenario and provides information that is pertinent and helpful to an infection control professional. | The output is irrelevant to the scenario, misses the key infection control issues, or provides generic information of no practical use. | The output is highly relevant, directly addresses the critical aspects of the scenario, and provides genuinely useful insights for an infection control nurse. |
Evidence quality | The degree to which the recommendations or the reasoning behind the questions align with established infection control principles, clinical guidelines (e.g., WHO, CDC), and sound scientific reasoning. | The output contains factually incorrect information, contradicts established guidelines, or provides advice that is not based on sound clinical evidence (“hallucination”). | The recommendations are fully aligned with best practices and established evidence-based guidelines. The reasoning is clinically and scientifically sound. |
Actionability | The recommendations are clear, specific, practical, and can be realistically implemented by an infection control nurse in a hospital setting. The steps are well-defined. | The recommendations are vague, abstract, impractical, or lack the specific details needed for implementation. | The recommendations are concrete, specific, and provide clear, step-by-step instructions that an infection control nurse could immediately act upon. |
Large Language Model | p-Value | |||||
---|---|---|---|---|---|---|
GPT-4.1 | DeepSeek V3 | Gemini 2.5 Pro Exp | Effect of Prompt | Interaction Effect | ||
Mean change in scoring effect of prompt | Coherence | −0.013 (−0.161, 0.136) | +0.010 (−0.138, 0.159) | +0.004 (−0.178, 0.187) | 0.991 | 0.987 |
Conciseness | −0.206 (−0.384, −0.028) | −0.123 (−0.301, 0.055) | −0.040 (−0.257, 0.178) | 0.066 | 0.613 | |
Usefulness and relevance | +0.060 (−0.123, 0.244) | +0.119 (−0.064, 0.302) | +0.092 (−0.133, 0.316) | 0.193 | 0.933 | |
Evidence quality | +0.765 (0.575–0.954) | +1.106 (0.916, 1.296) | +0.425 (0.193, 0.657) | <0.001 | 0.003 | |
Actionability | −0.046 (−0.221, 0.129) | +0.017 (−0.158, 0.192) | +0.052 (−0.162, 0.266) | 0.913 | 0.846 | |
Composite score | +0.560 (−0.194, 1.315) | +1.129 (0.375, 1.884) | +0.533 (−0.391, 1.457) | 0.019 | 0.661 |
(A) Comparison by profession | ||||
Profession | Adjusted Mean (Standard Error) | Mean Difference (95% CI) | p-Value | |
---|---|---|---|---|
Doctor | 38.83 (0.26) | 6.76 (6.26, 7.26) | <0.001 | |
Nurse | 32.06 (0.26) | |||
(B) Comparison by seniority within each profession | ||||
Profession | Seniority | Adjusted Mean (Standard Error) | Mean Difference (95% CI) | p-Value |
Doctor | Senior | 42.42 (0.37) | 7.19 (6.17, 9.21) | <0.001 |
Junior | 35.23 (0.37) | |||
Nurse | Senior | 32.08 (0.37) | 0.05 (−0.98, 1.08) | 0.915 |
Junior | 32.03 (0.37) |
Scenario | LLM | Identified Deficiency | Potential Harm and Severity | Correct Action Suggested by Expert Panel |
---|---|---|---|---|
9 (Positive AFB smear) | DeepSeek V3 | Poor clinical judgment: Recommended empiric TB treatment without considering NTM as a differential diagnosis. | High: Unnecessary exposure of the patient to anti-TB medications and their side effects. | Contact microbiologist or microbiology laboratory to arrange TB-PCR for confirmation. |
11 (Screening for Candida auris) | DeepSeek V3 | Factual inaccuracy: Suggested incorrect screening sites (stool/urine) instead of the standard axilla/groin swabs. | High: A false-negative screening result would lead to the premature discontinuation of contact precautions, creating a significant risk of silent transmission and outbreak of MDRO in the healthcare facility. | Screening using standard axillary and groin swabs. |
18 (Router installation) | GPT-4.1 | Incompleteness: Lacked a structured risk assessment and failed to ask for critical details (e.g., proximity to immunocompromised patients). Lack of specificity: Did not assign clear responsibilities or define specific environmental controls. | Moderate: Failure to assess risk could lead to construction dust (containing fungal spores like Aspergillus) being dispersed in a high-risk area, potentially causing severe invasive infections in immunocompromised patients. | Perform an infection control risk assessment and implement dust control measures appropriate for location and risk level. |
21 (Candida auris de-escalation) | Gemini 2.5 Pro Exp | Impracticality and lack of conciseness: Overwhelmed the user with dozens of nonessential questions and academic analysis instead of a direct, actionable answer. | Moderate: The confusing and noncommittal response could lead a frontline nurse to make an unsafe decision such as prematurely discontinuing contact precautions based on a single negative test, thereby risking transmission. | Provide a direct ‘yes/no’ answer to the de-escalation question based on hospital policy, followed by a concise rationale. |
22 (MDRA isolation from rectal swab) | GPT-4.1 | Factual inaccuracy and underestimation of risk: Incorrectly advised that single-room isolation was not required for MDRA, contradicting standard infection control policy. | High: This recommendation directly violates a core principle of MDRO management. Following this advice would create a high probability of nosocomial transmission to other vulnerable patients. | Single room isolation is required for patients colonized with MDRA. |
26 (Family member with tuberculosis) | Gemini 2.5 Pro Exp | Lack of conciseness and prioritization: Generated a long, complex document that obscured the key, actionable steps for the healthcare worker, failing to tailor the advice to their immediate needs. | Low: While the information may be technically correct, its poor presentation makes it difficult to use. The user may miss critical advice, leading to anxiety or missed opportunities for personal screening, but the advice itself is not directly harmful. | Prioritize the most critical actions for the healthcare worker and present them in a clear, easily readable list. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wong, S.-C.; Chiu, E.K.-Y.; Chiu, K.H.-Y.; Tam, A.R.; Chau, P.-H.; Choi, M.-H.; Ng, W.-Y.; Kwok, M.O.-T.; Chau, B.Y.; Ng, M.Y.-Z.; et al. Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study. Healthcare 2025, 13, 2652. https://doi.org/10.3390/healthcare13202652
Wong S-C, Chiu EK-Y, Chiu KH-Y, Tam AR, Chau P-H, Choi M-H, Ng W-Y, Kwok MO-T, Chau BY, Ng MY-Z, et al. Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study. Healthcare. 2025; 13(20):2652. https://doi.org/10.3390/healthcare13202652
Chicago/Turabian StyleWong, Shuk-Ching, Edwin Kwan-Yeung Chiu, Kelvin Hei-Yeung Chiu, Anthony Raymond Tam, Pui-Hing Chau, Ming-Hong Choi, Wing-Yan Ng, Monica Oi-Tung Kwok, Benny Yu Chau, Michael Yuey-Zhun Ng, and et al. 2025. "Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study" Healthcare 13, no. 20: 2652. https://doi.org/10.3390/healthcare13202652
APA StyleWong, S.-C., Chiu, E. K.-Y., Chiu, K. H.-Y., Tam, A. R., Chau, P.-H., Choi, M.-H., Ng, W.-Y., Kwok, M. O.-T., Chau, B. Y., Ng, M. Y.-Z., Lam, G. K.-M., Wong, P. W.-C., Chung, T. W.-H., Sridhar, S., Ma, E. S.-K., Yuen, K.-Y., & Cheng, V. C.-C. (2025). Comparative Evaluation and Performance of Large Language Models in Clinical Infection Control Scenarios: A Benchmark Study. Healthcare, 13(20), 2652. https://doi.org/10.3390/healthcare13202652