1. Introduction
Generative artificial intelligence (GenAI), a branch of AI that includes large language models (LLMs), offers considerable promise in various fields of clinical medicine and biomedical sciences. Traditionally, clinical microbiologists and infectious disease (ID) physicians have been early adopters of emerging technologies, but the clinical integration of GenAI has been met with polarised opinions due to incomplete understanding of LLM technologies and the opaque nature of GenAI [
1,
2]. Concerns about the consistency and situational awareness of LLM responses have been raised, highlighting potential risks to patient safety [
3]. The propensity of LLMs to produce confabulated recommendations could preclude their safe clinical deployment [
4]. Furthermore, ambiguous advice offered by LLMs might compromise the effectiveness of clinical management [
5]. Despite these challenges, stakeholders and clinicians are encouraged to participate in thoughtful and constructive discussions about AI integration in medicine, where this nascent technology could enhance their ability to deliver optimal patient care [
6,
7]. The European Medicines Agency (EMA) stresses that rapid advancement of LLM technologies introduces novel risks—particularly those stemming from non-transparent model architectures, potential biases, and data integrity concerns—that must be proactively managed. Their views provide a strong basis for the forthcoming draft regulation on AI in healthcare [
8].
This cross-sectional study assessed the quality and safety of AI-generated responses to real-life clinical scenarios at an academic medical centre. Three leading foundational GenAI models—Claude 2, Gemini Pro, and GPT-4.0—were selected to benchmark the current capabilities of LLMs. These models underwent blinded evaluations by six clinical microbiologists and ID physicians across four critical domains: factual consistency, comprehensiveness, coherence, and potential medical harmfulness. The analysis included comparative evaluations between specialists and resident trainees, aiming to yield nuanced insights that reflect the broad spectrum of clinical experiences and varying degrees of expertise.
2. Materials and Methods
Between 13 October and 6 December 2023, consecutive new in-patient clinical consultations attended by four clinical microbiologists—two fellows (K.H.-Y.C. and T.W.-H.C.) and two resident trainees (E.K.-Y.C. and Y.-Z.N.)—from the Department of Microbiology, Queen Mary Hospital (QMH), a university-affiliated teaching hospital and tertiary healthcare centre in Hong Kong with about 1700 hospital beds, were included. Duplicated referrals and follow-up assessments were excluded. First attendance clinical notes were retrospectively extracted from the Department’s digital repository for analysis. The included clinical notes encompassed patients from internal medicine and surgery who were predominantly middle-aged to elderly and displayed a balanced gender representation overall.
Included clinical notes were pre-processed, standardised, and anonymised to generate unique clinical scenarios
(Supplementary File S1, pp. 3–36). Patient identifiable details were removed. Medical terminologies were standardised. Non-universal abbreviations were expanded into their full terms (e.g., from ‘c/st’ to ‘culture’). Measurements were presented using the International System of Units (e.g., ‘g/dL’ for haemoglobin levels). Clinically relevant dates were included for chronological structuring. Finally, clinical scenarios were categorised systematically into five sections: “basic demographics and underlying medical conditions”, “current admission”, “physical examination findings”, “investigation results”, and “antimicrobials and treatments”.
All clinical scenarios were processed using a default zero-shot prompt template developed specifically for this study (
Figure 1) [
9]. The prompt template was created to standardise the analytical framework and model outputs. The prompt defined the behaviour of the chatbot to act as “an artificial intelligence assistant with expert knowledge in clinical medicine, infectious disease, clinical microbiology, and virology” [
10]. The template broke down the analysis into clinically meaningful segments and sub-tasks, using the Performed-Chain of Thought (P-COT) prompting approach, and each task was analysed sequentially through a logical, self-permeating, step-by-step framework [
11,
12,
13]. At the end of the prompt, the models were mandated to adhere closely to the provided instructions to reinforce their behaviour and for the desired responses [
14].
We accessed the chatbots through Poe (Quora, Mountain View, CA, USA), a subscription-based GenAI platform. Three foundational GenAI models were evaluated: Claude 2 (Anthropic, San Francisco, CA, USA), Gemini Pro (Google DeepMind, London, UK), and GPT-4.0 (OpenAI, San Francisco, CA, USA). Additionally, a Custom Chatbot based on GPT-4.0 (cGPT-4) was created using the “Create bot” feature via Poe. cGPT-4 was optimised using retrieval-augmented generation (RAG) to incorporate an external knowledge base from four established clinical references [
15], which included: Török, E., Moran, E., and Cooke, F. (2017). Oxford Handbook of Infectious Diseases and Microbiology. Oxford University Press [
16]; Mitchell, R.N., Kumar, V., Abbas A.K., and Aster, J.C. (2016). Pocket Companion to Robbins and Cotran Pathologic Basis of Disease (Robbins Pathology). Elsevier [
17]; Sabatine, M.S. (2022). Pocket Medicine: The Massachusetts General Hospital Handbook of Internal Medicine. Lippincott Williams and Wilkins [
18]; Gilbert, D.N., Chambers, H.F., Saag, M.S., Pavia, A.T., and Boucher, H.W. (editors) (2022). The Sanford Guide to Antimicrobial Therapy 2022. Antimicrobial Therapy, Incorporated [
19]. cGPT-4 was deployed as a private bot on the Poe platform and is accessible only to authorised users involved in this study. In this study, we have complied with the license terms and terms of service. Domain-specific healthcare AI models, such as Med-PaLM 2 [
20] or MEDITRON [
21], were not included for analysis due to limited access and propriety restrictions.
Chatbot response variability was specified using model temperature control, which influenced the creativity and predictability of outputs. A lower temperature value resulted in more rigid responses, while a higher value allowed for more varied and inventive answers [
22]. For this study, the model temperature settings were selected according to the default values recommended by Poe. No model-specific temperature adjustments were made to limit user manipulation, operator-dependent biases, and to reflect typical chatbot deployment scenarios. Claude 2 was set to a temperature of 0.5, and both GPT-4.0 and cGPT-4 were set to 0.35. The temperature setting for Gemini Pro was not disclosed by Poe at the time of assessment.
The study included a dataset of 40 distinct real-life clinical scenarios, which were processed by four GenAI chatbots, producing a total of 160 AI-generated responses. To ensure objective assessments, all investigators, except E.-K.Y.C., were blinded to the clinical scenarios and chatbot outputs. Dual-level randomisation was employed, where the clinical scenarios were randomised before being inputted into the chatbots, and the corresponding AI-generated responses were further randomised before being subjected to human evaluation via the Qualtrics survey platform (Qualtrics, Provo, UT, USA). Within the platform, clinical scenarios and their corresponding chatbot responses were presented at random, with all identifiers removed to ensure blinding.
Human evaluators were selected from the Department of Microbiology at the University of Hong Kong, the Department of Medicine (Infectious Disease Unit) at Queen Mary Hospital, and the Department of Medicine and Geriatrics (Infectious Disease Unit) at Princess Margaret Hospital. Our study design included three specialists (A.R.T., S.S.Y.W., and S.S., average clinical experience (avg. clinical exp.) = 19.3 years) and three resident trainees (A.W.T.L., M.H.C., and W.C.W., avg. clinical exp. = 5.3 years) to capture a broad spectrum of clinical perspectives. None of the evaluators had prior experience using chatbot technology in a clinical setting.
Written instructions were provided to the evaluators, where the procedures of the evaluation process and definitions of each domain were clearly defined. Evaluators were instructed to read each clinical scenario and its corresponding responses thoroughly before grading. Sample responses were demonstrated to ensure familiarity with the generated materials. AI-generated responses were systematically evaluated using a five-point Likert scale across four clinically relevant domains: factual consistency, comprehensiveness, coherence, and medical harmfulness [
23]. Factual consistency was assessed by verifying the accuracy of output information against clinical data provided in the scenarios. Comprehensiveness measured how completely the response covered the necessary information required to meet the objectives outlined in the prompt. Coherence evaluated how logically structured and clinically impactful the chatbot responses were. Medical harmfulness evaluated the potential of a response to cause patient harm
(Supplementary File S2, p. 3, Table S1).
Descriptive statistics were reported. Internal consistencies of the Likert scale items were evaluated using Cronbach’s alpha coefficient, which determined whether the included domains jointly reflected a singular underlying construct, thus justifying the formulation of a composite score. Composite scores, ranging from 1 to 5, were calculated by the mean of the combined scores across four domains. One-way analysis of variance (ANOVA) and Tukey’s honest significant difference (HSD) test were used for comparison. At the domain level, the Kruskal–Wallis H-test and post hoc Dunn’s multiple comparison tests were used for between chatbot comparisons. Within-group analyses between specialist and resident trainee evaluators at the domain level were compared using the paired
t-test [
24]. The comparison of response lengths between different models was analysed using one-way ANOVA and further assessed with Tukey’s HSD to identify significant differences.
In addition, we evaluated the frequency with which responses surpassed critical thresholds (e.g., “insufficiently verified facts” in the factual consistency domain, or “significant incoherence” in the coherence domain). We computed prevalence ratios to compare the incidence rates of these occurrences across different chatbots. We reported the Spearman correlation coefficients between the composite scores and running costs of each GenAI model [
25,
26,
27]. All statistical analyses were performed in R statistical software, version 4.33 (R Project for Statistical Computing), SPSS, version 29.0.1.0 (IBM Corporation, Armonk, NY, USA), and GraphPad Prism, version 10.2.0 (GraphPad Software Inc., San Diego, CA, USA). A
p-value less than 0.05 was considered as statistically significant.
3. Results
In this study, 40 clinical scenarios were tested using 4 GenAI chatbots, generating 160 distinct responses. Each response was evaluated by 6 evaluators separately, amassing a total of 960 evaluation entries, providing a robust dataset for analysis. The mean response length word counts were: GPT-4.0 (577.2 ± 81.2), Gemini Pro (537.8 ± 86.2), cGPT-4 (507.7 ± 80.2), and Claude 2 (439.5 ± 62.6;
Supplementary File S2, p. 4, Table S2). GPT-4.0 produced longer responses compared to Gemini Pro (character count:
p ≤ 0.001) and Claude 2 (word count:
p < 0.001; character count:
p ≤ 0.001;
Supplementary File S2, pp. 5–6, Tables S3 and S4).
The overall Cronbach’s alpha coefficient for the Likert scale was found to be high (α = 0.881). Additionally, high internal consistencies were observed across chatbots: GPT-4.0 (α = 0.847), cGPT-4 (α = 0.891), Gemini Pro (α = 0.873), and Claude 2 (α = 0.894). These findings reaffirmed that the scale items reliably measured a unified construct and functioned similarly across all models, supporting the robustness of the evaluation tool.
Regarding the overall model performances (
Figure 2a;
Supplementary File S2, p. 7, Table S5), GPT-4.0-based models exhibited higher mean composite scores (GPT-4.0: 4.121 ± 0.576; cGPT-4: 4.060 ± 0.667), which were lower for Claude 2 (3.919 ± 0.718) and Gemini Pro (3.890 ± 0.714). Comparing between different chatbots (
Figure 2b), GPT-4.0 had a significantly higher mean composite score than Gemini Pro (mean difference (MD) = 0.231,
p = 0.001) and Claude 2 (MD = 0.202,
p = 0.006). cGPT-4 also outperformed Gemini Pro (MD = 0.171,
p = 0.03). No statistical differences were observed between GPT-4.0 and cGPT-4.
For within-group comparisons of composite scores awarded between specialist and resident trainee evaluators, specialists gave a significantly higher score than resident trainees across all chatbots (
Supplementary File S2, p. 8, Table S6): GPT-4.0 (MD = 0.604,
p < 0.001), cGPT-4 (MD = 0.742,
p < 0.001), Gemini Pro (MD = 0.796,
p < 0.001), and Claude 2 (MD = 0.867,
p < 0.001). Concerning individual domains, higher scores were also awarded by specialists across all domains (
p < 0.001;
Supplementary File S2, p. 9, Table S7).
At the domain level (
Figure 3), pairwise comparisons showed that GPT-4.0 scored significantly higher than Gemini Pro and Claude 2 in terms of factual consistency (GPT-4.0 vs. Gemini Pro, mean rank difference (MRD) = 67.27,
p = 0.02; GPT-4.0 vs. Claude 2, MRD = 67.60,
p = 0.02), comprehensiveness (GPT-4.0 vs. Gemini Pro, MRD = 64.25,
p = 0.04; GPT-4.0 vs. Claude 2, MRD = 65.84,
p = 0.03), and lack of medical harm (GPT-4.0 vs. Gemini Pro, MRD = 69.79,
p = 0.02; GPT-4.0 vs. Claude 2, MRD = 64.87,
p = 0.040). For coherence, there was no statistically significant difference between GPT-4.0 and Claude 2, while cGPT-4 showed superior performance when compared to Gemini Pro (MRD = 79.69,
p = 0.004).
The incidence rates for each response type were calculated for comparison (
Supplementary File S2, p. 10, Table S8). Concerning factual accuracy, GPT-4.0 excelled, with 31.25% (95% confidence interval (CI) 25.42–37.08) of its responses being “fully verified facts”, which was higher than cGPT-4 (27.50%, 22.08–33.32), Claude 2 (24.58%, 19.17–29.58), and Gemini Pro (23.33%, 17.92–28.75). None of the models produced outputs that were regarded as “unverified or non-factual” (
Figure 4a).
In terms of comprehensiveness, 79.58% (95% CI 74.17–85.00) of outputs from GPT-4.0 showed either “complete coverage” (22.08%, 16.67–27.08) or “extensive coverage” (57.50%, 51.25–63.33), while all other chatbots were rated less than 70% for the combination of these two categories. Claude 2 showed the worst performance, where 35.00% (95% CI 28.75–41.67) of responses were regarded as showing “considerable coverage” (28.33%, 95% CI 22.50–34.99), “partial coverage” (5.83%, 2.92–8.75), and “limited coverage” (0.83%, 0.00–2.08;
Figure 4b).
Regarding coherence, cGPT-4 excelled with the highest percentage of “fully coherent” (30.42%, 95% CI 24.59–36.66) responses, compared to GPT-4.0 (27.92%, 22.50–33.33), Claude 2 (26.25%, 21.25–32.49), and Gemini Pro (23.75%, 18.33–29.58). When considering the combined categories of “fully coherent” and “minimally incoherent”, cGPT-4 was marginally better (85.00%, 95% CI 80.42–89.58) than GPT-4.0 (84.17%, 79.58–88.33) and Claude 2 (73.33%, 67.92–79.17). Gemini Pro showed the worst performance at 69.58% (63.34–75.42;
Figure 4c).
Concerning medical harmfulness, over 60% of all AI-generated responses contained a certain degree of harm, ranging from “minimally harmful”, “mildly harmful”, “moderately harmful”, to “severely harmful”: Claude 2 (70.42%, 95% CI 65.00–76.25), Gemini Pro (69.17%, 63.75–75.00), cGPT-4 (63.75%, 57.50–70.00), and GPT-4.0 (63.33%, 57.09–69.57). “Severely harmful” responses were documented by Gemini Pro (n = 3; 1.25%, 95% CI 0.00–2.91) and Claude 2 (n = 1; 0.42%, 0.00–1.25). Incidence rates for “harmless” responses were also lowest for these two models, Claude 2 (29.58%, 95% CI 23.75–35.83) and Gemini Pro (30.83%, 24.58–36.25;
Figure 4d).
When comparing the incidence rates of responses between specialists and resident trainees (
Supplementary File S2, pp. 11–12, Table S9), a greater proportion of responses were classified as “fully verified facts” by specialists (23.96%, 95% CI 21.04–26.66) compared to resident trainees (2.71%, 1.77–3.85), indicating that specialists were nine times more likely to recognise responses containing “fully verified facts”. For medical harmfulness, the proportion of responses rated as “harmless” was also higher among specialists (27.71%, 95% CI 24.79–30.63) than resident trainees (5.63%, 95% CI 4.27–7.29), suggesting that specialists were five times more likely to consider responses as “harmless”.
4. Discussion
In this cross-sectional study, AI-generated responses from four GenAI chatbots—GPT-4.0, Custom Chatbot (based on GPT-4.0; cGPT-4), Gemini Pro, and Claude 2—were evaluated by specialists and resident trainees from the divisions of clinical microbiology or infectious diseases. Consistently, GPT-4.0-based models outperformed Gemini Pro and Claude 2. The performance of the RAG-enhanced cGPT-4 chatbot was comparable to that of the unenhanced GPT-4.0 chatbot, illustrating our incomplete understanding of LLM architecture and the nuances of model configurations and augmentations. Post hoc analysis revealed that direct references to the external knowledge base occurred infrequently, which may partly explain the similar composite scores. Ongoing refinements to the RAG process could help optimise integration of external contents in future iterations to achieve superior performances.
Alarmingly, fewer than two-fifths of AI-generated responses were deemed “harmless”. Despite the superior performance of GPT-4.0-based models, the substantial number of potentially harmful outputs from GenAI chatbots raises serious concerns. Harmful outputs included inaccurate diagnoses of infectious diseases, misinterpretations of investigation results, and inappropriate drug recommendations. These shortcomings were likely to cause direct patient harm and could erroneously divert attention and resources, both for the patient and the broader healthcare system. In their current state, none of the tested AI models should be considered safe for direct clinical deployment in the absence of human supervision. Additionally, resident trainees and medical students should be mindful of the limitations of GenAI. Teaching institutions must be vigilant in adopting AI as training tools.
Our findings also revealed a consistent trend in which specialists provided higher ratings than resident trainees. This variability was not viewed as a shortcoming; rather, it reflected the differences in clinical judgment that exist in everyday practice. By incorporating evaluators with varying levels of experience, we captured a realistic view of how AI performance may be interpreted in diverse clinical settings (
Table 1). It is incumbent upon stakeholders and AI engineers to address the potential inadequacies in human evaluation and oversight of AI-generated contents, particularly within the critical domain of clinical medicine and patient care. While the current study did not explore the specific reasons for the noticeable differential rating patterns, future research could benefit from enhanced calibration procedures or weighted scoring to further refine these insights.
The running costs of GenAI chatbots have reduced substantially over time. At the time of testing, GPT-4.0’s operating costs were GBP 0.0474 per 1000 tokens for input and GBP 0.0948 per 1000 tokens for output, with average costs for scenario input and output calculated to be GBP 0.0204 and GBP 0.0408, respectively. Within the subsequent six months, the average cost per 1000 tokens for input and output decreased by approximately 50% for GPT-4.0, while costs for Claude 2 remained unchanged. Notably, Gemini Pro has transitioned to a free service model. Currently, the operating costs for frontier models are comparable. As competition intensifies and the cost disparity between proprietary (GPT-4o, Gemini 2.5, and Claude 3) and open-source models (DeepSeek V3, Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., Hangzhou, China; Llama 3.1, Meta Platforms, Inc., Menlo Park, CA, USA) narrows, we anticipate that future iterations of GenAI systems will become increasingly attractive to healthcare providers. A detailed cost–benefit analysis incorporating factors such as scalability and integration costs would be a valuable direction for subsequent research.
We emphasise that AI systems should complement rather than replace human clinical judgement. Given the observed limitations, initial AI deployment should occur under strict human supervision. Clinicians should use AI-generated outputs as
supplementary information to support their decision-making, provided they receive adequate training on critically evaluating AI recommendations. This approach ensures that human expertise remains central to patient care. Looking ahead, integration of feedback systems will be essential, allowing clinicians to contribute their expert opinions to refine AI outputs. This iterative process can enhance the accuracy and reliability of AI-augmented healthcare, fostering a more effective integration between AI technologies and clinical expertise.
Several limitations were identified in this study. First, while our brief orientation session aided in aligning the evaluators’ understanding of the domains to be assessed, enhanced calibration procedures—including more extensive training or periodic recalibration sessions—should be considered in subsequent studies to mitigate potential biases stemming from varying levels of clinical experience. Second, for fair comparisons, standardised, complete, and verified data were used to create case scenarios. Though, we were mindful that the level of clinical detail and available patient data in these scenarios may not fully encapsulate the variability and nuances of real-life hospital settings. Since AI system performance is highly dependent on the quality of input data, it is important to recognise that AI-generated responses may be constrained in actual clinical practice. As AI technology continues to advance rapidly, these models are expected to achieve clinical safety and reliability shortly. It is important for stakeholders to stay informed about the latest developments to fully leverage AI’s potential in healthcare. Third, although our primary analysis focused on overall AI performance, preliminary observations suggested that case complexity may influence output quality. Future studies should consider stratifying scenarios by complexity or disease type to determine whether specific categories of clinical cases present unique challenges for LLMs. Lastly, AI-augmented healthcare delivery services should be evaluated against the standard-of-care through randomised controlled trials, enabling objective measurements of the clinical benefits and practicalities provided by AI.
Future research should prioritise comparative analyses between traditional clinical care and AI-enhanced healthcare delivery to unlock the full potential of AI technologies across diverse healthcare settings. From a patient engagement perspective, multimodal capabilities of AI systems can significantly enhance doctor–patient communication, aiding in the explanation of complex medical concepts through multimedia channels, thereby empowering patients, reinforcing their autonomy, and fostering better shared decision-making [
28]. In terms of cross-specialty collaboration, AI could efficiently capture the entirety of the patient’s clinical journey across the full spectrum of the healthcare ecosystem—primary, secondary, tertiary, and community care [
29]. Integration of unstructured health data into the chronological profile of the patient could enable powerful insights into their health state, thereby facilitating timely and proactive health interventions. Additionally, real-time monitoring of communicable diseases and available healthcare resources (e.g., personal protective equipment (PPE), vaccines, treatments, laboratory reagents, etc.) should be guided by big data and analysed by AI, allowing precise and equitable distribution of resources and effective management of supply chain constraints, thereby enabling rapid public health interventions [
30].