Abstract
Background: Large language models (LLMs) such as ChatGPT have evolved rapidly, with notable improvements in coherence, factual accuracy, and contextual relevance. However, their academic and clinical applicability remains under scrutiny. This study evaluates the temporal performance evolution of LLMs by comparing earlier model outputs (GPT-3.5 and GPT-4.0) with ChatGPT-4.5 across three domains: aesthetic surgery counseling, an academic discussion base of thumb arthritis, and a systematic literature review. Methods: We replicated the methodologies of three previously published studies using identical prompts in ChatGPT-4.5. Each output was assessed against its predecessor using a nine-domain Likert-based rubric measuring factual accuracy, completeness, reference quality, clarity, clinical insight, scientific reasoning, bias avoidance, utility, and interactivity. Expert reviewers in plastic and reconstructive surgery independently scored and compared model outputs across versions. Results: ChatGPT-4.5 outperformed earlier versions across all domains. Reference quality improved most significantly (a score increase of +4.5), followed by factual accuracy (+2.5), scientific reasoning (+2.5), and utility (+2.5). In aesthetic surgery counseling, GPT-3.5 produced generic responses lacking clinical detail, whereas ChatGPT-4.5 offered tailored, structured, and psychologically sensitive advice. In academic writing, ChatGPT-4.5 eliminated reference hallucination, correctly applied evidence hierarchies, and demonstrated advanced reasoning. In the literature review, recall remained suboptimal, but precision, citation accuracy, and contextual depth improved substantially. Conclusion: ChatGPT-4.5 represents a major step forward in LLM capability, particularly in generating trustworthy academic and clinical content. While not yet suitable as a standalone decision-making tool, its outputs now support research planning and early-stage manuscript preparation. Persistent limitations include information recall and interpretive flexibility. Continued validation is essential to ensure ethical, effective use in scientific workflows.
1. Introduction
Over the past few years, the evolution of large language models (LLMs), supported by advanced machine learning techniques such as transformer-based architectures, reinforcement learning from human feedback (RLHF), large-scale unsupervised pretraining, and instruction tuning, has fundamentally transformed the way scientific research is conducted, written, and disseminated. These advancements have led to measurable improvements in coherence, factual accuracy, and contextual understanding [,]. Advanced machine learning techniques have propelled models such as ChatGPT from early experimental tools to sophisticated systems capable of generating complex, nuanced text. These models have been increasingly employed to support literature searches, data synthesis, and even the preliminary drafting of scientific manuscripts []. Early evaluations, however, revealed notable limitations, including superficial analyses, occasional factual inaccuracies, and instances of fabricated references. Such shortcomings raised important questions regarding AI-generated content’s reliability and academic utility [,,].
The primary aim of the present study is to conduct a comprehensive comparative analysis of LLM outputs by replicating and extending methodologies from three seminal studies. The first of these studies, published in May 2023 and based on outputs generated in January 2023, offered an early perspective on how LLMs performed when applied to clinically oriented research questions []. The second study, published in October 2024 with outputs generated in May 2023, further explored the applicability of LLMs in addressing specific medical queries []. The third study, a literature review with outputs generated in April 2024, examined the evolution of systematic literature search capabilities and highlighted improvements in content depth and reference accuracy []. In our investigation, we strictly adhere to the original experimental protocols by using the same set of prompts as employed in the previous studies. Because the original studies utilised only ChatGPT, our analysis employs the latest version, ChatGPT 4.5, to ensure a direct and meaningful comparison. Our analysis is structured around several key performance dimensions. We first assess the accuracy of responses by examining the factual correctness and their alignment with the current state of scientific knowledge. Next, we evaluate the completeness of the information provided by determining whether the responses comprehensively address all aspects of the queries. We also analyse the coherence and clarity of the generated text, focusing on logical flow and overall readability. Furthermore, we investigate the incidence of bias and errors, with particular attention to occurrences of fabricated or misleading references. Finally, we consider the practical utility of the outputs in the context of scientific and medical research, including their capacity to support literature reviews, data synthesis, and academic writing.
It is important to note that early iterations of LLMs, as reflected in studies [,], were characterised by a tendency toward generic or surface-level responses. These limitations were primarily attributable to constraints in training data and model architecture at that time and were empirically demonstrated in domains such as literature searching, where ChatGPT significantly underperformed against human researchers []. Recent advancements have addressed many of these issues. For example, ChatGPT 4.5 exhibits marked improvements in the depth and precision of its responses, offering more nuanced and contextually relevant information than its predecessors [,]. The implications of this study are significant for integrating AI technologies in academic research. By systematically comparing historical outputs with those generated by the most advanced models available today, we aim to depict progress made over the past 12 to 24 months. Our findings are expected to illuminate both the benefits and the persistent challenges associated with using LLMs, informing future research directions and guiding best practices for ethical implementation [,]. Ultimately, this work seeks to bolster confidence in using AI-generated content while delineating the boundaries within which these technologies can most effectively support scientific inquiry.
2. Materials and Methods
This study was designed to evaluate the evolution in the performance of LLMs over time by replicating and extending the methodologies of the author’s three previously published studies. The research adopts a comparative, qualitative design, focusing on outputs generated by earlier versions of ChatGPT (GPT-3.5 and GPT-4.5) and comparing them to outputs produced by the most recent iterations, ChatGPT4o. The objective was to assess improvements in accuracy, completeness, reference reliability, clarity, and overall practical utility in scientific and medical research.
To ensure methodological rigour and consistency, we selected three benchmark studies in which earlier LLMs had been tasked with specific academic or clinical prompts. These included: (1) a rhinoplasty consultation simulation involving nine standardised patient-focused questions derived from the American Society of Plastic Surgeons’ checklist; (2) an academic conversation evaluating the use of implants in the management of base of thumb arthritis, structured as five iterative scientific prompts; and (3) a systematic literature search comparing ChatGPT and other AI platforms against human researcher performance in identifying high-level evidence related to trapeziometacarpal joint osteoarthritis.
In the present study, we recreated each original scenario using ChatGPT-4.5, replicating the prompts used in the prior investigations without modification. These prompts reflected the nature of the original tasks: (1) a structured set of nine patient-focused questions for aesthetic surgery counseling; (2) five sequential, evidence-based academic questions on the surgical management of base of thumb arthritis; and (3) a systematic literature search query formulated with predefined inclusion criteria and Boolean logic elements. All prompts were entered verbatim in a single ChatGPT-4.5 session, in English, without rephrasing or multiple response generation, to ensure consistency with the original studies. All prompts were entered into a single ChatGPT-4.5 session under consistent conditions to eliminate variation due to model sampling or user behavior. The model was instructed not to generate multiple versions of each response. Responses were fully collected and preserved in their original form for blinded assessment.
Each ChatGPT-4.5 output was evaluated using a predefined rubric based on five performance domains: (1) factual accuracy, defined as the correctness of scientific or clinical information provided; (2) completeness, reflecting the model’s ability to address all aspects of the prompt; (3) reference quality, including the presence or absence of hallucinated or unverifiable citations; (4) clarity and coherence, judged by logical flow, grammatical accuracy, and readability; and (5) practical utility, particularly in the context of research drafting, literature review, or patient counseling. The prompts used in this study were identical to those reported in full in our previously published studies [,,] and are also displayed in the comparative tables within this manuscript (Table 1 and Table 2). No additional modifications or variations were introduced. All ChatGPT-4.5 outputs were generated between March 2025 in a single session per domain, ensuring consistent model behaviour during testing.

Table 1.
Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT.

Table 2.
Artificial or Augmented Authorship? A Conversation with a Chatbot on the Base of Thumb Arthritis. The references listed in this table are part of the AI-generated response and do not necessarily correspond to verifiable sources. They are reported verbatim to illustrate the AI output and are not included in the reference list.
A panel of four expert reviewers with over 50 years of clinical experience, including plastic and reconstructive surgeons and academic clinicians involved in the original studies, carried out the evaluation. Each reviewer independently scored the outputs using the predefined five-domain Likert-based rubric (0 = lowest, 5 = highest). The mean of the four reviewers’ scores was calculated for each domain. In cases where individual scores differed by more than one point, the reviewers discussed the discrepancy and reached a consensus. While no formal inter-rater reliability coefficient was calculated, agreement was reached for all final scores. Scores were then collated and compared across models, and descriptive analyses were performed to identify patterns in improvement or persistent limitations. All outputs were anonymised before evaluation so that reviewers were blinded to the model version that had generated them.
3. Results
The comparative analysis revealed marked improvements in the performance of LLMs across all evaluated domains when comparing earlier versions of ChatGPT (GPT-3.5 and GPT-4.0) to the most recent model, ChatGPT-4.5. Using a structured Likert scale assessment across nine performance dimensions, ChatGPT-4.5 consistently outperformed its predecessors in aesthetic surgery counseling, academic discussion on the base of thumb arthritis, and systematic literature review for trapeziometacarpal joint osteoarthritis.
In the aesthetic surgery domain, responses to nine standardised patient queries regarding rhinoplasty demonstrated significant enhancements in clarity, anatomical specificity, procedural depth, and psychological insight. As shown in Table 1a, ChatGPT-4.5 provided more tailored, structured, and psychologically sensitive advice compared to GPT-3.5. GPT-3.5 offered generalised responses with limited surgical detail and a static communication tone, whereas ChatGPT-4.5 provided tailored, dynamic outputs. It included structured breakdowns of surgical candidacy, operative techniques, and postoperative expectations, using terminology and explanations appropriate for both laypersons and medically trained users. The model also addressed mental health considerations and lifestyle factors and engaged in two-way communication when prompted.
In the academic discussion surrounding the base of thumb arthritis, GPT-3.5 generated superficial content with fabricated or unverifiable references. It failed to contextualise evidence or adhere to academic referencing standards. Conversely, ChatGPT-4.5 produced content aligned with established evidence-based medicine frameworks, referencing Level 4 studies accurately and incorporating a meaningful critique of current limitations in the literature. Table 2 illustrates the improved accuracy, reference validity, and application of evidence hierarchies achieved with ChatGPT-4.5. It demonstrated improved scientific reasoning, appropriately ranked evidence quality, and suggested relevant multidisciplinary and innovative management strategies. The new model exhibited no hallucinated references and maintained terminological precision throughout.
The domain of systematic literature review further illustrated these improvements. In the original 2024 study, GPT-4.0 retrieved only one relevant publication compared to 23 identified by manual human search. When re-evaluated with ChatGPT-4.5 using the same prompts, the model successfully retrieved nine relevant studies, including seven that matched human-identified results. All references were verifiable, and the model correctly outlined inclusion criteria, study design, intervention details, and levels of evidence. While recall remained inferior to manual database search strategies, the overall precision, citation accuracy, and contextual summary quality significantly improved. As summarised in Table 3, ChatGPT-4.5 retrieved a greater proportion of relevant studies with higher citation precision.

Table 3.
Variations Between Iterations in Simulated Searches.
Aggregated Likert scale scores demonstrated that ChatGPT-4.5 achieved higher ratings in all assessed domains. These comparative scores are detailed in Table 4. The most significant improvements were observed in reference quality (+4.5), factual accuracy (+2.5), scientific reasoning (+2.5), and practical utility (+2.5), reflecting a substantial enhancement in academic and clinical relevance. A summary of key improvements across domains is provided in Table 5. Bias and error avoidance also improved markedly, mainly by eliminating hallucinated references, such as plausible-sounding articles with non-existent DOIs or incorrect author and journal combinations, and by more precisely delineating evidence limitations. All references generated by ChatGPT-4.5 were verifiable via PubMed or official publisher databases.

Table 4.
Comparative Likert Scale Analysis of Previous and Current Large Language Model Outputs Across Key Performance Domains.

Table 5.
Key Improvements (ChatGPT-4.5 versus GPT-3.5).
4. Discussion
This study provides a detailed comparative assessment of the performance evolution of LLMs across three core domains: patient counseling, academic discussion, and literature review. Our findings demonstrate that ChatGPT-4.5 substantially improves factual accuracy, scientific reasoning, reference validity, and practical utility when benchmarked against earlier iterations such as GPT-3.5 and GPT-4.0.
The observed performance differences can be partly explained by architectural and training advancements, including larger and more diverse datasets, extended context handling, and improved alignment through reinforcement learning from human feedback. In our tasks, GPT-3.5 often produced generic or inaccurate outputs, GPT-4.0 improved contextual understanding but retained occasional gaps, while ChatGPT-4.5 showed greater factual accuracy, citation reliability, and adaptability. For example, in aesthetic counseling, it tailored advice to anatomical and psychological factors; in academic discussion, it avoided reference to hallucinations and applied evidence hierarchies; and in literature reviews, it improved citation precision, though recall remained limited.
In the domain of aesthetic surgery counseling, previous work by Xie, Seth, Hunter-Smith, Rozen, Ross and Lee [] highlighted the capacity of GPT-3.5 to generate patient-centred information in response to rhinoplasty queries but noted key limitations, including superficial content, lack of procedural detail, and the absence of psychosocial context or individualisation of advice []. In the present study, ChatGPT-4.5 not only addressed the same patient queries with greater clarity and depth but also contextualised recommendations according to anatomical, functional, and psychological criteria. It provided segmented responses, incorporated clinical terminology relevant to plastic surgery, and demonstrated a nuanced understanding of surgical planning and postoperative considerations. This evolution indicates enhanced model training, greater access to validated clinical data, and more refined reinforcement learning processes. The addition of bidirectional interaction, offering to tailor responses based on user feedback, further reflects advancements in conversational adaptability, a domain in which prior versions underperformed.
The academic utility of LLMs in generating scholarly content was explored in our previous study on the base of thumb arthritis, where GPT-3.5 was tasked with synthesising surgical evidence and generating structured scientific commentary []. While it managed to provide general overviews of treatment modalities such as trapeziectomy, implant arthroplasty, and arthrodesis, the outputs were plagued by reference hallucination [], limited critical appraisal, and minimal engagement with the hierarchy of evidence. In contrast, ChatGPT-4.5 exhibited a clear grasp of evidence-based frameworks, correctly applying the Centre for Evidence-Based Medicine (CEBM) grading system and referencing valid studies without fabrication [,]. Furthermore, the model discussed the limitations of existing literature, including the predominance of level 4 evidence, short follow-up durations, and heterogeneity in outcome measures, issues that were previously unaddressed. This demonstrates an emerging capacity for synthetic reasoning and contextual judgement, which is critical for scientific writing and peer-reviewed publication preparation.
In our 2025 study evaluating the capacity of LLMs to conduct systematic literature searches [], GPT-4.0 was found to be cautious and free of hallucinated references. Still, it retrieved a limited number of relevant articles and could not effectively replicate Boolean search logic or leverage structured databases. The re-application of the same methodology using ChatGPT-4.5 revealed marked improvements in literature identification, precision of citations, and structured presentation of study characteristics. Although the model underperformed relative to expert human reviewers regarding recall and sensitivity, it demonstrated a significant reduction in false negatives and maintained high citation fidelity. Importantly, ChatGPT-4.5 showed improved awareness of inclusion/exclusion criteria and could discuss intervention details, follow-up durations, and levels of evidence with a degree of consistency previously unseen in LLMs. These findings indicate that while ChatGPT remains an adjunct rather than a replacement for systematic reviewers, its outputs have reached a level of maturity suitable for early-stage scoping reviews and as a supportive tool in research planning.
From a technical perspective, the most striking improvement lies in ChatGPT-4.5’s reference behaviour. Both Xie, Seth, Hunter-Smith, Rozen, Ross and Lee [] and Seth, Sinkjær Kenney, Bulloch, Hunter-Smith, Bo Thomsen and Rozen [] reported frequent hallucinations or misattributions in GPT-3.5 outputs, with some references being fabricated or improperly cited []. The absence of such errors in GPT-4.5 underscores the positive impact of improved training datasets and citation validation algorithms []. This has significant implications for the safe use of LLMs in academic and clinical contexts, where misinformation can have substantial downstream effects.
Furthermore, ChatGPT-4.5 showed improved completeness and cohesion across all evaluated tasks. These findings align with broader trends observed in LLM development, whereby enhancements in transformer architecture, dataset curation, and instruction tuning have allowed newer models to provide more contextually rich and logically consistent outputs [,,]. These technical gains translate into tangible benefits in healthcare applications, where practitioners may rely on AI-generated summaries for decision support, education, or documentation [].
Nevertheless, the study also identifies persistent limitations. ChatGPT-4.5, despite its improvements, still underperforms in information recall during complex literature searches and lacks the interpretive flexibility of human experts []. It also generalises when synthesising contentious or nuanced academic topics and occasionally omits landmark studies unless explicitly prompted. These shortcomings suggest that while LLMs are becoming increasingly valuable for scientific workflows, they should be used as assistive tools under appropriate supervision rather than autonomous knowledge sources. Similar results have been reported in independent evaluations of LLMs in healthcare, including a large systematic review analysing over 500 studies on healthcare applications of LLMs [], a clinical medicine-focused review mapping evaluation methods across multiple domains [], and a benchmarking study demonstrating that LLMs encode substantial clinical knowledge []. These works collectively underscore both the promise and the ongoing limitations of LLMs in medical contexts.
An important methodological consideration is that LLMs such as ChatGPT are trained primarily on large-scale public datasets, the composition and provenance of which are not fully transparent. As a result, their outputs may reflect the strengths and biases of these underlying sources, and the factual accuracy of generated content cannot be assumed without verification. In the context of medical research, inferences drawn from LLM outputs should therefore be treated as preliminary and always corroborated by peer-reviewed evidence. In this study, all AI-generated content was reviewed and validated by experienced clinicians to ensure clinical and academic accuracy before inclusion in the analysis. An important methodological consideration is that LLMs such as ChatGPT are trained primarily on large-scale public datasets, the composition and provenance of which are not fully transparent. As a result, their outputs may reflect the strengths and biases of these underlying sources, and the factual accuracy of generated content cannot be assumed without verification. In the context of medical research, inferences drawn from LLM outputs should therefore be treated as preliminary and always corroborated by peer-reviewed evidence. In this study, all AI-generated content was reviewed and validated by experienced clinicians to ensure clinical and academic accuracy before inclusion in the analysis. Another limitation of this study is that the evaluation was conducted in only three domains. While these scenarios were selected to represent diverse academic and clinical tasks, they do not encompass the full range of possible LLM applications in medicine. As such, the findings should be interpreted with caution when extrapolating to other specialties or task types. Future research incorporating a broader range of medical disciplines and prompt types would provide a more comprehensive assessment of LLM performance.
Finally, the results underscore the importance of ongoing validation and benchmarking of LLMs against real-world clinical and academic use cases. As these technologies evolve, researchers and clinicians must remain informed of their capabilities and limitations to ensure their ethical and effective integration into practice []. (Figure 1).

Figure 1.
Visual summary of ChatGPT-4.5’s improved performance in academic plastic surgery across counseling, scientific writing, and literature review.
5. Conclusions
This study provides compelling evidence that LLMs have undergone substantial performance improvements over the past 12 to 24 months. By directly comparing ChatGPT-4.5 to earlier versions, including GPT-3.5 and GPT-4.0, across clinically and academically relevant scenarios, we demonstrate enhanced accuracy, depth, reference validity, and overall practical utility in the newer model. These advancements position ChatGPT-4.5 as a significantly more reliable tool for supporting medical counseling, academic writing, and preliminary literature reviews. Despite these advantages, limitations remain—particularly in critical synthesis, information recall, and interpretive reasoning during complex academic tasks. As such, LLMs should be considered adjuncts rather than replacements for expert human input. Continued monitoring, validation, and responsible integration of LLMs into scientific and clinical workflows will be essential to ensure safe, ethical, and effective use. While this study focuses on ChatGPT, the observed temporal trends in output quality may reflect broader patterns applicable to other transformer-based language models with similar training trajectories. Ultimately, this comparative analysis highlights the promise of LLMs as evolving partners in scientific research and medical education while reinforcing the need for vigilance in their application. Future iterations will bring even greater utility as the technology matures, provided their development is guided by rigorous assessment and ethical oversight.
Author Contributions
Conceptualization, I.S., G.M. and B.L.; methodology, I.S., S.B. and G.M.; software, B.L.; validation, I.S., G.M. and J.N.; formal analysis, G.M.; investigation, I.S., B.L. and J.N.; resources, W.M.R. and R.J.R.; data curation, B.L.; writing—original draft preparation, G.M., S.B. and I.S.; writing—review and editing, G.M., I.S. and B.L.; visualization, B.L.; supervision, W.M.R. and R.C.; project administration, I.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analysed in this study.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Tan, S.; Xin, X.; Wu, D. ChatGPT in medicine: Prospects and challenges: A review article. Int. Surg. J. 2024, 110, 3701–3706. [Google Scholar] [CrossRef] [PubMed]
- Tangsrivimol, J.A.; Darzidehkalani, E.; Virk, H.U.H.; Wang, Z.; Egger, J.; Wang, M.; Hacking, S.; Glicksberg, B.S.; Strauss, M.; Krittanawong, C. Benefits, limits, and risks of ChatGPT in medicine. Front. Artif. Intell. 2025, 8, 1518049. [Google Scholar] [CrossRef] [PubMed]
- Xie, Y.; Seth, I.; Hunter-Smith, D.J.; Wang, Z.; Egger, J.; Wang, M.; Hacking, S.; Glicksberg, B.S.; Strauss, M.; Krittanawong, C. Aesthetic surgery advice and counseling from artificial intelligence: A rhinoplasty consultation with ChatGPT. Aesthetic. Plast. Surg. 2023, 47, 1985–1993. [Google Scholar] [CrossRef] [PubMed]
- Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Raynier, J.-L.; Clowez, G.; Boileau, P.; Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: Comparative analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
- Seth, I.; Kenney, P.S.; Bulloch, G.; Hunter-Smith, D.J.; Thomsen, J.B.; Rozen, W.M. Artificial or augmented authorship? A conversation with a chatbot on base of thumb arthritis. Plast. Reconstr. Surg. Glob. Open 2023, 11, e4999. [Google Scholar] [CrossRef] [PubMed]
- Seth, I.; Marcaccini, G.; Lim, K.; Castrechini, M.; Cuomo, R.; Ng, S.K.-H.; Ross, R.J.; Rozen, W.M. Management of Dupuytren’s disease: A multi-centric comparative analysis between experienced hand surgeons versus artificial intelligence. Diagnostics 2025, 15, 587. [Google Scholar] [CrossRef] [PubMed]
- Seth, I.; Lim, B.; Xie, Y.; Ross, R.J.; Cuomo, R.; Rozen, W.M. Artificial intelligence versus human researcher performance for systematic literature searches: A study focusing on the surgical management of base of thumb arthritis. Plast. Aesthetic Res. 2025, 12, 1. [Google Scholar] [CrossRef]
- Nematov, D. Progress, challenges, threats and prospects of ChatGPT in science and education: How will AI impact the academic environment? J. Adv. Artif. Intell. 2025, 3, 187–205. [Google Scholar] [CrossRef]
- Yang, J.J.; Hwang, S.-H. Transforming hematological research documentation with large language models: An approach to scientific writing and data analysis. Blood Res. 2025, 60, 15. [Google Scholar] [CrossRef] [PubMed]
- Kumar, I.; Yadav, N.; Verma, A. Navigating artificial intelligence in scientific manuscript writing: Tips and traps. Indian J. Radiol. Imaging. 2025, 35, S178–S186. [Google Scholar] [CrossRef] [PubMed]
- Marcaccini, G.; Seth, I.; Xie, Y.; Susini, P.; Pozzi, M.; Cuomo, R.; Rozen, W.M. Breaking bones, breaking barriers: ChatGPT, DeepSeek, and Gemini in hand fracture management. J. Clin. Med. 2025, 14, 1983. [Google Scholar] [CrossRef] [PubMed]
- On, S.W.; Cho, S.W.; Park, S.Y.; Ha, J.-W.; Yi, S.-M.; Park, I.-Y.; Byun, S.-H.; Yang, B.-E. Chat generative pre-trained transformer (ChatGPT) in oral and maxillofacial surgery: A narrative review on its research applications and limitations. J. Clin. Med. 2025, 14, 1363. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Shue, K.; Liu, L.; Hu, G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Sci. Rep. 2025, 15, 10426. [Google Scholar] [CrossRef] [PubMed]
- Sharma, A.; Rao, P.; Ahmed, M.Z.; Chaturvedi, K. Artificial intelligence in scientific writing: Opportunities and ethical considerations. Int. J. Res. Med. Sci. 2024, 13, 532–542. [Google Scholar] [CrossRef]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025, 333, 319–328. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Shool, S.; Adimi, S.; Amleshi, R.S.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; Payne, P.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).