A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

Alkalbani, Asma Musabah; Alrawahi, Ahmed Salim; Salah, Ahmad; Haghighi, Venus; Zhang, Yang; Alkindi, Salam; Sheng, Quan Z.

doi:10.3390/info16060489

Open AccessSystematic Review

A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

by

Asma Musabah Alkalbani

^1,2,*

,

Ahmed Salim Alrawahi

^1,3

,

Ahmad Salah

^1,4

,

Venus Haghighi

²

,

Yang Zhang

⁵

,

Salam Alkindi

⁶

and

Quan Z. Sheng

²

¹

Department of Information Technology, College of Computing and Information Sciences, University of Technology and Applied Sciences, Ibri 511, Oman

²

School of Computing, Macquarie University, Sydney, NSW 2109, Australia

³

AI Applications Research Chair, University of Nizwa, Nizwa 616, Oman

⁴

Department of Computer Science, Faculty of Computers and Informatics, Zagazig University, Zagazig 44519, Egypt

⁵

The Anuradha and Vikas Sinha Department of Data Science, University of North Texas, Denton, TX 76203, USA

⁶

Department of Hematology, College of Medicine & Health Science, Sultan Qaboos University, Muscat 123, Oman

^*

Author to whom correspondence should be addressed.

Information 2025, 16(6), 489; https://doi.org/10.3390/info16060489

Submission received: 26 April 2025 / Revised: 22 May 2025 / Accepted: 7 June 2025 / Published: 12 June 2025

(This article belongs to the Special Issue Methods for Integrating Information in Data, Language Models, and Knowledge Graphs for Neurosymbolic Learning and Reasoning)

Download

Browse Figures

Versions Notes

Abstract

This systematic review evaluates recent literature from January 2021 to March 2024 on large language model (LLM) applications across diverse medical specialties. Searching PubMed, Web of Science, and Scopus, we included 84 studies. LLMs were applied to tasks such as clinical natural language processing, medical decision support, education, and aiding diagnostic processes. While studies reported benefits such as improved efficiency and, in some specific NLP tasks, high accuracy above 90%, significant challenges persist concerning reliability, ethical implications, and performance consistency, with accuracy in broader diagnostic support applications showing substantial variability, with some as low as 3%. The overall risk of bias in the reviewed literature was considerably low in 72 studies. Key findings highlight a substantial heterogeneity in LLM performance across different medical tasks and contexts, preventing meta-analysis due to a lack of standardized methodologies. Future efforts should prioritize developing domain-specific LLMs using robust medical data and establishing rigorous validation standards to ensure their safe and effective clinical integration. Trial registration: PROSPERO (CRD42024561381).

Keywords:

artificial intelligence; clinical decision support systems; large language models; clinical natural language processing; medical specialties

1. Introduction

Rapid advancements in natural language processing (NLP) influence the domain of artificial intelligence (AI) and its subfield of large language models (LLMs) (e.g., GPT and BERT) [1,2]. LLMs can be defined as deep neural network models with transformer-based architectures trained on large text corpora, capable of understanding context, generating human-like text, and performing complex language tasks. The model size specifies its complexity and processing capabilities and is a varying number from millions to billions of parameters, therefore called “large”. LLMs utilize complex deep learning (DL) architectures to autonomously learn complex linguistic patterns and semantics from unprecedented training datasets making them distinct from other approaches that rely heavily on predefined rules and feature engineering. This enables LLMs to demonstrate human-like capabilities in understanding and generating text, summarizing information, and comprehend contextual cues with remarkable accuracy. These capabilities have attracted growing interest in applying LLMs to healthcare and medicine.

With applications ranging from drug discovery and development, clinical decision support [3], patient care [4], research and documentation [5] to medical education and licensing [6], LLMs hold significant potential for various medical specialties by improving diagnostic support accuracy [7], optimizing downstream tasks [8], and enhancing patient care [9]. LLMs have also the ability to continuously learn from ever-evolving medical knowledge that ensures their adaptability and relevance in dynamic medical environments, thereby promising sustained development in clinical practice. The applications of LLMs in medicine are diverse, targeting a wide range of users including clinicians, medical researchers, educators, students, and patients. The use cases span from general administrative and documentation support to more specialized clinical decision support, medical educational tools, patient communication aids, and clinical NLP tasks. This review aims to synthesize the evidence across these varied applications within different medical specialties, highlighting the levels at which LLMs are being developed and evaluated, from foundational general purpose model capabilities, fine-tuned models to medical-specific LLMs.

Even with increasing applications of LLMs in many medical specialties, the literature reports challenges and negative impacts on employing LLMs in medical practices and processes. These included accuracy issues [10], ethical and legal concerns [11,12], inconsistent reliability of LLMs [7], their lack of clinical knowledge [6] and their inability to handle specific tasks [13], and challenges related to integration and implementation [14]. Reported challenges pertained to all LLMs, including general-purpose and clinically trained LLMs. This necessitates a rigorous evaluation of LLMs’ performance and impact across different medical disciplines. Several systematic reviews on the use of LLMs for medical applications exist [15,16]. The key challenges discussed in this study are depicted in Figure 1.

Existing reviews are limited to the use of LLMs in mainly one medical specialty, missing the evaluation of the LLMs’ effectiveness, limited or missing discussion on LLMs impact and challenges associated with the use of LLMs across various medical domains. Existing systematic reviews can be categorized into three types based on their study focus: (1) studies focusing on a very narrow medical field (a single specialty), (2) studies focusing on a single task performed by LLMs, and (3) studies reviewing several LLMs across different medical specialties. All three classes can focus on a single LLM (e.g., GPT) or several LLMs. Examples of the first type, which reviewed LLM tasks within a single medical specialty, include [17], which studied the performance of LLMs in Orthopedics, and [18], which systematically reviewed the ability of GPT-4 and Bard to pass the Fundamentals of Robotic Surgery (FRS) didactic test. The second type focuses on a single task performed by a single LLM, as in [19], which reviewed different versions of ChatGPT in medical licensing examinations worldwide, not limited to a specific specialty. An example of the third type is [20], where the authors reviewed the various applications of GPT across multiple medical specialties. The current study distinguishes itself from existing systematic reviews in several ways. First, its coverage extends to March 2024, while other studies have a shorter time span, such as [20], only covered periods up to September 2023. Second, this study focuses exclusively on top-tier journal publications, specifically WoS Q1 journals, to ensure a focus on the highest quality papers.

This systematic review considers both proprietary (e.g., GPT) and open source (e.g., BERT) LLMs. This study aims to (1) assess the current uses of LLMs in medical specialties and reported accuracy based on original studies, (2) explore the most common challenges of LLMs in these medical specialties, and (3) identify potential future applications of LLMs in medical specialties. The quality of the studies included will be assessed for potential biases.

This systematic review adheres to the Preferred Reporting Items for a Systematic Review and Meta-Analyses (PRISMA) guidelines. We systematically searched multiple databases, screened studies based on predefined criteria, extracted relevant data, and assessed the quality of included research using QUADAS-AI tool to synthesize the applications, performance, challenges, and future directions of LLMs across various medical specialties. The detailed methodology is described in the Methods Section, and the flow of study identification and selection is presented in Figure 2 to depict the steps from data acquisition to the final inclusion of studies.

This systematic review aims to provide a comprehensive overview of LLM utilization across a wide array of medical specialties, synthesizing evidence on their applications, reported performance, and inherent challenges. The novelty of this work lies in its broad scope across 19 specialties, its specific focus on recent literature (2021–2024) reflecting the current advanced LLM era, and its detailed analysis of evaluation metrics and risk of bias using the QUADAS-AI tool. By mapping the landscape of LLM research in medicine, this review contributes to identifying current capabilities, critical gaps, and robust future research directions, offering valuable insights for researchers, clinicians, and policymakers navigating this rapidly advancing domain.

The remainder of this paper is organized as follows: Section 2 details the methodology used for this systematic review. Section 3 presents the main findings, including study characteristics, LLM applications, and performance. Section 4 discusses these findings, their implications, and ethical considerations. Section 5 presents future research directions. Section 6 outlines the limitations of this review, and Section 7 concludes the review.

2. Materials and Methods

This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (PRISMA [21]). The PRISMA flow diagram of the included studies is presented in Figure 2, and its checklist is listed in Supplementary Table S4. The protocol review was also registered at PROSPERO (CRD42024561381) and is available via https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=561381 (accessed on 13 July 2024). We assess existing literature that has LLMs implemented into a medical specialty of one of the applications (diagnoses, medical education, decision support, clinical NLP).

2.1. Search Strategy

A comprehensive search was performed by searching PubMed, Web of Science (WoS), and Scopus for journal articles in English published from 2021 to March 2024. The search query, a carefully constructed list of keywords, field codes (e.g., TITLE-ABS-KEY), and Boolean operators (e.g., AND, OR, NOT) that are methodically designed to retrieve a particular set of scholarly literature systematically from the indexed journal databases such as Scopus, was ((“large language models” OR LLMs OR GPT-2 OR GPT-3 OR GPT-3.5 OR GPT-4 OR ChatGPT OR BERT OR LAMDA OR RoBERTa OR “Turing-NLG” OR “Turing Bletchley” OR XLNet OR “Transformer-XL” OR ERNIE OR ELECTRA OR ALBERT) AND (clinical OR disease OR healthcare OR medical OR patient OR diagnosis OR therapy OR treatment OR surgical OR oncology OR cardiology OR neurology OR psychological OR immunology OR genomics OR bioinformatics OR “public health”)), as explained in Listing 1 and detailed in Supplementary Table S6. Two investigators initially screened all retrieved records to ensure duplicates were removed and eligibility criteria were met. Discrepancies were resolved by discussion and consensus or by a third investigator.

Listing 1. Categories and terms applied in the search queries.

Search concepts combined using “AND”

Large Language models
Medical Specialties

Search terms combined using “OR”

Different LLMs names and generic terms of LLMs
Different medical specialties as classified in [22]

2.2. Inclusion and Exclusion Criteria

This review contained all peer-reviewed journal articles that studied the application of LLMs in medical specialties. The title and abstract of all retrieved records were rigorously screened. The inclusion criteria included any peer-reviewed journal article that employed one or more LLM to a medical specialty application. The exclusion criteria included (1) non-English publications, (2) no use of LLMs, and (3) other types of journal articles than original research (reviews, editorials, reports, commentaries, letters, viewpoints, reflections, and notes). To further assess high-quality research only, any study that is (4) not published in a Q1 WoS journal is excluded.

2.3. Data Extraction

The main outcomes of this systematic review were to identify the medical specialties where LLMs are utilized and assess the LLMs’ effectiveness and their challenges. Data were extracted and recorded into the following study characteristics: authors, year of publication, country of affiliation, objective, LLM model, medical specialty, application (decision support, diagnoses, medical education, and clinical NLP) (available in Supplementary Table S1), and evaluation metrics (available in Supplementary Table S2). Data was independently extracted by four reviewers. Discrepancies were discussed and resolved either by consensus or by a fifth reviewer.

2.4. Quality Assessment

One of the inclusion criteria in this review was Q1 WoS journal articles to distinguish studies based on the quality of their publication source. We modified QUADAS-AI [23] to evaluate the risk of bias of included studies in this review. The heterogeneity of specialties and the rapid development of LLMs also posed challenges in establishing consistent quality criteria across studies and therefore, we could not perform a meta-analysis. The modified tool is presented in Supplementary Table S5.

In this systematic review, we employed the revised tool for the Quality Assessment of Diagnostic Accuracy Studies tailored for Artificial Intelligence (QUADAS-AI) [23] to evaluate the risk of bias and applicability of the included studies. The QUADAS-AI tool was chosen because it is specifically designed to assess diagnostic accuracy studies involving AI technologies, ensuring a systematic and standardized evaluation of potential biases unique to AI applications.

We made minor modifications to the QUADAS-AI tool to align it with the context of LLM applications across various medical specialties. These modifications included adapting the phrasing of certain signaling questions to broadly cover aspects such as data source, size and quality, LLM model development and validation transparency, and the rigor of performance evaluation, while maintaining the core domains of the original tool. Studies were rated as having a ’low’, ’high’, or ’unclear’ risk of bias across the tool’s domains. The detailed modified QUADAS-AI framework is available in Supplementary Table S5.

3. Results

3.1. Study Selection

During the initial search, 6806 records were retrieved, 2095 duplicates were removed, 4448 records were removed after screening, and 6 records could not be retrieved. 257 publications were fully reviewed for potential inclusion, of which 84 studies only met the inclusion criteria and were included in this review.

3.2. Study Characteristics

The characteristics of the included studies are analyzed in Supplementary Table S1, which provides a detailed summary of all reviewed studies. There was an upward trend in the publication rate from 2021 onward, where 41/84 publications were published in 2023. This suggests increasing interest in using LLMs for medical specialties. The analysis showed that about 33% of the included studies were conducted by institutes located in North America (USA = 35, Canada = 2), while around 31% were affiliated with institutions in Asia (China = 16, Korea = 5, India = 2, Japan = 2, Taiwan = 2, Singapore = 2, UAE = 1). Around 23% of the studies were European affiliated (Germany = 6, Italy = 5, UK = 4, France = 2, Spain = 2, Finland = 1 Greece = 1, Portugal = 1, Netherlands = 1, Serbia = 1). A limited number of studies were affiliated with Australia (n = 3, 3%) and South America (Brazil = 1, 1%). These distributions highlighted the global interest in utilizing LLM in the medical field.

3.3. Medical Specialties

The included studies were categorized into one of 19 medical specialties as per specialty profiles of the Association of American Medical Colleges [22]. Several studies that crossed specialty boundaries or addressed broad medical practices (e.g., therapy recommendations spanning ophthalmology, orthopedics, and dermatology) were classified as multi-specialty and included in this study, as long as they met the other inclusion criteria (peer-reviewed, Q1 WoS journal, original research).

More studies (26%, n = 22) used LLMs in multi-specialty than in any other individual specialty. The specialties included were Surgery (approx. 12%, n = 10), Neurology (approx. 10%, n = 8), Radiology (approx. 10%, n = 8), Emergency Medicine (7%, n = 6), Psychiatry (7%, n = 6), Epidemiology (approx. 6%, n = 5), Infectious Diseases (approx. 5%, n = 4), Pathology (approx. 4%, n = 3), Oncology (2%, n = 2), Cardiology (1%, n = 1), Endocrinology (1%, n = 1), ENT (1%, n = 1), Geriatrics (1%, n = 1), Immunology (1%, n = 1), Medical Genetics and Genomics (1%, n = 1), Ophthalmology (1%, n = 1), Pediatrics (1%, n = 1), Urology (1%, n = 1), and Nephrology (1%, n = 1).

3.4. Application of LLMs

We identified five main applications of the LLMs investigated in the literature including clinical NLP (approx. 37%, n = 31), decision support (approx. 24%, n = 20), medical education (approx. 18%, n = 15), diagnoses (approx. 18%, n = 15), and patient management and engagement (approx. 4%, n = 3).

3.5. LLMs Used

Studies either used single LLM (82%, n = 69) or multiple LLMs (18%, n = 15). Over half of the studies used GPT-Based LLMs (51%, n = 43), (approx. 48%, n = 40) used BERT-based LLMs while (approx. 10%, n = 8) only used other type of LLMs.

3.6. Type of LLMs

The performance and reliability of LLMs depend significantly on training and fine-tuning. Reviewed studies used one of three types of LLMs. (1) General-purpose LLMs are trained on large general-purpose datasets rather than specific medical datasets. (2) Fine-tuned LLMs are pre-trained general-purpose LLMs but further trained in specific tasks or domains using relevant smaller datasets [24]. (3) Medical-specific LLMs are developed by training the model from scratch on specific medical datasets derived from clinical notes, electronic health records (EHRs), and medical literature [25]. Most of the included studies used out-of-the-box general-purpose LLMs (57%, n = 48), fine-tuned LLMs (approx. 36%, n = 30) or trained their LLMs from scratch as in medical-specific LLMs (7%, n = 6).

3.7. Reported Impact

Most of the included studies reported impacts of applying LLMs to medical specialties. All studies reported one or more positive impacts. Positive impact included improved diagnosis accuracy (approx. 24%, n = 20), enhanced medical processes efficiency (approx. 29%, n = 24), assisted in better decision-making (21%, n = 18), supported medical education (15%, n = 13), and enhanced patient care (approx. 11%, n = 9). Reported negative impacts were accuracy issues (approx. 24%, n = 20), inconsistent reliability (51%, n = 43), lack of clinical knowledge (approx. 13%, n = 11), ethical issues (approx. 4%, n = 3), inability to handle certain tasks (7%, n = 6) while no negative impacts reported in (1%, n = 1) of the reviewed studies.

3.8. Performance Evaluation

Several evaluation metrics of LLM performance were reported. Each study used at least one evaluation metric. Over half of the included studies (approx. 55%, n = 46) provided accuracy-related metrics, F1-score was reported in 25% of the included studies (n = 21), and precision and recall were used in 13% (n = 11) and approx. 12% (n = 10), respectively. Other evaluation measurements reported were AUC-ROC (approx. 11%, n = 9) and Likert scales (approx. 10%, n = 8). Several evaluation metrics are not included in Table 1 due to their limited usage across the reviewed papers and to maintain the table’s conciseness given the large number of potential metrics. Detailed evaluation metrics are presented in Supplementary Table S2. The studies reviewed were very heterogeneous in terms of medical specialties and LLM applications demonstrated; therefore, pooled analysis of specificity and sensitivity cannot be done. Each of the studies has been conducted with different datasets, metrics, and evaluation methodologies best suited for its context; hence, conducting such analyses risks compromising integrity and accuracy. A descriptive synthesis has thus been carried out to provide an overview.

The performance of LLMs on medical tasks was evaluated across 84 studies, categorized by their comparison approach: no comparison, comparison to existing tools/algorithms, and comparison to human professionals. A significant portion (27/84) presented LLM results without any comparative benchmark. Among the 34 studies comparing LLMs to other tools/algorithms, a clear majority (29/34) demonstrated superior LLM performance, while 3 showed lower performance and 2 achieved similar performance. In the 23 studies directly comparing LLMs to human professionals, a different pattern is noticed: 11 favored human expertise, 6 indicated superior LLM performance, 5 showed equivalent performance, and 1 reported an uncertain comparison. This analysis reveals a complex landscape where LLMs demonstrate considerable promise in some medical tasks, outperforming existing computational methods, but human expertise remains superior in a substantial number of cases. Collectively, across all comparative studies (n = 84), LLMs performed better in 35 studies, worse in 14 studies, achieved the same level of performance in 7 studies, and the comparison was uncertain in 1 instance. While a summary of the LLM accuracy in the included reviewed studies is provided in Table 1, a comprehensive analysis results are presented in Supplementary Table S2.

3.9. Validation Approach

The performance of LLMs of the studies included in this review was either validated (82%, n = 69) or not validated (approx. 18%, n = 15). The performance of LLMs was compared against the performance of medical professionals or validated by human experts (33%, n = 28) or compared against other tools including other LLMs or algorithms (approx. 49%, n = 41).

3.10. Quality Assessment Results

The quality and the risk of bias are assessed using the modified tool. The assessment with disagreements is validated and resolved through consensus discussions. Each of the four domains was evaluated for risk of bias as low, high, or unclear. The overall risk of bias was low in 72 studies, high in 11 studies and not clear in 1 study. The detailed results are presented in Supplementary Table S3.

4. Discussion

4.1. Principal Findings

4.1.1. Overview and Scope of the Review

This review comprehensively evaluates the effectiveness, impacts, and challenges of utilized LLMs across several medical specialties by reviewing the most relevant high-quality (SCIE-indexed Q1 journals) studies from Web of Science, PubMed and Scopus databases. While there have been many reviews on LLM applications in healthcare and medical applications since 2021, only a few systematic reviews exist with a focus on evaluating the impact of LLM applications in medical specialties. Two reviews focused on evaluating the use of ChatGPT in Gastroenterology [46,47] while others assessed the performance of ChatGPT in medical examinations [48,49,50,51]. This systematic review, therefore, stands out as a comprehensive survey and analysis of the included studies to (1) assess the performance of LLMs in medical applications based on the reported evaluation metrics such as F1-score, (2) identify the evidence that supports or hinders the use of LLMs in medical domains individually, and (3) identify future directions of applying LLMs in medical specialties. The findings of this review suggest that LLMs showed high acceptable accuracy in most of the reported applications. However, some concerns have been raised, such as hallucinations and false information [52].

This review showed an upward trend in publications between 2021 and 2024 focused on the application of LLMs in medical specialties. This correlates with the major release and widespread use of new and advanced LLMs and reflects the increasing maturity of LLMs especially ChatGPT, BERT, and domain-specific driven models to perform medical-related tasks. The maturity and advancements of LLMs are expected to continue expanding the role of LLMs in medical applications. Although this review covers literature up to March 2024, more recent perspectives continue to highlight these advancements, emphasizing the growing influence of generative AI in automating administrative tasks, supporting clinical decision-making, and even in novel areas such as multimodal data integration. The focus on both general-purpose LLMs such as GPT-4 and the push for domain-specific models remains a pertinent theme [53].

Our review indicates that LLM development in medicine occurs at multiple levels. Foundational models such as GPT and BERT are pre-trained on large general-purpose datasets, which often serve as the base. Subsequently, these models are frequently fine-tuned using specific medical corpora such as clinical notes, and biomedical literature to adapt them for tasks or specialties. This process often improves their multi-task capabilities relevant to the medical domain, such as understanding clinical terminologies, performing information extraction, generating coherent medical text, and even assisting in complex reasoning tasks. While this review focused on text-based applications, LLM developments suggest increasing integration with multi-modal data including imaging, EHR structured data, and genomics to provide more holistic decision support, and diagnostic assistance.

Figure 3 provides a visual representation of research studies sorted by country of affiliation. The United States comprises the largest share with 29 studies. China is next at 16. Then, there are several countries: Germany, Italy, the UK, and Korea which contributed 4 studies each. Most of remained studies come from a larger number of countries that contributed just one or two total. At the same time total studies, the distribution illustrates a substantial imbalance, the two top ranked countries (USA and China) make up a high proportion of studies. The many other small slices of countries with one or two studies shows a wide contribution from other parts of the world but it is not large in terms of proportions.

4.1.2. Performance Comparison Across Specialties

A total of 22 reviewed studies utilized LLMs in one or more specialties reflecting wide adoption across different domains in medicine. The application reported in [10] provided therapy recommendations across multiple specialties: ophthalmology, orthopedics, and dermatology. Although this suggests the flexibility of LLMs’ potential to support multi-specialty and found no significant difference in harmfulness ratings across the three specialties, but quality scores varied on specific questions. Surgery, neurology, and radiology were leading specialties with 10, 8, and 8 studies, respectively. In neurosurgery, GPT-4 excelled on written board questions [54] but required caution for patient information [41]. In orthopedics, resident performance surpassed GPT-4 on exam questions [55] while LLMs showed promise for improving radiology report readability [56] and answering general patient queries [57].

LLMs can improve aspects in surgical education [54] and assist in extracting evidence [58] and diagnoses from radiology reports [59,60] while aiding in interpreting neurological data [61]. Emerging applications in psychiatry, emergency medicine, and epidemiology are increasing. LLMs may improve the understanding and diagnosis of psychiatric disorders [62]. LLMs can also help in prioritizing patient care by rapidly and correctly classifying patients in triage [63,64]. However, these are advanced medical specialties whose accuracy, privacy and ethical concerns should be considered [65,66].

4.1.3. Key Areas of Utilization of LLMs

Of the studies included in this review, 31 focused on the application of clinical NLP. This highlights the potential of LLM applications to analyze large volumes of unstructured data, including clinical or nursing notes, medical records, and patient records from different medical specialties [67]. The capability of LLMs to handle unstructured data can also facilitate their conversion to structured data and integrate it into electronic medical records systems and clinical decision support systems [68] which was reported in 20 reviewed studies, indicating the growing interest in LLMs to assist medical staff. However, the performance of LLMs appears highly dependent on the type and structure of the input medical text. LLMs perform differently when processing structured EHR data [69,70,71] semi-structured radiology or pathology reports [25,60], unstructured clinical notes [72], medical exam questions [54], patient-generated text/queries [73], or scientific literature [74]. Challenges arise with long documents requiring segmentation or specialized architectures [41], and performance can degrade when encountering variability in documentation styles or formats [75]. This suggests that LLM applicability is not uniform across all forms of medical text. Therefore, careful attention to the quality of input data should be considered when applying LLMs to clinical NLP and decision support systems. Medical education and diagnoses received similar focus, recognizing the potential of LLMs to enhance learning and training experiences and assist in diagnostic support tasks. The small number of publications in patient management and engagement showed limited focus on these applications reflecting either technical challenges in implementing LLMs or untapped potential for these applications.

It is crucial to underscore that while LLMs are being explored for tasks related to diagnosis, their role is generally conceptualized as decision support tools to assist qualified healthcare professionals. The ultimate diagnostic responsibility remains with licensed physicians, and the deployment of LLMs in clinical practice must adhere to rigorous ethical guidelines and regulatory frameworks. Future research and deployment must navigate the evolving regulatory landscape to ensure patient safety and maintain professional accountability.

4.1.4. Dominant LLMs and Training Approaches

Although GPT and BERT are general-purpose LLMs, they collectively dominated the applications in medical specialties suggesting the leverage their capabilities without the need to train LLMs from scratch (trained entirely on medical datasets) or minimally fine-tune them. Both are perceived positively for their ease of use and utility in streamlining tasks such as drafting administrative letters [76], generating initial summaries [77], or providing general medical information [39,78].

The limited presence of other LLMs indicates the maturity and accessibility of GPT and BERT in comparison to emerging LLMs. Comparative studies often indicate performance improvements with newer model versions, such as GPT-4 over GPT-3.5 [54,79], although the choice of optimal model and training strategy appears task- and data-dependent [80,81]. This rapid evolution highlights the fast pace of development in LLM capabilities but also underscores that performance benchmarks last for a short time, necessitating continuous re-evaluation as LLMs are updated [31].

Another consideration is the computational requirements of GPT and BERT and available resources that favor many medical applications to use both LLMs rather than emerging ones. The dominance of GPT and BERT LLMs also impacted how applications technically utilized LLMs. Most applications (48/84) used pre-trained general-purpose models demonstrating the accessibility of ready LLMs to a wide range of medical applications without the need for extensive computational resources or training on specific datasets. Fine-tuned models were applied in (30/84) studies reflecting the keen interest to customize general-purpose LLMs (BERT) to specific domain-specific tasks including automated clinical notes [82], extracting knowledge from clinical notes [72], and understanding mental health disorders [83]. Several studies also highlight the potential and limitations of zero-shot or few-shot learning approaches, particularly leveraging GPT [84,85,86]. Notably, prompt-based fine-tuning emerges as a technique capable of achieving competitive performance with significantly reduced training data [60].

Although developing and training LLMs from scratch optimizes the model for advanced medical applications, the limited number of applications relying on training from scratch demonstrated the complexity of computational and data requirements. This makes it less viable for medical researchers and professionals without adequate computing and expertise resources. However, emerging applications exist in diagnostics support [87], genomics data mining [88] and interpreting medical reports [89]. Such domain-specific LLMs (e.g., CancerBERT in [52], medBERT.de in [90], SurgicBERTa in [86]) frequently demonstrate superior performance compared to general-purpose LLMs [80], suggesting that specific medical language benefits from targeted training.

While the majority of studies focus on English, many investigate LLMs performance in other languages [6,90,91,92,93,94]. These studies reveal variable performance and specific challenges such as worse performance of GPT-3.5 on a Chinese national medical licensing exam compared to its English translation [6], the lack of adequate lexical resources and annotated corpora in Serbian [7], and the need for domain-specific German LLMs [90]. These challenges indicate that LLM effectiveness is not uniform across languages and is often constrained by the availability of high-quality, language-specific medical training data and resources [93].

4.1.5. Impact Assessment

The identified positive impact of using LLMs across medical specialties demonstrates their active contributions beyond theory. The majority of reported advantages revolve around improving medical process efficiency and advancing diagnostic accuracy. This suggests the potential of LLMs’ contributions to routine tasks including optimizing administrative procedures and downstream tasks to allow medical professionals to focus more on patient care. Positive LLMs’ impact on diagnostics support reflects their capabilities of augmenting human expertise, potentially leading to faster and more accurate diagnoses [95]. However, a balance has to be maintained where LLMs are studied to enhance not replace human expertise in diagnostics. Further, LLMs can also provide valuable assistance to medical practitioners in clinical decision-making in specific situations such as triage assessment by offering evidence-based recommendations, judgment, and analytics [96]. A notable impact was also described for supporting medical education and patient engagement indicating the potential of LLMs to provide and facilitate interactive learning experiences for both medical professionals and patients. LLMs have been proven to pass different medical examinations [11] reflecting their credibility and accuracy as educational tools. While patient care enhancement being the lowest-reported benefit, this may imply an indirect impact of LLMs through improved diagnostic accuracy and process efficiency. Ultimately, this review did not conduct a meta-analysis due to the heterogeneous nature of the compiled studies.

Different significant concerns were also reported by most of the reviewed studies. 43 studies reported inconsistent reliability where LLMs may fail to maintain the same level of performance, produce different output for the same input, or have different quality output. This poses major challenges for integrating LLMs into medical information systems where reliability is essential and consequently may reduce the adoption and trust of LLMs due to the potential risks. Another 20 studies stated one or more forms of accuracy issues including unclear diagnoses and missing essential treatment recommendations [10]. This shows the severity of implications on patient safety and treatments and therefore, training the LLMs with diverse and high-quality medical datasets and continuing the validation in real-world medical scenarios are crucial to ensure accuracy and consistent reliability. The reported lack of clinical knowledge is expected due to the reliance on general-purpose LLMs that might not be adequately trained to perform medical-related tasks. Ethical concerns can be direct or associated with other considerations as well. They pose major challenges due to the evolving nature of the technology including patient privacy, transparency, model bias, and data security.

4.1.6. Evaluation Metrics and Performance Assessment

Evaluation metrics are crucial to assess the performance, reliability, and safety of LLMs. For instance, in the context of machine learning, an evaluation metric for a classification task is different from that of a clustering task. In this review, the most commonly used evaluation metrics are accuracy, F1-score, precision, and recall in as shown in Supplementary Table S2 presenting classification problems. Most studies aim to evaluate the overall correctness of LLM outputs. This reflects the simplicity of metrics and its interpretability facilitating the communication of results to the non-technical medical research community. However, over-reliance on accuracy alone may not capture the performance evaluation of LLMs correctly, especially in the case of imbalanced datasets. Therefore, the F1-score and other metrics are used to provide a balanced evaluation. This is particularly important in medical contexts where both false positives and false negatives can have significant consequences.

Our synthesis of the reviewed studies reveals critical qualitative limitations that impact the reliability of LLMs in medical settings. Factual inaccuracies and hallucinations in LLM outputs are observed frequently [97]. These range from misinterpretations of clinical data to the fabrication of information or references [57]. Factors contributing to these issues include the complexity and specificity of the prompt [69], the quality and representativeness of the training data [72], and limitations in processing higher-order clinical reasoning [98], temporal reasoning [45], handling negation [99], and poor understanding of different medical terminologies [97]. Although mitigation strategies such as prompt engineering [100], hybrid model architectures [99], data augmentation [60], and rule-based post-processing [101] are explored, human oversight and validation is decisive to ensure clinical safety and accuracy [41]. Advancements in techniques including retrieval-augmented generation (RAG), fine-tuning, and reinforcement learning are expected to enhance the performance and reliability of LLMs [4]. These findings suggest that while LLMs can enhance efficiency, their current limitations necessitate caution, particularly in complex medical tasks.

Only 21 studies utilized the F1-score evaluation metric as the second most used metric. While AUC-ROC usage is associated with the evaluation of LLMs in classification tasks, a five-point Likert scale is interestingly used to incorporate human-expert judgment in measuring the clarity, accuracy, and completeness of AI-generated clinical summaries in comparison to human-generated ones [102]. This review also identified several evaluation metrics used for specific tasks, including clustering medical records using the silhouette metric [83], gathering a range of opinions about the performance of LLMs [102,103], assessing readability of generated text using Flesch–Kincaid [41] and FRE [104]. The variety of evaluation metrics reflects the wide spectrum of LLM applications in medical specialties, while the correctness of the LLMs remains the dominant one. We highlight the importance of human interpretability in medical AI applications, i.e., the 5-point Likert scale, which is not a common evaluation metric. Of note, Likert scale is a rating scale is a device to collect information about qualitative things, such as research collaboration. These scales will typically use an order of number, such as ‘1 to 10’, or practice ordinal systems, such as five stars, to operationalize subjective judgements [105].

This review identified that LLMs achieved high accuracy in tasks such as interpreting standardized clinical documentation, coding of diagnoses, and extracting information from electronic health records. For instance, LLMs such as BERT and GPT-based models reported accuracies exceeding 90% in tasks involving clinical NLP for entity recognition and information extraction. These tasks often deal with structured or semi-structured data, where LLMs can effectively apply learned linguistic patterns.

Conversely, in more complex diagnostic support tasks that require very specific clinical judgment or integration of multimodal data such as imaging, and laboratory results, LLM performance was variable and often less accurate as low as 3% only. This highlights the importance of context and task complexity in evaluating LLM efficacy.

4.1.7. Validation Approaches

The validation of the methodologies in the reviewed studies revealed a thorough evaluation of LLMs from multiple perspectives, including comparing them to both human experts and other tools. A total of 44 studies benchmark their performance against existing tools while 34 studies compared their LLMs’ relative performance against medical experts. This suggests a rigorous benchmark where the LLMs’ potential is assessed to match human expert performance. However, evaluations by clinical experts consistently raise significant concerns regarding the reliability, factual accuracy, and safety of LLMs output [63]. For instance, readability assessments frequently find LLM generated content unsuitable for patient education due to high complexity [106] where evaluations by laymen rate LLMs responses more favorably than experts [32]. These discrepancies may reflect differing standards or awareness of clinical details. Moreover, human clinical expertise is essential not just for validating LLM outputs but throughout the development and application pipeline. Experts are crucial for curating and annotating training data [107], designing effective and clinically relevant prompts [85], defining annotation schemas [92], ingesting domain knowledge [84], interpreting specific or ambiguous LLM outputs [45], and ensuring outputs align with clinical standards [102].

Therefore, there is a need for more standardized validation methods in medical LLM research due to the diversity of approaches. It is noteworthy that several reviewed studies did not validate their methodologies. This lack of comparative validation could be attributed to the novelty of this research direction, where standardized validation methods are yet to be established. Such findings underscore the evolving nature of LLM applications in medicine and the ongoing challenges in developing robust evaluation frameworks.

4.1.8. Reported Challenges

Figure 1 presents a comprehensive description of several challenges with utilizing LLMs in different medical specialties. Specifically, the ethical and safety challenges that include issues with patient privacy, model bias, and the potential for the spread of misinformation. The data-related challenges discuss concerns about data quality, representativeness, and the limited availability of medical databases. Knowledge and reasoning challenges include poor understanding of medical terms and limited reasoning ability. Furthermore, methodological and evaluation challenges include the lack of standardized methodologies or frameworks available to validate LLMs. Next, the integration and implementation challenges include integrating LLMs with existing clinical systems and calls for continued re-evaluation processes. Performance and reliability challenges are factual inaccuracy, hallucinations, and unreliable outcomes. Finally, human factor and usability challenges necessitate human interpretation and communication to a patient, as well as the general understanding that human experts are often the better option in some cases. These multi-faceted issues illustrate the various complexities associated with utilizing LLMs safely and appropriately in healthcare.

5. Future Directions

Future LLM applications should consider several key areas requiring future research to fully realize the potential of LLMs in medicine. Continued effort is needed to enhance LLMs capabilities, particularly through domain-specific pre-training or fine-tuning using high-quality medical corpora. Addressing the scarcity of large-scale, high-quality, annotated medical datasets accessible for research is important. The development of novel LLMs is needed to better suit clinical data complexities, such as handling long temporal sequences or multimodal inputs. Improving reliability and safety should be prioritized to enhance LLM performance consistency, which will improve the trust and adoption of LLMs across various specialties. This can be achieved by developing methods to mitigate factual errors, reduce harmful hallucinations, and address inherent biases.

The current focus on multi-specialty applications must be maintained and improved to effectively produce comprehensive output from various medical domains [108]. This should be aided by expanding research effort to explore less-represented specialties in this review including ophthalmology, cardiology, endocrinology, urology, immunology, geriatrics, and nephrology or in unrepresented specialties such as dermatology, anesthesiology, gynecology, obstetrics, and hematology. This can uncover new horizons for LLM applications across a wider range of medical specialties. Dermatology could benefit from multimodal LLMs that combine visual image analysis with text-based reasoning to improve diagnostic accuracy for skin conditions. LLMs could optimize anesthesiology medication dosing and monitoring protocols based on patient-specific factors. For gynecology and obstetrics, applications in prenatal risk assessment and personalized pregnancy monitoring represent promising directions. Hematology applications could focus on rare blood disorder identification and the development of personalized treatment protocols.

While several LLMs excel in specific medical examinations, others did not perform well. Future research may focus on developing fine-tuned LLMs to aid medical education and training, particularly for license examinations, board examinations, and medical knowledge assessments. The same can be considered for patient management, engagement, and communication where LLMs can be utilized to personalize patient care to individual or group level.

Beyond specialty-specific applications, we identified three cross-cutting application areas with exceptional potential: (1) multilingual LLMs to address healthcare requirements in non-English speaking populations; (2) LLM-powered clinical decision support systems that integrate with existing electronic health records for real-time guidance; and (3) patient-oriented LLMs designed specifically for health literacy improvement and treatment adherence support.

Finally, the establishment of robust, standardized evaluation methodologies and benchmarks specific to medical applications is crucial for transparently assessing and comparing LLMs performance. The evaluation of LLM applications in medical specialties requires longitudinal studies to study the long-term impact of LLMs on patients, medical professionals, and healthcare environments. The assessment should be extended to measure the adaptability and trust of LLMs to existing and evolving medical practices and knowledge.

6. Limitations

A limitation of the current systematic review is the extreme heterogeneity of the studies included, since there is considerable variation by medical specialty, dataset, metric, and methodology of evaluation applied in specific contexts. The diversity found did not allow for meta-analysis without giving way to the integrity and accuracy of the results; thus, it needs a descriptive synthesis of the results. Furthermore, analysis with regard to specificity and sensitivity cannot be conducted, which is another limitation of this study. To address this in future research, it is recommended to develop standardized evaluation frameworks specific to LLM evaluation in healthcare. Additionally, establishing benchmark datasets for each medical specialty to enable direct and transparent comparison of different LLMs. Moreover, implementing mixed methods that combine quantitative performance metrics with qualitative assessment of clinical utility to provide a more comprehensive evaluation of LLM applications in medicine. Finally, meta-analyses may be conducted within more homogeneous subsets of studies to provide more robust conclusions.

Due to the massive number of existing LLMs, the proposed search strategy involved the combination of broad, generic terms (like “large language models”) and specific, well-known LLMs (e.g., ChatGPT and BERT) to achieve a balance between comprehensiveness and specificity in our retrieval. We recognize that the rapidly changing field of LLMs means that new LLMs also constantly emerge, which may mean our list of explicitly named models was not exhaustive. Future systematic reviews might consider refreshing their existing search strategies in a sequential manner, while employing the use of generic terms to capture the larger literature.

Furthermore, our search strategy was restricted to articles published in Q1 Web of Science journals. While this approach was intended to capture high-impact and rigorously peer-reviewed research during rapidly evolving AI technologies, it may have excluded studies published in other low-tier journals, or sources that were not indexed in Q1 of Web of Science at the time of our search. This could potentially lead to an under-representation of emerging research or work from smaller or more specialized fields in medicine. Future reviews should consider incorporating a broader range of sources to capture a more comprehensive understanding of LLMs in medical specialties.

An inherent limitation of this review is the potential bias introduced by uneven representation of medical specialties and task complexities. Certain specialties such as radiology and pathology may be overrepresented due to the availability of data, well-defined tasks, research focus, and potential publication bias whereas specialties with more complex or less structured data or less research and publication focus may be underrepresented. This disparity could affect the generalizability of the findings. Future research should consider a more balanced inclusion of specialties and consider analyses based on task complexity.

7. Conclusions

This systematic review highlights the dominance of GPT-based and BERT-based models in medical specialty studies. Reported applications include clinical NLP, supporting clinical decision, medical education, diagnostic support, patient management, and engagement. Most implementations rely on general-purpose or fine-tuned LLMs, while medical-specific models remain limited due to the complexity and resource demands of domain-specific training. A critical gap identified is the lack of standardized validation frameworks for LLMs in healthcare, which undermines reproducibility and trust. While LLMs offer transformative potential, their integration requires careful balancing of benefits including efficiency and scalability against risks such as bias, and hallucination. This balance necessitates interdisciplinary collaboration to address ethical, technical, and clinical constraints.

Our study highlighted the urgent need to establish a validation standard in LLM research to establish rigorous, and domain-specific benchmarks to evaluate LLM safety, reliability, and clinical relevance. Future LLM research in the clinical domain should be directed towards multimodal integration to design architectures that fuse text, imaging, and genomic data to support complex tasks such as differential diagnosis or treatment personalization. The development of specialty-specific LLMs should therefore concentrate on the utilization of credible medical data for fine-tuning and the most advanced transfer learning techniques. We believe that a collaboration between medical and computer science researchers will significantly enhance LLM’s capabilities in medical specialties. However, further research is needed to address the limitations and challenges associated with LLMs in medical applications, and to explore new avenues for improving their performance and adoption in clinical practice.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info16060489/s1, Table S1: Summary of Included Studies; Table S2: Evaluation Metrics; Table S3: Risk of bias assessment; Table S4: PRISMA Checklist [21]; Table S5: Modified QUADAS-AI; Table S6: Query Strings.

Author Contributions

Conceptualization, A.M.A.; methodology, A.M.A., A.S. and A.S.A.; validation, A.M.A., A.S., V.H. and Y.Z.; formal analysis, A.M.A., A.S. and A.S.A.; investigation, A.M.A., A.S., V.H. and Y.Z.; data curation, A.M.A., A.S., V.H. and Y.Z.; writing—original draft preparation, A.M.A., A.S. and A.S.A.; writing—review and editing, A.M.A., A.S., A.S.A., S.A. and Q.Z.S.; supervision, A.M.A.; project administration, A.M.A.; funding acquisition, A.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Higher Education, Research and Innovation, Oman as a part of the Block Funding Program grant number BFP/RGP/ICT/22/445.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. All data generated or analyzed during this study are included in this published article (and its Supplementary Materials).

Conflicts of Interest

The authors declare no conflicts of interest. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Network
DL	Deep Learning
EHR	Electronic Health Records
FRS	Fundamentals of Robotic Surgery
LLM	Large Language Model
NLP	Natural Language Processing
PRISMA	Preferred Reporting Items for a Systematic Review and Meta-Analyses
PROSPERO	International Prospective Register of Systematic Reviews
QUADAS-AI	Quality Assessment of Diagnostic Accuracy Studies tailored for Artificial Intelligence
WoS	Web of Science

References

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv 2018, arXiv:2012.11747v3. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Sandmann, S.; Riepenhausen, S.; Plagwitz, L.; Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 2024, 15, 2050. [Google Scholar] [CrossRef]
Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef]
Mishra, T.; Sutanto, E.; Rossanti, R.; Pant, N.; Ashraf, A.; Raut, A.; Uwabareze, G.; Oluwatomiwa, A.; Zeeshan, B. Use of large language models as artificial intelligence tools in academic research and publishing among global clinical researchers. Sci. Rep. 2024, 14, 31672. [Google Scholar] [CrossRef]
Wang, X.; Gong, Z.; Wang, G.; Jia, J.; Xu, Y.; Zhao, J.; Fan, Q.; Wu, S.; Hu, W.; Li, X. ChatGPT Performs on the Chinese National Medical Licensing Examination. J. Med. Syst. 2023, 47, 86. [Google Scholar] [CrossRef]
de Oliveira, J.M.; Antunes, R.S.; da Costa, C.A. SOAP classifier for free-text clinical notes with domain-specific pre-trained language models. Expert Syst. Appl. 2024, 245, 123046. [Google Scholar] [CrossRef]
Kim, Y.; Kim, J.H.; Kim, Y.M.; Song, S.; Joo, H.J. Predicting medical specialty from text based on a domain-specific pre-trained BERT. Int. J. Med. Inform. 2023, 170, 104956. [Google Scholar] [CrossRef]
Walker, H.L.; Ghani, S.; Kuemmerli, C.; Nebiker, C.A.; Müller, B.P.; Raptis, D.A.; Staubli, S.M. Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J. Med. Internet Res. 2023, 25, e47479. [Google Scholar] [CrossRef]
Wilhelm, T.I.; Roos, J.; Kaczmarczyk, R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. J. Med. Internet Res. 2023, 25, e49324. [Google Scholar] [CrossRef]
Mihalache, A.; Huang, R.S.; Popovic, M.M.; Muni, R.H. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med. Teach. 2023, 46, 366–372. [Google Scholar] [CrossRef] [PubMed]
Alkamli, S.; Al-Yahya, M.; Alyahya, K. Ethical and Legal Considerations of Large Language Models: A Systematic Review of the Literature. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 576–586. [Google Scholar]
Ilias, L.; Askounis, D. Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech. Knowl. Based Syst. 2023, 277, 110834. [Google Scholar] [CrossRef]
Pashangpour, S.; Nejat, G. The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare. Robotics 2024, 13, 112. [Google Scholar] [CrossRef]
Artsi, Y.; Sorin, V.; Konen, E.; Glicksberg, B.S.; Nadkarni, G.; Klang, E. Large language models for generating medical examinations: Systematic review. BMC Med. Educ. 2024, 24, 354. [Google Scholar] [CrossRef]
Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Haider, C.R.; Forte, A.J. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J. Clin. Med. 2024, 13, 3041. [Google Scholar] [CrossRef]
Zhang, C.; Liu, S.; Zhou, X.; Zhou, S.; Tian, Y.; Wang, S.; Xu, N.; Li, W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J. Med. Internet Res. 2024, 26, e59607. [Google Scholar] [CrossRef]
Moglia, A.; Georgiou, K.; Cerveri, P.; Mainardi, L.; Satava, R.M.; Cuschieri, A. Large language models in healthcare: From a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test. Artif. Intell. Rev. 2024, 57, 231. [Google Scholar] [CrossRef]
Liu, M.; Okuhara, T.; Chang, X.; Shirabe, R.; Nishiie, Y.; Okada, H.; Kiuchi, T. Performance of ChatGPT across different versions in medical licensing examinations worldwide: Systematic review and meta-analysis. J. Med. Internet Res. 2024, 26, e60807. [Google Scholar] [CrossRef]
Wang, L.; Wan, Z.; Ni, C.; Song, Q.; Li, Y.; Clayton, E.; Malin, B.; Yin, Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J. Med. Internet Res. 2024, 26, e22769. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Specialty Profiles. Available online: https://careersinmedicine.aamc.org/ (accessed on 1 April 2024).
Sounderajah, V.; Ashrafian, H.; Rose, S.; Shah, N.H.; Ghassemi, M.; Golub, R.; Kahn, C.E., Jr.; Esteva, A.; Karthikesalingam, A.; Mateen, B.; et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat. Med. 2021, 27, 1663–1665. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, K.; Weng, Y.; Chen, Z.; Zhang, J.; Hubbard, R. An intelligent early warning system of analyzing Twitter data using machine learning on COVID-19 surveillance in the US. Expert Syst. Appl. 2022, 198, 116882. [Google Scholar] [CrossRef] [PubMed]
Mitchell, J.R.; Szepietowski, P.; Howard, R.; Reisman, P.; Jones, J.D.; Lewis, P.; Fridley, B.L.; Rollison, D.E. A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study. J. Med. Internet Res. 2022, 24, e27210. [Google Scholar] [CrossRef]
Li, D.; Kao, Y.; Tsai, S.; Bai, Y.; Yeh, T.; Chu, C.; Hsu, C.; Cheng, S.; Hsu, T.; Liang, C.; et al. Comparing the performance of ChatGPT GPT-₄, Bard, and Llama-₂ in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin. Neurosci. 2024, 78, 347–352. [Google Scholar] [CrossRef]
Ayoub, N.F.; Lee, Y.J.; Grimm, D.; Divi, V. Head-to-head comparison of ChatGPT versus Google search for medical knowledge acquisition. Otolaryngol. Neck Surg. 2024, 170, 1484–1491. [Google Scholar] [CrossRef]
Wang, H.; Wu, W.; Dou, Z.; He, L.; Yang, L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI. Int. J. Med. Inform. 2023, 177, 105173. [Google Scholar] [CrossRef]
Wang, C.; Liu, S.; Li, A.; Liu, J. Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study. J. Med. Internet Res. 2023, 25, e51501. [Google Scholar] [CrossRef]
Oon, M.L.; Syn, N.L.; Tan, C.L.; Tan, K.; Ng, S. Bridging bytes and biopsies: A comparative analysis of ChatGPT and histopathologists in pathology diagnosis and collaborative potential. Histopathology 2023, 84, 601–613. [Google Scholar] [CrossRef]
Yun, J.Y.; Kim, D.J.; Lee, N.; Kim, E.K. A comprehensive evaluation of ChatGPT consultation quality for augmentation mammoplasty: A comparative analysis between plastic surgeons and laypersons. Int. J. Med. Inform. 2023, 179, 105219. [Google Scholar] [CrossRef]
Scquizzato, T.; Semeraro, F.; Swindell, P.; Simpson, R.; Angelini, M.; Gazzato, A.; Sajjad, U.; Bignami, E.G.; Landoni, G.; Keeble, T.R.; et al. Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation. Resuscitation 2024, 194, 110077. [Google Scholar] [CrossRef]
Maillard, A.; Micheli, G.; Lefevre, L.; Guyonnet, C.; Poyart, C.; Canouï, E.; Belan, M.; Charlier, C. Can Chatbot Artificial Intelligence Replace Infectious Diseases Physicians in the Management of Bloodstream Infections? A Prospective Cohort Study. Clin. Infect. Dis. 2023, 78, 825–832. [Google Scholar] [CrossRef] [PubMed]
Bushuven, S.; Bentele, M.; Bentele, S.; Gerber, B.; Bansbach, J.; Ganter, J.; Trifunovic-Koenig, M.; Ranisch, R. “ChatGPT, can you help me save my child’s life?”-Diagnostic Accuracy and Supportive Capabilities to lay rescuers by ChatGPT in prehospital Basic Life Support and Paediatric Advanced Life Support cases–an in-silico analysis. J. Med. Syst. 2023, 47, 123. [Google Scholar] [CrossRef] [PubMed]
Mulyar, A.; Uzuner, O.; McInnes, B. MT-clinical BERT: Scaling clinical information extraction with multitask learning. J. Am. Med. Inform. Assoc. 2021, 28, 2108–2115. [Google Scholar] [CrossRef] [PubMed]
Cai, X.; Liu, S.; Han, J.; Yang, L.; Liu, Z.; Liu, T. Chestxraybert: A pretrained language model for chest radiology report summarization. IEEE Trans. Multimed. 2021, 25, 845–855. [Google Scholar] [CrossRef]
Wang, S.Y.; Huang, J.; Hwang, H.; Hu, W.; Tao, S.; Hernandez-Boussard, T. Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam. Int. J. Med. Inform. 2022, 167, 104864. [Google Scholar] [CrossRef]
Hristidis, V.; Ruggiano, N.; Brown, E.L.; Ganta, S.R.R.; Stewart, S. ChatGPT vs Google for Queries Related to Dementia and Other Cognitive Decline: Comparison of Results. J. Med. Internet Res. 2023, 25, e48966. [Google Scholar] [CrossRef]
Ghanem, Y.K.; Rouhi, A.D.; Al-Houssan, A.; Saleh, Z.; Moccia, M.C.; Joshi, H.; Dumon, K.R.; Hong, Y.; Spitz, F.; Joshi, A.R.; et al. Dr. Google to Dr. ChatGPT: Assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg. Endosc. 2024, 38, 2887–2893. [Google Scholar] [CrossRef]
Parker, G.; Spoelma, M.J. A chat about bipolar disorder. Bipolar Disord. 2023, 26, 249–254. [Google Scholar] [CrossRef]
Mishra, A.; Begley, S.L.; Chen, A.; Rob, M.; Pelcher, I.; Ward, M.; Schulder, M. Exploring the Intersection of Artificial Intelligence and Neurosurgery: Let us be Cautious With ChatGPT. Neurosurgery 2023, 93, 1366–1373. [Google Scholar] [CrossRef]
Shang, J.; Tang, X.; Sun, Y. PhaTYP: Predicting the lifestyle for bacteriophages using BERT. Briefings Bioinform. 2023, 24, bbac487. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, G.; Li, K.; Li, F.; Huang, L.; Duan, M.; Zhou, F. HLAB: Learning the BiLSTM features from the ProtBert-encoded proteins for the class I HLA-peptide binding prediction. Briefings Bioinform. 2022, 23, bbac173. [Google Scholar] [CrossRef] [PubMed]
Klein, A.Z.; Magge, A.; O’Connor, K.; Flores Amaro, J.I.; Weissenbacher, D.; Gonzalez Hernandez, G. Toward using Twitter for tracking COVID-19: A natural language processing pipeline and exploratory data set. J. Med. Internet Res. 2021, 23, e25314. [Google Scholar] [CrossRef] [PubMed]
Percha, B.; Pisapati, K.; Gao, C.; Schmidt, H. Natural language inference for curation of structured clinical registries from unstructured text. J. Am. Med. Inform. Assoc. 2021, 29, 97–108. [Google Scholar] [CrossRef]
Wong, M.; Lim, Z.W.; Pushpanathan, K.; Cheung, C.Y.; Wang, Y.X.; Chen, D.; Tham, Y.C. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br. J. Ophthalmol. 2023, 108, 1362–1370. [Google Scholar] [CrossRef]
Klang, E.; Sourosh, A.; Nadkarni, G.N.; Sharif, K.; Lahat, A. Evaluating the role of ChatGPT in gastroenterology: A comprehensive systematic review of applications, benefits, and limitations. Ther. Adv. Gastroenterol. 2023, 16, 17562848231218618. [Google Scholar] [CrossRef]
Younis, H.A.; Eisa, T.A.E.; Nasser, M.; Sahib, T.M.; Noor, A.A.; Alyasiri, O.M.; Salisu, S.; Hayder, I.M.; Younis, H.A. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges. Diagnostics 2024, 14, 109. [Google Scholar] [CrossRef]
Levin, G.; Horesh, N.; Brezinov, Y.; Meyer, R. Performance of ChatGPT in medical examinations: A systematic review and a meta-analysis. BJOG Int. J. Obstet. Gynaecol. 2024, 131, 378–380. [Google Scholar] [CrossRef] [PubMed]
Schopow, N.; Osterhoff, G.; Baur, D. Applications of the natural language processing tool ChatGPT in clinical practice: Comparative study and augmented systematic review. JMIR Med. Inform. 2023, 11, e48933. [Google Scholar] [CrossRef]
Kao, H.J.; Chien, T.W.; Wang, W.C.; Chou, W.; Chow, J.C. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine 2023, 102, e34068. [Google Scholar] [CrossRef]
Zhou, S.; Wang, N.; Wang, L.; Liu, H.; Zhang, R. CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J. Am. Med. Inform. Assoc. 2022, 29, 1208–1216. [Google Scholar] [CrossRef]
Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef] [PubMed]
Ali, R.; Tang, O.Y.; Connolly, I.D.; Zadnik Sullivan, P.L.; Shin, J.H.; Fridley, J.S.; Asaad, W.F.; Cielo, D.; Oyelese, A.A.; Doberstein, C.E.; et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery 2023, 93, 1353–1365. [Google Scholar] [CrossRef] [PubMed]
Massey, P.A.; Montgomery, C.; Zhang, A.S. Comparison of ChatGPT–3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. JAAOS-J. Am. Acad. Orthop. Surg. 2023, 31, 1173–1179. [Google Scholar] [CrossRef]
Butler, J.J.; Puleo, J.; Harrington, M.C.; Dahmen, J.; Rosenbaum, A.J.; Kerkhoffs, G.M.M.J.; Kennedy, J.G. From technical to understandable: Artificial Intelligence Large Language Models improve the readability of knee radiology reports. Knee Surgery Sport. Traumatol. Arthrosc. 2024, 32, 1077–1086. [Google Scholar] [CrossRef]
Mika, A.P.; Martin, J.R.; Engstrom, S.M.; Polkowski, G.G.; Wilson, J.M. Assessing ChatGPT Responses to Common Patient Questions Regarding Total Hip Arthroplasty. J. Bone Jt. Surg. 2023, 105, 1519–1526. [Google Scholar] [CrossRef]
Liu, H.; Zhang, Z.; Xu, Y.; Wang, N.; Huang, Y.; Yang, Z.; Jiang, R.; Chen, H. Use of BERT (Bidirectional Encoder Representations from Transformers)-Based Deep Learning Method for Extracting Evidences in Chinese Radiology Reports: Development of a Computer-Aided Liver Cancer Diagnosis Framework. J. Med. Internet Res. 2021, 23, e19689. [Google Scholar] [CrossRef]
Datta, S.; Roberts, K. Fine-grained spatial information extraction in radiology as two-turn question answering. Int. J. Med. Inform. 2022, 158, 104628. [Google Scholar] [CrossRef] [PubMed]
Tan, R.S.Y.C.; Lin, Q.; Low, G.H.; Lin, R.; Goh, T.C.; Chang, C.C.E.; Lee, F.F.; Chan, W.Y.; Tan, W.C.; Tey, H.J.; et al. Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting. J. Am. Med. Inform. Assoc. 2023, 30, 1657–1664. [Google Scholar] [CrossRef]
Mohamad, E.; Boutoleau-Bretonnière, C.; Chapelet, G. ChatGPT’s Dance with Neuropsychological Data: A case study in Alzheimer’s Disease. Ageing Res. Rev. 2023, 92, 102117. [Google Scholar]
Otsuka, N.; Kawanishi, Y.; Doi, F.; Takeda, T.; Okumura, K.; Yamauchi, T.; Yada, S.; Wakamiya, S.; Aramaki, E.; Makinodan, M. Diagnosing psychiatric disorders from history of present illness using a large-scale linguistic model. Psychiatry Clin. Neurosci. 2023, 77, 597–604. [Google Scholar] [CrossRef]
Zaboli, A.; Brigo, F.; Sibilio, S.; Mian, M.; Turcato, G. Human intelligence versus Chat-GPT: Who performs better in correctly classifying patients in triage? Am. J. Emerg. Med. 2024, 79, 44–47. [Google Scholar] [CrossRef]
Lee, S.; Lee, J.; Park, J.; Park, J.; Kim, D.; Lee, J.; Oh, J. Deep learning-based natural language processing for detecting medical symptoms and histories in emergency patient triage. Am. J. Emerg. Med. 2024, 77, 29–38. [Google Scholar] [CrossRef] [PubMed]
CB, A.; Mahesh, K.; Sanda, N. Ontology-based semantic data interestingness using BERT models. Connect. Sci. 2023, 35. [Google Scholar] [CrossRef]
Qiao, Y.; Zhu, X.; Gong, H. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2021, 38, 648–654. [Google Scholar] [CrossRef] [PubMed]
Vithanage, D.; Zhu, Y.; Zhang, Z.; Deng, C.; Yin, M.; Yu, P. Extracting Symptoms of Agitation in Dementia from Free-Text Nursing Notes Using Advanced Natural Language Processing. In MEDINFO 2023—The Future Is Accessible; IOS Press: Amsterdam, The Netherlands, 2024; pp. 700–704. [Google Scholar]
Guo, Q.; Cao, S.; Yi, Z. A medical question answering system using large language models and knowledge graphs. Int. J. Intell. Syst. 2022, 37, 8548–8564. [Google Scholar] [CrossRef]
Acharya, A.; Shrestha, S.; Chen, A.; Conte, J.; Avramovic, S.; Sikdar, S.; Anastasopoulos, A.; Das, S. Clinical risk prediction using language models: Benefits and considerations. J. Am. Med. Inform. Assoc. 2024, 31, 1856–1864. [Google Scholar] [CrossRef]
Chen, Y.P.; Lo, Y.H.; Lai, F.; Huang, C.H. Disease Concept-Embedding Based on the Self-Supervised Method for Medical Information Extraction from Electronic Health Records and Disease Retrieval: Algorithm Development and Validation Study. J. Med. Internet Res. 2021, 23, e25113. [Google Scholar] [CrossRef]
Haze, T.; Kawano, R.; Takase, H.; Suzuki, S.; Hirawa, N.; Tamura, K. Influence on the accuracy in ChatGPT: Differences in the amount of information per medical field. Int. J. Med. Inform. 2023, 180, 105283. [Google Scholar] [CrossRef]
Xie, K.; Gallagher, R.S.; Conrad, E.C.; Garrick, C.O.; Baldassano, S.N.; Bernabei, J.M.; Galer, P.D.; Ghosn, N.J.; Greenblatt, A.S.; Jennings, T.; et al. Extracting seizure frequency from epilepsy clinic notes: A machine reading approach to natural language processing. J. Am. Med. Inform. Assoc. 2022, 29, 873–881. [Google Scholar] [CrossRef]
Alkouz, B.; Al Aghbari, Z.; Al-Garadi, M.A.; Sarker, A. Deepluenza: Deep learning for influenza detection from Twitter. Expert Syst. Appl. 2022, 198, 116845. [Google Scholar] [CrossRef]
Lu, Z.H.; Wang, J.X.; Li, X. Revealing opinions for COVID-19 questions using a context retriever, opinion aggregator, and question-answering model: Model development study. J. Med. Internet Res. 2021, 23, e22860. [Google Scholar] [CrossRef] [PubMed]
Truhn, D.; Loeffler, C.M.; Müller-Franzes, G.; Nebelung, S.; Hewitt, K.J.; Brandner, S.; Bressem, K.K.; Foersch, S.; Kather, J.N. Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-₄). J. Pathol. 2023, 262, 310–319. [Google Scholar] [CrossRef]
Karakas, C.; Brock, D.; Lakhotia, A. Leveraging ChatGPT in the Pediatric Neurology Clinic: Practical Considerations for Use to Improve Efficiency and Outcomes. Pediatr. Neurol. 2023, 148, 157–163. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Wright, A.P.; Patterson, B.L.; Wanderer, J.P.; Turer, R.W.; Nelson, S.D.; McCoy, A.B.; Sittig, D.F.; Wright, A. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. 2023, 30, 1237–1245. [Google Scholar] [CrossRef]
Xue, Z.; Zhang, Y.; Gan, W.; Wang, H.; She, G.; Zheng, X. Quality and Dependability of ChatGPT and DingXiangYuan Forums for Remote Orthopedic Consultations: Comparative Analysis. J. Med. Internet Res. 2024, 26, e50882. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Gao, K.; Liu, Q.; Wu, Y.; Zhang, K.; Zhou, W.; Guo, C. Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information. J. Med. Internet Res. 2023, 25, e49771. [Google Scholar] [CrossRef]
Ji, S.; Hölttä, M.; Marttinen, P. Does the magic of BERT apply to medical code assignment? A quantitative study. Comput. Biol. Med. 2021, 139, 104998. [Google Scholar] [CrossRef]
Nimmi, K.; Janet, B.; Selvan, A.K.; Sivakumaran, N. Pre-trained ensemble model for identification of emotion during COVID-19 based on emergency response support system dataset. Appl. Soft Comput. 2022, 122, 108842. [Google Scholar] [CrossRef]
Hartman, V.C.; Bapat, S.S.; Weiner, M.G.; Navi, B.B.; Sholle, E.T.; Campion, T.R. A method to automate the discharge summary hospital course for neurology patients. J. Am. Med. Inform. Assoc. 2023, 30, 1995–2003. [Google Scholar] [CrossRef]
Kim, S.; Cha, J.; Kim, D.; Park, E. Understanding Mental Health Issues in Different Subdomains of Social Networking Services: Computational Analysis of Text-Based Reddit Posts. J. Med. Internet Res. 2023, 25, e49074. [Google Scholar] [CrossRef]
Hu, D.; Liu, B.; Zhu, X.; Lu, X.; Wu, N. Zero-shot information extraction from radiological reports using ChatGPT. Int. J. Med. Inform. 2024, 183, 105321. [Google Scholar] [CrossRef]
Datta, S.; Lee, K.; Paek, H.; Manion, F.J.; Ofoegbu, N.; Du, J.; Li, Y.; Huang, L.C.; Wang, J.; Lin, B.; et al. AutoCriteria: A generalizable clinical trial eligibility criteria extraction system powered by large language models. J. Am. Med. Inform. Assoc. 2023, 31, 375–385. [Google Scholar] [CrossRef]
Bombieri, M.; Rospocher, M.; Ponzetto, S.P.; Fiorini, P. Machine understanding surgical actions from intervention procedure textbooks. Comput. Biol. Med. 2023, 152, 106415. [Google Scholar] [CrossRef] [PubMed]
Laison, E.K.E.; Hamza Ibrahim, M.; Boligarla, S.; Li, J.; Mahadevan, R.; Ng, A.; Muthuramalingam, V.; Lee, W.Y.; Yin, Y.; Nasri, B.R. Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis. J. Med. Internet Res. 2023, 25, e47014. [Google Scholar] [CrossRef] [PubMed]
Wang, X.F.; Yu, C.Q.; You, Z.H.; Qiao, Y.; Li, Z.W.; Huang, W.Z. An efficient circRNA-miRNA interaction prediction model by combining biological text mining and wavelet diffusion-based sparse network structure embedding. Comput. Biol. Med. 2023, 165, 107421. [Google Scholar] [CrossRef]
Lu, Z.; Sim, J.A.; Wang, J.X.; Forrest, C.B.; Krull, K.R.; Srivastava, D.; Hudson, M.M.; Robison, L.L.; Baker, J.N.; Huang, I.C. Natural language processing and machine learning methods to characterize unstructured patient-reported outcomes: Validation study. J. Med. Internet Res. 2021, 23, e26777. [Google Scholar] [CrossRef]
Bressem, K.K.; Papaioannou, J.M.; Grundmann, P.; Borchert, F.; Adams, L.C.; Liu, L.; Busch, F.; Xu, L.; Loyen, J.P.; Niehues, S.M.; et al. medBERT.de: A comprehensive German BERT model for the medical domain. Expert Syst. Appl. 2024, 237, 121598. [Google Scholar] [CrossRef]
Chizhikova, M.; López-Úbeda, P.; Collado-Montañez, J.; Martín-Noguerol, T.; Díaz-Galiano, M.C.; Luna, A.; Ureña-López, L.A.; Martín-Valdivia, M.T. CARES: A Corpus for classification of Spanish Radiological reports. Comput. Biol. Med. 2023, 154, 106581. [Google Scholar] [CrossRef]
Kaplar, A.; Stošović, M.; Kaplar, A.; Brković, V.; Naumović, R.; Kovačević, A. Evaluation of clinical named entity recognition methods for Serbian electronic health records. Int. J. Med. Inform. 2022, 164, 104805. [Google Scholar] [CrossRef]
Karthikeyan, S.; de Herrera, A.G.S.; Doctor, F.; Mirza, A. An OCR Post-Correction Approach Using Deep Learning for Processing Medical Reports. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2574–2581. [Google Scholar] [CrossRef]
Homburg, M.; Meijer, E.; Berends, M.; Kupers, T.; Olde Hartman, T.; Muris, J.; de Schepper, E.; Velek, P.; Kuiper, J.; Berger, M.; et al. A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study. J. Med. Internet Res. 2023, 25, e49944. [Google Scholar] [CrossRef] [PubMed]
Fonseca, Ă.; Ferreira, A.; Ribeiro, L.; Moreira, S.; Duque, C. Embracing the future—Is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision-making. Eur. J. Neurol. 2024, 31. [Google Scholar] [CrossRef]
Gan, R.K.; Ogbodo, J.C.; Wee, Y.Z.; Gan, A.Z.; González, P.A. Performance of Google bard and ChatGPT in mass casualty incidents triage. Am. J. Emerg. Med. 2024, 75, 72–78. [Google Scholar] [CrossRef] [PubMed]
Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst. Appl. 2024, 235, 121186. [Google Scholar] [CrossRef]
Cuthbert, R.; Simpson, A.I. Artificial intelligence in orthopaedics: Can chat generative pre-trained transformer (ChatGPT) pass Section 1 of the fellowship of the royal college of surgeons (trauma & orthopaedics) examination? Postgrad. Med J. 2023, 99, 1110–1114. [Google Scholar]
Fu, S.; Thorsteinsdottir, B.; Zhang, X.; Lopes, G.S.; Pagali, S.R.; LeBrasseur, N.K.; Wen, A.; Liu, H.; Rocca, W.A.; Olson, J.E.; et al. A hybrid model to identify fall occurrence from electronic health records. Int. J. Med. Inform. 2022, 162, 104736. [Google Scholar] [CrossRef]
Kaarre, J.; Feldt, R.; Keeling, L.E.; Dadoo, S.; Zsidai, B.; Hughes, J.D.; Samuelsson, K.; Musahl, V. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg. Sport. Traumatol. Arthrosc. 2023, 31, 5190–5198. [Google Scholar] [CrossRef]
Liu, J.; Gupta, S.; Chen, A.; Wang, C.K.; Mishra, P.; Dai, H.J.; Wong, Z.S.Y.; Jonnagaddala, J. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J. Med. Internet Res. 2023, 25, e48145. [Google Scholar] [CrossRef]
Liu, S.; McCoy, A.B.; Wright, A.P.; Nelson, S.D.; Huang, S.S.; Ahmad, H.B.; Carro, S.E.; Franklin, J.; Brogan, J.; Wright, A. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J. Am. Med. Inform. Assoc. 2024, 31, 1388–1396. [Google Scholar] [CrossRef]
Song, H.; Xia, Y.; Luo, Z.; Liu, H.; Song, Y.; Zeng, X.; Li, T.; Zhong, G.; Li, J.; Chen, M.; et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J. Med. Syst. 2023, 47. [Google Scholar] [CrossRef]
Bellinger, J.R.; De La Chapa, J.S.; Kwak, M.W.; Ramos, G.A.; Morrison, D.; Kesser, B.W. BPPV Information on Google VersusAI (ChatGPT). Otolaryngol. Neck Surg. 2023, 170, 1504–1511. [Google Scholar] [CrossRef] [PubMed]
Koczkodaj, W.W.; Kakiashvili, T.; Szymańska, A.; Montero-Marin, J.; Araya, R.; Garcia-Campayo, J.; Rutkowski, K.; Strzałka, D. How to reduce the number of rating scale items without predictability loss? Scientometrics 2017, 111, 581–593. [Google Scholar] [CrossRef] [PubMed]
Campbell, D.J.; Estephan, L.E.; Sina, E.M.; Mastrolonardo, E.V.; Alapati, R.; Amin, D.R.; Cottrill, E.E. Evaluating ChatGPT responses on thyroid nodules for patient education. Thyroid 2024, 34, 371–377. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Zhou, Y.; Jiang, X.; Natarajan, K.; Pakhomov, S.V.; Liu, H.; Xu, H. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 2021, 28, 2193–2201. [Google Scholar] [CrossRef]
Tsoutsanis, P.; Tsoutsanis, A. Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam. Comput. Biol. Med. 2024, 168, 107794. [Google Scholar] [CrossRef]

Figure 1. Key challenges associated with the implementation of LLMs in medical specialties.

Figure 2. PRISMA flow diagram of the proposed systematic review.

Figure 3. Distribution of research papers by country of affiliation.

Table 1. Tasks of LLMs in medicine: summary of performance analysis and key Findings.

Application	Task Examples	Dataset Characteristics	Accuracy Range	Key Findings
Medical Education	Performance in different medical exams	Official/practice question sets (100–861 questions)	29.4–90%	GPT-4 significantly outperforms other LLMs [26,27,28,29]. Human experts still outperform LLMs [30,31].
Diagnostic Assistance	Disease diagnosis based on various inputs	single to hundreds of questions/cases	3–94%	Prompt design significantly impacts AI diagnostic performance [32]. LLMs are promising for primary screening but not complex cases [33,34]. Significant safety concerns exist for using current LLMs in real-time [32].
Clinical decision support	Suggesting clinical risks, outcomes, and recommendations	15 cases to 84 K patients/questions	27–60%	Performance is highly model-dependent [35,36,37]. LLMs show potential in assisting decision support but often lack reliability, accuracy, and reasoning of human experts [31].
Patient Management and Engagement	Answering patient questions	notes/reports (100–6600) sets of 10–40 questions	56–80%	LLMs have the potential to answer medical questions but patients are challenged by medical terms [38,39]. LLMs often fabricate citations and hallucinate [40]. Overall information quality is lower than human expert [41].
Clinical NLP	- Generating concise summaries - identifying clinical entities (problems, treatments, tests, medications) from unstructured text - Classifying sentences/notes based on content	clinical notes (100 s–10 Ks) clinical trial protocols (180)	71–>90%	Fine-tuning can improve LLMs performance substantially [42,43,44]. Handling long documents, complex reasoning, and model hallucinations limit clinical extraction tasks [45].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alkalbani, A.M.; Alrawahi, A.S.; Salah, A.; Haghighi, V.; Zhang, Y.; Alkindi, S.; Sheng, Q.Z. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025, 16, 489. https://doi.org/10.3390/info16060489

AMA Style

Alkalbani AM, Alrawahi AS, Salah A, Haghighi V, Zhang Y, Alkindi S, Sheng QZ. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information. 2025; 16(6):489. https://doi.org/10.3390/info16060489

Chicago/Turabian Style

Alkalbani, Asma Musabah, Ahmed Salim Alrawahi, Ahmad Salah, Venus Haghighi, Yang Zhang, Salam Alkindi, and Quan Z. Sheng. 2025. "A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions" Information 16, no. 6: 489. https://doi.org/10.3390/info16060489

APA Style

Alkalbani, A. M., Alrawahi, A. S., Salah, A., Haghighi, V., Zhang, Y., Alkindi, S., & Sheng, Q. Z. (2025). A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information, 16(6), 489. https://doi.org/10.3390/info16060489

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

Abstract

1. Introduction

2. Materials and Methods

2.1. Search Strategy

2.2. Inclusion and Exclusion Criteria

2.3. Data Extraction

2.4. Quality Assessment

3. Results

3.1. Study Selection

3.2. Study Characteristics

3.3. Medical Specialties

3.4. Application of LLMs

3.5. LLMs Used

3.6. Type of LLMs

3.7. Reported Impact

3.8. Performance Evaluation

3.9. Validation Approach

3.10. Quality Assessment Results

4. Discussion

4.1. Principal Findings

4.1.1. Overview and Scope of the Review

4.1.2. Performance Comparison Across Specialties

4.1.3. Key Areas of Utilization of LLMs

4.1.4. Dominant LLMs and Training Approaches

4.1.5. Impact Assessment

4.1.6. Evaluation Metrics and Performance Assessment

4.1.7. Validation Approaches

4.1.8. Reported Challenges

5. Future Directions

6. Limitations

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI