The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine

Papageorgiou, Platon S.; Christodoulou, Rafail C.; Pitsillos, Rafael; Petrou, Vasileia; Vamvouras, Georgios; Kormentza, Eirini Vasiliki; Papagelopoulos, Panayiotis J.; Georgiou, Michalis F.

doi:10.3390/app15169005

Open AccessReview

The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine

by

Platon S. Papageorgiou

¹

,

Rafail C. Christodoulou

^2,*

,

Rafael Pitsillos

³,

Vasileia Petrou

⁴,

Georgios Vamvouras

⁵

,

Eirini Vasiliki Kormentza

¹,

Panayiotis J. Papagelopoulos

¹ and

Michalis F. Georgiou

^6,*

¹

First Department of Orthopaedics, University General Hospital Attikon, Medical School, National and Kapodistrian University of Athens, 12462 Athens, Greece

²

Department of Radiology, Stanford University School of Medicine, Stanford, CA 94305, USA

³

Neurophysiology Department, Cyprus Institute of Neurology and Genetics, 1683 Nicosia, Cyprus

⁴

Department of Medicine, University of Ioannina, 45110 Ioannina, Greece

⁵

Department of Mechanical Engineering, National Technical University of Athens, 15772 Zografou, Greece

⁶

Division of Nuclear Medicine, Department of Radiology, University of Miami, Miami, FL 33136, USA

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9005; https://doi.org/10.3390/app15169005

Submission received: 19 July 2025 / Revised: 12 August 2025 / Accepted: 13 August 2025 / Published: 15 August 2025

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) rapidly transform healthcare by automating tasks, streamlining administration, and enhancing clinical decision support. This rapid review assesses current and emerging applications of LLMs in diagnostic-related group (DRG) assignment and clinical decision support systems (CDSS), with emphasis on radiology and nuclear medicine. Evidence shows that LLMs, particularly those tailored for medical domains, improve efficiency and accuracy in DRG coding and radiology report generation, providing clinicians with actionable, context-sensitive insights by integrating diverse data sources. Advances like retrieval-augmented generation and multimodal architecture further increase reliability and minimize incorrect or misleading results that AI models generate, a term that is known as hallucination. Despite these benefits, challenges remain regarding safety, explainability, bias, and regulatory compliance, necessitating ongoing validation and oversight. The review prioritizes recent, peer-reviewed literature on radiology and nuclear medicine to provide a practical synthesis for clinicians, administrators, and researchers. While LLMs show strong promise for enhancing DRG assignment and radiological decision-making, their integration into clinical workflows requires careful management. Ongoing technological advances and emerging evidence may quickly change the landscape, so findings should be interpreted in context. This review offers a timely overview of the evolving role of LLMs while recognizing the need for continuous re-evaluation.

Keywords:

artificial intelligence; healthcare management; nuclear medicine; radiology; diagnostic-related groups; large language models; clinical decision support systems

1. Introduction

The rapid development of artificial intelligence (AI) has ushered in a new era for healthcare, with large language models (LLMs) showing remarkable capabilities in processing, interpreting, and generating complex medical information. LLMs, such as GPT-4 and ChatGPT, and domain-specific models, are poised to revolutionize clinical practice by automating administrative tasks, supporting informed decision-making, and enhancing patient engagement [1,2]. Their generative and natural language processing (NLP) abilities allow them to interpret unstructured clinical data, extract pertinent details, and provide guideline-based recommendations, making them valuable for managing text-based information in data-heavy specialties like radiology [1,3]. However, since AI models, especially those based on language, are less effective at analyzing raw numerical data and performing statistical computations, their utility in radiology is primarily limited to tasks involving text analysis and report generation, rather than quantitative image assessment or numerical data interpretation [4].

Recent studies suggest that LLMs may be able to facilitate routine administrative tasks, such as documentation, scheduling, and coding, which helps reduce burdens on healthcare professionals and thus could enable them to focus more on patient care [2,3]. One of the most promising uses is automating Diagnosis-Related Group (DRG) assignments. DRG is a code based on a patient’s diagnosis, procedures, age, sex, and discharge status, which aims to reflect the average resources needed to treat that patient. It is a vital process for hospital reimbursement and resource management. By utilizing advanced natural language understanding, LLMs could precisely analyze clinical notes and assign DRGs more efficiently with potentially fewer errors, supporting medical institutions’ financial and operational stability [1,2].

LLMs are integrated into clinical workflows in radiology to improve report generation, standardize communication, and provide decision support for imaging studies [3,5]. For example, Liu et al. introduced Radiology-GPT based on Alpaca-7B, a domain-specific LLM that outperformed general-purpose models in radiological diagnosis and reporting tasks in the four below-mentioned metrics, except for relevance; the metrics they used to evaluate the model understandability, coherence, relevance, conciseness, and clinical utility of generated responses were directly relevant to clinical practice, demonstrating the value of customizing LLMs to specific medical data and requirements [5]. Each metric was scored on a scale from 1 to 5, and radiologists used a set of 10 random radiology reports [5]. Similarly, pilot studies in leading European hospitals have shown that LLMs can reduce reporting times, automate routine tasks, and ensure consistency and quality in radiology reports, all while maintaining compliance with data protection regulations [3]. Figure 1 illustrates the global distribution of healthcare LLM adoption in research based on the data we found [4] [Figure 1].

Beyond administrative and reporting functions, LLMs are increasingly recognized for their potential in clinical decision support (CDS). By synthesizing information from diverse sources and providing contextually relevant recommendations, these models can assist clinicians in selecting appropriate imaging modalities and interpreting complex cases [6]. Comparative studies have found that LLMs trained on medical texts, such as Glass AI, can outperform general-purpose models like ChatGPT in predicting the most appropriate imaging studies for various clinical scenarios, further underscoring the benefits of domain adaptation and specialized training [6].

Despite these advances, integrating LLMs into healthcare systems presents ongoing challenges, including data privacy, ethical deployment, regulatory compliance, and rigorous evaluation and validation [1,2,3]. Addressing these issues will be essential for the responsible and effective adoption of LLMs in clinical practice.

The literature indicates that LLMs hold promise in many fields of medicine, like research, enhancing DRG assignment, and CDS in radiology and broader healthcare systems; in Figure 2, we present the number of specific LLM models we found in large databases like Web of Science, Scopus, and PubMed during our search [Figure 2]. Their ability to process large volumes of unstructured data such as medical records, clinical notes, radiology and pathology reports, patient histories, and other narrative text forms common in healthcare, generate structured outputs, and then use their natural language processing skills to interpret and extract meaningful and relevant medical information, such as symptoms, diagnoses, medications, and treatment details, from these texts, makes them transformative tools for the future of medical practice [1,2,3,5,6].

2. Scope of LLM Integration in Healthcare

2.1. LLM Applications in DRG Assignment

Large Language Models (LLMs), including general-purpose architectures like GPT-3.5 and GPT-4, and domain-specific implementations such as Radiology-GPT, are increasingly influencing the field of radiology by introducing advanced capabilities in natural language understanding and generation. Built upon transformer-based architectures and trained on massive datasets, these models are capable of parsing complex clinical language, making them particularly well-suited for a range of radiology applications [7,8].

One of the most impactful applications of LLMs is the automation of radiology report generation. These models can draft preliminary reports based on imaging findings, helping radiologists by saving time and reducing repetitive workload [7,9]. Early studies have demonstrated that LLM-generated reports can approximate the quality of expert-written texts, especially when fine-tuned on radiological corpora or guided with structured prompts [7]. This automation improves efficiency and contributes to the standardization of reporting language and format.

In line with this, LLMs support structured reporting (SR), which is increasingly advocated in modern radiology to improve report clarity, completeness, and interoperability. In their review, Busch et al. found a wide range of accuracy in SR of 25–100%, which demonstrates the potential but also the current limitations of LLMs [8]. Additionally, Alkalbani et al. found in NLP, which serves as the bridge between unstructured clinical text and structured data formats, an accuracy of 71–90% [10]. However, in their study, they also underline challenges, such as performance and reliability challenges, factual inaccuracy, hallucinations, and unreliable outcomes, and thus emphasize future research directions [10].

Furthermore, LLMs may facilitate the integration of imaging data with clinical and contextual information. When paired with patient histories, laboratory data, and other clinical records, these models can help contextualize imaging findings, suggest differential diagnoses, and even identify inconsistencies between clinical notes and image interpretations [4,7,10]. Sorin et al. used ChatGPT-3.5 in their study, which utilized clinical information from ten consecutive patients presented to a breast tumor board. In 70% of cases, the LLM’s recommendations aligned with those of the board [4]. This may be important help for clinicians and thus translate to better results for the patients.

Another utility is information extraction from unstructured radiology texts. LLMs can identify and categorize clinical entities such as anatomical sites, pathological findings, and measurement data from free-text reports, thereby enabling secondary uses such as population health analysis, predictive modeling, and clinical research [7,8,9]. Their proficiency in natural language processing also supports multilingual report translation and terminology harmonization, which are essential in multinational healthcare settings [8,9].

However, applying these potential benefits in clinical settings requires careful model customization and prompt engineering [10]. General-purpose LLMs are not naturally familiar with specialized radiological terms and may produce hallucinations if not properly tuned. Fine-tuning these models on radiology-specific datasets and using domain-specific prompts are essential for improving their reliability and clinical use [9,10].

In their study, Wang et al. developed DRG-LLaMA, a fine-tuned large language model based on LLaMA, optimized for predicting DRGs from clinical discharge summaries in the MIMIC-IV dataset [11]. The model, particularly the 7-billion-parameter variant with a maximum input token length of 512, achieved a top-1 prediction accuracy of 52.0% and a macro-averaged area under the curve (AUC) of 0.986, demonstrating superior performance over established baselines such as ClinicalBERT and CAML [11]. Performance further enhanced with larger model sizes (up to 13 billion parameters) and extended input contexts (up to 1024 tokens), yielding a top-1 accuracy of 54.6%. When approached as a two-label classification task, separating base DRG and complication or comorbidity (CC)/major CC (MCC) status, the model attained top-1 accuracies of 67.8% for base DRG and 67.5% for CC/MCC, resulting in an overall DRG prediction accuracy of 51.5% via a mapping rule. The findings also revealed that prediction efficacy correlated positively with DRG frequency, training data volume, and input length. Error analysis, however, highlighted challenges such as inadequate clinical concept extraction and complexities in base DRG selection [11].

2.2. Clinical Decision Support Systems Enhanced by LLMs

In healthcare settings, LLMs analyze vast and heterogeneous patient data, including symptoms, medical history, clinical notes, imaging, and laboratory results, to suggest differential diagnoses, recommend next steps, and summarize patient encounters with high efficiency and notable accuracy [12,13,14]. In their study on the use of large language model workflows in clinical decision support for triage, referral, and diagnosis, Gaber et al. employed an LLM-based workflow incorporating retrieval-augmented generation (RAG) and three models from the Claude family to assess 2000 medical cases covering a wide range of medical conditions. They found high levels of accuracy, especially when clinical data were used, in triage (57.7–82.8%), in predicting appropriate medical specialty referrals from patient data (76.9–87.5%), and in diagnosis prediction (67.4–82.5%) [13].

Recent research demonstrates that LLMs can approach specialist-level performance in generating diagnostic suggestions for various clinical scenarios and thus could be an important assisting tool for clinicians [13,15]. For example, Oumano et al. compared the accuracy of five different models, with and without RAG, using a set of 600 nuclear medicine technology board-examination-style questions in 15 different nuclear medicine topics, and discovered high accuracy in two OpenAI models that achieved scores of 0.787 and 0.783, respectively, when RAG was implemented [15]. Overall, RAG methods may improve reliability by reducing hallucinations and errors. For instance, Gaber et al. found that RAG-based CDSS may have similar accuracy compared to LLMs alone; however, they have an advantage in their ability to use external, trusted references, and thus reduce error rates [13,16].

LLMs can also integrate with electronic health records (EHRs), enabling real-time extraction and summarization of relevant patient information [17]. Unstructured data, such as healthcare professionals’ narratives, includes important patient details. However, analyzing them is challenging due to their complexity and large volume [17]. Information Extraction (IE) involves converting unstructured text into structured data, such as entities (e.g., patient names and symptoms), relationships (e.g., between medications and diagnoses), or events (e.g., clinical procedures) [17]. The study by Vithanage et al. explores adapting generative LLMs, specifically Llama 3.1-8B, for clinical named entity recognition (NER), which is vital in automating IE, in nursing progress notes from residential aged care facilities. They compared zero-shot and few-shot learning approaches combined with parameter-efficient fine-tuning (PEFT) and RAG across two clinical areas: agitation in dementia and malnutrition risk factors. Few-shot learning outperforms zero-shot learning; PEFT significantly improves performance for both approaches. Notably, few-shot learning with RAG yields better results than zero-shot learning with RAG, reaching an accuracy of up to 0.91. Additionally, zero-shot learning with PEFT shows performance comparable to few-shot with RAG, demonstrating the complementary strengths of PEFT and RAG [17]. Similarly, Tripathi et al. emphasize the potential of LLMs to streamline electronic medical records (EMRs), while raising privacy concerns regarding patient data security and compliance with regulations like the Health Insurance Portability and Accountability Act (HIPAA) [18].

Despite these benefits, LLM-driven CDSS face challenges in ensuring patient safety, maintaining transparency, managing algorithmic bias, and safeguarding data privacy, notably in EHR [14,16,18]. Ongoing advancements, including domain-specific fine-tuning and the addition of explainable AI techniques, continue to strengthen the reliability and utility of LLMs in CDS, especially in complex fields like radiology and nuclear medicine [8,15,19].

2.3. Radiology-Specific Applications

Radiology can benefit significantly from Large Language Models (LLMs) because they rely on SR and standardized terminology and produce extensive text data [Figure 3]. A primary transformative application is automated radiology report generation, where LLMs assist in drafting initial impressions, summarizing findings, and ensuring report consistency. When integrated with PACS or RIS systems, these models can reduce dictation time and enable radiologists to focus on more complex cases. Several domain-specific models, such as Radiology-GPT and Radiology-LLaMA, outperform general-purpose models in generating detailed, high-quality reports [19,20]. LLMs enhance error detection and quality assurance by examining past reports to find contradictions, missing findings, or inconsistent follow-ups. Their capacity to incorporate contextual information and evaluate coherence has demonstrated superior performance to traditional rule-based systems [21]. However, Hu et al. showed that there is a gap between the impressions generated by LLMs and those written by radiologists, as LLMs achieved great performance in completeness and correctness, but were not good at conciseness and verisimilitude [21].

Additionally, LLMs improve SR by filling in standardized templates (such as BI-RADS, LI-RADS, PI-RADS) using narrative descriptions or imaging features, which helps enhance interoperability and minimize variability [8,22]. Moreover, tools like MRScore have been developed to assess the clinical quality of AI-generated radiology reports through GPT-based (GPT-4 and GPT-4V) feedback systems (Mistral-7Binstruct backbone to calibrate the model) [19]. In educational environments, LLMs are increasingly functioning as interactive radiology tutors. Research shows that fine-tuned LLMs can interpret cases, create board-style questions, and provide comprehensive feedback for radiology students, aiding learning and exam readiness [23,24].

2.4. Nuclear Medicine Applications

In nuclear medicine, LLMs have begun to show value in interpretative and workflow tasks. For example, in PET/CT imaging, LLMs trained on oncologic datasets can summarize tracer uptake patterns, create impressions, and propose potential differentials. Domain-specific models, like fine-tuned RoBERTa for Deauville scoring, have demonstrated high accuracy in lymphoma assessments, often outperforming baseline models and sometimes matching human interpretation in pilot tests [25]. Integrating RAG with LLMs has also enhanced performance in nuclear medicine workflows. For example, Choi et al. developed a retrieval-augmented generation (RAG) large language model integrated with a database of over 211,000 PET imaging reports, demonstrating effective clustering of reports in vector space based on diagnoses and study types, and they achieved an 84.1% success rate in retrieving relevant similar cases, as agreed by nuclear medicine physicians, and significantly higher appropriateness scores for suggested potential diagnoses using RAG compared to the LLM alone [26]. Overall, the framework showed promise for aiding PET reporting by referencing past cases, supporting differential diagnoses, and enhancing educational and clinical decision-making in nuclear medicine [26].

LLMs also have the potential to be valuable for radioisotope therapy documentation, assisting with patient selection, dose planning, and procedural justification based on textual data from previous clinical notes, as Hirata et al. indicate in their review about LLMs in nuclear medicine [27]. Additionally, LLMs can identify adverse drug reactions and procedural complications from free-text reports to aid in post-therapy safety monitoring [28]. Emerging literature has also pointed toward the potential use of LLMs in radiopharmaceutical development and theranostics. Koller et al. in their study chose a set of 197 pre-selected research papers on theranostics and nuclear medicine, with a particular focus on Peptide Receptor Radionuclide Therapy, in an approach to improve the accuracy and relevance of LLM responses; they compared 5 LLMs augmented by both naive and advanced RAG techniques. Notably, GPT-4o and CLAUDE 3 OPUS achieved the highest scores with an accuracy of >0.8, although the authors highlight that the chatbots in theranostics are not yet perfect and future research should focus on providing accurate, reliable, and contextually relevant information as well as decreasing the bias of these chatbots [29].

Even though evidence remains preliminary, early-stage applications suggest that LLMs may accelerate discovery pipelines and support therapeutic agent selection in nuclear medicine. As theranostics becomes increasingly central in precision oncology, future studies will likely explore how generative AI can contribute to drug design, target prediction, and individualized therapy protocols. Finally, LLMs are valuable in nuclear medicine education and board preparation. Evaluating GPT-3.5, GPT-4, and Bard on nuclear medicine board-style questions shows competitive performance, reinforcing their potential as digital tutors [30].

Future research on LLMs in radiology and healthcare should concentrate on five prioritized domains. To begin with, LLMs that integrate imaging, clinical notes, and laboratory data will significantly improve diagnostic precision, particularly in complex cases. Secondly, personalized decision support utilizing electronic health record (EHR)-integrated LLMs can facilitate context-aware imaging recommendations tailored to individual patient histories. Thirdly, efforts must be directed toward scalable deployment, including creating lightweight LLMs suitable for resource-constrained settings and implementing federated learning to ensure privacy-preserving applications. Furthermore, model explainability is crucial to foster clinician trust; techniques such as chain-of-thought prompting and uncertainty quantification may strengthen transparency. Lastly, establishing robust validation frameworks and regulatory guidelines is imperative to address challenges related to model drift, generalizability, and safety monitoring in real-world scenarios. Addressing these priorities will promote LLMssafe, equitable, and clinically valuable integration within radiology and nuclear medicine.

3. Implementation Challenges

3.1. Ethical and Safety Considerations

Deploying Large Language Models in healthcare settings presents significant patient safety risks that require careful consideration and mitigation strategies [31]. LLMs are prone to generating hallucinations, inaccurate or fabricated medical information that could adversely affect clinical decision-making and patient outcomes [32]. These hallucinations can manifest as incorrect patient information, inappropriate diagnostic conclusions, or misleading treatment recommendations, potentially leading to misdiagnosis, delayed care, or inappropriate interventions [1,4,10]. Additionally, although promising, the benefit for patients and clinicians of using LLMs in healthcare is not yet established and has not been demonstrated to have a clear advantage over traditional methods for the moment [18].

In radiology and nuclear medicine, the consequences of LLM-generated errors are particularly concerning given the high-stakes nature of diagnostic imaging interpretation. Studies have documented hallucination rates ranging from 8% to 39.6% in medical-specialized models, with even sophisticated models like GPT-4 generating incorrect responses approximately 15% of the time [33]. Eliminating hallucinations represents a critical safety milestone, as demonstrated by retrieval-augmented generation approaches that have successfully reduced hallucination rates to zero in specific clinical applications [34].

The “black box” nature of many LLMs poses fundamental challenges for clinical implementation [35]. Healthcare providers require the ability to understand and explain AI-generated recommendations to ensure appropriate clinical decision-making and maintain patient trust. The lack of transparency creates several interconnected problems: healthcare providers cannot adequately explain AI-influenced decisions to patients, clinical validation becomes difficult without understanding the underlying reasoning, and accountability for AI-assisted decisions remains unclear [35].

Explainable AI (XAI) techniques are emerging as essential tools for addressing these transparency challenges [35]. However, current XAI approaches face significant limitations in healthcare contexts, where explanations must be technically accurate and clinically meaningful [36]. Developing interpretable models that maintain clinical relevance while providing transparent decision-making processes represents a critical research priority [36].

LLMs exhibit systematic biases that can perpetuate and amplify existing healthcare disparities. These biases manifest across multiple dimensions, including race, gender, socioeconomic status, and language proficiency [37]. In radiology and nuclear medicine, bias can result in differential diagnostic accuracy, inappropriate treatment recommendations, or unequal access to advanced imaging services across patient populations [38].

Algorithmic bias in DRG assignment systems poses particular risks for healthcare equity. Bias coding can lead to inappropriate reimbursement levels, resource allocation disparities, and systematic underpayment for care provided to marginalized populations [39]. The challenge is compounded by the fact that training data often reflects historical patterns of healthcare delivery that may embed discriminatory practices. Integrating LLMs into clinical workflows raises fundamental questions about patient autonomy and the adequacy of traditional informed consent processes [40]. Patients have varying preferences regarding disclosure of AI use in their healthcare, with studies indicating that many patients consider information about AI tools as crucial as traditional medical information.

The integration of LLMs into clinical decision-making creates complex accountability challenges. When AI-assisted decisions lead to adverse outcomes, determining responsibility among physicians, AI developers, healthcare institutions, and regulatory bodies becomes increasingly tricky. Current legal frameworks are inadequate for addressing these novel scenarios, creating uncertainty for healthcare providers and potentially hindering the adoption of beneficial AI technologies [41].

3.2. Economic Impact and Cost-Effectiveness

The deployment of LLMs may significantly affect healthcare professionals’ roles and responsibilities [42]. Concerns include the potential for deskilling, over-reliance on AI systems, and changes in the physician–patient relationship. The risk of automation bias, where healthcare providers inappropriately rely on AI-generated recommendations, poses challenges for maintaining clinical competence and professional judgment [43].

Training and education requirements for healthcare professionals using LLMs must address technical competencies and ethical considerations [43]. Developing guidelines for appropriate AI use, including when to accept, reject, or seek additional validation of AI-generated recommendations, represents an essential component of safe LLM deployment [42].

3.3. Technical Performance and Validation

Assessing the technical performance and validation of LLMs in CDS and DRG assignment is crucial for ensuring reliability and safety in healthcare settings [44]. LLMs are evaluated using comprehensive metrics, including accuracy, faithfulness, robustness, and generalizability, often benchmarked against curated clinical datasets and expert-generated standards. Domain-specific medical LLMs consistently outperform general-purpose models in factual accuracy and hallucination reduction, particularly in radiology and nuclear medicine applications [5,45,46].

Validation approaches follow established frameworks, such as the RELAINCE guidelines, for evaluating nuclear medicine AI. These propose four classes: promise, technical task-specific efficacy, clinical decision making, and post-deployment efficacy [46]. Specialty-specific metrics such as METRICS and MRScore assess both linguistic quality and clinical relevance of AI-generated content. The HealthBench framework represents a significant advancement, featuring 5000 multi-turn conversations evaluated by 262 physicians across 60 countries, using conversation-specific rubrics [47].

Validation typically proceeds through systematic stages: large-scale benchmarking using established datasets (MIMIC-IV, radiology boards), human-in-the-loop testing where clinicians review outputs, and real-world integration testing with electronic health records [48]. Cross-validation techniques and external validation protocols prevent overfitting and ensure generalizability across diverse clinical environments [49].

Persistent challenges include hallucination rates ranging from 8% to 39.6% in medical models, with advanced mitigation techniques such as chain-of-thought prompting and retrieval-augmented generation reducing rates to 1.47% in specific applications [50]. Generalizability remains problematic across different patient populations, healthcare systems, and clinical contexts, with local fine-tuning proving most effective for improving performance [47]. Continuous monitoring and iterative improvement are essential as medical knowledge evolves and models are updated [22,51]. Standardized, specialty-specific validation frameworks and ongoing clinician oversight remain necessary for trustworthy deployment in radiology, nuclear medicine, and broader healthcare CDSS [52].

4. Future Directions and Research Priorities

4.1. Emerging LLM Technologies and Capabilities

The future of large language models (LLMs) in healthcare is closely linked with technological advancements prioritizing specialization and efficiency [53]. The current trajectory points toward developing and adopting domain-specific and multimodal models, tailored for applications in radiology, nuclear medicine, and other high-stakes clinical settings [1]. These models will be increasingly designed to integrate various forms of data—text, imaging, laboratory results, and genomics—allowing for more accurate and comprehensive CDS. Efficiency improvements will come from techniques such as model pruning, quantization, and knowledge distillation, resulting in smaller, more resource-efficient LLMs suitable for deployment even in resource-constrained settings. In parallel, RAG methods are being incorporated to reduce hallucination rates and enhance factual accuracy markedly.

At the same time, explainable AI (XAI) techniques will improve transparency, user trust, and regulatory compliance by providing interpretable rationales for model outputs [27]. These advances will be complemented by the emergence of LLM-driven autonomous agents capable of executing multi-step workflows, automating documentation, triaging patients, and integrating seamlessly with electronic health records. Continuous learning mechanisms, including federated learning models, will be critical to adapting dynamically to new clinical guidelines and local health system realities, all while safeguarding patient privacy.

4.2. Integration with Precision Medicine and Personalized Care

LLMs are poised to become pivotal tools in precision medicine and personalized care by integrating and interpreting complex, heterogeneous patient data. Their ability to aggregate imaging, clinical notes, laboratory findings, and genomics will facilitate highly individualized diagnostic assessments, personalized therapeutic recommendations, and more effective risk stratification [54].

In practice, CDSS powered by these models can recommend personalized next steps, such as tailored imaging protocols or targeted therapies, while dynamically updating their knowledge base in response to evolving guidelines and accumulating clinical data [55]. Additionally, LLMs will contribute to patient engagement by generating tailored education materials and supporting conversational interfaces that address patient-specific concerns.

However, critical challenges remain, as robust validation across diverse populations is required to ensure generalizability, while ongoing efforts must address data privacy, bias reduction, and explainability. Interdisciplinary collaboration—engaging clinicians, computational scientists, ethicists, and patients—will play a vital role in developing clinically relevant, equitable, and trustworthy models in real-world settings [12,13].

4.3. Global Implementation Strategies

Wide-scale adoption of LLMs in healthcare will depend on coherent global implementation strategies prioritizing regulatory harmonization, equity, and capacity building. The current regulatory landscape for AI-enabled clinical tools is heterogeneous and lacks universally accepted safety, efficacy, and post-market monitoring standards, creating barriers to widespread deployment [56]. Addressing this requires establishing international consensus on evaluation protocols and continuous surveillance infrastructures to monitor model performance and safety as LLMs evolve. Scalable deployment strategies must support localization, such as language adaptation and fine-tuning to local data, while enabling use in low-resource environments through lightweight model architectures and federated learning solutions [57,58]. Ensuring equitable outcomes will require careful mitigation of algorithmic bias, inclusive dataset curation, and adaptation to region-specific healthcare challenges.

Additionally, systematic education and training for clinicians and patients will be essential to foster understanding, trust, and effective utilization of AI-driven systems. Ethical and legal frameworks must advance to clarify accountability, informed consent, and the appropriate integration of AI decision support into workflows, maintaining patient autonomy and public confidence. Finally, a commitment to ongoing real-world evaluation and iterative model improvement, anchored in transparent reporting of successes and failures, as well as the increase of research around LLMs in healthcare, will underpin the safe and sustainable deployment of LLMs in global health systems [Figure 4].

5. Limitations of the Review

This review is subject to several limitations. It primarily relies on published English-language studies, potentially overlooking non-English and unpublished data, which may introduce selection bias. Most included articles focus on radiology and nuclear medicine, so findings might not generalize well to other specialties. The available evidence is mostly from retrospective analyses or technical validations, with limited prospective or real-world clinical trials involving LLMs in healthcare workflows. Significant heterogeneity in study designs, datasets, and evaluation criteria complicates direct comparisons and restricts the strength of recommendations. Additionally, eightreferences were drawn from arXiv, a preprint repository that may include non-peer-reviewed works, raising concerns about their reliability. Finally, the rapid evolution of AI means some findings could quickly become outdated, potentially altering the risk-benefit profile of deploying LLMs in clinical practice.

6. Conclusions

Large language models show promise for tasks like diagnostic-related group assignment, clinical decision support, and streamlining administrative work (e.g., documentation and coding), with positive results reported in radiology and nuclear medicine.

At the same time, the literature includes mixed or limited findings over traditional decision support, highlighting implementation challenges (data quality, workflow fit, cost), and reporting risks such as hallucinations, bias, and variable performance across settings.

Looking ahead, progress in multimodal architectures might broaden their usefulness for imaging and other data types, but safe adoption will depend on prospective, real-world assessments, transparent benchmarking against established systems, and clear governance around privacy, bias, and accountability. Multidisciplinary collaboration and ongoing human oversight are essential to ensure benefits in both patient care and health system efficiency. Safe and effective implementation will require thorough validation, careful attention to ethical and regulatory issues, and continuous human oversight. As technology and validation frameworks develop, multidisciplinary teamwork will be crucial to realizing the full potential of LLMs for patient care and healthcare system performance.

Author Contributions

Conceptualization, P.S.P., R.C.C. and M.F.G.; methodology, R.C.C.; software, R.P.; validation, R.C.C., G.V., V.P. and G.V.; formal analysis, R.C.C.; investigation, R.C.C. and P.S.P.; resources, M.F.G.; data curation, R.P.; writing—original draft preparation, R.C.C. and P.S.P.; writing—review and editing, P.S.P., R.C.C., R.P., V.P., G.V., E.V.K., P.J.P. and M.F.G.; visualization, G.V.; supervision, M.F.G.; project administration, M.F.G.; funding acquisition, M.F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
BI-RADS	Breast Imaging Reporting and Data System
CDS	Clinical Decision Support
CDSS	Clinical Decision Support System(s)
CT	Computed Tomography
DRG	Diagnosis-Related Group
DRG-LLaMA	Diagnosis-Related Group-LLaMA (LLM fine-tuned for DRG)
EHR	Electronic Health Record
EMR	Electronic Medical Record
GPT	Generative Pre-trained Transformer
GPT-3.5/GPT-4	Generative Pre-trained Transformer, Versions 3.5 and 4
HIPAA	Health Insurance Portability and Accountability Act
IE	Information Extraction
LLaMa	Large Language Model Meta AI
LI-RADS	Liver Imaging Reporting and Data System
LLM	Large Language Model
MRScore	Model-based Radiology Score (LLM-based radiology eval.)
NLP	Natural Language Processing
PAC	Picture Archiving and Communication System
PET	Positron Emission Tomography
PI-RADS	Prostate Imaging Reporting and Data System
RAG	Retrieval-Augmented Generation
RIS	Radiology Information System
SR	Structured Reporting
XAI	Explainable Artificial Intelligence

References

Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef]
Al-Garadi, M.; Mungle, T.; Ahmed, A.; Sarker, A.; Miao, Z.; Matheny, M.E. Large Language Models in Healthcare. arXiv 2025, arXiv:2503.04748. [Google Scholar] [CrossRef]
Arnold, P.; Henkel, M.; Bamberg, F.; Kotter, E. Integration von Large Language Models in die Klinik: Revolution in der Analyse und Verarbeitung von Patientendaten zur Steigerung von Effizienz und Qualität in der Radiologie [Integration of large language models into the clinic: Revolution in analysing and processing patient data to increase efficiency and quality in radiology]. Radiologie 2025, 65, 243–248. [Google Scholar] [CrossRef]
Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The application of large language models in medicine: A scoping review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Li, Y.; Shu, P.; Zhong, A.; Jiang, H.; Pan, Y.; Yang, L.; Ju, C.; Wu, Z.; Ma, C.; et al. Radiology-GPT: A large language model for radiology. Meta Radiol. 2025, 3, 100153. [Google Scholar] [CrossRef]
Zaki, H.A.; Aoun, A.; Munshi, S.; Abdel-Megid, H.; Nazario-Johnson, L.; Ahn, S.H. The Application of Large Language Models for Radiologic Decision Making. J. Am. Coll. Radiol. 2024, 21, 1072–1078. [Google Scholar] [CrossRef] [PubMed]
D’aNtonoli, T.A.; Stanzione, A.; Bluethgen, C.; Vernuccio, F.; Ugga, L.; Klontzas, M.E.; Cuocolo, R.; Cannella, R.; Koçak, B. Large language models in radiology: Fundamentals, applications, ethical considerations, risks, and future directions. Diagn. Interv. Radiol. 2024, 30, 80–90. [Google Scholar] [CrossRef]
Busch, F.; Hoffmann, L.; dos Santos, D.P.; Makowski, M.R.; Saba, L.; Prucker, P.; Hadamitzky, M.; Navab, N.; Kather, J.N.; Truhn, D.; et al. Large language models for structured reporting in radiology: Past, present, and future. Eur. Radiol. 2024, 35, 2589–2602. [Google Scholar] [CrossRef]
Yu, P.; Xu, H.; Hu, X.; Deng, C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare 2023, 11, 2776. [Google Scholar] [CrossRef]
Alkalbani, A.M.; Alrawahi, A.S.; Salah, A.; Haghighi, V.; Zhang, Y.; Alkindi, S.; Sheng, Q.Z. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025, 16, 489. [Google Scholar] [CrossRef]
Wang, H.; Gao, C.; Dantona, C.; Hull, B.; Sun, J. DRG-LLaMA: Tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit. Med. 2024, 7, 16. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Zhou, Z.; Lyu, H.; Wang, Z. Large language models-powered clinical decision support: Enhancing or replacing human expertise? Intell. Med. 2025, 5, 1–4. [Google Scholar] [CrossRef]
Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef]
Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef] [PubMed]
Oumano, M.A.; Pickett, S.M. Comparison of Large Language Models’ Performance on 600 Nuclear Medicine Technology Board Examination-Style Questions. J. Nucl. Med. Technol. 2025. online ahead of print. [Google Scholar] [CrossRef]
Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef] [PubMed]
Vithanage, D.; Deng, C.; Wang, L.; Yin, M.; Alkhalaf, M.; Zhang, Z.; Zhu, Y.; Yu, P. Adapting Generative Large Language Models for Information Extraction from Unstructured Electronic Health Records in Residential Aged Care: A Comparative Analysis of Training Approaches. J. Health Inf. Res. 2025, 9, 191–219. [Google Scholar] [CrossRef]
Tripathi, S.; Sukumaran, R.; Cook, T.S. Efficient healthcare with large language models: Optimizing clinical workflow and enhancing patient care. J. Am. Med. Inf. Assoc. 2024, 31, 1436–1440. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Z.; Li, Y.; Liang, X.; Liu, L.; Wang, L.; Zhou, L. MRScore: Evaluating Radiology Report Generation with LLM-based Reward System. arXiv 2024, arXiv:2404.17778. [Google Scholar] [CrossRef]
Voinea, Ş.V.; Mămuleanu, M.; Teică, R.V.; Florescu, L.M.; Selișteanu, D.; Gheonea, I.A. GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering 2024, 11, 1043. [Google Scholar] [CrossRef]
Hu, D.; Zhang, S.; Liu, Q.; Zhu, X.; Liu, B. Large Language Models in Summarizing Radiology Report Impressions for Lung Cancer in Chinese: Evaluation Study. J. Med. Internet Res. 2025, 27, e65547. [Google Scholar] [CrossRef]
Li, H.; Wang, H.; Sun, X.; He, H.; Feng, J. Prompt-Guided Generation of Structured Chest X-Ray Report Using a Pre-trained LLM. arXiv 2024, arXiv:2404.11209. [Google Scholar] [CrossRef]
Alkhaldi, A.; Alnajim, R.; Alabdullatef, L.; Alyahya, R.; Chen, J.; Zhu, D.; Alsinan, A.; Elhoseiny, M. MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis. arXiv 2024, arXiv:2407.04106. [Google Scholar] [CrossRef]
Altalla’, B.; Ahmad, A.; Bitar, L.; Al-Bssol, M.; Al Omari, A.; Sultan, I. Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis. Int. J. Biomed. Imaging 2025, 2025, 5019035. [Google Scholar] [CrossRef] [PubMed]
Huemann, Z.; Lee, C.; Hu, J.; Cho, S.Y.; Bradshaw, T.J. Domain-adapted Large Language Models for Classifying Nuclear Medicine Reports. Radiol. Artif. Intell. 2023, 5, e220281. [Google Scholar] [CrossRef]
Choi, H.; Lee, D.; Kang YKoo Suh, M. Empowering PET imaging reporting with retrieval-augmented large language models and reading reports database: A pilot single center study. Eur. J. Nucl. Med. Mol. Imaging 2025, 52, 2452–2462. [Google Scholar] [CrossRef]
Hirata, K.; Matsui, Y.; Yamada, A.; Fujioka, T.; Yanagawa, M.; Nakaura, T.; Ito, R.; Ueda, D.; Fujita, S.; Tatsugami, F.; et al. Generative AI and large language models in nuclear medicine: Current status and future prospects. Ann. Nucl. Med. 2024, 38, 853–864. [Google Scholar] [CrossRef]
Alberts, I.L.; Mercolli, L.; Pyka, T.; Prenosil, G.; Shi, K.; Rominger, A.; Afshar-Oromieh, A. Large language models (LLM) and ChatGPT: What will the impact on nuclear medicine be? Eur. J. Nucl. Med. Mol. Imaging 2023, 50, 1549–1552. [Google Scholar] [CrossRef]
Koller, P.; Clement, C.; van Eijk, A.; Seifert, R.; Zhang, J.; Prenosil, G.; Sathekge, M.M.; Herrmann, K.; Baum, R.; Weber, W.A.; et al. Optimizing theranostics chatbots with context-augmented large language models. Theranostics 2025, 15, 5693–5704. [Google Scholar] [CrossRef] [PubMed]
Vachatimanont, S.; Kingpetch, K. Exploring the capabilities and limitations of large language models in nuclear medicine knowledge with primary focus on GPT-3.5, GPT-4 and Google Bard. J. Med. Artif. Intell. 2024, 7, 5. [Google Scholar] [CrossRef]
Liu, X.; Liu, H.; Yang, G.; Jiang, Z.; Cui, S.; Zhang, Z.; Wang, H.; Tao, L.; Sun, Y.; Song, Z.; et al. Medical large language model for diagnostic reasoning across specialties. Nat. Med. 2025, 31, 743–744. [Google Scholar] [CrossRef]
Kim, Y.; Jeong, H.; Chen, S.; Li, S.S.; Lu, M.; Alhamoud, K.; Mun, J.; Grau, C.; Jung, M.; Gameiro, R. Medical Hallucinations in Foundation Models and Their Impact on Healthcare. arXiv 2025, arXiv:2503.05777. [Google Scholar] [CrossRef]
Jones, N. AI hallucinations can’t be stopped—But these techniques can limit their damage. Nature 2025, 637, 778–780. [Google Scholar] [CrossRef] [PubMed]
Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
Kiseleva, A.; Kotzinos, D.; De Hert, P. Transparency of AI in Healthcare as a Multilayered System of Accountabilities: Between Legal Requirements and Technical Limitations. Front. Artif. Intell. 2022, 5, 879603. [Google Scholar] [CrossRef] [PubMed]
Sadeghi, Z.; Alizadehsani, R.; Cifci, M.A.; Kausar, S.; Rehman, R.; Mahanta, P.; Bora, P.K.; Almasri, A.; Alkhawaldeh, R.S.; Hussain, S.; et al. A review of Explainable Artificial Intelligence in healthcare. Comput. Electr. Eng. 2024, 118, 109370. [Google Scholar] [CrossRef]
Shen, Y.; Heacock, L.; Elias, J.; Hentel, K.D.; Reig, B.; Shih, G.; Moy, L. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology 2023, 307, e230163. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Liu, Y.; Liu, X.; Gulhane, A.; Mastrodicasa, D.; Wu, W.; Wang, E.J.; Sahani, D.; Patel, S. Demographic bias of expert-level vision-language foundation models in medical imaging. Sci. Adv. 2025, 11, eadq0305. [Google Scholar] [CrossRef]
Yang, Y.; Liu, X.; Jin, Q.; Huang, F.; Lu, Z. Unmasking and quantifying racial bias of large language models in medical report generation. Commun. Med. 2024, 4, 176. [Google Scholar] [CrossRef]
Rose, S.L.; Shapiro, D. An Ethically Supported Framework for Determining Patient Notification and Informed Consent Practices When Using Artificial Intelligence in Health Care. Chest 2024, 166, 572–578. [Google Scholar] [CrossRef]
Smith, H.; Fotheringham, K. Artificial intelligence in clinical decision-making: Rethinking liability. Med. Law Int. 2020, 20, 131–154. [Google Scholar] [CrossRef]
Abuadas, M.; Albikawi, Z.; Rayani, A. The impact of an AI-focused ethics education program on nursing students’ ethical awareness, moral sensitivity, attitudes, and generative AI adoption intention: A quasi-experimental study. BMC Nurs. 2025, 24, 720. [Google Scholar] [CrossRef] [PubMed]
Aucouturier, E.; Grinbaum, A. Training Bioethics Professionals in AI Ethics: A Framework. J. Law Med. Ethic 2025, 53, 176–183. [Google Scholar] [CrossRef]
Huang, Y.; Tang, K.; Chen, M.; Wang, B. A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry. arXiv 2024, arXiv:2404.15777. [Google Scholar] [CrossRef]
Abbasian, M.; Khatibi, E.; Azimi, I.; Oniani, D.; Abad, Z.S.H.; Thieme, A.; Sriram, R.; Yang, Z.; Wang, Y.; Lin, B.; et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit. Med. 2024, 7, 82. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Yu, J.; Chen, S.; Liu, C.; Wan, Z.; Bitterman, D.; Wang, F.; Shu, K. ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? arXiv 2024, arXiv:2411.06469. [Google Scholar] [CrossRef]
Park, S.H.; Han, K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 2018, 286, 800–809. [Google Scholar] [CrossRef]
Liu, F.; Li, Z.; Zhou, H.; Yin, Q.; Yang, J.; Tang, X.; Luo, C.; Zeng, M.; Jiang, H.; Gao, Y.; et al. Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 13696–13710. [Google Scholar] [CrossRef]
Waldock, W.J.; Zhang, J.; Guni, A.; Nabeel, A.; Darzi, A.; Ashrafian, H. The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. J. Med. Internet Res. 2024, 26, e56532. [Google Scholar] [CrossRef]
Takita, H.; Kabata, D.; Walston, S.L.; Tatekawa, H.; Saito, K.; Tsujimoto, Y.; Miki, Y.; Ueda, D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit. Med. 2025, 8, 175. [Google Scholar] [CrossRef]
Seo, J.; Choi, D.; Kim, T.; Cha, W.C.; Kim, M.; Yoo, H.; Oh, N.; Yi, Y.; Lee, K.H.; Choi, E. Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study. J. Med. Internet Res. 2024, 26, e58329. [Google Scholar] [CrossRef] [PubMed]
Rahman, S.; Jiang, L.Y.; Gabriel, S.; Aphinyanaphongs, Y.; Oermann, E.K.; Chunara, R. Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model. arXiv 2024, arXiv:2402.10965. [Google Scholar] [CrossRef]
Zhang, K.; Meng, X.; Yan, X.; Ji, J.; Liu, J.; Xu, H.; Zhang, H.; Liu, D.; Wang, J.; Wang, X.; et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J. Med. Internet Res. 2025, 27, e59069. [Google Scholar] [CrossRef]
Liang, S.; Zhang, J.; Liu, X.; Huang, Y.; Shao, J.; Liu, X.; Li, W.; Wang, G.; Wang, C. The potential of large language models to advance precision oncology. EBioMedicine 2025, 115, 105695. [Google Scholar] [CrossRef]
Aththanagoda, A.K.N.L.; Kulathilake, K.A.S.H.; Abdullah, N.A. Precision and Personalization: How Large Language Models Redefining Diagnostic Accuracy in Personalized Medicine—A Systematic Literature Review. IEEE J. Biomed. Health Inform. 2025. online ahead of print. [Google Scholar] [CrossRef]
Dennstädt, F.; Hastings, J.; Putora, P.M.; Schmerder, M.; Cihoric, N. Implementing large language models in healthcare while balancing control, collaboration, costs and security. NPJ Digit. Med. 2025, 8, 143. [Google Scholar] [CrossRef] [PubMed]
Kufel, J.; Bargieł, K.; Koźlik, M.; Czogalik, Ł.; Dudek, P.; Jaworski, A.; Magiera, M.; Bartnikowska, W.; Cebula, M.; Nawrat, Z.; et al. Usability of Mobile Solutions Intended for Diagnostic Images—A Systematic Review. Healthcare 2022, 10, 2040. [Google Scholar] [CrossRef] [PubMed]
Sorin, V.; Klang, E.; Sklair-Levy, M.; Cohen, I.; Zippel, D.B.; Lahat, N.B.; Konen, E.; Barash, Y. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 2023, 9, 44. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Worldwide distribution of countries implementing LLMs in healthcare, with nations shaded by their implementation level: basic, moderate, or advanced, based on research activity and publications during our search.

Figure 2. Bar chart displaying the frequency of specific LLM models we found in large databases like Web of Science, Scopus, and PubMed—such as GPT-4, GPT-3.5, ChatGPT, and others—utilized in published healthcare research studies.

Figure 3. During our search, the majority of LLM healthcare applications we found are concentrated in the areas of medical knowledge assessment, CDS, and radiology, with smaller proportions devoted to administrative tasks, patient education, nuclear medicine, and other specialties.

Figure 4. Line chart showing the annual number of published studies on LLMs in healthcare from 2020 to 2024, highlighting the rapid growth in research activity.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Papageorgiou, P.S.; Christodoulou, R.C.; Pitsillos, R.; Petrou, V.; Vamvouras, G.; Kormentza, E.V.; Papagelopoulos, P.J.; Georgiou, M.F. The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine. Appl. Sci. 2025, 15, 9005. https://doi.org/10.3390/app15169005

AMA Style

Papageorgiou PS, Christodoulou RC, Pitsillos R, Petrou V, Vamvouras G, Kormentza EV, Papagelopoulos PJ, Georgiou MF. The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine. Applied Sciences. 2025; 15(16):9005. https://doi.org/10.3390/app15169005

Chicago/Turabian Style

Papageorgiou, Platon S., Rafail C. Christodoulou, Rafael Pitsillos, Vasileia Petrou, Georgios Vamvouras, Eirini Vasiliki Kormentza, Panayiotis J. Papagelopoulos, and Michalis F. Georgiou. 2025. "The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine" Applied Sciences 15, no. 16: 9005. https://doi.org/10.3390/app15169005

APA Style

Papageorgiou, P. S., Christodoulou, R. C., Pitsillos, R., Petrou, V., Vamvouras, G., Kormentza, E. V., Papagelopoulos, P. J., & Georgiou, M. F. (2025). The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine. Applied Sciences, 15(16), 9005. https://doi.org/10.3390/app15169005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example from Radiology and Nuclear Medicine

Abstract

1. Introduction

2. Scope of LLM Integration in Healthcare

2.1. LLM Applications in DRG Assignment

2.2. Clinical Decision Support Systems Enhanced by LLMs

2.3. Radiology-Specific Applications

2.4. Nuclear Medicine Applications

3. Implementation Challenges

3.1. Ethical and Safety Considerations

3.2. Economic Impact and Cost-Effectiveness

3.3. Technical Performance and Validation

4. Future Directions and Research Priorities

4.1. Emerging LLM Technologies and Capabilities

4.2. Integration with Precision Medicine and Personalized Care

4.3. Global Implementation Strategies

5. Limitations of the Review

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI