Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews

Triantafyllidis, Andreas; Segkouli, Sofia; Kokkas, Stelios; Alexiadis, Anastasios; Lithoxoidou, Evdoxia Eirini; Manias, George; Antoniades, Athos; Votis, Konstantinos; Tzovaras, Dimitrios

doi:10.3390/healthcare14010045

Open AccessReview

Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews

by

Andreas Triantafyllidis

^1,*

,

Sofia Segkouli

¹,

Stelios Kokkas

¹,

Anastasios Alexiadis

¹

,

Evdoxia Eirini Lithoxoidou

¹

,

George Manias

²

,

Athos Antoniades

³

,

Konstantinos Votis

¹

and

Dimitrios Tzovaras

¹

Information Technologies Institute, Centre for Research and Technology Hellas, 57001 Thessaloniki, Greece

²

Department of Digital Systems, University of Piraeus, 18534 Piraeus, Greece

³

Stremble Ventures Ltd., 4042 Limassol, Cyprus

^*

Author to whom correspondence should be addressed.

Healthcare 2026, 14(1), 45; https://doi.org/10.3390/healthcare14010045

Submission received: 29 October 2025 / Revised: 7 December 2025 / Accepted: 21 December 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Smart and Digital Health)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background/Objective: The use of Large Language Models (LLMs) has recently gained significant interest from the research community toward the development and adoption of Generative Artificial Intelligence (GenAI) solutions for healthcare. The present work introduces the first meta-review (i.e., review of systematic reviews) in the field of LLMs for chronic diseases, focusing particularly on cardiovascular, cancer, and mental diseases, to identify their value in patient care, and challenges for their implementation and clinical application. Methods: A literature search in the bibliographic databases of PubMed and Scopus was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, to identify systematic reviews incorporating LLMs. The original studies included in the reviews were synthesized according to their target disease, specific application, LLMs used, data sources, accuracy, and key outcomes. Results: The literature search identified 5 systematic reviews respecting our inclusion and exclusion criteria, which examined 81 unique LLM-based solutions. The highest percentage of the solutions targeted mental disease (86%), followed by cancer (7%) and cardiovascular disease (6%), implying a large research focus in mental health. Generative Pre-trained Transformer (GPT)-family models were used most frequently (~55%), followed by Bidirectional Encoder Representations from Transformers (BERT) variants (~40%). Key application areas included depression detection and classification (38%), suicidal ideation detection (7%), question answering based on treatment guidelines and recommendations (7%), and emotion classification (5%). Study aims and designs were highly heterogeneous, and methodological quality was generally moderate with frequent risk-of-bias concerns. Reported performance varied widely across domains and datasets, and many evaluations relied on fictional vignettes or non-representative data, limiting generalisability. The most significant found challenges in the development and evaluation of LLMs include inconsistent accuracy, bias detection and mitigation, model transparency, data privacy, need for continual human oversight, ethical concerns and guidelines, as well as the design and conduction of high-quality studies. Conclusions: While LLMs show promise for screening, triage, decision support, and patient education—particularly in mental health—the current literature is descriptive and constrained by data, transparency, and safety gaps. We recommend prioritizing rigorous real-world evaluations, diverse benchmark datasets, bias-auditing, and governance frameworks before LLM clinical deployment and large adoption.

Keywords:

large language models; generative AI; digital health; literature review

1. Introduction

The advent of Large Language Models (LLMs), exemplified by platforms such as ChatGPT, is a significant advancement in Generative Artificial Intelligence (GenAI) with far-reaching implications for improving healthcare [1]. LLMs are able to generate human-like text, answer questions, and complete other language-related tasks with high accuracy [2]. LLMs have demonstrated strong capacity for complex clinical reasoning, patient education, and decision support across a variety of medical fields such as psychiatry [3], cardiology [4], and oncology [5].

LLMs are artificial intelligence systems built on Transformers [6], an Artificial Neural Network architecture which processes and generates human language by learning statistical patterns from massive amounts of data (e.g., books, websites, scientific papers). At their core, LLMs consist of billions of interconnected parameters (adjustable numerical values) that are trained to predict the next word in a sequence. The key components of an LLM are (Figure 1): (a) Token Embeddings: Converts words into numerical representations for the AI model to process, (b) Self-Attention Mechanism: Helps the model focus on the most relevant words in a sentence, (c) Feedforward Layers: Improves text predictions and sentence coherence, and (d) Decoder Mechanism: Generates human-like responses based on context. The two main types of LLMs used in healthcare are GPT (Generative Pre-trained Transformer) models [7], which excel at generating coherent text for tasks like answering patient questions or drafting clinical summaries [8,9], and BERT (Bidirectional Encoder Representations from Transformers) models [10], which specialize in understanding and classifying existing text for applications like detecting depression from social media posts or identifying suicide risk in clinical notes [11].

The adoption of LLMs in healthcare emerges as a promising step to provide more efficient, safer, and personalized care for patients, mainly because of their capability for swift natural language communication with the users, along with synthesis, summarisation, and contextual reasoning over diverse clinical information, around the clock. The acceptability, usability, and potential of LLMs in improving health and well-being have been explored in several previous studies [12,13,14].

As the body of peer-reviewed research on LLMs in healthcare expands rapidly, there is a pressing need to establish a solid evidence base regarding their clinical effectiveness and practical value. To address this gap, we present a review of systematic reviews examining the use of LLMs among patients with chronic diseases—with particular emphasis on cardiovascular, oncological, and mental health conditions because of their significant global burden. To the authors’ knowledge, this meta-review is the first in the field of LLMs for healthcare.

Our work aims to analyze the characteristics, performance, benefits and methodological challenges of LLM-based interventions and tools, in this fast-evolving domain. Unlike disease-specific analyses, the current review adopts a cross-condition perspective, enabling a broader understanding of how LLMs are being leveraged across diverse chronic care contexts. By systematically mapping the current state of applications, evaluating their effectiveness, and highlighting current challenges, this study seeks to inform researchers, developers, clinicians, and policymakers in designing and implementing more robust and impactful LLM-driven health interventions, towards realizing the full potential of GenAI application in clinical practice.

2. Methodology

We searched the bibliographic databases of PubMed and Scopus to identify systematic reviews of the application of LLMs for cardiovascular disease, cancer, and mental disorders, as reported in manuscripts published after 2022, since this was the year in which the interest in LLMs exploded because of the landmark emergence of ChatGPT. The inclusion criteria were: (a) the review should be defined as systematic, focus on the effectiveness of the LLMs, and follow reporting guidelines such as the Preferred Reporting Items for Systematic review and Meta-Analyses (PRISMA) [15] or the Cochrane guidelines [16]; (b) the review should report LLM-based tools or interventions targeted at chronically ill individuals diagnosed with cardiovascular disease, cancer, or mental disorders; (c) the paper should be written in English. We used the keyword query (“LLM” OR “large language model” OR “chatbot” OR “conversational” OR “bot” OR “digital assistant” OR “virtual assistant” OR “digital agent” OR “virtual agent”) for search within the title, abstract and keywords of the manuscripts, and restricted the search to a review type of articles. Reviews examining interventions or tools which were not focused on leveraging LLMs were excluded. Furthermore, reviews focusing exclusively on other medical fields such as surgery or medical education were excluded. Reviews not examining quantitative outcomes, surveys, and protocol papers were also excluded from the review.

The selection and review of papers were conducted independently by four reviewers (authors AT, SS, AA, SK) to ensure relevance based on predefined inclusion and exclusion criteria and to minimize potential selection bias or errors. Following the literature search, both abstracts and full-text manuscripts were screened. Studies that did not meet the inclusion criteria were excluded, and only those for which consensus among all reviewers was achieved were retained.

The methodological quality of the included systematic reviews was evaluated using the second version of the AMSTAR tool (A Measurement Tool to Assess Systematic Reviews), which has demonstrated reliability [17]. Data from the primary studies included within these reviews were synthesized (by AT) according to several dimensions: Target disease, application area, LLM(s) used, leveraged data sources, model accuracy, and key outcomes. We report descriptive percentages because most studies provided only aggregated summaries and heterogeneity in outcomes and timing precluded pooling or more advanced meta-analytic techniques.

3. Results

3.1. Literature Search Outcomes

Our literature search in the PubMed and Scopus databases was conducted on November 2024 with the last search update taking place on May 2025, to identify relevant studies published since 2022. The search yielded 422 records from the Scopus database and 120 records from PubMed. After removing all duplicates in the Mendeley© bibliography management software [18], and applying our inclusion and exclusion criteria, 24 articles remained for full manuscript reading. Finally, 5 papers (systematic reviews) were included [19,20,21,22,23]. Reasons for paper exclusion are shown in Figure 2, which is a standard PRISMA flow diagram intended to document the process of study identification, screening, eligibility assessment, and inclusion.

3.2. Quality Assessment of Reviews and Original Studies

The quality of the included review studies as assessed through AMSTAR 2 criteria can be seen in Table 1. All reviews performed study selection in duplicate and discussed the heterogeneity of studies and results. However, AMSTAR 2 assessment revealed recurring methodological weaknesses across the included systematic reviews. The most frequently failed domains were: (i) lack of comprehensive search strategies—none of the reviews searched gray literature or consulted experts; (ii) absence of explicit PICO framing in research questions; (iii) insufficient justification for deviations from protocols; and (iv) limited assessment and consideration of risk of bias when interpreting results. Furthermore, most reviews did not report sources of funding for included studies and did not explain the rationale for study design selection. These gaps underscore the need for future systematic reviews in this field to adopt rigorous protocols, transparent reporting, and standardized bias assessment tools to enhance reliability and reproducibility.

Three reviews [20,22,23], assessed the risk of bias of the individual studies using tools such as Quality Assessment of Diagnostic Accuracy Studies (QUADAS 2), Risk of Bias 2 Tool, Risk Of Bias In Non-randomized Studies-of Interventions (ROBINS-I), and Prediction model Risk Of Bias Assessment Tool (PROBAST). The individual studies were reported to be of high risk of bias except one in the review for breast cancer management by Sorin et al. [20], while Guo et al. [22] have reported overall a low risk of bias in the studies for mental health applications. Omar and Levkovich [23] in their review focusing on depression, reported an overall mixed picture in terms of risk of bias assessment. Although most studies were reported to be of low risk of bias in measurements and outcomes, several studies presented moderate biases because of confounding factors and participant selection that may have affected the applicability of the results.

3.3. Characteristics of Individual Studies

In Table A1, the main characteristics of the 81 non-duplicate individual studies reported in the reviews are presented in terms of target disease, specific application, used LLMs, data sources, accuracy, and key outcomes. The majority of the solutions focused on mental health conditions (70 studies, 86%), followed by cancer (6 studies, 7%) and cardiovascular diseases (5 studies, 6%). Most studies employed GPT models (45 studies, 55%), while BERT model and its variants were used in 33 studies (40%) (Figure 3). The main application areas included depression detection and classification (31 studies, 38%), suicidal ideation detection (6 studies, 7%), question answering based on treatment guidelines and recommendations (6 studies, 7%), mental health intervention, e.g., support for loneliness (6 studies, 7%), and emotion classification (4 studies, 5%) (Figure 4). The studies showed substantial heterogeneity: about half (41 studies, 50%) concentrated on clinical or diagnostic accuracy—evaluating correct diagnosis, guideline alignment, or model performance—while others (34 studies, 42%) were descriptive, focusing on aspects such as narrative outcomes, usability, trust, or plausibility.

3.4. Summary of LLMs Performance

The LLMs in depression applications (31 studies) achieved an accuracy in detecting the disease or its symptoms from 50% to 97% and an F1 score from 0.42 to 0.93. The studies reporting the highest performance used BERT, RoBERTa, AudiBERT, DistilBERT, and DeBERTa, and leveraged diverse datasets such as tweets, clinical interviews, or the mental health corpus database. The suicidal ideation studies (6 studies) used LLMs such as BERT, GPT, and XLNet, using social media posts (3 studies), as well as fictional patient data (3 studies). The reported achieved accuracy in 3 studies for suicidal ideation detection was from 87% to 95%. LLMs were found also to demonstrate good performance in several other mental health applications, such as classification of psychiatric conditions with BERT (F1 0.83) [24], mental health disorder prediction with BERT through Twitter (accuracy 97%) [25], and sentiment analysis through social media texts using GPT and Open Pre-trained Transformers (accuracy 86%) [26]. However, on other few occasions, LLMs did not demonstrate an acceptable performance, as in the case of the identification of diagnostic and management strategies in psychiatric conditions with GPT (accuracy 61%) [27]. All 6 studies reporting outcomes of diverse mental health interventions, e.g., support for loneliness, mindfulness, and self-discovery, did not provide quantitative performance metrics, however they highlighted the usefulness of LLMs in mental health support, with only one study criticizing the insufficiency of GPT in complex mental health scenarios.

In cardiology applications (5 studies), the LLMs were based on GPT, and they were used for answering clinical questions. The evaluation of the performance was descriptive in most of the studies (3 studies), while an accuracy up to 64.5% was reported in the studies showing performance metrics (2 studies).

In cancer applications (6 studies), GPT models were also used. 3 studies focused on tumor board clinical decision support, 2 studies on question-answering, and 1 study assessed the time and cost for developing LLM prompts. The accuracy of LLM responses compared with reviews of the tumor board varied, reaching up to 70% with real patient data, and up to 95% with fictional patient data. The question-answering applications reached up to 98% accuracy. Regarding time and cost evaluation, LLM-based prompting was found to offer an efficient approach to extract key information from the medical records of breast cancer patients and to generate well-structured clinical datasets. This method is expected to significantly reduce the effort required in routine clinical practice and research.

3.5. Benefits of Using LLMs

Across mental health, oncology, and cardiology, the most consistently reported benefits of LLMs include early detection and screening capabilities, which dominate mental health applications and are emerging in oncology through pathology report parsing. Decision-support functions are common to all three domains, but oncology demonstrates the highest complexity in multidisciplinary planning, while cardiology emphasizes guideline-concordant question answering. Patient education and engagement appear universally beneficial, with mental health leveraging conversational empathy and oncology focusing on informed consent. Documentation and research support are reported across all domains, but oncology highlights efficiency gains in tumor board preparation, whereas cardiology emphasizes educational utility.

3.5.1. Detection and Screening

LLMs have shown significant potential in the improvement of detection and screening for diseases. In mental health, transformer-trained LLMs have been applied to detect depression, suicidality, and anxiety, based on the examination of linguistic features and affectual cues within patient-written text or social media posts [21,22]. This strategy enables the early identification of mental illness within non-clinical environments, with the provision of scalable non-intrusive screening tools that can be used to monitor public health. The ability of LLMs to process large volumes of unstructured language data allows for timely detection of subtle indicators of distress that may escape conventional screening tools.

In oncology, benefits of screening procedures are also apparent. LLMs can extract tumor, receptor, and staging relevant variables from pathology as well as imaging reports with a high accuracy, reaching up to 98% [20]. This not only fastens the data-processing time but also helps standardized diagnostic workflows through the reduction in human failure as well as enhanced information-extraction consistency. Taken together, these outcomes establish that LLMs could bridge the gap between raw clinical data to usable diagnostic information to facilitate earlier treatment as well as personalized care.

3.5.2. Risk Assessment and Triage

In addition to screening and detection, LLMs enable risk assessment and triage procedures, especially in mental health applications. By analyzing the sentiment, tone, and complexity of language, such models can evaluate the psychological distress and guide those with increased risk of suicide or crisis [23]. This feature is particularly useful in online or resource-limited care environments, where fast triage can easily affect clinical outcomes.

The analytical power of the LLMs also helps practitioners in correlating large amounts of patient-reported information to assist with triage decisions regarding who needs immediate intervention as opposed to ongoing monitoring. As the systems evolve, the integration of the triage tools with the telepsychiatry platforms could allow long-term, passive surveillance of mental state with earlier detection of mental health emergencies.

3.5.3. Clinical Reasoning and Decision Support

The LLMs showed increasing abilities to mimic clinical reasoning and facilitate decision-making within psychiatry, cardiology, and oncology. They can enable diagnosis and treatment planning in psychiatry, typically producing guideline-concordant reasoning when presenting with clinical vignettes [21,22]. This helps the clinician in challenging diagnostic situations but also provides a useful instrument within medical education, as it enables students to practice structured ways of reasoning.

In cardiology, the LLMs proved effective in accurately answering board-style examination questions and depicting clinical reasoning in case discussions [4]. Such functions advance both educational use and ongoing professional development with the option to present an adaptive learning environment that resembles clinician-level judgment.

In oncology, the LLMs aid in the synthesis of multidisciplinary cases and tumor board planning with 50–70% agreement with the expert panels [5]. This analytical potential, in turn, leads to the possibility of summarizing variable clinical facts and helping with decision-making processes, most specifically where combined therapies need to be incorporated.

3.5.4. Patient Education and Participation

A major advantage of LLMs is their ability to encourage patient engagement by using natural, conversational interfaces. In mental health, conversational agents that use LLMs can provide psychoeducation, empathetic dialog, and stigma reduction by offering support to individuals who might otherwise avoid seeking for help [2]. These systems can deliver round-the-clock guidance and emotional validation, contributing to improved accessibility and continuity of care. In cardiology, the LLMs were shown to produce accurate and sympathetic responses to patient questions, enhancing compliance with regimens as well as health literacy [4]. Similarly, in oncology, they enable the development of patient education materials and consent summaries incorporating the most current clinical practice guidelines [5]. Such functionalities endow the patient with reliable, easy-to-read information, enforcing shared decision-making as well as faith in communication with the clinician.

3.5.5. Summary, Documentation, and Research Support

In all areas, LLMs bring significant advantages in automating research and documentation procedures. They help in psychiatry to summarize transcripts of therapy and to produce structured sets of data for Natural Language Processing (NLP) studies, minimizing manual labor and maximizing analytical potential [22]. They help in cardiology and oncology to prepare discharge summaries, clinical notes, and guideline overviews to minimize administrative workload [19,20]. In addition, the models aid clinical researchers in scientific paper writing through the automation of evidence synthesis and paper composition [19]. This aspect can accelerate the publication of clinical evidence, boosting the efficiency of medical scholarship. As the science of LLMs advances, this integration into research workflows could revolutionize the creation, curation, and communication of medical knowledge.

3.6. Challenges of Using LLMs

The most critical challenges for real-world adoption of LLMs are consistent across domains: hallucinations and output correctness remain universal concerns, particularly acute in oncology where treatment decisions carry high stakes. Bias and transparency issues are pervasive, with mental health applications most vulnerable to demographic and linguistic skew due to reliance on social media data. Data privacy and regulatory compliance challenges are emphasized in oncology and cardiology because of sensitive clinical records. Continuous human oversight is unanimously recommended, but mental health reviews stress its necessity for emergency scenarios, while cardiology and oncology focus on diagnostic validation. Methodological rigor and lack of real-world trials are cross-cutting limitations, hindering generalizability and safe deployment.

3.6.1. Accuracy, Safety, and Efficacy of LLMs in Real World Settings

Although LLMs showed better results than traditional tools such as machine learning [28,29], and even exhibited capabilities comparable to those of human experts in some cases [30,31], variations in accuracy and output correctness across different tasks persist. As an example, Levkovich and Elyoseph [32], evaluated ChatGPT’s performance in assessing suicide risk highlighting that ChatGPT could underestimate or overestimate suicide risks compared to mental health professionals, especially in complex scenarios with high perceived burdensomeness and thwarted belongingness. Advanced models such as GPT-4 have been effective in interpreting clinical and unstructured data to manage, detect, and classify depression [33]. However, studies often rely on fictional clinical vignettes, limiting the generalizability of these findings in real-world clinical practice [34]. In this context, researchers and industrial stakeholders should gather enough evidence of efficacy and safety—through rigorous clinical studies, testing, and oversight—before deploying LLMs at scale in real-life, to prevent causing harm due to unverified performance.

All reviews underscore the enduring risk of hallucinations, wherein models generate information that is incorrect, incomplete, or entirely fabricated. Such outputs—ranging from inaccurate facts to invented citations—pose danger in clinical settings, where they may inadvertently mislead healthcare professionals or patients. The reviews consistently identify hallucination control as a fundamental technical challenge that must be addressed prior to any autonomous clinical deployment of LLMs.

3.6.2. Bias and Model Transparency

Training data biases (demographic, geographic, linguistic) may propagate into LLM outputs. Several reviews note age, language and population skews (e.g., social-media datasets over-represent younger, English-speaking users), leading to uneven performance and potential amplification of disparities in care. Detecting and mitigating bias is made harder because training corpora and curation processes for commercial models are often undisclosed. Reviews call for benchmark datasets with diverse, annotated examples and for bias-auditing pipelines. Ensuring demographic inclusivity, representational diversity, and transparency in LLM development is essential to safeguard public trust in AI-driven healthcare systems.

A critical assessment of the safety and trust of LLMs was conducted in included review studies, highlighting the “black box” nature of AI systems. This means that LLMs are associated with limited interpretability and transparency (why the model generated a given answer), which undermines trust and makes clinical validation and regulatory assessment challenging. The reviews recommend model documentation, provenance of training data, auditing, and research into explanatory methods (attention analyses, knowledge graphs, causal embeddings), to improve model transparency.

3.6.3. Data Privacy, Security, and Regulatory Challenges

All reviews identify significant challenges related to data protection and regulatory compliance when applying LLMs in healthcare. Cloud-based model architectures risk exposing sensitive health information to third-party servers, while conventional anonymization is often inadequate, as models may inadvertently reconstruct or reveal personal data. The literature highlights the need for privacy-preserving solutions such as federated learning, encrypted inference, and local on-premise deployments. Regulatory frameworks—including the GDPR and the forthcoming EU AI Act—remain ill-equipped to address GenAI, leaving uncertainties over accountability, liability, and data-control roles. The opacity of commercial models further complicates auditability and certification under existing medical device standards. Reviews therefore call for transparent data-governance mechanisms, regulatory sandboxes, and privacy-by-design principles to ensure both ethical integrity and legal compliance. Ultimately, safeguarding patient privacy is viewed as a prerequisite for the responsible clinical integration of LLMs.

3.6.4. Data Availability and Generalizability

Data scarcity and imbalance have been identified as core barriers to reliable LLM performance in healthcare. Most models are trained or tested on English-language, high-resource datasets—such as PubMed abstracts or online social media—rather than representative clinical data, limiting generalizability across cultures, age groups, and care settings. Under-representation of non-English, minority, and low-income populations leads to systematic performance gaps and potential inequities in care. The opacity of proprietary training corpora further restricts reproducibility and independent bias auditing, raising concerns about compliance with ethical and data protection standards. To address these gaps, the reviews call for open, multilingual benchmark datasets annotated by domain experts, and privacy-preserving data sharing approaches. Strengthening data diversity, transparency, and accessibility is seen as essential to ensure that LLMs achieve equitable, safe, and scientifically valid integration into healthcare practice.

3.6.5. Continuous Human Oversight and Ethical Governance Frameworks

All five reviews converge on the principle that large language models must operate under continuous human supervision when applied in clinical or mental health contexts. While LLMs demonstrate promise in tasks such as information retrieval, triage support, and patient education, their susceptibility to hallucinations, bias, and contextual misinterpretation makes their unsupervised use unsafe. The reviews emphasize that LLMs should function as decision-support tools rather than autonomous agents, complementing but not replacing expert judgment. Ongoing human monitoring is also crucial for detecting subtle model drift or unexpected behavior following software updates. In this direction, structured human-in-the-loop frameworks, with clearly defined escalation procedures for high-risk outputs (e.g., suicide ideation detection, diagnostic recommendations) are required [35].

All five reviews stress that deploying LLMs in healthcare demands clear ethical governance frameworks ensuring transparency, accountability, and respect for patient rights. Current practice often outpaces regulation, leaving uncertainty around responsibility, informed consent, and fairness. Reviews highlight core principles such as accountability (clinicians retain final responsibility), transparency (users must know when AI is involved), non-maleficence (preventing harm through validation), justice (equitable performance across populations), autonomy (informed consent for AI interaction), and data ethics (responsible stewardship and privacy protection). Overall, the literature emphasizes that robust, enforceable ethical frameworks—combining technical, institutional, and societal safeguards—are essential to maintain public trust and ensure that GenAI serves healthcare’s fundamental moral obligations.

3.6.6. Methodological Quality of LLM Studies

All reviews highlight significant methodological weaknesses in current research on LLMs in healthcare. Most studies rely on retrospective, simulated, or vignette-based data rather than real-world clinical trials, limiting external validity and generalizability. Evaluation protocols vary widely, with inconsistent reporting of datasets, prompts, metrics, and baseline comparisons, making cross-study synthesis difficult. Few investigations employ standardized bias or risk-of-bias tools (e.g., QUADAS-2, PROBAST), and many omit details about model versioning or update cycles—critical for reproducibility. Reviews call for prospective, pre-registered, and multi-institutional studies using transparent methods and benchmark datasets that reflect clinical complexity and population diversity. They also emphasize the need for longitudinal evaluations assessing safety, performance drift, and human–AI interaction over time. Overall, improving methodological rigor and transparency is seen as essential to move the field from proof-of-concept experimentation toward clinically validated, ethically sound, and policy-relevant evidence. A thematic synthesis of outcomes and challenges for LLMs covering mental health, oncology, and cardiology applications can be seen in Table 2.

4. Discussion

4.1. Main Findings

We conducted a review of systematic reviews on the application of LLMs in critical healthcare domains, including cardiology, oncology, and mental health. The main goal was to examine the characteristics, outcomes, and challenges of LLMs by drawing, exploring, and synthesizing the findings of systematic reviews in this novel area of research. The key finding of the review is that LLMs have emerged as potential enhancers of diagnostic and decision-support processes in cardiology, oncology, and mental health. Nonetheless, the limited rigorous evidence to date underscores the need for robust, real-world research to validate LLM effectiveness and safety in clinical practice.

The predominant research focus was on depression detection and classification, complemented by investigations into suicidal ideation detection, question answering aligned with clinical guidelines, AI-driven mental health interventions such as loneliness support, emotion recognition and classification, and tumor board clinical decision support. More than half of the included studies applied GPT-based models, while BERT and its derivatives were employed in approximately 40% of the studies. The performance of LLMs was variable across their broad spectrum of applications, and therefore no clear evidence of their effectiveness can be demonstrated currently.

Across cardiology, oncology, and mental health, the reviewed evidence indicates that LLMs have substantial potential to enhance various aspects of clinical care, education, and research. Their strongest demonstrated benefits lie in detection, triage, and decision support, where transformer-based architectures can process vast volumes of unstructured text to identify early markers of disease, assess patient risk, and generate structured diagnostic information with high accuracy.

In mental health, LLMs have been applied to detect depression, anxiety, and suicidality through linguistic and affective cues, enabling scalable, non-intrusive screening beyond traditional clinical environments. In this context, LLMs may facilitate risk stratification and triage, helping clinicians identify individuals in need of early intervention—particularly in resource-limited or telehealth settings [36]. Their capacity for clinical reasoning and decision support has been demonstrated also in cardiology and oncology, where models often produce guideline-concordant responses and may assist in diagnostic synthesis and multidisciplinary planning [37].

Another prominent advantage of LLMs is their ability to enhance patient education and engagement through empathetic, conversational interfaces that provide accessible health information and encourage adherence and shared decision-making [38]. Additionally, LLMs streamline documentation, summarization, and research workflows, reducing administrative burden and accelerating knowledge generation. By automating report generation, the literature synthesis, and evidence summarisation, they can increase efficiency across clinical research domains [39].

Despite the potential of LLMs in improving clinical care, several limitations and challenges toward their wide adoption can be highlighted. Model output correctness and hallucinations remain fundamental challenges in applying LLMs to healthcare. Models frequently produce plausible but inaccurate or fabricated information, including erroneous clinical facts or nonexistent references [40]. Such hallucinations undermine reliability and pose safety risks when used for diagnosis or patient communication. The literature attributes these errors to limitations in the inherent logical structure of LLMs, training data, probabilistic text generation, and lack of factual grounding.

Furthermore, the included reviews consistently highlighted bias and lack of transparency as critical obstacles to the safe deployment of LLMs in healthcare. In several cases, the opacity of proprietary model architectures and datasets prevents independent auditing, bias detection, and accountability [41]. In this direction, transparent reporting of data provenance, inclusive and diverse training corpora, and bias-auditing frameworks that systematically evaluate fairness and performance across demographic and linguistic groups before clinical implementation, are important steps toward the improvement of LLMs’ reliability.

The reviewed evidence underscores major challenges related to data privacy, availability, human oversight, and methodological rigor in LLM research. Most models rely on proprietary or web-scraped data with uncertain consent and provenance, raising privacy and compliance concerns under frameworks such as GDPR. Simultaneously, the lack of open, diverse, expert-labeled, and representative datasets limits reproducibility and generalizability across languages and populations. The reviews stress that continuous human oversight is indispensable—LLMs should function as decision-support tools, with clinicians validating outputs and managing escalation for high-risk content. Methodologically, the current literature remains exploratory, dominated by simulated or retrospective designs with inconsistent evaluation metrics and scarce real-world validation. To advance clinical integration responsibly, the field must prioritize transparent data governance, ethical data sharing, standardized evaluation protocols, and prospective trials that rigorously test performance, safety, and reliability in real-world healthcare settings. Ensuring computational efficiency is also essential for real-time clinical applications, as models must deliver accurate outputs with minimal latency to support tasks like triage, decision support, and patient communication without disrupting care workflows.

The findings of this meta-review have important implications for the future of AI-driven healthcare. By synthesizing evidence across cardiology, oncology, and mental health, this work highlights that LLMs have the potential to transform clinical workflows through language-driven intelligence. Their demonstrated benefits in early detection, triage, decision support, and patient engagement suggest that LLMs could significantly improve clinical decision making and enable more personalized care delivery. At the same time, the variability in performance and methodological gaps underscore the need for rigorous validation before widespread adoption. These insights call for a strategic focus on developing transparent, bias-resistant, and ethically governed LLM systems, ensuring that innovation translates into safe, equitable, and clinically meaningful outcomes. Ultimately, the integration of LLMs into healthcare could redefine how clinicians interact with data, patients, and decision-making processes, accelerating the transition toward more efficient and patient-centered care.

4.2. Limitations

This review should be interpreted within the context of its limitations. The authors used a limited set of terms for the search of literature, which might have resulted in the omission of other relevant studies. Our search design likely missed relevant reviews that used terms referring to specific health conditions, alternative AI terminology, or did not mention the AI tool specifically in those fields. We limited our literature search to PubMed and Scopus, which are widely used databases internationally. This choice, however, represents an important methodological constraint that may introduce selection bias and affect the conclusiveness of our synthesis. We also recommend that future updates of this review expand the search to additional databases such as Embase and Cochrane, which are likely to index systematic reviews with rigorous clinical trial data. The review was based on the findings from systematic reviews of LLM studies and different populations in terms of disease, age, education and socioeconomic level were studied in different settings, which prevented the conduction of meta-analysis. The limitations of the included systematic reviews (introduced, for example, in their inclusion and exclusion criteria) might have affected the representation of the progress of LLMs. This review presented characteristics and outcomes of LLMs for cardiovascular disease, cancer, and mental disorders, considering their global burden. LLMs for other chronic conditions such as diabetes, arthritis, or liver disease were not examined. Our literature search was restricted to reviews published after 2022, due to the landmark emergence of ChatGPT. Different results could emerge if older studies were also included. Most included studies examined by the included reviews relied on controlled or simulated datasets rather than real-world clinical data, which limits the generalizability of findings and underscores the need for prospective evaluations in diverse, real-world settings. The generalizability of the findings is further restricted by the fact that only a small number of studies were found to be eligible for inclusion in this review.

5. Conclusions

In conclusion, the collective evidence from recent systematic reviews demonstrates that LLMs hold substantial promise for enhancing healthcare through improved detection, triage, decision support, and patient engagement. Overall, the reviews converge on the view that while LLMs are not yet ready for autonomous deployment in clinical settings, their augmentation of human expertise—from early detection and clinical reasoning to patient communication and research support—positions them as valuable tools for improving efficiency, consistency, and accessibility of healthcare services. Yet, their clinical integration remains constrained by persistent challenges, including hallucinations, bias, lack of transparency, data privacy risks, and limited methodological robustness. Current research is largely exploratory, emphasizing the need for rigorous, real-world evaluations supported by transparent data governance and ethical oversight. Ensuring reliability, fairness, and human supervision will be essential to build trust and safeguard patient welfare. As these models evolve, interdisciplinary collaboration among clinicians, data scientists, ethicists, and policymakers will be crucial to translating technical innovation into safe, equitable, and accountable clinical practice. Properly governed and validated, LLMs could become transformative instruments in shaping the future of evidence-based, human-centered healthcare.

Author Contributions

Conceptualization, A.T.; methodology, A.T., S.S., S.K., A.A. (Anastasios Alexiadis) and E.E.L.; writing—original draft preparation, A.T.; writing—review and editing, all authors; supervision, A.T.; project administration, A.T.; funding acquisition, A.T., G.M., A.A. (Athos Antoniades), K.V. and D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union’s Horizon Europe research and innovation program under grant agreement No 101080430 (AI4HF), and the European Union’s Horizon Europe research and innovation program under grant agreement No 101137301 (COMFORTage).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

Author Athos Antoniades was employed by the company Stremble Ventures Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. Characteristics of individual studies included in the reviews.

Study	Target Disease	Application	LLM Used	Data Sources	Accuracy	Key Outcomes
Kusunose et al. [42]	CVD	Question-answering based on Japanese Hypertension guidelines	GPT 3.5	Japanese Society of Hypertension Guidelines for the Management of Hypertension	64.50%	ChatGPT performed well on clinical questions, performance on hypertension treatment guidelines topics was less satisfactory
Rizwan et al. [43]	CVD	Question-answering on treatment and management plans	GPT 4.0	Hypothetical questions simulating clinical consultation	N/A	Out of the 10 clinical scenarios inserted in ChatGPT, eight were perfectly diagnosed
Skalidis et al. [44]	CVD	Ability in European Cardiology exams	GPT (Version N/A)	Exam questions from the ESC website, StudyPRN and Braunwald’s Heart Disease Review and Assessment	58.80%	Results demonstrate that ChatGPT succeeds in the European Cardiology exams
Van Bulck et al. [45]	CVD	Question-answering on common cardiovascular diseases	GPT (Version N/A)	Virtual patient questions	N/A	Experts considered ChatGPT-generated responses trustworthy and valuable, with few considering them dangerous. Forty percent of the experts found ChatGPT responses more valuable than Google
Williams et al. [46]	CVD	Question-answering on cardiovascular computed tomography	GPT 3.5	Questions from the Society of Cardiovascular Computed Tomography 2023 program as well as questions about high risk plaque (HRP), quantitative plaque analysis, and how AI will transform cardiovascular CT	N/A	The answers to debate questions were plausible and provided reasonable summaries of the key debate points
Choi et al. [47]	Breast Cancer	Evaluation of the time and cost of developing prompts using LLMs, tailored to extract clinical factors in breast cancer patients	GPT 3.5	Data from reports of surgical pathology and ultrasound from 2931 breast cancer patients who underwent radiotherapy from 2020 to 2022	87.70%	Developing and processing the prompts took 3.5 h and 15 min, respectively. Utilizing the ChatGPT application programming interface cost US $65.8 and when factoring in the estimated wage, the total cost was US $95.4
Griewing et al. [48]	Breast Cancer	Concordance to tumor board clinical decisions	GPT 3.5	Fictious patient data with clinical and diagnostic data	50−95%	ChatGPT 3.5 can provide treatment recommendations for breast cancer patients that are consistent with multidisciplinary tumor board decision making of a gynecologic oncology center in Germany
Haver et al. [49]	Breast Cancer	Question-answering on breast cancer prevention and screening	GPT 3.5	25 questions addressing fundamental concepts related to breast cancer prevention and screening	88%	ChatGPT provided appropriate responses for most questions posed about breast cancer prevention and screening, as assessed by fellowship-trained breast radiologists
Lukac et al. [50]	Breast Cancer	Tumor board clinical decision support	GPT 3.5	Tumor characteristics and age of the 10 consecutive pretreatment patient cases	64.20%	ChatGPT provided mostly general answers based on inputs, generally in agreement with the decision of MDT
Rao et al. [51]	Breast Cancer	Question-answering based on American Collegy of Radilology recommendations	GPT 4.0, GPT 3.5	Prompting mechanisms and clinical presentations based on American College of Radiology	88.9−98.4%	ChatGPT displays impressive accuracy in identifying appropriateness of common imaging modalities for breast cancer screening and breast pain
Sorin et al. [52]	Breast Cancer	Tumor board clinical decision support	GPT 3.5	Clinical and diagnostic data of 10 patients	70%	Chatbot’s clinical recommendations were in-line with those of the tumor board in 70% of cases
Abilkaiyrkyzy et al. [53]	Mental Disease	Mental illness detection using a chatbot	BERT	219 E-DAIC participants	69%	Chatbot effectively detects and classifies mental health issues, highly usable for reducing barriers to mental health care
Adarsh et al. [54]	Mental Disease	Depression sign detection using BERT	BERT-small	Social media texts	63.60%	Enhanced BERT model accurately classifies depression severity from social media texts, understanding nuances better than others
Alessa and Al-Khalifa [55]	Mental Disease	Mental health interventions using CAs for the elderly supported by LLMs	ChatGPT, Google Cloud API	Record of interactions with CA; results of the human experts’ assessment	N/A	The proposed ChatGPT-based system effectively serves as a companion for elderly individuals, helping to alleviate loneliness and social isolation. Preliminary evaluations showed that the system could generate relevant responses tailored to elderly personas
Beredo and Ong [56]	Mental Disease	Mental health interventions using CAs supported by LLMs	EREN, MHBot, PERMA	Empatheticdialogues (24,850 conversations); Well-Being Conversations; Perma Lexica	N/A	This study successfully demonstrated a hybrid conversation model, which combines generative and retrieval approaches to improve language fluency and empathetic response generation in chatbots. This model, tested through both automated metrics and human evaluation, showed that the medium variation in the FTER model outperformed the vanilla DialoGPT in perplexity and that the human-likeness, relevance, and empathetic qualities of responses were significantly enhanced, making VHope a more competent CA with empathetic abilities
Berrezueta-Guzman et al. [57]	Mental Disease	Evaluation of the efficacy of ChatGPT in mental intensive treatment	ChatGPT	Evaluations from 10 attention deficit hyperactivity disorder (ADHD) therapy experts and interactions between therapists and the custom ChatGPT	N/A	This paper found that the custom ChatGPT demonstrated strong capabilities in engaging language use, maintaining interest, promoting active participation, and fostering a positive atmosphere in ADHD therapy sessions, with high ratings in communication and language. However, areas needing improvement were identified, particularly in confidentiality and privacy, cultural and sensory sensitivity, and handling nonverbal cues
Blease et al. [58]	Mental Disease	Evaluation of psychiatrists’ perceptions of the LLMs	ChatGPT, Bard, Bing AI	Survey responses from 138 APA members on LLM chatbot use in psychiatry	N/A	This paper found that over half of psychiatrists used AI tools like ChatGPT for clinical questions, with nearly 70% agreeing on improved documentation efficiency and almost 90% indicating a need for more training while expressing mixed opinions on patient care impacts and privacy concerns
Bokolo et al. [29]	Mental Disease	Depression detection from Twitter	RoBERTa, DeBERTa	632,000 tweets	97.48%	Transformer models like RoBERTa excel in depression detection from Twitter data, outperforming traditional ML approaches
Crasto et al. [59]	Mental Disease	Mental health interventions using CAs supported by LLMs	DialoGPT	Counselchat (includes tags of illness); question answers from 100 college students	N/A	The DialoGPT model, demonstrating higher perplexity and preferred by 63% of college participants for its human-like and empathetic responses, was chosen as the most suitable system for addressing student mental health issues
Dai et al. [24]	Mental Disease	Psychiatric patient screening	BERT, DistilBERT, ALBERT, ROBERTa	500 EHRs	Accuracy not reported, F1 0.830	BERT models, especially with feature dependency, effectively classify psychiatric conditions from EHRs
Danner et al. [60]	Mental Disease	Detecting depression using LLMs through clinical interviews	BERT; GPT 3.5, ChatGPT 4	DAIC-WOZ; Extended-DAIC; simulated data	78%	The study assessed the abilities of GPT-3.5-turbo and ChatGPT-4 on the DAIC-WOZ dataset, which yielded F1 scores of 0.78 and 0.61, respectively, and a custom BERT model, extended-trained on a larger dataset, which achieved an F1 score of 0.82 on the Extended-DAIC dataset, in recognizing depression in text
Dergaa et al. [61]	Mental Disease	Simulated mental health assessments and interventions with ChatGPT	GPT 3.5	Fictional patient data; 3 scenarios	N/A	ChatGPT showed limitations in complex medical scenarios, underlining its unpreparedness for standalone use in mental health practice
Diniz et al. [62]	Mental Disease	Detecting suicidal ideation using LLMs through Twitter texts	BERT model for Portuguese, Multilingual BERT (base), BERTimbau	Non-clinical texts from tweets (user posts of the online social network Twitter)	95%	The Boamente system demonstrated effective text analysis for suicidal ideation with high privacy standards and actionable insights for mental health professionals. The best-performing BERTimbau Large model (accuracy: 0.955; precision: 0.961; F-score: 0.954; AUC: 0.954) significantly excelled in detecting suicidal tendencies, showcasing robust accuracy and recall in model evaluations
D’Souza et al. [27]	Mental Disease	Responding to psychiatric case vignettes with diagnostic and management strategies	GPT 3.5	Fictional patient data from clinical case vignettes; 100 cases	61%	ChatGPT 3.5 showed high competence in handling psychiatric case vignettes, with strong diagnostic and management strategy generation
Elyoseph et al. [63]	Mental Disease	Evaluating emotional awareness compared to general population norms	GPT 3.5	Fictional scenarios from the LEAS; 750 participants	85%	ChatGPT showed higher emotional awareness compared to the general population and improved over time
Elyoseph et al. [64]	Mental Disease	Assessing suicide risk in fictional scenarios and comparing to professional evaluations	GPT 3.5	Fictional patient data; text vignettescompared to 379 professionals	N/A	ChatGPT underestimated suicide risks compared to mental health professionals, indicating the need for human judgment in complex assessments
Elyoseph et al. [30]	Mental Disease	Evaluating prognosis in depression compared to other LLMs and professionals	GPT 3.5, GPT 4	Fictional patient data; text vignettescompared to 379 professionals	N/A	ChatGPT 3.5 showed a more pessimistic prognosis in depression compared to other LLMs and mental health professionals
Esackimuthu et al. [65]	Mental Disease	Depression detection from social media text	ALBERT base v1	ALBERT base v1 data	50%	ALBERT shows potential in detecting depression signs from social media texts but faces challenges due to complex human emotions
Farhat et al. [66]	Mental Disease	Evaluation of ChatGPT as a complementary mental health resource	ChatGPT	Responses generated by ChatGPT	N/A	ChatGPT displayed significant inconsistencies and low reliability when providing mental health support for anxiety and depression, underlining the necessity of validation by medical professionals and cautious use in mental health contexts
Farruque et al. [67]	Mental Disease	Depression level detection modeling	Mental BERT (MBERT)	13,387 Reddit samples	Accuracy not reported, F1 0.81	MBERT enhanced with text excerpts significantly improves depression level classification from social media posts
Farruque et al. [68]	Mental Disease	Depression symptoms modeling from Twitter	BERT, Mental-BERT	6077 tweets and 1500 annotated tweets	Accuracy not reported, F1 0.45	Semi-supervised learning models, iteratively refined with Twitter data, improve depression symptom detection accuracy
Friedman and Ballentine [69]	Mental Disease	Evaluation of LLMs in data-driven discovery: correlating sentiment changes with psychoactive experiences	BERTowid, BERTiment	Erowid testimonials; drug receptor affinities; brain gene expression data; 58K annotated Reddit posts	N/A	This paper found that LLM methods can create unified and robust quantifications of subjective experiences across various psychoactive substances and timescales. The representations learned are evocative and mutually confirmatory, indicating significant potential for LLMs in characterizing psychoactivity
Ghanadian et al. [70]	Mental Disease	Suicidal ideation detection using LLMs through social media texts	ALBERT, DistilBERT, ChatGPT, Flan-T5, Llama	UMD Dataset; Synthetic Datasets (Generated using LLMs like Flan-T5 and Llama2, these datasets augment the UMD dataset to enhance model performance)	87%	The synthetic data-driven method achieved consistent F1-scores of 0.82, comparable to real-world data models yielding F1-scores between 0.75 and 0.87. When 30% of the real-world UMD dataset was combined with the synthetic data, the performance significantly improved, reaching an F1-score of 0.88 on the UMD test set. This result highlights the effectiveness of synthetic data in addressing data scarcity and enhancing model performance
Hadar- Shoval et al. [71]	Mental Disease	Differentiating emotional responses in BPD and SPD scenarios using mentalising abilities	GPT 3.5	Fictional patient data (BPD and SPD scenarios); AI-generated data	N/A	ChatGPT effectively differentiated emotional responses in BPD and SPD scenarios, showing tailored mentalizing abilities
Hayati et al. [72]	Mental Disease	Detecting depression by Malay dialect speech using LLMs	GPT 3	Interviews with 53 adults fluent in Kuala Lumpur (KL), Pahang, or Terengganu Malay dialects	73%	GPT-3 was tested on three different dialectal Malay datasets (combined, KL, and non-KL). It performed best on the KL dataset with a max_example value of 10, which achieved the highest overall performance. Despite the promising results, the non-KL dataset showed the lowest performance, suggesting that larger or more homogeneous datasets might be necessary for improved accuracy in depression detection tasks
He et al. [73]	Mental Disease	Evaluation of CAs handling counseling for people with autism supported by LLMs	ChatGPT	Public available data from the web-based medical consultation platform DXY	N/A	The study found that 46.86% of assessors preferred responses from physicians, 34.87% favored ChatGPT, and 18.27% favored ERNIE Bot. Physicians and ChatGPT showed higher accuracy and usefulness compared to ERNIE Bot, while ChatGPT outperformed both in empathy. The study concluded that while physicians’ responses were generally superior, LLMs like ChatGPT can provide valuable guidance and greater empathy, though further optimization and research are needed for clinical integration
Hegde et al. [74]	Mental Disease	Depression detection using supervised learning	Ensemble of ML classifiers, BERT	Social media text data	Accuracy not reported, F1 0.479	BERT-based Transfer Learning model outperforms traditional ML classifiers in detecting depression from social media texts
Heston T.F. et al. [75]	Mental Disease	Simulating depression scenarios and evaluating AI’s responses	GPT 3.5	Fictional patient data; 25 conversational agents	N/A	ChatGPT-3.5 conversational agents recommended human support at critical points, highlighting the need for AI safety in mental health
Hond et al. [76]	Mental Disease	Early depression risk detection in cancer patients	BERT	16,159 cancer patients’ EHR data	Accuracy not reported, AUROC 0.74	Machine learning models predict depression risk in cancer patients using EHRs, with structured data models performing best
Howard et al. [77]	Mental Disease	Detecting suicidal ideation using LLMs through social media texts	DeepMoji, Universal Sentence Encoder, GPT 1	1588 labeled posts from the Computational Linguistics and Clinical Psychology 2017 shared task	Accuracy not reported, F1 0.414	The top-performing system, utilizing features derived from the GPT-1 model fine-tuned on over 150,000 unlabeled Reachout.com posts, achieved a new state-of-the-art macro-averaged F1 score of 0.572 on the CLPsych 2017 task without relying on metadata or preceding posts. However, error analysis indicated that this system frequently misses expressions of hopelessness
Hwang et al. [78]	Mental Disease	Generating psychodynamic formulations in psychiatry based on patient history	GPT 4	Fictional patient data from published psychoanalytic literature; 1 detailed case	N/A	GPT-4 successfully created relevant and accurate psychodynamic formulations based on patient history
Ilias et al. [79]	Mental Disease	Stress and depression identification in social media	BERT, MentalBERT	Public datasets	Accuracy not reported, F1 0.73	Extra-linguistic features improve calibration and performance of models in detecting stress and depression from texts
Janatdoust et al. [80]	Mental Disease	Depression signs detection from social media text	Ensemble of BERT, ALBERT, DistilBERT, RoBERTa	16,632 social media comments	61%	Ensemble models effectively classify depression signs from social media, utilizing multiple language models for improved accuracy
Kabir et al. [81]	Mental Disease	Depression severity detection from tweets	BERT, DistilBERT	40,191 tweets	Accuracy not reported, AUROC 0.74–0.86	Models effectively classify social media texts into depression severity categories, with high confidence and accuracy
Kumar et al. [82]	Mental Disease	Evaluation of GPT 3 in mental health intervention	GPT 3	209 participants responses, with 189 valid responses after filtering	N/A	This paper found that interaction with either of the chatbots improved participants’ intent to practice mindfulness again, while the tutorial video enhanced their overall experience of the exercise. These findings highlighted the potential promise and outlined directions for exploring the use of LLM-based chatbots for awareness-related interventions
Lam et al. [83]	Mental Disease	Multi-modal depression detection	Transformer, 1D CNN	189 DAIC-WOZ participants	Accuracy not reported, F1 0.87	Multi-modal models combining text and audio data effectively detect depression, enhanced by data augmentation
Lau et al. [28]	Mental Disease	Depression severity assessment	Prefix-tuned LLM	189 clinical interview transcripts	Accuracy not reported, RMSE 4.67	LLMs with prefix-tuning significantly enhance depression severity assessment, surpassing traditional methods
Levkovich and Elyoseph [31]	Mental Disease	Diagnosing and treating depression, comparing GPT models with primary care physicians	GPT 3.5, GPT 4	Fictional patient data from clinical case vignettes; repeated multiple times for consistency	N/A	ChatGPT aligned with guidelines for depression management, contrasting with primary care physicians and showing no gender or socioeconomic biases
Levkovich and Elyoseph [32]	Mental Disease	Evaluating suicide risk assessments by GPT models and mental health professionals	GPT 3.5, GPT 4	Fictional patient data; text vignettes compared to 379 professionals	N/A	GPT 4’s evaluations of suicide risk were similar to mental health professionals, though with some overestimations and underestimations
Li et al. [84]	Mental Disease	Evaluating performance on psychiatric licensing exams and diagnostics	GPT 4, Bard and Llama-2	Fictional patient data in exam and clinical scenario questions; 24 experienced psychiatrists	69%	GPT 4 outperformed other models in psychiatric diagnostics, closely matching the capabilities of human psychiatrists
Liyanage et al. [85]	Mental Disease	Data augmentation for wellness dimension classification in Reddit posts	GPT 3.5	Real patient data from Reddit posts; 3092 instances, post-augmentation 4376 records	69%	ChatGPT models effectively augmented Reddit post data, significantly improving classification performance for wellness dimensions
Lossio-Ventura et al. [26]	Mental Disease	Evaluations of LLMs for sentiment analysis through social media texts	ChatGPT;Open Pre-Trained Transformers (OPT)	NIH Data Set; Stanford Data Set	86%	This paper revealed high variability and disagreement among sentiment analysis tools when applied to health-related survey data. OPT and ChatGPT demonstrated superior performance, outperforming all other tools. Moreover, ChatGPT outperformed OPT, achieving a 6% higher accuracy and a 4% to 7% higher F-measure
Lu et al. [86]	Mental Disease	Depression detection via conversation turn classification	BERT, transformer encoder	DAIC dataset	Accuracy not reported, F1 0.75	Novel deep learning framework enhances depression detection from psychiatric interview data, improving interpretability
Ma et al. [87]	Mental Disease	Evaluation of mental health intervention CAs supported by LLMs	GPT 3	120 Reddit posts (2913 user comments)	N/A	The study highlighted that CAs like Replika, powered by LLMs, offered crucial mental health support by providing immediate, unbiased assistance and fostering self-discovery. However, they struggled with content filtering, consistency, user dependency, and social stigma, underscoring the importance of cautious use and improvement in mental wellness applications
Mazumdar et al. [88]	Mental Disease	Classifying mental health disorders and generating explanations	GPT 3, BERT-large, MentalBERT, ClinicBERT, and PsychBERT	Real patient data sourced from Reddit posts	87%	GPT 3 outperformed other models in classifying mental health disorders and generating explanations, showing promise for AI-IoMT deployment
Metzler et al. [89]	Mental Disease	Detecting suicidal ideation using LLMs through Twitter texts	BERT, XLNet	3202 English tweets	88.50%	BERT achieved F1-scores of 0.93 for accurately labeling tweets as about suicide and 0.74 for off-topic tweets in the binary classification task. Its performance was similar to or exceeded human performance and matched that of state-of-the-art models on similar tasks
Owen et al. [90]	Mental Disease	Depression signal detection in Reddit posts	BERT, MentalBERT	Reddit datasets	Accuracy not reported, F1 0.64	Effective identification of depressive signals in online forums, with potential for early intervention
Parker et al. [91]	Mental Disease	Providing information on bipolar disorder and generating creative content	GPT 3	N/A	N/A	GPT-3 provided basic material on bipolar disorders and creative song generation, but lacked depth for scientific review
Perlis et al. [34]	Mental Disease	Evaluation of GPT 4 for clinical decision support in bipolar depression	GPT 4 turbo (gpt-4-1106-preview)	Recommendations generated by the augmented GPT-4 model and responses from clinicians treating bipolar disorder	50.80%	This paper found that the augmented GPT 4 model had a Cohen’s kappa of 0.31 with expert consensus, identifying the optimal treatment in 50.8% of cases and placing it in the top 3 in 84.4% of cases. In contrast, the base model had a Cohen’s kappa of 0.09 and identified the optimal treatment in 23.4% of cases, highlighting the enhanced performance of the augmented model in aligning with expert recommendations
Poświata and Perełkiewicz [92]	Mental Disease	Depression sign detection using RoBERTa	RoBERTa, DepRoBERTa	RoBERTa models’ data	Accuracy not reported, F1 0.583	RoBERTa and DepRoBERTa ensemble excels in classifying depression signs, securing top performance in a competitive environment
Pourkeyvan et al. [25]	Mental Disease	Mental health disorder prediction from Twitter	BERT models from Hugging Face	11,890,632 tweets and 553 bio-descriptions	97%	Superior detection of depression symptoms from social media, demonstrating the efficacy of advanced NLP models
Sadeghi et al. [93]	Mental Disease	Detecting depression using LLMs through interviews	GPT 3.5-Turbo, RoBERTa	E-DAIC (219 participants)	N/A	The study achieved its lowest error rates, a Mean Absolute Error (MAE) of 3.65 on the dev set and 4.26 on the test set, by fine-tuning DepRoBERTa with a specific prompt, outperforming manual methods and highlighting the potential of automated text analysis for depression detection
Schubert et al. [94]	Mental Disease	Evaluation of LLMs’ performance on neurology board-style examinations	ChatGPT 3.5, ChatGPT 4.0	A question bank from an educational company with 2036 questions that resemble neurology board questions	85%	ChatGPT 4.0 excelled over ChatGPT 3.5, achieving 85.0% accuracy versus 66.8%. It surpassed human performance in specific areas and exhibited high confidence in responses. Longer questions tended to result in more incorrect answers for both models
Senn et al. [95]	Mental Disease	Depression classification from interviews	BERT, RoBERTa, DistilBERT	189 clinical interviews	Accuracy not reported, F1 0.93	Ensembles of BERT models enhance depression detection robustness in clinical interviews
Singh and Motlicek [96]	Mental Disease	Depression level classification using BERT, RoBERTa, XLNet	Ensemble of BERT, RoBERTa, XLNet	Ensemble of models	54%	Ensemble model accurately classifies depression levels from social media text, ranking highly in competitive settings
Sivamanikandan S. et al. [97]	Mental Disease	Depression level classification	DistilBERT, RoBERTa, ALBERT	Social media posts	Accuracy not reported, F1 0.457	Transformer models classify depression levels effectively, with RoBERTa achieving the best performance
Spallek et al. [98]	Mental Disease	Providing educational material on mental health and substance use	GPT-4	Real-world queries from mental health and substance use portals; 10 queries	N/A	GPT 4’s outputs were substandard compared to expert materials in terms of depth and adherence to communication guidelines
Stigall et al. [99]	Mental Disease	Emotion classification using LLMs through social media texts	EmoBERTTiny	A collection of publicly available datasets hosted on Kaggle and Huggingface	93.14% (sentiment analysis), 85.46% (emotion analysis)	EmoBERTTiny outperformed pre-trained and state-of-the-art models in all metrics and computational efficiency, achieving 93.14% accuracy in sentiment analysis and 85.48% in emotion classification. It processes a 256-token context window in 8.04 ms post-tokenization and 154.23 ms total processing speed
Suri et al. [100]	Mental Disease	Depressive tendencies detection using multimodal data	BERT	5997 tweets	97%	Multimodal BERT frameworks significantly enhance detection of depressive tendencies from complex social media data
Tao et al. [101]	Mental Disease	Detecting anxiety and depression using LLMs through dialogs in real-life scenarios	ChatGPT	Speech data from nine Q&A tasks related to daily activities (75 patients with anxiety and 64 patients with depression)	67.62%	This paper introduced a virtual interaction framework using LLMs to mitigate negative psychological states. Analysis of Q&A dialogs demonstrated ChatGPT’s potential in identifying depression and anxiety. To enhance classification, four language features, including prosodic and speech rate, positively impacted classification
Tey et al. [102]	Mental Disease	Pre- and post-depressive detection from tweets	BERT, supplemented with emoji decoding	Over 3.5 million tweets	Accuracy not reported, F1 0.90	Augmented BERT model classifies Twitter users into depressive categories, enhancing early depression detection
Toto et al. [103]	Mental Disease	Depression screening using audio and text	AudiBERT	189 clinical interviews	Accuracy not reported, F1 0.92	AudiBERT outperforms traditional and hybrid models in depression screening, utilizing multimodal data
Vajre et al. [104]	Mental Disease	Detecting mental health using LLMs through social media texts	PsychBERT	Twitter hashtags and Subreddit (6 domains: anxiety, mental health, suicide, etc)	Accuracy not reported, F1 0.63	The study identified PsychBERT as the highest-performing model, achieving an F1 score of 0.98 in a binary classification task and 0.63 in a more challenging multi-class classification task, indicating its superiority in handling complex mental health-related data. Additionally, PsychBERT’s explainability was enhanced by using the Captum library, which confirmed its ability to accurately identify key phrases indicative of mental health issues
Verma et al. [105]	Mental Disease	Detecting depression using LLMs through textual data	RoBERTa	Mental health corpus	96.86%	The study successfully used a RoBERTa-base model to detect depression with a high accuracy of 96.86%, showcasing the potential of AI in identifying mental health issues through linguistic analysis
Wan et al. [106]	Mental Disease	Family history identification in mood disorders	BERT–CNN	12,006 admission notes	97%	High accuracy in identifying family psychiatric history from EHRs, suggesting utility in understanding mood disorders
Wang et al. [107]	Mental Disease	Enhancing depression diagnosis and treatment through the use of LLMs	LLaMA-7B; ChatGLM-6B, Alpaca, LLMs + Knowledge	Chinese Incremental Pre-training Dataset	N/A	The study assessed LLMs’ performance in mental health, emphasizing safety, usability, and fluency and integrating mental health knowledge to improve model effectiveness, enabling more tailored dialogs for treatment while ensuring safety and usability
Wang et al. [33]	Mental Disease	Detecting depression using LLMs through microblogs	BERT, RoBERTa, XLNet	13,993 microblogs collected from the Sina Weibo	Accuracy not reported, F1 0.424	RoBERTa achieved the highest macro-averaged F1 score of 0.424 for depression classification, while BERT scored the highest micro-averaged F1 score of 0.856. Pretraining on an in-domain corpus improved model performance
Wei et al. [108]	Mental Disease	Evaluation of ChatGPT in psychiatry	ChatGPT	Theoretical analysis and literature reviews	N/A	The paper found ChatGPT useful in psychiatry, stressing ethical use and human oversight, while noting challenges in accuracy and bias, positioning AI as a supportive tool in care
Wu et al. [109]	Mental Disease	Expanding dataset of Post-Traumatic Stress Disorder (PTSD) using LLMs	GPT 3.5 Turbo	E-DAIC (219 participants)	Accuracy not reported, F1 0.63	This paper demonstrated that two novel text augmentation frameworks using LLMs significantly improved PTSD diagnosis by addressing data imbalances in NLP tasks. The zero-shot approach, which generated new standardized transcripts, achieved the highest performance improvements, while the few-shot approach, which rephrased existing training samples, also surpassed the original dataset’s efficacy
Yongsatianchot et al. [110]	Mental Disease	Evaluation of LLMs’ perception of emotion	Text-davinci-003, ChatGPT, GPT 4	Responses from three OpenAI LLMs to the Stress and Coping Process Questionnaire	N/A	The study applied the SCPQ to three OpenAI LLMs—davinci-003, ChatGPT, and GPT 4—and found that while their responses aligned with human dynamics of appraisal and coping, they did not vary across key appraisal dimensions as predicted and differed significantly in response magnitude. Notably, all models reacted more negatively than humans to negative scenarios, potentially influenced by their training processes
Zhang et al. [111]	Mental Disease	Detecting depression trends using LLMs through Twitter texts	RoBERTa, XLNet	2575 Twitter users with depression identified via tweets and profiles	78.90%	This study developed a fusion model that accurately classified depression among Twitter users with 78.9% accuracy. It identified key linguistic and behavioral indicators of depression and demonstrated that depressive users responded to the pandemic later than controls. The findings suggest the model’s effectiveness in noninvasively monitoring mental health trends during major events like COVID-19

References

Carchiolo, V.; Malgeri, M. Trends, Challenges, and Applications of Large Language Models in Healthcare: A Bibliometric and Scoping Review. Futur. Internet 2025, 17, 76. [Google Scholar] [CrossRef]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Volkmer, S.; Meyer-Lindenberg, A.; Schwarz, E. Large Language Models in Psychiatry: Opportunities and Challenges. Psychiatry Res. 2024, 339, 116026. [Google Scholar] [CrossRef]
Quer, G.; Topol, E.J. The Potential for Large Language Models to Transform Cardiovascular Medicine. Lancet Digit. Health 2024, 6, e767–e771. [Google Scholar] [CrossRef] [PubMed]
Iannantuono, G.M.; Bracken-Clarke, D.; Floudas, C.S.; Roselli, M.; Gulley, J.L.; Karzai, F. Applications of Large Language Models in Cancer Care: Current Evidence and Future Perspectives. Front. Oncol. 2023, 13, 1268915. [Google Scholar] [CrossRef]
Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef] [PubMed]
Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018. [Google Scholar] [CrossRef]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A Large Language Model for Electronic Health Records. Npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef]
Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A Systematic Review of Large Language Model (LLM) Evaluations in Clinical Medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
Hussain, W.; Khoriba, G.; Maity, S.; Jyoti Saikia, M. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef]
Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2024, 333, 319–328. [Google Scholar] [CrossRef] [PubMed]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; Group, T.P. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
Higgins, J.P.; Altman, D.G. Assessing Risk of Bias in Included Studies. In Cochrane Handbook for Systematic Reviews of Interventions; John Wiley & Sons, Ltd.: Chichester, UK; pp. 187–241. ISBN 9780470712184.
Shea, B.J.; Reeves, B.C.; Wells, G.; Thuku, M.; Hamel, C.; Moran, J.; Moher, D.; Tugwell, P.; Welch, V.; Kristjansson, E.; et al. AMSTAR 2: A Critical Appraisal Tool for Systematic Reviews That Include Randomised or Non-Randomised Studies of Healthcare Interventions, or Both. BMJ 2017, 358, j4008. [Google Scholar] [CrossRef] [PubMed]
Mohammadi, E.; Thelwall, M.; Haustein, S.; Larivière, V. Who Reads Research Articles? An Altmetrics Analysis of Mendeley User Categories. J. Assoc. Inf. Sci. Technol. 2015, 66, 1832–1846. [Google Scholar] [CrossRef]
Sharma, A.; Medapalli, T.; Alexandrou, M.; Brilakis, E.; Prasad, A. Exploring the Role of ChatGPT in Cardiology: A Systematic Review of the Current Literature. Cureus 2024, 16, e58936. [Google Scholar] [CrossRef]
Sorin, V.; Glicksberg, B.S.; Artsi, Y.; Barash, Y.; Konen, E.; Nadkarni, G.N.; Klang, E. Utilizing Large Language Models in Breast Cancer Management: Systematic Review. J. Cancer Res. Clin. Oncol. 2024, 150, 140. [Google Scholar] [CrossRef] [PubMed]
Omar, M.; Soffer, S.; Charney, A.W.; Landi, I.; Nadkarni, G.N.; Klang, E. Applications of Large Language Models in Psychiatry: A Systematic Review. Front. Psychiatry 2024, 15, 1422807. [Google Scholar] [CrossRef]
Guo, Z.; Lai, A.; Thygesen, J.H.; Farrington, J.; Keen, T.; Li, K. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment. Health 2024, 11, e57400. [Google Scholar] [CrossRef]
Omar, M.; Levkovich, I. Exploring the Efficacy and Potential of Large Language Models for Depression: A Systematic Review. J. Affect. Disord. 2025, 371, 234–244. [Google Scholar] [CrossRef]
Dai, H.J.; Su, C.H.; Lee, Y.Q.; Zhang, Y.C.; Wang, C.K.; Kuo, C.J.; Wu, C.S. Deep Learning-Based Natural Language Processing for Screening Psychiatric Patients. Front. Psychiatry 2021, 11, 533949. [Google Scholar] [CrossRef]
Pourkeyvan, A.; Safa, R.; Sorourkhah, A. Harnessing the Power of Hugging Face Transformers for Predicting Mental Health Disorders in Social Networks. IEEE Access 2024, 12, 28025–28035. [Google Scholar] [CrossRef]
Lossio-Ventura, J.A.; Weger, R.; Lee, A.Y.; Guinee, E.P.; Chung, J.; Atlas, L.; Linos, E.; Pereira, F. A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) Against Widely Used Sentiment Analysis Tools: Sentiment Analysis of COVID-19 Survey Data. JMIR Ment. Health 2024, 11, e50150. [Google Scholar] [CrossRef]
Franco D’Souza, R.; Amanullah, S.; Mathew, M.; Surapaneni, K.M. Appraising the Performance of ChatGPT in Psychiatry Using 100 Clinical Case Vignettes. Asian J. Psychiatry 2023, 89, 103770. [Google Scholar] [CrossRef]
Lau, C.; Zhu, X.; Chan, W.Y. Automatic Depression Severity Assessment with Deep Learning Using Parameter-Efficient Tuning. Front. Psychiatry 2023, 14, 1160291. [Google Scholar] [CrossRef]
Bokolo, B.G.; Liu, Q. Advanced Comparative Analysis of Machine Learning and Transformer Models for Depression and Suicide Detection in Social Media Texts. Electronics 2024, 13, 3980. [Google Scholar] [CrossRef]
Elyoseph, Z.; Levkovich, I.; Shinan-Altman, S. Assessing Prognosis in Depression: Comparing Perspectives of AI Models, Mental Health Professionals and the General Public. Fam. Med. Community Health 2024, 12, e002583. [Google Scholar] [CrossRef] [PubMed]
Levkovich, I.; Elyoseph, Z. Identifying Depression and Its Determinants upon Initiating Treatment: ChatGPT versus Primary Care Physicians. Fam. Med. Community Health 2023, 11, e002391. [Google Scholar] [CrossRef] [PubMed]
Levkovich, I.; Elyoseph, Z. Suicide Risk Assessments Through the Eyes of ChatGPT-3.5 Versus ChatGPT-4: Vignette Study. JMIR Ment. Health 2023, 10, e51232. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Chen, S.; Li, T.; Li, W.; Zhou, Y.; Zheng, J.; Chen, Q.; Yan, J.; Tang, B. Depression Risk Prediction for Chinese Microblogs via Deep-Learning Methods: Content Analysis. JMIR Med. Inform. 2020, 8, e17958. [Google Scholar] [CrossRef]
Perlis, R.H.; Goldberg, J.F.; Ostacher, M.J.; Schneck, C.D. Clinical Decision Support for Bipolar Depression Using Large Language Models. Neuropsychopharmacology 2024, 49, 1412–1416. [Google Scholar] [CrossRef]
Vaidyam, A.N.; Wisniewski, H.; Halamka, J.D.; Kashavan, M.S.; Torous, J.B. Chatbots and Conversational Agents in Mental Health: A Review of the Psychiatric Landscape. Can. J. Psychiatry 2019, 64, 456–464. [Google Scholar] [CrossRef] [PubMed]
Ahmed, S.K.; Hussein, S.; Aziz, T.A.; Chakraborty, S.; Islam, M.R.; Dhama, K. The Power of ChatGPT in Revolutionizing Rural Healthcare Delivery. Health Sci. Rep. 2023, 6, e1684. [Google Scholar] [CrossRef]
Chatziisaak, D.; Burri, P.; Sparn, M.; Hahnloser, D.; Steffen, T.; Bischofberger, S. Concordance of ChatGPT Artificial Intelligence Decision-Making in Colorectal Cancer Multidisciplinary Meetings: Retrospective Study. BJS Open 2025, 9, zraf040. [Google Scholar] [CrossRef]
Gibson, D.; Jackson, S.; Shanmugasundaram, R.; Seth, I.; Siu, A.; Ahmadi, N.; Kam, J.; Mehan, N.; Thanigasalam, R.; Jeffery, N.; et al. Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment. J. Med. Internet Res. 2024, 26, e55939. [Google Scholar] [CrossRef]
Bracken, A.; Reilly, C.; Feeley, A.; Sheehan, E.; Merghani, K.; Feeley, I. Artificial Intelligence (AI)—Powered Documentation Systems in Healthcare: A Systematic Review. J. Med. Syst. 2025, 49, 28. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.; Agarwal, A.; Singla, S. LLMs Will Always Hallucinate, and We Need to Live with This. In Intelligent Systems and Applications; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2025; Volume 1554, pp. 624–648. [Google Scholar] [CrossRef]
Jiao, J.; Afroogh, S.; Xu, Y.; Phillips, C. Navigating LLM Ethics: Advancements, Challenges, and Future Directions. AI Ethics 2025, 5, 5795–5819. [Google Scholar] [CrossRef]
Kusunose, K.; Kashima, S.; Sata, M. Evaluation of the Accuracy of ChatGPT in Answering Clinical Questions on the Japanese Society of Hypertension Guidelines. Circ. J. 2023, 87, 1030–1033. [Google Scholar] [CrossRef]
Rizwan, A.; Sadiq, T. The Use of AI in Diagnosing Diseases and Providing Management Plans: A Consultation on Cardiovascular Disorders with ChatGPT. Cureus 2023, 15, e43106. [Google Scholar] [CrossRef] [PubMed]
Skalidis, I.; Cagnina, A.; Luangphiphat, W.; Mahendiran, T.; Muller, O.; Abbe, E.; Fournier, S. ChatGPT Takes on the European Exam in Core Cardiology: An Artificial Intelligence Success Story? Eur. Heart J. Digit. Health 2023, 4, 279–281. [Google Scholar] [CrossRef]
Van Bulck, L.; Moons, P. What If Your Patient Switches from Dr. Google to Dr. ChatGPT? A Vignette-Based Survey of the Trustworthiness, Value, and Danger of ChatGPT-Generated Responses to Health Questions. Eur. J. Cardiovasc. Nurs. 2024, 23, 95–98. [Google Scholar] [CrossRef]
Williams, M.C.; Shambrook, J. How Will Artificial Intelligence Transform Cardiovascular Computed Tomography? A Conversation with an AI Model. J. Cardiovasc. Comput. Tomogr. 2023, 17, 281–283. [Google Scholar] [CrossRef] [PubMed]
Choi, H.S.; Song, J.Y.; Shin, K.H.; Chang, J.H.; Jang, B.S. Developing Prompts from Large Language Model for Extracting Clinical Information from Pathology and Ultrasound Reports in Breast Cancer. Radiat. Oncol. J. 2023, 41, 209–216. [Google Scholar] [CrossRef]
Griewing, S.; Gremke, N.; Wagner, U.; Lingenfelder, M.; Kuhn, S.; Boekhoff, J. Challenging ChatGPT 3.5 in Senology—An Assessment of Concordance with Breast Cancer Tumor Board Decision Making. J. Pers. Med. 2023, 13, 1502. [Google Scholar] [CrossRef]
Haver, H.L.; Ambinder, E.B.; Bahl, M.; Oluyemi, E.T.; Jeudy, J.; Yi, P.H. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology 2023, 307, 4. [Google Scholar] [CrossRef]
Lukac, S.; Dayan, D.; Fink, V.; Leinert, E.; Hartkopf, A.; Veselinovic, K.; Janni, W.; Rack, B.; Pfister, K.; Heitmeir, B.; et al. Evaluating ChatGPT as an Adjunct for the Multidisciplinary Tumor Board Decision-Making in Primary Breast Cancer Cases. Arch. Gynecol. Obstet. 2023, 308, 1831–1844. [Google Scholar] [CrossRef] [PubMed]
Rao, A.; Kim, J.; Kamineni, M.; Pang, M.; Lie, W.; Dreyer, K.J.; Succi, M.D. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J. Am. Coll. Radiol. 2023, 20, 990–997. [Google Scholar] [CrossRef]
Sorin, V.; Klang, E.; Sklair-Levy, M.; Cohen, I.; Zippel, D.B.; Balint Lahat, N.; Konen, E.; Barash, Y. Large Language Model (ChatGPT) as a Support Tool for Breast Tumor Board. npj Breast Cancer 2023, 9, 44. [Google Scholar] [CrossRef] [PubMed]
Abilkaiyrkyzy, A.; Laamarti, F.; Hamdi, M.; Saddik, A. El Dialogue System for Early Mental Illness Detection: Toward a Digital Twin Solution. IEEE Access 2024, 12, 2007–2024. [Google Scholar] [CrossRef]
Adarsh, S.; Antony, B. SSN@LT-EDI-ACL2022: Transfer Learning Using BERT for Detecting Signs of Depression from Social Media Texts. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Alessa, A.; Al-Khalifa, H. Towards Designing a ChatGPT Conversational Companion for Elderly People. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 5–7 July 2023; pp. 667–674. [Google Scholar] [CrossRef]
Beredo, J.L.; Ong, E.C. A Hybrid Response Generation Model for an Empathetic Conversational Agent. In Proceedings of the 2022 International Conference on Asian Language Processing (IALP), Shenzhen, China, 27–28 October 2022; pp. 300–305. [Google Scholar] [CrossRef]
Berrezueta-Guzman, S.; Kandil, M.; Martín-Ruiz, M.L.; Pau de la Cruz, I.; Krusche, S. Future of ADHD Care: Evaluating the Efficacy of ChatGPT in Therapy Enhancement. Healthcare 2024, 12, 683. [Google Scholar] [CrossRef]
Blease, C.; Worthen, A.; Torous, J. Psychiatrists’ Experiences and Opinions of Generative Artificial Intelligence in Mental Healthcare: An Online Mixed Methods Survey. Psychiatry Res. 2024, 333, 115724. [Google Scholar] [CrossRef]
Crasto, R.; Dias, L.; Miranda, D.; Kayande, D. CareBot: A Mental Health Chatbot. In Proceedings of the 2021 2nd International Conference for Emerging Technology (INCET), Belgaum, India, 21–23 May 2021. [Google Scholar] [CrossRef]
Danner, M.; Hadzic, B.; Gerhardt, S.; Ludwig, S.; Uslu, I.; Shao, P.; Weber, T.; Shiban, Y.; Rätsch, M. Advancing Mental Health Diagnostics: GPT-Based Method for Depression Detection. In Proceedings of the 2023 62nd Annual Conference of the Society of Instrument and Control Engineers (SICE), Tsu, Japan, 6–9 September 2023; pp. 1290–1296. [Google Scholar] [CrossRef]
Dergaa, I.; Fekih-Romdhane, F.; Hallit, S.; Loch, A.A.; Glenn, J.M.; Fessi, M.S.; Ben Aissa, M.; Souissi, N.; Guelmami, N.; Swed, S.; et al. ChatGPT Is Not Ready yet for Use in Providing Mental Health Assessment and Interventions. Front. Psychiatry 2023, 14, 1277756. [Google Scholar] [CrossRef] [PubMed]
Diniz, E.J.S.; Fontenele, J.E.; de Oliveira, A.C.; Bastos, V.H.; Teixeira, S.; Rabêlo, R.L.; Calçada, D.B.; Dos Santos, R.M.; de Oliveira, A.K.; Teles, A.S. Boamente: A Natural Language Processing-Based Digital Phenotyping Tool for Smart Monitoring of Suicidal Ideation. Healthcare 2022, 10, 698. [Google Scholar] [CrossRef]
Elyoseph, Z.; Hadar-Shoval, D.; Asraf, K.; Lvovsky, M. ChatGPT Outperforms Humans in Emotional Awareness Evaluations. Front. Psychol. 2023, 14, 1199058. [Google Scholar] [CrossRef] [PubMed]
Elyoseph, Z.; Levkovich, I. Beyond Human Expertise: The Promise and Limitations of ChatGPT in Suicide Risk Assessment. Front. Psychiatry 2023, 14, 1213141. [Google Scholar] [CrossRef]
Esackimuthu, S.; Shruthi, H.; Sivanaiah, R.; Angel Deborah, S.; Sakaya Milton, R.; Mirnalinee, T.T. SSN_MLRG3 @LT-EDI-ACL2022-Depression Detection System from Social Media Text Using Transformer Models. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Farhat, F. ChatGPT as a Complementary Mental Health Resource: A Boon or a Bane. Ann. Biomed. Eng. 2024, 52, 1111–1114. [Google Scholar] [CrossRef]
Farruque, N.; Zaïane, O.R.; Goebel, R.; Sivapalan, S. DeepBlues@LT-EDI-ACL2022: Depression Level Detection Modelling through Domain Specific BERT and Short Text Depression Classifiers. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022; pp. 167–171. [Google Scholar] [CrossRef]
Farruque, N.; Goebel, R.; Sivapalan, S.; Zaïane, O.R. Depression Symptoms Modelling from Social Media Text: An LLM Driven Semi-Supervised Learning Approach. Lang. Resour. Eval. 2024, 58, 1013–1041. [Google Scholar] [CrossRef]
Friedman, S.F.; Ballentine, G. Trajectories of Sentiment in 11,816 Psychoactive Narratives. Hum. Psychopharmacol. Clin. Exp. 2024, 39, e2889. [Google Scholar] [CrossRef] [PubMed]
Ghanadian, H.; Nejadgholi, I.; Al Osman, H. Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models. IEEE Access 2024, 12, 14350–14363. [Google Scholar] [CrossRef]
Hadar-Shoval, D.; Elyoseph, Z.; Lvovsky, M. The Plasticity of ChatGPT’s Mentalizing Abilities: Personalization for Personality Structures. Front. Psychiatry 2023, 14, 1234397. [Google Scholar] [CrossRef]
Hayati, M.F.M.; Ali, M.A.M.; Rosli, A.N.M. Depression Detection on Malay Dialects Using GPT-3. In Proceedings of the 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Kuala Lumpur, Malaysia, 7–9 December 2022; pp. 360–364. [Google Scholar] [CrossRef]
He, W.; Zhang, W.; Jin, Y.; Zhou, Q.; Zhang, H.; Xia, Q. Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis. J. Med. Internet Res. 2024, 26, e54706. [Google Scholar] [CrossRef] [PubMed]
Hegde, A.; Coelho, S.; Dashti, A.E.; Shashirekha, H.L. MUCS@Text-LT-EDI@ACL 2022: Detecting Sign of Depression from Social Media Text Using Supervised Learning Approach. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Heston, T.F. Safety of Large Language Models in Addressing Depression. Cureus 2023, 15, e50729. [Google Scholar] [CrossRef]
de Hond, A.; van Buchem, M.; Fanconi, C.; Roy, M.; Blayney, D.; Kant, I.; Steyerberg, E.; Hernandez-Boussard, T. Predicting Depression Risk in Patients with Cancer Using Multimodal Data: Algorithm Development Study. JMIR Med. Inform. 2024, 12, e51925. [Google Scholar] [CrossRef] [PubMed]
Howard, D.; Maslej, M.M.; Lee, J.; Ritchie, J.; Woollard, G.; French, L. Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study. J. Med. Internet Res. 2020, 22, e15371. [Google Scholar] [CrossRef]
Hwang, G.; Lee, D.Y.; Seol, S.; Jung, J.; Choi, Y.; Her, E.S.; An, M.H.; Park, R.W. Assessing the Potential of ChatGPT for Psychodynamic Formulations in Psychiatry: An Exploratory Study. Psychiatry Res. 2024, 331, 115655. [Google Scholar] [CrossRef] [PubMed]
Ilias, L.; Mouzakitis, S.; Askounis, D. Calibration of Transformer-Based Models for Identifying Stress and Depression in Social Media. IEEE Trans. Comput. Soc. Syst. 2024, 11, 1979–1990. [Google Scholar] [CrossRef]
Janatdoust, M.; Ehsani-Besheli, F.; Zeinali, H. KADO@LT-EDI-ACL2022: BERT-Based Ensembles for Detecting Signs of Depression from Social Media Text. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Kabir, M.; Ahmed, T.; Hasan, M.B.; Laskar, M.T.R.; Joarder, T.K.; Mahmud, H.; Hasan, K. DEPTWEET: A Typology for Social Media Texts to Detect Depression Severities. Comput. Hum. Behav. 2023, 139, 107503. [Google Scholar] [CrossRef]
Kumar, H.; Wang, Y.; Shi, J.; Musabirov, I.; Farb, N.A.S.; Williams, J.J. Exploring the Use of Large Language Models for Improving the Awareness of Mindfulness. In Proceedings of the Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; Volume 7. [Google Scholar] [CrossRef]
Lam, G.; Dongyan, H.; Lin, W. Context-Aware Deep Learning for Multi-Modal Depression Detection. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3946–3950. [Google Scholar] [CrossRef]
Li, D.J.; Kao, Y.C.; Tsai, S.J.; Bai, Y.M.; Yeh, T.C.; Chu, C.S.; Hsu, C.W.; Cheng, S.W.; Hsu, T.W.; Liang, C.S.; et al. Comparing the Performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in Differential Diagnosis with Multi-Center Psychiatrists. Psychiatry Clin. Neurosci. 2024, 78, 347–352. [Google Scholar] [CrossRef]
Liyanage, C.; Garg, M.; Mago, V.; Sohn, S. Augmenting Reddit Posts to Determine Wellness Dimensions Impacting Mental Health. In Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, Toronto, ON, Canada, 13 July 2023. [Google Scholar] [CrossRef]
Lu, K.C.; Thamrin, S.A.; Chen, A.L.P. Depression Detection via Conversation Turn Classification. Multimed. Tools Appl. 2023, 82, 39393–39413. [Google Scholar] [CrossRef]
Ma, Z.; Mei, Y.; Su, Z. Understanding the Benefits and Challenges of Using Large Language Model-Based Conversational Agents for Mental Well-Being Support. AMIA Annu. Symp. Proc. 2024, 2023, 1105–1114. [Google Scholar] [PubMed]
Mazumdar, H.; Chakraborty, C.; Sathvik, M.; Mukhopadhyay, S.; Panigrahi, P.K. GPTFX: A Novel GPT-3 Based Framework for Mental Health Detection and Explanations. IEEE J. Biomed. Health Inform. 2023, 1–8. [Google Scholar] [CrossRef]
Metzler, H.; Baginski, H.; Niederkrotenthaler, T.; Garcia, D. Detecting Potentially Harmful and Protective Suicide-Related Content on Twitter: Machine Learning Approach. J. Med. Internet Res. 2022, 24, e34705. [Google Scholar] [CrossRef]
Owen, D.; Antypas, D.; Hassoulas, A.; Pardiñas, A.F.; Espinosa-Anke, L.; Collados, J.C. Enabling Early Health Care Intervention by Detecting Depression in Users of Web-Based Forums Using Language Models: Longitudinal Analysis and Evaluation. JMIR AI 2023, 2, e41205. [Google Scholar] [CrossRef]
Parker, G.; Spoelma, M.J. A Chat about Bipolar Disorder. Bipolar Disord. 2024, 26, 249–254. [Google Scholar] [CrossRef]
Poswiata, R.; Perełkiewicz, M. OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text Using RoBERTa Pre-Trained Language Models. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022; Association for Computational Linguistics: New York, NY, USA, 2022; pp. 276–282. [Google Scholar] [CrossRef]
Sadeghi, M.; Egger, B.; Agahi, R.; Richer, R.; Capito, K.; Rupp, L.H.; Schindler-Gmelch, L.; Berking, M.; Eskofier, B.M. Exploring the Capabilities of a Language Model-Only Approach for Depression Detection in Text Data. In Proceedings of the 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Pittsburgh, PA, USA, 15–18 October 2023. [Google Scholar] [CrossRef]
Schubert, M.C.; Wick, W.; Venkataramani, V. Performance of Large Language Models on a Neurology Board–Style Examination. JAMA Netw. Open 2023, 6, e2346721. [Google Scholar] [CrossRef] [PubMed]
Senn, S.; Tlachac, M.L.; Flores, R.; Rundensteiner, E. Ensembles of BERT for Depression Classification. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; IEEE: New York, NY, USA, 2022; Volume 2022, pp. 4691–4694. [Google Scholar] [CrossRef]
Singh, M.; Motlicek, P. IDIAP Submission@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Sivamanikandan, S.; Santhosh, V.; Sanjaykumar, N.; Jerin Mahibha, C.; Thenmozhi, D. ScubeMSEC@LT-EDI-ACL2022: Detection of Depression Using Transformer Models. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022. [Google Scholar] [CrossRef]
Spallek, S.; Birrell, L.; Kershaw, S.; Devine, E.K.; Thornton, L. Can We Use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms. JMIR Med. Educ. 2023, 9, e51243. [Google Scholar] [CrossRef]
Stigall, W.; Al Hafiz Khan, M.A.; Attota, D.; Nweke, F.; Pei, Y. Large Language Models Performance Comparison of Emotion and Sentiment Classification. In Proceedings of the 2024 ACM Southeast Conference, Marietta, GA, USA, 18–20 April 2024; pp. 60–68. [Google Scholar] [CrossRef]
Suri, M.; Semwal, N.; Chaudhary, D.; Gorton, I.; Kumar, B. I Don’t Feel so Good! Detecting Depressive Tendencies Using Transformer-Based Multimodal Frameworks. In Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing, Sanya, China, 23–25 December 2022; pp. 360–365. [Google Scholar] [CrossRef]
Tao, Y.; Yang, M.; Shen, H.; Yang, Z.; Weng, Z.; Hu, B. Classifying Anxiety and Depression through LLMs Virtual Interactions: A Case Study with ChatGPT. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 2259–2264. [Google Scholar] [CrossRef]
Tey, W.L.; Goh, H.N.; Lim, A.H.L.; Phang, C.K. Pre- and Post-Depressive Detection Using Deep Learning and Textual-Based Features. Int. J. Technol. 2023, 14, 1334–1343. [Google Scholar] [CrossRef]
Toto, E.; Tlachac, M.L.; Rundensteiner, E.A. AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening. In CIKM’21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event Queensland, Australia, 1–5 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4145–4154. [Google Scholar] [CrossRef]
Vajre, V.; Naylor, M.; Kamath, U.; Shehu, A. PsychBERT: A Mental Health Language Model for Social Media Mental Health Behavioral Analysis. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Virtual, 9–12 December 2021; IEEE: New York, NY, USA, 2021; pp. 1077–1082. [Google Scholar] [CrossRef]
Verma, S.; Vishal; Joshi, R.C.; Dutta, M.K.; Jezek, S.; Burget, R. AI-Enhanced Mental Health Diagnosis: Leveraging Transformers for Early Detection of Depression Tendency in Textual Data. In Proceedings of the 2023 15th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Ghent, Belgium, 30 October–1 November 2023. [Google Scholar] [CrossRef]
Wan, C.; Ge, X.; Wang, J.; Zhang, X.; Yu, Y.; Hu, J.; Liu, Y.; Ma, H. Identification and Impact Analysis of Family History of Psychiatric Disorder in Mood Disorder Patients with Pretrained Language Model. Front. Psychiatry 2022, 13, 861930. [Google Scholar] [CrossRef]
Wang, X.; Liu, K.; Wang, C. Knowledge-Enhanced Pre-Training Large Language Model for Depression Diagnosis and Treatment. In Proceedings of the 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS), Dali, China, 12–13 August 2023; pp. 532–536. [Google Scholar] [CrossRef]
Wei, Y.; Guo, L.; Lian, C.; Chen, J. ChatGPT: Opportunities, Risks and Priorities for Psychiatry. Asian J. Psychiatry 2023, 90, 103808. [Google Scholar] [CrossRef]
Wu, Y.; Chen, J.; Mao, K.; Zhang, Y. Automatic Post-Traumatic Stress Disorder Diagnosis via Clinical Transcripts: A Novel Text Augmentation with Large Language Models. In Proceedings of the 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS), Toronto, ON, Canada, 19–21 October 2023. [Google Scholar] [CrossRef]
Yongsatianchot, N.; Torshizi, P.G.; Marsella, S. Investigating Large Language Models’ Perception of Emotion Using Appraisal Theory. In Proceedings of the 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Cambridge, MA, USA, 10–13 September 2023. [Google Scholar] [CrossRef]
Zhang, Y.; Lyu, H.; Liu, Y.; Zhang, X.; Wang, Y.; Luo, J. Monitoring Depression Trends on Twitter During the COVID-19 Pandemic: Observational Study. JMIR Infodemiology 2021, 1, e26769. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Key components of LLM.

Figure 2. PRISMA flow diagram for study inclusion.

Figure 3. Heatmap of LLM type usage by medical domain (darker blue corresponds to higher frequency).

Figure 4. Most common LLM application areas.

Table 1. Quality assessment of the included reviews according to AMSTAR 2 criteria (Y: Yes, N: No, PY: Partial Yes) of included studies (Original items concerning meta-analysis and quantitative synthesis were removed because they were deemed to be out of scope).

AMSTAR 2 Criteria	Study
AMSTAR 2 Criteria	Sharma et al. [19]	Sorin et al. [20]	Omar et al. [21]	Guo et al. [22]	Omar and Levkovich [23]
1. Did the research questions and inclusion criteria for the review include the components of PICO?	N	N	N	N	N
2. Did the report of the review contain an explicit statement that the review methods were established prior to the conduct of the review and did the report justify any significant deviations from the protocol?	N	N	PY	PY	PY
3. Did the review authors explain their selection of the study designs for inclusion in the review?	N	N	N	N	N
4. Did the review authors use a comprehensive literature search strategy?	PY	N	PY	PY	PY
5. Did the review authors perform study selection in duplicate?	Y	Y	Y	Y	Y
6. Did the review authors perform data extraction in duplicate?	Y	N	Y	Y	Y
7. Did the review authors provide a list of excluded studies and justify the exclusions?	N	PY	N	Y	PY
8. Did the review authors describe the included studies in adequate detail?	N	N	PY	PY	PY
9. Did the review authors use a satisfactory technique for assessing the risk of bias (RoB) in individual studies that were included in the review?	N	Y	N	Y	Y
10. Did the review authors report on the sources of funding for the studies included in the review?	N	N	N	N	N
11. Did the review authors account for RoB in individual studies when interpreting/discussing the results of the review?	N	N	N	N	N
12. Did the review authors provide a satisfactory explanation for, and discussion of, any heterogeneity observed in the results of the review?	Y	Y	Y	Y	Y
13. Did the review authors report any potential sources of conflict of interest, including any funding they received for conducting the review?	Y	Y	Y	Y	Y

Table 2. Thematic synthesis of LLM outcomes and challenges.

Category	Detailed Summary
Clinical Applications	- Mental Health: Depression detection, suicidality risk assessment, emotion classification.
	- Oncology: Tumor board decision support, pathology report parsing, question answering.
	- Cardiology: Guideline-based question answering, diagnostic reasoning, triage support.
Patient Engagement	- Mental Health: Conversational agents for psychoeducation, stigma reduction, loneliness support.
	- Oncology: Consent preparation, patient education materials.
	- Cardiology: Adherence support, health literacy improvement.
Technical Challenges	- Accuracy variability across tasks and domains.
	- Hallucinations and fabricated outputs.
	- Bias in training data (age, language, geography).
	- Lack of interpretability and transparency.
Ethical & Regulatory Issues	- Data privacy risks in cloud-based deployments.
	- Compliance with GDPR and emerging AI regulations.
	- Accountability and liability uncertainties.
	- Need for human-in-the-loop governance.
Methodological Gaps	- Heavy reliance on simulated or vignette-based data.
	- Scarcity of prospective, real-world trials.
	- Inconsistent evaluation metrics and reporting standards.
	- Limited demographic and linguistic diversity in datasets.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Triantafyllidis, A.; Segkouli, S.; Kokkas, S.; Alexiadis, A.; Lithoxoidou, E.E.; Manias, G.; Antoniades, A.; Votis, K.; Tzovaras, D. Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews. Healthcare 2026, 14, 45. https://doi.org/10.3390/healthcare14010045

AMA Style

Triantafyllidis A, Segkouli S, Kokkas S, Alexiadis A, Lithoxoidou EE, Manias G, Antoniades A, Votis K, Tzovaras D. Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews. Healthcare. 2026; 14(1):45. https://doi.org/10.3390/healthcare14010045

Chicago/Turabian Style

Triantafyllidis, Andreas, Sofia Segkouli, Stelios Kokkas, Anastasios Alexiadis, Evdoxia Eirini Lithoxoidou, George Manias, Athos Antoniades, Konstantinos Votis, and Dimitrios Tzovaras. 2026. "Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews" Healthcare 14, no. 1: 45. https://doi.org/10.3390/healthcare14010045

APA Style

Triantafyllidis, A., Segkouli, S., Kokkas, S., Alexiadis, A., Lithoxoidou, E. E., Manias, G., Antoniades, A., Votis, K., & Tzovaras, D. (2026). Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews. Healthcare, 14(1), 45. https://doi.org/10.3390/healthcare14010045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Cardiovascular Disease, Cancer, and Mental Disorders: A Review of Systematic Reviews

Abstract

1. Introduction

2. Methodology

3. Results

3.1. Literature Search Outcomes

3.2. Quality Assessment of Reviews and Original Studies

3.3. Characteristics of Individual Studies

3.4. Summary of LLMs Performance

3.5. Benefits of Using LLMs

3.5.1. Detection and Screening

3.5.2. Risk Assessment and Triage

3.5.3. Clinical Reasoning and Decision Support

3.5.4. Patient Education and Participation

3.5.5. Summary, Documentation, and Research Support

3.6. Challenges of Using LLMs

3.6.1. Accuracy, Safety, and Efficacy of LLMs in Real World Settings

3.6.2. Bias and Model Transparency

3.6.3. Data Privacy, Security, and Regulatory Challenges

3.6.4. Data Availability and Generalizability

3.6.5. Continuous Human Oversight and Ethical Governance Frameworks

3.6.6. Methodological Quality of LLM Studies

4. Discussion

4.1. Main Findings

4.2. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI