A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions

Voultsiou, Evdokia; Moussiades, Lefteris

doi:10.3390/electronics15030524

Open AccessSystematic Review

A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions

by

Evdokia Voultsiou

and

Lefteris Moussiades

^*

Department of Informatics, Democritus University of Thrace, 65404 Kavala, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 524; https://doi.org/10.3390/electronics15030524

Submission received: 24 October 2025 / Revised: 7 January 2026 / Accepted: 19 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This systematic review examines 205 studies on the use of Large Language Models (LLMs) in psychiatry, psychology, psychotherapy, and clinical workflows. Furthermore, studies that directly evaluated at least one LLM in a mental health context were included in the extended detailed analysis. GPT-4 and GPT-3.5 were the most commonly assessed models. Although LLMs showed promising short-term performance across domains, most evaluations relied on small, non-longitudinal datasets and single-session testing, limiting generalizability. The evidence indicates rapid growth but significant methodological inconsistency, emphasizing the need for more diverse datasets, standardized evaluation, and long-term validation before clinical integration. This review also examines how LLMs are being incorporated into mental health practice, outlining key challenges, limitations, and emerging opportunities. Ethical, clinical, and technological considerations are proposed to guide responsible adoption. Given the complexity of mental health care, a multidisciplinary, human-centered approach remains essential to ensure that future LLM applications augment—rather than replace—professional expertise.

Keywords:

large language models; mental health; psychiatry; psychology; psychotherapy

1. Introduction

Mental health is at the forefront due to its significance for humanity [1]. Internal and environmental forces, as illustrated in Figure 1, affect everyone’s mental health in multiple ways. At the same time, prejudice, poverty, bullying, inadequate healthcare, schooling, and limited public resources represent significant challenges for both individuals and communities [2,3].

Since the 2008 Convention on the Rights of Persons with Disabilities (CRPD) took effect, the World Health Organization (WHO) launched the Comprehensive Mental Health Action Plan in 2013. It extended it to 2030, while the United Nations (UN) Human Rights Council and the United Nations General Assembly passed multiple resolutions promoting rights-based mental health reforms [4]. Despite global awareness, mental health is underfunded by 2%, and financed institutions have been accused of civil rights abuse [5].

Nine out of ten patients with severe mental health issues worldwide receive no care [6]. Half of adults will develop a mental health issue [7], and 13% of the global population is currently affected [8]. One in four Americans, including high school students, experience mental health issues, with an average delay of eleven years between the onset of symptoms and intervention [9,10].

Significant deficiencies in the mental health system, including a global shortage of 5 million experts, leave many exposed to trauma and exhaustion without effective care [11,12,13]. Mental health workforce shortages vary worldwide; large U.S. populations face limited access with projected shortfalls of tens of thousands of psychiatrists, Canada has diagnosis wait times of up to six months, Australia relies heavily on emergency services and telehealth, Sub-Saharan Africa has about one psychiatrist per 500,000 people, and India has fewer than one psychiatrist per 100,000 [12,14].

Mental health services are hindered by insufficient funding, poor quality, and inaccessibility [6,11], necessitating scalable, cost-effective options integrated with existing systems, inclusive policies for marginalized groups, and sustainable, rights-based strategies [4,13,15,16].

Artificial intelligence (AI) has already established itself in mental health through diverse applications [17,18,19,20,21,22]. Grounded in this foundation, the adaptability of LLMs in education, assessment, and intervention offers renewed optimism for mental health treatments [16]. Fine-tuned LLMs report promising results in sentiment analysis, summarizing provider–patient interactions and generating human-like language, thus expanding healthcare and education [23,24]. Public views on ChatGPT in mental health are usually positive, although hazards must be considered [24]. ELIZA, a conversational agent developed by Joseph Weizenbaum in the late 1960s and early 1970s, demonstrated the potential and limitations of machine-driven discourse in therapy [1,25,26].

Modern LLMs excel in mental health care and demonstrate physician-level competence on board exams, with psychiatry ranking highly [27,28].

Psychiatry, psychology, and psychotherapy work together to shape how we understand and support mental health.

Psychiatry is the medical side of mental health. Psychiatrists diagnose conditions and look at how the brain and body affect emotions and behavior. They can prescribe medication and manage complex disorders using medical knowledge and diagnostic manuals like the DSM [29].

Psychology focuses on how people think, feel, learn, and behave. It explains why mental health difficulties arise and how thoughts and emotions influence everyday life. Clinical psychology uses scientific methods and assessments to understand each person’s unique experiences [30].

Psychotherapy is the supportive, talk-based approach to improving mental health. Therapists help people cope with stress, change unhelpful patterns, and heal from difficult experiences using structured methods or psychodynamic therapy [31].

These three fields create a complete picture:

Psychiatry supports biological and medical needs.
Psychology explains mental and emotional processes.
Psychotherapy guides healing, growth, and long-term well-being.

This study investigates LLM-oriented projects in psychiatry, psychology, and psychotherapy, focusing on challenges and clinicians’ perspectives, by formulating the following research questions (RQs):

RQ1 (Psychiatry): What role could LLMs play in reshaping the overall direction and practice of psychiatry?

RQ2 (Psychology): How might LLMs influence the way psychology understands and studies human thought and behavior?

RQ3 (Psychotherapy): In what ways could LLMs transform therapeutic relationships and the experience of psychotherapy?

RQ4 (Challenges): What broader challenges—ethical, cultural, technical, and beyond—must be addressed for LLMs to be responsibly applied in mental health?

RQ5 (Clinical/Workforce): How can LLMs be integrated into clinical work in ways that support mental health professionals while preserving the human dimension of care?

Compared with existing systematic and scoping reviews on large language models (LLMs) in mental health—such as [32,33,34,35,36]—the present review provides the most extensive and methodologically detailed analysis to date. Whereas prior reviews typically include 20–40 studies and focus broadly on “mental health applications,” our review synthesizes 205 studies across psychology, psychiatry, psychotherapy, and the mental health workforce, offering deeper disciplinary resolution. Unlike previous work, this review systematically extracts empirical validation (“tested at least one LLM”), model types, datasets, populations, prompting strategies, RAG and fine-tuning usage, evaluation metrics, and long-term consequences (Table 1). This level of granularity, combined with domain-specific categorization and quantitative distribution of empirical vs. technical vs. review-type studies, distinguishes our review as the most comprehensive mapping of LLM evidence in clinical and psychological science to date.

2. Materials and Methods

This systematic review follows PRISMA guidelines [37], as it examines LLM integration in mental health, focusing on psychiatry, psychology, and psychotherapy. Five research questions (RQs) were formulated to examine their role, implementation challenges, and clinician acceptance. Relevant studies were retrieved from seven major databases: PubMed, Scopus, IEEE Xplore, JMIR Mental Health, Springer, MDPI, Science Direct, and Google Scholar. The search was performed between March 2025 and July 2025 to capture the most recent advances in the integration of LLMs into mental health contexts.

2.1. Eligibility Criteria

1. Inclusion: Only peer-reviewed studies published in English were included. Eligible studies originated from recognized academic sources and demonstrated clear relevance to BOTH LLMs AND Mental Health, with a specific focus on psychiatry, psychology, psychotherapy, and mental health clinical practice.

2. Exclusion: Studies were excluded if they were (a) not published in English; (b) not peer-reviewed or from non-academic sources; (c) editorials, opinion pieces, or conference abstracts without full text; (d) “grey literature” such as blogs, reports, or commentaries; or (e) studies that did not primarily focus on LLMs in mental health.

3. Limitations: Only English-language studies were included in this review. Although this introduces the possibility of language bias, the vast majority of peer-reviewed research on LLMs and mental health is published in English. Moreover, given that most LLMs are primarily trained and evaluated on English-language data, restricting the search to English ensures coverage of the most relevant and impactful research. Including studies in multiple languages would not only be impractical but also unlikely to significantly alter the findings, as non-English publications in this domain remain scarce. In addition, publication bias may exist because preprints and “grey literature” were excluded, and, given the rapid pace of LLM development, some studies published after July 2025 may not be captured.

4. Grouping: The predefined research questions (RQs) served as the main framework for grouping studies and synthesizing the extracted data. Rather than analyzing the literature in a purely descriptive manner, each study was mapped to one or more RQs according to its primary focus. This categorization allowed us to systematically organize the findings on the role of LLMs across domains—psychiatric applications (RQ1), psychological perspectives (RQ2), psychotherapeutic interventions (RQ3), broader implementation challenges (RQ4), and implications for the clinical workforce (RQ5). By aligning the synthesis with the RQs, we ensured that the evidence was summarized but directly addressed the objectives of the review. This process facilitated the identification of thematic patterns and highlighted areas of consensus and divergence between studies, while the transparency and reproducibility of the review methodology were improved.

2.2. Study Search Strategy and Selection Process

The literature search across 7 major scientific databases required the following research queries:

1. [“Large Language Models”] IN [“Mental Health”];

2. [“Large Language Models”] IN [“Psychiatry”];

3. [“Large Language Models”] IN [“Psychology”];

4. [“Large Language Models”] IN [“Psychotherapy”];

5. [“Large Language Models”] IN [“Mental Health”] OR [“Psychiatry”] OR [“Psychology”] OR [“Psychotherapy”];

6. [“Large Language Models”] IN [“Mental Health Clinical Practice”] OR [“Psychiatrists”] OR [“Psychologists”] OR [“Psychotherapists”].

2.3. Data Collection Process

Only English-language studies published between 2024 and 2025 were searched, using equivalent search terms adapted for each database. The initial search identified 12,672 records across PubMed, Scopus, IEEE Xplore, JMIR Mental Health, Springer, MDPI, ScienceDirect, and Google Scholar. After removing 502 duplicates, 12,170 records remained for title and abstract screening (Table 2). Two reviewers independently screened these records and excluded 11,570 as not relevant to LLMs in mental health or to the target domains of psychiatry, psychology, psychotherapy, or mental health clinical practice (Table 3). The remaining 600 articles were retrieved for full-text eligibility assessment. After full-text review, 395 articles were excluded for not meeting the inclusion criteria (e.g., not LLM-focused, not mental-health related, and language restrictions), leaving 205 studies that were included in the final synthesis. No automation technologies were employed during screening or data extraction; all decisions were made by human reviewers (Table 4). The entire study selection process is summarized in the PRISMA flow diagram (Figure 2).

For the synthesis, the process began with the sorting of the studies by their primary area of focus. Only the papers that placed LLMs at the center of their investigation—and that clearly fell within mental health, psychiatry, psychology, psychotherapy, or the mental health workforce—were kept in the final set. This first step ensured that the review stayed aligned with its scope.

From there, the analysis followed the structure of the coding and data extraction framework presented in Table 5. Each study was examined in terms of its design, the LLMs it used, the type of data it relied on, the tasks it addressed, and the metrics with which it was evaluated. Their key findings, limitations, ethical considerations, and overall relevance to the research questions were also noted. Looking at the studies through these common categories made it easier to compare them fairly and see recurring patterns—both in the strengths reported and in the challenges researchers faced.

Working in this two-step way—first by grouping studies into domains and then synthesizing what they revealed based on the coding framework—allowed the review to bring together a more transparent and more coherent picture of how LLMs are currently being used in mental health contexts and where the most important gaps remain.

Figure 3 summarizes the complete methodological workflow described in Section 2.1, Section 2.2 and Section 2.3, illustrating the end-to-end process from literature retrieval to LLM-oriented evidence synthesis.

2.4. Risk of Bias Assessment

Risk of bias refers to flaws in study design or conduct that may distort results. Two independent reviewers applied the Mixed Methods Appraisal Tool (MMAT) to evaluate each study, resolving any disagreements through consensus [38]. Apart from language (which was predominantly English), the studies were judged to have an insignificant risk of bias, and inclusion was based on how closely they aligned with the RQs of this review.

2.5. Reporting Bias Assessment

To reduce reporting bias, multiple databases and supplementary sources were searched. Only English-language-based studies were included—potentially increasing language bias; the vast majority of peer-reviewed research on LLMs and mental health is published in English. Moreover, no formal statistical tool for assessing reporting bias was applied.

2.6. Strength and Consistency of the Evidence

All included studies were sourced from established and internationally recognized scientific databases—such as Scopus, Web of Science, PubMed, and other peer-reviewed repositories—which ensured methodological rigor and academic credibility. The synthesis, therefore, reflects both the breadth and the quality of the available empirical evidence, offering a transparent interpretation of strengths, limitations, and convergence across studies. Furthermore, consistent findings emerging from independent investigations were treated as areas of more substantial evidence, while discrepancies or single-study observations were interpreted more cautiously.

2.7. Evidence Synthesis and Confidence Assessment

To assess confidence in the evidence, the extracted studies were evaluated qualitatively along four dimensions: (i) methodological transparency, (ii) dataset quality, (iii) reproducibility, and (iv) ecological validity. Across domains, most studies scored low to moderate. Confidence was reduced by heterogeneous reporting standards, unclear prompt configurations, and the absence of model calibration metrics.

Only a minority of studies provided error breakdowns or sensitivity analyses. Furthermore, 93% of included studies used cross-sectional or single-session evaluations, which prevent conclusions regarding long-term safety, stability, or drift. As a result, the overall confidence in the evidence is moderate for low-risk tasks (e.g., summarization, psychoeducation) and low for high-risk clinical tasks (e.g., diagnosis, risk assessment, therapeutic decision-making).

This confidence profile should guide interpretation; the observed improvements reflect short-term performance in controlled settings, not validated clinical outcomes.

3. Evidence Across Mental Health Domains

LLMs are increasingly applied across healthcare [39,40,41], with particular relevance in mental health. While promising in psychiatry, psychology, and psychotherapy, these applications also raise challenges.

Table 6 provides an aggregated overview of empirical findings across mental health domains, summarizing dominant LLM architectures, task focus, methodological approaches, key strengths, and recurring limitations identified in the reviewed studies. The domain-specific analyses that follow expand on the patterns summarized in Table 6, providing detailed evidence and illustrative examples within each mental health context.

The findings reflect clinicians’ perspectives on their role, and Figure 4 illustrates the distribution of LLM-oriented studies across domains.

3.1. LLMs in Psychiatry

3.1.1. Diagnostic Application

LLMs empower patients and advance science [42,43,44], accelerating their progress across diverse scientific fields, including knowledge evaluation, neurocognitive decoding, and beyond [45,46,47]. Empirical studies show that LLMs can detect depression and stress, enhance suicide prevention, and support early psychosis evaluation [27,32,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86]. In diagnostic benchmarks, LLMs often beat or bolster classical machine learning [87] and rival LLMs. In depression diagnostics, GPT-4 excelled in zero-shot scenario circumstances, although GPT-3.5 enhanced through fine-tuning [52]. PHQ-8 (Patient Health Questionnaire-8) depression screening further revealed that GPT-4o achieved the highest overall prediction accuracy, with Llama 3 demonstrating superior performance in detecting anhedonia and Cohere excelling in identifying psychomotor symptoms [88]. Transformer-based models (RoBERTa, DistilBERT, DeBERTa, SqueezeBERT) were employed to identify depression in Twitter postings, yielding commendable outcomes, with RoBERTa attaining an accuracy of 98.1% [89]. Furthermore, DepGPT (GPT-3.5) showed robust efficacy in detecting depression among Bengali speakers [90]. In suicide-related tasks, GPT-2, Bio-ClinicalBERT, and MentalBERT achieved 97.7% suicidal ideation detection in Reddit posts [91], while in risk assessment, GPT-4, in comparison to GPT-3.5, showed higher sensitivity, especially in recognizing male gender as a risk factor and differentiating between Greek and South Korean cultural contexts [92]. Nonetheless, the persistent under-representation of marginalized groups in the bulk of studies indicates that these populations remain insufficiently represented in research. Few-shot LlaMA-7B beat zero-shot methods for multi-diagnosis [93], while hybrid techniques with ChatGPT, BERT, and the SHAP method reached 93.8% diagnostic accuracy with reliable, polite, and contextually relevant recommendations that assisted counselors in identifying and addressing depressive symptoms [94].

3.1.2. Evidence-Aligned Benchmarking and Retrieval-Augmented Generation (RAG)

In the evaluation of multiple-choice questions in psychiatry, several GPT models were systematically assessed, revealing that newer models exhibited more accuracy and consistency in responses, underscoring the necessity for calibrated confidence metrics prior to clinical application [94]. A RAG framework was introduced that compels LLMs to respond exclusively with proven psychiatric clinical guidelines (in depression), thereby preventing hallucinations and guaranteeing traceable, evidence-based answers. Expert assessment indicated superior coherence, accuracy, and evidence quality, with open-source LlaMA models marginally outperforming GPT-3 [95]. Knowledge-enhanced pre-training and model choice can further strengthen clinical utility, such as LlaMA models, edge BERT variants in treatment–outcome classification [96,97]. In a recent study, task-specific models—MentalLlama, Mistral, and MentalBART—summarized therapy sessions effectively, although clinical reliability remains limited. The authors conclude that, despite strong automatic metrics, current LLMs remain insufficiently reliable for clinical deployment [98].

3.1.3. Ecological Assessment, Longitudinal Monitoring, and Symptom Inference

In DSM-5–based scenarios (Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition) and patient–doctor dialogues, GPT-3.5, AYA, and GPT-4 outperformed comparators [99]. These expert-vetted vignettes (questions and answers, patient–doctor dialogues, and first-person diaries) validly approximate real-world diagnostic encounters and constitute a suitable gold standard. In diary-style emotional writing, fine-tuned GPT-3.5 with Chain-of-Thought (CoT) achieved diagnostic accuracy of 90.2% and specificity up to 95.5%, while GPT-4 reached high recall at lower accuracy, highlighting the potential of user-generated diary text as a clinically useful ecological momentary assessment [100].

LLMs, as text-based detectors [27,74,101], monitor symptoms in binary question answering for depressive symptoms with promising zero-, one-, and few-shot learning accuracy on BERT, BART, and ALBERT [102], structuring symptom narratives [58] and enhancing symptom monitoring [103]. GPT-4 reduces depression, anxiety, and substance use and improves therapy adherence and mental health prevention [104], while ethological computational psychiatry tracks mood–cognition–behavior relationships using wearables, cellphones, computer vision, and LLMs [105].

3.1.4. Simulation-Based Training, Clinical Education, and Psychoeducational Support

For clinicians, LLM-based training simulations are significant [74,106,107,108], as they deliver immersive experiences with standardized patient dialogues and branching vignettes that unfold as the learner probes history, examines options, and orders tests. Cases are designed to elicit core clinical reasoning moves—problem representation, differential diagnosis, likelihood revision, and selection of guideline-concordant next steps—while requiring explicit risk assessments (suicide, self-harm, intoxication, violence) and disposition planning [74]. Additionally, general and preventive psychoeducation is reinforced by LLMs—both for the public and for psychiatric stakeholders—by delivering scalable, on-demand materials that explain symptoms, risk factors, and help-seeking pathways in accessible language [45,57,74,109]. In psychoeducation, ChatGPT delivered helpful motivational prompts and guided cognitive reappraisal, but its responses often lacked empathic attunement and nuanced validation. [110]. As a result, perceived support improved for skills training, yet relational warmth and emotional containment remained limited.

3.1.5. Automated Clinical Documentation and Workflow Optimization

LLMs reduce administrative burdens, such as screening, text summarization, auto-generated notes or transcriptions, feature extraction, clinical transcripts analysis, and psychiatric terminology detection [57,70,73,81,98,109,111,112,113,114,115,116,117]. Psychiatrists are relieved from overly complex bureaucracy in psychiatric evaluations and are empowered in clinical support and in medical question answering [58,118,119]. Furthermore, RAG-based or context-aware chatbots provide accurate, evidence-based, and workflow-integrated support, while computational text mining uncovers patterns in the psychiatric literature to guide these interventions [63,82,120,121,122,123].

3.1.6. AI-Augmented Therapeutic Interventions and Immersive Treatment Modalities

LLMs are highly transformative tools for clinical treatment [27,57,69,70,71,76,77,78,81,118,123,124,125,126], and diverse approaches such as cognitive restructuring, exposure exercises, Dialectical Behavior Therapy, family therapy, and counseling provide affordable and accessible therapeutic options [27,57,59,76,80,114,127,128,129,130] that are essential for equality and inclusion [131]. LLM-based virtual therapeutic avatars are intriguing yet unexplored, while Avatar and exposure therapy empower schizophrenia therapies [44]. Established AI approaches provide a foundation for LLM-oriented research. Integrating immersive technology with AI—such as meTAI—enables tailored, interactive therapy with real-time feedback [132]. In exposure therapy, virtual simulations provide safe, controlled environments for graded practice, although greater realism, effectiveness, and adaptability still require rigorous study [133]. Emerging directions include AI-based virtual therapists (e.g., Tess), animal-like social robots augmented with affect sensing, and game-informed, evidence-based interventions (cognitive–behavioral therapy-aligned gamification for behavior change, social interaction, and self-regulation) within immersive platforms [44].

3.1.7. Personalized Mental Health Support and Emotion-Aware Interaction Systems

Patient alignment is complemented by LLMs [69,134], as self-regulation and self-literacy in the path of the therapeutic and healing process is highly advocated [45,67,70,74,75,79,125,135,136,137], especially when patients are empowered with enhanced therapeutic and non-clinical self-help chatbots that provide support beyond psychiatric sessions [71,114]. Indeed, personalization in mental interventions promotes self-acceleration, according to studies [58,72,76,78,109,126,131]. Automated, customized CBT and dyadic interactions between therapists and patients are two important areas that require individualized adaptation to each patient’s individual circumstances [73]. Emotion recognition is a substantial critical component of digital mental health systems [109], and LLM-based conversational agents increasingly demonstrate emotion-aware interaction competencies that support more empathetic, human-aligned responses [133].

3.1.8. Clinical Decision Support and Predictive Analytics

LLMs contribute to clinical decision-making by analyzing unstructured notes, detecting risks, generating therapeutic summaries, and applying knowledge graphs (KGs), which currently reach about 12% accuracy [27,58,59,67,70,115,118,126,138]. Decision explainability is essential [65], and simulation-based applications are strengthened through frameworks like CureFun [139]. Furthermore, the PSAT framework (Process knowledge-infused cross-attention) further incorporates clinical guidelines (e.g., Patient Health Questionnaire—9) to improve classification accuracy and provide clinician-level explanations [140].

New predictive and intervention technologies in psychiatry are showing strong potential. Crisis prediction models can flag early warning signs [109], and predictive analytics help identify vulnerable populations through large-scale data [68,73]. Digital tools now enable faster, more tailored crisis responses [45], while biometric and digital phenotyping data improve diagnosis and relapse prediction [58]. Combining clinical expertise with Generative AI-driven support strengthens timely interventions [130]. LLMs like GPT-4 are also being applied in psychiatric rehabilitation, showing promising results for AI-assisted recovery [141]. In forensic psychiatry, AI merges police and medical records for risk assessment and supports digital patient twins and offender simulations, while it simplifies expert reporting [142]. Similarly, in Africa, LLMs have been used to speed up jail data screening, personalize treatment plans, simulate court cases, and train specialists, who are in short supply [143].

3.1.9. Comparative Clinical Performance and Diagnostic Concordance with Human Clinicians

Although LLMs still have limitations when it comes to complex or high-risk cases, they can achieve performance that is comparable to that of psychiatrists in structured evaluations [51,144,145]. GPT-4 often produced results similar to human clinicians in emotion recognition and diagnostic reasoning [131,146]. The Thera-Turing Test confirmed that chatbot responses often appear human-like in fluency and coherence. However, they consistently underperform on empathy, with weaker affect labeling, validation, and context-sensitive warmth [147]. GPT-4, Claude, and Bard showed clinician-level alignment in schizophrenia and depression case vignettes—matching diagnostic impressions, severity ratings, and next-step recommendations—whereas GPT-3.5 often rates cases as more severe, less likely to recover, and better managed through conservative approaches [131,148]. Strong results were also reported in Electronic Health Record (EHR) term classification—including the extraction and normalization of psychiatric symptoms, diagnoses, or medications with robust handling of negation or temporality [115]—and in medication guidance for bipolar depression, where models produced guideline-concordant options [149]. GPT-4 Turbo achieved a level of accuracy comparable to that of clinicians when it generated psychiatry vignettes for adults, as well as children [150]. In terms of risk management, GPT-4 performed as well as clinicians [151], Claude 3.5 Sonnet showed high performance on Suicidal Ideation Response Inventory benchmarks (SIRI-2) [152], and generative LLMs aligned with the Veteran Health Administration’s stratification for veterans (VHA) [153]. Comparative studies in diagnosing depression, Post-Traumatic Stress Disorder (PTSD), and social phobia revealed that ChatGPT-4, Gemini, Claude, and GPT-3.5 matched or slightly surpassed clinician ratings. However, these models do not perform as well when it comes to early or chronic schizophrenia, and they tend to anticipate a more conservative recovery [154]. Additional research has demonstrated that LLMs perform well in correlation with specialists when it comes to schizophrenia, depression, and Obsessive–Compulsive Disorder (OCD) [155]. In cases of OCD, ChatGPT-4, Gemini Pro, and Llama 3 have achieved high accuracy rates [156].

3.1.10. Bias and Ethical Issues in Clinical Psychiatry LLM Studies

Value-based assessments of LLMs reveal cultural bias, emphasizing universalism and self-direction while downplaying power and security [157]. Evidence from other studies indicates that LLMs show notable agreement with clinicians in recovery predictions. Nevertheless, their forecasts are systematically more conservative, reflecting a bias toward underestimating recovery speed and overemphasizing residual symptoms or the need for long-term management [148]. General evaluations found that LLM-generated responses were often empathetic but excessively verbose, and they frequently lacked the depth of lived-experience advice that patients value [158]. In contrast, the Mello LLM achieved higher scores than humans on measures of empathy and emotional intelligence, raising important questions about the potential of these systems to simulate affective understanding beyond human benchmarks [159]. Despite this progress, the following concerns remain: bias [157], lack of transparency [160], and the need for FDA-approved oversight [114] to reach underserved populations [72]. LLMs should complement, not replace, human empathy [71,109], while balancing accessibility and personalization [79]. Risks include misinformation, shallow clinical depth, privacy issues, and bias [58,76,126,134]. Improvements in engagement and safety have been reported [161], but LLMs remain unfit for autonomous psychiatric care [162] and continue to show treatment bias [160].

3.1.11. Overview of LLM Applications in Psychiatry

In the field of psychiatry and LLMs, the research revealed 129 studies, which are categorized by type in Table 7. Approximately 86% of the studies are empirical (all types combined), while review-type studies (narrative, scoping, systematic, and rapid) comprise about 21%, and purely theoretical, opinion, or editorial pieces are few (~4%).

In 73 of the 129 psychiatry-related studies, complete data extraction was conducted because these articles evaluated at least one LLM on a clearly defined psychiatric task/s. Collectively, these studies form the methodological core of current psychiatric LLM research, with most published in 2024–2025, reflecting a rapidly accelerating field. GPT-based architectures (GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o) dominate in both usage frequency and performance, while LLaMA models, BERT-based transformers (BioClinicalBERT, MentalBERT), and seq2seq models such as T5 and BART appear less consistently or for specific classification tasks (Table 8). Despite this model diversity, most experiments rely on prompting-only pipelines, often enhanced with few-shot or chain-of-thought strategies, whereas the use of RAG systems or fine-tuned variants remains comparatively rare in psychiatry.

The datasets used across these studies are predominantly real but stem from non-longitudinal clinical sources, including electronic health records, psychiatric evaluation notes, PHQ-8/9 and PANSS assessments, DAIC-WOZ conversational data, therapy session transcripts, and social media corpora for suicide risk detection. However, these datasets are limited by small sample sizes, cultural imbalance, inconsistencies in annotation quality, lack of diagnostic diversity, and the absence of temporal follow-up—factors that restrict the ecological validity and generalizability of the findings. When compared with traditional baselines such as BERT, RoBERTa, DistilBERT, and classical ML models, GPT-4 and occasionally fine-tuned GPT-3.5 typically achieve superior performance in tasks such as mood disorder detection, OCD classification, and suicide risk estimation. The few studies using RAG-enhanced systems show advantages in consistency and reduced hallucination rates, especially when grounded in DSM-based or evidence-supported content.

Despite promising short-term results, none of the 73 studies evaluated LLMs in longitudinal or real-world clinical deployment. As a result, long-term stability, drift resistance, safety, and supervision requirements remain unknown. Current psychiatric LLM applications therefore remain developmental—useful for structured, controlled tasks but not yet validated for continuous clinical decision-making or high-risk scenarios. The combination of limited datasets, methodological heterogeneity, and the absence of temporal depth underscores the need for long-term, clinician-supervised evaluations before LLMs can be safely integrated into psychiatric care.

3.1.12. Critical Appraisal of Psychiatric Evidence

Most psychiatric studies rely on small, cross-sectional, or synthetic datasets, limiting generalizability and ecological validity. Designs often use vignette-based evaluations that simplify objective clinical complexity. Prompting methods, baselines, and metrics vary widely, reducing comparability and reproducibility. While GPT-4 and similar models show strong short-term diagnostic performance, findings are inconsistent across cultural contexts and high-risk tasks. Few studies assess safety, calibration, drift, or long-term outcomes. Overall confidence in psychiatric evidence is moderate for low-risk applications but low for diagnostic or predictive clinical use.

3.2. LLMs in Psychology

3.2.1. Language-Based Psychological Signal Detection and Psychopathology Modeling

Indicators of mental health were discovered by conducting a Reddit comment analysis that was based on a taxonomy. This analysis involved mapping three latent language characteristics onto the Hierarchical Taxonomy of Psychopathology (HiTOP), a dimensional framework that was used in the original study to align language-derived signals with transdiagnostic psychopathology dimensions. RAG-enhanced LLMs outperformed traditional baselines such as Long Short-Term Memory (LSTM) networks and Recurrent Neural Networks (RNNs) in suicide risk detection from social media data, achieving 90.7% accuracy with 87% precision and 94% recall, and showing stable results across multiple runs [163]. Suicidal ideation detection benefited from artificial socially aware data generated by ChatGPT, Flan-T5, and LLaMA, which led to results that were stronger than those obtained from baselines trained on real-world data [164]. Over the course of three weeks, ChatGPT’s language sentiment was used to predict symptoms of depression. These predictions were then compared to results obtained using the Linguistic Inquiry and Word Count (LIWC) dictionary approaches [165]. In addition, GPT’s diverse models were tested against dictionary techniques in twelve different languages in order to determine which method was most effective in predicting sentiment, distinct emotions, offensiveness, and moral underpinnings [166].

3.2.2. Stress, Sleep, and Behavior Prediction Using LLM-Enhanced Models

In stress detection, biomedical-domain LLMs such as BioBERT achieved performance levels of up to 90.32% accuracy, demonstrating the effectiveness of domain-specific pre-training in identifying stress-related indicators from clinical and textual data [167]. ChatGPT-3.5, when combined with a smartphone-based system tracking daily habits, produced tailored and understandable sleep recommendations, while its fine-tuned Turbo variant with stress prediction models achieved 99.37% accuracy in predicting stress for context-aware chatbot responses [168,169]. Hybrid techniques using psychological theory, sentiment analysis, and rule-based filtering performed best [170], while hybrid methods integrating BERT with Bayesian networks achieved higher accuracy compared to psychologists in vignette accuracy [171]. Prompt engineering continues to be of paramount significance, and in order to achieve the highest level of performance from ChatGPT, it is recommended to use few-shot prompts guided by expert advice [172]. Zero-shot and few-shot techniques were far less effective than CoT prompting, which led to an improvement in empathy and targeted counseling recommendations [173].

3.2.3. Evaluating and Enhancing LLM Psychological Reasoning Through Prompting

When it comes to reconstructing counseling conversations, LLMs are utilized, with CPsyCoun enhancing the comprehensiveness and authenticity of the discourse [174]. Beyond theoretical frameworks, the benchmarking of LLMs on counseling encounters is further advanced through the CounseLLMe dataset, which offers standardized scenarios to evaluate conversational quality, empathy, and adherence to therapeutic practices [175]. They are also being tested in adversarial extreme behavior counseling cases, where DeepSeek, Wenxin Yiyan, and Claude showed stronger sensitivity than GPT-4 [176]. Furthermore, LLMs demonstrated strong zero/few-shot capabilities in sentiment, emotion, and empathetic response generation, although fine-tuned models still perform better in complex psychological tasks [177].

3.2.4. LLMs as Companions and Social Agents: Perceptions of Empathy and Consciousness

LLMs increasingly act as companions and mentors. The vFerryman system integrates LLMs with calming virtual environments to promote emotional well-being [178], while Mello demonstrated higher empathy and emotional intelligence scores relative to human participants [159]. LLMs can serve as safe AI partners for roleplay, improving empathy, conflict resolution, and active listening [179]. Research has also demonstrated that sixty-seven percent of participants believe that there is at least a slight chance that ChatGPT possesses phenomenal consciousness, with more frequent users being more likely to assign a greater potential of such intelligence to the program [103].

3.2.5. Psychometrics, Psychological Assessment, and Personality Modeling with LLMs

A recent study suggested a unified psychometrics framework for assessment that is based on LLMs [180]. LLMs can predict outcomes of psychological assessments with notable accuracy, offering insights that may complement traditional scales and, in some cases, reflect aspects of real-world functioning more closely [181]. As a result of the more compelling and precise personality test findings that were provided by the Chat-16PF system, there was an increase in self-disclosure [182]. LLM surprisal scores were successful in the detection of aphasia and were also advantageous in the process of classifying subtypes of the condition [183].

3.2.6. LLLM-Augmented Digital Mental Health Interventions and Mental Health Outcomes

LLMs are increasingly embedded in therapeutic chatbots that deliver CBT-based exercises, mindfulness prompts, and guided self-help interventions. A RAG-based and GPT-based system was tested in three randomized controlled trials (n = 326), revealing significant improvements in well-being and reductions in depression/anxiety through personalized recommendations and empathetic dialogues [184]. Other LLM-powered bots demonstrated increased engagement, better accessibility, and reductions in anxiety and depression symptoms, although concerns remain regarding unempathetic or context-inappropriate responses [58]. Large-scale reviews highlight the promise of Digital Mental Health Interventions (DMHIs), especially CBT. However, their effects are often short-lived, with many users dropping out—an issue that LLM-based approaches now seek to overcome through more personalized and engaging support [185].

3.2.7. Psychological Frameworks Applied to LLM Reasoning, Values, and Biases

LLMs have been examined through psychological theory lenses. For instance, Schwartz’s Theory of Basic Values was applied to compare LLMs (Bard, Claude 2, GPT-3.5, GPT-4) with 53,000 humans, showing biases toward universalism and self-direction over power/security [157]. Similarly, Dual Process Theory has also been proposed as a framework to address hallucinations by applying reasoning strategies like Chain-of-Thought, Reflexion, and Tree-of-Thoughts [186]. Psychometric profiling indicates that LLMs had a fair distribution of the Big Five traits and low levels of dark traits. However, they continue to exhibit some biases in terms of morals and gender, as well as conservative standards of moral conduct [187].

3.2.8. Ethical and Safety Risks in Psychological and Therapeutic LLM Applications

Despite these promising applications, risks are evident. Studies warn about the dangers of emotional manipulation, the loss of human connection, and the failure to provide a sufficient duty of care. These studies emphasize the importance of adopting a framework that combines Responsible AI with the Ethics of Care [188]. Research also warns about overreliance, reduced real-life social interactions, and privacy breaches [189]. In addition, it was demonstrated that GPT-4 is susceptible to emotionally upsetting stories, and therapeutic therapies only partially alleviate the consequences of this vulnerability [190].

3.2.9. Overview of LLM Research in Psychology

In the field of psychology and LLMs, 28 studies were identified (Table 9). The area is strongly dominated by empirical, benchmark, and experimental work (64.3%), followed by a notable share of technical or engineering-focused studies (14.3%) that develop datasets, simulations, or LLM-based psychometric tools. Conceptual and theoretical papers (10.7%) and review-type works (≈11% combined) form a smaller portion of the literature, indicating that psychology research with LLMs is primarily driven by data-driven experimentation rather than high-level theoretical reflection.

Consistent with this trend, 18 of the 28 psychology-related studies underwent data extraction, as they tested at least one LLM on a specific psychological or cognitive task. Nearly all were published in 2024–2025, reflecting the rapidly accelerating growth of LLM research in psychology. The studies exhibit substantial model diversity, although GPT-based models clearly dominate, followed by BERT-family models and a smaller number of LLaMA and seq2seq architectures. When grouped by model family, the GPT lineage is the most frequently used, while BERT and LLaMA models appear mainly as secondary baselines (Table 10). Most studies rely heavily on prompting (≈89%), while RAG (≈28%) and fine-tuning (≈50%) appear more often than in the psychiatric domain, frequently in combination with prompting. All 18 studies use real datasets, typically grounded in psychological questionnaires, cognitive tasks, or annotated mental health corpora, although dataset inconsistencies, cultural biases, and lack of temporal depth remain common limitations. Only a minority of studies (≈6) are designed for direct psychological assessment, while most remain prototypes or conceptual explorations of cognitive or affective reasoning. Study duration is uniformly short, with single-session evaluations and no validated long-term deployment, limiting conclusions about longitudinal psychological reliability or safety.

Across the included psychology studies, dataset time spans were extremely narrow; most corpora represented single-session or single-day text (≈70%), with only a small minority using multi-week data (≈4–5 studies), and none incorporating month- or year-scale longitudinal trajectories. This limited temporal depth significantly constrains the ability to assess stability, drift, or symptom evolution over time.

3.2.10. Critical Appraisal of LLM-Based Psychological Research

Psychological studies frequently use narrow, short-form datasets and single-session evaluations, limiting temporal and cross-cultural validity. High reported performance in sentiment or stress prediction often reflects dataset artifacts or weak baselines rather than true psychological insight. Empathy, reasoning, and affective stability results are inconsistent, and studies rarely examine construct validity or reliability. The lack of longitudinal testing, theoretical grounding, and robust evaluation reduces confidence. Evidence is moderate for constrained computational tasks but low for interpretive or affective psychological functions.

3.3. LLMs in Psychotherapy

3.3.1. AI Psychotherapists and Therapeutic Text Analysis

AI psychotherapists (APTs) demonstrate how LLMs change psychotherapy in research initiatives [191]. Client happiness, participation, and a moderate reduction in symptoms are all improved when counseling text analysis is conducted on a large scale using transformers [192]. Fine-tuning nBERT using the NRC Emotion Lexicon increased emotion identification and personalized therapies, whereas BERTopic modeling of psychotherapy transcripts highly predicted symptom intensity and therapeutic connection [193,194]. Culture-relevant visual resources that help emotion recognition and regulation have been created using generative AI to improve engagement [195]. ChatGPT-4 outperformed professionals in the area of social intelligence when this trajectory was extrapolated, which indicates that it may have therapeutic potential [196].

3.3.2. Therapeutic Chatbots and Adaptive Support Systems

Compared to earlier NLP systems, which often generate rigid or repetitive responses, LLM-based chatbots deliver interactions that adapt more fluidly to user input [197,198]. MindScape provides tailored journaling prompts for emotional management [199], while clinical platforms like CaiTI employ smart devices to incorporate CBT and Motivational Interviewing (MI) to increase involvement [200]. Agents that are enabled by RAG make therapy even more effective by basing responses on clinical knowledge that has been validated [201,202]. Comparative studies show AI psychotherapy is becoming more like human sessions, with therapists rating AI higher [203]. To enable systematic evaluation, frameworks such as Gabriel’s assessment model [204] and the BOLT behavioral system have been proposed [205]. Reviews highlight the scalability of GPT-4 chatbots across clinical care, supervision, and research [173,206] but stress that their role must remain complementary and aligned with clinical guidelines [177,207].

3.3.3. Integration into Clinical Practice and Research Workflows

Integration into health systems further illustrates their potential. LLMs improved National Health Service (NHS) Talking Therapies allocation, recuperation, and wait times [208]. Research workflows benefited as GPT-3.5 and GPT-4 reached human-comparable sensitivity in systematic review screening [86]. In practice, LLMs contribute to low-intensity interventions, digital phenotyping, patient onboarding, medication support, and documentation [209]. Gender bias is highlighted by Gemma’s negative evaluations of men and her tendency to downplay women’s needs [210].

3.3.4. Model Adaptation, Fine-Tuning, and Therapeutic Quality

Improvements to the dataset and modeling increase the capabilities of the LLMs, such as ChatPal, which is able to demonstrate greater empathy and helpfulness than larger models thanks to datasets that provide emotional support, such as ExTES [59]. Human evaluators have rated ChatGPT’s psychotherapy responses higher than human therapists in authenticity, professionalism, and practicality [211]. In group therapy, LLM-generated relational cues matched human ones across empathy dimensions [212]. Emotion-aware embeddings fusion further improved affective state detection and empathetic dialogue [193].

3.3.5. Clinical Trials and Therapy Session Summarization

Clinical trials prove viability. Improved DialoGPT with ChatGPT-3.5 satisfied patients and professionals with conversational quality and relevance [213]. Task-specific fine-tuned models were able to summarize therapy sessions more effectively than general-purpose LLMs, generating concise, clinically relevant outputs that highlighted treatment goals, therapeutic progress, and patient concerns with greater accuracy [121]. Within adolescent depression research, embedding models including BERT, MentalBERT, MentalLongformer, LlaMA 2, and LlaMA 3 achieved moderate-to-high effectiveness in the fine-grained classification of depressive symptoms [97].

3.3.6. LLM-Supported Psychotherapeutic Approaches

CBT remains a central focus in digital mental health research. LLMs show promise in mapping patient statements into ABCD pathways—linking activating events, beliefs, consequences, and disputations—yet they continue to lag behind supervised learning methods in terms of accuracy and reliability [214]. Fine-tuned CBT-LLMs generate structured replies that follow CBT principles, ensuring patient statements are organized into coherent therapeutic steps [215]. Additionally, when LLMs are integrated with CBT knowledge bases, these models provide improved alignment and higher-quality discourse [216]. Although they are not intended to be used as a primary source of information, ChatGPT-4 and Bard can be utilized to reframe problematic beliefs using the “Catch it, Check it, Change it” method [217]. Moreover, children have reported experiencing fewer feelings of fear and distress as a result of Socially Assistive Robots (SARs) [218]. At the same time, AI-led Socratic dialogue is exemplified by multi-agent frameworks such as Socrates 2.0 [219]. According to more extensive studies, LLMs improve CBT throughout the many stages of personalization, monitoring, and relapse prediction [220].

Beyond CBT, other therapies are being explored. Psychodynamic psychotherapy has been empirically tested and has been shown to yield meaningful benefits in depression, anxiety, and personality disorders, with results comparable to other established therapies. However, no validated evidence yet exists for its mediation through LLMs [221]. MI has been further simulated for patient support and therapist training [222,223,224,225]. The use of LLM-assisted cognitive restructuring reduced the intensity of negative emotions in 67% of users, and it assisted 65% of users in overcoming negative beliefs [226]. Prompt engineering has improved the ability of LLMs to support problem-solving treatment by producing more structured and goal-directed responses. However, limitations remain, particularly in maintaining consistency across sessions and conveying empathy in ways comparable to human clinicians [227].

3.3.7. Multimodal, VR, and Robot-Assisted Therapeutic Systems

Hybrid and immersive systems increase options. Multimodal LLMs personalize psychotherapy in real time via text, facial gestures, and speech [228]. Virtual reality avatars powered by GPT-4 were rated empathic and safe for individuals with mild-to-moderate anxiety or depression [229]. Furthermore, SARs with LLM integration deliver CBT-based home therapies to alleviate psychological distress [218].

3.3.8. Ethical and Safety Issues in Psychotherapy-Oriented LLM Use

Risks and ethical issues still exist. LLMs are capable of providing empathy, validation, and psychoeducation, but they are also prone to misusing directive counsel, ignoring contextual inquiry, and failing to perform in emergencies. This prevents them from being a viable option for usage as standalone agents [230]. These findings are supported by investigations that utilize a mixed-methods approach. Privacy issues are not the primary consideration when it comes to adoption; rather, it seems that utility and trust are more influential factors [231]. Ultimately, they have potential in psychotherapy and emotional support; however, it is essential to provide ethical oversight and validation based on empirical evidence [27].

3.3.9. Overview of LLM Research in Psychotherapy

In the psychotherapy domain, 43 studies were identified (Table 11). The field presents a mixed landscape of empirical work and conceptual analyses. Empirical and observational studies make up 37.2%, while pilot or feasibility designs and early simulated user studies account for smaller shares. Conceptual, narrative, and ethics-oriented papers are also prominent (23.3%), reflecting ongoing concerns about safety, therapeutic boundaries, and AI-mediated interaction. Only a minority of works represent in-depth clinical experimentation, and—similar to the clinical domain—no large randomized controlled trials currently exist.

Importantly, 32 of the 43 psychotherapy studies included data extraction, as they tested at least one LLM on a specific psychotherapeutic task. In psychotherapy, the extracted studies evaluated LLMs as supportive tools for therapeutic communication, session summarization, reflective responding, and structured cognitive–behavioral interventions. GPT-4 and GPT-3.5 are the most frequently tested models, while a broad range of additional architectures—such as GPT-3, GPT-2, LLaMA-2, multilingual variants, and more—appear across specific experimental settings. BERT-derived models (including BERT, nBERT, and MentalBERT) are commonly used as baselines for emotion classification, intent detection, and therapeutic-style identification, while seq2seq models such as Flan-T5, T5, and BART support tasks requiring structured generation or labeling (Table 12). The datasets used include therapist–client transcripts, DAIC-WOZ interactions, CBT/DBT dialogue templates, multilingual counseling corpora, and synthetic therapeutic exchanges; however, they are typically short, scripted, or limited to idealized therapeutic behaviors and lack longitudinal variability. Prompt-based evaluation remains the predominant methodological approach, with only a few studies employing fine-tuning and none testing RAG-enhanced therapeutic systems. Although GPT-4 consistently generates more coherent, empathic, and clinically aligned responses than traditional NLP baselines and earlier LLMs, no study evaluates multi-session consistency, long-term therapeutic alliance, or safety in high-risk situations. Consequently, current LLM applications in psychotherapy remain in their early stages, promising in controlled tasks but not yet suitable for autonomous therapeutic delivery or real-world clinical use.

3.3.10. Critical Appraisal of LLM-Based Psychotherapy Evidence

Evidence for LLMs in psychotherapy largely depends on simulated dialogues, scripted CBT/DBT templates, or small therapist–client transcripts that do not reflect real therapeutic processes. Evaluations rely heavily on prompting and subjective ratings, with limited standardization or reproducibility. The findings conflict across studies; some show high empathy and structure, while others reveal inconsistent guidance, weak crisis handling, or bias. No multi-session or long-term evaluations exist, and evidence remains preliminary. Confidence is moderate for supportive tasks and low for interactive therapeutic roles.

3.4. Risks, Limitations, and Implementation Barriers in Mental Health LLM Systems

These studies have revealed the challenges of implementing LLMs in crucial domains of mental health by acknowledging multiple risks that affect safety, reliability, ethical use, and beyond, which are analyzed in detail across the following categories:

1. Human Oversight and Over-reliance: LLMs cannot substitute professional judgment, and excessive dependence on automated systems poses major risks. Overreliance without adequate human oversight degrades the quality of care, creating configuration errors, insufficient safety checks, and weak validation processes [16,63,153,193,232,233,234]. Clinical safety is threatened by poor decision-making, inadequate crisis response, and misinterpretation of patient needs [32,77,229]. For patients, overtrusting chatbots may foster harmful attachments or encourage the replacement of human relationships with anthropomorphized systems [109,229,235,236]. For clinicians, shifting responsibilities to automated tools risks weakening professional skills and judgment [76,124]. Repetitive or inaccurate responses from static models further reduce therapeutic meaning and undermine trust [109,188,198].

2. Biases and Data Limitations: LLMs inherit biases from the datasets on which they are trained, resulting in harmful stereotypes, under-representation of marginalized groups, cultural insensitivity, and misinterpretation of distress [49,151,230,235,237,238,239]. These limitations become especially problematic in multilingual or culturally diverse contexts, where fairness is compromised [27,89,109,138,188,238]. Unmitigated, such biases perpetuate stigma, misdiagnosis, and inequitable care [16,58,109,143,240]. Data-related challenges reinforce these issues; small and imbalanced samples reduce generalizability [124,193,239,241], while methodological weaknesses such as unvalidated scales and poor reproducibility limit the strength of evidence [27,46]. Most studies remain narrowly focused on depression and anxiety in small pilots without robust patient data [129,173], and cultural mismatches risk producing inequitable outcomes [109,240].

3. Privacy and Security: Applying AI to mental health contexts introduces significant privacy risks. Concerns include data breaches, a lack of informed consent, and misuse of confidential records [27,77,109,178,198,232,235]. Systems involving chatbots, emotion recognition, or continuous monitoring can be exposed to leakage, poisoning, and inference attacks. Without strict safeguards such as encryption, HIPAA/GDPR compliance, and ethical governance, LLMs may enable surveillance, manipulation, deepfakes, and digital exclusion [109].

4. Ethical Concerns: Ethical issues are particularly pronounced in mental health, where therapeutic interactions demand trust and transparency. Chatbots risk manipulation, simulate false empathy, and can involve children without sufficient safeguards [16,32,236,238,242]. Broader ethical gaps include the absence of standardized guidelines, limited legal accountability, and a lack of clear duty of care from developers [49,63,188,193]. Concerns also include value misalignment of generative AI [109,157], the emotional manipulation of vulnerable users [188], and the erosion of embodied, non-verbal communication [71]. To address these issues, adopting ethics-of-care regulatory approaches and developing processes for aligning AI with human values is strongly recommended [109]. The Readiness Evaluation for AI–Mental Health Deployment and Implementation (READI) framework offers a structured methodology for ensuring responsible and ethical implementation [243].

5. Technical and Clinical Limitations: From a technical perspective, black-box embeddings, classifier errors, inconsistent outputs, hallucinations, and model drift significantly constrain safe use [58,155,169,206,244]. Additional barriers include costly integration of unreliable knowledge graphs [138], token limits that truncate therapeutic exchanges, and reduced consistency when models attempt emotionally attuned responses [193]. Clinical challenges further restrict applicability. Misclassification due to comorbidity, overlapping symptoms, or subtype complexity undermines accuracy [16,244]. False positives may trigger unnecessary interventions, while false negatives risk missing high-risk patients [89]. Evidence of safety is especially weak in children and adolescents, with reports of unsafe or inconsistent outputs due to poor robustness [68,188,198]. Empathy gaps, cultural variability, and unknown long-term impacts further raise concerns [77,109,245]. Finally, structural barriers—such as focusing narrowly on training therapists without addressing organizational limitations—restrict safe and equitable adoption [246].

3.5. The Role of the Mental Health Workforce

High emotional demands, structural inefficiencies, and chronic stressors generate poor care quality and workforce instability, causing burnout in 21–67% of social workers, therapists, and psychiatric nurses [12]. Over 70% of UK mental health practitioners are emotionally exhausted, leading to turnover and staffing shortages [12,83]. Secondary traumatic stress (STS), compassion fatigue, and moral discomfort lead to clinician burnout. Poor supervision, training, and peer support make early-career professionals more susceptible to disengagement and attrition [12,247].

In this setting, LLMs may reduce workload and improve training. ChatGPT generates professional training content 37.5% faster, simplifying educational materials and supervisory aids [248]. HAILEY helps non-experts in peer support learn and practice empathetic communication skills that complement therapist responses [249]. Clinical processes employ LLMs to review patient data, produce clinical notes, enhance digital health services, and create scientific content. Professionals stress distinguishing trust thresholds; note summarizing, literature synthesis, and administrative recording are safer than LLM-driven diagnosis and treatment recommendations [250].

Despite these benefits, mental health specialists’ LLM usage perspectives depend on several factors. These algorithms should supplement expert judgment. Physicians may lose abilities in synthesizing patient histories and assessing diagnostic data due to the LLM usage. Incorrect or biased outcomes may disproportionately affect less-experienced professionals, increasing duty and accountability [251,252]. These factors underline the importance of thorough monitoring, accuracy verification, and ethics. Teaching, peer support, and administrative relief are LLM-based programs’ strengths, but direct patient care and diagnostic decision-making require proper integration, shared responsibility, and constant human monitoring.

3.5.1. Overview of LLM Applications for the Clinical Mental Health Workforce

In total, only five studies in the dataset focus on LLM-related projects explicitly designed for mental health professionals (Table 13). As a group, these studies represent a mix of empirical, pilot, technical, and conceptual work, reflecting the early and exploratory nature of LLM integration into clinical mental health practice.

Notably, in the clinical workforce domain, only two studies in the dataset evaluate LLMs directly, reflecting an early-stage and highly exploratory area of research (Table 14). These studies assess the performance of GPT-4, GPT-3.5, and a GPT-2–based transformer variant in tasks related to clinical documentation assistance, information extraction, and workflow support. The datasets used are narrow in scope—primarily short, de-identified clinical notes and small administrative text corpora—offering limited diagnostic variety and minimal real-world complexity. Although GPT-4 and GPT-3.5 consistently outperform the GPT-2–based model, the evaluations remain single-session and do not include longitudinal or real-world deployment testing. As a result, long-term stability, safety, model drift, and generalizability to complete clinical workflows remain unknown. Overall, current evidence positions LLMs in clinical workforce applications as promising but still preliminary tools, with insufficient empirical grounding for integration into routine clinical operations 1.

3.5.2. Overall Critical Appraisal and Confidence in Evidence

Across domains, evidence is constrained by small datasets, heterogeneous designs, weak reporting standards, and limited real-world evaluations. Many studies use simplified or synthetic settings that inflate performance and do not assess calibration, robustness, or safety. Contradictory findings highlight instability across demographic, contextual, and emotional scenarios. Reproducibility is limited due to missing prompts, baselines, and evaluation protocols. Overall confidence is moderate for low-risk uses, such as summarization and psychoeducation, and low for high-risk diagnostic or therapeutic tasks.

3.6. Cross-Model, Cross-Dataset, and Cross-Design Comparisons

Cross-Model Comparison: GPT-4 generally outperformed GPT-3.5 and BERT-based baselines across structured benchmarking tasks, but advantages disappeared in culturally sensitive or highly emotional tasks. LLaMA models showed competitive performance when fine-tuned, whereas prompting-only pipelines favored GPT-4. Domain-specific models (e.g., MentalBERT) performed best on narrow classification tasks but poorly on generative tasks, confirming that no single model is optimal across domains.

Cross-Dataset Comparison: EHR-based datasets supported higher precision but suffered from small sample sizes. Social-media datasets provided volume but low ecological validity. Therapy transcripts were limited but produced the most realistic assessments of conversational ability. Studies using synthetic data showed inflated performance compared to real data. Overall, dataset choice strongly influenced results.

Cross-Design Comparison: Prompt-only studies reported the highest apparent performance but lacked reproducibility. Fine-tuning improved consistency but required domain-aligned datasets. RAG improved factual accuracy and reduced hallucinations but remains underused. Vignette-based designs overstated clinical readiness, while real-world designs showed greater error variability.

Contradictions Across Studies: Across domains, contradictory findings emerged. Some studies reported clinician-level performance in diagnostic vignettes, whereas others found over-conservatism, misclassification of comorbidities, or failure in culturally diverse scenarios. These discrepancies highlight the fragility of current evidence and the dependence on narrowly controlled conditions.

4. Discussion

LLMs are transforming psychiatry (RQ1), where they provide sophisticated diagnostics, expedite documentation, support guideline-based decision-making, and deliver training simulations and psychoeducational tools [48,63,76,87,88,106]. These contributions enhance care, education, and personalization, though they demand prudent oversight.

In psychology and psychotherapy (RQ2–RQ3), LLMs advance large-scale language analysis, behavioral monitoring, and stress detection [48,167], as well as psychometric prediction [180]. They are also applied in empathetic discourse, counseling simulations, CBT- and MI-based therapies, and integration with immersive technologies such as VR and socially supportive robots. While these innovations hold promise for improving patient experience and therapeutic outcomes, concerns remain regarding bias, emotional manipulation, prejudice, over-directiveness, and inadequate crisis management [27,230].

Across all domains (RQ4), ethical, cultural, and technical challenges remain central. Studies warn of privacy issues, limited explainability, cultural mismatch, insufficient oversight, discrimination, and over-reliance [16,109]. Hallucinations, poor crisis response, weak validation, and small, imbalanced datasets further restrict safe deployment, underscoring the need for ethical governance and rigorous validation frameworks.

Finally, specialist perspectives (RQ5) reflect both enthusiasm and caution. Clinicians appreciate how documentation support, training, and peer assistance reduce the workload [248,249,250], yet they advise against replacement due to risks of overreliance, skill atrophy, liability, and erosion of therapeutic judgment [251,252]. Overall, specialists favor integration when it is cautious, controlled, and personalized.

The research showed several ethical, clinical, and technological foundations for responsible LLM use in mental health, and these are briefly depicted in Table 15. GenAI4MH, ROBINS-I, and Responsible AI (Ethics of Care perspective) promote transparency, fairness, and human values. The Thera-Turing Test, BOLT, and READI clinical frameworks examine therapeutic dialogue quality, computational behavioral evaluation, and deployment readiness. Clinical standards, explainability, and simulation-based decision support are embedded in PSAT and CureFun. These concepts provide a multi-layered platform for safe, effective, and ethical LLM integration into psychiatry, psychology, and psychotherapy.

Recommendations

The studies reviewed repeatedly highlighted a standard set of recommendations for the integration of LLMs in mental health. While each study addressed different aspects, their findings largely aligned and are presented below:

1. LLMs as Assistive Tools: LLMs are strictly augmentative tools and always under professional oversight [47,69,126]. This is consistently highlighted via reducing false positives, misclassifications, hallucinations, and harmful advice while preserving therapeutic rapport and trust [63,116,153,158]. Only applications developed and clinically validated by institutions or universities—such as FDA-approved digital therapeutics (Rejoyn, reSET, NightWare, EndeavorRx) or NHS-compliant tools (Limbic, Wysa)—should be considered trustworthy, since their LLM-driven responses undergo full clinical approval; moreover, the establishment of certified clinical validators within multidisciplinary teams, combining expertise in LLMs, AI, and mental health, is essential to ensure rigorous human oversight at every stage [253,254,255,256,257].

2. Augmentative Technologies: Mental health apps could be complemented by diverse technologies, such as two-stage or hybrid workflows for decision support [86], frozen model versions for reproducibility [187], memory-optimized architectures, adaptive attention mechanisms [193], or prompt engineering—from domain-specific instructions, role assignments, multi-turn dialogue, and multimodal CoT prompting to ethical prompts and structured multi-role frameworks—[14,258,259,260,261] and datasets against overfitting [63,116,176]. In addition, RAG, graph-driven context control, domain knowledge integration, and ensemble or hybrid models are suggested to mitigate hallucinations and improve diagnostic robustness [114,138,139,242,262]. Finally, deployment innovations such as quantization, pruning, federated learning, error analysis, and safety defaults (e.g., “don’t know” responses, prosocial outputs, and ethical guardrails) aim to balance efficiency, scalability, and clinical reliability [87,121,154,162].

3. Ethical and Regulatory Frameworks: Sensitive data in mental health—ranging from EHRs and voice logs to therapy transcripts—demand robust ethical safeguards, such as the WHO’s mhGAP Intervention Guide and the NIMH Data Sharing and Ethics Guidelines [27,86]. Ethical frameworks must ensure that clinicians and patients jointly set the boundaries of use [153], safeguards such as guardrails, red-teaming, and ethical reward functions [76], and continuous bias mitigation through inclusive datasets and fairness metrics (equalized odds, demographic parity) [45]. Deployment must also involve crisis escalation protocols (Columbia Suicide Severity Rating Scale triggers and more), sandbox testing with synthetic dialogues, and strict compliance with GDPR (General Data Protection Regulation), HIPAA (Health Insurance Portability and Accountability Act), and the EU AI Act [193,230]. Complementary strategies—such as evidence-based clinical integration, development of global data repositories, sustainability frameworks, and patient-centered design practices—are essential to guarantee trust, equity, and accountability in clinical practice.

4. Clinical Trials and Certified Clinical Validation: Clinical trials for LLMs in mental health should combine controlled lab studies with longitudinal real-world testing and metrics (such as engagement, dropout, adherence, or relapse prevention rates), based on clinical protocols such as FDA SaMD (Software as a Medical Device), based on the clinical evaluation document of the International Medical Device Regulators Forum (IMDRF/SaMD WG/N41 FINAL:2017, ICH-GCP (International Council for Harmonisation—Good Clinical Practice) and WHO guidelines [111,136,167,201]. Priorities include larger and more diverse samples, rigorous RCTs, pilot testing, active comparators (CBT vs. LLM-based CBT), cross-cultural evaluation, safeguards against bias (ROBINS-AI), and hallucinations [263]. Trials should measure actual clinical improvement, test use in mental health and crisis care, and align with validated workflows and ethical standards [264]. Evaluation must be carried out using established mental health standards and scales (PHQ-9, Generalized Anxiety Disorder—7 (GAD-7), or the Positive and Negative Syndrome Scale (PANSS), while incorporating mechanisms to detect and mitigate systematic errors, false outputs, and performance degradation.

5. Collaboration and Training: Strong interdisciplinary collaboration and professional training are essential for safely advancing LLM use in psychotherapy, psychiatry, and psychology, and computer science perspectives should be integrated [263], with active collaboration between clinicians, AI developers, ethicists, policymakers, and patients to ensure usability, trust, and equitable outcomes [84,126,153,264]. Participatory design, red-teaming, and ethics-of-care evaluation committees [76,188] help align systems with therapeutic frameworks while monitoring long-term societal effects. Equally important is professional preparation. Clinicians must receive training on AI literacy, regulatory guidance, privacy, and ethical use [153,191,265]. Collaborative, multidisciplinary, and well-funded efforts—including minority-language and culturally adapted development [266]—are key to creating equitable, empathetic, and effective AI in psychiatry.

6. Privacy and Data Protection: Ensuring privacy requires strict compliance with HIPAA, GDPR, and PIPEDA (the Personal Information Protection and Electronic Documents Act), while going beyond legal minimums through the adoption of advanced safeguards [158,190,260]. This includes AES-256 encryption for data at rest, TLS 1.3 for data in transit, and the integration of multi-factor authentication mechanisms such as FIDO2/WebAuthn (Fast Identity Online 2/Passwordless Authentication) hardware tokens, in addition to SFA (Single-Factor Authentication), 2FA (Two-Factor Authentication), and 3FA (Three-Factor Authentication). Audit logging can be reinforced with immutable, append-only systems such as Apache Kafka or AWS CloudTrail, while patient-centered consent management frameworks like Kantara’s Consent Receipt or OneTrust allow users to control their data and enforce deletion requests [267,268]. Technical solutions include federated learning, implemented via TensorFlow Federated or PySyft, differential privacy approaches such as Google’s DP-SGD (Differentially Private Stochastic Gradient Descent) or Apple’s local DP, and on-device fine-tuning with efficient frameworks like LlaMA.cpp or Mistral on Edge TPU. Blockchain infrastructures such as Hyperledger Fabric or Ethereum private chains have also been proposed to secure sensitive educational and clinical datasets [269]. Finally, continuous monitoring of privacy safeguards through observability stacks like Prometheus and Grafana is essential to prevent misuse and preserve user trust [195].

7. Transparency and Explainability: Improving transparency and explainability is vital to overcome the “black box” nature of LLMs and build trust among clinicians and patients [27,153,158,261]. Standards emphasize clear documentation, transparent reporting of datasets, reproducibility, and disclaimers about limitations. Frameworks including FUTURE-AI, CONSORT-AI/SPIRIT-AI, the WHO’s AI ethics guidance, READI (Ready and Equitable Atlas for Diabetes Insights) standards, and the EU AI Act [270,271,272,273], while grounding outputs in source text, enhance factual accuracy [88,118]. Extending transparency to training data, alignment strategies, and fairness audits ensures reproducibility and user confidence [274].

8. Regulation and Governance Compliance: Regulation must ensure that LLMs in mental health operate with open APIs and interoperability standards, reducing dependence on single vendors. For example, the NHS-approved Limbic Access chatbot is certified as a medical device and interoperates with NHS Talking Therapies systems, demonstrating how vendor lock-in can be avoided [232]. Applications must comply with clinical, ethical, and operational standards, drawing on frameworks such as the WHO’s Mental Health Action Plan (2013–2030), the APA (American Psychological Association) Principles for AI in Mental Health (2023), and the EU AI Act (2024). Professional oversight should reflect best practices seen in FDA-cleared digital therapeutics like Rejoyn (for major depressive disorder) and reSET/reSET-O (for substance and opioid use disorders), which mandate rigorous clinical validation before deployment [107,116,144]. Certification and auditing must also follow established models; the ORCHA (Organisation for the Review of Care and Health Apps) app library in the UK evaluates digital health tools for privacy, bias, and clinical safety, while GDPR compliance in Europe ensures patient data protection. Together, these frameworks emphasize that there must be transparent governance, bias mitigation, and accountability at both international and institutional levels to guarantee safe and trustworthy mental health interventions [275].

9. Human–LLM Role Clarity: Studies emphasize hybrid models where LLMs act as co-therapists or screening aids under human oversight [84,118,158,191,217,276,277], with safeguards such as limiting advice-giving, ensuring referral pathways, and reinforcing empathy [79,109,130,188,232]. Role clarity emerges along an ecosystem spectrum. In single-role models, LLMs act only as symptom recorders; in multifunctional models, they record, screen, flag risks, and draft reports; and in ecosystem integration models, they serve as infrastructure nodes linking wearables, peers, and caregivers into clinician dashboards [278,279,280]. Across these cases, human experts in the field of mental health remain the interpreters, ethical anchors, and final decision-makers.

10. Cultural Adaptation and Personalization: Culturally tailored datasets, multilingual safety checks, and equitable access strategies—such as CBT-informed prompts, empathy-focused outputs, and cross-lingual corpora—enhance socio-cultural sensitivity and reduce bias. Alignment with social determinants of health (SDOH) and cultural values ensures that LLMs evolve beyond one-size-fits-all toward equitable, context-aware mental health tools [62,63,84]. Examples such as HamRaz and EmoMent highlight the importance of culturally grounded corporations in training. Since every patient represents a unique combination of mindset, cultural background, and religious identity, each intervention requires personalization [281,282]. Moreover, the inclusion of marginalized groups is essential to broaden research samples and to support the development of interventions that are both inclusive and clinically meaningful.

11. Innovative Modalities: Merging forces of LLMs with VR as a therapeutic means [229], in psychometrics [187], in psychoeducation, or in developing self-help strategies is highly advocated [211]. LLM-based simulations (e.g., Acceptance and Commitment Therapy, Dialectical Behavior Therapy) [27], in a multimodal integration—combining wearables, speech, speech-to-text systems, Bayesian reasoning, emotion recognition, Electroencephalography (EEG), and phenotyping—enhance personalization and empathy, and they should be further explored [89,145,174,183,201,245]. LLMs have become a dominant force in mental health, yet their combined application with other leading technologies—such as virtual reality, wearables, or neurophysiological monitoring—remains insufficiently explored. Advancing this integrative approach could expand the spectrum of intervention strategies and enable more personalized and innovative therapeutic solutions.

12. Feedback and Iterative Refinement: Deploying real-time dashboards with continuous patient feedback not only strengthens the therapeutic alliance [145,194] but also establishes the foundation for sustainability, success, and personalization in mental health systems. Feedback serves as a critical safeguard, reducing the risk of errors and preventing potentially harmful situations, while simultaneously ensuring that interventions remain adaptive and clinically appropriate. By systematically integrating patient insights, interventions can be redesigned in a more targeted manner, improving both clinical accuracy and overall inclusivity [145,192].

5. Conclusions

LLMs represent a state-of-the-art development in psychiatry, psychology, and psychotherapy. However, their practical implementation requires robust ethical, clinical, and technological frameworks that guarantee transparency, fairness, and oversight. Within a mental health ecosystem, LLMs can serve as valuable assistive tools, but their potential will only be fully realized through cautious integration, participatory design, and continuous feedback.

Limitations of the present systematic review are primarily related to the breadth, heterogeneity, and reporting practices of the primary studies included. As a review rather than an experimental evaluation, this work does not directly assess the performance of large language models. Instead, it reproduces and synthesizes findings reported in the original studies. The 205 studies included exhibit substantial variation in research design, data sources, model types, evaluation procedures, and clinical or psychological contexts, which may complicate the direct comparability of the studies themselves. In addition, essential methodological details—such as prompting strategies, use of fine-tuning or RAG, and sample characteristics—are not consistently reported across all studies, creating gaps in the completeness of data extraction. The review was limited to English-language publications, which may exclude relevant work in other languages. Finally, given the rapid development of LLMs, newly published studies may not be captured after the search cut-off date, and no formal quality appraisal or risk-of-bias assessment was applied, as existing tools are not yet tailored for AI-driven research in mental health contexts.

Future Directions

Future work on LLMs in mental health should focus on the following directions:

1. Domain- and disorder-specific models: Domain-focused models outperform generic ones in psychiatric contexts [34]. Deploying open source, cost-effective, and validated LLMs that function as multifunctional recorders for relapse prediction, continuous monitoring, and multimodal integration is necessary. These models must be syndrome-based and culturally adaptable, validated against clinical scales such as PANSS for schizophrenia, PHQ-9 for depression, and the Hamilton Depression Rating Scale (HAM-D), ensuring outputs that are more clinically relevant and accurate than generic models [34,283,284,285].

2. Cross-disciplinary foundations: Collaboration should involve other experts of diverse scientific fields, such as philosophy or logotherapy [286], or less expected domains, envisioning the implications of their potential for LLM-oriented research projects. Future research should also embed LLMs into psycholinguistics, where computational linguistic markers of narratives can help detect early signs of depression, PTSD, or psychosis. In psychoeducation, adaptive CBT- and ACT-based modules delivered via LLMs are gaining traction as scalable digital interventions [27,287]. Moreover, creative therapies may be enhanced by LLM-assisted art and music applications, provided these are developed with strict privacy-by-design safeguards aligned with GDPR/HIPAA and governed by ethical frameworks such as FUTURE-AI [288,289].

3. Developing universal frameworks and standards: Future research should move toward the development of an international framework tailored explicitly to LLMs in mental health. Building on the foundations of CONSORT-AI, SPIRIT-AI, DECIDE-AI, and FUTURE-AI [274,287,288,289,290,291], such a framework should address the unique challenges of generative models, including prompt design, hallucination monitoring, escalation pathways to clinicians, de-identification of sensitive dialogues, multimodal and longitudinal data integration, and equity across diverse populations. Establishing globally aligned standards will ensure that LLM-based mental health interventions are safe, transparent, and clinically valid across different contexts.

4. Global repository and benchmarking: An international repository of validated models, datasets, and benchmarks, aligned with WHO/ITU AI for Health initiatives, should be established. Instead of isolated prototypes, international collaborations should build repositories of validated datasets and benchmarks (e.g., suicide risk triage, CBT session simulation), aligned with the WHO/ITU’s Global Initiative on AI for Health (GI-AI4H). This would allow researchers worldwide—including in low-resource settings—to evaluate models under standardized conditions, improving fairness and reproducibility [292].

5. Transitioning from simulated to real-world data: Future research should embed de-identification protocols within LLM pipelines to enable the privacy-preserving use of real patient data, including de-identified EHR transcripts and clinical interactions [293]. Beyond synthetic and depression-focused corpora such as DAIC-WOZ, diverse datasets covering multiple conditions, multimodal signals, and longitudinal trajectories are needed. Incorporating these resources will allow for the development of clinically robust, equitable, and generalizable mental health LLMs [294,295].

6. Long-term impact and equity: Long-term impact (24–36 months) and equity should be assessed through longitudinal studies aligned with international standards SPIRIT-AI, CONSORT-AI, FUTURE-AI, WHO mhGAP, PHQ-9, GAD-9, PANSS, WHODAS 2.0 (World Health Organization Disability Assessment Schedule 2.0) by developing two complementary ecosystems—a clinical ecosystem LLM-oriented in real mental healthcare settings to track therapeutic alliance, adherence, and relapse prevention, particularly designed for low-resource contexts to ensure equitable access and reduce disparities [296]. Equity must be addressed by running long-term trials in low- and middle-income countries (LMICs), where datasets and infrastructures differ significantly [297].

Author Contributions

Conceptualization, L.M.; Methodology, E.V.; Validation, L.M.; Formal Analysis, E.V. and L.M.; Investigation, E.V. and L.M.; Writing—Original Draft Preparation, E.V.; Writing—Review and Editing, E.V. and L.M.; Supervision, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data supporting the findings of this study are available within the published articles included in the reference list.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Olawade, D.B.; Wada, O.Z.; Odetayo, A.; David-Olawade, A.C.; Asaolu, F.; Eberhardt, J. Enhancing mental health with Artificial Intelligence: Current trends and future prospects. J. Med. Surg. Public Health 2024, 3, 100099. [Google Scholar] [CrossRef]
Gautam, S.; Jain, A.; Chaudhary, J.; Gautam, M.; Gaur, M.; Grover, S. Concept of mental health and mental well-being, it’s determinants and coping strategies. Indian J. Psychiatry 2024, 66, S231–S244. [Google Scholar] [CrossRef] [PubMed]
Ford, K.; Freund, R.; Young Lives, University of Oxford. Young Lives Under Pressure: Protecting and Promoting Young People’s Mental Health at a Time of Global Crises. 2022. Available online: https://www.younglives.org.uk/sites/default/files/2022-11/YL-PolicyBrief-55-Sep22 Final.pdf (accessed on 4 April 2025).
Funk, M.; World Health Organization. Guidance on Mental Health Policy and Strategic Action Plans. Module 1. Introduction, Purpose and Use of the Guidance; World Health Organization: Geneva, Switzerland, 2025; Available online: https://iris.who.int/bitstream/handle/10665/380465/9789240106796-eng.pdf?sequence=1 (accessed on 14 May 2025).
Kestel, D.; Neira, M.; Azzi, M.; World Health Organization. Mental Health Atlas 2020. 2020. Available online: https://iris.who.int/bitstream/handle/10665/345946/9789240036703-eng.pdf?sequence=1&isAllowed=y (accessed on 18 April 2025).
WHO (b). World Health Organization (WHO). New WHO Guidance Calls for Urgent Transformation of Mental Health Policies. 2025. Available online: https://www.who.int/news/item/25-03-2025-new-who-guidance-calls-for-urgent-transformation-of-mental-health-policies (accessed on 13 May 2025).
McGrath, J.J.; Al-Hamzawi, A.; Alonso, J.; Altwaijri, Y.; Andrade, L.H.; Bromet, E.J.; Bruffaerts, R.; de Almeida, J.M.C.; Chardoul, S.; Chiu, W.T.; et al. Age of onset and cumulative risk of mental disorders: A cross-national analysis of population surveys from 29 countries. Lancet Psychiatry 2023, 10, 668–681. [Google Scholar] [CrossRef] [PubMed]
WHO (a). World Health Organization (WHO). σελ. 1 Mental Disorders. 2022. Available online: https://www.who.int/news-room/fact-sheets/detail/mental-disorders/ (accessed on 22 April 2025).
Centers for Disease Control and Prevention (CDC). Mental Health. 2024. Available online: https://www.cdc.gov/cdi/indicator-definitions/mental-health.html (accessed on 1 May 2025).
Duckworth, K.; National Alliance on Mental Illness (NAMI). Mental Health By the Numbers. 2023. Available online: https://www.nami.org/about-mental-illness/mental-health-by-the-numbers/ (accessed on 10 May 2025).
Sim, K.Y.H.; Choo, K.T.W. Envisioning an AI-Enhanced Mental Health Ecosystem. arXiv 2025, arXiv:2503.14883. [Google Scholar] [CrossRef]
Ballout, S. Trauma Mental Health Workforce Shortages Health Equity: ACrisis in Public Health. Int. J. Environ. Res. Public Health 2025, 22, 620. [Google Scholar] [CrossRef]
Wainberg, M.L.; Scorza, P.; Shultz, J.M.; Helpman, L.; Mootz, J.J.; Johnson, K.A.; Neria, Y.; Bradford, J.M.; Oquendo, M.; Arbucle, M. Challenges and Opportunities in Global Mental Health: A Research-to-Practice Perspective. Curr. Psychiatry Rep. 2017, 19, 28. [Google Scholar] [CrossRef] [PubMed]
Satiani, A.; Niedermier, J.; Satiani, B.; Svendsen, D.P. Projected Workforce of Psychiatrists in the United States: A Population Analysis. Psychiatr. Serv. 2018, 69, 710–713. [Google Scholar] [CrossRef]
Kazdin, A.E.; Blase, S.L. Rebooting Psychotherapy Research and Practice to Reduce the Burden of Mental Illness. Perspect. Psychol. Sci. 2011, 6, 21–37. [Google Scholar] [CrossRef]
Lawrence, H.R.; Schneider, R.A.; Rubin, S.B.; Matarić, M.J.; McDuff, D.J.; Jones Bell, M. The Opportunities and Risks of Large Language Models in Mental Health. JMIR Ment. Health 2024, 11, e59479. [Google Scholar] [CrossRef]
Hsu, S.L.; Shah, R.S.; Senthil, P.; Ashktorab, Z.; Dugan, C.; Geyer, W.; Yang, D. Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback. arXiv 2025, arXiv:2305.08982. [Google Scholar] [CrossRef]
Thieme, A.; Hanratty, M.; Lyons, M.; Palacios, J.; Marques, R.F.; Morrison, C.; Doherty, G. Designing Human-centered AI for Mental Health: Developing Clinically Relevant Applications for Online CBT Treatment. ACM Trans. Comput.-Hum. Interact. 2023, 30, 1–50. [Google Scholar] [CrossRef]
Tutun, S.; Johnson, M.E.; Ahmed, A.; Albizri, A.; Irgil, S.; Yesilkaya, I.; Ucar, E.N.; Sengun, T.; Harfouche, A. An AI-based Decision Support System for Predicting Mental Health Disorders. Inf. Syst. Front. 2023, 25, 1261–1276. [Google Scholar] [CrossRef]
Afonso-Jaco, A.; Katz, B.F.G. Spatial Knowledge via Auditory Information for Blind Individuals: Spatial Cognition Studies and the Use of Audio-VR. Sensors 2022, 22, 4794. [Google Scholar] [CrossRef]
Fulmer, R.; Joerin, A.; Gentile, B.; Lakerink, L.; Rauws, M. Using Psychological Artificial Intelligence (Tess) to Relieve Symptoms of Depression and Anxiety: Randomized Controlled Trial. JMIR Ment. Health 2018, 5, e64. [Google Scholar] [CrossRef]
Fitzpatrick, K.K.; Darcy, A.; Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 2017, 4, e19. [Google Scholar] [CrossRef] [PubMed]
Lee, S.S.; Li, N.; Kim, J. Conceptual model for Mexican teachers’ adoption of learning analytics systems: The integration of the information system success model and the technology acceptance model. Educ. Inf. Technol. 2024, 29, 13387–13412. [Google Scholar] [CrossRef]
Zhou, W. Chat GPT Integrated with Voice Assistant as Learning Oral Chat-based Constructive Communication to Improve Communicative Competence for EFL earners. arXiv 2023, arXiv:2311.00718. [Google Scholar] [CrossRef]
Bassett, C. The computational therapeutic: Exploring Weizenbaum’s ELIZA as a history of the present. AI Soc. 2019, 34, 803–812. [Google Scholar] [CrossRef]
Anyoha, R.; Harvard University’s Science in the News (SITN). The History of Artificial Intelligence. 2017. Available online: https://sites.harvard.edu/sitn/2017/08/28/history-artificial-intelligence (accessed on 21 May 2025).
Hua, Y.; Na, H.; Li, Z.; Liu, F.; Fang, X.; Clifton, D.; Torous, J. A scoping review of large language models for generative tasks in mental health care. Npj Digit. Med. 2025, 8, 230. [Google Scholar] [CrossRef]
Katz, U.; Cohen, E.; Shachar, E.; Somer, J.; Fink, A.; Morse, E.; Shreiber, B.; Wolf, I. GPT versus Resident Physicians—A Benchmark Based on Official Board Scores. NEJM AI 2024, 1, 5. [Google Scholar] [CrossRef]
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 5th ed.; American Psychiatric Publishing: Washington, DC, USA, 2013. [Google Scholar] [CrossRef]
Henriques, G. A New Unified Theory of Psychology; Springer: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
Locher, C.; Meier, S.; Gaab, J. Psychotherapy: A World of Meanings. Front. Psychol. 2019, 10, 460. [Google Scholar] [CrossRef]
Guo, Z.; Lai, A.; Thygesen, J.H.; Farrington, J.; Keen, T.; Li, K. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment. Health 2024, 11, e57400. [Google Scholar] [CrossRef]
Hua, Y.; Liu, F.; Yang, K.; Li, Z.; Na, H.; Sheu, Y.; Zhou, P.; Moran, L.V.; Ananiadou, S.; Clifton, D.A.; et al. Large Language Models in Mental Health Care: A Scoping Review. arXiv 2025, arXiv:2401.02984. [Google Scholar] [CrossRef]
Omar, M.; Soffer, S.; Charney, A.W.; Landi, I.; Nadkarni, G.N.; Klang, E. Applications of large language models in psychiatry: A systematic review. Front. Psychiatry 2024, 15, 1422807. [Google Scholar] [CrossRef]
Brickman, D.; Gupta, M.; Oltmanns, J.R. Large Language Models for Psychological Assessment: A Comprehensive Overview. Adv. Methods Pract. Psychol. Sci. 2025, 8, 1–26. [Google Scholar] [CrossRef]
Gautam, A.; Kellmeyer, P. Exploring the Credibility of Large Language Models for Mental Health Support: Protocol for a Scoping Review. JMIR Res. Protoc. 2025, 14, e62865. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Hong, Q.N.; Fàbregues, S.; Bartlett, G.; Boardman, F.; Cargo, M.; Dagenais, P.; Gagnon, M.P.; Griffiths, F.; Nicolau, B.; O’Cathain, A.; et al. The Mixed Methods Appraisal Tool (MMAT) version 2018 for information professionals and researchers. Educ. Inf. 2018, 34, 87–98. [Google Scholar] [CrossRef]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef]
He, K.; Mao, R.; Lin, Q.; Ruan, Y.; Lan, X.; Feng, M.; Cambria, E. A survey of large language models for healthcare: From data, technology, and applications to accountability and ethics. Inf. Fusion 2025, 118, 102963. [Google Scholar] [CrossRef]
Zhang, K.; Meng, X.; Yan, X.; Ji, J.; Liu, J.; Xu, H.; Zhang, H.; Liu, D.; Wang, J.; Wang, X.; et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J. Med. Internet Res. 2025, 27, e59069. [Google Scholar] [CrossRef]
Dergaa, I.; Fekih-Romdhane, F.; Hallit, S.; Loch, A.A.; Glenn, J.M.; Fessi, M.S.; Aissa, M.B.; Souissi, N.; Guelmami, N.; Swed, S.; et al. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front. Psychiatry 2024, 14, 1277756. [Google Scholar] [CrossRef]
Han, Q.; Zhao, C. Unleashing the potential of chatbots in mental health: Bibliometric analysis. Front. Psychiatry 2025, 16, 1494355. [Google Scholar] [CrossRef]
Vinod, K.D. The emergence of AI in mental health: A transformative journey. World J. Adv. Res. Rev. 2024, 22, 1867–1871. [Google Scholar] [CrossRef]
Holmes, G.; Tang, B.; Gupta, S.; Venkatesh, S.; Christensen, H.; Whitton, A. Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review. J. Med. Internet Res. 2025, 27, e63126. [Google Scholar] [CrossRef]
Chang, Y.; Su, C.Y.; Liu, Y.C. Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model. Healthcare 2024, 12, 2305. [Google Scholar] [CrossRef]
Wang, B.; Sun, Y.; Zi, Y.; Zhao, Y.; Qin, B. Scale-CoT: Integrating LLM with Psychiatric Scales for Analyzing Mental Health Issues on Social Media. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; IEEE: London, UK, 2024; pp. 2651–2658. [Google Scholar] [CrossRef]
Bauer, B.; Norel, R.; Leow, A.; Rached, Z.A.; Wen, B.; Cecchi, G. Using Large Language Models to Understand Suicidality in a Social Media–Based Taxonomy of Mental Health Disorders: Linguistic Analysis of Reddit Posts. JMIR Ment. Health 2024, 11, e57234. [Google Scholar] [CrossRef]
Li, R. Integrative diagnosis of psychiatric conditions using ChatGPT and fMRI data. BMC Psychiatry 2025, 25, 145. [Google Scholar] [CrossRef]
Lan, K.; Jin, B.; Zhu, Z.; Chen, S.; Zhang, S.; Zhu, K.Q.; Wu, M. Depression Diagnosis Dialogue Simulation: Self-improving Psychiatrist with Tertiary Memory. arXiv 2024, arXiv:2409.15084. [Google Scholar]
Ohse, J.; Hadžić, B.; Mohammed, P.; Peperkorn, N.; Danner, M.; Yorita, A.; Kubota, N.; Rätsch, M.; Shiban, Y. Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection. Comput. Speech Lang. 2024, 88, 101663. [Google Scholar] [CrossRef]
Wiest, I.C.; Verhees, F.G.; Ferber, D.; Zhu, J.; Bauer, M.; Lewitzka, U.; Pfennig, A.; Mikolas, P.; Kather, J.N. Detection of Suicidality Through Privacy-Preserving Large Language Models. Br, J. Psychiatry 2024, 225, 532–537. [Google Scholar] [CrossRef]
Englhardt, Z.; Ma, C.; Morris, M.E.; Chang, C.C.; Xu, X.O.; Qin, L.; McDuff, D.; Liu, X.; Patel, S.; Iyer, V. From Classification to Clinical Insights. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 1–25. [Google Scholar] [CrossRef]
Xu, X.; Yao, B.; Dong, Y.; Gabriel, S.; Yu, H.; Hendler, J.; Ghassemi, M.; Dey, A.K.; Wang, D. Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 1–32. [Google Scholar] [CrossRef]
Scherbakov, D.A.; Hubig, N.C.; Lenert, L.A.; Alekseyenko, A.V.; Obeid, J.S. Natural Language Processing and Social Determinants of Health in Mental Health Research: AI-Assisted Scoping Review. JMIR Ment. Health 2025, 12, e67192. [Google Scholar] [CrossRef]
Edgcomb, J.B.; Saha, A.; Lee, J.J.; Ponce, C.G.; Tascione, E.M.; Montero, A.E.; Ryan, N.D. 1.53 Large Language Models to Extract Information on Suicide From Children’s Medical Records. J. Am. Acad. Child. Adolesc. Psychiatry 2024, 63, S175. [Google Scholar] [CrossRef]
Jin, Y.; Liu, J.; Li, P.; Wang, B.; Yan, Y.; Zhang, H.; Ni, C.; Wang, J.; Li, Y.; Bu, Y.; et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 2025, 27, e69284. [Google Scholar] [CrossRef]
Pavlopoulos, A.; Rachiotis, T.; Maglogiannis, I. An Overview of Tools and Technologies for Anxiety and Depression Management Using AI. Appl. Sci. 2024, 14, 9068. [Google Scholar] [CrossRef]
Wang, X.; Zhou, Y.; Zhou, G. The Application and Ethical Implication of Generative AI in Mental Health: Systematic Review. JMIR Ment. Health 2025, 12, e70610. [Google Scholar] [CrossRef]
Priyadarshana, Y.H.P.P.; Senanayake, A.; Liang, Z.; Piumarta, I. Prompt engineering for digital mental health: A short review. Front Digit Health 2024, 6, 1410947. [Google Scholar] [CrossRef]
Kambeitz, J.; Chakraborty, M.; Wenzel, J.; Schwed, L.; Menne, F.; König, A.; Ruef, A.; Bonivento, C.; Dwyer, D.; Brambilla, P. Automated Analysis of Linguistic Measures Using Verbal Fluency Test in Psychotic and Affective Disorders: Findings From the PRONIA Study. Biol. Psychiatry 2025, 97, S50. [Google Scholar] [CrossRef]
Esmaeilzadeh, P. Decoding the cry for help: AI’s emerging role in suicide risk assessment. AI Ethics 2025, 5, 4645–4679. [Google Scholar] [CrossRef]
Chen, C.C.; Chen, J.A.; Liang, C.S.; Lin, Y.H. Large language models may struggle to detect culturally embedded filicide-suicide risks. Asian J. Psychiatry 2025, 105, 104395. [Google Scholar] [CrossRef]
Rosenman, G.; Wolf, L.; Hendler, T. LLM Questionnaire Completion for Automatic Psychiatric Assessment. arXiv 2024, arXiv:2406.06636. [Google Scholar] [CrossRef]
Ludlow, C. Investigating the Capability of Large Language Models to Identify Causal Relations in Psychiatric Case Studies: A Methodological Proof of Concept for the Analysis of Psychological Case Formulations. 2025. Available online: https://osf.io/preprints/psyarxiv/wfmv8_v1 (accessed on 21 April 2025).
Gargari, O.K.; Fatehi, F.; Mohammadi, I.; Firouzabadi, S.R.; Shafiee, A.; Habibi, G. Diagnostic accuracy of large language models in psychiatry. Asian J. Psychiatry 2024, 100, 104168. [Google Scholar] [CrossRef]
Arbanas, G. ChatGPT and other Chatbots in Psychiatry. Arch. Psychiatry Res. 2024, 60, 137–142. [Google Scholar] [CrossRef]
Alhuwaydi, A. Exploring the Role of Artificial Intelligence in Mental Healthcare: Current Trends and Future Directions—A Narrative Review for a Comprehensive Insight. Risk Manag. Health Policy 2024, 17, 1339–1348. [Google Scholar] [CrossRef]
Ghorbanian Zolbin, M.; Kujala, S.; Huvila, I. Experiences and Expectations of Immigrant and Nonimmigrant Older Adults Regarding eHealth Services: Qualitative Interview Study. J. Med. Internet Res. 2025, 27, e64249. [Google Scholar] [CrossRef]
Rodríguez Gatta, D.; Rotarou, E.S.; Banks, L.M.; Kuper, H. Healthcare access among people with and without disabilities: A cross-sectional analysis of the National Socioeconomic Survey of Chile. Public Health 2025, 241, 144–150. [Google Scholar] [CrossRef]
Rodado, J.; Crespo, F. Relational dimension versus Artificial Intellingence. Am. J. Psychoanal. 2024, 84, 268–284. [Google Scholar] [CrossRef]
Bidargaddi, N.; Almirall, D.; Murphy, S.; Nahum-Shani, I.; Kovalcik, M.; Pituch, T.; Maaieh, H.; Strecher, V. To Prompt or Not to Prompt? A Microrandomized Trial of Time-Varying Push Notifications to Increase Proximal Engagement With a Mobile Health App. JMIR MHealth UHealth 2018, 6, e10123. [Google Scholar] [CrossRef]
Banerjee, S.; Dunn, P.; Conard, S.; Ali, A. Mental Health Applications of Generative AI and Large Language Modeling in the United States. Int. J. Environ. Res. Public Health 2024, 21, 910. [Google Scholar] [CrossRef]
Bzdok, D.; Thieme, A.; Levkovskyy, O.; Wren, P.; Ray, T.; Reddy, S. Data science opportunities of large language models for neuroscience and biomedicine. Neuron 2024, 112, 698–717. [Google Scholar] [CrossRef] [PubMed]
Dart, M.; Ahmed, M. Evaluating Staff Attitudes, Intentions, and Behaviors Related to Cyber Security in Large Australian Health Care Environments: Mixed Methods Study. JMIR Hum. Factors 2023, 10, e48220. [Google Scholar] [CrossRef]
Obradovich, N.; Khalsa, S.S.; Khan, W.U.; Suh, J.; Perlis, R.H.; Ajilore, O.; Paulus, M.P. Opportunities and risks of large language models in psychiatry. NPP—Digit. Psychiatry Neurosci. 2024, 2, 8. [Google Scholar] [CrossRef]
Ni, Y.; Jia, F. A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening, Support, Monitoring, Prevention, and Clinical Education. Healthcare 2025, 13, 1205. [Google Scholar] [CrossRef]
Liu, Z.; Bao, Y.; Zeng, S.; Qian, R.; Deng, M.; Gu, A.; Li, J.; Wang, W.; Cai, W.; Li, W.; et al. Large Language Models in Psychiatry: Current Applications, Limitations, and Future Scope. Big Data Min. Anal. 2024, 7, 1148–1168. [Google Scholar] [CrossRef]
Farhat, F. ChatGPT as a Complementary Mental Health Resource: A Boon or a Bane. Ann. Biomed. Eng. 2024, 52, 1111–1114. [Google Scholar] [CrossRef] [PubMed]
Medina, J.C.; Andrade, R.R. Advancements in Artificial Intelligence for Health: A Rapid Review of AI-Based Mental Health Technologies Used in the Age of Large Language Models. In Bioinformatics and Biomedical Engineering; Springer: Berlin/Heidelberg, Germany, 2024; pp. 318–343. [Google Scholar] [CrossRef]
Malgaroli, M.; Schultebraucks, K.; Myrick, K.J.; Andrade Loch, A.; Ospina-Pinillos, L.; Choudhury, T.; Kotov, R.; De Choudhury, M.; Torous, M. Large language models for the mental health community: Framework for translating code to care. Lancet Digit. Health 2025, 7, e282–e285. [Google Scholar] [CrossRef]
Sree, Y.B.; Sathvik, A.; Hema Akshit, D.S.; Kumar, O.; Pranav Rao, B.S. Retrieval-Augmented Generation Based Large Language Model Chatbot for Improving Diagnosis for Physical and Mental Health. In Proceedings of the 2024 6th International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Pattaya, Thailand, 23–23 November 2024; IEEE: London, UK, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Singh, A.; Ehtesham, A.; Mahmud, S.; Kim, J.H. Revolutionizing Mental Health Care through LangChain: A Journey with a Large Language Model. In Proceedings of the 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2024; IEEE: London, UK, 2024; pp. 73–78. [Google Scholar] [CrossRef]
Ferizaj, D.; Lalk, C.; Lahmann, N.; Strube-Lahmann, S.; Rubel, J. Identifying Yalom’s group therapeutic factors in anonymous mental health discussions on Reddit: A mixed-methods analysis using large language models, topic modeling and human supervision. Front. Psychiatry 2025, 16, 1503427. [Google Scholar] [CrossRef]
Levkovich, I.; Shinan-Altman, S.; Elyoseph, Z. Can large language models be sensitive to culture suicide risk assessment? J. Cult. Cogn. Sci. 2024, 8, 275–287. [Google Scholar] [CrossRef]
Matsui, K.; Utsumi, T.; Aoki, Y.; Maruki, T.; Takeshima, M.; Takaesu, Y. Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews. J. Med. Internet Res. 2024, 26, e52758. [Google Scholar] [CrossRef] [PubMed]
Abdullah, M.; Negied, N. Detection and Prediction of Future Mental Disorder From Social Media Data Using Machine Learning, Ensemble Learning, and Large Language Models. IEEE Access 2024, 12, 120553–120569. [Google Scholar] [CrossRef]
Teferra, B.G.; Perivolaris, A.; Hsiang, W.N.; Sidharta, C.K.; Rueda, A.; Parkington, K.; Wu, Y.; Soni, A.; Samavi, R.; Jetly, R.; et al. Leveraging large language models for automated depression screening. PLoS Digit. Health 2025, 4, e0000943. [Google Scholar] [CrossRef]
Bokolo, B.G.; Liu, Q. Deep Learning-Based Depression Detection from Social Media: Comparative Evaluation of ML and Transformer Techniques. Electronics 2023, 12, 4396. [Google Scholar] [CrossRef]
Chowdhury, A.K.; Sujon, S.R.; Shafi, M.d.S.S.; Ahmmad, T.; Ahmed, S.; Hasib, K.M.; Shah, F.M. Harnessing large language models over transformer models for detecting Bengali depressive social media text: A comprehensive study. Nat. Lang. Process J. 2024, 7, 100075. [Google Scholar] [CrossRef]
Qorich, M.; El Ouazzani, R. Advanced deep learning and large language models for suicide ideation detection on social media. Prog. Artif. Intell. 2024, 13, 135–147. [Google Scholar] [CrossRef]
Nanda, M.; Inkpen, D.; Dargel, A. Detecting Multiple Mental Health Disorders with Large Language Models. In Proceedings of the 2024 28th International Conference Information Visualisation (IV), Coimbra, Portugal, 22–26 July 2024; IEEE: London, UK, 2024; pp. 252–257. [Google Scholar] [CrossRef]
Liu, Y.; Ding, X.; Peng, S.; Zhang, C. Leveraging ChatGPT to optimize depression intervention through explainable deep learning. Front. Psychiatry 2024, 15, 1383648. [Google Scholar] [CrossRef]
Hanss, K.; Sarma, K.V.; Glowinski, A.L.; Krystal, A.; Saunders, R.; Halls, A.; Gorrell, S.; Reilly, E. Assessing the Accuracy Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. J. Med. Internet Res. 2025, 27, e69910. [Google Scholar] [CrossRef]
Kharitonova, K.; Pérez-Fernández, D.; Gutiérrez-Hernando, J.; Gutiérrez-Fandiño, A.; Callejas, Z.; Griol, D. Incorporating evidence into mental health Q&A: A novel method to use generative language models for validated clinical content extraction. Behav. Inf. Technol. 2025, 44, 2333–2350. [Google Scholar] [CrossRef]
Wang, X.; Liu, K.; Wang, C. Knowledge-enhanced Pre-training large language model for depression diagnosis and treatment. In Proceedings of the 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS), Dali, China, 12–13 August 2023; IEEE: London, UK, 2023; pp. 532–536. [Google Scholar] [CrossRef]
Xin, A.W.; Nielson, D.M.; Krause, K.R.; Fiorini, G.; Midgley, N.; Pereira, F.; Lossio-Ventura, J.A. Using large language models to detect outcomes in qualitative studies of adolescent depression. J. Am. Med. Inform. Assoc. 2024, 33, 79–89. [Google Scholar] [CrossRef]
Adhikary, P.K.; Srivastava, A.; Kumar, S.; Singh, S.M.; Manuja, P.; Gopinath, J.K.; Krishnan, V.; Gupta, S.K.; Deb, K.S.; Chakraborty, T. Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study. JMIR Ment. Health 2024, 11, e57306. [Google Scholar] [CrossRef] [PubMed]
Gargari, O.K.; Habibi, G.; Nilchian, N.; Shafiee, A. Comparative analysis of large language models in psychiatry and mental health: A focus on GPT, AYA, and Nemotron-3–8B. Asian J. Psychiatry 2024, 99, 104148. [Google Scholar] [CrossRef] [PubMed]
Shin, D.; Kim, H.; Lee, S.; Cho, Y.; Jung, W. Using Large Language Models to Detect Depression From User-Generated Diary Text Data as a Novel Approach in Digital Mental Health Screening: Instrument Validation Study. J. Med. Internet Res. 2024, 26, e54617. [Google Scholar] [CrossRef]
Andibanbang, F.; Odunuga, K.V.; Osuji, C.I.; Adeniran, K.O.; Akinyemi, K.; Tyem, N.F.; Akinlade, H.O.; Asimiyu, A.A.; Oke, O.; Ariyo, T.S.; et al. AI-Powered Conversations: The Diagnostic Potential of Chatbots in Mental Health Care. Ann. Comput. 2025, 1, 19–25. [Google Scholar] [CrossRef]
Pawar, D.; Phansalkar, S. A Binary Question Answering System for Diagnosing Mental Health Syndromes powered by Large Language Model with Custom-Built Dataset. In Proceedings of the 2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), Indore, India, 13–14 December 2024; IEEE: London, UK, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Colombatto, C.; Fleming, S.M. Folk psychological attributions of consciousness to large language models. Neurosci. Conscious. 2024, 2024, niae013. [Google Scholar] [CrossRef] [PubMed]
Casu, M.; Triscari, S.; Battiato, S.; Guarnera, L.; Caponnetto, P. AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications. Appl. Sci. 2024, 14, 5889. [Google Scholar] [CrossRef]
Monosov, I.E.; Zimmermann, J.; Frank, M.J.; Mathis, M.W.; Baker, J.T. Ethological computational psychiatry: Challenges and opportunities. Curr. Opin. Neurobiol. 2024, 86, 102881. [Google Scholar] [CrossRef]
Pang, H.Y.M.; Meshkat, S.; Teferra, B.G.; Rueda, A.; Samavi, R.; Krishnan, S.; Doyle, T.; Rambhatla, S.; DeJong, S.; Sockalingam, S.; et al. Opportunities and Barriers of Generative Artificial Intelligence in the Training of Psychiatrists: A Competencies-Based Perspective. Acad. Psychiatry 2025, 49, 25–30. [Google Scholar] [CrossRef]
Lee, Q.Y.; Chen, M.; Ong, C.W.; Ho, C.S.H. The role of generative artificial intelligence in psychiatric education—A scoping review. BMC Med. Educ. 2025, 25, 438. [Google Scholar] [CrossRef]
Chen, S.; Wu, M.; Zhu, K.Q.; Lan, K.; Zhang, Z.; Cui, L. LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation. arXiv 2023, arXiv:2305.13614. [Google Scholar] [CrossRef]
Asman, O.; Torous, J.; Tal, A. Responsible Design, Integration, and Use of Generative AI in Mental Health. JMIR Ment. Health 2025, 12, e70439. [Google Scholar] [CrossRef]
Wang, L.; Bhanushali, T.; Huang, Z.; Yang, J.; Badami, S.; Hightow-Weidman, L. Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations. JMIR Ment. Health 2025, 12, e70014. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Cui, W.; Wang, J.; Li, Y. Chat, Summary and Diagnosis: A LLM-Enhanced Conversational Agent for Interactive Depression Detection. In Proceedings of the 2024 4th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE), Chengdu, China, 15–17 November 2024; IEEE: London, UK, 2024; pp. 343–348. [Google Scholar] [CrossRef]
Patra, B.G.; Lepow, L.A.; Kumar, P.K.R.J.; Vekaria, V.; Sharma, M.M.; Adekkanattu, P.; Fennessy, B.; Hynes, G.; Land, I.; Sanchez-Ruiz, J.A.; et al. Extracting Social Support and Social Isolation Information from Clinical Psychiatry Notes: Comparing a Rule-based NLP System and a Large Language Model. Am. Med Inform. Assoc. 2024, 32, 218–226. [Google Scholar] [CrossRef] [PubMed]
Kim, T.; Bae, S.; Kim, H.A.; Lee, S.W.; Hong, H.; Yang, C.; Kim, Y.H. MindfulDiary: Harnessing Large Language Model to Support Psychiatric Patients’ Journaling. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–20. [Google Scholar] [CrossRef]
Fisher, C.E. The real ethical issues with AI for clinical psychiatry. Int. Rev. Psychiatry 2025, 37, 14–20. [Google Scholar] [CrossRef]
Cardamone, N.C.; Olfson, M.; Schmutte, T.; Ungar, L.; Liu, T.; Cullen, S.W.; Williams, N.J.; Marcus, S.C. Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study. JMIR Med. Inform. 2025, 13, e65454. [Google Scholar] [CrossRef]
Bouguettaya, A.; Team, V.; Stuart, E.M.; Aboujaoude, E. AI-driven report-generation tools in mental healthcare: A review of commercial tools. Gen. Hosp. Psychiatry 2025, 94, 150–158. [Google Scholar] [CrossRef]
Wu, Y.; Chen, J.; Mao, K.; Zhang, Y. Automatic Post-Traumatic Stress Disorder Diagnosis via Clinical Transcripts: A Novel Text Augmentation with Large Language Models. In Proceedings of the 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS), Toronto, ON, Canada, 19–21 October 2023; IEEE: London, UK, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Volkmer, S.; Meyer-Lindenberg, A.; Schwarz, E. Large language models in psychiatry: Opportunities and challenges. Psychiatry Res. 2024, 339, 116026. [Google Scholar] [CrossRef]
Lopes, M.C. The future of the sleep field using large language models in mental health care. J. Clin. Sleep. Med. 2025, 21, 1151–1152. [Google Scholar] [CrossRef]
Shah, H.A.; Islam, A.; Tariq, Z.U.A.; Belhaouari, S.B.; Househ, M. Retrieval Augmented Generation System for Mental Health Information. Stud. Health Technol. Inform. 2025, 329, 693–697. [Google Scholar] [CrossRef] [PubMed]
Kumar, A.; Sharma, S.; Gupta, S.; Kumar, D. Mental Healthcare Chatbot Based on Custom Diagnosis Documents Using a Quantized Large Language Model. In Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Toronto, ON, Canada, 19–21 October 2023; IEEE: London, UK, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Amirian, S.; Kekre, A.; Loganathan, B.J.; Chavan, V.; Kandula, P.; Littlefield, N.; Franco, J.R.; Tafti, A.P.; Ebuenyi, I.K. Advancing psychosocial disability and psychosocial rehabilitation research through large language models and computational text mining. Camb. Prism. Glob. Ment. Health 2024, 11, e123. [Google Scholar] [CrossRef]
James, L.J.; Genga, L.; Montagne, B.; Hagenaars, M.; Van Gorp, P. Caregiver’s Evaluation of LLM-Generated Treatment Goals for Patients with Severe Mental Illnesses. In Proceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments, New York, NY, USA, 26–28 June 2024; ACM: New York, NY, USA, 2024; pp. 187–190. [Google Scholar] [CrossRef]
Alhuzali, H.; Alasmari, A. Pre- Trained Language Models for Mental Health: An Empirical Study on Arabic Q&A Classification. Healthcare 2025, 13, 985. [Google Scholar] [CrossRef]
Li, L.; Kong, S.; Zhao, H.; Li, C.; Teng, Y.; Wang, Y. Chain of Risks Evaluation CORE: A framework for safer large language models in public mental health. Psychiatry Clin. Neurosci. 2025, 79, 299–305. [Google Scholar] [CrossRef]
Cross, S.; Mangelsdorf, S.; Valentine, L.; O’Sullivan, S.; McEnery, C.; Scott, I.; Gilbertson, T.; Louis, S.; Myer, J.; Liu, P.; et al. Insights from fifteen years of real-world development, testing and implementation of youth digital mental health interventions. Internet Interv. 2025, 41, 100849. [Google Scholar] [CrossRef] [PubMed]
Izumi, K.; Tanaka, H.; Shidara, K.; Adachi, H.; Kanayama, D.; Kudo, T.; Nakamura, S. Response Generation for Cognitive Behavioral Therapy with Large Language Models: Comparative Study with Socratic Questioning. arXiv 2024, arXiv:2401.15966. [Google Scholar] [CrossRef]
Shankar, R.; Bundele, A.; Yap, A.; Mukhopadhyay, A. Development and feasibility testing of an AI-powered chatbot for early detection of caregiver burden: Protocol for a mixed methods feasibility study. Front. Psychiatry 2025, 16. [Google Scholar] [CrossRef]
Khorev, V.; Kiselev, A.; Badarin, A.; Antipov, V.; Drapkina, O.; Kurkin, S.; Hramov, A. Review on the use of AI-based methods and tools for treating mental conditions and mental rehabilitation. Eur. Phys. J. Spec. Top. 2024, 234, 4139–4158. [Google Scholar] [CrossRef]
Thapa, S.; Adhikari, S. GPT-4o and multimodal large language models as companions for mental wellbeing. Asian J. Psychiatry 2024, 99, 104157. [Google Scholar] [CrossRef]
Elyoseph, Z.; Levkovich, I.; Shinan-Altman, S. Assessing prognosis in depression: Comparing perspectives of AI models, mental health professionals and the general public. Fam. Med. Community Health 2024, 12, e002583. [Google Scholar] [CrossRef]
Wang, G.; Badal, A.; Jia, X.; Maltz, J.S.; Mueller, K.; Myers, K.J.; Niu, C.; Vannier, M.; Yan, P.; Yu, Z.; et al. Development of metaverse for intelligent healthcare. Nat. Mach. Intell. 2022, 4, 922–929. [Google Scholar] [CrossRef]
Firth, J.; Torous, J.; López-Gil, J.F.; Linardon, J.; Milton, A.; Lambert, J.; Smith, L.; Jarić, I.; Fabian, H.; Vancampfort, D.; et al. From “online brains” to “online lives”: Understanding the individualized impacts of Internet use across psychological, cognitive and social dimensions. World Psychiatry 2024, 23, 176–190. [Google Scholar] [CrossRef]
Ma, Y.; Zeng, Y.; Liu, T.; Sun, R.; Xiao, M.; Wang, J. Integrating large language models in mental health practice: A qualitative descriptive study based on expert interviews. Front Public Health 2024, 12, 1475867. [Google Scholar] [CrossRef]
Patias, I.; Miteva, D.; Peltekova, E.; Wright, M.; Gasteiger-Klicpera, B. Leveraging Large Language Models to Enhance Mental Health Literacy and Diversity Awareness in Adolescents: The me_HeLi-D Project. In Proceedings of the 2024 8th International Symposium on Innovative Approaches in Smart Technologies (ISAS), İstanbul, Turkiye, 6–7 December 2024; IEEE: London, UK, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Floridi, L.; Cowls, J.; Beltrametti, M.; Chatila, R.; Chazerand, P.; Dignum, V.; Luetge, C.; Madelin, R.; Pagallo, U.; Rossi, F.; et al. AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations. Minds Mach. 2018, 28, 689–707. [Google Scholar] [CrossRef]
Friedman, S.F.; Ballentine, G. Trajectories of sentiment in 11,816 psychoactive narratives. Hum. Psychopharmacol. Clin. Exp. 2023, 39, e2889. [Google Scholar] [CrossRef]
Park, C.; Lee, H.; Lee, S.; Jeong, O. Synergistic Joint Model of Knowledge Graph and LLM for Enhancing XAI-Based Clinical Decision Support Systems. Mathematics 2025, 13, 949. [Google Scholar] [CrossRef]
Li, Y.; Zeng, C.; Zhong, J.; Zhang, R.; Zhang, M.; Zou, L. Leveraging Large Language Model as Simulated Patients for Clinical Education. arXiv 2024, arXiv:2404.13066. [Google Scholar] [CrossRef]
Dalal, S.; Tilwani, D.; Gaur, M.; Jain, S.; Shalin, V.L.; Sheth, A.P. A Cross Attention Approach to Diagnostic Explainability Using Clinical Practice Guidelines for Depression. IEEE J. Biomed. Health Inform. 2025, 29, 1333–1342. [Google Scholar] [CrossRef]
Garg, S.; Sharma, S. Impact of Artificial Intelligence in Special Need Education to Promote Inclusive Pedagogy. Int. J. Inf. Educ. Technol. 2020, 10, 523–527. [Google Scholar] [CrossRef]
Tortora, L. Beyond Discrimination: Generative AI Applications and Ethical Challenges in Forensic Psychiatry. Front. Psychiatry 2024, 15, 1346059. [Google Scholar] [CrossRef] [PubMed]
Ogunwale, A.; Smith, A.; Fakorede, O.; Ogunlesi, A.O. Artificial intelligence and forensic mental health in Africa: A narrative review. Int. Rev. Psychiatry 2025, 37, 3–13. [Google Scholar] [CrossRef]
Baydili, İ.; Tasci, B.; Tasci, G. Artificial Intelligence in Psychiatry: A Review of Biological and Behavioral Data Analyses. Diagnostics 2025, 15, 434. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Lu, T.; Shao, X.; Han, Y.; Xia, Y.; Zheng, Y.; Wang, J.; Li, X.; Ravindran, A.; Fan, L.; et al. Practical AI application in psychiatry: Historical review and future directions. Mol. Psychiatry 2025, 30, 4399–4408. [Google Scholar] [CrossRef] [PubMed]
Sezgin, E.; Chekeni, F.; Lee, J.; Keim, S. Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study. J. Med. Internet Res. 2023, 25, e49240. [Google Scholar] [CrossRef]
Desage, C.; Bunge, B.; Bunge, E.L. A Revised Framework for Evaluating the Quality of Mental Health Artificial Intelligence-Based Chatbots. Procedia Comput. Sci. 2024, 248, 3–7. [Google Scholar] [CrossRef]
Elyoseph, Z.; Levkovich, I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment. Health 2024, 11, e53043. [Google Scholar] [CrossRef]
Perlis, R.H.; Goldberg, J.F.; Ostacher, M.J.; Schneck, C.D. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology 2024, 49, 1412–1416. [Google Scholar] [CrossRef]
Hanss, K.; Sarma, K.V.; Halls, A.; Gorrell, S.; Reilly, E. 4.29 Can Artificial Intelligence Make the Diagnosis? Evaluating the Accuracy of Large Language Models in Diagnosing Child and Adolescent Psychiatry Clinical Cases. J. Am. Acad. Child. Adolesc. Psychiatry 2024, 63, S239–S240. [Google Scholar] [CrossRef]
Lee, C.; Mohebbi, M.; O’Callaghan, E.; Winsberg, M. Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study. JMIR Ment Health 2024, 11, e58129. [Google Scholar] [CrossRef]
McBain, R.K.; Cantor, J.H.; Zhang, L.A.; Baker, O.; Zhang, F.; Halbisen, A.; Kofner, A.; Breslau, J.; Stein, B.; Mehrotra, A.; et al. Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study. J. Med. Internet Res. 2025, 27, e67891. [Google Scholar] [CrossRef]
Lauderdale, S.A.; Schmitt, R.; Wuckovich, B.; Dalal, N.; Desai, H.; Tomlinson, S. Effectiveness of generative AI-large language models’ recognition of veteran suicide risk: A comparison with human mental health providers using a risk stratification model. Front. Psychiatry 2025, 16, 9. [Google Scholar] [CrossRef] [PubMed]
Levkovich, I. Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals. Eur. J. Investig. Health Psychol. Educ. 2025, 15, 9. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Xu, Z.; Zhang, M.; Xu, C.; Guo, Y.; Zhan, Z.; Fang, Y.; Ding, S.; Wang, J.; Xu, K.; et al. Large Language Models for Disease Diagnosis: A Scoping Review. arXiv 2025, arXiv:2409.00097. [Google Scholar] [CrossRef]
Kim, J.; Leonte, K.G.; Chen, M.L.; Torous, J.B.; Linos, E.; Pinto, A.; Rodriguez, C.I. Large language models outperform mental and medical health care professionals in identifying obsessive-compulsive disorder. Npj Digit. Med. 2024, 7, 193. [Google Scholar] [CrossRef]
Hadar-Shoval, D.; Asraf, K.; Mizrachi, Y.; Haber, Y.; Elyoseph, Z. Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz’s Theory of Basic Values. JMIR Ment. Health 2024, 11, e55988. [Google Scholar] [CrossRef] [PubMed]
Saha, K.; Jain, Y.; De Choudhury, M. Linguistic Comparison of AI- and Human-Written Responses to Mental Health Queries. arXiv 2025, arXiv:2504.09271. [Google Scholar]
George, S.B.; Binu Rajan, M.R.; Ebin, P. Mello: A Large Language Model for Mental Health Counselling Conversations. In Proceedings of the 2024 3rd International Conference for Advancement in Technology (ICONAT), Goa, India, 6–8 September 2024; IEEE: London, UK, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Bouguettaya, A.; Stuart, E.M.; Aboujaoude, E. Racial bias in AI-mediated psychiatric diagnosis and treatment: A qualitative comparison of four large language models. Npj Digit. Med. 2025, 8, 332. [Google Scholar] [CrossRef] [PubMed]
Guan, M.Y.; Joglekar, M.; Wallace, E.; Jain, S.; Barak, B.; Helyar, A.; Dias, R.; Vallone, A.; Ren, H.; Wei, J.; et al. Deliberative Alignment: Reasoning Enables Safer Language Models. arXiv 2025, arXiv:2412.16339. [Google Scholar] [CrossRef]
Grabb, D.; Lamparth, M.; Vasan, N. Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation. arXiv 2024, arXiv:2406.11852. [Google Scholar]
Xu, Z.; Xu, J.; Luo, Y.; Zhang, K.; Zhang, J.; Zou, Y.; Liu, L. Utilizing Large Language Models for Psychological Assessment: Enhancing Suicide Risk Detection Through Social Media Analysis. In Proceedings of the 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC), Qingdao, China, 13–15 December; IEEE: London, UK, 2024; pp. 1418–1421. [Google Scholar] [CrossRef]
Ghanadian, H.; Nejadgholi, I.; Osman, H.A. Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models. IEEE Access 2024, 12, 14350–14363. [Google Scholar] [CrossRef]
Hur, J.K.; Heffner, J.; Feng, G.W.; Joormann, J.; Rutledge, R.B. Language sentiment predicts changes in depressive symptoms. Proc. Natl. Acad. Sci. USA 2024, 121, e2321321121. [Google Scholar] [CrossRef]
Rathje, S.; Mirea, D.M.; Sucholutsky, I.; Marjieh, R.; Robertson, C.E.; Van Bavel, J.J. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl. Acad. Sci. USA 2024, 121, e2308950121. [Google Scholar] [CrossRef]
Hasan, M.J.; Sultana, J.; Ahmed, S.; Momen, S. Early detection of occupational stress: Enhancing workplace safety with machine learning and large language models. PLoS ONE 2025, 20, e0323265. [Google Scholar] [CrossRef]
Wahab, O.; Adda, M.; Zrira, N. SereniSens: A Multimodal AI Framework with LLMs for Stress Prediction through Sleep Biometrics. Procedia Comput. Sci. 2024, 251, 342–349. [Google Scholar] [CrossRef]
Corda, E.; Massa, S.M.; Riboni, D. Context-Aware Behavioral Tips to Improve Sleep Quality via Machine Learning and Large Language Models. Future Internet 2024, 16, 46. [Google Scholar] [CrossRef]
Aich, A.; Quynh, A.; Osseyi, P.; Pinkham, A.; Harvey, P.; Curtis, B.; Depp, C.; Parde, N. Using LLMs to Aid Annotation and Collection of Clinically-Enriched Data in Bipolar Disorder and Schizophrenia. In Proceedings of the 10th Workshop Comput Linguist Clin Psychol CLPsych, Albuquerque, NM, USA, 3 May 2025; pp. 181–192. Available online: https://aclanthology.org/2025.clpsych-1.15/ (accessed on 21 April 2025).
Pavez, J.; Allende, H. A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis. Appl. Sci. 2024, 14, 8283. [Google Scholar] [CrossRef]
Yang, K.; Ji, S.; Zhang, T.; Xie, Q.; Kuang, Z.; Ananiadou, S. Towards Interpretable Mental Health Analysis with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar] [CrossRef]
Huang, J.; Gu, S.; Hou, L.; Wu, Y.; Wang, X.; Yu, H.; Han, J. Large Language Models Can Self-Improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1051–1068. [Google Scholar] [CrossRef]
Zhang, C.; Li, R.; Tan, M.; Yang, M.; Zhu, J.; Yang, D.; Zhao, J.; Ye, G.; Li, C.; Hu, X. CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling. arXiv 2024, arXiv:2405.16433. [Google Scholar]
De Duro, E.S.; Improta, R.; Stella, M. Introducing CounseLLMe: A dataset of simulated mental health dialogues for comparing LLMs like Haiku, LLaMAntino and ChatGPT against humans. Emerg. Trends Drugs Addict. Health 2025, 5, 100170. [Google Scholar] [CrossRef]
Zeng, Q.; Li, X.; Wang, S.; Liu, K. Adversarial Evaluation Algorithm for Detecting Extreme Behaviors of LLMs in Psychological Counseling Scenarios. In Proceedings of the 2025 2nd International Conference on Algorithms, Software Engineering and Network Security (ASENS), Guangzhou, China, 21–23 March 2025; IEEE: London, UK, 2025; pp. 412–415. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, J. Can AI replace psychotherapists? Exploring the future of mental health care. Front. Psychiatry 2024, 15, 1444382. [Google Scholar] [CrossRef] [PubMed]
Wang, W.J. vFerryman: An Artificial Intelligence-Driven Personalized Companion Providing Calming Visuals and Social Interaction for Emotional Well-Being. Eng. Proc. 2025, 92, 22. [Google Scholar] [CrossRef]
Yang, D.; Ziems, C.; Held, W.; Shaikh, O.; Bernstein, M.S.; Mitchell, J. Social Skill Training with Large Language Models. arXiv 2024, arXiv:2404.04204. [Google Scholar] [CrossRef]
Li, Y.; Huang, Y.; Wang, H.; Zhang, X.; Zou, J.; Sun, L. Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models. arXiv 2024, arXiv:2406.17675. [Google Scholar] [CrossRef]
Kjell, O.N.E.; Kjell, K.; Schwartz, H.A. Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Res. 2024, 333, 115667. [Google Scholar] [CrossRef]
Liu, Z.; Kang, Y.; Li, X. Research on Psychological Test based on Large Language Model. In Proceedings of the 2024 3rd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), Mianyang, China, 5–7 July 2024; IEEE: London, UK, 2024; pp. 503–510. [Google Scholar] [CrossRef]
Cong, Y.; LaCroix, A.N.; Lee, J. Clinical efficacy of pre-trained large language models through the lens of aphasia. Sci. Rep. 2024, 14, 15573. [Google Scholar] [CrossRef]
Liu, I.; Liu, F.; Xiao, Y.; Huang, Y.; Wu, S.; Ni, S. Investigating the Key Success Factors of Chatbot-Based Positive Psychology Intervention with Retrieval- and Generative Pre-Trained Transformer (GPT)-Based Chatbots. Int. J. Hum.–Comput. Interact. 2025, 41, 341–352. [Google Scholar] [CrossRef]
Huang, S.; Wang, Y.; Li, G.; Hall, B.J.; Nyman, T.J. Digital Mental Health Interventions for Alleviating Depression and Anxiety During Psychotherapy Waiting Lists: Systematic Review. JMIR Ment. Health 2024, 11, e56650. [Google Scholar] [CrossRef] [PubMed]
Bellini-Leite, S.C. Dual Process Theory for Large Language Models: An overview of using Psychology to address hallucination and reliability issues. Adapt. Behav. 2024, 32, 329–343. [Google Scholar] [CrossRef]
Pellert, M.; Lechner, C.M.; Wagner, C.; Rammstedt, B.; Strohmaier, M. AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspect. Psychol. Sci. 2024, 19, 808–826. [Google Scholar] [CrossRef]
Tavory, T. Regulating AI in Mental Health: Ethics of Care Perspective. JMIR Ment. Health 2024, 11, e58493. [Google Scholar] [CrossRef]
Liu, J. ChatGPT: Perspectives from human–computer interaction and psychology. Front. Artif. Intell. 2024, 7, 1418869. [Google Scholar] [CrossRef]
Ben-Zion, Z.; Witte, K.; Jagadish, A.K.; Duek, O.; Harpaz-Rotem, I.; Khorsandian, M.C.; Burrer, A.; Seifritz, E.; Homan, P.; Schulz, E.; et al. Assessing and alleviating state anxiety in large language models. Npj Digit. Med. 2025, 8, 132. [Google Scholar] [CrossRef] [PubMed]
Jurblum, M.; Selzer, R. Potential promises and perils of artificial intelligence in psychotherapy—The AI Psychotherapist (APT). Australas. Psychiatry 2025, 33, 103–105. [Google Scholar] [CrossRef]
Imel, Z.E.; Tanana, M.J.; Soma, C.S.; Hull, T.D.; Pace, B.T.; Stanco, S.C.; Creed, T.A.; Moyers, T.R.; Atkins, D.C. Outcomes in Mental Health Counseling From Conversational Content With Transformer-Based Machine Learning. JAMA Netw. Open 2024, 7, e2352590. [Google Scholar] [CrossRef]
Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]
Lalk, C.; Steinbrenner, T.; Kania, W.; Popko, A.; Wester, R.; Schaffrath, J.; Eberhardt, S.; Schwartz, B.; Lutz, W.; Rubel, J. Measuring Alliance and Symptom Severity in Psychotherapy Transcripts Using Bert Topic Modeling. Adm. Policy Ment. Health Ment. Health Serv. Res. 2024, 51, 509–524. [Google Scholar] [CrossRef]
Sezgin, E.; McKay, I. Behavioral health and generative AI: A perspective on future of therapies and patient care. Npj Ment. Health Res. 2024, 3, 25. [Google Scholar] [CrossRef] [PubMed]
Sufyan, N.S.; Fadhel, F.H.; Alkhathami, S.S.; Mukhadi, J.Y.A. Artificial intelligence and social intelligence: Preliminary comparison study between AI models and psychologists. Front. Psychol. 2024, 15, 1353022. [Google Scholar] [CrossRef] [PubMed]
Raile, P. The usefulness of ChatGPT for psychotherapists and patients. Humanit. Soc. Sci. Commun. 2024, 11, 47. [Google Scholar] [CrossRef]
Yang, F.; Wei, J.; Zhao, X.; An, R. Artificial Intelligence–Based Mobile Phone Apps for Child Mental Health: Comprehensive Review and Content Analysis. JMIR MHealth UHealth 2025, 13, e58597. [Google Scholar] [CrossRef] [PubMed]
Nepal, S.; Pillai, A.; Campbell, W.; Massachi, T.; Choi, E.S.; Xu, X.; Kuc, J.; Huckins, J.; Holden, J.; Depp, C.; et al. Contextual AI Journaling: Integrating LLM and Time Series Behavioral Sensing Technology to Promote Self-Reflection and Well-being using the MindScape App. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Sargolzaei, P.; Rastogi, M.; Zaman, L. Advancing Mixed Reality Game Development: An Evaluation of a Visual Game Analytics Tool in Action-Adventure and FPS Genres. Proc. ACM Hum.-Comput. Interact. 2024, 8, 1–32. [Google Scholar] [CrossRef]
Soman, G.; Judy, M.V.; Abou, A.M. Human guided empathetic AI agent for mental health support leveraging reinforcement learning-enhanced retrieval-augmented generation. Cogn. Syst. Res. 2025, 90, 101337. [Google Scholar] [CrossRef]
Song, I.; Pendse, S.R.; Kumar, N.; De Choudhury, M. The Typing Cure: Experiences with Large Language Model Chatbots for Mental Health Support. arXiv 2025, arXiv:2401.14362. [Google Scholar] [CrossRef]
Kuhail, M.A.; Alturki, N.; Thomas, J.; Alkhalifa, A.K.; Alshardan, A. Human-Human vs Human-AI Therapy: An Empirical Study. Int. J. Hum.–Comput. Interact. 2025, 41, 6841–6852. [Google Scholar] [CrossRef]
Gabriel, S.; Puri, I.; Xu, X.; Malgaroli, M.; Ghassemi, M. Can AI Relate: Testing Large Language Model Response for Mental Health Support. arXiv 2024, arXiv:2405.12021. [Google Scholar] [CrossRef]
Chiu, Y.Y.; Sharma, A.; Lin, I.W.; Althoff, T. A Computational Framework for Behavioral Assessment of LLM Therapists. arXiv 2024, arXiv:2401.00820. [Google Scholar] [CrossRef]
Stade, E.C.; Stirman, S.W.; Ungar, L.H.; Boland, C.L.; Schwartz, H.A.; Yaden, D.B.; Sedoc, J.; DeRubeis, R.J.; Willer, R.; Eichstaedt, J.C. Large language models could change the future of behavioral healthcare: A proposal for responsible development and evaluation. Npj Ment. Health Res. 2024, 3, 12. [Google Scholar] [CrossRef]
Yuan, A.; Garcia Colato, E.; Pescosolido, B.; Song, H.; Samtani, S. Improving Workplace Well-being in Modern Organizations: A Review of Large Language Model-based Mental Health Chatbots. ACM Trans. Manag. Inf. Syst. 2025, 16, 1–26. [Google Scholar] [CrossRef]
Rollwage, M.; Habicht, J.; Juechems, K.; Carrington, B.; Viswanathan, S.; Stylianou, M.; Hauser, T.U.; Harper, R. Correction: Using Conversational AIto Facilitate Mental Health Assessments Improve Clinical Efficiency Within Psychotherapy Services: Real-World Observational Study. JMIR AI 2024, 3, e57869. [Google Scholar] [CrossRef] [PubMed]
Montag, C.; Ali, R.; Al-Thani, D.; Hall, B.J. On artificial intelligence and global mental health. Asian J. Psychiatry 2024, 91, 103855. [Google Scholar] [CrossRef] [PubMed]
Rickman, S. Evaluating gender bias in large language models in long-term care. BMC Med. Inf. Decis. Mak. 2025, 25, 274. [Google Scholar] [CrossRef]
Lopes, E.; Jain, G.; Carlbring, P.; Pareek, S. Talking Mental Health: A Battle of Wits Between Humans and AI. J. Technol. Behav. Sci. 2023, 9, 628–638. [Google Scholar] [CrossRef]
Salman, S.; Richards, D. Using Large Language Models to Embed Relational Cues in the Dialogue of Collaborating Digital Twins. Systems 2025, 13, 353. [Google Scholar] [CrossRef]
Isaranuwatchai, W.; Wang, Y.; Soboon, B.; Tungsanga, K.; Nakamura, R.; Wee, H.-L.; Botwright, S.; Theantawee, W.; Laoharuangchaiyot, J.; Mongkolchaipak, T.; et al. An empirical study looking at the potential impact of increasing cost-effectiveness threshold on reimbursement decisions in Thailand. Health Policy Technol. 2024, 13, 100927. [Google Scholar] [CrossRef]
Jiang, M.; Yu, Y.J.; Zhao, Q.; Li, J.; Song, C.; Qi, H.; Zhai, W.; Luo, D.; Wang, X.; Fu, G.; et al. AI-Enhanced Cognitive Behavioral Therapy: Deep Learning and Large Language Models for Extracting Cognitive Pathways from Social Media Texts. arXiv 2024, arXiv:2404.11449. [Google Scholar] [CrossRef]
Na, H. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. arXiv 2024, arXiv:2403.16008. [Google Scholar]
Shen, H.; Li, Z.; Yang, M.; Ni, M.; Tao, Y.; Yu, Z.; Zheng, W.; Xu, C.; Hu, B. Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy? In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; IEEE: London, UK, 2024; pp. 3695–3700. [Google Scholar] [CrossRef]
Hodson, N.; Williamson, S. Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks. JMIR AI 2024, 3, e52500. [Google Scholar] [CrossRef]
Kian, M.J.; Zong, M.; Fischer, K.; Singh, A.; Velentza, A.M.; Sang, P.; Upadhyay, S.; Gupta, A.; Faruki, M.A.; Browning, W.; et al. Can an LLM-Powered Socially Assistive Robot Effectively and Safely Deliver Cognitive Behavioral Therapy? A Study With University Students. arXiv 2024, arXiv:2402.17937. [Google Scholar] [CrossRef]
Held, P.; Pridgen, S.A.; Chen, Y.; Akhtar, Z.; Amin, D.; Pohorence, S. A Novel Cognitive Behavioral Therapy–Based Generative AI Tool (Socrates 2.0) to Facilitate Socratic Dialogue: Protocol for a Mixed Methods Feasibility Study. JMIR Res. Protoc. 2024, 13, e58195. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Zhao, Q.; Li, J.; Wang, F.; He, T.; Cheng, X.; Yang, B.X.; Ho, G.W.K.; Fu, G. A Generic Review of Integrating Artificial Intelligence in Cognitive Behavioral Therapy. arXiv 2024, arXiv:2407.19422. [Google Scholar] [CrossRef]
Hwang, G.; Lee, D.Y.; Seol, S.; Jung, J.; Choi, Y.; Her, E.S.; An, M.H.; Park, R.W. Assessing the potential of ChatGPT for psychodynamic formulations in psychiatry: An exploratory study. Psychiatry Res. 2024, 331, 115655. [Google Scholar] [CrossRef] [PubMed]
Han, G.; Liu, W.; Huang, X.; Borsari, B. Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts. In Proceedings of the 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), Orlando, FL, USA, 3–6 June 2024; IEEE: London, UK, 2024; pp. 392–401. [Google Scholar] [CrossRef]
Lundahl, B.; Howey, W.; Dilanchian, A.; Garcia, M.J.; Patin, K.; Moleni, K.; Burke, B. Addressing Suicide Risk: A Systematic Review of Motivational Interviewing Infused Interventions. Res. Soc. Work Pract. 2024, 34, 158–168. [Google Scholar] [CrossRef]
Claire, V.; PositivePsychology.com. What Is Motivational Interviewing? A Theory of Change. 2020. Available online: https://positivepsychology.com/motivational-interviewing-theory/ (accessed on 2 June 2025).
Yosef, S.; Zisquit, M.; Cohen, B.; Brunstein, A.K.; Bar, K.; Friedman, D.; Association for Computational Linguistics. Assessing Motivational Interviewing Sessions with AI-Generated Patient Simulations. 2024. Available online: https://aclanthology.org/2024.clpsych-1.1/ (accessed on 2 June 2025).
Sharma, A.; Rushton, K.; Lin, I.W.; Nguyen, T.; Althoff, T. Facilitating Self-Guided Mental Health Interventions Through Human-Language Model Interaction: A Case Study of Cognitive Restructuring. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–29. [Google Scholar] [CrossRef]
Filienko, D.; Wang, Y.; Jazmi, C.E.l.; Xie, S.; Cohen, T.; De Cock, M.; Yuwen, W. Toward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy. arXiv 2024, arXiv:2409.00112. [Google Scholar]
Yang, M.; Tao, Y.; Cai, H.; Hu, B. Behavioral Information Feedback With Large Language Models for Mental Disorders: Perspectives and Insights. IEEE Trans. Comput. Soc. Syst. 2024, 11, 3026–3044. [Google Scholar] [CrossRef]
Spiegel, B.M.R.; Liran, O.; Clark, A.; Samaan, J.S.; Khalil, C.; Chernoff, R.; Reddy, K.; Mehra, M. Feasibility of combining spatial computing and AI for mental health support in anxiety and depression. npj Digit. Med. 2024, 7, 22. [Google Scholar] [CrossRef]
Scholich, T.; Barr, M.; Wiltsey Stirman, S.; Raj, S. A Comparison of Responses from Human Therapists and Large Language Model–Based Chatbots to Assess Therapeutic Communication: Mixed Methods Study. JMIR Ment. Health 2025, 12, e69709. [Google Scholar] [CrossRef] [PubMed]
He, Q.; Wang, J.; He, D. The Influence of Task and Group Disparities Over Users’ Attitudes Toward Using Large Language Models for Psychotherapy. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2024, 68, 1147–1152. [Google Scholar] [CrossRef]
Uptegraft, C.; Black, K.C.; Gale, J.; Marshall, A.; He, S. The Elastic Electronic Health Record: A Five-Tiered Framework for Applying Artificial Intelligence to Electronic Health Record Maintenance, Configuration, and Use. JMIR AI 2025, 4, e66741. [Google Scholar] [CrossRef]
Siddals, S.; Torous, J.; Coxon, A. “It happened to be the perfect thing”: Experiences of generative AI chatbots for mental health. Npj Ment. Health Res. 2024, 3, 48. [Google Scholar] [CrossRef] [PubMed]
Park, C.; Lee, H.; Jeong, O.R. Leveraging Medical Knowledge Graphs and Large Language Models for Enhanced Mental Disorder Information Extraction. Future Internet 2024, 16, 260. [Google Scholar] [CrossRef]
Bolpagni, M.; Gabrielli, S. Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics 2025, 12, 33. [Google Scholar] [CrossRef]
Shen, J.; DiPaola, D.; Ali, S.; Sap, M.; Park, H.W.; Breazeal, C. Empathy Toward Artificial Intelligence Versus Human Experiences and the Role of Transparency in Mental Health and Social Support Chatbot Design: Comparative Study. JMIR Ment. Health 2024, 11, e62679. [Google Scholar] [CrossRef]
Coşkun, Ö.; Kıyak, Y.S.; Budakoğlu, I.İ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. Med. Teach. 2025, 47, 268–274. [Google Scholar] [CrossRef]
Rządeczka, M.; Sterna, A.; Stolińska, J.; Kaczyńska, P.; Moskalewicz, M. The Efficacy of Conversational AI in Rectifying the Theory-of-Mind and Autonomy Biases: Comparative Analysis. JMIR Ment. Health 2025, 12, e64396. [Google Scholar] [CrossRef]
Alasmari, A. A Scoping Review of Arabic Natural Language Processing for Mental Health. Healthcare 2025, 13, 963. [Google Scholar] [CrossRef] [PubMed]
Connors, E.H.; Janse, P.; de Jong, K.; Bickman, L. The Use of Feedback in Mental Health Services: Expanding Horizons on Reach and Implementation. Adm. Policy Ment. Health Ment. Health Serv. Res. 2025, 52, 1–10. [Google Scholar] [CrossRef]
Villarreal-Zegarra, D.; Reategui-Rivera, C.M.; García-Serna, J.; Quispe-Callo, G.; Lázaro-Cruz, G.; Centeno-Terrazas, G.; Galvez-Arevalo, R.; Escobar-Agreda, S.; Dominguez-Rodriguez, A.; Finkelstein, J. Self-Administered Interventions Based on Natural Language Processing Models for Reducing Depressive and Anxiety Symptoms: Systematic Review and Meta-Analysis. JMIR Ment. Health 2024, 11, e59560. [Google Scholar] [CrossRef]
Waaler, P.N.; Hussain, M.; Molchanov, I.; Bongo, L.A.; Elvevåg, B. Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation. JMIR AI 2025, 4, e69820. [Google Scholar] [CrossRef]
Stade, E.C.; Eichstaedt, J.C.; Kim, J.P.; Wiltsey Stirman, S. Readiness evaluation for artificial intelligence-mental health deployment and implementation (READI): A review and proposed framework. Technol. Mind Behav. 2025, 6, 111–122. [Google Scholar] [CrossRef]
Shewcraft, R.A.; Schwarz, J.; Micsinai Balan, M. Algorithmic Classification of Psychiatric Disorder–Related Spontaneous Communication Using Large Language Model Embeddings: Algorithm Development and Validation. JMIR AI 2025, 4, e67369. [Google Scholar] [CrossRef]
Rasool, A.; Aslam, S.; Hussain, N.; Imtiaz, S.; Riaz, W. nBERT: Harnessing NLP for Emotion Recognition in Psychotherapy to Transform Mental Health Care. Information 2025, 16, 301. [Google Scholar] [CrossRef]
Lewis, C.C.; Boyd, M.; Puspitasari, A.; Navarro, E.; Howard, J.; Kassab, H.; Hoffman, M.; Scott, K.; Lyon, K.; Lyon, S.; et al. Implementing Measurement-Based Care in Behavioral Health. JAMA Psychiatry 2019, 76, 324. [Google Scholar] [CrossRef]
O’Connor, K.; Muller Neff, D.; Pitman, S. Burnout in mental health professionals: A systematic review and meta-analysis of prevalence and determinants. Eur. Psychiatry 2018, 53, 74–99. [Google Scholar] [CrossRef] [PubMed]
Barish, G.; Marlotte, L.; Drayton, M.; Mogil, C.; Lester, P. Automatically Enriching Content for a Behavioral Health Learning Management System: A First Look. In Proceedings of the 9th World Congress on Electrical Engineering and Computer Systems and Science, London, UK, 3–5 August 2023. [Google Scholar] [CrossRef]
Sharma, A.; Lin, I.W.; Miner, A.S.; Atkins, D.C.; Althoff, T. Human-AI Collaboration Enables More Empathic Conversations in Text-based Peer-to-Peer Mental Health Support. arXiv 2022, arXiv:2203.15144. [Google Scholar] [CrossRef]
Lam, K. ChatGPT for low- and middle-income countries: A Greek gift? Lancet Reg. Health-West. Pac. 2023, 41, 100906. [Google Scholar] [CrossRef]
Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, R.; Ahuja, m.; et al. Opinion Paper: “So what if ChatGPTwrote it?” Multidisciplinary perspectives on opportunities challenges implications of generative conversational AIfor research practice policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
Shahsavar, Y.; Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Hum. Factors 2023, 10, e47564. [Google Scholar] [CrossRef]
Rothman, B.; Slomkowski, M.; Speier, A.; Rush, A.J.; Trivedi, M.H.; Lawson, E.; Fahmy, M.; Carpenter, D.; Chen, D.; Forbes, A. Evaluating the Efficacy of a Digital Therapeutic (CT-152) as an Adjunct to Antidepressant Treatment in Adults With Major Depressive Disorder: Protocol for the MIRAI Remote Study. JMIR Res. Protoc. 2024, 13, e56960. [Google Scholar] [CrossRef]
Mogk, J.; Idu, A.E.; Bobb, J.F.; Key, D.; Wong, E.S.; Palazzo, L.; Stefanik-Guizlo, K.; King, D.; Beatty, T.; Dorsey, C.N.; et al. Prescription Digital Therapeutics for Substance Use Disorder in Primary Care: Mixed Methods Evaluation of a Pilot Implementation Study. JMIR Form. Res. 2024, 8, e59088. [Google Scholar] [CrossRef]
Oh, S.; Choi, J.; Han, D.H.; Kim, E. Effects of game-based digital therapeutics on attention deficit hyperactivity disorder in children and adolescents as assessed by parents or teachers: A systematic review and meta-analysis. Eur. Child Adolesc. Psychiatry 2024, 33, 481–493. [Google Scholar] [CrossRef]
Habicht, J.; Viswanathan, S.; Carrington, B.; Hauser, T.U.; Harper, R.; Rollwage, M. Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nat. Med. 2024, 30, 595–602. [Google Scholar] [CrossRef] [PubMed]
Davenport, N.D.; Werner, J.K. A randomized sham-controlled clinical trial of a novel wearable intervention for trauma-related nightmares in military veterans. J. Clin. Sleep Med. 2023, 19, 361–369. [Google Scholar] [CrossRef] [PubMed]
Kong, H.; Moon, S. When LLM Therapists Become Salespeople: Evaluating Large Language Models for Ethical Motivational Interviewing. arXiv 2025, arXiv:2503.23566. [Google Scholar] [CrossRef]
Huang, Y.; Wang, W.; Zhou, J.; Zhang, L.; Lin, J.; Liu, H.; Hu, X.; Zhou, Z.; Dong, W. Integrative modeling enables ChatGPT to achieve average level of human counselors performance in mental health, Q&A. Inf. Process. Manag. 2025, 62, 104152. [Google Scholar] [CrossRef]
Hosseini, S.M.B.; Momeni Nezhad, M.J.; Hosseini, M.; Zolnoori, M. Optimizing Entity Recognition in Psychiatric Treatment Data with Large Language Models. Stud. Health Technol. Inform. 2025, 329, 784–788. [Google Scholar] [CrossRef]
Leow, J.J.D.; Chua, H.N.; Jasser, M.B.; Issa, B.; Wong, R.T.K. Comparison of Depression Detection Between LLMs and Zero-Shot Learning Using DAD Dataset. In Proceedings of the 2025 21st IEEE International Colloquium on Signal Processing and Its Applications (CSPA), Pulau Pinang, Malaysia, 7–8 February 2025; IEEE: London, UK, 2025; pp. 295–300. [Google Scholar] [CrossRef]
Gao, Y.; Li, R.; Croxford, E.; Caskey, J.; Patterson, B.W.; Churpek, M.; Miller, T.; Dligach, D.; Afshar, M. Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design Application Study. JMIR AI 2025, 4, e58670. [Google Scholar] [CrossRef] [PubMed]
Ferrario, A.; Sedlakova, J.; Trachsel, M. The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis. JMIR Ment. Health 2024, 11, e56569. [Google Scholar] [CrossRef]
Li, M.D.; Wang, Y. Editorial: Advances; opportunities and challenges of using modern AGI and AIGC technologies in depression and related disorders. Front. Psychiatry 2025, 16, 1625579. [Google Scholar] [CrossRef]
Linardon, J.; Messer, M.; Anderson, C.; Liu, C.; McClure, Z.; Jarman, H.K.; Goldberg, S.B.; Torous, J. Role of large language models in mental health research: An international survey of researchers’ practices and perspectives. BMJ Ment. Health 2025, 28, e301787. [Google Scholar] [CrossRef] [PubMed]
Qu, Y.; Du, P.; Che, W.; Wei, C.; Zhang, C.; Ouyang, W.; Bian, Y.; Xu, F.; Hu, B.; Du, K.; et al. Promoting interactions between cognitive science and large language models. Innovation 2024, 5, 100579. [Google Scholar] [CrossRef] [PubMed]
Thazhath, M.B.; Michalak, J.; Hoang, T. Harpocrates: Privacy-Preserving and Immutable Audit Log for Sensitive Data Operations. arXiv 2022, arXiv:2211.04741. [Google Scholar] [CrossRef]
Regueiro, C.; Seco, I.; Gutiérrez-Agüero, I.; Urquizu, B.; Mansell, J. A Blockchain-Based Audit Trail Mechanism: Design and Implementation. Algorithms 2021, 14, 341. [Google Scholar] [CrossRef]
Diro, A.; Kaisar, S.; Saini, A.; Fatima, S.; Hiep, P.C.; Erba, F. Workplace security and privacy implications in the GenAI age: A survey. J. Inf. Secur. Appl. 2025, 89, 103960. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic data in health care: A narrative review. PLoS Digit. Health 2023, 2, e0000082. [Google Scholar] [CrossRef]
Liu, X.; Cruz Rivera, S.; Moher, D.; Calvert, M.J.; Denniston, A.K.; The SPIRIT-AI and CONSORT-AI Working Group; SPIRIT-AI and CONSORT-AI Steering Group; Chan, A.W.; Darzi, A.; Holmes, C.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef]
AI-READI Consortium; Writing Committee; Baxter, S.L.; De Sa, V.R.; Ferryman, K.; Jain, P.; Lee, C.S.; Li-Pook-Tha, J.; Liu, T.Y.A.; Owen, J.P.; et al. AI-READI: Rethinking AI data collection, preparation and sharing in diabetes research and beyond. Nat. Metab. 2024, 6, 2210–2212. [Google Scholar] [CrossRef] [PubMed]
Taylor, N.; Kormilitzin, A.; Lorge, I.; Nevado-Holgado, A.; Cipriani, A.; Joyce, D.W. Model development for bespoke large language models for digital triage assistance in mental health care. Artif. Intell. Med. 2024, 157, 102988. [Google Scholar] [CrossRef]
Babu, A.; Joseph, A.P. Digital wellness or digital dependency? a critical examination of mental health apps and their implications. Front. Psychiatry 2025, 16, 1581779. [Google Scholar] [CrossRef]
Brega, J.R.F.; Rodello, I.A.; Dias, D.R.C.; Martins, V.F.; de Paiva Guimaraes, M. A virtual reality environment to support chat rooms for hearing impaired and to teach Brazilian Sign Language (LIBRAS). In Proceedings of the 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), Doha, Qatar, 10–13 November 2014; IEEE: London, UK, 2014; pp. 433–440. [Google Scholar] [CrossRef]
Levkovich, I.; Omar, M. Evaluating of BERT-based and Large Language Mod for Suicide Detection, Prevention, and Risk Assessment: A Systematic Review. J. Med. Syst. 2024, 48, 113. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wang, Y.; Xiao, Y.; Escamilla, L.; Augustine, B.; Crace, K.; Zhou, G.; Zhang, Y. Evaluating an LLM-Powered Chatbot for Cognitive Restructuring: Insights from Mental Health Professionals. arXiv 2025. [Google Scholar] [CrossRef]
Nie, J.; Shao, H.; Fan, Y.; Shao, Q.; You, H.; Preindl, M.; Jian, X. LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices. arXiv 2024, arXiv:2403.10779. [Google Scholar] [CrossRef]
Han, X. NeuroPal: A Clinically-Informed Multimodal LLM Assistant for Mental Health Combining Sleep Chronotherapy, Cognitive Behavioral Reframing, and Adaptive Phytochemical Intervention. arXiv 2025, arXiv:2505.06640. [Google Scholar]
Abbasi, M.A.; Mirnezami, F.S.; Neshati, A.; Naderi, H. HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents. arXiv 2025, arXiv:2502.05982. [Google Scholar]
Alhuzali, H.; Alasmari, A.; Alsaleh, H. MentalQA: An Annotated Arabic Corpus for Questions and Answers of Mental Healthcare. IEEE Access 2024, 12, 101155–101165. [Google Scholar] [CrossRef]
Podichetty, J.T.; Silvola, R.M.; Rodriguez-Romero, V.; Bergstrom, R.F.; Vakilynejad, M.; Bies, R.R.; Stratford, R.E., Jr. Application of machine learning to predict reduction in total PANSS score and enrich enrollment in schizophrenia clinical trials. Clin. Transl. Sci. 2021, 14, 1864–1874. [Google Scholar] [CrossRef]
Kroenke, K.; Spitzer, R.L.; Williams, J.B.W. The PHQ-9: Validity of a brief depression severity measure. J. Gen. Intern. Med. 2001, 16, 606–613. [Google Scholar] [CrossRef]
Carrozzino, D.; Patierno, C.; Fava, G.A.; Guidi, J. The Hamilton Rating Scales for Depression: A Critical Review of Clinimetric Properties of Different Versions. Psychother. Psychosom. 2020, 89, 133–150. [Google Scholar] [CrossRef] [PubMed]
Aiello-Puchol, A.; García-Alandete, J. A systematic review on the effects of logotherapy and meaning-centered therapy on psychological and existential symptoms in women with breast and gynecological cancer. Support. Care Cancer 2025, 33, 465. [Google Scholar] [CrossRef] [PubMed]
Kho, J.J.; Song, S.; Tan, S.M.X.; Fitriyah, N.H.; Lokadjaja, M.C.; Yee, J.Y.; Yang, Z.; Chen, E.Y.H.; Lee, J.; Goh, W.W.B.; et al. Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths. Schizophrenia 2025, 11, 98. [Google Scholar] [CrossRef]
Lekadir, K.; Frangi, A.F.; Porras, A.R.; Glocker, B.; Cintas, C.; Langlotz, C.P.; Weicken, E.; Asselbergs, F.; Prior, F.; Collins, G.S.; et al. FUTURE-AI: International consensus guideline for trustworthy deployable artificial intelligence in healthcare. BMJ 2025, 388, e081554. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Yu, B. FLLMM: A Federated Large-small Language Model Collaboration Based Music Therapy for Mental Disease. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence and Education, Xiamen, China, 22–24 November 2024; ACM: New York, NY, USA, 2024; pp. 724–729. [Google Scholar] [CrossRef]
Cruz Rivera, S.; Liu, X.; Chan, A.W.; Denniston, A.K.; Calvert, M.J.; The SPIRIT-AI and CONSORT-AI Working Group; Darzi, A.; Holmes, C.; Yau, C.; Moher, D.; et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Nat. Med. 2020, 26, 1351–1363. [Google Scholar] [CrossRef]
Vasey, B.; Nagendran, M.; Campbell, B.; Clifton, D.A.; Collins, G.S.; Denaxas, S.; Denniston, A.; Faes, L.; Geerts, B.; Ibrahim, M.; et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 2022, 28, 924–933. [Google Scholar] [CrossRef]
Semenza, J.C.; Paz, S. Climate change and infectious disease in Europe: Impact, projection and adaptation. Lancet Reg. Health-Eur. 2021, 9, 100230. [Google Scholar] [CrossRef]
Borger, T.; Mosteiro, P.; Kaya, H.; Rijcken, E.; Salah, A.A.; Scheepers, F.; Spruit, M. Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting. Expert Syst. Appl. 2022, 199, 116720. [Google Scholar] [CrossRef]
Ringeval, F.; Schuller, B.; Valstar, M.; Gratch, J.; Cowie, R.; Scherer, S.; Mozgai, S.; Cummins, N.; Schmitt, M.; Pantic, M. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23 October 2017; ACM: New York, NY, USA, 2017; pp. 3–9. [Google Scholar] [CrossRef]
Kessler, R.C.; Berglund, P.; Demler, O.; Jin, R.; Merikangas, K.R.; Walters, E.E. Lifetime Prevalence and Age-of-Onset Distributions of DSM-IV Disorders in the National Comorbidity Survey Replication. Arch. Gen. Psychiatry 2005, 62, 593. [Google Scholar] [CrossRef] [PubMed]
Wiegand, T.; Krishnamurthy, R.; Kuglitsch, M.; Lee, N.; Pujari, S.; Salathé, M.; Wenzel, M.; Xu, S. WHO and ITU establish benchmarking process for artificial intelligence in health. Lancet 2019, 394, 9–11. [Google Scholar] [CrossRef] [PubMed]
Bolton, P.; West, J.; Whitney, C.; Jordans, M.J.D.; Bass, J.; Thornicroft, G. Expanding mental health services in low- and middle-income countries: A task-shifting framework for delivery of comprehensive, collaborative, and community-based care. Camb. Prism. Glob. Ment. Health 2023, 10, e16. [Google Scholar] [CrossRef]

Figure 1. Key factors affecting mental health. Adapted from [2].

Figure 2. PRISMA flow diagram of the study selection process.

Figure 3. End-to-end methodological workflow of the systematic review, highlighting LLM-related evidence extraction and synthesis.

Figure 4. Domain distribution of 205 included studies.

Table 1. Comparison of our review with existing studies.

Criterion	Our Review	Existing Reviews
1. Total number of studies	205	20–30 per review
2. Scientific fields explored	Psychiatry, psychology, psychotherapy, mental health	Mostly mental health
3. Empirical testing identification	Yes—explicit classification of studies (“Tested >= 1 LLM model”)	Not explicitly reported
4. LLM details	GPT-4, GPT-4o, GPT-3.5, Claude, Llama, Palm, Mistral, Medical LLM and more	Records the specific language models, model family, scale, and modalities used

Table 2. Records identified per database, duplicates removed, and records screened.

DATABASE	Records Identified	Duplicates Removed	Records Screened
PubMed	169	52	117
Scopus	53	50	3
IEEE Xplore	106	33	73
JMIR Mental Health	552	68	484
Springer	66	34	32
MDPI	46	18	28
ScienceDirect	980	25	955
Google Scholar	10,700	222	10,478
Total	12,672	502	12,170

Table 3. Records excluded during initial screening.

DATABASE	Records Excluded
PubMed	17
Scopus	1
IEEE Xplore	50
JMIR Mental Health	315
Springer	25
MDPI	8
ScienceDirect	800
Google Scholar	10,354
Total	11,570

Table 4. PRISMA summary of study selection across screening stages.

PRISMA Stage	Count
Records identified from databases	12,672
Duplicates removed	502
Records screened (title/abstract)	12,170
Records excluded (title/abstract)	11,570
Full-text articles assessed for eligibility	600
Full-text articles excluded (with reasons)	395
Studies included in the review	205

Table 5. Coding and data extraction framework.

Category	Data Item Extracted	Description/ Purpose
1. Basic Study Information	Author(s), Year, Title, DOI	Identifies the study and provides essential bibliographic metadata.
2. Study Type/Design	Empirical, technical studies, various types of reviews, RCT, observational, simulation, benchmark, qualitative, mixed-methods	Determines the methodological approach and the strength of evidence.
3. Mental Health Domain	Psychiatry/Psychology/Psychotherapy/Mental health workforce	Used for thematic grouping and alignment with the research questions.
4. LLM Model(s)	GPT-3.5, GPT-4, Claude, Gemini, LLaMA, Mistral, domain-specific models	Records the specific language models, model family, scale, and modalities used.
5. Task Focus	Diagnosis, risk assessment, therapy support, CBT, MI, psychoeducation, documentation, benchmarking	Identifies the primary task addressed and characterizes the scope of LLM-based systems.
6. Dataset/Source	Clinical notes, EHR, DAIC-WOZ, synthetic data, therapy transcripts	Evaluates data provenance, quality, and potential biases affecting results.
7. Methods Used	Zero-shot and few-shot prompting; fine-tuning; RAG; model pipelines; evaluation workflows	Captures technical implementation details and methodological choices across studies.
8. Evaluation Metrics	Accuracy, F1-score, sensitivity, specificity, AUC, BLEU, ROUGE, user satisfaction, expert ratings	Required for comparative analysis of model performance.
9. Sample/Population	General adults; youth; clinicians; mixed populations	Assesses external validity and applicability of findings.
10. Key Findings	Main outcomes and results	Summarizes each study’s contribution and performance.
11. Limitations	Bias, hallucinations, dataset constraints, safety risks, cultural mismatch, limited duration	Enables critical synthesis across included studies
12. Long-Term or Real-World Use	Deployment; integration; workflow impact; long-term use	Describes real-world applicability, long-term feasibility, and clinical integration.
13. Ethical and Safety Notes	Bias; safety issues; misuse risk; governance considerations	Records ethical and safety considerations essential for evaluating LLM deployment in mental health.
14. Contribution to RQs	RQ1/RQ2/RQ3/RQ4/RQ5	Maps each study to the relevant research questions and clarifies its analytical contribution.

Table 6. Aggregated empirical findings across mental health domains.

Mental Health Domain	Dominant LLM Architectures	Primary Task Focus	Predominant Methodological Approach	Key Reported Strengths	Recurring Limitations
Psychiatry	GPT-4, GPT-4o, GPT-3.5; BERT-family (BioClinicalBERT, MentalBERT); LLaMA variants	Diagnosis support, risk assessment (suicide, depression), symptom detection, clinical decision support	Prompt-only pipelines (zero-/few-shot, CoT); limited fine-tuning; sparse use of RAG	High short-term diagnostic accuracy in controlled settings; competitive or clinician-level performance in structured benchmarks; strong performance in text-based symptom inference	Lack of longitudinal validation; small and culturally imbalanced datasets; limited robustness in high-risk scenarios; inconsistent calibration and explainability
Psychology	GPT-4, GPT-3.5; BERT, RoBERTa, DeBERTa; LLaMA, T5 (occasional)	Sentiment and emotion detection, stress and sleep prediction, psychometric inference, behavioral modeling	Hybrid approaches combining prompting, fine-tuning, and theory-informed pipelines; moderate use of RAG	Strong performance in language-based psychological signal detection; effective large-scale analysis of affective and behavioral data; promising psychometric prediction	Dataset artifacts and short time horizons; limited construct validity; inconsistent empathy and reasoning stability; minimal cross-cultural evaluation
Psychotherapy	GPT-4, GPT-3.5; BERT-derived baselines; task-specific fine-tuned models	CBT/DBT support, therapeutic dialogue generation, session summarization, reflective responding	Prompt-driven conversational agents; selective fine-tuning; minimal RAG integration	Fluent and structured therapeutic responses; improved accessibility and engagement; effective session summarization and guided self-help	Empathy variability; weak crisis handling; absence of multi-session or real-world clinical trials; safety and ethical concerns
Mental Health Workforce	GPT-4, GPT-3.5	Clinical documentation, note summarization, training support, workflow assistance	Prompt-only assistance tools evaluated in pilot or feasibility studies	Reduced administrative burden; faster documentation and training content generation; clinician-perceived utility	Narrow task scope; limited empirical validation; no longitudinal deployment; unclear impact on clinical decision-making

Table 7. Distribution of included LLM studies in psychiatry by study type.

Category	Number of Studies	% of Total
1. Primary Empirical	44	34.10%
2. Cross-sectional Empirical	3	2.30%
3. Comparative Empirical	1	0.80%
4. Mixed (Bibliometric/Secondary Empirical)	1	0.80%
5. Narrative/Literature/Perspective/Conceptual	18	14.00%
6. Scoping Reviews	6	4.70%
7. Systematic Reviews	1	0.80%
8. Short/Rapid Reviews	2	1.60%
9. Technical/Engineering Empirical Studies	38	29.50%
10. Protocols/Methodology Papers	10	7.80%
11. Other/Editorial/Opinion	5	3.90%
Total	129	100%

Table 8. Frequency distribution of LLMs used in included psychiatry studies.

Frequency Category	LLMs Identified in the Literature
Most Frequent	GPT-4, GPT-4 Turbo, GPT-4o, GPT-3.5, GPT-3, ChatGPT
Moderately Frequent	BERT, RoBERTa, DistilBERT, DeBERTa, SqueezeBERT, BioClinicalBERT, MentalBERT
Occasional	LlaMA-2, LlaMA-3, LlaMA-7B, LlaMAntino, T5, Flan-T5, BART
Rare/Specialized	Mello LLM, PSAT-LM, CureFun LLM, KG-enhanced LLMs, GPT-4o multimodal, NEUROPAL, hybrid Bayesian–Transformer systems

Table 9. Distribution of included LLM studies in psychology by study type.

Category	Number of Studies	% of Total
1. Empirical and Benchmark Experimental Studies	18	64.30%
2. Technical/Engineering Studies (systems, datasets, simulations)	4	14.30%
3. Conceptual, Theoretical, and Perspective Papers	3	10.70%
4. Narrative and Literature Reviews	2	7.10%
5. Systematic and Scoping Reviews	1	3.60%
Total	28	100%

Table 10. Frequency of LLM usage in psychology studies.

Frequency Category	LLMs
Most Frequent	GPT-4, GPT-3.5, GPT-4o
Moderately Frequent	BERT, RoBERTa, DeBERTa, DistilBERT, BioBERT, MentalBERT, Mental Longformer
Occasional	LLaMA, LLaMA-2, LLaMA-3, Flan-T5, T5
Rare/Specialized	BERTopic + embeddings, hybrid BERT + Bayesian models, multilingual GPT-based sentiment models

Table 11. Distribution of included LLM studies in psychotherapy by study type.

Category	Number of Studies	% of Total
1. Empirical, Experimental, and Observational Studies	16	37.20%
2. Pilot and Feasibility Studies	5	11.60%
3. Technical/Engineering Studies (LLM systems, chatbots, and frameworks)	6	14.00%
4. Conceptual, Narrative, Perspective, and Ethics Papers	10	23.30%
5. Systematic, Scoping, and Meta-Synthesis Reviews	4	9.30%
6. Case and Qualitative Studies	2	4.60%
Total	43	100%

Table 12. Frequency of LLMs used in psychotherapy studies.

Frequency	LLMs
Most Frequent	GPT-4, GPT-3.5, GPT-4o
Moderately Frequent	BERT, mBERT, MentalBERT, Mental Longformer
Occasional	LLaMA-2, LLaMA-3, T5, Flan-T5, BART
Rare/Specialized LLM Systems	ChatPal, Socrates 2.0, therapy-summarization LLMs, emotion-fusion LLMs, VR GPT-4 avatars, robot-assisted LLMs (SAR)

Table 13. LLM studies for the mental health clinical workforce.

Category	Number of Studies	% of Total
1. Empirical and Observational Studies	2	40%
2. Pilot and Feasibility Studies	1	20%
3. Technical or System-Development Studies	1	20%
4. Conceptual, Framework, and Ethics Papers	1	20%
Total	5	100%

Table 14. LLMs used in clinical mental health studies.

Frequency	LLMs
LLMs Used	GPT-4, GPT-3.5, GPT-4 Turbo

Table 15. Ethical, clinical, and technological frameworks for responsible adoption of LLMs in mental health.

Dimension	Frameworks	Framework Scope
Ethical	GenAI4MH ROBINS-I Responsible AI (Ethics of Care perspective)	Ensure fairness, transparency, inclusivity, and protection of human values in AI adoption. Provide governance principles and safeguard against bias, misuse, and cultural misalignment risks.
Clinical	Thera-Turing Test BOLT READI (Readiness Evaluation for AI-Mental Health Deployment and Implementation)	Evaluate therapeutic dialogue quality, computational behavioral assessment, and readiness for deployment in psychiatric/psychological contexts. Support evidence-based validation and safety in clinical use.
Technological	PSAT (ProcesS knowledge-infused cross-attention framework) CureFun	Integrate clinical guidelines into model design, improve explainability, enable simulation-based decision support, and enhance reliability of LLMs in mental health applications.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Voultsiou, E.; Moussiades, L. A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions. Electronics 2026, 15, 524. https://doi.org/10.3390/electronics15030524

AMA Style

Voultsiou E, Moussiades L. A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions. Electronics. 2026; 15(3):524. https://doi.org/10.3390/electronics15030524

Chicago/Turabian Style

Voultsiou, Evdokia, and Lefteris Moussiades. 2026. "A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions" Electronics 15, no. 3: 524. https://doi.org/10.3390/electronics15030524

APA Style

Voultsiou, E., & Moussiades, L. (2026). A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions. Electronics, 15(3), 524. https://doi.org/10.3390/electronics15030524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions

Abstract

1. Introduction

2. Materials and Methods

2.1. Eligibility Criteria

2.2. Study Search Strategy and Selection Process

2.3. Data Collection Process

2.4. Risk of Bias Assessment

2.5. Reporting Bias Assessment

2.6. Strength and Consistency of the Evidence

2.7. Evidence Synthesis and Confidence Assessment

3. Evidence Across Mental Health Domains

3.1. LLMs in Psychiatry

3.1.1. Diagnostic Application

3.1.2. Evidence-Aligned Benchmarking and Retrieval-Augmented Generation (RAG)

3.1.3. Ecological Assessment, Longitudinal Monitoring, and Symptom Inference

3.1.4. Simulation-Based Training, Clinical Education, and Psychoeducational Support

3.1.5. Automated Clinical Documentation and Workflow Optimization

3.1.6. AI-Augmented Therapeutic Interventions and Immersive Treatment Modalities

3.1.7. Personalized Mental Health Support and Emotion-Aware Interaction Systems

3.1.8. Clinical Decision Support and Predictive Analytics

3.1.9. Comparative Clinical Performance and Diagnostic Concordance with Human Clinicians

3.1.10. Bias and Ethical Issues in Clinical Psychiatry LLM Studies

3.1.11. Overview of LLM Applications in Psychiatry

3.1.12. Critical Appraisal of Psychiatric Evidence

3.2. LLMs in Psychology

3.2.1. Language-Based Psychological Signal Detection and Psychopathology Modeling

3.2.2. Stress, Sleep, and Behavior Prediction Using LLM-Enhanced Models

3.2.3. Evaluating and Enhancing LLM Psychological Reasoning Through Prompting

3.2.4. LLMs as Companions and Social Agents: Perceptions of Empathy and Consciousness

3.2.5. Psychometrics, Psychological Assessment, and Personality Modeling with LLMs

3.2.6. LLLM-Augmented Digital Mental Health Interventions and Mental Health Outcomes

3.2.7. Psychological Frameworks Applied to LLM Reasoning, Values, and Biases

3.2.8. Ethical and Safety Risks in Psychological and Therapeutic LLM Applications

3.2.9. Overview of LLM Research in Psychology

3.2.10. Critical Appraisal of LLM-Based Psychological Research

3.3. LLMs in Psychotherapy

3.3.1. AI Psychotherapists and Therapeutic Text Analysis

3.3.2. Therapeutic Chatbots and Adaptive Support Systems

3.3.3. Integration into Clinical Practice and Research Workflows

3.3.4. Model Adaptation, Fine-Tuning, and Therapeutic Quality

3.3.5. Clinical Trials and Therapy Session Summarization

3.3.6. LLM-Supported Psychotherapeutic Approaches

3.3.7. Multimodal, VR, and Robot-Assisted Therapeutic Systems

3.3.8. Ethical and Safety Issues in Psychotherapy-Oriented LLM Use

3.3.9. Overview of LLM Research in Psychotherapy

3.3.10. Critical Appraisal of LLM-Based Psychotherapy Evidence

3.4. Risks, Limitations, and Implementation Barriers in Mental Health LLM Systems

3.5. The Role of the Mental Health Workforce

3.5.1. Overview of LLM Applications for the Clinical Mental Health Workforce

3.5.2. Overall Critical Appraisal and Confidence in Evidence

3.6. Cross-Model, Cross-Dataset, and Cross-Design Comparisons

4. Discussion

Recommendations

5. Conclusions

Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI