LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark

Sun, Junwei; Ma, Siqi; Fan, Yiran; Washington, Peter

doi:10.3390/app16073388

Open AccessArticle

LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark

¹

Department of Statistics, Stanford University, Stanford, CA 94305, USA

²

Department of Medicine, University of California, San Francisco (UCSF), San Francisco, CA 94143, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(7), 3388; https://doi.org/10.3390/app16073388

Submission received: 5 February 2026 / Revised: 2 March 2026 / Accepted: 4 March 2026 / Published: 31 March 2026

(This article belongs to the Special Issue Methods, Applications and Developments in Biomedical Informatics: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Anxiety and depression are among the most prevalent mental health conditions worldwide. Early and accurate automated detection from naturalistic conversations (e.g., those recorded with a remote chatbot) could eventually improve screening and, in turn, access to timely care. As a first step towards this goal, we aim to evaluate the efficacy of both traditional machine learning and large language models (LLMs) in classifying anxiety and depression from psychotherapy sessions using labels derived from clinician-annotated session metadata reflecting the primary presenting psychiatric concerns. While psychotherapy transcripts do not reflect the real-world domain of remote naturalistic conversation, we conduct this analysis as an “easy” starting point towards the eventual goal of building generalizable, clinician-assistive models that can infer mental health status from unstructured, non-directive conversations captured in the home setting as part of a remote digital assessment process. LLM underperformance on a psychotherapy benchmark would indicate that LLMs are most likely not yet ready to advance towards mental health classifications in more complex and less structured contexts, such as from remote conversations with a chatbot or family member. To study whether LLMs can classify anxiety and depression from psychotherapy transcripts, we fine-tuned both established transformer models (BERT, RoBERTa, Longformer) and more recent large models (Mistral-7B), trained a Support Vector Machine using engineered features, and assessed prompting GPT chatbots. We observe that (1) all machine learning approaches perform poorly and (2) state-of-the-art models fail to improve multi-label classification performance relative to traditional machine learning methods, indicating the current limitations of using LLMs for classification of psychiatric diagnoses from unstructured patient text as of 2026.

Keywords:

machine learning; mental health; Large Language Model

1. Introduction

With the rise of Large Language Models (LLMs) and their ability to facilitate conversations, researchers have been inspired to explore their potential for providing automated healthcare solutions [1,2,3,4,5]. With the increasing prevalence of mental health concerns globally, machine learning (ML) more broadly and LLMs in particular have the potential to detect mental health issues in a scalable and accessible manner.

There has been a recent increase in the use of natural language processing (NLP) techniques for classification of mental health conditions, with a majority of such research historically employing traditional ML methods like Support Vector Machines (SVMs), decision trees, and random forests [6,7,8]. With advancements in deep learning (DL) techniques and the advent of LLMs, neural networks that rely less on feature engineering that capture more syntactic nuances are becoming increasingly considered and evaluated for a wide range of mental health applications [9].

In this study, we aim to evaluate the efficacy of applying DL techniques for detecting anxiety and depression from long counseling and psychotherapy transcripts. The classification task relies on clinician-annotated session metadata reflecting primary presenting psychiatric concerns rather than formal or independently validated diagnoses, yielding a clinically-anchored, dialogue-level benchmark intended as a first step towards generalizable, clinician-assistive models for mental health inference from unstructured, non-directive real-world conversations. Given the recent success of LLMs on a wide range of language tasks and their ability to model longer context windows, we hypothesized that these models would outperform traditional ML approaches.

A novel aspect of this study is the use of psychotherapy transcripts to evaluate classification performance. Most studies utilize short or mixed-quality web text corpora such as tweets and Reddit posts as surrogates for mental health states [10,11,12,13,14]. In contrast, psychotherapy transcripts provide long-form, clinician-guided dialogue centered on the patient’s symptoms and treatment goals, yielding a higher signal-to-noise ratio and clinically grounded yet naturalistic conversational data. We evaluate LLM diagnostic performance on therapy session transcripts, which are clinically oriented and therefore should be easier to classify, to see how well LLMs can perform in a clearer setting as a stepping stone towards the eventual translational goal of applying them to messier, everyday speech.

For clinical data, most LLM-based studies have utilized electronic health records (EHRs) [15,16,17]. However, applying LLMs to naturalistic, clinically oriented conversations for mental health diagnosis is relatively underexplored (at least at the time when we conceptualized this project). Part of the bottleneck lies in the neural architectures’ limitation on input token sizes, and recent studies have suggested methods such as text segmentation, sliding windows, and architectural adjustments like incorporating convolutional layers to address this issue [18,19,20]. Unlike structured EHR notes, which are often curated summaries that may already contain clinician interpretation or diagnostic information, conversational transcripts more closely resemble real-world dialogue settings in which automated remote digital assessment systems would ultimately need to operate.

Thus, it is important to assess whether recent LLM architectures, especially those with extended context windows, can offer practical gains in modeling long, clinical conversations. Doing so not only helps validate their readiness for downstream psychiatric applications but also clarifies whether their increased complexity and resource demands are justified relative to simpler alternatives.

2. Materials and Methods

2.1. Dataset and Preprocessing

This study utilizes data sourced from Alexander Street Press, Counseling and Psychotherapy Transcripts, Volumes I and II, acquired from the Stanford Library [21,22]. The dataset comprises de-identified plain-text transcripts of therapy sessions addressing a diverse array of mental health issues with various therapeutic approaches. The sessions are conducted between a client and a mental health professional, with the conversation centered around the client’s diagnostic profile, making the dataset ideal for obtaining an upper bound of model performance in mental state detection using conversational texts. We accessed the data on 30 September 2023. The authors did not have access to information that could identify individual participants during or after data collection.

We applied several preprocessing steps to the transcripts: (1) non-ASCII characters were systematically eliminated; (2) descriptive elements such as “chuckles” and “laughter” were removed to restrict our discussion to natural language features; and (3) explicit mentions of the symptom words (e.g., “anxiety,” “depression”) were removed to avoid directly associating symptom labels with classification outcomes. The preprocessed dataset consisted of 3503 session records. Subsequently, we divided the 3503 psychotherapy sessions into an 80% training set and a 20% evaluation set for performance assessment. An overview of how the dataset is used for downstream model training is presented in Figure 1.

2.2. Response Variables and Metrics

For each transcript in the dataset, the metadata includes a psychological issue column that labels the primary reason for the session. These labels reflect clinician-annotated session metadata indicating the primary presenting concern rather than independently verified formal clinical diagnoses. We treat this label as the ground-truth response variable. There are over 60 issues, but most suffer from severe class imbalance: in the majority, fewer than 5% of transcripts receive a positive label. We therefore focus on anxiety and depression, the most prevalent categories in the dataset. For each transcript, it may be labeled as having anxiety, depression, both, or neither. Accordingly, we formulate this task as a multi-label classification problem.

To evaluate the model performance, we employ accuracy, weighted F1, and weighted AUROC score. In the multi-label situation, a sample is considered correctly classified only when labels for both anxiety and depression are accurate. Accuracy is defined as the number of correct predictions divided by the total number of samples. For both anxiety and depression, the dataset contains more samples without the symptom than with the symptom. To address this class imbalance, we used the weighted F1 and weighted AUROC to assess model performance. More details for the evaluation metrics are described in Table A1.

2.3. ML Baselines

We established a performance baseline for traditional ML methods using radial basis kernel Support Vector Machines (RBF SVMs) with a feature matrix composed of normalized stemming Bag-of-Words (BoW) and features derived from established linguistic dictionaries. These included the average concreteness score, which evaluates the degree to which the word describes a perceptible concept, and eight basic emotions and sentiments such as anger, sadness, fear, etc., calculated per sentence and averaged over each document [23,24]. The final feature matrix consists of 30,770 features.

2.4. Transformer Finetuning

We approached DL methods under two schemas: truncation and subdocument slicing, followed by pooling. The truncation method processes the first 512 or 4096 tokens only (i.e., the longest accepted sequence length) from each sample to fine-tune multi-label classifiers using BERT, RoBERTa, and Longformer models [25,26,27]. These models were selected for their unique training and attention mechanisms, which are well-suited to handling varying text lengths and complexities. Additionally, we explored the efficacy of sub-document slicing and pooling by employing a boosting methodology in which classification results from sliced sub-samples were pooled, and either a majority vote or an OR construction (any true sub-document makes the entire document true) was used to determine the final predictions. Each model featured a classification head with linear, dropout, and tanh layers to produce logit outputs, facilitated by the PyTorch 2.2 framework and leveraging pre-trained models from Huggingface [28,29]. Furthermore, we also fine-tuned the Mistral-7B-v0.1 (referred to below as Mistral) model with qLoRA [30,31], leveraging existing code source on this multi-label task [32]. With qLoRA, we are able to train 0.58% of the total parameters, leveraging the pretrained model’s ability while also injecting task-specific information into the model. Mistral accepts longer sequence lengths and has significantly more parameters, ensuring more comprehensive coverage of transcript information and introducing more complexity. With Mistral, we truncated each document up to 8192 tokens, covering the complete transcript for most samples. The training process is illustrated in Figure 1 above.

2.5. GPT Evaluations

The advent of artificial intelligence (AI) systems facilitated by LLMs suggests exciting new possibilities in breaking down complex psychological texts using multi-billion-scale models. We explored the current capabilities of GPT-family models, including GPT-3.5-turbo-0125 (referred to below as GPT-3.5), GPT-4-turbo-2024-04-09 (referred to below as GPT-4), and GPT-4o-2024-05-13 (referred to below as GPT-4o), in classifying psychological symptoms from text transcripts through prompting using API access. We truncated transcripts to the maximum token size each GPT model allows at the time we accessed these models. More details can be found in Table A2. For the two newer models, GPT-4 and GPT-4o, the input token size limit successfully covers the entirety of many therapy session transcripts. Performance is assessed from two aspects: accuracy and stability. Accuracy is meant to assess whether GPT models can provide accurate psychiatric classifications, while stability is meant to evaluate models’ classification consistency. For accuracy assessment, 200 randomly selected transcripts are subject to binary classification of anxiety and depression under a designed prompt and function calling that restricts the output to binary (“Positive” for symptom present; “Negative” for symptom not present). Results are then pooled for multi-label accuracy. For stability assessment, a transcript is randomly sampled, and this single transcript is classified by each of the three GPT models under the same prompt 200 times. Each time, the multi-label classification outcome is recorded by pooling the labels for anxiety and depression together. This method serves as a parallel to concerns in psychiatry, where even human experts may deliver inconsistent labels for the same patient input, and we aim to evaluate whether current LLMs can achieve consistent classifications. A schematic of the prompt and parameters used in querying GPT models can be found in Figure 2.

3. Results

Table 1 summarizes the performance of each ML baseline and DL model. Our results suggest that DL approaches, including both truncation and sub-document slicing, do not show clear performance advantages over the traditional ML method. Under such circumstances, the DL systems show only modest numerical differences in accuracy (approximately 0.1 at most). The performance disparities between traditional ML and DL approaches are not substantial, despite slight improvements in F1 and AUROC. This indicates that traditional ML models with meticulous feature engineering, which borrows from prior psychological research, may still achieve comparable results to large neural networks in this task. This finding contradicts our hypothesis that neural networks, which convert words into word embeddings, would better capture word-level patterns and sentence-level semantics. A ranking of model performances in each evaluation metric used is available in Figure 3.

Additionally, attention mechanisms that allow the model to associate words with their prior contexts did not seem to boost neural networks’ performance. However, the characteristics of lengthy texts may render the classification task challenging for neural network-based models, as these models must infer a close-to-binary answer from a vast amount of information that may contain noise. The performance of our DL models is consistent with existing literature, where a fine-tuned Clinical-Longformer tailored to long document classification tasks achieved an F1-score of 0.484 and an AUROC score of 0.762 in predicting acute kidney injury using electronic health records [33].

While fine-tuning Mistral often yields desirable results within 10 epochs, utilizing Mistral on mental health tasks showed no significant learning throughout the training process. The best model also did not outperform SVM or other DL approaches. The underperformance of Mistral has been reported in the literature, where it yielded only slightly higher performance than the baseline model on a stance classification problem regarding climate change activism [34]. The learning curve for Mistral is included in the app:app1-applsci-16-03388 for reference (Figure A1).

In the attempt to further improve model performance, we manually examined a random sample of 100 cases where DL models fail to produce correct labeling. Since the clinical sessions are lengthy and available only upon request, we are not able to include specific examples. In summary, we observe that the failure cases often include a mix of daily talks and discussion of client problems or include discussion of other people’s issues, such as their family member suffering from mental health problems. Therefore, one possible explanation for the observed performance pattern is due to sparse information over a long context.

To test our hypothesis, we conducted further experiments to fine-tune the BERT model on a random segment of 512 adjacent tokens from each document and altered the pooling mechanism to OR construction. The results, illustrated in Table 2, suggest no difference compared to the truncation baseline, which may imply that information in a therapy session is uniformly scattered throughout the conversation. This aligns with previous literature on emotion classification, where limited improvement of ensemble methods on deep neural network classifiers was found [35,36]. Specifically, simple ensemble methods with deep neural networks may help to digest longer text but fail to provide substantial performance improvement. These results from various DL experiments reinforce our main finding that DL methods do not demonstrate consistent advantages over traditional approaches in this context.

Lastly, our explorations with GPTs (Figure 4) validated that complex models do not necessarily yield superior performance, specifically on mental health tasks that involve discerning emotions from scattered conversations. We assessed GPT models from both accuracy and stability, as both diagnostic accuracy and consistency are crucial if LLMs are to be integrated into clinical workflows. When given 200 randomly selected transcripts for multi-label classification, both GPT-3.5 and the newer GPT-4o models’ accuracy is significantly lower than the performance of other ML/DL models (Figure 4). Additionally, when the model was tasked with providing a classification for the same transcript repeatedly, we noticed that the older GPT-3.5 model is highly unstable in its classification, with its predictions split across two of the four possible combinations for anxiety and depression diagnosis (Figure 4). Interestingly, such instability seems to be resolved in GPT-4o, with all 200 predictions providing the same multi-label combination, which is indeed the true combination for that selected transcript (Figure 4). GPT-4 accuracy and stability were not reported due to a striking inconsistency observed in its prompting process. Under the same prompting scheme, function-calling strategy, and LLM parameter settings, the GPT-4 model was unable to provide a definitive classification outcome for many of the transcripts. An example would be failure to follow the binary output restriction and returning both labels for a transcript, thereby classifying it as both suggesting and not suggesting the symptom of a mental health condition at the same time, which is clearly inconclusive. A few failed API calls to GPT-4 also involve the model outputting the exact same input transcripts, which deviates from the task of classifying psychological symptoms entirely. Consequently, we could not obtain reliable accuracy and stability measures for the GPT-4 model. Nevertheless, the low accuracy of even one of the newest GPT models suggests that current user-facing language models may not yet be capable of effectively performing psychiatric classification at this level of complexity, and variance in the response needs to be taken into consideration when using such models for clinically related tasks.

4. Discussion

4.1. Overview

Our findings indicate that LLMs and fine-tuned neural networks do not show clear or consistent performance advantages over traditional ML models in classifying lengthy therapeutic transcripts into mental state labels. The underperformance of LLMs on this diagnosis-centric, dialogue-level benchmark reveals that, as of 2026, current LLMs face challenges in automated mental state detection even in this idealized, clinically focused conversational setting. We evaluated several methods, including training an RBF-SVM model with feature engineering incorporating prior psychological knowledge, fine-tuning popular transformer architectures and more recent LLMs, and prompting LLMs to assess their ability to extract psychological states from long, complex psychotherapy transcripts. Techniques such as boosting through a majority vote and OR construction did not significantly improve model performance, and larger models also did not outperform smaller transformer-based models at classifying mental health labels from long conversational texts.

Because psychotherapy transcripts are diagnosis-relevant but remain conversational and context-rich, these results likely serve as an initial reference point for current models’ capability in detecting mental health labels from complex conversations. Although therapy conversations focus on mental health concerns, their length, narrative structure, and conversational noise mean they do not necessarily constitute an easier task than structured clinical documentation such as EHR notes. The underperformance of LLMs in classifying diagnosis-grounded transcripts suggests that they might not be well-suited for clinical use in less-structured or differently structured conversational settings, including everyday speech. Traditional ML methods, such as SVMs augmented with dictionary mappings from prior psychological research, could still be preferable due to lower computational requirements and greater analytical transparency compared to LLMs, which are crucial for diagnostic ML. Given the importance of model interpretability in helping clinicians understand the extracted information and the reasoning behind predictions [6], this relative transparency makes traditional ML approaches with careful feature engineering more practical in some clinical settings compared to DL and LLMs, particularly where understanding model behavior is important.

It is important to note that in interpreting these findings, ethical aspects relevant to automated mental health detection should be considered. Issues such as the sensitivity of psychotherapy transcripts for patient privacy, the risk of over-relying on automated outputs, and the importance of explainability for interpretability are central to how such results should be contextualized for potential clinical use, as pointed out by many recent works in real-life LLM applications [37,38,39].

4.2. Limitations

It is important to acknowledge several limitations in our study. Firstly, we evaluated the models using a single dataset. This could limit the generalizability of our findings to other psychotherapy transcripts or similar datasets. Although the dataset is large and of high transcription quality, it may not fully represent the diversity of demographics, symptoms, and therapeutic approaches present in broader mental health care. Additionally, the dataset was collected more than a decade ago. Shifts in the mental health landscape, evolving clinical practices, and changes in everyday language use may influence how present-day patients and clinicians communicate, potentially affecting model performance.

Another limitation is the absence of independently validated clinical diagnostic labels. Our labels are based on clinician-provided metadata indicating the primary psychological issue discussed in each session, rather than independently verified clinical diagnoses; however, these labels still provide a clinically informed reference point. Because psychotherapy conversations may involve comorbid or exploratory discussions, assigning a single primary issue can introduce label noise, as transcripts may contain meaningful content related to multiple conditions. A direct comparison with practicing clinicians, by evaluating the same transcripts, would provide a more meaningful benchmark. Despite this uncertainty, evaluating model behavior in such conditions remains informative. Moreover, even trained professionals may find it difficult to make reliable diagnostic judgments from transcripts alone, as real-world assessments typically draw on multimodal cues and repeated interactions.

Our analysis also focuses solely on English-language transcripts, limiting applicability to non-English-speaking populations and overlooking cultural-linguistic variation. Future work could explore multilingual approaches to improve robustness and cross-cultural relevance.

We also removed explicit diagnostic keywords (e.g., “anxiety” and “depression”) during preprocessing to reduce potential label leakage. While this encourages learning beyond explicit mentions, it may have reduced the semantic cues leveraged by Transformer-based models and could partially influence the observed LLM performance. Future work should examine the effect of preprocessing choices through controlled ablation studies comparing models trained with and without keyword filtering. This could help clarify whether performance differences stem from model limitations or altered linguistic context.

Finally, the rapid evolution of LLMs means that our findings are based on the most recent and popular models available at the time of the study. Future research is needed to continuously analyze new models as they emerge. Practical implementation challenges, such as integrating these models into existing workflows and ensuring user acceptance, also need to be addressed.

4.3. Future Work

Despite our negative results with LLMs, there are several promising directions for future research. First, expanding to datasets from multiple sources, capturing a wider range of therapeutic styles, clinician approaches, populations, and cultural contexts, would strengthen generalizability and test whether the observed trends hold across diverse settings. It is worthwhile to investigate additional DL architectures other than Transformer-based models to potentially improve performance on this task [40].

Second, expanding our framework to encompass a broader spectrum of mental health labels presents a compelling direction that would make classification more clinically relevant. The original dataset comprises a diverse array of over 60 mental health indicators, ranging from suicidal intent and sleep disturbances to hallucinations and mania, among others. Leveraging our framework to develop a more robust, symptom-rich multi-label classification system could help capture the interdependencies among psychological issues, potentially yielding more robust predictive performance and offsetting the instability of current DL models trained on very long text [6]. This could also deepen understanding of the relationships between various mental health manifestations.

Third, moving beyond text-only input to incorporate multimodal data such as audio or video recordings could better capture the nuances of psychological states. Mental health issues such as anxiety and depression are often reflected in body language and state of mind, which are difficult to capture from textual inputs alone [41]. While our dataset contained indications such as laughter or silence, we removed them during preprocessing to limit our input to natural language only. Vision-Language Models, however, can accept visual modalities such as facial expression and body language, which have shown promise in other precision brain health domains, such as digital autism diagnostics [42,43,44]. This multimodal approach could better capture the nuances of psychological states [6], potentially leading to more accurate and reliable assessments. Furthermore, establishing a human-in-the-loop approach that combines human expertise with language models and DL [45,46] has the potential not only to improve predictive performance but also to ensure ethical safeguards such as accountability, transparency, and responsible deployment.

Finally, in addition to these technical directions, future work should also address clinical priorities in model evaluation. In mental health classification, a false negative, meaning missing a patient in need, can be far more consequential than a false positive, which may simply result in additional low-risk assessment. Approaches such as cost-sensitive learning or threshold tuning could be applied to prioritize sensitivity where clinical safety is paramount [47].

Overall, mental health diagnostics have the potential to be rapidly transformed by advances in AI. With the incorporation of richer datasets and multimodal features, continued progress in benchmark performance, such as on psychotherapy transcripts, can lay the foundation for the future development of safe and effective automated mental state diagnostic systems using more challenging benchmark datasets.

Author Contributions

J.S. performed RoBERTa model fine-tuning, GPT prompting analysis, participated in paper writeup. S.M. finetuned and performed analysis on the BERT and Mistral-7B models and participated in the paper write-up. Y.F. performed the Longformer finetuning, created figures, and participated in the paper write-up. P.W. provided supervision and manuscript editing. All authors read and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analyzed in this study are publicly available, but restrictions apply to the availability of these data. Due to access permissions from Stanford Libraries, we are not authorized to reshare this data. However, they are available at https://redivis.com/datasets/4ew0-9qer43ndg (accessed on 30 September 2023) and https://redivis.com/datasets/9tbt-5m36b443f (accessed on 30 September 2023) upon reasonable request and with permission of Stanford Libraries. The repository containing the code used during the current study can be found at the following GitHub link: https://github.com/ivysun14/Mental-Health-Prediction (accessed on 16 November 2025).

Acknowledgments

We would like to thank Stanford Libraries for providing access to the dataset used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUROC	Area under the receiver operating characteristic curve
BERT	Bidirectional encoder representations from transformers
RoBERTa	Robustly optimized bidirectional encoder representations from transformers

Appendix A

Table A1. Selected evaluation metrics used to assess model performance. A sample is considered “accurately classified” only when the model classification of both anxiety and depression is correct. To calculate the weighted F1 score and weighted AUROC, we first calculate the metric for each individual symptom and then find the weighted average of metrics, with the weight being the number of samples with a positive diagnosis for each symptom.

Metric	Definition
Accuracy	The number of samples with both anxiety and depression predicted correctly divided by the total number of samples.
Precision	The number of true positives divided by the sum of true positives and false positives.
Recall	The number of true positives divided by the sum of true positives and false negatives.
F1 Score	Two times precision multiplied by recall, divided by the sum of precision and recall.

Table A2. Parameter configurations used for GPT-model prompting.

Model	Specific Model Version	Input Token Limit	Output Token Limit
GPT-3.5	gpt-3.5-turbo-0125	16,385	4096
GPT-4	gpt-4-turbo-2024-04-09	128,000	4096
GPT-4o	gpt-4o-2024-05-13	128,000	4096

Appendix B

Figure A1. Training and validation loss for Mistral-7B-v0.1.

References

Hornstein, S.; Scharfenberger, J.; Lueken, U.; Wundrack, R.; Hilbert, K. Predicting recurrent chat contact in a psychological intervention for the youth using natural language processing. npj Digit. Med. 2024, 7, 132. [Google Scholar] [CrossRef] [PubMed]
Swaminathan, A.; López, I.; Mar, R.A.G.; Heist, T.; McClintock, T.; Caoili, K.; Grace, M.; Rubashkin, M.; Boggs, M.N.; Chen, J.H.; et al. Natural language processing system for rapid detection and intervention of mental health crisis chat messages. npj Digit. Med. 2023, 6, 213. [Google Scholar] [CrossRef]
Balan, R.; Dobrean, A.; Poetar, C.R. Use of automated conversational agents in improving young population mental health: A scoping review. npj Digit. Med. 2024, 7, 75. [Google Scholar] [CrossRef] [PubMed]
Schäfer, S.K.; Von Boros, L.; Schaubruch, L.M.; Kunzler, A.M.; Lindner, S.; Koehler, F.; Werner, T.; Zappalà, F.; Helmreich, I.; Wessa, M.; et al. Digital interventions to promote psychological resilience: A systematic review and meta-analysis. npj Digit. Med. 2024, 7, 30. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhang, R.; Lee, Y.C.; Kraut, R.E.; Mohr, D.C. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. npj Digit. Med. 2023, 6, 236. [Google Scholar] [CrossRef]
Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. npj Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef]
Bayramli, I.; Castro, V.; Barak-Corren, Y.; Madsen, E.M.; Nock, M.K.; Smoller, J.W.; Reis, B.Y. Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction. npj Digit. Med. 2022, 5, 15. [Google Scholar] [CrossRef]
Chancellor, S.; De Choudhury, M. Methods in predictive techniques for mental health status on social media: A critical review. npj Digit. Med. 2020, 3, 43. [Google Scholar] [CrossRef]
Abd-alrazaq, A.; Alhuwail, D.; Schneider, J.; Toro, C.T.; Ahmed, A.; Alzubaidi, M.; Alajlani, M.; Househ, M. The performance of artificial intelligence-driven technologies in diagnosing mental disorders: An umbrella review. npj Digit. Med. 2022, 5, 87. [Google Scholar] [CrossRef]
Ji, S.; Zhang, T.; Ansari, L.; Fu, J.; Tiwari, P.; Cambria, E. MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare. arXiv 2021, arXiv:2110.15621. [Google Scholar] [CrossRef]
Salmi, S.; Mérelle, S.; Gilissen, R.; Van Der Mei, R.; Bhulai, S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID-19 pandemic: In-depth analysis using encoder representations from transformers. BMC Public Health 2022, 22, 530. [Google Scholar] [CrossRef] [PubMed]
Su, C.; Xu, Z.; Pathak, J.; Wang, F. Deep learning in mental health outcome research: A scoping review. Transl. Psychiatry 2020, 10, 116. [Google Scholar] [CrossRef] [PubMed]
Mangalik, S.; Eichstaedt, J.C.; Giorgi, S.; Mun, J.; Ahmed, F.; Gill, G.; Ganesan, A.V.; Subrahmanya, S.; Soni, N.; Clouston, S.A.P.; et al. Robust language-based mental health assessments in time and space through social media. npj Digit. Med. 2024, 7, 109. [Google Scholar] [CrossRef] [PubMed]
Kelley, S.W.; Mhaonaigh, C.N.; Burke, L.; Whelan, R.; Gillan, C.M. Machine learning of language use on Twitter reveals weak and non-specific predictions. npj Digit. Med. 2022, 5, 35. [Google Scholar] [CrossRef]
Huang, J.; Yang, D.M.; Rong, R.; Nezafati, K.; Treager, C.; Chi, Z.; Wang, S.; Cheng, X.; Guo, Y.; Klesse, L.J.; et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 2024, 7, 106. [Google Scholar] [CrossRef]
Guevara, M.; Chen, S.; Thomas, S.; Chaunzwa, T.L.; Franco, I.; Kann, B.H.; Moningi, S.; Qian, J.M.; Goldstein, M.; Harper, S.; et al. Large language models to identify social determinants of health in electronic health records. npj Digit. Med. 2024, 7, 6. [Google Scholar] [CrossRef]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef]
Fiok, K.; Karwowski, W.; Gutierrez, E.; Davahli, M.R.; Wilamowski, M.; Ahram, T. Revisiting Text Guide, a Truncation Method for Long Text Classification. Appl. Sci. 2021, 11, 8554. [Google Scholar] [CrossRef]
Park, H.; Vyas, Y.; Shah, K. Efficient Classification of Long Documents Using Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 702–709. [Google Scholar] [CrossRef]
Zheng, Y.; Cai, R.; Maimaiti, M.; Abiderexiti, K. Chunk-BERT: Boosted keyword extraction for long scientific literature via BERT with chunking capabilities. In Proceedings of the 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 385–392. [Google Scholar] [CrossRef]
Alexander Street Press. Counseling and Psychotherapy Transcripts: Volume I [Full Text Data]; Alexander Street Press: Alexandria, VA, USA, 2023. [Google Scholar] [CrossRef]
Alexander Street Press. Counseling and Psychotherapy Transcripts: Volume II [Full Text Data]; Alexander Street Press: Alexandria, VA, USA, 2023. [Google Scholar] [CrossRef]
Brysbaert, M.; Warriner, A.B.; Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 2014, 46, 904–911. [Google Scholar] [CrossRef]
Mohammad, S.M.; Turney, P.D. Crowdsourcing a Word–Emotion Association Lexicon. Comput. Intell. 2013, 29, 436–465. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Khalil, N. Brev dev Notebooks: Mistral Fine-Tuning. 2024. Available online: https://github.com/brevdev/notebooks/blob/main/mistral-finetune.ipynb (accessed on 3 February 2024).
Li, Y.; Wehbe, R.M.; Ahmad, F.S.; Wang, H.; Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med Inform. Assoc. 2023, 30, 340–347. [Google Scholar] [CrossRef]
Christodoulou, C. Nlpdame at climateactivism 2024: Mistral sequence classification with peft for hate speech, targets and stance event detection. In Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-Political Events from Text (CASE 2024), St. Julian’s, Malta, 22 March 2024; pp. 96–104. [Google Scholar]
Kamran, S.; Zall, R.; Hosseini, S.; Kangavari, M.; Rahmani, S.; Hua, W. EmoDNN: Understanding emotions from short texts through a deep neural network ensemble. Neural Comput. Appl. 2023, 35, 13565–13582. [Google Scholar] [CrossRef]
Parvin, T.; Sharif, O.; Hoque, M.M. Multi-class Textual Emotion Categorization using Ensemble of Convolutional and Recurrent Neural Network. SN Comput. Sci. 2022, 3, 62. [Google Scholar] [CrossRef]
Mandal, A.; Chakraborty, T.; Gurevych, I. Towards privacy-aware mental health AI models. Nat. Comput. Sci. 2025, 5, 863–874. [Google Scholar] [CrossRef]
Tilala, M.H.; Chenchala, P.K.; Choppadandi, A.; Kaur, J.; Naguri, S.; Saoji, R.; Devaguptapu, B.; Tilala, M. Ethical considerations in the use of artificial intelligence and machine learning in health care: A comprehensive review. Cureus 2024, 16, e62443. [Google Scholar] [CrossRef] [PubMed]
Hu, M.; Alkhairy, S.; Lee, I.; Pillich, R.T.; Fong, D.; Smith, K.; Bachelder, R.; Ideker, T.; Pratt, D. Evaluation of large language models for discovery of gene set function. Nat. Methods 2025, 22, 82–91. [Google Scholar] [CrossRef] [PubMed]
Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Moura, I.; Teles, A.; Viana, D.; Marques, J.; Coutinho, L.; Silva, F. Digital Phenotyping of Mental Health using multimodal sensing of multiple situations of interest: A Systematic Literature Review. J. Biomed. Inform. 2023, 138, 104278. [Google Scholar] [CrossRef]
Washington, P.; Wall, D.P. A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism. Annu. Rev. Biomed. Data Sci. 2023, 6, 211–228. [Google Scholar] [CrossRef]
Perochon, S.; Di Martino, J.M.; Carpenter, K.L.H.; Compton, S.; Davis, N.; Eichner, B.; Espinosa, S.; Franz, L.; Krishnappa Babu, P.R.; Sapiro, G.; et al. Early detection of autism using digital behavioral phenotyping. Nat. Med. 2023, 29, 2489–2497. [Google Scholar] [CrossRef]
Kline, A.; Wang, H.; Li, Y.; Dennis, S.; Hutch, M.; Xu, Z.; Wang, F.; Cheng, F.; Luo, Y. Multimodal machine learning in precision health: A scoping review. npj Digit. Med. 2022, 5, 171. [Google Scholar] [CrossRef]
Washington, P. A perspective on crowdsourcing and human-in-the-loop workflows in precision health. J. Med. Internet Res. 2024, 26, e51138. [Google Scholar] [CrossRef]
Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef]
Araf, I.; Idri, A.; Chairi, I. Cost-sensitive learning for imbalanced medical data: A review. Artif. Intell. Rev. 2024, 57, 80. [Google Scholar] [CrossRef]

Figure 1. Overview of the methods. Psychotherapy transcripts are subjected to anxiety and depression classifications under: (1) Support Vector Machine, where models are trained on stem word frequency augmented with a psychological dictionary mapping. (2) Transformer-based models fine-tuned with text truncation and subdocument splitting plus boosting. (3) GPT prompting through API calls.

Figure 2. Schematic of the GPT prompt design. Psychotherapy transcripts are subjected to anxiety and depression classifications under GPT prompting. The prompt is designed to include a description of the GPT model’s task, followed by the available classification classes, meaning of classifying the text into each possible class, the text under investigation, and a space for GPT to return its response. The binary response is made possible by defining a function available through the API that controls the LLM’s response format.

Figure 3. Performance ranking by model.

Figure 4. Accuracy and stability of GPT-3.5 and GPT-4o models. (a) Accuracy of GPT-3.5 and GPT-4o evaluated on 200 randomly selected psychotherapy transcripts for each model. (b) Stability of GPT-3.5 measured by repeatedly classifying the same randomly selected transcript under an identical prompting scheme 200 times, shown as the joint distribution of anxiety and depression predictions. (c) Stability of GPT-4o measured under the same protocol as in (b). The transcript selected for stability assessment has ground truth labels of positive for anxiety and positive for depression.

Table 1. Performance of deep learning/machine learning models.

Model	Learning Rate	Accuracy	Weighted F1	Weighted AUROC
RBF SVM	–	0.549	0.485	0.705
BERT	5 × 10⁻⁵	0.503	0.508	0.694
Boosted BERT	5 × 10⁻⁵	0.530	0.351	0.686
RoBERTa	1 × 10⁻⁵	0.561	0.517	0.713
Boosted RoBERTa	1 × 10⁻⁵	0.566	0.542	0.756
Longformer	1 × 10⁻⁵	0.549	0.565	0.681
Boosted Longformer	2 × 10⁻⁵	0.514	0.319	0.568
Mistral-7B-V0.1	5 × 10⁻⁶	0.087	0.483	0.507

Highest achieving scores for each performance metric are bolded.

Table 2. Performance of the best fine-tuned BERT-based models.

Model	Accuracy	Weighted F1	Weighted AUROC
Default Truncation BERT	0.503	0.508	0.694
Majority Vote	0.483	0.475	0.647
OR Construction	0.455	0.439	0.627
Random Truncation	0.514	0.344	0.663

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, J.; Ma, S.; Fan, Y.; Washington, P. LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Appl. Sci. 2026, 16, 3388. https://doi.org/10.3390/app16073388

AMA Style

Sun J, Ma S, Fan Y, Washington P. LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Applied Sciences. 2026; 16(7):3388. https://doi.org/10.3390/app16073388

Chicago/Turabian Style

Sun, Junwei, Siqi Ma, Yiran Fan, and Peter Washington. 2026. "LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark" Applied Sciences 16, no. 7: 3388. https://doi.org/10.3390/app16073388

APA Style

Sun, J., Ma, S., Fan, Y., & Washington, P. (2026). LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Applied Sciences, 16(7), 3388. https://doi.org/10.3390/app16073388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Preprocessing

2.2. Response Variables and Metrics

2.3. ML Baselines

2.4. Transformer Finetuning

2.5. GPT Evaluations

3. Results

4. Discussion

4.1. Overview

4.2. Limitations

4.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI