LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset and Preprocessing
2.2. Response Variables and Metrics
2.3. ML Baselines
2.4. Transformer Finetuning
2.5. GPT Evaluations
3. Results
4. Discussion
4.1. Overview
4.2. Limitations
4.3. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AUROC | Area under the receiver operating characteristic curve |
| BERT | Bidirectional encoder representations from transformers |
| RoBERTa | Robustly optimized bidirectional encoder representations from transformers |
Appendix A
| Metric | Definition |
|---|---|
| Accuracy | The number of samples with both anxiety and depression predicted correctly divided by the total number of samples. |
| Precision | The number of true positives divided by the sum of true positives and false positives. |
| Recall | The number of true positives divided by the sum of true positives and false negatives. |
| F1 Score | Two times precision multiplied by recall, divided by the sum of precision and recall. |
| Model | Specific Model Version | Input Token Limit | Output Token Limit |
|---|---|---|---|
| GPT-3.5 | gpt-3.5-turbo-0125 | 16,385 | 4096 |
| GPT-4 | gpt-4-turbo-2024-04-09 | 128,000 | 4096 |
| GPT-4o | gpt-4o-2024-05-13 | 128,000 | 4096 |
Appendix B

References
- Hornstein, S.; Scharfenberger, J.; Lueken, U.; Wundrack, R.; Hilbert, K. Predicting recurrent chat contact in a psychological intervention for the youth using natural language processing. npj Digit. Med. 2024, 7, 132. [Google Scholar] [CrossRef] [PubMed]
- Swaminathan, A.; López, I.; Mar, R.A.G.; Heist, T.; McClintock, T.; Caoili, K.; Grace, M.; Rubashkin, M.; Boggs, M.N.; Chen, J.H.; et al. Natural language processing system for rapid detection and intervention of mental health crisis chat messages. npj Digit. Med. 2023, 6, 213. [Google Scholar] [CrossRef]
- Balan, R.; Dobrean, A.; Poetar, C.R. Use of automated conversational agents in improving young population mental health: A scoping review. npj Digit. Med. 2024, 7, 75. [Google Scholar] [CrossRef] [PubMed]
- Schäfer, S.K.; Von Boros, L.; Schaubruch, L.M.; Kunzler, A.M.; Lindner, S.; Koehler, F.; Werner, T.; Zappalà, F.; Helmreich, I.; Wessa, M.; et al. Digital interventions to promote psychological resilience: A systematic review and meta-analysis. npj Digit. Med. 2024, 7, 30. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Zhang, R.; Lee, Y.C.; Kraut, R.E.; Mohr, D.C. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. npj Digit. Med. 2023, 6, 236. [Google Scholar] [CrossRef]
- Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. npj Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef]
- Bayramli, I.; Castro, V.; Barak-Corren, Y.; Madsen, E.M.; Nock, M.K.; Smoller, J.W.; Reis, B.Y. Predictive structured–unstructured interactions in EHR models: A case study of suicide prediction. npj Digit. Med. 2022, 5, 15. [Google Scholar] [CrossRef]
- Chancellor, S.; De Choudhury, M. Methods in predictive techniques for mental health status on social media: A critical review. npj Digit. Med. 2020, 3, 43. [Google Scholar] [CrossRef]
- Abd-alrazaq, A.; Alhuwail, D.; Schneider, J.; Toro, C.T.; Ahmed, A.; Alzubaidi, M.; Alajlani, M.; Househ, M. The performance of artificial intelligence-driven technologies in diagnosing mental disorders: An umbrella review. npj Digit. Med. 2022, 5, 87. [Google Scholar] [CrossRef]
- Ji, S.; Zhang, T.; Ansari, L.; Fu, J.; Tiwari, P.; Cambria, E. MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare. arXiv 2021, arXiv:2110.15621. [Google Scholar] [CrossRef]
- Salmi, S.; Mérelle, S.; Gilissen, R.; Van Der Mei, R.; Bhulai, S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID-19 pandemic: In-depth analysis using encoder representations from transformers. BMC Public Health 2022, 22, 530. [Google Scholar] [CrossRef] [PubMed]
- Su, C.; Xu, Z.; Pathak, J.; Wang, F. Deep learning in mental health outcome research: A scoping review. Transl. Psychiatry 2020, 10, 116. [Google Scholar] [CrossRef] [PubMed]
- Mangalik, S.; Eichstaedt, J.C.; Giorgi, S.; Mun, J.; Ahmed, F.; Gill, G.; Ganesan, A.V.; Subrahmanya, S.; Soni, N.; Clouston, S.A.P.; et al. Robust language-based mental health assessments in time and space through social media. npj Digit. Med. 2024, 7, 109. [Google Scholar] [CrossRef] [PubMed]
- Kelley, S.W.; Mhaonaigh, C.N.; Burke, L.; Whelan, R.; Gillan, C.M. Machine learning of language use on Twitter reveals weak and non-specific predictions. npj Digit. Med. 2022, 5, 35. [Google Scholar] [CrossRef]
- Huang, J.; Yang, D.M.; Rong, R.; Nezafati, K.; Treager, C.; Chi, Z.; Wang, S.; Cheng, X.; Guo, Y.; Klesse, L.J.; et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 2024, 7, 106. [Google Scholar] [CrossRef]
- Guevara, M.; Chen, S.; Thomas, S.; Chaunzwa, T.L.; Franco, I.; Kann, B.H.; Moningi, S.; Qian, J.M.; Goldstein, M.; Harper, S.; et al. Large language models to identify social determinants of health in electronic health records. npj Digit. Med. 2024, 7, 6. [Google Scholar] [CrossRef]
- Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef]
- Fiok, K.; Karwowski, W.; Gutierrez, E.; Davahli, M.R.; Wilamowski, M.; Ahram, T. Revisiting Text Guide, a Truncation Method for Long Text Classification. Appl. Sci. 2021, 11, 8554. [Google Scholar] [CrossRef]
- Park, H.; Vyas, Y.; Shah, K. Efficient Classification of Long Documents Using Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 702–709. [Google Scholar] [CrossRef]
- Zheng, Y.; Cai, R.; Maimaiti, M.; Abiderexiti, K. Chunk-BERT: Boosted keyword extraction for long scientific literature via BERT with chunking capabilities. In Proceedings of the 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 385–392. [Google Scholar] [CrossRef]
- Alexander Street Press. Counseling and Psychotherapy Transcripts: Volume I [Full Text Data]; Alexander Street Press: Alexandria, VA, USA, 2023. [Google Scholar] [CrossRef]
- Alexander Street Press. Counseling and Psychotherapy Transcripts: Volume II [Full Text Data]; Alexander Street Press: Alexandria, VA, USA, 2023. [Google Scholar] [CrossRef]
- Brysbaert, M.; Warriner, A.B.; Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 2014, 46, 904–911. [Google Scholar] [CrossRef]
- Mohammad, S.M.; Turney, P.D. Crowdsourcing a Word–Emotion Association Lexicon. Comput. Intell. 2013, 29, 436–465. [Google Scholar] [CrossRef]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Khalil, N. Brev dev Notebooks: Mistral Fine-Tuning. 2024. Available online: https://github.com/brevdev/notebooks/blob/main/mistral-finetune.ipynb (accessed on 3 February 2024).
- Li, Y.; Wehbe, R.M.; Ahmad, F.S.; Wang, H.; Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med Inform. Assoc. 2023, 30, 340–347. [Google Scholar] [CrossRef]
- Christodoulou, C. Nlpdame at climateactivism 2024: Mistral sequence classification with peft for hate speech, targets and stance event detection. In Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-Political Events from Text (CASE 2024), St. Julian’s, Malta, 22 March 2024; pp. 96–104. [Google Scholar]
- Kamran, S.; Zall, R.; Hosseini, S.; Kangavari, M.; Rahmani, S.; Hua, W. EmoDNN: Understanding emotions from short texts through a deep neural network ensemble. Neural Comput. Appl. 2023, 35, 13565–13582. [Google Scholar] [CrossRef]
- Parvin, T.; Sharif, O.; Hoque, M.M. Multi-class Textual Emotion Categorization using Ensemble of Convolutional and Recurrent Neural Network. SN Comput. Sci. 2022, 3, 62. [Google Scholar] [CrossRef]
- Mandal, A.; Chakraborty, T.; Gurevych, I. Towards privacy-aware mental health AI models. Nat. Comput. Sci. 2025, 5, 863–874. [Google Scholar] [CrossRef]
- Tilala, M.H.; Chenchala, P.K.; Choppadandi, A.; Kaur, J.; Naguri, S.; Saoji, R.; Devaguptapu, B.; Tilala, M. Ethical considerations in the use of artificial intelligence and machine learning in health care: A comprehensive review. Cureus 2024, 16, e62443. [Google Scholar] [CrossRef] [PubMed]
- Hu, M.; Alkhairy, S.; Lee, I.; Pillich, R.T.; Fong, D.; Smith, K.; Bachelder, R.; Ideker, T.; Pratt, D. Evaluation of large language models for discovery of gene set function. Nat. Methods 2025, 22, 82–91. [Google Scholar] [CrossRef] [PubMed]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Moura, I.; Teles, A.; Viana, D.; Marques, J.; Coutinho, L.; Silva, F. Digital Phenotyping of Mental Health using multimodal sensing of multiple situations of interest: A Systematic Literature Review. J. Biomed. Inform. 2023, 138, 104278. [Google Scholar] [CrossRef]
- Washington, P.; Wall, D.P. A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism. Annu. Rev. Biomed. Data Sci. 2023, 6, 211–228. [Google Scholar] [CrossRef]
- Perochon, S.; Di Martino, J.M.; Carpenter, K.L.H.; Compton, S.; Davis, N.; Eichner, B.; Espinosa, S.; Franz, L.; Krishnappa Babu, P.R.; Sapiro, G.; et al. Early detection of autism using digital behavioral phenotyping. Nat. Med. 2023, 29, 2489–2497. [Google Scholar] [CrossRef]
- Kline, A.; Wang, H.; Li, Y.; Dennis, S.; Hutch, M.; Xu, Z.; Wang, F.; Cheng, F.; Luo, Y. Multimodal machine learning in precision health: A scoping review. npj Digit. Med. 2022, 5, 171. [Google Scholar] [CrossRef]
- Washington, P. A perspective on crowdsourcing and human-in-the-loop workflows in precision health. J. Med. Internet Res. 2024, 26, e51138. [Google Scholar] [CrossRef]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef]
- Araf, I.; Idri, A.; Chairi, I. Cost-sensitive learning for imbalanced medical data: A review. Artif. Intell. Rev. 2024, 57, 80. [Google Scholar] [CrossRef]




| Model | Learning Rate | Accuracy | Weighted F1 | Weighted AUROC |
|---|---|---|---|---|
| RBF SVM | – | 0.549 | 0.485 | 0.705 |
| BERT | 5 × 10−5 | 0.503 | 0.508 | 0.694 |
| Boosted BERT | 5 × 10−5 | 0.530 | 0.351 | 0.686 |
| RoBERTa | 1 × 10−5 | 0.561 | 0.517 | 0.713 |
| Boosted RoBERTa | 1 × 10−5 | 0.566 | 0.542 | 0.756 |
| Longformer | 1 × 10−5 | 0.549 | 0.565 | 0.681 |
| Boosted Longformer | 2 × 10−5 | 0.514 | 0.319 | 0.568 |
| Mistral-7B-V0.1 | 5 × 10−6 | 0.087 | 0.483 | 0.507 |
| Model | Accuracy | Weighted F1 | Weighted AUROC |
|---|---|---|---|
| Default Truncation BERT | 0.503 | 0.508 | 0.694 |
| Majority Vote | 0.483 | 0.475 | 0.647 |
| OR Construction | 0.455 | 0.439 | 0.627 |
| Random Truncation | 0.514 | 0.344 | 0.663 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sun, J.; Ma, S.; Fan, Y.; Washington, P. LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Appl. Sci. 2026, 16, 3388. https://doi.org/10.3390/app16073388
Sun J, Ma S, Fan Y, Washington P. LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Applied Sciences. 2026; 16(7):3388. https://doi.org/10.3390/app16073388
Chicago/Turabian StyleSun, Junwei, Siqi Ma, Yiran Fan, and Peter Washington. 2026. "LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark" Applied Sciences 16, no. 7: 3388. https://doi.org/10.3390/app16073388
APA StyleSun, J., Ma, S., Fan, Y., & Washington, P. (2026). LLMs Underperform on Classifying Anxiety and Depression Using Therapy Conversations: A First-Step Benchmark. Applied Sciences, 16(7), 3388. https://doi.org/10.3390/app16073388

