Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media
Abstract
1. Introduction
- 1.
- We introduce SocialCD-3K, the first Chinese multi-label cognitive distortion dataset, constructed from Burns’ classic Cognitive-Behavioral Therapy (CBT) framework, unlike prior simplified or inconsistent schemes.
- 2.
- We present SOS-HL-1K, a publicly available Chinese social media dataset for suicide risk classification, addressing the scarcity of non-English resources.
- 3.
- We provide a benchmark evaluating two supervised methods and eight LLMs across zero-shot, few-shot, and fine-tuning settings, offering a reference for future research.
- 4.
- We analyze performance gaps between supervised methods and LLMs on simple vs. complex psychological tasks, revealing the current limitations of LLMs in specialized domains.
2. Related Work
2.1. Text Sentiment Analysis
2.2. Large Language Model and Its Applications in Medical Domain
3. Methods
3.1. Baseline Supervised Learning Models
- LSAN [44]: The LSAN model uses label semantics to identify relationships between labels and documents, creating a label-specific document representation. It employs a self-attention mechanism to focus on this representation and uses an adaptive fusion strategy for multi-label text classification, proving effective in predicting low-frequency labels.
- BERT [45]: BERT uses a bidirectional approach, facilitated by the Transformer architecture [46], to understand the full context of words. It is pre-trained with a masked language model objective, predicting masked words. BERT excels in various NLP tasks, such as question answering and sentiment analysis, especially when fine-tuned on specific data.
3.2. Large Language Models
3.2.1. General LLMs: GPTs
- GPT-3.5: GPT-3.5 is an advanced iteration of the GPT-3 [52] language model, offering improvements in conversational capabilities. It is designed to provide more coherent and context-aware responses in conversational applications, reflecting ongoing developments in language modeling techniques.
- GPT-4 [47]: GPT-4 is a groundbreaking multimodal model that processes both images and text to generate text outputs. It performs at a human level on various benchmarks, including scoring in the top 10% on a simulated bar exam. Built on the Transformer architecture [46], GPT-4 is trained to predict tokens and undergoes post-training alignment for improved accuracy. Despite its capabilities, GPT-4 has limitations like occasional content hallucinations and a constrained context window.
3.2.2. Chinese LLMs
- ChatGLM2-6B: ChatGLM2-6B is an open-source bilingual model with 6.2 billion parameters, optimized for Chinese question-answering and dialogue. Trained on about 1 TB of Chinese and English text, it can be fine-tuned through various techniques like supervised learning and human feedback, allowing for diverse language processing applications.
- GLM-4 [49]: GLM-4 uses a bidirectional Transformer [46] structure with the Masked and Prefix Language Model (MPCT) pre-training strategy, combining MLM and PLM strengths. It employs Rotary Position Embedding (RoPE) for long text sequences and adaptive masking for varied tasks, and supports multi-task learning.
3.3. LLM Prompt Strategies
3.3.1. Zero-Shot Prompting
- 1.
- Basic: A direct task statement without specific contextual emphasis. Prompt English translation: “ [Please perform a multi-label classification task to determine whether it reflects any of the specified 12 cognitive distortions: (list of cognitive distortions categories)].”
- 2.
- Role-definition prompting: This strategy delineates the role of the LLM (in this case, a psychologist) and emphasizes the need for psychological expertise. Prompt English translation: “ [Assuming the role of a psychologist with professional psychological experiences] ”
- 3.
- Scene-definition prompting: Introduces the context of a social media environment, highlighting user identifiers to eliminate ambiguity. Prompt English translation: “ [Given the user ID u and the associated posts on social media, based on the post content], ”
- 4.
- Hybrid prompting: A combination of both role and scene definitions that provides an integrated prompt, and it can be represented as .
3.3.2. Few-Shot Prompting
- 1.
- Background knowledge: This strategy is provided with psychological definitions, supplemented by representative cases, followed by one of the four prompting strategies devised from zero-shot prompting. Prompts that integrate background knowledge and employ the hybrid strategy from zero-shot prompting are detailed as follows: “ [There are the definitions of 12 cognitive distortions (list of cognitive distortions definitions), and there are the representative cases (list of cognitive distortions cases). Consider these cognitive distortions definitions and cases], .”
- 2.
- Prompting with n reference samples per category: In this strategy, reference data are randomly selected from the training set for each category to construct prompts for the LLM, followed by one of the four prompting strategies designed from zero-shot prompting. Prompts that incorporate the reference data and employ the hybrid strategy from zero-shot prompting are detailed as follows: “ [There are some examples of the target task with the ground truth label (list of reference samples). Consider these cognitive distortions examples], .”
- 3.
- Background knowledge and prompting with n reference samples per category: This approach investigates whether enhancing explanations in few-shot prompting can improve the LLM in terms of its understanding of psychological tasks. It incorporates psychological definitions, and provides n reference samples per category for LLM prompt conduction. The following example integrates background knowledge and reference instances, and employ the hybrid strategy from zero-shot prompting. It can be represented as .
3.4. LLM Fine-Tuning for Downstream Task
4. Experiments and Results
4.1. Datasets and Evaluation Metrics
4.2. Experiment Design
- Prompt design: Initially, we assessed four prompting strategies within the zero-shot learning conduction. Subsequently, based on their performance, the top two strategies were selected for further evaluation in the few-shot learning setting across various LLMs.
- LLM ability: In our experiments, we found that the performance of the ChatGLM2-6B and GLM-130B models performed poorly. Therefore, we decided to use the latest version of the GLM series, the GLM-4 model, to evaluate performance on the newly expanded dataset. For GPT-3.5, its token limitation prevented us from entering five samples for each category during few-shot prompting. Consequently, we reserved the approach exclusively for GPT-4.
- LLM fine-tuning: OpenAI recently introduced a fine-tuning feature for GPT-3.5, and official reports suggest that, under certain conditions, fine-tuned GPT-3.5 may outperform GPT-4 [58]. Therefore, we experimented with fine-tuning GPT-3.5. Since the current version of GPT-4 lacks fine-tuning capabilities, we were unable to assess its potential. Additionally, we explored the fine-tuning performance of three other open-source Chinese large models in the cognitive distortion recognition task.
4.3. Implementation Details
- LSAN: We used Word2vec [59] to train 300-dimensional embeddings for both the document and randomly initialized label texts. The function of the attention mechanism is to compute word contributions to labels and create label-specific document representations. Dot products between these document and label vectors refined these relationships further. These two types of document representations were then fused using weighted combinations. For predictions, we employed a fully connected layer, followed by ReLU and a sigmoid function. We used cross-entropy as the loss function.
- BERT: We employed a Chinese pre-trained BERT model to extract 768-dimensional vectors from the sentences. To avoid overfitting, a dropout function [60] was applied to these sentence vectors. Subsequently, a fully connected layer was introduced for classification. The sigmoid function served as the activation function for the output layer.
- LLM-zero shot: Both GPT-3.5 and GPT-4 [47] are closed-source and available through OpenAI’s API. We used Gpt-3.5-turbo because it is one of the most capable and cost-effective models in the GPT-3.5 family, and GPT-4 for its advanced capabilities. For GLM models, we deployed the open-source ChatGLM2-6B [48] on our server, and tested the larger GLM-130B [48] via its official website due to deployment challenges. We averaged the performance over five rounds of experiments for all models to minimize randomness. For GPT-3.5, GPT-4, ChatGLM2-6B, and GLM-4, we set the temperature to 0.1, 0.3, 0.5, 0.7, and 0.9, and averaged the results over five rounds of experiments. For GLM-130B, we could not adjust the temperature due to its limitations.
- LLM-few shot: We conducted the experiments using the top two performing prompt strategies from the zero-shot performance as determined by their F1-scores. Given the different input token constraints of each model, we selected varying amounts of reference data according to the requirements of the corresponding models.
- LLM-fine-tuning: For the closed-source model GPT-3.5, we used the API provided by OpenAI to fine-tune the GPT-3.5 Turbo model for suicide risk and cognitive distortions tasks. In this experiment, the training epoch was set to 3. For other hyperparameters, we did not explicitly specify them; instead, OpenAI selected the default values based on the dataset size. We also explored the performance of three open-source Chinese LLMs (Chinese-LLaMA-2-7B, Chinese-Alpaca-2-7B, and LLaMA2-Chinese-7b-Chat) on this fine-tuning task. These models were fine-tuned locally on an NVIDIA A100 GPU. The training process was configured with a batch size of 8, a learning rate of using a cosine scheduler, and a weight decay of 0.01, and was run for 5 epochs, with model checkpoints saved every 300 steps. We utilized Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. The LoRA-specific configuration was set to a rank of 8, an alpha of 32, and a lora_dropout rate of 0.05. The LoRA modules were applied to the attention layers’ query and value projection matrices (q_proj and v_proj). To ensure full reproducibility, the complete configuration file for these experiments is available in our GitHub repository.
5. Results
5.1. Effect of Prompting Strategies on Task Performance
5.2. Temperature Sensitivity Analysis
6. Expert Evaluation and Feedback
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization. Depressive Disorder (Depression); World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
- Huang, Y.; Wang, Y.; Wang, H.; Liu, Z.; Yu, X.; Yan, J.; Yu, Y.; Kou, C.; Xu, X.; Lu, J.; et al. Prevalence of mental disorders in China: A cross-sectional epidemiological study. Lancet Psychiatry 2019, 6, 211–224. [Google Scholar] [CrossRef]
- World Health Organization. Suicide; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
- Keles, B.; McCrae, N.; Grealish, A. A systematic review: The influence of social media on depression, anxiety and psychological distress in adolescents. Int. J. Adolesc. Youth 2020, 25, 79–93. [Google Scholar] [CrossRef]
- Robinson, J.; Cox, G.; Bailey, E.; Hetrick, S.; Rodrigues, M.; Fisher, S.; Herrman, H. Social media and suicide prevention: A systematic review. Early Interv. Psychiatry 2016, 10, 103–121. [Google Scholar] [CrossRef]
- Luxton, D.D.; June, J.D.; Fairall, J.M. Social media and suicide: A public health perspective. Am. J. Public Health 2012, 102, S195–S200. [Google Scholar] [CrossRef]
- Coppersmith, G.; Leary, R.; Crutchley, P.; Fine, A. Natural language processing of social media as screening for suicide risk. Biomed. Inform. Insights 2018, 10, 1178222618792860. [Google Scholar] [CrossRef] [PubMed]
- Nandwani, P.; Verma, R. A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 2021, 11, 81. [Google Scholar] [CrossRef] [PubMed]
- Acheampong, F.A.; Wenyu, C.; Nunoo-Mensah, H. Text-based emotion detection: Advances, challenges, and opportunities. Eng. Rep. 2020, 2, e12189. [Google Scholar] [CrossRef]
- Saunders, D. Domain adaptation and multi-domain adaptation for neural machine translation: A survey. J. Artif. Intell. Res. 2022, 75, 351–424. [Google Scholar] [CrossRef]
- Sharma, A.; Rushton, K.; Lin, I.W.; Wadden, D.; Lucas, K.G.; Miner, A.S.; Nguyen, T.; Althoff, T. Cognitive reframing of negative thoughts through human-language model interaction. arXiv 2023, arXiv:2305.02466. [Google Scholar] [CrossRef]
- Laparra, E.; Bethard, S.; Miller, T.A. Rethinking domain adaptation for machine learning over clinical language. JAMIA Open 2020, 3, 146–150. [Google Scholar] [CrossRef]
- He, T.; Fu, G.; Yu, Y.; Wang, F.; Li, J.; Zhao, Q.; Song, C.; Qi, H.; Luo, D.; Zou, H.; et al. Towards a psychological generalist AI: A survey of current applications of large language models and future prospects. arXiv 2023, arXiv:2312.04578. [Google Scholar] [CrossRef]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
- Xu, X.; Yao, B.; Dong, Y.; Yu, H.; Hendler, J.; Dey, A.K.; Wang, D. Leveraging Large Language Models for Mental Health Prediction via Online Text Data. arXiv 2023, arXiv:2307.14385. [Google Scholar]
- Sharma, A.; Rushton, K.; Lin, I.; Wadden, D.; Lucas, K.; Miner, A.; Nguyen, T.; Althoff, T. Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; pp. 9977–10000. [Google Scholar] [CrossRef]
- Wang, B.; Deng, P.; Zhao, Y.; Qin, B. C2D2 Dataset: A Resource for the Cognitive Distortion Analysis and Its Impact on Mental Health. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
- Lin, S.; Wang, Y.; Dong, J.; Ni, S. Detection and Positive Reconstruction of Cognitive Distortion Sentences: Mandarin Dataset and Evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; pp. 6686–6701. [Google Scholar] [CrossRef]
- Burns, D.D. Feeling Good; Signet Book: New York, NY, USA, 1981; Volume 4. [Google Scholar]
- Zhu, J.; Xu, A.; Tan, M.; Yang, M. XinHai@CLPsych 2024 Shared Task: Prompting Healthcare-oriented LLMs for Evidence Highlighting in Posts with Suicide Risk. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), St. Julians, Malta, 21 March 2024; Yates, A., Desmet, B., Prud’hommeaux, E., Zirikly, A., Bedrick, S., MacAvaney, S., Bar, K., Ireland, M., Ophir, Y., Eds.; pp. 238–246. [Google Scholar]
- Kim, L. A Baseline for Self-state Identification and Classification in Mental Health Data: CLPsych 2025 Task. In Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych), St. Julians, Malta, 3 May 2025; pp. 218–224. [Google Scholar]
- Zheng, S.; Tao, Y.; Zhou, T. RSD-15K: A Large-Scale User-Level Annotated Dataset for Suicide Risk Detection on Social Media. arXiv 2025, arXiv:2507.11559. [Google Scholar] [CrossRef]
- Cao, L.; Zhang, H.; Feng, L.; Wei, Z.; Wang, X.; Li, N.; He, X. Latent Suicide Risk Detection on Microblog via Suicide-Oriented Word Embeddings and Layered Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 1718–1728. [Google Scholar] [CrossRef]
- Huang, X.; Xing, L.; Brubaker, J.R.; Paul, M.J. Exploring timelines of confirmed suicide incidents through social media. In Proceedings of the 2017 IEEE International Conference on Healthcare Informatics (ICHI), Park City, UT, USA, 23–26 August 2017; pp. 470–477. [Google Scholar]
- Fu, G.; Song, C.; Li, J.; Ma, Y.; Chen, P.; Wang, R.; Yang, B.X.; Huang, Z. Distant supervision for mental health management in social media: Suicide risk classification system development study. J. Med. Internet Res. 2021, 23, e26119. [Google Scholar] [CrossRef]
- Singh, M.; Jakhar, A.K.; Pandey, S. Sentiment analysis on the impact of coronavirus in social life using the BERT model. Soc. Netw. Anal. Min. 2021, 11, 33. [Google Scholar] [CrossRef]
- Wan, F. Sentiment analysis of Weibo comments based on deep neural network. In Proceedings of the 2019 International Conference on Communications, Information System and Computer Engineering (CISCE), Haikou, China, 5–7 July 2019; pp. 626–630. [Google Scholar]
- Zhang, X.; Li, W.; Ying, H.; Li, F.; Tang, S.; Lu, S. Emotion detection in online social networks: A multilabel learning approach. IEEE Internet Things J. 2020, 7, 8133–8143. [Google Scholar] [CrossRef]
- Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and Applications of Large Language Models. arXiv 2023, arXiv:2307.10169. [Google Scholar] [CrossRef]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
- Liebrenz, M.; Schleifer, R.; Buadze, A.; Bhugra, D.; Smith, A. Generating scholarly content with ChatGPT: Ethical challenges for medical publishing. Lancet Digit. Health 2023, 5, e105–e106. [Google Scholar] [CrossRef]
- Jeblick, K.; Schachtner, B.; Dexl, J.; Mittermeier, A.; Stüber, A.T.; Topalis, J.; Weber, T.; Wesp, P.; Sabel, B.; Ricke, J.; et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports. arXiv 2022, arXiv:2212.14882. [Google Scholar] [CrossRef] [PubMed]
- Surameery, N.M.S.; Shakor, M.Y. Use ChatGPT to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 2023, 3, 17–22. [Google Scholar]
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Yeo, Y.H.; Samaan, J.S.; Ng, W.H.; Ting, P.S.; Trivedi, H.; Vipani, A.; Ayoub, W.; Yang, J.D.; Liran, O.; Spiegel, B.; et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin. Mol. Hepatol. 2023, 29, 721–732. [Google Scholar] [CrossRef]
- Jiang, L.Y.; Liu, X.C.; Nejatian, N.P.; Nasir-Moin, M.; Wang, D.; Abidin, A.; Eaton, K.; Riina, H.A.; Laufer, I.; Punjabi, P.; et al. Health system-scale language models are all-purpose prediction engines. Nature 2023, 619, 357–362. [Google Scholar] [CrossRef]
- Bauer, B.; Norel, R.; Leow, A.; Rached, Z.A.; Wen, B.; Cecchi, G. Using Large Language Models to Understand Suicidality in a Social Media–Based Taxonomy of Mental Health Disorders: Linguistic Analysis of Reddit Posts. JMIR Ment. Health 2024, 11, e57234. [Google Scholar] [CrossRef]
- Koushik, L.; Vishruth, M.; Anand Kumar, M. Detecting suicide risk patterns using hierarchical attention networks with large language models. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), St. Julians, Malta, 21 March 2024; pp. 227–231. [Google Scholar]
- Alhamed, F.; Ive, J.; Specia, L. Using large language models (llms) to extract evidence from pre-annotated social media data. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), St. Julians, Malta, 21 March 2024; pp. 232–237. [Google Scholar]
- Chen, S.; Wu, M.; Zhu, K.Q.; Lan, K.; Zhang, Z.; Cui, L. LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation. arXiv 2023, arXiv:2305.13614. [Google Scholar] [CrossRef]
- Fu, G.; Zhao, Q.; Li, J.; Luo, D.; Song, C.; Zhai, W.; Liu, S.; Wang, F.; Wang, Y.; Cheng, L.; et al. Enhancing Psychological Counseling with Large Language Model: A Multifaceted Decision-Support System for Non-Professionals. arXiv 2023, arXiv:2308.15192. [Google Scholar] [CrossRef]
- Yang, K.; Ji, S.; Zhang, T.; Xie, Q.; Kuang, Z.; Ananiadou, S. Towards Interpretable Mental Health Analysis with ChatGPT. arXiv 2023, arXiv:2304.03347. [Google Scholar] [CrossRef]
- Xiao, L.; Huang, X.; Chen, B.; Jing, L. Label-specific document representation for multi-label text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 466–475. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023. [Google Scholar] [CrossRef]
- Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
- ZHIPU AI DevDay GLM-4. 2024. Available online: https://zhipuai.cn/en/devday (accessed on 23 April 2024).
- Cui, Y.; Yang, Z.; Yao, X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv 2023, arXiv:2304.08177. [Google Scholar]
- Llama Chinese Community and AtomEcho. Llama2-Chinese-7b-Chat. Model Continuously Updated with New Parameters, Training Process Available at Llama.family. Deployment, Training, and Fine-Tuning Methods Detailed at GitHub Repository: Llama-Chinese. 2024. Available online: https://huggingface.co/FlagAlpha/Llama2-Chinese-7b-Chat (accessed on 10 April 2024).
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Wang, S.; Sun, Y.; Xiang, Y.; Wu, Z.; Ding, S.; Gong, W.; Feng, S.; Shang, J.; Zhao, Y.; Pang, C.; et al. Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2112.12731. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Lin, Z. How to write effective prompts for large language models. Nat. Hum. Behav. 2024, 8, 611–615. [Google Scholar] [CrossRef]
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
- OpenAI. GPT-3.5 Turbo Fine-Tuning and API Updates. 2023. Available online: https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates (accessed on 10 April 2024).
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Dataset | Categories | |
---|---|---|
SOS-HL-1K | High-risk | 601 |
Low-risk | 648 | |
= 999, = 250, L = 1, = 1, = 47.79 | ||
SocialCD-3K | All-or-nothing thinking | 77 |
Over-generalization | 141 | |
Mental filter | 378 | |
Disqualifying the positive | 27 | |
Mind reading | 121 | |
The fortune teller error | 652 | |
Magnification | 321 | |
Emotional reasoning | 16 | |
Should statements | 84 | |
Labeling and mislabeling | 1961 | |
Blaming oneself | 188 | |
Blaming others | 27 | |
= 2725, = 682, L = 12, = 1.71, = 42.56 |
Model Category | Model | Type | Sub-Type | Train Data | Micro-Average | Macro-Average | High-Risk | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
F1-Score (%) | Precision (%) | Recall (%) | F1-Score (%) | Precision (%) | Recall (%) | F1-Score (%) | |||||
Supervised learning | LSAN | Train from scratch | - | 999 | 83.20 (±1.15) | 82.75 (±1.05) | 83.75 (±0.90) | 82.81 (±0.98) | 74.59 (±1.21) | 87.50 (±0.95) | 80.53 (±0.87) |
BERT | Fine-tuning | - | 999 | 87.20 (±0.61) | 87.41 (±0.65) | 85.54 (±0.78) | 86.23 (±0.68) | 88.42 (±0.75) | 77.78 (±1.02) | 82.76 (±0.56) | |
LLM | ChatGLM2-6B | Zero-shot | basic | 0 | 60.80 (±4.91) | 63.41 (±5.32) | 60.61 (±4.93) | 58.26 (±5.99) | 69.07 (±7.02) | 37.10 (±8.55) | 48.07 (±8.99) |
role | 0 | 58.56 (±4.20) | 60.94 (±5.81) | 58.38 (±4.18) | 56.19 (±3.97) | 65.77 (±9.00) | 35.81 (±4.25) | 46.15 (±4.48) | |||
scene | 0 | 60.56 (±3.22) | 61.51 (±3.18) | 60.44 (±3.24) | 59.49 (±3.60) | 64.52 (±3.57) | 45.16 (±6.57) | 53.01 (±5.33) | |||
hybrid | 0 | 61.20 (±6.74) | 62.39 (±7.34) | 61.08 (±6.75) | 60.04 (±7.12) | 65.68 (±9.61) | 46.13 (±10.90) | 53.74 (±9.47) | |||
Few-shot | background + scene | 0 | 57.44 (±6.28) | 57.98 (±6.48) | 57.36 (±6.34) | 56.47 (±6.59) | 58.56 (±6.23) | 47.26 (±16.11) | 51.45 (±10.71) | ||
background + hybrid | 0 | 62.80 (±16.62) | 64.29 (±18.11) | 62.86 (±16.65) | 62.23 (±16.60) | 60.37 (±13.17) | 70.64 (±23.99) | 64.41 (±17.14) | |||
+ scene | 24 | 65.68 (±6.41) | 65.88 (±6.33) | 65.63 (±6.43) | 65.48 (±6.54) | 67.19 (±6.20) | 59.52 (±9.43) | 63.04 (±7.88) | |||
+ hybrid | 24 | 61.04 (±8.50) | 61.75 (±9.08) | 60.95 (±8.50) | 60.38 (±8.58) | 64.29 (±11.40) | 49.20 (±9.18) | 55.56 (±9.56) | |||
background + + scene | 24 | 50.00 (±9.17) | 50.09 (±9.80) | 50.05 (±9.16) | 49.53 (±9.26) | 49.74 (±7.41) | 56.61 (±12.12) | 52.70 (±8.70) | |||
background + + hybrid | 24 | 60.64 (±9.29) | 61.57 (±9.99) | 60.74 (±9.26) | 59.51 (±10.15) | 58.91 (±7.92) | 73.23 (±11.86) | 64.78 (±7.59) | |||
+ scene | 60 | 55.28 (±8.32) | 55.83 (±11.10) | 55.05 (±8.36) | 50.78 (±10.80) | 57.71 (±16.53) | 26.77 (±14.14) | 36.14 (±16.74) | |||
+ hybrid | 60 | 52.16 (±5.99) | 51.27 (±8.31) | 51.94 (±6.04) | 47.57 (±8.85) | 50.60 (±12.52) | 24.84 (±13.85) | 32.60 (±15.48) | |||
background + + scene | 60 | 60.64 (±12.21) | 61.33 (±12.79) | 60.58 (±12.20) | 60.23 (±12.07) | 62.97 (±14.85) | 52.90 (±13.93) | 57.02 (±12.67) | |||
background + + hybrid | 60 | 58.56 (±5.52) | 59.11 (±5.37) | 58.47 (±5.59) | 57.41 (±6.39) | 60.14 (±4.83) | 47.10 (±16.21) | 51.88 (±11.30) | |||
GLM-130B | Zero-shot | basic | 0 | 58.32 (±2.37) | 69.53 (±3.65) | 58.62 (±2.35) | 51.69 (±3.88) | 54.58 (±1.49) | 95.81 (±2.57) | 69.52 (±1.36) | |
role | 0 | 59.68 (±2.79) | 69.34 (±3.52) | 59.96 (±2.77) | 54.16 (±4.13) | 55.51 (±1.77) | 94.84 (±2.18) | 70.02 (±1.66) | |||
scene | 0 | 58.96 (±1.43) | 68.04 (±4.36) | 59.24 (±1.44) | 53.53 (±1.07) | 55.05 (±0.75) | 93.87 (±3.20) | 69.40 (±1.46) | |||
hybrid | 0 | 62.80 (±0.49) | 74.79 (±2.02) | 63.07 (±0.49) | 57.97 (±0.85) | 57.37 (±0.35) | 97.42 (±1.75) | 72.20 (±0.47) | |||
Few-shot | background + role | 0 | 60.80 (±1.85) | 66.90 (±3.41) | 61.03 (±1.86) | 57.25 (±1.72) | 56.55 (±1.10) | 90.32 (±3.28) | 69.55 (±1.77) | ||
background + hybrid | 0 | 61.52 (±2.29) | 69.00 (±3.86) | 61.77 (±2.29) | 57.62 (±2.73) | 56.91 (±1.47) | 92.42 (±3.45) | 70.43 (±1.82) | |||
+ role | 24 | 55.36 (±1.28) | 58.17 (±2.47) | 55.58 (±1.30) | 51.78 (±0.88) | 53.18 (±0.74) | 83.23 (±3.29) | 64.89 (±1.53) | |||
+ hybrid | 24 | 58.80 (±0.57) | 63.91 (±1.71) | 59.03 (±0.58) | 55.02 (±0.73) | 55.30 (±0.32) | 88.39 (±3.05) | 68.02 (±0.97) | |||
background + + role | 24 | 61.60 (±1.10) | 64.44 (±1.05) | 61.77 (±1.09) | 59.80 (±1.32) | 57.84 (±0.87) | 83.38 (±0.44) | 68.30 (±0.67) | |||
background + + hybrid | 24 | 66.24 (±1.08) | 71.07 (±0.71) | 66.38 (±1.06) | 64.29 (±1.50) | 60.88 (±1.07) | 90.00 (±1.47) | 72.61 (±0.48) | |||
GPT-3.5 | Zero-shot | basic | 0 | 53.76 (±4.28) | 57.82 (±7.62) | 54.03 (±4.28) | 47.76 (±4.81) | 52.00 (±2.25) | 88.23 (±1.98) | 65.42 (±1.81) | |
role | 0 | 56.32 (±0.95) | 67.70 (±2.39) | 56.64 (±0.95) | 48.44 (±1.59) | 53.31 (±2.05) | 96.13 (±1.55) | 68.59 (±1.67) | |||
scene | 0 | 54.08 (±2.57) | 58.84 (±4.69) | 54.36 (±2.58) | 47.89 (±2.93) | 52.16 (±2.11) | 89.03 (±2.01) | 65.76 (±1.94) | |||
hybrid | 0 | 54.80 (±2.12) | 61.06 (±3.34) | 55.10 (±2.10) | 47.69 (±3.36) | 52.55 (±1.98) | 92.26 (±1.82) | 66.95 (±1.78) | |||
Few-shot | background + role | 0 | 58.24 (±1.69) | 63.49 (±2.95) | 58.48 (±1.69) | 54.17 (±2.04) | 54.90 (±1.07) | 88.55 (±3.84) | 67.76 (±1.63) | ||
background + hybrid | 0 | 58.80 (±1.67) | 64.06 (±2.06) | 59.04 (±1.67) | 54.85 (±2.11) | 55.27 (±1.14) | 89.03 (±1.22) | 68.19 (±1.09) | |||
+ role | 24 | 59.76 (±1.37) | 62.98 (±1.69) | 59.95 (±1.37) | 57.50 (±1.52) | 56.34 (±0.96) | 83.39 (±1.51) | 67.22 (±1.08) | |||
+ hybrid | 24 | 60.96 (±1.37) | 64.29 (±1.69) | 61.15 (±1.37) | 58.77 (±1.52) | 57.19 (±0.96) | 84.68 (±1.51) | 68.27 (±1.08) | |||
background + + role | 24 | 63.20 (±1.57) | 65.53 (±2.62) | 63.35 (±1.59) | 61.98 (±1.31) | 59.37 (±0.83) | 81.61 (±4.50) | 68.71 (±2.04) | |||
background + + hybrid | 24 | 62.08 (±2.25) | 64.86 (±3.31) | 62.25 (±2.26) | 60.46 (±2.14) | 58.26 (±1.44) | 82.90 (±4.13) | 68.41 (±2.26) | |||
Fine-tuning | role | 999 | 83.12 (±0.55) | 82.25 (±0.49) | 81.44 (±0.72) | 81.80 (±0.58) | 84.76 (±0.51) | 71.77 (±0.98) | 77.73 (±0.42) | ||
scene | 999 | 83.36 (±0.51) | 82.81 (±0.56) | 82.02 (±0.69) | 82.40 (±0.54) | 84.11 (±0.63) | 72.58 (±0.85) | 77.92 (±0.49) | |||
hybrid | 999 | 83.60 (±0.48) | 83.83 (±0.55) | 82.05 (±0.61) | 82.70 (±0.45) | 84.26 (±0.45) | 73.39 (±0.81) | 78.45 (±0.38) | |||
GPT-4 | Zero-shot | basic | 0 | 62.64 (±2.05) | 72.38 (±3.40) | 62.90 (±2.05) | 58.33 (±2.57) | 57.43 (±1.31) | 95.48 (±2.10) | 71.72 (±1.42) | |
role | 0 | 62.64 (±2.74) | 74.26 (±3.90) | 62.91 (±2.73) | 57.77 (±3.50) | 57.29 (±1.73) | 97.26 (±2.03) | 72.10 (±1.82) | |||
scene | 0 | 64.88 (±2.11) | 76.03 (±2.59) | 65.14 (±2.09) | 60.86 (±2.68) | 58.81 (±1.46) | 97.58 (±1.51) | 73.39 (±1.37) | |||
hybrid | 0 | 62.96 (±2.60) | 74.67 (±4.57) | 63.23 (±2.60) | 58.22 (±3.02) | 57.47 (±1.57) | 97.42 (±2.38) | 72.30 (±1.89) | |||
Few-shot | + scene | 24 | 66.88 (±1.21) | 74.28 (±1.59) | 67.10 (±1.21) | 64.32 (±1.52) | 60.70 (±0.93) | 94.35 (±1.51) | 73.87 (±0.83) | ||
+ hybrid | 24 | 64.00 (±3.36) | 66.96 (±4.09) | 64.16 (±3.35) | 62.52 (±3.60) | 59.77 (±2.60) | 84.19 (±4.51) | 69.87 (±2.75) | |||
background + + scene | 24 | 69.28 (±5.60) | 70.58 (±5.11) | 69.38 (±5.56) | 68.74 (±6.05) | 65.44 (±5.37) | 81.77 (±2.77) | 72.63 (±4.07) | |||
background + + hybrid | 24 | 68.96 (±1.99) | 69.85 (±1.78) | 69.04 (±1.97) | 68.65 (±2.14) | 65.65 (±2.48) | 78.87 (±2.70) | 71.60 (±1.36) | |||
+ scene | 60 | 67.04 (±1.56) | 73.13 (±1.97) | 67.24 (±1.55) | 64.87 (±1.98) | 61.11 (±1.28) | 92.42 (±2.32) | 73.56 (±1.09) | |||
background + + hybrid | 60 | 73.76 (±5.63) | 74.59 (±5.41) | 73.83 (±5.61) | 73.54 (±5.77) | 70.16 (±5.76) | 82.58 (±4.21) | 75.81 (±4.74) |
Model Category | Model | Type | Sub-Type | Train Data | Micro-Average | Macro-Average | ||||
---|---|---|---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F1-Score (%) | Precision (%) | Recall (%) | F1-Score (%) | |||||
Supervised learning | LSAN | Train from scratch | - | 2725 | 75.53 (±0.85) | 71.48 (±0.91) | 73.45 (±0.88) | 70.10 (±1.15) | 67.00 (±1.21) | 67.28 (±1.18) |
BERT | Fine-tuning | - | 2725 | 85.43 (±0.68) | 68.62 (±0.85) | 76.10 (±0.45) | 75.00 (±0.98) | 68.30 (±1.15) | 70.05 (±0.85) | |
LLM | GLM-4 | Zero-shot | basic | 0 | 17.39 (±2.41) | 46.95 (±2.95) | 25.38 (±2.68) | 16.52 (±3.11) | 45.15 (±3.65) | 21.65 (±3.28) |
role | 0 | 20.56 (±2.25) | 46.95 (±2.81) | 28.59 (±2.51) | 19.88 (±2.95) | 45.82 (±3.51) | 22.11 (±3.21) | |||
scene | 0 | 17.19 (±2.55) | 45.33 (±3.08) | 24.93 (±2.85) | 16.81 (±3.25) | 44.92 (±3.78) | 21.05 (±3.55) | |||
hybrid | 0 | 19.54 (±2.31) | 47.32 (±2.88) | 27.66 (±2.59) | 18.95 (±3.01) | 43.93 (±3.58) | 21.67 (±3.29) | |||
Few-shot | background + hybrid | 0 | 30.63 (±2.01) | 48.82 (±2.15) | 37.64 (±2.08) | 25.50 (±2.61) | 41.21 (±2.75) | 25.30 (±2.68) | ||
background + role | 0 | 28.76 (±2.11) | 47.20 (±2.28) | 35.74 (±2.18) | 27.01 (±2.71) | 43.43 (±2.88) | 25.65 (±2.78) | |||
+ hybrid | 24 | 32.25 (±1.85) | 43.21 (±2.05) | 36.94 (±1.95) | 23.96 (±2.45) | 37.05 (±2.65) | 23.05 (±2.55) | |||
+ role | 24 | 34.82 (±1.72) | 44.58 (±1.91) | 39.10 (±1.81) | 23.64 (±2.32) | 30.17 (±2.51) | 22.40 (±2.41) | |||
background + + hybrid | 24 | 41.74 (±1.55) | 47.82 (±1.48) | 44.57 (±1.51) | 27.04 (±2.05) | 40.96 (±1.98) | 25.59 (±2.01) | |||
background + + role | 24 | 41.91 (±1.51) | 45.83 (±1.62) | 43.78 (±1.58) | 28.75 (±2.01) | 34.27 (±2.22) | 27.03 (±2.08) | |||
+ hybrid | 60 | 26.32 (±2.58) | 34.74 (±2.71) | 29.95 (±2.65) | 23.11 (±3.28) | 29.98 (±3.41) | 18.80 (±3.35) | |||
+ role | 60 | 29.67 (±2.45) | 36.61 (±2.65) | 32.78 (±2.52) | 23.72 (±3.15) | 32.71 (±3.35) | 20.74 (±3.22) | |||
GPT-3.5 | Zero-shot | basic | 0 | 10.63 (±2.65) | 13.95 (±3.11) | 12.06 (±2.88) | 9.85 (±3.35) | 12.16 (±3.81) | 8.76 (±3.58) | |
role | 0 | 12.17 (±2.81) | 10.21 (±2.98) | 11.10 (±2.91) | 11.05 (±3.51) | 9.86 (±3.68) | 7.21 (±3.61) | |||
scene | 0 | 10.33 (±2.77) | 11.83 (±3.25) | 11.03 (±3.01) | 9.79 (±3.47) | 10.02 (±3.95) | 7.55 (±3.71) | |||
hybrid | 0 | 10.59 (±2.71) | 11.83 (±3.11) | 11.18 (±2.85) | 9.96 (±3.41) | 10.62 (±3.81) | 6.48 (±3.55) | |||
Few-shot | background + hybrid | 0 | 20.16 (±2.51) | 12.20 (±2.75) | 15.21 (±2.62) | 18.14 (±3.21) | 11.96 (±3.45) | 8.36 (±3.32) | ||
background + basic | 0 | 25.84 (±2.35) | 11.46 (±2.61) | 15.88 (±2.48) | 23.31 (±3.05) | 10.48 (±3.31) | 9.60 (±3.18) | |||
+ hybrid | 24 | 16.76 (±2.68) | 19.68 (±2.99) | 18.10 (±2.81) | 14.37 (±3.38) | 18.01 (±3.69) | 10.14 (±3.51) | |||
+ basic | 24 | 17.61 (±2.61) | 14.69 (±2.88) | 16.02 (±2.75) | 15.14 (±3.31) | 8.87 (±3.58) | 6.09 (±3.45) | |||
Fine-tuning | scene | 2725 | 72.03 (±0.78) | 71.86 (±0.85) | 71.95 (±0.65) | 63.81 (±0.95) | 68.55 (±1.05) | 64.98 (±0.85) | ||
role | 2725 | 69.88 (±0.95) | 70.49 (±1.01) | 70.18 (±0.98) | 61.05 (±1.15) | 66.92 (±1.21) | 62.73 (±1.18) | |||
hybrid | 2725 | 71.94 (±0.75) | 72.48 (±0.81) | 72.21 (±0.68) | 64.22 (±0.95) | 69.01 (±1.01) | 65.53 (±0.88) | |||
Chinese-Alpaca-2-7B | Fine-tuning | scene | 2725 | 72.61 (±0.61) | 71.07 (±0.88) | 71.83 (±0.55) | 63.10 (±0.81) | 68.03 (±1.08) | 64.51 (±0.75) | |
role | 2725 | 73.60 (±0.45) | 72.32 (±0.71) | 72.96 (±0.38) | 65.48 (±0.65) | 70.02 (±0.91) | 66.89 (±0.58) | |||
hybrid | 2725 | 72.02 (±0.58) | 69.95 (±0.95) | 70.97 (±0.49) | 62.83 (±0.78) | 67.59 (±1.15) | 63.81 (±0.69) | |||
Chinese-LLaMA-2-7B | Fine-tuning | scene | 2725 | 73.56 (±0.51) | 71.45 (±0.81) | 72.49 (±0.43) | 64.91 (±0.71) | 69.46 (±1.01) | 66.04 (±0.63) | |
role | 2725 | 72.82 (±0.65) | 68.83 (±1.11) | 70.77 (±0.58) | 61.33 (±0.85) | 68.50 (±1.31) | 63.75 (±0.78) | |||
hybrid | 2725 | 73.46 (±0.48) | 70.07 (±0.99) | 71.73 (±0.51) | 63.08 (±0.68) | 69.72 (±1.19) | 65.29 (±0.71) | |||
LLaMA2-Chinese- 7b-Chat | Fine-tuning | scene | 2725 | 66.22 (±0.88) | 61.85 (±1.21) | 63.96 (±1.05) | 56.19 (±1.08) | 59.84 (±1.41) | 57.23 (±1.25) | |
role | 2725 | 68.59 (±0.75) | 65.09 (±1.01) | 66.80 (±0.81) | 58.75 (±0.95) | 64.01 (±1.21) | 60.18 (±1.01) | |||
hybrid | 2725 | 69.44 (±0.69) | 64.59 (±1.05) | 66.93 (±0.79) | 59.52 (±0.89) | 63.88 (±1.25) | 60.95 (±0.99) | |||
GPT-4 | Zero-shot | basic | 0 | 17.13 (±2.02) | 59.65 (±2.25) | 26.61 (±2.11) | 16.43 (±2.52) | 58.79 (±2.75) | 22.95 (±2.61) | |
role | 0 | 17.30 (±1.98) | 57.53 (±2.15) | 26.61 (±2.05) | 16.62 (±2.48) | 56.51 (±2.65) | 22.45 (±2.55) | |||
scene | 0 | 18.29 (±2.18) | 42.71 (±2.58) | 25.62 (±2.35) | 17.25 (±2.68) | 41.69 (±3.08) | 21.73 (±2.85) | |||
hybrid | 0 | 17.41 (±2.01) | 56.41 (±2.21) | 26.61 (±2.09) | 16.93 (±2.51) | 55.72 (±2.71) | 20.74 (±2.59) | |||
Few-shot | background + hybrid | 0 | 30.68 (±1.15) | 53.67 (±1.25) | 39.04 (±1.18) | 28.97 (±1.55) | 50.16 (±1.65) | 27.35 (±1.58) | ||
background + basic | 0 | 31.38 (±1.11) | 56.66 (±1.19) | 40.39 (±1.13) | 28.10 (±1.51) | 55.14 (±1.59) | 28.58 (±1.53) | |||
+ hybrid | 24 | 34.62 (±1.05) | 40.35 (±1.25) | 37.26 (±1.15) | 23.37 (±1.45) | 37.45 (±1.65) | 22.78 (±1.55) | |||
+ basic | 24 | 37.56 (±0.98) | 40.97 (±1.21) | 39.19 (±1.09) | 25.59 (±1.38) | 43.23 (±1.61) | 27.14 (±1.49) | |||
background + + hybrid | 24 | 26.25 (±1.55) | 43.84 (±1.81) | 32.84 (±1.68) | 20.60 (±2.05) | 42.87 (±2.31) | 22.04 (±2.18) | |||
background + + basic | 24 | 37.56 (±1.01) | 40.97 (±1.25) | 39.19 (±1.11) | 27.93 (±1.41) | 36.12 (±1.65) | 23.65 (±1.51) | |||
+ hybrid | 60 | 26.37 (±1.48) | 47.82 (±1.65) | 34.00 (±1.55) | 20.86 (±1.98) | 48.03 (±2.15) | 23.70 (±2.05) | |||
+ basic | 60 | 29.77 (±1.35) | 53.80 (±1.51) | 38.33 (±1.41) | 22.42 (±1.85) | 53.05 (±2.01) | 26.04 (±1.91) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qi, H.; Fu, G.; Li, J.; Song, C.; Zhai, W.; Luo, D.; Liu, S.; Yu, Y.; Yang, B.; Zhao, Q. Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media. Bioengineering 2025, 12, 882. https://doi.org/10.3390/bioengineering12080882
Qi H, Fu G, Li J, Song C, Zhai W, Luo D, Liu S, Yu Y, Yang B, Zhao Q. Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media. Bioengineering. 2025; 12(8):882. https://doi.org/10.3390/bioengineering12080882
Chicago/Turabian StyleQi, Hongzhi, Guanghui Fu, Jianqiang Li, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yijing Yu, Bingxiang Yang, and Qing Zhao. 2025. "Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media" Bioengineering 12, no. 8: 882. https://doi.org/10.3390/bioengineering12080882
APA StyleQi, H., Fu, G., Li, J., Song, C., Zhai, W., Luo, D., Liu, S., Yu, Y., Yang, B., & Zhao, Q. (2025). Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media. Bioengineering, 12(8), 882. https://doi.org/10.3390/bioengineering12080882