From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems
Abstract
1. Introduction
1.1. Data Mining Challenges in Therapeutic Communication
1.2. From Patient Monitoring to Bidirectional Clinical Analysis Systems
1.3. Research Contributions
2. Materials and Methods
2.1. Human Annotation Protocol
2.2. Patient Side: Multi-Label Classification with Imbalance Handling
2.2.1. Problem Formulation
2.2.2. Frequency-Stratified Class Weighting
2.2.3. Dynamic Threshold Optimization
2.2.4. Multimodal Extension
2.3. Provider Side: Cross-Modal Attention for Real-World Data
2.3.1. YouTube Data Processing Pipeline
2.3.2. Controlled HOPE Dataset
2.3.3. Cross-Modal Attention Architecture
3. Results
3.1. Patient-Side Emotion Recognition Performance
3.2. Comparison with State-of-the-Art Imbalance Handling Methods
3.3. Provider-Side Behavior Recognition Performance
3.4. Interaction-Level Alignment Analysis
3.5. Cross-Dataset Transfer Analysis
3.6. Systematic Attention Pattern Analysis
3.7. Summary of Experimental Findings
3.8. Automated Annotation Quality Assessment
4. Discussion
4.1. Controlled Versus Real-World Data Quality
4.2. Class-Aware Optimization: Engineering Solution for Domain-Specific Imbalance
4.3. Fusion Strategy Selection for Heterogeneous Modalities
4.4. Scalability and Deployment Considerations
4.5. Methodological Perspective
4.6. Cross-Domain and Cultural Generalizability
4.7. Bidirectionality as Analytical Framework
4.8. Limitations and Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ASR | Automatic Speech Recognition |
| BCE | Binary Cross-Entropy |
| DSM-5 | Diagnostic and Statistical Manual, Fifth Edition |
| FFN | Feed-Forward Network |
| HOPE | Healing Opportunities in Psychotherapy Expressions |
| LLM | Large Language Model |
| MFCC | Mel-Frequency Cepstral Coefficient |
| MLP | Multilayer Perceptron |
| NLP | Natural Language Processing |
| PQS | Psychotherapy Process Q-Set |
| ReLU | Rectified Linear Unit |
References
- Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef]
- Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- Kumar, V.; Lalotra, G.S.; Sasikala, P.; Rajput, D.S.; Kaluri, R.; Lakshmanna, K.; Shorfuzzaman, M.; Alsufyani, A.; Uddin, M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare 2022, 10, 1293. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Charte, F.; Rivera, A.J.; del Jesus, M.J.; Herrera, F. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 2015, 163, 3–16. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Pillai, I.; Fumera, G.; Roli, F. Threshold optimisation for multi-label classifiers. Pattern Recognit. 2013, 46, 2055–2065. [Google Scholar] [CrossRef]
- Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
- Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
- Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
- Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef]
- Banse, R.; Scherer, K.R. Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 1996, 70, 614–636. [Google Scholar] [CrossRef]
- Scherer, K.R.; Johnstone, T.; Klasmeyer, G. Vocal expression of emotion. In Handbook of Affective Sciences; Oxford University Press: New York, NY, USA, 2003; pp. 433–456. [Google Scholar]
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 873–883. [Google Scholar]
- Calvo, R.A.; Milne, D.N.; Hussain, M.S.; Christensen, H. Natural language processing in mental health applications using non-clinical texts. Nat. Lang. Eng. 2017, 23, 649–685. [Google Scholar] [CrossRef]
- Guntuku, S.C.; Yaden, D.B.; Kern, M.L.; Ungar, L.H.; Eichstaedt, J.C. Detecting depression and mental illness on social media: An integrative review. Curr. Opin. Behav. Sci. 2017, 18, 43–49. [Google Scholar] [CrossRef]
- De Choudhury, M.; Counts, S.; Horvitz, E. Social media as a measurement tool of depression in populations. In Proceedings of the 5th Annual ACM Web Science Conference, Paris, France, 2–4 May 2013; pp. 47–56. [Google Scholar]
- Yates, A.; Cohan, A.; Goharian, N. Depression and self-harm risk assessment in online forums. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2968–2978. [Google Scholar]
- Benton, A.; Mitchell, M.; Hovy, D. Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 152–162. [Google Scholar]
- Coppersmith, G.; Dredze, M.; Harman, C.; Hollingshead, K. From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 1–10. [Google Scholar]
- Resnik, P.; Armstrong, W.; Claudino, L.; Nguyen, T.; Nguyen, V.A.; Boyd-Graber, J. Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 99–107. [Google Scholar]
- Losada, D.E.; Crestani, F.; Parapar, J. eRISK 2017: CLEF lab on early risk prediction on the Internet: Experimental foundations. In International Conference of the Cross-Language Evaluation Forum for European Languages, Dublin, Ireland, 11–14 September 2017; Springer: Cham, Switzerland, 2017; pp. 346–360. [Google Scholar]
- Shen, J.H.; Rudzicz, F. Detecting Anxiety through Reddit. In Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology—From Linguistic Signal to Clinical Reality, Vancouver, BC, Canada, 3 August 2017; pp. 58–65. [Google Scholar]
- Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. NPJ Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef]
- Cummins, N.; Scherer, S.; Krajewski, J.; Schnieder, S.; Epps, J.; Quatieri, T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015, 71, 10–49. [Google Scholar] [CrossRef]
- Low, D.M.; Bentley, K.H.; Ghosh, S.S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig. Otolaryngol. 2020, 5, 96–116. [Google Scholar] [CrossRef]
- Kessler, R.C.; Chiu, W.T.; Demler, O.; Walters, E.E. Prevalence, Severity, and Comorbidity of 12-Month DSM-IV Disorders in the National Comorbidity Survey Replication. Arch. Gen. Psychiatry 2005, 62, 617–627. [Google Scholar] [CrossRef]
- Brown, T.A.; Campbell, L.A.; Lehman, C.L.; Grisham, J.R.; Mancill, R.B. Current and lifetime comorbidity of the DSM-IV anxiety and mood disorders in a large clinical sample. J. Abnorm. Psychol. 2001, 110, 585–599. [Google Scholar] [CrossRef]
- Mineka, S.; Watson, D.; Clark, L.A. Comorbidity of anxiety and unipolar mood disorders. Annu. Rev. Psychol. 1998, 49, 377–412. [Google Scholar] [CrossRef]
- Elliott, R.; Bohart, A.C.; Watson, J.C.; Murphy, D. Therapist empathy and client outcome: An updated meta-analysis. Psychotherapy 2018, 55, 399–410. [Google Scholar] [CrossRef]
- Norcross, J.C.; Lambert, M.J. Psychotherapy relationships that work III. Psychotherapy 2018, 55, 303–315. [Google Scholar] [CrossRef] [PubMed]
- Greenberg, L.S.; Elliott, R. Empathy. Psychotherapy 2019, 56, 461–468. [Google Scholar] [CrossRef]
- Watson, J.C. Reassessing Rogers’ necessary and sufficient conditions of change. Psychotherapy 2007, 44, 268–273. [Google Scholar] [CrossRef]
- Horvath, A.O.; Del Re, A.C.; Flückiger, C.; Symonds, D. Alliance in individual psychotherapy. Psychotherapy 2011, 48, 9–16. [Google Scholar] [CrossRef]
- Martin, D.J.; Garske, J.P.; Davis, M.K. Relation of the therapeutic alliance with outcome and other variables: A meta-analytic review. J. Consult. Clin. Psychol. 2000, 68, 438–450. [Google Scholar] [CrossRef] [PubMed]
- Flückiger, C.; Del Re, A.C.; Wampold, B.E.; Horvath, A.O. The alliance in adult psychotherapy: A meta-analytic synthesis. Psychotherapy 2018, 55, 316–340. [Google Scholar] [CrossRef]
- Wampold, B.E.; Imel, Z.E. The Great Psychotherapy Debate: The Evidence for What Makes Psychotherapy Work, 2nd ed.; Routledge: New York, NY, USA, 2015. [Google Scholar]
- Flemotomos, N.; Martinez, V.R.; Chen, Z.; Singla, K.; Ardulov, V.; Peri, R.; Imel, Z.E.; Atkins, D.C.; Narayanan, S. Automated quality assessment of cognitive behavioral therapy sessions through highly contextualized language representations. PLoS ONE 2021, 16, e0258639. [Google Scholar] [CrossRef]
- Flemotomos, N.; Martinez, V.R.; Gibson, J.; Atkins, D.C.; Creed, T.A.; Narayanan, S.S. Language features for automated evaluation of cognitive behavior psychotherapy sessions: A machine learning approach. Front. Psychol. 2021, 12, 702139. [Google Scholar]
- Orlinsky, D.E.; Rønnestad, M.H. How Psychotherapists Develop: A study of Therapeutic Work and Professional Growth; American Psychological Association: Washington, DC, USA, 2005. [Google Scholar]
- Juslin, P.N.; Scherer, K.R. Vocal expression of affect. In The New Handbook of Methods in Nonverbal Behavior Research; Oxford University Press: Oxford, UK, 2005; pp. 65–135. [Google Scholar]
- Cowie, R.; Cornelius, R.R. Describing the emotional states that are expressed in speech. Speech Commun. 2003, 40, 5–32. [Google Scholar] [CrossRef]
- Ambady, N.; Rosenthal, R. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychol. Bull. 1992, 111, 256–274. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Boyd, K.; Eng, K.H.; Page, C.D. Area under the precision-recall curve: Point estimates and confidence intervals. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 15–19 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 451–466. [Google Scholar]
- Imel, Z.E.; Atkins, D.C.; Caperton, D.D.; Takano, K.; Iijima, Y.; Walker, D.D.; Steyvers, M. Mental Health Counseling From Conversational Content with Transformer-Based Machine Learning. JAMA Netw. Open 2024, 7, e2351075. [Google Scholar] [CrossRef]
- Bredin, H.; Laurent, A. End-to-End Speaker Segmentation for Overlap-Aware Resegmentation. In Proceedings of Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3111–3115. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of ICML 2023, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Anthropic. Claude 3 Model Family: Introducing the Next Generation of AI Assistants; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
- Tsai, Y.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]
- Jones, E.E. Therapeutic Action: A Guide to Psychoanalytic Therapy; Jason Aronson: Lanham, MD, USA, 2000. [Google Scholar]
- Ablon, J.S.; Jones, E.E. How expert clinicians’ prototypes of an ideal treatment correlate with outcome in psychodynamic and cognitive-behavioral therapy. Psychother. Res. 1998, 8, 71–83. [Google Scholar] [CrossRef]
- Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs Late Fusion in Multimodal Convolutional Neural Networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
- Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014; pp. 3123–3128. [Google Scholar]
- Jones, E.E.; Pulos, S.M. Comparing the process in psychodynamic and cognitive-behavioral therapies. J. Consult. Clin. Psychol. 1993, 61, 306–316. [Google Scholar] [CrossRef] [PubMed]
- Lamers, F.; van Oppen, P.; Comijs, H.C.; Smit, J.H.; Spinhoven, P.; van Balkom, A.J.; Nolen, W.A.; Zitman, F.G.; Beekman, A.T.; Penninx, B.W. Comorbidity patterns of anxiety and depressive disorders in a large cohort study: The Netherlands Study of Depression and Anxiety (NESDA). J. Clin. Psychiatry 2011, 72, 341–348. [Google Scholar] [CrossRef]
- Kring, A.M.; Bachorowski, J.A. Emotions and psychopathology. Cogn. Emot. 1999, 13, 575–599. [Google Scholar] [CrossRef]
- Gross, J.J.; Muñoz, R.F. Emotion regulation and mental health. Clin. Psychol. Sci. Pract. 1995, 2, 151–164. [Google Scholar] [CrossRef]
- Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar] [CrossRef]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–25. [Google Scholar]
- Goldberg, S.B.; Flemotomos, N.; Martinez, V.R.; Tanana, M.J.; Kuo, P.B.; Pace, B.T.; Villatte, J.L.; Georgiou, P.G.; Van Epps, J.; Imel, Z.E.; et al. Machine learning and natural language processing in psychotherapy research: Alliance as example use case. J. Couns. Psychol. 2020, 67, 438–448. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. arXiv 2021, arXiv:2104.03502. [Google Scholar] [CrossRef]
- Ericsson, K.A. Deliberate practice and acquisition of expert performance: A general overview. Acad. Emerg. Med. 2008, 15, 988–994. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Hassan, A.A.; Hanafy, R.J.; Fouda, M.E. Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models. arXiv 2024, arXiv:2412.03796. [Google Scholar] [CrossRef]
- Almeida, H.; Briand, A.; Meurs, M.J. Multimodal depression detection: A comparative study of machine learning models and feature fusion techniques. J. Biomed. Inform. 2024, 149, 104565. [Google Scholar]
- Al Hanai, T.; Ghassemi, M.; Glass, J. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In Proceedings of Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1716–1720. [Google Scholar]
- Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Valstar, M.; Schuller, B.; Smith, K.; Almaev, T.; Eyben, F.; Krajewski, J.; Cowie, R.; Pantic, M. AVEC 2014: 3D dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA, 7 November 2014; pp. 3–10. [Google Scholar]
- Franklin, J.C.; Ribeiro, J.D.; Fox, K.R.; Bentley, K.H.; Kleiman, E.M.; Huang, X.; Musacchio, K.M.; Jaroszewski, A.C.; Chang, B.P.; Nock, M.K. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol. Bull. 2017, 143, 187–232. [Google Scholar] [CrossRef]
- Nock, M.K.; Borges, G.; Bromet, E.J.; Cha, C.B.; Kessler, R.C.; Lee, S. Suicide and suicidal behavior. Epidemiol. Rev. 2008, 30, 133–154. [Google Scholar] [CrossRef]
- Baer, R.A.; Crane, C.; Miller, E.; Kuyken, W. Doing no harm in mindfulness-based programs: Conceptual issues and empirical findings. Clin. Psychol. Rev. 2019, 71, 101–114. [Google Scholar] [CrossRef] [PubMed]
- Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017, 33, i37–i48. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A.E.; Pollard, T.J.; Shen, L.; Li-wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
- Gaur, M.; Alambo, A.; Sain, J.P.; Kursuncu, U.; Thirunarayan, K.; Kavuluru, R.; Sheth, A.; Welton, R.; Pathak, J. Knowledge-aware assessment of severity of suicide risk for early intervention. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 514–525. [Google Scholar]
- Sharma, A.; Lin, I.W.; Miner, A.S.; Atkins, D.C.; Althoff, T. Towards Facilitating Empathic Conversations in Online Mental Health Support: A Reinforcement Learning Approach. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 194–205. [Google Scholar]
- Pérez-Rosas, V.; Mihalcea, R.; Resnicow, K.; Singh, S.; An, L. Understanding and Predicting Empathic Behavior in Counseling Therapy. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1426–1435. [Google Scholar]
- Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 162–178. [Google Scholar]
- Kenny, P.G.; Parsons, T.D.; Gratch, J.; Leuski, A.; Rizzo, A.A. Virtual patients for clinical therapist skills training. In Proceedings of the International Conference on Intelligent Virtual Agents, Philadelphia, PA, USA, 20–22 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 197–210. [Google Scholar]
- Rizzo, A.; Scherer, S.; DeVault, D.; Gratch, J.; Artstein, R.; Hartholt, A.; Lucas, G.M.; Dyck, M.; Stratou, G.; Morency, L.P.; et al. Detection and computational analysis of psychological signals using a virtual human interviewing agent. J. Pain Manag. 2016, 9, 311–321. [Google Scholar]

| Dataset | Domain | Scale | Label Structure | Modalities | Imbalance | Challenges | Annotation |
|---|---|---|---|---|---|---|---|
| CounselChat | Patient Emotions | 1482 interactions | 25 categories (42.2% multi-label) | Text | 60:1 | Multi-label co-occurrence, extreme imbalance | Three psychologists, Cohen’s = 0.72, Fleiss’ = 0.78 |
| DAIC-WOZ | Patient Emotions | 8400 utterances | 11 emotions (multi-label) | Text, Audio | Moderate | Multi-label, fusion, noise | Two psychologists, = 0.69 |
| HOPE Controlled | Provider Behaviors | 178 sessions, 12,500 utterances | 25 PQS dimensions | Text, Audio | Balanced | Context-dependent behaviors, prosody | Single psychologist, = 0.76 |
| HOPE YouTube | Provider Comm. | 330 sessions, 14,086 segments | 6 styles | Text, Audio | Variable | Real-world quality, automation | Automated (Claude Sonnet 4) |
| Dataset | Method | Annotators | Granularity | Reliability | Noise | Notes |
|---|---|---|---|---|---|---|
| CounselChat | Expert human | 3 psychologists | Interaction | = 0.78 | Minimal | Gold standard |
| DAIC-WOZ | Expert human | 2 psychologists | Utterance | = 0.69 | Low | Clinical context |
| HOPE Controlled | Expert human | 1 psychologist | Utterance | = 0.76 | Low | Single-annotator limitation |
| HOPE YouTube | Automated LLM | Claude Sonnet 4 | 10-s windows | = 0.61 (audit) | Moderate–High | Not comparable to controlled |
| Component | CounselChat | DAIC-WOZ Early | DAIC-WOZ Late |
|---|---|---|---|
| Base encoder | ClinicalBERT | ClinicalBERT | ClinicalBERT + MLP |
| Fusion | N/A | Concatenation | Weighted averaging |
| Loss | Weighted BCE | Standard BCE | Independent BCE |
| Optimizer | AdamW, | AdamW, | Text: ; Audio: |
| Batch size | 8/16 | 8/16 | Text: 8/16; Audio: 16/32 |
| Epochs | 5, patience = 3 | 5 | Text: 5; Audio: 20 |
| Fusion weight | N/A | N/A | = 0.75 |
| Component | Controlled HOPE | YouTube HOPE |
|---|---|---|
| Text Encoder | DeBERTa-v3-base (184M) | DeBERTa-v3-base |
| Audio Encoder | WavLM-base-plus (95M) | WavLM-base-plus |
| Shared Dimension | d = 256, h = 8 | d = 256, h = 8 |
| Training | Warmup, 2 epochs; fine-tune, 10 epochs | Single-stage, 20 epochs |
| Batch Size | 8 (effective 32) | 8 |
| Output Classes | 25 PQS dimensions | 6 communication styles |
| Configuration | Micro-F1 | Macro-F1 | Subset Acc | Key Finding |
|---|---|---|---|---|
| CounselChat Results (25 Emotion Categories) | ||||
| Single-Label Baseline | 0.30 | 0.12 | N/A | Label information lost |
| Multi-Label (No Weights) | 0.13 | 0.12 | 0.00 | Rare class collapse |
| Multi-Label + Class Weights | 0.52 | 0.53 | 0.18 | Substantial gain |
| Multi-Label + Weights + Thresholds | 0.65 | 0.74 | 0.34 | Six-fold improvement |
| DAIC-WOZ Results (11 Emotion Categories, Multimodal) | ||||
| Text Only | 0.87 | 0.55 | N/A | Strong text signal |
| Audio Only | 0.28 | 0.15 | N/A | Limited audio signal |
| Early Fusion (Concatenation) | 0.64 | 0.39 | N/A | Suboptimal fusion |
| Late Fusion (Weighted) | 0.88 | 0.55 | N/A | Text-dominant optimal |
| Method | Macro-F1 | Micro-F1 | Key Mechanism | Δ vs. Ours |
|---|---|---|---|---|
| Naive Multi-Label Baseline | 0.12 | 0.13 | Standard BCE, = 0.5 | −0.62 |
| Focal Loss [71] | 0.61 | 0.72 | Hard example emphasis, = 2.0 | −0.13 |
| Class-Balanced Loss [9] | 0.65 | 0.76 | Effective sample number | −0.09 |
| Logit Adjustment [72] | 0.68 | 0.78 | Post hoc adjustment | −0.06 |
| Ours (Stratified + Thresholds) | 0.74 | 0.65 | Frequency-stratified, per-class | – |
| Architecture | Micro-F1 | Macro-F1 | Cohen’s | Δ vs. Full |
|---|---|---|---|---|
| Controlled HOPE (Human-Annotated, 25 PQS Dimensions) | ||||
| BERT-base + Acoustic Features | 0.58 | 0.58 | 0.52 | −0.33 |
| ClinicalBERT + Acoustic Features | 0.62 | 0.62 | 0.56 | −0.29 |
| ClinicalBERT + WavLM (Early Fusion) | 0.71 | 0.71 | 0.64 | −0.20 |
| DeBERTa-v3 + Acoustic Features | 0.74 | 0.74 | 0.68 | −0.17 |
| DeBERTa-v3 + WavLM (Concatenation) | 0.79 | 0.79 | 0.74 | −0.12 |
| DeBERTa-v3 + WavLM (Late Fusion) | 0.82 | 0.82 | 0.78 | −0.09 |
| DeBERTa-v3 + WavLM (Cross-Attention) | 0.93 | 0.91 | 0.87 | – |
| YouTube HOPE (Automated Annotation, 6 Communication Styles) | ||||
| DeBERTa-v3 + WavLM (Cross-Attention) | 0.86 | 0.71 | N/A | – |
| PQS Dimension | Precision | Recall | F1 | Support | Multimodal Signature |
|---|---|---|---|---|---|
| Warmth/Supportiveness | 0.96 | 0.93 | 0.94 | 156 | Affirming language + soft prosody |
| Empathy | 0.94 | 0.91 | 0.92 | 142 | Validating content + warm tone |
| Silence/Listening | 0.93 | 0.90 | 0.91 | 145 | Distinctive acoustic absence |
| Validation | 0.92 | 0.89 | 0.90 | 134 | Clear linguistic markers |
| Reassurance | 0.88 | 0.85 | 0.86 | 103 | Moderate complexity |
| Directiveness | 0.87 | 0.84 | 0.85 | 98 | Multiple communication styles |
| Interpretation | 0.82 | 0.79 | 0.80 | 87 | Context-dependent patterns |
| Challenge/Confrontation | 0.79 | 0.75 | 0.77 | 67 | Subtle, variable delivery |
| Communication Style | F1 | 95% CI | Support | Threshold | Characteristics |
|---|---|---|---|---|---|
| Neutral | 0.934 | [0.91, 0.95] | 327 | 0.25 | Majority class |
| Transitional | 0.834 | [0.79, 0.87] | 211 | 0.60 | Structural cues |
| Reflective | 0.833 | [0.78, 0.88] | 137 | 0.30 | Paraphrasing |
| Empathetic | 0.600 | [0.42, 0.78] | 12 | 0.85 | Rare; wide CI |
| Supportive | 0.561 | [0.39, 0.73] | 18 | 0.45 | Limited examples |
| Validating | 0.500 | [0.25, 0.75] | 8 | N/A | Extreme rarity |
| Configuration | Empathy F1 | Warmth F1 | Directiveness F1 | Macro-F1 (25 Dims) |
|---|---|---|---|---|
| Full Model (All Features) | 0.92 | 0.94 | 0.85 | 0.91 |
| - Pitch Features | 0.84 | 0.88 | 0.82 | 0.85 |
| - Energy Features | 0.88 | 0.90 | 0.74 | 0.83 |
| - MFCC Features | 0.90 | 0.92 | 0.83 | 0.88 |
| - All Prosody (Text Only) | 0.76 | 0.78 | 0.66 | 0.75 |
| Communication Style | Automated Frequency | Expert Frequency | Agreement () | Error Pattern |
|---|---|---|---|---|
| Neutral | 327 | 298 | 0.78 | Overdetection |
| Transitional | 211 | 224 | 0.72 | Good alignment |
| Reflective | 137 | 151 | 0.65 | Moderate underdetection |
| Supportive | 18 | 29 | 0.52 | Underdetection |
| Empathetic | 12 | 23 | 0.41 | Substantial underdetection |
| Validating | 8 | 17 | 0.35 | Severe underdetection |
| Overall | 200 | 200 | 0.61 | Bias toward majority |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mallarapu, S.; Liu, X.; Zargarian, P.; Mottaghian, S.F.; Suresha, R.; Jain, V.; Bayat, A. From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers 2026, 15, 161. https://doi.org/10.3390/computers15030161
Mallarapu S, Liu X, Zargarian P, Mottaghian SF, Suresha R, Jain V, Bayat A. From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers. 2026; 15(3):161. https://doi.org/10.3390/computers15030161
Chicago/Turabian StyleMallarapu, Saahithi, Xinyan Liu, Pegah Zargarian, Seyyedeh Fatemeh Mottaghian, Ramyashree Suresha, Vasudha Jain, and Akram Bayat. 2026. "From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems" Computers 15, no. 3: 161. https://doi.org/10.3390/computers15030161
APA StyleMallarapu, S., Liu, X., Zargarian, P., Mottaghian, S. F., Suresha, R., Jain, V., & Bayat, A. (2026). From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers, 15(3), 161. https://doi.org/10.3390/computers15030161

