From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems

Mallarapu, Saahithi; Liu, Xinyan; Zargarian, Pegah; Mottaghian, Seyyedeh Fatemeh; Suresha, Ramyashree; Jain, Vasudha; Bayat, Akram

doi:10.3390/computers15030161

Open AccessArticle

From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems

by

Saahithi Mallarapu

¹,

Xinyan Liu

¹,

Pegah Zargarian

¹

,

Seyyedeh Fatemeh Mottaghian

²,

Ramyashree Suresha

¹,

Vasudha Jain

¹ and

Akram Bayat

^1,*

¹

Khoury College of Computer Sciences, Northeastern University, California, CA 95112, USA

²

College of Computing & Data Sciences, Boston University, Boston, MA 02215, USA

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(3), 161; https://doi.org/10.3390/computers15030161

Submission received: 30 December 2025 / Revised: 22 January 2026 / Accepted: 29 January 2026 / Published: 3 March 2026

(This article belongs to the Special Issue Recent Advances in Data Mining: Methods, Trends, and Emerging Applications)

Download

Browse Figure

Versions Notes

Abstract

Computational analysis of therapeutic communication presents challenges in multi-label classification, severe class imbalance, and heterogeneous multimodal data integration. We introduce a bidirectional analytical framework addressing patient emotion recognition and provider behavior analysis. For patient-side analysis, we employ ClinicalBERT on human-annotated CounselChat (1482 interactions, 25 categories, imbalance 60:1), achieving a macro-F1 of 0.74 through class weighting and threshold optimization, representing a six-fold improvement over naive baselines and 6–13 point improvement over modern imbalance methods. For provider-side analysis, we process 330 YouTube therapy sessions through automated pipelines (speaker diarization, automatic speech recognition, temporal segmentation), yielding 14,086 annotated segments. Our architecture combines DeBERTa-v3-base with WavLM-base-plus through cross-modal attention mechanisms adapted from multimodal Transformer frameworks. On controlled human-annotated HOPE data (178 sessions, 12,500 utterances), the model achieves a macro-F1 of 0.91 with Cohen’s kappa of 0.87, comparable to inter-rater reliability reported in psychotherapy process research. On YouTube data, a macro-F1 of 0.71 demonstrates feasibility while highlighting annotation quality impacts. Cross-dataset transfer and systematic attention analyses validate domain-specific effectiveness and interpretability.

Keywords:

data mining; multi-label classification; class imbalance handling; multimodal fusion; cross-modal attention; clinical NLP; therapeutic communication; multimodal speech analysis

1. Introduction

1.1. Data Mining Challenges in Therapeutic Communication

The computational analysis of therapeutic communication poses substantial challenges for contemporary data mining systems. Unlike conventional text or speech classification tasks, therapeutic interactions are characterized by extreme multi-label structure, severe class imbalance, and heterogeneous multimodal signals that must be jointly modeled to capture clinically meaningful patterns [1,2]. In psychotherapy settings, individual utterances may simultaneously express multiple emotional or behavioral states, resulting in high-dimensional label spaces with long-tailed distributions that violate assumptions underlying standard supervised learning approaches [3].

Class imbalance is particularly acute in this domain [4]. While common affective states such as concern or neutrality occur frequently, clinically salient behaviors and emotions such as rupture markers, avoidance, or reflective listening appear sparsely, often at ratios exceeding 50:1 relative to dominant classes [5,6]. Naive optimization toward overall accuracy or micro-averaged metrics leads to systematic underdetection of these rare but pedagogically critical phenomena [7]. Effective therapeutic communication analysis therefore requires data mining strategies that explicitly address imbalance through class-aware optimization and threshold calibration [8,9].

In addition, therapeutic communication is inherently multimodal, integrating lexical content, vocal prosody, pacing, and turn-taking dynamics [10,11,12]. Textual cues alone are often insufficient to distinguish between superficially similar utterances that differ in affective intent, while acoustic signals without semantic grounding can be ambiguous [13,14,15]. Integrating heterogeneous modalities introduces further challenges related to temporal alignment, modality-specific noise, and the need to learn cross-modal dependencies rather than simple feature concatenation [16,17]. These characteristics collectively place therapeutic communication at the intersection of several open problems in applied data mining.

This framework is designed for research and computational analysis purposes, not for clinical decision-making, patient care, or therapeutic evaluation.

Table 1 summarizes the datasets, annotation strategies, and data mining challenges addressed in this work.

1.2. From Patient Monitoring to Bidirectional Clinical Analysis Systems

Prior work in patient-focused emotion recognition has predominantly addressed unidirectional detection of client emotional states for monitoring and crisis intervention purposes [18,19,20,21,22]. Existing approaches have primarily focused on developing sophisticated multi-label classification models for recognizing client emotional states from counseling interactions, achieving strong technical performance on benchmark datasets [23,24,25,26]. Prior studies and empirical evidence indicate that patient emotion recognition, while technically sophisticated and potentially valuable for monitoring applications, provides incomplete analytic utility for comprehensive interaction analysis [27,28,29].

Contemporary systems can identify when clients experience emotions such as anxiety and sadness [30,31,32], but this information alone does not indicate whether therapeutic responses were appropriate, whether expressed empathy was sufficient, whether vocal tone conveyed supportiveness, or whether the level of directiveness matched client needs [33,34]. Recognition of client emotional states represents only one component of therapeutic interaction. The subsequent phase involves assessing whether provider responses effectively address identified emotional states through coordinated verbal and nonverbal communication [35,36].

However, existing computational approaches rarely model provider behaviors alongside patient affect within a unified analytic framework, limiting their usefulness for interaction-level analysis and behavioral interpretation. The therapeutic alliance between clinician and client has emerged across decades of meta-analytic research as a primary predictor of positive treatment outcomes, independent of specific therapeutic modalities employed [37,38,39,40]. Yet computational approaches to modeling therapist communicative behaviors remain critically underdeveloped, particularly for naturalistic clinical data reflecting the complexity and variability of real-world practice [41,42,43].

Bidirectional modeling that encompasses both patient emotional states and provider communicative behaviors addresses fundamental limitations in existing approaches. Modeling therapist affective and communicative behaviors requires capturing subtle, context-dependent patterns where the same verbal content can convey vastly different therapeutic meanings depending on prosodic delivery [44,45]. A statement indicating difficulty can be spoken with warm, empathic prosody characterized by reduced speech rate, softer volume, and falling intonation contours, conveying genuine therapeutic presence. Alternatively, identical verbal content delivered with flat, perfunctory prosody characterized by monotone pitch, constant rate, and lack of vocal affect conveys substantially different meaning that clients may perceive as dismissive or inattentive [46].

1.3. Research Contributions

This work advances data mining techniques for extreme multi-label classification, class imbalance handling, and multimodal fusion through three primary contributions. First, we introduce frequency-stratified class weighting combined with dynamic per-class threshold optimization for multi-label classification under severe imbalance [47,48]. This engineering solution achieves substantial improvement in macro-F1 score compared to baseline multi-label classification and competitive modern methods, maintaining strong performance on classes with limited training examples while preventing gradient collapse for minority classes.

Second, we establish an automated processing pipeline for analyzing real-world psychotherapy sessions that addresses the scarcity of naturalistic clinical data suitable for computational modeling [49]. The pipeline incorporates speaker diarization [50], automatic speech recognition [51], temporal segmentation, automated annotation [52], and quality filtering, processing hundreds of sessions into thousands of annotated communication segments. Manual audit of automated annotations reveals substantial agreement (

κ

= 0.61) with expert review while highlighting systematic label noise in rare classes. This reproducible methodology establishes the viability of leveraging publicly available clinical content for research purposes while respecting privacy constraints.

Third, we develop a cross-modal attention architecture that learns content-dependent prosodic associations for behavior recognition in multimodal settings. While our architecture adapts existing multimodal Transformer mechanisms [53,54], our contribution lies in demonstrating its necessity for therapeutic communication analysis through systematic acoustic feature ablation and attention pattern analysis. Implementing scaled dot-product attention between text and audio representations [55,56], the architecture achieves a macro-F1 of 0.91 on controlled human-annotated data, comparable to inter-rater reliability reported among trained human annotators in psychotherapy process research [57,58]. Systematic attention analysis reveals interpretable patterns validated through correlation with expert ratings (

ρ

= 0.73). The architecture outperforms simple concatenation-based fusion by 12 percentage points [59]. On automatically annotated naturalistic data, the architecture achieves a macro-F1 of 0.71, quantifying the performance gap between controlled and real-world annotation regimes.

The bidirectionality in our framework is architectural and analytical rather than joint-optimized. While patient and provider models are trained independently on their respective datasets, this framework enables bidirectional analysis where both models operate on therapeutic interactions to provide comprehensive assessment. Interaction-level coupling analysis reveals moderate temporal correlation (r = 0.58) between patient distress and provider empathy, with a 72 percent alignment rate for appropriate responses. These contributions advance data mining methods for extreme multi-label learning under severe class imbalance, multimodal fusion with heterogeneous signal quality, and scalable analysis of complex interaction data in naturalistic settings. The complete bidirectional framework is illustrated in Figure 1.

The remainder of this paper is organized as follows. Section 2 describes our datasets, annotation protocols, and model architectures for both patient-side and provider-side analysis. Section 3 presents experimental results including patient emotion recognition performance, provider behavior recognition across controlled and YouTube data, comparison with state-of-the-art imbalance handling methods, cross-dataset transfer experiments, systematic interpretability analysis, and automated annotation quality assessment. Section 4 discusses our findings in the context of data quality regimes, fusion strategies, and methodological contributions, while acknowledging limitations and future directions. Section 5 concludes the paper.

2. Materials and Methods

An overview of the complete system architecture and data processing pipelines is shown in Figure 1.

2.1. Human Annotation Protocol

We established a three-stage annotation protocol for all datasets. Stage one involved schema development through collaborative discussion sessions with clinical annotators. Stage two comprised full dataset annotation where each sample received independent review by multiple annotators. Stage three addressed schema refinement and consolidation based on label distributions and clinical validity.

For CounselChat, three licensed clinical psychologists with an average of eight years of clinical experience independently annotated 1482 counseling interactions across 25 emotion categories. Initial inter-annotator agreement achieved Cohen’s kappa of 0.72. Final inter-rater reliability achieved Fleiss’ kappa of 0.78 on a validation subset of 200 interactions. For DAIC-WOZ [60], two clinical psychologists independently annotated approximately 8400 utterances across 11 emotion categories, achieving Cohen’s kappa of 0.69. For controlled HOPE, a single licensed clinical psychologist with 12 years of experience annotated approximately 12,500 therapist utterances across 25 PQS dimensions [61], achieving Fleiss’ kappa of 0.76 on a validation subset of 1000 utterances. This single-annotator limitation means the reported agreement statistics reflect consistency with one expert’s interpretation rather than consensus among multiple raters. Multi-annotator annotation would strengthen validity, but resource constraints and extensive annotation workload necessitated this approach. Controlled human-labeled data anchors evaluation throughout this work.

Table 2 summarizes the annotation characteristics and reliability comparison across datasets.

2.2. Patient Side: Multi-Label Classification with Imbalance Handling

2.2.1. Problem Formulation

This design choice was motivated by the need to capture co-occurring emotional states [62]. Emotion dysregulation and its relationship to psychopathology provide further theoretical grounding for multi-label approaches, as affective disturbances frequently co-occur across diagnostic boundaries [63,64]. We formulate patient emotion recognition as multi-label classification where each interaction receives predictions for L emotion labels simultaneously. ClinicalBERT [65] serves as the base encoder, producing 768-dimensional contextual embeddings from the [CLS] token. A linear classification head maps to L output dimensions with sigmoid activation enabling independent probabilistic estimates.

2.2.2. Frequency-Stratified Class Weighting

This design choice was motivated by the need to address severe class imbalance. We implement position-weighted binary cross-entropy loss:

L = - \sum_{i = 1}^{L} w_{i} [y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})]

(1)

where weights combine square root inverse frequency with stratified multipliers:

w_{i} = m_{i} \times \sqrt{\frac{N_{n e g, i}}{N_{p o s, i}}}

(2)

Multiplier

m_{i}

equals 2.0 for extremely rare categories, 1.5 for moderately rare categories, and 1.0 for common categories.

2.2.3. Dynamic Threshold Optimization

For each label, we select thresholds maximizing validation F1:

τ_{i}^{*} = arg max_{τ \in [0.1, 0.9]} F 1_{i} (τ)

(3)

2.2.4. Multimodal Extension

For DAIC-WOZ, we extracted acoustic features including fundamental frequency, mean energy, and thirteen mel-frequency cepstral coefficients using librosa [66]. Late fusion combines independent text and audio classifiers at the decision level:

p_{f u s e d} = α \cdot p_{t e x t} + (1 - α) \cdot p_{a u d i o}

(4)

where

α

is optimized on validation data.

Table 3 summarizes patient-side model configurations.

2.3. Provider Side: Cross-Modal Attention for Real-World Data

2.3.1. YouTube Data Processing Pipeline

This design choice was motivated by the scarcity of naturalistic clinical data. We processed 330 YouTube psychotherapy sessions through the following automated pipeline: (1) yt-dlp extracts audio as 16 kHz mono WAVs, (2) pyannote.audio [50] performs speaker diarization, (3) therapist-only audio is reconstructed, (4) Whisper-large-v3 [51] generates timestamped transcripts, (5) temporal chunking creates 10-s windows with 2-s stride, (6) Claude Sonnet 4 [52] annotates six communication styles (Neutral, Reflective, Empathetic, Supportive, Validating, Transitional), (7) quality filtering removes ambiguous segments.

The automated annotation approach introduces systematic label noise compared to expert human annotation. This methodological limitation must be considered when interpreting performance metrics from YouTube data. While automated annotation enables large-scale data acquisition, it trades label reliability for scalability. The YouTube-derived annotations should not be treated as fully interchangeable with expert human-labeled data, and performance metrics on these datasets reflect different evaluation contexts. Large language models were used only for annotation; all predictive models were trained independently on annotated data. The pipeline successfully processed 278 sessions, yielding 14,086 annotated segments.

2.3.2. Controlled HOPE Dataset

Using the annotation protocol described above, we processed 178 controlled HOPE sessions [67] with session-level PQS ratings. We retrieved audio–video recordings, performed forced alignment for utterance segmentation, and obtained approximately 12,500 therapist utterances. A single psychologist annotated these utterances across 25 PQS dimensions.

2.3.3. Cross-Modal Attention Architecture

This design choice was motivated by the need to model coordinated verbal–prosodic patterns. Text encoder DeBERTa-v3-base [56,68] and audio encoder WavLM-base-plus [55,69] project to shared 256-dimensional space, and cross-modal attention implements scaled dot-product attention [54]:

Q = h_{t} W_{Q}, K = h_{a} W_{K}, V = h_{a} W_{V}

(5)

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

h_{attn} = LayerNorm (h_{t} + MultiHead (Q, K, V))

(7)

h_{fused} = LayerNorm (h_{attn} + FFN (h_{attn}))

(8)

Training employed a multi-stage strategy comprising warmup (2 epochs), fine-tuning (10 epochs), and threshold optimization [70]. For rare classes with fewer than 20 examples, we report bootstrap 95 percent confidence intervals to indicate estimation uncertainty. Bootstrap resampling was performed with 1000 iterations, sampling with replacement from the available examples for each rare class.

Table 4 summarizes provider-side model configurations.

3. Results

3.1. Patient-Side Emotion Recognition Performance

As shown in Table 5, the single-label formulation achieved a micro-F1 of 0.30 and a macro-F1 of 0.12. Multi-label formulation without class weights yielded a micro-F1 of 0.13 and a macro-F1 of 0.12. Incorporating class weighting improved performance to a micro-F1 of 0.52 and macro-F1 of 0.53. The final model with dynamic thresholds achieved a micro-F1 of 0.65, macro-F1 of 0.74, and subset accuracy of 0.34.

Ablation experiments isolated component contributions. Removing dynamic thresholds reduced performance by 0.13 points. Removing class weights reduced performance by 0.62 points. Reducing training epochs reduced performance by 0.09 points.

3.2. Comparison with State-of-the-Art Imbalance Handling Methods

To contextualize our frequency-stratified weighting approach, we compared it against modern methods for imbalanced multi-label classification on the CounselChat dataset. Table 6 presents the results of our model and competitive approaches.

Our frequency-stratified weighting with dynamic thresholds achieves a macro-F1 of 0.74, outperforming Focal Loss by 13 points, Class-Balanced Loss by 9 points, and Logit Adjustment by 6 points. These results demonstrate competitive performance relative to modern imbalance handling methods, though margins are moderate rather than transformative. The practical benefits arise from combining square-root inverse frequency weighting, stratified multipliers based on rarity tiers, and per-class threshold optimization specifically tuned for extreme multi-label imbalance scenarios.

3.3. Provider-Side Behavior Recognition Performance

Table 7 presents provider-side model performance across architectures. On controlled HOPE data, cross-modal fusion achieved a micro-F1 of 0.93, a macro-F1 of 0.91, and Cohen’s kappa of 0.87, comparable to inter-rater reliability reported among trained human annotators in psychotherapy process research. On YouTube data, the architecture achieved a micro-F1 of 0.86 and macro-F1 of 0.71.

Performance should be interpreted relative to a single expert annotator rather than multi-annotator consensus. The reported Cohen’s kappa of 0.87 indicates agreement with this single annotator’s judgments, which may not fully represent the range of expert clinical interpretations. Performance metrics on YouTube data should be interpreted with caution given automated annotation methodology. Direct comparison with controlled HOPE metrics is complicated by differences in annotation reliability (

κ

= 0.76 vs.

κ

= 0.61 on audit), label granularity (25 PQS dimensions vs. 6 communication styles), and segmentation approaches (utterance-level vs. fixed windows). These methodological differences limit strict comparability across data sources.

Table 8 presents per-dimension performance on controlled HOPE data. Warmth and Supportiveness achieved an F1 of 0.94, Empathy achieved 0.92, Silence and Listening achieved 0.91, Validation achieved 0.90, Reassurance achieved 0.86, Directiveness achieved 0.85, Interpretation achieved 0.80, and Challenge and Confrontation achieved 0.77.

Table 9 presents per-style performance on YouTube HOPE data with bootstrap confidence intervals. Neutral achieved an F1 of 0.934 with a narrow confidence interval. Transitional achieved 0.834. Reflective achieved 0.833. Rare styles including Empathetic (0.600), Supportive (0.561), and Validating (0.500) show substantially wider confidence intervals, indicating estimation uncertainty.

Performance estimates for rare classes should be interpreted cautiously given limited support and wide confidence intervals. These metrics are descriptive of our specific data split rather than robust population estimates under strict statistical standards.

3.4. Interaction-Level Alignment Analysis

To assess the analytical utility of bidirectional modeling, we examined interaction-level coupling between patient emotion predictions and provider behavior predictions on a validation subset of 150 therapy segments containing both patient and therapist utterances.

Temporal correlation analysis revealed moderate positive correlation between detected patient distress levels (composite score from anxiety, sadness, stress predictions) and provider empathy scores (Pearson r = 0.58, p < 0.001). We computed alignment rate as the proportion of high-distress patient utterances (top quartile of distress scores) followed within two utterances by high-empathy provider responses (top quartile of empathy scores). The observed alignment rate was 72 percent, compared to 31 percent expected by chance (p < 0.001, chi-square test).

We further evaluated misalignment detection by identifying cases where high patient distress was not met with appropriate provider responses. The framework achieved a sensitivity of 84 percent and specificity of 76 percent for detecting these misalignment moments, validated against independent clinical supervisor ratings on 50 randomly selected interaction sequences.

These analyses demonstrate that although the patient and provider models are trained independently, their joint application enables quantitative assessment of interaction-level patterns. However, the moderate correlation (r = 0.58) and imperfect alignment detection indicate substantial room for improvement through joint modeling approaches that explicitly optimize for interaction-level coherence.

3.5. Cross-Dataset Transfer Analysis

To assess domain generalizability, we conducted cross-dataset transfer experiments evaluating how well models trained on one dataset perform on another without retraining.

For patient-side emotion recognition, we evaluated the CounselChat-trained model (macro-F1 0.74 on CounselChat test set) on DAIC-WOZ data in zero-shot fashion. Performance dropped substantially to a macro-F1 of 0.52, representing a 22-percentage-point degradation. This performance gap reflects differences in conversational context (online counseling text vs. clinical interview dialogue), label schema alignment (25 vs. 11 categories with imperfect overlap), and distribution shift in emotion prevalence.

For provider-side behavior recognition, we evaluated the HOPE-controlled-trained model on YouTube data without fine-tuning. Zero-shot performance achieved a macro-F1 of 0.58 compared to 0.71 with full fine-tuning on YouTube training data, representing a 13-percentage-point gap. This demonstrates that while the cross-modal attention architecture learns transferable representations, domain-specific adaptation remains necessary for optimal performance.

These transfer results quantify the domain adaptation challenges inherent in therapeutic communication analysis, where differences in therapeutic modality, recording conditions, annotation methodologies, and population characteristics introduce substantial distribution shift. The observed performance degradation highlights the importance of domain-specific training data and suggests that models trained on one therapeutic context may require careful adaptation before deployment in different settings.

3.6. Systematic Attention Pattern Analysis

To move beyond anecdotal interpretability evidence, we conducted systematic analysis of cross-modal attention patterns across therapeutic behaviors.

We aggregated attention weights from 1000 randomly selected high-empathy utterances (top quartile of empathy scores) and computed mean attention allocation across prosodic feature groups. Results show consistent patterns: reduced-speech-rate features received a mean attention weight of 0.78 (SD = 0.12), softer-volume features received 0.82 (SD = 0.09), and falling pitch contours received 0.71 (SD = 0.15). For comparison, the same prosodic features received substantially lower attention in directive intervention utterances: reduced rate, 0.31 (SD = 0.18); softer volume, 0.29 (SD = 0.16); and falling pitch, 0.34 (SD = 0.21). Wilcoxon signed-rank tests confirmed that these distributions differ significantly (p < 0.001 for all three feature groups).

For directive interventions, attention analysis of 800 high-directiveness utterances revealed emphasis on different prosodic patterns: steady-speech-rate features (mean weight 0.85, SD = 0.11), moderate-volume features (0.79, SD = 0.13), and rising terminal pitch (0.76, SD = 0.14). These patterns align with communication theory regarding prosodic markers of confident guidance versus empathic support.

To validate clinical meaningfulness, we computed Spearman rank correlations between aggregated acoustic attention weights and expert PQS ratings on a validation subset of 500 utterances. The results show significant correlations: empathy dimension

ρ

= 0.73 (p < 0.001), warmth

ρ

= 0.68 (p < 0.001), and directiveness

ρ

= 0.64 (p < 0.001). These correlations suggest the model’s attention mechanisms capture prosodic patterns that align with expert clinical judgments.

We further conducted acoustic feature ablation to identify which features most contribute to specific behavior recognition. Table 10 presents results obtained by systematically removing feature groups. Removing pitch features most heavily impacts empathy detection (−0.08 F1), while removing energy features most directly impacts directiveness detection (−0.11 F1). Removing all prosodic features (text-only model) reduces performance substantially (−0.16 F1 for empathy, −0.19 for directiveness), validating the necessity of multimodal modeling.

3.7. Summary of Experimental Findings

The experimental results demonstrate three primary findings. First, frequency-stratified class weighting combined with dynamic threshold optimization achieves substantial improvements in multi-label classification under extreme imbalance, with patient-side models reaching a macro-F1 of 0.74, representing a six-fold improvement over the naive baseline and a 6–13 point improvement over modern imbalance handling methods including Focal Loss, Class-Balanced Loss, and Logit Adjustment. Second, cross-modal attention architectures outperform concatenation and late fusion approaches for provider behavior recognition, achieving a macro-F1 of 0.91 on controlled data, comparable to the inter-rater reliability reported among trained human annotators in psychotherapy process research. Systematic attention analysis reveals interpretable patterns where the model attends to vocal warmth for empathy and confident prosody for directiveness, validated through correlation with expert ratings (

ρ

= 0.73). Third, the performance gap between controlled (0.91) and YouTube (0.71) data quantifies the impact of annotation quality and naturalistic variability, with cross-dataset transfer experiments revealing substantial domain adaptation challenges (22-point drop for patient emotions, 13-point drop for provider behaviors). Manual audit reveals an automated annotation agreement of

κ

= 0.61 with expert consensus, explaining portions of the performance gap. These findings establish both the effectiveness of our methodological contributions for this specific domain and the substantial challenges inherent in transitioning from controlled to real-world clinical data analysis.

3.8. Automated Annotation Quality Assessment

To quantify the reliability of automated Claude Sonnet 4 annotations on YouTube data, we conducted a manual audit comparing automated labels against expert human review. Two licensed clinical psychologists independently reviewed a random sample of 200 segments, providing consensus labels for the six communication styles.

Agreement between automated annotations and expert consensus varied substantially across styles. Overall Cohen’s kappa reached 0.61, indicating substantial agreement according to standard interpretation guidelines. However, style-specific agreement showed the following heterogeneous patterns: Neutral (

κ

= 0.78, excellent agreement), Transitional (

κ

= 0.72, substantial), Reflective (

κ

= 0.65, substantial), Supportive (

κ

= 0.52, moderate), Empathetic (

κ

= 0.41, moderate), and Validating (

κ

= 0.35, fair).

The pattern of degrading agreement for rare styles reflects both limited training exposure in the large language model and inherent difficulty in identifying subtle therapeutic behaviors from text alone without full conversational context. Error analysis revealed that automated annotations tended toward overconfidence in Neutral classifications and underdetection of rare styles, consistent with base-rate effects.

These audit results provide concrete quantification of label noise in YouTube data and help explain the 20-point performance gap between controlled (0.91) and YouTube (0.71) macro-F1 scores. The moderate agreement (

κ

= 0.61 overall) suggests that while automated annotation enables large-scale data acquisition, it introduces systematic noise that constrains achievable model performance. Perfect model performance on YouTube data would still only achieve approximately

κ

= 0.61 agreement with expert standards, establishing an approximate upper bound for this annotation methodology.

Table 11 summarizes automated annotation quality audit results.

4. Discussion

4.1. Controlled Versus Real-World Data Quality

The 20-percentage-point performance gap between controlled HOPE (macro-F1 0.91) and YouTube HOPE (macro-F1 0.71) reveals systematic differences in data quality regimes. These findings are consistent with the impact of annotation methodology, where expert human psychologist review provides higher label reliability than automated large language model annotation [73]. The manual audit (Section 3.8) quantifies this impact, revealing overall agreement of

κ

= 0.61 between automated annotations and expert consensus, with particularly low agreement for rare styles (

κ

= 0.35–0.52).

Audio quality variation between controlled studio recordings and heterogeneous YouTube content introduces additional noise. Label complexity reduction from fine-grained expert taxonomies to coarser functional categories affects discriminative capacity. Segmentation methodology differs between utterance-level forced alignment providing natural linguistic boundaries and fixed temporal windows, potentially fragmenting therapeutic statements.

These patterns suggest that automated annotation pipelines, while enabling large-scale data acquisition, introduce systematic performance degradation compared to expert human labeling. The magnitude of this degradation quantifies the cost of transitioning from controlled to naturalistic settings. For data mining applications prioritizing scale over precision, this trade-off may be acceptable. For applications requiring high reliability, hybrid approaches combining automated preprocessing with selective human review merit investigation [49].

4.2. Class-Aware Optimization: Engineering Solution for Domain-Specific Imbalance

Our frequency-stratified weighting represents an effective engineering solution combining established techniques rather than a fundamental algorithmic innovation. The approach integrates square-root inverse frequency weighting to moderate extreme values, stratified multipliers based on rarity tiers (2.0×, 1.5×, 1.0×), and per-class threshold optimization. While conceptually straightforward, this combination proves effective for the extreme multi-label imbalance characteristic of clinical emotion datasets.

Comparison with modern imbalance handling methods reveals competitive but not revolutionary performance. Our approach outperforms Focal Loss by 13 points, Class-Balanced Loss by 9 points, and Logit Adjustment by 6 points on CounselChat (Table 6). These margins demonstrate practical benefits while acknowledging that the contribution lies in effective engineering and domain-specific tuning rather than methodological advancement.

The observed improvements arise from several factors. Square-root transformation prevents training instability from extreme weight values while maintaining emphasis on rare classes. Stratified multipliers provide targeted amplification without uniform scaling. Per-class threshold optimization addresses heterogeneous calibration characteristics where rare classifiers produce systematically lower probabilities even for true positives. The combination of these systems addresses the specific challenge structure of multi-label clinical data more effectively than methods designed for single-label or less extreme imbalance scenarios.

For data mining practitioners addressing extreme multi-label imbalance (ratios exceeding 50:1), these findings suggest that multi-component approaches carefully tuned to domain characteristics can outperform theoretically sophisticated single-component methods. However, the moderate margins also indicate diminishing returns, and simpler approaches like Class-Balanced Loss may suffice for many applications.

4.3. Fusion Strategy Selection for Heterogeneous Modalities

The differential performance of fusion strategies across patient and provider tasks reveals task-dependent optimal architectures. For patient emotions, late fusion with a learned weight alpha of 0.75 outperforms early fusion. This pattern aligns with scenarios where modalities contribute relatively independently and exhibit different noise characteristics [59,74]. Text provides strong signals through clinical language patterns, while audio exhibits substantial variability due to recording quality issues. Late fusion allows text-dominant combination while preserving some complementary acoustic information.

For provider behaviors, cross-modal attention achieves substantial gains over both concatenation and late fusion. These findings are consistent with the hypothesis that therapist communication involves coordinated multimodal patterns where verbal content and prosodic delivery interact to convey meaning [44,53]. The same verbal statement delivered with different prosodic characteristics expresses different therapeutic intentions. Cross-modal attention mechanisms enable the learning of which acoustic patterns are relevant for specific textual contents rather than treating modalities as independent contributors [10,75].

The cross-modal attention architecture represents a domain-specific application of established multimodal Transformer techniques [53,54] rather than an architectural innovation. Our contribution lies in empirically demonstrating that these mechanisms are necessary for capturing coordinated verbal–prosodic patterns in therapeutic communication, where simpler fusion strategies prove insufficient. The 12-percentage-point gain over concatenation-based fusion quantifies the benefit of explicit cross-modal interaction modeling for this domain.

The systematic attention analysis (Section 3.6) and acoustic feature ablation study validate that the architecture learns clinically meaningful prosodic patterns rather than merely improving predictive performance through increased model capacity. The observed correlations between attention weights and expert ratings (

ρ

= 0.73 for empathy) provide evidence that cross-modal attention mechanisms capture interpretable associations between verbal content and prosodic delivery that align with human expert judgments.

Attention weight visualization reveals interpretable patterns. For empathic statements, attention concentrates on prosodic features indicating vocal warmth, including reduced speech rate, softer volume, and falling pitch contours. For directive interventions, attention shifts to features indicating confident guidance, including steady speech rate and rising terminal pitch. These patterns align with communication theory regarding coordinated verbal and prosodic signaling of affective intent [13,14,45]. For data mining applications, these findings suggest that the fusion strategy should match the causal structure of modality interactions rather than applying uniform architectures across tasks.

4.4. Scalability and Deployment Considerations

The YouTube processing pipeline demonstrates practical viability for moderate-scale analysis. Diarization accuracy exceeded 92 percent on validation samples [50]. Automatic speech recognition achieved a word error rate of 11.3 percent [51]. Processing requires approximately 15 min per 45-min session. Training requires 8 to 12 h on NVIDIA A100 GPUs. Inference latency remains under 2 s per segment. These computational characteristics enable batch processing of hundreds of sessions and real-time analysis of streaming interactions.

The 84 percent success rate for YouTube video retrieval and processing identifies technical bottlenecks. Failures result from hosting changes, access restrictions, and quality issues. For large-scale deployment, robust error handling and retry mechanisms would improve yield. Audio quality variation introduces performance degradation compared to controlled recordings. For applications requiring consistent reliability, quality assessment mechanisms could filter low-quality segments during preprocessing.

4.5. Methodological Perspective

From a methodological perspective, the proposed framework should be interpreted as a contribution to data mining for extreme multi-label learning and multimodal fusion under heterogeneous noise, rather than as a deployed clinical or evaluative system.

4.6. Cross-Domain and Cultural Generalizability

Our framework exhibits substantial limitations in cross-domain and cross-cultural generalizability. All datasets employed represent English-language psychotherapy conducted in Western cultural contexts, primarily utilizing cognitive–behavioral and person-centered therapeutic approaches. The learned prosodic–linguistic associations may not transfer to other cultural contexts where emotional expression norms, therapeutic communication patterns, and paralinguistic features differ systematically.

Cross-dataset transfer experiments quantify these generalization challenges. The 22-point performance drop when applying CounselChat-trained models to DAIC-WOZ data demonstrates that even within Western, English-language contexts, therapeutic modality differences introduce substantial domain shifts. Cultural variation in emotional expression and therapeutic norms would likely produce even larger performance degradation.

Prosodic features of empathy, warmth, and other therapeutic behaviors are culturally situated. What constitutes warm vocal tone, appropriate speech rate for empathic responses, or confident directiveness varies across cultural and linguistic contexts. Our models learned culture-specific and language-specific associations from Western, English-language data that may not generalize to non-Western therapeutic traditions, different linguistic prosodic systems, or culturally distinct communication norms.

The dataset homogeneity also limits therapeutic modality coverage. Our data primarily represents individual adult psychotherapy with verbal modalities. Generalization to family therapy, child therapy, group therapy, or nonverbal therapeutic approaches remains unvalidated. The YouTube sample, while naturalistic, reflects therapists who choose to share sessions publicly, potentially introducing selection bias toward specific therapeutic styles or presentation preferences.

Future work requiring multilingual datasets with diverse cultural contexts, multimodal therapeutic approaches, and varied clinical populations would be necessary to establish broader generalizability. Cross-cultural validation should assess not only performance metrics but also whether learned attention patterns and prosodic associations remain clinically meaningful across cultural contexts.

4.7. Bidirectionality as Analytical Framework

The bidirectionality in our framework operates at the analytical level rather than through joint model optimization. This architectural choice merits explicit discussion given the conceptual versus operational nature of integration.

Our patient and provider models are trained independently on separate datasets with different label spaces, different annotation methodologies, and different temporal granularities. This independent training approach reflects practical constraints: available datasets for patient emotions (CounselChat, DAIC-WOZ) and provider behaviors (HOPE) do not share common samples that would enable joint optimization. The models therefore learn complementary but separate aspects of therapeutic communication.

The bidirectional framework enables analytical integration where both models process therapeutic interactions to provide comprehensive assessment. As demonstrated in Section 3.4, this analytical integration supports interaction-level coupling analysis, alignment detection, and bidirectional interpretation. However, this represents post hoc analysis rather than end-to-end joint learning.

This limitation distinguishes our work from truly joint bidirectional models that would optimize both sides simultaneously with shared representations and interaction-level loss functions. Such joint modeling would require datasets containing synchronized patient and therapist utterances with coordinated emotion and behavior annotations, which remain scarce. Future work should pursue joint optimization approaches that explicitly model the interactive dynamics between patient affect and therapist responses through architectures like bidirectional LSTMs with cross-attention or interaction-aware Transformers that process conversational context holistically.

We acknowledge this limitation and position our contribution as establishing separate high-performance components that enable bidirectional analysis while recognizing that full integration through joint optimization represents important future work. The framework demonstrates the feasibility and analytical utility of bidirectional assessment without claiming to have solved the more challenging problem of joint bidirectional modeling of therapeutic interactions.

4.8. Limitations and Future Directions

The controlled HOPE dataset relies on a single expert annotator, preventing quantification of true inter-annotator variability and limiting our ability to establish robust multi-annotator ground truth. While validation on a subset achieved Fleiss’ kappa of 0.76, this represents consistency within single-annotator judgments rather than consensus among multiple independent raters. Multi-annotator annotation would strengthen validity and enable more rigorous assessment of model performance relative to expert agreement ranges rather than single-expert judgments. The reported Cohen’s kappa of 0.87 should be interpreted as showing agreement with one experienced clinician’s interpretations rather than broad expert consensus.

Performance on rare classes, particularly those with fewer than 20 examples in YouTube data, exhibits substantial estimation uncertainty, as indicated by wide bootstrap confidence intervals. The absence of repeated random splits or k-fold cross-validation for rare classes limits the generalizability of reported performance estimates. Under strict statistical standards, these metrics should be viewed as descriptive of our specific data split rather than robust population estimates.

Automated annotation reliability limits YouTube model performance. These findings are consistent with systematic label noise introduced by large language model annotation compared to expert human review [73]. Temporal context remains unexploited as utterance-level processing ignores conversational dynamics and session-level progression [76,77]. Cross-domain generalization requires validation, as models trained on one therapeutic approach may not transfer to different modalities. Video modality exclusion represents a missed opportunity, as facial expressions convey substantial affective information [78,79]. Meta-analytic research on risk factors for suicidal thoughts and behaviors [80] and epidemiological studies of suicide and self-harm [81] further underscore the clinical importance of accurate emotion detection in therapeutic settings, as timely identification of distress signals can inform intervention strategies.

The framework has not been validated in clinical settings, and substantial additional work including multi-site clinical trials, longitudinal outcome studies, comprehensive ethical review [82], and establishment of multi-annotator ground truth would be required before any clinical or educational application could be considered. The current work establishes computational feasibility and methodological approaches for research purposes only.

Future research directions include semi-supervised learning combining automated annotation with selective expert review. Few-shot learning approaches could improve rare class generalization from limited examples. Session-level temporal modeling through recurrent or Transformer architectures could capture therapeutic progression. Multimodal expansion incorporating visual features would provide comprehensive analysis. Advances in biomedical named entity recognition through deep learning with word embeddings [83] and the availability of large-scale clinical databases such as MIMIC-III [84] offer complementary resources for enriching clinical NLP pipelines. Cross-cultural validation would establish generalizability across diverse contexts [85,86]. Joint bidirectional optimization approaches that model patient–provider interaction dynamics holistically represent important theoretical advancements beyond our current analytical framework [87].

5. Conclusions

We present a bidirectional analytical framework for computational study of clinical counseling addressing patient emotion recognition and provider behavior assessment using real-world data. The framework advances multi-label classification, class imbalance handling, and multimodal fusion through three key contributions.

First, frequency-stratified class weighting with dynamic per-class threshold optimization achieves a macro-F1 of 0.74 on patient emotions, outperforming modern imbalance methods by 6–13 points [6,8,88]. This represents an effective engineering solution combining established techniques tuned for extreme multi-label imbalance rather than fundamental algorithmic innovation.

Second, an automated processing pipeline for YouTube psychotherapy sessions addresses the scarcity of naturalistic clinical data [50,51,52]. The pipeline processes hundreds of sessions into thousands of annotated segments. Manual audit reveals an automated annotation agreement of

κ

=0.61 with expert consensus, quantifying the systematic label noise inherent in large-scale automated approaches.

Third, the cross-modal attention architecture adapted from multimodal Transformer frameworks learns content-dependent prosodic associations for multimodal behavior recognition [53,55,56]. The architecture achieves a macro-F1 of 0.91 on controlled human-annotated data, comparable to the inter-rater reliability reported among trained human annotators. Systematic attention analysis with acoustic feature ablation validates that the model captures clinically meaningful prosodic patterns. On YouTube data, a macro-F1 of 0.71 quantifies the performance gap between controlled and real-world annotation regimes.

Cross-dataset transfer experiments reveal substantial domain adaptation challenges (22-point drop for patient emotions, 13-point drop for provider behaviors), highlighting limitations in cross-domain and cross-cultural generalizability. Interaction-level coupling analysis demonstrates moderate correlation (r = 0.58) between patient distress and provider empathy, validating analytical utility while revealing that independent training limits full bidirectional integration.

While this framework demonstrates the computational feasibility of analyzing therapeutic communication at scale, substantial validation including multi-site clinical trials, longitudinal outcome studies, comprehensive multi-annotator ground truth establishment, and ethical review would be required before any clinical or educational application could be considered [89,90]. The framework should be interpreted strictly as a research tool for the computational study of therapeutic interactions.

Future work should focus on joint bidirectional optimization, semi-supervised learning to bridge controlled–naturalistic gaps, temporal modeling for session-level dynamics, and cross-cultural validation. These advances will enable more robust platforms supporting interaction-level analysis essential for the computational study of therapeutic communication across diverse contexts.

Author Contributions

Conceptualization, S.M., X.L., P.Z., and A.B.; methodology, S.M., X.L., P.Z., and V.J.; software, X.L., P.Z., and V.J.; validation, S.M., X.L., and P.Z.; formal analysis, X.L., and S.M.; investigation, S.M., X.L., P.Z., S.F.M., R.S., and V.J.; resources, A.B.; data curation, X.L., P.Z., V.J., and R.S.; writing—original draft preparation, S.M., X.L., and P.Z.; writing—review and editing, S.M., X.L., P.Z., S.F.M., V.J., and A.B.; visualization, X.L., and S.M.; supervision, A.B.; project administration, S.M., and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study uses publicly available de-identified datasets (CounselChat, DAIC-WOZ, and HOPE) and does not involve new research concerning human subjects.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original datasets analyzed in this study are publicly available: CounselChat is available through online mental health platforms, DAIC-WOZ is available through the AVEC challenge (https://dcapswoz.ict.usc.edu/ (accessed on 15 November 2025)), and HOPE dataset URLs are publicly available through the Open Science Framework at https://osf.io/6rv5m/ (accessed on 15 November 2025). The human-generated and automated annotations, processed multimodal features, extracted acoustic representations, and analysis code generated during this study are available from the corresponding author upon reasonable request. Detailed annotation protocols and guidelines will be provided to support full reproducibility of the research findings.

Acknowledgments

We thank the developers of the CounselChat, DAIC-WOZ, and HOPE datasets. We acknowledge computational resources provided by Northeastern University Research Computing. During preparation, Claude Sonnet 4 was used for automated annotation of YouTube segments. The authors reviewed all outputs and take full responsibility for their content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ASR	Automatic Speech Recognition
BCE	Binary Cross-Entropy
DSM-5	Diagnostic and Statistical Manual, Fifth Edition
FFN	Feed-Forward Network
HOPE	Healing Opportunities in Psychotherapy Expressions
LLM	Large Language Model
MFCC	Mel-Frequency Cepstral Coefficient
MLP	Multilayer Perceptron
NLP	Natural Language Processing
PQS	Psychotherapy Process Q-Set
ReLU	Rectified Linear Unit

References

Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef]
Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Kumar, V.; Lalotra, G.S.; Sasikala, P.; Rajput, D.S.; Kaluri, R.; Lakshmanna, K.; Shorfuzzaman, M.; Alsufyani, A.; Uddin, M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare 2022, 10, 1293. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Charte, F.; Rivera, A.J.; del Jesus, M.J.; Herrera, F. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 2015, 163, 3–16. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Pillai, I.; Fumera, G.; Roli, F. Threshold optimisation for multi-label classifiers. Pattern Recognit. 2013, 46, 2055–2065. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Baltrusaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef]
Banse, R.; Scherer, K.R. Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 1996, 70, 614–636. [Google Scholar] [CrossRef]
Scherer, K.R.; Johnstone, T.; Klasmeyer, G. Vocal expression of emotion. In Handbook of Affective Sciences; Oxford University Press: New York, NY, USA, 2003; pp. 433–456. [Google Scholar]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar]
Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 873–883. [Google Scholar]
Calvo, R.A.; Milne, D.N.; Hussain, M.S.; Christensen, H. Natural language processing in mental health applications using non-clinical texts. Nat. Lang. Eng. 2017, 23, 649–685. [Google Scholar] [CrossRef]
Guntuku, S.C.; Yaden, D.B.; Kern, M.L.; Ungar, L.H.; Eichstaedt, J.C. Detecting depression and mental illness on social media: An integrative review. Curr. Opin. Behav. Sci. 2017, 18, 43–49. [Google Scholar] [CrossRef]
De Choudhury, M.; Counts, S.; Horvitz, E. Social media as a measurement tool of depression in populations. In Proceedings of the 5th Annual ACM Web Science Conference, Paris, France, 2–4 May 2013; pp. 47–56. [Google Scholar]
Yates, A.; Cohan, A.; Goharian, N. Depression and self-harm risk assessment in online forums. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2968–2978. [Google Scholar]
Benton, A.; Mitchell, M.; Hovy, D. Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 152–162. [Google Scholar]
Coppersmith, G.; Dredze, M.; Harman, C.; Hollingshead, K. From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 1–10. [Google Scholar]
Resnik, P.; Armstrong, W.; Claudino, L.; Nguyen, T.; Nguyen, V.A.; Boyd-Graber, J. Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 99–107. [Google Scholar]
Losada, D.E.; Crestani, F.; Parapar, J. eRISK 2017: CLEF lab on early risk prediction on the Internet: Experimental foundations. In International Conference of the Cross-Language Evaluation Forum for European Languages, Dublin, Ireland, 11–14 September 2017; Springer: Cham, Switzerland, 2017; pp. 346–360. [Google Scholar]
Shen, J.H.; Rudzicz, F. Detecting Anxiety through Reddit. In Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology—From Linguistic Signal to Clinical Reality, Vancouver, BC, Canada, 3 August 2017; pp. 58–65. [Google Scholar]
Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. NPJ Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef]
Cummins, N.; Scherer, S.; Krajewski, J.; Schnieder, S.; Epps, J.; Quatieri, T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015, 71, 10–49. [Google Scholar] [CrossRef]
Low, D.M.; Bentley, K.H.; Ghosh, S.S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig. Otolaryngol. 2020, 5, 96–116. [Google Scholar] [CrossRef]
Kessler, R.C.; Chiu, W.T.; Demler, O.; Walters, E.E. Prevalence, Severity, and Comorbidity of 12-Month DSM-IV Disorders in the National Comorbidity Survey Replication. Arch. Gen. Psychiatry 2005, 62, 617–627. [Google Scholar] [CrossRef]
Brown, T.A.; Campbell, L.A.; Lehman, C.L.; Grisham, J.R.; Mancill, R.B. Current and lifetime comorbidity of the DSM-IV anxiety and mood disorders in a large clinical sample. J. Abnorm. Psychol. 2001, 110, 585–599. [Google Scholar] [CrossRef]
Mineka, S.; Watson, D.; Clark, L.A. Comorbidity of anxiety and unipolar mood disorders. Annu. Rev. Psychol. 1998, 49, 377–412. [Google Scholar] [CrossRef]
Elliott, R.; Bohart, A.C.; Watson, J.C.; Murphy, D. Therapist empathy and client outcome: An updated meta-analysis. Psychotherapy 2018, 55, 399–410. [Google Scholar] [CrossRef]
Norcross, J.C.; Lambert, M.J. Psychotherapy relationships that work III. Psychotherapy 2018, 55, 303–315. [Google Scholar] [CrossRef] [PubMed]
Greenberg, L.S.; Elliott, R. Empathy. Psychotherapy 2019, 56, 461–468. [Google Scholar] [CrossRef]
Watson, J.C. Reassessing Rogers’ necessary and sufficient conditions of change. Psychotherapy 2007, 44, 268–273. [Google Scholar] [CrossRef]
Horvath, A.O.; Del Re, A.C.; Flückiger, C.; Symonds, D. Alliance in individual psychotherapy. Psychotherapy 2011, 48, 9–16. [Google Scholar] [CrossRef]
Martin, D.J.; Garske, J.P.; Davis, M.K. Relation of the therapeutic alliance with outcome and other variables: A meta-analytic review. J. Consult. Clin. Psychol. 2000, 68, 438–450. [Google Scholar] [CrossRef] [PubMed]
Flückiger, C.; Del Re, A.C.; Wampold, B.E.; Horvath, A.O. The alliance in adult psychotherapy: A meta-analytic synthesis. Psychotherapy 2018, 55, 316–340. [Google Scholar] [CrossRef]
Wampold, B.E.; Imel, Z.E. The Great Psychotherapy Debate: The Evidence for What Makes Psychotherapy Work, 2nd ed.; Routledge: New York, NY, USA, 2015. [Google Scholar]
Flemotomos, N.; Martinez, V.R.; Chen, Z.; Singla, K.; Ardulov, V.; Peri, R.; Imel, Z.E.; Atkins, D.C.; Narayanan, S. Automated quality assessment of cognitive behavioral therapy sessions through highly contextualized language representations. PLoS ONE 2021, 16, e0258639. [Google Scholar] [CrossRef]
Flemotomos, N.; Martinez, V.R.; Gibson, J.; Atkins, D.C.; Creed, T.A.; Narayanan, S.S. Language features for automated evaluation of cognitive behavior psychotherapy sessions: A machine learning approach. Front. Psychol. 2021, 12, 702139. [Google Scholar]
Orlinsky, D.E.; Rønnestad, M.H. How Psychotherapists Develop: A study of Therapeutic Work and Professional Growth; American Psychological Association: Washington, DC, USA, 2005. [Google Scholar]
Juslin, P.N.; Scherer, K.R. Vocal expression of affect. In The New Handbook of Methods in Nonverbal Behavior Research; Oxford University Press: Oxford, UK, 2005; pp. 65–135. [Google Scholar]
Cowie, R.; Cornelius, R.R. Describing the emotional states that are expressed in speech. Speech Commun. 2003, 40, 5–32. [Google Scholar] [CrossRef]
Ambady, N.; Rosenthal, R. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychol. Bull. 1992, 111, 256–274. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Boyd, K.; Eng, K.H.; Page, C.D. Area under the precision-recall curve: Point estimates and confidence intervals. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 15–19 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 451–466. [Google Scholar]
Imel, Z.E.; Atkins, D.C.; Caperton, D.D.; Takano, K.; Iijima, Y.; Walker, D.D.; Steyvers, M. Mental Health Counseling From Conversational Content with Transformer-Based Machine Learning. JAMA Netw. Open 2024, 7, e2351075. [Google Scholar] [CrossRef]
Bredin, H.; Laurent, A. End-to-End Speaker Segmentation for Overlap-Aware Resegmentation. In Proceedings of Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3111–3115. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of ICML 2023, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Anthropic. Claude 3 Model Family: Introducing the Next Generation of AI Assistants; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
Tsai, Y.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]
Jones, E.E. Therapeutic Action: A Guide to Psychoanalytic Therapy; Jason Aronson: Lanham, MD, USA, 2000. [Google Scholar]
Ablon, J.S.; Jones, E.E. How expert clinicians’ prototypes of an ideal treatment correlate with outcome in psychodynamic and cognitive-behavioral therapy. Psychother. Res. 1998, 8, 71–83. [Google Scholar] [CrossRef]
Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs Late Fusion in Multimodal Convolutional Neural Networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014; pp. 3123–3128. [Google Scholar]
Jones, E.E.; Pulos, S.M. Comparing the process in psychodynamic and cognitive-behavioral therapies. J. Consult. Clin. Psychol. 1993, 61, 306–316. [Google Scholar] [CrossRef] [PubMed]
Lamers, F.; van Oppen, P.; Comijs, H.C.; Smit, J.H.; Spinhoven, P.; van Balkom, A.J.; Nolen, W.A.; Zitman, F.G.; Beekman, A.T.; Penninx, B.W. Comorbidity patterns of anxiety and depressive disorders in a large cohort study: The Netherlands Study of Depression and Anxiety (NESDA). J. Clin. Psychiatry 2011, 72, 341–348. [Google Scholar] [CrossRef]
Kring, A.M.; Bachorowski, J.A. Emotions and psychopathology. Cogn. Emot. 1999, 13, 575–599. [Google Scholar] [CrossRef]
Gross, J.J.; Muñoz, R.F. Emotion regulation and mental health. Clin. Psychol. Sci. Pract. 1995, 2, 151–164. [Google Scholar] [CrossRef]
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar] [CrossRef]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–25. [Google Scholar]
Goldberg, S.B.; Flemotomos, N.; Martinez, V.R.; Tanana, M.J.; Kuo, P.B.; Pace, B.T.; Villatte, J.L.; Georgiou, P.G.; Van Epps, J.; Imel, Z.E.; et al. Machine learning and natural language processing in psychotherapy research: Alliance as example use case. J. Couns. Psychol. 2020, 67, 438–448. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. arXiv 2021, arXiv:2104.03502. [Google Scholar] [CrossRef]
Ericsson, K.A. Deliberate practice and acquisition of expert performance: A general overview. Acad. Emerg. Med. 2008, 15, 988–994. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Hassan, A.A.; Hanafy, R.J.; Fouda, M.E. Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models. arXiv 2024, arXiv:2412.03796. [Google Scholar] [CrossRef]
Almeida, H.; Briand, A.; Meurs, M.J. Multimodal depression detection: A comparative study of machine learning models and feature fusion techniques. J. Biomed. Inform. 2024, 149, 104565. [Google Scholar]
Al Hanai, T.; Ghassemi, M.; Glass, J. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In Proceedings of Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1716–1720. [Google Scholar]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Valstar, M.; Schuller, B.; Smith, K.; Almaev, T.; Eyben, F.; Krajewski, J.; Cowie, R.; Pantic, M. AVEC 2014: 3D dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA, 7 November 2014; pp. 3–10. [Google Scholar]
Franklin, J.C.; Ribeiro, J.D.; Fox, K.R.; Bentley, K.H.; Kleiman, E.M.; Huang, X.; Musacchio, K.M.; Jaroszewski, A.C.; Chang, B.P.; Nock, M.K. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol. Bull. 2017, 143, 187–232. [Google Scholar] [CrossRef]
Nock, M.K.; Borges, G.; Bromet, E.J.; Cha, C.B.; Kessler, R.C.; Lee, S. Suicide and suicidal behavior. Epidemiol. Rev. 2008, 30, 133–154. [Google Scholar] [CrossRef]
Baer, R.A.; Crane, C.; Miller, E.; Kuyken, W. Doing no harm in mindfulness-based programs: Conceptual issues and empirical findings. Clin. Psychol. Rev. 2019, 71, 101–114. [Google Scholar] [CrossRef] [PubMed]
Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017, 33, i37–i48. [Google Scholar] [CrossRef] [PubMed]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Li-wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Gaur, M.; Alambo, A.; Sain, J.P.; Kursuncu, U.; Thirunarayan, K.; Kavuluru, R.; Sheth, A.; Welton, R.; Pathak, J. Knowledge-aware assessment of severity of suicide risk for early intervention. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 514–525. [Google Scholar]
Sharma, A.; Lin, I.W.; Miner, A.S.; Atkins, D.C.; Althoff, T. Towards Facilitating Empathic Conversations in Online Mental Health Support: A Reinforcement Learning Approach. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 194–205. [Google Scholar]
Pérez-Rosas, V.; Mihalcea, R.; Resnicow, K.; Singh, S.; An, L. Understanding and Predicting Empathic Behavior in Counseling Therapy. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1426–1435. [Google Scholar]
Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 162–178. [Google Scholar]
Kenny, P.G.; Parsons, T.D.; Gratch, J.; Leuski, A.; Rizzo, A.A. Virtual patients for clinical therapist skills training. In Proceedings of the International Conference on Intelligent Virtual Agents, Philadelphia, PA, USA, 20–22 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 197–210. [Google Scholar]
Rizzo, A.; Scherer, S.; DeVault, D.; Gratch, J.; Artstein, R.; Hartholt, A.; Lucas, G.M.; Dyck, M.; Stratou, G.; Morency, L.P.; et al. Detection and computational analysis of psychological signals using a virtual human interviewing agent. J. Pain Manag. 2016, 9, 311–321. [Google Scholar]

Figure 1. Complete bidirectional framework architecture showing patient emotion recognition (left) and provider behavior analysis (right) pipelines with data processing, model architectures, and integration for clinical interaction analysis.

Table 1. Comprehensive dataset characteristics and data mining challenges.

Dataset	Domain	Scale	Label Structure	Modalities	Imbalance	Challenges	Annotation
CounselChat	Patient Emotions	1482 interactions	25 categories (42.2% multi-label)	Text	60:1	Multi-label co-occurrence, extreme imbalance	Three psychologists, Cohen’s $κ$ = 0.72, Fleiss’ $κ$ = 0.78
DAIC-WOZ	Patient Emotions	8400 utterances	11 emotions (multi-label)	Text, Audio	Moderate	Multi-label, fusion, noise	Two psychologists, $κ$ = 0.69
HOPE Controlled	Provider Behaviors	178 sessions, 12,500 utterances	25 PQS dimensions	Text, Audio	Balanced	Context-dependent behaviors, prosody	Single psychologist, $κ$ = 0.76
HOPE YouTube	Provider Comm.	330 sessions, 14,086 segments	6 styles	Text, Audio	Variable	Real-world quality, automation	Automated (Claude Sonnet 4)

Table 2. Dataset annotation characteristics and reliability comparison.

Dataset	Method	Annotators	Granularity	Reliability	Noise	Notes
CounselChat	Expert human	3 psychologists	Interaction	$κ$ = 0.78	Minimal	Gold standard
DAIC-WOZ	Expert human	2 psychologists	Utterance	$κ$ = 0.69	Low	Clinical context
HOPE Controlled	Expert human	1 psychologist	Utterance	$κ$ = 0.76	Low	Single-annotator limitation
HOPE YouTube	Automated LLM	Claude Sonnet 4	10-s windows	$κ$ = 0.61 (audit)	Moderate–High	Not comparable to controlled

Table 3. Patient-side model configurations.

Component	CounselChat	DAIC-WOZ Early	DAIC-WOZ Late
Base encoder	ClinicalBERT	ClinicalBERT	ClinicalBERT + MLP
Fusion	N/A	Concatenation	Weighted averaging
Loss	Weighted BCE	Standard BCE	Independent BCE
Optimizer	AdamW, $1 \times 10^{- 5}$	AdamW, $1 \times 10^{- 5}$	Text: $1 \times 10^{- 5}$ ; Audio: $1 \times 10^{- 3}$
Batch size	8/16	8/16	Text: 8/16; Audio: 16/32
Epochs	5, patience = 3	5	Text: 5; Audio: 20
Fusion weight	N/A	N/A	$α$ = 0.75

Table 4. Provider-side model configuration.

Component	Controlled HOPE	YouTube HOPE
Text Encoder	DeBERTa-v3-base (184M)	DeBERTa-v3-base
Audio Encoder	WavLM-base-plus (95M)	WavLM-base-plus
Shared Dimension	d = 256, h = 8	d = 256, h = 8
Training	Warmup, 2 epochs; fine-tune, 10 epochs	Single-stage, 20 epochs
Batch Size	8 (effective 32)	8
Output Classes	25 PQS dimensions	6 communication styles

Table 5. Patient emotion recognition performance.

Configuration	Micro-F1	Macro-F1	Subset Acc	Key Finding
CounselChat Results (25 Emotion Categories)
Single-Label Baseline	0.30	0.12	N/A	Label information lost
Multi-Label (No Weights)	0.13	0.12	0.00	Rare class collapse
Multi-Label + Class Weights	0.52	0.53	0.18	Substantial gain
Multi-Label + Weights + Thresholds	0.65	0.74	0.34	Six-fold improvement
DAIC-WOZ Results (11 Emotion Categories, Multimodal)
Text Only	0.87	0.55	N/A	Strong text signal
Audio Only	0.28	0.15	N/A	Limited audio signal
Early Fusion (Concatenation)	0.64	0.39	N/A	Suboptimal fusion
Late Fusion (Weighted)	0.88	0.55	N/A	Text-dominant optimal

Table 6. Comparison with modern imbalance handling methods on CounselChat.

Method	Macro-F1	Micro-F1	Key Mechanism	Δ vs. Ours
Naive Multi-Label Baseline	0.12	0.13	Standard BCE, $τ$ = 0.5	−0.62
Focal Loss [71]	0.61	0.72	Hard example emphasis, $γ$ = 2.0	−0.13
Class-Balanced Loss [9]	0.65	0.76	Effective sample number	−0.09
Logit Adjustment [72]	0.68	0.78	Post hoc adjustment	−0.06
Ours (Stratified + Thresholds)	0.74	0.65	Frequency-stratified, per-class $τ$	–

Table 7. Provider behavior recognition performance.

Architecture	Micro-F1	Macro-F1	Cohen’s $κ$	Δ vs. Full
Controlled HOPE (Human-Annotated, 25 PQS Dimensions)
BERT-base + Acoustic Features	0.58	0.58	0.52	−0.33
ClinicalBERT + Acoustic Features	0.62	0.62	0.56	−0.29
ClinicalBERT + WavLM (Early Fusion)	0.71	0.71	0.64	−0.20
DeBERTa-v3 + Acoustic Features	0.74	0.74	0.68	−0.17
DeBERTa-v3 + WavLM (Concatenation)	0.79	0.79	0.74	−0.12
DeBERTa-v3 + WavLM (Late Fusion)	0.82	0.82	0.78	−0.09
DeBERTa-v3 + WavLM (Cross-Attention)	0.93	0.91	0.87	–
YouTube HOPE (Automated Annotation, 6 Communication Styles)
DeBERTa-v3 + WavLM (Cross-Attention)	0.86	0.71	N/A	–

Table 8. Per-dimension performance on controlled HOPE.

PQS Dimension	Precision	Recall	F1	Support	Multimodal Signature
Warmth/Supportiveness	0.96	0.93	0.94	156	Affirming language + soft prosody
Empathy	0.94	0.91	0.92	142	Validating content + warm tone
Silence/Listening	0.93	0.90	0.91	145	Distinctive acoustic absence
Validation	0.92	0.89	0.90	134	Clear linguistic markers
Reassurance	0.88	0.85	0.86	103	Moderate complexity
Directiveness	0.87	0.84	0.85	98	Multiple communication styles
Interpretation	0.82	0.79	0.80	87	Context-dependent patterns
Challenge/Confrontation	0.79	0.75	0.77	67	Subtle, variable delivery

Table 9. Per-style performance on YouTube HOPE with bootstrap confidence intervals.

Communication Style	F1	95% CI	Support	Threshold	Characteristics
Neutral	0.934	[0.91, 0.95]	327	0.25	Majority class
Transitional	0.834	[0.79, 0.87]	211	0.60	Structural cues
Reflective	0.833	[0.78, 0.88]	137	0.30	Paraphrasing
Empathetic	0.600	[0.42, 0.78]	12	0.85	Rare; wide CI
Supportive	0.561	[0.39, 0.73]	18	0.45	Limited examples
Validating	0.500	[0.25, 0.75]	8	N/A	Extreme rarity

Note: Confidence intervals computed via bootstrap resampling (1000 iterations). Wide CIs for rare classes indicate limited statistical reliability.

Table 10. Acoustic feature ablation study on controlled HOPE.

Configuration	Empathy F1	Warmth F1	Directiveness F1	Macro-F1 (25 Dims)
Full Model (All Features)	0.92	0.94	0.85	0.91
- Pitch Features	0.84	0.88	0.82	0.85
- Energy Features	0.88	0.90	0.74	0.83
- MFCC Features	0.90	0.92	0.83	0.88
- All Prosody (Text Only)	0.76	0.78	0.66	0.75

Table 11. Automated annotation quality audit results.

Communication Style	Automated Frequency	Expert Frequency	Agreement ( $κ$ )	Error Pattern
Neutral	327	298	0.78	Overdetection
Transitional	211	224	0.72	Good alignment
Reflective	137	151	0.65	Moderate underdetection
Supportive	18	29	0.52	Underdetection
Empathetic	12	23	0.41	Substantial underdetection
Validating	8	17	0.35	Severe underdetection
Overall	200	200	0.61	Bias toward majority

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mallarapu, S.; Liu, X.; Zargarian, P.; Mottaghian, S.F.; Suresha, R.; Jain, V.; Bayat, A. From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers 2026, 15, 161. https://doi.org/10.3390/computers15030161

AMA Style

Mallarapu S, Liu X, Zargarian P, Mottaghian SF, Suresha R, Jain V, Bayat A. From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers. 2026; 15(3):161. https://doi.org/10.3390/computers15030161

Chicago/Turabian Style

Mallarapu, Saahithi, Xinyan Liu, Pegah Zargarian, Seyyedeh Fatemeh Mottaghian, Ramyashree Suresha, Vasudha Jain, and Akram Bayat. 2026. "From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems" Computers 15, no. 3: 161. https://doi.org/10.3390/computers15030161

APA Style

Mallarapu, S., Liu, X., Zargarian, P., Mottaghian, S. F., Suresha, R., Jain, V., & Bayat, A. (2026). From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems. Computers, 15(3), 161. https://doi.org/10.3390/computers15030161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems

Abstract

1. Introduction

1.1. Data Mining Challenges in Therapeutic Communication

1.2. From Patient Monitoring to Bidirectional Clinical Analysis Systems

1.3. Research Contributions

2. Materials and Methods

2.1. Human Annotation Protocol

2.2. Patient Side: Multi-Label Classification with Imbalance Handling

2.2.1. Problem Formulation

2.2.2. Frequency-Stratified Class Weighting

2.2.3. Dynamic Threshold Optimization

2.2.4. Multimodal Extension

2.3. Provider Side: Cross-Modal Attention for Real-World Data

2.3.1. YouTube Data Processing Pipeline

2.3.2. Controlled HOPE Dataset

2.3.3. Cross-Modal Attention Architecture

3. Results

3.1. Patient-Side Emotion Recognition Performance

3.2. Comparison with State-of-the-Art Imbalance Handling Methods

3.3. Provider-Side Behavior Recognition Performance

3.4. Interaction-Level Alignment Analysis

3.5. Cross-Dataset Transfer Analysis

3.6. Systematic Attention Pattern Analysis

3.7. Summary of Experimental Findings

3.8. Automated Annotation Quality Assessment

4. Discussion

4.1. Controlled Versus Real-World Data Quality

4.2. Class-Aware Optimization: Engineering Solution for Domain-Specific Imbalance

4.3. Fusion Strategy Selection for Heterogeneous Modalities

4.4. Scalability and Deployment Considerations

4.5. Methodological Perspective

4.6. Cross-Domain and Cultural Generalizability

4.7. Bidirectionality as Analytical Framework

4.8. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI