Next Article in Journal
Multimodal Smart-Skin for Real-Time Sitting Posture Recognition with Cross-Session Validation
Previous Article in Journal
Ergonomic Evaluation of Augmented Reality-Based Visualization of Scattered Radiation Distribution During Partial-Angle CT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset

1
College of Computer Science and Technology, Shanxi University of Electronic Science and Technology, Linfen 041000, China
2
College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China
3
School of Computing, Newcastle University, Newcastle NE1 7RU, UK
*
Author to whom correspondence should be addressed.
Multimodal Technol. Interact. 2026, 10(4), 38; https://doi.org/10.3390/mti10040038
Submission received: 4 January 2026 / Revised: 7 March 2026 / Accepted: 31 March 2026 / Published: 2 April 2026

Abstract

There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations.

1. Introduction

Although research on speech emotion recognition has made certain achievements, there is still room for improvement in the accuracy of emotion recognition. On the one hand, since single-modal information in characterizing emotions is susceptible to factors such as noise and individual expression variations, recognition results often exhibit limited accuracy [1]. This has led to the mainstream adoption of multimodal fusion methods, integrating speech, text, visual, and other signals to extract more comprehensive emotional information. On the other hand, current emotion recognition technologies are primarily based on discrete models, which can only identify single categories like happiness or sadness. These models struggle to capture intensity variations within the same emotion class under discrete labeling (e.g., a smile or a hearty laugh) and the continuous transition between different emotions (e.g., crying from overwhelming joy). In contrast, the dimensional emotion model continuously describes emotions in a high-dimensional space, compensating for the limitations of discrete models. This approach, particularly in human–machine interaction, allows machines to simulate dynamic changes in human emotions [2]. Therefore, constructing a recognition model that integrates multimodal information, combines both discrete and dimensional emotions, and aligns with human cognitive and emotional processing has become one of the key challenges in the field of emotion recognition [3].
Against this backdrop, multimodal emotion recognition has emerged as a research hotspot, aiming to fuse heterogeneous information from multiple sensory sources to enhance the accuracy and generalization capabilities of emotion modeling. In recent years, various fusion strategies have emerged, such as tensor fusion [4], Multimodal Transformer [5], MISA (Modality-Invariant and Specific Representations) [6], HRG-SSA [7], and MTAMW [8] which integrates identity and modality attributes, achieving notable progress in modality-consistent modeling.
However, the inherent structural differences and semantic gaps between modalities present significant challenges for the unified representation learning of multimodal data. Existing RNN variants [9], Transformers [10], and BERT-based pre-trained models [11] have shown strong performance in modeling intra-modal dynamic interactions. Nevertheless, these mainstream approaches primarily focus on intra-sample modal fusion, often overlooking the need for semantic alignment and structural generalization across samples—especially across different sentiment categories. This failure to address underlying modal inconsistencies severely limits the transferability and generalization capabilities of sentiment recognition systems in complex real-world scenarios.
To alleviate the aforementioned challenges, recent research has introduced contrastive learning into multimodal sentiment recognition tasks. By constructing positive–negative sample pairs and designing contrastive loss functions, these models are encouraged to learn structurally sensitive and semantically consistent feature distributions, thereby effectively improving cross-modal alignment and cross-sample discrimination capabilities. Representative approaches, such as SimCLR [12], CLIP [13], MoCo [14], HNCL [15], CLRA [16], and MSHCL [17], typically adopt a two-stage architecture consisting of a ‘representation extractor + projection head’. This structure not only preserves strong feature representation capabilities but also enhances the model’s generalization and robustness.
To meet the requirements of multimodal fusion sentiment networks, current databases still present several constraints. Existing sentiment databases typically adopt either discrete emotion labels (e.g., CASIA’s five categories [18], EMO-DB’s seven categories [19]) or dimensional emotion annotations (e.g., VAM’s three-dimensional annotation [20], SEMAINE’s five-dimensional annotation [21]), while datasets that provide both discrete and dimensional labels remain relatively limited. In response to the limitations of unimodal resources in complex scenarios, multimodal corpora have been developed to support collaborative modeling. Some corpora mainly combine external expressive signals, such as audio and video in CMU-MOSI [22], or audio and video together with motion capture in IEMOCAP [23], which can be influenced by noise and emotional masking in audio–visual channels. Other corpora further incorporate physiological signals, such as Recola’s synchronous collection of audio, video, and physiological signals [24], and MGEED’s combination of optomyography, electroencephalogram, and electrocardiogram signals [25]. While incorporating these physiological modalities provides additional information, they often lack a tight integration with external emotional expressions (e.g., facial expressions and speech), which restricts the overall accuracy of emotion analysis. Furthermore, many multimodal corpora place more emphasis on audio–video and general physiological measurements (such as electrocardiograms and electroencephalograms), while signals related to speech production are less frequently collected. Therefore, for emotion analysis centered on speech, introducing supplementary speech production modalities is somewhat necessary.
Building on this motivation, we consider complementary modalities that are directly tied to speech production. In particular, articulatory kinematics and glottal activity are promising due to their close connection to speech production and emotional expression. Articulatory kinematics quantifies 3D displacements and velocities of articulators (e.g., tongue and lips) to characterize emotion-related changes in articulatory movement patterns and coordination [26], adding articulatory-state cues beyond acoustic features [27]. The glottal signal reflects periodic vocal-fold contact variation and captures emotion-related changes in vocal-fold vibration driven by laryngeal tension regulation [28], thereby providing source-related features from complex acoustic waveforms [29]. The characteristics of these two modalities also lay the foundation for their participation in emotion recognition and dimension prediction. We describe their roles in emotional representation through the pleasure–arousal–dominance (PAD) model [30]: P denotes valence (positive vs. negative), A denotes physiological activation/energy (calm to excited), and D denotes control/power (compliant to confident). Changes in A typically accompany physiological activation and are potentially reflected in articulatory movement dynamics and glottal tension regulation. Therefore, we propose the following hypothesis: both the articulatory kinematics features and the glottal measurement values are associated with A—the former provides the amplitude, speed and temporal organization of the movement, which helps to distinguish between high and low arousal levels, as well as the intensity difference of “strong–weak” in discrete emotion recognition, while the latter captures the glottal contact dynamics related to changes in laryngeal tension and excitation patterns, which are associated with different arousal levels.
However, databases that satisfy these dual requirements—simultaneous acquisition of articulatory kinematics–audio–glottal–video together with three-dimensional PAD affect annotation—remain limited. This data gap restricts joint modeling of discrete classification and dimensional regression, and constrains the fusion of complementary emotional signals such as acoustics, glottal source, and articulatory movement.
To fill this gap and leverage these distinct physiological mappings, this study constructs the STEM-E2VA database (Suzhou and Taiyuan Emotional dataset on Mandarin with Electromagnetic articulation, Electroglottography, Video, and Audio). This database realizes the quadri-modal synchronous collection of acoustic, glottal, video and articulatory kinematic signals. It contains seven discrete emotions (neutral, happy, angry, etc.) and an improved seven-point PAD annotation. In addition, this study uses the SCL-90 psychological scale to analyze the correlation between annotators’ psychological states and PAD annotation deviations, so as to improve annotation reliability.
Based on this high-quality multimodal dimensional emotion database, we further propose a multimodal contrastive fusion affect network guided by PAD emotional labels. The network consists of three key components: first, a customized multimodal feature encoding pathway that models acoustic, glottal, and articulatory kinematic signals to extract their respective features; second, a mixed contrastive learning loss (MMCL-PAD) is introduced to optimize both intra-modal and inter-modal representation learning; and third, a cross-attention fusion mechanism integrates multimodal information to support downstream tasks such as discrete sentiment recognition and dimensional sentiment prediction.
Based on all of the above, this study aims to address the core challenges in speech emotion recognition, including the limited representation capability of single-modal information, the difficulty of capturing continuous emotional transitions with discrete annotations, and the modal structural differences and cross-sample misalignment in multimodal fusion.
The main contributions of this work are threefold:
(1)
We constructed the Mandarin multimodal emotional speech corpus STEM-E2VA, featuring synchronized acoustics, glottal signals, articulatory kinematics, and video. It provides rich multimodal representations to support various tasks such as active generation tasks (e.g., emotion speech synthesis/generation) and generative mechanisms research like acoustic-to-articulatory mapping.
(2)
We refined the seven-point PAD annotation scheme for seven discrete emotions and verified its reliability through an SCL-90-based analysis of annotator bias. Leveraging these robust annotations, we propose the MMCL-PAD to bridge the gap between categorical and dimensional emotion representations, simultaneously supporting discrete emotion recognition and continuous PAD prediction.
(3)
We propose a PAD-guided multimodal hybrid contrastive framework that aligns and fuses heterogeneous modalities via cross-attention. Experimental results demonstrate that our framework achieves superior performance on the STEM-E2VA dataset and exhibits generalization capabilities on public benchmarks (e.g., IEMOCAP).
The remainder of this paper is organized as follows. Section 2 introduces the construction of STEM-E2VA, including data collection, preprocessing. Section 3 describes the PAD annotation procedure, statistical analyzes of annotation behavior, and outlier-related processing. Section 4 presents the proposed PAD-guided multimodal contrastive fusion framework. Section 5 reports the experimental setup and results. Section 6 and Section 7 discuss findings, limitations, and future directions.

2. Construction of the STEM-E2VA Database

The STEM-E2VA database is a multimodal audio–visual emotional pronunciation database for Mandarin, jointly collected by Taiyuan University of Technology and Soochow University. This database synchronously captures four modalities of information, encompassing acoustic, articulatory, glottal signal, and visual modalities.

2.1. Selection of Subjects

To ensure the naturalness and authenticity of emotional data, the STEM-E2VA database recruited 22 native Mandarin-speaking participants (gender ratio 1:1, aged 21–27) who were not performing arts majors. All participants met the criteria for standard Mandarin proficiency, physical and mental health, and the absence of throat disorders. Prior to recording, participants completed the Symptom Checklist-90 (SCL-90) survey and underwent emotional expression training. Upon passing the evaluation, they signed an informed consent form [31] to ensure the scientific and standardized nature of the data collection process.

2.2. Emotion Selection and Induction

Given the absence of unified classification standards in affect recognition research, this paper adopts common emotions—anger, sadness, happiness, and neutrality—as benchmarks to enhance the study’s generalizability and comparability. Following Chen Ying’s methodology [32], the first three emotions are categorized into two intensity levels, ultimately establishing seven discrete categories as shown in Table 1. Emotional induction employed autobiographical recall [33] and imagined emotion induction [34], guiding participants to construct contextual scenarios or evoke genuine emotional experiences to obtain authentic emotional speech.

2.3. Text Design

The speech materials for this database were systematically designed to cover a broad phonetic context. The textual corpus primarily consists of simple finals, compound finals, nasal finals, and emotionally neutral sentences.
Based on the Source-Filter Theory [35], we developed specific strategies for the selection of syllabic finals. Simple finals such as /a/, /i/, and /u/ were chosen to analyze glottal source characteristics under quasi-static vocal tract configurations, as they are phonetically monophthongs with a relatively stable vocal tract shape that minimizes interference from complex articulatory movements and provides reliable baseline data for investigating vocal fold vibration modes and voice quality via glottal signals. Conversely, compound finals including /ai/ and /ou/ were introduced to capture dynamic articulatory properties; their inherent tongue and lip gliding movements can be accurately recorded by the high spatiotemporal resolution of the Electromagnetic Articulograph (EMA), yielding rich kinematic data for articulatory trajectory modeling [36]. Furthermore, Mandarin nasal finals such as /an/ and /ang/ were selected to ensure comprehensive phonetic coverage. With distinct spectral characteristics and articulation methods relative to oral monophthongs, they increase phonetic diversity and supply a more comprehensive combination of acoustic and articulatory features for multimodal modeling [37].
Sentence selection is equally critical for the construction of an emotional speech database. Appropriate sentence design is essential for eliciting natural and authentic emotional expression and serves as the foundation for effective speech emotion recognition [38]. Given that the same text must support the expression of multiple emotions, the content must strictly adhere to principles of “semantic neutrality” and “lack of emotional bias.” This database rigorously screened sentences across three dimensions:
  • Sentence Type: Only declarative sentences were used. To stabilize the F0 baseline and reduce prosodic variability, we avoided sentence-final modal particles and markers that imply specific intonation patterns (e.g., ma, ya, ba).
  • Content: Emotion-bearing words, intensifiers, and degree adverbs were excluded. We retained only objective, daily-life descriptive statements (e.g., task-oriented or factual sentences) to minimize lexical–semantic interference.
  • Duration: The sentence duration was strictly controlled to approximately 5 s at a normal speaking rate.

2.4. Multimodal Data Acquisition and Preprocessing

2.4.1. Data Acquisition System

During data acquisition, we employed multi-device synchronization to acquire signals across four modalities. For articulatory kinematics, we used the three-dimensional electromagnetic articulograph EMA-AG501 [39] (Carstens Medizinelektronik GmbH, Bovenden, Germany), which records spatial coordinates at a 250 Hz sampling rate. Thirteen sensors were attached to the vocal organs and reference points, with seven primary sensors dedicated to tracking articulatory movements: four on the lips (upper lip, lower lip, left and right mouth corners, labeled L1–L4) and three on the tongue (tongue tip, tongue body, tongue root, labeled T1–T3). The collect data sensor settings are shown in Figure 1. Facial video was collected with a Sony FDR-AX100 camera (4K, 3840 × 2160; 25 fps; 54 Mbps video bitrate; 56 Mbps total bitrate) (Sony, Tokyo, Japan). Glottal signals were collected by the left channel of an electroglottograph EGG-D100 at 44.1 kHz. Audio was collected via two approaches: the right channel of the EGG-D100, and a T.bone EM 9600 beamforming microphone positioned 30 cm from the participant (in synchronization with EMA). All audio signals were stored as stereo WAV files and resampled to 48 kHz with 16-bit depth.

2.4.2. Synchronization and Procedure

During the experiment, data collection was conducted by three operators: two controlled the EMA-AG501 and EGG-D100 devices respectively (Dion Engineering Ltd., Ruse, Bulgaria), and the third managed stimulus presentation and emotional guidance. Participants were seated under the EMA instrument and articulated prompts shown on a screen. A hybrid synchronization strategy was applied to ensure temporal consistency across modalities. EMA-AG501 and T.bone EM 9600 microphones (Thomann’s, Treppendorf, Germany) share the clock trigger through the synchronization box to achieve hardware synchronization. For glottal data, although the recording work is carried out simultaneously with other devices, precise time alignment was achieved afterwards. EGG operates in dual-channel mode, using the synchronous audio signal from its right channel to match the audio signal from the microphone, thereby eliminating artificial delay and achieving precise alignment. The data collection scenario is shown in Figure 2.
Given the different native sampling rates across modalities outlined in Table 2, we aligned model inputs by physical time duration rather than unifying sampling rates. During preprocessing, acoustic and glottal signals were resampled to 16 kHz for the Wav2Vec 2.0 encoder, while articulatory kinematics remained at 250 Hz to avoid interpolation errors and redundant computation. All data were segmented into fixed 3.4 s windows, corresponding to 54,400 samples at 16 kHz and 850 samples at 250 Hz, with minor duration differences adjusted by truncation or zero-padding. Although the input lengths differ, temporal alignment is achieved in the fusion module using cross-attention, which automatically learns the correspondence between sparse articulatory frames and dense acoustic frames, resolving resolution mismatches caused by different sampling rates (see Section 4.3).
Although our hybrid synchronization process provides robust temporal consistency, where the hardware trigger (EMA-microphone) can achieve synchronization with sub-millisecond precision, and the theoretical time resolution of waveform alignment is limited by the sampling interval of the reference audio (approximately 20–23 microseconds per sample point), due to system and environmental factors, minor residual offsets are inevitable in practical operations. To address this, this framework employs a two-layer strategy design to reduce the influence of such deviations. Firstly, at the feature extraction level, the time granularity of the extracted features is greater than the potential original signal misalignment. Specifically, the EMA system records at 250 Hz (4 milliseconds per frame), while the feature stride extracted by the Wav2Vec 2.0 acoustic encoder is approximately 20 milliseconds. The millisecond-level physical errors are usually included within the same feature frame, thus avoiding strict frame-level misalignment. Secondly, at the architecture level, cross-attention fusion does not enforce frame-by-frame correspondence but learns semantic “soft alignment”. By dynamically focusing on the local neighborhood of acoustic markers, this mechanism not only effectively compensates for the residual hardware deviations but also adapts to the inherent dynamic asynchrony of heterogeneous physiological modalities, ultimately ensuring the robustness of downstream emotion recognition.

2.4.3. Data Preprocessing and Screening

In multimodal data preprocessing, considering that the EGG microphone and facial sensor in the video data would cause certain obstruction to the subjects’ faces, it was necessary to manually extract the required micro-expression images. Therefore, we have not processed the video data. Users can, according to their own needs, independently clip the required data from the video.
For the preprocessing of articulatory kinematic data from the EMA-AG501, the original voltage amplitude data (.amp) is converted to raw 3D spatial data (.rawpos), which, along with the voltage data, undergoes filtering before head calibration to produce calibrated 3D data (.pos); this calibrated data is then converted to ASCII format, and Visartico is used to inspect articulator trajectories, excluding abnormal data based on visual criteria such as complete lip shape and continuous tongue movement. To quantify data stability, we use the root mean square error (RMSE) as a key metric, where smaller values indicate greater stability; we extract 3D motion parameters for the upper/lower lips, left/right mouth corners, and tongue, and define speech data with eRMS ≤ 10 mm as stable, excluding all other data.
To obtain objective emotional data, we assessed the acoustic modalities of the speech database and selected samples with satisfactory speech quality and emotional expression. Six native Mandarin-speaking volunteers with normal hearing were recruited to perform subjective listening scores [40], evaluating speech clarity and emotional intensity. Neutral speech was not scored for emotional expression. In data analysis, samples with both speech clarity and emotional expression scores ≥ 2.5 were retained for further analysis.
In this study, the EGG-D100 device recorded two-channel signals: the left channel recorded the glottal waveform, whose amplitude was positively correlated with the contact area of the vocal cords [41]; the right channel recorded the synchronous acoustic signal. The quality of the glottal signal varies depending on physiological characteristics: male or subjects with a lower body fat percentage are more likely to obtain stable signals, while the signals of female subjects are usually weaker and unstable, because the vocal cord skin of women is thinner, the vibration frequency is higher, and the movement speed is faster, which leads to a weakening of the EGG current. Therefore, during the screening process, samples with undetectable, unstable or missing glottal waveforms in the left channel were excluded.
The final dataset comprises 2415 emotional speech segments, 528 rhyme segments, and 24.5 h of video. Details are presented in Table 3.

3. Quantitative Labeling of PAD Dimensions and Data Analysis

Having completed the foundational collection and construction of the STEM-E2VA database, we introduced a dimension-based affect annotation system to comprehensively enhance its emotional expression dimension and adapt to multi-task modeling requirements. We also systematically evaluated and optimized annotation quality and annotator state using psychological analysis methods.

3.1. PAD Quantitative Model and Labeling

The PAD three-dimensional affect model was proposed by Russell et al. [42] in 1974 on the basis of psychological research. The affective domain was categorized into three dimensions: pleasure, arousal, and dominance. Pleasure (P) is defined as the positive or negative polarity of emotion, that is to say, the negative or positive states of emotion (denoted as pleasure and displeasure, respectively). Arousal (denoted as A) is indicative of the physiological activation level and alertness of emotion, reflecting whether an individual’s response to the environment is in an excited or subdued state (designated as arousal and nonarousal, respectively). Dominance (denoted as D) is defined as the control and influence of emotion over others and the external environment, distinguishing whether an emotional state originates from the individual’s subjective expression or is generated by the influence of the objective environment (designated as dominance and submissiveness, respectively).
In this study, we adapted the PAD emotion scale revised by the Chinese Academy of Sciences for annotation and improved it based on the nine-point semantic differential scale: referring to the research of Gong Xu et al. [43], we expanded the five-point difference scale to a seven-point semantic differential scale. In order to enhance the participants’ ability to comprehend the scoring system, the original seven-point numerical range (−3 to 3) was modified to 1 to 7. In addition, the four items across three dimensions were consolidated into a single configuration, as shown in Table 4.
In order to conduct the PAD dimension emotion annotation experiment, a total of 28 subjects (15 males and 13 females) were recruited in this study, none of whom had participated in the recording of the STEM-E2VA database. Prior to the commencement of the experiment, all participants completed the SCL-90 mental health questionnaire. The annotation process employed a WeChat mini-program, facilitating random playback of speech data, discrete scoring for P/A/D dimensions, automated recording of response times and replay counts, and incorporation of periodic rest mechanisms to ensure annotation quality.

3.2. Statistical Analysis of PAD Data

In order to comprehensively evaluate the annotation consistency and spatial distribution characteristics of 20 annotators across six target emotions (happiness, pleasure, anger, indifference, sadness, and sorrow) in the three dimensions of pleasure (P), arousal (A), and dominance (D), this section combines quantitative statistical and visualization methods to comprehensively analyze the data’s dispersion degree and spatial patterns.
Firstly, the intra-class standard deviation (Std) of each emotion in the P, A, and D dimensions was calculated to quantitatively assess the concentration and dispersion degree of the annotation results. The results demonstrate (Figure 3) that the Std values of all emotions range from 0.90 to 1.50, indicating that the annotation data is generally concentrated. However, the Std value of the pleasure dimension of the happiness emotion is relatively high, reaching 1.53. It is speculated that the presence of outliers, which are indicative of individual differences among some annotators, resulted in an increase in the overall Std.
To further understand how dispersion shows in space, we average the 20 PAD label values for each sample and plot their distribution in three dimensions (Figure 4). This study analyzed three strong emotions: happiness, anger, and sadness. From the scatter plot, it can be seen that the three kinds of emotions have good discrimination in PAD space, and the labeled PAD data are basically distributed around the center point; the 95% ellipsoidal confidence interval further demonstrates observable regional clustering. However, there are still a small number of outliers, which are consistent with the above discrete results, reflecting that individual differences in labelers have a certain impact on data consistency, and further analysis is needed to improve data quality.

3.3. PAD Outlier Analysis

In the statistical analysis of PAD data, the high dispersion of the annotation results and the outlier phenomenon in the spatial distribution indicate the existence of systematic annotation bias. Such outliers may stem from two potential mechanisms: first, the emotional cognitive bias between the annotators and the subjects, especially when distinguishing between strong and weak emotional intensities, subjective annotations and objective expressions are prone to understanding misalignment; second, the potential interference from annotators’ psychological states may diminish their sensitivity to specific affect dimensions. Since PAD annotation data within outliers can have a certain impact on downstream tasks, it is necessary to perform outlier screening and conduct in-depth analysis of the occurrence of outliers.

3.3.1. Correlation Analysis of Outliers and Psychological States of Annotators

In order to quantify the proportion of abnormal annotations for each annotator across different emotional dimensions, this paper first utilizes boxplots to establish the outlier threshold t i d based on the distribution characteristics of each emotion in the three dimensions: The three factors under consideration are pleasure degree (P), activation degree (A), and dominance degree (D). An annotation is classified as an outlier when its value exceeds the upper or lower bounds of a given affect dimension. Subsequently, the outlier rates of 20 annotators in the three dimensions of each emotion were screened out.
r i ( c , d ) = n o u t l i e r ( i , c , d ) n t o t a l ( c )
In the formula, r i ( c , d ) : the outlier rate of the i-th annotator on the d-th dimension (P/A/D) of the c-th emotion; n o u t l i e r ( i , c , d ) : The number of times the annotator’s value exceeds the threshold in this sentiment dimension; n t o t a l ( c ) : The total number of annotations of this annotator under this type of sentiment; C ∈ {1, 2,…, 6}: six types of target emotions; d ∈ {1,2,3}: Three emotional dimensions (pleasure degree, activation degree, dominance degree).
The psychological symptom state of the annotators has been demonstrated to directly influence the regulatory levels of their autonomic nervous system and the hypothalamic-pituitary–adrenal axis. Therefore, highly neurotic individuals exhibit heightened sensitivity and more intense reactions to negative stimuli, while responding more weakly to positive stimuli. This phenomenon also leads to a certain degree of influence on their perception and activation level of emotions [44]. Consequently, the three dimensions of the six emotions were categorized into high and low levels, as shown in Table 5, and directly analyze the relationship between psychological symptoms and the criteria of high and low dimensions.
A certain outlier rate among annotators can be allowed in a small range. With the exception of high pleasure and low activation, the outlier rate of each category is relatively low and stable, as shown in Figure 5. Annotators demonstrate a notably higher outlier rate in the high-pleasure category, with four annotators surpassing an outlier rate of 30%, and the highest attaining 51% (implying that over half of this annotator’s values exceeded the threshold). Therefore, an in-depth analysis of each category will be conducted next to explore the underlying causes of the occurrence of outliers.

3.3.2. Modeling of the Impact of Psychological Symptoms on the Accuracy of PAD

To investigate the relationship between the nine psychological symptom factors and the six high/low dimension criteria, we designated the nine independent variables—namely, somatization, obsessive-compulsive, symptoms, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, and psychoticism—denoted as X1X9. The six high–low dimensional criteria, namely pleasure level (high/low), arousal level (high/low), and dominance level (high/low) (denoted as Y1Y6), were taken as dependent variables for multiple linear regression analysis. Consequently, the preliminary model can be set as follows:
Y i = β 0 + k = 1 9 β k X k + μ 0
In Equation (2): β 0 is the constant term; β 1 , β 2 ,…, β 9 are the regression coefficients; u is the remaining residual, which is independent of the nine independent variables. Given that not all independent variables have statistically notable effects on the dependent variable, we used stepwise regression analysis to judge the notability of the nine factors (independent variables) and six categories (dependent variables) of SCL-90 individually in turn. In instances where an initially included independent variable became notable due to the introduction of a subsequent variable, the independent variable was removed. This ensures that only notable independent variables are included in the regression equation before introducing new variables, while also eliminating those with high multicollinearity.
Finally, we used SPSS23.0 to analyze the six categories, of which the results of high pleasantness are shown in Table 6. Stepwise regression was employed for model construction to identify notable predictors. Initially, Model 1 was developed with X 1 (somatization) as the sole predictor. Model 2 was subsequently established by adding X3 (interpersonal sensitivity) to Model 1. Finally, X4 (depression) was incorporated, resulting in the final Model 3, which included X 1 , X 3 , and X 4 as predictor variables.
In Table 6: R 2 represents the coefficient of determination, a statistical measure describing how well the regression curve fits the actual data points; the adjusted R 2 is the recalculated result of the coefficient of determination R 2 plus the residual degrees of freedom, which indicates the fitting accuracy of the model. As demonstrated in Table 6, it can be seen that the adjusted R 2 of model 3 is 0.762, which is closer to 1 and demonstrates the most optimal fit. Therefore, Model 3 is employed to derive the regression equation coefficients, as illustrated in Table 7. It is important to note that both unstandardized and standardized coefficients are representative of regression coefficients. Standardization is a process that eliminates dimensionality, reflecting the relative importance of the variables. It is employed exclusively for the purpose of comparing independent variables. The unstandardized regression coefficients are used to explain the effect of the independent variables on the dependent variable. In model 3, the p < 0.050 of the three independent variables X 1 ,   X 3 ,   X 4 are notable and the tolerance of the covariance statistics is not less than 0.100, and the Variance Inflation Factors (VIFs) are less than 10.000. Consequently, the validity of this model is substantiated. By performing the same stepwise regression analysis for each category, the regression equation for high pleasure can be expressed as follows
Y 1 = 0.930 + 0.398 X 1 0.328 X 3 + 0.286 X 4
In Equation (3): The coefficients for somatization ( X 1 ) and depression ( X 4 ) are positive, indicating that more severe symptoms of these types lead to higher error rates in evaluating high pleasantness. Conversely, the opposite trend is observed for interpersonal sensitivity ( X 3 ). This finding suggests that the psychological symptoms may potentially diminish the brain’s stress response levels to specific emotions, which also directly affects the accuracy of the annotations made by the annotators in the PAD labeling experiment. We used the same method to analyze the other five dependent variables, and there were no similar results with high pleasantness. The reason for the outliers is unknown, which is also a direction for subsequent analysis and research.

3.4. Outlier Processing

After completing the analysis of the relationship between the psychological state of the annotators and the outlier rate, it is still necessary to further clarify and handle those outlier individuals that notably affect data quality. We established the outlier determination threshold based on the preliminary statistical results to calculate the outlier rate. In the following stage of the research, we will systematically analyze the annotation distribution characteristics of different emotions across each dimension of PAD and precisely define the outlier determination criteria for each type of emotion in each dimension, so as to eliminate or adjust abnormal annotations before the subsequent model training, thereby enhancing the overall consistency and stability of the data.
The distribution of six emotions across three dimensions was examined based on the calculated average annotation values, combined with a statistical analysis of density estimation curves, box plots and jitter scatter plots. Figure 6 shows the cloud–rain diagram distribution for pleasantness. It can be seen that all six types of emotions follow a slightly normal distribution. Pleasure, indifferent, grief and sadness exhibit the highest concentration among them, while angry and happy are more dispersed. However, both angry and happy are clearly concentrated within a specific range. Therefore, the threshold t i d of outliers is determined based on the box plot. For example, if the threshold t 6 1 of sadness in terms of pleasure exceeds 3.3, sadness is regarded as an outlier. Similarly, the threshold sizes of outliers across the six emotions in the three dimensions are obtained.
Finally, in order to eliminate the effect of outliers on the distribution of the data, we uniformly clip the PAD annotation values that exceeded the threshold, bringing them in line with the corresponding upper and lower bounds. Then we calculate the mean and Std of each dimension for the corrected data according to the sentiment category. These values serve as representative vectors for each sentiment in the PAD space (see Table 8 for results). The results demonstrate that the mean discriminative degree of “happy”, “angry” and “sad” strong emotions exhibit notable discriminative power and are clearly separated from their corresponding weak emotions. Furthermore, the Std across all emotions decreased, indicating more concentrated data distribution and enhanced annotation consistency. The revised data can construct a dimensional emotion database with clear structure and strong discrimination, which lays a foundation for subsequent emotion speech recognition modeling in the PAD space.

3.5. Evaluation of Emotional Accuracy

To evaluate the performance capabilities of different modalities within the STEM-E2VA database for sentiment recognition tasks, this section designs objective evaluation experiments based on benchmark classification models. The IS09 feature set (384-dimensional features and spectrogram features) [45] for acoustic and glottal signals is extracted using the OpenSMLIE toolbox (audEERING GmbH, Gilching, Germany), along with 28-dimensional kinematic features comprising displacement and velocity of tongue and lip articulation. Validation was performed using Support Vector Machines (SVMs), Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs), with 80% of the samples used for training and 20% for recognition.
The specific experimental results are detailed in Table 9. Overall, the acoustic modality demonstrated the most superior recognition performance, achieving 85.28% under the DNN model. This was followed by the glottal signal modality. While the recognition rate for the articulatory kinematic modality was relatively lower, the results nonetheless indicated that this modality possesses a certain capacity for emotional discrimination. This suggests that the articulatory kinematic modality holds independent supplementary value within multimodal fusion scenarios.
The confusion matrix is drawn in Figure 7. By comparing Figure 7b,c, it can be observed that there is confusion between happiness and pleasure, with happiness being more frequently identified as pleasure, and some instances of pleasure being identified as happiness. This pattern is empirically supported by their close proximity in the PAD space. As summarized in Table 8, the mean PAD vectors for happiness (5.71, 5.77, 5.68) and pleasure (5.26, 5.42, 5.33) differ by only ΔP = 0.45, ΔA = 0.35, and ΔD = 0.35, corresponding to a Euclidean distance of approximately 0.67 on the 1–7 scale. This close proximity suggests that the two categories occupy neighboring regions in the affective space, making their decision boundary inherently subtle.
From Figure 7a–c,g–i, we observe notable confusion between indifference and sadness. This is consistent with their PAD annotations. As given in Table 8, the mean PAD vectors of indifference (2.81, 3.95, 4.14) and sadness (2.38, 2.24, 2.45) both indicate relatively low pleasure and arousal compared with high-energy emotions (e.g., anger or happiness). Such close proximity in dimensional annotation suggests that these categories occupy neighboring regions in the affective space, which in turn affects the separability and increases the likelihood of confusion.
Overall, this database achieves a high recognition rate through its foundational features and recognition network, demonstrating superior speech quality and clear differentiation between emotions. It provides strong support for affect recognition research.

4. A Multimodal Contrast Fusion Model Integrating PAD Perception

4.1. Overview of Research Tasks and Models

This study tackles the challenges of intermodal alignment and ambiguous boundaries between similar emotions in multimodal affect recognition. It proposes a multimodal contrastive learning model that integrates the PAD perception mechanism, aiming to unify discrete emotion classification and dimensional emotion perception through structural optimization, thereby improving recognition performance and generalization capabilities in complex environments.
As shown in Figure 8, the overall framework consists of three parts: multimodal feature encoding, PAD-guided hybrid contrastive learning optimization, and interactive attention fusion. Specifically, the model receives four parallel inputs: acoustic speech ( u a ) and their enhanced samples constructed through circular shifts ( u a ), as well as glottal signal ( u g ) and sequences of articulatory kinematics ( u e ). In the feature encoding stage, the acoustic and glottal branches use Wav2Vec 2.0 XLSR as the backbone network; to balance the stability of the pre-trained representations and the downstream adaptation ability, we freeze the feature encoder (7-layer CNN, 512 channels) and fine-tune the context encoder (Transformer layers). The outputs of these two branches are then aggregated through mean pooling to form fixed-length 768-dimensional feature vectors. At the same time, the articulatory kinematics branch uses GRU–Transformer to construct temporal information and is followed by a linear projection layer (Linear Layer), which forces the features to be mapped to a unified dimension d = 768 consistent with the acoustic modality. This dimension alignment strategy ensures that features from different sources can be subsequently fused and contrasted in a unified metric space for cross-modal integration.
In the first stage of training, a hybrid contrastive learning optimization was introduced, integrating intra-modal and inter-modal contrastive losses, and embedding the PAD three-dimensional sentiment labels as perceptual guidance. By dynamically adjusting the sample weights (based on the PAD spatial distance d), the fuzzy boundaries of similar emotions were effectively modeled, and the discriminative and continuous nature of the embedding space was enhanced. In the second stage, the projection head was removed, only the optimized encoder was retained, and its output was sent to the multimodal fusion module. This module employs interactive attention and layer normalization, dynamically fusing acoustic features with phonatory and kinematic information as references. Finally, the fused representations were input to the MLP, and at the same time, a classification probability vector for six discrete sentiments and three-dimensional PAD (polarity, activation, dominance) regression results was output, achieving joint prediction.
In summary, the entire framework achieves unified multimodal semantic alignment and affective continuity modeling through the process of feature encoding, comparative optimization, and fusion output, providing a robust foundation for complex affective analysis. While enhancing classification accuracy, it strengthens the ability to predict continuity across dimensions such as pleasantness and arousal, thereby establishing a reliable basis for multimodal affective analysis.

4.2. Feature Extraction

To achieve effective modeling of multimodal affective information, this paper divides the feature extraction process into two stages: contrastive learning training for the acoustic–glottal modality and feature modeling for the articulatory kinematic modality.

4.2.1. Acoustic and Glottal Modal Characterization Learning Based on Wav2Vec2.0

This paper employs a dual-modal contrastive learning framework using acoustic speech signals and glottal signals as inputs. The acoustic modality encoder adapts a pre-trained wav2vec 2.0 model, partially fine-tuned for the contrastive learning task to better adapt to the feature distribution of the current task while preserving its original modeling capabilities. To enhance the model’s robustness and generalization ability in complex environments, a data augmentation strategy involving cyclic shifts is applied to the acoustic modality. For the glottal modality, an encoder based on wav2vec 2.0 is similarly employed, but an independent encoder instance is used to model the temporal characteristics of the glottal signal. Acoustic modality encoders share parameters between the augmentation and original branches to strengthen the model’s consistent modeling capability against environmental perturbations. Within the contrastive representation space, a lightweight multi-layer perceptron (MLP) serves as the projection head, mapping encoder outputs into representation vectors suitable for contrastive learning. Average pooling is uniformly applied as the feature aggregation strategy to transform encoder outputs into projection head inputs.

4.2.2. Modeling Articulation Kinematic Features Using GRU and Transformer Architectures

In the articulatory kinematics modality modeling phase, the model focuses on continuous articulatory motion trajectory data captured by sensors or image processing modules. This modality exhibits temporal dynamics, serving as a key feature source that reflects the physical process of articulation. To extract temporal structural information, a sequential modeling architecture consisting of a Gated Recurrent Unit (GRU) and Transformer is employed. First, the GRU module is used to model local temporal dependencies, capturing short-term dynamics in articulatory movements. Then, the Transformer, through its multi-head attention mechanism, captures long-range contextual dependencies to enhance the model’s ability to capture global semantics and rhythmic patterns of motion. This design enables deep temporal modeling of the kinematic modality, improving the system’s ability to model the complex mapping between articulatory actions and emotions, thus providing a stable and semantically rich foundation for multimodal speech understanding.

4.3. Multimodal Attention Fusion Mechanism

In multimodal emotion recognition tasks, the contribution of different modalities to emotional states varies. Traditional concatenation or weighted fusion methods struggle to dynamically adjust the interaction between modalities, resulting in the attenuation of important modal features and the underutilization of weaker modality information. To address this, this study proposes a multimodal fusion strategy based on the cross-attention mechanism. By guiding the interaction between different modalities, the model can dynamically focus on key modality segments based on task requirements, enabling deep semantic-level fusion of modalities. This approach effectively enhances the complementarity and discriminative power of multimodal features.
In this paper, the extracted acoustic speech features ( f a ) , glottal features ( f g ) , and articulatory kinematic features ( f e ) are utilized. As illustrated in Figure 9, an interactive attention mechanism operates on these feature vectors, each possessing a dimensionality of 768. We employ a multi-head attention configuration with h = 8 parallel heads to capture diverse semantic interactions. In this cross-modal module, the acoustic features ( f a ) are projected to serve as keys (K) and values (V), while the glottal ( f g ) and articulatory ( f e ) features act as queries (Q). After attention weighting and max-pooling, the features are concatenated to form a unified 2304-dimensional representation (3 × 768), which is then fed into the MLP for downstream tasks.
In the multimodal fusion stage, this paper adapts a mechanism similar to the multi-head self-attention mechanism in Transformers to effectively prevent the overfitting phenomenon. During fusion, multiple independent attention heads perform the scaling dot product operation, and each head calculates the similarity between the query vector Q and the key vector K in its own subspace, so as to obtain the respective attention distributions. Subsequently, these attention distributions are used to vector V for weighting to extract the common features among modalities. Among the modal pairs ( f a , f g ) , ( f a , f e ) are fused with articulatory and vocal gate modalities as queries Q , and acoustic speech modalities as keys K , V . Taking the features of articulatory modality f e and acoustic features f a as an example, the single-head attention mechanism is calculated as follows:
A t t e n t i o n ( Q , K , V ) = s o f t m a x Q K T d K V
In (4), d K represents the dimension of the matrix K , which is used to reduce the size of the matrix d K times to prevent a large settlement during dot product calculation. The larger settlement result during the operation leads to the matrix value falling into a range with a smaller gradient after calculating the Softmax function. Then the multiplicative calculation is done as:
h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V )
M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , h e a d 2 , , h e a d h )
In (5) and (6), i = 1 , , h indicates the set h attention heads, and W i Q , W i K , W i V are the parameter matrices. After calculation, the fused multimodal features can be obtained by layer residual linking and normalization. At this time, the features contain noise and redundant information, so the maximum pooling approach is used to highlight the key features, while reducing the amount of computation, and ultimately the output of f e , f a , f g is obtained. Finally the feature vectors of the three modalities are spliced to get the final multimodal fusion feature f a e g :
f a e g = C o n c a t f e , f a , f g

4.4. Multi-Scale Contrast Loss Function Integrating PAD Perception

In order to explore the deep relationships between modes and across emotion categories, and solve the problem of “false positive/negative samples” of similar emotion pairs in traditional contrast learning, this paper proposes to introduce Hybrid Contrastive Loss for Perception in PAD Dimensions (MMCL-PAD). Within the intra-modal and inter-modal contrastive learning framework, this method dynamically adjusts the temperature coefficient based on PAD spatial distances, achieving differentiated treatment of difficult samples. The whole loss mechanism consists of three components: intra-modal contrastive learning (IACL), inter-modal contrastive learning (IECL) and PAD-based dynamic constraints.

4.4.1. Intra-Modal Contrastive Learning Loss

In emotion recognition research, Intra-modal Contrastive Learning (IACL) involves learning the dynamic characteristics of a single modality and the relationships among different emotions in a supervised manner. This approach increases the discriminative power of the feature space. The contrast loss formula within the mode is as follows:
L I A C L m = l o g i = 1 M e x p ( z m p m i / τ ) i = 1 M e x p ( z m p m i / τ ) + j = 1 N e x p ( z m n j m / τ )
In Equation (8), m { ( a , a ) , g } represents the acoustic modality a , the acoustic data enhancement a and the glottal modality g , and z , p , n represent the feature vectors of the acoustic features f a , f a , and the glottal feature f g after the projection head, which are represented as the anchor point, the positive samples and the negative samples, respectively • represents the dot product operation of two vectors (measuring the similarity between two samples). Before the dot product operation, it is necessary to perform L2 regularization on the representations to ensure that the similarity of each pair falls between 0 and 1. τ represents the temperature coefficient. Thus the intra-modal contrast loss in this paper can be expressed as:
L i n t r a = L I A C L a , a + L I A C L g

4.4.2. Inter-Modal Contrast Learning Loss

To overcome the limitations of single-modal models in cross-modal scenarios, inter-modal contrastive learning (IECL) explores the correlations and complementarities between different modalities, and promotes modal fusion through contrastive learning. Unlike IACL, IECL defines positive sample pairs as samples from two different modalities with the same emotional label, and defines negative sample pairs as cross-modal samples with different emotional labels. The same processing method can obtain the loss of IECL as follows:
L I E C L a , g = l o g i = 1 2 M e x p ( z a p i g / τ ) i = 1 2 M e x p ( z a p i g / τ ) + j = 1 2 N e x p ( z a n j g / τ )
The representation of each parameter in (10) is similar to in IACL, so the inter-modal comparison loss in this paper can be expressed as:
L i n t e r = L I E C L a , g + L I E C L g , a
The final multimodal mixed contrast loss (MMCL) consists of the IACL and IECL together as follows:
L M M C L = λ 1 L I A C L a , a + λ 2 L I A C L g + λ 3 ( L I E C L a , g + L I e C L g , a )

4.4.3. Introduction of Contrast Loss for PADs

There are similar pairs of emotions such as happiness and pleasure, and sadness and grief in the database, which may lead to misjudgments in contrastive learning and thus affect the classification performance of the model. Wang F. et al. [46] pointed out that the temperature coefficient in the InfoNCE loss determines the degree of attention the model pays to difficult negative samples: the smaller the temperature coefficient, the stronger the gradient penalty applied to highly similar negative samples, which is more conducive to improving the discriminative power of the representation. This paper first adopts an automated parameter optimization algorithm to determine the optimal temperature coefficient as τ = 0.07 and applies it to subsequent experiments.
However, a fixed temperature coefficient is difficult to adaptively adjust according to the similarity between samples and cannot simultaneously take into account highly similar and significantly different pairs of emotional samples. We found that the distance of similar emotions in the PAD space is closer, while the distance of significantly different emotions is farther. Therefore, this paper introduces PAD into contrastive learning and dynamically allocates the temperature coefficient for each pair of positive and negative samples based on the PAD space distance between the sample pairs to achieve real-time adjustment of the penalty intensity in a one-to-one manner. The formula for calculating the PAD space distance is as follows:
d = ( x i x j ) 2 + ( y i y j ) 2 + ( z i z j ) 2
In Equation (13), x , y , z represent the three-dimensional values of p, A and D in the dimensional space. Within a training batch K, a PAD distance matrix of size K2 will be formed between each pair of samples, where the diagonal elements are samples with spatial distance 0. By replacing the temperature coefficients τ with the PAD spatial distances d in the original intra-modal contrast loss IACL, the improved intra-modal contrast loss is then expressed as:
L I E C L α , g = log i = 1 2 M exp ( z α p i g / τ ) i = 1 2 M exp ( z α p i g / τ ) + j = 1 2 N exp ( z α n j g / τ )
Similarly, the contrastive loss of IECL undergoes the same substitution. Finally, the multimodal hybrid contrastive learning loss with PAD (MMCL-PAD) comprises the improved IACL and IECL together, denoted as:
L M M C L P A D = λ 1 L I A C L P A D a , a + λ 2 L I A C L P A D g + λ 3 ( L I E C L P A D a , g + L I e C L P A D g , a )

5. Experiments

To verify the effectiveness and advanced nature of the proposed PAD-guided multimodal contrastive fusion framework, this section conducts extensive evaluations on both discrete emotion classification and continuous dimensional prediction tasks. Through experimental analysis, this section verified the roles of each key component, including data processing, contrastive learning strategies, modal selection, and fusion architecture. It explored the contribution of each module to the overall model performance from multiple perspectives, thereby demonstrating the overall superiority of this framework in complex emotion recognition scenarios.

5.1. Experimental Setup

In the experiments of this paper, we used two datasets: the publicly available IEMOCAP [25] dataset and STEM-E2VA. IEMOCAP is an audio and video emotional database recorded by the University of Southern California, containing approximately 12 h of audio–visual data. It was performed by 10 professional actors to depict various emotional states in scripted or impromptu dialogs. This dataset includes emotion categories, including happiness, anger, neutrality, sadness, surprise, fear, disgust, depression and excitement in the discrete emotion dimension, and provides dimension labels based on the PAD (pleasure–arousal–dominance) three-dimensional emotion model. For this experiment, we selected the happy, angry, neutral and sad categories, totaling 4490 samples. Since IEMOCAP does not contain glottal and articulatory kinematic data, this paper only uses it for some contrastive learning-related experiments.
To comprehensively evaluate the performance of the model in different tasks, we adopt different performance evaluation metrics for discrete emotion recognition and dimensional emotion prediction tasks respectively: In discrete emotion recognition, classification accuracy (accuracy) is used as the evaluation metric; in dimensional emotion, RMSE is used for performance evaluation.
In order to systematically evaluate the effectiveness of the proposed framework and its core components, we designed a comprehensive set of experiments. First, to establish a reliable data foundation, we verified the effectiveness of the PAD outlier processing strategy. Second, we conducted a series of ablation studies to evaluate our core methodological designs. This involved analyzing the specific contributions of the multi-scale contrastive learning strategy and the embedded PAD perception mechanism, followed by a systematic quantification of the impact of different physiological modality combinations (acoustics, glottal, and articulatory kinematics). Finally, to demonstrate the superiority and generalization ability of our architecture, we designed multi-dimensional comparative experiments. We initially compared the proposed interactive attention mechanism against the traditional concatenation-based feature fusion method, and subsequently benchmarked the complete model against current mainstream emotion recognition frameworks to verify its overall robustness and advancement. All models were trained and evaluated on the constructed STEM-E2VA database and the publicly available IEMOCAP multimodal emotion database, maintaining uniform partitioning strategies to ensure the objectivity and consistency of the comparison results. The detailed hyperparameter configurations for each training stage are summarized in Table 10.

5.2. Effectiveness of PAD Outlier Processing

To quantify the necessity of the PAD abnormal value handling protocol described in Section 3.4, we conducted a comparative ablation study. We trained the proposed framework using both the unprocessed PAD labels and the processed labels, while keeping all other settings (including data splitting, model architecture, and hyperparameters) exactly the same. As reported in Table 11, using raw labels leads to lower classification accuracy and higher RMSE on the PAD regression tasks. We attribute this degradation to annotation outliers that perturb the PAD–distance relationships used in the MMCL-PAD objective, which in turn yields unstable or biased contrastive gradients during optimization. Overall, these results indicate that PAD outlier processing is an important component for improving training stability and enhancing the discriminative power of the proposed method.

5.3. Ablation Experiment

To assess the contribution of key components to the performance of emotion recognition, this section conducts a series of ablation experiments from two dimensions: contrastive learning strategies and modal combinations. Through gradually stripping away model components, the experiments quantified the specific roles of different physiological modalities in constructing the emotional feature space, thereby deeply verifying the necessity of each design stage and the effectiveness of the multi-scale contrastive learning strategy.

5.3.1. Comparison of Different Contrastive Learning Strategies

To explore whether the introduction of multi-scale contrast mechanisms within and between modalities has a positive effect on model performance, we compared the experimental effects of single-modal contrast loss L I A C L a , a and hybrid multimodal contrast loss L M M C L . The experimental results are shown in Table 12.
It can be seen from Table 9 that on the STEM-E2VA dataset, the intra-modal and inter-modal hybrid contrast learning designed in this paper has a substantial improvement compared to the simple intra-modal learning. It can increase the recognition accuracy by 2.11% in discrete emotion recognition, indicating that this design can help the model mine deeper emotion features and effectively fuse information within modalities, across modalities, and among classes. However, in dimensional prediction, the mixed contrastive loss cannot enable the model to mine more dimensional features. The main reason for this situation is that the pre-trained model adopted in this paper was trained under the discrete emotion recognition task, so the dimensional emotion information contained in the model is relatively small.
To further explore the impact of introducing the PAD dimension label as a supervisory signal on the model performance, we integrated it into the contrastive learning framework and compared the changes in the loss function before and after the introduction (from L I A C L a , a to L I A C L P A D a , a , and from L M M C L to L M M C L P A D ). The following specific experimental results and analysis demonstrate the crucial role that the PAD perception mechanism plays in enhancing representational ability and discrimination. Following the introduction of PAD dimensionality labels as supervisory information, we observed the changes in model performance after the loss function changed from L I A C L a , a to L I A C L P A D a , a , and from L M M C L to L M M C L P A D .
As can be seen from Table 13, on the IEMOCAP dataset, by adding PAD dimension labels on the basis of the supervised contrast loss of discrete labels, the classification accuracy rate has increased from 73.63% to 74.85%, representing an increase of 1.22%. Among the predictions on the three dimensions of PAD, the improvement effect on activation degree A was the best, with RMSE reduced by 0.04, while the improvement effect on dominance degree D was average. Analysis revealed that this is because the PAD sentiment annotation of the IEMOCAP dataset was carried out by three annotators. Most of the annotated PAD values have overlapping and similar parts, which cannot effectively distinguish the differences between samples. Consequently, the contrast loss incorporating PAD learns limited dimensional sentiment information.
In the STEM-E2VA dataset, compared to L I A C L a , a and L I A C L P A D a , a losses within a single modality, the discrete sentiment recognition rate improved by 1.89%. Compared to L M M C L and L M M C L P A D hybrid multimodal losses, the discrete sentiment recognition rate increased by 2.87%, with L M M C L P A D achieving a recognition rate of 81.23%.
Compared with the IEMOCAP dataset, the model demonstrated more noticeable dimensional emotion modeling capabilities on the self-constructed STEM-E2VA dataset, particularly in achieving a substantial decline in the RMSE metrics of the two dimensions of pleasure (P) and arousal (A). Taking the final improved scheme L M M C L P A D as an example, the pleasure degree and activation degree RMSE of the model on IEMOCAP were 0.202 and 0.213 respectively, while on STEM-E2VA, they decreased to 0.152 and 0.145 respectively, representing reductions of 0.050 and 0.068. This demonstrates the model’s superior capability in predicting dimensional affect. This phenomenon indicates that the model can more fully mine the continuous variation features between dimensional emotions on the high-quality labeled STEM-E2VA dataset, thereby achieving better performance at the regression level. However, the model’s performance on dominance (D) is poor across both datasets, as it fails to adequately extract dominance-related information. This represents an area requiring further research in subsequent studies.
To visualize the impact of loss strategies, we applied t-SNE to the STEM-E2VA dataset. As shown in Figure 10a, the baseline L I A C L a , a yields a fragmented distribution with noticeable overlap between similar emotions, highlighting the limitations of discrete supervisory signals. As illustrated in Figure 10b, introducing PAD constraints initiates clustering by capturing continuous emotional characteristics, while Figure 10c demonstrates that the multimodal hybrid loss further enhances inter-class separation. Ultimately, as depicted in Figure 10d, the proposed L M M C L P A D achieves the optimal structure, with distinct boundaries even for fine-grained categories like “Happy” and “Pleasure”. This evolution from disorder to compact clustering confirms that L M M C L P A D effectively handles fine-grained discrimination, aligning perfectly with the peak accuracy of 81.23% reported in Table 13.

5.3.2. Systematic Ablation of Modality Combinations

Considering that the articulatory kinematics and the glottal modalities respectively carry different levels of emotional physiological information, this study formulates hypotheses regarding the roles of articulatory kinematics and glottal modalities in emotional representation. Both articulatory kinematic features and glottal measurements are associated with A: the former provides the amplitude, speed, and temporal organization, which helps to distinguish between high and low arousal levels, as well as the intensity differences of “strong–weak” in discrete emotion recognition, while the latter captures the glottal contact dynamics related to changes in laryngeal tension and excitation patterns, which are associated with different arousal levels. To verify these hypotheses and quantify the specific contributions of each physiological modality in emotional representation, this paper conducts modality ablation experiments on the STEM-E2VA dataset. Specifically, we take single-modal acoustics as the baseline, gradually integrate glottal and articulatory kinematic signals, while excluding the video modality, in order to strictly focus on the physiological mechanism of speech production and avoid confounding factors introduced by visual feature extractors.
As summarized in Table 14 adding either physiological modality improves performance over the acoustic-only baseline (78.14%). Specifically, after incorporating the glottal signal, the accuracy of discrete emotion recognition increased to 82.12%, and the prediction errors of the PAD dimensions were also reduced. Notably, the relatively substantial relative error reduction occurs in the A dimension, decreasing by approximately 25.31%. This improvement indicates that glottal contact dynamics act as an important physiological proxy for arousal levels, capturing the changes in laryngeal tension and excitation modes that acoustics alone struggle to resolve.
Similarly, the introduction of articulatory kinematics also contributed to an improvement in accuracy, reaching 82.51%, and the RMSE in the A dimension decreased by 16.67%. The boost in discrete emotion accuracy indicates that the amplitude, speed, and temporal organization of articulatory movements provide critical evidence of articulatory effort. This aids the model in discriminating the “strong–weak” intensity differences among discrete emotional states, while also offering essential kinetic cues for distinguishing high and low arousal states.
Furthermore, fusing all three modalities (acoustics + articulatory kinematics + glottal) achieves the better overall performance (85.47% accuracy) and the lower RMSE across the PAD space. This synergistic reduction demonstrates that the vocal tract (articulatory kinematics) and the vocal source (glottal signals) provide mutually complementary physiological cues for affective representation.
Finally, it is worth noting that the inclusion of the glottal and articulatory kinematics modalities reduces the RMSE not only for the arousal (A) dimension but also for the pleasure (P) and dominance (D) dimensions to varying degrees. This observation suggests that these physiological streams likely encapsulate additional affective cues beyond mere physical activation or intensity. Although the specific underlying physiological mechanisms mapping these modalities to pleasure and dominance are not yet fully understood, this finding provides a promising avenue for future investigation into the broader emotional correlates of speech production.

5.4. Comparative Experiment

Following the ablation studies, this section evaluates the comprehensive performance of the proposed framework through multi-dimensional comparative experiments. We first compared the differences between the interactive fusion and the traditional feature concatenation strategies; subsequently, under the dual tasks of discrete sentiment classification and continuous PAD prediction, we systematically compared this model with the mainstream single-modal and multimodal benchmark methods to comprehensively verify its advancement and robustness.

5.4.1. Comparison of Multimodal Fusion Methods

During the multimodal contrastive learning stage, in this paper, the trained acoustic feature f a and glottal feature f g are simply concatenated and used for the training of downstream tasks, without eliminating the redundant information between modalities. Therefore, after contrastive learning in this paper, the universal contrastive fusion features of acoustics and glottal are interactively fused with the kinematic features of pronunciation through the attention mechanism. At the same time, this approach is then compared with simple concatenation to determine whether it improves the model’s representational capabilities across different tasks. The experimental results are shown in Table 15.
As can be seen from Table 15, the features fused by the interactive attention mechanism perform 2.94% higher than the concatenated contrastive fused features and the kinematic features of pronunciation in the emotion classification task, achieving an accuracy rate of 85.47%. Compared with the recognition rate of 81.23% in the contrastive training stage, it reached 82.53% after concatenating the kinematic features of pronunciation. This indicates that the features after contrastive training contain rich emotional information, and through the attention mechanism, the feature information corresponding to different tasks can be weighted. However, the improvements in the dimension prediction task decreased by 0.058, 0.06 and 0.046 respectively. Among them, the prediction performance on activation degree was the best, dropping to 0.074, while the prediction for pleasantness reached 0.082. After the interactive fusion of features from the three modalities, it can be seen that the model can once again extract emotional information from the dimensions.
To evaluate feature space optimization, we performed t-SNE visualization on the STEM-E2VA dataset. As illustrated in Figure 11a, the conventional concatenation method results in blurred boundaries and substantial overlap between semantically similar emotions (e.g., ‘Happy’ and ‘Pleased’). In contrast, Figure 11b demonstrates that the interactive attention fusion yields compact clusters with clear decision boundaries. As detailed in Table 14, this structural improvement correlates directly with the accuracy increase from 82.53% to 85.47%. By leveraging multi-head attention to filter noise and integrate complementary acoustic–glottal–kinematic cues, the proposed mechanism effectively constructs a discriminative feature space capable of resolving fine-grained PAD differences.
To further examine the interpretability of the proposed interactive attention fusion mechanism, we visualize the cross-attention weight matrices between the acoustic modality and the two physiological modalities. As shown in Figure 12, acoustic tokens serve as keys/values, while glottal and articulatory kinematics tokens serve as queries (weights are post-Softmax and averaged over eight heads). The resulting heatmaps exhibit clear near-diagonal bands, where attention mass concentrates around temporally corresponding acoustic segments and their local neighborhoods, whereas off-diagonal regions receive substantially lower weights. This pattern suggests that the model captures near-synchronous temporal correspondence across heterogeneous modalities, using the acoustic stream as an anchor to dynamically align physiological cues despite differences in sampling characteristics. Compared with naive feature concatenation, such cross-attentive alignment mitigates temporal mismatch during fusion and yields temporally coherent multimodal representations, which is consistent with the improved performance reported in Table 14. For visualization only, we apply a power-law normalization (PowerNorm, γ = 0.35 ) to the colormap to enhance contrast among low-magnitude attention weights.

5.4.2. Unimodal Comparison on the Public IEMOCAP Dataset

To evaluate the robustness of our PAD-guided representation learning mechanism, we compare our acoustic-only baseline with several representative acoustic emotion recognition models, including P-TAPT [47], NCDE Classifier [48], Emotion2vec [49].
We note that the original implementations of these baselines are primarily designed for discrete emotion classification. To facilitate a fair comparison across both discrete classification and continuous dimensional regression settings, we reproduce these models and extend them with a unified linear regression head on top of the learned acoustic representations. The resulting models are then fine-tuned on the IEMOCAP dataset using the RMSE objective to predict continuous pleasure, arousal, and dominance (PAD) values.
The re-evaluated results are summarized in Table 16. Overall, our method achieves the highest classification accuracy (74.85%), slightly outperforming Emotion2vec (74.48%), and yields lower prediction errors on the pleasure (RMSE: 0.202) and arousal (RMSE: 0.213) dimensions compared with the considered baselines. These results suggest that incorporating PAD guidance can benefit both discrete emotion recognition and continuous affect prediction, potentially enabling more fine-grained modeling of affective variations.

5.4.3. Multimodal Comparative Experiments on the STEM-E2VA Dataset

We benchmarked our proposed PAD-aware multimodal contrastive fusion framework against representative baseline methods, specifically, Multimodal Transformer (MulT) [5], MISA [6], MTAMW [8] and HRG-SSA [7].
As these models were originally developed for text–audio–visual settings, we adapt their input projection layers to support our physiological modalities (acoustics, glottal, and kinematics) while keeping the projected feature dimensions consistent across methods. In addition, to enable evaluation on continuous affect dimensions, we report PAD prediction errors (RMSE) for these baselines by attaching a regression output to the fused representations, following the same tri-modal inputs and evaluation protocol used for our method.
The comparative results are summarized in Table 17. Overall, our approach achieves higher discrete classification accuracy than the considered fusion baselines (e.g., 85.47% vs. 84.76%for CAFAS) and yields lower RMSE across the PAD dimensions. These findings suggest that the proposed interactive attention design and PAD-aware spatial constraints can better capture and align complementary information across physiological modalities, compared with conventional tensor-fusion or cross-modal attention mechanisms.

6. Discussion

To address the challenges of modality heterogeneity and temporal asynchrony in multimodal emotion recognition, as well as the issues associated with discrete labels lacking emotional continuity and the difficulty of jointly modeling dimensional and discrete annotations, this study constructs the STEM-E2VA Mandarin multimodal emotion database and proposes a multimodal supervised hybrid contrastive learning model that integrates PAD perception. The database achieves the synchronous collection of four modalities—articulatory kinematics, acoustics, glottal signals, and video—and incorporates seven types of discrete emotions with PAD three-dimensional annotations, providing a valuable addition to the development of multimodal sentiment databases in China.
Given that discrete sentiment labels are limited in expressing intensity and continuous variation, and dimensional sentiment models lack clear semantic boundaries at the sentiment category level, this study designs a hierarchical complementary annotation mechanism. This mechanism integrates seven discrete sentiment categories and three-dimensional PAD annotations, thereby enhancing the modeling capability for emotional intensity, intermediate states, and modal conflicts while maintaining discriminative power. Experiments demonstrate that this annotation approach improves sentiment recognition accuracy and robustness, while more comprehensively capturing sentiment categories, intensity, and transitional characteristics.
In terms of quality control for emotion annotation, the SCL-90 psychological scale was introduced to analyze the psychological characteristics of annotators. Through linear regression, it was found that high pleasure outliers were notably correlated with individual somatization, depression, interpersonal sensitivity and other psychological characteristics, revealing the impact of psychological state on annotation consistency. This psychological intervention mechanism provides a new approach to improving annotation credibility and solves the annotation deviation problem of traditional databases (such as CASIA [18] relying on subjective judgment and VAM [20] not considering psychological characteristics).
Methodologically, the study proposes the multimodal contrastive learning loss function MMCL-PAD, which incorporates spatial perception within the PAD framework. By adaptively adjusting temperature coefficients based on PAD spatial distances, it improves the model’s ability to differentiate between fine-grained emotions, such as ‘happy’ versus ‘pleasure’. MMCL-PAD combines intra-modal and inter-modal contrast strategies and introduces an affective dimension perception mechanism to dynamically model semantic distances between samples. This enhances the consistency and discriminative power of multimodal representations within the feature space. Experimental validation confirms its effectiveness: on the STEM-E2VA dataset, discrete emotion recognition accuracy reaches 81.23% (a 2.87% improvement over the baseline), while the RMSE for pleasure/arousal in dimensional affect prediction decreases to 0.152/0.145, respectively. Additionally, the GRU–Transformer architecture, combined with cross-attention mechanisms, elevates classification accuracy to 85.47%, further validating the advantages of multimodal feature fusion. To quantify modality contributions, ablation results show that, compared with the 78.14% acoustic baseline, glottal signal reduces arousal prediction error by 25.31% relatively and articulatory kinematics increases recognition accuracy to 82.51% while reducing arousal RMSE by 16.67% relatively. Joint fusion of glottal, articulatory kinematics, and acoustic features raises accuracy to 85.47% and reduces PAD prediction errors in the evaluated settings. Compared with recent baseline models (e.g., MTAMW [8], Emotion2vec [44]), the proposed architecture exhibits superior performance in both discrete emotion classification and continuous PAD.
To further verify the robustness of the proposed training strategy beyond a single cultural context, we evaluated the framework on both the STEM-E2VA (Mandarin) and IEMOCAP (English) datasets. It is worth noting that direct cross-corpus training (e.g., training on STEM-E2VA and testing on IEMOCAP) was not performed in this study because IEMOCAP lacks the specific physiological modalities (articulatory kinematics and glottal) required by our multimodal fusion architecture. The consistent performance improvements observed on both datasets (as detailed in Section 5) demonstrate the cross-corpora applicability and stability of our methodological approach.
Despite substantial research achievements, several areas warrant further exploration: current efforts focus on emotion state recognition and prediction, with potential future expansions into modeling emotional change mechanisms and active generation tasks to support human–machine emotional interaction scenarios, such as mental health interventions. Other areas of exploration include designing lightweight architectures to optimize latency issues in real-time deployment, and incorporating physiological modalities, such as electroencephalography and electromyography, to build a more comprehensive neuro-behavioral coupled affective computing framework.

7. Conclusions

This paper focuses on the construction and evaluation of the multimodal Chinese Mandarin sentiment dataset STEM-E2VA, exploring the relationship between similar emotions in discrete and dimensional sentiment frameworks. First, a multimodal Chinese Mandarin sentiment database is established, integrating data from a three-dimensional electromagnetic articulograph (EMA), an electroglottograph (EGG), and audio–video sources. The process of text corpus design, participant selection, and data collection is thoroughly documented. The emotional quality of the data is validated using deep learning models to ensure its effectiveness.
Next, an optimized PAD three-dimensional emotion model is introduced, and mathematical statistical methods (such as confidence intervals and standard deviations) are applied to analyze the data and address outliers. The study finds that outliers are predominantly concentrated in the high-pleasure range and exhibit notable correlations with somatization, depression, and interpersonal sensitivity. Processing these outliers provides a solid empirical foundation for the stable training and robust representation learning of subsequent models.
Finally, a multimodal dimensional contrastive fusion model based on the pretrained wav2vec2.0 model is proposed. This approach finetunes the speech recognition task to be applicable to glottal and speech emotion recognition tasks. Compared to traditional multimodal fusion methods, the proposed model employs a hybrid contrastive learning loss within and across modalities, enabling it to effectively extract information between classes and samples. By incorporating PAD contrastive loss and an interactive attention mechanism in the fusion stage, experimental results show that this approach enhances the model’s ability to learn general emotional features, thereby improving the performance of downstream tasks. On STEM-E2VA, our full model achieves 85.47% accuracy for discrete emotion recognition, with PAD regression errors of RMSE = 0.082/0.074/0.302 for P/A/D, respectively. Ablation results further verify that adding glottal and articulatory kinematics provides complementary cues beyond acoustics, improving both classification and PAD prediction.

Author Contributions

Conceptualization, S.D. and L.L.; methodology, S.D.; software, W.Z.; validation, L.L., H.L. and S.D.; formal analysis, W.Z.; investigation, T.Z.; resources, S.D.; data curation, L.L.; writing—original draft preparation, S.D.; writing—review and editing, W.Z., F.L. and H.L.; visualization, F.Z.; supervision, S.D.; project administration, S.D.; funding acquisition, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Fund of the National Natural Science Foundation of China (No. 12004275), the Shanxi Scholarship Council of China (No. 2024-060), the Natural Science Foundation of Shanxi Province, China (No. 202403021211098), and the Startup Fund of Shanxi University of Electronic Science and Technology (No. 2025KJ016).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions, as they contain sensitive multimodal information that could compromise the privacy of the participants.

Acknowledgments

Special thanks go to all recruited participants and Newcastle University for discussion and help.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
  2. Wu, W.; Li, H. Modeling artificial emotions in PAD emotional space and human-computer interaction experiments. J. Harbin Inst. Technol. 2019, 51, 29–37. [Google Scholar] [CrossRef]
  3. Chen, M.-Y.; Chen, T.-H. Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena. Future Gener. Comput. Syst. 2019, 96, 692–699. [Google Scholar] [CrossRef]
  4. Hou, M.; Tang, J.; Zhang, J.; Kong, W.; Zhao, Q. Deep multimodal multilinear fusion with high-order polynomial pooling. Adv. Neural Inf. Process. Syst. 2019, 32, 12136–12145. [Google Scholar]
  5. Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
  6. Hazarika, D.; Zimmermann, R.; Poria, S. MISA: Modality-invariant and modality-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1122–1131. [Google Scholar]
  7. Ji, H.; Li, X.; Li, M.; Zhao, M.; Gao, C. Hybrid relational graphs with sentiment-laden semantic alignment for multimodal emotion recognition in conversation. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI); International Joint Conferences on Artificial Intelligence Organization: Marina Del Rey, CA, USA, 2025; pp. 2973–2981. [Google Scholar] [CrossRef]
  8. Wang, Y.; He, J.; Wang, D.; Wang, Q.; Wan, B.; Luo, X. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing 2024, 572, 127181. [Google Scholar] [CrossRef]
  9. Mai, S.; Xing, S.; Hu, H. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Trans. Multimed. 2020, 22, 122–137. [Google Scholar] [CrossRef]
  10. Hazmoune, S.; Bougamouza, F. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
  11. Yang, K.; Xu, H.; Gao, K. Cm-bert: Cross-modal bert for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia (MM ‘20), Seattle, WA, USA, 12–16 October 2020; pp. 521–525. [Google Scholar] [CrossRef]
  12. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 12–18 July 2020; pp. 1597–1607. Available online: https://proceedings.mlr.press/v119/chen20j.html (accessed on 15 March 2025).
  13. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 28 March 2025).
  14. Chen, S.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9640–9649. [Google Scholar]
  15. Liu, T.; Li, J.; Wu, J.; Du, B.; Zhan, Y.; Tao, D.; Wan, J. Facial expression recognition with heatmap neighbor contrastive learning. IEEE Trans. Multimed. 2025, 27, 4795–4807. [Google Scholar] [CrossRef]
  16. Dai, S.; Li, M.; Wu, X.; Ju, X.; Li, X.; Yang, J.; Hu, D. Contrastive learning of EEG representation of brain area for emotion recognition. IEEE Trans. Instrum. Meas. 2025, 74, 2506913. [Google Scholar] [CrossRef]
  17. Chang, J.; Zhang, Z.; Qian, Y.; Lin, P. Multi-scale hyperbolic contrastive learning for cross-subject EEG emotion recognition. IEEE Trans. Affect. Comput. 2025, 16, 1716–1731. [Google Scholar] [CrossRef]
  18. Han, W.; Li, H.; Ruan, H.; Ma, L. Review on Speech Emotion Recognition. J. Softw. 2014, 25, 37–50. (In Chinese) [Google Scholar]
  19. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Interspeech, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
  20. Grimm, M.; Kroschel, K.; Narayanan, S. The Vera am Mittag German audio-visual emotional speech database. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hannover, Germany, 23 June–26 April 2008; pp. 865–868. [Google Scholar] [CrossRef]
  21. McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; Schroder, M. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 2012, 3, 5–17. [Google Scholar] [CrossRef]
  22. Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv 2016, arXiv:1606.06259. [Google Scholar] [CrossRef]
  23. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  24. Ringeval, F.; Sonderegger, A.; Sauer, J.; Lalanne, D. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–8. [Google Scholar] [CrossRef]
  25. Wang, Y.; Yu, H.; Gao, W.; Xia, Y.; Nduka, C. MGEED: A multimodal genuine emotion and expression detection database. IEEE Trans. Affect. Comput. 2024, 15, 606–619. [Google Scholar] [CrossRef]
  26. Lee, S.; Yildirim, S.; Kazemzadeh, A.; Narayanan, S. An articulatory study of emotional speech production. In Proceedings of the Interspeech, Lisbon, Portugal, 4–8 September 2005; pp. 497–500. [Google Scholar] [CrossRef]
  27. Kim, J.; Toutios, A.; Lee, S.; Narayanan, S.S. A kinematic study of critical and non-critical articulators in emotional speech production. J. Acoust. Soc. Am. 2015, 137, 1411–1429. [Google Scholar] [CrossRef]
  28. Gobl, C.; Chasaide, A.N. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 2003, 40, 189–212. [Google Scholar] [CrossRef]
  29. Henrich, N.; d’Alessandro, C.; Doval, B.; Castellengo, M. On the use of the derivative of electroglottographic signals for characterization of nonpathological phonation. J. Acoust. Soc. Am. 2004, 115, 1321–1332. [Google Scholar] [CrossRef]
  30. Mehrabian, A.; Russell, J.A. An Approach to Environmental Psychology; The MIT Press: Cambridge, MA, USA, 1974. [Google Scholar]
  31. Meenakshi, N.; Yarra, C.; Yamini, B.K.; Ghosh, P.K. Comparison of speech quality with and without sensors in electromagnetic articulograph AG 501 recording. In Proceedings of the Interspeech, Singapore, 14–18 September 2014; pp. 935–939. [Google Scholar]
  32. Chen, Y.; Xiao, Z. Design of speech database combining discrete labels and dimensional space. Tech. Acoust. 2018, 37, 380–387. [Google Scholar]
  33. Brewer, D.; Doughtie, E.B.; Lubin, B. Induction of mood and mood shift. J. Clin. Psychol. 1980, 36, 215–226. [Google Scholar] [CrossRef]
  34. Mehrabian, A. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. 1996, 14, 261–292. [Google Scholar] [CrossRef]
  35. Fant, G. Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations; Mouton: The Hague, The Netherlands, 1960. [Google Scholar]
  36. Cai, Z.; Qin, X.; Cai, D.; Li, M.; Liu, X.; Zhong, H. The DKU-JNU-EMA Electromagnetic Articulography Database on Mandarin and Chinese Dialects with tandem feature based acoustic-to-articulatory inversion. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 26–29 November 2018; pp. 235–239. [Google Scholar] [CrossRef]
  37. Li, J.; Cheng, L. The acoustics properties of the nasals and nasalization in Standard Chinese. Inf. Technol. J. 2014, 13, 1793–1799. [Google Scholar] [CrossRef]
  38. Zhang, H.-Y.; Huang, H.-M.; Li, W. Speech emotion database oriented to emotional change detection. Comput. Simul. 2021, 38, 448–455. [Google Scholar]
  39. Yunusova, Y.; Green, J.R.; Mefferd, A. Accuracy assessment for AG500, electromagnetic articulograph. J. Speech Lang. Hear. Res. 2009, 52, 547–555. [Google Scholar] [CrossRef][Green Version]
  40. Song, J.; Zhang, X.; Sun, Y.; Chang, J. Establishment of an emotional speech database based on the fuzzy comprehensive evaluation method. Mod. Electron. Technol. 2016, 39, 51–54. [Google Scholar]
  41. Roark, R.M.; Watson, B.C.; Baken, R.; Brown, D.J.; Thomas, J.M. Measures of vocal attack time for healthy young adults. J. Voice 2012, 26, 12–17. [Google Scholar] [CrossRef]
  42. Russell, J.A.; Mehrabian, A. Evidence for a three-factor theory of emotions. J. Res. Pers. 1977, 11, 273–294. [Google Scholar] [CrossRef]
  43. Gong, X.; Huang, Y.; Wang, Y. Revision of the Chinese Facial Affective Picture System. Chin. Ment. Health J. 2011, 25, 40–46. [Google Scholar] [CrossRef]
  44. Ge, Q.; Zhou, X.; Liu, Y. Effect of neuroticism on depressive symptoms in officers and soldiers: The mediating effect of negative automatic thoughts and psychological stress response. J. Nav. Med. Univ. 2022, 43, 821–826. [Google Scholar]
  45. Schuller, B.; Steidl, S.; Batliner, A. The Interspeech 2009 emotion challenge. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar]
  46. Wang, F.; Liu, H. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2495–2504. [Google Scholar]
  47. Chen, L.W.; Rudnicky, A. Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 7592–7596. [Google Scholar]
  48. Wang, N.; Yang, D. Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlled differential equations classifier. PLoS ONE 2025, 20, e0318297. [Google Scholar] [CrossRef]
  49. Ma, Z.; Zheng, Z.; Ye, J.; Li, J.; Gao, Z.; Zhang, S.; Chen, X. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 15747–15760. [Google Scholar]
Figure 1. Articulatory sensor placement and participant setup for data acquisition. Participant gave permission for the use of his image. (a) Placement of EMA sensors on lips (L1–L4) and tongue (T1–T3). (b) Participant wearing EMA sensors in the recording session.
Figure 1. Articulatory sensor placement and participant setup for data acquisition. Participant gave permission for the use of his image. (a) Placement of EMA sensors on lips (L1–L4) and tongue (T1–T3). (b) Participant wearing EMA sensors in the recording session.
Mti 10 00038 g001
Figure 2. Schematic diagram of data acquisition.
Figure 2. Schematic diagram of data acquisition.
Mti 10 00038 g002
Figure 3. Intra-class dispersion of PAD annotations. Standard deviations of pleasure, arousal, and dominance are computed per emotion category to quantify annotation consistency across annotators.
Figure 3. Intra-class dispersion of PAD annotations. Standard deviations of pleasure, arousal, and dominance are computed per emotion category to quantify annotation consistency across annotators.
Mti 10 00038 g003
Figure 4. PAD-space visualization of strong emotions. Each point represents the mean PAD label of a sample; ellipsoids indicate 95% confidence regions for happy, anger, and sadness, illustrating cluster separability and remaining outliers.
Figure 4. PAD-space visualization of strong emotions. Each point represents the mean PAD label of a sample; ellipsoids indicate 95% confidence regions for happy, anger, and sadness, illustrating cluster separability and remaining outliers.
Mti 10 00038 g004
Figure 5. Distribution of annotator outlier rates across the six high/low PAD dimension categories. The boxplots visualize the statistical dispersion of annotation error rates for the 20 annotators. Each dot represents an individual annotator.
Figure 5. Distribution of annotator outlier rates across the six high/low PAD dimension categories. The boxplots visualize the statistical dispersion of annotation error rates for the 20 annotators. Each dot represents an individual annotator.
Mti 10 00038 g005
Figure 6. The average distribution of various emotions. The “cloud and rain” graph (density estimation curve graph, box plot, and jitter scatter plot) shows the distribution of the degree of pleasure across the six emotional categories.
Figure 6. The average distribution of various emotions. The “cloud and rain” graph (density estimation curve graph, box plot, and jitter scatter plot) shows the distribution of the degree of pleasure across the six emotional categories.
Mti 10 00038 g006
Figure 7. Confusion matrices of unimodal baselines on STEM-E2VA. Rows denote ground-truth emotion classes and columns denote predicted classes; panels (ai) correspond to CNN/SVM/DNN models for acoustic, glottal, and kinematic modalities.
Figure 7. Confusion matrices of unimodal baselines on STEM-E2VA. Rows denote ground-truth emotion classes and columns denote predicted classes; panels (ai) correspond to CNN/SVM/DNN models for acoustic, glottal, and kinematic modalities.
Mti 10 00038 g007
Figure 8. Overview of the PAD-guided multimodal hybrid contrastive framework. In this framework, Wav2Vec 2.0 encodes acoustic and glottal signals, and a GRU–Transformer models articulatory kinematics. These representations are optimized by MMCL-PAD and integrated using an interactive attention fusion module to achieve discrete classification and PAD regression.
Figure 8. Overview of the PAD-guided multimodal hybrid contrastive framework. In this framework, Wav2Vec 2.0 encodes acoustic and glottal signals, and a GRU–Transformer models articulatory kinematics. These representations are optimized by MMCL-PAD and integrated using an interactive attention fusion module to achieve discrete classification and PAD regression.
Mti 10 00038 g008
Figure 9. Interactive attention fusion for multimodal integration. The cross-modal multi-head attention model integrates the interaction between acoustic features (as keys/values) and glottal/articulatory kinematics features (as queries). The representations after attention processing are aggregated and concatenated to form a unified fused feature.
Figure 9. Interactive attention fusion for multimodal integration. The cross-modal multi-head attention model integrates the interaction between acoustic features (as keys/values) and glottal/articulatory kinematics features (as queries). The representations after attention processing are aggregated and concatenated to form a unified fused feature.
Mti 10 00038 g009
Figure 10. t-SNE visualization of learned multimodal representations on the STEM-E2VA dataset under different contrastive loss functions: (a) L I A C L a , a , (b) L I A C L P A D a , a , (c) L M M C L , (d) L I A C L a , a . The points are colored according to their ground-truth emotion labels: red for happy, purple for pleasure, cyan for anger, dark blue for indifferent, green for grief, and yellow for sad.
Figure 10. t-SNE visualization of learned multimodal representations on the STEM-E2VA dataset under different contrastive loss functions: (a) L I A C L a , a , (b) L I A C L P A D a , a , (c) L M M C L , (d) L I A C L a , a . The points are colored according to their ground-truth emotion labels: red for happy, purple for pleasure, cyan for anger, dark blue for indifferent, green for grief, and yellow for sad.
Mti 10 00038 g010
Figure 11. t-SNE visualization of learned multimodal representations on the STEM-E2VA dataset under different fusion strategies: (a) concatenation method; (b) interactive attention fusion method. The points are colored according to their ground-truth emotion labels: red for happy, purple for pleasure, cyan for anger, dark blue for indifferent, green for grief, and yellow for sad.
Figure 11. t-SNE visualization of learned multimodal representations on the STEM-E2VA dataset under different fusion strategies: (a) concatenation method; (b) interactive attention fusion method. The points are colored according to their ground-truth emotion labels: red for happy, purple for pleasure, cyan for anger, dark blue for indifferent, green for grief, and yellow for sad.
Mti 10 00038 g011
Figure 12. Visualization of the cross-modal cross-attention weights (averaged over 8 heads) in the interactive fusion module. The horizontal axis denotes the acoustic tokens (keys/values), while the vertical axis denotes the glottal tokens (left) and articulatory kinematics tokens (right) as queries. Each row is normalized to sum to one. The prominent near-diagonal bands suggest that the model captures near-synchronous temporal correspondence and semantic alignment between heterogeneous modalities. (a) Cross-modal interaction attention weight heatmap of “Glottal” → “Audio” (average of 8 heads). (b) Cross-modal interaction attention weights heatmap for “Kinematics → Audio” (average of 8-head model).
Figure 12. Visualization of the cross-modal cross-attention weights (averaged over 8 heads) in the interactive fusion module. The horizontal axis denotes the acoustic tokens (keys/values), while the vertical axis denotes the glottal tokens (left) and articulatory kinematics tokens (right) as queries. Each row is normalized to sum to one. The prominent near-diagonal bands suggest that the model captures near-synchronous temporal correspondence and semantic alignment between heterogeneous modalities. (a) Cross-modal interaction attention weight heatmap of “Glottal” → “Audio” (average of 8 heads). (b) Cross-modal interaction attention weights heatmap for “Kinematics → Audio” (average of 8-head model).
Mti 10 00038 g012
Table 1. Discrete emotion categories in STEM-E2VA, comprising neutral and strong/weak intensities of happy, anger, and sad.
Table 1. Discrete emotion categories in STEM-E2VA, comprising neutral and strong/weak intensities of happy, anger, and sad.
NeutralHappyAngerSad
-StrongWeakStrongWeakStrongWeak
-HappyPleasureAngerIndifferenceGriefSadness
Table 2. Sampling rates, synchronization methods, and alignment strategies across modalities.
Table 2. Sampling rates, synchronization methods, and alignment strategies across modalities.
ModalityAcquisition Sampling RateSynchronization StrategyPreprocessing RateInput Sequence Length (per 3.4 s Window)Alignment Mechanism
Acoustic48 kHzHardware Trigger (Sync Box)16 kHz54,400 samplesImplicit (via Wav2Vec 2.0 encoder)
Glottal44.1 kHzWaveform Alignment (Right-channel)16 kHz54,400 samplesImplicit (via Wav2Vec 2.0 encoder)
Articulatory kinematics250 HzHardware Trigger (Sync Box)250 Hz~850 samplesExplicit (via Cross-Attention)
Table 3. Composition of the STEM-E2VA corpus. Counts are reported for rhymes (single rhymes, compound rhymes, and nasal) and sentence prompts across the discrete categories listed in the table (neutral, happy, pleasure, anger, indifference, grief and sadness), together with the total video data duration.
Table 3. Composition of the STEM-E2VA corpus. Counts are reported for rhymes (single rhymes, compound rhymes, and nasal) and sentence prompts across the discrete categories listed in the table (neutral, happy, pleasure, anger, indifference, grief and sadness), together with the total video data duration.
Data TypeType of CorpusNumber of CorpusTotal
Pronunciation Data
Acoustic Data
Laryngeal data
RhymesSingle RhymesCompound RhymesNasal528
132198198
SentenceNeutralHappyPleasureAngerIndifferenceGriefSadness2415
352335345345352337349
Video data24.5 h24.5 h
Table 4. Simplified 7-point PAD annotation scale used in STEM-E2VA. Each dimension (pleasure, arousal, dominance) is rated from 1 to 7 using a semantic differential scheme adapted from the revised PAD scale.
Table 4. Simplified 7-point PAD annotation scale used in STEM-E2VA. Each dimension (pleasure, arousal, dominance) is rated from 1 to 7 using a semantic differential scheme adapted from the revised PAD scale.
Meaning1234567
PleasureAngryAngryUnpleasantNeutralAttentiveConcernInterested
PejorativeColdCalmNeutralMildFriendlyIntimate
GriefSadPerturbedNeutralSatisfiedSatisfiedDelighted
InfuriatedSorrowBoredomNeutralPleasantHappyExcited
ArousalSleepyLazyAbsent-mindedNeutralAdvertentAttentiveSober
CalmRelaxedMildNeutralInterestedCheerfulExcited
DepressedRelaxingPeacefulNeutralComfortableAttention-grabbingInterested
RestrainedScaredWorriedNeutralStunnedSurprisedSurprised
DominanceControlledAnxiousPerturbedNeutralSatisfiedExhilaratingMastered
CompliantAcceptedAgreedNeutralHopefullyCovetousDominative
HumilityShyDependentNeutralConfidentConceitedProud
AffectedHelplessAcceptedNeutralCalmlyAttention-grabbingInfluential
Table 5. High/low grouping of emotions. Emotions are categorized as high or low along each PAD dimension (pleasure, arousal, dominance).
Table 5. High/low grouping of emotions. Emotions are categorized as high or low along each PAD dimension (pleasure, arousal, dominance).
DimensionStrengthEmotion
PleasureHighHappiness, Pleasure
LowAnger, Indifference, Grief, Sadness
ArousalHighHappiness, Pleasure, Anger, Indifference
LowGrief, Sadness
DominanceHighHappiness, Pleasure, Anger, Indifference
LowGrief, Sadness
Table 6. Stepwise regression model fit for predicting high-pleasure outlier rates from SCL-90 factors. R, R2, adjusted R2, and the standard error are reported for Models 1–3.
Table 6. Stepwise regression model fit for predicting high-pleasure outlier rates from SCL-90 factors. R, R2, adjusted R2, and the standard error are reported for Models 1–3.
ModelRR2Adjusted R2Error of Standard Estimation
10.7620.5810.5580.094
20.8260.6820.6440.084
30.8940.7990.7620.069
Table 7. Regression coefficients for model 3 explaining high-pleasure outlier rates. Unstandardized/standardized coefficients, t-statistics, p-values, and collinearity diagnostics (Tolerance, VIF) are reported.
Table 7. Regression coefficients for model 3 explaining high-pleasure outlier rates. Unstandardized/standardized coefficients, t-statistics, p-values, and collinearity diagnostics (Tolerance, VIF) are reported.
Model 3UnstandardizedStandardizedtpCovariance Statistics
CoefficientCoefficientToleranceVIF
Constant0.093-3.3040.004--
X 1 0.3980.7864.8350.0000.4752.106
X 3 −0.328−0.790−4.1460.0010.3452.895
X 4 0.2860.6553.0610.0070.2743.654
Table 8. Representative PAD vectors for each emotion category after outlier processing. Mean and standard deviation (mean, Std) are reported for pleasure, arousal, and dominance.
Table 8. Representative PAD vectors for each emotion category after outlier processing. Mean and standard deviation (mean, Std) are reported for pleasure, arousal, and dominance.
(Mean, Standard Deviation)
EmotionPleasureArousalDominance
Happy(5.71, 1.01)(5.77, 0.78)(5.68, 0.76)
Pleasure(5.26, 0.83)(5.42, 0.65)(5.33, 0.63)
Anger(2.04, 0.84)(6.06, 0.82)(5.99, 0.79)
Indifferent(2.81, 0.70)(3.95, 1.12)(4.14, 1.07)
Grief(1.59, 0.64)(2.92, 0.38)(1.65, 0.72)
Sadness(2.38, 0.60)(2.24, 0.74)(2.45, 0.79)
Table 9. Single-modality emotion recognition baselines on STEM-E2VA. Accuracy (%) is reported for acoustic, glottal, and articulatory kinematics modalities using SVM, DNN, and CNN classifiers under the same train/test split.
Table 9. Single-modality emotion recognition baselines on STEM-E2VA. Accuracy (%) is reported for acoustic, glottal, and articulatory kinematics modalities using SVM, DNN, and CNN classifiers under the same train/test split.
Modality TypeModel TypeAccuracy
acousticSVM82.57%
DNN85.28%
CNN79.85%
glottalSVM72.28%
DNN79.85%
CNN65.42%
articulatory kinematicsSVM53.57%
DNN61.17%
CNN52.28%
Table 10. Hyperparameter configuration for the four-stage training pipeline. stage 1 pre-trains feature encoders, stage 2 performs contrastive learning, stage 3 jointly trains the multimodal fusion model, and stage 4 fine-tunes downstream classification/regression heads.
Table 10. Hyperparameter configuration for the four-stage training pipeline. stage 1 pre-trains feature encoders, stage 2 performs contrastive learning, stage 3 jointly trains the multimodal fusion model, and stage 4 fine-tunes downstream classification/regression heads.
HyperparameterStage 1: Feature
Extraction
Stage 2: Contrastive LearningStage 3: Multimodal Joint TrainingStage 4: Downstream Task
Batch size321286464
Learning rate0.00010.0010.0010.001
OptimizerAdamSGD (momentum = 0.8)AdamAdam
Weight decay0.00010.0010.0010.001
LR schedulerStepLR (step = 50, γ = 0.1)StepLR (step = 60, γ = 0.1)StepLR (step = 40, γ = 0.1)StepLR (step = 50, γ = 0.1)
Epoch10010080300
Table 11. A comparative experiment on the handling of abnormal values of PAD on STEM-E2VA (original labels and processed labels). Accuracy refers to discrete emotion classification; RMSE-P/A/D represent the regression errors in terms of pleasure, arousal, and dominance.
Table 11. A comparative experiment on the handling of abnormal values of PAD on STEM-E2VA (original labels and processed labels). Accuracy refers to discrete emotion classification; RMSE-P/A/D represent the regression errors in terms of pleasure, arousal, and dominance.
SetupOutlier ProcessingAccuracyRMSE
PAD
Baseline (Raw)unprocessed 81.61%0.1290.1240.481
Proposed processed85.47%0.0820.0740.302
Table 12. Effect of multi-scale contrastive learning on STEM-E2VA. We compare intra-modal contrastive learning with the proposed hybrid intra- and inter-modal strategy. Accuracy and RMSE-P/A/D are reported.
Table 12. Effect of multi-scale contrastive learning on STEM-E2VA. We compare intra-modal contrastive learning with the proposed hybrid intra- and inter-modal strategy. Accuracy and RMSE-P/A/D are reported.
DatasetLossesAccuracyRMSE
PAD
STEM-E2VA L I A C L a , a 76.25%0.1890.1810.369
L M M C L 78.36%0.1830.1780.363
Table 13. Effect of incorporating PAD perception into contrastive learning. Comparing PAD-unaware and PAD-aware losses on IEMOCAP and STEM-E2VA yields accuracy and RMSE-P/A/D.
Table 13. Effect of incorporating PAD perception into contrastive learning. Comparing PAD-unaware and PAD-aware losses on IEMOCAP and STEM-E2VA yields accuracy and RMSE-P/A/D.
DatasetLossesAccuracyRMSE
PAD
IEMOCAP L I A C L a , a 73.63%0.2360.2530.402
L I A C L P A D a , a 74.85%0.2020.2130.391
STEM-E2VA L I A C L a , a 76.25%0.1890.1810.369
L I A C L P A D a , a 78.14%0.1720.1620.364
L M M C L 78.36%0.1830.1780.363
L M M C L P A D 81.23%0.1520.1450.358
Table 14. Modality ablation on STEM-E2VA. Performance is reported for acoustics alone and progressively adding glottal and articulatory kinematics to quantify each modality’s contribution; accuracy and RMSE-P/A/D are reported.
Table 14. Modality ablation on STEM-E2VA. Performance is reported for acoustics alone and progressively adding glottal and articulatory kinematics to quantify each modality’s contribution; accuracy and RMSE-P/A/D are reported.
Modality CombinationAccuracyRMSE
PAD
acoustics78.14%0.1720.1620.364
Acoustics + glottal82.12%0.1500.1210.350
Acoustics + articulatory kinematics82.51%0.1440.1350.342
Acoustics + articulatory kinematics + glottal85.47%0.0820.0740.302
Table 15. Comparison of fusion method on STEM-E2VA. Simple concatenation is compared with the proposed interactive attention fusion under the same encoders and training protocol; accuracy and RMSE-P/A/D are reported.
Table 15. Comparison of fusion method on STEM-E2VA. Simple concatenation is compared with the proposed interactive attention fusion under the same encoders and training protocol; accuracy and RMSE-P/A/D are reported.
Fusion MethodAccuracyRMSE
PAD
Concatenation82.53%0.1400.1340.348
Interactive Attention Fusion85.47%0.0820.0740.302
Table 16. Performance comparison with baseline methods on the IEMOCAP dataset (acoustic only). Accuracy and RMSE-P/A/D are reported.
Table 16. Performance comparison with baseline methods on the IEMOCAP dataset (acoustic only). Accuracy and RMSE-P/A/D are reported.
DatasetMethodsAccuracy RSME
PAD
IEMOCAPP-TAPT [47]74.30%0.2140.2250.403
NCDE Classifier [48]74.18%0.2290.2310.408
Emotion2vec [49]74.48%0.2150.2280.398
Ours74.85%0.2020.2130.391
Table 17. Comparison with multimodal fusion baselines on STEM-E2VA. (acoustic, glottal, kinematic); accuracy and RMSE-P/A/D are reported.
Table 17. Comparison with multimodal fusion baselines on STEM-E2VA. (acoustic, glottal, kinematic); accuracy and RMSE-P/A/D are reported.
DatasetMethodsAccuracy RSME
PAD
STEM-E2VAMulT (MultimodalTranformer) [5]82.700.0980.0920.322
MISA (Modality-Invariant and Specific Representations) [6]82.950.0940.0860.315
HRG-SSA [7]83.09%0.0940.0860.315
MTAMW [8]84.76%00890.0810.307
Ours85.470.0820.0740.302
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duan, S.; Zhang, W.; Li, L.; Zhu, T.; Zhao, F.; Li, F.; Liang, H. PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset. Multimodal Technol. Interact. 2026, 10, 38. https://doi.org/10.3390/mti10040038

AMA Style

Duan S, Zhang W, Li L, Zhu T, Zhao F, Li F, Liang H. PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset. Multimodal Technologies and Interaction. 2026; 10(4):38. https://doi.org/10.3390/mti10040038

Chicago/Turabian Style

Duan, Shufei, Wenjie Zhang, Liangqi Li, Ting Zhu, Fangyu Zhao, Fujiang Li, and Huizhi Liang. 2026. "PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset" Multimodal Technologies and Interaction 10, no. 4: 38. https://doi.org/10.3390/mti10040038

APA Style

Duan, S., Zhang, W., Li, L., Zhu, T., Zhao, F., Li, F., & Liang, H. (2026). PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset. Multimodal Technologies and Interaction, 10(4), 38. https://doi.org/10.3390/mti10040038

Article Metrics

Back to TopTop