1. Introduction
As a complex and multi-dimensional psychological phenomenon, human emotion plays a core role in individual behavior, cognitive processes, and social interaction [
1,
2]. An in-depth understanding and automatic recognition of emotional states are of great significance for the development of human–computer interaction systems with emotional intelligence, mental health monitoring, and personalized services [
3,
4,
5]. Current emotion recognition research is mainly divided into discrete emotion classification (e.g., joy, sadness, anger) and dimensional emotion models (e.g., the Valence–Arousal model). Among them, Valence (P) describes the positive or negative degree of emotion, while Arousal (A) reflects the level of physiological activation. These dimensions provide a more refined and continuous characterization of emotion, better capturing subtle changes in human emotion [
6,
7].
Personality traits, as stable tendencies in individuals’ thinking, emotional, and behavioral patterns, significantly influence the generation, experience, and regulation of emotions [
8,
9,
10]. For example, Neuroticism is often associated with greater negative emotional experiences and arousal, while Extraversion may predict more positive emotional states and arousal [
11]. The Big Five personality model has been widely accepted as a mainstream framework for describing personality structure due to its cross-cultural and cross-situational stability. This model covers five core dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism [
12,
13,
14,
15,
16,
17,
18]. Exploring how the Big Five personality traits affect individuals’ emotional dynamics in the valence–arousal space helps to reveal the complex interaction mechanism between personality and emotion [
19,
20,
21].
In recent years, advances in deep learning have led to significant progress in the field of affective computing [
22]. In particular, emotion recognition based on multimodal physiological signals, such as Electroencephalogram (EEG), Electrocardiogram (ECG), and Galvanic Skin Response (GSR), has shown great potential due to its ability to capture genuine physiological responses [
23]. The Bidirectional Gated Recurrent Unit (Bi-GRU) performs well on sequential data and has been widely applied to emotion recognition tasks involving speech, video, and physiological signals [
24,
25,
26]. However, traditional emotion recognition models still have limitations in capturing complex temporal dependencies and cross-modal information interaction [
18]. To address these challenges, this study implemented a series of improvements based on the basic model TFACE, which adopts a multi-head self-attention mechanism and Bidirectional Long Short-Term Memory (Bi-LSTM) to process temporal features. Given that the multi-head self-attention mechanism of TFACE suffers from computational redundancy and insufficient ability to focus on key emotional features, we introduced the Squeeze-and-Excitation (SE) mechanism to replace it, aiming to improve feature expression by adaptively weighting channel features. The SE module recalibrates channel features, enabling the model to focus more on feature channels that contribute to emotion recognition.
Meanwhile, to address the redundancy of parameters and the long training time of the Bi-LSTM unit in TFACE, we adopted the Bi-GRU unit to replace it. By simplifying the gating structure, Bi-GRU reduces the scale of model parameters while maintaining the ability to capture bidirectional context, improves training efficiency and robustness, and is more suitable for the rapid modeling needs of video emotion frame sequences [
24]. Compared with Bi-LSTM, Bi-GRU usually has fewer parameters and faster training speed while maintaining similar performance, thereby improving the model’s efficiency and robustness.
This study used the improved model to extract emotional labels from emotion-inducing videos. The model was applied to analyze the first 10 min of emotional video data from 30 participants and obtain corresponding emotional labels. Subsequently, these discrete emotional labels were converted into continuous valence (P) and arousal (A) values, which were mapped to the circumplex model of emotion [
6,
7,
27]. To further explore the influence of the Big Five personality traits on valence and arousal in the circumplex model of emotion, this study adopted CTSEM to fit the data. CTSEM is a powerful statistical method that can analyze dynamic processes that change over time and is very suitable for analyzing emotional and personality variables with temporal dependence [
11]. In addition, age and gender, as important covariates, were included in the model to control for their potential influences [
19]. In the model parameter estimation process, this study selected the L-BFGS optimization algorithm. As a representative of quasi-Newton methods, this algorithm reduces computational complexity by approximating the inverse of the Hessian matrix, achieves a good balance between efficiency and memory consumption, and is particularly suitable for the optimization needs of large-scale time-series data (19,262 observations) in CTSEM analysis, which can efficiently complete the parameter estimation of the drift matrix and the moderating effect matrix [
28,
29,
30,
31]. Compared with the traditional Newton method, L-BFGS achieves a good balance between computational efficiency and memory requirements and has thus been widely used in machine learning and large-scale optimization problems [
32,
33,
34,
35].
Previous research has established associations between Big Five personality traits and emotional states, yet most studies emphasize static correlations while lacking a precise characterization of emotional dynamics. Moreover, substantial gaps persist in the efficiency and accuracy of multimodal emotion recognition, as well as in the systematic investigation of personality–emotion regulatory mechanisms. Existing recognition models exhibit limitations in temporal feature extraction and attentional focusing, hindering efficient derivation of dynamic affective parameters.
This study systematically addresses these gaps through the following integrated approach. First, we developed an optimized TFace-Bi-GRU-SE deep learning architecture that enhances video-based emotion recognition accuracy and inference efficiency by incorporating Bi-GRU modules to streamline temporal modeling and Squeeze-and-Excitation (SE) blocks to amplify salient channel features. Second, we applied this model to 10 min video recordings from 30 participants, generating continuous time-series trajectories for valence (P) and arousal (A). Third, integrating Big Five personality assessments with these affective trajectories, we employed continuous-time structural equation modeling (CTSEM) to systematically examine how personality dimensions, age, and gender modulate core dynamic parameters, including autoregressive effects, cross-lagged relationships, and diffusion coefficients. Finally, through complementary analyses—comprising group comparisons, correlation tests, robustness checks, and sensitivity evaluations—we elucidated specific mechanisms by which personality traits shape affective response patterns, thereby establishing a methodological framework for integrating deep learning-based emotion recognition with continuous-time dynamic modeling and generating preliminary hypotheses about personality–emotion associations that warrant investigation in future adequately powered studies.
4. Discussion
This study found significant differences in the time decay of pleasure and arousal: the half-life of pleasure was approximately 12.3 s, with about 5% of the effect intensity remaining after 50 s, indicating a more persistent experience; the half-life of arousal was only about 1.2 s, with the effect essentially disappearing after 10 s. Physiological arousal subsided much faster than subjective pleasure, with a time decay difference of approximately 10 times. This finding is consistent with the research of Kuppens et al., suggesting that positive emotional experiences have a stronger self-sustaining capacity [
54].
Based on emotion regulation theory, pleasure (P)—as one of the core affect dimensions—is typically closely associated with an individual’s overall well-being and cognitive evaluation of the environment. Statistical analyses revealed that changes in pleasure may involve complex cognitive processes, such as situational reappraisal and meaning construction. Given the relative slowness of these processes, pleasure exhibits greater persistence [
55]. Arousal (A), by contrast, reflects an individual’s immediate physiological responses to environmental stimuli (e.g., changes in heart rate and respiration)—responses that are typically rapid and transient [
56].
From an evolutionary psychology perspective, the ability to quickly identify environmental threats or opportunities and initiate physiological preparations is critical for survival. As a rapid response mechanism, arousal prompts organisms to generate an immediate fight-or-flight response. Once the threat is eliminated or the opportunity is seized, physiological arousal levels decline rapidly to conserve energy [
57]. Pleasure, however, is linked to long-term resource acquisition and environmental assessment (e.g., feelings of safety, comfort, and social connection). These experiences are stable and consequential for an individual’s long-term well-being and adaptability, thus requiring more extensive processing and maintenance [
58]. The persistence of pleasure may play a pivotal role in emotion regulation and adaptive functioning, helping individuals recover from negative emotions and sustain a positive psychological state [
59,
60].
In the analysis of autoregressive effects, the autoregressive parameter for pleasure was
= −0.056 (
z = −4.05,
p < 0.001), and for arousal was
= −0.558 (
z = −11.14,
p < 0.001). Both parameters were significantly negative, with arousal returning to baseline more rapidly. This result is consistent with the study by Sosnowska et al., which suggests that individual differences exist in emotional dynamics and the baseline attractor state of emotions, indicating that emotional states return to an intrinsic stable point (homeostasis) over time [
61]. Collectively, the rapid initiation and dissipation of arousal—a physiological stress response—may reflect its strong autoregressive characteristics, enabling individuals to adapt to environmental changes efficiently [
62].
This study observed non-significant cross-effects between the two dimensions: the cross-effect of pleasure on arousal was
= 0.007 (
z = 0.10,
p = 0.920), and the cross-effect of arousal on pleasure was
= −0.026 (
z = −1.49,
p = 0.136). These findings support the relative independence of pleasure and arousal in their dynamic changes. Tidikis et al. [
63] noted that emotional valence and arousal exert distinct effects on task comprehension, suggesting they may influence cognitive processes via separate pathways. This independence holds important implications for emotion theory: Russell’s circumplex model of affect conceptualizes emotions as a continuous space defined by two orthogonal dimensions—pleasure and arousal [
64]—and the current results align closely with this framework, confirming that pleasure and arousal are relatively independent core emotional dimensions rather than simply linearly correlated. This independence may reflect distinct neurophysiological mechanisms underlying emotional evaluation (pleasure) and physiological activation (arousal) [
65,
66]. For instance, high pleasure can co-occur with high arousal (e.g., excitement) or low arousal (e.g., calmness), and the independent variation in these dimensions suggests mediation by distinct neurophysiological pathways and cognitive mechanisms [
67].
Regarding diffusion parameters, the pleasure diffusion parameter was significant (
= 0.286,
z = 18.11,
p < 0.001), as was the arousal diffusion parameter (
= 0.339,
z = 13.13,
p < 0.001); the diffusion covariance parameter was also significant (
= 0.218,
z = 2.75,
p = 0.006). Diffusion parameters reflect the instantaneous variability of emotional states—i.e., the degree of random short-term fluctuations in emotions. Previous research has demonstrated bidirectional links between the dynamic interplay of positive/negative affect and physical activity [
68], suggesting that multiple factors may influence the instantaneous variability of emotions. These findings indicate that pleasure (P) and arousal (A), as core emotional dimensions, exhibit distinct temporal dynamic patterns: pleasure is persistent, potentially contributing to the maintenance of a positive mindset and adaptive behaviors; arousal dissipates rapidly, which may facilitate stress response management. The dynamic independence and significant positive covariation of these two dimensions reflect the complexity and multi-layered regulation of the emotional system. Exploratory analyses further suggested that distinct personality dimensions may be differentially associated with emotional dynamic parameters. However, these patterns require replication in adequately powered studies before conclusions about personality’s role can be drawn.
The observed tenfold difference in decay rates between valence (half-life: 12.3 s) and arousal (half-life: 1.2 s) has direct implications for the design of affect-sensitive systems. Most existing video-based emotion recognition pipelines treat valence and arousal as synchronously evolving outputs and apply uniform temporal smoothing across both dimensions [
7]. The present findings suggest that this assumption is empirically untenable: a smoothing window appropriate for valence—which persists across tens of seconds—would systematically over-smooth arousal dynamics that resolve within seconds, effectively erasing physiologically meaningful reactivity signals. The tenfold difference in decay rates between valence (half-life: 12.3 s) and arousal (half-life: 1.2 s) is consistent with the hypothesis that dimension-specific temporal filtering may be warranted in affect-sensitive pipeline design. However, as this pattern derives from a single exploratory sample (
N = 30), replication in independent samples is necessary before design recommendations can be drawn.
A secondary methodological contribution concerns the boundary conditions under which CTSEM can reliably recover dynamic parameters from AI-derived emotion labels. The present pipeline achieved a weighted accuracy of 64.96% (seed = 42; five-seed mean: 63.50 ± 0.98%)—a level typical of current seven-class, video-based recognition systems—yet autoregressive parameters exhibited complete directional consistency across 50 bootstrap iterations and less than 9% deviation under a soft-label sensitivity analysis. This robustness is not incidental: the dense longitudinal design (mean T = 642 observations per person) provides redundancy that effectively averages out frame-level classification noise [
69]. The practical implication is significant—researchers need not await near-perfect classifiers before applying CTSEM to AI-derived emotion data, provided that temporal sampling density is sufficient. A preliminary heuristic suggested by the present results is that T ≥ 600 observations per person may provide adequate noise-averaging capacity at classification accuracies in the 60–70% range. Although this threshold requires systematic empirical validation in future work, it constitutes a concrete and falsifiable starting point for field standardization.
Exploratory between-person moderation analyses revealed preliminary patterns, organized here by convergent evidence strength. Bootstrap resampling of the complete model (
Supplementary Table S6) revealed heterogeneous stability across effects: Ten predictor–DRIFT associations achieved directional consistency ≥ 70%: Agreeableness–P_A (88%), Openness–A_A (84%), Conscientiousness–P_A (80%), Agreeableness–A_P (78%), Agreeableness–A_A (76%), Openness–A_P (74%), age–A_P (72%), Extraversion–A_A (72%), age–P_A (70%), and Extraversion–P_P (70%). These constitute the most replicable preliminary evidence. Effects with directional consistency below 70% are noted where relevant but should be treated as directionally unstable at this sample size.
Agreeableness and Openness showed the strongest convergent evidence across CTSEM, bootstrap, and Bayesian analyses. Agreeableness selectively and positively modulated A_A (
z = 7.62,
p < 0.001) and P_A (
z = 4.91,
p < 0.001), with bootstrap directional consistency of 76% and 88%, respectively, and Bayesian analysis yielded moderate evidence for the Agreeableness–P_mean association (BF
10 = 2.39,
r = 0.40,
p = 0.028). This pattern reflects its role in transforming pleasantness into behavioral activation through positive social interactions [
70,
71]; individuals high in agreeableness are more likely to receive social support and positive feedback in interpersonal relationships, thereby maintaining an elevated baseline of pleasantness [
72]. Openness positively modulated A_A (
z = 11.64,
p < 0.001), with bootstrap directional consistency of 84%, and showed preliminary evidence for A_P (
z = −4.65,
p < 0.001, 74% directional consistency). Bayesian analysis provided the most substantial evidence across all predictors for the Openness–P_mean association (BF
10 = 18.31,
r = 0.54,
p = 0.002). Openness is typically associated with curiosity, imagination, and receptiveness to novel experiences—traits that may facilitate the generation and maintenance of positive emotions [
73]. Individuals high in Openness demonstrate greater flexibility in adjusting their cognitive appraisals when confronted with emotional stimuli, for instance, reinterpreting adverse events as learning opportunities [
74]; consequently, their emotional fluctuations may decay more rapidly [
75].
Extraversion positively modulated P_P (70% directional consistency) and A_A (72% directional consistency), suggesting that individuals higher in Extraversion demonstrate greater emotional stability. Extraversion significantly negatively modulated the bidirectional cross-effects between pleasure and arousal (both
p < 0.001), indicating that highly extraverted individuals are better at decoupling these two dimensions—maintaining positive pleasure even during high arousal states and preventing the spread of negative emotions [
76,
77]. This decoupling ability may stem from more robust social support systems and superior coping strategies [
78]. Bootstrap directional consistency was 72% for Extraversion–A_A and 70% for Extraversion–P_P, providing modest corroborating support. Conscientiousness negatively modulated A_A (
z = −6.87,
p < 0.001) and negatively modulated P_A (
z = −6.91,
p < 0.001), with the latter achieving bootstrap directional consistency of 80%. Highly conscientious individuals accelerate the dissipation of arousal through self-discipline and goal-directed behavior; conscientiousness indirectly influences emotional dynamics by promoting goal setting and achievement [
79,
80]. Neuroticism demonstrated the opposite pattern to Extraversion: it positively modulated A_P (
z = 6.03,
p < 0.001) and P_A (
z = 4.55,
p < 0.001), while negatively modulating A_A (
z = −4.71,
p < 0.001), suggesting that highly neurotic individuals experience stronger bidirectional coupling between pleasure and arousal alongside faster arousal dissipation. Bayesian analysis independently supported the Neuroticism–A_mean association (BF
10 = 2.75,
r = 0.41,
p = 0.023); however, bootstrap directional consistency for Neuroticism ranged only 54–64% across DRIFT parameters, and these dynamic parameter effects should be treated as tentative. Neuroticism is frequently associated with emotional instability, heightened sensitivity to harmful stimuli, and elevated physiological arousal levels [
73,
81].
Age-related effects suggested more rapid emotional return to baseline with increasing age (negative modulation of P_P and A_A, both
p < 0.001) alongside enhanced bidimensional coupling (positive modulation of A_P and P_A, both
p < 0.001), aligning with socioemotional selectivity theory [
82] and lifespan developmental perspectives [
73,
81]. A gender-moderated effect was observed in the Neuroticism–Arousal relationship, with stronger associations among females (
r = 0.746,
p = 0.008) than males (
r = 0.129,
p = 0.597) [
83]. Bootstrap directional consistency analysis identified ten associations with ≥ 70% stability; these constitute the most replicable preliminary findings and warrant prioritized replication in adequately powered studies (
N ≥ 61). Detailed results are presented in the
Supplementary Materials. The present moderation analyses were conducted on
N = 30 participants, which constrains between-person detection power to significant effects (f
2 ≥ 0.35, partial |
r| ≥ 0.51). Bootstrap directional consistency and Bayesian factor analysis were applied to stratify findings by evidence strength, retaining only effects with ≥ 70% directional stability for substantive interpretation. Ten personality–DRIFT associations met this threshold. Expanding the CTSEM sample to
N ≥ 61 would provide 80% power to detect medium-sized moderation effects (f
2 ≥ 0.15), enabling more precise estimation of individual difference effects on emotional dynamics. Future studies should recruit larger independent samples with complete personality assessments to confirm the preliminary patterns identified here and characterize the full range of personality–DRIFT associations with adequate statistical resolution.
This study implemented rigorous quality control measures in its sample design. Initially, 198 participants were recruited. Through screening using the SCL-90 scale and video quality control procedures, 174 valid samples were retained, yielding a retention rate of 87.88%, which meets the recommendations of Osborne and Overbay (2004) [
84].
A total of 144 participants were allocated to deep learning model training and testing (70%/30% split), and 30 independent samples were used for continuous-time structural equation modeling (CTSEM) analysis. This design ensures the independence of the CTSEM analysis, potentially mitigating the risks of data leakage and overfitting. Regarding the adequacy of the
N = 30 sample size, there is methodological support. Hecht and Zitzmann (2020) demonstrated that increasing the number of time points (T) can partially compensate for a reduced number of individuals (N) [
85]. This study provided 19,262 time-series observation points per individual, potentially furnishing sufficient data for CTSEM parameter estimation. Rodebaugh et al. (2022) demonstrated that dynamic patterns can be reliably detected even with a few individuals provided there are abundant within-individual time points [
86]. Andriamiarana et al. (2023) showed, using Bayesian methods, that small samples can yield relatively stable estimates when appropriate estimation procedures are used [
87]. The CTSEM multilayer modeling framework accounts for individual-level uncertainty, and the reported coefficient of variation (CV) quantifies inter-individual heterogeneity. Bayesian factor analysis based on individual-level data (
n = 30) provided strong evidence for the core findings (BF
10 = 18.3). The deep learning model was pretrained using the RAVDESS dataset. The internal dataset (174 cases) was divided into training (70%) and testing (30%) sets at the subject level to ensure unbiased model evaluation. True_Ratio is defined as the percentage of frames in which the model’s prediction matches the majority vote of three independent experts across 20% of the frames.
The dataset splitting follows the official fixed split scheme of DFEW rather than five-fold cross-validation, primarily to enable fair comparison with DFEW baseline studies and due to computational resource constraints. This study acknowledges the limitations of the training process and repeated the experiment with five random seeds (42, 68, 189, 618, and 719), presenting results as “mean ± standard deviation” (WA: 63.50% ± 0.98%) to enhance the robustness of model results.
This study employed the TFace-Bi-GRU-SE deep learning model to generate continuous emotion time series, achieving a weighted accuracy (WA) of 63.50%. Approximately 36.5% of frame labels were misclassified, which may constitute the primary source of measurement error, with misclassifications predominantly occurring between similar emotion categories [
88,
89]. The error may exhibit both random noise characteristics and systematic bias. Frame misclassification may cause deviations in single-frame pleasure (P) and arousal (A) emotion values. When integrated across a continuous sequence, the error may propagate along the temporal dimension, leading to a regression dilution effect [
88,
90]. The impact of measurement error on CTSEM parameter estimation may vary: autoregressive effect parameters (P_P, A_A), based on individual temporal patterns, may be relatively robust to single-point measurement errors and better capture the temporal continuity of emotional states; cross-effect parameters (P_A, A_P), involving interactions between different emotional dimensions, may be more susceptible to interference, which may partially account for the nonsignificant cross-effects observed in this study [
91]. Structural equation modeling (SEM) theory indicates that measurement error can attenuate the true strength of associations between latent variables, a phenomenon known as attenuation bias or regression dilution [
92], potentially leading to underestimation of some path coefficients in CTSEM. This study used fuzzy frame cleaning to reduce noise in the raw data, and the CTSEM model, by modeling data across multiple time points, may smooth single-point errors. Robustness tests and measurement error assessments collectively support the reliability of the core findings. Bootstrap resampling (50 iterations;
Table 9) demonstrated that the directional consistency of the P_P and A_A autoregressive parameters reached 100%, with 95% confidence intervals entirely excluding zero. Subgroup validation (
Supplementary Table S7) further confirmed that the signs of all core dynamic parameters remained stable across subgroups. Two complementary sensitivity analyses were conducted to assess robustness against measurement error: confusion-matrix-based structured noise injection (
Table 10) revealed that autoregressive parameters maintained negative signs under all perturbation conditions (directional consistency = 100%), while soft-label uncertainty propagation (
Table 11) yielded estimates deviating less than 9% from baseline values, with negligible Monte Carlo variance (βP_P: ±0.005; βA_A: ±0.007;
Figure 7). Collectively, these findings demonstrate that under a dense longitudinal design (mean = 642 observations per person), CTSEM can reliably recover core dynamic parameters from AI-derived emotion data of moderate classification accuracy. The cross-effects exhibited lower directional consistency rates in bootstrap and subgroup validation (60–70%), warranting cautious interpretation. This may reflect the inherently weak nature of the cross-effects or the substantial impact of moderate classification accuracy on the estimation of weak effects. Future research can enhance measurement accuracy through multimodal fusion or attention mechanisms [
93,
94,
95,
96]. This study provides a methodological reference for AI-assisted emotion measurement, emphasizing the need for multidimensional validation (bootstrap sampling, subgroup analysis, and sensitivity testing).
The dynamic interaction between emotional valence and arousal was systematically analyzed using a continuous-time structural equation model. The specific regulatory effects of the Big Five personality dimensions, age, and gender on these dynamic parameters were quantified. CTSEM, as a statistical method for analyzing temporally evolving processes, is well-suited for handling time-dependent emotional and personality variables, thereby revealing the complex interaction mechanisms between personality and emotions [
11,
43,
97,
98]. The integration of continuous emotional dimension data extracted by deep learning models with Big Five personality scale data provides a multilevel, cross-domain analytical framework that deepens our understanding of how personality shapes emotional response patterns [
99,
100,
101]. Methods that integrate multiple data modalities, such as physiological signals and behavioral data, are receiving increasing attention in affective computing [
95,
102].
The findings of this study are exploratory and insufficient to inform the design of personalized intervention strategies [
103]. The primary contribution of this work is methodological: it demonstrates a pipeline for extracting continuous emotional dynamics from ecologically valid video data and subjects those dynamics to rigorous sensitivity testing. Whether the personality-moderation patterns identified here carry clinical relevance remains an open empirical question, contingent on replication with adequate sample sizes (
N ≥ 61) and more refined measurement models. At the methodological level, this study offers a systematic evaluation of deep learning models as measurement tools. It examines error-mitigation strategies, thereby providing a foundational foundation for the application of AI-assisted emotion measurement in psychology and medicine.
To address concerns regarding sample size generalizability, we validated the robustness of the research findings through two supplementary analyses: Bootstrap resampling analysis (n = 50) focused on the core emotional dynamic DRIFT parameters (P_P, A_A autoregressive coefficients), revealing stable estimates (coefficient of variation < 0.5, narrow 95% confidence intervals, 100% directional consistency), confirming that dense longitudinal data can provide reliable within-subject dynamic estimates even with small samples; subgroup validation analysis demonstrated that these parameters exhibited 100% directional consistency across different subgroups of emotional expression intensity, video quality, and personality extremity, supporting the cross-group consistency of the core emotional dynamic patterns.
This study has several limitations. First, sample size constitutes the primary limitation. The CTSEM analysis included 30 participants. Although dense longitudinal data supported the estimation of core dynamic parameters, the CTSEM analysis included only 30 participants, limiting the power to detect between-person moderation effects to significant levels (f2 ≥ 0.35). To address this constraint directly, moderation effects were reported according to convergent evidence strength: effects supported by both bootstrap directional consistency ≥ 70% and Bayesian corroboration were retained for interpretation. In comparison, directionally unstable effects (<70% consistency) were noted but excluded from substantive conclusions. This evidence-based filtering approach ensures that reported findings are matched to the sample’s actual detection capacity. Second, the sample design contains flaws: the 144 participants in the model training/testing group did not undergo Big Five personality assessment, precluding verification of trait comparability between the two groups and limiting evaluation of the model’s generalizability across personality groups. Third, the deep learning model’s approximately 36.5% frame misclassification rate may systematically underestimate fine-grained dynamic parameters. Fourth, all participants were recruited from a Chinese population, resulting in high sample homogeneity and limiting the generalizability of the results across cultures and populations.
Future research should prioritize addressing the sample size limitation. We recommend recruiting at least 100 participants who complete both video recording and personality assessment, employing stratified sampling to ensure a representative distribution of personality traits, and matching personality traits between the training and validation groups to verify the exploratory findings of this study and enhance the generalizability of the results. Additionally, including participants from diverse cultural backgrounds would strengthen the cross-cultural applicability of the conclusions. Furthermore, multimodal emotion recognition techniques (such as integrating speech and physiological signals) can improve measurement accuracy. Future studies should first prioritize replication of the exploratory personality–DRIFT associations identified here in adequately powered independent samples (N ≥ 61). Only upon successful replication would it be appropriate to consider downstream applications such as personalized emotion regulation programs.