A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification

Bian, Yifan; Kim, Hyunwoo; Krumhuber, Eva G.

doi:10.3390/electronics15040849

Open AccessArticle

A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification

by

Yifan Bian

,

Hyunwoo Kim

and

Eva G. Krumhuber

^*

Department of Experimental Psychology, University College London, 26 Bedford Way, London WC1H 0AP, UK

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 849; https://doi.org/10.3390/electronics15040849

Submission received: 13 November 2025 / Revised: 9 February 2026 / Accepted: 10 February 2026 / Published: 17 February 2026

(This article belongs to the Special Issue Advances of Artificial Intelligence and Vision Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The growing availability of facial expression databases (FEDBs) has accelerated the development of empathic AI systems designed to promote emotional awareness and well-being. However, most existing systems are trained solely on posed (acted), static databases featuring exaggerated and stereotypical displays. Such portrayals may not accurately represent the real-world expressions that are often subtle, heterogeneous, and ambiguous, raising concerns about the performance of these AI systems in inferring human emotions. Furthermore, the lack of cross-database evaluation has limited assessments of how well these systems generalize to diverse facial behaviors. To address these gaps, the present study evaluates five spontaneous and dynamic databases that provide more ecologically valid representations of affective responses observed in everyday life. We assessed the performance of a widely adopted affective computing system, AFFDEX (v1.0; iMotions, Copenhagen, Denmark), to examine how basic emotions are inferred from spontaneous facial movements. Results reveal substantial variability in decoding accuracy across emotion categories, database contexts, and demographic factors. Prototypical and complex expressions were decoded more accurately than subtle or heterogeneous ones, while ambiguous expressions that blend multiple affective signals impaired machine predictions. Together, these findings underscore the crucial need to train and validate affective computing systems using diverse FEDBs that encompass a wider spectrum of behaviors to improve robustness and real-world generalizability.

Keywords:

affective computing; emotion recognition; spontaneous; dynamic; database

1. Introduction

Facial expressions are one of the most informative channels of human affect [1]. Automated facial expression analysis (AFEA) has emerged as a promising approach for extracting facial representations from images, videos, and depth data both inside and outside the laboratory [2,3]. AFEA serves as a vital processing component in a wide range of applications, including emotion-aware AI systems that enable more adaptive, empathetic, and engaging interactions by accounting for users’ affective states [4].

The field of affective computing has been revolutionized by the growing availability of large-scale datasets and increasingly sophisticated computational techniques, leading to a proliferation of AFEA architectures and toolkits [5]. However, the validity and robustness of these systems remain fundamentally constrained by the characteristics of the datasets on which they are trained and validated [6]. Many widely used datasets consist of static images of posed (acted) expressions captured at peak intensity, depicting singular emotions under highly standardized conditions [7,8,9]. While AFEA models have achieved promising results on such datasets [10,11,12], there is growing concern regarding the extent to which these models can generalize to real-life contexts, where facial expressions are often heterogeneous, subtle, and ambiguous [13,14].

The development of spontaneous FEDBs addresses key limitations of posed datasets by offering more realistic representations of affective behavior. Nevertheless, this greater realism also introduces significant challenges for affective computing [15]. Unlike posed displays, spontaneous facial expressions are highly heterogeneous and embedded within complex socioemotional contexts [16]. Consequently, recognition rates for spontaneous expressions are consistently lower and more variable than those reported for posed datasets, often fluctuating between 15% and 65% [17,18,19]. Accumulating empirical evidence suggests that this performance variability is strongly contingent on database-specific characteristics [20,21], including emotion-eliciting methods (e.g., film induction, autobiographical recall), demographic composition (e.g., age, gender), and technical factors (e.g., illumination, face-box size). As these characteristics vary substantially across FEDBs [22], single-database evaluations often yield incomplete or misleading estimates of model performance. Cross-corpus evaluations are therefore essential for systematically capturing variability in spontaneous expressions across diverse sociocultural contexts.

Despite the necessity for comprehensive validation, AFEA models continue to be benchmarked primarily on a limited set of databases that fail to encompass the full spectrum of real-world affective behavior [23,24]. Single-database studies often exhibit constrained demographic coverage and emotional contexts, thereby restricting model generalizability to underrepresented populations and real-world situations [25,26]. Cross-corpus research has demonstrated that models performing well within their training datasets often suffer from deteriorated performance when tested on unseen datasets, revealing algorithmic biases that remain obscured in single-corpus analyses [27,28]. Critically, however, most existing cross-corpus evaluations are still confined to posed databases [21,29,30]. It remains to be determined how robustly AFEA systems can generalize across diverse spontaneous FEDBs, which differ markedly in both emotion-eliciting contexts and demographic profiles [22].

Beyond database characteristics, empirical studies have identified several key facial parameters that systematically account for variation in AFEA performance. Prototypicality, the degree to which an expression resembles a theoretical prototype (e.g., Duchene smiles for happiness), has been consistently shown to enhance recognition accuracy [31]. However, classifiers that rely heavily on prototypical patterns are likely to fail when confronted with spontaneous expressions that often deviate from prototypical configurations [32]. Complexity, reflecting the intensity and diversity of facial activity, can facilitate emotion detection by providing richer and more discriminative cues that increase classifier confidence [13]. At the same time, models trained predominantly on static, high-intensity expressions may struggle with fleeting, subtle facial movements. Ambiguity, the presence of facial cues signaling multiple emotional states, tends to hinder accurate classification [15]. Evaluating AFEA performance through the joint lens of these facial parameters provides a diagnostic framework for identifying model strengths and weaknesses, offering guidance for feature engineering and model development aimed at better capturing the variability inherent in real-life affective behavior.

Despite these insights, many existing AFEA models remain theoretically grounded in Basic Emotion Theory (BET), which assumes that emotions are encoded and decoded in distinct, prototypical, and integrated facial patterns [1]. As a result, these models are optimized to detect prototypical displays while largely disregarding emotional co-occurrence [33] and partial configurations [34]. This theoretical orientation likely constrains model performance when applied to spontaneous expressions characterized by heterogeneous, subtle, and ambiguous patterns. Although BET-derived models can achieve human-level accuracy in decoding posed portrayals [35], their generalizability to spontaneous expressions remains questionable. Beyond theoretical constraints, practical barriers also impede the evaluation and adoption of state-of-the-art AFEA models. Most models are proprietary or, when open-sourced, require substantial domain expertise for implementation, as they are often built using different programming languages and frameworks. Few are accompanied by comprehensive documentation of their data pipelines, including details of feature extraction, preprocessing, analysis, and visualization. This lack of accessibility limits their use to specialists and hinders broader adoption.

2. The Present Study

To address these limitations, we conducted a cross-corpus evaluation of AFEA for spontaneous emotion recognition, providing a comprehensive assessment of model performance across multiple databases. We employed a commercially available AFEA software package called AFFDEX (v1.0; iMotions) [36], which has been widely adopted in the research community and achieved competitive performance among commercial systems [35]. AFFDEX was used as a standardized benchmark, enabling systematic evaluation and comparison of machine performance across multiple datasets. This cross-corpus approach overcomes the limitations of single-database studies and proprietary tools, thereby enhancing research generalizability.

The present study integrates emotion classification with a detailed analysis of facial action units (AUs) [37] to elucidate the decision boundaries employed by the classifier. Building on this foundation, we derived higher-level facial parameters, including prototypicality, complexity, and ambiguity from AU patterns and emotion predictions to quantify the key structural and informative properties of each expression. These parameters were operationalized following the definitions and computational procedures outlined in [13], whose methodology has been previously validated mostly on posed expressions. This approach provides a more granular understanding of both the model’s architecture and the underlying theoretical assumptions (e.g., BET) guiding emotion classification, extending prior work that has largely focused on posed expressions to naturalistic contexts. In addition, we evaluated potential demographic disparities in recognition accuracy, which are often inadequately addressed during model training and evaluation.

Drawing from previous research, we expected classification accuracy to vary across databases and demographic groups, highlighting the necessity of cross-corpus validation for developing robust and generalizable AFEA systems. Moreover, we predicted that higher levels of prototypicality and complexity would be associated with increased recognition accuracy, whereas greater ambiguity would be associated with reduced performance [13].

3. Method

3.1. Stimulus Material

Spontaneous facial expressions in the form of video clips or image sequences were obtained from publicly available databases. The selection process was guided by several criteria to ensure a comprehensive cross-corpus evaluation. First, databases needed to be publicly accessible to enable reproducibility and standardization. Second, we prioritized those featuring multiple basic emotions (three to six categories). Third, databases had to contain spontaneous rather than posed expressions, captured using validated elicitation methods. Finally, we selected databases that provided sufficient sample sizes across emotion categories and demographic groups to ensure meaningful statistical comparisons.

This selection process yielded five databases, including BINED [38], BP4D [39], DISFA [40], EB+ [41], and Emognition [42]. Each database features five to six basic emotions for classification via AFFDEX. Although our initial aim was to evaluate all six basic emotions, anger was excluded from the analysis due to insufficient sample sizes across databases. Consequently, the present study focused on five basic emotions: happiness, sadness, fear, disgust, and surprise. Expressions in the selected databases were primarily elicited through video-induction techniques, wherein participants viewed emotionally evocative film clips. Such methods have been empirically validated for their efficacy in inducing targeted emotional states [43]. Several databases (BINED, BP4D, EB+) incorporated additional elicitation methods, including tactile stimulation and active engagement tasks, further enriching the diversity of recorded expressions.

The video recordings predominantly feature frontal views of participants, maintaining consistency in head orientation and facial visibility. Most recordings were captured at frame rates ranging from 20 to 60 frames per second, effectively preserving the dynamic properties of facial expressions. Video duration varied substantially across databases, reflecting differences in emotion elicitation and recording protocols employed, from brief expressions lasting less than one second to extended emotional episodes spanning several minutes. Resolution quality also varied, from medium (720 × 576) to high (1920 × 1080) definition. The recordings typically featured plain backgrounds and controlled lighting conditions to optimize facial visibility.

3.2. Sampling Procedure

To ensure consistent representation across heterogeneous databases, we implemented a stratified random sampling approach [44]. This method divides the overall population into distinct, homogeneous subsets (strata) prior to sampling. Stratification was based on emotion category to balance class representation and on gender, which was the only demographic variable consistently available and comparably coded across all databases. Other demographic metadata (e.g., age, cultural or ethnic background) were either missing or inconsistently reported, precluding reliable cross-corpus analysis.

Specifically, we randomly selected five portrayals of each emotion category for each gender. This procedure yielded 10 portrayals per emotion (5 male, 5 female) per database. The selection process culminated in a total of 250 spontaneous expressions, comprising 125 female and 125 male encoders (Table 1). Encoder identities were unique within each emotion category and database to prevent performance estimates from being biased by individual-specific expressive tendencies. The duration of selected stimuli ranged from 5 s to 2 min and 9 s, reflecting the heterogeneity characteristic of spontaneous emotional displays across different databases and elicitation contexts.

3.3. Machine Analysis

The selected facial videos were analyzed using AFFDEX (v1.0; iMotions) [36], a system that integrates advanced computer vision and machine learning algorithms to decode facial emotions. AFFDEX was trained and validated on one of the largest spontaneous facial expression databases, AM-FED [45], which comprises over 15 million frames of facial expressions elicited through emotionally evocative videos, thereby enhancing the system’s reliability for spontaneous emotion analysis. Moreover, AFFDEX demonstrated superior performance in emotion classification for spontaneous expressions in a cross-classifier validation study [35].

AFFDEX provides objective, continuous measurements for 19 facial AUs (AUs 1, 2, 4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 24, 25, 26, 28, 43), capturing subtle expression changes that may not be readily perceptible to untrained observers. Additionally, AFFDEX generates probabilistic estimates for the six basic emotions by mapping AU patterns onto discrete emotion categories [1]. The outputs represent the probabilistic occurrence of each AU feature and emotion category on a scale from 0 (likely absent) to 100 (likely present). Due to variations in video length, output scores were averaged across all frames within each video clip.

To compute the accuracy of machine emotion classification, dummy variables were created to indicate whether the recognized emotions matched the ground truth labels. For each video, the recognized emotion was determined by averaging the scores across all frames for each emotion category and selecting the category with the highest average score. A classification was considered accurate if the recognized emotion matched the ground-truth label; otherwise, it was considered inaccurate.

4. Results

4.1. Emotion Classification

Overall, the machine classifier achieved a mean accuracy of 34.4% across all conditions, significantly exceeding chance-level performance of 20% (1/5), t(249) = 4.78, p < 0.001, Cohen’s d = 0.30. We next examined the factors contributing to variation in machine performance using logistic regression, with Emotion, Database, and Gender specified as fixed effects. Bonferroni corrections were applied to post-hoc comparisons within each factor to control for multiple testing.

Emotion strongly predicted classification accuracy (Figure 1) when controlling for Database and Gender (χ²(4) = 108.07, p < 0.001). The machine achieved markedly higher accuracy for Happiness (M = 0.94, CI [0.87, 1.01]) compared to all other emotions, such as Surprise (M = 0.24, CI [0.12, 0.36]), Disgust (M = 0.24, CI [0.12, 0.36]), Fear (M = 0.16, CI [0.06, 0.26]), and Sadness (M = 0.14, CI [0.04, 0.24]). Bonferroni-corrected pairwise comparisons indicated that Happiness classification differed significantly from all other emotions (all ps < 0.001), while classification accuracy for Sadness, Fear, Surprise, and Disgust did not differ significantly from one another (all ps > 0.05).

When controlling for Emotion and Gender, Database did not show a significant main effect on overall classification accuracy (χ²(4) = 3.06, p = 0.548), indicating comparable average performance across the five datasets (Figure 2): Emognition (M = 0.42, 95% CI [0.28, 0.56]), BP4D (M = 0.34, 95% CI [0.21, 0.47]), EBplus (M = 0.34, 95% CI [0.21, 0.47]), BINED (M = 0.32, 95% CI [0.19, 0.45]), and DISFA (M = 0.30, 95% CI [0.17, 0.43]).

We tested whether the pattern of emotion classification varied across databases by comparing a model including only main effects (Emotion + Database + Gender) with a model that additionally included the Emotion × Database interaction. The interaction was highly significant (χ²(16) = 64.27, p < 0.001), indicating that the recognizability of specific emotions varied depending on database characteristics. To probe this interaction, we conducted Bonferroni-corrected pairwise comparisons within each emotion across databases. For Fear, expressions from Emognition were classified significantly more accurately than those from BP4D, DISFA, and EBplus (ps < 0.05). For Surprise, expressions from BINED were classified more accurately than those from BP4D, EBplus, and DISFA (ps < 0.05). In contrast, classification accuracy for Disgust, Happiness, and Sadness did not differ significantly across databases (ps > 0.05).

Testing for gender effects while controlling for Emotion and Database, we found that the machine showed significantly higher accuracy when classifying male faces compared to female faces (40.0% vs. 28.8%; χ²(1) = 5.98, p = 0.014), indicating a gender bias in the model’s performance.

We also tested Emotion × Gender (χ²(4) = 8.01, p = 0.091) and Database × Gender (χ²(4) = 7.47, p = 0.113) interactions, neither of which reached significance, suggesting that the gender bias was relatively consistent across emotions and databases. However, post-hoc tests indicated the gender bias was most pronounced for Disgust (male: 40%, female: 8%; t(48) = −2.780, p < 0.008, Cohen’s d = 0.79), while no other emotions showed significant gender differences (ps > 0.05; see Figure 3).

To further assess machine performance, we constructed a confusion matrix (Figure 4) and computed classification performance metrics (precision, recall, F1-score, specificity, false positive rate, and negative predictive value) for each emotion category (Table 2).

With five emotion categories (excluding Anger), the expected chance rate for any specific misclassification pair was 25%. Most misclassification rates fell below this chance threshold, indicating that the classifier did not randomly confuse most emotion pairs. However, systematic misclassification patterns emerged for certain emotion pairs. The machine frequently misclassified portrayals of Surprise (44.9%), fear (61.7%), and Disgust (58.7%) as ‘Happiness’. This resulted in a false positive rate of 42.5% across the 200 non-Happiness samples and a positive predictive value of only 35.6% for Happiness classifications, indicating that only about 3 in 10 ‘Happiness’ predictions were accurate. In addition, facial portrayals of Sadness were misclassified equally often as ‘Surprise’ and ‘Disgust’, each with a rate of 30.4%.

4.2. FACS Analysis

Multiple linear regression was conducted to model the relationship between facial AUs and emotion classification. For each emotion category, a separate multivariate model was fitted using ordinary least squares (OLS), with the magnitudes of emotion ratings as the dependent variable and AU ratings as predictors. AU predictors were standardized using z-score transformation prior to analysis to facilitate direct comparison of coefficient magnitudes across predictors. This yielded standardized regression coefficients representing the change in emotion prediction associated with a one standard deviation increase in AU intensity.

Model validity was assessed through tests of multicollinearity and goodness-of-fit evaluation. Variance Inflation Factors (VIF) were computed to detect multicollinearity among predictors. All VIFs remained below the conservative threshold of 5 (range: 1.01–3.49), with the highest values observed for AU12 (VIF = 3.49) and AU6 (VIF = 3.43). This suggests acceptably low multicollinearity and adequate independence among AU predictors. All regression models were statistically significant (all ps < 0.001), with the majority of AU predictors reaching significance at p < 0.05. Model fit varied systematically across emotion categories. Happiness demonstrated excellent fit (R² = 0.917), as did Disgust (R² = 0.855). Surprise (R² = 0.694) and Sadness (R² = 0.589) showed good fit, while Fear demonstrated moderate fit (R² = 0.555). These results indicate that facial AUs collectively account for substantial variance in emotion predictions, with explanatory power varying by emotion category.

Having established model validity, we examined the specific patterns of AU contributions to each emotion. The estimated coefficients revealed distinct AU-emotion relationships across emotion categories (Table 3). For Happiness, AU12 (β = 24.42, p < 0.001) and AU6 (β = 3.28, p < 0.001) were the strongest predictors. Disgust was most strongly predicted by AU9 (β = 7.38, p < 0.001) and AU10 (β = 1.71, p < 0.001). Sadness showed the largest positive associations with AU4 (β = 4.52, p < 0.001) and AU1 (β = 1.79, p < 0.001). Surprise demonstrated strong positive associations with AU26 (β = 2.38, p < 0.001), AU25 (β = 2.08, p < 0.001), and AU2 (β = 6.53, p < 0.001). Overall, these patterns largely aligned with the theoretical prototypes proposed in Basic Emotion Theory [1], suggesting that the machine learning model classified emotions by detecting prototypical facial configurations. However, Fear showed a notable exception: while AU5 was the dominant predictor (β = 6.93, p < 0.001), other prototypical fear AUs showed weaker or negative associations (e.g., AU2: β = −0.91, p < 0.001).

Further analysis examined the distribution of prediction weights across facial AUs for each emotion by calculating the proportion of total absolute coefficient magnitudes attributable to individual AUs (Figure 5). The results revealed considerable concentration of predictive weight within specific AUs for each emotion: AU12 contributed 74% of the total coefficient magnitudes for Happiness, AU9 contributed 62% for Disgust, AU5 contributed 61% for Fear, AU4 contributed 44% for Sadness, and AU2 contributed 44% for Surprise. In contrast to these high-weight predictors, several AUs (e.g., AU43, AU14, AU18) provided minimal contributions across all models (average absolute β < 0.05), suggesting they played limited roles in machine emotion classification.

4.3. Prototypicality, Ambiguity and Complexity

Separate one-way ANOVA analyses were conducted to examine the prototypicality, complexity, and ambiguity of spontaneous expressions across different emotions. Significant main effects were found for prototypicality (F(4, 245) = 50.138, p < 0.001, η²p = 0.450), complexity (F(4, 245) = 13.209, p < 0.001, η²p = 0.177), and ambiguity (F(4, 245) = 4.144, p < 0.001, η²p = 0.063). Happiness was the most prototypical emotion and exhibited the most expressive and distinctive facial patterns, whereas Sadness was the least prototypical yet the most subtle and ambiguous. An exploratory analysis quantified the prevalence of prototypical expression patterns by computing the ratio of prototypicality to complexity. Consistent with previous meta-analytic evidence [34], the analysis yielded a mean ratio of only 9.07%, indicating that over 90% of spontaneous expressions were dominated by non-prototypical patterns.

Separate OLS regression analyses were then conducted to evaluate the predictive effects of facial prototypicality, complexity, and ambiguity on emotion classification. To facilitate direct comparison of effect sizes, these parameters were standardized (z-scored) prior to analysis. Figure 6 illustrates the influence of these parameters on emotion classification performance. Both prototypicality and complexity positively contributed to the machine classification of target emotions (ps < 0.05). Specifically, each standard deviation increase in prototypicality and complexity was associated with a 26.33% and 6.79% rise in decoding accuracy, respectively. In contrast, ambiguity negatively predicted emotion recognition performance (p < 0.01), with each standard deviation increase in ambiguity resulting in a 12.64% reduction in accuracy.

5. Discussion

The growing interest in ecologically valid facial expression stimuli has spurred the development of numerous spontaneous FEDBs over recent decades [22,46]. Despite this proliferation, research often benchmarks emotion classification performance against individual databases that may not capture the full diversity of real-world expressions [41]. The lack of cross-corpus validation can risk obscuring algorithmic biases that remain undetected in single-database studies [20]. To bridge this gap, the present study provides a systematic cross-corpus evaluation of AFFDEX, offering a comprehensive assessment of how the commercially available software navigates the nuances of spontaneous, dynamic facial behavior.

Consistent with prior work [20], our results corroborate the difficulty of inferring spontaneous emotions from facial behavior alone. Although the system performed reliably above chance, recognition accuracy varied considerably across emotion categories, database contexts, and demographic factors. Such sources of variability are often minimized in posed or single-database paradigms, resulting in homogenous expression patterns and limited contextual diversity. Under such constrained conditions, model performance might be overestimated without adequately testing cross-domain generalizability [6]. The present findings demonstrate how evaluation outcomes vary when the affective computing system is assessed across spontaneous datasets that differ in emotion-eliciting methods and participant characteristics, thereby accounting for the variability inherent in real-life affective behavior [16]. Moreover, the observed gender disparities in classification accuracy emphasize the imperative of accounting for demographic diversity during model training and evaluation to mitigate the risk of perpetuating algorithmic biases in real-world applications.

Detailed FACS analysis revealed the model’s decision boundaries for emotion classification. Several AUs aligned with BET prototypes emerged as the most influential predictors driving the system’s classifications. While prior studies have shown that BET-derived models can achieve human-level accuracy for posed displays [35], the present study exposes critical pitfalls of operationalizing such prototypical patterns when decoding spontaneous expressions, which are inherently more heterogeneous. Moreover, the model disproportionately anchored its decisions on a small subset of highly weighted AUs, with limited sensitivity to their relational dependencies. This simplified AU-emotion mapping likely contributed to the high false-positive rates observed in the confusion matrix. For instance, AU12 was treated as a diagnostic cue for happiness, even in the absence of other markers associated with genuine happiness (e.g., AU6) or its co-occurrence with AUs indicative of other emotions. Such decision heuristics may inflate happiness recognition rates while constraining the model’s capacity to discriminate among other affective states.

Beyond the analysis of individual facial movements, this study further evaluated model performance using facial parameters that capture the global properties of expressions, offering a complementary perspective on the challenges of decoding spontaneous emotions. The findings reaffirm and extend prior work derived from posed portrayals by systematically documenting the heterogeneous, subtle, and ambiguous patterns of spontaneous expressions, while quantifying their impact on emotion classification accuracy [47]. Facial portrayals characterized by greater prototypicality and complexity were classified more accurately than heterogeneous or subtle expressions. In contrast, ambiguous expressions signaling multiple affective states were more prone to misclassification, reflecting elevated uncertainty in emotion categorization [13]. These findings validate the utility of the three facial parameters as scalable benchmarks for capturing meaningful variation in emotion expression and recognition. Future advances in emotion recognition techniques could benefit from database design or feature engineering approaches that explicitly accommodate variability across these dimensions, such as adopting soft-labeling annotation schemes that preserve distributional information about ambiguity and uncertainty [48], as well as developing probabilistic models capable of representing or decoding blended emotional states [49].

Collectively, these findings underscore the necessity of cross-corpus validation using spontaneous datasets as a more stringent evaluation criterion for affective computing systems. However, it is important to acknowledge that technical parameters, such as face-box size and video resolution, which vary across databases and can influence performance estimates [20], were not explicitly modeled in the current analyses. While our sampling strategy aimed to balance representativeness across databases differing in emotion-eliciting contexts and demographic composition, the limited number of portrayals per condition may have reduced statistical power and constrained generalizability. In addition, while the use of the AFFDEX system provides a standardized benchmark, emotion classifiers differ substantially in model architecture, theoretical assumptions, and training procedures, resulting in systematic differences in performance and bias [35]. Future studies should therefore incorporate systematic multi-classifier comparisons and benchmark computational systems against human perceptual judgements to establish convergent validity.

Author Contributions

Conceptualization, E.G.K., Y.B. and H.K.; methodology, Y.B. and H.K.; formal analysis, Y.B.; writing—original draft preparation, Y.B. and E.G.K.; writing—review and editing, Y.B., H.K. and E.G.K.; supervision, E.G.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflict of interests.

References

Ekman, P. Basic Emotions. In Handbook of Cognition and Emotion; Dalgleish, T., Power, M.J., Eds.; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2005; pp. 45–60. [Google Scholar] [CrossRef]
Cheong, J.H.; Jolly, E.; Xie, T.; Byrne, S.; Kenney, M.; Chang, L.J. Py-Feat: Python facial expression analysis toolbox. Affect. Sci. 2023, 4, 781–796. [Google Scholar] [CrossRef]
Chang, D.; Yin, Y.; Li, Z.; Tran, M.; Soleymani, M. LibreFace: An open-source toolkit for deep facial expression analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 8205–8215. [Google Scholar] [CrossRef]
Srinivasan, R.; González, B.S.M. The role of empathy for artificial intelligence accountability. J. Responsible Technol. 2022, 9, 100021. [Google Scholar] [CrossRef]
Hu, J.; Mathur, L.; Liang, P.P.; Morency, L.P. OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis. arXiv 2025, arXiv:2506.02891. [Google Scholar] [CrossRef]
Cohn, J.F.; Ertugrul, I.O.; Chu, W.S.; Girard, J.M.; Jeni, L.A.; Hammal, Z. Affective facial computing: Generalizability across domains. In Multimodal Behavior Analysis in the Wild; Academic Press: Cambridge, MA, USA, 2019; pp. 407–441. [Google Scholar] [CrossRef]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar] [CrossRef]
Dawel, A.; Miller, E.J.; Horsburgh, A.; Ford, P. A systematic survey of face stimuli used in psychological research 2000–2020. Behav. Res. Methods 2022, 54, 1889–1901. [Google Scholar] [CrossRef]
Pan, Z.; Tan, H.; Liu, S.; Fang, X. Beyond the basic six, static, and WERID: Exploring the range of emotions conveyed by facial expressions. J. Exp. Soc. Psychol. 2026, 122, 104836. [Google Scholar] [CrossRef]
Kawulok, M.; Celebi, M.E.; Smolka, B. (Eds.) Advances in Face Detection and Facial Image Analysis; Springer: Cham, Switzerland, 2016; Volume 1. [Google Scholar] [CrossRef]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
Jia, S.; Wang, S.; Hu, C.; Webster, P.J.; Li, X. Detection of genuine and posed facial expressions of emotion: Databases and methods. Front. Psychol. 2021, 11, 580287. [Google Scholar] [CrossRef] [PubMed]
Kim, H.; Küster, D.; Girard, J.M.; Krumhuber, E.G. Human and machine recognition of dynamic and static facial expressions: Prototypicality, ambiguity, and complexity. Front. Psychol. 2023, 14, 1221081. [Google Scholar] [CrossRef] [PubMed]
Bian, Y.; Küster, D.; Liu, H.; Krumhuber, E.G. Understanding naturalistic facial expressions with deep learning and multimodal large language models. Sensors 2023, 24, 126. [Google Scholar] [CrossRef]
Hassin, R.R.; Aviezer, H.; Bentin, S. Inherently Ambiguous: Facial Expressions of Emotions, in Context. Emot. Rev. 2013, 5, 60–65. [Google Scholar] [CrossRef]
Barrett, L.F.; Adolphs, R.; Marsella, S.; Martinez, A.M.; Pollak, S.D. Emotional Expressions Reconsidered: Challenges to Inferring Emotion from Human Facial Movements. Psychol. Sci. Public Interest 2019, 20, 1–68. [Google Scholar] [CrossRef]
Benitez-Quiroz, C.F.; Srinivasan, R.; Martinez, A.M. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5562–5570. [Google Scholar] [CrossRef]
Stöckli, S.; Schulte-Mecklenbeck, M.; Borer, S.; Samson, A.C. Facial expression analysis with AFFDEX and FACET: A validation study. Behav. Res. Methods 2018, 50, 1446–1460. [Google Scholar] [CrossRef]
Tcherkassof, A.; Dupré, D. The emotion–facial expression link: Evidence from human and automatic expression recognition. Psychol. Res. 2021, 85, 2954–2969. [Google Scholar] [CrossRef] [PubMed]
Krumhuber, E.G.; Küster, D.; Namba, S.; Skora, L. Human and machine validation of 14 databases of dynamic facial expressions. Behav. Res. Methods 2021, 53, 686–701. [Google Scholar] [CrossRef]
Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
Kim, H.; Bian, Y.; Krumhuber, E.G. A Review of 25 Spontaneous and Dynamic Facial Expression Databases of Basic Emotions. Affect. Sci. 2025, 6, 380–394. [Google Scholar] [CrossRef]
Cowen, A.S.; Keltner, D. What the face displays: Mapping 28 emotions conveyed by naturalistic expression. Am. Psychol. 2020, 75, 349–364. [Google Scholar] [CrossRef]
Srinivasan, R.; Martinez, A.M. Cross-Cultural and Cultural-Specific Production and Perception of Facial Expressions of Emotion in the Wild. IEEE Trans. Affect. Comput. 2021, 12, 707–721. [Google Scholar] [CrossRef]
Cowen, A.S.; Keltner, D.; Schroff, F.; Jou, B.; Adam, H.; Prasad, G. Sixteen facial expressions occur in similar contexts worldwide. Nature 2021, 589, 251–257. [Google Scholar] [CrossRef]
Dominguez-Catena, I.; Paternain, D.; Galar, M. Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5209–5226. [Google Scholar] [CrossRef] [PubMed]
Ryumina, E.; Karpov, A. Facial Expression Recognition using Distance Importance Scores Between Facial Landmarks. In Proceedings of the 30th International Conference on Computer Graphics and Machine Vision (GraphiCon 2020), Part 2, St. Petersburg, Russia, 22–25 September 2020; pp. 1–10. [Google Scholar] [CrossRef]
Bishay, M.; Preston, K.; Strafuss, M.; Page, G.; Turcot, J.; Mavadati, M. Affdex 2.0: A real-time facial expression analysis toolkit. In Proceedings of the 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), Waikoloa Beach, HI, USA, 5–8 January 2023; pp. 1–8. [Google Scholar] [CrossRef]
Chaves, F.E.; de Araujo, T.P.; Maia, J.E.B. Facial expression recognition: A cross-database evaluation of features and classifiers. J. Intell. Comput. 2019, 10, 34. [Google Scholar] [CrossRef]
Ramis, S.; Buades, J.M.; Perales, F.J.; Manresa-Yee, C. A novel approach to cross dataset studies in facial expression recognition. Multimed. Tools Appl. 2022, 81, 39507–39544. [Google Scholar] [CrossRef]
Matsumoto, D.; Hwang, H.C. Judgments of subtle facial expressions of emotion. Emotion 2014, 14, 349–357. [Google Scholar] [CrossRef]
Fernández-Dols, J.M.; Crivelli, C. Emotion and expression: Naturalistic studies. Emot. Rev. 2013, 5, 24–29. [Google Scholar] [CrossRef]
Berrios, R.; Totterdell, P.; Kellett, S. Eliciting mixed emotions: A meta-analysis comparing models, types, and measures. Front. Psychol. 2015, 6, 428. [Google Scholar] [CrossRef]
Durán, J.I.; Fernández-Dols, J.-M. Do emotions result in their predicted facial expressions? A meta-analysis of studies on the co-occurrence of expression and emotion. Emotion 2021, 21, 1550–1569. [Google Scholar] [CrossRef]
Dupré, D.; Krumhuber, E.G.; Küster, D.; McKeown, G.J. A performance comparison of eight commercially available automatic classifiers for facial affect recognition. PLoS ONE 2020, 15, e0231968. [Google Scholar] [CrossRef]
McDuff, D.; Mahmoud, A.; Mavadati, M.; Amr, M.; Turcot, J.; Kaliouby, R.E. AFFDEX SDK: A Cross-Platform Real-Time Multi-Face Expression Recognition Toolkit. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 3723–3726. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V.V.; Hager, J.C. The Facial Action Coding System: A Technique for the Measurement of Facial Movement; Consulting Psychologists Press: San Francisco, CA, USA, 2002. [Google Scholar]
Sneddon, I.; McRorie, M.; McKeown, G.; Hanratty, J. The Belfast induced natural emotion database. IEEE Trans. Affect. Comput. 2012, 3, 32–41. [Google Scholar] [CrossRef]
Zhang, X.; Yin, L.; Cohn, J.F.; Canavan, S.; Reale, M.; Horowitz, A.; Liu, P.; Girard, J.M. BP4D-Spontaneous: A high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 2014, 32, 692–706. [Google Scholar] [CrossRef]
Mavadati, M.; Sanger, P.; Mahoor, M.H. Extended disfa dataset: Investigating posed and spontaneous facial expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1–8. [Google Scholar] [CrossRef]
Ertugrul, I.O.; Cohn, J.F.; Jeni, L.A.; Zhang, Z.; Yin, L.; Ji, Q. Crossing domains for au coding: Perspectives, approaches, and measures. IEEE Trans. Biom. Behav. Identity Sci. 2020, 2, 158–171. [Google Scholar] [CrossRef]
Saganowski, S.; Komoszyńska, J.; Behnke, M.; Perz, B.; Kunc, D.; Klich, B.; Kaczmarek, Ł.D.; Kazienko, P. Emognition dataset: Emotion recognition with self-reports, facial expressions, and physiology using wearables. Sci. Data 2022, 9, 158. [Google Scholar] [CrossRef]
Gross, J.J.; Levenson, R.W. Emotion elicitation using films. Cogn. Emot. 1995, 9, 87–108. [Google Scholar] [CrossRef]
Iliyasu, R.; Etikan, I. Comparison of quota sampling and stratified random sampling. Biom. Biostat. Int. J. 2021, 10, 24–27. [Google Scholar] [CrossRef]
McDuff, D.J.; Kaliouby, R.E.; Sénéchal, T.; Amr, M.; Cohn, J.F.; Picard, R.W. Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected “In-the-Wild”. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 881–888. [Google Scholar] [CrossRef]
Krumhuber, E.G.; Skora, L.; Küster, D.; Fou, L. A review of dynamic datasets for facial expression research. Emot. Rev. 2017, 9, 280–292. [Google Scholar] [CrossRef]
Reisenzein, R.; Studtmann, M.; Horstmann, G. Coherence between emotion and facial expression: Evidence from laboratory experiments. Emot. Rev. 2013, 5, 16–23. [Google Scholar] [CrossRef]
Cabitza, F.; Campagner, A.; Basile, V. Toward a perspectivist turn in ground truthing for predictive computing. Proc. AAAI Conf. Artif. Intell. 2023, 37, 6860–6868. [Google Scholar] [CrossRef]
Cai, J.; Meng, Z.; Khan, A.S.; Li, Z.; O’Reilly, J.; Tong, Y. Probabilistic attribute tree structured convolutional neural networks for facial expression recognition in the wild. IEEE Trans. Affect. Comput. 2022, 14, 1927–1941. [Google Scholar] [CrossRef]

Figure 1. Mean recognition accuracy of emotions. Error bars represent the standard errors of the mean. The dashed horizontal line indicates chance-level performance (0.20).

Figure 2. Mean recognition accuracy of databases. Error bars represent the standard errors of the mean. The dashed horizontal line indicates chance-level performance (0.20).

Figure 3. Mean recognition accuracy of emotion for each gender. Error bars represent the standard errors of the mean. Asterisks denote significant gender differences (p < 0.05), with Bonferroni corrections applied to each comparison. The dashed horizontal line indicates chance-level performance (0.20).

Figure 4. Confusion matrix across five emotion categories.

Figure 5. Distribution of prediction weights across the full set of AUs for each emotion. Values represent the proportion of the total absolute coefficient magnitudes attributable to each individual AU.

Figure 6. Predictive effects of prototypicality (left), complexity (middle), and ambiguity (right) on emotion classification. Regression lines depict the relationship between standardized predictor values and recognition accuracy; shaded areas represent 95% confidence intervals.

Table 1. Selection Characteristics of Facial Stimuli from Five Spontaneous FEDBs.

Database	Emotion					Gender
	Disgust	Fear	Happiness	Sadness	Surprise	Male	Female
BINED	10	10	10	10	10	50	50
BP4D	10	10	10	10	10	50	50
DISFA	10	10	10	10	10	50	50
EB+	10	10	10	10	10	50	50
Emognition	10	10	10	10	10	50	50
Total	50	50	50	50	50	125	125

Table 2. Classification Performance Metrics.

Emotion	Precision	Recall	Specificity	F1-Score	FPR	NPV
Disgust	0.293	0.240	0.855	0.264	0.145	0.818
Fear	0.364	0.160	0.930	0.222	0.070	0.816
Happiness	0.356	0.940	0.575	0.516	0.425	0.975
Sadness	0.700	0.140	0.985	0.233	0.015	0.821
Surprise	0.364	0.240	0.895	0.289	0.105	0.825

Note. FPR = False Positive Rate; NPV = Negative Predictive Value.

Table 3. Relative Contribution of Facial AUs for Emotion Predictions.

Action Units		Emotion
		Disgust	Fear	Happiness	Sadness	Surprise
AU1	Inner brow raise	0.233	−0.047	0.118	1.792	0.978
AU2	Outer brow raiser	0.042	−0.914	−0.984	−0.130	6.528
AU4	Brow lowerer	0.044	0.117	−0.226	4.519	−0.249
AU5	Upper lid raiser	−0.065	6.929	−0.028	−0.060	0.847
AU6	Cheek raiser	−0.196	0.324	3.283	0.059	−0.187
AU7	Lid tightener	0.606	0.138	−0.291	0.836	−0.044
AU9	Nose wrinkler	7.385	0.124	−0.649	−0.240	−0.495
AU10	Upper lip raiser	1.712	−0.124	−0.324	−0.359	−0.190
AU12	Lip corner puller	−0.714	−1.094	24.417	0.011	−0.187
AU14	Dimpler	0.013	−0.068	0.082	0.008	0.022
AU15	Lip corner depressor	0.022	−0.184	0.157	0.515	−0.208
AU17	Chin raiser	0.116	−0.468	−0.070	0.225	0.140
AU18	Lip pucker	0.013	−0.035	−0.051	−0.107	−0.012
AU20	Lip stretcher	−0.038	0.224	0.896	0.153	−0.094
AU24	Lip presser	0.030	−0.009	−0.044	−0.205	−0.072
AU25	Lips part	0.403	−0.054	0.779	−0.662	2.077
AU26	Jaw drop	−0.128	−0.210	−0.110	0.009	2.380
AU28	Lips suck	−0.187	0.116	−0.620	−0.314	0.042
AU43	Eye closure	−0.035	0.101	0.029	−0.034	−0.030

Note. AUs associated with theoretical prototypes are shown in bold. The regression coefficients represent the change in emotion ratings per one standard unit increase in AU activity.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bian, Y.; Kim, H.; Krumhuber, E.G. A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification. Electronics 2026, 15, 849. https://doi.org/10.3390/electronics15040849

AMA Style

Bian Y, Kim H, Krumhuber EG. A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification. Electronics. 2026; 15(4):849. https://doi.org/10.3390/electronics15040849

Chicago/Turabian Style

Bian, Yifan, Hyunwoo Kim, and Eva G. Krumhuber. 2026. "A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification" Electronics 15, no. 4: 849. https://doi.org/10.3390/electronics15040849

APA Style

Bian, Y., Kim, H., & Krumhuber, E. G. (2026). A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification. Electronics, 15(4), 849. https://doi.org/10.3390/electronics15040849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Corpus Evaluation on Spontaneous and Dynamic Facial Expressions for Automated Emotion Classification

Abstract

1. Introduction

2. The Present Study

3. Method

3.1. Stimulus Material

3.2. Sampling Procedure

3.3. Machine Analysis

4. Results

4.1. Emotion Classification

4.2. FACS Analysis

4.3. Prototypicality, Ambiguity and Complexity

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI