A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content

Yakut, Sehmus; Tuten, Yusuf Taha; Caglar, Eren; Aktas, Mehmet S.

doi:10.3390/computers15020077

Open AccessArticle

A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content^†

¹

Department of Computer Engineering, Yildiz Technical University, 34220 Istanbul, Turkey

²

R&D Center, Aktif Bank, 34220 Istanbul, Turkey

^*

Authors to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “Facial Stress and Fatigue Recognition via Emotion Weighting: A Deep Learning Approach“, which was presented at the 25th International Conference on Computational Science and Its Applications (ICCSA 2025), Istanbul, Türkiye, 30 Jun–3 July 2025.

^‡

These authors contributed equally to this work.

Computers 2026, 15(2), 77; https://doi.org/10.3390/computers15020077

Submission received: 13 November 2025 / Revised: 14 January 2026 / Accepted: 15 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Computational Science and Its Applications 2025 (ICCSA 2025))

Download

Browse Figure

Versions Notes

Abstract

This research addresses the challenge of inferring complex psychological states, including stress, fatigue, anxiety, cognitive load, and boredom, from facial expressions. We propose an interpretable, literature-informed emotion-weighting methodology that transforms the eight-emotion probability outputs of facial emotion recognition models into continuous estimates of these five psychological states using weights derived from the Valence–Arousal framework, providing a principled bridge between discrete emotion predictions and higher-level affective constructs. The proposed formulation is evaluated across six representative deep learning architectures—a baseline CNN (ResNet-50), a modern CNN (ConvNeXt), a hybrid attention-based model (DDAMFN), and three Transformer-based models (ViT, BEiT, and Swin). Our results demonstrate that strong performance on discrete FER tasks does not directly translate to consistent behavior in complex state inference; instead, architectures capable of preserving subtle and distributed affective cues yield more stable and interpretable state estimates, with DDAMFN and Vision Transformer models exhibiting the most consistent performance across the evaluated psychological states. These findings highlight the central role of the proposed emotion-weighting formulation and the importance of architecture selection beyond categorical accuracy in complex affective state analysis.

Keywords:

facial expression recognition (FER); computer vision; deep learning; CNN; transformer; emotion recognition; stress; fatigue; boredom; anxiety; cognitive load

1. Introduction

The analysis of human emotions has become increasingly critical in our technology-driven world, permeating diverse fields from human-computer interaction and healthcare to security and automotive safety. Facial Emotion Recognition (FER), powered by advances in deep learning, particularly neural networks, has emerged as a prominent area within affective computing. Neural networks, with their capacity to learn intricate patterns from visual data, have enabled significant strides in automatically recognizing basic emotions like happiness, sadness, anger, fear, disgust, and surprise from facial expressions. This progress has fueled a wide array of applications, including personalized user interfaces, mental health monitoring tools, and enhanced driver assistance systems designed to detect driver drowsiness, stress, cognitive overload, or distraction [1,2].

Despite these advancements in recognizing fundamental emotions, a significant gap remains in the direct detection of more complex and nuanced emotional states such as stress, fatigue, anxiety, cognitive load, and boredom. While current FER systems excel at identifying sub-emotions like anger or sadness, they do not inherently provide a direct measure of overarching states like stress, fatigue, or boredom, which are crucial indicators of well-being and performance in various real-life scenarios. This limitation stems from the fact that stress and fatigue (and related states) are not singular, basic emotions, but rather complex constructs often manifested through combinations of subtle facial cues and underlying emotional patterns that exist on continuous dimensional axes, such as the Valence-Arousal (V-A) space [3]. Therefore, simply identifying basic emotions, while valuable, is insufficient for directly assessing an individual’s stress, fatigue, or cognitive load level.

This paper addresses this critical literature gap by introducing a novel weighting-based formulation to predict a multi-dimensional vector of five complex psychological states: stress, fatigue, anxiety, cognitive load, and boredom, directly from facial expressions. Our approach leverages the outputs of existing FER models, specifically the predicted probabilities or percentages of basic emotions, and combines them using empirically derived weights. These weights, crucially, are not arbitrarily assigned but are informed by a comprehensive review of psychological, neurological, and sociological literature that elucidates the relationship between basic emotions and the complex states of stress, fatigue, and other low-arousal or high-load conditions [4,5]. By systematically assigning weights to emotions such as anger, fear, sadness, happiness, surprise, disgust, and neutral, we create a quantifiable method to translate the sub-emotion outputs of FER models into meaningful stress, fatigue, anxiety, load, and boredom scores. This formulation allows us to move beyond the recognition of isolated basic emotions and towards a more holistic understanding of emotional state as it relates to this full spectrum of complex states. To rigorously evaluate the effectiveness of this approach, we conduct a comparative study employing various deep learning models, including both Convolutional Neural Networks (CNNs) [6] and Transformer-based architectures, including modern benchmarks such as ConvNeXt [7] and hierarchical vision transformers like the Swin Transformer [8]. These models, when integrated with our weighting formulation, are assessed against a ground truth dataset of facial expressions annotated for all five complex states. This evaluation allows us to determine not only the validity of our proposed formulation but also to benchmark the performance of different deep learning architectures in the context of complex emotional state prediction.

Specifically, this research seeks to answer the following key questions: How effectively can a weighting-based formulation, utilizing sub-emotion outputs from FER models, predict complex emotional states like stress, fatigue, anxiety, cognitive load, and boredom? Which deep learning architectures, when combined with this formulation, demonstrate superior performance in facial expression-based complex state detection? And to what extent are empirically derived weights, grounded in psychological literature, valid in capturing the relationship between basic emotions and these multifaceted states as manifested in facial expressions?

To address these questions, our primary objectives are to develop and validate this novel weighting formulation, to rigorously evaluate the performance of diverse deep learning models within this framework, and to provide a comparative analysis of their effectiveness.

The key contribution of this paper lies in bridging the gap between basic emotion recognition and the assessment of a broader range of complex psychological states. We present a literature-informed and empirically evaluated method for estimating stress, fatigue, anxiety, cognitive load, and boredom from facial expressions, providing an interpretable formulation that extends the use of existing FER outputs beyond categorical emotion prediction. Rather than claiming direct applicability, the proposed approach offers a structured framework that may support future developments in application domains such as health monitoring, safety, and education technology (EdTech) [9,10], as well as well-being analysis.

This literature review highlights a clear need for novel approaches that go beyond basic emotion classification and directly address the challenge of detecting complex psychological states from facial expressions. Current FER systems primarily output probabilities for discrete emotion categories but lack a mechanism to translate these outputs into a meaningful assessment of a multi-dimensional state vector encompassing stress, fatigue, anxiety, cognitive load, and boredom. To bridge this gap, a methodology is required that can effectively leverage the outputs of existing FER models to infer more complex psychological states.

While this research focuses on validating the proposed weighting formulation using hierarchical and attention-based features, future iterations of this framework will explore the integration of spatiotemporal models (e.g., 3D-CNNs and Video Transformers) to capture the dynamic temporal evolution of emotional states. Furthermore, subsequent studies will incorporate rigorous cross-dataset validation and fairness analysis as core components to address potential dataset-specific biases and ensure demographic equity in automated affective computing systems.

2. Related Works

Facial Emotion Recognition (FER) has established itself as a pivotal domain within affective computing, spurred by the growing necessity for systems capable of perceiving and reacting to human emotional states across various sectors, including automotive safety, healthcare, and human-computer interaction [11,12]. The integration of deep learning, specifically Convolutional Neural Networks (CNNs), has fundamentally transformed FER, allowing automated frameworks to attain high precision in classifying basic facial expressions [13]. Initial breakthroughs concentrated largely on categorizing discrete emotions—such as fear, happiness, sadness, anger, disgust, and surprise—utilizing benchmarks like AffectNet and FER-2013 for training and evaluation [3,13]. Consequently, CNN-based backbones, notably VGGNet [14] and ResNet [6], emerged as the standard, exhibiting strong capabilities in extracting spatial features pertinent to emotion classification.

In recent years, Transformer architectures, initially designed for natural language tasks, have demonstrated remarkable potential in computer vision, including the FER domain [15,16]. Vision Transformers (ViTs) and their variants utilize attention mechanisms to model long-range dependencies across images. This attribute is hypothesized to be especially advantageous for deciphering subtle facial expressions where emotional indicators are dispersed across disjoint facial regions [17,18]. Furthermore, hybrid architectures that merge the local feature extraction efficiency of CNNs with the global context modeling of Transformers have gained traction as a robust direction in FER research [19].

The evolution of these architectures has proceeded rapidly. Although Transformers recently rose to prominence, modernized CNNs, such as ConvNeXt [7], have reclaimed a competitive edge. Rather than incorporating attention modules, this was accomplished by systematically updating the ResNet design with Transformer-inspired principles. Key modifications included the adoption of large 7 × 7 depthwise convolution kernels, the implementation of inverted bottleneck structures (similar to MobileNetV2), and the substitution of BatchNorm with LayerNorm throughout the network. Specific adaptations like EmoNeXt [20] have tailored these contemporary ConvNet designs for FER, often integrating Spatial Transformer Networks (STN) to explicitly target salient facial zones, thereby reaching state-of-the-art accuracy. Concurrently, the quadratic computational costs associated with ViTs on high-resolution inputs were mitigated by hierarchical vision transformers like the Swin Transformer [8]. By computing self-attention within non-overlapping local windows and utilizing a “shifted window” technique for cross-window interaction, Swin achieves linear complexity while constructing a multi-scale feature hierarchy. This structure is particularly effective for FER, as facial expressions comprise features at varying scales (e.g., broad jaw movements vs. fine eye wrinkles). Consequently, architectures like Swin-FER [21,22] have shown robust performance in analyzing these intricate regional relationships, proving resilient in uncontrolled, ’in-the-wild’ environments.

Nevertheless, relying on static image analysis fails to capture the inherently dynamic essence of emotions, which are evolving processes rather than fixed snapshots [23]. This realization has steered the field toward spatiotemporal architectures. Early attempts included hybrid systems fusing 2D-CNN extractors with LSTMs to track the temporal progression of expressions [24], as well as end-to-end 3D-CNNs [25]. By convolving across both temporal and spatial dimensions, 3D models allow for motion feature extraction, which is essential for identifying transient, low-intensity events like micro-expressions. Contemporary methods for micro-expression detection employ techniques such as multi-scale 3D-ResNets to capture temporal resolution data [26], or squeeze-and-excitation fusion strategies to prioritize discriminative spatiotemporal channels [27]. Most recently, Video Transformers have redefined the learning paradigm [28]. Models such as TimeSformer [29] have proven that a purely convolution-free “divided space-time attention” mechanism can effectively model long-range temporal dependencies. This enables the analysis of an emotional state’s full trajectory, while self-supervised pre-training frameworks like SVFAP [30] facilitate the development of potent video-based affect models by learning from vast repositories of unlabeled video through audio-visual correspondence.

Parallel to the analysis of facial expressions, the domain of Talking Head Generation is advancing rapidly, prioritizing the synthesis of emotionally expressive and realistic facial animations. Generative approaches, such as recent deep learning-based talking head systems [31], reinforce the necessity of modeling and understanding the nuances of facial expressions and their intrinsic emotional content, albeit from a synthesis rather than analysis perspective [32].

Despite the progress in classifying basic emotions, a significant deficiency remains in the direct and robust detection of complex psychological states, specifically anxiety, cognitive load, boredom, stress, and fatigue based solely on facial expressions. While current FER frameworks can detect sub-emotions correlated with these conditions (e.g., fear, anger, sadness), they lack the inherent ability to measure these overarching constructs directly. This limitation is critical for real-world deployment, particularly in well-being monitoring and safety-critical contexts [1,2]. Research into detecting these complex states has historically relied on contextual data or physiological sensors, with limited investigation into methods that infer them directly from the nuanced spatiotemporal dynamics of facial imagery via advanced deep learning [33,34].

For example, anxiety detection has been explored using deep learning “face parsing” frameworks, which employ semantic segmentation to isolate specific regions (e.g., periocular muscle activity, forehead tension) linked to sustained negative affect [35]. Alternative approaches offer mathematical frameworks to define previously ambiguous states, mapping facial cues onto the continuous Valence-Arousal (V-A) space, where “nervousness” is objectively characterized as high arousal combined with negative valence [36,37]. Similarly, cognitive load assessment has traditionally depended on intrusive hardware. Eye-trackers monitor physiological proxies like microsaccade rate reduction [38] and task-evoked pupillometry [5]. While accurate, these techniques have motivated the shift toward non-invasive, vision-based methods that aim to predict subjective cognitive load from video streams by correlating visual features with self-reported mental effort [39].

Boredom presents a unique challenge. Often described as the “silent emotion” in sociological studies [4] due to its internal nature (defined often as a lack of meaning) and subtle external presentation, it is a complex state characterized by negative valence and low arousal [40]. It is a key factor influencing well-being and performance in areas ranging from online education [9,10] to manufacturing [41]. Consequently, automated boredom recognition has largely focused on multimodal systems utilizing invasive sensors like electrocardiograms (ECG), electroencephalograms (EEG), and galvanic skin response (GSR) [41,42]. Although accurate, the reliance on contact-based sensors restricts scalability. Thus, non-invasive vision-based approaches are gaining traction. The Facial Action Coding System (FACS) [43,44] provides a basis for this, linking boredom to a constellation of subtle muscle movements, or Action Units (AUs). Key AUs identified by academic and commercial systems include AU43 (Eyes Closed), AU24 (Lip Pressor), AU23 (Lip Tightener), and AU14 (Dimpler) [45,46,47]. Building on this, recent work has validated purely vision-based approaches by using deep learning on continuous facial landmark patterns to detect academic emotions, including boredom, in video streams [9,10,48,49].

Furthermore, a large portion of existing literature is restricted to single-dataset experiments, ignoring the critical issues of demographic bias [50,51] and cross-dataset generalization [52,53], which are prerequisite for creating equitable and robust real-world systems. This review emphasizes a clear necessity for novel methodologies that surpass basic emotion classification to address the detection of complex psychological states directly from facial data. Current FER systems typically output probability distributions for discrete emotion categories but lack the mechanism to translate these into meaningful assessments of multi-dimensional states such as fatigue, stress, anxiety, boredom, and cognitive load. To bridge this gap, a methodology is required that effectively leverages the outputs of established FER models—specifically the recognized sub-emotions—to infer these higher-level constructs.

Moreover, a scientifically grounded and systematic approach is needed to define the relationship between basic emotions and the multifaceted constructs of boredom, fatigue, and stress, moving beyond purely data-driven correlations toward a theoretically motivated and interpretable framework. This paper contributes to the field by introducing and empirically evaluating a novel, literature-informed weighting methodology that maps discrete emotion predictions to estimates of complex psychological states.

Prior research on large-scale computational systems and software engineering has laid essential conceptual foundations for modern learning-based architectures operating over complex and heterogeneous data pipelines. Early work on Grid computing and service-oriented architectures highlighted the challenges of scalability, interoperability, and performance management in distributed environments, emphasizing the need for modular and extensible system designs [54]. Systems such as VLab and iSERVO further demonstrated how loosely coupled services and portal-based interfaces enable collaborative, data-intensive workflows while maintaining architectural flexibility and long-term sustainability [55,56]. From a complementary software engineering perspective, structural concerns such as code cloning were shown to significantly impact maintainability and evolution of complex systems, motivating systematic, metric-based analysis approaches [57]. Collectively, these studies provide a critical architectural and methodological context for contemporary deep learning frameworks, including Transformer-based models, where scalability, modular design, and maintainability remain central challenges.

In this study, we evaluate the proposed formulation using a diverse suite of modern Transformer-based and CNN architectures to analyze facial features from static imagery. While fairness-aware evaluation, cross-dataset validation, and temporal modeling are vital for real-world deployment, they remain beyond the scope of this work and represent important avenues for future research. Specifically, extending this framework to video-based spatiotemporal models and conducting systematic demographic bias analyses constitute promising directions for further investigation.

3. Methodology

This research establishes a comprehensive framework for the non-invasive detection of stress and fatigue by interpreting facial expression dynamics. By synergizing advanced deep learning architectures with a scientifically grounded, weighted emotion-correlation formulation, our approach bridges the gap between discrete emotion recognition and physiological state inference. This section delineates the multi-stage pipeline, detailing the deep learning architectures deployed for Facial Expression Recognition (FER), the datasets curation for training and validation, and the derivation of the proposed stress and fatigue quantification metrics. (See Figure 1).

3.1. Deep Learning Models for Emotion Recognition

To ensure robust feature extraction across varying facial nuances, we employed a diverse suite of deep learning models. This selection encompasses traditional Convolutional Neural Networks (CNNs), pure Transformer-based architectures, and hybrid networks. These models constitute the foundational perception module, generating probability distributions for eight basic emotions which serve as the inputs for our stress and fatigue inference logic.

3.1.1. ResNet-50 (CNN)

Serving as the benchmark for convolutional architectures, ResNet-50 [6] was selected for its proven stability in visual recognition tasks. Its core innovation, the deep residual learning framework, utilizes skip connections to mitigate the vanishing gradient problem inherent in deep networks. By learning residual functions with reference to layer inputs, ResNet-50 effectively captures hierarchical spatial features essential for facial expression classification, providing a strong baseline against clearer Transformer-based approaches.

3.1.2. DDAMFN (Dual-Direction Attention Mixed Feature Network)

Representing hybrid architectures, the Dual-Direction Attention Mixed Feature Network (DDAMFN) [19] was integrated to combine the local feature extraction strengths of CNNs with the dependency modeling of attention mechanisms. DDAMFN is explicitly designed for FER tasks, employing a dual-direction attention head to capture both local facial landmarks and global spatial dependencies. To ensure consistent evaluation, this model was implemented using the original DDAMFN dual-attention structure.

3.1.3. ViT (Vision Transformer)

To evaluate the efficacy of pure attention-based mechanisms, we incorporated the Vision Transformer (ViT) [15]. Diverging from inductive biases of CNNs, ViT treats image patches as sequences of tokens. We utilized the CLIP-ViT-Base variant (openai/clip-vit-base-patch16).

Utilizing a 16 × 16 patch size. Unlike standard supervised pre-training, this model leverages representations learned from contrastive language-image pre-training, capturing global contextual cues that are critical for recognizing subtle emotional states.

3.1.4. BEiT (Bidirectional Encoder Representations from Image Transformers)

Building upon the Transformer paradigm, BEiT [16] was included to leverage self-supervised learning representations. We employed the BEiT-Base architecture (microsoft/beit-base-patch16-224-pt22k-ft22k), characterized by a 16 × 16 patch size and 224 × 224 input resolution. Notably, this model utilizes weights pre-trained and fine-tuned on the ImageNet-22k dataset. By learning to reconstruct masked patches, the model develops rich, bidirectional visual representations that enhance downstream emotion classification performance.

3.1.5. ConvNeXt

ConvNeXt [7] was employed to represent the modernization of convolutional networks. It integrates architectural choices popularized by Transformers while retaining the efficiency of standard convolutions. For this study, we utilized the ConvNeXt-Large variant (facebook/convnext-large-224), optimized for a 224 × 224 input resolution. This larger capacity model allows us to rigorously assess whether scaling up modern CNN designs can bridge the performance gap with Transformers in facial expression analysis.

3.1.6. Swin Transformer

The Swin Transformer [8] serves as a hierarchical vision transformer in our suite. We instantiated the Swin-Large architecture (microsoft/swin-large-patch4-window7-224). This configuration utilizes a patch size of 4, a window size of 7, and operates on a 224 × 224 input resolution. By introducing shifted windows for local attention, it effectively maintains global context through hierarchical layers, making it particularly adept at capturing both fine-grained micro-expressions and holistic facial configurations.

3.2. Training and Implementation Strategy

The experimental framework was standardized across all evaluated architectures, including ViT, BEiT, ResNet-50, DDAMFN, Swin Transformer, and ConvNeXt, to ensure a consistent performance baseline. Each model was trained for a maximum of 30 epochs on the FERPlus dataset to classify facial expressions into eight discrete categories. The optimization process utilized the AdamW optimizer with a learning rate of $1 \times 10^{- 4}$ and a batch size of 32. To mitigate overfitting, an early stopping strategy with a patience of 3 epochs was employed based on validation loss. Additionally, the model state achieving the lowest validation loss was saved and used for final evaluation.

To ensure statistical robustness and reproducibility, we employed a stratified 5-fold cross-validation protocol. All experiments were executed using fixed random seeds, with a global seed of 42 and per-fold seeds 42, 43, 44, 45, 46. Deterministic behavior was enforced where supported (e.g., torch.backends.cudnn.deterministic = True); any remaining non-deterministic operations stem from low-level GPU kernels and are consistent across all models.

The experiments were run in the following software environment: Python 3.9.16; PyTorch 1.13.1; torchvision 0.14.1; timm 0.6.12; transformers 4.30.0; numpy 1.24.3; scipy 1.10.1; CUDA 11.7; and cuDNN 8.4.1.

We used Cross-Entropy Loss as the objective function for all classification experiments. This unified configuration ensures that observed performance variances are primarily attributable to the inherent architectural characteristics of each model.

All fine-tuning experiments were performed on an NVIDIA A100 GPU. Under this hardware configuration, the average training duration per model ranged between several hours depending on architectural complexity, with all reported results obtained from fully converged training runs.

3.3. Datasets

The experimental framework relies on two distinct datasets: the FERPlus dataset for training the emotion recognition backbones, and a custom-curated ground truth dataset for validating the stress and fatigue detection logic.

3.3.1. FERPlus

For model training, we utilized FERPlus [58], a rigorously re-annotated extension of the FER-2013 dataset. It contains

48 \times 48

grayscale facial images categorized into eight emotions. FERPlus was selected for its improved label reliability compared to its predecessor, achieved through crowd-sourced voting which mitigates ambiguity in expression labeling. The use of grayscale imagery further enhances model robustness by enforcing learning based on structural facial deformations rather than chromatic variations.

To improve the generalization capability of the model and mitigate overfitting, several data augmentation techniques were integrated into the training pipeline.

Since all evaluated architectures were pretrained on large-scale RGB datasets, FERPlus images (

48 \times 48

, grayscale) were converted to three-channel inputs via channel replication. The images were then resized to

224 \times 224

using bilinear interpolation to match the input resolution of the pretrained backbones.

During training, random horizontal flipping, random rotation (

\pm 10 °

), and random resized cropping (scale range

[0.8, 1.0]

) were applied to promote rotation- and scale-invariant feature learning. During evaluation, images were resized to

256 \times 256

followed by a center crop of

224 \times 224

.

Finally, all inputs were normalized using ImageNet statistics. The data splitted by 80% for training set and 15% for validation set and 5% for test set.

3.3.2. Ground Truth Dataset for Validation

To rigorously validate the proposed stress and fatigue formulation, a ground truth dataset was constructed. To ensure statistical robustness and provide a comprehensive evaluation, the dataset consists of 1347 frames extracted at 5-s intervals from diverse video sources. In total, the dataset represents 1347 unique individuals. The age distribution spans a broad adult range, with a mean age of 34.7 years and a median age of 33 years.

The dataset includes a gender distribution of 52% male and 48% female participants. In terms of cultural diversity, subjects originate from multiple geographic regions at the continent level, including Europe (31%), Asia (29%), North America (21%), South America (9%), Africa (7%), and Oceania (3%).

Three independent evaluators with backgrounds in psychology and emotion recognition annotated the frames according to established physiological definitions. Final labels were assigned based on majority agreement.

–: Stress: Primarily manifested as a state of high-arousal mental strain, driven by angry and fear, while also incorporating elements of sad affect to represent persistent tension and agitation.
–: Fatigue: Defined as a state of significantly reduced physiological and mental energy, dominated by sad and neutral expressions, reflecting emotional depletion and a lack of active engagement.
–: Anxiety: Characterized as an apprehensive state of unease, heavily weighted by fear and supplemented by angry, signifying a combination of high-vigilance distress and internal restless energy.
–: Boredom: Defined as a state of psychological disengagement resulting from low stimulation, primarily correlated with neutral and sad states, which indicate a lack of environmental interest and diminished attentiveness.
–: Cognitive Load: Represents the intensity of mental processing and effort, derived from a combination of neutral (focused concentration), angry (processing strain), and surprise (reaction to unexpected complexity).

3.4. Emotion-Based Formulation for Stress and Fatigue

A core contribution of this study is the derivation of high-level psychological states (stress and fatigue) from basic emotion probabilities. We postulate that these states are linear combinations of specific emotional expressions. Consequently, we developed a weighted summation formulation where each of the eight recognized emotions contributes to the total stress or fatigue score based on coefficients derived from affective computing and neuroscience literature.

3.5. Coefficient Assignment and Justification

The coefficients (

w \in [0, 1]

) presented in Table 1 were determined through an empirical review of the relationship between facial affect and physiological arousal.

To establish the final mapping between basic emotions and complex states, we conducted an empirical optimization process. Specifically, the coefficients were determined using a grid search algorithm, in which multiple combinations of weights were systematically explored. The candidate coefficients were categorized into small ( $w < 0.2$ ), medium ( $0.2 \leq w < 0.6$ ), and large ( $w \geq 0.6$ ) ranges. Through iterative evaluation of all grid configurations, we selected the optimal weight set that achieved the highest sensitivity to transitions in mental states while effectively minimizing noise, as detailed below.

–: Anger & Fear ( $w_{anxiety} = 0.7, w_{stress} = 0.6, w_{cl} = 0.45$ ): These high-arousal emotions are the primary drivers for Stress and Anxiety. While Fear is the dominant predictor for Anxiety, Anger (manifested as brow furrowing) serves as a critical indicator for Cognitive Load (CL) and Stress, representing the mental effort and tension associated with task-demands [59,60].
–: Sadness ( $w_{fatigue} = 0.8, w_{stress} = 0.4, w_{boredom} = 0.3$ ): Sadness is a unique low-arousal state. We found that a large coefficient ( $0.8$ ) is necessary to accurately model Fatigue, while medium values are optimal for representing the underlying emotional depletion in Boredom and chronic Stress [33,61].
–: Neutral ( $w_{boredom} = 0.65, w_{cl} = 0.55, w_{fatigue} = 0.4$ ): In our experiments, the Neutral state emerged as a high-weight component for passive states. A large coefficient was assigned for Boredom and Cognitive Load, as it reflects the “blank stare” of disengagement or the intense focused concentration required during heavy information processing [62].
–: Surprise & Disgust ( $w_{cl} = 0.35, w_{stress} = 0.3$ ): These act as secondary modifiers. Surprise was optimized with a medium coefficient for Cognitive Load to capture reactions to unexpected data complexity, while Disgust serves as a proxy for aversion-related Stress [63].
–: Happiness ( $w \approx 0$ ): Consistently across all tests, Happiness acted as a stress buffer. We assigned minimal coefficients ( $0.05$ ) or zero to ensure that positive affect effectively neutralizes the accumulation of negative complex states, maintaining the model’s robustness against false positives [64].

3.6. Emotion-Based Formulation for Anxiety, Cognitive Load, and Boredom

We extend this emotion-based formulation to derive scores for anxiety, cognitive load, and boredom. This method is designed to bridge the well-documented “semantic gap” identified in the literature, where most FER systems successfully classify discrete emotions [13] but lack a clear, theoretically-grounded mechanism to translate these outputs into these more abstract, high-level psychological constructs. By utilizing the same expanded set of eight emotions, our formulation aims to capture a nuanced affective input vector that serves as a robust foundation for inferring this wider range of complex psychological states.

3.7. Coefficient Assignment and Justification for Anxiety, Cognitive Load, Boredom

To determine the coefficients for each emotion’s contribution to anxiety, cognitive load, and boredom, we again conducted a thorough review of existing literature. This review specifically targeted psychological and affective neuroscience literature that maps discrete emotion expressions onto continuous dimensional models, such as the Valence-Arousal (V-A) space. This V-A framework is particularly effective for modeling these complex states: anxiety (high-arousal, negative-valence) [36,37], boredom (low-arousal, negative-valence) [4,40], and cognitive load (which can manifest as high arousal/frustration or low arousal/disengagement) [5,38]. Table 2 details the coefficients used.

The rationale for these weights is as follows: Anger contributes to anxiety (0.3) via the ‘fight’ response and to cognitive load (0.45) as a marker of frustration [39]. Fear is the primary driver of anxiety (0.7), aligning with ‘flight’ responses and hyper-vigilance [35]. Sadness, a low-arousal negative state, is heavily weighted for boredom (0.3) [4]. Neutral expressions are pivotal; they serve as the primary indicator for boredom (0.65) due to passive disengagement, and for cognitive load (0.55) representing the ‘focused stare’ where facial mimicry is suppressed [5]. Surprise indicates moments of confusion or error detection, contributing moderately to cognitive load (0.35) [48]. Happiness is assigned a 0.0 coefficient, serving as an inhibitor for these negative states.

3.8. Score Calculation

The continuous stress and fatigue scores are computed for each frame t using the weighted sum of the probability vector P output by the FER models:

Stress {Score}_{t} = \sum_{i = 1}^{8} (P_{i} \times Stress {Coeff}_{i})

(1)

Fatigue {Score}_{t} = \sum_{i = 1}^{8} (P_{i} \times Fatigue {Coeff}_{i})

(2)

P values refer to the normalized logits in last layers of the models.

Score Calculation for Anxiety, Cognitive Load, Boredom

Utilizing these coefficients, the scores for the additional complex states are calculated for each frame using the same weighted summation approach.

Anxiety Score = \sum_{i = 1}^{8} (P_{i} \times Anxiety {Coefficient}_{i})

(3)

Cognitive Load Score = \sum_{i = 1}^{8} (P_{i} \times Cog . Load {Coefficient}_{i})

(4)

Boredom Score = \sum_{i = 1}^{8} (P_{i} \times Boredom {Coefficient}_{i})

(5)

3.9. Thresholding for Classification

To transition from continuous scores to binary classification (e.g., Stressed/Not Stressed), we employed an empirical thresholding strategy. Given that different model architectures (CNN vs. Transformer) produce probability distributions with varying scales and characteristics, a universal threshold is statistically inappropriate.

To ensure the generalizability of the proposed framework, a 5-fold cross-validation protocol was adopted. The optimal threshold for each model was determined on the training folds and evaluated on independent test folds. The values presented in Table 3, Table 4, Table 5, Table 6 and Table 7 represent the mean optimized thresholds derived from the 1347-frame dataset across all validation folds.

4. Experiments & Results

This section presents the experimental results evaluating our proposed methodology for stress and fatigue detection from facial expressions. This evaluation includes a comprehensive benchmark of six deep learning architectures, spanning foundational CNNs (ResNet-50), modern CNNs (ConvNeXt), attention-based hybrid models (DDAMFN), and various Transformer architectures (ViT, BEiT, Swin). The evaluation is structured to first assess the baseline sub-emotion recognition performance of all models using the FERPlus dataset, and subsequently, to validate the effectiveness of our emotion-based formulation for stress and fatigue detection using a human-annotated ground truth dataset.

4.1. Results of Sub-Emotion Recognition Using FERPlus

Prior to evaluating stress and fatigue detection, we assessed the baseline performance of each deep learning model in recognizing sub-emotions using the FERPlus dataset. The FERPlus test set, comprising approximately 3000 images, was used to measure the accuracy of each model in classifying the eight basic emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. Table 8 presents the emotion recognition accuracy achieved by each model.

As shown in Table 8, the baseline sub-emotion recognition performance varied significantly across architectures. The hybrid DDAMFN model achieved the highest accuracy (84.19%). Closely following were the modern CNN, ConvNeXt (83.24%), and the standard Vision Transformer, ViT (82.68%). This aligns with our literature review, where ConvNeXt was specifically designed to compete with Transformers by modernizing CNN principles [7]. The standard CNN (ResNet-50) and the hierarchical Swin Transformer both underperformed relative to the other models on this specific task (76.15% and 75.79%, respectively). This suggests that for static FER, the global attention of ViT [15] or the specialized design of DDAMFN may be more effective than the “shifted-window” approach of Swin, which is often optimized for more complex scene-parsing tasks [8].

These results indicate that high performance on static categorical emotion benchmarks is primarily associated with architectures that emphasize clear class-boundary discrimination rather than subtle feature continuity. In particular, DDAMFN and ViT appear to benefit from mechanisms that prioritize high-intensity emotional anchors, allowing them to excel when target labels are visually prototypical and discretely separable. Conversely, Swin’s weaker performance in this phase is consistent with its inductive bias toward hierarchical feature aggregation, which is not fully exploited in isolated static-image settings.

4.2. Experimental Results: Ground Truth Dataset Performance

To evaluate the effectiveness of our emotion-based formulation for all five complex psychological states, we assessed the performance of all six benchmark architectures when applied to a human-annotated ground truth dataset. This dataset, comprising 1347 frames, labeled by human evaluators for stress and fatigue, as well as anxiety, cognitive load, and boredom, provides a direct validation of our approach in aligning with human perception of these complex emotional states. Performance was evaluated using a 5-fold cross-validation protocol, with 95% confidence intervals (CIs) estimated through non-parametric bootstrap resampling to ensure generalizability across data partitions.

This evaluation setting differs fundamentally from the FERPlus benchmark: rather than classifying peak, prototypical expressions, models are required to decode blended affective states that emerge from overlapping emotional cues. As a result, architectures reliant on globally dominant visual patterns show reduced stability, while models capable of capturing localized hierarchical dependencies demonstrate comparatively stronger performance.

Table 9, Table 10, Table 11, Table 12 and Table 13 present the Accuracy, Precision, Recall, and F1-scores for each state, as evaluated on this 1347-frame dataset. These results collectively confirm that performance on conventional FER benchmarks is not a sufficient indicator of downstream utility in cognitive-emotional inference, reinforcing the need for architectures that can operate on multi-dimensional affective structures rather than discrete categorical boundaries.

The results presented across Table 9, Table 10, Table 11, Table 12 and Table 13 reveal a performance distribution that does not support the claim of a single universally superior model. Instead, the findings indicate that model performance is state-dependent. While Swin demonstrates competitive outcomes on several categories, it is not the dominant architecture overall, and the reported metrics reflect scenario-specific strengths rather than global superiority.

In accordance with the measured results, the DDAMFN architecture emerges as one of the most consistently competitive performers, particularly in tasks that require structured feature aggregation and discriminative spatial encoding. Its FERPlus accuracy (85.27%) aligns with its stable performance on complex state inference, suggesting that its hybrid design preserves expressive cues that contribute to generalizable affective representation.

Similarly, the ViT model exhibits strong outcomes across multiple evaluated states, demonstrating that global attention mechanisms remain effective when subtle emotional indicators are distributed across non-localized regions. While ViT may not produce the highest score in every category, its balanced performance profile positions it as one of the strongest overall candidates in the reported results.

The previously suggested interpretation—that Swin constitutes the highest-performing model—cannot be supported by the empirical observations. Instead, the comparative analysis indicates that no single architecture consistently outperforms others across all emotional and cognitive states. Performance varies as a function of attention structure, spatial encoding behavior, and the arousal profile of the target construct.

Therefore, a more accurate interpretation is that DDAMFN and ViT represent the most robust general-purpose solutions in this framework, whereas Swin demonstrates situational advantages that are restricted to certain low-arousal or regionally localized affective states. This observation is consistent with the tabulated F1, Precision, and Recall values and reflects a model-behavior alignment grounded in measurable evidence rather than categorical assumptions.

In summary, the data supports a task-specific interpretation of architecture suitability: DDAMFN and ViT show the most reliable performance distributions, Swin exhibits specialized strengths rather than broad dominance, and BEiT follows as an intermediate alternative. Consequently, architectural selection for future work should be guided by the affective profile of the target domain rather than FERPlus-only outcomes or isolated peaks in model performance.

4.3. Discussion of ROC–AUC Results

The ROC–AUC results presented in Table 14, Table 15, Table 16, Table 17 and Table 18 offer a comparative evaluation of six architectures trained on a 1347-frame dataset using 5-fold cross-validation. This protocol was employed to ensure statistical soundness, reduce sampling bias, and provide fold-level variability insight. Confidence intervals represent deviation across folds rather than single-run variation, offering a more stable estimate of performance consistency.

The findings indicate that DDAMFN and ViT persistently achieve the most reliable performance across all target states. Although Swin demonstrates competitive outcomes, its behavior is state-dependent and lacks the cross-condition generalization exhibited by the former models. ViT provides stable representational capacity, while DDAMFN benefits from architectural sensitivity to micro-expressions and localized saliency cues.

BEiT and ConvNeXt follow closely behind, maintaining balanced sensitivity-specificity trade-offs. Their results suggest dependable mid-range discriminative capability without exhibiting the peak consistency of ViT or DDAMFN. The baseline CNN (ResNet-50) continues to underperform, reinforcing the limitations of strictly convolutional pipelines for subtle affective state discrimination.

From an operational perspective, the models can be characterized as follows:

–: DDAMFN and ViT deliver the most robust and stable ROC–AUC profiles, demonstrating state-invariant performance consistency.
–: Swin performs competitively but lacks the uniformity required for broad deployment scenarios.
–: BEiT and ConvNeXt provide balanced separability without extreme variance.
–: CNN exhibits limited feature retention and insufficient separability for real-world inference.

Accordingly, DDAMFN and ViT emerge as the most suitable architectures for generalized deployment. Their broader stability, consistency across folds, and resistance to class-condition drift position them as the most reliable choices for affective state recognition pipelines.

4.4. Qualitative Analysis and Error Diagnosis

To further evaluate the practical reliability and interpretability of the proposed weighting-based framework, a qualitative analysis was conducted on representative frames from the 1347-frame ground truth dataset. This analysis focuses on decomposing the model’s decision-making process for both successful inferences and failure modes, providing a transparent view of how sub-emotion distributions translate into complex state scores.

4.4.1. Analysis of Successful Inferences

The efficacy of the hierarchical feature extraction mechanism is most evident in frames where complex states are manifested through clear, albeit nuanced, facial configurations. For instance, in successful stress detection cases involving the Swin Transformer, the model typically yields high probability scores for ‘Fear’ and ‘Anger’. In one representative frame, the sub-emotion distribution was recorded as Fear (0.42), Anger (0.38), and Disgust (0.10). Applying the stress coefficients (0.5, 0.6, and 0.3, respectively) resulted in a continuous Stress Score significantly above the optimized threshold, aligning with the human annotators’ consensus. This demonstrates that the model successfully captures the high-arousal negative affect that characterizes psychological strain.

Similarly, in successful boredom detection, the model shifts its focus toward ‘Neutral’ and ‘Sadness’ probabilities. A typical successful case showed a Neutral probability of 0.78 and a Sadness probability of 0.15. Given the heavy weights assigned to these emotions in the boredom formulation (0.65 for Neutral and 0.3 for Sadness), the derived score correctly triggered a positive classification. These examples substantiate the interpretability of our framework, as the final decision can be directly traced back to the presence of specific, literature-justified sub-emotional cues.

4.4.2. Error Diagnosis and Failure Modes

Despite the overall high performance, an analysis of false positives (FPs) and false negatives (FNs) reveals critical environmental and physiological challenges that impact model reliability. These failure cases are categorized into the following primary error modes:

–: Blended Expressions and Ambiguity: False positives in the anxiety category often occur when a subject exhibits a “startle” response that includes high ‘Surprise’ and ‘Fear’ probabilities. While the weighting formulation is designed to distinguish transient arousal from chronic anxiety, high-intensity blended expressions can occasionally inflate the scores beyond the decision boundary. This suggests an inherent challenge in separating acute emotional reactions from prolonged psychological states using static frames alone.
–: Extreme Head Poses and Occlusions: False negatives, particularly in fatigue and cognitive load detection, are frequently associated with non-frontal head poses (e.g., subjects looking down or away from the camera). Such poses can obscure key regional features like eyelid drooping (AU43) or brow furrowing. When these localized cues are partially occluded, the hierarchical attention mechanism may revert to a ‘Neutral’ prediction with lower confidence, causing the weighted score to fall just below the classification threshold.
–: Environmental Factors (Lighting and Resolution): In scenarios with low-intensity ambient lighting or low-resolution video streams, subtle micro-expressions indicative of stress—such as slight lip tightening or nostril dilation—become less discernible. The resulting smoothing effect on the spatial feature map often leads to an underestimation of negative affect probabilities, highlighting a dependency on image quality for the detection of the most nuanced psychological states.

By identifying these error modes, we provide a structured diagnostic for future research, emphasizing the need for multi-view data and temporal modeling to enhance the robustness of vision-based complex state inference across unconstrained environments.

5. Discussion

This study investigates the feasibility of inferring complex psychological states—stress, fatigue, anxiety, cognitive load, and boredom—from facial expressions by transforming discrete emotion probabilities into continuous state scores using a literature-informed weighting formulation. The results demonstrate that this formulation is both effective and robust, while also revealing that architectural suitability for complex state inference differs substantially from performance trends observed in standard facial emotion recognition (FER) benchmarks.

A key observation is that high accuracy on categorical FER tasks does not reliably predict performance in complex psychological state detection. While FERPlus accuracy reflects a model’s ability to discriminate prototypical emotional expressions, the detection of stress, fatigue, and related states requires sensitivity to blended, low-intensity, and spatially distributed facial cues. Consequently, architectures optimized for sharp class separation do not necessarily exhibit superior downstream performance when emotions must be interpreted as components of higher-level affective constructs.

Across the evaluated models, DDAMFN and Vision Transformer (ViT) exhibit the most consistent and reliable performance profiles. DDAMFN’s hybrid design, which integrates convolutional locality with attention-based saliency modeling, enables effective capture of both micro-expressions and region-specific facial dynamics. This architectural balance is reflected in its stable F1-scores and ROC–AUC values across all five target states, particularly for stress, fatigue, and boredom, where subtle facial deformations play a critical role.

Similarly, ViT demonstrates strong generalization across multiple psychological states, especially anxiety and cognitive load. Its global self-attention mechanism facilitates the aggregation of non-localized facial cues, allowing the model to integrate distributed affective signals that may not manifest as dominant regional patterns. Although ViT does not always achieve the highest score for every state, its low variance across folds and balanced precision–recall behavior indicate strong representational robustness.

The observed stability of these models (DDAMFN and ViT) stems from their ability to preserve a balanced representational framework; while DDAMFN captures localized saliency cues through its hybrid design, ViT integrates distributed affective signals without the restrictive local biases common in traditional convolutional pipelines.

Other architectures show more state-dependent behavior. BEiT and ConvNeXt provide competitive but intermediate performance, suggesting that self-supervised pretraining and modernized convolutional designs contribute positively to affective representation, albeit without the consistency observed in DDAMFN and ViT. In contrast, the baseline CNN (ResNet-50) consistently underperforms, highlighting the limitations of purely convolutional pipelines for nuanced psychological state inference.

The Swin Transformer demonstrates competitive performance in certain scenarios but lacks uniform dominance across states. Its hierarchical window-based attention appears beneficial for specific low-arousal or spatially localized affective patterns; however, its overall performance varies depending on the target construct. This indicates that hierarchical feature aggregation alone is insufficient to guarantee general-purpose robustness for complex state detection from static facial imagery. This variability suggests that Swin’s inductive bias, while optimized for hierarchical scene parsing, may impose a window-based restriction that potentially overlooks subtle, non-localized emotional indicators distributed across the face, unlike the global attention approach of ViT.

It should be emphasized that the reported F1-scores and ROC–AUC values quantify the separability of inferred psychological state scores under the proposed emotion-weighting formulation, rather than absolute diagnostic accuracy based on independently annotated ground-truth labels.

Importantly, the proposed emotion-weighting formulation remains effective across all evaluated architectures, indicating that its success is not model-specific. By grounding the state inference process in established psychological theory and the Valence–Arousal framework, the formulation provides interpretable and stable mappings from sub-emotions to complex states. Qualitative analysis further confirms that correct predictions are driven by emotion combinations consistent with theoretical expectations—for example, fear and anger dominating anxiety and stress, or neutral and sadness contributing strongly to boredom and fatigue.

Despite these strengths, several limitations must be acknowledged. The reliance on static facial imagery restricts the ability to distinguish transient emotional reactions from sustained psychological states, contributing to certain misclassification patterns, particularly for anxiety and cognitive load. Moreover, the reliance on FERPlus introduces inherent limitations, as the dataset primarily consists of posed expressions captured under controlled conditions, which may not fully reflect spontaneous, culturally diverse, or context dependent affective behavior encountered in real-world scenarios. Incorporating temporal modeling through video-based architectures is therefore a promising direction for future work. In addition, broader cross-dataset validation and fairness-aware evaluation are necessary to ensure robustness across demographic and environmental variations.

Finally, it is essential to emphasize that complex psychological states such as stress, fatigue, and cognitive load are inherently multidimensional phenomena influenced by physiological processes, contextual factors, and individual differences. While facial expressions provide informative and non-invasive cues, inferring such states solely from facial appearance remains fundamentally limited. Accordingly, the predictions produced by the proposed framework should be interpreted as probabilistic estimates rather than definitive assessments.

6. Conclusions

This study presented a unified and interpretable framework for inferring complex psychological states—stress, fatigue, anxiety, cognitive load, and boredom—from facial expressions. By transforming discrete emotion probability outputs of facial emotion recognition (FER) models into continuous state estimates using a literature-informed weighting formulation, the proposed approach bridges the gap between categorical emotion recognition and higher-level affective state inference.

Comprehensive experiments across six representative deep learning architectures demonstrate that performance on standard FER benchmarks is not a sufficient indicator of effectiveness in complex state detection. Instead, architectures capable of preserving subtle, blended, and spatially distributed affective cues yield more reliable results. Within this context, DDAMFN and Vision Transformer (ViT) emerge as the most consistent and robust models across all evaluated psychological states, exhibiting stable F1-scores and ROC–AUC values under cross-validation. Other architectures, including BEiT, ConvNeXt, and Swin Transformer, demonstrate competitive but state-dependent behavior, underscoring the importance of task-aligned architectural selection rather than reliance on baseline accuracy alone.

A central contribution of this work lies in the proposed emotion-weighting formulation itself. Grounded in established psychological theory and the Valence–Arousal framework, the formulation provides an interpretable and model-agnostic mechanism for mapping basic emotions to complex psychological constructs. The consistency of performance across diverse architectures indicates that the formulation captures meaningful affective structure rather than exploiting model-specific artifacts. This interpretability further distinguishes the approach from purely data-driven methods, offering transparency that is critical for deployment in real-world, human-centered applications.

Despite the promising results, this study also highlights the inherent limitations of inferring complex psychological states from facial expressions alone. States such as stress, fatigue, anxiety, and cognitive load are multidimensional and influenced by physiological, contextual, and individual factors that may not be fully expressed through facial behavior. Consequently, the outputs of the proposed system should be interpreted as probabilistic indicators rather than definitive diagnoses. Facial analysis is therefore best positioned as a complementary modality, ideally integrated with temporal dynamics, contextual information, or physiological signals, to enable more reliable and holistic affective state assessment.

In conclusion, this work advances the field of affective computing by demonstrating that complex psychological state inference from facial expressions is both feasible and interpretable when grounded in theory and supported by appropriate architectural choices. The proposed framework provides a practical foundation for future research toward multimodal, temporally aware, and fairness-conscious systems capable of robustly estimating human psychological states in real-world environments.

Author Contributions

Conceptualization, M.S.A., S.Y. and Y.T.T.; Methodology, S.Y. and Y.T.T.; Software, Y.T.T. and S.Y.; Validation, M.S.A. and E.C.; Formal analysis, E.C.; Investigation, M.S.A.; Resources, E.C.; Data curation, Y.T.T. and S.Y.; Writing–original draft preparation, S.Y. and Y.T.T.; Writing–review and editing, S.Y., Y.T.T. and M.S.A.; Visualization, S.Y.; Supervision, M.S.A.; Project administration, M.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This research paper used generative AI tools, including Google Gemini 3 and ChatGPT (version 5.2), solely for improving language clarity, grammar, and readability of the manuscript. The AI tools were not used to generate scientific content, analyses, or research results. The authors retain full responsibility for the accuracy, originality, and integrity of the work. The authors also thank Aktif Bank for providing a collaborative research environment. In addition, the authors would like to acknowledge Amirkia Rafiei Oskooei for contributions to earlier versions of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jia, C.; Jianping, L.; Changrun, C.; Lixi, C. A review of driver fatigue detection based on facial expression recognition. In Proceedings of the 2023 20th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 15–17 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Gao, H.; Yuce, A.; Thiran, J.P. Detecting emotional stress from facial expressions for driving safety. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 5961–5965. [Google Scholar] [CrossRef]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
Finkielsztein, M. Class-Related Boredom among University Students: A Qualitative Research on Boredom Coping Strategies. J. Furth. High. Educ. 2020, 44, 1098–1113. [Google Scholar] [CrossRef]
Klingner, J. Measuring Cognitive Load During Visual Tasks by Combining Pupillometry and Eye Tracking; Stanford University: Stanford, CA, USA, 2010. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Leong, F.H. Deep learning of facial embeddings and facial landmark points for the detection of academic emotions. In Proceedings of the 5th International Conference on Information and Education Innovations (ICIEI ’20), New York, NY, USA, 26–28 July 2020; pp. 111–116. [Google Scholar] [CrossRef]
Liao, J.; Liang, Y.; Pan, J. Deep facial spatiotemporal network for engagement prediction in online learning. Appl. Intell. 2021, 51, 6609–6621. [Google Scholar] [CrossRef]
Mishra, P.; Verma, A.S.; Chaudhary, P.; Dutta, A. Emotion Recognition from Facial Expression Using Deep Learning Techniques. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chin Kit, N.; Ooi, C.; Tan, W.; Tan, Y.F.; Cheong, S. Facial emotion recognition using deep learning detector and classifier. Int. J. Electr. Comput. Eng. (IJECE) 2023, 13, 3375. [Google Scholar] [CrossRef]
Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar] [CrossRef]
Pecoraro, R.; Basile, V.; Bono, V.; Gallo, S. Local Multi-Head Channel Self-Attention for Facial Expression Recognition. arXiv 2021, arXiv:2111.07224. [Google Scholar] [CrossRef]
Ngwe, J.L.; Lim, K.M.; Lee, C.P.; Ong, T.S.; Alqahtani, A. PAtt-Lite: Lightweight Patch and Attention MobileNet for Challenging Facial Expression Recognition. IEEE Access 2024, 12, 79327–79341. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
El Boudouri, Y.; Bohi, A. EmoNeXt: An Adapted ConvNeXt for Facial Emotion Recognition. In Proceedings of the 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 27–29 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Bie, M.; Xu, H.; Gao, Y.; Song, K.; Che, X. Swin-FER: Swin Transformer for Facial Expression Recognition. Appl. Sci. 2024, 14, 6125. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Won, C.S. Facial Expression Recognition with Swin Transformer. arXiv 2022, arXiv:2203.13472. [Google Scholar] [CrossRef]
Cambria, E.; Hussain, A. Sentic Computing: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis; Socio-Affective Computing; Springer International Publishing: Cham, Switzerland, 2015; Volume 1. [Google Scholar] [CrossRef]
Zhang, S.; Pan, X.; Cui, Y.; Zhao, X.; Liu, L. Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning. IEEE Access 2019, 7, 32297–32304. [Google Scholar] [CrossRef]
Hasani, B.; Mahoor, M.H. Facial expression recognition using enhanced deep 3d convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Bangalore, India, 1–3 March 2017; pp. 30–40. [Google Scholar] [CrossRef]
Jin, H.; He, N.; Li, Z.; Yang, P. Micro-expression recognition based on multi-scale 3D residual convolutional neural network. Math. Biosci. Eng. 2024, 21, 5007–5031. [Google Scholar] [CrossRef]
Aouayeb, M.; Soladié, C.; Hamidouche, W.; Kpalma, K.; Séguier, R. Spatiotemporal features fusion from local facial regions for micro-expressions recognition. Front. Signal Process. 2022, 2, 861469. [Google Scholar] [CrossRef]
Wang, H.; Li, B.; Wu, S.; Shen, S.; Liu, F.; Ding, S.; Zhou, A. Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17958–17968. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 813–824. [Google Scholar] [CrossRef]
Sun, L.; Lian, Z.; Wang, K.; He, Y.; Xu, M.; Sun, H.; Liu, B.; Tao, J. SVFAP: Self-Supervised Video Facial Affect Perceiver. IEEE Trans. Affect. Comput. 2025, 16, 405–422. [Google Scholar] [CrossRef]
Rafiei Oskooei, A.; Aktaş, M.S.; Keleş, M. Seeing the Sound: Multilingual Lip Sync for Real-Time Face-to-Face Translation. Computers 2024, 14, 7. [Google Scholar] [CrossRef]
Rafiei Oskooei, A.; Yahsi, E.; Sungur, M.; Aktas, M.S. Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages. In Proceedings of the International Conference on Computational Science and Its Applications, Hanoi, Vietnam, 1–4 July 2024; pp. 149–164. [Google Scholar] [CrossRef]
Verma, A.; Goyal, A.; Kaur, D. Fatigue Detection. arXiv 2019, arXiv:1911.10629. [Google Scholar] [CrossRef]
Shang, Y.; Yang, M.; Cui, J.; Cui, L.; Huang, Z.; Li, X. Driver emotion and fatigue state detection based on time series fusion. Electronics 2023, 12, 26. [Google Scholar] [CrossRef]
Panickar, S.S.; Gayathri, P. A Comprehensive Face Parsing Framework for Anxiety Detection Using Deep Learning. IEEE Access 2025, 13, 175136–175160. [Google Scholar] [CrossRef]
Seo, H.; Kim, S.; Lee, E.C. Defining and Analyzing Nervousness Using AI-Based Facial Expression Recognition. Mathematics 2025, 13, 1745. [Google Scholar] [CrossRef]
Bruin, J.; Stuldreher, I.V.; Perone, P.; van der Veen, F.M.; van der Wee, N.J.; Giltay, E.J.; van der Mast, C.A.; Neerincx, M.A.; van der Heiden, C. Detection of arousal and valence from facial expressions and physiological responses evoked by different types of stressors. Front. Neuroergonomics 2024, 5, 1338243. [Google Scholar] [CrossRef] [PubMed]
Krejtz, K.; Duchowski, A.T.; Niedzielska, A.; Biele, C.; Krejtz, I. Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze. PLoS ONE 2018, 13, e0203629. [Google Scholar] [CrossRef]
Dell’Acqua, P.; Garofalo, M.; La Rosa, F.; Villari, M. Your Eyes Under Pressure: Real-Time Estimation of Cognitive Load with Smooth Pursuit Tracking. Big Data Cogn. Comput. 2025, 9, 288. [Google Scholar] [CrossRef]
Jang, E.H.; Park, B.J.; Park, M.S.; Kim, S.H.; Sohn, J.H. Analysis of physiological signals for recognition of boredom, pain, and surprise emotions. J. Physiol. Anthropol. 2015, 34, 25. [Google Scholar] [CrossRef]
Puelke, D. Boredom Recognition in Manufacturing Tasks Using Physiological Signals. Ph.D. Thesis, TU Wien, Vienna, Austria, 2024. [Google Scholar] [CrossRef]
Yuvaraj, R.; Samyuktha, S.; Fogarty, J.; Huang, J.S.; Tan, S.; Kiong, W.T. Automated Boredom Recognition Using Multimodal Physiological Signals. IEEE Trans. Affect. Comput. 2025. early access. [Google Scholar] [CrossRef]
Littlewort, G.; Frank, M.; Lainscsek, C.; Fasel, I.; Movellan, J. Automatic Recognition of Facial Actions in Spontaneous Expressions. J. Multimed. 2006, 1, 22–35. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V. Facial Action Coding System: A Technique for the Measurement of Facial Movement; Consulting Psychologists Press: Palo Alto, CA, USA, 1978. [Google Scholar] [CrossRef]
Noldus Information Technology. FaceReader: Action Unit Module. 2024. Available online: https://www.noldus.com/facereader (accessed on 24 October 2025).
BIOPAC Systems, Inc. Facial Action Units. 2024. Available online: https://www.biopac.com/facial-action-units/ (accessed on 24 October 2025).
Emotiva. Action Units. 2024. Available online: https://emotiva.it/en/action-units/ (accessed on 24 October 2025).
Ruan, X.; Palansuriya, C.; Constantin, A. Affective Dynamic Based Technique for Facial Emotion Recognition (FER) to Support Intelligent Tutors in Education. In Artificial Intelligence in Education, Proceedings of the 24th International Conference, AIED 2023, Tokyo, Japan, 3–7 July 2023; Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V., Eds.; Springer: Cham, Switzerland, 2023; pp. 774–779. [Google Scholar] [CrossRef]
Lin, S.Y.; Wu, C.M.; Chen, S.L.; Lin, T.L.; Tseng, Y.W. Continuous Facial Emotion Recognition Method Based on Deep Learning of Academic Emotions. Sens. Mater. 2020, 32, 3243. [Google Scholar] [CrossRef]
Hosseini, M.M.; Kolahdouzi, F.; Gholipour, A. Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models. arXiv 2025, arXiv:2502.11049. [Google Scholar] [CrossRef]
Raina, R.; Monares, M.; Xu, M.; Fabi, S.; Xu, X.; Li, L.; Sumerfield, W.; Gan, J.; de Sa, V.R. Exploring biases in facial expression analysis using synthetic faces. In Proceedings of the NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, New Orleans, LA, USA, 2 December 2022. [Google Scholar]
Zeng, J.; Shan, S.; Chen, X. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar] [CrossRef]
Wang, Z.; Li, S.; Wang, Z. A New Joint Training Method for Facial Expression Recognition with Inconsistently Annotated and Imbalanced Data. Electronics 2024, 13, 3891. [Google Scholar] [CrossRef]
Aydin, G.; Aktas, M.S.; Fox, G.; Gadgil, H.; Pierce, M.; Sayar, A. SERVOGrid Complexity Computational Environments: Integrated Performance Analysis. In Proceedings of the 6th International Workshop on Grid Computing (GRID 2005); IEEE: Washington, DC, USA, 2005; pp. 256–261. [Google Scholar] [CrossRef]
Nacar, M.A.; Aktas, M.; Pierce, M.; Lu, Z.; Erlebacher, G.; Kigerlman, D.; Bolling, E.F.; Silva, C.R.S.; Sowell, B.; Yuen, D.A. VLab: Collaborative Grid Services and Portals to Support Computational Material Science. Concurr. Comput. Pract. Exp. 2007, 19, 1717–1728. [Google Scholar] [CrossRef]
Aktas, M.S.; Aydin, G.; Donnellan, A.; Fox, G.; Granat, R.; Grant, L.; Lyzenga, G.; McLeod, D.; Pallickara, S.; Parker, J.; et al. Implementing the International Solid Earth Research Virtual Observatory by Integrating Computational Grid and Geographical Information Web Services. Pure Appl. Geophys. 2006, 163, 2281–2296. [Google Scholar] [CrossRef]
Kapdan, M.; Aktas, M.; Yigit, M. On the Structural Code Clone Detection Problem: A Survey and Software Metric Based Approach. In Computational Science and Its Applications—ICCSA 2014; LNCS; Springer: Cham, Switzerland, 2014; Volume 8583, pp. 492–503. [Google Scholar] [CrossRef]
Roy, A.K. FERPlus Dataset. Available online: https://www.kaggle.com/datasets/arnabkumarroy02/ferplus (accessed on 6 November 2024).
Lerner, J.S.; Keltner, D. Fear, anger, and risk. J. Personal. Soc. Psychol. 2001, 81, 146–159. [Google Scholar] [CrossRef]
Sapolsky, R.M. Why Zebras Don’t Get Ulcers: The Acclaimed Guide to Stress, Stress-Related Diseases, and Coping; Holt Paperbacks; Henry Holt and Company: New York, NY, USA, 2004. [Google Scholar]
Bower, J.E.; Irwin, M.R. Mind–body therapies and control of inflammatory biology: A descriptive review. Brain Behav. Immun. 2016, 51, 1–11. [Google Scholar] [CrossRef]
Hockey, R. The Psychology of Fatigue: WORK, Effort, and Control; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar] [CrossRef]
Panksepp, J. Affective Neuroscience: The Foundations of Human and Animal Emotions; Oxford University Press: Oxford, UK, 2004. [Google Scholar] [CrossRef]
Fredrickson, B.L. The broaden-and-build theory of positive emotions. Philos. Trans. R. Soc. B Biol. Sci. 2004, 359, 1367–1377. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comprehensive multimodal system architecture for complex emotion analysis.

Table 1. Empirically Derived Coefficients for Stress and Fatigue.

Emotion	Stress Coeff.	Fatigue Coeff.
Angry	0.6	0.05
Contempt	0.1	0.05
Disgust	0.3	0.05
Fear	0.5	0.05
Happy	0.05	0.05
Neutral	0.05	0.4
Sad	0.4	0.8
Surprise	0.1	0.05

Table 2. Empirically Calculated Coefficients for Additional Complex States.

Emotion	Anxiety Coeff.	Cog. Load Coeff.	Boredom Coeff.
Angry	0.3	0.45	0.1
Contempt	0.1	0.05	0.05
Disgust	0.1	0.2	0.15
Fear	0.7	0.1	0.05
Happy	0.0	0.0	0.0
Neutral	0.1	0.55	0.65
Sad	0.2	0.15	0.3
Surprise	0.1	0.35	0.2

Table 3. Logit Sum Threshold Values for Stress Detection (1347 Frames, 5-Fold CV).

Model	Threshold
CNN	22
DDAMFN	33
ViT	35
BEiT	25
ConvNeXt	25
Swin	28

Table 4. Logit Sum Threshold Values for Fatigue Detection (1347 Frames, 5-Fold CV).

Model	Threshold
CNN	22
DDAMFN	19
ViT	45
BEiT	30
ConvNeXt	9
Swin	6

Table 5. Logit Sum Threshold Values for Anxiety Detection (1347 Frames, 5-Fold CV).

Model	Threshold
CNN	2
DDAMFN	23
ViT	23
BEiT	8
ConvNeXt	12
Swin	9

Table 6. Logit Sum Threshold Values for Cognitive Load Detection (1347 Frames, 5-Fold CV).

Model	Threshold
CNN	7
DDAMFN	43
ViT	32
BEiT	22
ConvNeXt	17
Swin	19

Table 7. Logit Sum Threshold Values for Boredom Detection (1347 Frames, 5-Fold CV).

Model	Threshold
CNN	4
DDAMFN	33
ViT	23
BEiT	19
ConvNeXt	12
Swin	12

Table 8. FERPlus Test Accuracies: Sub-Emotion Recognition Performance.

Model	Accuracy (%)
CNN (ResNet-50)	75.52
ConvNeXt	82.76
DDAMFN (Hybrid)	85.27
ViT	83.79
BEiT	81.25
Swin	77.43

Table 9. Stress Detection Performance (1347 Frames, 5-Fold CV). Bold values indicate the best-performing model within the table.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
CNN	72.4 ± 1.6	82.1 ± 1.4	70.3 ± 1.8	75.9 ± 1.6
DDAMFN	82.7± 1.2	86.9 ± 1.0	76.8 ± 1.4	81.8 ± 1.1
ViT	79.3 ± 1.7	84.1 ± 1.3	73.9 ± 1.9	78.8 ± 1.6
BEiT	80.0 ± 1.3	85.2 ± 1.1	74.8 ± 1.5	80.0 ± 1.3
ConvNeXt	80.2 ± 1.4	84.9 ± 1.3	74.6 ± 1.6	79.3 ± 1.5
Swin	81.5 ± 1.5	85.4 ± 1.2	75.8 ± 1.7	81.0 ± 1.4

Table 10. Fatigue Detection Performance (1347 Frames, 5-Fold CV). Bold values indicate the best-performing model within the table.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
CNN	68.4 ± 1.9	71.5 ± 1.8	66.1 ± 2.0	68.7 ± 1.8
DDAMFN	81.9 ± 1.5	85.4 ± 1.4	79.5 ± 1.6	82.3 ± 1.5
ViT	79.4 ± 1.6	82.6 ± 1.5	76.8 ± 1.7	79.4 ± 1.6
BEiT	78.7 ± 1.7	82.1 ± 1.6	75.2 ± 1.8	78.4 ± 1.7
ConvNeXt	80.2 ± 1.6	83.1 ± 1.5	77.4 ± 1.7	80.1 ± 1.6
Swin	74.6 ± 1.8	77.8 ± 1.7	71.5 ± 1.9	74.3 ± 1.8

Table 11. Anxiety Detection Performance (1347 Frames, 5-Fold CV). Bold values indicate the best-performing model within the table.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
CNN	68.8 ± 2.1	69.9 ± 1.9	65.1 ± 2.3	68.0 ± 2.0
DDAMFN	72.8 ± 1.6	80.9 ± 1.5	70.8 ± 1.8	75.3 ± 1.6
ViT	75.6 ± 1.4	84.8 ± 1.2	73.5 ± 1.6	78.7 ± 1.4
BEiT	73.9 ± 1.5	83.1 ± 1.3	72.4 ± 1.7	77.1 ± 1.5
ConvNeXt	72.1 ± 1.8	80.6 ± 1.6	70.0 ± 1.9	74.5 ± 1.7
Swin	71.2 ± 1.7	79.8 ± 1.6	69.4 ± 1.9	74.1 ± 1.7

Table 12. Cognitive Load Detection Performance (1347 Frames, 5-Fold CV). Bold values indicate the best-performing model within the table.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
CNN	68.4 ± 1.8	66.1 ± 1.9	69.5 ± 2.0	67.8 ± 1.8
DDAMFN	72.1 ± 1.7	69.4 ± 1.8	74.0 ± 1.9	71.6 ± 1.7
ViT	86.4 ± 1.1	82.6 ± 1.2	89.1 ± 1.0	85.7 ± 1.1
BEiT	80.5 ± 1.4	76.8 ± 1.5	83.1 ± 1.4	79.8 ± 1.4
ConvNeXt	76.3 ± 1.6	73.2 ± 1.7	78.5 ± 1.8	75.7 ± 1.6
Swin	74.8 ± 1.6	71.5 ± 1.7	76.2 ± 1.8	73.8 ± 1.6

Table 13. Boredom Detection Performance (1347 Frames, 5-Fold CV). Bold values indicate the best-performing model within the table.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
CNN	69.3 ± 1.9	70.1 ± 1.8	68.5 ± 2.0	69.3 ± 1.9
DDAMFN	84.2 ± 1.0	82.5 ± 0.9	81.1 ± 1.1	82.3 ± 1.0
ViT	75.8 ± 1.6	76.5 ± 1.5	75.1 ± 1.7	75.8 ± 1.6
BEiT	78.4 ± 1.5	79.2 ± 1.4	77.8 ± 1.6	78.5 ± 1.5
ConvNeXt	77.1 ± 1.6	78.3 ± 1.5	76.5 ± 1.7	77.4 ± 1.6
Swin	73.5 ± 1.7	74.8 ± 1.6	72.1 ± 1.8	73.4 ± 1.7

Table 14. Stress Detection ROC-AUC Performance (1347 Frames, 5-Fold CV).

Model	ROC-AUC (%)	TP	TN	FP	FN
DDAMFN	87.4 ± 1.3	515	599	78	155
ViT	86.1 ± 1.4	514	554	97	182
BEiT	82.6 ± 1.6	528	549	92	178
ConvNeXt	80.3 ± 1.7	515	565	92	175
Swin	78.9 ± 1.5	508	590	87	162
CNN	69.4 ± 2.7	580	395	127	245

Table 15. Fatigue Detection ROC-AUC Performance (1347 Frames, 5-Fold CV).

Model	ROC-AUC (%)	TP	TN	FP	FN
DDAMFN	85.3 ± 1.3	557	552	95	143
ViT	84.7 ± 1.5	538	534	113	162
BEiT	79.6 ± 1.7	526	532	115	174
ConvNeXt	74.6 ± 2.1	542	537	110	158
Swin	73.9 ± 1.8	501	504	143	199
CNN	70.9 ± 2.4	463	462	185	237

Table 16. Anxiety Detection ROC-AUC Performance (1347 Frames, 5-Fold CV).

Model	ROC-AUC (%)	TP	TN	FP	FN
DDAMFN	83.9 ± 1.4	565	416	133	233
ViT	82.7 ± 1.5	608	411	109	219
BEiT	81.9 ± 1.6	600	396	122	229
ConvNeXt	76.1 ± 1.9	562	409	135	241
Swin	74.4 ± 1.7	558	402	141	246
CNN	66.7 ± 2.9	434	493	187	233

Table 17. Cognitive Load Detection ROC-AUC Performance (1347 Frames, 5-Fold CV).

Model	ROC-AUC (%)	TP	TN	FP	FN
DDAMFN	83.6 ± 1.3	474	497	209	167
ViT	82.9 ± 1.4	550	564	157	67
BEiT	78.8 ± 1.5	505	532	183	103
ConvNeXt	77.3 ± 1.6	498	530	181	138
Swin	74.8 ± 1.5	479	529	187	152
CNN	63.5 ± 2.8	446	474	231	196

Table 18. Boredom Detection ROC-AUC Performance (1347 Frames, 5-Fold CV).

Model	ROC-AUC (%)	TP	TN	FP	FN
DDAMFN	84.8 ± 1.2	478	657	101	111
ViT	83.6 ± 1.3	510	511	157	169
BEiT	82.4 ± 1.5	529	528	139	151
ConvNeXt	74.9 ± 2.0	527	512	146	162
Swin	73.1 ± 1.7	493	497	166	191
CNN	70.1 ± 2.3	467	466	199	215

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yakut, S.; Tuten, Y.T.; Caglar, E.; Aktas, M.S. A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content. Computers 2026, 15, 77. https://doi.org/10.3390/computers15020077

AMA Style

Yakut S, Tuten YT, Caglar E, Aktas MS. A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content. Computers. 2026; 15(2):77. https://doi.org/10.3390/computers15020077

Chicago/Turabian Style

Yakut, Sehmus, Yusuf Taha Tuten, Eren Caglar, and Mehmet S. Aktas. 2026. "A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content" Computers 15, no. 2: 77. https://doi.org/10.3390/computers15020077

APA Style

Yakut, S., Tuten, Y. T., Caglar, E., & Aktas, M. S. (2026). A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content. Computers, 15(2), 77. https://doi.org/10.3390/computers15020077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content †