Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition

Shen, Wei; Mou, Xingang; Yi, Jing; Le, Songqing

doi:10.3390/app16094312

Open AccessArticle

Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition

¹

School of Mechanical and Electronic Engineering, Wuhan University of Technology, Wuhan 430070, China

²

Key Laboratory of Advanced Manufacturing Technology for High Performance Parts, Ministry of Education, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4312; https://doi.org/10.3390/app16094312

Submission received: 19 March 2026 / Revised: 23 April 2026 / Accepted: 24 April 2026 / Published: 28 April 2026

Download

Browse Figures

Versions Notes

Abstract

Driver emotion recognition constitutes a fundamental pillar of intelligent cockpit systems, playing a pivotal role in enhancing driving safety and optimizing human–machine interaction. Despite the integration of vehicle sensor data in recent multimodal approaches, conventional fusion paradigms frequently encounter performance degradation due to the inherent noise and weak semantic correlation between vehicle telemetry and emotional states. To address these challenges, this study introduces a Dimensional Emotion-Guided Multi-task (DEGM) framework, a novel architecture designed to explicitly formalize the asymmetric roles of visual and vehicular modalities. Rather than employing simplistic feature concatenation, the proposed method maps multivariate vehicle data into a continuous Valence–Arousal–Dominance (VAD) space to characterize latent emotional tendencies within specific driving contexts. These predicted dimensions subsequently serve as semantic priors to conditionally modulate global facial representations through a Feature-wise Linear Modulation (FiLM) mechanism, facilitating robust and interpretable cross-modal interaction. Furthermore, the framework adopts a multi-task learning strategy that jointly optimizes discrete emotion classification and continuous dimension regression, leveraging the latter as a structural regularizer to refine the latent feature space. Comprehensive evaluations on the public PPB driving emotion dataset demonstrate that the proposed DEGM achieves a competitive accuracy of 87.50% and a weighted F1-score of 0.8727. The results validate that our framework provides a lightweight and robust paradigm for context-aware affect sensing, demonstrating strong potential for practical deployment in intelligent transportation systems.

Keywords:

driver emotion recognition; multi-task learning; context-aware; dimensional emotion; feature modulation; intelligent vehicles

1. Introduction

Driving constitutes a pervasive yet high-stakes activity in modern society, where safety is intrinsically linked to both the stability of public transportation systems and the physical well-being of individuals. A burgeoning body of literature suggests that road traffic crashes are profoundly influenced by drivers’ behavioral patterns and cognitive states, with emotional fluctuations playing a pivotal role in shaping perception, attention allocation, and decision-making processes [1,2]; for instance, states of heightened agitation, acute anger, or prolonged low arousal can significantly impair a driver’s risk assessment and environmental awareness, thereby escalating the probability of collisions. Consequently, achieving accurate and robust driver emotion recognition has become a cornerstone for the development of advanced driver assistance systems (ADAS) and the realization of intuitive human–machine interaction (HMI) within intelligent cockpits.

The primary challenges in driver affect recognition stem from the inherent complexity of driving environments and the nuanced nature of emotional expression. On the one hand, the cognitive load and behavioral constraints imposed by driving tasks often lead to the suppression of explicit emotional cues; thus, facial expressions in these contexts are frequently characterized by low intensity, rapid fluctuations, and transient durations [3]. On the other hand, real-world driving scenarios introduce multifaceted interferences—such as non-uniform illumination, abrupt head pose variations, and partial occlusions—which impede stable feature extraction. To mitigate these challenges, contemporary research has pivoted toward multimodal perception, integrating visual cues, speech, and physiological signals [4,5,6].

Among these modalities, physiological signals—including electroencephalography (EEG) and galvanic skin response—can directly reflect autonomic nervous system fluctuations. However, their acquisition typically necessitates contact-based or semi-invasive sensors, which increase deployment costs and may intrude upon natural driving behavior, thereby limiting their ecological validity [7]. Speech signals are similarly constrained, as driving is not inherently a continuous verbal task and vocal features are often confounded by semantic content and ambient noise. In contrast, facial morphology provides a rich, non-invasive channel for affective state inference, making it the primary modality for non-verbal emotion recognition in intelligent vehicle research [8].

Early studies predominantly relied on traditional machine learning frameworks with hand-crafted features to classify emotions from static imagery [9]. However, these approaches often lack the generalization and robustness required to handle the subtle and temporally dependent expressions typical of driving. The advent of deep learning has enabled the joint modeling of spatio-temporal features via Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), such as CNN-LSTM architectures [10]. More recently, Transformer-based frameworks have gained traction due to their superior global modeling capabilities in video-based tasks [11]. Nevertheless, relying solely on visual modalities remains insufficient under adverse conditions, where performance degradation is frequently observed due to sensory noise or information loss.

Beyond explicit facial cues, vehicles generate continuous multivariate sensor streams—such as speed, acceleration, steering angle, and pedal position—that reflect the intricate interplay between driving behavior and the external environment. These signals are cost-effective to acquire and implicitly encode a driver’s psychological state; for example, erratic longitudinal control is often symptomatic of stress or agitation. However, vehicle signals possess “weak emotional semantics,” and treating them as an independent modality for emotion classification typically yields marginal gains. Consequently, the existing literature often relegates vehicle data to behavior analysis, lacking a systematic framework to formalize its role in emotion recognition.

To bridge these gaps, this study posits that vehicle signals are more effectively utilized as contextual perception priors—capturing the implicit constraints of the environment—rather than as an emotional modality equivalent to facial expressions. We propose a Dimensional Emotion-Guided Multi-task (DEGM) framework to formalize this asymmetric interaction. Within this architecture, vehicle temporal signals are modeled to predict continuous emotional dimensions (Valence–Arousal–Dominance, VAD) that characterize the driving context. These predicted dimensions then serve as a semantic compass to conditionally modulate facial video features via a Feature-wise Linear Modulation (FiLM) mechanism, aligning visual representations with the current driving context. Simultaneously, the framework adopts a multi-task learning strategy where continuous dimension regression acts as a structural regularizer to enhance the stability and discriminative power of discrete emotion classification.

The main contributions of this work are summarized as follows:

Asymmetric Multimodal Formalization: We propose a dimensional emotion-guided conditional modulation framework that explicitly models the asymmetric roles of facial and vehicle data. By mapping vehicle context into a continuous VAD space, we enable a more interpretable and robust cross-modal interaction compared to conventional symmetric fusion paradigms.
Multi-task Collaborative Optimization: We formulate driver emotion recognition as a joint optimization problem. This multi-task approach leverages dimensional regression as a structural regularizer, significantly improving classification robustness and feature consistency under noisy or weak contextual signals.
Hierarchical Spatio-Temporal Encoding: We introduce the Spatio-Temporal Aggregation and Projection Embedding (STAP-Embed) module for facial video encoding. By hierarchically aggregating short-term dynamics and long-term dependencies, STAP-Embed preserves fine-grained spatio-temporal cues essential for detecting subtle facial micro-expressions.

2. Related Work

2.1. Modality Evolution: From Physiological to Visual Cues

Driver emotion recognition (DER) aims to decode affective states within the intricate constraints of vehicular environments, typically leveraging internal physiological responses or external behavioral cues. Physiological signal-based methodologies, utilizing electroencephalography (EEG), electrocardiography (ECG), and heart rate variability (HRV), offer high sensitivity to autonomic nervous system fluctuations [12]. Nevertheless, their practical deployment remains hindered by the requirement for specialized, often intrusive, sensing hardware. As underscored by recent studies, multi-channel physiological sensors can obstruct natural driving maneuvers, thereby compromising the ecological validity and authenticity of the captured data [13,14].

In contrast, facial expressions serve as a non-invasive and information-dense channel for affective inference. Early research predominantly relied on static imagery, employing hand-crafted features coupled with shallow classifiers [15]. The subsequent paradigm shift toward deep learning, specifically Convolutional Neural Networks (CNNs), significantly enhanced the capacity to extract discriminative spatial features directly from raw pixels [16]. However, static analysis inherently fails to account for the temporal evolution of affect. This limitation is particularly critical in driving scenarios, where emotional manifestations—often characterized by low-intensity micro-expressions and rapid transitions—necessitate a nuanced understanding of temporal dynamics.

2.2. Spatio-Temporal Modeling for Facial Expressions

To encapsulate the dynamic nature of emotions, researchers have transitioned toward video-based recognition by integrating temporal modeling components. Hybrid architectures, such as CNNs combined with Bidirectional Long Short-Term Memory (BiLSTM) networks, have been widely adopted to capture sequential dependencies atop spatial representations [17]. Recently, Transformer architectures have gained prominence in the field of Dynamic Facial Expression Recognition (DFER) due to their superior global modeling capacity. By leveraging self-attention mechanisms, Transformers can establish long-range spatio-temporal dependencies, circumventing the localized focus of traditional recurrent models [18,19,20].

Despite these advancements, a recurring challenge remains: many Transformer-based approaches necessitate decomposing video streams into numerous fine-grained tokens. This process frequently leads to the erosion of local structural integrity while imposing a formidable computational burden—a critical drawback for real-time intelligent cockpit applications. This underscores the need for more efficient spatio-temporal aggregation strategies that preserve fine-grained expressive cues without excessive overhead.

2.3. Contextual Integration and Multimodal Fusion Paradigms

The integration of vehicle-derived telemetry—including steering angle, acceleration, and pedal position—as a complementary modality has emerged as a promising research frontier. Empirical evidence suggests that driving behaviors are deeply intertwined with a driver’s cognitive and emotional load [21,22]. However, vehicle signals possess weak emotional semantics and standalone vehicle-based emotion classification typically yields suboptimal performance, leaving the effective utilization of such data an open challenge.

Current multimodal fusion strategies are generally categorized into early (data-level), late (decision-level), and intermediate (feature-level) fusion. Early fusion often overlooks the inherent heterogeneity between visual and mechanical modalities, while late fusion tends to sacrifice rich cross-modal interactions [23,24]. Although attention-based intermediate fusion has improved flexibility by dynamically weighting different features, most existing frameworks relegate all modalities to “symmetrically informative” sources [25,26,27]. Such paradigms fail to explicitly model the asymmetric role of the driving context.

Distinct from these conventional approaches, the proposed work does not treat vehicle signals as a parallel emotional modality. Instead, we redefine vehicle data as a contextual prior. By mapping these signals into a continuous dimensional emotion space (VAD) and utilizing the results to conditionally modulate visual features, our framework facilitates a more interpretable, context-aware, and robust interaction between the driver’s environment and their expressive cues.

3. Methodology

3.1. Overall Architecture

We propose a Dimensional Emotion Guided Multi-task Learning Network (DEGM) for context-aware driver affect recognition, as illustrated in Figure 1. The proposed framework consists of four core components:

A facial video encoder based on the proposed STAP-Embed module,
A lightweight vehicle context encoder,
A dimensional emotion guided conditional modulation module, and
A multi-task prediction head for discrete emotion classification and continuous emotion regression.

Unlike conventional multimodal affect recognition frameworks that treat different modalities as equally informative emotional sources, DEGM explicitly models the asymmetric contribution of facial video and vehicle signals in driving scenarios. Facial video serves as the primary carrier of affective expression, while vehicle signals are regarded as contextual cues reflecting driving conditions and behavioral tendencies rather than direct emotional manifestations.

Therefore, vehicle signals are not directly fused with facial features. Instead, they are first mapped into a continuous emotional dimension space (Valence–Arousal–Dominance), which captures the latent affective tendencies induced by driving context. The predicted dimensional representation then acts as a semantic guidance signal, modulating global facial features through a conditional FiLM-based mechanism. This design enables controlled cross-modal interaction while avoiding the noise amplification commonly observed in direct feature-level fusion.

On the facial side, we introduce STAP-Embed, a hierarchical spatio-temporal embedding strategy that aggregates fine-grained short-term facial dynamics within video segments and long-range temporal dependencies across segments. Compared with conventional frame-wise or token-heavy Transformer-based encoders, STAP-Embed significantly reduces temporal redundancy while preserving emotion-discriminative dynamics under subtle expression variations.

Finally, DEGM adopts a multi-task learning strategy, jointly optimizing discrete emotion classification and continuous dimensional regression. The dimensional regression task not only provides complementary supervision but also regularizes the feature space, encouraging consistency between categorical decisions and continuous affective trends. Through this unified architecture, DEGM achieves robust emotion recognition under weakly correlated and noisy contextual signals.

3.2. Facial Video Encoder

The facial video encoder aims to map an input facial video sequence into a compact spatio-temporal representation with strong emotion discriminability. Let the input facial video be denoted as

x_{f} \in R^{(T, C, H, W)}

(1)

The input video is first partitioned into a set of non-overlapping video blocks

x^{(t, C, h, w)}

according to predefined temporal and spatial scales. Each video block contains a short sequence of consecutive frames of length

t

, which is designed to capture local temporal dynamics. These video blocks are then processed by the proposed STAP-Embed module. During the embedding stage, each frame within a video block is passed through a convolutional neural network-based feature extraction module to obtain spatial feature maps. This module consists of two standard convolutional layers followed by a StarBlock, which jointly extract low-level spatial structures and perform channel-wise feature projection. As a result, each frame is represented as a spatial feature vector

x^{(t, D_{s})}

. To model short-term temporal variations within each video block, the spatial feature vectors are further encoded along the temporal dimension. Specifically, the feature sequence

x^{(t, D_{s})}

is first fed into a bidirectional LSTM to capture fine-grained temporal evolution patterns. The output of the BiLSTM is subsequently refined by a multi-head self-attention module, which dynamically assigns different importance weights to individual time steps, thereby emphasizing frames that are more informative for emotion discrimination. After block-level temporal modeling, each video block is represented by a compact feature vector. All block-level features are then reorganized into a structured feature sequence

X \in R^{n_{t} \times n_{h} \times n_{w} \times D}

, where

n_{t} = T / t

,

n_{h} = H / h

,

n_{w} = W / w

. This sequence is fed into a hierarchical Transformer encoding architecture consisting of a spatial Transformer encoder followed by a temporal Transformer encoder.

The first stage is the spatial Transformer encoder, which aims to capture spatial correlations among video blocks within each temporal slice. For a given temporal index

t_{n} \in 1, \dots, n_{t}

, the corresponding block embeddings are reshaped into a sequence

X_{t_{n}} \in R^{n_{h} \cdot n_{w} \times D}

. A learnable spatial classification token

x_{c l s_s} \in R^{1 \times D}

is prepended to the sequence, forming the spatial encoder input

Z_{t_{n}}^{(0)} = [x_{c l s_s}, X_{t_{n}}] \in R^{(n_{h} \cdot n_{w} + 1) \times D}

(2)

The resulting sequence is processed by a stack of

L_{s}

standard Transformer layers. At the

l

-th layer (

l = 1, \dots, L_{s}

), the computations are defined as

Z_{t_{n}}^{' (l)} = M S A (L N (Z_{t_{n}}^{(l - 1)})) + Z_{t_{n}}^{(l - 1)}

(3)

Z_{t_{n}}^{(l)} = f f n (L N (Z_{t_{n}}^{' (l)})) + Z_{t_{n}}^{' (l)}

(4)

where

L N

denotes layer normalization,

M S A

represents multi-head self-attention, and

F F N

denotes the feed-forward network. This process is independently applied at each temporal index

t_{n}

, enabling the model to learn spatial configuration patterns, such as coordinated facial movements, that are informative for emotion expression. After

L_{s}

layers, the final spatial representation for each time step is obtained from the corresponding spatial classification token

z_{t_{n}, c l s} = Z_{t_{n}}^{(L_{s})} [0]

.

The second stage is the temporal Transformer encoder, which models long-term temporal dependencies across different time steps. All spatial classification tokens

z_{t_{n}, c l s}

are collected to form a temporal summary sequence

Y_{s} = [z_{1, c l s}; z_{2, c l s}; \dots; z_{T_{n}, c l s}] \in R^{T_{n} \times D}

. This representation significantly reduces the sequence length, enabling efficient temporal attention computation. A learnable temporal classification token

x_{c l s_t} \in R^{1 \times D}

is prepended to the sequence, yielding the input to the temporal encoder

Z_{t e m p}^{(0)} = [x_{c l s_t}; Y_{s}] \in R^{(T_{n} + 1) \times D}

(5)

The sequence is then passed through

L_{t}

Transformer layers with identical structures, where self-attention mechanisms model the temporal evolution of facial expressions over the entire video. Finally, the facial video is represented by a global feature vector

f_{f a c e} \in R^{D}

(6)

which serves as the core input for subsequent conditional modulation and multi-task prediction modules.

3.3. Vehicle Context Encoder

The vehicle context encoder is designed to extract background information related to driving emotion from multi-channel vehicle sensor signals. Let the vehicle sensor input be denoted as

x_{v} \in R^{(S, T_{v})}

(7)

where

S

represents the number of sensor channels and

T_{v}

denotes the temporal length of the signal.

Considering the characteristics of vehicle signals, including high noise levels, inconsistent sampling frequencies, and limited data scale, this work does not directly adopt deep temporal models for vehicle signal encoding. Instead, a statistics-based representation is constructed to obtain a robust and compact contextual description. Specifically, for each sensor channel, a fixed set of statistical descriptors is computed, including the mean, standard deviation, maximum value, minimum value, temporal change rate, and the proportion of outlier values. These statistics characterize driving behavior from both global distributional properties and dynamic variation perspectives.

The statistical features from all sensor channels are concatenated along the channel dimension to form a fixed-length feature vector

f_{s t a t} \in R^{(S \times 6)}

(8)

This process introduces no learnable parameters, which helps reduce model complexity and mitigates the risk of overfitting. The resulting statistical feature vector is then mapped through a lightweight multilayer perceptron to obtain a low-dimensional vehicle context representation

c_{v} = \emptyset (f_{s t a t})

(9)

where

c_{v} \in R^{C_{v}}

. The mapping network consists of linear layers followed by normalization, nonlinear activation, and Dropout. The output dimensionality is intentionally constrained to a relatively small size, ensuring that vehicle information participates in subsequent modeling primarily as a conditional contextual signal rather than as a dominant feature source.

3.4. Dimensional Emotion-Guided Conditional Modulation

To enable effective interaction between facial features and vehicle context, a dimensional emotion-guided conditional modulation (DEGCM) mechanism is introduced. The core idea of this module is to first predict continuous emotion dimensions from vehicle signals and then use the resulting dimensional information to conditionally modulate facial representations. Specifically, based on the vehicle statistical features, a regression branch is introduced to predict an emotion dimension vector

\hat{d} = [\hat{v}, \hat{a}, \hat{d}]

(10)

where

\hat{v}, \hat{a}, \hat{d}

denote the predicted valence, arousal, and dominance dimensions, respectively. This predicted vector is interpreted as the expected emotional state implied by driving behavior and serves as semantic guidance for cross-modal interaction. During the modulation stage, the predicted emotion dimension vector is fed into a lightweight multilayer perceptron to generate modulation parameters with the same dimensionality as the facial feature representation,

[γ, β] = h (\hat{d})

(11)

where

γ, β \in R^{D}

. A feature-wise linear modulation (FiLM) mechanism is then applied to transform the facial features:

f_{f a c e}^{'} = γ ⊙ f_{f a c e} + β

(12)

Through element-wise scaling and shifting, this operation guides facial representations toward emotional states indicated by vehicle context while preserving the intrinsic structure of the original features. As a result, cross-modal conditional modeling is achieved in a controlled and interpretable manner.

3.5. Multi-Task Learning Prediction Head

After conditional modulation, the resulting feature representation

f_{f a c e}^{'}

is jointly used for discrete emotion classification and continuous emotion dimension regression. In the classification branch, the feature vector is passed through a multilayer perceptron followed by a Softmax layer to produce a probability distribution over emotion categories. This branch is optimized using the cross-entropy loss function.

In the regression branch, the model predicts continuous emotion dimensions from the same feature representation. The output range is constrained by a Tanh activation function, and supervision is provided using the Smooth L1 loss. In addition, the emotion dimension prediction branch based solely on vehicle signals is also constrained by a regression loss, ensuring that vehicle context learning is explicitly guided by dimensional emotion supervision. The overall training objective is defined as a weighted combination of the classification loss and the regression losses:

L_{C} = C r o s s E n t r o p y L o s s (\hat{y_{c}}, y_{c})

(13)

L_{D} = S m o o t h L 1 L o s s (\hat{d}, d)

(14)

L_{A} = S m o o t h L 1 L o s s ({\hat{d}}_{A}, d)

(15)

L o s s = α L_{C} + β L_{D} + γ L_{A}

(16)

where

\hat{y_{c}}, y_{c}

denote the predicted and ground-truth discrete emotion labels, respectively;

\hat{d}, d

represent the predicted and ground-truth emotion dimension vectors;

{\hat{d}}_{A}

denotes the emotion dimension prediction obtained from the vehicle-only branch; and

α, λ, μ

are weighting coefficients that balance the optimization of different tasks.

4. Experiments

4.1. Experimental Setup

All experiments were conducted on the publicly available PPB [28] dataset. This dataset collects psychological data, physiological signals, and driving behavior data from 40 participants across 240 driving tasks. Each sample is annotated with both discrete emotion labels and continuous emotion dimension values using the DES and SAM scales, including seven categorical emotions—surprise, fear, disgust, sadness, anger, neutral, and happiness—as well as three dimensional emotions: valence, arousal, and dominance. The emotion distribution of each driver in the PPB dataset is approximately balanced across categories. Prior to training, both facial video data and vehicle driving signals were preprocessed. Since the videos in PPB contain the driver’s upper body, face detection and cropping were first performed to isolate facial regions. Specifically, the lightweight MTCNN [29] face detection framework was employed to efficiently extract facial images from the video sequences. The MTCNN is a three-tiered network, which can be divided into three layers: P-Net, R-Net, and O-Ne. This model adopts the idea of candidate boxes plus classifier, which can simultaneously balance speed and accuracy, and achieve fast and efficient face detection. For data augmentation and temporal sampling, we randomly sampled 240 original 30 s videos (30 FPS) at fixed intervals and divided each video into 32 segments. One frame was randomly selected from each segment to form a 32-frame input sequence. This process was repeated three times for each video, resulting in 720 facial image sequences. All sequences inherited the same discrete emotion label and dimensional emotion annotation as their corresponding original videos. A figure illustrating example sequences from different emotion categories is shown in Figure 2. To avoid misinterpretation, we clarify that Figure 2 is intended as a qualitative illustration of representative samples rather than a standalone label-validation experiment. The emotion labels are inherited from the official PPB annotations at the task/sequence level, which were generated using standardized DES and SAM protocols in the original dataset construction. Because driving-related facial expressions are often low-intensity and context-dependent, a single still frame may appear ambiguous (e.g., sadness vs. neutral). Therefore, category validity in this study is assessed primarily through sequence-level quantitative evaluation (Accuracy, F1, CCC, Pearson, RMSE) and confusion matrix analysis on the held-out test set, rather than subjective interpretation of isolated frames.

The vehicle modality includes eight sensor signals, such as vehicle speed and acceleration. To eliminate scale differences among different sensor channels, Z-score normalization was applied. Then, we performed proportional sampling (20 Hz) on the vehicle data (60 Hz sampling rate) to generate 720 discrete driving sequences, each containing 360 frames, with emotion labels consistent with the corresponding facial video sequences. Ultimately, the ratio of facial image sequences to vehicle data sequences was 1:50. For model evaluation, we randomly split the entire dataset into 80% training data and 20% test data.

The proposed DEGM was implemented and trained using the PyTorch 2.2.2 framework on an NVIDIA GeForce RTX 4090 GPU from the cloud server operator “AutoDL”. The total number of training epochs was set to 200. A cosine annealing learning rate scheduler with warm-up was adopted, where the warm-up phase lasted for 10 epochs, the maximum learning rate was set to 3 × 10⁻⁵, and the minimum learning rate was set to 1 × 10⁻⁶. Model parameters were optimized using the Adam optimizer.

4.2. Comparative Experiments

To comprehensively evaluate the effectiveness and applicability of the proposed DEGM framework for driver emotion recognition, systematic comparative experiments were conducted under identical data splits and experimental settings. DEGM was compared against a range of representative baseline methods, including models based on temporal modeling, attention mechanisms, and multimodal fusion strategies. These methods cover mainstream modeling paradigms in driver state analysis and affective computing, and collectively reflect the performance upper bounds of existing approaches from different perspectives of temporal dependency modeling and cross-modal interaction.

Considering that driver emotion recognition involves both discrete emotion category classification and continuous emotion intensity modeling, all methods were evaluated on two complementary tasks: Task 1 is discrete emotion classification, and Task 2 is continuous emotion dimension regression.

For Task 1, Accuracy and F1-score were adopted to assess classification performance. For Task 2, regression performance was evaluated using the Concordance Correlation Coefficient (CCC), Pearson Correlation Coefficient (PCC), and Root Mean Square Error (RMSE), which jointly measure prediction consistency, linear correlation, and absolute error magnitude.

In addition to quantitative comparisons, confusion matrices and error distribution analyses were employed to visualize the prediction behaviors of different methods. These visualizations provide further insight into model-specific strengths and weaknesses across different emotion categories and emotional dimensions, thereby offering objective support for the subsequent experimental analysis.

In the comparative experiments, the proposed DEGM is systematically evaluated against multiple representative cross-modal driver emotion recognition methods on both the discrete emotion classification task (Task 1) and the continuous emotion dimension regression task (Task 2), and the results are shown in Table 1. As can be observed, DEGM achieves the best performance in the discrete classification task, obtaining the highest Accuracy of 87.50% and F1 score of 87.27%. Compared with ConvLSTM, DDEC, and MER-MFVA, DEGM improves classification accuracy by approximately 3–8 percentage points, and it also outperforms Former-DFER with an accuracy gain of about 7%, demonstrating a clear advantage in discrete emotion recognition.

This performance improvement is further illustrated by the confusion matrices shown in Figure 3. Compared with the other methods, DEGM exhibits a higher concentration along the main diagonal for most emotion categories, indicating more stable and consistent predictions. In particular, DEGM shows more concentrated prediction distributions for the SAD, AD, and DD categories, whereas several baseline methods suffer from noticeable inter-class confusion. For example, MER-MFVA and MMA-DFER present relatively high misclassification rates between the SAD and SD categories, while ConvLSTM and Former-DFER both exhibit insufficient recall for the FD category, resulting in more dispersed predictions.

For the continuous emotion dimension regression task, the compared methods show varying performance across different evaluation metrics. Former-DFER achieves the highest values in terms of CCC-Avg (0.8219) and Pearson-Avg (0.8383), while DEGM attains CCC-Avg of 0.8211 and Pearson-Avg of 0.8333, which are of the same order of magnitude as those of Former-DFER. In terms of error metrics, DEGM obtains an RMSE-Avg of 0.2793, which is close to the 0.2782 achieved by Former-DFER.

A dimension-wise analysis reveals that the performance of different methods varies across the Valence, Arousal, and Dominance dimensions, as illustrated by the error distributions in Figure 4. In the Valence dimension, both DEGM and Former-DFER exhibit relatively concentrated error distributions with peaks close to zero, whereas ConvLSTM and MMA-DFER show more dispersed distributions with heavier tails. For the Arousal dimension, the error distributions are generally wider than those of Valence and Dominance, and several methods, including MMA-DFER and ConvLSTM, present long-tailed distributions in both positive and negative directions, indicating higher regression instability. In the Dominance dimension, DEGM and Former-DFER demonstrate more symmetric and concentrated error distributions, while MER-MFVA and DDEC show more pronounced shifts and dispersion.

Overall, it can be observed that there exists a trade-off between discrete emotion classification performance and continuous emotion dimension regression performance across different methods. Some approaches achieve relatively strong correlation-based metrics (e.g., CCC or Pearson) but exhibit limited classification accuracy. In contrast, DEGM maintains regression performance at a competitive level while achieving superior overall performance in the discrete emotion classification task.

4.3. Ablation Study

To systematically analyze the effectiveness of each key component in the proposed DEGM, a series of ablation experiments are conducted under the same experimental settings. Specifically, three aspects are investigated: (1) the impact of different cross-modal interaction strategies between facial video and vehicle context, (2) the contribution of continuous emotion dimension regression as an auxiliary task with different dimension combinations, and (3) the necessity of vehicle modality and multi-task learning through baseline component removal. These experiments aim to quantify the role of each design choice and provide deeper insights into how contextual modulation and dimensional supervision influence driver emotion recognition performance.

In the ablation experiments on cross-modal interaction strategies, different fusion and modulation schemes exhibit substantial performance differences on both the discrete classification task and the continuous emotion dimension regression task, indicating that the design of cross-modal interaction is essential for driver emotion recognition. As reported in Table 2, simple feature-level fusion methods (Add and Concat) achieve relatively strong performance on Task 1, with Concat reaching an Accuracy of 85.64% and an F1 score of 85.69%. However, the performance gains of these approaches on Task 2 are limited, with CCC-Avg values ranging from approximately 0.8168 to 0.8214, while RMSE remains around 0.284. The corresponding scatter plots in Figure 5 show that although the overall prediction trends are generally aligned with the ground truth, noticeable dispersion persists in the mid-to-high emotion ranges. This issue is particularly evident in the Dominance dimension, where a considerable number of predicted samples deviate from the diagonal, suggesting that simple fusion strategies struggle to accurately model fine-grained emotion intensity variations.

By contrast, the No-Guided FiLM method, which introduces conditional modulation without explicit emotion dimension guidance, does not exhibit an obvious degradation in classification performance but shows a marked decline in continuous regression performance. Specifically, its CCC-Avg drops to 0.7698, and RMSE increases to 0.3161. The scatter distributions reveal more dispersed predictions in the Valence and Dominance dimensions, with systematic offsets observed in the low-emotion range. This indicates that modulation mechanisms lacking explicit emotional semantic constraints have difficulty in stably aligning facial representations with vehicle context. The Cross-Attention method demonstrates relatively balanced performance across the three emotion dimensions, achieving CCC values of approximately 0.780, 0.882, and 0.801 for Valence, Arousal, and Dominance, respectively. Nevertheless, its overall performance remains inferior to that of the FiLM-based approaches. As shown in the scatter plots, Cross-Attention achieves a better fitting trend in the Arousal dimension, while still exhibiting relatively large variance in the Dominance dimension, suggesting that high-degree-of-freedom attention mechanisms may introduce instability when vehicle semantic cues are weak.

This demonstrates that emotion dimension guidance enables vehicle context to more effectively modulate facial features. It should be noted that the CCC value of the Arousal dimension does not achieve the absolute best performance and shows a slight decline compared with some baseline methods. Additionally, a small number of outliers can be observed in the high-arousal range, indicating that dimension-guided modulation still faces challenges in modeling highly dynamic and transient arousal variations. Overall, this experiment confirms the necessity of cross-modal interaction for driver emotion modeling and shows that conditional modulation with emotion dimension constraints provides more stable modeling capability across most emotion dimensions.

To quantitatively assess the impact of continuous emotion dimensions as auxiliary tasks on discrete emotion classification, further ablation experiments with different dimension combinations are conducted, as summarized in Table 3.

When Valence, Arousal, and Dominance are jointly optimized, the model achieves the best classification performance, with an Accuracy of 87.50% and an F1 score of 87.27%, clearly outperforming all dual-dimension and single-dimension configurations. Removing any one dimension results in consistent performance degradation, with Accuracy dropping to 80.55%, 80.09%, and 78.70% for different dual-dimension combinations. Under single-dimension configurations, performance further deteriorates, indicating that relying on partial emotion dimensions is insufficient to provide effective discriminative constraints for discrete emotion categories. As shown in Figure 6, this trend is particularly evident in the confusion matrices. Under three-dimension joint modeling, most emotion categories exhibit higher concentration along the main diagonal, whereas missing or reduced dimension settings lead to substantially increased inter-class confusion, especially among categories with similar emotion intensity or arousal levels. These results suggest that different emotion dimensions play complementary roles in driving scenarios, and their joint modeling provides a more complete continuous emotional structure to support classification.

The baseline ablation results are reported in Table 4, with corresponding confusion matrices shown in Figure 7. The complete model incorporating facial modality, vehicle modality, and multi-task learning achieves the best classification performance, with Accuracy and F1 score reaching 87.50% and 87.27%, respectively. When the vehicle modality is removed while retaining the dimension regression auxiliary task, classification performance decreases to 82.40%/82.28%, indicating that vehicle context provides effective complementary information for facial emotion features.

Conversely, when the vehicle modality is retained but the dimension regression task is removed, Accuracy and F1 score drop to 84.72% and 84.77%, respectively, demonstrating that continuous emotion dimension constraints also contribute positively to optimizing discriminative boundaries. When both the vehicle modality and the dimension regression auxiliary task are removed, leaving only facial modality for classification, performance further degrades to 77.31%/77.19%, representing the most severe decline. The confusion matrices further show that, in the absence of vehicle information or dimension constraints, inter-class confusion increases significantly, particularly among categories with similar arousal levels or emotion intensities. Overall, these results verify the complementary roles of vehicle modality and dimension regression auxiliary tasks in driver emotion recognition, with both jointly constituting key factors for improving classification performance and prediction stability.

5. Discussion

The experimental results indicate that DEGM provides a clear performance advantage for discrete driver emotion recognition while maintaining competitive dimensional regression. As shown in Table 1, DEGM achieves 87.50 ± 0.11 Accuracy and 87.27 ± 0.16 F1-score, outperforming all baselines on Task 1. The gain over the strongest classification baseline (DDEC: 83.79 Accuracy, 83.73 F1) is +3.71 and +3.54 points, respectively, and the gain over Former-DFER is +6.95 points in Accuracy. These improvements suggest that explicitly modeling modality asymmetry is beneficial in driving scenarios where facial cues are subtle and contextual information is noisy.

At the same time, DEGM does not dominate every regression metric, which reveals an important trade-off. On Task 2, DEGM reports CCC-Avg = 0.8211, Pearson-Avg = 0.8333, and RMSE-Avg = 0.2793, remaining very close to Former-DFER (0.8219, 0.8383, and 0.2782). This near-parity implies that the proposed framework prioritizes robust category discrimination without sacrificing continuous affect modeling. In practical intelligent cockpit settings, such a balance is desirable, since decision-oriented emotion categories are often directly linked to downstream intervention logic, while dimensional outputs provide complementary trend information.

The ablation studies further explain why DEGM is effective. In cross-modal interaction ablation (Table 2), DEGCM reaches the best overall profile (87.50/87.27 for classification, CCC-Avg = 0.8468, RMSE-Avg = 0.2793). Notably, removing dimensional guidance (No-Guided FiLM) causes a marked regression degradation (CCC-Avg drops from 0.8468 to 0.7698; RMSE rises from 0.2793 to 0.3161), while classification also decreases (F1: 87.27 to 85.57). This pattern supports the core assumption of the method: vehicle signals are more effective as semantically constrained contextual priors than as directly fused emotional features.

The contribution of multi-dimensional supervision is also quantitatively consistent. In Table 3, full VAD supervision (V + A + D) gives the best classification result (87.50 Accuracy, 87.27 F1), while all dual- or single-dimension variants are lower (Accuracy ranges roughly from 74.07 to 80.55). This confirms that Valence, Arousal, and Dominance provide complementary constraints and that partial dimensional supervision cannot fully regularize class boundaries in this task.

Table 4 highlights the complementary roles of modality and task design. Starting from the full model (87.50/87.27), removing vehicle modality reduces performance to 82.40/82.28, removing the dimensional auxiliary task reduces it to 84.72/84.77, and removing both leads to the lowest result (77.31/77.19). The magnitude of these drops indicates that performance gains are not due to one isolated component; instead, they emerge from cooperative effects between contextual conditioning and multi-task regularization.

A interpretation of the confusion patterns is that residual errors are partly task-intrinsic rather than purely model-induced. As shown in Figure 3, DEGM improves concentration along the diagonal, especially for SAD, AD, and DD, but confusion still appears between visually adjacent low-intensity categories (e.g., SAD vs. SD, and weaker FD recall in some baselines). In real driving, emotional expressions are constrained by attention demands and behavioral suppression, so facial differences among neighboring categories can be subtle. Therefore, multimodal sequence-level modeling is necessary, but some ambiguity is expected even with improved architectures.

Several limitations remain. First, arousal is still the least stable dimension, consistent with wider error distributions and occasional high-arousal outliers, suggesting insufficient modeling of rapid transient dynamics. Second, the current vehicle branch is statistics-based, which improves robustness but may discard fine-grained temporal cues. Third, the present 80/20 random split is effective for benchmarking but may still overestimate generalization if subject characteristics overlap between train and test sets; stricter subject-independent protocols should be further reported.

Future work should therefore focus on (i) richer temporal encoders for vehicle streams and cross-modal alignment, (ii) adaptive or uncertainty-aware task weighting to improve arousal robustness, and (iii) more rigorous external validation, including cross-subject and cross-dataset protocols. These directions are essential for translating offline gains into reliable real-world deployment in safety-critical intelligent cockpit systems.

6. Conclusions

This study addresses the challenges of weak emotional expressiveness, strong contextual dependency, and imbalanced relevance among multimodal information in driving scenarios, and proposes DEGM. The proposed method explicitly distinguishes the asymmetric roles of facial video and vehicle signals in emotion recognition. Instead of treating vehicle data as an equivalent emotional modality, vehicle signals are modeled as contextual constraints, which guide the conditional modulation of facial representations through continuous emotion dimension prediction, enabling controlled and interpretable cross-modal interaction. In addition, a multi-task learning strategy that jointly optimizes discrete emotion classification and continuous emotion dimension regression is introduced, allowing the model to achieve a balanced trade-off between discriminative performance and continuous emotion modeling.

Comprehensive experiments conducted on the PPB dataset demonstrate that the proposed method achieves an Accuracy of 87.50% and an F1 score of 87.27% on the discrete emotion classification task, consistently outperforming multiple representative baseline methods. On the continuous emotion dimension regression task, DEGM maintains competitive performance across CCC, Pearson correlation coefficient, and RMSE metrics, while exhibiting more concentrated prediction error distributions. Further ablation studies validate three key findings. First, the design of cross-modal interaction has a significant impact on driver emotion modeling, with dimension-guided conditional modulation showing more stable alignment across most emotion dimensions. Second, continuous emotion dimensions—Valence, Arousal, and Dominance—play complementary roles in classification, and joint modeling consistently outperforms any subset configuration. Third, both vehicle modality information and the dimension regression auxiliary task make indispensable contributions to improving classification accuracy and prediction stability.

Although the proposed approach achieves consistent overall performance improvements, noticeable error fluctuations remain in the Arousal dimension, which is more sensitive to rapid and transient emotional changes. This suggests that the current conditional modulation mechanism has limitations in modeling highly dynamic emotional variations. Moreover, while the statistical modeling of vehicle signals enhances robustness, it may constrain expressive capacity in more complex or long-duration driving scenarios. Future work will explore finer-grained temporal modeling strategies and adaptive dimension weighting mechanisms to further improve the modeling of complex driving contexts and dynamic emotional transitions.

Author Contributions

Conceptualization, W.S. and X.M.; methodology, W.S. and X.M.; software, J.Y.; validation, S.L.; formal analysis, J.Y.; investigation, W.S.; resources, W.S.; data curation, X.M.; writing—original draft preparation, J.Y.; writing—review and editing, S.L.; visualization, S.L.; supervision, X.M.; project administration, W.S.; funding acquisition, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Science and Technology Project of Guangxi, grant number AA23062066.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

PPB is a publicly available third-party dataset, and our use complies with its license terms. We did not conduct original human-subject data collection; ethical approval and informed-consent procedures were handled in the original PPB study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, W.; Zeng, G.; Zhang, J.; Xu, Y.; Xing, Y.; Zhou, R.; Guo, G.; Shen, Y.; Cao, D.; Wang, F.-Y. CogEmoNet: A Cognitive-Feature-Augmented Driver Emotion Recognition Model for Smart Cockpit. IEEE Trans. Comput. Soc. Syst. 2022, 9, 667–678. [Google Scholar] [CrossRef]
Huang, H.; Liu, J.; Yang, Y.; Wang, J. Risk Generation and Identification of Driver–Vehicle–Road Microtraffic System. ASCE-ASME J. Risk Uncertain. Eng. Syst. A Civ. Eng. 2022, 8, 04022029. [Google Scholar] [CrossRef]
Hu, L.; Lu, T.; Li, G.; Zhang, X.; Cai, H. Automatic Generation of Intelligent Vehicle Testing Scenarios at Intersections Based on Natural Driving Datasets. IEEE Trans. Intell. Veh. 2024, 9, 5448–5460. [Google Scholar] [CrossRef]
Liu, S.; Wang, X.; Zhao, L.; Li, B.; Hu, W.; Yu, J.; Zhang, Y.-D. 3DCANN: A Spatio-Temporal Convolution Attention Neural Network for EEG Emotion Recognition. IEEE J. Biomed. Health Inform. 2022, 26, 5321–5331. [Google Scholar] [CrossRef] [PubMed]
Pan, D.; Zheng, H.; Xu, F.; Ouyang, Y.; Jia, Z.; Wang, C.; Zeng, H. MSFR-GCN: A Multi-Scale Feature Reconstruction Graph Convolutional Network for EEG Emotion and Cognition Recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 3245–3254. [Google Scholar] [CrossRef] [PubMed]
Ahmed, M.R.; Islam, S.; Muzahidul Islam, A.K.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. Expert Syst. Appl. 2023, 218, 119633. [Google Scholar] [CrossRef]
Ekman, P. Facial Expression and Emotion. Am. Psychol. 1993, 48, 384–392. [Google Scholar] [CrossRef] [PubMed]
Jain, D.K.; Dutta, A.K.; Verdú, E.; Alsubai, S.; Sait, A.R.W. An Automated Hyperparameter Tuned Deep Learning Model Enabled Facial Emotion Recognition for Autonomous Vehicle Drivers. Image Vis. Comput. 2023, 133, 104659. [Google Scholar] [CrossRef]
Saadi, I.; Cunningham, D.W.; Taleb-Ahmed, A.; Hadid, A.; Hillali, Y.E. Driver’s Facial Expression Recognition: A Comprehensive Survey. Expert Syst. Appl. 2024, 242, 122784. [Google Scholar] [CrossRef]
Varma, H.; Ganapathy, N.; Deserno, T.M. Video-Based Driver Emotion Recognition Using Hybrid Deep Spatio-Temporal Feature Learning. In Proceedings of the Medical Imaging 2022: Imaging Informatics for Healthcare, Research, and Applications; SPIE: Bellingham, WA, USA, 2022; Volume 12037, pp. 57–63. [Google Scholar]
Xiang, G.; Yao, S.; Wu, X.; Deng, H.; Wang, G.; Liu, Y.; Li, F.; Peng, Y. Driver Multi-Task Emotion Recognition Network Based on Multi-Modal Facial Video Analysis. Pattern Recognit. 2025, 161, 111241. [Google Scholar] [CrossRef]
How, T.-V.; Green, R.E.A.; Mihailidis, A. Towards PPG-Based Anger Detection for Emotion Regulation. J. NeuroEng. Rehabil. 2023, 20, 107–134. [Google Scholar] [CrossRef]
Quiles Pérez, M.; Martínez Beltrán, E.T.; López Bernal, S.; Martínez Pérez, G.; Huertas Celdrán, A. Analyzing the Impact of Driving Tasks When Detecting Emotions through Brain–Computer Interfaces. Neural Comput. Appl. 2023, 35, 8883–8901. [Google Scholar] [CrossRef]
Xiao, H.; Li, W.; Zeng, G.; Wu, Y.; Xue, J.; Zhang, J.; Li, C.; Guo, G. On-Road Driver Emotion Recognition Using Facial Expression. Appl. Sci. 2022, 12, 807–826. [Google Scholar] [CrossRef]
Azman, A.; Raman, K.J.; Mhlanga, I.A.J.; Ibrahim, S.Z.; Yogarayan, S.; Abdullah, M.F.A.; Razak, S.F.A.; Amin, A.H.M.; Muthu, K.S. Real Time Driver Anger Detection. In Proceedings of the Information Science and Applications 2018; Kim, K.J., Baek, N., Eds.; Springer: Singapore, 2019; pp. 157–167. [Google Scholar]
Sudha, S.S.; Suganya, S.S. On-Road Driver Facial Expression Emotion Recognition with Parallel Multi-Verse Optimizer (PMVO) and Optical Flow Reconstruction for Partial Occlusion in Internet of Things (IoT). Meas. Sens. 2023, 26, 100711. [Google Scholar] [CrossRef]
Du, G.; Wang, Z.; Gao, B.; Mumtaz, S.; Abualnaja, K.M.; Du, C. A Convolution Bidirectional Long Short-Term Memory Neural Network for Driver Emotion Recognition. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4570–4578. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Q. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1553–1561. [Google Scholar]
Zhang, X.; Li, M.; Lin, S.; Xu, H.; Xiao, G. Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3192–3203. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. arXiv 2021, arXiv:2103.15691. [Google Scholar]
Pavlidis, I.; Dcosta, M.; Taamneh, S.; Manser, M.; Ferris, T.; Wunderlich, R.; Akleman, E.; Tsiamyrtzis, P. Dissecting Driver Behaviors under Cognitive, Emotional, Sensorimotor, and Mixed Stressors. Sci. Rep. 2016, 6, 25651. [Google Scholar] [CrossRef]
Shi, Y.; Boffi, M.; Piga, B.E.A.; Mussone, L.; Caruso, G. Perception of Driving Simulations: Can the Level of Detail of Virtual Scenarios Affect the Driver’s Behavior and Emotions? IEEE Trans. Veh. Technol. 2022, 71, 3429–3442. [Google Scholar] [CrossRef]
Pan, B.; Hirota, K.; Jia, Z.; Zhao, L.; Jin, X.; Dai, Y. Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips. J. Ambient Intell. Hum. Comput. 2023, 14, 1903–1917. [Google Scholar] [CrossRef]
Ding, T.; Zhang, K.; Gao, S.; Miao, X.; Xi, J. A Multimodal Driver Anger Recognition Method Based on Context-Awareness. IEEE Access 2024, 12, 118533–118550. [Google Scholar] [CrossRef]
Mou, L.; Rastgoo, M.N.; Ma, L.; Huang, T.; Yin, B.; Jain, R. Driver Emotion Recognition with a Hybrid Attentional Multimodal Fusion Framework. IEEE Trans. Affect. Comput. 2023, 14, 2970–2981. [Google Scholar] [CrossRef]
Yang, H.; Wu, J.; Hu, Z.; Lv, C. Real-Time Driver Cognitive Workload Recognition: Attention-Enabled Learning with Multimodal Information Fusion. IEEE Trans. Ind. Electron. 2024, 71, 4999–5009. [Google Scholar] [CrossRef]
Xiang, G.; Yao, S.; Deng, H.; Wu, X.; Wang, X.; Xu, Q.; Yu, T.; Wang, K.; Peng, Y. A Multi-Modal Driver Emotion Dataset and Study: Including Facial Expressions and Synchronized Physiological Signals. Eng. Appl. Artif. Intell. 2024, 130, 107772. [Google Scholar] [CrossRef]
Chumachenko, K.; Iosifidis, A.; Gabbouj, M. MMA-DFER: MultiModal Adaptation of Unimodal Models for Dynamic Facial Expression Recognition in-the-Wild. arXiv 2024, arXiv:2404.09010. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Yu, Q. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Dong, Z.; Hu, C.; Zhu, L.; Ji, X.; Lai, C.S. A Dual-Pathway Driver Emotion Classification Network Using Multitask Learning Strategy: A Joint Verification. IEEE Internet Things J. 2025, 12, 14897–14908. [Google Scholar] [CrossRef]

Figure 1. Dimensional Emotion Guided Multi-task Learning Network.

Figure 2. A figure illustrating example sequences from different emotion categories.

Figure 3. Confusion matrices of different methods on the discrete emotion classification task. ND = Neutral, HD = Happy, SD = Surprise, SAD = Sadness, AD = Anger, DD = Disgust, FD = Fear.

Figure 4. Error distribution of all methods on regression tasks in different dimensions. (a) DEGM. (b) MER-MFVA. (c) DDEC. (d) MMA-DFER. (e) ConvLSTM. (f) Former-DFER.

Figure 5. Scattered distribution of predictions for different cross-modal feature interaction methods in a dimensionality-based sentiment prediction task. (a) DEGCM. (b) FiLM without dimensional guidance. (c) Concat. (d) Add. (e) Gated. (f) Cross-Attention.

Figure 6. Discrete classification confusion matrix under different dimensional prediction tasks. ND = Neutral, HD = Happy, SD = Surprise, SAD = Sadness, AD = Anger, DD = Disgust, FD = Fear.

Figure 7. Discrete classification confusion matrix under baseline ablation. (a) Complete model. (b) Multi-task model without vehicle modality. (c) Discrete classification model with no-dimensional task. (d) Discrete classification model without vehicle modality.

Table 1. Comparison of the performance of different methods on Task 1 and Task 2.

Method	Task 1		Task 2
Method	Acc (%)	F1 Score (%)	CCC-Avg	Pearson-Avg	RMSE-Avg
Former-DFER [18]	80.55 ± 0.13	80.79 ± 0.15	0.8219 ± 0.0011	0.8383 ± 0.0014	0.2782 ± 0.0007
ConvLSTM [25]	83.33 ± 0.11	83.51 ± 0.12	0.7540 ± 0.0017	0.7636 ± 0.0011	0.3281 ± 0.0009
MMA-DFER [28]	79.63 ± 0.14	79.85 ± 0.14	0.7567 ± 0.0021	0.7687 ± 0.0012	0.3154 ± 0.0006
DDEC [30]	83.79 ± 0.15	83.73 ± 0.15	0.7973 ± 0.0019	0.8035 ± 0.0021	0.2943 ± 0.0006
MER-MFVA [11]	82.87 ± 0.11	83.58 ± 0.11	0.7897 ± 0.0015	0.7960 ± 0.0012	0.3008 ± 0.0004
DEGM (ours)	87.50 ± 0.11	87.27 ± 0.16	0.8211 ± 0.0013	0.8333 ± 0.0021	0.2793 ± 0.0004

Table 2. Performance of different cross-modality feature interaction strategies in multi-task prediction.

Method	Task 1		Task 2
Method	Acc (%)	F1 Score (%)	CCC-Avg	Pearson-Avg	RMSE-Avg
Cross-Attention	81.94	81.73	0.8066	0.8103	0.2935
Gated	83.33	83.23	0.8073	0.8135	0.2899
Add	84.72	84.99	0.8214	0.8313	0.2840
Concat	85.64	85.69	0.8168	0.8225	0.2843
No-Guided Film	85.64	85.57	0.7698	0.7781	0.3161
DEGCM	87.50	87.27	0.8468	08533	0.2793

Table 3. Ablation study on the contribution of valence, arousal, and dominance signals.

Valence	Arousal	Dominance	Accuracy (%)	F1 Score (%)
✓	✓	✓	87.50	87.27
✓	✓	✗	80.55	80.33
✓	✗	✓	80.09	79.94
✗	✓	✓	78.70	78.45
✓	✗	✗	74.07	73.56
✗	✓	✗	79.16	79.06
✗	✗	✓	79.62	79.53

Table 4. Ablation study on the contribution of facial and vehicle modalities across two tasks.

Face	Task 1	Vehicle	Task 2	Accuracy (%)	F1 Score (%)
✓	✓	✓	✓	87.50	87.27
✓	✓	✗	✓	82.40	82.28
✓	✓	✓	✗	84.72	84.77
✓	✓	✗	✗	77.31	77.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, W.; Mou, X.; Yi, J.; Le, S. Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Appl. Sci. 2026, 16, 4312. https://doi.org/10.3390/app16094312

AMA Style

Shen W, Mou X, Yi J, Le S. Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Applied Sciences. 2026; 16(9):4312. https://doi.org/10.3390/app16094312

Chicago/Turabian Style

Shen, Wei, Xingang Mou, Jing Yi, and Songqing Le. 2026. "Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition" Applied Sciences 16, no. 9: 4312. https://doi.org/10.3390/app16094312

APA Style

Shen, W., Mou, X., Yi, J., & Le, S. (2026). Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Applied Sciences, 16(9), 4312. https://doi.org/10.3390/app16094312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition

Abstract

1. Introduction

2. Related Work

2.1. Modality Evolution: From Physiological to Visual Cues

2.2. Spatio-Temporal Modeling for Facial Expressions

2.3. Contextual Integration and Multimodal Fusion Paradigms

3. Methodology

3.1. Overall Architecture

3.2. Facial Video Encoder

3.3. Vehicle Context Encoder

3.4. Dimensional Emotion-Guided Conditional Modulation

3.5. Multi-Task Learning Prediction Head

4. Experiments

4.1. Experimental Setup

4.2. Comparative Experiments

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI