Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition
Abstract
1. Introduction
- Asymmetric Multimodal Formalization: We propose a dimensional emotion-guided conditional modulation framework that explicitly models the asymmetric roles of facial and vehicle data. By mapping vehicle context into a continuous VAD space, we enable a more interpretable and robust cross-modal interaction compared to conventional symmetric fusion paradigms.
- Multi-task Collaborative Optimization: We formulate driver emotion recognition as a joint optimization problem. This multi-task approach leverages dimensional regression as a structural regularizer, significantly improving classification robustness and feature consistency under noisy or weak contextual signals.
- Hierarchical Spatio-Temporal Encoding: We introduce the Spatio-Temporal Aggregation and Projection Embedding (STAP-Embed) module for facial video encoding. By hierarchically aggregating short-term dynamics and long-term dependencies, STAP-Embed preserves fine-grained spatio-temporal cues essential for detecting subtle facial micro-expressions.
2. Related Work
2.1. Modality Evolution: From Physiological to Visual Cues
2.2. Spatio-Temporal Modeling for Facial Expressions
2.3. Contextual Integration and Multimodal Fusion Paradigms
3. Methodology
3.1. Overall Architecture
- A facial video encoder based on the proposed STAP-Embed module,
- A lightweight vehicle context encoder,
- A dimensional emotion guided conditional modulation module, and
- A multi-task prediction head for discrete emotion classification and continuous emotion regression.
3.2. Facial Video Encoder
3.3. Vehicle Context Encoder
3.4. Dimensional Emotion-Guided Conditional Modulation
3.5. Multi-Task Learning Prediction Head
4. Experiments
4.1. Experimental Setup
4.2. Comparative Experiments
4.3. Ablation Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, W.; Zeng, G.; Zhang, J.; Xu, Y.; Xing, Y.; Zhou, R.; Guo, G.; Shen, Y.; Cao, D.; Wang, F.-Y. CogEmoNet: A Cognitive-Feature-Augmented Driver Emotion Recognition Model for Smart Cockpit. IEEE Trans. Comput. Soc. Syst. 2022, 9, 667–678. [Google Scholar] [CrossRef]
- Huang, H.; Liu, J.; Yang, Y.; Wang, J. Risk Generation and Identification of Driver–Vehicle–Road Microtraffic System. ASCE-ASME J. Risk Uncertain. Eng. Syst. A Civ. Eng. 2022, 8, 04022029. [Google Scholar] [CrossRef]
- Hu, L.; Lu, T.; Li, G.; Zhang, X.; Cai, H. Automatic Generation of Intelligent Vehicle Testing Scenarios at Intersections Based on Natural Driving Datasets. IEEE Trans. Intell. Veh. 2024, 9, 5448–5460. [Google Scholar] [CrossRef]
- Liu, S.; Wang, X.; Zhao, L.; Li, B.; Hu, W.; Yu, J.; Zhang, Y.-D. 3DCANN: A Spatio-Temporal Convolution Attention Neural Network for EEG Emotion Recognition. IEEE J. Biomed. Health Inform. 2022, 26, 5321–5331. [Google Scholar] [CrossRef] [PubMed]
- Pan, D.; Zheng, H.; Xu, F.; Ouyang, Y.; Jia, Z.; Wang, C.; Zeng, H. MSFR-GCN: A Multi-Scale Feature Reconstruction Graph Convolutional Network for EEG Emotion and Cognition Recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 3245–3254. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, M.R.; Islam, S.; Muzahidul Islam, A.K.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. Expert Syst. Appl. 2023, 218, 119633. [Google Scholar] [CrossRef]
- Ekman, P. Facial Expression and Emotion. Am. Psychol. 1993, 48, 384–392. [Google Scholar] [CrossRef] [PubMed]
- Jain, D.K.; Dutta, A.K.; Verdú, E.; Alsubai, S.; Sait, A.R.W. An Automated Hyperparameter Tuned Deep Learning Model Enabled Facial Emotion Recognition for Autonomous Vehicle Drivers. Image Vis. Comput. 2023, 133, 104659. [Google Scholar] [CrossRef]
- Saadi, I.; Cunningham, D.W.; Taleb-Ahmed, A.; Hadid, A.; Hillali, Y.E. Driver’s Facial Expression Recognition: A Comprehensive Survey. Expert Syst. Appl. 2024, 242, 122784. [Google Scholar] [CrossRef]
- Varma, H.; Ganapathy, N.; Deserno, T.M. Video-Based Driver Emotion Recognition Using Hybrid Deep Spatio-Temporal Feature Learning. In Proceedings of the Medical Imaging 2022: Imaging Informatics for Healthcare, Research, and Applications; SPIE: Bellingham, WA, USA, 2022; Volume 12037, pp. 57–63. [Google Scholar]
- Xiang, G.; Yao, S.; Wu, X.; Deng, H.; Wang, G.; Liu, Y.; Li, F.; Peng, Y. Driver Multi-Task Emotion Recognition Network Based on Multi-Modal Facial Video Analysis. Pattern Recognit. 2025, 161, 111241. [Google Scholar] [CrossRef]
- How, T.-V.; Green, R.E.A.; Mihailidis, A. Towards PPG-Based Anger Detection for Emotion Regulation. J. NeuroEng. Rehabil. 2023, 20, 107–134. [Google Scholar] [CrossRef]
- Quiles Pérez, M.; Martínez Beltrán, E.T.; López Bernal, S.; Martínez Pérez, G.; Huertas Celdrán, A. Analyzing the Impact of Driving Tasks When Detecting Emotions through Brain–Computer Interfaces. Neural Comput. Appl. 2023, 35, 8883–8901. [Google Scholar] [CrossRef]
- Xiao, H.; Li, W.; Zeng, G.; Wu, Y.; Xue, J.; Zhang, J.; Li, C.; Guo, G. On-Road Driver Emotion Recognition Using Facial Expression. Appl. Sci. 2022, 12, 807–826. [Google Scholar] [CrossRef]
- Azman, A.; Raman, K.J.; Mhlanga, I.A.J.; Ibrahim, S.Z.; Yogarayan, S.; Abdullah, M.F.A.; Razak, S.F.A.; Amin, A.H.M.; Muthu, K.S. Real Time Driver Anger Detection. In Proceedings of the Information Science and Applications 2018; Kim, K.J., Baek, N., Eds.; Springer: Singapore, 2019; pp. 157–167. [Google Scholar]
- Sudha, S.S.; Suganya, S.S. On-Road Driver Facial Expression Emotion Recognition with Parallel Multi-Verse Optimizer (PMVO) and Optical Flow Reconstruction for Partial Occlusion in Internet of Things (IoT). Meas. Sens. 2023, 26, 100711. [Google Scholar] [CrossRef]
- Du, G.; Wang, Z.; Gao, B.; Mumtaz, S.; Abualnaja, K.M.; Du, C. A Convolution Bidirectional Long Short-Term Memory Neural Network for Driver Emotion Recognition. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4570–4578. [Google Scholar] [CrossRef]
- Zhao, Z.; Liu, Q. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1553–1561. [Google Scholar]
- Zhang, X.; Li, M.; Lin, S.; Xu, H.; Xiao, G. Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3192–3203. [Google Scholar] [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. arXiv 2021, arXiv:2103.15691. [Google Scholar]
- Pavlidis, I.; Dcosta, M.; Taamneh, S.; Manser, M.; Ferris, T.; Wunderlich, R.; Akleman, E.; Tsiamyrtzis, P. Dissecting Driver Behaviors under Cognitive, Emotional, Sensorimotor, and Mixed Stressors. Sci. Rep. 2016, 6, 25651. [Google Scholar] [CrossRef]
- Shi, Y.; Boffi, M.; Piga, B.E.A.; Mussone, L.; Caruso, G. Perception of Driving Simulations: Can the Level of Detail of Virtual Scenarios Affect the Driver’s Behavior and Emotions? IEEE Trans. Veh. Technol. 2022, 71, 3429–3442. [Google Scholar] [CrossRef]
- Pan, B.; Hirota, K.; Jia, Z.; Zhao, L.; Jin, X.; Dai, Y. Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips. J. Ambient Intell. Hum. Comput. 2023, 14, 1903–1917. [Google Scholar] [CrossRef]
- Ding, T.; Zhang, K.; Gao, S.; Miao, X.; Xi, J. A Multimodal Driver Anger Recognition Method Based on Context-Awareness. IEEE Access 2024, 12, 118533–118550. [Google Scholar] [CrossRef]
- Mou, L.; Rastgoo, M.N.; Ma, L.; Huang, T.; Yin, B.; Jain, R. Driver Emotion Recognition with a Hybrid Attentional Multimodal Fusion Framework. IEEE Trans. Affect. Comput. 2023, 14, 2970–2981. [Google Scholar] [CrossRef]
- Yang, H.; Wu, J.; Hu, Z.; Lv, C. Real-Time Driver Cognitive Workload Recognition: Attention-Enabled Learning with Multimodal Information Fusion. IEEE Trans. Ind. Electron. 2024, 71, 4999–5009. [Google Scholar] [CrossRef]
- Xiang, G.; Yao, S.; Deng, H.; Wu, X.; Wang, X.; Xu, Q.; Yu, T.; Wang, K.; Peng, Y. A Multi-Modal Driver Emotion Dataset and Study: Including Facial Expressions and Synchronized Physiological Signals. Eng. Appl. Artif. Intell. 2024, 130, 107772. [Google Scholar] [CrossRef]
- Chumachenko, K.; Iosifidis, A.; Gabbouj, M. MMA-DFER: MultiModal Adaptation of Unimodal Models for Dynamic Facial Expression Recognition in-the-Wild. arXiv 2024, arXiv:2404.09010. [Google Scholar]
- Zhang, K.; Zhang, Z.; Li, Z.; Yu, Q. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Dong, Z.; Hu, C.; Zhu, L.; Ji, X.; Lai, C.S. A Dual-Pathway Driver Emotion Classification Network Using Multitask Learning Strategy: A Joint Verification. IEEE Internet Things J. 2025, 12, 14897–14908. [Google Scholar] [CrossRef]







| Method | Task 1 | Task 2 | |||
|---|---|---|---|---|---|
| Acc (%) | F1 Score (%) | CCC-Avg | Pearson-Avg | RMSE-Avg | |
| Former-DFER [18] | 80.55 ± 0.13 | 80.79 ± 0.15 | 0.8219 ± 0.0011 | 0.8383 ± 0.0014 | 0.2782 ± 0.0007 |
| ConvLSTM [25] | 83.33 ± 0.11 | 83.51 ± 0.12 | 0.7540 ± 0.0017 | 0.7636 ± 0.0011 | 0.3281 ± 0.0009 |
| MMA-DFER [28] | 79.63 ± 0.14 | 79.85 ± 0.14 | 0.7567 ± 0.0021 | 0.7687 ± 0.0012 | 0.3154 ± 0.0006 |
| DDEC [30] | 83.79 ± 0.15 | 83.73 ± 0.15 | 0.7973 ± 0.0019 | 0.8035 ± 0.0021 | 0.2943 ± 0.0006 |
| MER-MFVA [11] | 82.87 ± 0.11 | 83.58 ± 0.11 | 0.7897 ± 0.0015 | 0.7960 ± 0.0012 | 0.3008 ± 0.0004 |
| DEGM (ours) | 87.50 ± 0.11 | 87.27 ± 0.16 | 0.8211 ± 0.0013 | 0.8333 ± 0.0021 | 0.2793 ± 0.0004 |
| Method | Task 1 | Task 2 | |||
|---|---|---|---|---|---|
| Acc (%) | F1 Score (%) | CCC-Avg | Pearson-Avg | RMSE-Avg | |
| Cross-Attention | 81.94 | 81.73 | 0.8066 | 0.8103 | 0.2935 |
| Gated | 83.33 | 83.23 | 0.8073 | 0.8135 | 0.2899 |
| Add | 84.72 | 84.99 | 0.8214 | 0.8313 | 0.2840 |
| Concat | 85.64 | 85.69 | 0.8168 | 0.8225 | 0.2843 |
| No-Guided Film | 85.64 | 85.57 | 0.7698 | 0.7781 | 0.3161 |
| DEGCM | 87.50 | 87.27 | 0.8468 | 08533 | 0.2793 |
| Valence | Arousal | Dominance | Accuracy (%) | F1 Score (%) |
|---|---|---|---|---|
| ✓ | ✓ | ✓ | 87.50 | 87.27 |
| ✓ | ✓ | ✗ | 80.55 | 80.33 |
| ✓ | ✗ | ✓ | 80.09 | 79.94 |
| ✗ | ✓ | ✓ | 78.70 | 78.45 |
| ✓ | ✗ | ✗ | 74.07 | 73.56 |
| ✗ | ✓ | ✗ | 79.16 | 79.06 |
| ✗ | ✗ | ✓ | 79.62 | 79.53 |
| Face | Task 1 | Vehicle | Task 2 | Accuracy (%) | F1 Score (%) |
|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | ✓ | 87.50 | 87.27 |
| ✓ | ✓ | ✗ | ✓ | 82.40 | 82.28 |
| ✓ | ✓ | ✓ | ✗ | 84.72 | 84.77 |
| ✓ | ✓ | ✗ | ✗ | 77.31 | 77.19 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shen, W.; Mou, X.; Yi, J.; Le, S. Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Appl. Sci. 2026, 16, 4312. https://doi.org/10.3390/app16094312
Shen W, Mou X, Yi J, Le S. Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Applied Sciences. 2026; 16(9):4312. https://doi.org/10.3390/app16094312
Chicago/Turabian StyleShen, Wei, Xingang Mou, Jing Yi, and Songqing Le. 2026. "Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition" Applied Sciences 16, no. 9: 4312. https://doi.org/10.3390/app16094312
APA StyleShen, W., Mou, X., Yi, J., & Le, S. (2026). Dimensional Emotion-Guided Conditional Modulation for Context-Aware Multimodal Driver Affect Recognition. Applied Sciences, 16(9), 4312. https://doi.org/10.3390/app16094312

