Advances in Video Emotion Recognition: Challenges and Trends
Abstract
1. Introduction
- Comparative analysis of classic algorithms: This study systematically evaluated classic algorithms in the VER field, including the comparison and analysis of their performance on four datasets. This evaluation provides a benchmark reference for researchers to select or refine algorithms for specific VER tasks, while highlighting unresolved challenges that require further innovation.
- Comprehensive analysis of challenges: This study comprehensively analyzed the pivotal challenges hindering the development of VER, categorizing them into three interdependent dimensions. Through this analysis, this study identified the root causes of current limitations, established a foundation for future research, and bridged theoretical insights with actionable strategies to advance the field.
- Actionable future research directions: This paper proposes actionable research directions to address the key challenges, thereby promoting the development of the VER field. By systematically linking these directions to the challenges, the paper provides a clear roadmap for future innovations, thereby guiding researchers in designing precise and efficient VER systems.
2. Psychological Models for VER
3. VER Datasets
4. VER Algorithms
4.1. Handcrafted-Feature-Based Algorithms
4.1.1. Visual Emotion Features
4.1.2. Audio Emotion Features
4.1.3. Emotion Learning
4.2. Neural-Network-Based Algorithms
4.2.1. Two-Stage VER Algorithms
4.2.2. End-to-End VER Algorithms
5. Results
- LSTMs leverage a recurrent structure to process data sequentially, capturing long-term temporal dependencies. In the VER field, LSTMs learn contextual information from previous frames to infer emotions. However, long videos may still suffer from partial forgetting. Moreover, LSTMs are limited in spatial modeling, and their sequential processing leads to slow training.
- CNNs use kernels to extract local features and build hierarchical abstractions via stacked layers, which enables them to learn video content efficiently. Their limitations include limited receptive fields and requiring deep stacking for long-range dependencies. In practice, the kernel size and depth must be carefully designed, and 3D convolutions are essential for video data.
- Transformers employ self-attention to learn dependencies across elements in a video, thereby establishing cross-temporal relationships. Their primary strength lies in building long-range dependency models. However, their quadratic complexity in attention matrices limits long-sequence processing. In practice, local or sparse attention variants can reduce costs, and positional encodings are critical for videos.
6. Discussions
6.1. Challenges
6.1.1. Gaps Between Emotional Representations and Labels
6.1.2. Large-Scale and High-Quality VER Datasets
6.1.3. Efficient Integration of Multiple Modalities
6.2. Future Work
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Goel, S.; Jara-Ettinger, J.; Ong, D.C.; Gendron, M. Face and context integration in emotion inference is limited and variable across categories and individuals. Nat. Commun. 2024, 15, 2443. [Google Scholar] [CrossRef] [PubMed]
- Nomiya, H.; Shimokawa, K.; Namba, S.; Osumi, M.; Sato, W. An artificial intelligence model for sensing affective valence and arousal from facial images. Sensors 2025, 25, 1188. [Google Scholar] [CrossRef] [PubMed]
- Sun, W.; Yan, X.; Su, Y.; Wang, G.; Zhang, Y. MSDSANet: Multimodal emotion recognition based on multi-stream network and dual-scale attention network feature representation. Sensors 2025, 25, 2029. [Google Scholar] [CrossRef] [PubMed]
- Yi, Y.; Zhou, J.; Wang, H.; Tang, P.; Wang, M. Emotion recognition in user-generated videos with long-range correlation-aware network. IET Image Process. 2024, 18, 3288–3301. [Google Scholar] [CrossRef]
- Bose, D.; Hebbar, R.; Feng, T.; Somandepalli, K.; Xu, A.; Narayanan, S. MM-AU: Towards multimodal understanding of advertisement videos. In Proceedings of the ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 86–95. [Google Scholar]
- Antonov, A.; Kumar, S.S.; Wei, J.; Headley, W.; Wood, O.; Montana, G. Decoding viewer emotions in video ads. Sci. Rep. 2024, 14, 26382. [Google Scholar] [CrossRef]
- Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
- Hanjalic, A.; Xu, L.Q. Affective video content representation and modeling. IEEE Trans. Multimed. 2005, 7, 143–154. [Google Scholar] [CrossRef]
- Jiang, Y.G.; Xu, B.; Xue, X. Predicting emotions in user-generated videos. In Proceedings of the Association for the Advancement of Artificial Intelligence, Québec, QC, Canada, 27–31 July 2014; pp. 73–79. [Google Scholar]
- Yi, Y.; Wang, H. Multi-modal learning for affective content analysis in movies. Multimed. Tools Appl. 2019, 78, 13331–13350. [Google Scholar] [CrossRef]
- Yi, Y.; Wang, H.; Li, Q. Affective video content analysis with adaptive fusion recurrent network. IEEE Trans. Multimed. 2020, 22, 2454–2466. [Google Scholar] [CrossRef]
- Yi, Y.; Wang, H.; Tang, P. Unified multi-stage fusion network for affective video content analysis. Electron. Lett. 2022, 58, 795–797. [Google Scholar] [CrossRef]
- Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An End-to-End visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 303–311. [Google Scholar]
- Zhang, Z.; Zhao, P.; Park, E.; Yang, J. MART: Masked affective representation learning via masked temporal distribution distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 12830–12840. [Google Scholar]
- Zhang, Z.; Wang, L.; Yang, J. Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18888–18897. [Google Scholar]
- Lee, J.; Kim, S.; Kim, S.; Park, J.; Sohn, K. Context-aware emotion recognition networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10143–10152. [Google Scholar]
- Cheng, H.; Tie, Y.; Qi, L.; Jin, C. Context-aware based visual-audio feature fusion for emotion recognition. In Proceedings of the International Joint Conference on Neural Networks, Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Chen, C.; Wu, Z.; Jiang, Y.G. Emotion in context: Deep semantic feature fusion for video emotion recognition. In Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 127–131. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Sun, J.J.; Liu, T.; Cowen, A.S.; Schroff, F.; Adam, H.; Prasad, G. EEV: A large-scale dataset for studying evoked expressions from video. arXiv 2021, arXiv:2001.05488. [Google Scholar]
- Pang, L.; Zhu, S.; Ngo, C.W. Deep multimodal learning for affective analysis and retrieval. IEEE Trans. Multimed. 2015, 17, 2008–2020. [Google Scholar] [CrossRef]
- Zhang, H.; Xu, M. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning. IEEE Trans. Multimed. 2023, 25, 881–891. [Google Scholar] [CrossRef]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
- Sural, S.; Qian, G.; Pramanik, S. Segmentation and histogram generation using the HSV color space for image retrieval. In Proceedings of the IEEE International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; pp. 589–592. [Google Scholar]
- Xu, B.; Fu, Y.; Jiang, Y.; Li, B.; Sigal, L. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans. Affect. Comput. 2018, 9, 255–270. [Google Scholar] [CrossRef]
- Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Baveye, Y.; Dellandrea, E.; Chamaret, C.; Chen, L. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Trans. Affect. Comput. 2015, 6, 43–55. [Google Scholar] [CrossRef]
- Schmidhuber, J.; Hochreiter, S. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
- Davis, S.B.; Mermelstein, P. Evaluation of acoustic parameters for monosyllabic word identification. J. Acoust. Soc. Am. 1978, 64, S180–S181. [Google Scholar] [CrossRef]
- Ji, X.; Dong, Z.; Zhou, G.; Lai, C.S.; Qi, D. MLG-NCS: Multimodal local–global neuromorphic computing system for affective video content analysis. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 5137–5149. [Google Scholar] [CrossRef]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Zhu, Y.; Chen, Z.; Wu, F. Multimodal deep denoise framework for affective video content analysis. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 130–138. [Google Scholar]
- Zhu, Y.; Chen, Z.; Wu, F. Affective Video Content Analysis via Multimodal Deep Quality Embedding Network. IEEE Trans. Affect. Comput. 2022, 13, 1401–1415. [Google Scholar] [CrossRef]
- Gan, Q.; Wang, S.; Hao, L.; Ji, Q. A multimodal deep regression bayesian network for affective video content analyses. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5113–5122. [Google Scholar]
- Ou, Y.; Chen, Z.; Wu, F. Multimodal local-global attention network for affective video content analysis. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1901–1914. [Google Scholar] [CrossRef]
- Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1064–1071. [Google Scholar]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
- Guo, Y.; Siddiqui, F.; Zhao, Y.; Chellappa, R.; Lo, S.Y. StimuVAR: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models. arXiv 2024, arXiv:2409.00304. [Google Scholar]
- Yu, H.F.; Huang, F.L.; Lin, C.J. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 2011, 85, 41–75. [Google Scholar] [CrossRef]
- Li, X.; Wang, S.; Huang, X. Temporal enhancement for video affective content analysis. In Proceedings of the ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 642–650. [Google Scholar]
- Shor, J.; Jansen, A.; Maor, R.; Lang, O.; Tuval, O.; Quitry, F.d.C.; Tagliasacchi, M.; Shavitt, I.; Emanuel, D.; Haviv, Y. Towards learning a universal non-semantic representation of speech. arXiv 2020, arXiv:2002.12764. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
- Wang, S.; Li, X.; Zheng, F.; Pan, J.; Li, X.; Chang, Y.; Zhu, Z.; Li, Q.; Wang, J.; Xiao, Y. VAD: A video affective dataset with danmu. IEEE Trans. Affect. Comput. 2024, 15, 1889–1905. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Gouyon, F.; Pachet, F.; Delerue, O. On the use of zero-crossing rate for an application of classification of percussive sounds. In Proceedings of the COST G-6 Conference on Digital Audio Effects, Verona, Italy, 7–9 December 2000; p. 16. [Google Scholar]
- Plutchik, R. Emotion, Theory, Research, and Experience; Academic Press: Cambridge, MA, USA, 1980. [Google Scholar]
- Ekman, P. An Argument for Basic Emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
- Arifin, S.; Cheung, P.Y. Affective level video segmentation by utilizing the pleasure-arousal-dominance information. IEEE Trans. Multimed. 2008, 10, 1325–1341. [Google Scholar] [CrossRef]
- Zhao, S.; Yao, H.; Sun, X.; Xu, P.; Liu, X.; Ji, R. Video indexing and recommendation based on affective analysis of viewers. In Proceedings of the ACM International Conference on Multimedia, Scottsdale, AZ, USA, 28 November–1 December 2011; pp. 1473–1476. [Google Scholar]
- Cowen, A.S.; Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. USA 2017, 114, E7900–E7909. [Google Scholar] [CrossRef] [PubMed]
- Koide-Majima, N.; Nakai, T.; Nishimoto, S. Distinct dimensions of emotion in the human brain and their representation on the cortical surface. NeuroImage 2020, 222, 117258. [Google Scholar] [CrossRef]
- Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
- Posner, J.; Russell, J.A.; Peterson, B.S. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol. 2005, 17, 715–734. [Google Scholar] [CrossRef] [PubMed]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
- Soleymani, M.; Lichtenauer, J.; Pun, T.; Pantic, M. A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 2012, 3, 42–55. [Google Scholar] [CrossRef]
- Dellandréa, E.; Huigsloot, M.; Chen, L.; Baveye, Y.; Xiao, Z.; Sjöberg, M. Datasets column: Predicting the emotional impact of movies. ACM SIGMultimedia Rec. 2019, 10, 6. [Google Scholar] [CrossRef]
- Li, C.; Wang, J.; Wang, H.; Zhao, M.; Li, W.; Deng, X. Visual-texual emotion analysis with deep coupled video and danmu neural networks. IEEE Trans. Multimed. 2020, 22, 1634–1646. [Google Scholar] [CrossRef]
- Sjöberg, M.; Baveye, Y.; Wang, H.; Quang, V.L.; Ionescu, B.; Dellandréa, E.; Schedl, M.; Demarty, C.H.; Chen, L. The MediaEval 2015 affective impact of movies task. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Dellandréa, E.; Chen, L.; Baveye, Y.; Sjöberg, M.V.; Chamaret, C. The MediaEval 2016 emotional impact of movies task. In Proceedings of the MediaEval Workshop, Hilversum, Netherlands, 20-21 October 2016. [Google Scholar]
- Dellandréa, E.; Huigsloot, M.; Chen, L.; Baveye, Y.; Sjöberg, M. The MediaEval 2017 emotional impact of movies task. In Proceedings of the MediaEval Workshop, Dublin, Ireland, 13–15 September 2017. [Google Scholar]
- Dellandréa, E.; Huigsloot, M.; Chen, L.; Baveye, Y.; Xiao, Z.; Sjöberg, M. The MediaEval 2018 emotional impact of movies task. In Proceedings of the MediaEval Workshop, Sophia Antipolis, France, 29-31 October 2018. [Google Scholar]
- Baveye, Y.; Dellandrea, E.; Chamaret, C.; Chen, L. Deep learning vs. kernel methods: Performance for emotion prediction in videos. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Xi’an, China, 21–24 September 2015; pp. 77–83. [Google Scholar]
- Borth, D.; Chen, T.; Ji, R.; Chang, S.F. Sentibank: Large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 459–460. [Google Scholar]
- Thao, H.T.P.; Balamurali, B.; Roig, G.; Herremans, D. Attendaffectnet–emotion prediction of movie viewers using multimodal fusion with self-attention. Sensors 2021, 21, 8356. [Google Scholar] [CrossRef]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Irani, E.S.M. Matching local self-similarities across images and videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
- Torresani, L.; Szummer, M.; Fitzgibbon, A. Efficient object category recognition using classemes. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 776–789. [Google Scholar]
- Li, L.J.; Su, H.; Fei-Fei, L.; Xing, E.P. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Proceedings of the Neural Information Processing Systems, Hyatt Regency, Vancouver, BC, Canada, 6–11 December 2010; pp. 1378–1386. [Google Scholar]
- Yi, Y.; Wang, H. Motion keypoint trajectory and covariance descriptor for human action recognition. Vis. Comput. 2018, 34, 391–403. [Google Scholar] [CrossRef]
- Baecchi, C.; Uricchio, T.; Bertini, M.; Del Bimbo, A. Deep sentiment features of context and faces for affective video analysis. In Proceedings of the ACM on International Conference on Multimedia Retrieval, Bucharest, Romania, 6–9 June 2017; pp. 72–77. [Google Scholar]
- Chen, T.; Borth, D.; Darrell, T.; Chang, S.F. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv 2014, arXiv:1410.8586. [Google Scholar]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015. [Google Scholar]
- Guo, X.; Zhong, W.; Ye, L.; Fang, L.; Heng, Y.; Zhang, Q. Global Affective Video Content Regression Based on Complementary Audio-Visual Features. In Proceedings of the International Conference on Multimedia Modeling, Daejeon, Republic of Korea, 5–8 January 2020; pp. 540–550. [Google Scholar]
- Eyben, F.; Weninger, F.; Gross, F.; Schuller, B. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 835–838. [Google Scholar]
- Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.A.; Narayanan, S.S. The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of the INTERSPEECH, Makuhari, Chiba, Japan 26–30 September 2010. [Google Scholar]
- Schuller, B.; Steidl, S.; Batliner, A.; Vinciarelli, A.; Scherer, K.; Ringeval, F.; Chetouani, M.; Weninger, F.; Eyben, F.; Marchi, E.; et al. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings of the Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013. [Google Scholar]
- Aytar, Y.; Vondrick, C.; Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 892–900. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; Singer, Y. Online passive aggressive algorithms. J. Mach. Learn. Res. 2006, 7, 551–585. [Google Scholar]
- Xu, M.; Jin, J.S.; Luo, S.; Duan, L. Hierarchical movie affective content analysis based on arousal and valence features. In Proceedings of the ACM International Conference on Multimedia, Vancouver, BC, Canada, 26–31 October 2008; pp. 677–680. [Google Scholar]
- Arandjelović, R.; Zisserman, A. Three things everyone should know to improve object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2911–2918. [Google Scholar]
- Wang, S.; Hao, L.; Ji, Q. Knowledge-augmented multimodal deep regression bayesian networks for emotion video tagging. IEEE Trans. Multimed. 2020, 22, 1084–1097. [Google Scholar] [CrossRef]
- Huynh, V.; Lee, G.S.; Yang, H.J.; Kim, S.H. Temporal convolution networks with positional encoding for evoked expression estimation. arXiv 2021, arXiv:2106.08596. [Google Scholar]
- Lin, K.; Wang, X.; Zheng, Z.; Zhu, L.; Yang, Y. Less is more: Sparse sampling for dense reaction predictions. arXiv 2021, arXiv:2106.01764. [Google Scholar]
- Yan, B.; Wang, L.; Gao, K.; Gao, B.; Liu, X.; Ban, C.; Yang, J.; Li, X. Multi-Granularity Network with Modal Attention for Dense Affective Understanding. arXiv 2021, arXiv:2106.09964. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9-15 June 2019; pp. 6105–6114. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
- Ho, N.H.; Yang, H.J.; Kim, S.H.; Lee, G.; Yoo, S.B. Deep graph fusion based multimodal evoked expressions from large-scale videos. IEEE Access 2021, 9, 127068–127080. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27-30 June 2016; pp. 770–778. [Google Scholar]
- Mironica, I.; Ionescu, B.; Sjöberg, M.; Schedl, M.; Skowron, M. RFA at MediaEval 2015 affective impact of movies task: A multimodal approach. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Thomas, T.; Domínguez, M.; Ptucha, R. Deep independent audio-visual affect analysis. In Proceedings of the IEEE Global Conference on Signal and Information Processing, Montreal, QC, Canada, 14–16 November 2017; pp. 1417–1421. [Google Scholar]
- Dai, Q.; Zhao, R.W.; Wu, Z.; Wang, X.; Gu, Z.; Wu, W.; Jiang, Y.G. Fudan-Huawei at MediaEval 2015: Detecting violent scenes and affective impact in movies with deep learning. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Chakraborty, R.; Maurya, A.K.; Pandharipande, M.; Hassan, E.; Ghosh, H.; Kopparapu, S.K. TCS-ILAB-MediaEval 2015: Affective impact of movies and violent scene detection. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Marin Vlastelica, P.; Hayrapetyan, S.; Tapaswi, M.; Stiefelhagen, R. KIT at MediaEval 2015–Evaluating visual cues for affective impact of movies task. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Seddati, O.; Kulah, E.; Pironkov, G.; Dupont, S.; Mahmoudi, S.; Dutoit, T. UMons at MediaEval 2015 affective impact of movies task including violent scenes detection. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Trigeorgis, G.; Coutinho, E.; Ringeval, F.; Marchi, E.; Zafeiriou, S.; Schuller, B. The ICL-TUM-PASSAU approach for the MediaEval 2015 affective impact of movies task. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Lam, V.; Phan, S.; Le, D.D.; Satoh, S.; Duong, D.A. NII-UIT at MediaEval 2015 affective impact of movies task. In Proceedings of the MediaEval Workshop, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Guo, J.; Song, B.; Zhang, P.; Ma, M.; Luo, W. Affective video content analysis based on multimodal data fusion in heterogeneous networks. Inf. Fusion 2019, 51, 224–232. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, Y.; Cai, L.; Tu, C.; Wei, W. Video affective effects prediction with multi-modal fusion and shot-long temporal context. arXiv 2019, arXiv:1909.01763. [Google Scholar]
- Wang, S.; Wang, C.; Chen, T.; Wang, Y.; Shu, Y.; Ji, Q. Video affective content analysis by exploring domain knowledge. IEEE Trans. Affect. Comput. 2019, 12, 1002–1017. [Google Scholar] [CrossRef]
- Huynh, V.T.; Yang, H.J.; Lee, G.S.; Kim, S.H. Prediction of evoked expression from videos with temporal position fusion. Pattern Recognit. Lett. 2023, 172, 245–251. [Google Scholar] [CrossRef]
- Peng, X.; Li, K.; Li, J.; Chen, G.; Guo, D. Multi-modality fusion for emotion recognition in videos. In Proceedings of the IJCAI Workshop on Micro-gesture Analysis for Hidden Emotion Understanding, Macau, China, 21–22 August 2023. [Google Scholar]
- Wei, J.; Yang, X.; Dong, Y. User-generated video emotion recognition based on key frames. Multimed. Tools Appl. 2021, 80, 14343–14361. [Google Scholar] [CrossRef]
- Qiu, H.; He, L.; Wang, F. Dual focus attention network for video emotion recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Tao, H.; Duan, Q. Hierarchical attention network with progressive feature fusion for facial expression recognition. Neural Netw. 2024, 170, 337–348. [Google Scholar] [CrossRef]
Acronym | Description |
---|---|
AFRN | Adaptive fusion recurrent network [11] |
CAER | Context-aware emotion recognition [16] |
CAF | Context-aware framework [17] |
CFN | Context fusion network [18] |
CNN | Convolutional Neural Network |
CTEN | Cross-modal temporal erasing network [15] |
DSIFT | Dense scale invariant feature transform [19] |
EEV | Evoked Expression in Video [20] |
E-MDBM | Enhanced multimodal deep Boltzmann machine [21] |
FAEIL | Frame-level adaptation and emotion intensity learning [22] |
FNN | Feedforward neural network |
GCN | Graph convolutional network |
GeMAPS | Geneva minimalistic acoustic parameter set [23] |
GRUs | Gated recurrent units |
HAFs | High-level audio features |
HMM | Hidden Markov model |
HOG | Histogram of oriented gradient [24] |
HSH | Hue–saturation histogram [25] |
HVFs | High-level visual features |
ITE | Image Transfer Encoding [26] |
LAFs | Low-level audio features |
LAR | Least-angle regression [27] |
LBP | Local binary pattern [28] |
LIRIS-ACCEDE | LIRIS annotated creative commons emotional database [29] |
LRCANet | Long-range correlation-aware network [4] |
LSTM | Long short-term memory [30] |
LVFs | Low-level visual features |
MART | Masked affective representation learning [14] |
MFCC | Mel-frequency cepstrum coefficients [31] |
MKT | Motion keypoint trajectory |
MLG-S | Multimodal local–global system [32] |
MLLMs | Multimodal large language models |
MLP | Multi-layer perceptron [33] |
MM-AU | Multimodal ads understanding [5] |
MMDDN | Multimodal deep denoise network [34] |
MMDQEN | Multimodal deep quality embedding network [35] |
MMDRBN | Multimodal deep regression Bayesian network [36] |
MML | Multimodal learning [10] |
MMLGAN | Multimodal local–global attention network [37] |
MSE | Mean-square error |
PCC | Pearson correlation coefficient |
RBM | Restricted Boltzmann machine [38] |
RPN | Region proposal network |
S3D | Separable 3D CNN [39] |
StimuVAR | Stimuli-aware video affective reasoning [40] |
SVM | Support vector machine [41] |
TE | Temporal enhancement [42] |
TRILL | Triplet loss network [43] |
TSN | Temporal segment network [44] |
UMFN | Unified multi-stage fusion network [12] |
VAANet | Visual audio attention network [13] |
VAD | Video affective dataset [45] |
VER | Video emotion recognition |
VGG | Visual geometry group [46] |
ZCR | Zero-crossing rate [47] |
Dataset | Year | Emotion Model | Main Evaluation Metrics |
---|---|---|---|
DEAP [56] | 2012 | Valence–arousal–dominance | Accuracy |
MAHNOB-HCI [57] | 2012 | Valence–arousal–dominance | F1 |
VideoEmotion-8 [9] | 2014 | Plutchik | Accuracy |
LIRIS-ACCEDE [29,58] | 2015 | Valence–arousal | MSE, PCC, and accuracy |
Ekman-6 [26] | 2018 | Ekman | Accuracy |
CAER [16] | 2019 | 6 emotions | Accuracy |
Video-Danmu [59] | 2020 | 7 emotions | Precision and accuracy |
EEV [20] | 2021 | 5 emotions | PCC |
MM-AU [5] | 2023 | 3 emotions | F1 and accuracy |
VAD [45] | 2024 | Valence–arousal and 13 emotions | F1 and accuracy |
Method | Visual | Audio | Arousal (%) | Valence (%) |
---|---|---|---|---|
Mironica et al. [93] | ✓ | ✓ | 45.04 | 36.12 |
Thomas et al. [94] | ✓ | ✓ | 48.20 | 44.64 |
Dai et al. [95] | ✓ | ✓ | 48.84 | 41.78 |
Chakraborty et al. [96] | ✓ | ✓ | 48.95 | 35.66 |
Marin et al. [97] | ✓ | 51.89 | 38.54 | |
Seddati et al. [98] | ✓ | ✓ | 52.44 | 37.28 |
Trigeorgis et al. [99] | ✓ | ✓ | 55.72 | 41.48 |
Lam et al. [100] | ✓ | ✓ | 55.91 | 42.96 |
Baecchi et al. [72] | ✓ | 55.98 | 45.31 | |
MMDDN + MMCLF [34] | ✓ | ✓ | 56.75 | 45.03 |
MMDQEN [35] | ✓ | ✓ | 56.75 | 45.03 |
OFVGG + GeMAPS [101] | ✓ | ✓ | 57.00 | 40.83 |
MML [10] | ✓ | ✓ | 57.40 | 46.22 |
Zhang et al. [102] | ✓ | ✓ | 57.50 | 45.90 |
MLG-S [32] | ✓ | ✓ | 57.90 | 48.20 |
AFRN [11] | ✓ | ✓ | 58.22 | 48.61 |
Wang et al. [103] | ✓ | ✓ | 60.88 | 43.74 |
Method | Visual | Audio | PCC |
---|---|---|---|
Ho et al. [91] | ✓ | ✓ | 0.00819 |
MGN-MA [87] | ✓ | ✓ | 0.02292 |
Lin et al. [86] | ✓ | ✓ | 0.04430 |
TPF [104] | ✓ | ✓ | 0.05400 |
TCN [85] | ✓ | ✓ | 0.05477 |
Method | Visual | Audio | ACC(%) |
---|---|---|---|
StimuVAR [40] | ✓ | 40.50 | |
Peng et al. [105] | ✓ | ✓ | 43.20 |
MART [14] | ✓ | ✓ | 50.83 |
MMLGAN [37] | ✓ | ✓ | 51.14 |
ITE [26] | ✓ | ✓ | 52.60 |
CAF [17] | ✓ | ✓ | 52.70 |
KeyFrame [106] | ✓ | 52.85 | |
DFAN [107] | ✓ | 53.34 | |
VAANet [13] | ✓ | ✓ | 54.50 |
UMFN [12] | ✓ | ✓ | 55.80 |
CTEN [15] | ✓ | ✓ | 57.30 |
LRCANet [4] | ✓ | ✓ | 57.40 |
FAEIL [22] | ✓ | ✓ | 57.63 |
TE [42] | ✓ | ✓ | 59.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yi, Y.; Zhou, Y.; Wang, T.; Zhou, J. Advances in Video Emotion Recognition: Challenges and Trends. Sensors 2025, 25, 3615. https://doi.org/10.3390/s25123615
Yi Y, Zhou Y, Wang T, Zhou J. Advances in Video Emotion Recognition: Challenges and Trends. Sensors. 2025; 25(12):3615. https://doi.org/10.3390/s25123615
Chicago/Turabian StyleYi, Yun, Yunkang Zhou, Tinghua Wang, and Jin Zhou. 2025. "Advances in Video Emotion Recognition: Challenges and Trends" Sensors 25, no. 12: 3615. https://doi.org/10.3390/s25123615
APA StyleYi, Y., Zhou, Y., Wang, T., & Zhou, J. (2025). Advances in Video Emotion Recognition: Challenges and Trends. Sensors, 25(12), 3615. https://doi.org/10.3390/s25123615