Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition
Abstract
:1. Introduction
2. Related Work
2.1. Feature Grouping
2.2. Atrous Convolutions
3. Methodology
3.1. Visual-Audio Representation Extraction
3.2. Multi-Scale Parallel Synergistic Enhancement Attention
4. Experiment
4.1. Datasets
4.2. Implementation Details
4.3. Comparison with Other Methods
- (1)
- Low-scale contextual cues: alcohol bottles and cigarette packs on the table (coordinates [60:65, 95:100], activation intensity: 0.41), reflecting environmental depression indicators.
- (2)
- Mid-scale expression composites: asymmetric lip corner elevation (coordinates [40:50, 115:125], activation intensity: 0.39), suggesting suppressed emotional leakage.
- (3)
- High-scale kinematic patterns: retracted right shoulder motion (coordinates [20:30, 150:160], activation intensity: 0.36), indicative of defensive posturing.
- (1)
- Emotional germination stage: At 00:48 s in the video, the activation intensity of facial expressions and hand movements (pressing the lighter) increases (activation intensity: 0.83). It is manifested as a slight frown in the eyes and a slight drooping of the corners of the mouth (coordinates [20:30, 115:125] and [40:50, 90:100]), and the visual clues of the photo area in the environment are noticed, marking the initial germination of sad emotions. This stage usually corresponds to the initial perception of emotions, and individuals begin to brew sad experiences internally, but external behaviors still maintain a certain degree of inhibition.
- (2)
- Emotional expression stage: At 00:56 s in the video, attention is focused on the photo in the center of the picture, and the corresponding local activation intensity increases significantly (activation intensity: 0.79, coordinates [19:37]). As an emotional trigger, the photo accelerates the external expression of sad emotions, and the overall atmosphere becomes heavier and more perceptible.
- (3)
- Emotional deepening stage: At 01:13 s in the video, the activation intensity of the face and trunk area reaches the highest level (activation intensity: 0.95). It is manifested as trembling of the corners of the mouth and tense facial muscles (coordinates [42:53, 92:105]), accompanied by slight shaking of the body (activation intensity: 0.84). This stage shows that the sad emotion has entered a deep experience period, and the individual’s physiological and behavioral reactions tend to be intense, reflecting the great accumulation and outbreak of emotional tension, which is the peak stage of sad emotion.
- (4)
- Emotional decline stage: At 01:47 s in the video, the overall activation intensity of the body has decreased significantly (activation intensity: 0.35). Specifically, the drooping of the corners of the mouth is weakened, the eyes are closed, the amplitude of body movements is significantly reduced, and the person leans on the sofa to show a relaxed state. This stage reflects the natural decline of emotions and the process of self-comfort. The individual gradually recovers from the peak of sadness and enters the stage of emotional regulation and emotional calming.
4.4. Error Analysis
4.5. Ablation Study
4.6. Hyper-Parameter Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. arXiv 2018, arXiv:1709.01507. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.-S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 6298–6306. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5987–5995. [Google Scholar]
- Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Hu, X.; Yang, J. Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks. arXiv 2019, arXiv:1905.09646. [Google Scholar]
- Yang, Q.-L.Z.Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. arXiv 2021, arXiv:2102.00240. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7794–7803. [Google Scholar]
- Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. arXiv 2020, arXiv:2010.03045. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1411.4038. [Google Scholar]
- Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2016, arXiv:1511.07122. [Google Scholar]
- van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network. arXiv 2021, arXiv:2105.14447. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv 2021, arXiv:2103.02907. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? arXiv 2018, arXiv:1711.09577. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Xu, B.; Fu, Y.; Jiang, Y.-G.; Li, B.; Sigal, L. Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization. IEEE Trans. Affect. Comput. 2018, 9, 255–270. [Google Scholar] [CrossRef]
- Qiu, H.; He, L.; Wang, F. Dual Focus Attention Network for Video Emotion Recognition. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. arXiv 2017, arXiv:1702.01992. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos. Proc. AAAI Conf. Artif. Intell. 2020, 34, 303–311. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, L.; Yang, J. Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 18888–18897. [Google Scholar]
Methods | Backbone | Params (M) | FLOPs (G) |
---|---|---|---|
SE [1] | ResNet-101 | 49.30 | 7.84 |
CBAM [2] | 49.30 | 7.84 | |
CA [16] | 44.80 | 7.84 | |
SA [9] | 43.32 | 7.83 | |
MPE (ours) | 43.26 | 7.85 |
Method | [19] | [20] | [21] | [22] | [23] | [24] | Ours |
---|---|---|---|---|---|---|---|
VideoEmotion-8 | 52.6% | 53.3% | 53.7% | 54.2% | 54.5% | 57.3% | 57.6% |
Ekman-6 | 55.6% | 57.3% | 54.2% | 54.3% | 55.3% | 58.2% | 61.25% |
Number of Branches | Dilation Rate | VideoEmotion-8 | Ekman-6 |
---|---|---|---|
0 | - | 55.02% | 56.28% |
1 | [1] | 55.39% | 56.96% |
2 | [1, 3] | 55.68% | 57.23% |
3 | [1, 3, 5] | 57.60% | 61.25% |
4 | [1, 3, 5, 7] | 56.23% | 60.64% |
Fusion Method | VideoEmotion-8 | Ekman-6 |
---|---|---|
Fixed average weight (1/3) | 55.02% | 56.28% |
Maximum response fusion | 55.39% | 56.96% |
Learnable static weights | 55.68% | 57.63% |
Dynamic weight generation | 57.6% | 61.25% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, L.; Sun, Y.; Guan, J.; Kang, S.; Huang, J.; Zhong, X. Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition. Electronics 2025, 14, 1886. https://doi.org/10.3390/electronics14091886
Zhang L, Sun Y, Guan J, Kang S, Huang J, Zhong X. Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition. Electronics. 2025; 14(9):1886. https://doi.org/10.3390/electronics14091886
Chicago/Turabian StyleZhang, Lianqi, Yuan Sun, Jiansheng Guan, Shaobo Kang, Jiangyin Huang, and Xungao Zhong. 2025. "Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition" Electronics 14, no. 9: 1886. https://doi.org/10.3390/electronics14091886
APA StyleZhang, L., Sun, Y., Guan, J., Kang, S., Huang, J., & Zhong, X. (2025). Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition. Electronics, 14(9), 1886. https://doi.org/10.3390/electronics14091886