CAGNet: A Network Combining Multiscale Feature Aggregation and Attention Mechanisms for Intelligent Facial Expression Recognition in Human-Robot Interaction
Abstract
:1. Introduction
2. State-of-the-Art
3. Methodology
3.1. CBAM Attention Mechanism
3.2. Global Average Pooling
3.3. The Proposed Network
4. Results
4.1. Experimental Data
4.2. Experimental Procedure
4.3. Experimental Analysis and Comparison
Experimental Evaluation
5. Discussion
5.1. Comparison of Different Attention Mechanisms
5.2. Performance of the Proposed Network Model
5.3. Expression Recognition Application
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, H.; Xiao, X.; Liu, X.; Guo, J.; Wen, G.; Liang, P. Heuristic objective for facial expression recognition. Vis. Comput. 2023, 39, 4709–4720. [Google Scholar] [CrossRef]
- Hosney, R.; Talaat, F.M.; El-Gendy, E.M.; Saafan, M.M. AutYOLO-ATT: An attention-based YOLOv8 algorithm for early autism diagnosis through facial expression recognition. Neural. Comput. Appl. 2024, 36, 1–21. [Google Scholar] [CrossRef]
- Geng, Y.; Meng, H.; Dou, J. FEAIS: Facial emotion recognition enabled education aids IoT system for online learning. In Proceedings of the ICALT, Bucharest, Romania, 1–4 July 2022; pp. 403–407. [Google Scholar]
- Chen, L.; Wu, M.; Pedrycz, W.; Hirota, K.; Chen, L.; Wu, M.; Pedrycz, W.; Hirota, K. Weight-Adapted Convolution Neural Network for Facial Expression Recognition. In Emotion Recognition and Understanding for Emotional Human-Robot Interaction Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 57–75. [Google Scholar]
- Putro, M.D.; Nguyen, D.L.; Jo, K.H. A fast CPU real-time facial expression detector using sequential attention network for human–robot interaction. IEEE Trans. Industr. Inform. 2022, 18, 7665–7674. [Google Scholar] [CrossRef]
- Li, J.; Jin, K.; Zhou, D.; Kubota, N.; Ju, Z. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 2020, 411, 340–350. [Google Scholar] [CrossRef]
- Ge, H.; Zhu, Z.; Dai, Y.; Wang, B.; Wu, X. Facial expression recognition based on deep learning. Comput. Methods Programs Biomed. 2022, 215, 106621. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern. Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the CVPR, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Niu, L.; Zhao, Z.; Zhang, S. Extraction method for facial expression features based on Gabor feature fusion and LBP histogram. J. Shenyang Univ. Technol 2016. [Google Scholar]
- Whitehill, J.; Omlin, C.W. Haar features for FACS AU recognition. In Proceedings of the FGR, Southampton, UK, 10–12 April 2006; p. 5. [Google Scholar]
- Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, S.; Zhang, H.; Tong, X. Expression Recognition Based on Feature Fusion Method forDimensionality Reduction. Comput. Digit. Eng. 2015, 43, 396–399. [Google Scholar]
- Lee, D.H.; Yoo, J.H. CNN Learning Strategy for Recognizing Facial Expressions. IEEE Access 2023, 1, 70865–70872. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the CVPR, Honolulu, HI, USA, 21-26 July 2017; pp. 6526–6534. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML, Lille, France, 7–9 July 2015; Volume 37, pp. 448–456. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. JMLR 2014, 15, 1929–1958. [Google Scholar]
- Zhang, W.; Zhang, X.; Tang, Y. Facial expression recognition based on improved residual network. IET Image Process 2023, 17, 2005–2014. [Google Scholar] [CrossRef]
- Xie, S.; Li, M.; Liu, S.; Tang, X. ResNet with attention mechanism and deformable convolution for facial expression recognition. In Proceedings of the ICICSP, Shanghai, China, 24–26 September 2021; pp. 389–393. [Google Scholar]
- Liang, H.; Ying, B.; Lei, Y.; Yu, Z.; Liu, L. A CNN-improved and channel-weighted lightweight human facial expression recognition method. J. Image Graph. 2022, 27, 3491–3502. [Google Scholar] [CrossRef]
- Zhang, S.; Wang, W. Research on Facial Expression Recognition Based on Improved VGG Model. Mod. Inf. Technol. 2021, 5, 100–103. [Google Scholar] [CrossRef]
- Zhao, J.; Feng, X.; Cao, W.; Pei, T.; Jia, W.Z.; Wang, R. T-SNet:Lightweight Facial Expression Recognition Based on Knowledge Distillation. Microelectron. Comput. 2025, 42, 38–47. [Google Scholar] [CrossRef]
- Guo, X.; Ma, N.; Liu, W.; Sun, F.; Zhang, J.; Chen, Y.; Zang, G. Expression Recognition and Interaction of Pharyngeal Swab Collection Robot. Comput. Eng. Appl. 2022, 58, 125. [Google Scholar]
- Yang, L.; Yang, H.; Hu, B.B.; Wang, Y.; Lv, C. A robust driver emotion recognition method based on high-purity feature separation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15092–15104. [Google Scholar] [CrossRef]
- Chudasama, V.; Kar, P.; Gudmalwar, A.; Shah, N.; Wasnik, P.; Onoe, N. M2fnet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 4652–4661. [Google Scholar]
- Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert. Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
- Sánchez-Brizuela, G.; Cisnal, A.; de la Fuente-López, E.; Fraile, J.C.; Pérez-Turiel, J. Lightweight real-time hand segmentation leveraging MediaPipe landmark detection. Virtual Real. 2023, 27, 3125–3132. [Google Scholar] [CrossRef]
- Liao, L.; Wu, S.; Song, C.; Fu, J. PH-CBAM: A parallel hybrid CBAM network with multi-feature extraction for facial expression recognition. Electronics 2024, 13, 3149. [Google Scholar] [CrossRef]
- Liu, C.; Liu, X.; Chen, C.; Wang, Q. Soft thresholding squeeze-and-excitation network for pose-invariant facial expression recognition. Vis. Comput. 2023, 39, 2637–2652. [Google Scholar] [CrossRef]
- Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
- Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw. 2015, 64, 59–63. [Google Scholar] [CrossRef] [PubMed]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the CVPR, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the CVPR, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the CVPR, Long Beach, CA, USA, 16–20 June 2019; pp. 5217–5226. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Li, W.; Wang, J.; Zhang, L.; Zeng, X.; Chen, G.; Long, H. Research on facial expression recognition based on improved YOLOv8. In Proceedings of the MVAID 2024, SPIE, Kunming, China, 26–28 April 2024; Volume 13230, pp. 323–329. [Google Scholar]
- Kumar Tataji, K.N.; Kartheek, M.N.; Prasad, M.V. CC-CNN: A cross connected convolutional neural network using feature level fusion for facial expression recognition. Multimed Tools Appl. 2024, 83, 27619–27645. [Google Scholar] [CrossRef]
- Lin, E.; Wang, F.; Tan, X. Facial expression recognition of ShuffleNetV2 combined withultra-lightweight dual attention modules. Electron. Meas. Technol. 2024, 47, 168–174. [Google Scholar] [CrossRef]
- Boughida, A.; Kouahla, M.N.; Lafifi, Y. A novel approach for facial expression recognition based on Gabor filters and genetic algorithm. Evol. Syst. Ger. 2022, 13, 331–345. [Google Scholar] [CrossRef]
Layer | Filter Size/Stride | Output SHAPE | Layer | Filter Size/Stride | Output Shape |
---|---|---|---|---|---|
Conv2d_1 | /1 | max_pooling2d_3 | /2 | ||
Conv2d_2 | /1 | Conv2d_7 | /1 | ||
max_pooling2d_1 | /2 | Conv2d_8 | /1 | ||
CBAM_1 | - | max_pooling2d_4 | /2 | ||
Conv2d_3 | /1 | CBAM_2 | - | ||
Conv2d_4 | /1 | GAP | - | 512 | |
max_pooling2d_2 | /2 | FC1 | 4096 | 4096 | |
Conv2d_5 | /1 | FC2 | 4096 | 4096 | |
Conv2d_6 | /1 | FC3 | 4096 | 7 |
Processing Method | FER2013 | CK+ |
---|---|---|
Rotation | ±10° | ±10° |
Width Shift | ±0.2 | ±0.05 |
Height Shift | ±0.2 | ±0.05 |
Horizontal Flip | True | True |
Shear Range | ±0.2 | ±0.2 |
Zoom Range | ±0.2 | ±0.2 |
Attention Mechanisms | FER2013 Accuracy (%) | CK + Accuracy (%) |
---|---|---|
ECANet | 68.20 | 96.27 |
CANet | 67.98 | 95.93 |
SENet | 68.54 | 96.95 |
CBAM | 71.52 | 97.97 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, D.; Ma, W.; Shen, Z.; Ma, Q. CAGNet: A Network Combining Multiscale Feature Aggregation and Attention Mechanisms for Intelligent Facial Expression Recognition in Human-Robot Interaction. Sensors 2025, 25, 3653. https://doi.org/10.3390/s25123653
Zhang D, Ma W, Shen Z, Ma Q. CAGNet: A Network Combining Multiscale Feature Aggregation and Attention Mechanisms for Intelligent Facial Expression Recognition in Human-Robot Interaction. Sensors. 2025; 25(12):3653. https://doi.org/10.3390/s25123653
Chicago/Turabian StyleZhang, Dengpan, Wenwen Ma, Zhihao Shen, and Qingping Ma. 2025. "CAGNet: A Network Combining Multiscale Feature Aggregation and Attention Mechanisms for Intelligent Facial Expression Recognition in Human-Robot Interaction" Sensors 25, no. 12: 3653. https://doi.org/10.3390/s25123653
APA StyleZhang, D., Ma, W., Shen, Z., & Ma, Q. (2025). CAGNet: A Network Combining Multiscale Feature Aggregation and Attention Mechanisms for Intelligent Facial Expression Recognition in Human-Robot Interaction. Sensors, 25(12), 3653. https://doi.org/10.3390/s25123653