Next Article in Journal
Some of Problems of Direction Finding of Ground-Based Radars Using Monopulse Location System Installed on Unmanned Aerial Vehicle
Previous Article in Journal
Research on the Enhancement of Laser Radar Range Image Recognition Using a Super-Resolution Algorithm
Open AccessArticle

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Department of Electronic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Lee, M.K.; Choi, D.Y.; Kim, D.H.; Song, B.C. Visual Scene-aware Hybrid Neural Network Architecture for Video-based Facial Expression Recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019.
Sensors 2020, 20(18), 5184; https://doi.org/10.3390/s20185184
Received: 19 August 2020 / Revised: 7 September 2020 / Accepted: 9 September 2020 / Published: 11 September 2020
(This article belongs to the Section Intelligent Sensors)
Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks. View Full-Text
Keywords: facial expression recognition; multi-modal fusion; convolutional neural networks facial expression recognition; multi-modal fusion; convolutional neural networks
Show Figures

Figure 1

MDPI and ACS Style

Lee, M.K.; Kim, D.H.; Song, B.C. Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition. Sensors 2020, 20, 5184. https://doi.org/10.3390/s20185184

AMA Style

Lee MK, Kim DH, Song BC. Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition. Sensors. 2020; 20(18):5184. https://doi.org/10.3390/s20185184

Chicago/Turabian Style

Lee, Min K.; Kim, Dae H.; Song, Byung C. 2020. "Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition" Sensors 20, no. 18: 5184. https://doi.org/10.3390/s20185184

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop