Next Article in Journal
Shewhart Attribute and Variable Control Charts Using Modified Multiple Dependent State Sampling
Previous Article in Journal
Real-Time Error-Free Reversible Data Hiding in Encrypted Images Using (7, 4) Hamming Code and Most Significant Bit Prediction
Article Menu
Issue 1 (January) cover image

Export Article

Open AccessArticle
Symmetry 2019, 11(1), 52;

Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition

Institute of Intelligent Information Processing, Taizhou University, Taizhou 318000, China
School of Software Engineering, Institute of Big Data Science and Industry, Taiyuan University, Shanxi 030006, China
College of information science and technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis, Guangxi University for Nationalities, Nanning 530006, China
Author to whom correspondence should be addressed.
Received: 28 October 2018 / Revised: 28 December 2018 / Accepted: 30 December 2018 / Published: 5 January 2019
Full-Text   |   PDF [5153 KB, uploaded 5 January 2019]   |  


The proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies. View Full-Text
Keywords: facial expression recognition; convolutional neural networks; temporal-spatial features; optical flow; feature aggregation facial expression recognition; convolutional neural networks; temporal-spatial features; optical flow; feature aggregation

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Pan, X.; Guo, W.; Guo, X.; Li, W.; Xu, J.; Wu, J. Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition. Symmetry 2019, 11, 52.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Symmetry EISSN 2073-8994 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top