Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition
AbstractThe proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies. View Full-Text
Share & Cite This Article
Pan, X.; Guo, W.; Guo, X.; Li, W.; Xu, J.; Wu, J. Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition. Symmetry 2019, 11, 52.
Pan X, Guo W, Guo X, Li W, Xu J, Wu J. Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition. Symmetry. 2019; 11(1):52.Chicago/Turabian Style
Pan, Xianzhang; Guo, Wenping; Guo, Xiaoying; Li, Wenshu; Xu, Junjie; Wu, Jinzhao. 2019. "Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition." Symmetry 11, no. 1: 52.
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.