Facial Micro-Expression Recognition based on Deep Local-Holistic Network

: Micro-expression is a subtle, local and brief facial movement. It can reveal the genuine


19
Facial micro-expression (micro-expression) is an involuntary and momentary facial ex-20 pression, with a brief duration of less than 500ms [1]. It reflects one's genuine emotions that 21 people are trying to conceal. In contrast to ordinary facial expressions, micro-expression is 22 consciously suppressed, but unconsciously leaked. Moreover, it has the two distinguishing 23 features of short duration and low intensity. Compared to polygraph instruments that 24 require equipment, micro-expression-based lie detection is unobtrusive, and individuals are 25 less likely to counteract it. Therefore, micro-expressions have many potential applications 26 in many fields, such as clinical diagnosis [2] and national security [3]. 27 micro-expression is difficult to detect through the naked eye and requires a trained 28 professional to recognize [2]. In order to help people recognize micro-expression, Ekman 29 et al. developed the Facial Action Coding System (FACS) [4] and defined the muscle 30 activity of facial expressions as action units (AU). Meantime, they also developed the 31 micro-expression Training Tool (micro-expressionTT) [5]. Since then, micro-expression 32 has received increasing attention from researchers. However, micro-expression analysis 33 through humans is still very challenging, and many researchers have tried to develop 34 micro-expression auto-recognition methods by employing computer vision techniques. 35 Since 2013, Xiaolan Fu's group has built three spontaneous micro-expression databases: 36 38 Based on these published databases, research on micro-expression recognition has been 39 gradually developed. There are two main types of approaches, i.e., recognition methods 40 based on handcraft features and methods based on deep learning feature extraction. Due 41 to the brief, subtle, and localized nature of micro-expressions, it is challenging for both 42 handcrafted features and features obtained based on deep learning to fully represent 43 micro-expressions. In addition, since the collection and labeling of micro-expressions are 44 time-consuming and laborious, the total number of published micro-expression samples 45 is about 1000. Therefore, micro-expression recognition is a typical small sample size (SSS) 46 problem. The sample size greatly limits the application of deep learning in this area. First, 47 deep network models involve a large number of parameters, and training on a small 48 micro-expression sample may cause overfitting problems of the model. Moreover, the 49 number of samples in the model and the network parameters are affected by the SSS 50 problem compared with the algorithms for expression recognition. Furthermore, due to 51 the complicated characterization of micro-expressions themselves, even methods such as 52 transfer learning with sample pre-training on other large-scale data sets do not achieve 53 satisfactory results cannot be applied to practical applications.

54
To address the problem that micro-expression features are difficult to learn in deep 55 networks under small sample problems, we explored the psychological cognitive attention 56 mechanism. As shown in Fig. 1, the process of individual cognitive micro-expressions 57 moves from global cognition to local-focused attention and finally to global decision mak-58 ing [10]. Inspired by this theory, we propose a Deep Local-Holistic Network (DLHN) with 59 enhanced micro-expression feature extraction capability for micro-expression recognition. 60 The architecture of the proposed method mainly includes two sub-networks: (1) a hierar-61 chical convolutional recurrent network (HCRNN), learning local and abundant features 62 from original frames of micro-expression video clips. (2) a robust principal component 63 analysis recurrent network (RPRNN), extracting sparse information from original frames 64 of micro-expression video clips by RPCA, and then feeding the sparse information to a 65 deep learning model to extract holistic and sparse features. The two networks are trained 66 separately and then fused for micro-expression recognition. The rest of this paper is organized as follows: Section 2 reviews the related works on 68 micro-expression recognition and basic models applied in our method; Section 3 introduces 69 our proposed algorithm in detail; Section 4 presents the experimental results; and Section 5 70 concludes the article.

72
This section first introduces the related works on micro-expression recognition, then 74 briefly describes three algorithms as they are employed in our proposed method, includ-75 ing deep convolutional neural network, recurrent neural network, and Robust Principal 76 Component Analysis. In the early stages of the study, most methods adopt handcrafted features to identify 79 micro-expressions. Polikovsky et al. [11] divided the face into specific regions and recog-80 nized the motion in each region by 3D-Gradients orientation histogram descriptor. Tomas 81 Pfister et al. [12] designed the first spontaneous micro-expression database (SMIC) and used 82 LBP-TOP [13] to extract dynamic and appearance features of micro-expressions.   Deep Convolutional neural network (DCNN) is a hierarchical machine learning 117 method containing multilevel nonlinear transformations. It is a classic and widely used 118 structure with three prominent characteristics: local receptive fields shared weights and 119 spatial or temporal subsampling. These features reduce temporal and spatial complexity 120 and allow some degree of shift, scale, and distortion invariance when designed to process 121 still images. It has been shown to outperform many other techniques [28].

122
As introduced in Section 1, the handcraft micro-expression features are not sufficiently 123 representational. Hence, we apply DCNN to improve the discriminative ability for micro-124 expression by targeting learning in local regions where micro-expressions frequently occur. 125

126
Recurrent neural network (RNN) can be used to process sequential data through 127 mapping an input sequence to a corresponding output sequence, using the hidden states. 128 However, as the network gradually deepens, there will be problems of gradient disappear-129 ance and gradient explosion. To solve this problem, Long Short-Term Memory (LSTM) 130 architecture was proposed [29] which uses memory cells with multiplicative gate units to 131 process information. It has been shown to outperform RNN on learning long sequences.

132
Besides, RNN takes into account only the past context. To solve the problems, a 133 bidirectional RNN (BRNN) is created [30], which can process data in both past and future 134 information. Subsequently, Graves et al. [31] proposed a bidirectional LSTM (BLSTM), 135 which has better performance than LSTM on processing long contextual information of 136 complex temporal dynamics.

137
Since micro-expressions are very subtle, it isn't easy to distinguish them from neutral 138 faces just by a single frame. The movement pattern in the temporal sequence is an essential 139 feature for micro-expressions. Therefore, we extract the temporal features from micro-140 expression sequence based on BRNN and BLSTM to enhance the classification performance. 141 [32] demonstrated that the observed data could be separated efficiently and exactly into sparse and low-rank structures in high-dimensional spaces. Then, an idealized "robust principal component analysis" problem is proposed to recover a low-rank matrix A from highly corrupted measurements D:

Robust Principal Component Analysis
Where A is the deserved data in a low-rank subspace, and E is the error term, usually 143 treated as noise.

144
According to the characteristic of micro-expression with short duration and low 145 intensity, micro-expression data are sparse in both the spatial and temporal domains. In 146 2014, Wang et.al. [17] proposed E as the deserved subtle motion information of micro-147 expression and A as noise for micro-expression recognition. Inspired by this idea, we 148 adopt RPCA to obtain sparse information from micro-expression frames, and then feed 149 the extracted information to RPRNN which learns sparse and holistic micro-expression 150 features.    i.e., eyebrows, eyes, nose, mouth, are used for the local micro-expression feature extraction 167 (Fig. 3a). First, the gray-scale micro-expression frames are cropped and normalized with a 168 size of 112×112. Then the ROIs are determined based on facial landmarks. The ROI size 169 of eyebrows, eyes, nose and mouth are 112 ×33, 112×20, 56×32, 56×38, respectively. Fur-170 thermore, considering the integrity of each ROI, the adjacent ROIs may have overlapping 171 portions.

172
As shown in the HCRNN bock of Fig. 2, the structure of CNN module consists of four 173 HCNNs. For each branch, the input is the ROI gray-scale images, and the network contains 174 four convolutional layers. All four HCNNs have the same structure, as listed in Table 1. 175 The output sizes in the table refer to generated tensor shapes by four HCNN. The CNN 176 module is able to extract local spatial micro-expression features. For a better visualization, 177 Fig. 3b presents the feature maps of L4 in HCRNN.   In a micro-expression sequence, the past context and future context usually are useful for prediction. Thus, a BRNN module [33] is adopted to process micro-expression temporal variation. The number of neurons in each layer of BRNN Module is listed as follows: L5(30 × 4)-L7(60 × 3)-L8(60 × 3)-L10(90 × 2)-L11(90 × 2)-L12(80 × 1). First, the extracted ROI features from CNN module are fed into BRNN module in L5 layer. Then, local temporal information is concatenated in L6 layer and subsequently processed by two BLSTMs in L7 layers (See BRNN structure in Fig. 4). Similar steps of L6 and L7 are repeated in L8 and L9 layers. A global temporal feature is obtained through the concatenation in L10 layer and the BLSTM in L11 layer. We classify micro-expression by an FC layer in L12 of HCRNN and obtain probabilistic outputs by softmax layer in L13 of HCRNN: where h i is the output of L13, i is the output unit, where i = 0, 1, ...k. Finally, the HCRNN is trained by using the cross-entropy loss function: where c j is the ground truth, P h j is the predicted probability of output layer.  Due to the short duration and low intensity of micro-expression movement, microexpression could be considered as sparse data. Hence, RPCA [15] is utilized to obtain sparse micro-expression information. In details, for a gray-scale video clip V(h × w × n), where h and w is respectively the pixels height and width of each frame, n is the number of frames. We stack all frames as column vectors of a matrix D with h × w rows and n columns. It can be formulated as follows:

RPRNN for Holistic Features
where A is a low-rank matrix, B is a sparse matrix, rank(·) is the rank of the matrix and ∥ · ∥ 0 denotes ℓ 0 -norm which obtains the number of nonzero elements in the matrix. This is a non-convex function. Wright et al. adopted the ℓ 1 -norm as a convex surrogate for the highly-nonconvex ℓ 0 -norm and the nuclear norm (or sum of singular values) to replace non-convex low-rank matrix, i.e., the following convex optimization problem: where ∥ · ∥ * denotes nuclear norm, ∥ · ∥ 1 denotes ℓ 1 -norm which counts the sum of all 183 elements in matrix, and λ is a positive weighting parameter (λ > 0). Lin et al. [34] proposed 184 the Augmented Lagrange Multiplier Method (ALM), which includes two algorithms of 185 exact ALM and inexact ALM to process linearly constrained convex optimization problems. 186 The inexact ALM has a slight improvement in the required number of partial SVDs than the 187 exact ALM and has the same convergence speed as the exact ALM. Benefiting from it, we 188 adopt the method of inexact ALM to obtain sparse micro-expression motion information 189 from original frames.

191
The obtained sparse micro-expression images are fed into RPRNN to extract holistic features. The architecture of RPRNN is shown at the bottom block in Fig. 2. In order to learn high-level micro-expression representations, a deep BLSTM network is created by multiple LSTM hidden layers. The holistically sparse features are extracted in the L1 of RPRNN, and two FC layers are used to classify micro-expressions. Then, the emotion type of micro-expression is estimated by the softmax layer: where r i is an output of the softmax layer. Finally, to avoid the overfitting problem, we combine the cross-entropy loss function with L2 Regularization: where P(r i ) is the predicted probability of output layer, θ index to weight values.
where a is weight value, P hi and P ri are the predicted probabilities in HCRNN and RPRNN. 194 According to the experiment result, we find that the model can achieve the best performance 195 when a equals 0.45. Thus, we set a to 0.45.

198
We use the datasets combined of four spontaneous micro-expression databases (CASME I, CASME II, CAS(ME) 2 , and SAMM) to assess the performance of our models. Table 2 presents the details of these four databases. However, the number of emotion classes number is different in these databases, and micro-expression samples are labeled by taking different AUs criteria. For example, the combination of AU1 and AU2 defines a microexpression sample as disgust in CAS(ME) 2 and as surprise in CASME II. In order to alleviate the impact of the different encoding, we adopt a uniformly AU encoding criterion proposed by Davison et al. [35]. Finally, we select 560 samples from the combined dataset and divide them into four emotion labels: Specifically, Negative Consists of anger, disgust, sadness, and fear. Fig.7a shows the sample 199 size of each emotion category. In our experiments, we use 10-fold cross-validation protocol 200 on our combined dataset.

202
Since the length of each video sample varies, we performed linear interpolation and 203 extracted 16 frames from it for the subsequent recognition task. The size of the face image 204 is 112 × 112. For HCRNN, the face region is divided into four ROIs as the input of CNN 205 module. To guarantee the integrity of each part, ROIs have overlapping areas, and the size 206 of brow, eye, nose, mouth regions are 112 × 33, 112 × 20, 56 × 32, and 56 × 38, respectively. 207 The convolution kernel size of HCNN is set to 3 × 3, and the size of the pooling kernel is 208 2 × 2. The stride of convolution and pooling layer is set as 1 and 2. In the training stage, the 209 learning rate adopts exponential decay with that the initial value equals 0.85 . We update 210 all weights in each iteration with mini-batch samples whose size is 45. The iteration curves 211 in Fig. 5a respectively represent the trend of loss and accuracy value in the testing set.

212
For RPRNN, the original micro-expression frames are processed by RPCA to obtain the 213 sparse micro-expression images. Fig. 6 illustrates an example of micro-expression images 214 processed by RPCA. Then the sparse images are fed to RPRNN to obtain holistic features. 215 In the model, the attenuation way of learning rate and the update mode of weights are the 216 same as HCRNN, and the value of the learning rate is initialized to 0.01. Same as HCRNN, 217 in the training stage, we update all weights in each iteration. Fig.5b plots the iteration 218 curves representing the trend of loss and accuracy value in the testing set. In the whole 219 experiment, we employ a truncated normal distribution with zero mean and a standard 220 deviation of 0.1 to initialize weights, and initialize biases as 0.1.

222
Our proposed DLHN consists of HCRNN and RPRNN. As introduced in Section 3.3, 223 these two sub-networks are combined by parameter a. We choose different a to evaluate the 224 results of the fusion network and conduct our experiments with 10-fold cross-validation. 225 Table 3 show micro-expression recognition accuracy of the fusion network with different 226 parameter a. It can be seen that when a equals 0.45, the average accuracy of the fusion 227 network is the highest. Therefore a is set as 0.45 when we compare the performance of the 228 proposed DLHN with current state-of-the-art (SOTA) methods in the combined dataset. Figure 6. An example of RPCA on micro-expression images. Fig. 6a-6e are the original microexpression images. Fig. 6f-6j are the corresponding extracted sparse information. Fig. 6k-6o are the enhanced display for Fig. 6f-6j by multiplying each pixels with 2. In the choice of comparison methods, among the handcraft feature-based methods, 230 we choose the classical FDM features and LBP features [36], as well as the variant of LBP 231 features (LBP-SIP) [37]. Among the deep learning methods, we choose the first place 232 method for Micro-Expression Grand Challenge 2019 and two deep learning-based methods 233 with codes released in the last two years, which are STSTNet [38], RCN(_a,_w,_s, and _f) [25] 234 and Feature Refinement (FR) [27], respectively. Moreover, we all reproduced these methods 235 with the same data configuration. Table 4 shows the overall accuracy of all algorithms. 236 The best algorithm based on traditional methods for micro-expression recognition is LBP-237 TOP(4 × 4), which achieves 58.38% mean accuracy. The mean accuracy of HCRNN and 238 RPRNN is respectively 55.08% and 59.53%. The fusion model, i.e., DLHN obtains the best 239 performance by combined local abundant features extracted by HCRNN and holistic sparse 240 features extracted by RPRNN and achieves 60.31% mean accuracy. Besides, RPRNN obtain 241 the best performances in three folds (fold7, fold8, and fold10), which demonstrate that the 242 efficiency of holistic sparse spatio-temporal feature extraction capacity of RPRNN. Furthermore, Fig. 7b illustrates the confusion matrix of our proposed DLHN based on 244 four emotion categories. According to Fig. 7a, "negative" and "other" have more samples 245 than "positive" and "surprise". Therefore, the recognition accuracy of "negative" and "other" 246 is higher than the other two categories.

248
In this paper, we proposed a Deep Local-Holistic Network for micro-expression 249 recognition. Specifically, HCRNN is designed to extract local and abundant information 250 from the ROIs related to micro-expression. According to the sparse characteristic of micro-251 expression, we obtain sparse micro-expression information from original images by RPCA, 252 and utilize RPRNN to extract holistic and sparse features from sparse images. Deep 253 Local-Holistic Network, which fused by HCRNN and RPRNN, captures the local-holistic, 254 sparse-abundant micro-expression information, and boosts the performance of micro-255 expression recognition. Experimental results on combined databases demonstrate that our 256 proposed method outperforms some state-of-the-art algorithms.
The recognition performance of DLHN remains to be improved due to the limitation 258 of the small sample problem and unbalanced sample distribution. In future work, we 259 will further investigate unsupervised learning as well as data augmentation methods to 260 improve the performance of micro-expression recognition.  Data Availability Statement: The CASME I database is available at http://fu.psych.ac.cn/CASME/ 268 casme-en.php. The CASME II database is available at http://fu.psych.ac.cn/CASME/casme2-en.php. 269 The CAS(ME) 2 database is available at http://fu.psych.ac.cn/CASME/cas(me)2-en.php. The SAMM 270 database is available at http://www2.docm.mmu.ac.uk/STAFF/M.Yap/dataset.php.