Cost-Effective CNNs for Real-Time Micro-Expression Recognition

: Micro-Expression (ME) recognition is a hot topic in computer vision as it presents a 1 gateway to capture and understand human’s daily emotions. It is nonetheless a challenging problem 2 due to the fact ME typically being transient (lasting less than 200 ms) and subtle. Recent advances 3 in machine learning enable new and effective methods to be adopted for solving diverse computer 4 vision tasks. In particular, the use of deep learning techniques on large datasets outperforms classical 5 approaches based on classical machine learning which rely on hand-crafted features. Even though 6 available datasets for spontaneous ME are scarce and much smaller, using off-the-shelf Convolutional 7 Neural Networks (CNNs) still demonstrates satisfactory classiﬁcation results. However, these 8 networks are heavy in terms of memory consumption and computational resources. This poses great 9 challenges when deploying CNN-based solutions in many applications such as driver’s monitoring 10 or comprehension recognition in virtual classrooms, which demand fast and accurate recognition. As 11 these networks are initially designed for tasks of different domains, they are over-parameterized and 12 need to be optimized for ME recognition. 13 In this paper, we propose a new network based on the well-known ResNet18 which we optimize 14 for ME classiﬁcation in two ways. Firstly, we reduce the depth of the network by removing residual 15 layers. Secondly, we introduce a more compact representation of optical ﬂow used as input to the 16 network. We present extensive experiments and demonstrate that the proposed network obtains 17 accuracies comparable to the state-of-the-art methods while signiﬁcantly reducing the necessary 18 memory space. Our best classiﬁcation accuracy reaches 60.17% on the challenging composite dataset 19 containing 5 objectives classes. Our method takes only 24.6 ms for classifying a ME video clip (less 20 than the occurrence time of the shortest ME which lasts 40 ms). Our CNN design is suitable for 21 real-time embedded applications with limited memory and computing resources. 22


Introduction
Emotion recognition has received much attention in the research community in recent years.
Among the several sub-fields of emotion analysis, studies of facial expression recognition are particularly active [1][2][3].In contrast to the traditional macro-expression, people are less familiar with micro facial expressions [4,5], and even fewer know how to capture and recognize them.
Micro-Expression (ME) is a rapid and involuntary facial expression that exposes a person's true emotion [6].These subtle expressions usually take place when a person conceals his or her emotions in one of the two scenarios: conscious suppression or unconscious repression.Conscious suppression happens when an individual deliberately prevents oneself from expressing genuine emotions.In contrary, unconscious repression occurs when the subject is not aware of his or her true emotions.In both cases, MEs reveal the subject's true emotions regardless of the subject's awareness.Intuitively, ME recognition has a vast number of potential applications across different sectors, such as security field, neuromarketing [7], automobile drivers' monitoring [8] and lies and deceit detection [5].
Psychological research shows that facial MEs generally are transient (e.g., remaining less than 200 ms) and very subtle [9].The short duration and subtlety incur great challenges for human to perceive and recognize them.To enable better ME recognition by human, Ekman and his team developed the ME training tool (METT).Even with the help of this training tool, human can barely achieve around 40% accuracy [10].Moreover, human's decisions are prone to be influenced by individual's perception varied along different subjects and time, resulting in less objective results.Therefore, a bias-free and high-quality automatic system for facial ME recognition is highly sought after.
A number of earlier solutions to automate facial ME recognition has been based on geometry or appearance feature extraction methods.Specifically, geometric-based features encode geometric information of the face, such as shapes and locations of facial landmarks.On the other hand, appearance-based features describe the skin texture of faces.Most existing methods [11,12] attempt to extract low-level features such as the widely used Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) [13][14][15] from different facial regions, and simply concatenate them for ME recognition.
Nevertheless, transient and subtle ME inherently makes it challenging for low level-features to effectively capture essential movements in ME.At the same time, these features can also be affected by irrelevant information or noise in video clips, which further weakens their discrimination capabilities especially on inactive facial regions with less dynamics [16].
Recently, more approaches based on mid-level and high-level features have been proposed.Among these methods, the pipeline composed of optical flow and deep learning has demonstrated its high effectiveness for MEs recognition in comparison with traditional ones.The studies applying deep learning to tackle the ME classification problem usually considered well-known Convolutional Neural Networks (CNNs) such as ResNet [17] and VGG [18].These studies re-purpose the use of off-the-shelf CNNs by giving them input data token from the optical flow extracted from the MEs.While achieving good performance, these neural networks are quite heavy in terms of memory usage and computation.
In specific applications, for example during automobile drivers' monitoring or students' comprehension recognition in virtual education systems, fast and effective processing methods are necessary to capture emotional responses as quickly as possible.Meanwhile, thanks to great progresses in parallel computing, parallelized image processing devices such as embedded systems are easily accessible and affordable.Already well-adopted in diverse domains, these devices possess multiple strengths in terms of speed, embeddability, power consumption and flexibility.These advantages however are often at the cost of limited memory and computing power.
The objective of this work is to design an efficient and accurate ME recognition pipeline for embedded vision purpose.First of all, our design takes into account thorough investigations on different CNN architectures.Next, different optical flow representations for CNN inputs have been studied.Finally, our proposed pipeline achieves competitive accuracy for ME recognition as state-of-the-art approaches while being real-time capable and using less memory.The paper is organized as follows.In Section 2, several recent related work are reviewed.Section 3 explains the proposed methodology in order to establish cost-effective CNNs for fast ME recognition.Section 4 provides experimental results and performance evaluations.Lastly, Section 5 concludes the paper.

Related works
MEs begin at the onset (first frame where the muscles of the facial expressions start to contract), finish at the offset (last frame, where the face returns to its neutral state), and reach their pinnacle at the apex frames (see Figure 1).Because of their very short duration and low intensity, ME recognition and analysis are considered as difficult tasks.Earlier studies proposed using low-level features such as LBP-TOP to address these problems.LBP-TOP is a 3D descriptor extended from the traditional 2D LBP.It encodes the binary patters between image pixels, and the temporal relationship between pixels and their neighboring frames.The resulting histograms are then concatenated to represent the temporal changes over entire videos.LBP-TOP has been widely adopted in several studies.Pfister et al. [13] applied LBP-TOP for spontaneous ME recognition.Yan et al. [14] achieved 63% ME recognition accuracy on their CASME II database using LBP-TOP.In addition, LBP-TOP has also been used to investigate differences between micro-facial movement sequences and neutral face sequences.Several studies aimed to extend low-level features extracted by LBP-TOP as they still could not reach satisfactory accuracy.For example, Liong et al. [19] proposed to assign different weights to local features, putting more attention on active facial regions.Wang et al. [11] studied the correlation between color and emotions by extracting LBP-TOP from the tensor independent color space (TICS).
Ruiz-Hernandez and Pietikäinen [20] used the re-parameterization of second order Gaussian jet on the LBP-TOP, achieving promising ME recognition result on the SMIC database [21].Considering that LBP-TOP consists of redundant information, Wang et al. [22] proposed the LBP-Six Intersection Points (LBP-SIP) method which is computationally more efficient and achieves higher accuracy on the CASEME II database.We also note that the STCLQP (SpatioTemporal Completed Local Quantization Patterns) proposed by Huang et al. [23] achieved a substantial improvement for analyzing facial MEs.
Over the years as research shows that it is non-trivial for low-level features to effectively capture and encode ME's subtle dynamic patterns (especially from inactivate regions), other methods shift to exploit mid-or high-level features.He et al. [16] developed a novel multi-task mid-level feature learning method to enhance the discrimination ability of the extracted low-level features.The mid-level feature representation is generated by learning a set of class-specific feature mappings.
Better recognition performance has been obtained with more available information, features with better discrimination and generalization abilities.A simple and efficient method known as Main Directional Mean Optical-flow (MDMO) was employed by Liu et al. [24].They used optical flow to measure the subtle movement of facial regions of interest (ROIs) that were spotted based on the Facial Action Coding System (FACS).Oh et al. [25] also applied the monogenic Riesz wavelet representation in order to amplify subtle movements of MEs.
The aforementioned methods indicate that the majority of existing approaches heavily rely on hand-crafted features.Inherently, they are not easily transferable as the process of feature crafting and selection depend heavily on domain knowledge and researchers' experience.In addition, methods based on hand-crafted features are not accurate enough to be applied in practice.Therefore, high-level feature descriptors which better describe different MEs and can be automatically learned are desired.
Recently, more and more vision-based tasks have shifted to deep CNN-based solutions due to their superior performance.Recent developments in ME recognition are also inspired by these advancements by incorporating CNN models within the ME recognition framework.
Peng et al. [26] proposed a two-stream convolutional network DTSCNN (Dual Temporal Scale Convolutional Neural Network) to address two aspects: overfitting problem caused by small sizes of existing ME databases and use of high-level features.We can observe four characteristics of DTSCNN: (i) separate features were first extracted from ME clips from two shallow networks and then fused; (ii) data augmentation and higher drop-out ratio were applied in each network; (iii) two databases (CASME I and CASME II) were combined to train the network; (iv) the data fed to the networks were optical-flow images instead of raw RGB frames.
Khor et al. [27] studied two variants of an Enriched LRCN (Long-term Recurrent Convolutional Network) model for ME recognition.Spatial enrichment (SE) refers to channel-wise stacking of gray-scale and optical flow images as new input to CNN.On the other hand, temporal enrichment (TE) stacks obtained features.Their TE model achieves better accuracy on a single database, while the SE model is more robust against the cross-domain protocol involving more databases.
Liong et al. [28] designed a Shallow Triple Stream Three-dimentional CNN (STSTNet).The model takes input stacked optical flow images computed between the onset and apex frames (optical strain, horizontal and vertical flow fields), followed by three shallow convolution layers in parallel and a fusion layer.The proposed method is able to extract rich features from MEs while being computationally light, as the fused features are compact yet discriminative.
Our objective is to realize a fast and high-performance ME recognition pipeline for embedded vision applications under several constraints, such as embeddability, limited memory and restricted computing resources.Inspired by existing works [26,28], we explore different CNN architectures and several optical flow representations for CNN inputs to find cost-effective neural network architectures that are capable of recognizing MEs in real-time.

Methodology
The studies applying deep learning to tackle the ME classification problem [29][30][31][32] usually used pretrained CNNs such as ResNet [17] and VGG [18] and applied transfer learning to obtain ME features.In our work, we first select off-the-shelf ResNet18 because it provides the best trade-off between accuracy and speed on the challenging ImageNet classification and is recognized for these performances in transfer learning.ResNet [17] explicitly lets the stacked layers fit a residual mapping.
Namely, the stacked non-linear layers are let to fit another mapping of F(x) := H(x) − x where H(x) is the desired underlying mapping and x the initial activations.The original mapping is recast into F(x) + x by feedforward neural networks with shortcut connections.ResNet18 has 20 convolutional layers (CL) (17 successive CL and 3 branching ones).Residual links after each pair of successive convolutional units are used and the kernel size after each residual link is doubled.As ResNet18 is designed to extract features from RGB color images, it requires inputs to have 3 channels.
In order to accelerate processing speed in the deep learning domain, the main current trend in decreasing complexity of CNN is to reduce the number of parameters.For example, Hui et al. [33] proposed a very compact LiteFlowNet which is 30 times smaller in the model size and 1.36 times faster in the running speed in comparison with the state-of-the-art CNNs for optical flow estimation.In Well known CNN is created for specific problems and therefore over calibrated when they are used in other contexts.ResNet18 was made for end-to-end object recognition: the dataset used for training had hundreds of thousands of images for each class and more than a thousand classes in total.
Based on the fact that: (i) ME recognition study considers in maximum 5 classes and the datasets of spontaneous MEs are scarce and contain much fewer samples, and (ii) optical flows are high-level features contrary to low-level color features and so require shallower network, we have reduced the architecture of ResNet18 by iteratively removing residual layers.This allows us to assess the influence of the depth of the network on its classification capacities in our context and therefore to estimate the relevant calibration of the network.
Figure 2 illustrates the reduction protocol: at each step the last residual layer with two CL is removed and the previous one is connected to the fully connected layer.Only networks with an odd number of CL are therefore proposed.As highlighted in Table 1 the decrease in the number of CL poses a significant impact on the number of learnable parameters of the network, which directly affects the forward propagation time.Once the network depth has been correctly estimated, the dimension of the input has to be optimized.In our case, CNNs take optical flows extracted between the onset and apex frames of ME video clips.It is between these two moments that the motion is most likely to be the strongest.
The dimensionality of inputs determines the complexity of the network that uses them since the reduction in input channels dictates the number of filters to be used throughout all following layers of the CNN.The optical flow between the onset (Figure 3-a) and the apex (Figure 3-b) typically has a 3-channel representation to be used in a pretrained architecture designed for 3-channel color images.
This representation however may not be optimal for ME recognition.
Optical flow can be described as the change of structured patterns of light between successive frames to measure movement of a pixel over a period of time.Optical flow estimation techniques are based on the assumption of brightness invariance: where I(x, y, t) is the intensity of pixel in position (x, y) at time t.
The optical flow is represented as a vector (Figure 3-c) indicating the direction and intensity of the motion.The projection of the vector on the horizontal axis corresponds to the Vx field (Figure 3-d) while its projection on the vertical axis is the Vy field (Figure 3-e).The magnitude M is the norm of the vector (Figure 3-f).Figure 4 illustrates this representation of one optical flow vector.The horizontal and vertical components V x and V y of the optical flow correspond to the spatial variation (δ x , δ y ) obtained by minimizing the difference between the left and right term of Equation 1.
In this paper, the optical flow is estimated by the Horn-Schunck method [35].This method assumes that the optical flow is smooth over the entire image.Hence minimizing the following equation estimates the velocity field: α is a regularization parameter that controls the degree of smoothness and is usually selected heuristically.This energy is iteratively minimized until convergence from the following equations : where V is the weighted average of V in a neighbourhood.When classifying ME, the resulting matrices Vx , Vy and M are traditionally given as input to the CNN.Nonetheless, the third channel is inherently redundant since M is computed from Vx and Vy.Optical flow composed of the 2-channel Vx and Vy field could already provide all relevant information.Furthermore, we hypothesize that even a single channel motion field itself could be descriptive enough.Hence we have created and evaluated networks taken as input the optical flow in a two-channel representation (Vx-Vy) and in an one-channel representation (M, Vx or Vy).For this purpose, the proposed networks begin by a number of CL related to the depth optimization followed by a batch normalization and ReLU.Then the networks end by a maxpooling layer and a fully connected layer.The Figure 5 presents the architectures used with one to four CL according to the results of the experiments in Section 4. As illustrated in Table 2, a low dimensional input leads to a significant reduction in the number of learnable parameters and therefore in the complexity of the system.

Dataset and validation protocol presentation
Two ME databases are used in our experiments.CASME II (Chinese Academy of Sciences Micro-Expression) [14] is a comprehensive spontaneous ME database containing 247 video samples, collected from 26 Asian participants with an average age of 22.03 years old.Compared to the first database, the Spontaneous Actions and Micro-Movements (SAMM) [36] is a more recent one consisting of 159 micro-movements (one video for each).These videos are collected spontaneously from a demographically diverse group of 32 participants with a mean age of 33.24 years old and a balanced gender split.Originally intended for investigating micro-facial movements, SAMM initially collected the 7 basic emotions.
Both the CASME II and SAMM databases are recorded at a high-speed frame rate of 200 fps.They also both contain "objective classes", as provided in [37].For this reason, Facial MEs Grand Challenge 2018 [38] proposed to combine all samples from both databases into a single composite dataset of 253 videos with five emotion classes.It should be noted that the repartition is not very well balanced.
Similar to [38], we applied the Leave One Subject Out (LOSO) cross-validation protocol for ME classification, where one subject's data is used as a test set in each fold of the cross-validation.This is done to better reproduce realistic scenarios where the encountered subjects are not present during training of the model.In all experiments, recognition performance is measured by accuracy, which is the percentage of correctly classified video samples out of the total number of samples in the database.
The Horn-Schunck method [35] was selected to compute optical flow.This algorithm is widely used for optical flow estimation in many recent studies in virtue of its robustness and efficiency.
Throughout all experiments, we train the CNN models with a mini-batch size of 64 for 150 epochs using the RMSprop optimization.Feature extraction and classification are both handled by the CNN.
Simple data augmentation is applied to double the training size.Specifically, for each ME video clip used for training, in addition to the optical flow between the onset and apex frame, we also include a second flow computed between the onset and apex+1 frame.

ResNet depth study
In order to find the ResNet depth which permits an optimal compromise between the ME recognition performance and the number of learnable parameters, we tested different CNN depths using the method described in Section 3. The obtained accuracies are given in Table 3: We observe that the best score is achieved by ResNet8 which has seven CL.However, the scores achieved by different numbers of CL do not vary much.Furthermore, beyond seven CL, adding more CL doesn't improve the accuracy of the model.The fact that accuracy doesn't increase along with depth confirms that multiple successive CL are not necessary to achieve a respectable accuracy.The most interesting observation is that with a single CL, we achieve a score that is not very far from the optimal score while the size of the model is much more concise.This suggests that instead of deep learning, a more "classical" approach exploiting shallow neural networks presents an interesting field to explore when considering portability and computation efficiency for embedded systems.That is the principal reason we will restrict our study to shallow CNNs.

CNN input study
In this subsection, we study impacts of optical flow representations on ME recognition performance.Two types of CNN have been investigated, one with 1-channel input (Vx, Vy, or M) and the other one using the 2-channel Vx-Vy pair.Due to the fact that off-the-shelf CNNs typically take 3-channel inputs and are pre-trained accordingly, applying transfer learning to adapt to our models is a nontrivial task.Instead, we created custom CNNs and trained them from scratch.Table 4 shows recognition accuracies of different configurations using a small number of CNN layers.
We can observe that the Vx-Vy pair and Vy alone give the best results, both representations achieving 60.17% accuracy.On the other hand, using magnitude alone leads to similar accuracy as those of Vy and Vx-Vy pair with a score of 59.34%.Vx gets the worst results overall, with a maximum score of 54.34%.This observation indicates that the most prominent features for ME classification might indeed be more dominant in vertical movement rather than the horizontal one.This assumption is logical when thinking about the muscle movements happening in each known facial expression.To better visualize the difference in the high-level features present in Vx, Vy and the Magnitude, we did an averaging on all the different samples according to their classes.The result can be seen in

Classification analysis
In order to understand obtained results, we measured cosine similarity of features extracted by three CNNs: ResNet8 (Section 4.2), Vx-Vy-3 CL and Vy-3 CL (Section 4.3).Usually, the convolutional layers of CNNs are considered as different feature extractors; only the last fully connected layer directly performs the classification task.The features just before classification can be represented in vector format.Cosine similarity measures the similarity between two vectors a and b using Equation 5: Cosine similarity values fall within the range of [-

Performance evaluations
In this subsection, we measure our proposed method on three aspects: recognition accuracy, needed memory space and processing speed.Since we obtain optimal results by using the Vy field and 3-layer CNN, further evaluations will concentrate on this particular configuration.Evaluation on memory space: Table 9 summarizes the number of learnable parameters and used filters according to the dimensionality of the network inputs.The minimum required memory space corresponds to 333,121 parameter storage, which is less than 3.12% of that of off-the-shelf ResNet18.

Conclusion and future works
In this paper, we propose cost-efficient CNN architectures to recognize spontaneous MEs.We first investigated the depth of the well-known ResNet18 network to demonstrate that using only a small number of layers is sufficient in our task.Based on this observation, we have experienced several representations at network's input.Finally, we obtained an accuracy of 60.17% with a light CNN design consisting of 3 CL with single-channel inputs Vy.This configuration enables the number of learnable parameters to be reduced by a factor of 32 in comparison with the ResNet18.Moreover, we achieved a processing time of 24.6 ms which is shorter than MEs (40 ms).Our study opens an interesting way to find the trade-off between speed and accuracy in ME recognition.While the results are encouraging, it should be noted that our method does not give a better accuracy than the ones described in the literature.Instead, a compromise has to be made between accuracy and processing time.By minimizing the computation, our proposed method manages to obtain accuracy comparable to the state-of-the-art systems while being compatible with the real-time constraint of embedded vision.
Several future works could further enhance both the speed and accuracy of our proposed ME recognition pipeline.These include more advanced data augmentation techniques to improve recognition performance.Moreover, new ways to automatically optimize the structure of a network to make it lighter have been presented recently.Other networks optimized for efficiency will also be explored.For example, MobileNet [40] uses depth-wise separable convolutions to build light weight CNN.ShuffleNet [41] uses pointwise group convolution to reduce computation complexity of 1x1 convolutions and channel shuffle to help the information flowing across feature channels.Our next step of exploration aims to analyze and integrate these new methodologies in our framework.

Figure 1 .
Figure 1.Example of a ME: the maximum movement intensity occurs at the apex frame.

[ 34 ]
, Rieger et al. explored parameter-reduced residual networks on in-the-wild datasets, targeting real-time head pose estimation.They experimented various ResNet architectures with a varying number of layers to handle different image sizes (including low-resolution images).The optimized ResNet achieved state-of-the-art accuracy with real-time speed.

Figure 2 .
Figure 2. Depth reduction of a deep neural network: in the initial network, each residual layer contains two CL (left); the last residual layer is removed (middle) to obtain a shallower network (right). onset

Figure 3 .
Figure 3. Optical flow is computed between the onset (a) and the apex (b): vectors obtained for a random sample of pixels (c), Vx field (d), Vy field (e) and magnitude field (f).

Figure 4 .
Figure 4. Visualisation of M, Vx and Vy for one optical flow vector.

Figure 5 .
Figure 5. Proposed networks composed of one to four (from left to right) CL for various representations of the optical flow as input.

Figure 6 .
Figure 6.We observe that Vx exhibits a non-negligible quantity of noise.Magnitude and Vy on the other hand have clear regions of activity for each class.The regions of activity are aligned with the muscles responsible of each facial expression.

Figure 6 .
Figure 6.Average optical flow obtained in the dataset per ME class.Studied classes are in order from left to right: happiness, surprise, anger, disgust and sadness.

Figure 8 .
Figure 8. Confusion matrix obtained by the work of [27].

Following
several previous studies, we fed CNNs with optical flow estimated from the onset and apex of MEs.Different flow representations (horizontal Vx, vertical Vy, Magnitude M and Vx-Vy pair) have been tested and evaluated on a composite dataset (CASME II and SAMM) for recognition of five objective classes.The results obtained on the Vy input alone are more convincing.It is likely due to the fact that such an orientation is more suitable describing ME's motion and its variations between the different expression classes.Experimental results demonstrated that the proposed method can achieve similar recognition rate when compared with state-of-the-art approaches.

Table 1 .
Number of CL and number of learnable parameters in the proposed architectures.

Table 2 .
Number of learnable parameters according to the dimensionality of the input of the network.

Table 3 .
Accuracies varied by the number of convolution layers (CL) and associated number of learnable parameters.

Table 4 .
Accuracies under various CNN architectures and optical flow representations.

Table 5 .
Cosine similarity for the 3 CL CNN with single-channel input Vy

Table 6 .
Cosine similarity for the 3 CL CNN with double-channel inputs (Vx-Vy)

Table 9 .
Number of learnable parameters and filters (in brackets) of various network architectures under different input dimensions.
Evaluation on processing speed: we used a mid-range computer with an Intel Xeon processor and an Nvidia GTX 1060 graphic card to carry out all the experiments.The complete pipeline is implemented in MatLAB 2018a with its deep learning toolbox.Our model which achieves the best score is the CNN with a single-channel input and three successive CL.It needs 12.8 ms to classify the vertical component Vy.The optical flow between two frames requires 11.8 ms to compute using our computer, leading to a total runtime to classify an ME video clip of 24.6 ms.In our knowledge, the proposed method outperforms most ME recognition systems in terms of processing speed.