Objective Classes for Micro-Facial Expression Recognition

Micro-expressions are brief spontaneous facial expressions that appear on a face when a person conceals an emotion, making them different to normal facial expressions in subtlety and duration. Currently, emotion classes within the CASME II dataset are based on Action Units and self-reports, creating conflicts during machine learning training. We will show that classifying expressions using Action Units, instead of predicted emotion, removes the potential bias of human reporting. The proposed classes are tested using LBP-TOP, HOOF and HOG 3D feature descriptors. The experiments are evaluated on two benchmark FACS coded datasets: CASME II and SAMM. The best result achieves 86.35\% accuracy when classifying the proposed 5 classes on CASME II using HOG 3D, outperforming the result of the state-of-the-art 5-class emotional-based classification in CASME II. Results indicate that classification based on Action Units provides an objective method to improve micro-expression recognition.


Introduction
A micro-facial expression is revealed when someone attempts to conceal their true emotion [1,2]. When they consciously realise that a facial expression is occurring, the person may try to suppress the facial expression because showing the emotion may not be appropriate [3]. Once the suppression has occurred, the person may mask over the original facial expression and cause a micro-facial expression. In a high-stakes environment, these expressions tend to become more likely as there is more risk to showing the emotion.  Emotion Classes 5 7 Mean Age (SD) 22.03 (SD = 1.60) 33.24 (SD = 11.32) Ethnicities 1 13 * This is the original amount of movements used in [23], however we use a larger set of 255 provided by the dataset. (a) CASME II CASME II was developed by Yan et al. [23] and refers to Chinese Academy of Sciences Micro-expression Database II, which was preceded by CASME [22] with major improvements. All samples in CASME II are spontaneous and dynamic micro-expressions with a high frame rate (200 fps). There are a few frames kept before and after each micro-expression to make it suitable for detection experiments. The resolution of samples is 640×480 pixels for recording which saved as MJPEG and resolution about 280×340 pixels for cropped facial area. The participants' facial expressions were elicited in a well-controlled laboratory environment. The dataset contains 255 micro-expressions (gathered from 35 participants) and were selected from nearly 3000 facial movements and have been labelled with AUs based on FACS. Only 247 movements were used in the original experiments on CASME II [23]. When analysing the FACS codes of the CASME II dataset, it was found that there are many conflicts to the coded AUs and the estimated emotions. These inconsistencies do not help when attempting to train distinct machine learning classes, and adds further justification for the proposed introduction of new classes based on AUs only.
For example, Subject 11 with the micro-expression clip filename of 'EP19_03f', was coded as an AU4 in the 'others' estimated emotion category (shown in Fig. 1). However, Subject 26 with the micro-expression clip filename of 'EP18_50', was also coded with AU4 but in the 'disgust' estimated emotion category (shown in Fig. 2). As can be seen in the apex frame (centre image) of both Fig. 1 and 2, AU4, the lowering of the brow, is present. Having the same movement in different categories is likely to have an effect on any training stage of machine learning.

(b) SAMM
The Spontaneous Actions and Micro-Movements (SAMM) [24] dataset is the first high-resolution dataset of 159 micro-movements induced spontaneously with the largest variability in demographics. To obtain a wide variety of emotional responses, the dataset was created to be as diverse as possible. A total of 32 participants were recruited for the experiment with a mean age of 33.24 years (SD: 11.32, ages between 19 and 57), and an even gender split of 16 male and female participants. The inter-coder reliability of the FACS codes within the dataset is 0.82, and was calculated by using a slightly modified version of the inter-reliability formula, found in the FACS Investigator's Guide [28], to account for three coders rather than two.
The inducement procedure was based on the 7 basic emotions [1] and recorded at 200 fps. As part of the experimental design, each video stimuli was tailored to each participant, rather than obtaining self reports during or after the experiment. This allowed for particular videos to be chosen and shown to participants for optimal inducement potential. The experiment comprised of 7 stimuli used to induce emotion in the participants who were told to suppress their emotions so that micro-facial movements might occur. To increase the chance of this happening, a prize of £50 was offered to the participant that could hide their emotion the best, therefore introducing a high-stakes situation [1,2]. Each participant completed a questionnaire prior to the experiment so that the stimuli could be tailored to each individual to increase the chances of emotional arousal.
The SAMM dataset was originally designed to investigate micro-facial movements by analysing muscle movements of the face rather than recognising distinct classes [29]. We are the first to categorise SAMM based on the FACS AUs and then use these categories for micro-facial expression recognition.

(c) Related Work
Currently, there are three features which many micro-expression recognition approaches rely on: Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG) and Histogram of Oriented Optical Flow (HOOF) based. We will discuss different methods that use these features in recent work on micro-expression recognition.
As an extension to the original Local Binary Pattern (LBP) [11] operator, Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) was proposed by Zhao et al. [12]  texture and facial expression analysis in the spatial-temporal domain. Since video sequence of time length T, usually it can be thought as a stack of XY planes along the time axis T, but also it can be thought as three planes XY, XT and YT. These provide information about space and time transition. The basic idea of LBP-TOP is similar to LBP, the difference being that LBP-TOP extracts features from all three planes which will be combined in into a single feature vector.
Yan et al. [23] carried out the first micro-expression recognition experiment on the CASME II dataset. LBP-TOP [12] was used to extract the features and Support Vector Machine (SVM) [30] was employed as the classifier. The radii varied from 1 to 4 for X and Y, and from 2 to 4 for T (T=1 was not considered due to little change between two neighbouring frames at 200 fps), with classification occurring between five main categories of emotions provided in this experiment (happiness, disgust, surprise, repression and others).
Davison et al. [31] used the LBP-TOP feature to differentiate between movements and neutral sequences, attempting to avoid bias when classifying with an SVM.
The performance of [23] on recognising micro-expressions in 5-classes with LBP-TOP features extraction, achieved a best result of 63.41% accuracy, using leave-one-out cross-validation. This result is an average for recent micro-expression recognition research, and is likely due to the way micro-expressions are categorised. Of the 5-class in the CASME II dataset, 102 were classed as 'others', which denoted movements not suited for the other categories but related to emotion. The next highest category was 'disgust' with 60 movements, showing that the 'others' class made the categorisation imbalanced. Further, the categorisation was not based solely on AUs due to micro-expressions being short in duration and low in intensity, but also based on the participant's self-reporting. By classifying micro-expressions in this way, features are unlikely to exhibit a pattern, and therefore perform poorly during the recognition stage as can be seen in the other performance results of around 50% -60% in [23].
More recently, LBP-TOP was used as a base feature for micro-expression recognition with integral projection [32,33]. These representations attempt to improve discrimination between micro-expression classes and therefore improve recognition rates. Polikovsky et al. [25] used a 3D gradient histogram descriptor (HOG 3D) to recognise posed micro-facial expressions from high-speed videos. The paper used manually marked up areas that are relevant to FACS-based movement so that unnecessary parts of the face are left out. This does means that the method of classifying movement in these subjectively selected areas is time-consuming and would not suit a real-time application like interrogation. The spatio-temporal domain is explored highlighting the importance of the temporal plane in micro-expressions, however the bin selection for the XY plane is 8 and the XT, YT planes have been set to 12. The number of bins selected represents the different directions of movement in each plane.
For HOOF-based methods, a Main Direction Mean Optical Flow (MDMO) feature was proposed by Liu et al. [34] for micro-facial expression recognition using SVM as a classifier. The method of detection also uses 36 regions, partitioned using 66 facial points on the face, to isolate local areas for analysis, but keeping the feature vector small for computational efficiency. The best result on the CASME II dataset was 67.37% using leave-one-subject-out cross validation.
The basic HOOF descriptor was also used by Li et al. [35] as a comparative feature when spotting micro-expressions and then performing recognition. This is the first automatic micro-expression system which can spot and recognise micro-expressions from spontaneous video data, and be comparable to human performance.

Methodology
To overcome the conflicting classes in CASME II, we restructure the classes around the AUs that have been FACS coded. Using EMFACS [28], a list of AUs and combinations are proposed for a fair categorisation of the SAMM [24] and CASME II [23] datasets. Categorising in this way removes the bias of human reporting and relies on the ground truth movement data, feature representation and recognition technique for each micro-expression clip. Table 2 shows 7 classes and the corresponding AUs that have been assigned to that class. Classes I-VI are linked with happiness, surprise, anger, disgust, sadness and fear. Class VII relates to contempt and other AUs that have no emotional link in EMFACS [28]. It should be noted that the classes do not directly correlate to being these emotions, however the links used are informed from previous research [27,28,36]. Each movement in both datasets were classified based on the AU categories of  Table 3. Micro-expression recognition experiments are run on two datasets: CASME II and SAMM. For this experiment, three types of feature representations are extracted from a sequence of grey images which represent the micro-expressions. These image sequences are divided into 5×5 blocks that are nonoverlapping. The LBP-TOP features [12] radii parameters for X, Y and T are set to 1, 1 and 4 respectively and all neighbours in three planes set to 4. The HOG 3D [25] and HOOF [14] features are set to the parameters described in the original implementations.
Sequential Minimal Optimization (SMO) [37] is used in the classification phase with 10-fold cross validation and leave-one-subject-out (LOSO) to classify between I-V, I-VI and I-VII classes. SMO is a fast algorithm for training SVMs, and provide a solution to solving very large quadratic programming (QP) problems, which are required to train SVMs. SMO avoid time-consuming QP calculations by breaking them down into smaller pieces. Doing this allows for the classification task to be completed much faster than using traditional SVMs [37].

Results
Evidence to support the proposed AU-based categories can be seen in the confusion matrix in Fig. 3.   'happiness' and 28.57% of the 'disgust' categories are misclassified as 'others' respectively. The original chosen emotions, including many placed in the 'others' category, leads to a lot of conflict at the recognition stage.
The proposed classes I-V classification results using LBP-TOP can be seen in the confusion matrix in Fig. 4. In contrast, the classification rates are more stable and outperforming the original classes overall. The results are by no means perfect, however it shows that the most logical direction is to use objective classes based on AUs rather than estimated emotion categories.   selection of FACS-based regions [38] supports this with AUC results for detecting relevant movements to be 0.7512 and 0.7261 on SAMM and CASME II, respectively. Table 4 shows the experimental results on CASME II with each result metric being a weighted average calculation to account for imbalanced numbers within classes. Each experiment was completed for each feature and within the original classes defined in [23] and the proposed classes. Both the 10-fold crossvalidation results and leave-one-subject-out (LOSO) are shown.
For the CASME II results we see that the proposed classes outperform the original classes for every feature, with the top performing being a weighted accuracy score of 86.35% for the HOG 3D feature in the proposed class I-V. This shows a large improvement over the original classes which achieved 80.93% for the same feature. Using LOSO, the results were comparable with the original classes. The highest accuracy was 76.60% from the HOOF feature, in the proposed I-VII classes.
The experiment based on the same conditions were then repeated for SAMM and can be seen in Table 5. Overall the recognition rates were good for SAMM, with the best result achieving an accuracy of 81.93% using LBP-TOP in I-VI classes for 10-fold cross validation. The best result using LOSO was from the HOG 3D feature, in the proposed I-VII classes and achieved 63.93%, however due to the lower amount of micro-expressions within the SAMM dataset compared with CASME II, the LOSO results were lower.
The imbalance of data, specifically the low amounts of micro-expression data, can skew LOSO results with low amounts of testing and training. This shows how using LOSO for micro-expression recognition is difficult to quantify with a fair amount of significance. Further data collection of spontaneous microexpressions is required to rectify this.

Conclusion
We show that restructuring micro-expression classes objectively around the AUs, recognition results outperform the state-of-the-art, emotion-based classification approaches. As micro-expressions are so subtle, the best way to categorise is objectively as possible, so using AU codes is the most logical. Categorising using a combination of AUs and self-reports [23] can cause many conflicts when training a machine learning method. Further, dataset imbalances can be very detrimental to machine learning algorithms, and this is further emphasised with the relatively low amount of movements in both datasets. Future work will look into the effect of using more modern features, with AUs classification to improve on the recognition accuracy. This could include the MDMO feature [34], local wrinkle feature [39] and the feature extraction methods described by Wang et al. [40].
Further work can be done to improve micro-facial expression datasets. Firstly, more datasets or expanding previous sets would be a simple improvement that can help move the research forward faster. Secondly, a standard procedure on how to maximise the amount of micro-movements induced spontaneously in laboratory controlled experiments would be beneficial. If collaboration between established datasets and researchers from psychology occurred, dataset creation would be more consistent.
Deep learning has emerged as a new area of machine learning research [41][42][43], and micro-expression analysis has yet to exploit this trend. Unfortunately, the amount of high-quality spontaneous microexpression data is low and deep learning requires a large amount of data to work well [42]. Many video-based datasets previously used have over 10,000 video samples [44] and even over 1 million actions extracted from YouTube videos [45]. A real effort to gather spontaneous micro-expression data is required for deep learning approaches to be effective in the future.