Late Fusion-Based Video Transformer for Facial Micro-Expression Recognition

: In this article, we propose a novel model for facial micro-expression (FME) recognition. The proposed model basically comprises a transformer, which is recently used for computer vision and has never been used for FME recognition. A transformer requires a huge amount of data compared to a convolution neural network. Then, we use motion features, such as optical ﬂow and late fusion to complement the lack of FME dataset. The proposed method was veriﬁed and evaluated using the SMIC and CASME II datasets. Our approach achieved state-of-the-art (SOTA) performance of 0.7447 and 73.17% in SMIC in terms of unweighted F1 score (UF1) and accuracy (Acc.), respectively, which are 0.31 and 1.8% higher than previous SOTA. Furthermore, UF1 of 0.7106 and Acc. of 70.68% were shown in the CASME II experiment, which are comparable with SOTA.


Introduction
Facial micro-expression (FME) often faintly occurs for 0.04-0.2 s when people try hiding their true feelings, unlike macro-expression appearing on the face from 0.75-2 s. Due to these characteristics of FME, it is cost-intensive to build an FME dataset and there are few FME datasets. In addition, several existing datasets, such as SMIC and CASME II [1,2], developed in a strict environment, have a small number of samples.
Because of this nature of FME, most early studies [1,[3][4][5][6] used handcrafted features such as local binary patterns on three orthogonal planes and optical flow [7]. However, as deep learning began to gain prominence in computer vision, there have been many attempts [8][9][10][11] to combine deep neural networks with handcrafted features since a study [12] using convolution neural networks (CNNs) with long short-term memory model (LSTM) in FME recognition was conducted.
Recently, deep-learning methods have achieved the state of the art (SOTA) using a vision transformer model, with composed self-attention layer without CNN rather than using CNN in computer vision. Generally, the vision transformer model outperforms CNN when using transfer learning with pretrained weights using large number of data rather than training from scratch. Interestingly, a recent study [13], which injected CNN-like inductive biases [14], locality and pyramid structure, into transformer models, showed similar performance to CNN with scratch training on the ImageNet dataset.
However, to the best of our knowledge, no studies have applied a vision transformer to FME recognition. We assume the transformer's inductive bias, modeling relations between input patches might seem more suitable for FME recognition than the inductive bias of CNN, since the pattern of FME is subtle and appears only in a part of each frame in the video. Therefore, we propose an FME recognition model using a transformer and optical flow [7], which is a general feature to represent the motion of video, without pretrained weights using a large amount of data. We used optical flow as a motion feature to complement the lack of data [15].
Since FME datasets were captured by a high-speed camera, we thought the influence of the optical flow in FME recognition would be different from general video recognition. Therefore, in ablation we conducted various experiments about that influence and empirically found the proper way to use the optical flow. As a result, our proposed model achieves the SOTA in the SMIC [1] and comparable performance in the CASME II [2] (see Table 2).

Prior Works of FME Recognition
Previous studies on FME were performed using handcrafted features. They can be summarized as follows: Li et al. used local binary pattern histograms from three orthogonal planes (LBP-TOP) to describe the spatiotemporal local textures from cropped face sequences for feature extraction [1], and interpolated video using a temporal interpolation model (TIM) [16]. Liong et al. proposed a feature extraction method using bi-weighted oriented optical flow (Bi-WOOF), variants of optical flow, to encode essential expressiveness of the apex frame and used only two images per video [4]. Wang et al. used the sparse part of Robust PCA to extract the subtle motion information of micro-expression and classified the local texture features of the information extracted by local spatiotemporal directional features [3]. Xiaobai et al. proposed a new unifying framework [5], where motion magnification is employed to counter the low intensity of MEs, for ME spotting and recognition. Yuan et al. designed a hierarchical spatial division scheme for spatiotemporal descriptor extraction to address difficulty with choosing an ideal division grid for different micro-expression samples [6].
However, since deep-learning methods have become de facto, studies in FME recognition have begun to use deep-learning methods: Devangini et al. proposed the first work to explore the possible use of deep learning for micro-expression recognition task. They solved the problem of lack of data using transfer learning from objects and facial expression-based CNN models [12]. Li et al. applied the 3D flow-based CNNs model, which flows consists of gray color information, and horizontal and vertical optical flow [8]. Xia et al. proposed a deep model, which is constituted of several recurrent convolutional layers. They exploited two types to extend the connectivity of convolutional networks across the temporal domain, in which the spatiotemporal deformations are modeled in views of facial appearance and geometry separately [9]. Choi et al. proposed a 2D landmark feature map (LFM) obtained by transforming face landmark information into 2D image information. They also proposed an LFM-based recognition method that is an integrated framework of CNN and LSTM [10]. Xuan et al. developed a multi-task learning (MTL) method to effectively leverage a side task: gender detection. Their method GEME [11] recognized micro-expressions by incorporating unique gender characteristics and subsequently improved the micro-expression recognition accuracy.

Vision Transformer
The transformer is a highly successful model in natural language processing and has recently been applied to computer vision. The best-known vision transformer is ViT [17], which replaces word tokens in a sentence with patch tokens in an image. The biggest difference between the transformer and CNN, which was the mainstream in computer vision, is that it uses self-attention operations, not convolution operations. Self-attention is an operation that allows each token to represent contextual information within the group to which it belongs, rather than representing an individual meaning. For that, selfattention converts the input token into individual query, key, and value, and calculates a scaled dot product [18] between them. Because self-attention models the relationship of patches within the image they belong to, unlike CNN, the vision transformer has a global receptive field. In addition, each query, key, and value depends on the input data, so unlike CNN, the transformer has a property of adaptive weight aggregation, making it more expressive. However, as it has a large capacity, it is necessary to learn with large amounts of data to achieve good performance. Many studies have been proposed to achieve similar performance to CNN with the same amount of data. Among them, the Swin transformer [13] used in this paper is a study to solve the above problem by borrowing some of the pyramid structure and locality of CNN. Figure 1 depicts the structure of the proposed method. First, we linearly interpolate a different length video x into a fixed length video x fix . Next, we calculate optical flows x opt from x fix and then throw away the last frame of x fix to match the length of x fix and x opt :

Proposed Method
where the sequence length M depends on the data sample, N is the desired number of lengths, (H, W) is the resolution of the video, and C is the number of channels. We explain this preprocessing in detail later. Afterward, grayscale or color images x fix and optical flows x opt are independently passed through two transformer backbones, face backbone f and motion backbone g. These backbones have the same structure but do not share parameters to extract k-dimensional feature vectors, z face and z motion . These vectors are then transformed to z fusion via concatenation: Finally, we push z fusion into the classifier h composed of the fully connected layer, followed by the SoftMax layer, to obtain a class score s and use cross entropy loss using t ∈ R c for training: where c is the number of target classes and subscript i means the position of the elements in the vector.

Preprocessing
The number of frames for each video must be the same to use it as a transformer input. A previous study [16] used TIM to equally interpolate the frames of each video. However, TIM's assumption that each frame is linearly independent is fragile when using an FME dataset because the video captured by a high-speed camera (100/200 fps) has a dimmer pattern compared to the normal-speed video (30 fps). In addition, TIM interpolates videos using singular value decomposition that require numerous computations to flatten video vectors.
Due to these limitations, we use linear interpolation. Linear interpolation is a method of curve-fitting using linear polynomials to construct new data points within the range of a discrete set of known data points. Table 1 shows that TIM requires additional computation and does not yield significant performance improvements compared to linear interpolation. We measured the time required for interpolation of 31 frames of video into 8 frames. The interpolation is executed using the CPU, where the RAM capacity is 377 GB, and the CPU is 64-core AMD EPYC 7702. Then, we calculated an optical flow feature from the interpolated video. Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. We used dense optical flow obtained using the Farneback algorithm [19], which is a basic method of calculating optical flow. In the proposed method, we set N as a quarter of the average video length on the target dataset.

Transformer Backbone
The Swin video transformer [20] is a model that performs well without a large amount of data using a Swin block composed of window or shifted-window multi-head selfattention, which solves the quadratic complexity problem of the transformer.
In the proposed method, we use Video Swin-B as the backbone. It consists of 4 stages, which have different Swin block numbers {2, 2, 18, 2} for each stage. Since we handle a token as 2 × 4 × 4 in the backbone, patch partition first reshapes the input video . Then, a linear embedding layer is applied to project each token to the dimension of 128 and each stage feeds previous tokens to the next stage through Swin blocks.
We expect that the face backbone models the spatial and temporal relationships between patches of all frames for FME recognition, and the motion backbone does same thing as the face backbone in terms of motion information using optical flows.

Late Fusion
In prior work [21], researchers called extracting features of each frame through one shared 2DCNN which could not model the temporal relations, and combining features before classification, Late Fusion (LF). In contrast to LF, they called extracting a combined feature of all frames through one 3DCNN early fusion (EF).
The definitions of LF and EF above are slightly different from our research. However, we used the names because the positions that combine each feature are the same. In our research, we named LF the process that extracts two features individually from two different inputs, video and optical flows, through two different backbones and combines those features before classification. Additionally, we named EF the process that extracts a combined feature through one backbone from one input in which video and optical flows are concatenated channel-wise, and classifies the feature into emotional classes.
In the proposed method, we used LF even though EF has a low amount of computation due to its shared backbone. This is because we thought that extracting a feature from the concatenated input would degrade performance due to dependency, caused by the optical flows calculated from the video.

Dataset and Metrics for the Whole Experiment
We used the SMIC [1] and the CASME II Since the dataset has an imbalance distribution of emotion labels, we used three balanced metrics to reduce the bias: accuracy (Acc), unweighted average recall (UAR) [22], and unweighted F1 score (UF1) [23].
where C is the number of classes, and N c is the number of samples for each class. TP, FN, and FP are the true positive, false negative and false positive, respectively.

Training Scheme
All models were trained on 1 GPU with 1 image per GPU. Specifically, we used a RTX A6000 48 GB. This means that batch size is 1. For backpropagation, we used the AdamW [24] optimizer of which betas are (0.9, 0.999) and weight decay is 0.05. The initial learning rate is 10 −5 .
Since each video shows the upper body, we cropped the face part using a face detection model [25] and resized its resolution as (224, 224). In addition, the FME appears in a very faint pattern and can be easily damaged if a strong change is applied. Thus, we used only simple augmentations such as random scaling and rotation in the range of [0.9, 1.1] and [−10 • , 10 • ], and horizontal flip. Furthermore, since the class distribution of each dataset is unbalanced, an imbalanced sampler that matches the class distribution was used before data augmentation to train using the similar amount of data per epoch. We expected the sampler to reduce the bias of the dataset.
We trained the face backbone f (·), the motion backbone g(·), and the classifier h(·) using this dataset with the proposed method. We used same scheme for training models of the ablation study.

Evaluation Protocol
Each person has a different form of expression on their face. Therefore, to avoid person-dependent issues, the performance of the model is evaluated with leave-one-subjectout (LOSO) cross-validation. LOSO measures the performance of the model using one of the subjects as a validation set and the rest as a training set. Then, we repeat the training and validation process as many times as the number of subjects in the dataset. Table 2 shows the performance of the proposed method compared to other studies on SMIC and CASME II. The numerical values of each method are taken from survey [26] or their own paper. We can find that our method achieves the best accuracy and UF1 on average in SMIC and shows comparable performance in CASME II. In the case of SMIC, the proposed method has an improvement of about 0.031 and 1.8% over previous SOTA. In the case of CASME II, the proposed method did not outperform the previous SOTA, but while the previous SOTAs, LFM and GEME, demanded additional complex methods for model learning, our model exhibits comparable performance without such a method.  Figure 2 shows the confusion matrices of the proposed method in the validation phase. In SMIC, the proposed model has difficulty classifying 'negative', especially misclassifying 'negative' as 'positive'. This is interesting because the number of samples labeled 'negative' is the largest, and other studies classify 'negative' relatively well. We believed that these results are due to using the imbalanced sampler. In CASME II, the proposed method is relatively poor at classifying 'others' and 'disgust', which have the most data samples.

Ablation Study
Since we performed FME recognition using a transformer, we need to know the influence of optical flow and distinguish it from the influence of the transformer. Therefore, we will compare the performance of CNN-based [27], transformer-based [28], and CNNlike transformer-based [20] models. The CNN-based model represents locality inductive bias, the transformer-based model represents inductive bias of global receptive field, and the CNN-like transformer-based model represents intermediate inductive bias between them. Inductive biases, broadly speaking, encourage the learning algorithm to prioritize solutions with certain properties. The comparison with these three models will be a clue to which inductive bias is more suitable for FME recognition. Furthermore, since the FME datasets consist of high-speed videos, 100 Hz or 200 Hz, it is different from regular videos with large changes in each frame. Therefore, we empirically investigate the proper way to use optical flow.
In ablation 5.1 and 5.2, we do not use the LF method because the LF requires too much computation. For the model detail, see Table 3. We represent the amount of computation for each model in multiply-accumulate operations (MACs), which is a common step that computes the product of two numbers and adds that product to an accumulator.

Influence of the Optical Flow
We examine whether the motion feature yields significant improvements and analyze the effect of color information, which is considered useless in FME recognition because it is subject-dependent. To minimize the loss of color information, we use the video length of 32 close to the average number of frames in SMIC as a video length (33.7). It makes the effect of optical flow less pronounced. Table 4 shows that incorporated optical flow had higher performance than those using only grayscale or color information. Interestingly, the 3D-ResNeXt-101, which consists only of CNN, performs best when using motion information, and video transformer models perform better when using image information together, contrary to what is generally known, that color information is meaningless. Therefore, it may be seen that it is effective to use motion information, and in the video transformer, it is appropriate to use image information and motion information together. Note: Bold shows the best performance of each model in each metric.

Investigation of the Proper Interpolated Length
Since optical flow represents the motion of objects between two frames, meaningful features may not be extracted for images captured with a high-speed camera, so the proposed method interpolated each sample in half the average number of frames in the dataset to extract meaningful motion information. However, it is unknown whether half the average number is appropriate in most cases. Therefore, it is essential to interpolate with an appropriate number of frames. In this experiment, we compared the average, half of average, and a quarter of average. Based on the above results of comparison using optical flow, we do not consider using only color and grayscale.
In Table 5, the use of video length as 32 generally showed a similar or worse performance to using it as 8 and 16. We think these results are for two reasons. The first reason is that the number of data is too small to use long frames, so the model is overfitted. In general, as the dimension of the input vector used for training increases, the number of data should also increase in proportion to the dimension. However, in the case of FME datasets, the number of samples is small and thus the model is easily overfitted. The second reason is that motion information is more useful than color information, as confirmed in ablation 5.1. Increasing the interpolation length reduces the loss of color information and differences between two frames, which reduces the usefulness of the optical flow and lowers model performance. Thus, it is desirable to use a smaller video length.

Effect of the Fusion Location
The proposed model extracts two features individually using grayscale information and optical flow, but it is unknown whether this yields improvement because there is no research on the transformer. Therefore, we compare the performance of EF and LF.
From Table 6, it is difficult to determine which is better. However, when using LF, it needs one more backbone, which requires about twice as much additional operation as EF. In addition, except for Swin, the highest performance of each model comes from EF, so EF can be considered better. However, since the highest performance among all models comes from LF, it is still difficult to determine superiority and inferiority. As a result, it is still an open problem.

Conclusions
Recently, various studies have been proposed to solve FME recognition, but no studies have used transformers. In this research, we focused on whether the transformer model can be suitable for FME recognition. Since transformers generally require a large amount of data but there is no sufficient dataset for FME recognition, our main purpose is to train the transformer model successfully. We achieve the purpose using the optical flow, which was mainly used by video processing models, and LF with the transformer. As a result, our model becomes the SOTA in SMIC and achieves comparable performance in CASME II, although we do not use methods specialized for FME recognition as in other studies.