1. Introduction
Facial expressions, as alternative presentations of human emotion apart from speech, specifically their recognition, is a hot direction of affective computing. In addition to the macro-expressions (MaEs) that occur usually, micro-expressions can appear in the context of certain special situations. Micro-expressions (MEs), which arise in high-risk environments, are involuntary facial movements that occur spontaneously when an individual encounters certain emotions but try to suppress facial expressions [
1]. Therefore, MEs can be widely applied in a series of fields. For example, when assessing the performance of a lecture session based on a student’s emotions, a transitory ME can reveal emotions normally hidden by the student. In business situations, a salesman can use ME to estimate a client’s real reaction during the showcasing of goods. Border control agents can detect unusual patterns of behavior by asking questions. The U.S. transportation security administration developed a spotting program, whereby airport staff are trained to monitor suspicious behaviors by analyzing MEs and signs in conversations [
2]. In the medical field, especially in psychotherapy, doctors can use MEs as a guide to know how patients really feel [
3]. Briefly, MEs analysis plays a vital role in analyzing people’s hidden emotions in various situations.
The duration of a ME is very short (generally no more than 500 ms) [
4], and the intensities of involved muscle movements are subtle [
5]. The automatic computing and analysis of ME is a new field in facial analysis, which began to arouse great interest from 2014 [
6,
7,
8,
9]. Generally speaking, facial ME analysis includes two tasks: recognition and spotting. The ME recognition has to do with determining the emotional states causing MEs, and the spotting is to locate the temporal positions of MEs in a video. Based on our survey, most ME researches focus on the recognition task, and ME spotting is given limited attention. This work aims to develop the ME spotting approach. For real-world applications, MEs, MaEs and no-expressions may exist simultaneously in long video sequences. Thus, the prerequisite for identifying categories of expression clips is to spot them accurately. As a video usually contains several expression clips, the purpose of the spotting is to predict the onset frame and offset frame, then keep the intersection over union (IoU) of ground truths (GTs) and proposals no less than the specified threshold. The principle of ME spotting with four scales is visualized in
Figure 1. A clip of MaE or ME implies the complete process of certain emotion, which typically contains an onset frame, an apex frame and an offset frame. Based on this viewpoint, the onset frame is the starting point of an expression. The apex frame means the peak of facial muscles’ motions. The offset frame defines the point back to the initial state of the expression.
Recent advances in deep learning have witnessed widespread popularity in the recognition task, though limited efforts have been made for the spotting task, especially on long “untrimmed” videos. Owing to the similarity between temporal action spotting and object detection, many temporal action detection methods adopt a framework similar to object detection methods. One-stage detection of long videos was proposed firstly in SSAD [
10], which detects the time boundaries and confidence scores of multiple categories directly. However, most long-video detection methods implement two-stage detectors, such as Faster RCNN, which will first obtain the proposals and then identify the category. Considering the low time cost of small ME data, this work concentrates on the two-stage framework for ME localization in long videos.
At present, ME spotting in long videos is still a worthwhile research topic with extraordinary attentions. However, several key challenges remain unresolved.
The existing ME spotting samples are very limited; without abundant data, the deep neural network will fully “drain” the training data and over-absorb the information from the training set, resulting in performance degradation in the testing set.
Due to the subtle movements of MEs, it is difficult to find MEs integrally in long video sequences; in other words, it is difficult to determine the boundaries of MEs.
The temporal extent of expression varies dramatically compared to the sizes of objects in an image (from a fraction of 0.04 to 4 s). The significant distinction in duration also makes it very challenging to locate MEs and MaEs.
To improve the performance of ME spotting in long videos from the aforementioned three folds, we propose an adaptive enhanced micro-expression detection network based on a specific prior of ME. The main contributions of this paper are summarized as follows:
We propose a multi-stage channel feature extraction module, named Prelayer, which is capable of fusing optical flow modality and RGB information and alleviating the problem of insufficient samples.
A multi-scale network with the fusion of multiple scale features and an anchor-free-based mechanism was developed to locate each ME’s boundaries accurately.
A new post-processing network for MEs, named ME-NMS (non-maximum suppression), is utilized to enhance the detection accuracy of some extremely short fragments.
This work is an extended version of our conference paper in the sixteenth Chinese Conference on Biometric Recognition (CCBR) [
11], in which the multi-scale feature fusion network was shown to improve the spotting performance. Our code for this work is available at
https://github.com/sijiacheng-0827/AEM-Net (accessed on 16 October 2022).
3. Our Method
This section will introduce the AEM-Net in detail. First, the architecture of AEM-Net is outlined, and then the corresponding modules in the model are described.
3.1. Network Architecture
AEM-Net mainly consists of four parts: the multi-stage channel feature extraction module, multi-scale fused-feature localization network, position suppression and adaptive deep feature enhancement module and post-processing module (ME-NMS). The entire network structure is shown in
Figure 2. Yu et al. [
23] used the I3D model to extract micro-expression features in the LSSNet framework. The convolutional kernels and pooling kernels in the image classification network are extended from 2D to 3D, and the effective features can be extracted. However, the flow information and the RGB information extracted from the I3D model do not sufficiently interact. In order to extract richer and more robust features, a multi-stage channel feature extraction module named Prelayer was introduced. After obtaining the features that contain rich information, a sliding-window operation is performed on these feature sequences on four scales, in which the downsampling step of each scale is set to 2, and the feature maps of the next layer with larger receptive fields are upsampled and used to guide feature prediction in this layer. The number of frames contained in a micro-expression clip is often a relatively limited range, and the proposals obtained by the sliding-window operation may contain too many or too few frames, which will seriously interfere with the normal prediction and affect the detection performance. In order to suppress too-long and too-short proposals, and to make our network pay more attention to tiny motion features, a spatio-temporal attention module should be set up. The spatial attention mechanism is used to screen the proposals in a certain range, so as to improve the feature expression of crucial regions. The channel attention mechanism automatically obtains the importance of each feature channel by learning the network so that it adaptively strengthens the essential features and suppresses the non-important features. Finally, the ME-NMS module is proposed to search for the optimal candidate segment of the clip and suppress the false detection segment.
Given a video with MaEs or MEs or both, AEM-Net aims to locate all the MaE and ME instances appropriately. Suppose that all expressions in this video, where and represent onset and offset frames of the expression, respectively; indicates the category label.
3.1.1. Multi-Stage Channel Feature Extraction
For a ME long video, LSSNet uses the dual-stream inflated 3D convolutional network (I3D) [
28] to extract features from both the original images and the optical flow. On the one hand, the convolutional and pooling kernels in the image classification network are extended from 2D to 3D so as to learn spatio-temporal features by the 3D convolution network directly. On the other hand, the performance of ME spotting will be improved through the operation of fusing optical flow features. The features obtained through the I3D network contain some micro-expression information. However, the model simply concatenates the spatio-temporal and optical flow features. The insufficient fusion of them will lead to the loss of a lot of useful information and greatly increase the probability of negative samples. In order to extract richer and more robust features, a multi-stage channel feature extraction module was designed. The detailed structure of the module is shown in
Figure 3.
Algorithm 1 ME-NMS loss calculating procedure. |
Output: Correct process- 1:
Input: - 2:
- 3:
M means duration of - 4:
T is type of the set means macro-expression means micro expression - 5:
is the list of initial detection boxes - 6:
S contains corresponding detection scores - 7:
is the NMS threshold - 8:
Auxiliary variable: - 9:
K[ ] - 10:
K is the list to keep final detections after NMS - 11:
whiledo - 12:
if then - 13:
- 14:
- 15:
- 16:
- 17:
- 18:
else - 19:
- 20:
end if - 21:
return K - 22:
end while
|
3.2. Modules
Specifically, three one-dimensional convolutional layers with different numbers of convolutional kernels were used to convolute the spatio-temporal and the optical flow features from the I3D model to obtain feature sequences with different channel numbers. The difference in the number of channels makes each feature sequence focus on different information. The parameter error of the convolutional layer will cause the offset of the estimated mean value, so a max pooling layer is needed to reduce the offset error and retain more facial texture information after each convolutional layer. Then, the feature sequences that pass through the max pooling operation are fused to obtain a richer feature expression.
3.2.1. Clip Proposal Network
In the field of object detection, researchers use segmentation algorithms to generate proposal regions. Micro-expression spotting is different from target detection.
Figure 4 presents the effects of different lengths of clips using anchor-based and anchor-free methods, which were measured by MAP (mean average precision) [
29].
For fragments with a duration between [2.5–4.9], the anchor-based detector head contributes to better results, and for fragments with a duration in the range of [0–2.5] or [4.9–6.9], the anchor-free detector head leads to better results. As the duration of the micro-expression sequence is generally in the range of [0–0.5] and the duration of the macro-expression sequence is generally in the range of [0.5–4], we combine the two detector heads as the detector in our method. The threshold was set as 0.1:0.1:0.7. The performance comparison focuses on the score under the threshold 0.5. It can be found that the anchor-free mechanism is more suitable for very short clips and the anchor-based mechanism is more suitable for medium-length clips. For the positioning of mixed-length clip sets, it is necessary to consider an effective combination of the two detectors. Hence, we follow the method of generating proposal regions in LSSNet. Specifically, anchor-based and anchor-free methods are employed in generating proposals. The detailed clip proposal network is shown in
Figure 5.
The lengths of MaEs and MEs in the
datasets range from 3 to 117 frames. To be exact, the length of MEs range from 8 to 16 frames, and the lengths of MaEs range from 3 to 117. In view of this attribute of MaEs and MEs, it is necessary to ensure that the generated proposals are between 2 and 128. However, the generated feature sequence is too large when the proposal scale is set to two, leading to considerable computational complexity. Therefore, the case of scale two is discarded. In addition, five ratios [0.5, 0.75, 1, 1.5, 2] are set for the anchor of each scale, so we finally selected four scales from
to
as the anchor template. A convolutional layer with stride
is used to reduce the temporal length at each hierarchy. Subsequently, the feature maps from the upper layer are up-sampled with the ratio of two. We also took a feature map from earlier in the network and merge it with our up-sampled features using the adding operator. This method enables our model to get more meaningful semantic information from the up-sampled features and finer-grained information from the earlier feature maps. In addition, several convolutional layers are embedded to eventually predict a similar tensor. Suppose the regression output of layer
i is
,
is the ratio of anchors, and the frame number of the sliding window is
. The number of anchors can be defined as follows:
The anchor-based module selects the closest-matched anchor and regresses the boundary step by step. This mechanism enhances the capability to dispose of action instances with various lengths. Unlike the method of anchor-based, the proposed anchor-free module regresses the distances from the center of an action instance to its boundaries. Compared with previous prediction methods over action proposals, anchor-free regression is more flexible and robust, especially for action instances with extremely short or long durations. In more detail, the action instances are divided into several levels according to the actions’ temporal length. After mapping back, if the calculation result falls into the action’s scope, it is said to be the foreground point; otherwise, a background point. The anchor-free module predicts the classification score to determine whether a point falls into an action instance and its category label.
It can be found that the anchor-free module performs well for relatively short or long action instances, though the anchor-based module is more suitable for medium or long action instances. In essence, the reason for this phenomenon lies in the inherent flexibility of the anchor-free module and the stability of the classification and detection process adopted by the anchor-based module. As the duration of a micro-expression is short, the action is small, and there is a lot of interference in the facial expression. The duration of a macro-expression is longer and the action is larger. Thus, we use the detector combined with the anchor-based and the anchor-free modules. Combined with its detection ability, the weights of the two detectors in the detection are adjusted.
3.2.2. Spatio-Temporal Attention Block
In order to spot expression intervals of different duration (0.04–4 s) accurately, three down-sampling layers, that is, the convolution with four scales, were designed to comply with the intervals of expressions with a large duration span. However, it can lead to data redundancy and an increase in negative samples. The main reason is that the fewer down-sampling times are captured, the fewer scales and more proposals are generated. As a result, the attention to short-time duration samples is weaker. Thus, an attention module is introduced to alleviate this situation. Firstly, following the feature processing network, the clips in the dataset are restructured with some videos shorter than corresponding clips of MaE or some videos longer than the length of the whole video. Thus, the location-suppression module is built to suppress the too-long and too-short proposals. Secondly, the channel-wise feature recalibrating module is applied to focus on the adaptive local receptive field sizes of neurons. The detail of the channel attention block is shown in
Figure 6. This will increase the attention of the network on short time samples and enhances the performance of ME spotting.
For the implementation of the position suppression part, assuming that the length of the sliding window is
; the sampling frequency of this sliding window is
f; the strides of the current layer are
; and the input of sliding windows is
, where
is the batch size,
is the channel number and
is the length of the current proposals.
A position enhancer
is calculated by scale variation, which is activated with softmax as the weight of for different lengths of
.
where the size of position
is
, the size of
is
and the number of
is
. The network structure is shown in
Figure 7.
At the same time, to explore the relations among channels, the model proposes an architecture to recalibrate channel-wise feature responses by explicitly considering inter-dependencies between channels. Each weight value corresponds to each channel. It can present the correlation in channels by forming a bottleneck structure and outputting the same weights as the input features. Concretely, the global spatial information is first squeezed into the channel descriptors, which is average-pooled to 1, then recalibrated by
, which contains two fully connected layers, and then the weight matrix is obtained. Finally, the weights are assigned to the sequence features. The weight of the channel is
;
is the original feature sequence.
is the compression of the global spatial information into the channel descriptor. The global-average pooling is utilized to generate channel interaction. The operation is to squeeze the information in the time domain to a single point, which will take advantage of the information gathered in the compression operation. Then, a simple gating mechanism with a sigmoid activation
is followed. Layer normalization is used to balance the data distribution. In the end, the additional gain in performance is obtained by adding dropout in the convolutional layers.
3.2.3. ME-NMS Module
NMS is the abbreviation of non-maximum suppression. In traditional object detection, the closer the positions of the two boxes, the higher the score used for the benchmark. Specifically, the IoU (intersection of union) of two boxes is calculated by Equation (
10).
If the IoU exceeds the threshold, it will suppress the box with a smaller score. As there is no need to output two similar boxes, the one with a large score is kept. In this paper, in order to reduce category decision errors, a ME-NMS module is proposed for ME spotting. In the post-processing process, we can set the box with the highest score as the candidate object and then sort the candidate boxes according to their scores from big to small; we intersect the current candidate segment with the union of the number n of first best candidates, and those with scores lower than the threshold
are discarded (
is set to 0.2). As a result, the false detection samples can be filtered out. As the micro-expression duration is extremely short and susceptible to interference, a threshold of ME should be set based on a short duration prior of ME. Through this operation, our multi-scalar localization process allows for a more accurate location of ME. Taking into account a video sequence, the anchor-based spotting module predicts anchors at each time position. By contrast, the anchor-free mechanism considers a temporal location
on feature map
; the accumulated stride of the current layer is
. Therefore, we can map it back onto the input sequence:
Since the length of micro-expressions in the
dataset ranges from 3 to 16, considering the mapping mechanism, a threshold was set for micro-expression localization during the post-processing process to filter candidate clips:
where
is the number of frames in the longest micro-expression video clip and
is the slider box size of layer 2. When the duration of the obtained candidate segment is larger than this threshold, the candidate segment will be taken as the final result. The algorithm implementation is elaborated in Algorithm 1.
5. Conclusions
In this paper, a Prelayer model was developed for ME spatial-temporal feature fusion, aiming to improve the processing of micro-motion samples. This module combines the anchor-free mechanism with the anchor-based mechanism using fused multi-scale localization to accurately locate the onset and offset frames of expressions, avoiding prior biases regarding localization and duration. Moreover, the ME-NMS module is integrated to further reduce a priori-based false-positive micro-expression samples. In addition, channel attention and spatial attention jointly enhance the adaptive learning of key areas and deep features of the network. Comprehensive experiments demonstrated the effectiveness of our method. AEM-Net is capable of improving localized ME and MaE instances with various durations. It achieved improved performances on and SAMM-LV datasets. However, due to the brief and subtle nature of ME, the F1-score of our ME spotting is not satisfactory yet for real-world applications. Powerful feature learning and action localization in an end-to-end ME spotting framework are promising directions in future research. In addition, we will also study the bioinformatics principle of micro-expressions and further improve the performance of ME spotting.