A Spatio-Temporal Spotting Network with Sliding Windows for Micro-Expression Detection

: Micro-expressions reveal underlying emotions and are widely applied in political psychology, lie detection, law enforcement and medical care. Micro-expression spotting aims to detect the temporal locations of facial expressions from video sequences and is a crucial task in micro-expression recognition. In this study, the problem of micro-expression spotting is formulated as micro-expression classiﬁcation per frame. We propose an effective spotting model with sliding windows called the spatio-temporal spotting network. The method involves a sliding window detection mechanism, combines the spatial features from the local key frames and the global temporal features and performs micro-expression spotting. The experiments are conducted on the CAS(ME) 2 database and the SAMM Long Videos database, and the results demonstrate that the proposed method outperforms the state-of-the-art method by 30.58% for the CAS(ME) 2 and 23.98% for the SAMM Long Videos according to overall F-scores.


Introduction
Macro-expressions are observable with the naked eye, albeit they are deceitful [1], while micro-expressions [2,3] are short-lived and unconscious expressions [4,5] that are harder to spot and recognize.Micro-expressions are more reliable measures for psychological states and are more important in understanding people's real emotions.They are widely applied in political psychology [6], lie detection [7], law enforcement and medical care [8].
Research on micro-expression analysis primarily focuses on two areas: micro-expression spotting, which involves identifying the onset and apex frames of micro-expressions in videos, and micro-expression recognition, which predicts the category of the microexpression.Deep learning methods have wide and valuable applications in artificial intelligence [9][10][11], and advances in deep models have contributed to the rapid developments of micro-expression recognition technology.However, micro-expression spotting tasks, particularly in unprocessed raw videos, remain challenging.In 2020, the Third Facial Micro-Expression Grand Challenge (MEGC2020) [12] introduced a new challenge to spot both macro-and micro-expressions from Long Videos, drawing the attention of researchers to the spotting task.
Micro-expression spotting aims to automatically detect the start and end frames of micro-expressions in a video, representing the time interval of the micro-expression action.Traditional machine learning methods rely on manually crafted features.Various feature descriptors are employed, including spatial features such as local binary patterns (LBPs) [13], a histogram of oriented gradients (HOG) [14], integral projection [15] and Riesz pyramid features [16], temporal features such as optical stain [17] and optical flow [18][19][20][21] and features extracted in frequency domains such as the frequency domain feature [22].Temporal features, such as optical flow vectors, have proven to be highly effective for micro-expression spotting.For example, Shreve et al. [23] partitioned the face into multiple regions, including the forehead, eyes, cheeks and mouth.They employed dense optical flow to extract image features and utilized central difference methods to compute the optical strain magnitude within each region.By comparing these magnitudes with predefined thresholds, they achieved micro-expression detection.In 2011, Shreve et al. [24] combined existing macro-expression and micro-expression databases and employed optical flow for detection.However, both of these approaches focused on non-spontaneous micro-expressions, which were elicited through experimental instructions.Such micro-expression data virtually lacked interference, making them relatively straightforward to detect and observe.
These features are further processed by using various machine learning methods, also called shallow learning methods, such as the chi-square distances of LBP features [25] and Euclidean distance ratio variations in facial landmarks [26].For example, optical flow vectors were extracted and video segments without micro-expressions were removed by using heuristics [18].Also, not all frames in a video contribute equally to the spotting task.Feature difference (FD)-based methods [13] usually compute feature differences between the first and last frames in the temporal window instead of using the whole sequence.The main idea of using an FD is to search for distinctive variations within temporal windows.
Deep-learning-based methods have become mainstream solutions in many research fields, particularly in computer vision.Researchers have also applied these methods to micro-expression spotting.For instance, a convolutional neural network (CNN) has been proposed to detect apex frames [27].Neutral frames and apex frames were first classified by a CNN architecture, and feature engineering methods were introduced to merge nearby detection samples.
Combined networks of spatial and temporal deep models have also been utilized.For example, the framework proposed in [28] consisted of two networks: a spatial network and a temporal network.The spatial network generated spatial feature maps of two adjacent frames, based on which a contrasting feature was obtained to enhance micro-expression spotting.The contrasting feature was then fused with the temporal features extracted by the temporal network to perform micro-expression recognition and apex frame detection.
In addition to the deep learning methods mentioned above for short video clips, there have been studies investigating micro-expression spotting in Long Videos, utilizing various deep learning models such as CNN, 3D-CNN and their variations.For instance, CNN models were employed to extract spatial features from image frames, and a multi-head self-attention model was utilized along with the temporal dimension to analyze the weight of each frame and identify macro-and micro-expression intervals [29].Variant CNNbased models are also employed.For example, a Concat-CNN model consisting of three streams of convolutional networks with different sizes of convolution kernels [30] was proposed to learn feature correlations among facial action units (AUs) of different frames.In addition, a local bilinear convolutional neural network (LBCNN) [31] was proposed to transform the micro-expression spotting task into a fine-grained image recognition problem.Xue et al. [32] proposed a Two-Stage Macro-and Micro-expression spotting network (TSMSNet) containing two sub-networks: the Triplet-Stream Attention Network (TSANet) and the Spatial-Temporal Classification (STCNet).TSANet processed the horizontal and vertical components of optical flow as well as optical strain in three branches, combining attention mechanisms to extract spatiotemporal features.The STCNet utilized the initial expression intervals inferred by the TSANet to predict multi-scale expression segments.
Multiple-stream-based deep learning models are also employed.For example, a twostream 3D-CNN used frame skipping and contrast enhancement [33] for micro-expression spotting in Long Videos.Liong et al. [34] proposed the Shallow Optical Flow Three-Stream CNN (SOFTNet) model to estimate a confidence score indicating the probability of a frame belonging to an expression interval.They treated micro-expression spotting as a regression problem and introduced a pseudo-label mechanism combined with a sliding window mechanism to achieve macro-expression and micro-expression detection in Long Videos.In 2022, Liong et al. proposed the multi-temporal stream network (MTSN) model based SOFTNet [34].This approach computed two optical flow features with different time differences, and each optical flow was processed by the top and bottom SOFTNets.Finally, the feature vectors from two streams were concatenated and utilized for micro-expression detection.
The state-of-the-art micro-expression-spotting methods still have much room for improvement.In this study, we aim to design an effective micro-expression-spotting method.Inspired by the idea of video summarization, we extract representative frames from video sequences that may contain crucial information.The micro-expression spotting problem is then formulated as a classification task to determine whether these key frames contain a micro-expression, a macro-expression or no facial expressions at all.The proposed method extracts both spatial and temporal information to select key frames by analyzing the video structure and spatio-temporal redundancies in the content.
The contributions of this study are listed as the following: • A spatio-temporal network with sliding windows is proposed for effective microexpression spotting.

•
A key-frame-extraction method is fused into the spatio-temporal network so that spatial features of the video clip are denoted as a more concise key-frame-based representation.

•
Experiments show that the proposed model achieves F1-scores of 0.6600 on the CAS(ME) 2 and 0.6091 on the SAMM Long Videos for micro-expression spotting and performs better with a large margin compared with the state-of-the-art methods.

The Spatio-Temporal Spotting Network with Sliding Windows
The video sequences are initially processed with a spatial feature extraction module by using a sliding window mechanism.Key frames are then extracted from the resulting feature sequences within the temporal windows.These key frames are further analyzed by a temporal-information-extraction module for facial expression classification, which identifies whether the central frame of the temporal window contains a micro-expression, macro-expression or no expression at all. Figure 1 illustrates the overall structure of the proposed micro-expression-spotting method called STSNet_SW.The codes and models of the proposed method are available at https://github.com/ourpubliccodes/STSNet_SW on 12 September 2023.
Overview of the proposed spatio-temporal micro-expression-spotting method with sliding windows.

Spatial Information Extraction
Given a video sequence S = {I 1 , I 2 , . . . ,I n , . . . ,I N }, where N is the number of frames, the sequence is sampled by using temporal sliding windows of size K.At moment t, the window samples a sub-sequence S t , with I n being the middle frame.And, all sample windows of the video clip can be denoted as S = {S 1 , S 2 , . . . ,S t , . . . ,S T }, where T denotes the total number of sliding windows.Note that K is set to five in our experiments, and the first two and last two frames of the video sequence are not sampled as the middle frames of the sliding windows.
The spatial information extraction module employs a residual network (ResNet) model [35] as its backbone and is applied on every frame in a video sequence.At time t, we extract a feature sequence, denoted as P t , from the K samples within the sub-sequence S t .This feature sequence is represented as P t = {p 1 , p 2 , . . . ,p k , . . . ,p K }, where p k ∈ R C×H×W is the spatial features extracted from the k-th image frame and C, H and W denote the number of channels, the height and the width, respectively.
The model is first initialized with the ImageNet dataset [36].Due to the relatively low intensity and short duration of the micro-expressions and the limited number of microexpression training samples, the model tends to overfit.To avoid this, the initialized model is pre-trained on a macro-expression dataset AffectNet [37], which adapts the model from a general image domain to the facial expression domain.

Key Frames Extraction
Related studies in micro-expression recognition [22,[38][39][40] show that the features extracted from the apex frames consist of crucial information and are most effective for facial expression recognition.In a video clip with micro-expressions, most frames are static and contain very few information and are thus redundant, while several frames contain relatively rich information.Motivated by these observations, we introduce a key-frame-extraction module.
The key-frame-extraction module keeps the most representative frames with more distinctive features and abandons invariant frames.The module adopts the idea of video summarization and utilizes a self-attention module and a two-layer fully connected classification network.Figure 2 illustrates the structure of the module.
The structure of the key-frame-extraction module.This module takes P t = {p 1 , p 2 , . . . ,p k . . . ,p K } as input, where P t is the spatial feature sequence extracted from the sub-sequence S t sampled by the t-th sliding window.This input is processed through the self-attention calculation part and the key-frame-classification part, resulting in a set of scores Z t = {z 1 , z 2 , . . . ,z k , . . . ,z K }.Finally, we select these frames corresponding to the top M scores in Z t as the M key frames.
The self-attention part captures the correlation between the features.The attention vector αk is computed as the similarity between the extracted feature from the k-th frame and the feature sequence P t .We first calculate the correlation between the spatial feature of the k-th frame and the spatial feature of the i-th frame in the sequence, as formulated in Equation (1): Then, the correlation between the k-th frame and the feature sequence P t is denoted by Equation (2): where K denotes the number of frames within a temporal window and W 1 , W 2 are learnable matrices.Then, this correlation vector is normalized by a softmax function to obtain the attention weight vector αk , calculated as Equation ( 3): The attention score αk,i evaluates the level of attention given to p i by p k .Next, the feature p k is weighted by using the attention vector αk .Each input feature is first linearly transformed by multiplying with a transformation matrix W 3 .The transformed vector is multiplied by its corresponding attention score, which is followed by a summation to compute the new representation b k , formulated by Equation (4).This vector focuses both on the global and the key information of the whole sequence: In the key-frame-classification part, the vector b k is further processed with a linear activation U, a residual sum, a dropout layer Dropout and a normalization layer Norm, formulated as Equation ( 5): Two more layers are applied to compute the final scores, as shown in Equation ( 6).Layer L 1 consists of a ReLU activation layer, a dropout layer and normalization layer, and L 2 contains a single hidden unit with a sigmoid activation: The output of the key-frame-extraction module is an importance score sequence Z t = {z 1 , z 2 , . . . ,z k , . . . ,z K }, z k ∈ [0, 1).We rank Z t and take the top M frames as the key frames P key,t for the feature sequence P t at time t, where P key,t = p k 1 , p k 2 , . . ., p k m , . . ., p k M .

Temporal Information Extraction
The spatial feature sequences of the key frames are further processed by the temporalinformation-extraction module composed of two Gated Recurrent Units (bi-GRUs).Compared with the Long Short-Term Memory Networks (LSTMs), the GRU units not only extract temporal contextual information but also contain fewer trainable parameters, which make them converge faster during the training process and reduce the risk of overfitting.The structure of the module is illustrated in Figure 3.
Each bi-GRU module extracts features from a specific pixel position of all key frames in parallel and obtains a feature pixel sequence of size C × M, where C denotes the total number of channels of the spatial feature and M denotes the total number of key frames.Suppose the dimension of the spatial feature vector for each frame is C × H × W: there are H × W different spatial positions for each feature map.Then, the temporal network consists of H × W bi-GRU modules.All bi-GRU modules share the same set of parameters to reduce the total number of tunable parameters.Suppose the input features at spatial position (i, j) from all key frames are denoted by P (i,j) , where p (i,j) k m denotes the m-th key frame with the hidden state h m−1 from the previous key frame; the GRU unit obtains the hidden state h m of the current key frame.The output of each bi-GRU module is the average of the hidden states of the M key frames.This configuration allows the output to fit with different key-frame lengths.And, the final spatio-temporal feature is the concatenation of the output from each bi-GRU module, where C is the number of feature channels after processing by the bi-GRU module.k m denotes the feature at spatial position (i, j) in the m-th key frame.The concatenation of outputs from these H × W bi-GRU modules forms the spatio-temporal feature F t .
Finally, a dropout layer with a probability of 0.5 and a fully connected softmax layer are applied to classify expressions for the sub-sequence S t .The classification results denote the expression categories (including micro-expression, macro-expression and no expression) of the middle frame within the t-th temporal sliding window.The loss function is defined in Equation ( 7): where Q represents the number of categories, y (F t ) is the predicted label for the feature F t , q is the ground truth label, 1{•} denotes an eigenfunction (its value is 1 when y (F t ) and q are equal and 0 otherwise) and V is the weight vector of a fully connected layer.

Segment Merging
For the micro-expression detection task, it is common to utilize the Intersection over Union (IoU) between the detected sample and the ground truth to determine whether a segment qualifies as a true positive (TP) sample, as depicted in Equation ( 8): where W groundTruth is the ground truth interval starting from the onset frame until the offset frame, r is set as 0.5 and the spotted interval W spotted is considered to be a TP if it meets the condition of Equation ( 8).We observe that for a video segment containing an expression, if a few frames were wrongly identified, a long and continuous segment of expressive content might be recognized as multiple short segments.In this case, some segments will be filtered out because they do not meet the threshold duration, leading to them being incorrectly identified as false negatives (FNs) and thus diminishing the performance of micro-expression spotting.To address this issue, we carry out post-processing through segment merging to reduce FN short segments.For example, if an image frame is predicted as not containing any expression, but its two adjacent frames are labeled as macro-expressions (or micro-expressions), the label of this frame is adjusted to be consistent with its neighbors.Figure 4 illustrates the process of segment merging.This merging approach enhances the overlap between the spotted samples and ground truth, which mitigates, to some extent, the performance degradation that FN short segments cause.

Figure 4.
Illustration of segment merging.The images framed inside the green box in the first row represent micro-expressions according to ground truth annotations.The results of expression detection for these images are displayed in the second row, where "0" signifies the absence of expression and "1" denotes a micro-expression.Notably, the images encased within the green box are incorrectly predicted as two separate expression segments, delineated by the red boxes in the second row.Subsequent to applying segment-merging post-processing, these segments are consolidated into a single micro-expression segment, indicated by the blue box in the third row, which aligns consistently with the ground truth.

Experiments
We conduct extensive experiments on the spotting benchmark of the Third Facial Micro-Expression Grand Challenge (MEGC 2020) that aims to spot both macro-and microexpressions (starting from the onset frame until the offset frame) in Long Videos.And, the challenge includes two datasets and several metrics for evaluating the performance of the methods, which are also used in our experiments.

Datasets and Evaluation Metrics
The proposed method is evaluated on the CAS(ME) 2 database [41] and the SAMM Long Videos dataset [42].The CAS(ME) 2 database contains a total of 87 videos with a frame rate of 30 frames per second (fps) and an average duration of 86 s.The authors annotated 300 macro-expressions and 57 micro-expressions from 22 subjects, and the emotions are divided into four categories: positive, negative, surprise and others.
The SAMM Long Videos dataset is an extended version of the SAMM dataset.It contains 147 videos with a frame rate of 200 fps and an average duration of 35 s.The dataset contains 343 macro-expressions and 159 micro-expressions from 32 subjects, recorded using a high-speed camera with a resolution of 2040 × 1088.
The MEGC 2020 spotting task evaluates both macro-and micro-expression spotting.All videos are treated as "one particularly long video", so the metric represents the overall performance of all videos.We first evaluate the spotting of macro-and micro-expressions separately and then compute the overall performance of the entire dataset.For macroexpressions, the recall and the precision are defined as Equations ( 9) and ( 10): where a 1 denotes TPs, m 1 denotes the total number of macro-expression (MaE) sequences and n 1 denotes the total number of predicted macro-expression intervals.The requirement of being a TP is described in Equation ( 8).Likewise, we also use two metrics for micro-expressions, as formulated in Equations ( 11) and ( 12): where a 2 denotes TPs, m 2 denotes the total number of micro-expression (MiE) sequences and n 2 denotes the total number of predicted micro-expression intervals.The overall performance is then computed as Equations ( 13) and ( 14): Based on the overall recall and precision, we calculate the F1-score with Equation ( 15).The F1-score is one of the widely used evaluation metrics in micro-expression analysis.It provides a comprehensive assessment by considering both precision and recall.Precision measures the percentage of TP samples among all samples predicted as micro-expressions (including both TPs and false positives (FPs)), while recall measures the percentage of TPs among all micro-expression samples (including both TPs and FNs).Both of these metrics are equally important in assessing the classification accuracy, but they sometimes conflict with each other.Therefore, the F1-score computes the harmonic mean of precision and recall, taking both metrics into account simultaneously.The F1-score ranges from a minimum of 0 to a maximum of 1, where a higher F1-score indicates better model performance.

Experiments and Results
We run the experiments by using an I5-9600K CPU@3.70 GHz with a NVIDIA GeForce RTX 2070 (with memory size of 16 GB, manufactured by the Gigabyte Technology located at New Taipei City, Taiwan).In this study, the size of the sliding windows is set as K = 5 and the number of key frames is set as M = 3.For the temporal-information-extraction module, the input of each temporal module is a feature map with dimensions of M × 512, 512 is the output dimensions of each frame after the key-frame-extraction module and the number of feature channels C is equal to 64.For training, the number of epochs is set as 30, and the initial learning rate is set as 1 × 10 −3 .The learning rate is adjusted by using the cosine annealing learning rate method [43], and the minimum value is set as 1 × 10 −8 .For optimization, we use a stochastic gradient descent method [44] with the momentum set as 0.9 and the weight decay set as 5 × 10 −4 .And, we use the leave-one-subject-out crossvalidation (LOSO) protocol for validation.This validation method allows us to precisely evaluate the generalization of the model across individuals, making it particularly suitable for personalized requirements in practical scenarios.
In this study, we compare the proposed method with the baseline provided by the MEGC 2020 spotting task and other state-of-the-art (SOTA) methods on the F1-score.From the results in Table 1, it is clear that our method outperforms others on both the CAS(ME) 2  and SAMM Long Videos datasets.Specifically, on the CAS(ME) 2 dataset, we achieve an F1-score of 0.6694 for macro-expression detection and 0.6600 for micro-expression detection.On the SAMM Long Videos dataset, the F1-score for macro-expression detection is 0.5539, while for micro-expression detection, it reaches 0.6091.Compared to the SOTA, the STSNet_SW approach improves by 43.25% on the CAS(ME) 2 and 39.11% on the SAMM Long Videos for micro-expression spotting and outperforms the other methods by 25.93% and 14.58% on the two datasets for macro-expression spotting, respectively.It is noteworthy that the compared methods generally exhibit better performance in detecting macro-expressions than micro-expressions on both datasets, with the LSSNet-LSM [45] and MTSN [46] methods particularly excelling in this regard.However, the proposed STSNet_SW method is effective at spotting both micro-expressions and macro-expressions.Table 2 reports a detailed analysis of the experimental results obtained by using the proposed method across several evaluation metrics.The F1-scores for spotting macro-expressions of the CAS(ME) 2 and the SAMM Long Videos are 0.6694 and 0.5539, respectively, and the F1-scores for spotting micro-expressions are 0.66 and 0.6091, respectively.The overall F1-scores are 0.6678 and 0.5697, respectively.Note that the relatively small values of the FPs indicate that the proposed method has a strong ability to identify no-expression segments, and there are very few cases where no-expression segments are misclassified as macro-expression (or micro-expression) segments.The experimental results of the proposed method with segment merging is reported in Table 3.The results show that performance on the SAMM Long Videos dataset improved, but the results on the CAS(ME) 2 dataset barely changed.This discrepancy is attributed to the difference in frame rates between the two datasets.In the SAMM Long Videos database with high frame rates, an expression clip contains a greater number of frames within the same time interval compared to the CAS(ME) 2 dataset with low frame rates.Consequently, the proposed model may miss some of the expression frames during spotting, leading to the detection of multiple shorter segments instead of a single complete segment.Therefore, the segment-merging post-processing connects multiple short segments into a longer one, significantly improving the detected TPs for the SAMM Long Videos dataset.
In addition, we also conduct an experiment to distinguish macro-and micro-expressions according to segment duration so as to assess and validate the advantages of segmentmerging post-processing in micro-expression spotting.Typically, micro-expressions have shorter durations compared to macro-expressions, which make them prone to be confused with macro-expression segments.With the initial predictions generated by the STSNet_SW method, we re-labeled segments with facial expressions as "with-expression" segments and those without expressions as "no-expression" segments.During the segment-duration post-processing, we set a threshold duration to discriminate whether a "with-expression" segment represents a micro-expression or a macro-expression.Specifically, for segments containing facial expressions, those lasting shorter than or equal to the threshold are considered micro-expressions, while segments lasting longer than the threshold are re-labeled as a macro-expressions.Figure 5 shows an example of a micro-expression segment misclassified as a macro-expression, which is correctly re-labeled by using segment-duration post-processing.The second row shows the prediction results for this video clip, including no-expression segments denoted by "0", a micro-expression segment indicated by "1" (i.e., the images framed within the red box) and a macro-expression segment identified as "2" (i.e., the images framed within the yellow box).Subsequently, these segments are re-labeled as either "no-expression" segments or "with-expression" segments (depicted by the blue boxes in the third row).The "with-expression" segments are further re-classified as micro-expressions (indicated by the red boxes in the fourth row) by comparing their durations with the predefined threshold.
Based on experience, the threshold of the frame numbers is set as 15 for the CAS(ME) 2 dataset and 100 for the SAMM Long Videos dataset.Table 4 shows the experimental results.From the table, we observe that compared with segment-merging post-processing, the performance on the two databases is improved for macro-expression spotting, while the performance is worse for both databases for the micro-expressions.This phenomenon is attributed to the fact that filtering expression segments based on the threshold results in micro-expressions predominantly containing shorter-duration segments.Consequently, these micro-expression segments are scattered throughout the video sequence, causing the overlap region between the spotted samples and ground truth that includes more non-expression segments.From the perspective of Equation ( 8), this issue decreases the IoU and TPs for micro-expressions, which subsequently affects the final detection.Conversely, macro-expressions, which retain segments with a long duration, show a relatively increased overlap between the detected samples and ground truth.On the other hand, due to individual variations in emotional expression habits, setting a fixed threshold duration for each dataset does not ensure that all micro-expression segments meet the thresholding criteria.As a result, many micro-expression segments may be misclassified as macro-expression segments because their durations are longer than the threshold.The compared methods using segment duration for post-processing are subject to accuracy loss under certain scenarios.In conclusion, segment merging is effective as a post-processing procedure.

Conclusions
This study proposes a spatio-temporal spotting network with sliding windows for spotting macro-and micro-expression in long videos.By combining convolutional neural networks and recurrent neural networks, this model comprehensively learns spatial and temporal features, capturing the key characteristics of facial expressions.Furthermore, we innovatively incorporate a video summarization algorithm for key frame extraction to improve the performance of micro-expression spotting.Many existing expression-spotting methods combine traditional feature extraction with deep learning but frequently struggle to capture complex facial expression variations and incur high costs of labor and time.Additionally, given the shorter durations and smaller amplitudes of facial movements in micro-expressions compared to macro-expressions, current methods tend to prioritize macro-expression detection and perform poorly in micro-expression segment detection.
We evaluate the proposed STSNet_SW on two benchmark datasets: CAS(ME) 2 and SAMM Long Videos from the MEGC 2020 challenge.In terms of the F1-score, the proposed method achieves scores of 0.6600 and 0.6091 for micro-expression spotting on the CAS(ME) 2 and SAMM Long Videos datasets, respectively, and scores of 0.6694 and 0.5539 for macro-expression spotting on the CAS(ME) 2 and SAMM Long Videos datasets, respectively.Compared to the state-of-the-art (SOTA) methods, the STSNet_SW approach achieves a superiority margin of 43.25% and 25.93% on the CAS(ME) 2 dataset for microexpression and macro-expression spotting, and this method improves by 14.58% and 39.11% on the SAMM Long Videos dataset for micro-expression and macro-expression spotting, respectively.These results demonstrate that the STSNet_SW method outperforms state-of-the-art methods in both macro-and micro-expression detection, with particularly remarkable improvements in micro-expressions.However, regardless of whether segmentmerging post-processing or segment-duration post-processing is applied, the performance of this proposed method on the SAMM Long Videos dataset is notably lower than on the CAS(ME) 2 dataset, possibly due to disparities between the datasets.Additionally, there is a significant imbalance between the data used for micro-expressions and macro-expressions, limiting further improvements in spotting performance.We will continue this study in the future and utilize more diverse facial information, such as facial action units (AUs) and optical flow features to solve the problem of insufficient data for micro-expression spotting.

Figure 5 .
Figure 5. Illustration of the segment duration.The first row displays the video clip to be detected.The second row shows the prediction results for this video clip, including no-expression segments denoted by "0", a micro-expression segment indicated by "1" (i.e., the images framed within the red box) and a macro-expression segment identified as "2" (i.e., the images framed within the yellow box).Subsequently, these segments are re-labeled as either "no-expression" segments or "with-expression" segments (depicted by the blue boxes in the third row).The "with-expression" segments are further re-classified as micro-expressions (indicated by the red boxes in the fourth row) by comparing their durations with the predefined threshold.
The structure of the temporal-information-extraction module.This module takes the spatial feature sequence P key,t from M key frames as input, where P key,t = p k 1 , p k 2 , . . ., p k m , . . ., p k M and p k m ∈ R C×H×W .The temporal-information-extraction module is composed of H × W bi-GRU modules, each of which process pixel-wise features within the sequence P key,t .For each bi-GRU module containing M bi-GRUs units, h m is the output of the m-th bi-GRUs unit applied to p

Table 1 .
Comparison of the proposed method with the state-of-the-art methods according to F1-scores.

Table 2 .
Detailed performance of the proposed method using several evaluation metrics.

Table 3 .
Detailed performance of the proposed method with segment-merging post-processing.

Table 4 .
Detailed performance of the proposed method with segment-duration post-processing.