Multimodal Attention Dynamic Fusion Network for Facial Micro-Expression Recognition

The emotional changes in facial micro-expressions are combinations of action units. The researchers have revealed that action units can be used as additional auxiliary data to improve facial micro-expression recognition. Most of the researchers attempt to fuse image features and action unit information. However, these works ignore the impact of action units on the facial image feature extraction process. Therefore, this paper proposes a local detail feature enhancement model based on a multimodal dynamic attention fusion network (MADFN) method for micro-expression recognition. This method uses a masked autoencoder based on learnable class tokens to remove local areas with low emotional expression ability in micro-expression images. Then, we utilize the action unit dynamic fusion module to fuse action unit representation to improve the potential representation ability of image features. The state-of-the-art performance of our proposed model is evaluated and verified on SMIC, CASME II, SAMM, and their combined 3DB-Combined datasets. The experimental results demonstrated that the proposed model achieved competitive performance with accuracy rates of 81.71%, 82.11%, and 77.21% on SMIC, CASME II, and SAMM datasets, respectively, that show the MADFN model can help to improve the discrimination of facial image emotional features.


Introduction
Facial micro-expressions (hereinafter referred to as micro-expressions) are shortduration and low-intensity facial muscle movements.Since it usually occurs when hiding emotion in the heart and can reflect genuine emotions and motivations [1].If people are not professionally trained, it is impossible to hide the appearance of micro-expressions [2].Researchers found that micro-expressions are often present in lie detection scenarios.Thus, it has major implications when it comes to high-risk situations, including criminal investigation, social interactions, national security, and business negotiations [3].
The researchers have shown that facial emotional changes are a combination of some action units (AUs), which can be used as additional auxiliary information to improve the performance of facial micro-expressions recognition [4].Xie et al. [5] combined AU detection and micro-expression recognition and proposed an AU-assisted Graph Attention Convolutional Network.The model predicts micro-expression categories by learning AUs node features in the graph convolutional network learning module.Lei et al. [6] proposed a graph convolutional network based on AU, which enhanced the feature representation of nodes and graph edges extracted by the graph convolutional network by fusing AU features.Zhao et al. [7] proposed a Spatio-Temporal AU Graph Convolution Network, which inputs local image regions of AUs to a three-dimensional convolutional model to obtain AU features.They tried to utilize graph convolutional networks to focus on the dependencies between different local regions to improve the performance of microexpressions recognition.
Although these methods employ action units to enhance image features, they do not consider the impact of action units on the image feature extraction process.The studies have shown that the Vision Transformer (ViT) [8] model can achieve success in tasks such as image recognition [9], object detection [10], image segmentation [11], and generation [12] by focusing on image local information.The ViT model structure can introduce auxiliary information into the image encoder model to dynamically enhance features.Therefore, this paper proposes a local detail feature enhancement method based on a multimodal dynamic attention fusion network (MADFN) model for micro-expression recognition.This model of the local detail feature enhancement method is shown in Figure 1.
Entropy 2023, 25, x FOR PEER REVIEW 2 of 18 which inputs local image regions of AUs to a three-dimensional convolutional model to obtain AU features.They tried to utilize graph convolutional networks to focus on the dependencies between different local regions to improve the performance of micro-expressions recognition.Although these methods employ action units to enhance image features, they do not consider the impact of action units on the image feature extraction process.The studies have shown that the Vision Transformer (ViT) [8] model can achieve success in tasks such as image recognition [9], object detection [10], image segmentation [11], and generation [12] by focusing on image local information.The ViT model structure can introduce auxiliary information into the image encoder model to dynamically enhance features.Therefore, this paper proposes a local detail feature enhancement method based on a multimodal dynamic attention fusion network (MADFN) model for micro-expression recognition.This model of the local detail feature enhancement method is shown in Figure 1.In this model, a learnable class token (LCT) is used to remove local areas with low emotional expression ability in micro-expression images.To enhance the discrimination of emotional features, the action unit representation is added to the extraction process of extracting the potential emotional features of the image, and the action unit dynamic fusion (AUDF) module is used to fuse the action unit representation with the local features of the image sub-blocks with high weight for micro-expression recognition.We evaluated the MADFN model on three datasets: Spontaneous Micro-expression Corpus (SMIC) [13], Chinese Academy of Sciences Micro-Expression II (CASME II) [14], Spontaneous Actions, and Micro-Movements (SAMM) [15], and their combined dataset (3DB-combined) [16].
In general, this paper attempts to propose a MADFN model for solving the local detail feature enhancement problem.The main contributions of this paper are summarized as follows: 1.The masked autoencoder based on the learnable class token is proposed to remove small contributing local image sub-blocks for micro-expression recognition.In this model, a learnable class token (LCT) is used to remove local areas with low emotional expression ability in micro-expression images.To enhance the discrimination of emotional features, the action unit representation is added to the extraction process of extracting the potential emotional features of the image, and the action unit dynamic fusion (AUDF) module is used to fuse the action unit representation with the local features of the image sub-blocks with high weight for micro-expression recognition.We evaluated the MADFN model on three datasets: Spontaneous Micro-expression Corpus (SMIC) [13], Chinese Academy of Sciences Micro-Expression II (CASME II) [14], Spontaneous Actions, and Micro-Movements (SAMM) [15], and their combined dataset (3DB-combined) [16].
In general, this paper attempts to propose a MADFN model for solving the local detail feature enhancement problem.The main contributions of this paper are summarized as follows: 1.The masked autoencoder based on the learnable class token is proposed to remove small contributing local image sub-blocks for micro-expression recognition.
2. The influence of action units on facial micro-expression recognition is analyzed, and we are the first to add action unit representations to the feature extraction process of micro-expression images.The remainder of this paper is organized as follows: In Section 2, a brief review of the related research on micro-expression recognition.Section 3 provides a complete introduction to the proposed model.Section 4 shows the datasets, details, and results of the experiment.Finally, Section 5 presents the conclusions of this research method.

Related Work
The facial micro-expression recognition methods are generally divided into two types.The first type of method extracts global features from the whole image for micro-expressions recognition; meanwhile, the second type of method locates the local regions where microexpressions occur and then extracts local features for micro-expression recognition.

Global Features for Micro-Expression Recognition
Several earlier studies [17][18][19][20][21] that make hand-crafted features adequately represent the micro-expression changes on facial micro-expression recognition used a rule-based block division approach to extract features from each block to be stitched into a compact feature vector for micro-expression recognition to make hand-crafted features best represent micro-expression changes [22,23].This methodology was first used for microexpression recognition by Pfister et al. [17].The micro-expression video images were uniformly separated into 4 × 4, 5 × 5, and 6 × 6 blocks evenly from the three planes of XY, XT, and YT.To recognize the micro-expressions, the Local Binary Pattern (LBP) features of these blocks are extracted and combined into a feature vector.Wang [20] utilized the same procedure to divide the micro-expression video images and thereafter extract the Spatiotemporal Completed Local Quantized Patterns (STCLQP) features of each region block.Wang et al. [24] explored the rule-based block division method in different color spaces to verify the influence of the color feature spaces on micro-expression recognition.
Although hand-crafted feature methods may give excellent micro-expression recognition results, they can ignore additional information in the original image data.With the development of deep learning, researchers consider applying it in micro-expression recognition to extract subtle changes in the features of micro-expression [25][26][27].Kim et al. [28] employed a Recurrent Neural Network to extract the temporal features of the microexpression video images for micro-expression recognition while using the Convolutional Neural Networks (CNN) architecture to capture the spatial information from different temporal stages (onset, apex, and offset frame).Liong et al. [29] developed an optical flow feature from the apex frame (OFF-apex) framework, which utilizes the optical flow feature map of the micro-expression apex frame as the input of the CNN to enhance the optical flow features and improve the recognition rate of micro-expressions.Micro-expression recognition using deep learning methods is the favorite choice of researchers with excellent results in the 2019 Facial Micro-Expression Grand Challenge (MEGC 2019) [30][31][32][33].

Local Features for Micro-Expression Recognition
Although, the global image feature extraction method can improve effectiveness in micro-expression recognition.However, this may neglect the influence of local information and also brings the problem of information redundancy.Therefore, researchers first locate regions where micro-expressions occur and then extract local features of these regions for micro-expression recognition.The initial work on local features moved away from a rulebased block division approach and toward a rule-based facial ROI features extraction [34,35].Wang et al. [36] used the Facial Action Coding System (FACS) to distinguish 16 ROIs and obtained the Local Spatiotemporal Directional features of these regions through Robust Principal Component Analysis (RPCA) for micro-expression recognition.Liu et al. [37] proposed a Main Directional Mean Optical (MDMO) flow feature.To reduce the impact of noise caused by head movement in micro-expression recognition, this method employs the robust optical flow method to extract features from 36 ROIs divided by Action Units (AU) in micro-expression video images.Xu et al. [38] suggested a micro-expression recognition method based on the Facial Dynamics Map (FDM), which locates ROIs based on facial emotion in a micro-expression video sequence and extracts features from these regions for micro-expression recognition.Happy et al. [39] employed the FACS to locate 36 facial ROIs and applied the Fuzzy Histogram of Optical Flow Orientation (FHOFO) method to extract the subtle changes features in these regions of these regions for micro-expression recognition.Liong et al. [40] presented a Bi-Weighted Oriented Optical Flow (BI-WOOF) feature descriptor, which uses two schemes to perform a weighted average of the global and local Histogram of Oriented Optical Flow (HOOF) features.Each ROI is weighted using the magnitude component and multiplied by the average optical variation of each ROI amplitude in the local feature extraction.The final histogram features are weighted from the overall HOOF features for micro-expression recognition.
Although the rule-based ROIs location method can help improve the recognition accuracy of the micro-expressions, it may not obtain the best results.Therefore, researchers use deep learning or attention mechanisms to obtain local features to recognize microexpressions.Chen et al. [41] introduced a three-dimensional spatiotemporal convolutional neural network with a Convolutional Block Attention Module (CBAM) for micro-expression recognition, which included a visual attention mechanism.While this method focuses on the importance of the features of interest, it ignores the subtle feature of the local regions.Li et al. [42] presented an LGCcon learning module, which combines local and global information to discover local regions of key emotional information while suppressing the detrimental impact of irrelevant facial regions on micro-expression recognition.Wang et al. [43] presented a Residual Network with Micro-Attention (RNMA) model to locate the facial ROIs holding distinct AU to address the influence of micro-expression changes in local regions.Xia et al. [44] proposed a recurrent convolutional network (RCN) to explore the effects of shallow architecture and low-resolution input data on micro-expression recognition using an attention model for focusing on local facial regions.

Multimodal Dynamic Attention Fusion Network
The multimodal dynamic attention fusion network consists of two inputs microexpression image and AU embedding.First, the micro-expression image is divided into regular non-overlapping sub-blocks.The mask operation is performed on the image subblocks that contribute less to micro-expression recognition through the learnable class token module.The image sub-blocks with high attention weights are input into the action unit dynamic fusion module through normalization and multi-head self-attention (MSA) operations and fused with AU embedding to improve the distinguishability of high-dimensional local feature representation of micro-expression images.Finally, the final micro-expression prediction is performed by fusing AU embedding and enhanced image local representation.
Different from the fusion methods of feature connection, addition, or multiplication, this paper embeds the action unit dynamic fusion module into the transformer encoder model and uses AU embedding to enhance the local feature representation of microexpression image, thereby increasing the discrimination of image features to improve micro-expression recognition performance.The framework structure of the multimodal dynamic attention fusion network is shown in Figure 2.
Different from the fusion methods of feature connection, addition, or multiplication, this paper embeds the action unit dynamic fusion module into the transformer encoder model and uses AU embedding to enhance the local feature representation of micro-expression image, thereby increasing the discrimination of image features to improve microexpression recognition performance.The framework structure of the multimodal dynamic attention fusion network is shown in Figure 2.

Image Autoencoders Based on Learnable Class Token
The problem of small datasets for micro-expressions severely limits model fitting, while the micro-expression image needs to be divided into regular non-overlapping image sub-blocks in a multimodal dynamic attention fusion network model.If these image sub-blocks are directly input to the visual transformer, it will lead to information redundancy.The low intensity of micro-expression movement results in slight differences between images of a subject in different categories but huge differences between images of different subjects within the same category.Therefore, micro-expression recognition can be regarded as a fine-grained image classification problem, and more attention should be paid to the distinguishability of local image features.
For the local perception of fine-grained image classification, He et al. [45] proposed a Masked Autoencoder (MAE) model, which adopts a random sampling (RS) module to mask a large number of image sub-blocks to reduce redundancy.Compared with blockwise sampling (BS) and grid-wise sampling (GS) modules, random sampling can construct efficient feature representation through highly sparse image sub-blocks.However, the uncertainty of random sampling may remove some image sub-blocks with high representational power.
Therefore, this section proposed an image autoencoder pre-training model based on a learnable class token.The model structure is shown in Figure 3.The model utilizes a

Image Autoencoders Based on Learnable Class Token
The problem of small datasets for micro-expressions severely limits model fitting, while the micro-expression image needs to be divided into regular non-overlapping image sub-blocks in a multimodal dynamic attention fusion network model.If these image subblocks are directly input to the visual transformer, it will lead to information redundancy.The low intensity of micro-expression movement results in slight differences between images of a subject in different categories but huge differences between images of different subjects within the same category.Therefore, micro-expression recognition can be regarded as a fine-grained image classification problem, and more attention should be paid to the distinguishability of local image features.
For the local perception of fine-grained image classification, He et al. [45] proposed a Masked Autoencoder (MAE) model, which adopts a random sampling (RS) module to mask a large number of image sub-blocks to reduce redundancy.Compared with block-wise sampling (BS) and grid-wise sampling (GS) modules, random sampling can construct efficient feature representation through highly sparse image sub-blocks.However, the uncertainty of random sampling may remove some image sub-blocks with high representational power.Specifically, the image is divided into regular non-overlapping image patches   .These image blocks are masked by the LCT module, and the high-weight image subblocks are input to the encoder network.The encoder uses the image path of a multimodal dynamic attention fusion network to extract the representations of local sub-blocks.Then, these representations and the learnable class token are reconstructed according to the original position and input to the decoder network to restore the original image.In the second iteration, the apex frame is input to the autoencoder, and the output onset frame extracts the emotional representation in the apex frame for micro-expression recognition.
The LCT module is a fully connected layer model in which the input is a feature vector with the same length as the image sub-blocks   , and the output is sorted to remove those corresponding low-weight images.This module specifically expressed as where,   is the image sub-block sampled by the LCT module,   is the parameter of the LCT module,   is the binary mask token projection corresponding to each image sub-block,  is the division of the mask weight threshold, and  is the proportion of all image sub-blocks masked by the LCT model.Specifically, the image is divided into regular non-overlapping image patches x p .These image blocks are masked by the LCT module, and the high-weight image sub-blocks are input to the encoder network.The encoder uses the image path of a multimodal dynamic attention fusion network to extract the representations of local sub-blocks.Then, these representations and the learnable class token are reconstructed according to the original position and input to the decoder network to restore the original image.In the second iteration, the apex frame is input to the autoencoder, and the output onset frame extracts the emotional representation in the apex frame for micro-expression recognition.
The LCT module is a fully connected layer model in which the input is a feature vector with the same length as the image sub-blocks x p , and the output is sorted to remove those corresponding low-weight images.This module specifically expressed as where, x s is the image sub-block sampled by the LCT module, w lmt is the parameter of the LCT module, w i is the binary mask token projection corresponding to each image sub-block, θ is the division of the mask weight threshold, and µ is the proportion of all image sub-blocks masked by the LCT model.
In the model pre-training process, the parameters are updated through two different loss functions, which are expressed as follows: where, l i is the loss function of image autoencoder, x i is the i-th frame image in the microexpression video sample, l o is the loss function of apex frame to onset frame mapping, x o is the onset frame,

∼
x i and ∼ x o are corresponding generated face images.Different from the random sampling of image sub-blocks in the MAE model, this paper removes low-weight image sub-blocks through a learnable class token module.The learned mask sub-blocks are rearranged in the order in which the sub-blocks were removed.Finally, masked subblocks with low weights are again selected for deletion.This cycle repeats until the best high-weight local region is selected for micro-expression recognition.This learnable method reduces information redundancy to a large extent by deleting a large number of image sub-blocks.

Vision Transformer Model Based on Action Unit Dynamic Fusion
Due to the low intensity of micro-expression facial motion, it is difficult to obtain highly discriminative local representations, which affects the performance of facial microexpression recognition.The ViT model has been widely used in computer vision [46]  The AU embedding has been proven to help extract more effective feature representations in micro-expression recognition, but how to dynamically add AU information to the process of feature extraction is still blank in the current research field.Inspired by dynamic filters [47][48][49][50], this paper proposes an action unit dynamic fusion module to add AU embedding to a vision transformer encoder model for enhancing the discriminability of micro-expression image representations.
In the basic AUDF module, the AU-encoded features are first replicated with the  However, facial representation and AU embedding are mutually complementary and interdependent in micro-expression recognition.Therefore, to further enhance the enhancement effect of AU embedding on facial representations, this paper introduces an AUDF enhancement module, AUDF-E.
First, the attention weight output by MSA is down-sampled and mapped to a onedimensional feature   .Then, perform a splicing operation with the AU embedding   and then linearly change it to the same dimension feature as each local image sub-block.Finally, the features copied and concatenated are multiplied by the output of MSA with the same number of image sub-blocks.AUDF-E is expressed as follows: The AUDF module is used to dynamically enhance the local image representations, and the facial emotion representations are obtained through the MLP module by residual connection and normalized.Finally, the output of the vision transformer encoder is fused with the AU embedding to obtain the final classification result.Firstly, the image representations z 0 of the remaining image, sub-blocks are normalized and input to the MSA model to calculate the attention weight of each image sub-block.For each subspace, define three feature matrices W Q,i , W K,i and W V,i to linearly map image sub-blocks and obtain matrix query Q i , key K i , and value V i in the MSA module.Then, Q i and K i perform the dot product operation to obtain the attention probability distribution of each image sub-block through SoftMax, and then multiply it with the value matrix to obtain the attention weight of the image sub-block.Finally, the weights of each subspace in MSA are concatenated and multiplied to obtain the final feature output z 0 .
Entropy 2023, 25, 1246 9 of 18 The AU embedding has been proven to help extract more effective feature representations in micro-expression recognition, but how to dynamically add AU information to the process of feature extraction is still blank in the current research field.Inspired by dynamic filters [47][48][49][50], this paper proposes an action unit dynamic fusion module to add AU embedding to a vision transformer encoder model for enhancing the discriminability of micro-expression image representations.
In the basic AUDF module, the AU-encoded features are first replicated with the same number of image sub-blocks and then multiplied with the output of MSA.The AUDF module utilizes dynamic multiplication to fuse AU embedding into the local feature extraction process to increase the discrimination of facial emotional representations.The calculation method is as follows: where, z l is the output image local attention representations weight of the MSA model, z l is the output of the AUDF module, z e is the additional facial AU embedding, Reshape() is to transform the one-dimensional feature is a two-dimensional matrix, and LN() represents a fully connected layer.However, facial representation and AU embedding are mutually complementary and interdependent in micro-expression recognition.Therefore, to further enhance the enhancement effect of AU embedding on facial representations, this paper introduces an AUDF enhancement module, AUDF-E.
First, the attention weight output by MSA is down-sampled and mapped to a onedimensional feature z i .Then, perform a splicing operation with the AU embedding z e and then linearly change it to the same dimension feature as each local image sub-block.Finally, the features copied and concatenated are multiplied by the output of MSA with the same number of image sub-blocks.AUDF-E is expressed as follows: The AUDF module is used to dynamically enhance the local image representations, and the facial emotion representations are obtained through the MLP module by residual connection and normalized.Finally, the output of the vision transformer encoder is fused with the AU embedding to obtain the final classification result.
where p is the micro-expression prediction probability of the MADFN model, z 0 L is the category label output.Finally, the class token z 0 L is replaced with the parameters corresponding to the micro-expression of LCT, and the comparison sub-block with high attention weight is used for the next iteration.During the MADFN model training process, the focal loss is used to reduce the impact of category imbalance.

Results and Analysis
In this section, the analysis and comparison of experimental results, ablation experimental analysis, and visualization analysis will be introduced in detail.The proposed MADFN model was verified experimentally on three public facial micro-expression datasets SMIC, CASME II, SAMM, and their combination 3DB-combined.

CAMSE II
Table 2 shows the comparison results of the MADFN model on the CAMSE II dataset and two types of baseline methods for three classifications.Compared with local feature methods, the MADFN model outperforms existing baseline methods.Although the MER-SiamC3D model with global features is 0.0205 higher than the F1-Score performance of the MADFN, the MERSiamC3D model uses key frame images in video sequences for recognition, and the model structure is more complex.Compared with the optimal TSGACN method in local features, the strategy of fusing facial key points and optical flow features, the MADFN enhances facial local features through AU embedding, and the accuracy and F1-Score performance indicators of the model are higher than TSGACN 0.41% and 0.6641.At the same time, the experimental results found that although TSGACN can achieve higher recognition accuracy, its F1-Score performance is slightly lower, which shows that they did not consider the influence of sample imbalance in the CAMSE II, and similar results emerged in the SAMM dataset.

SAMM
Table 3 shows the comparison results of the MADFN model on the SAMM dataset and two types of baseline methods for three classifications.Compared with the global feature method, MADFN has achieved the best experimental results.However, in comparison with the method of local features, MADFN is still much different from the TSGACN model.The TSGACN has achieved excellent performance on the SAMM dataset with its unique model, but they are more focused on improving model performance.The recognition performance of the MADFN model is slightly inferior to that of TSGACN, but the MADFN model proposed in this paper pays more attention to the generalization ability, the processing of unbalanced data, and the complexity of the model.The experimental results show that MADFN can improve the classification accuracy even by fusing the overall and local features and alleviating the problem of micro-expression sample imbalance.

MEGC2019
Table 4 shows the comparison results of the MADFN model on the MEGC2019 dataset and two types of baseline methods for three classifications.On the three subsets of SMIC, CASME II, and SAMM datasets, the MADFN proposed in this paper achieves the SOTA recognition results.At the same time, in the comparison experiment of the combined dataset 3DB-Combined, the MADFN model also achieved competitive performance.Compared with the optimal PLAN method in local features, MADFN also improves the UF1 and UAR indicators by 0.012 and 0.0024, respectively, which also proves the effectiveness of the model.

Ablation Experiment Analysis
A detailed analysis is carried out in the ablation experiments to evaluate the effectiveness of the local feature extraction of the MADFN.This section conducts ablation experiments in three aspects of the basic model, mask sampling strategy and fusion strategy in the SMIC, CASME II, and SAMM.  5.It was found that in SMIC and SAMM, although the performance of ViT-H was higher than that of ViT-B structure, it was slightly inferior to that of ViT-L.In CASME II, ViT-H achieved the best results.The reason for this result is that the size of the micro-expression data set does not support training on a large-scale dataset but in a smaller-scale data set.Therefore, in the follow-up experiment process, this paper uses ViT-L as the backbone network for model training, and the experimental results are shown in Table 6.Based on determining the backbone network, the impact of different mask sampling strategies on the performance of the DViT model is further compared.Specifically, the impact of the four mask sampling strategies of BS, GS, RS, and LS is mainly compared.Among them, block sampling is to randomly mask out large image blocks, grid sampling refers to masking out three of every four small image blocks, and random sampling is large-scale masking out of small image blocks; a different sampling strategy is shown in the figure .The experimental results are shown in Table 7.In the three comparison data sets, random sampling and learning sampling are much higher than average block sampling and grid sampling.Compared with the random adoption strategy, the learning sampling strategy can improve the accuracy of recognition.This is because mask sampling through learning can effectively avoid the uncertainty brought about by random mask sampling so that a highly differentiated local region of interest can be obtained through an accurate mask strategy, thereby extracting more robust local features to improve the performance of micro-expression recognition.original image, and the second column is the visual representation of the mask of the original image by the LCT module, the third column is the mask representation of LCT in Grad-CAM, and the fourth column is the visual representation of Grad-CAM.It can be seen from the figure that LCT can mask out areas that have little influence on category weights and propose emotional features in local areas with high attention weights for micro-expression recognition.

Visualization Analysis
The performance and scale of the AUDF model are largely determined by learning mask marks.To further explain the impact of learning mask marks on model performance, the performance of AUDF is visualized through Grad-CAM.

. Visualization Analysis
The performance and scale of the AUDF model are largely determined by learning mask marks.To further explain the impact of learning mask marks on model performance, the performance of AUDF is visualized through Grad-CAM.Figures 6-8 can clearly show the corresponding relationship between learning mask marks and Grad-CAM in SMIC, CASME II, and SAMM datasets, where the first column is the sub-block division of the original image, and the second column is the visual representation of the mask of the original image by the LCT module, the third column is the mask representation of LCT in Grad-CAM, and the fourth column is the visual representation of Grad-CAM.It can be seen from the figure that LCT can mask out areas that have little influence on category weights and propose emotional features in local areas with high attention weights for micro-expression recognition.

Conclusions
In this paper, a multimodal dynamic attention fusion network method is proposed to enhance the local features of facial images by facial action unit embedding.To the parameter complexity of the vision transformer model, a learnable class token is proposed to sample a subset of patches with high attention weights to reduce the computational complexity of facial image feature extraction.The action unit dynamic fusion module is used to add action unit embedding information in the process of facial image local feature extraction to improve the distinguishability of image emotional features.The performance of the model is evaluated and verified on SMIC, CASME II, SAMM, and their combined 3DB-combined datasets.The experimental results show that the MADFN model can perform feature fusion through dynamic mapping, which can help improve the performance of micro-expression recognition.

Conclusions
In this paper, a multimodal dynamic attention fusion network method is proposed to enhance the local features of facial images by facial action unit embedding.To the parameter complexity of the vision transformer model, a learnable class token is proposed to sample a subset of patches with high attention weights to reduce the computational complexity of facial image feature extraction.The action unit dynamic fusion module is used to add action unit embedding information in the process of facial image local feature extraction to improve the distinguishability of image emotional features.The performance of the model is evaluated and verified on SMIC, CASME II, SAMM, and their combined 3DB-combined datasets.The experimental results show that the MADFN model can perform feature fusion through dynamic mapping, which can help improve the performance of micro-expression recognition.
The research related to micro-expression analysis in this paper mainly discusses the micro-expression recognition in determined videos, but often there is still how to locate the occurrence of micro-expressions in the real environment.In a real environment, the occurrence of micro-expressions is often to conceal one's true emotions, so microexpressions are often accompanied by the occurrence of macro-expressions.How to locate the location of micro-expressions from the complex environment and emotional changes is also key research in future work.

Figure 1 .
Figure 1.The framework of the MADFN method.The apex frame of the micro-expression video clip is input to an LCT module to remove local areas with low emotional expression.Then, the AUDF module adds to the vision transformer encoder to fuse action unit representation to improve the potential representation ability of image features.Finally, the local image features with high attention weight are fused action unit representations for micro-expression recognition.

Figure 1 .
Figure 1.The framework of the MADFN method.The apex frame of the micro-expression video clip is input to an LCT module to remove local areas with low emotional expression.Then, the AUDF module adds to the vision transformer encoder to fuse action unit representation to improve the potential representation ability of image features.Finally, the local image features with high attention weight are fused action unit representations for micro-expression recognition.

Figure 2 .
Figure 2. The framework structure of the multimodal dynamic attention fusion network.

Figure 2 .
Figure 2. The framework structure of the multimodal dynamic attention fusion network.
Therefore, this section proposed an image autoencoder pre-training model based on a learnable class token.The model structure is shown in Figure 3.The model utilizes a learning sampling (LS) module to remove local image sub-blocks that contribute little to micro-expression recognition, reducing the complexity of the pre-training model and improving model performance while focusing on the emotional feature representation of local areas.learning sampling (LS) module to remove local image sub-blocks that contribute little to micro-expression recognition, reducing the complexity of the pre-training model and improving model performance while focusing on the emotional feature representation of local areas.

Figure 3 .
Figure 3.The image autoencoder structure is based on a learnable class token.The image autoencoders-based learnable class tokens are an end-to-end pre-training model.The model iteration is divided into two parts.Firstly, all the images of the microexpression video samples are input to the autoencoder for training to obtain the highdimensional representations of the facial image.Specifically, the image is divided into regular non-overlapping image patches   .These image blocks are masked by the LCT module, and the high-weight image subblocks are input to the encoder network.The encoder uses the image path of a multimodal dynamic attention fusion network to extract the representations of local sub-blocks.Then, these representations and the learnable class token are reconstructed according to the original position and input to the decoder network to restore the original image.In the second iteration, the apex frame is input to the autoencoder, and the output onset frame extracts the emotional representation in the apex frame for micro-expression recognition.The LCT module is a fully connected layer model in which the input is a feature vector with the same length as the image sub-blocks   , and the output is sorted to remove those corresponding low-weight images.This module specifically expressed as

Figure 3 .
Figure 3.The image autoencoder structure is based on a learnable class token.The image autoencoders-based learnable class tokens are an end-to-end pre-training model.The model iteration is divided into two parts.Firstly, all the images of the microexpression video samples are input to the autoencoder for training to obtain the highdimensional representations of the facial image.Specifically, the image is divided into regular non-overlapping image patches x p .These image blocks are masked by the LCT module, and the high-weight image sub-blocks are input to the encoder network.The encoder uses the image path of a multimodal dynamic attention fusion network to extract the representations of local sub-blocks.Then, these representations and the learnable class token are reconstructed according to the original position and input to the decoder network to restore the original image.In the second iteration, the apex frame is input to the autoencoder, and the output onset frame extracts the emotional representation in the apex frame for micro-expression recognition.The LCT module is a fully connected layer model in which the input is a feature vector with the same length as the image sub-blocks x p , and the output is sorted to remove those corresponding low-weight images.This module specifically expressed as . The studies have shown that in image classification tasks, the ViT model can improve recognition performance by focusing on the attention weights of image sub-blocks.However, due to the complexity of the network structure, the ViT model usually requires large-scale data for model training.Therefore, we first utilize a large number of mask operations on the image through the LCT module to reduce the complexity of the model.Then, the representations of the reserved image sub-blocks are input into the vision transformer model based on action unit dynamic fusion to fuse the facial AU embedding to recognize the emotional state of the face.The vision transformer model based on the action unit dynamic fusion structure is shown in Figure 4. Entropy 2023, 25, x FOR PEER REVIEW 8 of 18

Figure 4 .
Figure 4.The vision transformer model is based on the action unit dynamic fusion structure.

Figure 4 .
Figure 4.The vision transformer model is based on the action unit dynamic fusion structure.The improved ViT encoder includes L-layer MSA, AUDF, and MLP modules.The single-layer MSA, AUDF, and MLP model structures are shown in Figure 5.

4. 2 . 1 .
Basic Model This section first compares the influence of three different backbone networks of ViT Base (ViT-B/16), ViT Large (ViT-L/16), and ViT Huge (ViT-H/14) on micro-expression recognition.The backbone model network parameters are shown in Table

Figure 6 .Figure 7 .
Figure 6.The correspondence between learnable class token and Grad-CAM in SMIC.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.

Figure 6 .
Figure 6.The correspondence between learnable class token and Grad-CAM in SMIC.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.

Figure 6 .Figure 7 .
Figure 6.The correspondence between learnable class token and Grad-CAM in SMIC.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.

Figure 7 .
Figure 7.The correspondence between learnable class token and Grad-CAM in CASME II.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.Entropy 2023, 25, x FOR PEER REVIEW 16 of 18

Figure 8 .
Figure 8.The correspondence between learnable class token and Grad-CAM in SAMM.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.

Figure 8 .
Figure 8.The correspondence between learnable class token and Grad-CAM in SAMM.(a) is a subblock of the original image; (b) is an image subblock that can learnable class token masks; (c) is the learnable class token mask image subblock corresponding to Grad-CAM; (d) is a Grad CAM image.
et al. developed the Local Binary Pattern with Six Intersections Point (LBP-SIP) to reduce the information redundancy of the LBP-TOP feature and, thus, the time-space complexity.Spatio-Temporal Local Binary Pattern with Integral Projection (STLBP-IP) was proposed by Huang et al. to enhance the properties of LBP-TOP through integrated projection.By using Sparsity-Promoting Dynamic Mode Decomposition (DMDSP) to remove neutral expressions from micro-expression videos, Le Ngo et al. managed to achieve a high recognition rate.To overcome the sparsity problem of the LBP features, Huang et al.

Table 1
shows the comparison results of the MADFN model on the SMIC dataset and two types of baseline methods for three classifications.The accuracy of the MADFN model is 6.07% and 11.2% higher than the best KTGSL in the global feature method and the best model SMDMO based on local features, respectively.The F1-Score of our model is higher than the best TSCNN model in the global feature method, and the best model SMDMO based on local features is 0.0966 and 0.1161, respectively.The effectiveness of the MADFN model is demonstrated by comparing it with two classes of baseline methods.

Table 1 .
The performance comparison of MADFN and two types of models on the SMIC dataset.

Table 2 .
The performance comparison of MADFN and two types of models on the CAMSE II dataset.

Table 3 .
The performance comparison of MADFN and two types of models on the SAMM dataset.

Table 4 .
The performance comparison of MADFN and two types of models on the MEGC2019 dataset.

Table 5 .
The different backbone model structure settings.

Table 6 .
The influence of different backbone models.