M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition
Abstract
1. Introduction
- We propose a unified five-stream CNN architecture that effectively integrates both optical flow and RGB modalities, enabling joint modeling of motion and appearance cues critical for micro-expression recognition.
- To tackle sample scarcity and label imbalance in micro-expression datasets, we introduce a customized data augmentation pipeline and incorporate focal loss, leading to consistent performance gains across standard benchmarks.
- We conduct extensive experiments on publicly available datasets and demonstrate that our approach achieves state-of-the-art (SOTA) performance while maintaining high computational efficiency, offering a strong baseline for robust and interpretable multi-modal micro-expression analysis.
2. Related Work
2.1. ME Datasets
- SMIC [21]: SMIC (Spontaneous Micro-Expression Corpus) was the first public spontaneous ME dataset, containing 71 video clips from 20 participants. It includes three modalities: high speed (HS, 100 fps), visible light (VIS, 25 fps), and near infrared (NIR), captured while participants watched emotion-inducing videos under instructions to maintain a neutral expression. Each clip is labeled via participant self-report as “positive,” “negative,” or “surprise,” and FACS-trained coders provide frame-level annotations to ensure reliability.
- CASME [22]: CASME (Chinese Academy of Sciences Micro-Expression) contains 195 spontaneous MEs elicited from 35 participants. Videos were captured at 60 fps in a controlled setting, with coding of onset, apex, and offset frames. Emotion labels stem from a combination of participant self-reports and AU-based FACS.
- CASME II [23]: CASME II builds on CASME with 247 spontaneous micro-expression samples from 26 participants. Videos were recorded at an improved 200 fps and 280 × 340 px resolution. Labeling combines AUs, subjective reports, and contextual video information, with ambiguous cases tagged as “others”. It is the most widely used dataset for MER.
- CAS(ME)2 [24]: CAS(ME)2 extends CASME II by capturing both spontaneous micro- and macro-expressions from the same 22 participants. It contains 57 micro-expression samples and 300 cropped macro-expression samples, all recorded at 30 fps. In addition, 87 long video sequences are provided for joint evaluation of expression spotting and recognition. Emotion labels are assigned based on a combination of facial action units (AUs), the nature of the emotion-inducing videos, and participant self-reports, categorized into “positive,” “negative,” “surprise,” and “others.” The dataset facilitates research into cross-scale expression dynamics and supports both recognition and temporal spotting tasks.
- SAMM [25]: SAMM (Spontaneous Actions and Micro-Movements) is a high-resolution spontaneous ME dataset comprising 159 micro-expression samples recorded at 200 or 300 fps under controlled lighting conditions, with facial regions captured at 2040 × 1088 resolution. The dataset emphasizes diversity by including 32 participants from 13 different ethnicities. Each ME is annotated with onset, apex, and offset frames, as well as corresponding facial action units (AUs). Emotion categories include contempt, disgust, fear, anger, sadness, happiness, and surprise. To induce genuine emotional responses, participant-specific video stimuli and incentive-based protocols were employed. An extended version, SAMM Long Videos [27], provides 147 long videos containing both micro- and macro-expressions, further supporting research in expression spotting across temporal scales.
- MMEW [26]: MMEW (Macro- and Micro-Expression Warehouse) is a multi-modal database containing 300 micro- and 900 macro-expression clips, covering six basic emotions (happiness, surprise, anger, disgust, fear, sadness). Data are available in RGB, depth, and infrared formats, with naming conventions that simplify emotion-specific retrieval. It supports deep learning research due to its scale and richness.
- Composite Dataset [28] and CMED [29]: The Composite Dataset (from MEGC2019) merges CASME II, SAMM, and SMIC-HS, unifying emotion labels into positive, negative, and surprise to standardize evaluation. The Compound Micro-Expression Dataset (CMED) aggregates MEs from CASME, CASME II, CAS(ME)2, SMIC-HS, and SAMM, categorizing expressions into basic and compound emotions, reflecting the psychological realism of naturally co-occurring affective states.
2.2. Deep Learning-Based Micro-Expression Recognition Pipeline
2.3. Multi-Modal Approaches for Micro-Expression Recognition
3. Methods
3.1. Preprocessing Module
3.2. Multi-Modal Feature Extraction
3.2.1. RGB Feature Encoder
3.2.2. Optical Flow Feature Encoder
3.3. Output Module
4. Experiments and Results
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Baselines
- AlexNet [45]: A classic CNN architecture consisting of five convolutional layers and three fully connected layers. It is used here as a shallow baseline for visual feature learning.
- VGG-16 [46]: A deeper architecture with small convolution kernels (3 × 3) and a uniform design. VGG-16 is known for its strong representation ability and is commonly used for fine-grained recognition tasks.
- ResNet-18 [47]: Incorporates residual connections to mitigate the vanishing gradient problem in deep networks. Its lightweight version (ResNet-18) is suitable for small-scale datasets like MER.
- GoogLeNet (Inception v1) [48]: Utilizes multi-scale convolution within inception modules, allowing more expressive features with fewer parameters.
- Off-ApexNet [49]: A dual-stream CNN model that captures subtle facial motion by computing optical flow between the onset and apex frames. The network separately processes the horizontal and vertical components of flow to learn discriminative motion features. It is one of the earliest deep learning frameworks tailored for MER tasks.
- STSTNet [17]: A lightweight three-stream 3D CNN that processes horizontal flow, vertical flow, and optical strain simultaneously. By employing shallow temporal convolution, it effectively models the short-term dynamics of micro-expressions. The architecture is highly efficient and well suited for small-scale MER datasets.
- HTNet [18]: A hierarchical Transformer designed for micro-expression recognition. It divides the face into four regions and uses local self-attention to capture subtle muscle movements, while an aggregation layer models interactions between eye and lip areas.
- LAENet [50]: A local self-attention encoding network that focuses on critical facial regions with subtle muscle movements. It incorporates spatial attention to selectively enhance local features relevant to micro-expression.
- SSRLTS-ViT [51]: SSRLTS-ViT is a three-stream Vision Transformer baseline for micro-expression recognition that learns from three optical flow components to capture subtle facial motion with global context.
- CNNCapsNet [40]: This method integrates five input streams, including vertical and horizontal optical flow between onset–apex and apex–offset, as well as the apex grayscale image. A multi-stream CNN is used for feature extraction, followed by a Capsule Network for final classification.
4.1.4. Implementation Details
4.2. Performance Evaluation
4.2.1. Overall Performance
4.2.2. Interpretability Analysis
4.2.3. Ablation Analysis
4.2.4. Efficiency and Complexity Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, K.; Zhang, S. Research on Museum Visitor Experience Based on Micro-Expression Recognition. Int. J. Psychophysiol. (IOP) 2021, 168, S147. [Google Scholar] [CrossRef]
- Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124. [Google Scholar] [CrossRef]
- Ekman, P. Lie catching and microexpressions. Philos. Decept. 2009, 1, 5. [Google Scholar]
- Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; See, J.; Phan, R.C.W.; Oh, Y.H. Lbp with six intersection points: Reducing redundant information in lbp-top for micro-expression recognition. In Proceedings of the Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 525–537. [Google Scholar]
- Wang, Y.; See, J.; Phan, R.C.W.; Oh, Y.H. Efficient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition. PLoS ONE 2015, 10, e0124674. [Google Scholar] [CrossRef]
- Huang, X.; Zhao, G.; Hong, X.; Zheng, W.; Pietikäinen, M. Spontaneous facial micro-expression analysis using spatiotemporal completed local quantized patterns. Neurocomputing 2016, 175, 564–578. [Google Scholar] [CrossRef]
- Liong, S.T.; Phan, R.C.W.; See, J.; Oh, Y.H.; Wong, K. Optical strain based recognition of subtle emotions. In Proceedings of the 2014 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Kuching, Malaysia, 1–4 December 2014; pp. 180–184. [Google Scholar]
- Liong, S.T.; See, J.; Phan, R.C.W.; Le Ngo, A.C.; Oh, Y.H.; Wong, K. Subtle expression recognition using optical strain weighted features. In Proceedings of the Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 644–657. [Google Scholar]
- Happy, S.; Routray, A. Fuzzy histogram of optical flow orientations for micro-expression recognition. IEEE Trans. Affect. Comput. 2017, 10, 394–406. [Google Scholar] [CrossRef]
- Liu, Y.J.; Zhang, J.K.; Yan, W.J.; Wang, S.J.; Zhao, G.; Fu, X. A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Trans. Affect. Comput. 2015, 7, 299–310. [Google Scholar] [CrossRef]
- Liong, S.T.; See, J.; Wong, K.; Phan, R.C.W. Less is more: Micro-expression recognition from video using apex frame. Signal Process. Image Commun. 2018, 62, 82–92. [Google Scholar] [CrossRef]
- Patel, D.; Hong, X.; Zhao, G. Selective deep features for micro-expression recognition. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2258–2263. [Google Scholar]
- Peng, M.; Wang, C.; Chen, T.; Liu, G.; Fu, X. Dual temporal scale convolutional neural network for micro-expression recognition. Front. Psychol. 2017, 8, 1745. [Google Scholar] [CrossRef]
- Bai, M.; Goecke, R. Investigating LSTM for micro-expression recognition. In Proceedings of the Companion Publication of the 2020 International Conference on Multimodal Interaction, Utrecht, The Netherlands, 25–29 October 2020; pp. 7–11. [Google Scholar]
- Zhou, Z.; Zhao, G.; Pietikäinen, M. Towards a practical lipreading system. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 137–144. [Google Scholar]
- Liong, S.; See, J.; Wong, K.; Phan, R. Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), Lille, France, 14–18 May 2019. [Google Scholar]
- Wang, Z.; Zhang, K.; Luo, W.; Sankaranarayana, R. Htnet for micro-expression recognition. Neurocomputing 2024, 602, 128196. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto, Department of Computer Science: Toronto, ON, Canada, 2009. [Google Scholar]
- Li, X.; Pfister, T.; Huang, X.; Zhao, G.; Pietikäinen, M. A spontaneous micro-expression database: Inducement, collection and baseline. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–6. [Google Scholar]
- Yan, W.J.; Wu, Q.; Liu, Y.J.; Wang, S.J.; Fu, X. CASME database: A dataset of spontaneous micro-expressions collected from neutralized faces. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–7. [Google Scholar]
- Yan, W.J.; Li, X.; Wang, S.J.; Zhao, G.; Liu, Y.J.; Chen, Y.H.; Fu, X. CASME II: An improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE 2014, 9, e86041. [Google Scholar] [CrossRef]
- Qu, F.; Wang, S.J.; Yan, W.J.; Li, H.; Wu, S.; Fu, X. CAS (ME)2: A database for spontaneous macro-expression and micro-expression spotting and recognition. IEEE Trans. Affect. Comput. 2017, 9, 424–436. [Google Scholar] [CrossRef]
- Davison, A.K.; Lansley, C.; Costen, N.; Tan, K.; Yap, M.H. Samm: A spontaneous micro-facial movement dataset. IEEE Trans. Affect. Comput. 2016, 9, 116–129. [Google Scholar] [CrossRef]
- Ben, X.; Ren, Y.; Zhang, J.; Wang, S.J.; Kpalma, K.; Meng, W.; Liu, Y.J. Video-based facial micro-expression analysis: A survey of datasets, features and algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5826–5846. [Google Scholar] [CrossRef]
- Yap, C.H.; Kendrick, C.; Yap, M.H. Samm long videos: A spontaneous facial micro-and macro-expressions dataset. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 771–776. [Google Scholar]
- See, J.; Yap, M.H.; Li, J.; Hong, X.; Wang, S.J. Megc 2019—The second facial micro-expressions grand challenge. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–5. [Google Scholar]
- Zhao, Y.; Xu, J. A convolutional neural network for compound micro-expression recognition. Sensors 2019, 19, 5553. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
- Matsugu, M.; Mori, K.; Mitari, Y.; Kaneda, Y. Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Netw. 2003, 16, 555–559. [Google Scholar] [CrossRef] [PubMed]
- Wu, H.Y.; Rubinstein, M.; Shih, E.; Guttag, J.; Durand, F.; Freeman, W. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. (TOG) 2012, 31, 1–8. [Google Scholar] [CrossRef]
- Niklaus, S.; Liu, F. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1701–1710. [Google Scholar]
- Li, X.; Hong, X.; Moilanen, A.; Huang, X.; Pfister, T.; Zhao, G.; Pietikäinen, M. Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods. IEEE Trans. Affect. Comput. 2017, 9, 563–577. [Google Scholar] [CrossRef]
- Xie, H.X.; Lo, L.; Shuai, H.H.; Cheng, W.H. Au-assisted graph attention convolutional network for micro-expression recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2871–2880. [Google Scholar]
- Yu, J.; Zhang, C.; Song, Y.; Cai, W. ICE-GAN: Identity-aware and capsule-enhanced GAN with graph-based reasoning for micro-expression recognition and synthesis. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Xia, Z.; Hong, X.; Gao, X.; Feng, X.; Zhao, G. Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Trans. Multimed. 2019, 22, 626–640. [Google Scholar] [CrossRef]
- Lei, L.; Li, J.; Chen, T.; Li, S. A novel graph-tcn with a graph structured representation for micro-expression recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2237–2245. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Liu, N.; Liu, X.; Zhang, Z.; Xu, X.; Chen, T. Offset or onset frame: A multi-stream convolutional neural network with capsulenet module for micro-expression recognition. In Proceedings of the 2020 5th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 18–20 November 2020; pp. 236–240. [Google Scholar]
- Xie, Z.; Zhao, C. Dual-branch cross-attention network for micro-expression recognition with transformer variants. Electronics 2024, 13, 461. [Google Scholar] [CrossRef]
- Wang, S.; Guan, S.; Lin, H.; Huang, J.; Long, F.; Yao, J. Micro-expression recognition based on optical flow and pcanet+. Sensors 2022, 22, 4296. [Google Scholar] [CrossRef]
- Kumar, A.J.R.; Bhanu, B. Micro-expression classification based on landmark relations with graph attention convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1511–1520. [Google Scholar]
- Zach, C.; Pock, T.; Bischof, H. A duality based approach for realtime tv-l 1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium; Springer: Berlin/Heidelberg, Germany, 2007; pp. 214–223. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Gan, Y.S.; Liong, S.T.; Yau, W.C.; Huang, Y.C.; Tan, L.K. OFF-ApexNet on micro-expression recognition system. Signal Process. Image Commun. 2019, 74, 129–139. [Google Scholar] [CrossRef]
- Gan, Y.S.; Lien, S.E.; Chiang, Y.C.; Liong, S.T. LAENet for micro-expression recognition. Vis. Comput. 2024, 40, 585–599. [Google Scholar] [CrossRef]
- Zhang, H.; Yin, L.; Zhang, H.; Wu, X. Facial micro-expression recognition using three-stream vision transformer network with sparse sampling and relabeling. Signal Image Video Process. 2024, 18, 3761–3771. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Dataset | SMIC [21] | CASME [22] | CASME II [23] | CAS(ME)2 [24] | SAMM [25] | SAMM long [25] | MMEW [26] |
---|---|---|---|---|---|---|---|
Subjects | 16/8/8 | 35 | 35 | 22 | 32 | 32 | 36 |
ME samples | 164/71/71 | 195 | 247 | 57 | 159 | 159 | 300 |
Resolution | 640 × 480 | 640 × 480 | 640 × 480 | 640 × 480 | 2040 × 1088 | 2040 × 1088 | 1920 × 1080 |
Facial size | 190 × 230 | 150 × 90 | 250 × 340 | - | 400 × 400 | - | 400 × 400 |
Frame Rate | 100/25/25 | 60 | 60 | 30 | 200 | 200 | 90 |
Expressions | 3 | 8 | 5 | 4 | 7 | 7 | 7 |
Environ | Lab | Lab | Lab | Lab | Lab | Lab | Lab |
Ethnicity | 3 | 1 | 1 | 1 | 13 | 13 | 1 |
AU | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Apex | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Dataset | Metric | AN | V16 | R18 | GN | OAN | SSN | HN | LN | ViT | CCN | M3ENet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CASME I | Precision | 0.338 | 0.151 | 0.177 | 0.123 | 0.340 | 0.640 | 0.051 | 0.379 | 0.809 | 0.251 | 0.780 |
Recall | 0.325 | 0.142 | 0.178 | 0.149 | 0.349 | 0.417 | 0.143 | 0.263 | 0.629 | 0.286 | 0.746 | |
F1_score | 0.323 | 0.092 | 0.156 | 0.089 | 0.338 | 0.446 | 0.075 | 0.258 | 0.691 | 0.265 | 0.760 | |
Accuracy | 0.599 | 0.346 | 0.377 | 0.366 | 0.579 | 0.623 | 0.356 | 0.541 | 0.726 | 0.586 | 0.856 | |
AUC | 0.837 | 0.323 | 0.526 | 0.622 | 0.843 | 0.893 | 0.112 | 0.694 | 0.953 | 0.746 | 0.974 | |
CASME II | Precision | 0.349 | 0.345 | 0.288 | 0.130 | 0.490 | 0.507 | 0.078 | 0.546 | 0.737 | 0.355 | 0.718 |
Recall | 0.336 | 0.200 | 0.272 | 0.175 | 0.405 | 0.412 | 0.167 | 0.308 | 0.530 | 0.323 | 0.715 | |
F1_score | 0.334 | 0.173 | 0.227 | 0.125 | 0.426 | 0.438 | 0.106 | 0.298 | 0.584 | 0.327 | 0.715 | |
Accuracy | 0.567 | 0.486 | 0.372 | 0.468 | 0.571 | 0.633 | 0.468 | 0.573 | 0.686 | 0.523 | 0.803 | |
AUC | 0.807 | 0.653 | 0.698 | 0.713 | 0.826 | 0.881 | 0.259 | 0.810 | 0.913 | 0.816 | 0.948 | |
CAS(ME)2 | Precision | 0.160 | 0.076 | 0.089 | 0.037 | 0.162 | 0.297 | 0.037 | 0.204 | 0.405 | 0.090 | 0.621 |
Recall | 0.158 | 0.123 | 0.131 | 0.125 | 0.184 | 0.320 | 0.125 | 0.181 | 0.371 | 0.136 | 0.650 | |
F1_score | 0.126 | 0.084 | 0.069 | 0.057 | 0.167 | 0.303 | 0.057 | 0.160 | 0.362 | 0.100 | 0.631 | |
Accuracy | 0.319 | 0.278 | 0.236 | 0.296 | 0.343 | 0.431 | 0.296 | 0.352 | 0.495 | 0.287 | 0.731 | |
AUC | 0.655 | 0.433 | 0.598 | 0.388 | 0.690 | 0.750 | 0.122 | 0.616 | 0.815 | 0.612 | 0.921 | |
SAMM | Precision | 0.381 | 0.223 | 0.459 | 0.051 | 0.521 | 0.488 | 0.051 | 0.264 | 0.712 | 0.306 | 0.757 |
Recall | 0.295 | 0.150 | 0.403 | 0.142 | 0.290 | 0.352 | 0.143 | 0.201 | 0.472 | 0.279 | 0.740 | |
F1_score | 0.288 | 0.092 | 0.416 | 0.075 | 0.293 | 0.371 | 0.075 | 0.173 | 0.522 | 0.277 | 0.747 | |
Accuracy | 0.502 | 0.360 | 0.548 | 0.354 | 0.497 | 0.561 | 0.357 | 0.428 | 0.629 | 0.467 | 0.806 | |
AUC | 0.748 | 0.376 | 0.777 | 0.540 | 0.769 | 0.808 | 0.122 | 0.733 | 0.863 | 0.740 | 0.944 | |
MMEW | Precision | 0.530 | 0.435 | 0.735 | 0.307 | 0.733 | 0.692 | 0.245 | 0.678 | 0.734 | 0.603 | 0.822 |
Recall | 0.518 | 0.409 | 0.707 | 0.358 | 0.534 | 0.585 | 0.146 | 0.495 | 0.656 | 0.487 | 0.839 | |
F1_score | 0.513 | 0.391 | 0.715 | 0.323 | 0.554 | 0.609 | 0.073 | 0.508 | 0.682 | 0.495 | 0.829 | |
Accuracy | 0.688 | 0.608 | 0.783 | 0.569 | 0.743 | 0.730 | 0.302 | 0.683 | 0.758 | 0.690 | 0.878 | |
AUC | 0.897 | 0.858 | 0.939 | 0.819 | 0.943 | 0.929 | 0.387 | 0.899 | 0.941 | 0.909 | 0.976 | |
Average | Precision | 0.352 | 0.246 | 0.350 | 0.130 | 0.449 | 0.525 | 0.092 | 0.414 | 0.680 | 0.321 | 0.740 |
Recall | 0.327 | 0.205 | 0.338 | 0.190 | 0.352 | 0.417 | 0.145 | 0.290 | 0.532 | 0.302 | 0.738 | |
F1_score | 0.317 | 0.166 | 0.317 | 0.134 | 0.355 | 0.433 | 0.077 | 0.280 | 0.568 | 0.294 | 0.736 | |
Accuracy | 0.535 | 0.415 | 0.463 | 0.411 | 0.546 | 0.595 | 0.356 | 0.515 | 0.659 | 0.510 | 0.815 | |
AUC | 0.789 | 0.528 | 0.708 | 0.616 | 0.814 | 0.852 | 0.200 | 0.750 | 0.897 | 0.765 | 0.953 |
Ablation Variant | Precision | Recall | F1_Score | Accuracy | AUC |
---|---|---|---|---|---|
Standard | 0.740 | 0.738 | 0.736 | 0.815 | 0.953 |
w/o Data augmentation | 0.306 | 0.267 | 0.268 | 0.437 | 0.637 |
w/o Optical flow input | 0.640 | 0.622 | 0.625 | 0.705 | 0.896 |
w/o RGB input | 0.742 | 0.715 | 0.723 | 0.776 | 0.939 |
Single RGB input | 0.731 | 0.726 | 0.722 | 0.793 | 0.943 |
Shuffled RGB input | 0.621 | 0.571 | 0.584 | 0.678 | 0.891 |
Model | Parameters (M) | FLOPs (M) | Latency (ms) |
---|---|---|---|
M3ENet | 6.01 | 42.87 | 5.08 |
SSRLTS-ViT | 66.42 | 643.05 | 22.94 |
ResNet-18 | 11.17 | 98.44 | 3.23 |
VGG-16 | 33.61 | 723.58 | 1.45 |
GoogLeNet | 5.97 | 76.40 | 5.47 |
EfficientNet | 7.98 | 0.16 | 3.46 |
HTNet | 139.65 | 215.45 | 13.59 |
CNNCapsNet | 12.51 | 477.10 | 2.47 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, K.; Liu, X.; Yang, G. M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition. Sensors 2025, 25, 6276. https://doi.org/10.3390/s25206276
Zhao K, Liu X, Yang G. M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition. Sensors. 2025; 25(20):6276. https://doi.org/10.3390/s25206276
Chicago/Turabian StyleZhao, Ke, Xuanyu Liu, and Guangqian Yang. 2025. "M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition" Sensors 25, no. 20: 6276. https://doi.org/10.3390/s25206276
APA StyleZhao, K., Liu, X., & Yang, G. (2025). M3ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition. Sensors, 25(20), 6276. https://doi.org/10.3390/s25206276