Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation

Ke, Liang-Ying; Liao, Chia-Yu; Hsia, Chih-Hsien

doi:10.3390/engproc2025092070

Open AccessProceeding Paper

Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation^†

by

Liang-Ying Ke

¹

,

Chia-Yu Liao

² and

Chih-Hsien Hsia

^3,4,*

¹

Department of Engineering Science, National Cheng Kung University, Tainan 701, Taiwan

²

Taipei Fuhsing Private School, Taipei 106, Taiwan

³

Department of Computer Science and Information Engineering, National Ilan University, Yilan 260, Taiwan

⁴

Department of Business Administration, Chaoyang University of Technology, Taichung 413, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2024 IEEE 6th Eurasia Conference on IoT, Communication and Engineering, Yunlin, Taiwan, 15–17 November 2024.

Eng. Proc. 2025, 92(1), 70; https://doi.org/10.3390/engproc2025092070

Published: 14 May 2025

(This article belongs to the Proceedings of 2024 IEEE 6th Eurasia Conference on IoT, Communication and Engineering)

Download

Browse Figures

Versions Notes

Abstract

With the advancement of deep learning (DL) and computer vision (CV) technologies, significant progress has been made in facial expression identification FER for real-world applications. However, FER still faces challenges such as occlusion and head pose variations, which make it difficult for FER models to maintain stability and accuracy. In this study, we introduced a focal vision transformer (FViT) with partial feature masking (PFM) into FER. This method was found to efficiently simulate the challenges posed by occlusion and head pose variations by introducing PFM data augmentation. Parts of the image were randomly masked while preserving key facial expressions. The proposed FViT showed an accuracy of 89.08% on the real-world affective faces database, which includes scenarios with occlusion and head pose variations. PFM enhanced the model’s performance, too. The developed method effectively addresses the challenges of occlusion and head pose variations in FER.

Keywords:

deep learning; facial expression recognition; vision transformer

1. Introduction

As social media, information and communications technology (ICT), and deep learning (DL) continue to thrive, accurately and efficiently identifying users’ facial expressions is indispensable in the research on human–computer interaction (HCI). DL techniques are used in facial expression recognition (FER) to improve traditional computer vision (CV) with outstanding results [1]. However, current FER technologies require large amounts of high-quality labeled data for training, which demands time and costs and limits the development of FER technology. In addition, face images captured often suffer from a lack of important facial features due to occlusions or head poses. These problems degrade the recognition ability of FER models, making it difficult to apply them [2].

Facial occlusion issues are caused by self-occlusion from head pose variation or a user’s body parts and occlusion produced by external objects. These problems are often caused by masks, glasses, hair, or users’ hands, which obscure important parts of facial expressions. For instance, the eyes, eyebrows, and mouth are critical for identifying facial expressions. Huang et al. [3] proposed a combination of a convolutional neural network (CNN) and a vision transformer (ViT) called the FER-VT model for FER. This model uses grid-wise attention (GWA) [4] and visual transformer attention (VTA) to capture facial texture features and semantic features, respectively, while enabling the FER-VT model to learn long-range dependencies (LRDs) [5] across different facial parts. This approach strengthens the FER-VT model’s facial perception ability and further enhances its accuracy.

Although the FER-VT model efficiently extracts facial features using GWA, it still struggles to handle missing facial features caused by occlusion and head pose variations. This limitation makes it challenging for the FER model to adapt to real-world environments. Nan et al. [6] proposed the A-MobileNet model for FER, which incorporates depthwise separable convolution (DSC) [7] and the convolutional block attention module (CBAM) [8]. The model effectively captures facial features through the CBAM mechanism, extracting the discriminative facial features to improve FER’s accuracy. However, due to the extensive use of DSC in constructing the FER model, the model’s ability to extract facial features is limited to effectively address the issues of missing facial features caused by pose variations and occlusion. To address such issues, we developed a method called partial feature masking (PFM) which simulates head pose variations and occlusions to augment facial expression images. We applied the PFM method to focal modulation networks (FocalNets) [9] for FER, and the effectiveness of the PFM method was verified in this study.

2. Method

2.1. Partial Feature Masking

FER is affected by head pose variations and occlusion, which negatively impact the performance of the models. To address these issues and improve model accuracy, a data augmentation technique was used in this study to efficiently simulate missing facial features caused by head pose variations and occlusion. This method enhances model diversity in training and preserves important facial features effectively. The flowchart of the PFM method is shown in Figure 1.

First, PFM uses a pre-trained model [10] on the real-world affective faces database (RAF-DB) to generate a class activation map (CAM) for each face image and determines the importance of facial features using them. To simulate missing facial features caused by head pose variations and occlusions, facial features are classified from the CAM into two binary maps, which represent emotion-related and non-emotion-related masks. Conditional random masking (CRM) is applied to both types of masks, dividing the original map into patches according to the specified patch size. Each patch is then binarized according to the proportion of emotion-related and non-emotion-related masks to determine whether it belongs to an emotion-related region. Finally, CRM applies random masking to the patches, masking both types separately to prevent emotion-related facial features from being completely masked. After applying CRM to both emotion-related and non-emotion-related masks, the two masks are recombined and applied to the original image for data augmentation. Such operations ensure that emotion-related features are preserved, allowing the model to accurately identify facial expression cues.

2.2. Focal Vision Transformer

Since facial expressions involve the coordinated movement of facial muscles, the model’s ability to perceive overall facial features is the key to efficiently recognizing facial expressions. Therefore, we used the FocalNets model proposed by Yang et al. [9] to capture global facial features from face images. With the focal modulation mechanism (FMM), FocalNet captures LRD from face images to efficiently perceive relationships between facial features, enabling the model to learn global features (Figure 2a). FMM shows higher accuracy and more efficiently mitigates the model’s data requirements and the need for large training databases than the traditional self-attention mechanism in ViT (Figure 2b). This enables the model to efficiently capture different facial features when recognizing facial expressions, while also constructing relationships between these features to achieve more stable and generalized FER.

3. Results and Discussion

To evaluate the model’s ability to recognize facial expressions, we used the RAF-DB [11] containing 29,672 face images for training and testing. This database was independently annotated by 40 well-trained annotators. The facial expression images in the RAF-DB database include facial poses, lighting conditions, occlusions, and processing operations, making the database highly diverse. The database consists of two subsets: one contains seven categories of basic emotions with single labels, and the other contains twelve categories of compound emotions with double labels. To compare the performance with those of previous studies, we used the basic emotions subset: 12,271 facial expression images for training and 3068 images for testing. The hyperparameters included the image size of 224, batch size of 64, and the number of epochs of 40. AdamW was selected as the optimizer, with an initial learning rate of 0.0002 and a weight decay value of 0.0005. The FER model was trained on a Nvidia RTX 3090 Ti graphic processing unit (GPU) using the PyTorch 2.2 toolbox.

To evaluate the developed PFM in this study, we conducted data augmentation from previous studies, including random horizontal flip (RHF), random rotation (RR), random erasing (RE) [12], and PFM (Table 1). The PFM method efficiently enhanced the accuracy of the FER model. Compared with the RE method that treated all features equally, PFM used the pre-trained model’s CAM to assess the importance of facial features and mask them accordingly, thereby more effectively improving the model’s recognition ability.

To evaluate the impact of the patch size and masking ratio on the FER model in PFM, we conducted an ablation study to investigate the effect of data augmentation (Table 2 and Table 3). The developed PFM showed better accuracy in FER when the patch size was 16 × 16, while other patch sizes degraded the model’s ability to recognize facial expressions. This result implied that a larger patch size tends to occlude more facial features, reducing the model’s ability to identify facial details. In contrast, while a smaller patch size preserved more facial features, it introduced more noise, affecting the model’s recognition performance. Regarding PFM’s masking ratio, the FER model showed the best accuracy when the ratio was between 0.02 and 0.25. However, as the masking ratio increased, the model’s recognition accuracy decreased. This result implies that an appropriate masking ratio improves the model’s accuracy, but when the ratio is too high, important facial features are excessively occluded, making it difficult for the model to further improve its accuracy.

To evaluate the developed FER method, we compared the performance of methods from those in previous studies (Table 4). The developed method efficiently improved the accuracy of FER, showing a 0.82% higher accuracy than that of the RCL-Net model. This demonstrates that applying the proposed method to FER results in better recognition performance.

4. Conclusions

To evaluate whether PFM efficiently addresses head pose and occlusion problems, we assessed the performance of FocalNets with data augmentation in FER using the RAF-DB public database. By applying the developed PFM, we occluded facial features. The model’s recognition ability was enhanced by improving its adaptability. The PFM applied to the FocalNets model showed an accuracy of 89.08%, outperforming other methods in recognition ability. Additionally, the data augmentation used in this study efficiently enhanced the model’s stability when applied to diverse facial images.

Author Contributions

Conceptualization, L.-Y.K. and C.-H.H.; methodology, L.-Y.K.; software, L.-Y.K. and C.-Y.L.; validation, L.-Y.K., C.-Y.L. and C.-H.H.; formal analysis, L.-Y.K. and C.-Y.L.; investigation, L.-Y.K.; resources, C.-H.H.; data curation, L.-Y.K.; writing—original draft preparation, L.-Y.K. and C.-Y.L.; writing—review and editing, C.-H.H.; visualization, L.-Y.K. and C.-Y.L.; supervision, C.-H.H.; project administration, C.-H.H.; funding acquisition, C.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Ministry of Science and Technology, Taiwan, under grant no. MOST 111-2221-E-197-020-MY3.

Institutional Review Board Statement

Not applicable. This study did not involve any humans or animals.

Informed Consent Statement

Not applicable. This study did not involve humans.

Data Availability Statement

This study did not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lin, T.-Y.; Chan, H.-T.; Hsia, C.-H.; Lai, C.-F. Facial skincare products recommendation with computer vision technologies. Electronics 2022, 11, 143. [Google Scholar] [CrossRef]
Hsia, C.-H.; Chiang, J.-S.; Lin, C.-Y. A fast face detection method for illumination variant condition. Sci. Iran. 2015, 22, 2081–2091. [Google Scholar]
Huang, Q.; Huang, C.; Wang, X.; Jiang, F. Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 2021, 580, 35–54. [Google Scholar] [CrossRef]
Zou, R.; Song, C.; Zhang, Z. The Devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17492–17501. [Google Scholar]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. CoAtNet: Marrying convolution and attention for all data sizes. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 3965–3977. [Google Scholar]
Nan, Y.; Ju, J.; Hua, Q.; Zhang, H.; Wang, B. A-MobileNet: An approach of facial expression recognition. Alex. Eng. J. 2022, 61, 4435–4444. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, J.; Li, C.; Dai, X.; Gao, J. Focal modulation networks. In Proceedings of the Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 4203–4217. [Google Scholar]
Zhang, W.-L.; Jia, R.-S.; Wang, H.; Che, C.-Y.; Sun, H.-M. A self-supervised learning network for student engagement recognition from facial expressions. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12399–12410. [Google Scholar] [CrossRef]
Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2852–2861. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
Liao, J.; Lin, Y.; Ma, T.; He, S.; Liu, X.; He, G. Facial expression recognition methods in the wild based on fusion feature of attention mechanism and LBP. Sensors 2023, 23, 4204. [Google Scholar] [CrossRef] [PubMed]

Figure 1. PFM data augmentation method.

Figure 2. Attention mechanism of ViT: (a) shows the architecture of FMM, and (b) shows the architecture of the self-attention mechanism.

Table 1. Ablation study on data augmentation for training.

Methods	RHF	RR	RE	PFM	Accuracy
This paper					87.68%
	✓	✓			88.89%
	✓	✓	✓		88.75%
	✓	✓		✓	89.08%

Table 2. Ablation study on patch size in PFM data augmentation.

Methods	PFM	Patch Size	Accuracy
This paper	w/o	N/A	88.89%
	w/	4	88.40%
		8	88.79%
		16	89.08%
		32	88.82%

Table 3. Ablation study on the masking ratio in PFM data augmentation.

Methods	PFM	${R a t i o}_{m i n}$	${R a t i o}_{m a x}$	Accuracy
This paper	w/o	N/A	N/A	88.89%
	w/	0.02	0.25	89.08%
		0.25	0.50	88.95%
		0.50	0.75	88.69%

Table 4. Comparison with previous studies on the RAF-DB database in terms of accuracy.

Methods	Accuracy
FER-VT [3]	88.26%
A-MobileNet [6]	84.49%
RCL-Net [13]	88.20%
This paper	89.08%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ke, L.-Y.; Liao, C.-Y.; Hsia, C.-H. Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation. Eng. Proc. 2025, 92, 70. https://doi.org/10.3390/engproc2025092070

AMA Style

Ke L-Y, Liao C-Y, Hsia C-H. Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation. Engineering Proceedings. 2025; 92(1):70. https://doi.org/10.3390/engproc2025092070

Chicago/Turabian Style

Ke, Liang-Ying, Chia-Yu Liao, and Chih-Hsien Hsia. 2025. "Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation" Engineering Proceedings 92, no. 1: 70. https://doi.org/10.3390/engproc2025092070

APA Style

Ke, L.-Y., Liao, C.-Y., & Hsia, C.-H. (2025). Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation. Engineering Proceedings, 92(1), 70. https://doi.org/10.3390/engproc2025092070

Article Menu

Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation^†

Abstract

1. Introduction

2. Method

2.1. Partial Feature Masking

2.2. Focal Vision Transformer

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation †

Abstract

1. Introduction

2. Method

2.1. Partial Feature Masking

2.2. Focal Vision Transformer

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Improving Facial Expression Recognition with a Focal Transformer and Partial Feature Masking Augmentation^†