Swin-FER: Swin Transformer for Facial Expression Recognition

Bie, Mei; Xu, Huan; Gao, Yan; Song, Kai; Che, Xiangjiu

doi:10.3390/app14146125

Open AccessArticle

Swin-FER: Swin Transformer for Facial Expression Recognition

by

Mei Bie

^1,2,

Huan Xu

²,

Yan Gao

²,

Kai Song

¹

and

Xiangjiu Che

^2,*

¹

Institute of Education, Changchun Normal University, Changchun 130032, China

²

College of Computer Science and Technology, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6125; https://doi.org/10.3390/app14146125

Submission received: 18 June 2024 / Revised: 7 July 2024 / Accepted: 12 July 2024 / Published: 14 July 2024

Download

Browse Figures

Versions Notes

Abstract

The ability of transformers to capture global context information is highly beneficial for recognizing subtle differences in facial expressions. However, compared to convolutional neural networks, transformers require the computation of dependencies between each element and all other elements, leading to high computational complexity. Additionally, the large number of model parameters need extensive data for training so as to avoid overfitting. In this paper, according to the characteristics of facial expression recognition tasks, we made targeted improvements to the Swin transformer network. The proposed Swin-Fer network adopts the fusion strategy from the middle layer to deeper layers and employs a method of data dimension conversion to make the network perceive more spatial dimension information. Furthermore, we also integrated a mean module, a split module, and a group convolution strategy to effectively control the number of parameters. On the Fer2013 dataset, an in-the-wild dataset, Swin-Fer achieved an accuracy of 71.11%. On the CK+ dataset, an in-the-lab dataset, the accuracy reached 100%.

Keywords:

Swin transformer; facial expression recognition; feature fusion; classification

1. Introduction

The transformer is a kind of neural network model which utilizes self-attention mechanisms to establish dependency relationships between sequences, and it has attracted wide attention for its excellent performance in natural-language processing tasks [1]. Many researchers have tried to apply the transformer model to facial expression recognition tasks. This mechanism enables the model to consider all other elements in a sequence while processing each sequence element, which is especially useful for capturing subtle variations in expressions [2]. Compared with convolutional networks, the transformer is not constrained by local receptive fields, allowing it to flexibly focus on any part of an image. Moreover, its self-attention structure supports better parallel processing [3,4]. However, this self-attention mechanism involves computing dependencies between each element and all other elements, resulting in high data dimensions and computational overheads [5]. Additionally, the model includes numerous parameters and usually needs a lot of data for effective training to mitigate overfitting. Consequently, the model is larger, requiring more computational resources for both training and inference [6].

The Swin transformer, a variant of the transformer, has been specifically optimized for visual tasks. Introducing window partitioning mechanisms and hierarchical structures enables the model to learn features of different scales [7]. In this paper, based on the characteristics of facial expression recognition tasks, we made targeted modifications to the original Swin transformer network structure, proposing the Swin-Fer network (Swin Transformer for Facial Expression Recognition), which achieved promising experimental results. The main contributions are as follows:

Swin-Fer adopts a fusion strategy from middle to deep layers to capture facial expression features more accurately. This guides the network to learn the relationships between local and global features effectively, thus enhancing the ability of expression recognition.
Using the data dimension transformation strategy, the whole network model can perceive more spatial dimension information. In addition, in order to improve the generalization ability of the model, the mean module and the split module, as well as group convolution, are introduced. While achieving satisfactory experimental results, the parameter count is kept largely unchanged.
The proposed method achieves an accuracy of 71.11% on the Fer2013 dataset under natural conditions and 100% on the CK+ dataset in laboratory environments. The sensitivity and specificity of the model, as indicated by the area under the curve, are also good.

2. Related Works

2.1. Transformer for Facial Expression Recognition

As a model of a self-attention mechanism, the transformer model demonstrates excellent performance and adaptability to various facial expression recognition tasks [8]. Xue et al. [9] applied the transformer method to facial expression recognition tasks and developed the TransFER model by integrating strategies such as Multi-Attention Dropping (MAD) and Multi-Head Self-Attention Dropping. This model highlights important local blocks (patches) while suppressing irrelevant ones, thereby exploring rich relationships between local blocks and addressing to some extent the issue of “large inter-class similarity, small intra-class similarity” in facial expression recognition tasks. Ma et al. [2] proposed the VTFF (Visual Transformer with Feature Fusion) network, which introduced the Attentional Selective Fusion (ASF) method to leverage two feature maps generated by a dual-branch CNN. Through global–local attention mechanisms, multiple features are fused to capture discriminative visual words, and the method achieved excellent performance. In facial expression classification tasks, many researchers input image samples containing various emotional states into transformer models and use a softmax classifier to predict the emotional labels represented by each image. Kim et al. [10] observed that the Vision Transformer (ViT) may have limited ability to capture subtle changes in facial expressions and may potentially lose local features of images. They proposed Squeeze-ViT, which can reduce feature dimensions to lower computational complexity and integrate global and local features to enhance network performance. Zhao et al. [11] proposed the Former-DFER network structure for natural environmental scenes, composed of a CS transformer (a convolutional spatial transformer) and a T transformer (a temporal transformer). This architecture guides the network to learn spatial and contextual facial features, thus improving the accuracy of facial expression classification.

2.2. Overview of the Swin Transformer

The Swin transformer, as described by Liu et al. [7], is a hierarchical vision transformer designed to handle various computer vision tasks efficiently. It uses a shifted-windows approach for self-attention, which allows for better scalability and performance compared to existing transformer-based models. It builds hierarchical feature maps, similar to convolutional neural networks (CNNs), enabling it to capture multi-scale representations effectively.

A key innovation of the Swin transformer is its use of shifted windows for self-attention computation. Self-attention is first computed within local, non-overlapping windows, and then these window positions are shifted between successive layers. This approach facilitates cross-window connections, enhancing the model’s ability to capture long-range dependencies while maintaining computational efficiency. The combination of non-overlapping and shifted windows allows the Swin transformer to balance local context extraction within windows with global context integration across windows, leading to high performance in vision tasks. By focusing on local windows, the computational complexity of the self-attention mechanism is significantly reduced, making the Swin transformer more efficient and scalable for high-resolution images. Liang et al. [12] addresses the challenges of facial expression recognition (FER) posed by occlusions and head-pose variations using a convolution-transformer dual branch network (CT-DBN). The CT-DBN leverages the strengths of both convolutional neural networks (CNNs) and the Swin transformer to capture local and global facial information, respectively. Qin et al. [13] integrated a Multi-Level Channel Attention (MLCA) module into each task-specific subnet, enabling adaptive feature selection from optimal levels and channels. This design allows the Swin transformer to efficiently and accurately perform facial expression recognition, achieving strong experimental results and demonstrating superior understanding of facial features.

3. Proposed Method

This paper designed a facial expression recognition method based on the Swin transformer, named Swin-Fer, whose network structure is depicted in Figure 1. Considering the characteristics of facial expression recognition tasks, deeper-level features contain richer semantic information. Swin-Fer employs a fusion strategy from middle to deep layers [14]. After the image is input into the network, the image window segmentation and patch embedding operations are carried out. The input image is divided into blocks of fixed sizes, and each block is embedded to obtain a fixed-size vector representation. The channel number is converted to 96, equivalent to performing a 4 × 4 convolution (kernel size = 4) without overlapping regions, with each movement having a stride of 4 (stride = 4). Hence, an input image with a size of B × 3 × 224 × 224 is 56 × 56 after flattening. The data are then flattened from B × C × H × W (batch size × channel number × height × width) along the height direction, which effectively converts the spatial dimensions into one. Subsequently, a transpose operation is performed to exchange the spatial and channel dimensions.

In Figure 1, STB represents the Swin transformer block, where STB1, STB2, STB3, and STB4 include one pair, one pair, three pairs, and one pair of W-MSA and SW-MSA combinations, respectively, and iterates on this basis. The extraction and fusion of the information occurs after the patch merging layer between STB2, STB3, and STB4, resulting in STB2P and STB3P. The letter ‘P’ in STB2P and STB3P denotes the output after the patch merging operation. Specifically, STB2P refers to the output after the patch merging that occurs between STB2 and STB3, while STB3P represents the output after the patch merging that occurs between STB3 and STB4. Layer normalization is a regularization technique in deep learning, which is similar to batch normalization (BN), but it calculates the mean and variance of each layer rather than each batch of each neuron, making it more suitable for sequence models. After layer normalization of STB4, STB4L is obtained. Along with STB4’s output (STB4O) and the earlier STB2P and STB3P, the information output results from these four levels undergo data dimension transformation, adaptive average pooling, and mean operations before being merged. After passing through the split module, a fusion output is obtained. This result is added to STB4L, and adaptive average pooling is performed again to further reduce the spatial features to obtain the final output.

3.1. Patch Merging

Patch merging is a technique employed in transformer models to enhance the efficiency of image processing. It involves dividing the input image into several sub-images which are then concatenated and fed into the transformer model for processing, as illustrated in Figure 2. This approach reduces the computational load of the model by avoiding processing the entire large-scale image directly, thus improving efficiency when dealing with large images.

The concatenated image is fed into the transformer model for processing. During this process, the patch merging compresses the detailed information of the input high-resolution feature map into a low-resolution feature map. While retaining the main information from the original feature map, this operation reduces computational complexity, thereby enhancing the calculation efficiency and generalization capability of the model [7,15].

The patch merging process between two STBs is equivalent to a downsampling operation without convolution. Previous experiments tried to extract features before patch merging, resulting in two issues: excessive parameters and irregular data, leading to suboptimal feature extraction. Therefore, all experiments in this study executed feature data extraction after patch merging.

3.2. Dimensional Transformation of Data

Within each STB, all odd blocks have a shift size of 0, while all even blocks have a shift size of 3. A shift size of 0 implies no operation, and the spatial features are directly normalized before being unfolded. When the shift size is 3, the dimensions (1 and 2) are shifted, and the shifted content is filled circularly, thus realizing the torch rolling operation. The purpose of torch rolling is to disrupt the internal structure of the data, maintaining the same data in the feature map but altering the position, thus disrupting the dependencies and linear relationships between pixels, facilitating more complex interaction between grids. Regardless of the subsequent operation on odd or even blocks, the output feature dimensions remain the same. The output window features (X_window) need to be merged again, equivalent to recombining spatial features, and with the application of window-based multi-head attention mechanism and shifted-windows multi-head self-attention (SW-MSA), interactions within and between windows are achieved, further improving model performance.

In addition, the multi-head self-attention mechanism divides the input data into several heads, and each head generates different query, key, and value vectors and computes the corresponding attention-weighted results. These results are then concatenated to form the final output of the multi-head self-attention mechanism. This output is a three-dimensional tensor that needs to be reshaped to a four-dimensional tensor for better matching with subsequent feature fusion calculations. For example, position encoding needs to add position information to input features, and decoder calculation needs to concatenate the encoder results with target sequences. Therefore, converting the three-dimensional tensor to a four-dimensional tensor facilitates these calculations, improving model efficiency and accuracy.

Assume that the input datum is

X \in R^{B \times L \times D}

, where B represents batch size, L denotes sequence length, and D represents input dimensionality. In the multi-head self-attention mechanism, the input datum is divided into H parts through linear mapping, and after multi-head attention calculation, the output with shape

Y \in R^{B \times L \times H}

is obtained.

Initially, H is divided into num_heads parts, and the input dimension, D, is also divided into the same number of parts, that is, D = num_heads × head_dim, where head_dim represents the dimension size of each head. The output tensor, Y, of the multi-head self-attention mechanism is transformed into a four-dimensional tensor,

Z \in R^{B \times L \times n u m_h e a d s \times h e a d_d i m}

, as follows:

Z = r e s h a p e (Y, (B, L, n u m_h e a d s, h e a d_d i m))

(1)

This operation segments the last dimension of the output tensor, creating a new dimension for each head in the new shape. The transformation from three-dimensional to four-dimensional data effectively opens up spatial features before feature fusion, so that the whole network model can perceive more spatial dimension information, thus enhancing classification performance.

3.3. Mean Module

In transformer networks, the mean operation is usually used to reduce the dimensionality of feature tensors extracted through self-attention mechanisms, enabling them to be input into fully connected layers for classification. For instance, in the vision transformer proposed by Dosovitskiy et al. [16], a feature tensor is divided into multiple blocks, followed by self-attention mechanisms, resulting in feature vectors for each block. Subsequently, each feature vector undergoes a mean operation to obtain a fixed-length feature vector for final classifier input.

In the Swin-Fer network, images are segmented into blocks, and features are extracted by self-attention mechanisms; thus, a sequence of features is generated. The mean module allows the obtainment of the average information of the data, achieving dimensionality reduction and reducing computational complexity while retaining the main information.

Assume that the size of the obtained feature tensor is

H \times W \times n

, where H represents height, W represents width, and n represents the number of channels [17]. The calculation formula for the mean module is as follows:

{m e a n (X)}_{i, j} = \frac{1}{n} \sum_{k}^{n} X_{i, j, k}

(2)

where

{m e a n (X)}_{i, j}

represents the pixel value of the pooled output feature map at position

(i, j)

, while X_i,j,k represents the pixel value of the input feature map X at position (i,j,k). The above formula indicates averaging pooling along the third dimension (channel dimension) of the feature tensor to obtain a new feature tensor. Since this operation aggregates the original n channels into one channel, the number of channels changes from n to 1. This averaging pooling method concatenates vectors without increasing network parameters.

3.4. Split Module

Swin-Fer focuses on controlling the number of parameters while performing facial expression recognition tasks. With data dimensions already high, it is essential to minimize weight or bias operations as much as possible. As shown in Figure 1, in Branch1 and Branch2, each branch contains a group convolution. Because the data processing at this stage is not very extensive, group convolution can be considered instead of regular convolution to reduce parameters, especially for small datasets. As shown in Figure 3, the group convolution strategy first divides the input feature map into several groups, with an equal number of channels in each group. Then, convolution operations are performed separately on each group, and, finally, the convolution results of each group are concatenated as the final output. Through this channel grouping strategy, different groups can learn different features, enhancing feature diversity. Additionally, group convolution enables better learning of inter-channel relationships, strengthening model representation capability. Specifically:

X^{'} = \begin{matrix} G \\ | | \\ g = 1 \end{matrix} X^{(g)} * K^{(g)} + b

(3)

where

X^{(g)}

represents the g-th group of the input feature map, with a total of G groups. After group convolution, the output feature map is

X^{'}

, “

| |

” denotes the tensor concatenation operation,

K^{(g)}

represents the convolution kernel of the g-th group, and b is the bias term.

The results of these two branches are concatenated. Subsequently, the tensor dimension transposition operation is performed, where the input is [B, 2, 2, 4, 4] and the output is [B, 2, 2, 4, 4]. Although the second and third dimensions are both 2, they actually switch positions (as shown in Figure 4). The transposed tensor allows concatenation along the batch processing dimension and better utilization of hardware parallelism to speed up calculation. Generally, after tensor dimension transposition, position information is restored. However, since the data volume at this stage of the experiment is already small, the impact on experimental results is negligible, and the restoration operation is not executed.

4. Experiment Results and Analysis

4.1. Experimental Datasets

The Cohn–Kanade dataset (CK+) is a laboratory-environment dataset, as depicted in Figure 5, showcasing sample expressions for each category. For the task of facial expression recognition in static images, the last three frames of the expression sequence [18], known as the peak expression states, are usually selected for training and testing. In this paper, 981 images are selected as the experimental dataset, comprising 882 images for training and 99 images for testing. The sample distribution of each category is detailed in Table 1.

Fer2013, a representative dataset for facial expressions in natural environments, included facial occlusions (e.g., hands, hats, and glasses), low pixel resolutions, and facial images captured with arbitrary poses and angles. After preprocessing, the samples in the FER-2013 dataset were scaled to 48 × 48 pixels. As a challenging dataset, Figure 6 illustrates sample images for each category, and Table 2 outlines the sample distribution for each classification.

4.2. Experimental Environment

The experimental environment for Swin-Fer is based on Win10, with a 12th Gen Intel(R) Core(TM) i7 3.6GHz CPU and an NVIDIA Geforce RTX 4090 GPU, within a python development environment. During model training, the adaptive learning rate optimization algorithm, the Adam optimizer, is adopted, dynamically adjusting the learning rate based on gradient information in each iteration, thus achieving faster and more stable convergence. All experiments in the manuscript were conducted using the same environmental settings and hyperparameter configurations. The specific experimental parameters are presented in Table 3.

4.3. Experimental Results

Applying a transformer to the FER2013 dataset for facial expression recognition is a challenging task due to the small original image size of 48 × 48, while most transformer models are suitable for sizes of 224 and 356. The Swin transformer requires a large image size for operations such as patch embedding, with a step size of 4. Therefore, to obtain the transformer’s benchmark, the image size of the facial expression pictures had to be enlarged to 224 × 224. We tested four lightweight transformer structures, with specific experimental results presented in Table 4. With almost no change in the parameter count, the recognition accuracy of our method reached 71.11% on FER2013—an increase of 0.41% compared to the original Swin transformer—and 100% on CK+—a 3.75% improvement with respect to the original network.

To expand the practicality of transformers, there are usually different types of the same transformer architecture, such as tiny, small, base, and large. Increasing the model in transformer methods typically improves the accuracy, but there is also a corresponding exponential increase in parameters between different models. The transformer models selected in Table 4 are all lightweight structures, such as small and tiny-level structures, emphasizing the use of methods like mean, adaptive pooling, and group convolution to control the number of parameters [19]. As shown in Table 5, while applying a fusion strategy to better extract features, the total parameters of the proposed Swin-Fer network increased by only 70 B compared to the original Swin transformer.

The experimental results for Swin-Fer and the original Swin transformer on the FER2013 dataset are shown in Figure 7. From Figure 7a, it is evident that due to the use of pre-training methods, both models quickly showed a trend of convergence, stabilizing around 40 iterations. For the facial expression recognition problem, which is indeed a multi-class classification problem, we adopted the One-vs.-Rest strategy to generate the ROC curves. As shown in Figure 7b, the area under the ROC curve for both models is 0.84, with the curves deviating obviously from the 45-degree diagonal line, indicating good sensitivity and specificity. However, as can be seen in Figure 7c, it is apparent that the Swin-Fer method has relatively smaller fluctuations in accuracy, exhibiting more stability. Finally, from the confusion matrices presented in Figure 7d,e, the accuracy changes in various classifications between Swin-Fer and the original Swin transformer are as follows: anger (+0.2%), disgust (+3.6%), fear (+0.7%), happiness (−1.7%), neutral (+1.4%), sadness (+1%), and surprise (+1.9%), with an increase in accuracy for six out of seven classifications, demonstrating the effectiveness of the proposed improvement algorithm.

Figure 8 presents the experimental results for Swin-Fer and the original Swin transformer network on the CK+ dataset. Due to the limited sample size of the CK+ dataset, the model may be constrained by insufficient data during training and validation, resulting in fluctuations in accuracy. The proposed Swin-Fer method, based on fusion methods and introducing split modules and data dimension transformation strategies, enhances the model’s ability to capture facial expression details. The experimental results show that compared to the original Swin transformer model, the Swin-Fer method has a relatively smoother loss curve on the CK+ dataset, indicating more stable performance during training. Additionally, the accuracy increased by 3.75%, which shows that the Swin-Fer method also exhibits certain potential and advantages in facial expression recognition tasks in laboratory environments.

4.4. Comparison of Experimental Accuracy

The accuracy of the Swin-Fer network structure proposed in this paper along with the accuracies of other methods introduced in the past five years on the FER2013 and CK+ datasets are compared in Table 6. Some representative research on FER2013 achieved more than 70% accuracy, which shows that facial expression recognition tasks in natural environments are challenging. Adjusting network structures and optimizing hyperparameters can help capture complex features in images, thus further improving model performance. This paper has explored using transformer methods to handle facial expression recognition tasks, focusing on designing the Swin-Fer network model. With the network parameters mostly unchanged, competitive experimental results were achieved, with an accuracy of 71.11%. In addition, the third column of Table 6 compares the accuracies of different network structures on the CK+ dataset, and these accuracies are for the most part very high, with the lowest accuracy reaching 96.25%. The method proposed in this paper, Swin-Fer, achieved 100%. These experimental results enhance confidence in the generalization ability of the model.

To further compare the effectiveness of the proposed model, we conducted training on a larger dataset, AffectNet. AffectNet is a large-scale facial expression dataset designed for training and evaluating facial expression recognition models. It contains over one million facial images collected from the Internet, with approximately 45,000 manually annotated images for eight emotion categories. The images in the AffectNet dataset typically have a resolution of 256 × 256 pixels. These images cover a wide range of facial expressions and poses, making the dataset suitable for research on emotion analysis and facial expression recognition. This resolution is appropriate for training and testing deep learning models, particularly those that require high-resolution inputs, such as transformer models. Table 7 shows the comparison of experimental accuracy across different methods on the AffectNet (eight emotions) dataset.

As indicated in Table 6 and Table 7, although the proposed method did not achieve the highest accuracy on the FER2013 dataset, it demonstrated superior performance on the AffectNet (eight emotions) dataset compared to other advanced models. The larger resolution and color images of AffectNet likely better suit the capabilities of Swin-Fer, highlighting its strength in extracting feature information from high-resolution color images. This suggests that Swin-Fer, based on the Swin transformer for feature extraction, performs well in settings where input images are of larger spatial dimensions, allowing the model to extract more effective feature information.

5. Conclusions

Facial expression recognition is a challenging computer vision task. The key regions of different expressions exhibit diverse distributions, especially in natural conditions, such as head posture, lighting changes, and occlusion, which make feature extraction particularly difficult [34,35]. In order to enhance the model’s generalization, Swin-Fer contains a split module, replaces ordinary convolutions with group convolutions in two branches, and effectively controls the number of parameters through strategies such as averaging and adaptive pooling.

According to the experimental results, the proposed method effectively focuses on the most critical regions of facial expressions and achieves competitive results on datasets from both natural conditions (FER2013) and laboratory environments (CK+). Due to the low sample pixel quality of the Fer2013 dataset, the performance of transformer models is limited, and accuracy needs to be further improved. However, on the AffectNet dataset, which consists of higher pixel quality and colored images, performance was notably better, demonstrating the advantages of Swin-Fer. Enhancing the performance of transformer models on low-resolution facial expression images remains a key focus for future research. Additionally, current research is limited to static-image single-person facial expression recognition. Future research should include a broader range of practical scenarios, such as real-time, multi-person, and dynamic video facial expression recognition, and provide powerful support tools for practical applications.

Author Contributions

Conceptualization, M.B. and X.C.; methodology, H.X. and Y.G.; investigation, H.X. and Y.G.; writing—original draft preparation, M.B., H.X. and K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Special Project on Digitization in Education of the Jilin Educational Scientific Research Leading Group under grant JS2338, key project under grant ZD21100, and Social Science Research of the Education Department of Jilin Province (JJKH20231054SK).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rahali, A.; Akhloufi, M.A. End-to-end transformer-based models in textual-based NLP. AI 2023, 4, 54–110. [Google Scholar] [CrossRef]
Ma, F.; Sun, B.; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. 2021, 14, 1236–1248. [Google Scholar] [CrossRef]
Shi, C.; Zhao, S.; Zhang, K.; Wang, Y.; Liang, L. Face-based age estimation using improved Swin Transformer with attention-based convolution. Front. Neurosci. 2023, 17, 1136934. [Google Scholar] [CrossRef]
Wang, Q.; Li, Z.; Zhang, S.; Chi, N.; Dai, Q. A versatile Wavelet-Enhanced CNN-Transformer for improved fluorescence microscopy image restoration. Neural Netw. 2024, 170, 227–241. [Google Scholar] [CrossRef] [PubMed]
Shen, X.; Han, D.; Guo, Z.; Chen, C.; Hua, J.; Luo, G. Local self-attention in transformer for visual question answering. Appl. Intell. 2023, 53, 16706–16723. [Google Scholar] [CrossRef]
Chitty-Venkata, K.T.; Mittal, S.; Emani, M.; Vishwanath, V.; Somani, A.K. A survey of techniques for optimizing transformer inference. J. Syst. Archit. 2023, 144, 102990. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhou, M.; Liu, X.; Yi, T.; Bai, Z.; Zhang, P. A superior image inpainting scheme using Transformer-based self-supervised attention GAN model. Expert Syst. Appl. 2023, 233, 120906. [Google Scholar] [CrossRef]
Xue, F.; Wang, Q.; Guo, G. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network, Montreal, QC, Canada, 11–17 October 2021; pp. 3601–3610. [Google Scholar]
Kim, S.; Nam, J.; Ko, B.C. Facial Expression Recognition Based on Squeeze Vision Transformer. Sensors 2022, 22, 3729. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Q. Former-dfer: Dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, Electr Network, Chengdu, China, 20–24 October 2021; pp. 1553–1561. [Google Scholar]
Liang, X.; Xu, L.; Zhang, W.; Zhang, Y.; Liu, J.; Liu, Z. A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput. 2023, 39, 2277–2290. [Google Scholar] [CrossRef]
Qin, L.; Wang, M.; Deng, C.; Wang, K.; Chen, X.; Hu, J.; Deng, W. SwinFace: A Multi-Task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2223–2234. [Google Scholar] [CrossRef]
Bie, M.; Xu, H.; Liu, Q.; Gao, Y.; Che, X. Multi-dimension and Multi-level Information Fusion for Facial Expression Recognition. J. Imaging Sci. Technol. 2023, 67, 1–11. [Google Scholar] [CrossRef]
Kim, J.H.; Kim, N.; Won, C.S. Global–local feature learning for fine-grained food classification based on Swin Transformer. Eng. Appl. Artif. Intell. 2024, 133, 108248. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghan, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20–22 June 2023; pp. 14420–14430. [Google Scholar]
Cheng, S.; Zhou, G. Facial expression recognition method based on improved VGG convolutional neural network. International J. Pattern Recognit. Artif. Intell. 2020, 34, 2056003. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal attention for long-range interactions in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 30008–30022. [Google Scholar]
Alamsyah, D.; Pratama, D. Implementasi Convolutional Neural Networks (CNN) untuk Klasifikasi Ekspresi Citra Wajah pada FER-2013 Dataset. (JurTI) J. Teknol. Inf. 2020, 4, 350–355. [Google Scholar] [CrossRef]
Nie, H. Face Expression Classification Using Squeeze-Excitation Based VGG16 Network. In Proceedings of the 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 482–485. [Google Scholar]
Minaee, S.; Minaei, M.; Abdolrashidi, A. Deep-emotion: Facial expression recognition using attentional convolutional network. Sensors 2021, 21, 3046. [Google Scholar] [CrossRef] [PubMed]
Zu, F.; Zhou, C.J.; Wang, X. An improved convolutional neural network based on centre loss for facial expression recognition. Int. J. Adapt. Innov. Syst. 2021, 3, 58–73. [Google Scholar] [CrossRef]
Pan, L.; Shao, W.; Xiong, S.; Lei, Q.; Huang, S.; Beckman, E.; Hu, Q. SSER: Semi-Supervised Emotion Recognition Based on Triplet Loss and Pseudo Label. Knowl.-Based Syst. 2024, 292, 111595. [Google Scholar] [CrossRef]
Shen, T.; Xu, H. Facial Expression Recognition Based on Multi-Channel Attention Residual Network. CMES-Comput. Model. Eng. Sci. 2023, 135, 539–560. [Google Scholar]
Zhu, X.; He, Z.; Zhao, L.; Dai, Z.; Yang, Q. A Cascade Attention Based Facial Expression Recognition Network by Fusing Multi-Scale Spatio-Temporal Features. Sensors 2022, 22, 1350. [Google Scholar] [CrossRef] [PubMed]
Aouayeb, M.; Hamidouche, W.; Soladie, C.; Kpalma, K.; Seguier, R. Learning vision transformer with squeeze and excitation for facial expression recognition. arXiv 2021, arXiv:2107.03107. [Google Scholar]
Zhao, Z.; Liu, Q.; Zhou, F. Robust lightweight facial expression recognition network with label distribution training. In Proceedings of the AAAI conference on artificial intelligence (AAAI), Online, 2–9 February 2021; Volume 35, pp. 3510–3519. [Google Scholar]
Pourmirzaei, M.; Montazer, G.A.; Esmaili, F. Using self-supervised auxiliary tasks to improve fine-grained facial representation. arXiv 2021, arXiv:2105.06421. [Google Scholar]
Savchenko, A.V. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In Proceedings of the IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 16–18 September 2021; pp. 119–124. [Google Scholar]
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 2023, 8, 199. [Google Scholar] [CrossRef] [PubMed]
Wagner, N.; Mätzler, F.; Vossberg, S.R.; Schneider, H.; Pavlitska, S.; Zöllner, J.M. CAGE: Circumplex Affect Guided Expression Inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 4683–4692. [Google Scholar]
Li, J.; Nie, J.; Guo, D.; Hong, R.; Wang, M. Emotion separation and recognition from a facial expression by generating the poker face with vision transformers. arXiv 2022, arXiv:2207.11081. [Google Scholar]
Zhang, L.; Verma, B.; Tjondronegoro, D.; Chandran, V. Facial expression analysis under partial occlusion: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 1–49. [Google Scholar] [CrossRef]
Shao, J.; Qian, Y. Three convolutional neural network models for facial expression recognition in the wild. Neurocomputing 2019, 355, 82–92. [Google Scholar] [CrossRef]

Figure 1. The network architecture of Swin-Fer.

Figure 2. Schematic representation of patch merging.

Figure 3. Group convolution.

Figure 4. Transpose operation.

Figure 5. Sample examples from the CK+ dataset.

Figure 6. Sample examples from the Fer2013 dataset.

Figure 7. Experimental results on the FER2013 dataset.

Figure 8. Experimental results on the CK+ dataset.

Table 1. Sample distribution of the CK+ dataset.

Expression	Surprise	Happy	Disgust	Anger	Fear	Sad	Contempt	Total
Train	225	186	159	123	66	75	48	882
Test	24	21	18	12	9	9	6	99
Total	249	207	177	135	75	84	54	981

Table 2. Sample distribution of the FER2013 dataset.

Expression	Happy	Neutral	Sad	Fear	Angry	Surprise	Disgust	Total
Train	7215	4965	4830	4097	3995	3171	436	28,709
Test	1774	1233	1247	1024	958	831	111	7178
Total	8989	6198	6077	5121	4953	4002	547	35,877

Table 3. Experimental parameters for Swin-Fer.

Parameter Name	Parameters
Batch size	80
Learning rate	0.0001
Optimizer	Adam
Seed	1024
Image size	224

Table 4. Accuracy of transformer methods on the FER2013 and CK+ datasets.

Method	Fer2013	CK+
EdgeViT	0.5935	0.725
EfficientFormer	0.6053	0.7125
Vision transformer	0.6848	0.85
Swin transformer	0.7070	0.9625
Swin-Fer	0.7111	1

Table 5. Comparison of network parameter counts for transformer methods.

Network	Number of Parameters
EdgeViT	11.30 M
EfficientFormer	19.8 M
Vision transformer	21.42 M
Swin transformer	27.53 M (27530120 B)
Swin-Fer	27.53 M (27530190 B)

Table 6. Comparison of experimental accuracy of different methods on FER2013 and CK+ datasets.

Network	Fer2013	CK+
CNN using the Adamax optimizer [20]	66%	-
VGG16+SE Block [21]	66.8%	99.18%
DeepEmotion [22]	70.02%	98%
Swin transformer	70.70%	96.25%
Improved CNN based on center loss [23]	71.39%	96.64%
SSER [24]	71.62%	97.59%
Multi-Channel Attention Residual Network [25]	72.7%	98.8%
ResNet-50 + pyramid + cascaded attention block + GRU [26]	-	99.23%
ViT+SE [27]	-	99.8%
Swin-Fer	71.11%	100%

Table 7. Comparison of experimental accuracy of different methods on AffectNet datasets.

Network	AffectNet (Eight Emotions)
EfficientFace [28]	59.89%
SL + SSL puzzling (B2) [29]	61.32%
Multi-Task EfficientNet-B0 [30]	61.32%
DAN [31]	62.09%
CAGE [32]	62.30%
Vit-base + MAE [33]	62.42%
Swin-Fer	63.29%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bie, M.; Xu, H.; Gao, Y.; Song, K.; Che, X. Swin-FER: Swin Transformer for Facial Expression Recognition. Appl. Sci. 2024, 14, 6125. https://doi.org/10.3390/app14146125

AMA Style

Bie M, Xu H, Gao Y, Song K, Che X. Swin-FER: Swin Transformer for Facial Expression Recognition. Applied Sciences. 2024; 14(14):6125. https://doi.org/10.3390/app14146125

Chicago/Turabian Style

Bie, Mei, Huan Xu, Yan Gao, Kai Song, and Xiangjiu Che. 2024. "Swin-FER: Swin Transformer for Facial Expression Recognition" Applied Sciences 14, no. 14: 6125. https://doi.org/10.3390/app14146125

APA Style

Bie, M., Xu, H., Gao, Y., Song, K., & Che, X. (2024). Swin-FER: Swin Transformer for Facial Expression Recognition. Applied Sciences, 14(14), 6125. https://doi.org/10.3390/app14146125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Swin-FER: Swin Transformer for Facial Expression Recognition

Abstract

1. Introduction

2. Related Works

2.1. Transformer for Facial Expression Recognition

2.2. Overview of the Swin Transformer

3. Proposed Method

3.1. Patch Merging

3.2. Dimensional Transformation of Data

3.3. Mean Module

3.4. Split Module

4. Experiment Results and Analysis

4.1. Experimental Datasets

4.2. Experimental Environment

4.3. Experimental Results

4.4. Comparison of Experimental Accuracy

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI