You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

21 August 2024

TriCAFFNet: A Tri-Cross-Attention Transformer with a Multi-Feature Fusion Network for Facial Expression Recognition

,
,
and
Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Intelligent Sensors

Abstract

In recent years, significant progress has been made in facial expression recognition methods. However, tasks related to facial expression recognition in real environments still require further research. This paper proposes a tri-cross-attention transformer with a multi-feature fusion network (TriCAFFNet) to improve facial expression recognition performance under challenging conditions. By combining LBP (Local Binary Pattern) features, HOG (Histogram of Oriented Gradients) features, landmark features, and CNN (convolutional neural network) features from facial images, the model is provided with a rich input to improve its ability to discern subtle differences between images. Additionally, tri-cross-attention blocks are designed to facilitate information exchange between different features, enabling mutual guidance among different features to capture salient attention. Extensive experiments on several widely used datasets show that our TriCAFFNet achieves the SOTA performance on RAF-DB with 92.17%, AffectNet (7 cls) with 67.40%, and AffectNet (8 cls) with 63.49%, respectively.

1. Introduction

Facial expression is one of the most prevalent and vital signals conveying human emotions and intentions. As one of the fundamental tasks in computer vision, facial expression recognition (FER) has attracted increasing attention in recent years due to its close relevance to human–computer interaction, education, healthcare, and online monitoring applications.
Two types of datasets are primarily used in the facial expression recognition task. The first type is datasets in laboratory environments such as JAFFE [1], CK+ [2], etc. The facial expression images in these datasets are pictures of individuals making expressions in a laboratory environment according to instructions and then being captured by a camera. The second type is datasets in real-world scenarios, such as RAF-DB [3], AffectNet [4], FERPlus [5], etc. Most of the images in these datasets are collected from the Internet and obtained by keyword searches. The background environments in each picture vary significantly. In the early studies on facial expression recognition, datasets primarily used were collected in laboratory settings. For these datasets, researchers relied on handcrafted features such as HOG (Histograms of Oriented Gradients) [6], LBP (Local Binary Patterns) [7], LDP (Local Directional Patterns) [8], and SIFT (Local Scale-Invariant Feature Transform) [9] features to analyze facial expressions, and they achieved good recognition results. However, in practical applications, the environment is complex and dynamic. Using handcrafted features for facial expression recognition faces challenges due to uncontrollable factors such as occlusion, head posture variations, facial deformation, etc., leading to a decrease in recognition effectiveness. With the development of deep learning, CNNs (convolutional neural networks) have been introduced into the field of facial expression recognition due to their efficient computational efficiency and powerful feature extraction capabilities [10,11,12,13]. Algorithms based on CNNs have achieved outstanding results on datasets in laboratory environments. However, constrained by the limitations of the local receptive fields, the CNNs fail to fully consider the importance of the global information of images. Although the recognition accuracy on datasets in a laboratory environment has already exceeded 99%, the recognition accuracy on datasets in a natural environment still needs to be improved. After the Transformer was applied to computer vision, Xue et al. [14] designed the first Transformer network, TransFER, for facial expression recognition, which improved Vision Transformer (ViT) by combining global and local information. Since then, ViT has been introduced to facial expression recognition tasks with state-of-the-art (SOTA) results [15,16,17,18,19,20].
Meanwhile, researchers have explored the role of different features in facial expression recognition. LBP features can effectively describe the texture information of an image and are robust to illumination changes, which can be a complement of image features. By combining LBP features with image features and fusing the global and local information, facial expression recognition performance can be effectively improved [19,21,22]. Some studies have investigated the role of facial landmarks in facial expression recognition. Landmarks are a set of crucial points in a face image, which can exclude the interference of factors such as skin color, gender, age, and image background and have a specific generalization ability. They provide a sparse representation of the facial region and can be used as a complement to image information. Landmark features extracted automatically from landmarks can help the model to focus on more important details, which can help to improve the variance value of the inter-class differences [23,24]. Zheng et al. [25] proposed the idea of fusing landmark features and image features. They found that landmark features are the key to solving the inter-class similarity and intra-class differences. The image features are used as supplementary information to assist landmark features in facial expression classification. Although this approach has achieved good results, it still has the following problems. On the one hand, adding scale information solely through upsampling and downsampling may not enrich semantic details as much as introducing other features. On the other hand, the network structure introduces a large number of parameters, leading to increased computational cost.
Despite the outstanding performance of these algorithms, the following problems still exist: (1) Insufficient feature information. Convolutional neural networks can only extract the CNN feature of images, and the input of ViT can only be a single RGB image. Both can only handle single pieces of information, but they fail to make effective use of features that help distinguish subtle differences between images. (2) Inter-class similarity. There is a similarity between pictures of the same person in different expressions, and this similarity makes it difficult to distinguish between different expressions. There are similarities between different expression categories for the same person, posing a challenge in distinguishing between different expressions. (3) Intra-class variability. There may be significant variability between images of the same expression from different individuals, and this variability can easily lead to the misrecognition of expression categories. For example, differences in skin color, gender, image environment, and age among different individuals can contribute to such variations. (4) Model parameters and floating-point operations per second (FLOPs). Many works have only considered accuracy as the model evaluation metric. However, a large number of parameters and slow operation speed cannot meet the needs of practical applications, especially the requirements of real-time expression recognition tasks. Therefore, it is essential to include model parameter size and FLOPs in the evaluation criteria.
In this paper, we propose a tri-cross-attention transformer with a multi-feature fusion network (TriCAFFNet) to solve further the four problems of insufficient feature information, inter-class similarity, intra-class variability, and the excessive number of model parameters. TriCAFFNet is a model that processes landmark features, CNN features, and LBPHOG (fusion features of Local Binary Pattern and Histogram of Oriented Gradients) features by using the original input image and the fused feature image as model inputs. Based on the excellent performance of the facial expression recognition of LBP and HOG features, they can be used as important information to distinguish different faces, thus decreasing the intra-class variability instead of other handcraft features. The proposed model uses the a convolutional neural network to extract high-level features from the fused image of LBPHOG, enhancing the model’s ability to recognize facial expressions and improving overall robustness. With the advantage of landmark features in automation and robustness in the unexpected surrounding factors of an image, landmark features are used to distinguish the subtle differences and decrease the inter-class similarity. Meanwhile, the CNN features are used as complementary information for the landmark features and LBPHOG features, and the proposed tri-cross-attention mechanism can adaptively fuse these three types of feature information. In addition, our model extracts two different features using the same image backbone, which keeps the total number of parameters of the model at a low level while ensuring recognition accuracy at an advanced level. In summary, the contributions of this paper are as follows:
  • This paper proposes to introduce LBP, HOG, landmark, and CNN features of an image simultaneously in the facial expression recognition task to alleviate the problem of insufficient feature information in the expression recognition task by fusing and cross-utilizing multiple different features.
  • A tri-cross-attention mechanism is proposed, which enhances the model’s ability to resolve inter-class similarities and intra-class variability by guiding the three types of features to each other, adaptively fusing the three types of features, and comprehensively exploiting the advantages of each feature.
  • The effectiveness of TriCAFFNet was verified through extensive experiments, with TriCAFFNet achieving state-of-the-art recognition accuracies on two commonly used datasets (RAF-DB 92.17%, AffectNet (7 cls) 67.40%, AffectNet (8 cls) 63.49%).

3. Method

3.1. Baseline

Figure 1 illustrates the overall architecture of our Baseline model. For the input image, we obtain the feature matrices X c n n R P × D with an image backbone IR50 [39], where P is the number of patches and D is the feature dimension. After obtaining the image features X c n n , the transformer architecture utilizes a self-attention mechanism to capture correlations across patches, achieved by the Multi-head Self-Attention Layer (MSA) in the transformer architecture. The input X c n n is first mapped to three matrices: the query matrix Q, key matrix K, and value matrix V by linear transformations:
Q = X c n n W Q , K = X c n n W K , V = X c n n W V
where W Q , W K , W V R D × D .
Figure 1. The architecture of Baseline.
The vanilla transformer attention block can be described by the following equation:
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T / d ) V
where d is the scaling factor for appropriate normalization.
Then, the encoder output is calculated by the vanilla transformer encoder, which consists of MSA and MLP with a layer normalization operator. The calculation can be described as the following two equations. Finally, the encoder is sent into the output layer with an SE block and another MLP to obtain the likelihood result.
X c n n = M S A ( Q , K , V ) + X c n n
X c n n _ o u t = M L P ( N o r m ( X c n n ) ) + X c n n

3.2. TriCAFFNet

Figure 2 illustrates the overall architecture of TriCAFFNet. Our network structure includes a multi-feature input, a feature extraction block, a feature fusion block, several tri-cross-attention blocks, and an output. The multi-feature input comprises two types of images: the original facial image and the fused LBPHOG feature image. The fused LBPHOG feature image is generated by processing the original facial image using the LBP and HOG methods. In the feature extraction block, the original face image and the fused LBPHOG feature image are first preprocessed and then inputted into the feature extraction block to obtain the LBPHOG feature map, the CNN feature map, and the landmark feature map. The network contains eight tri-cross-attention blocks. For computing cross-attention maps, the Q-matrices of different features are swapped, and the Q-matrices of the other two features are merged to represent the Q-matrix of the CNN feature in each block. The cross-attention maps are spliced to obtain the fused cross-attention map, which is processed by the SE-block and fully connected layers to obtain the final expression probability output. In the following section, we discuss the characteristics of different features and how they exchange information.
Figure 2. The overall architecture of TriCAFFNet, a facial landmark detector MobileFaceNet, is applied to obtain landmark features and the advanced LBPHOG features, and an image backbone IR50 is used to extract image features.

3.3. Multi-Feature

Feature fusion is the key to enhancing the ability to solve inter-class similarities and intra-class variability in facial expression recognition. Obtaining high-level semantic information for expression recognition through feature fusion is a critical way to improve the accuracy of expression recognition. LBP features and HOG features have been widely used in face recognition and have achieved good results. LBP features can express the texture information of an image, and HOG can express the edge information of an image. They have similar characteristics to landmarks, which can provide significant information on critical parts of a face to support the distinction of subtle differences between similar faces. HOG features can enhance the edge information of the LBP features, and by fusing the two feature images, the texture features and edge features can be fused within a single image. In this paper, the improved circular LBP feature extraction algorithm and HOG feature extraction algorithm are used to extract the LBP feature image and HOG feature image of the input RGB image, respectively. The two feature images are fused by element-wise addition, followed by further extraction of their advanced features to improve the model’s capability to distinguish the critical areas of the face.
Meanwhile, the global information of the image is also essential for facial expression. We extract the CNN features of the input RGB image using a convolutional neural network and combine the local and global information of the picture. In the process of extracting the above features, the MobileFaceNet network pre-trained by ImageNet [40] is used as the landmark detector, and the parameters are frozen during the training process without parameter updating in order to extract the landmark features quickly. In addition, taking advantage of MoblieFaceNet’s sensitivity to salient regions, we use the same MobileFaceNet network as a feature extractor to extract high-level features from LBPHOG feature images. The experimental results also demonstrate that this not only reduces the overall number of parameters of the model but also extracts useful feature information. The IR50 network pre-trained by Ms-Celeb-1M [41] dataset is used as the Image Backbone to extract CNN features of the input RGB image. Three feature matrices X l a n d m a r k R P × D , X l b p h o g R P × D , and X c n n R P × D are extracted, containing landmark feature, LBPHOG high-level feature, and CNN feature.

3.4. Tri-Cross-Attention

For the input to the Transformer encoder X l a n d m a r k R P × D , X l b p h o g R P × D , and X c n n R P × D , the three streams of inputs are converted into the query matrix Q, the key matrix K, and the Value matrix V, respectively, by linear transformation:
Q 1 = X l a n d m a k W Q 1 , K 1 = X l a n d m a r k W K 1 , V 1 = X l a n d m a r k W V 1
Q 2 = X c n n W Q 2 , K 2 = X c n n W K 2 , V 2 = X v n n W V 2
Q 3 = X l b p h o g W Q 3 , K 3 = X l b p h o g W K 3 , V 3 = X l b p h o g W V 3
where W Q 1 , W Q 2 , W Q 3 , W K 1 , W K 2 , W K 3 , W V 1 , W V 2 , W V 3 R D × D .
The architecture of the tri-cross-attention module is shown in Figure 3, which can be described as the following formulas:
A t t e n t i o n l a n d m a r k = S o f t m a x ( Q 2 K 1 T / d ) V 1
A t t e n t i o n c n n = S o f t m a x ( ( Q 1 + Q 3 ) K 2 T / d ) V 2
A t t e n t i o n l a n d m a r k = S o f t m a x ( Q 2 K 3 T / d ) V 3
where d is the scaling factor used for normalization. Q 1 , Q 2 , and Q 3 are matrices computed from the landmark feature map, CNN feature map, and LBPHOG feature map, respectively.
Figure 3. The architecture of tri-cross-attention module.
During the computation of the respective attention, the Q matrix of the CNN feature map is replaced with Q 1 + Q 3 , and the Q matrix of the other two feature maps is replaced with Q 2 . The high distinguishing ability of landmark feature maps and LBPHOG high-level feature maps is used to guide CNN features in calculating self-attention. At the same time, the information on landmark features, LBPHOG high-level features, and CNN features is preserved. The tri-cross-attention module introduces global information by exchanging the Q matrix of the CNN features into the attention computation process of the landmark features and the LBPHOG high-level features, respectively. For the given inputs X l a n d m a r k , X l b p h o g and X c n n , their computation through the encoder can be expressed as the following equation:
X l a n d m a r k = C M S A l a n d m a r k ( Q 2 , K 1 , V 1 ) + X l a n d m a r k
X l a n d m a r k _ o u t = M L P ( N o r m ( X l a n d m a r k ) ) + X l a n d m a r k
X c n n = C M S A c n n ( Q 1 , Q 2 , Q 3 , K 2 , V 2 ) + X c n n
X c n n _ o u t = M L P ( N o r m ( X c n n ) ) + X c n n
X l b p h o g = C M S A l b p h o g ( Q 2 , K 3 , V 3 ) + X l b p h o g
X l b p h o g _ o u t = M L P ( N o r m ( X l b p h o g ) ) + X l b p h o g
where C M S A (·) denotes cross-attention multi-head self-attention block, N o r m (·) denotes normalization operation and M L P (·) denotes multi-layer perceptron.
In the process of fusing two Q matrices in the CNN stream, we explored different fusion methods, including direct element-wise addition, concatenation followed by dimension reduction using a 1 × 1 convolution, and upsampling or downsampling approaches. Experimental results show that element-wise addition is the most effective. We believe that in the process of summing the two matrices, the fusion matrix is able to obtain the components of the two sub-matrices with large weight values, fusing the more important attention components of each of the two sub-matrices.

4. Experiments

4.1. Datasets

The availability of large-scale datasets in real-world scenarios is essential for facial expression recognition tasks. Large-scale datasets can provide rich samples of facial expressions, which can help networks learn more comprehensive features. These datasets can cover face images of different ages, genders, races, and emotional states, thus making the trained models more robust in real-world scenarios. In this paper, two commonly used facial expression datasets in real-world scenarios are chosen to validate the effectiveness of the proposed TriCAFFNet model.
RAF-DB [3]: Real-world Affective Faces Database (RAF-DB) is a facial expression recognition dataset containing seven different basic facial expressions (happy, sad, anger, surprise, fear, disgust, and neutral). The dataset includes 29,672 face images from the real world. Among them, 15,339 images are used for facial expression recognition tasks, with 12,271 images for training and 3068 images for testing. Each image is annotated with labelled expression categories and intensity levels. The RAF-DB dataset is widely used in the field of facial expression recognition and is one of the commonly used benchmark datasets for evaluating and comparing different algorithms.
AffectNet [4]: AffectNet is one of the largest publicly available datasets for facial expression recognition tasks. It is a large-scale dataset in a real-world scenario, containing more than one million facial images collected from the Internet. The photos are labelled into eight emotion categories, including the seven primary expressions and the contempt expression. Comparison of model recognition results can be based on two different uses: accuracy based on seven emotion categories and accuracy based on eight emotion categories. We verify the recognition effect of the proposed model based on these two uses.

4.2. Implement Details

In the experiments, a standardized preprocessing is applied to all input images, including resizing to a uniform size of 224 × 224, random horizontal flipping, random vertical flipping, random addition of Gaussian noise, and random erasure of specific regions in the images. The feature extraction model utilizes the IR50 model pre-trained on the Ms-Celeb-1M dataset as the image backbone. The MobileFaceNet, with the same frozen training parameters, is employed to extract facial landmarks and high-level LBPHOG features. The learning rate is initialized to 0.0004. Adam optimizer is utilized. The batch size is set to 200. The mlp ratio and drop path rate are set to 2.0 and 0.01, respectively. The cross-entropy loss function is used as the loss function. We implemented all the experiments on an NVIDIA RTX 3090 GPU based on the Pytorch framework.

4.3. Comparison with State-of-the-Art Results

In this section, we compare the recognition results of the method proposed in this paper with some SOTA methods on RAF-DB and AffectNet datasets.
Results on RAF-DB: We compare the results with the FER algorithms that have achieved SOTA performance on the RAF-DB dataset in recent years. The results are shown in Table 1. The experimental results indicate that our TriCAFFNet achieved SOTA performance on the RAF-DB dataset with an accuracy of 92.17%. It outperforms MVT (88.62%) [17], PSR (88.98%) [28], QOT (89.97%) [42], TransFER (90.91%) [14], APViT (91.98%) [29], and LCFC(89.23%) [30] by 3.55%, 3.19%, 2.2%, 1.26%, 0.19%, and 2.94%, respectively. It surpasses our Baseline(91.88%) by 0.29%, and also surpasses the second-highest method, POSTER (92.05%) [25], by 0.12%.
Table 1. Comparison results with other SOTA FER algorithms on RAF-DB.
Results on AffectNet (7 cls): The results of TriCAFFNet on AffectNet (7 cls) dataset in comparison with the previous methods are presented in Table 2. TriCAFFNet (67.40%) outperforms MVT (64.57%) [17], EAC (65.32%) [43], TransFER (66.23%) [14], APViT (66.91%) [29], and POSTER (67.31%) [25] on AffectNet (7 cls) by 2.83%, 2.08%, 1.17%, 0.43%, and 0.09%, respectively. It surpasses our Baseline(66.34%) by 1.06%, and also surpasses the second-highest method, QOT (67.37%) [42], by 0.03%.
Table 2. Comparison results with other SOTA FER algorithms on AffectNet (7 cls).
Results on AffectNet (8 cls). Table 3 shows the results of TriCAFFNet on the AffectNet (8 cls) dataset in comparison with the previous methods. TriCAFFNet (63.49%) is 2.81%, 2.16%, 2.09% 1.32% higher than PSR (60.68%) [28], ARM (61.33%) [44], MVT (61.40%) [17] and LCFC (62.17%) [30], respectively. It surpasses our Baseline(63.14%) by 0.35% and is 0.15% higher than the second highest POSTER (63.34%) [25] method.
Table 3. Comparison results with other SOTA FER algorithms on AffectNet (8 cls).
As shown in Figure 4, we display the confusion matrices of the proposed model on these datasets. The darker colors of the diagonal positions in the confusion matrix represent the higher recognition accuracy of the class. Lighter colors in other positions indicate lower misidentification rates. It can be seen that TriCAFFNet performs well on the RAF-DB dataset and is weaker than the different classes, only in Fear and Disgust, which have fewer training samples. It also reaches a good level on AffectNet.
Figure 4. Confusion matrics of TriCAFFNet on RAF-DB, AffectNet (7-cls) and AffectNet (8-cls).
Model size and FLOPs: In model evaluation, the number of parameters and FLOPs are also crucial metrics. Table 4 compares the number of parameters and FLOPs of our TriCAFFNet model with MVT [17], VTFF [19], Transfer [14], POSTER [25] and APViT [29]. The proposed model keeps the number of parameters at a lower level by using the same image backbone to extract different features and also improves performance.
Table 4. Comparison of parameters, FLOPs, and accuracy on RAF-DB and AffectNet datasets with SOTA models.

4.4. Result Analysis

In order to better understand TriCAFFNet’s efficient use of multiple features and the effectiveness of the architecture, we compared TriCAFFNet with other SOTA methods in terms of facial expression recognition accuracy as well as average accuracy on the seven classes of expressions in the RAF-DB dataset. The results are shown in Table 5.
Table 5. Comparison of class-wise accuracy with some SOTA models on RAF-DB.
Among the seven expression categories, TriCAFFNet achieves the highest facial expression recognition accuracy in neutral and sad compared to all other models listed in the table. Neutral expressions are easily misidentified as other expressions because they have no apparent emotional coloring. TriCAFFNet outperformed the second-ranked model TransFER [14] on neutral by 2.79%, which indicates that TriCAFFNet is capable of utilizing multiple features to distinguish subtle differences between various expressions, enhancing its ability to address inter-class similarity issues. The feature differences between sad and other expressions are pretty obvious. TriCAFFNet enhances this distinctiveness by introducing landmark features and LBPHOG features, thereby improving the model’s ability to address inter-class differences. Additionally, all models perform the worst in the fear category, which is because, in the training set of RAF-DB, the number of images in the fear category accounts for only 2.2% of the total number of images. It is nearly 20 times lower than the happy (38.9%) category, which has the most significant proportion. Therefore, the model lacks a sufficient number of training samples for the fear category. TriCAFFNet performs well in this category as well, which indicates that even with a limited number of training samples, TriCAFFNet is able to capture various features that provide valuable information for recognition.
We further analyze the misrecognition rates of the model on the neutral, fear, and sad expression categories and compare them with POSTER [25]. The results, as shown in Table 6, illustrate the ability of TriCAFFNet to address both inter-class similarity and intra-class variability issues. In recognition of the neutral category, the probability of TriCAFFNet misidentifying it as fear or anger is 0, and the likelihood of misidentifying it as other expression categories is also maintained at a low level. Compared to POSTER [25], TriCAFFNet can further reduce the probability of misidentifying neutral as happy, surprise, and disgust. In the category of fear, which has the smallest sample size, the number of misidentifications as surprise and sad is relatively high due to the dataset itself. However, TriCAFFNet is still able to keep the number of misidentifications of fear as other categories, excluding surprise and sad, at a low level.
Table 6. Comparison results of misclassification rate on RAF-DB (↓ indicates that TriCAFFNet has a lower misclassification rate than POSTER in this category, ↑ indicates that TriCAFFNet has a higher recognition rate than POSTER in this category).
As shown in Figure 5, the high-dimensional features of TriCAFFNet are visualized using the t-SNE [45] method. t-SNE visualization plots of the RAF-DB dataset present apparent clustering effects and obvious separability, where the points of different colors are far away from each other. In contrast with our Baseline model, the points of the same color are tightly clustered in the low-dimensional space, which suggests that they have similar features in the high-dimensional space, thus further illustrating the SOTA capability of TriCAFFNet in performing the expression classification task. The AffectNet dataset, because of its massive number of samples and the unbalanced distribution among the samples, has closer distances between the points of individual colors. TriCAFFNet also presents a competitive performance.
Figure 5. Visualization of high dimensional space t-SNE visualization results on RAF-DB and AffectNet.

4.5. Ablation Study

We conducted ablation experiments on RAF-DB and AffectNet datasets to validate the effectiveness of our proposed architecture, and the results are shown in Table 7.
Table 7. Results of ablation experiments of key components of TriCAFFNet.
Landmark. We verify the importance of landmark features by removing landmark features in the three-stream attention mechanism. After removing landmark features, the recognition accuracy of TriCAFFNet on the RAF-DB dataset and AffectNet dataset decreases from 92.17% and 67.40% to 91.40% and 67.10%, and its recognition effect is greatly reduced. Therefore, we introduce landmark features to enhance the ability of the model to distinguish the subtle differences between images, which can effectively enhance the recognition ability of the model.
LBPHOG. We investigated the effect of the introduction of LBPHOG features on the recognition effect. It can be found that the recognition rate of the model on both datasets decreases when the LBPHOG features are removed (RAF-DB decreases by 0.43% and AffectNet decreases by 0.10%). Therefore, the ability of the model to distinguish subtle differences between images can be further enhanced by introducing LBPHOG features, thus improving the accuracy of facial expression recognition.
Tri-Cross-Attention. We explore the effect of the tri-cross-attention mechanism on the model recognition results by removing it. After removing the tri-cross-attention module, the accuracy decreases by 0.78% on RAF-DB and 0.26% on AffectNet. The recognition performance was the poorest in the three ablation experiments, which indicates that the proposed tri-cross-attention mechanism is crucial for the model. For three different features, the tri-cross-attention mechanism enables mutual fusion and guidance. Compared with the common fusion method, it is more helpful in improving the recognition accuracy of the model.

5. Conclusions

In this paper, we propose the tri-cross-attention Transformer with a multi-feature fusion network (TriCAFFNet) for facial expression recognition. We propose to improve the model’s ability to solve the inter-class similarity and the intra-class variability problems by introducing landmark features, LBPHOG features, and CNN features of the image to enable the model to acquire sufficient information related to recognition. The proposed tri-cross-attention mechanism is used to make the three features fuse and guide each other. A large number of FER experiments show that TriCAFFNet achieves SOTA performance while keeping the number of parameters as well as FLOPs at a low level, which makes TriCAFFNet a better choice for FER as it strikes a good balance between accuracy and computational complexity.
We enhanced the model’s recognition capability by selectively leveraging distinct features. However, this approach also increased the complexity of our model inputs, necessitating additional preprocessing steps. Furthermore, during feature extraction, we employed a single parameter-frozen backbone to extract advanced features from two different types of features, significantly reducing the number of trainable parameters and simplifying the model’s complexity. Nonetheless, the inclusion of two backbones and a tri-cross-attention ViT means there is still room for improvement in optimizing the overall parameter count despite maintaining it at a relatively low level.

Author Contributions

Conceptualization, Y.T., Z.W., H.Y. and D.C.; methodology, Y.T., Z.W., H.Y. and D.C.; software, Z.W.; validation, Y.T. and Z.W.; investigation, Y.T., Z.W., H.Y. and D.C.; writing—original draft preparation, Y.T. and Z.W.; writing—review and editing, Y.T. and Z.W.; supervision, Y.T., H.Y. and D.C.; project administration, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the General Project for Education of National Social Science Fund, Study on the Mechanism of Emotional Engagement and its Intervention in Primary and Secondary School Teachers’ online Training (Grant Number: BCA230278).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lyons, M.; Budynek, J.; Akamatsu, S. Automatic classification of single facial images. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 1357–1362. [Google Scholar] [CrossRef]
  2. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
  3. Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2584–2593. [Google Scholar]
  4. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
  5. Barsoum, E.; Zhang, C.; Ferrer, C.C.; Zhang, Z. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI ’16, Tokyo, Japan, 12–16 November 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 279–283. [Google Scholar]
  6. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  7. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  8. Jabid, T.; Kabir, M.H.; Chae, O. Local directional pattern (LDP) for face recognition. In Proceedings of the 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 9–13 January 2010; pp. 329–330. [Google Scholar]
  9. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  10. Parra, D.; Camargo, C. Design methodology for single-channel CNN-based FER systems. In Proceedings of the 2023 6th International Conference on Information and Computer Technologies (ICICT), Raleigh, NC, USA, 24–26 March 2023; pp. 89–94. [Google Scholar]
  11. Borgalli, M.R.A.; Surve, D.S. Deep learning for facial emotion recognition using custom CNN architecture. J. Phys. Conf. Ser. 2022, 2236, 012004. [Google Scholar] [CrossRef]
  12. Bodapati, J.D.; Srilakshmi, U.; Veeranjaneyulu, N. FERNet: A Deep CNN Architecture for Facial Expression Recognition in the Wild. J. Inst. Eng. (India) Ser. B 2022, 103, 439–448. [Google Scholar] [CrossRef]
  13. Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6896–6905. [Google Scholar]
  14. Xue, F.; Wang, Q.; Guo, G. TransFER: Learning relation-aware facial expression representations with transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3581–3590. [Google Scholar]
  15. Fan, Y.; Wang, H.; Zhu, X.; Cao, X.; Yi, C.; Chen, Y.; Jia, J.; Lu, X. FER-PCVT: Facial expression recognition with patch-convolutional vision transformer for stroke patients. Brain Sci. 2022, 12, 1626. [Google Scholar] [CrossRef]
  16. Li, Y.; Wang, M.; Gong, M.; Lu, Y.; Liu, L. FER-former: Multi-modal transformer for facial expression recognition. arXiv 2023, arXiv:2303.12997. [Google Scholar]
  17. Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. MVT: Mask vision transformer for facial expression recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar]
  18. Yang, W.; Chen, H.; Peng, H.; Guo, J.; Liu, Z. Leveraging one-class classification with resnet18 and cbam for enhanced facial expression recognition. SSRN 2023. [Google Scholar] [CrossRef]
  19. Ma, F.; Sun, B.; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput. 2023, 14, 1236–1248. [Google Scholar] [CrossRef]
  20. Kim, J.H.; Kim, N.; Won, C.S. Facial expression recognition with swin transformer. arXiv 2022, arXiv:2203.13472. [Google Scholar]
  21. Huang, D.; Shan, C.; Ardabilian, M.; Wang, Y.; Chen, L. Local binary patterns and its application to facial image analysis: A survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2011, 41, 765–781. [Google Scholar] [CrossRef]
  22. Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on Local Binary Patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
  23. Hasani, B.; Mahoor, M.H. Facial expression recognition using enhanced deep 3D convolutional neural networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 2278–2288. [Google Scholar]
  24. Jung, H.; Lee, S.; Yim, J.; Park, S.; Kim, J. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 2983–2991. [Google Scholar]
  25. Zheng, C.; Mendieta, M.; Chen, C. POSTER: A pyramid cross-fusion transformer network for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3146–3155. [Google Scholar]
  26. Savchenko, A.V. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In Proceedings of the 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 16–18 September 2021; pp. 119–124. [Google Scholar]
  27. Tang, Y. Deep Learning using Linear Support Vector Machines. arXiv 2013, arXiv:1306.0239. [Google Scholar]
  28. Vo, T.H.; Lee, G.S.; Yang, H.J.; Kim, S.H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 2020, 8, 131988–132001. [Google Scholar] [CrossRef]
  29. Xue, F.; Wang, Q.; Tan, Z.; Ma, Z.; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. 2022, 14, 3244–3256. [Google Scholar] [CrossRef]
  30. Li, H.; Xiao, X.; Liu, X.; Wen, G.; Liu, L. Learning Cognitive Features as Complementary for Facial Expression Recognition. Int. J. Intell. Syst. 2024, 2024, 7321175. [Google Scholar] [CrossRef]
  31. Chen, J.; Chen, Z.; Chi, Z.; Fu, H. Facial Expression Recognition in Video with Multiple Feature Fusion. IEEE Trans. Affect. Comput. 2018, 9, 38–50. [Google Scholar] [CrossRef]
  32. Shao, J.; Qian, Y. Three convolutional neural network models for facial expression recognition in the wild. Neurocomputing 2019, 355, 82–92. [Google Scholar] [CrossRef]
  33. Li, X.; Deng, W.; Li, S.; Li, Y. Compound Expression Recognition In-the-Wild with AU-Assisted Meta Multi-Task Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5734–5743. [Google Scholar]
  34. Khemakhem, F.; Ltifi, H. Neural style transfer generative adversarial network (NST-GAN) for facial expression recognition. Int. J. Multimed. Inf. Retr. 2023, 12, 26. [Google Scholar] [CrossRef]
  35. Akhand, M.A.H.; Roy, S.; Siddique, N.; Kamal, M.A.S.; Shimamura, T. Facial Emotion Recognition Using Transfer Learning in the Deep CNN. Electronics 2021, 10, 1036. [Google Scholar] [CrossRef]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
  37. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  38. Li, Y.; Lu, G.; Li, J.; Zhang, Z.; Zhang, D. Facial Expression Recognition in the Wild Using Multi-Level Features and Attention Mechanisms. IEEE Trans. Affect. Comput. 2023, 14, 451–462. [Google Scholar] [CrossRef]
  39. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar]
  40. Zeng, D.; Lin, Z.; Yan, X.; Liu, Y.; Wang, F.; Tang, B. Face2Exp: Combating data biases for facial expression recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 20259–20268. [Google Scholar]
  41. Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. arXiv 2016, arXiv:1607.08221. [Google Scholar]
  42. Jin, L.; Zhou, Y.; Ma, G.; Song, E. Quaternion deformable local binary pattern and pose-correction facial decomposition for color facial expression recognition in the wild. IEEE Trans. Comput. Soc. Syst. 2023, 11, 2464–2478. [Google Scholar] [CrossRef]
  43. Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn from all: Erasing attention consistency for noisy label facial expression recognition. arXiv 2022, arXiv:2207.10299. [Google Scholar]
  44. Shi, J.; Zhu, S.; Liang, Z. Learning to amend facial expression representation via de-albino and affinity. arXiv 2021, arXiv:2103.10189. [Google Scholar]
  45. Maaten, L.; Hinton, G.E. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.