Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition

This paper presents a novel facial expression recognition network, called Distract your Attention Network (DAN). Our method is based on two key observations in biological visual perception. Firstly, multiple facial expression classes share inherently similar underlying facial appearance, and their differences could be subtle. Secondly, facial expressions simultaneously exhibit themselves through multiple facial regions, and for recognition, a holistic approach by encoding high-order interactions among local features is required. To address these issues, this work proposes DAN with three key components: Feature Clustering Network (FCN), Multi-head Attention Network (MAN), and Attention Fusion Network (AFN). Specifically, FCN extracts robust features by adopting a large-margin learning objective to maximize class separability. In addition, MAN instantiates a number of attention heads to simultaneously attend to multiple facial areas and build attention maps on these regions. Further, AFN distracts these attentions to multiple locations before fusing the feature maps to a comprehensive one. Extensive experiments on three public datasets (including AffectNet, RAF-DB, and SFEW 2.0) verified that the proposed method consistently achieves state-of-the-art facial expression recognition performance. The DAN code is publicly available.


Introduction
Facial expressions are direct and fundamental social signals in human communication [1,2]. Along with other gestures, they convey important nonverbal emotional cues in interpersonal relations. More importantly, visionbased facial expression recognition has become a powerful sentiment analysis tool in a wide spectrum of practical applications. For example, counseling psychologists assess a patient's condition and consider treatment plans by constantly observing facial expressions [3]. In retail sales, a customer's facial expression data are used to determine whether a human sales assistant is needed [4]. Other significant application areas include social robots, e-learning, and facial expression synthesis.
Facial expression recognition (FER) is a technology that uses computers to automatically recognize facial expressions. As this research area matures, a number of large-scale facial expression datasets have emerged. In an early seminal work [5], six prototypical emotional displays are postulated: angry (AN), disgust (DI), fear (FE), happy (HA), sad (SA), and surprise (SU), which are often referred to in the literature as basic emotions. Recent FER datasets regard neutral (NE) or contempt (CO) as additional expression categories, expanding the number of facial expression categories to seven or eight. Figure 1: Comparing the Grad CAM++ visualization of a single attention head model and our proposed DAN on the RAF-DB test set. The first column is obtained with DACL [9], and the rest of the columns are generated by four attention heads from the proposed DAN model. Our method explicitly learns to attend to multiple local image regions for facial expression recognition.
In contrast to generic image classification, there are strong common features among different categories of facial expressions. Indeed, multiple expressions share inherently similar underlying facial appearance, and their differences could be less distinguishable. In computer vision, a common strategy to address this issue involves adopting a variant of the center loss [6]. In this work, a Feature Clustering Network (FCN) is proposed, which includes a simple and straightforward extension to the center loss, and it works well for optimizing both intra-class and inter-class variations. Unlike existing methods [7,8,9], our method does not involve additional computations other than the variation of the cluster centers, and it only requires a few hyper-parameters.
In addition, one unique aspect of FER lies in the delicate contention between capturing the subtle local variations and obtaining a unified holistic representation. To attend to local details, some recent studies focus on attention mechanisms [9,10,11], achieving promising results. Nonetheless, as shown in Figure 1, it is difficult for a model with only a single attention head to concentrate on various parts of the face at the same time. In fact, it is our belief that facial expression is simultaneously manifested in multiple parts of the face, such as eyebrows, eyes, nose, mouth, chin, etc. Therefore, this paper proposes a Multi-head Attention Network (MAN) inspired by biological visual perception that instantiates a number of attention heads to attend to multiple facial areas. Our attention module implements both spatial and channel attentions, which allows for capturing higher-order interactions among local features while maintaining a manageable computational budget. Furthermore, this paper proposes an Attention Fusion Network (AFN) that ensures attention is drawn to multiple locations before their fusion into a comprehensive feature vector for downstream classification.
By integrating the above ideas, a novel facial expression recognition network, called Distract your Attention Network (DAN), is presented in this paper. Our method implements multiple attention heads and ensures that they capture useful aspects of the facial expressions without overlapping. Concretely, our work proposes three sub-networks, including a Feature Clustering Network (FCN), a Multi-head Attention Network (MAN), and an Attention Fusion Network (AFN). More specifically, our method first extracts and clusters the backbone feature embedding with FCN, where an affinity loss is applied to increase the inter-class distances while decreasing the intra-class distances. After that, an MAN is built to attend to multiple facial regions concurrently, where multiple attention heads that each include a spatial attention unit and a channel attention unit are adopted. Finally, the output feature vectors from MAN are fed to an AFN to output class scores. Specifically, this work designs a partition loss in AFN to force attention maps from the MAN to focus on different facial locations. As shown in Figure 1, a single attention module could only concentrate on one coarser image region, missing other important facial locations. On the contrary, our proposed DAN manages to simultaneously capture several vital facial regions.
The main contributions of our work are summarized as follows: • To maximize class separability, this work proposes a simple yet effective feature clustering strategy in FCN to simultaneously optimize intra-class variations and inter-class margins.
• Our work demonstrates that a single attention module cannot sufficiently capture all the subtle and complex appearance variations across different expressions. To address this issue, MAN and AFN are proposed to capture multiple non-overlapping local attentions and fuse them to encode higher-order interactions among local features.
• The experimental results show that the proposed DAN method achieves an accuracy of 62.09% on AffectNet-8, 65.69% on AffectNet-7, 89.70% on RAF-DB, and 53.18% on SFEW 2.0, respectively, which represent the state of the art in facial expression recognition performance.
The rest of the paper is organized as follows. Section 2 reviews related literature in facial expression recognition with a particular focus on the attention mechanism and discriminative loss functions. Section 3 describes the proposed method in detail. Section 4 then presents the experimental evaluation results followed by closing remarks in Section 5.

Related Work
Facial expression recognition is an image classification task involving accurately identifying the emotional state of humans. The earliest study on FER dates back to 1872, when Darwin first proposed the argument of consistency in expressions [2]. In 1971, Ekman and Friesen presented the six basic emotions [5]. Later, the first facial expression recognition system [12] was proposed in 1991, which was based on optical flow. After that, the FER system has gradually matured and, in general, can be divided into three sub-processes: face detection, feature extraction, and expression classification [13]. Recently, FER systems are benefiting from the rapid development of deep learning, and a unified neural network can be used to perform both feature extraction and expression classification.
One of the most significant applications of FER is in human-computer interaction. By analyzing users' facial expressions, computers can interpret human emotions and respond accordingly, providing more natural and intuitive user interfaces. This technology has already been implemented in various areas, such as gaming, virtual reality, and video conferencing. In the healthcare industry, facial expression recognition can be used to detect and diagnose various mental health disorders. It can also be used to monitor patients' emotions and provide personalized treatment options. For example, Ref. [14] take advantage of FER instead of medical devices to detect a people's health state. FER also has important applications in artistic research, Ref. [15] use FER to recognize facial expressions in opera performance scenes to help teams assess user satisfaction and use it to make adjustments. Very recently, [16] propose a method for synthesizing recognizable face line portraits with controllable expression and high recognizability based on a triangle coordinate system.
Attention mechanism. Attention mechanism plays an important role in visual perception [17,18]. In particular, attention enables human beings to actively seek more valuable information in a complex scene. In recent years, there have been a plethora of studies attempting to introduce attention mechanisms into deep Convolutional Neural Network (CNN) with success. For example, [19] focus on the channel relationship of network features and propose a squeeze-and-excitation block to retain the most valuable channel information. Ref. [20] introduce frequency analysis to attention mechanism and present a new method termed FcaNet to compress information loss in scalar-based channel representations. Ref. [21] present a group-wise spatial attention module (SGE) where the spatial-wise features are divided into multiple groups to learn a potential spatial connection. By leveraging the complementary nature between channel-wise and spatial-wise features, Ref. [22] propose the Convolutional Block Attention Module (CBAM) that sequentially connects a channel attention and spatial attention to obtain rich attention features. Ref. [23] propose a coordinate attention that embeds positional information into channel attention to generate spatially selective attention maps. Ref. [24] propose a new triplet attention that captures cross-dimensional interaction to efficiently build interdimensional dependencies. Likewise, Ref. [25] use a position attention module and a channel attention module in parallel to share the local-wise and global-wise features for scene segmentation task. Very recently, seminal works based on self-attention mechanisms have emerged, such as Swin-Transformer [26], DAB-DETR [27], DaViT [28], and QFormer [29]. These models have indeed demonstrated superior performance, but they also bring with them a huge number of model parameters and an expensive computational cost, which are unsuitable for lightweight applications.
Inspired by these efforts, our method includes design of an attention head that sequentially cascades a spatial attention and a channel attention unit, which is both efficient and effective.
There are a few papers that introduce the above progress into FER. For example, Ref. [30] apply region attention on the CNN backbone to enhance its power of capturing local information. Ref. [31] construct a bottom-up and top-down architecture to obtain low resolution attention features. In these papers, only a single attention head is used, which would generally lead to attention on a rough area of the face. In our work, however, multiple non-overlapping attention regions could be simultaneously activated to capture information from different local regions. This is particularly useful for FER, as we need to simultaneously attend to multiple regions (e.g., eyes, nose, mouth, forehead, and chin) to capture the subtle differences among emotion classes.
Attention mechanisms have found applications in a wide range of areas in computer vision. For example, [32] propose a joint weak saliency and attention-aware model for person re-identification, in which saliency features are weakened to obtain more complete global features and diversified saliency features via attention diversity. Ref. [33] propose a reconstruction method based on attention mechanism to address the issue of low-frequency and high-frequency components being treated equally in image super-resolution. Ref. [34] propose a multi-head self-attention approach for multi-label mRNA subcellular localization prediction and achieve excellent results.
Discriminative loss. A discriminative loss function can strongly regulate the distribution of deep features. For example, [35] propose contrastive loss, which is an efficient loss function that maximizes class separability. In detail, a general Euclidean metric is used for features from the same class, but for diverse classes, the loss values will get close to a maximum margin. In addition, Ref. [6] present the center loss to learn a center distribution of each class and penalize the distances between deep features and their corresponding class centers. Differently, Ref. [36] propose using an angle as a distance measure and introduce an angular softmax loss. That work is followed by a number of methods [37,38,39] that improve the angular loss function.
In recent years, several studies demonstrate that discriminative loss functions could be well adapted to the FER task.
Ref. [40] combine the advantages of center loss and softmax loss and propose a discriminative distribution-agnostic loss function. Concretely, a center loss aggregates the features of the same class into a cluster, and a softmax loss separates the adjacent classes. Similarly, Ref. [7] introduce a cosine metric based on center loss to increase the inter-class distance among different categories. Furthermore, Ref. [9] propose an attentive center loss, which advocates learning the relationship of each class center for the center loss. However, all these loss functions bring in auxiliary parameters and computations. On the contrary, the affinity loss proposed in this paper is more simple and uses the internal relationship among class centers to increase the inter-class distance.

Our Approach
This section describes the proposed DAN model in detail. In order to learn high-quality attentive features, our DAN is divided into three components: Feature Clustering Network (FCN), Multi-head Attention Network (MAN), and Attention Fusion Network (AFN). Firstly, the FCN accepts a batch of face images and outputs basic feature embedding with class discrimination abilities. Afterwards, the MAN is employed to learn diverse attention maps that capture several sectional facial expression regions. Then, these attention maps are explicitly trained to focus on different areas by the AFN. Finally, the AFN fuses features coming from all attention heads and gives a prediction for the expression category of the input images.
In particular, the presented MAN contains a series of lightweight but effective attention heads. An attention head composes of a spatial attention unit and a channel attention unit in sequential order. Distinctively, the spatial attention unit involves convolution kernels of various sizes. A channel attention unit is connected to the end of the spatial attention unit to reinforce the attention maps by simulating an encoder-decoder structure. Both the spatial attention and the channel attention units are integrated back into the input features. The overall process of the proposed DAN is shown in Figure 2.

Feature Clustering Network (FCN)
Let us begin by introducing the FCN. Considering both performance and the number of parameters in our model, our method employs a residual network [41] as the backbone. As discussed earlier, different facial expressions may share similar underlying facial appearance. Therefore, this paper proposes a discriminative loss function, named affinity loss, to maximize class margins. Concretely, in each training step, our method encourages features to move closer to the class center to which they belong. At the same time, our method pushes centers of different classes apart in order to maintain good separability. More formally, suppose we have the i-th input image feature x i ∈ X with a class label y i ∈ Y during training, where X is the input feature space and Y is the label space. Here, i ∈ {1 . . . M } where M is the number of images in the training set. For the sake of simplicity, the output features of our backbone can be written as: where F r represents the backbone network and w r denotes the network parameters of F r .
Affinity loss. The affinity loss is proposed to maximize the inter-class distance while minimizing the intra-class variation. By virtue of the affinity loss, the backbone network can accurately cluster various features of facial expressions toward their respective class centers. Compared to the standard center loss [6], our affinity loss advocates that it makes sense to expand the distance between class centers, because a larger distance between class centers further prevents overlap among classes. Formally, given the class center matrix c ∈ R D×|Y| wherein each column corresponds to the center of a specific class, the affinity loss function can be written as follows: where N is the number of images in a training batch, D is the dimension of class centers, c yi is the column vector in c that corresponds to the ground-truth class of the i-th image, and σ c indicates the standard deviation among class centers. One might notice that x i is not necessarily a feature vector and could instead be a feature map of higher dimensions. In practice, following [6], a global average pooling is performed on x i , and any singleton dimensions are then removed before applying Equation 2, and we note that similar operations to flatten the feature maps may also work.
Despite its simplicity, our affinity loss is more effective than the standard center loss, as it encourages wider margins among classes as a result of pushing class centers further away from each other. This arguably completes the other half of the puzzle missing in the center loss, as the standard center loss only enforces tightness within classes but not sparsity among different classes. In particular, σ c can be readily available during training, as class centers have already been computed, and no additional hyper-parameters are needed for our affinity loss. Figure 3 presents the t-SNE visualization of the features obtained by FCN with center loss and affinity loss, respectively. It is clear that by integrating the affinity loss, our FCN learns feature clusters of better quality, and the class margins are clearly wider among different classes.

Multi-Head Attention Network (MAN)
As discussed earlier, a single attention head may be insufficient for effective FER, and the joint reasoning of multiple local regions is necessary. In this regard, our MAN contains a few parallel heads, which remain independent of each other. As shown in Figure 4, an attention head is a combination of a spatial attention unit and a channel attention unit. The spatial attention unit receives the input features from the FCN and extracts the spatial features. Then, the channel attention unit accepts the spatial features as the input feature and extracts the channel features. Features from the two above dimensions are then combined into an attention vector. In particular, it should be noted that all parallel heads are identical in terms of their sizes, differing only in parameter values.
More concretely, the left section of Figure 4 illustrates the spatial attention unit, which consists of four convolutional layers and one activation function. In particular, our method constructs the 1 × 1, 1 × 3, 3 × 1, and 3 × 3 convolution kernels to capture local features at multiple scales. The channel attention unit shown in the right part consists of a global average pooling layer, two linear layers, and one activation function. In particular, our method takes advantage of having two linear layers to encode the channel information.
Formally, let H = {H 1 , . . . , H K } be the spatial attention heads and S = {s 1 , . . . , s K } be the output spatial attention maps, where K is the number of attention heads. Given the output features of our backbone x (the subscript i is omitted for notation brevity), the output of the j-th spatial attention unit can then be written as: where w s are the network parameters of H j , and is the Hadamard product.
Similarly, assume H = {H 1 , . . . , H K } are the channel attention heads and A = {a 1 , . . . , a K } are the final attention feature vectors of MAN. The j-th output a j can be written as: where w c are the network parameters of H j .
The key benefits of the proposed MAN are two-fold. Firstly, our method incorporates spatial attention and channel attention sequentially for each attention head, so as to efficiently capture high-order interactions among features. More importantly, our method instantiates multiple attention heads so that the model can attend to different facial regions at the same time, which is not possible with a single attention head.

Attention Fusion Network (AFN)
Furthermore, to guide the MAN in learning attention maps in an orchestrated fashion, this work proposes the AFN to further enhance the features learned by MAN. Firstly, AFN scales the attention feature vectors by applying a log-softmax function to emphasize the most interesting region. After that, a partition loss is proposed to instruct the attention heads to focus on different crucial regions and avoid overlapping attentions. This would ensure the efficacy of multiple attention heads, making them attend to non-overlapping local regions. Lastly, the normalized attention feature vectors are merged into one, and we can then compute class confidence with a linear layer. Feature scaling. The final attention feature vectors A are firstly scaled using a log-softmax function along the dimension corresponding to the attention heads, so as to obtain a common metric for the features from different heads. More specifically, the output of MAN from Section 3.2 gives the attention feature vectors A = {a 1 , . . . , a K }, and we assume that any of these vectors, e.g., the j-th vector a j , is L-dimensional (e.g., L = 512 in our experiments). In this case, the feature scaling result v j for a j can be given by: , l ∈ {1, . . . , L}, j ∈ {1, . . . , K} where a l j is the l-th element of a j , and v l j is the l-th element of v j . After scaling, the features from different heads are further added up for the final classification, as shown in Figure 2.
Partition loss. This paper proposes a partition loss to direct the attention heads to focus on different facial feature regions so they will not collapse into a single attention. As visualized in the right part of Figure 1, our partition loss successfully controls the heads of MAN to follow different facial areas without additional interventions. More specifically, we do so by maximizing the variance among attention vectors. In particular, our method considers K, the number of attention heads, as a parameter to adaptively adjust the descent speed of loss values. The MAN with a larger quantity of attention heads may generate more subtle areas of attention. Overall, we write our partition loss as follows: where N is the number of images in the current batch, L is the dimension of the attention feature vector in each head, σ 2 i,l denotes the variance of v l j where j ∈ {1, . . . , K} when the i-th image and the l-th dimension are given.

Model Learning
As shown above, our DAN model is comprised of three components: FCN, MAN, and AFN. In order to train this model, one has to consider the losses from all components (i.e., the affinity loss from FCN, the partition loss from AFN, and the cross-entropy loss for image classification) within a unified framework. Following the standard practice in deep learning, our method learns a final loss function by integrating all these loss functions: where λ 1 , λ 2 are the weighting hyper-parameters for L af and L pt . L cls denotes the cross-entropy loss in image classification. In our experiments, we empirically set both λ 1 and λ 2 to 1.0 and note that the consistent performance gains we observe are not particularly sensitive to their values.

Experimental Evaluation
This section describes our experimental evaluation results in detail. In particular, the superiority of the proposed method is quantitatively verified on three benchmark datasets: AffectNet, RAF-DB, and SFEW 2.0. This paper shows that the proposed method provides consistent performance improvements to a number of strong baselines. In addition, it is demonstrated that the various components of our model are all contributing to the final performance through ablation studies.

AffectNet
AffectNet [42] is a large-scale database of facial expressions that contains two benchmark branches: AffectNet-7 and AffectNet-8 (with an additional category of "contempt"). AffectNet-7 includes 287,401 images with seven classes, where all images are divided into 283,901 training samples and 3500 testing samples. Additionally, AffectNet-8 introduces the contempt data and expands the number of training and test samples to 287,568 and 4000, respectively.

RAF-DB
RAF-DB [43] is a real-world database with more than 29,670 facial images downloaded from the Internet. Seven basic and eleven compound emotion labels are provided for the dataset through manual labeling. There are 15,339 images in total for expression classification (including 12,271 training set and 3068 testing set images), each of which is aligned and cropped to a size of 100 × 100.

Implementation Details
On RAF-DB and AffectNet datasets, our work directly uses the official aligned image samples. For the SFEW 2.0 dataset, the facial images are manually aligned using the RetinaFace [45] model. Input images are resized to 224 × 224 pixels for each training and testing step on all datasets. A few basic random data augmentation methods (such as horizontal flipping, random rotation, and erasing) are used selectively to prevent over-fitting. Moreover, a ResNet-18 [41] model is adopted as the backbone of FCN in all experiments, and for a fair comparison, this work pre-trains the ResNet-18 model on the MS-Celeb-1M face recognition dataset.
Our experimental code is implemented with PyTorch, and the models are trained on a workstation with an NVIDIA Tesla P100 12GB GPU. Our code is publicly available from https://github.com/yaoing/DAN. For all tasks, models are trained for 40 epochs with a uniform batch size of 256 (mainly due to GPU memory constraint), and the head number of attention in MAN is set to 4 by default.

Quantitative Performance Comparisons
This paper presents the quantitative performance comparison results in Tables 1-4 for AffectNet, RAF-DB, and SFEW 2.0. Our proposed method achieves an accuracy of 62.09% on AffectNet-8 and an accuracy of 65.69% on AffectNet-7, which are both superior to existing methods. Comparing for the RAF-DB dataset, DAN acquires 89.70% in accuracy, which is state of the art. Our method, although not the best, also obtains a competitive accuracy of 53.18% on the more compact SFEW 2.0 dataset. This is perhaps because the multi-head attention would require larger datasets for learning to ensure an effective model. Overall, the results above clearly demonstrate that the proposed method is highly competitive and the components in our method are effective across multiple datasets.

Effects of Loss Functions for FCN and AFN
In order to verify the efficacy of the individual loss functions proposed in our method, this paper now performs ablation studies on the RAF-DB dataset to demonstrate that our affinity loss and partition loss are indeed contributing to model performance. Specifically, this work evaluates the influence of using different loss functions for FCN and AFN separately in Table 5. In FCN, our proposed affinity loss provides a considerable performance improvement to the standard center loss. Similarly, the partition loss plays a crucial role in the performance of AFN. These results demonstrate that the affinity loss and the partition loss both contribute to the superior performance of our model.

Number of Attention Heads
In addition, the number of attention heads obviously affects the performance of our model. Figure 5 shows the accuracy results with a changing number of attention heads on the RAF-DB dataset. It is evident that our proposed MAN structures are superior to a single attention module. Moreover, using four attention heads maximizes the performance gain. Therefore, four attention heads are used throughout the experiments in this paper.   Our method provides near-perfect precision for most classes in the low recall regime. Classes such as "Happiness", "Neutral", "Surprise", and "Sadness" are relatively easy, while the disgust and fear classes are the most challenging. Table 6 compares our method against state-of-the-art methods in terms of model size and inference time. Our DAN with four attention heads uses 19.72 M and 2.23 G of parameters and FLOPs, respectively, which corresponds to moderate resource consumption, to achieve its state-of-the-art facial expression recognition performance.

Confusion Matrix
In order to better understand the performance of our model on each facial expression category, Figure 6 presents the confusion matrices on each dataset we evaluated. More specifically, it is obvious that the "Happy" class is the easiest on both Affect-8 and Affect-7, followed by "Sad", "Surprise", and "Fear". On RAF-DB, our DAN model has a high accuracy in "Happiness", "Sadness", "Surprise", "Neutral", and "Anger" classes. On SFEW 2.0, our method performs relatively well on the "Happy" and "Neutral" classes, while the "Disgust" and "Fear" classes are very challenging. Possible reasons for the large gaps in the performance include the appearance similarities among facial expression categories as well as the skewed class distribution in the training dataset. It is therefore obvious that to further improve the performance of our method in the future, extra efforts should be made to avoid class confusion on the more difficult classes such as "Disgust", "Fear", and "Anger".

Precision-Recall Analysis
Finally, Figure 7 presents the per-class precision-recall curves on the RAF-DB dataset. It is evident that our approach achieves near-perfect performance in "Happiness" class, and also performs well in "Neutral", "Surprise", "Sadness", and "Anger" classes. Overall, our method provides over 80% precision at 50% recall for all classes.

Conclusions
This paper presents a robust method for facial expression recognition that consists of three novel sub-networks including the Feature Clustering Network (FCN), the Multi-head Attention Network (MAN), and the Attention Fusion Network (AFN). Specifically, the FCN learns to maximize class separability for backbone facial expression features, the MAN captures multiple diverse attentions, and the AFN penalizes overlapping attentions and fuses the learned features. Experimental results on three benchmark datasets demonstrate the superiority of our method for FER. It is our hope that the exploration into feature clustering and learning multiple diverse attentions would provide insights for future research in facial expression recognition and other related vision tasks.