Facial Action Unit Recognition by Prior and Adaptive Attention

: Facial action unit (AU) recognition remains a challenging task, due to the subtlety and non-rigidity of AUs. A typical solution is to localize the correlated regions of each AU. Current works often predeﬁne the region of interest (ROI) of each AU via prior knowledge, or try to capture the ROI only by the supervision of AU recognition during training. However, the predeﬁnition often neglects important regions, while the supervision is insufﬁcient to precisely localize ROIs. In this paper, we propose a novel AU recognition method by prior and adaptive attention. Speciﬁcally, we predeﬁne a mask for each AU, in which the locations farther away from the AU centers speciﬁed by prior knowledge have lower weights. A learnable parameter is adopted to control the importance of different locations. Then, we element-wise multiply the mask by a learnable attention map, and use the new attention map to extract the AU-related feature, in which AU recognition can supervise the adaptive learning of a new attention map. Experimental results show that our method (i) outperforms the state-of-the-art AU recognition approaches on challenging benchmark datasets, and (ii) can accurately reason the regional attention distribution of each AU by combining the advantages of both the predeﬁnition and the supervision.


Introduction
Facial action unit (AU) recognition involves the prediction for occurrence or nonoccurrence of each AU, which is an important task in the communities of computer vision and affective computing [1][2][3][4]. As defined in the facial action coding system (FACS) [5,6], each AU is a local facial action with one or more atomic muscle actions. Due to the subtlety and non-rigidity, the appearance of AUs are diversely changed across persons and expressions. For instance, as shown in Figure 1, AU 1 (inner brow raiser), AU 2 (outer brow raiser), and AU 4 (brow lowerer) occur in brow regions with overlaps, in which it is difficult to distinguish each AU from the fused appearance. In the literature, facial AU recognition is still a challenging task.
Considering AUs appear in local facial regions, one intuitive solution is to localize the correlated regions so as to extract features for AU recognition. Since the locations of AU centers can be specified by correlated facial landmarks via prior knowledge, Li et al. [2,7] predefined a regional attention map for each AU, in which a position with a farther distance to the AU centers is given a lower attention value. However, different AUs share the same attention distribution, which ignores the divergences across AUs. Furthermore, correlated landmarks only can determine the central locations of AUs, while a few potentially correlated regions very far away from the centers are neglected.  As deep neural networks have a self-attention mechanism [8] during training, Shao et al. [9] only resorted to the supervision of AU recognition to capture correlated regions of AUs. In this approach, some irrelevant regions are included since AUs are subtle and do not have distinct contours. Shao et al. [3,10] proposed adaptively modifying the predefined attention map of each AU, which is a pioneering work of combining predefined attention and supervised attention. However, directly convoluting on a predefined attention map only works as smoothing, in which positions outside of the predefined region of interest (ROI) obtain similar attention weights and thus are regarded as similar importance. In this way, correlated regions very distant to the AU centers are still not emphasized.
To tackle the above issues, we develop a novel facial AU recognition method named PAA by prior and adaptive attention. In particular, we first predefine a mask for each AU by assigning lower weights to the positions farther away from the predefined AU centers. Since the regions outside the predefined ROI should not be neglected, we use a learnable parameter to adaptively control the importance of different positions. Then, we elementwise multiply the mask by a learnable attention map, and employ the new attention map to extract AU-related features. In this process, the new attention map is adaptively learned under the guidance of AU recognition. By integrating the advantages of the predefinition and the supervision, our method can precisely capture correlated regions of each AU, in which correlated locations are included and uncorrelated locations are discarded. This paper has three main contributions: • We propose to combine the constraint of prior knowledge and the supervision of AU recognition to adaptively learn the regional attention distribution of each AU. • We propose a learnable parameter to adaptively control the importance of different positions in the predefined mask, which is beneficial for choosing an appropriate constraint of prior knowledge. • We conduct extensive experiments on challenging benchmark datasets, which demonstrate that our method outperforms the state-of-the-art AU recognition approaches, and can precisely learn the attention map of each AU.

Related Works
In this section, we review other approaches that are strongly relevant to our work, including facial landmark-aided AU recognition approaches and attention learning-based AU recognition approaches.

Facial Landmark-Aided AU Recognition
Since facial landmarks have prior location relationships with AUs, landmarks can help to learn AU-related features. Benitez-Quiroz et al. [11] integrated the geometry and local texture feature for AU recognition, in which the geometry feature contains the normalized distances among landmarks as well as the angles of Delaunay mask constructed by landmarks. Zhao et al. [12] extracted scale-invariant feature transform (SIFT) [13] features from local facial regions centered at relevant landmarks as AU-related features.
Facial landmarks can also facilitate AU recognition in other ways. Niu et al. [14] relied on landmarks to construct a shape regularization to AU recognition. Ma et al. [15] introduced typical object detection tasks into AU recognition by employing landmarks to define bounding boxes for AUs, in which each AU is predicted to occur in which bounding box. If one AU is absent, it should be predicted as non-occurrence for all bounding boxes.
These approaches all demonstrate the contribution of landmarks to AU recognition. In this paper, we use landmarks to predefine a regional mask with a learnable control parameter for each AU.

Attention Learning-Based AU Recognition
Considering AUs are subtle and have no distinct contour and texture, it is impracticable to manually annotate their regional attention distribution. An intuitive solution is to use the prior knowledge for attention predefinition. Li et al. [2,7] generated an attention map for each AU by predefining two Gaussian distributions centered around the two AU centers due to the symmetry, in which a position farther away from the centers has a smaller attention weight. Different AUs have identical attention distributions, which ignore the differences among AUs. Furthermore, attention predefinition cannot highlight potentially correlated regions far away from the predefined ROI.
As important region beyond the predefined ROI may be neglected, Shao et al. [9] directly learned the attention map of each AU without the prior knowledge, while Shao et al. [3,10] modified the predefined attention map under the supervision of AU recognition. However, the attention distribution learned in [9] contains quite a few uncorrelated locations, and the attention modification in [3,10] seems to be the smoothing of predefintion and still cannot emphasize correlated locations very distant to the predefined ROI. On the contrary, our approach can include both strongly correlated positions near the AU centers and weakly correlated positions scattered on the global face.

Overview
Our main goal is to predict the occurrence probability of total m AUs for an input image:p = (p 1 , · · · ,p m ). The structure of our network is illustrated in Figure 2. Specifically, two hierarchical and multi-scale region layers [3,10] with each, followed by a max-pooling layer, are firstly used to extract a multi-scale feature, which is beneficial for adapting to AUs with diverse sizes in different local facial regions. Then, each AU has one branch to predict the occurrence probability, in which the predefined mask M i , as the prior knowledge, constrains the learning of new attention map M (1) i during training. In our framework, both the prior knowledge and the AU recognition guidance are exploited to learn the regional attention distribution of each AU.

Convolutional Layer
Max-Pooling Layer  Figure 2. The overview of our PAA framework. An input image firstly goes through two hierarchical and multi-scale region layers [3,10], each of which is followed by a max-pooling layer. Then, m branches are used to predict AU occurrence probabilities, in which the learned attention map M i is element-wise multiplied by the predefined mask M i to obtain the new attention map M (1) i for the i-th AU. We overlay the attention maps as well as the masks on the input image for a better view. " " refers to element-wise multiplication. The expression c × l × l refers to the layer dimensions are c, l, and l, respectively.

Constraint of Attention Predefinition
In the branch of the i-th AU, three convolutional layers are firstly adopted, where i = 1, · · · , m. Then, a convolutional layer with one channel is used to learn the attention map M (0) i . According to the prior knowledge, the central locations of AUs can be determined by correlated facial landmarks [2,3], as illustrated in Figure 3. To exploit the prior knowledge, we predefine a mask M i for the i-th AU.   [2,3], in which "scale" denotes the distance between two inner eye corners. Each AU has two symmetric centers specified by two correlated facial landmarks, in which landmarks are in white and AU centers are in other colors.
Since the i-th AU has two centers (ā i(1) ,b i(1) ) and (ā i(2) ,b i(2) ), we take the predefined mask M i(1) of the first center as an example. In particular, we use a Gaussian distribution with a standard deviation δ centered around the location (ā i(1) ,b i(1) ) to compute the value at each location (a, b): We next incorporate M i(1) and M i(2) by choosing the larger value at each location: In M i , the positions with values significantly larger than zero constitute the ROI of the i-th AU, while other approximately zero-valued positions are ignored. However, the positions beyond the predefined ROI that we do not want are completely discarded during the constraint for attention learning. We introduce a learnable-instead of fixed-control parameter i to give appropriate importance to the positions outside of the ROI: where i ≥ 0 and a larger i give larger importance to the positions beyond the predefined ROI. Note that the relative size between different positions in M i is unchanged, and different AUs have independent control parameters. As illustrated in Figure 2, M 1 and M m are adaptively learned with different attention distributions.

Supervision of AU Recognition
After obtaining the predefined mask M i , we generate the new attention map M Considering that deep neural networks have a self-attention mechanism [8], we exploit AU recognition to guide the learning of M (1) i . Specifically, as shown in Figure 2, we element-wise multiply M (1) i with the fourth convolutional channel map to emphasize AU-related features. Then, another convolutional layer, as well as a global average pooling layer [16], are adopted to extract the AU feature with the size of 12c. Finally, we predict the occurrence probabilityp i of the i-th AU by using a fully-connected layer with one dimension followed by a Sigmoid function, and define the AU recognition loss as: where a weighting strategy is employed, and p i , w i , and v i denote the ground-truth occurrence probability, the weight, and the weight for occurrence of the i-th AU, respectively. There are two types of data imbalance issues [17] in most existing AU datasets [18][19][20]: inter-AU data imbalance that different AUs have different occurrence rates, and intra-AU data imbalance that AUs often have smaller occurrence rates than non-occurrence rates. To alleviate these data imbalance issues, w i and v i are defined as: where n occ i /n is the occurrence rate of the i-th AU, and n occ i and n denote the number of images appearing in the i-th AU and the number of all images in the training set, respectively.
By the constraint of attention predefinition and the guidance of AU recognition, the adaptively learned AU attention map M (1) i can capture both strongly relevant regions predefined by prior knowledge as well as scattered relevant regions on the global face. In this case, our AU recognition method can work well under the subtlety and non-rigidity of AUs due to the accuracy of AU-related features.

Datasets
In this paper, we evaluate our PAA on three popular benchmark datasets. Besides AU annotations, each dataset is also annotated with facial landmarks.
• Binghamton-Pittsburgh 4D (BP4D) [18] includes 41 subjects, including 23 women and 18 men, in which 328 videos with about 140,000 frames are captured in total by placing each subject into 8 sessions. Each frame is labeled with the AUs of occurrence or non-occurrence. Similar to the previous approaches [1][2][3], we conduct subjectexclusive three-fold cross-validation with two folds for training and the remaining one for testing on 12 AUs. Our method uses the same partitions of subjects as the previous works [1][2][3]. • Denver Intensity of Spontaneous Facial Action (DISFA) [19] contains 12 women and 15 men, in which each subject is recorded by a video with 4845 frames. Each frame is labeled with AU intensities ranging from 0 to 5. Following the previous methods [1][2][3], we treat an AU as occurrence if its intensity is equal or larger than two, and treat it as non-occurrence otherwise. We conduct subject-exclusive three-fold cross-validation on eight AUs. Our method uses the same partitions of subjects as the previous works [1][2][3]. • Sayette Group Formation Task (GFT) [20] includes 96 subjects with each subject captured by one video, whose images are more challenging than BP4D and DISFA due to unscripted interactions in 32 three-subject teams. Each frame is labeled with AU occurrences. We adopt the official training and testing partitions of subjects [20], in which about 108,000 frames of 78 subjects are used for training, and about 24,600 frames of 18 subjects are used for testing.

Implementation Details
Our PAA is implemented via PyTorch [21], in which each convolutional layer adopts 3 × 3 convolutional filters with a stride of 1, a padding of 1, and each max-pooling layer processes 2 × 2 spatial fields with a stride of 2. We normalize each image to 3 × 200 × 200 by similarity transformation, and randomly crop the normalized image to 3 × l × l with a random horizontal flip. The image size l, the network parameter c, as well as the standard deviation δ in Equation (1) are set to 176, 8, and 3, respectively.
Similar to JÂA-Net [3], we employ the stochastic gradient descent (SGD) solver with a Nesterov momentum [22] of 0.9, a weight decay of 0.0005, and a mini-batch size of 8 to train PAA with 12 epochs. The learning rate is initialized to be 0.006 and is multiplied by 0.3 at every 2 epochs during training. Following the settings in [1][2][3], we use the parameters of the well-trained model on BP4D for initialization when training on DISFA.

Evaluation Metrics
We evaluate methods via a popular metric frame-based F1-score (F1-frame): where P denotes the precision, and R denotes the recall. We also report the average results of the F1-frame over all AUs (shortly written as Avg). We omit "%" in all the F1-frame results for simplicity in the experimental results.

Comparison with State-of-the-Art Methods
In this section, we compare our PAA approach with state-of-the-art AU recognition methods, including LSVM [23], APL [24], JPML [12], AlexNet [25], DRML [1], EAC-Net [2], DSIN [26], CMS [27], LP-Net [14], ARL [9], SRERL [28], AU R-CNN [15], TCAE [29], AU-GCN [30], Ertugrul et al. [31], JÂA-Net [3], UGN-B [32], HMP-PS [33], and GeoCNN [34]. Notice that these works often adopt external training data, while our PAA uses the benchmark dataset only. Specifically, EAC-Net, SRERL, AU R-CNN, UGN-B, HMP-PS, and GeoCNN use pre-trained ImageNet models [35,36], CMS employs external thermal images, LP-Net pre-trains on a face recognition dataset [37], and GeoCNN utilizes a pre-trained 3D morphable model (3DMM) [38,39]. Several related works such as R-T1 [7] are not compared since they process a sequence of frames instead of a single frame. Table 1 shows the results of different methods on the BP4D benchmark. We can observe that our PAA performs better than most of the previous works, especially for the average F1-frame. Compared to other methods using external training data, such as AU R-CNN and UGN-B, PAA uses benchmark training images only, while achieving better performance. Although GeoCNN is slightly better than our method, it relies on a pre-trained 3DMM to obtain additional 3D manifold information to facilitate AU recognition. Table 1. F1-frame results on Binghamton-Pittsburgh 4D (BP4D) [18]. The results of LSVM [23] and JPML [12] are from [1], and those of other previous methods are reported in their original papers. The best results of each AU, as well as the average across methods, are shown in bold. Our PAA method performs better than most of the previous works.  Table 2 shows the results on DISFA, from which we can see that our PAA achieves competitive performance. It can also be found that many methods such as AU-GCN exhibit more fluctuated results across AUs on DISFA than on BP4D, and work well on BP4D, but show poor results on DISFA. This is because DISFA is more challenging with a more severe data imbalance problem than BP4D. In this case, our PAA achieves a more stable performance among different AUs than most of the previous works, and performs consistently well on BP4D and DISFA with 63.4 and 62.9 average F1-frame results, respectively. Table 2. F1-frame results on Denver Intensity of Spontaneous Facial Action (DISFA) [19]. The results of LSVM [23] and APL [24] are from [1], and those of other previous methods are reported in their original papers. The best results of each AU, as well as the average across methods, are shown in bold. Our PAA method achieves competitive performance, and achieves a more stable performance among different AUs than most of the previous works.

Evaluation of GFT
The comparison results on GFT are presented in Table 3. It can be observed that our PAA outperforms all other approaches. Notice that GFT images are often in large poses, which are more challenging than BP4D and DISFA images with near-frontal poses. In this case, PAA still achieves good performance with the highest average F1-frame of 55.8. Table 3. F1-frame results of Sayette Group Formation Task (GFT) [20]. The results of LSVM [23] and AlexNet [25] are from [20], those of EAC-Net [2] and ARL [9] are from [3], and those of other previous methods are reported in their original papers. The best results of each AU, as well as the average across methods, are shown in bold. Our PAA method outperforms all the other approaches.

Ablation Study
In this section, we investigate the effectiveness of each component in our PAA. Table 4 summarizes the structures of different variants of PAA, in which Baseline does not have the structure of learning the attention map M (1) i and does not utilize the weighting strategy in Equation (5) with w i = 1/m and v i = 1. The results of different variants of PAA on the BP4D benchmark are presented in Table 5, which use the same hyperparameters, as detailed in Section 4.1.2.  (1) i , and does not utilize the weighting strategy in Equation (5) with w i = 1/m and v i = 1. Table 5. F1-frame results for 12 AUs of different variants of PAA on BP4D [18]. The best results of each AU, as well as the average across methods, are shown in bold. The performance is gradually improved after adding the proposed components.

Weighting Strategy for Suppressing Data Imbalance
We can observe that Baseline+W (au) performs better with the average F1-frame of 59.9 than Baseline. This demonstrates the effectiveness of the introduced weighting strategy by alleviating both inter-AU data imbalance and intra-AU data imbalance.

Supervision of AU Recognition for Attention Learning
Besides the structure of Baseline+W i with each channel of the fourth convolutional feature map. We can see that AA significantly improves the average F1-frame to 61.5, which shows that the attention map only under the guidance of AU recognition can already capture much useful AU information.

Attention Predefinition
Another variant over Baseline+W (au) is PA, which only uses the prior attention by directly element-wise multiplying the predefined mask M i with each channel of the fourth convolutional feature map. Since the guidance of AU recognition is not available, i = 0 is fixed for M i . We can see that PA achieves good performance with the average F1-frame of 61.9. This indicates that the prior knowledge is beneficial for AU recognition by specifying the ROI of each AU. We next further explore the effectiveness of combining prior attention and adaptive attention.
After employing the predefined mask M i with the adaptively learned i over AA, our PAA achieves the highest average F1-frame 63.4. To investigate the usefulness of learnable i , we implement PAA ( f ix) by fixing i = 0 in Equation (3) for each AU. In this case, the potentially relevant regions beyond the predefined ROI are ignored, in which the average F1-frame is degraded to 62.0. Therefore, the design of adaptive learning for the control parameter is effective since our PAA can adaptively learn which AU has correlated positions beyond the predefined ROI to be emphasized.

Visual Results
In Figure 4, we visualize the attention maps learned by recent attention learning-based methods, EAC-Net [2], JÂA-Net [3] and ARL [9], as well as M (1) i , M (0) i , and M i for each AU learned by our PAA. Owing to the subtlety and non-rigidity of AUs, different AUs have different appearances, including shapes and sizes, and AU appearances are also varied across persons and expressions. In this case, AUs should have diverse regional attention distributions. For example, the two images in Figure 4 are both in happy expression, while AU 14 co-occurs with AUs 6, 7, 10, 12, and 17 in the first image, and only occurs alone in the second image. It is expected that the same AU of the two images should have different attention maps.   Figure 4. Visualization of learned attention maps of different methods, in which AUs 6, 7, 10, 12, 14, and 17 appear in the first image, and AU 14 appears in the second image. Each row lists the attention maps of 12 AUs, as well as the combined attention map of occurred AUs, for the corresponding method. Attention weights from zero to one are visualized using colors from blue to red, and are overlaid on the input images for a better view.
We can observe that different AUs of different images for EAC-Net have the same attention distribution except for different distribution centers, in which the divergences across AUs are ignored and the locations beyond the predefined ROIs are also ignored with zero attention weights. Although JÂA-Net tries to modify the predefined attention maps, it seems that the modification works as the smoothing of the predefinition, in which the positions very distant to the AU centers have smoothed attention weights. Correlated and uncorrelated positions beyond the predefined ROIs are regarded as similar importance, in which the learning of AU features are still often inaccurate. Another solution with a different perspective is ARL, in which the learned attention maps are dense with almost all correlated positions included. However, many potentially uncorrelated positions are mistakenly emphasized.
In contrast with these methods, by combining the advantages of M i as the predefinition and M (0) i as the supervision for each AU, our PAA can capture both strongly relevant positions specified by landmarks and weakly relevant positions distributed globally in the face. Moreover, we can see that different AUs often have different attention distributions in M i since we employ a learnable instead of fixed control parameter for each AU. In this way, we can more adaptively learn the attention weights at different locations of different AUs, especially for those far away from the predefined ROIs. Furthermore, we find that there are quite a few overlaps among the relevant locations in the attention maps of occurred AUs for our PAA. In this case, our combined attention distribution is clean and lies in the highlighted attention range of the predefined combined attention map, which demonstrates that our method can precisely capture the correlated positions of each AU.

Conclusions
In this paper, we have proposed a novel AU recognition method by prior and adaptive attention, which is beneficial for integrating the advantages of the constraint of prior knowledge and the supervision of AU recognition. We have also proposed a learnable parameter to adaptively control the importance of different positions in the predefined mask of each AU. In this case, we can adaptively learn an appropriate constraint of the prior knowledge.
We have compared our approach against state-of-the-art methods on popular challenging benchmarks, which shows that our approach outperforms most of the previous methods. Besides, we have conducted an ablation study, in which each component in our framework is demonstrated to be contributed to AU recognition. Moreover, the visual results indicate that our approach can accurately reason the regional attention distribution of each AU.