Multi-Task Learning Using Task Dependencies for Face Attributes Prediction

: Face attributes prediction has an increasing amount of applications in human–computer interaction, face veriﬁcation and video surveillance. Various studies show that dependencies exist in face attributes. Multi-task learning architecture can build a synergy among the correlated tasks by parameter sharing in the shared layers. However, the dependencies between the tasks have been ignored in the task-speciﬁc layers of most multi-task learning architectures. Thus, how to further boost the performance of individual tasks by using task dependencies among face attributes is quite challenging. In this paper, we propose a multi-task learning using task dependencies architecture for face attributes prediction and evaluate the performance with the tasks of smile and gender prediction. The designed attention modules in task-speciﬁc layers of our proposed architecture are used for learning task-dependent disentangled representations. The experimental results demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on Faces of the world (FotW) and Labeled faces in the wild-a (LFWA) datasets. pre-trained on ImageNet for weight initialization. The results also show the effectiveness of our proposed architecture in comparison with previous state-of-the-art methods.


Introduction
Face attributes are useful to achieve detailed description of human faces (e.g., smile, gender, age, etc.). Face attributes prediction has applications in human-computer interaction, face verification [1,2] and video surveillance [3,4]. Face variations in pose, illumination, scale and occlusion increase the difficulty of face attributes prediction. The performance of face attributes prediction has been improved by using deep convolutional neural networks (DCNNs) [5][6][7][8][9][10]. Face attributes prediction is trained separately in these networks, but the inherent correlation between the face attributes has been ignored.
Various studies show that dependencies exist in face attributes [11][12][13][14][15]. Multi-task learning networks can improve the performance of individual tasks by jointly learning correlated tasks. In traditional multi-task learning architectures, the shared layers learn general representations for all the tasks by parameter sharing while the following task-specific representations are learned in the task-specific layers. However, the dependencies between the tasks have been ignored in the task-specific layers. Accordingly, further improving the performance of individual tasks by using task dependencies among face attributes in the task-specific layers of the multi-task learning architecture is a challenge problem.
We propose a multi-task learning using task dependencies architecture for face attributes prediction and evaluate the performance with the tasks of smile and gender prediction. Our proposed architecture splits into two task-specific branches after the shared layers. In the task-specific branches, we establish the task dependencies in the task-specific layers by incorporating attention mechanism. The fully connected layers in the task-specific layers are transformed by using the designed attention modules for learning task-dependent disentangled representations, where the task-dependent disentangled representations denote the representations [16,17] of one task that are disentangled [18] by depending on another task. The transformed fully connected layers that contain task-dependent disentangled representations are fed into softmax layers to predict the final face attributes. In experiments, we demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on FotW and LFWA datasets.
The rest of this paper is organized as follows: Section 2 briefly reviews related works. Section 3 describes the proposed multi-task learning using task dependencies architecture in detail. Section 4 describes the experimental configuration; the results on FotW and LFWA datasets are also presented and discussed in Section 4. Section 5 concludes the paper.

Related Work
Multi-task learning. Caruana [19] first analyzed multi-task learning in detail. Since then, multi-task learning has been adopted for solving different computer vision problems. Gkioxari et al. used a convolutional neural network (CNN) for pose prediction and action classification of people in unconstrained images [20]. Eigen et al. proposed a multi-scale convolutional architecture for predicting depth, surface normals and semantic labels [21]. Misra et al. presented cross-stitch units to learn shared representations for multi-task learning in ConvNets [22]. Kokkinos et al. presented a CNN that jointly handles low-, mid-, and high-level vision tasks in a unified architecture [23]. Mallya et al. studied a method for performing multiple tasks in a single deep neural network by iteratively pruning and packing the network parameters [24]. Kim et al. proposed a novel architecture containing multiple networks of different configurations termed deep virtual networks with respect to different tasks and memory budgets [25]. Recently, multi-task learning with DCNNs have also been studied and applied to face attributes prediction. Levi et al. used a deep convolutional neural network(DCNN) for age and gender classification [26]. Liu et al. proposed a novel deep learning framework for attribute prediction in the wild [27]. Ranjan et al. presented a DCNN for face analysis utilizing transfer learning from a face recognition model [28]. Hyun et al. proposed a method to multi-attribute recognition of facial images based on a deep learning network that automatically learns the exclusive and joint relationship among attribute recognition tasks [29]. In multi-task learning, when the prediction of one task which will be used as condition is accurate, other tasks can be formulated by using conditional probability. For example, in [30], the experimental results on the MORPH-II dataset show that the multitask method achieves 98% gender recognition accuracy, thus the age probability P(A(X) = a) can be calculated using the gender-conditioned probability P(A(X) = a | G(X) = g) and the marginal gender probability P(G(X) = g) in their proposed conditional multitask learning method. However, the error predicted gender G(X) = g will lead to incorrect calculation of P(A(X) = a | G(X) = g) and P(G(X) = g); therefore, their method cannot be used when the multitask method cannot predict gender accurately on other datasets.
Attention mechanism. Human perception is similar to the attention mechanism that selects specific parts of the input information, rather than using all input information. In neural networks, attention mechanism can be used as feature selectors that can determine the importance of each feature for the particular task. The attention mechanism has been studied and applied to recurrent neural networks (RNNs) and long short term memory (LSTM) for sequential tasks [31][32][33]. The attention mechanism with DCNNs have been applied to vision-related tasks. Tang et al. proposed a deep-learning based generative framework with visual attention [34]. Xiao et al. applied visual attention to fine-grained classification task using DCNN [35]. Xu et al. presented an attention based model that automatically learns to describe the content of images [36]. Zhao et al. proposed a diversified visual attention network for fine-grained object classification [37]. Inspired by the attention mechanism, we propose a multi-task learning using task dependencies architecture for face attributes prediction in this paper.
The main contributions of this paper are summarized as follows: 1. A multi-task learning using task dependencies architecture for face attributes prediction in end-to-end manner. The designed attention modules in our proposed architecture are used for learning task-dependent disentangled representations. We evaluate the performance with the tasks of smile and gender prediction. 2. We present experimental results which demonstrate that our proposed architecture outperforms the traditional multi-task learning architecture and show the effectiveness in comparison with the state-of-the-art methods on FotW and LFWA datasets.

Modeling
Formally, the smile/non-smile prediction of the input face X is defined as S(X). The expected smile/non-smile prediction of the input X is defined as follows: where P(S(X) = s) is the probability that the smile/non-smile prediction of the input X is s, where s ∈ S. We define S = {'non-smile', 'smile'}. We assume that the predicted smile/non-smile is dependent on the gender of the input X. Compared to the traditional multi-task learning architecture as shown in Figure 1a, the FC S layer has been transformed into FC S|C G layer in the multi-task learning architecture we proposed (shown in Figure 1b). FC S denotes the fully connected layer that contains K(K ∈ N) smile/non-smile representation units. FC S|C G denotes the transformed gender dependent fully connected layer that contains K gender dependent smile/non-smile representation units, where C G is the gender context. We feed the transformed FC S|C G layer into the softmax layer to predict the final smile/non-smile. The probability P(S(X) = s) in Equation (1) can be modeled as follows: The gender context C G contains K gender context units C Gi , where C Gi is the i-th (i = 1, 2, . . . K) gender context unit that is automatically chosen from the gender representation units in the FC G layer. FC G denotes the fully connected layer that contains K gender representation units.
The dependency score function score(x S j , C Gi ) that takes a conjunction of the j-th(j = 1, 2, . . . K) input smile/non-smile representation unit x S j from the FC S layer and the i-th gender context unit C Gi from the FC G layer to score the dependency between x S j and C Gi . The dependency score function can be formulated as follows: The probability P(d = j | x S , C Gi ) reveals the relative importance of x S j based on C Gi , where d indicates which input smile/non-smile representation unit in x S is important based on C Gi , where x S contains K input smile/non-smile representation units. The probability P(d = j | x S , C Gi ) can be calculated using the dependency score function as follows: The importance probability distribution P(d | x S , C Gi ) is defined as follows: The gender dependent smile/non-smile representation units in the transformed FC S|C G layer can be defined as follows:Ŝ whereŜ i is the i-th (i = 1, 2, . . . K) gender dependent smile/non-smile representation unit that is the weighted average of the input smile/non-smile representation units.Ŝ i can be formulated as the expectation of x S according to the importance probability distribution P(d | x S , C Gi ). The transformed FC S|C G layer is generated by concatenating K gender dependent smile/non-smile representation units. The gender prediction of the input face X is defined as G(X). The expected gender prediction is defined as follows: where P(G(X) = g) is the probability that the gender prediction of the input X is g, where g ∈ G.
We define G ={'male', 'female'}. We also assume that the predicted gender is dependent on the smile/non-smile of the input X. The probability P(G(X) = g) in Equation (7) can be modeled as follows: where FC G|C S denotes the transformed smile/non-smile dependent fully connected layer that contains K smile/non-smile dependent gender representation units, where C S is the smile/non-smile context chosen from the smile/non-smile representation units in the FC S layer.
The calculation of the smile/non-smile dependent gender representation units in the transformed FC G|C S layer is similar to calculating the gender dependent smile/non-smile representation units in the transformed FC S|C G layer. The smile/non-smile dependent gender representation units can be calculated using Equations (9)-(12) as follows:

Network Architecture
The multi-task learning architecture we proposed is shown in Figure 2. The ResNet50 [38] Network is adapted as the baseline architecture. We share the parameters from its first 46 layers for all the tasks. We evaluate our proposed architecture with the tasks of smile and gender prediction. Thus, the network splits into two task-specific branches corresponding to smile and gender prediction. We attach a fully connected layer FC S that contains 64 smile/non-smile representation units and a fully connected layer FC G that contains 64 gender representation units respectively to 'res5c1' and 'res5c2', where 'res5c1' and 'res5c2' are residual blocks in ResNet50. The smile/non-smile representation units in FC S layer and the i-th (i = 1, 2, . . . 64) gender context unit C Gi are fed into the i-th (i = 1, 2, . . . 64) gender context attention module Att_C Gi (shown in Figure 3), where Att_C Gi is designed to learn the i-th (i = 1, 2, . . . 64) gender dependent smile/non-smile representation unitŜ i by using Equations (3)-(6). The transformed FC S|C G layer is generated by concatenating 64 gender dependent smile/non-smile representation units. We feed the transformed FC S|C G layer into the softmax layer to predict the final smile/non-smile. The procedure of predicting the final gender is similar to that of predicting the final smile/non-smile in our proposed architecture. The i-th (i = 1, 2, . . . 64) smile/non-smile context attention module Att_C Si (shown in Figure 4) is designed to learn the i-th (i = 1, 2 . . . 64) smile/non-smile dependent gender representation unitĜ i by using Equations (9)-(12).

The Model Objective
We use the cross-entropy loss for training the smile prediction task. The loss function L S is formulated as follows: where s = 1 for a smiling face and s = 0, otherwise. p s is the final predicted probability that the input is a smiling face. We also use the cross-entropy loss for training the gender prediction task. The loss function L G is formulated as follows: where g = 0 if the gender is male and g = 1 if the gender is female. p g is the final predicted probability that the input face is a female. The total loss L is the weighted sum of the individual losses. L is defined as follows: where λ s and λ g are weight parameters corresponding to smile and gender prediction task, respectively.

Experiments
The proposed multi-task learning using task dependencies architecture is evaluated with the tasks of smile and gender prediction. The architecture in which we feed FC S and FC G layers directly into softmax layers to predict the final smile/non-smile and gender respectively as shown in Figure 1a is called TMTL (Traditional Multi Task Learning). We select TMTL architecture as the comparison baseline.

Datasets
We evaluate the smile and gender prediction performance on Faces of the World (FotW) [39] and Labeled Faces in the Wild-a (LFWA) [40] datasets. Both FotW and LFWA datasets cover large variations in pose, illumination and scale of faces. The FotW dataset contains 9130 images, each of which is labeled with non-smile/smile and male/female. The FotW dataset has been split into 6078 images for training and 3052 images for validation. The LFWA dataset contains 13,143 images, each of which is labeled with non-smile/smile, male/female and thirty-eight other face attributes. The LFWA dataset has been split into 6263 images for training and 6880 images for validation.

Experimental Configuration
For the FotW dataset, we crop the faces from the original images using the provided coordinates of the bounding box and resize the cropped face images to 224 × 224 × 3. For the LFWA dataset, we directly resize the face images to 224 × 224 × 3.
All the architectures are trained using the keras [41] framework. Data augmentation such as horizontal flip, horizontal shift and vertical shift are adopted to prevent overfitting. We train all the architectures using Adam with a mini-batch size of 64. The initial learning rate is set to 0.001. The learning rate will decrease to 0.0001 after training 25 epochs. The weight parameters are decided based on the importance of the task in the overall loss. We assume that the smile prediction task and the gender prediction task have the same importance in our proposed architecture due to both of the tasks being binary classification problems. Therefore, we set the weight parameters λ s = 1, λ g = 1. For FotW dataset and LFWA datasets, we adopt he_normal as the weight initialization method and train TMTL architecture 40 epochs (overfitting after 40 epochs) and 30 epochs (overfitting after 30 epochs), respectively. For all the datasets, we initialize our proposed architecture with trained weights from TMTL architecture and train 30 epochs, respectively.

The Effectiveness of Multi-Task Learning Using Task Dependencies
We evaluate the contribution of multi-task learning using task dependencies. Disentangling the underlying structure of representations into disjoint parts can benefit for solving a diverse set of tasks in a data-efficient manner. The disentangled representations are vector representations with respect to a particular decomposition of a group into subgroups using the group and representation theory [42]. Table 1 shows that our proposed architecture in comparison with TMTL architecture on FotW and LFWA datasets, respectively. Our proposed architecture produces performance gains over TMTL architecture because our proposed architecture disentangles smile/non-smile and gender representations into gender dependent smile/non-smile and smile/non-smile dependent gender representations, respectively, by establishing the task dependencies between smile and gender prediction tasks in the task-specific layers. We combine (smile/non-smile × gender) into four groups. For each of the groups, we randomly sample 100 images from the FotW validation dataset. The t-distributed stochastic neighbor embedding (t-SNE) [43] on the sampled FotW validation dataset show the distributions of the representations in FC S and FC G layers, respectively, in Figure 5a,c, and show the distributions of the gender dependent smile/nonsmile and smile/non-smile dependent gender representations in FC S|C G and FC G|C S layers, respectively, in Figure 5b,d. Clusters in Figure 5b,d are disentangled by gender and smile/non-smile more explicitly compared to those in Figure 5a,c. The procedure of achieving the sampled LFWA validation dataset is the same as that of achieving the sampled FotW validation dataset. The t-SNE on the sampled LFWA validation dataset shows the distributions of the representations in FC S and FC G layers, respectively, in Figure 6a,c, and show the distributions of the gender dependent smile/nonsmile and smile/non-smile dependent gender representations in FC S|C G and FC G|C S layers, respectively, in Figure 6b,d. Clusters in Figure 6b,d are also disentangled by gender and smile/non-smile more explicitly compared to those in Figure 6a,c.

Comparison with Previous Approaches
We initialize our proposed architecture using the weights from ResNet50 pre-trained on ImageNet [44]. Tables 2 and 3 compare our results with those of previous methods on FotW and LFWA datasets, respectively. Our average accuracy is lower than SIAT_MMLAB on the FotW dataset and LNets+ANet on the LFWA dataset. The SIAT_MMLAB architecture is composed of GNet for gender classification and two SNets for smile classification. GNet and two SNets are trained with different face cropping schemes for better performance. The SIAT_MMLAB architecture adopts the VGG-Faces [45] model, which is pre-trained on a large-scale face identification dataset for face identification and face verification. They use a general-to-specific fine-tuning scheme that fine-tunes the model three times on CelebA [27] (with forty attribute annotations), CelebA (with smile and gender annotations) and FotW (with smile and gender annotations) datasets, respectively. The LNets+ANet architecture integrates two CNNs LNet and ANet, where LNet locates the entire face region and ANet extracts features for attribute recognition. LNet is pre-trained on ImageNet and fine-tuned by image-level attribute tags. ANet is pre-trained on the CelebA dataset and fine-tuned by attribute tags. Our proposed architecture can perform smile and gender prediction tasks in the end-to-end manner using a single deep neural network. The input face images are processed as mentioned in experimental configurations with no extra face cropping and localization steps. We only use the weights from ResNet50 pre-trained on ImageNet for weight initialization. The results also show the effectiveness of our proposed architecture in comparison with previous state-of-the-art methods.

Conclusions
In this paper, we have proposed a novel multi-task learning using task dependencies architecture for face attributes prediction and evaluated the performance with the tasks of smile and gender prediction. We transformed the fully connected layers by using the designed attention modules for learning task-dependent disentangled representations. The transformed fully connected layers were fed into softmax layers to predict the final face attributes. The experimental results demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on FotW and LFWA datasets. In the future, we will evaluate the performance of our proposed architecture with more tasks of face attributes prediction. We also plan to apply the attention module to more fully connected layers or convolution layers and try to use dynamic weights for performing more face attributes' prediction tasks.

Conflicts of Interest:
The authors declare that they have no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: