1. Introduction
Face recognition is one of the most attractive topics in biometrics and computer vision because of its convenience, hygiene, and low cost, since face images can be acquired in a contactless manner without requiring any special equipment [
1]. Face recognition is in great demand as personal authentication for smartphones, security gates, payment services, communication robots, etc. due to its advantages. Although the explosive development of Convolutional Neural Networks (CNNs) has dramatically improved the accuracy of face recognition, face recognition still faces the problem that its accuracy is significantly degraded by changes in pose, facial expression, motion, illumination, and resolution. To address the great demand for face recognition, further improvements in its performance have been investigated. There are two approaches to improve the performance of face recognition: a direct approach to improve the face recognition method and an indirect approach to improve the performance by adding other factors to the face recognition method. In this paper, we focus on face attribute estimation, which is an indirect approach, in the sense that it can be used not only for improving the accuracy of face recognition but also for customer analysis in marketing, image retrieval, video surveillance, and criminal investigation [
2,
3].
A face has a wide variety of biological features, including age, gender, hair color, hairstyle, mouth size, nose height, etc. These facial features, called face attributes, cannot be used for personal identification on their own; however, they can be used together for rough personal identification. This use of biometric traits is known as soft biometrics, in contrast to hard biometrics, where a single biometric trait such as fingerprint, iris, or face can be used for personal identification. For example, the recognition accuracy of face recognition methods can be improved by combining general face features with face attributes [
4,
5]. The processing time of face recognition can be reduced by prescreening using face attributes.
Face attribute estimation can be regarded as a multiple binary classification problem, as shown in
Figure 1; that is, it is a problem of estimating whether a face has or does not have the attribute. Face attributes have multiple names depending on their color and shape, such as hair, or are expressed numerically, such as age. To deal with face attribute estimation as a binary classification problem, for example, hair can be decomposed into several classes such as black hair, blond hair, brown hair, and gray hair, and age can be simplified to young. Face attribute estimation consists of three processes: face detection, feature extraction, and classification [
3,
6]. Among these processes, feature extraction is the most important process, since it has the greatest impact on the estimation accuracy.
Traditional methods utilize hand-crafted features such as Local Binary Patterns (LBP) [
7] in feature extraction. The LBP-based methods can estimate attributes from only one face image, since they do not require any training process; however, their estimation accuracy is quite low, since LBP cannot handle a wide variety of face attributes. CNN-based approaches have recently become the most popular approach for face attribute estimation, since CNN has made a significant impact on image recognition. Although one feature extractor should be used for each attribute to explore the accuracy of attribute estimation, in most cases, one feature extractor is shared to estimate all face attributes for the parameter efficiency [
2,
8,
9,
10,
11,
12,
13,
14]. To achieve both high parameter efficiency and high estimation accuracy, it is necessary to design CNN consisting of multiple layers such as convolution and pooling layers to extract the optimal features for each attribute. Several methods have been proposed to improve the accuracy of face attribute estimation by appropriately sharing the layers of CNNs [
2,
13,
14,
15]. In those methods, the manual grouping and clustering of face attributes were used to share layers of CNNs based on grouping. Manual grouping is not only time consuming but also arbitrary, and simple attribute clustering is not always effective for attribute estimation.
In this paper, we propose a method to automatically optimize CNN structures for solving multiple binary classification problems in order to improve the processing efficiency and accuracy in face attribute estimation. The basic structure of CNN used in the proposed method, which is called Merged Multi-CNN (MM-CNN), consists of a large number of convolution blocks regularly located in the depth and width directions, which are connected to each other at each depth by merging layers. MM-CNN is automatically optimized for face attribute estimation by introducing trainable weight parameters for each merging layer between blocks. We also propose a parameter reduction method called Convolutionalization for Parameter Reduction (CPR), which removes all fully connected layers from MM-CNN. Through a set of experiments to evaluate the performance on two public datasets, Large-scale CelebFaces Attributes dataset (CelebA) [
9] and Labeled Faces in the Wild-a dataset (LFW-a) [
16], we demonstrate that MM-CNN can estimate face attributes with high accuracy using CNN with fewer weight parameters than conventional methods. This paper is a full version of our initial study [
17] with a detailed description of the proposed method, a survey of recent works, and performance comparison. The contributions of this paper can be summarized as follows:
We propose a novel CNN architecture, MM-CNN, specifically designed for multi-task processing; and
We also propose CPR, which significantly reduces the parameters of CNN by removing fully connected layers.
2. Related Work
The conventional methods for face attribute estimation are summarized in
Table 1. These methods can be categorized as Support Vector Machine (SVM), CNN, and others depending on the type of classifier. In the following, we give an overview of the conventional methods for each type of classifier.
The first type of methods employs SVM as classifiers, which are the earliest methods for face attribute estimation [
6,
8,
9,
10]. SVM is a machine learning method to determine the decision boundaries for separating classes in feature space. Kumar et al. [
6] proposed one of the famous face attribute estimation methods using handcrafted local features. This method extracts pixel values from grayscale, RGB, and HSV color spaces, edge magnitude, and orientation as features and classifies them into each face attribute using SVM. After this work, most of the methods have employed CNN-based feature extractors due to its excellent performance on image recognition. Zhang et al. [
8] proposed Pose Aligned Networks for Deep Attribute modeling (PANDA), which consists of feature extraction by CNNs with poselet detection and attribute prediction by a linear SVM for each attribute. Liu et al. [
9] proposed two CNN architectures: LNet for face localization and ANet for face attribute prediction with a linear SVM for each attribute. Zhong et al. [
10] extracted features using FaceNet [
18] or VGG-16 [
19] and predicted attributes using a linear SVM.
The second type of methods employs neural networks as classifier [
2,
12,
13,
14,
15,
21,
23,
25], where most methods employ a single CNN to complete feature extraction and classification as a multi-task CNN. Wang et al. [
12] proposed a GoogLeNet-like network architecture consisting of three CNNs for face recognition, weather prediction, and location estimation. Face attributes are estimated from concatenated features in the fully connected layers. Hand et al. [
2] proposed Multi-task deep Convolutional Neural Network (MCNN) with an AUXiliary network (MCNN-AUX). They separate the 40 face attributes into six or nine groups based on facial parts, and they extract features for each attribute group. Auxiliary network, which finally estimates face attributes based on the estimation results of the multi-task CNN, is added. Cao et al. [
15] proposed Partially Shared Multi-task CNN (PS-MCNN). They separate the 40 face attributes into four groups: upper, middle, lower, and whole images, based on the position of each attribute in the face. The PS-MCNN aggregates the features extracted by the network for each group and estimates their attributes using a classifier consisting of fully connected layers. Gao et al. [
13] proposed three small multi-task CNNs: ATNet, ATNet_G, and ATNet_GT. Although these approaches are similar to MCNN, CNNs are desinged according to multiple clusters obtained by classifying face attributes using the k-means algorithm. Han et al. [
14] proposed a multi-label classification method using original labels determined by their own rule in light of correlation among face attributes. They separate the attributes into eight groups—one group related to the whole face and seven groups related to each facial parts—and design a special classifier architecture with multiple one output for each group. Fukui et al. [
21] proposed Attention Branch Networks (ABN), which is a sort of general-purpose CNN with attention to features. ABN consists of two branches: an attention branch for generating a visualization map and a perception branch for classification. They demonstrated that the attention mechanism with a visualization map is effective for estimating face attributes. Bhattarai et al. [
23] proposed a new loss function based on a continuous label, which is generated by word2vec [
24] based on 40 face attributes labels written in text. Chen et al. [
25] proposed a Hard Parameter Sharing-Channel Split network (HPS-CS) consisting of normal and group convolution layers.
The third type of methods employs other classifiers [
11,
27,
28]. Huang et al. proposed Large Margin Local Embedding (LMLE)-kNN [
11] and Cluster-based LMLE (CLMLE) [
27]. They focused on the class imbalance of face attribute labels and proposed a learning method that takes into account the distance between small clusters generated for each class. In LMLE-kNN and CLMLE, DeepID2 [
26] and ResNet-like CNN [
29] are used for feature extraction, respectively. Ehrlich et al. [
28] proposed Multi-Task Restricted Boltzmann Machines (MT-RBMs) with Principal Component Analysis (PCA).
Our approach is similar to MCNN [
2], PS-MCNN [
15], and ATNet [
13]. Although the relationships among facial attributes are hierarchical and complex, these methods use manual or non-hierarchical clustering to make a preliminary set of groups of facial attributes. On the other hand, our approach automatically optimizes the network parameters by recognizing the relationships among face attributes during the training of CNN.
3. Fundamentals of Face Attributes
In this section, we give fundamental observations about the face attributes that we focus on in this paper. We use the 40 face attributes defined in CelebA [
9], as shown in
Table 2. CelebA is a large-scale dataset of face attributes that has been used for the training and performance evaluation of major face attribute estimation methods. In this paper, for convenience, each attribute is assigned an index number from 1 to 40, as shown in
Table 2. Most of the attributes in CelebA are defined on the biological characteristics, while some are defined by whether the person wears ornaments such as glasses and earrings. These face attributes can be classified into groups based on the following relations: (i) commonality of facial parts, (ii) co-occurrence, and (iii) color, shape, and texture.
Figure 2 shows an example of illustrating the relationship among face attributes based on relations (i)–(iii). In the following, we discuss the details of each relation.
(i) Commonality of facial parts—For face attribute labels, the most obvious relationship is based on the organs, that is, the facial parts included in the face. For example, Black Hair (9) and Wavy Hair (34) are attributes related to “hair,” Arched Eyebrows (2) and Narrow Eyes (24) are attributes related to “eyes,” and Big Nose (8) and Pointy Nose (28) are attributes related to “nose.” Note that the attribute labels such as Male (21), Attractive (3), and Young (40) are assigned to “face” in
Figure 2, since they are based on the features of the entire face.
(ii) Co-occurrence—Some attributes have co-occurrence, since they can appear simultaneously.
Figure 3 shows a color map visualizing the co-occurrence probabilities of 40 face attributes in CelebA. The co-occurrence probability of two face attributes indicates the ratio of face images assigned with those two attributes. The face attributes with the highest co-occurrence probability are related to gender. Male (21) has a high probability of attributes such as 5 O’Clock Shadow (1), Bald (5), and Goatee (17), while female has a high probability of attributes such as Arched Eyebrows (2) and Heavy Makeup (19), where female means the face image without the Male (21) assignment. Exceptions are the co-occurrence of Smiling (32) with High Cheekbones (20) and Rosy Cheeks (30) for facial expressions, and Young (40) with Rosy Cheeks (30) for age. The co-occurrence of face attributes has a positive correlation in most cases, while there are some cases that have a negative correlation. For example, Gray Hair (18) symbolizing “aging” shows a high negative correlation with Young (40) and 5 O’Clock Shadow (1). No Beard (25) and Sideburns (31) also show a high negative correlation. We guess that Sideburns (31) is assigned a label as part of the beard in CelebA. However, note that such correlations between face attributes depend on the dataset. In
Figure 3, Blond Hair (10) and No Beard (25) have high co-occurrence probability, while Black Hair (9) and No Beard (25) have low co-occurrence probability. This fact indicates that most of the females in CelebA have blond hair rather than black hair. CelebA consists mainly of Western celebrities and a very small number of Asian celebrities. Thus, the correlation of facial attributes strongly depends on ethnicity and gender.
(iii) Color or shape or texture—Most face attributes are related to either color, shape, or texture, except for abstract attributes such as age and gender. Color-related attributes include Black Hair (9), Blond Hair (10), Brown Hair (12), Gray Hair (18), Bags Under Eyes (4), Pale Skin (27), and Rosy Cheeks (30), shape-related attributes include Straight Hair (33) and Wavy Hair (34), Chubby (14), and Oval Face (26), and texture-related attributes include Blurry (11), Eyeglasses (16), and Heavy Makeup (19). The 5 O’Clock Shadow (1) and No Beard (25) attributes are related to both color and shape.
It is important to consider the above relationships among face attributes for estimating face attributes using multi-task CNN. In multi-task CNN, sharing feature extractors for face attributes with strong relationships can improve the estimation accuracy and reduce computational cost and memory consumption. There are complex relationships among face attributes, and it is difficult to manually design the optimal network architectures that takes them into account. To address this problem, in this paper, we propose a method to automatically optimize multi-task CNN for face attribute estimation.