Deep Learning for Facial Beauty Prediction

: Facial beauty prediction (FBP) is a burgeoning issue for attractiveness evaluation, which aims to make assessment consistent with human opinion. Since FBP is a regression problem, to handle this issue, there are data-driven methods for ﬁnding the relations between facial features and beauty assessment. Recently, deep learning methods have shown its amazing capacity for feature representation and analysis. Convolutional neural networks (CNNs) have shown tremendous performance on facial recognition and comprehension, which are proved as an effective method for facial feature exploration. Lately, there are well-designed networks with efﬁcient structures investigated for better representation performance. However, these designs concentrate on the effective block but do not build an efﬁcient information transmission pathway, which led to a sub-optimal capacity for feature representation. Furthermore, these works cannot ﬁnd the inherent correlations of feature maps, which also limits the performance. In this paper, an elaborate network design for FBP issue is proposed for better performance. A residual-in-residual (RIR) structure is introduced to the network for passing the gradient ﬂow deeper, and building a better pathway for information transmission. By applying the RIR structure, a deeper network can be established for better feature representation. Besides the RIR network design, an attention mechanism is introduced to exploit the inner correlations among features. We investigate a joint spatial-wise and channel-wise attention (SCA) block to distribute the importance among features, which ﬁnds a better representation for facial information. Experimental results show our proposed network can predict facial beauty closer to a human’s assessment than state-of-the-arts.


Introduction
As a burgeoning issue [1], facial beauty prediction (FBP) has attracted more and more attention from researchers and users, which is a comprehensive topic of face recognition [2,3] and comprehension [4][5][6]. An example of an FBP problem can be demonstrated in Figure 1. There are application potentials for FBP with attractiveness, such as makeup recommendation, and face beautification.
In FBP problem, facial features play an important role for assessment. After extraction, the features are explored and summarized for aggregate analysis. To find a better representation of facial features, there are various data-driven models for FBP with hand-crafted or adaptive learned descriptors [7][8][9]. With extracted features, these models perform the assessments with elaborate predictors, which are trained in a statistic manner.
Lately, deep learning has been proved as an efficient tool for signal and image processing [10][11][12]. The revival of deep learning methods, especially, convolutional neural networks (CNNs), provides a new perspective for FBP problem. CNN performs much better performances in ResNet [26], proposed by He et al., has turned out to be a remarkable design pattern for CNN architectures. While building a deeper network, there is gradient vanishing problem which limits the performance. In ResNet, a shortcut is designed as bypath, connecting the inputs and outputs for better gradient transmission. The component design with shortcut is termed as residual block, which aims to learn the residual information from main path. Based on the residual learning, it is able to build a very deep network.
There are varieties of ResNet focusing on efficient network representation. In ResNeXt [27], group convolutions were introduced to introduce the cardinality of the network, which means the size of transformations. From the investigation, it is more efficient to improve the cardinality than depth and width. ResNeXt holds a similar structure to InceptionNet [28]. However, there are identical topology structures in ResNeXt, which reduce the design burden. With the splitting-transformation-merging strategy, ResNeXt achieved competitive performance compared to ResNet with much fewer parameters.
ResNeSt [29] is another effective design derived from ResNet. With channel separation and attention, ResNeSt block designs several identical cardinals in the computation unit. Similar to SENet [30] and SKNet [31], there are channel attentions in each cardinal unit for better feature map representation. Considering the inherent correlation of different features, ResNeSt has become state-of-the-art for image recognition and beyond.
Residual connection and its varieties demonstrate the superior performance of network representation ability. To build the network deeper and establish a more robust gradient flow, residual-in-residual (RIR) is proposed by grouping residual blocks with a higher level shortcut. With the high level residual connection, gradient and information will be transmitted from the shallow layers to the deeper. RIR structure has proved its performance on image super-resolution [32], restoration [33], classification and other computer vision tasks, which turns out to be an efficient design pattern.
Besides effective network designs, some works focus on the inherent correlations of features. Channel attention in SENet is one of the successful mechanisms for finding better representation of features. In SENet, information from different channels is evaluated by global average pooling and is processed by several full connection layers. Besides channel attention, spatial attentions are introduced by considering the dual attention maps both on channel-wise and spatial-wise features. Non-local attention [34], which is a special pattern for global information consideration on features, has become a success for image restoration and comprehension.
This paper proposes a novel CNN design for the FBP problem. In the proposed network, we investigate an RIR block design with an attention mechanism. To build the network deeper, multi-level skip connections are introduced to compose a better gradient transmission flow. An attention mechanism is devised to find the inherent correlation among feature maps. In an attention mechanism, both channel-wise and spatial-wise attentions are considered for a better correlation representation. Experimental results show that our network holds a better performance than other CNN-based methods, which is more consistent with the assessment of humans.
The contributions of this paper can be demonstrated as follows: • We propose a network for the facial beauty prediction (FBP) problem. Specifically, residual-in-residual (RIR) groups are designed for building a deeper network. To devise a better gradient transmission flow, multi-level skip connections are introduced.

•
To find the inherent correlations among features, a joint spatial-wise and channel-wise attention mechanism is introduced for better feature comprehension.

•
Experimental results demonstrate our network can achieve a better performance than other CNN-based methods and make the assessment more consistent with human opinion.

Facial Beauty Prediction
Facial beauty acts as an essential influence factor in daily life. Facial beauty prediction (FBP) has attracted more and more attention from researchers, which is a composite study of psychology, computer science, evolutionary biology, and so forth. Recently, there are data-driven methods for the FBP issue, which adaptively extract the facial features and perform the analysis with defined mathematical models. The models are supposed to be consistent with a human's assessment as much as possible. Besides plentiful prediction methods, databases proposed for the FBP problem have also become a spotlight in this area. SCUT-FBP5500 [9], to our best knowledge, is one of the most popular open benchmarks widely used for evaluation. In SCUT-FBP5500, there are diverse face pictures containing both males and females. The ages of the persons vary from 15 to 60, which acquires a large interval for a richer representation of real facial beauty situation. There are 2000 Asian females, 2000 Asian males, 750 Caucasian males and 750 Caucasian females in SCUT-FBP5500, resized to 350 × 350 resolution. The scores are ranked by 60 volunteers between 1-5, meaning the attractiveness from low to high.

Convolutional Neural Networks
Convolutional neural networks (CNNs) are one of the most remarkable successes in the deep learning area, which demonstrate superior performances on a large amount of computer vision tasks, such as detection [35][36][37], denoising [38], recognition [39,40], and better feature representation [41][42][43]. To our best knowledge, AlexNet, as the champions of the ImageNet competition in 2012, is one of the successful CNN designs developed on GPU. After AlexNet, there have been numerous efficient designs for better performances, composed of wider or deeper networks with elaborate and fancy layer connections. VGGNet, which is widely applied for various GAN-based works, is one of the representative design patterns for deep networks. InceptionNet proposed by Google is another effective design which achieved the first prize of ILSVRC 2014. In InceptionNet, the authors designed an inception module to improve the parameter utilization. The inception module applied 1 × 1 convolutional layers to organize the information across different channels. Features from convolutional layers with different filter sizes and max-pooling operations were aggregated by concatenation. By utilizing the Inception module, the network improved the performance and avoided the over-fitting with more branches. Furthermore, batch normalization (BN) and dropout strategies were applied in InceptionNet for training speed improvement.
Residual connection from ResNet is another well-known structure for network design. In VGGNet and InceptionNet, there is a limitation of network performance with the increase of network depth, which is caused by the gradient vanishing. To solve this issue, residual representation was introduced in ResNet. The blocks were designed based on the residual learning, which makes it easier to learn the mapping between inputs and outputs.
Densely connection [44], as another efficient design pattern for CNN, has become one of the popular choices for various tasks. Different from ResNet, DenseNet applied dense connections to connect features from shallow layers to deeper, which are more effective than residual connection. Furthermore, features are reused in DenseNet via the concatenation of channels, which saves the parameters with fewer computation costs.

Method
As shown in Figure 2, the proposed network holds a pyramid structure to progressively extract the feature from images. We devise a residual-in-residual group termed RIRG to build the network deeper. There are five stages in the network. For each stage, the feature maps are down-scaled by max-pooling operation as the output. Finally, a global average pooling is utilized to shrink the resolution to 1 × 1. Let us denote the input image as I 0 , then for i-th stage, the operations can be described as, where RIRG i n (·) denotes the n-h RIRG in i-th stage, and MaxPool(·) denotes the max pooling operation.
The proposed network will be introduced in the following manner. Firstly, we will introduce the structure of the proposed RIRG. Moreover, the spatial-wise and channel-wise joint attention mechanism applied in RIRG will be demonstrated in detail, which is termed SCA. Finally, the settings of the proposed network will de described with discussions. There is a progressive design for facial feature exploration. With the increase of channel number, the resolution of feature maps will be decreased with a fixed ratio.

Residual-In-Residual Group
RIRGs are devised from the perspective that a deeper network will lead to better performance. Since residual connection can survive from the gradient vanishing problem, residual-in-residual connections are introduced to pass the shallow features and gradients to deeper with a long shortcut. There are multi-levels in RIRG to build the flow more effective. The design of RIRG is shown in Figure 3. The basic block in RIRG, termed RIRB, as shown in Figure 3a, is composed of two convolutional layers with a ReLU activation and SCA block. The residual connection in RIRB can be regarded as the first level skip path, which connects the features at a distance of two convolutional layers. By stacking RIRBs, the residual-in-residual module (RIRM) is designed with a padding structure. The second level skip path is introduced to RIRM connecting the input and output features. Beyond the second level skip connection, residual-in-residual group (RIRG) is devised in a recursive way like the RIRM, which stacks the RIRM as the main path with a same padding structure. The third level skip connection in RIRG crosses a large spacing of convolutional layers, which makes the shallow features become deeper more efficiently.
The RIRG holds a similar structure to dense connection. If we expand the three-level residual connections, and regard the stacked layers as an entire operation, then the features from different layers are densely connected via residual learning. On one hand, the densely-like design is able to deliver the features and information more efficiently. On the other hand, the mixture of dense and residual connections reuses the features with limited parameters and computation complexity, which indeed improves the network performance.
Although the RIR design can build an efficient information transmission pathway, it requires a large number of parameters and high computation complexity with the increase of network depth. From this point of view, we propose a modified convolution operation to substitute the vanilla layer. For each convolution step, there are two 1 × 1 convolution operations for channel squeeze and excitation, and one depth-wise convolution for spatial exploitation. The shrunken channel number is set as 32. With this substitution, the computation complexity and parameters will be substantially saved.

Spatial-Wise and Channel-Wise Attention
The proposed spatial-and channel-wise attention mechanism (SCA) is shown in Figure 4. As shown in the illustration, there are two dual paths finding the spatial-wise and channel-wise attentions separately. There is a convolutional layer to demonstrate the explore the correlation from features in general. After exploration, two parallel bypaths exploit the different attentions independently. From the channel-wise attention bypath, the information from different channels will be evaluated by global average pooling. After pooling, two full connection layers with a ReLU activation is introduced to dig out the inherent correlation. Finally, a Sigmoid activation is devised for the non-negativity.

AvgPool Sigmoid
Sigmoid Figure 4. Structure of proposed spatial-and channel-wise attention mechanism (SCA). There are two paths for finding the spatial and channel attention jointly. After distribution, the addition of two features will be regarded as the output.
Similar to the channel-wise bypath, there are two convolutional layers with a ReLU activation to demonstrate the correlation of spatial-wise features, and a Sigmoid activation is applied at the end of the processing procedure. Different from the channel-wise bypath, there is no global pooling method for information evaluation.
After extraction, the two attentions are multiplied with the input features, and the addition is regarded as the final result. The operation of SCA can be described as, x SCA C = σ(FC(ReLU(FC(AvgPool(x SCA ))))), x SCA S = σ(DConv(ReLU(DConv(x SCA )))), where σ(·) denotes the Sigmoid activation, x in , x out denote the input and output features separately. ⊗ denotes the channel-wise multiplication, which allocates different weights to channels. denotes the point-wise multiplication. DConv(·) denotes the depth-wise convolution. In the operation, x SCA C is a tensor with size 1 × 1 × c and x SCA S is a tensor with size h × w × c, where h, w, and c denotes the height, width and channel number of the size of x SCA .

Network Design
The entire network is designed as follows. Firstly, the input image is considered with size 256 × 256 × 3. Then one convolutional layer expands the channel number to 64 and maintains the resolution. There are N = 2 RIRGs after the convolutional layers for feature exploration, and a max-pooling operation is applied to decrease the feature size. For each RIRG, there are five RIRMs, and five RIRB for each RIRM. There are K = 2 stages in the network. In each stage, the channel number will be expanded by one convolutional layer at the beginning, and the size of features is halved by max-pooling at the end. After the stages, there is a global average pooling step to resize the tensors as 1 × 1 × 1024. Two fully connection (FC) layers with a ReLU activation is introduced to perform the prediction, and the output size of final FC is one, demonstrating the rank of facial beauty prediction.

Experiment
The network is trained on the SCUT-FBP5500 dataset. To our best knowledge, it is the largest dataset for FBP problem up to now. We train the network for 1000 iteration with batch size as b = 25. The parameters are updated by Adam optimizer with learning rate lr = 1e − 4, which is halved for every 200 iterations. We choose L1-loss as the loss function. Notice that the input size of the SCUT-FBP5500 dataset is 224 × 224, we rescale the image size to 256 × 256 by bicubic interpolation for training and testing.

Results
We conduct the comparison with diverse methods including geometric feature based and deep learning based methods-Linear Regression, Gaussian Regression, SVR, AlexNet, ResNet-18, and ResNeXt-50. The measurement indexes are chosen as Pearson Correlation, MAE, and RMSE. The dataset is split with the ratio 0.6 for training and 0.4 for testing. That is, 60% instances of the dataset are randomly chosen for training and the other 40% are for testing. The results are shown in Table 1.
PC demonstrates the Pearson Correlation, which is the higher the better. From the table, our network performs better than other works. With the higher PC, our proposed network is more consistent with human opinion, which shows the effectiveness of the proposed structure design. To further testify the network capacity, we perform the comparison via 5-fold cross validation, which holds 80-20% splitting for each fold. The results and average for each fold are shown in Table 2. From the table, our performance is better than state-of-the-arts. Furthermore, we analyze the distribution of predicted scores from our network, which is shown in Figure 5. The yellow points are frequencies of prediction, while the blue points denotes the ground-truth. From the visualization illustrations, the score of male and female are in accordance with the normal distribution. There is a shift on mean value of male and female predictions. We hold the notion that the shift is from the bias of sexuality. To prove the hypothesis that our prediction accords with the normal distribution, we use the Anderson-Darling test for evaluation. It is a modification of the Kolmogorov-Smirnov (K-S) test and gives more weight to the tails than does the K-S test. We make the hypothesis H 0 : samples follow the normal distribution; and the H 1 : samples do not follow the normal distribution. After the A-D test, we get the significance level α = 0.043 and the critical value C A−D = 0.75. Since α ≤ C A−D , we cannot reject the hypothesis H 0 . From this point of view, we hold the notion that the prediction values follow the normal distribution.

Ablation Study
Investigation on Network Design. To show the performance of network design, we make the experiments on different settings of block numbers. Specifically, K m , K g denote the block and module number in RIRM and RIRG separately. The results are shown in Table 3. From the table, the depth of network is one of the most important factors of network prediction performance. When the K m and K g are lower, the PC, MAE and RMSE will be worse than the longer network. This accords with the intuition that a wider and deeper network has better representation ability and processes features more effectively. Specifically, K m and K g have a similar influence on the performance. Singly adjusting the K m and K g has a similar effect on the results.  Table 4. From the table, the attention mechanism leads to a shallow performance improvement due to the restricted parameters. SCA considers the channel attention and spatial attention jointly, which finds the correlations from two perspectives for better consideration.
In this paper, our network achieves a better performance than ResNeXt-50. Our network holds 6.75 M parameters and 34.25 GFlops. ResNeXt-50 has 25.03 M parameters and 5.56 GFlops. From the comparison, our network is lighter than ResNeXt-50. Although our network is much more deeper than other works, the well-designed convolution operation can prevent the plentiful number of parameters and computation complexity.
Effectiveness of SCA. In this paper, we propose an attention mechanism termed as SCA. Since there is only one convolution layer and some depth-wise convolutions in SCA, it is a simple but effective design for finding the inherent correlation of feature maps. There are few parameters in SCA with lower computation complexity. From this point of view, it can give a performance boost with a little increase on complexity. From this perspective, SCA is an efficient component for performance boost.
Comparison on Effective Network Architecture Designs. Recently, there are different effective network architecture designs for feature exploitation, such as ResNet and several extensions, DenseNet, MobileNet series and SqueezeNet. These works concentrate on different block designs for effective feature exploitation. However, the choreographed works are concentrating on building a deeper or wider structure for better performance, which lack to build a more efficient information transmission pathway. The main difference of different networks is the inside blocks, and almost all the networks are modified based on ResNet-50 or ResNet-101, which provide a fixed information transmission pathway for fair performance comparison. To address this issue, we introduce the RIR structure for better information transmission. Furthermore, these lightweight works focus on different blocks, but do not consider the correlations of features. In this paper, the attention mechanism is introduced to find the inherent correlations for better feature representation, which is termed as SCA.
Threats to Validity. In this paper, we propose a novel network for the FBP problem. However, the improved performance is limited by the number of parameters. A deeper or wider network will lead to better performance, while it will also produce a high computation complexity. Considering the threats to internal validity, the vital important element is the network depth. From the ablation study, with the increase of network depth, the performance will be improved at the same time.
There are two aspects about the threats to external validity. On one hand, the labels for training are assessed by some students in a specific society, which may cannot cover a common opinion. On the other hand, the trained images are selected from Asian and Caucasian people, which may lead to a bias on the diversity.

Conclusions
In this paper, we proposed a novel network for the facial beauty prediction problem. Traditional networks focus on the effective block designs with a deeper or wider network for better performance, which almost neglect the efficient information transmission pathway and the correlations of features. To address these issues, we proposed a three-level residual-in-residual structure, termed RIRG, for better information transmission. Since RIRG was designed in a recursive way for multi-level residual connections, it could provide a more efficient information and gradient transmission style. Furthermore, a joint spatial and channel attention mechanism-SCA-was introduced in this paper for finding the inherent correlations of features, which is a tiny component with few parameters for performance improvement. The experimental results showed that our proposed network achieved a better performance than other works with restricted parameters. Further, we will find more datasets with higher diversity, and compare our works with more recent works. Meanwhile, we will also tend to build a novel dataset for the FBP problem with more cultures, mentalities, traditions and economic status.