Boosting Face Recognition under Drastic Views Using a Pose AutoAugment Manner

: Face recognition under drastic pose drops rapidly due to the limited samples during the model training. In this paper, we propose a pose-autoaugment face recognition framework (PAFR) based on the training of a Convolutional Neural Network (CNN) with multi-view face augmentation. The proposed framework consists of three parts: face augmentation, CNN training, and face matching. The face augmentation part is composed of pose autoaugment and background appending for increasing the pose variations of each subject. In the second part, we train a CNN model with the generated facial images to enhance the pose-invariant feature extraction. In the third part, we concatenate the feature vectors of each face and its horizontally ﬂipped face from the trained CNN model to obtain a robust feature. The correlation score between the two faces is computed by the cosine similarity of their robust features. Comparable experiments are demonstrated on Bosphorus and CASIA-3D databases.


Introduction
Face recognition is one of the most studied topics in computer vision. It has been widely applied in security, pedestrian identification, and other fields. Traditional face recognition methods are mainly based on handcrafted features, such as High-Dimensional Local Binary Patterns (HD-LBP) [1], Fisher Vector (FV) descriptors [2], and Multi-Directional Multi-Level Dual-Cross Patterns (MDML-DCPs) [3]. However, handcrafted features are not robust. As Convolutional Neural Network (CNN) provides a better solution for this problem, many face recognition approaches based on CNN have emerged. Incorporating deeper networks and large training sets, CNN-based approaches [4,5] have transcended human performance on the Labeled Faces in the Wild (LFW) [6] dataset. Despite the dominance of existing CNN-based face recognition approaches in feature extraction, there is still a challenge that the recognition accuracy of profile faces (one eye is self-occluded) drops rapidly [7,8]. The main reason is that the training sets, such as CASIA WebFaces, which is crawled from the Internet, are not evenly distributed on the head pose [9,10]. The insufficient intra-class variations (differences of poses in a subject) make the recognition model less sensitive to profile faces [9].
In response to the challenge of profile face recognition, two novel types of face recognition methods are proposed, i.e., facial image normalization [11,12] and face augmentation [9,13,14], both of which normally require 3D face models to ease the difficulty. Facial image normalization transforms multiple views into a uniform one. In [11], the 3D-2D system (UR2D) maps the facial image to UV space using 3D facial data, which achieves better performance than using a reconstructed 3D shape. The landmark detection and pose estimation are essential for UR2D, but may be challenging and time-consuming when transforming drastic face poses to the UV space. By contrast, face augmentation transforms one view into multiple views. Recent studies [9,13,15] map 2D faces to generic 3D shapes or reconstructed 3D shapes, but render only limited views for each identity. The sparse sampling of views may reduce recognition accuracy [16].
In order to tackle the issue of sparse sampling and enhance the recognition accuracy under drastic pose, we propose an enhanced face recognition framework based on CNN training with a random-view face augmentation method. The proposed framework consists of three parts: face augmentation, CNN training, and face matching. In the first part, we aim at increasing intra-class variations of views. With the aid of a 3D graphic engine, numerous views, especially drastic pose views, are generated by face scans. Pose AutoAugment is proposed to find the best distribution of facial views. As these views have no backgrounds, the facial contour may change drastically. In order to avoid the information of facial contours being learned by the following CNN training, we randomly append a background behind each view. In the second part, we train a CNN model with views that are generated in the first part, to enhance the pose-invariant feature extraction. The CNN architecture is adapted from the classical research [17] on face recognition. In the third part, we concatenate the feature vectors of each face from the trained CNN model. The correlation score between two faces is computed by the cosine similarity of their feature vectors. By computing the scores of a facial image under drastic pose to all images corresponding to different subjects, we adopt the highest score to identify an unknown subject in the probe from registered subjects in the gallery. Experiments on the Bosphorus database and CASIA-3D database demonstrate a state-of-the-art performance of the proposed framework. Furthermore, the importance of components in the first part, for example, adding backgrounds and face cropping, are also evaluated.
Our contributions are as follows.
i An enhanced face recognition framework is proposed to identify pose-invariant representations of faces under drastic poses. The proposed framework trains a pose-invariant CNN model, and extract identifiable features of drastic pose view for face recognition. ii A novel face augmentation method, which is composed of pose auto-augment and background appending, is proposed to increase pose variations for each subject. iii Experiments on Bosphorus and CASIA-3D FaceV1 databases demonstrate state-of-the-art performance for face recognition under drastic poses.
The rest of this paper is organized as follows. Section 2 briefly reviews related works on pose-invariant face recognition and rendering methods. Section 3 illustrates the face augmentation method and the proposed face recognition framework. Experimental results and discussion are demonstrated in Section 4. This paper is concluded in Section 5.

Related Work
With the rapid development of CNN, face recognition has made a lot of progress in the past ten years. Although the maturity of face recognition leads the increasing success from research to commercial application, there are still some challenges, such as occlusion [18], age [19], pose [20], and attack [21]. In pose-invariant face recognition, there are two streams: one is image normalization, and the other is face augmentation. Both streams mainly focus on two issues: 3D shape fitting and face generating. Since the 1970s, 3D models have been used for image generating [22][23][24][25]. Many studies apply image generating to object detection, object retrieval [26,27], and viewpoint estimation [28], etc. Recently, image generating methods have been employed for face recognition [24], face alignment [29], and 3D face reconstruction [30].
The image normalization methods reconstruct the facial image to 3D face and generate the same views from 3D faces for comparison. Georghiades et al. [31] employed a reconstructed surface to render nine views, then the test image matched these views in a linear subspace. However, the poses of the rendered face are no more than 24 • . Wang et al. [32] reconstructed a 3D face by fitting a 2D facial image and generated multi-view virtual faces. A Gabor feature was extracted from both virtual faces and test faces to identify the same person. Prabhu et al. [33] and Moeini et al. [34] also generated multi-view virtual faces from a reconstructed 3D face, further estimating the viewpoint of the test face for comparing the test face with virtual faces under similar view. Dou et al. [35] reconstructed accurate 3D shapes to transform the pose variant of images to a uniform facial space.
The face augmentation methods are proposed to fit facial images to 3D shapes and generate multi-view facial images to learn pose-invariant features for comparison. Rather than concentrating on accurate 3D reconstruction, Hassner et al. rendered the frontal face with a generic 3D face shape [36]. Then 10 generic shapes were employed to render new facial views [9,13]. Another idea of 3D shape fitting is fitting a facial image to its real 3D shape, which was proposed by Kakadiaris et al. [11]. In terms of face generating, Hassner et al. [36] only generated a frontal face for each identity. Vasilescu et al. [24,37] rendered 15 face images from −35 • viewpoint to +35 • viewpoint in 5 • steps, including six viewpoints for training and nine viewpoints for testing. Hassner et al. also generated five views (0 • , ±45 • , ±75 • ) to train a pose-invariant CNN model [9,13]. Crispell et al. [15] further developed the idea of face generating [9], and generated five views randomly for each identity.
Both image normalization and face augmentation methods generate limited views for face recognition. However, the sparse sampling of views may reduce the recognition accuracy [16]. Dou et al. [30] generate numerous views, but to design an end-end 3D reconstruction system. We adapt the idea of image generation [30] to our research. Different from [30], we generate numerous views with searched distribution to train a CNN model aimed at the performance improvement of face recognition under drastic poses. Furthermore, our face augmentation method is compatible with any type of 3D face, and we use face scans in this paper, due to its accurate 3D shape and simplicity of texture mapping to 3D shape via camera calibration.
To our best knowledge, we are the first to generate arbitrary views that are more accurate and realistic for face training and recognition.

Proposed Methods
We propose a novel framework (PAFR) for face recognition under drastic pose. As shown in Figure 1, the proposed framework consists of three parts: face augmentation, CNN training using generated faces, and face matching. In the following section, we describe these three parts in detail.

Face Augmentation
To increase intra-class pose variations for each subject, we employ a 3D graphic engine to generate multi-views from a 3D face. Both types of the 3D face (reconstructed 3D face and face scan) are compatible with our face augmentation method. In this paper, we use face scans due to their accurate 3D shape and simplicity of texture mapping to a 3D shape via camera calibration. The face augmentation pipeline for face scans consists of four components: preprocessing, pose autoaugment, face cropping, and random background appending.
Preprocessing is necessary to obtain a smoothed face scan since a raw face scan contains lots of noise, irregular edges, and hollows. Irregular edges and hollows mainly appear in the eyes, hair, and ears, due to their reflective properties and self-occlusion. Surface noise is substantially affected by the precision of scanners. To remove noise, irregular edges, and hollows, preprocessing involves cropping the facial region, filling hollows, and denoising. First, the position of the nose tip is located. The points in which the distance from the nose tip exceeds a threshold are removed. Then, we use the bilinear interpolation algorithm to fill missing points. Lastly, we employ the Laplacian algorithm to obtain a smoothed face scan.
In terms of pose autoaugment, we simulate a camera in a 3D context to project a face scan to 2D faces with a 3D graphic engine. Millions of views from a single 3D face can be generated by exploring the extrinsic parameters of the camera: azimuth R a , elevation R e , and in-plane R i rotation. Although a pose-invariant face recognition model performs better with more views, it may induce a heavy load if we use too many views in the following tasks, such as cropping, background adding, and training. To keep a balance between the variety of poses and the resource consumption, we generate a repository of enough poses with searched distribution, and randomly sample poses from the repository for each subject.
To be specific, a number N and distribution parameters W P are set to generate enough extrinsic parameters {R a , R e , R i }. Then a group of parameters for each subject is randomly sampled from these {R a , R e , R i }. Given N and W P , W X is first generated obeying the uniform distribution of (0, W total ), where W P = {w 1 , w 2 , ...w i , ..., w m }, W X = {W x 1 , W x 2 , ..., W x N } and W total = ∑ m i=1 w i . X is an integer vector {x 1 , x 2 , ..., x j , ..., x n }, and x j is obtained by computing To those rotation parameters R for a subject, where s is a scale value. B = {b 1 , b 2 , ..., b n }, obeys the uniform distribution of (−θ, θ). θ and s are set manually. By changing W P and N, we can obtain the distribution and as many rotation parameters as we want. To improve the recognition performance on profile faces automatically, we tune the distribution W P and N to find the proper number of profiles and the near-frontal faces using Bayesian Optimization [38].
We crop the generated faces by reserving the same aspect ratio but do not align them for the following two reasons. First, as suggested by the literature on VGG-face training [39], face recognition achieves better performance when training faces are not aligned. Furthermore, when the pose is larger than 60 degrees, the aligned face will be seriously distorted as a result of the occluded eye corner and mouth corner.
The background of each generated face is transparent, leading to high contrast on the facial contour. To prevent the CNN classifier from overfitting unrealistic contour patterns, we synthesize the background in a flexible manner by randomly appending a scene image as the background. The alpha channel is employed to combine the generated face and background.

CNN Training
Recent face recognition models [4,5,17] achieve state-of-the-art accuracy on YTF, LFW, and Megaface. As we enrich large-pose faces as the training set, better performance is expected on faces under drastic pose than existing CNN-based methods.
The training set consists of generated faces from frontal scans of 3D databases, such as Bosphorus and CASIA-3D FaceV1 in this paper. The CNN architecture is adapted from SphereFace CNN [17] with 20 layers, which is trained on CAISA-WebFace. We freeze the parameters of all convolution layers, and fine-tune the parameters of the fully connected layers for mapping the feature f i from the last convolution layer to its identity label C i . The minimization of loss function L (Equation (3)) is to maximize cos(mθ C i , f i ) and minimize cos(θ j, f i ).
where θ C i , f i is the angle between f i and its identity label C i . θ j, f i is the angle between f i and other identity label j. θ ∈ [0, π]. m is the angular margin. N is the number of generated images. When maximizing cos(mθ C i , f i ), the angle θ C i , f i will be minimized to zero. When minimizing cos(θ j, f i ), the angle θ j, f i will be maximized to π.

Face Matching
In the testing phase, we drop the last fully connected layer of trained CNN, which indicates a label of face identity, and adopt the penultimate layer that keeps high-level information as our feature representation. The testing set consists of photos. These photos are not used to reconstruct 3D faces and generate new views for face matching, since landmark detection and pose estimation is essential for 3D reconstruction, which may not be accurate for drastic face pose. These photos are matched based on the proposed framework. For a photo p, we extract a feature vector f p and its horizontally flipped feature f f lip from the penultimate layer of trained CNN. A robust face representation r p (Equation (4)) is achieved by concatenating these feature vectors.
For a face pair denoted as (p 1 , p 2 ), the score s (Equation (5)) is computed by the cosine similarity of their robust representation.
Before face matching, near-frontal faces (yaw rotations in ±30 • ) in the testing set are aligned using five landmarks (two eyes, two mouth corners, and nose) while the other faces are aligned using the visible eye and the tip of the nose. MTCNNv1 [40] is employed for detecting and aligning the testing faces. However, MTCNNv1 is so not efficient on profile faces. The failure cases in face detection are manually aligned.

Datasets
To evaluate the performance of PAFR on faces under drastic poses, databases contain 2D arbitrary poses and frontal face scans are needed. CASIA-3D FaceV1 and Bosphorus were employed for our experiments. Bosphorus contains 105 individuals and 4652 scans in total. There are 60 men and 45 women. Most of them are between 25 and 35. Each one has no less than 31 scans, and no more than 54 scans. Facial poses contain seven yaw rotations ranging from −90 • to 90 • , four pitch rotations, and two cross rotations. Each scan contains a point cloud, a facial image, and 22 manually labeled feature points. The facial image can be mapped to 3D space as a face texture by coordinate mapping. CASIA-3D FaceV1 contains 4624 scans of 123 individuals, 37 or 38 scans per person, covering the variation of the pose, emotion, and illumination. Each scan has a 3D face with texture and a facial image. Compared with Bosphorus, the faces in CASIA-3D FaceV1 are darker, since the light source in CASIA-3D FaceV1 is incandescent light. In Table 1, a summary of the datasets is presented.   , respectively. A 3D graphic engine, Blender, is employed, for its open-source and python-support. The backgrounds on generated faces are from the SUN397 database [41]. Figure 3 shows the faces generated by face scans in Bosphorus. Each scan randomly generates 512 facial images, from which we selected 15 images that represent multi-view faces under yaw rotations (azimuth rotations to the camera) ranging from −90 • to 90 • . It can be seen that the generated faces are under arbitrary pose with high fidelity.

Evaluate the Components in Face Augmentation
To evaluate the impact of each component in the face augmentation, training faces are generated by the proposed augmentation method except for one or two components. After CNN training is finished, the class of the max probability out of the Softmax classifier is the predicted class (named as rank-1 accuracy). We compute the rank-1 accuracy for 2D faces from Bosphorus.
The baseline is illustrated by computing the accuracy on Bosphorus. First, we generate faces without cropping and background appending. The accuracy is reported as 'cbN'. Second, we combine backgrounds randomly on generated faces of cbN. The accuracy is shown as 'cBN'. Third, the generated faces of the cbN are cropped as 'CbN'. Forth, we combine backgrounds randomly on generated faces of CbN. The accuracy is reported as 'CBN'. Table 2 shows these reports. Table 2 depicts the benefits of the combination of components. When cropping on faces under ±90 • pose, the performance is increased by more than 8% without a background, and by more than 20.79% with a background appending. When not cropping on faces under ±90 • pose, the performance is increased by more than 1% without a background, and by more than 2.66% after cropping. Experimental results justify the effectiveness of the cropping component and the background appending component.

Evaluate the Face Recognition under Pose Variations
To validate the effectiveness of the proposed framework under arbitrary poses, we calculate the accuracy of face identification on Bosphorus and CASIA-3D FaceV1. We set nine groups of gallery and probe. To all nine groups, the probes are the same, and consist of faces under poses ranging from left 90 • to right 90 • . For each gallery, the faces are under the same views. Tables 3 and 4 demonstrate the effectiveness and robustness of PAFR.  Table 3 describes the performance of the proposed framework on Bosphorus. It shows little difference in face identification when poses of faces in the gallery are from left 45 • to right 45 • , respectively. When the gallery consists of frontal faces, the recognition accuracy is 94% for the probe, which consists of faces under left 90 • rotations. However, the recognition accuracy drops significantly when the gallery only consists of faces under ±90 • .  Table 4 describes the similar results on CASIA-3D FaceV1 as in Table 3. The performance on CASIA-3D FaceV1 is better than the performance on Bosphorus. Both results have validated the effectiveness of PAFR under arbitrary poses, especially drastic pose.

Comparison with State-of-the-Art
In Table 5, we compare the performance of PAFR with recent researches on Bosphorus. PGM [42], PGDP [43], and Liang et al. [44] demonstrated the state-of-the-art performance in 3D face recognition. FLM + GT [45] and Sang et al. [46] demonstrated the performances of image normalization methods. It can be observed that the proposed framework outperforms 3D face recognition methods and the image normalization method at all poses, except for the accuracy on faces under right 90 • , which is lower than [46]. When examining the failed cases under R90 • , we found that most of these failed cases are not well-aligned. Though we employ MTCNNv1 for aligning the testing faces before face matching, MTCNNv1 is not efficient on profile faces. Since testing faces under L90 • are better aligned than those under R90 • , the accuracy of L90 • is higher.

Conclusions
We propose a pose-autoaugment face recognition framework, which is the first to generate arbitrary views that are more accurate and realistic for face training and recognition. The proposed framework trains a pose-invariant CNN model, and extracts identifiable features of drastic pose view for face recognition. The proposed face augmentation method increases the pose variations for each subject. The experiments on Bosphorus show that our work improves the average accuracy over all poses. The experiments also demonstrate the robustness and effectiveness under drastic poses. In the future, we will explore face augmentation methods with 3D faces scanned by low-resolution devices, such as Kinect.
Author Contributions: Conceptualization, W.G.; methodology, W.G.; writing-original draft preparation, W.G.; writing-review and editing, X.Z.; supervision, X.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.