Three-D Wide Faces (3DWF): Facial Landmark Detection and 3D Reconstruction over a New RGB–D Multi-Camera Dataset

Latest advances of deep learning paradigm and 3D imaging systems have raised the necessity for more complete datasets that allow exploitation of facial features such as pose, gender or age. In our work, we propose a new facial dataset collected with an innovative RGB–D multi-camera setup whose optimization is presented and validated. 3DWF includes 3D raw and registered data collection for 92 persons from low-cost RGB–D sensing devices to commercial scanners with great accuracy. 3DWF provides a complete dataset with relevant and accurate visual information for different tasks related to facial properties such as face tracking or 3D face reconstruction by means of annotated density normalized 2K clouds and RGB–D streams. In addition, we validate the reliability of our proposal by an original data augmentation method from a massive set of face meshes for facial landmark detection in 2D domain, and by head pose classification through common Machine Learning techniques directed towards proving alignment of collected data.


Introduction
Recent advances in computer vision and machine learning have directed the research community towards building large collections of annotated data in order to increase the performance of different application domains. In an analogous manner, 3D imaging has arrived by the introduction of low-cost devices and the development of efficient 3D reconstruction algorithms. In this paper, we focus on 3D facial imaging. Facial attributes range from simple demographic information such as gender, age, or ethnicity, to the physical characteristics of a face such as nose size, mouth shape, or eyebrow thickness, and even to environmental aspects such as lighting conditions, facial expression, and image quality [1].
The first attempt to collect facial data to study faces is through the introduction of Multi-PIE database [2], where the facial appearances vary significantly by a number of factors such as identity, illumination, pose, and expression. They built a setup with 13 cameras located at head height and spaced over 15 intervals. Multi-PIE database contains 337 subjects, imaged under 15 view points and 19 illumination conditions. This work is the starting point of the benchmark 300-W [3] that establishes the main metrics and standards to normalize the evaluation of the different methods for facial landmark analysis. One of the standards established by 300-W is the annotation of facial landmarks.

1.
3DWF We propose a multi-camera dataset containing visual facial features. We include streams from 600 to 1200 frames of RGB-D data from three cameras for 92 subjects and clouds normalized to 2 K points for ten poses proposed and their corresponding 3D landmarks projected. Demographic data such as age or gender is provided for every subject as well. Such a complete dataset in terms of the number of subjects and different imaging conditions is unique and the first of its kind.

2.
An innovative data augmentation method for facial landmark detection. This method is based on a 3D(mesh)-2D projection implemented by raycasting [11]. 3.
3D reconstruction workflow adapted to facial properties and their normalization to provide meaningful features for cloud formats.
The paper is organized as follows: 1. Section 2 outlines the techniques related to the work presented here.

2.
Section 3 presents the set-up for the acquisition of the proposed dataset.

3.
Section 4 introduces an innovative data augmentation for facial landmark detection.

4.
Section 5 describes a complete pipeline for 3D face modeling with a multi-camera RGB-D setup.

5.
Section 6 presents a validation method for the classification of the markers. 6.
Section 7 validates the captured data and evaluates the proposed methods for facial landmark detection. 7.
The main contributions are discussed in Section 8.

Related Work
This section describes the current algorithms related to facial analysis that make use of computer vision techniques. We focus on the following facial analysis applications: 3D facial acquisition, facial landmarks detection, and head pose estimation.

3D Facial Acquisition
Recently, different approaches have been published to optimize acquisition systems to accurately represent 3D facial attributes. Ref. [12] collected 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. The authors proposed a dense face reconstruction method based on dense correspondence from every 2D image gathered to a collected neutral high resolution scan. Another hybrid solution (Florence Dataset) for reconstruction is presented in [13], where a complex capturing system (3dMD [14]) is used. The number of subjects (53) is also smaller than the one proposed here. UHDB31 [15] presented a follow-up of this work by increasing the number of poses and the number of subjects resulting in a more complete dataset, but still costs of the set up are quite high and the number of subjects are still under the ones captured in this work.
In our case, we provide neutral high resolution scans as well, but we believe that the impressive results of recent works published, implementing deep learning for classification and segmentation with normalized input clouds [16,17] motivate a new research line. Therefore, we postulate a new challenge, and propose an initial reconstructed and normalized set to adopt this line for facial analysis.
The Pandora dataset, focusing on shoulder and head pose estimation, is introduced in [18] for driving environments. However, the dataset only contains images from 20 subjects and they only have one camera, and therefore it does not allow proper 3D reconstruction for extreme poses. Ref. [19] proposes a 3D reference-free face modeling tested on a set of predefined poses. The authors perform an initial data filtering process, and employ the face-pose to adapt the reconstruction. In our case, we use 2D face detection projection, and afterward implement the proper 3D filtering techniques, exploiting information from 2D facial landmark detection in order to perform a more reliable 3D face reconstruction. Other techniques are proposed by simply using a single RGB sensor, but in this case they require either a 3D Morphable Model (3DMM) initially proposed by Blanz and Vetter [20], this kind of method can be trapped in a local minimum and can not generate very accurate 3D geometry, or a 2D reference frame and displacement measurement, as in [21].

Facial Landmark Detection
Research in this field is very proficient. Therefore, we concentrate on the three groups most relevant to our work:

1.
Regression-Based Methods. These methods directly learn a regression function from image appearance (feature) to the target output (shape): where M denotes the mapping from an image appearance feature (F(I)) to the shape x, and F is the feature extractor.
Ref. [22] proposed a two-level cascaded learning framework based on boosted regression. This method directly learns a vectorial output for all landmarks. Shape-indexed features are extracted from the whole image and fed into the regressor.

2.
Graphical Model-based Methods. Graphical model-based methods mainly refer to tree-structure-based methods and Markov Random Field (MRF) based methods. Tree-structure-based methods take each facial feature point as a node and all points as a tree. The locations of facial feature points can be optimally solved by dynamic programming. Unlike the tree-structure that has no loops, MRF-based methods model the location of all points by loops. Zhu and Ramanan [23] proposed a unified model for face detection, head pose estimation and landmark estimation. Their method is based on a mixture of trees, each of them corresponds to one head pose view. These different trees share a pool of parts. Since tree-structure-based methods only consider the local neighboring relation and neglect the global shape configuration, they may easily lead to unreasonable facial shapes.

3.
Deep Learning-Based Methods. Luo et al. [24] proposed a hierarchical face parsing method based on deep learning. They recast the facial feature point localization problem as the process of finding label maps. The proposed hierarchical framework consists of four layers performing respectively the following tasks: face detector, facial parts detectors, facial component detectors and facial component segmentation. Sun et al. [25] proposed a three-level cascaded deep convolutional network framework for point detection in a coarse-to-fine manner. It can achieve great accuracy, but this method needs to model each point by a convolutional network that increases the complexity of the whole model.
Ref. [26] enhanced the detection by following a coarse-to-fine manner where coarse features inform finer features early in their formation, in such a way that finer features can make use of several layers of computation in deciding how to use coarse features. We selected this method to test the data augmentation method presented in Section 4.1 because of the novelty and efficiency of a deep net that combines convolution and max-pool layers to train faster than the summation baseline and yields more precise localization predictions. Finally, the other selected solution to test the performance of the proposed augmentation method (Section 4.1) is [27] as they imply an evolution from previous models. As the proposed Tweaked Neural Network does not involve multiple part models, it is naturally hierarchical and requires no auxiliary labels beyond landmarks. They provide an analysis of representations produced at intermediate layers of a deep CNN trained for landmark detection, yielding good results at representing different head poses and (some) facial attributes. They inferred from previous analysis that the first fully connected layer already estimates rough head pose. With this information they can train pose specific landmark regressors.

Head Pose Estimation
Head pose estimation is a topic widely explored with applications such as autonomous driving, focus of attention modeling or emotion analysis. Fanelli et al. [28] introduced the first relevant method to solve this problem relying on depth sensing devices. Their proposal is based on random regression forests by formulating pose estimation as a regression problem. They synthesize a great amount of annotated training data using a statistical model of the human face. In an analogous way, Liu et al. [8] also propose a training method based on synthetic generated data. They use a Convolution Neural Network (CNN) to learn the most relevant features. To provide annotated head poses in the training process, they generate a realistic head pose dataset through rendering techniques. They fuse data from 37 subjects with differences in gender, age, race and expression. Some other lines of research, such as the one followed by [29], pose the problem as the classification of human gazing direction. We follow this approach in our work. It proposes as well deep learning techniques to fuse different low resolution sources of visual information that can be obtained from RGB-D devices. The authors encode depth information by adding two extra channels: surface normal azimuthal and surface normal elevation angle. Their learning stage is divided into two CNNs (RGB and depth inputs). The information learned by deep learning is employed to further fine-tune a regressor.
Analyzing all literature related to our work, we can conclude that our multi-camera RGB-D setup provides an affordable capturing system, able to perform 3D face reconstruction at extreme poses with a reasonable cost and deployment. In addition, facial landmark detection is already been explored extensively. Therefore, it is more suitable to provide a refinement of the 3D techniques being presented here. Head pose estimation is highly correlated with facial landmark detection (especially in 3D domain) and we believe with a good performance in facial landmarks head pose could easily be approached.

3DWF Dataset
This dataset is captured by a system composed of 3 Asus Xtion depth cameras [30] in order to acquire multi-camera RGB-D information from 92 subjects by modifying their head poses steered by a sequence of markers. The subjects were asked to move their head continuously in a natural manner. To achieve a synchronous acquisition with three simultaneous devices, three independent USB buses are required, and synchronization among them has been implemented to provide a uniform acquisition. Synchronization among the devices is very critical since subjects are moving their head, and reducing delay between the cameras allows registration of the point clouds acquired. For that aim OpenNI 2 library [31] has been adopted by following the following procedure:

1.
List of connected devices is gathered.

3.
Data structures required to perform data flow are created.
First generation of RGB-D sensors are deployed due to their higher accuracy to perform 3D reconstruction of one category of objects proved in [32], and their feasibility to connect more than one device to the same computer. Between 600-1200 frames are recorded by each device for every subject. The proposed setup is displayed in Figure 1. Three RGB-D cameras can be observed in the Figure 1, the one in the middle will be named as frontal camera, and the other ones as side cameras. The number in the box represents the sequence of markers that the subjects were asked to follow (starting in box 1 and ending in box 10). Where W stands for width, H for height and D for depth, and origin is located at the frontal camera for W and D, and the floor for H. The proposed dataset contains the following sources: Visual data (a) RGB and Depth data. This data has been continuously captured and is relevant for topics such as facial tracking or 3D face reconstruction.
RGB point clouds for ten markers. This data has been statically reconstructed with a target resolution of 2K, and it is very suitable for machine learning methods related to the tasks such as head pose or gaze estimation. (c) HD initial cloud. This data can be useful as reference cloud and has been captured with Faro Freestyle 3D Laser Scanner [33] whose 3D point accuracy is 0.5 mm and reconstructed by FARO Scene [34] .

Set-Up Optimization
Optimization tests are mainly based on three parameters:

1.
Distance from the model to the frontal camera. The manufacturer of the device recommends a distance in the range of 80-150 cm. Therefore, the tests are performed in this range. From Table A2, it can be derived that the highest number of points in the point cloud are obtained with distances of 80 cm. Analogously, the best visual appearance is gathered with that value.

2.
Light source. Once the optimum distance from the model to the camera is calculated, the next step is to determine the parameters related to the LED light source employed whose main features are expressed in Table A1. Different tests are performed based on three parameters: (a) Distance.The tests performed for the distance were mainly based on visual appearance in the cloud obtained. For distances smaller than 200 cm appearance was too bright. We found that optimum distance should be set to 250 cm.
Luminous flux. We base our evaluation on the resolution of the pointcloud obtained for each camera. Results obtained can be noted in Table A3. It can be derived that as long as the luminous flux increases, the resolution of the point cloud decreases. Therefore, 250 Lumens (lm) (minimum provided by the manufacturer) is chosen.
Orientation. To optimize orientation of the light source at the proposed scenario, grayscale mean (Î) and Standard deviation (σ I ) values are evaluated. To this end, different angles between the light sources are explored with a luminous flux of 250 lm, but in this case, we also should consider the visual appearance. For that purpose, we have tested light sources pointing to three targets: i. Models ii.
Frontal cameras iii.
Side cameras Tests performed pointing to the models presented the worst visual results, even though different diffusion filters have been tested on the light source. Other values are shown in Table A4. The best visual results were obtained when light sources were pointing to the opposite side cameras. In this case, the mean and STD are replaced by the median (Ĩ) and Median Absolute Deviation (MAD), since the median and the mean own a notable difference. Also angles of the cameras are included in the table since they have a large influence on the results. Further, this parameter will be analyzed (separately) below.

3.
Camera orientation. To test the best orientation of the cameras, different angles are used. All optimum parameters exposed previously are deployed in the scenario to test this parameter. The RGB-D devices chosen capture the scene affected by all parameters previously exposed, and therefore we believe it might be last parameter to be tested, and most critical since it determines the field of view. First tests are carried to determine the region of interest to be covered for the subjects involved in the experiment. The minimum angle required to completely cover the face of the subjects is 30 • and as long as the angles among the cameras are increased, surface covered increases as well. Finally, grayscale values obtained by the RGB sensor are evaluated to capture a similar range of color intensities for the faces. Results are shown in Table A5. It can be derived that as long as the angle among the cameras is increased, the difference between mean and the STD is also increased. Therefore, we can conclude that the optimum angle between the cameras is 30 • .

Subjects Description
For this dataset, age and gender are also registered for all the subjects. Statistics of those features are shown in Figure 2.
We can observe that most of the population is located between 20 and 40 years old due to the fact that the dataset has been recorded in a university, but the dataset covers a wide range of ages. It is also noticeable that gender is a little unbalanced, however, looking to the specific graphs of age ranges for every gender it can be observed that age of females is more balanced than the age of males.

Facial Landmark Detection
This section proposes a new data augmentation method from 3D meshes to 2D images and analyzes its influence on two state of the art deep learning facial landmark detection methods.

Data Augmentation
With the data augmentation method proposed for face landmark detection we wanted to prove a possible application for the dataset proposed (3DWF). The dataset provided by 3DUniversum is captured by a rotating structured sensor device in order to reconstruct 3D models of 300 subjects. Rotation is performed by an analogous device to the one presented in [35], however, in this case projector is not required, and structured light sensor is combined with common RGB sensor to perform the capture. This dataset was gathered by performing a massive data collection which allowed to collect a larger number of subjects, in spite of collecting less facial attributes such as pose, gender of age from every one of them, 3D reconstruction of gathered data is outside of the scope of this work. In our case, we are directed towards a deep learning method to extract valuable features from meshes already processed. To this end, we used raycasting [11] to perform the projection of the 3D mesh to a 2D image. By implementing this technique, we moved the viewing plane in front of the pinhole to remove the inversion. A graphical explanation is shown in Figure 3. If an object point is at distance z 0 from the viewpoint, and has y coordinate y 0 , then its projection y p onto the viewplane is determined by the ratios of sides of similar triangles: (0, 0), (0, z p ), (y p , z p ), and (0, 0), (0, z 0 ), (y 0 , z 0 ). So we have: The values of the viewpoint are based on the following parameters and values:

VanillaCNN
The solution shown in [27] is selected as one suitable architecture to increase the performance of facial landmark detection for Annotated Facial Landmarks in the Wild Dataset (AFLW [36]) based on data augmentation method previously discussed. The architecture of this network includes mid-network features and implies a hierarchical learning. The main peculiarity of this network is the tweaking model oriented to two main processes:

1.
It performs a specific clustering in the intermediate layers by a representation that discriminates between differently aligned faces. With that information, it trains pose specific landmark regressors.

2.
The remaining weights from the first dense layer output are fine-tuned by selecting only the group of images classified in the same cluster with the features from the intermediate layers.
An absolute hyperbolic tangent is used as an activation function and Adam is used for training optimization [37]. L2 normalized by the inter-ocular distance is implemented as the network loss: where P i is the 2xk vector of predicted coordinates for a training image, P i their ground truth locations, and p i,1 , p i,2 is the reference eye position.

Recombinator Networks (RCN)
Ref. [26] performs learning through using landmark independent feature maps. In this case, instead of performing specific learning, a more purely statistical approach is performed. The output of each branch is upsampled, then concatenated with the next level branch with one degree of finer resolution. Therefore, the main novelty is that branches pass more information to each other during training letting the network learn how to combine them non-linearly to maximize the log likelihood of the landmarks. It is only at the end of the Rth branch that feature maps are converted into a per-landmarks scoring representation by implementing a softmax.
All convolutional layers are followed by ReLU non-linearity except for the one right before the softmax. This architecture is trained globally using gradient backpropagation with an additional regularization term for the weights calculated through the next equation: where n is the number of samples, k is the number of landmarks, W represents the network parameters to minimize within regularization term to minimize and λ their weight. In summary, we selected two architectures that alternate convolution and max-pooling layers, but whose nature is completely different. VanillaCNN presents four convolution layers and two dense layers. Dense layers are interlaid by a discrimination among the clusters previously learned from midnetwork features in a specific pose manner. RCN presents a bidirectional architecture with different branches including 3-4 convolution layers whose results are concatenated in the end of each branch to the inputs of the following one. VanillaCNN presents a descending size of filtering sizes along the network and RCN keeps it fixed.

3D Reconstruction
This section validates the algorithm developed to present one point cloud for every subject for the markers located in the scenario that are graphically shown in Figure 1 in 3DWF Dataset. The steps performed are summarized in Figure 4.

Registration
Clouds obtained from RGB-D devices are registered by using a rigid body transformation [38]. We use an affine transformation [39]. Ten points are selected from every cloud (two by two matching). To obtain the transformation matrix, we built an homogeneous transformation, using the frontal point cloud as reference (Cl Frontal ): where R is the rotation matrix R and t i ∀i, ∈ {1, ..., 3} t is the translation vector. Obtaining an origin matrix with a point in every row p i = (x i , y i , z i , 1) where i, ∈ {1, ..., 10} from Cl Right and Cl Le f t and a target matrix with a point in every row q i = (x i , y i , z i , 1) where i, ∈ {1, ..., 10} from Cl Frontal . And to overcome accuracy errors in manual annotation and obtain optimum values for R le f t,right and T le f t,right we employed Random Sample Consensus (RANSAC) [40]. Then initial clouds (Cl Le f t , Cl Right ) are transformed towards the reference frontal cloud (Cl Le f t , Cl Right ) and the resulting clouds are added two by two to obtain the complete cloud Cl Total . This summation task is shown in Figure 5.

Refinement
The second step for reconstruction is based on Iterative Closest Point (ICP) [41] algorithm. ICP is used to minimize difference between sets of geometrical points such as segments, triangles or parametric curves. In our work, we use the point-to-point approach. Metric distance between the origin cloud (Cl Le f t and Cl Right ) and target cloud (Cl Frontal ) is minimized by the following equation: where p i is the point belonging to the origin cloud (Cl Le f t and Cl Right ) and q i is a point belonging to the target cloud (Cl Frontal ). Regarding rotation and translation matrix, the algorithm iterates over the minimum square distances by: where N is the number of iterations fixed to 30 for our solution and we have also fixed the percentage of worst candidate removal to 90%. With this refinement Cl Total is obtained.

ROI
Once the whole cloud is built and refined, we use the Dlib face detector [42] on the RGB image from the frontal camera in order to determine the region of interest (ROI). In a similar way, we apply the face landmark detection based on VanillaCNN exposed in Section 4 obtaining the locations of facial landmarks. To project the keypoints obtained from the neural network, we use Perspective Projection Model [43]. By applying the following equations: where X k , Y k and Z k are the projected coordinates in the cloud, x c and y c are the coordinates of the center of the 2D image, x k and y k are the input coordinates from the 2D image and δ x and δ y are the parameters to correct the distortion of the lens provided by the manufacturer. Obtaining 3D projection for the bounding box delimiters to project the cropped cloud obtained after refinement Cl Total into a cloud with mostly facial properties Cl F 0 . In order to test the accuracy of Cl F 0 we have considered the clouds gathered with Faro Freestyle 3D Laser Scanner (Cl HD ) as ground truth, and we have measured the average minimum distance from ∀pt i ∈ Cl F 0 ∈ Marker1 to ∀pt i ∈ Cl HD for every subject, obtaining as result distances in the range [16 − 23] mm. In addition, we should consider:

1.
Since the distance from the subject to the camera is below 1 m, the error of the depth sensor should be in the range [5 − 15] mm according to the results exposed in [44].

2.
The faces of the subjects are not rigid (although both captures have been performed on a neutral pose).
Therefore, the range measured as distance from Cl F 0 to Cl HD proves the accuracy of the 3D reconstruction performed.

Noise Filtering
In this section, an algorithm to filter Cl F 0 is proposed to obtain reliable face clouds. The following features are proposed:

1.
Color. Initially we need to delimit two areas: (a) 2D ROI' to extract. We have employed facial landmarks detected by VanillaCNN through data augmentation procedure presented in Sections 4.2 and 4.1 respectively. In our setup we have detected five points: left eye (le), right eye (re), nose (n), left mouth (lm) and right mouth (rm). A new ROI (ROI ) is defined based on a bounding box with these detections: {(x le , y le ), (x re , y re ), (x lm , y lm ), (x rm , y rm )} ROI RGB intensities are transformed to a more uniform color space: CIELAB [45]. Components values of the two intensities samples used for thresholding are calculated in the following manner following a normal distribution: whereL ROI and σ L ROI are the mean and standard deviation of L component from CIELAB color space for the new ROI defined. Analogously forâ ROI , σ a ROI ,b ROI and σ b ROI . And w is fixed to 0.75 in our implementation.
3D Contour Ct F 0 . In this case we have defined two margins for width and height from Cl F 0 to filter farther points to the cloud centroid by applying CIEDE2000 ∀pt i ∈ Ct F 0 : where ∆E * 00 is the metric used in CIEDE2000 and Cl F FC is the point cloud obtained after color filtering.

2.
Depth. Mainly focused on noise introduced by depth sensors and outliers from color filtering.
For that aim we have built a confidence interval based on normal distribution of Where w Z is fixed to 2.25 in our implementation.

Uniform Distribution
To provide a reliable point cloud dataset, it is important that clouds have similar resolutions and that every part of the cloud is constant regarding point-space density. For that reason, Cl F F is divided into four parts based on its width and height. A resolution of 2K points is proposed as target resolution. Therefore, every cloud part should have 2K/4 points. An implementation of voxel grid downsampling [46] based on a dynamic radius search is used. The voxel grid filter down-samples the data by taking a spatial average of the points in the cloud through employing rectangular areas that are known as voxels. The set of points that lie within the bounds of a voxel are assigned to that voxel and will be combined into one output point. With this final step Cl F is composed, and sample values for one subject are displayed in Figure 6. In an analogous way one sample for Marker 1 without texture mapping is shown in Figure 7.

Head Pose Classifcation
This section describes the methods implemented to validate the alignment of head pose values of the data gathered in 3DWF dataset with markers located in the scene. We used visual information of subjects when they look at marker 1 (relaxed pose looking to the front) as reference for the other markers. Initial steps are analogous to the ones proposed in Section 5.3. In this case, we reversely used projection equations shown in (8) together with 2D Euclidean distance to gather the closest points included in Cl Total to 2D facial landmarks detected by Vanilla CNN. In this way, a new set composed by 3D facial landmarks is obtained: {(X le , Y le , Z le ), (X re , Y re , Z re ), (X n , Y n , Z n ), (X lm , Y lm , Z lm ), (X rm , Y rm , Z rm )}

Rigid Motion
Initial transformations are performed by using the Least-Squares Rigid Motion by means of SVD [47] from the set of 3D facial landmarks to obtain the corresponding rotation matrix. Let P = p 1 , p 2 , ..., p n where p i are 3D coordinates of facial landmarks for marker 1 ∈ R 3 and Q = q 1 , q 2 , ..., q n where q i are 3D coordinates of facial landmarks for markers 2-10 ∈ R 3 be our reference and target sets of data respectively. We are able to find a rigid transformation that optimally aligns the two sets in the least squares sense, i.e., assuming unity vector for translation matrix (subjects are static in the experiment proposed): By Restating the problem so that the translation would be zero, and simplifying the expression we cand reformulate the problem: where W = diag(w 1 , ..., w n ) is an nxn diagonal matrix with the weight w i on diagonal entry i, Y is the dxn matrix with y i as its columns and X is the dxn matrix with x i as its columns. tr is the trace of a square matrix (sum of the elements on the diagonal) and owes commutative property with respect to product . Therefore we are looking for a rotation R that maximizes tr(RXWY T ). Now we have denoted dxd covariance matrix S = XWY T . If we take Single Value Decomposition (SVD) of S such that S = U ∑ V T relying on the fact that that V , R and U are all orthogonal matrices, so V T RU is also an orthogonal matrix and we can assume identity. Therefore we can calculate corresponding rotation matrix in the following way: Given a rotation matrix R, we can compute the Euler angles, φ, θ, ψ by equating each element in R with the corresponding element in the matrix product R Z (φ)R Y (θ)R X (ψ). This results in nine equations that have been used to find the Euler angles.

Results
In this subsection, metrics used for evaluation and results obtained by them for the proposed architectures and datasets are presented.  Table 1.
Subjects are randomized for every viewpoint based on splits presented in Section 4.1.

2.
AFLW. Number of images to split dataset and learning rate base for AFLW Dataset are shown in Table 2.

Error Metric
The euclidean distance between the true and estimated landmark positions normalized by the distance between the eyes (interocular distance) is used: where K is the number of landmarks (5 in our work), N is the total number of images, D(n) is the k ) represent the true and (ŵ n k ,ĥ (n) k ) estimated coordinates for landmark k in image n, respectively. Localization error is measured as a fraction of the inter-ocular distance, a measure invariant to the actual size of the images. We declare a point correctly detected if the pixel error is below 0.1 interocular distance.

Accuracy
The histogram in Figure 8 show that the error for the different combinations of networks, explained in Section 4.2 and datasets. It can be derived that the lowest error rate is by training RCN with the 3DU dataset.
Massive normalized 2D projections of this dataset learned in a bidirectional approach reduces the error. In addition, we mention that the network layers based on midnetwork features proposed by VanillaCNN achieve the worst results with the same training and testing data since this is an specific solution for another nature of data, but weights and bias pre-learned from 3DU dataset increases the performance of the algorithm, and helps to achieve the best results for AFLW after finetuning. In this case, massive data initialize properly midnetwork features so that the network can go beyond in global minimum target for common datasets of landmark detection such as AFLW. This procedure is shown, using the visualization of filters, in Tables 3 and 4 for last convolution and max-pooling layers, where brighter color intensities represent stronger activations. It can be derived that the initial training of 3DU provides blunter features at this stage of the learning due to triangulation procedure to gather mesh input data. It can be inferred as well that the fine-tuning process provides sharper features that increase the performance of the network.

Head Pose Classification
To validate the proposed dataset, we assumed Euler angles calculated previously as head pose angles (pitch (φ), yaw (ψ) and roll (θ)). Three 2D projections of this data can we noted on Figure 9. It can be inferred that the distribution of the data is quite uniform.

Accuracy
Two validation methods have been tested in order to classify the Euler angles calculated for the projection of the facial andmarks obtained by fine tuning the initial training of VanillaCNN with the data augmentation procedure exposed in Section 4.1 with the training of AFLW. The methods selected are Linear Discriminant Analysis (LDA [48]) and Gaussian Naive Bayes (GNB [49]). The aim of those classification methods is to validate the data capture, and the projection of the facial landmarks estimated to the point clouds gathered. The results and the main features of the proposed methods can be noticed on Table 5. Confusion matrix for GNB classification, where rows correspond to ground truth markers (2-10) and columns to predicted markers in the same range, is shown in Figure 10. Results show promising values for a simple classification technique such as GNB. It can be noticed that those markers where the subjects are looking to one side of the scene (such as 2 or 5) are the most complex to predict, and those markers where the subjects are looking straight and modifying their pitch (such as 10) the simplest.

Conclusions
In this paper, we have presented an optimized multi-camera RGB-D system for facial properties to capture accurate and reliable data. In this scope we have performed a data collection including 92 people fulfilling the need of a 3D facial dataset able to exploit capabilities of deep learning paradigm in 3D scope. In addition, we provide a complete pipeline to process data collected and pose a challenge for Computer Vision and Machine Learning research community by annotating human characteristics such as age or gender. The collected RGB-D streams allow other related tasks such as face tracking or 3D reconstruction with a wide source of visual information that increase the performance of common acquisition systems for extreme head poses.
In this scope, we found facial landmark detection one of the main tasks where our work should contribute to research lines that project 3D information into a more feasible and less costly domain such as 2D. For that reason, we have proposed an innovative data augmentation method, tested and discussed its accuracy on two state-of-the-art deep learning solutions. We have trained and evaluated synthetic and visual imaging data on two complementary architectures, finding a combined solution that enhances results for a very common deep net architecture like Vanilla CNN.
Finally, the alignment of the path proposed to the subjects by ten markers is validated by implementing a geometric approach for head-pose through previously estimated features. The refinement of the learning techniques implemented for this task is one of the lines of research proposed for future work. Funding: This project has been partially funded by the European project AI4EU: "Advancing Europe through collaboration in AI".

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding the publication of this article.

Abbreviations
The following abbreviations are used in this manuscript: RGB-D