Tightly-Coupled Data Compression for Efﬁcient Face Alignment

Featured Application: The proposed method in this paper is suitable for resource restricted environments such as mobile face alignment applications. Abstract: Face alignment is the key component for applications such as face and expression recognition, face based AR (Augmented Reality), etc. Among all the algorithms, cascaded-regression based methods have become popular in recent years for their low computational costs and satisfactory performances in uncontrolled environments. However, the size of the trained model is large for cascaded-regression based methods, which makes it difﬁcult to be applied in resource restricted scenarios such as applications on mobile phones. In this paper, a data compression method for the trained model of supervised descent method (SDM) is proposed. Firstly, according to the distribution of the model data estimated with the non-parametric method, a K-means based data quantization algorithm with probability density-aware initialization was proposed to efﬁciently quantize the model data. Then, a tightly-coupled SDM training algorithm was proposed so that the training process reduced the errors caused by data quantization. Quantitative experimental results proved that our proposed method compressed the trained model to less than 19% of its original size with very similar feature localization performance. The proposed method opens the gates to efﬁcient mobile face alignment applications based on SDM.


Introduction
Face alignment is an important part of facial image analysis.It automatically localizes facial feature points such as eyes, nose, eyebrows, mouth, etc. from a face image.It plays an important role in popular applications such as face recognition [1,2], attribute computing [3,4], and expression recognition [5].Face alignment technology is usually applied to get some anchor points for affine warping so that the face recognition procedure is robust against pose variations.In [3], facial landmarks were used to outline a fetus's face.In [4], facial landmarks helped to localize and represent salient regions of the face.Figure 1 shows an example of a face alignment algorithm with the supervised descent method (SDM) [6].
According to a recent survey [7], face alignment methods can be divided into two categories: generative methods and discriminative methods.
Generative methods explicitly construct generative models for the shape and/or appearance of the face.Feature locations are derived according to the best fit of the model to the test image.Cootes et al. proposed a famous algorithm that is called active shape models (ASM) to calculate an appearance model for every facial part separately [8].In [9], the authors proposed the Gauss-Newton Deformable Part Model (GN-DPM) that can construct generative models for each facial part simultaneously.The classical active appearance models (AAM) [10], which contain the shape model, appearance model, and motion model, also belong to this category.The AAM's defect is that this method is not robust against occlusions.
Appl.Sci.2018, 8, x 2 of 20 appearance model for every facial part separately [8].In [9], the authors proposed the Gauss-Newton Deformable Part Model (GN-DPM) that can construct generative models for each facial part simultaneously.The classical active appearance models (AAM) [10], which contain the shape model, appearance model, and motion model, also belong to this category.The AAM's defect is that this method is not robust against occlusions.Face alignment example with the supervised descent method (SDM) algorithm [6].Firstly, the face region is automatically detected in the image [11].Then, features are localized within the detected face region.(Image courtesy of [12]).
Different from generative methods, discriminative methods aim to estimate the mapping between the facial appearances and the features' locations directly.Constrained local models (CLM) are able to learn independent local detector for each feature point [13].Then, a shape model is used to regularize these local models.Different from CLM, cascaded regression methods learn a vectorial regression function directly to calculate the face shape stage-by-stage.Explicit shape regression (ESR) [14] was one of the first algorithms in this category.It is a two-stage boosted regression framework.Burgos-Artizzu et al. introduced occlusion information into the regression process [15] in order to improve robustness.Kazemi and Josephine proposed to use regression trees instead of random ferns and achieved super-fast speed [16].Besides the above mentioned two-level boosted regression framework, Xiong and De la Torre presented a cascaded linear regression method with hand-crafted features [6].The contribution of [6] is a provable supervised descent method (SDM).It also extends SDM to Global SDM in order to cope with the problem of conflicting gradient directions [17].SDM is a popular face alignment method especially for resource restricted applications since it can achieve state-of-the-art results in real 2D scenarios and still achieve real-time performance.
With the advent of the deep learning era, deep neural networks have been successfully applied in many computer vision tasks in recent years.Sun et al. were the first to use the deep convolutional network cascade for face alignment [18].Reference [19] proposed a recurrent neural networks approach.Recently, there were works to achieve 3D face alignment by fitting a 3D Morphable Model (3DMM) through convolutional neural networks (CNN) [20,21].Reference [22] proposed a 3D face alignment network (3D-FAN) by stacking four hourglass networks.Reference [23] proposed a two-stage method with the help of a deep residual network [24].In this method, heat-maps of 2D landmarks are first calculated using convolutional-part heat-map regression.Then, these heat-maps along with the original RGB image are used to regress the depth information with a very deep residual network.Although deep learning methods, especially 3D ones, perform better on images with large head poses than do traditional methods, they are not easily applied in mobile platforms.The main reasons are as follows: (1) The deep learning model is usually in the order of 100 MB, which is too big for mobile applications.(2) The computational cost is still quite high and can hardly achieve real-time performances on mobile platforms.Face alignment example with the supervised descent method (SDM) algorithm [6].Firstly, the face region is automatically detected in the image [11].Then, features are localized within the detected face region.(Image courtesy of [12]).
Different from generative methods, discriminative methods aim to estimate the mapping between the facial appearances and the features' locations directly.Constrained local models (CLM) are able to learn independent local detector for each feature point [13].Then, a shape model is used to regularize these local models.Different from CLM, cascaded regression methods learn a vectorial regression function directly to calculate the face shape stage-by-stage.Explicit shape regression (ESR) [14] was one of the first algorithms in this category.It is a two-stage boosted regression framework.Burgos-Artizzu et al. introduced occlusion information into the regression process [15] in order to improve robustness.Kazemi and Josephine proposed to use regression trees instead of random ferns and achieved super-fast speed [16].Besides the above mentioned two-level boosted regression framework, Xiong and De la Torre presented a cascaded linear regression method with hand-crafted features [6].The contribution of [6] is a provable supervised descent method (SDM).It also extends SDM to Global SDM in order to cope with the problem of conflicting gradient directions [17].SDM is a popular face alignment method especially for resource restricted applications since it can achieve state-of-the-art results in real 2D scenarios and still achieve real-time performance.
With the advent of the deep learning era, deep neural networks have been successfully applied in many computer vision tasks in recent years.Sun et al. were the first to use the deep convolutional network cascade for face alignment [18].Reference [19] proposed a recurrent neural networks approach.Recently, there were works to achieve 3D face alignment by fitting a 3D Morphable Model (3DMM) through convolutional neural networks (CNN) [20,21].Reference [22] proposed a 3D face alignment network (3D-FAN) by stacking four hourglass networks.Reference [23] proposed a two-stage method with the help of a deep residual network [24].In this method, heat-maps of 2D landmarks are first calculated using convolutional-part heat-map regression.Then, these heat-maps along with the original RGB image are used to regress the depth information with a very deep residual network.Although deep learning methods, especially 3D ones, perform better on images with large head poses than do traditional methods, they are not easily applied in mobile platforms.The main reasons are as follows: (1) The deep learning model is usually in the order of 100 MB, which is too big for mobile applications.As mentioned above, SDM achieves satisfactory results while having a relatively low computational cost.However, the size of the trained model for SDM can be more than 80 MB, which is still too large for a commercial mobile application.Unfortunately, traditional lossless compression technology such as entropy coding [25] cannot achieve a high enough compression rate to meet the needs of mobile applications.Lossy compression technology is widely used in video encoding areas.State-of-the-art methods such as HEVC (High Efficiency Video Coding) can reach a very high compression rate with good visual quality [26].Unfortunately, this kind of technology heavily depends on the block motion estimations between consecutive frames in time domains, which are obviously unavailable in our work.
Recently, research about deep learning network compression has gradually emerged.Reference [27] proposed new kinds of convolutional operations to reduce parameters.Reference [28] compressed the network by pruning unimportant filters according to weight analysis.References [29,30] transferred the weights into binary values in order to reduce the size of the model.Instead of transferring the weights into binary values, Zhu et al. proposed a method to reduce the precision of weights into ternary values [31].This method avoids most accuracy degradation.Howard et al. proposed depth-wise separable convolutions architecture so that traditional 3D convolutions can be broken into 2D convolutions [32].This makes it suitable for mobile applications.Based on [32], shortcut technology was introduced in [33].Furthermore, linear bottlenecks were used instead of Relu [34] in order to keep the features.Most of the above methods are aimed at specific tasks or network structures such as image classification and image segmentation.To the best of our knowledge, there is no compression architecture that can be applied directly to face alignment networks.
In this paper, a tightly-coupled data compression method for the trained model of supervised descent method (SDM) that can reduce the size to less than 1/5 of the original size without obvious performances loss is proposed.This method opens the gates to mobile applications using SDM based face alignment technology.
The remainder of the paper is organized as follows: Section 2 briefly describes the main procedure of the SDM algorithm [6] for the sake of completeness and clarity; Section 3 explains our proposed method in detail; Section 4 demonstrates some detailed algorithm implementations in the proposed method; Section 5 shows both qualitative and quantitative experimental results; Section 6 draws conclusions.

Basics of SDM
This section briefly introduces the main workflow of SDM.Please refer to [6] for details.We assume that the face feature points are represented by N 2D landmarks s = [x 1 , y 1 , . . ., x N , y N ] T .Usually, N = 68.Figure 2 shows the definition of the 68 face landmarks.Given a face image I and the initial 2D landmarks s 0 estimated from the detected face region, our aim is to find a series of regressors: is the projection matrix that can also be called the descent direction and b d is the bias term.In SDM, D is usually chosen between 4 and 6, and the estimated 2D landmarks at dth step are calculated according to the following equation: f(I, s d−1 ) are the shape related features that can be SIFT (Scale-Invariant Feature Transform)features [35] or HoG(Histograms of oriented Gradients) features [36] for better performance [37].These features can calculated at the landmarks s d-1 in image I.The final estimated facial landmarks are s D .
Given M training images, SDM aims to minimize a series of the following equations and get r d sequentially: where ∆s i d is the shape residual of the ith training image at the dth regression step: s i * are the ground truth landmarks locations of the ith training image.Equation ( 2) is a standard linear least squares problem and can solved in closed-form.
The following Figure 3 shows the flow chart of the training algorithm of SDM.

Tightly-Coupled Data Compression Algorithm
As shown in Section 2, the key components of SDM are r d for all D steps.Thus, the trained model of SDM stores all the information of r d for all D steps.Since all the features calculated from every landmark are concatenated together and form a large feature vector, the dimension of the feature vector can easily achieve 27,200  From the table it can be concluded that the data ranges vary in different regression steps.So in this paper, the method of compressing the data separately for each step is proposed.Furthermore, the data range of A d is very different from that of b d , and b d only contains 136 floating point numbers in each step, which is much smaller (only about 3KB for all 6 steps) than the size of A d .Therefore, in this paper, only the data in A d are compressed.
Since the data distribution of A d can not be described by a parametric model, a non-parametric method for estimating the data distribution of A d is applied in this paper.The number of elements in A d is large enough, so for the sake of computational efficiency, the Parzen window method [39] is used here.Assuming that the number of elements in A d is T, and window size is h, the probability density function (PDF) of the element x in A d can be estimated through the following equation: where φ(x) is the square window function Figure 4 shows the estimated PDF of the elements in A 1 .The shapes of the PDFs in other steps are similar.From the figure it can be concluded that most of the values in A d concentrate around 0, and the distribution also shows some long tail effects.As a result, it is inappropriate to compress the data according to uniform quantization.In this paper, a K-means based data quantization algorithm with a probability density-aware initialization technique is proposed in order to cope with the above mentioned difficulties.

K-Means Based Data Quantization with Probability Density-Aware Initialization
The basic idea of our proposed data compression algorithm is to quantize all the elements of Ad into Q predefined values so that each element can be represented with fewer bits.The optimal Q predefined values can be calculated through minimizing the following equation: where 1(•) is the characteristic function, T is the number of elements in Ad, Ad(i) is the ith element in Ad, C = {C1, C2, ..., CQ} divides the data of Ad into Q disjoint clusters, and qk is the representative scalar of cluster k.Solving the minimization problem of Equation ( 6) is NP-hard [40].The K-means algorithm [41] can be applied to get the approximate solution.The performance of K-means algorithm heavily depends on the initialization.As shown in Figure 4, the data distribution of Ad shows single peak characteristics.Traditional random initialization for the K-means algorithm cannot capture the characteristics of the data distribution of Ad, so that the acquired result is far from optimal.
A probability density-aware initialization method, which is to say the initial quantization step size (cluster size) is inversely proportional to the probability density distribution of the data in order to fully capture the data distribution of Ad, is proposed in this paper.Thus, we have where vk−1 and vk are lower and upper bounds of the kth quantization step respectively.So the optimal initialization strategy can be estimated through the following equation: where Vmin and Vmax are minimum and maximum values of Ad as shown in Table 1.Unfortunately, the exact solution of the above equation is difficult.In this paper, an approximate solution to the above problem is proposed as follows since only reasonable initializations for the K-means algorithm are needed.
All the elements in Ad are sorted in ascending order and stored in an array SA.The lowest and highest quantization step bounds can be calculated with the following equation: The remaining quantization step bounds are calculated with Equations ( 10) and ( 11) as follows:

K-Means Based Data Quantization with Probability Density-Aware Initialization
The basic idea of our proposed data compression algorithm is to quantize all the elements of A d into Q predefined values so that each element can be represented with fewer bits.The optimal Q predefined values can be calculated through minimizing the following equation: and q k is the representative scalar of cluster k.Solving the minimization problem of Equation ( 6) is NP-hard [40].The K-means algorithm [41] can be applied to get the approximate solution.The performance of K-means algorithm heavily depends on the initialization.As shown in Figure 4, the data distribution of A d shows single peak characteristics.Traditional random initialization for the K-means algorithm cannot capture the characteristics of the data distribution of A d , so that the acquired result is far from optimal.
A probability density-aware initialization method, which is to say the initial quantization step size (cluster size) is inversely proportional to the probability density distribution of the data in order to fully capture the data distribution of A d , is proposed in this paper.Thus, we have where v k−1 and v k are lower and upper bounds of the kth quantization step respectively.So the optimal initialization strategy can be estimated through the following equation: where V min and V max are minimum and maximum values of A d as shown in Table 1.Unfortunately, the exact solution of the above equation is difficult.In this paper, an approximate solution to the above problem is proposed as follows since only reasonable initializations for the K-means algorithm are needed.
All the elements in A d are sorted in ascending order and stored in an array SA.The lowest and highest quantization step bounds can be calculated with the following equation: The remaining quantization step bounds are calculated with Equations ( 10) and ( 11) as follows: where • is the floor function that rounds the element to the nearest integer towards minus infinity.
With the above algorithm, the number of elements between consecutive quantization steps bounds is nearly the same, so that Equation ( 8) is approximately solved.The most important thing is that all the quantization steps bounds can be efficiently estimated with the above algorithm.However, we do not choose the mid-value between the lower and upper bounds as initial values for the K-means algorithm since the data distribution inside the quantized region is not uniform.Instead, the initial value for the K-means algorithm is set as the mean value of all the elements that fall in the same quantized region as follows.First, the set of all the elements that belong to the kth quantization step is calculated.
Then, the initial value of the kth cluster center for the K-means algorithm can be calculated as follows: With these initial values, the K-means algorithm [41] can be applied so that the optimal clusters' centers q k (k = 1 ••• Q) with regard to Equation ( 6) are estimated.Thus, the quantized value AQ d (i can be calculated with the following equations:

Tightly-Coupled Training Algorithm
If we directly quantize the final learned result of A d as described in the previous section, the quantization process will introduce extra errors to the feature localization results.In order to reduce the errors introduced by data quantization processes, we modified the traditional training algorithm of SDM described in Section 2 and propose the following tightly-coupled training algorithm as shown in Figure 5.
In the above algorithm, the data quantization process is coupled with the training process as shown in steps 4 and 5 so that the errors caused by quantization in the previous step are propagated into the next regression step.As a result, the projection matrix in the next step can partially correct the errors introduced by the quantization process and the final results can be improved.

Compressed Model Data Storage Arrangement
The final compressed model consists of the stacked version of all D steps of regressors rd.Each regressor rd consists of three parts: Q quantized values for the projection matrix; the quantized projection matrix AQd that corresponds to Ad; the bias term bd.
The quantized values are stored with single-precision floating point numbers.There are Q single-precision floating point numbers for each regression step.
The quantized projection matrix AQd has the same dimension as that of Ad.Each element in AQd is an index to the quantized value, which was described above.Since there are Q different quantized values, each element in AQd only consists of log2Q bits that usually consume much fewer bits than 32-bit floating point numbers.Through AQd, the corresponding quantized value can be fetched according to the index, and the approximate projection matrix can be reconstructed.In such a way, the data compression purpose can be achieved.Throughout this paper, we chose Q = 64, which is justified in the experimental results section.
The bias term bd is stored directly as a floating point number since its size is relatively small, as stated before.Figure 6 illustrates the diagram of the compressed data storage arrangement.The quantized values are stored with single-precision floating point numbers.There are Q single-precision floating point numbers for each regression step.
The quantized projection matrix AQ d has the same dimension as that of A d .Each element in AQ d is an index to the quantized value, which was described above.Since there are Q different quantized values, each element in AQ d only consists of log 2 Q bits that usually consume much fewer bits than 32-bit floating point numbers.Through AQ d , the corresponding quantized value can be fetched according to the index, and the approximate projection matrix can be reconstructed.In such a way, the data compression purpose can be achieved.Throughout this paper, we chose Q = 64, which is justified in the experimental results section.
The bias term b d is stored directly as a floating point number since its size is relatively small, as stated before.Figure 6 illustrates the diagram of the compressed data storage arrangement.

Methodology
In this section, implementation details about the algorithm proposed in Section 3.1 are described.Algorithm 1 demonstrates the pseudo codes of the approximate algorithm for solving Equation ( 8).
Algorithm 1. Approximate algorithm for solving Equation (8).Input: Projection matrix Ad at the dth step; number of elements T in Ad; number of quantization levels Q. Output: quantization steps bounds v0 ••• vQ 1. Sort all the elements of Ad in ascending order and store them in an array SA with quick sort [42].2. Calculate v0 and vQ with Equation (9).

5.
Get vk with Equation (11).end for Algorithm 2 shows the pseudo codes for the whole procedure of the proposed data quantization algorithm.12) and (13).3. Minimize Equation ( 6) with the K-means algorithm [41] and get the optimal clusters' centers qk calculated with Equations ( 14) and (15).end for

Methodology
In this section, implementation details about the algorithm proposed in Section 3.1 are described.Algorithm 1 demonstrates the pseudo codes of the approximate algorithm for solving Equation ( 8).
Sort all the elements of A d in ascending order and store them in an array SA with quick sort [42].

end for
Algorithm 2 shows the pseudo codes for the whole procedure of the proposed data quantization algorithm.Calculate the optimal quantization steps bounds Initialize cluster centers µ k (k = 1 ••• Q) for the K-means algorithm with Equations ( 12) and (13).

3.
Minimize Equation ( 6) with the K-means algorithm [41] and get the optimal clusters' centers For each element A d (i with Equations ( 14) and (15).end for

Results
In this section, our method was compared against the standard SDM and deep learning based method [22] on a 300 W dataset [12].This dataset is a publically available challenging dataset that consists of 600 indoor and outdoor in-the-wild images.It covers a large variation of identity, expression, illumination conditions, pose, occlusion, and face size.Each image has ground truth locations of 68-point configuration [43].The open source implementation of standard SDM by Patrik Huber [44] and the implementation of the authors [45] for [22] were used in this paper.Since the SDM based method does not apply any 3D information, only the 2D face alignment network (2D-FAN) version was tested in the deep learning based method [22] for the sake of fairness.

The Choice of Q
In this section, the face alignment accuracy was evaluated according to the average distance between the detected landmarks and the ground truth, normalized by the inter-ocular distance as proposed in [46].The number of bits used for each element in AQ d was varied and their corresponding normalized mean error loss against standard SDM algorithm [6] was calculated.Figure 7 shows the result.

Results
In this section, our method was compared against the standard SDM and deep learning based method [22] on a 300 W dataset [12].This dataset is a publically available challenging dataset that consists of 600 indoor and outdoor in-the-wild images.It covers a large variation of identity, expression, illumination conditions, pose, occlusion, and face size.Each image has ground truth locations of 68-point configuration [43].The open source implementation of standard SDM by Patrik Huber [44] and the implementation of the authors [45] for [22] were used in this paper.Since the SDM based method does not apply any 3D information, only the 2D face alignment network (2D-FAN) version was tested in the deep learning based method [22] for the sake of fairness.

The Choice of Q
In this section, the face alignment accuracy was evaluated according to the average distance between the detected landmarks and the ground truth, normalized by the inter-ocular distance as proposed in [46].The number of bits used for each element in AQd was varied and their corresponding normalized mean error loss against standard SDM algorithm [6] was calculated.Figure 7 shows the result.From the above figure it can be concluded that if each element in AQd used more than 6 bits, the loss was small and reduced smoothly.However, if each element was represented with less than 6 bits, the error loss increased rapidly.In this paper we chose 6 bits, considering the balance between the error loss and compression efficiency.This means Q = 2 6 = 64.
In this paper, HoG features were utilized and 68 feature points were detected.The number of regression steps was 6.The dimensions of AQd and Ad were both 136 × 27,200, as mentioned before.The dimension of bias vector bd was 136.For standard SDM, all the elements were represented by 32-bit single precision floating point numbers.Therefore, the total space needed for the training model was Our model size was only about 18.75% of the standard one, which means the compression rate of our proposed method was about 5.3X.Furthermore, when entropy coding [25] was applied to our proposed compressed data, about 10% more compression power was usually obtained.From the above figure it can be concluded that if each element in AQ d used more than 6 bits, the loss was small and reduced smoothly.However, if each element was represented with less than 6 bits, the error loss increased rapidly.In this paper we chose 6 bits, considering the balance between the error loss and compression efficiency.This means Q = 2 6 = 64.
In this paper, HoG features were utilized and 68 feature points were detected.The number of regression steps was 6.The dimensions of AQ Our model size was only about 18.75% of the standard one, which means the compression rate of our proposed method was about 5.3X.Furthermore, when entropy coding [25] was applied to our proposed compressed data, about 10% more compression power was usually obtained.

Qualitative Experimental Results
In this section our feature localization results were compared against the standard SDM with an uncompressed training model [6] and 2D-FAN [22].Figure 8 shows the results on the 300 W dataset [12].The left column shows the results with our compressed training model.The center column shows the results by using the SDM algorithm with an uncompressed training model [6].The right column shows the results with the 2D-FAN [22].Red lines in all the images depict ground truth feature locations.From this figure it can be concluded that our compressed training model can generate very similar results with its uncompressed counterpart.They can both fit the ground truth very well even in occlusion scenarios.Deep learning based method [22] performs slightly better, especially in images with large head poses such as the first and the last rows.However, the model size of the 2D-FAN is about 182 MB [45], which is obviously not suitable for mobile applications.The accuracy of our proposed method is enough for mobile applications such as a virtual add-on, which is verified in Section 5.6.

Qualitative Experimental Results
In this section our feature localization results were compared against the standard SDM with an uncompressed training model [6] and 2D-FAN [22].Figure 8 shows the results on the 300 W dataset [12].The left column shows the results with our compressed training model.The center column shows the results by using the SDM algorithm with an uncompressed training model [6].The right column shows the results with the 2D-FAN [22].Red lines in all the images depict ground truth feature locations.From this figure it can be concluded that our compressed training model can generate very similar results with its uncompressed counterpart.They can both fit the ground truth very well even in occlusion scenarios.Deep learning based method [22] performs slightly better, especially in images with large head poses such as the first and the last rows.However, the model size of the 2D-FAN is about 182 MB [45], which is obviously not suitable for mobile applications.The accuracy of our proposed method is enough for mobile applications such as a virtual add-on, which is verified in Section 5.6.

Quantitative Experimental Results
As shown in Figure 2, the face features can be divided into five parts: face contour, eyebrows, eyes, nose, and mouth.Face contour contains points No. 1 to No. 17. Eyebrows contain points No. 18 based method performs slightly better.However, the differences of the normalized mean errors on the 300 W test dataset between our proposed method and the deep learning based method are all below 1% for all five parts, which is acceptable considering the high computational cost and memory storage usage of the deep learning based method.

Quantitative Experimental Results
As shown in Figure 2, the face features can be divided into five parts: face contour, eyebrows, eyes, nose, and mouth.Face contour contains points No. 1 to No. 17  9.From the figure it can be concluded that our proposed compressed training model can achieve very close feature points compared with the uncompressed training model.The deep learning based method performs slightly better.However, the differences of the normalized mean errors on the 300 W test dataset between our proposed method and the deep learning based method are all below 1% for all five parts, which is acceptable considering the high computational cost and memory storage usage of the deep learning based method.[6] and deep learning based method [22] on the 300 W test set [12].
Similar to the work in [22], a subset was chosen from the 300 W test dataset whose yaw angles are between 0 and 30 degrees.Experiments on this subset with the above three methods were conducted.The results are shown in Figure 10.From this figure, we can find that for moderate head poses, which are the typical scenarios for mobile applications, all three methods can achieve lower normalized errors and generate very similar results, especially for eyes and face contour parts.These two parts are very important for AR (Augmented Reality) based applications.The differences of the normalized mean errors between our proposed method and deep learning based method were less than 0.6% for most parts and even achieved 0.3% for the eyes regions.This proves the effectiveness of our proposed algorithm.[6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.[6] and deep learning based method [22] on the 300 W test set [12].
Similar to the work in [22], a subset was chosen from the 300 W test dataset whose yaw angles are between 0 and 30 degrees.Experiments on this subset with the above three methods were conducted.The results are shown in Figure 10.From this figure, we can find that for moderate head poses, which are the typical scenarios for mobile applications, all three methods can achieve lower normalized errors and generate very similar results, especially for eyes and face contour parts.These two parts are very important for AR (Augmented Reality) based applications.The differences of the normalized mean errors between our proposed method and deep learning based method were less than 0.6% for most parts and even achieved 0.3% for the eyes regions.This proves the effectiveness of our proposed algorithm.

Quantitative Experimental Results
As shown in Figure 2, the face features can be divided into five parts: face contour, eyebrows, eyes, nose, and mouth.Face contour contains points No. 1 to No. 17  9.From the figure it can be concluded that our proposed compressed training model can achieve very close feature points compared with the uncompressed training model.The deep learning based method performs slightly better.However, the differences of the normalized mean errors on the 300 W test dataset between our proposed method and the deep learning based method are all below 1% for all five parts, which is acceptable considering the high computational cost and memory storage usage of the deep learning based method.[6] and deep learning based method [22] on the 300 W test set [12].
Similar to the work in [22], a subset was chosen from the 300 W test dataset whose yaw angles are between 0 and 30 degrees.Experiments on this subset with the above three methods were conducted.The results are shown in Figure 10.From this figure, we can find that for moderate head poses, which are the typical scenarios for mobile applications, all three methods can achieve lower normalized errors and generate very similar results, especially for eyes and face contour parts.These two parts are very important for AR (Augmented Reality) based applications.The differences of the normalized mean errors between our proposed method and deep learning based method were less than 0.6% for most parts and even achieved 0.3% for the eyes regions.This proves the effectiveness of our proposed algorithm.[6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.[6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.
Figure 11 shows the cumulative error distribution curve of our proposed method and the uncompressed training model SDM [6].It is obvious that the two curves are very close to each other.This again confirms the similar performance of both methods despite our training model being much smaller.
Appl.Sci.2018, 8, x 14 of 20 Figure 11 shows the cumulative error distribution curve of our proposed method and the uncompressed training model SDM [6].It is obvious that the two curves are very close to each other.This again confirms the similar performance of both methods despite our training model being much smaller.

Effect of the Tightly-Coupled Training Algorithm
The effect of the tightly-coupled training algorithm was analyzed in this section.We compared the results of our proposed method with the method without a tightly-coupled training algorithm, which is to say the quantization results of Section 3.1 were applied directly.The results are shown in Figure 12.From the figure it can be concluded that without the tightly-coupled training algorithm, normalized mean error for each feature point increased by more than 2.5%.Considering that the average of the normalized mean error was about 3.5%, this was a great increase in localization error.This proves that the coupling process successfully reduces the errors caused by the data quantization step.

Effect of Probability Density-Aware Initialization
Investigation of the effectiveness of our proposed probability density-aware initialization technique in Section 3.1 was conducted in this section.Our proposed method was compared with standard random initialization for the K-means algorithm, which is to say we randomly chose Q elements from Ad and set them as the initial cluster centers for the K-means algorithm instead of μk as calculated according to Equation (13).We tried 100 times and calculated the averages and standard deviations of normalized mean errors for the five parts of the face.The results are shown  The effect of the tightly-coupled training algorithm was analyzed in this section.We compared the results of our proposed method with the method without a tightly-coupled training algorithm, which is to say the quantization results of Section 3.1 were applied directly.The results are shown in Figure 12.From the figure it can be concluded that without the tightly-coupled training algorithm, normalized mean error for each feature point increased by more than 2.5%.Considering that the average of the normalized mean error was about 3.5%, this was a great increase in localization error.This proves that the coupling process successfully reduces the errors caused by the data quantization step.
Figure 11 shows the cumulative error distribution curve of our proposed method and the uncompressed training model SDM [6].It is obvious that the two curves are very close to each other.This again confirms the similar performance of both methods despite our training model being much smaller.The effect of the tightly-coupled training algorithm was analyzed in this section.We compared the results of our proposed method with the method without a tightly-coupled training algorithm, which is to say the quantization results of Section 3.1 were applied directly.The results are shown in Figure 12.From the figure it can be concluded that without the tightly-coupled training algorithm, normalized mean error for each feature point increased by more than 2.5%.Considering that the average of the normalized mean error was about 3.5%, this was a great increase in localization error.This proves that the coupling process successfully reduces the errors caused by the data quantization step.

Effect of Probability Density-Aware Initialization
Investigation of the effectiveness of our proposed probability density-aware initialization technique in Section 3.1 was conducted in this section.Our proposed method was compared with standard random initialization for the K-means algorithm, which is to say we randomly chose Q elements from Ad and set them as the initial cluster centers for the K-means algorithm instead of μk as calculated according to Equation (13).We tried 100 times and calculated the averages and standard deviations of normalized mean errors for the five parts of the face.The results are shown

Effect of Probability Density-Aware Initialization
Investigation of the effectiveness of our proposed probability density-aware initialization technique in Section 3.1 was conducted in this section.Our proposed method was compared with standard random initialization for the K-means algorithm, which is to say we randomly chose Q elements from A d and set them as the initial cluster centers for the K-means algorithm instead of µ k as calculated according to Equation (13).We tried 100 times and calculated the averages and standard deviations of normalized mean errors for the five parts of the face.The results are shown in Table 2. From the table it can be found that the normalized mean errors increased by more than 1.3% when random initialization was used.The reason might be that the data in A d concentrated at specific points, as shown in Figure 4.If we randomly selected Q elements and set them as the initial cluster centers, all these initial cluster centers fell near the mode of the PDF with high probability.Thus, the quantization errors for the elements in A d , which are far from the mode of the PDF, were very high and caused large localization errors.Our proposed method was compared with the method without using the K-means algorithm described in Section 3.1.That is to say µ k , which is calculated according to Equation ( 13), is used directly as the quantization center.Figure 13 shows the result.Without K-means clustering, normalized mean errors increased by about 0.8%.This proves K-means clustering successfully minimizes the quantization errors in Equation ( 6).
Appl.Sci.2018, 8, x 15 of 20 in Table 2. From the table it can be found that the normalized mean errors increased by more than 1.3% when random initialization was used.The reason might be that the data in Ad concentrated at specific points, as shown in Figure 4.If we randomly selected Q elements and set them as the initial cluster centers, all these initial cluster centers fell near the mode of the PDF with high probability.Thus, the quantization errors for the elements in Ad, which are far from the mode of the PDF, were very high and caused large localization errors.Our proposed method was compared with the method without using the K-means algorithm described in Section 3.1.That is to say μk, which is calculated according to Equation ( 13), is used directly as the quantization center.Figure 13 shows the result.Without K-means clustering, normalized mean errors increased by about 0.8%.This proves K-means clustering successfully minimizes the quantization errors in Equation ( 6).

Parameter Sensitivity Analysis
There are two important parameters in our proposed method, one is the number of bits nb = log2Q used to encode each element in Ad.The other is the number of regression steps D. Table 3 shows the normalized mean error loss against the standard SDM for different choices of nb.

Parameter Sensitivity Analysis
There are two important parameters in our proposed method, one is number of bits nb = log 2 Q used to encode each element in A d .The other is the number of regression steps D. Table 3 shows the normalized mean error loss against the standard SDM for different choices of nb.It can be concluded from the table that the normalized mean error loss against the standard SDM decreased almost linearly when the number of bits was not smaller than 6.The error loss difference between nb = 6 and nb = 16 was quite small and was hardly noticeable in video applications.The above data justified the choice of Q = 2 6 = 64 in all our experiments.
Experiments for the normalized mean error loss against the standard SDM with different choices of regression steps D were also conducted in this section.Table 4 shows the results.This table verifies that the normalized mean error loss was not sensitive to the choice of the number of regression steps if D > 1.This justifies the robustness of our proposed algorithm.In reality, it is uncommon to choose very small D values unless the computing resources are extremely restricted since feature localization accuracy is not guaranteed even with the standard SDM.This table also reveals the fact that with our proposed method we are free to choose D according to the demand of the application because the accuracy loss is not sensitive to D.

User Study for AR Mobile Applications
An AR mobile application with SDM using our proposed compressed trained model was developed.Sample effects of this application are shown in Figure 14.This application adds some interesting virtual decorations to the face video in real-time.It can be concluded from the table that the normalized mean error loss against the standard SDM decreased almost linearly when the number of bits was not smaller than 6.The error loss difference between nb = 6 and nb = 16 was quite small and was hardly noticeable in video applications.The above data justified the choice of Q = 2 6 = 64 in all our experiments.
Experiments for the normalized mean error loss against the standard SDM with different choices of regression steps D were also conducted in this section.Table 4 shows the results.This table verifies that the normalized mean error loss was not sensitive to the choice of the number of regression steps if D > 1.This justifies the robustness of our proposed algorithm.In reality, it is uncommon to choose very small D values unless the computing resources are extremely restricted since feature localization accuracy is not guaranteed even with the standard SDM.This table also reveals the fact that with our proposed method we are free to choose D according to the demand of the application because the accuracy loss is not sensitive to D.

User Study for AR Mobile Applications
An AR mobile application with SDM using our proposed compressed trained model was developed.Sample effects of this application are shown in Figure 14.This application adds some interesting virtual decorations to the face video in real-time.Twenty short face videos of different people were recorded.The length of each video was about 30 s. AR effects for these face videos with our developed mobile application were generated.Each face video generated two output videos: one with standard SDM [6] and the other with our Twenty short face videos of different people were recorded.The length of each video was about 30 s. AR effects for these face videos with our developed mobile application were generated.Each face video generated two output videos: one with standard SDM [6] and the other with our proposed compressed model.These two videos have the same AR effect, but different face videos had different AR effects.
Twenty people were recruited to score the results of the methods generated by 20 face videos.The range of the scores was from 1 to 5 points.Half of the people were male and the other half were female.The ages of the people ranged from 19 to 40.The recruits were undergraduate students, graduate students, or teachers.None of them had relationships with this research project.
In this experiment, two output videos were simultaneously shown on the monitor.The test subject scored the visual effects.The results demonstrated that 14 out of 20 people scored exactly the same for the two methods in all the videos.The scores of the other six people are listed in Table 5.From this table it can be concluded that the visual effects generated by our compressed model were very similar with the uncompressed counterpart.This also proves the effectiveness of our proposed algorithm.

Computational Cost Analysis
The computational cost overhead for online feature tracking of our proposed method was to uncompress the data file.The face alignment process was exactly the same for our proposed method and [6].Fortunately, the uncompress process only needs to be done once before feature tracking.The computational time of our uncompressing process was about 20 ms on iPhone 6, which is neglectable compared with the loading process of the mobile application, which is in the order of several seconds.
The extra computational cost introduced in the training process as described in Section 3.2 is listed in Table 6.Our proposed data compression algorithm was implemented in C++ with Microsoft Visual Studio 2015 IDE on a 64-bit Windows 7 operation system.All the data in Table 6 were collected on a PC equipped with Intel 3.4GHz i7-4770 CPU and 8GB RAM.From the above table it can be concluded that the computational overhead for each regression step in the tightly coupled training process was about 12.3 + 256.7 + 9.4 = 278.4ms.In this paper, a 6-step regressor was trained.Thus, the total computational overhead for the tightly-coupled training process was 278.4 × 6 = 1670.4ms.Since the whole training process took about 20 min, the extra computational cost was again negligible.

Discussion
This paper proposed an adaptive data compression method for the training model of an SDM-based face-alignment algorithm.An efficient method to quantize the data according to the probability density-aware K-means algorithm was proposed in this paper.Furthermore, our quantization method was tightly coupled into the training process so that the accuracy loss was minimized.Experimental results proved that our proposed method was on par with the standard method while only needing less than 1/5 of the original storage space.Our method even achieved comparable results with state-of-the-art deep learning based methods in images with moderate head

Figure 1 .
Figure 1.Face alignment example with the supervised descent method (SDM) algorithm[6].Firstly, the face region is automatically detected in the image[11].Then, features are localized within the detected face region.(Image courtesy of[12]).

( 3 )
Deep learning model need huge amounts of training data, which are not easy to collect for the face alignment task.(4) The training process of deep learning models is tricky without the help of open source implementations from the authors.

Figure 1 .
Figure 1.Face alignment example with the supervised descent method (SDM) algorithm[6].Firstly, the face region is automatically detected in the image[11].Then, features are localized within the detected face region.(Image courtesy of[12]).

( 2 )
The computational cost is still quite high and can hardly achieve real-time performances on mobile platforms.(3) Deep learning model need huge amounts of training data, which are not easy to collect for the face alignment task.(4) The training process of deep learning models is tricky without the help of open source implementations from the authors.

s
shape residual of the ith training image at the dth regression step: are the ground truth landmarks locations of the ith training image.Equation (2) is a standard linear least squares problem and can solved in closed-form.The following Figure3shows the flow chart of the training algorithm of SDM.

Figure 3 .
Figure 3.The flow chart of the training algorithm of SDM.

s
shape residual of the ith training image at the dth regression step: are the ground truth landmarks locations of the ith training image.Equation (2) is a standard linear least squares problem and can solved in closed-form.The following Figure3shows the flow chart of the training algorithm of SDM.

Figure 3 .
Figure 3.The flow chart of the training algorithm of SDM.

Figure 3 .
Figure 3.The flow chart of the training algorithm of SDM.

Figure 4 .
Figure 4. Typical example of the estimated probability density function (PDF) for the elements in A1.

Figure 4 .
Figure 4. Typical example of the estimated probability density function (PDF) for the elements in A 1 .

Figure 5 .
Figure 5.The flow chart of the proposed tightly-coupled training algorithm.

Figure 5 .
Figure 5.The flow chart of the proposed tightly-coupled training algorithm.

3. 3 .
Compressed Model Data Storage Arrangement The final compressed model consists of the stacked version of all D steps of regressors r d .Each regressor r d consists of three parts: Q quantized values for the projection matrix; the quantized projection matrix AQ d that corresponds to A d ; the bias term b d .

Figure 6 .
Figure 6.The diagram of the compressed data storage arrangement.

Algorithm 2 .
The proposed data quantization algorithm.Input: Projection matrix Ad (d = 1 ••• D) for all D steps; number of quantization levels Q. Output: Quantized projection matrix AQd (d = 1 ••• D) for all D steps.For d = 1 to D, 1. Calculate the optimal quantization steps bounds v0 ••• vQ according to Ad with Algorithm 1. 2. Initialize cluster centers μk (k = 1 ••• Q) for the K-means algorithm with Equations (

Figure 6 .
Figure 6.The diagram of the compressed data storage arrangement.

Algorithm 1 .
Approximate algorithm for solving Equation (8).Input: Projection matrix A d at the dth step; number of elements T in A d ; number of quantization levels Q. Output: quantization steps bounds v 0

Algorithm 2 .
The proposed data quantization algorithm.Input: Projection matrix A d (d = 1 ••• D) for all D steps; number of quantization levels Q. Output: Quantized projection matrix AQ d (d = 1 ••• D) for all D steps.For d = 1 to D, 1.

Figure 7 .
Figure 7. Normalized mean error loss against the standard SDM.

Figure 7 .
Figure 7. Normalized mean error loss against the standard SDM.
d and A d were both 136 × 27,200, as mentioned before.The dimension of bias vector b d was 136.For standard SDM, all the elements were represented by 32-bit single precision floating point numbers.Therefore, the total space needed for the training model was (136 × 27,200 + 136) × 4 × 6 = 88,784,064 Bytes With our proposed method, we needed to store 64 single precision floating point numbers for quantized values, 136 × 27,200 6-bit matrix AQ d and 136-dimensional single precision floating point vector b d for each regression step.Therefore, our storage consumption for the whole training data was (64 × 4 + 136 × 27,200 × 6/8 + 136 × 4) × 6= 16,651,200 Bytes

Figure 8 .
Figure 8. Face alignment results for different methods on the 300 W dataset [12].The left column shows the results with our compressed training model.The center column shows the results by using the SDM algorithm with an uncompressed training model [6].The right column shows the results with the 2D-FAN [22].

Figure 8 .
Figure 8. Face alignment results for different methods on the 300 W dataset [12].The left column shows the results with our compressed training model.The center column shows the results by using the SDM algorithm with an uncompressed training model [6].The right column shows the results with the 2D-FAN [22].
27. Eyes contain points No. 37 to No. 48.Nose contains points No. 28 to No. 36.Mouth contains points No. 49 to No. 68.The average values of the normalized mean errors in the five parts, respectively, on the 300 W test dataset were estimated.We compared our compressed training model against the uncompressed training model and deep learning based method.The results are shown in Figure 9. From the figure it can be concluded that our proposed compressed training model can achieve very close feature points compared with the uncompressed training model.The deep learning . Eyebrows contain points No. 18 to No. 27.Eyes contain points No. 37 to No. 48.Nose contains points No. 28 to No. 36.Mouth contains points No. 49 to No. 68.The average values of the normalized mean errors in the five parts, respectively, on the 300 W test dataset were estimated.We compared our compressed training model against the uncompressed training model and deep learning based method.The results are shown in Figure

Figure 9 .
Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on the 300 W test set[12].

Figure 10 .
Figure 10.Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on a subset of the 300 W test dataset[12] whose yaw angles are between 0 and 30 degrees.

Figure 9 .
Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on the 300 W test set[12].
. Eyebrows contain points No. 18 to No. 27.Eyes contain points No. 37 to No. 48.Nose contains points No. 28 to No. 36.Mouth contains points No. 49 to No. 68.The average values of the normalized mean errors in the five parts, respectively, on the 300 W test dataset were estimated.We compared our compressed training model against the uncompressed training model and deep learning based method.The results are shown in Figure

Figure 9 .
Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on the 300 W test set[12].

Figure 10 .
Figure 10.Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on a subset of the 300 W test dataset[12] whose yaw angles are between 0 and 30 degrees.

Figure 10 .
Figure 10.Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM[6] and deep learning based method[22] on a subset of the 300 W test dataset[12] whose yaw angles are between 0 and 30 degrees.

Figure 11 .
Figure 11.Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM[6] on the 300 W test set[12].

Figure 12 .
Figure 12.Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.

Figure 11 .
Figure 11.Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM[6] on the 300 W test set[12].

Figure 11 .
Figure 11.Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM[6] on the 300 W test set[12].

Figure 12 .
Figure 12.Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.

Figure 12 .
Figure 12.Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.

Figure 13 .
Figure 13.Normalized mean errors of our proposed method versus the method without K-means clustering.

Figure 13 .
Figure 13.Normalized mean errors of our proposed method versus the method without K-means clustering.

Figure 14 .
Figure 14.Sample AR (Augmented Reality) effects of the mobile application using our proposed compressed model version of SDM.

Figure 14 .
Figure 14.Sample AR (Augmented Reality) effects of the mobile application using our proposed compressed model version of SDM.

Table 1 .
if HoG features are used.So the dimension of A d is 136 × 27,200 and b d is a 136-dimensional vector.Typically, each component in A d and b d is represented by a single-precision floating point number.Table 1 shows the data ranges of A d and b d for a typical HoG-based 6-step regressor.Data ranges for a typical HoG-based 6-step regressor.

Table 2 .
Comparisons of normalized mean errors of our proposed method versus the method with random initialization for the K-means algorithm.

Table 2 .
Comparisons of normalized mean errors of our proposed method versus the method with random initialization for the K-means algorithm.

Table 3 .
The normalized mean error loss against the standard SDM for different choices of nb.

Table 3 .
The normalized mean error loss against the standard SDM for different choices of nb.

Table 4 .
The normalized mean error loss against the standard SDM for different choices of D.

Table 4 .
The normalized mean error loss against the standard SDM for different choices of D.

Table 5 .
Comparisons of average scores for the two methods.

Table 6 .
Extra computational cost for each part of the algorithm in the tightly-coupled training process.