Stabilized Temporal 3D Face Alignment Using Landmark Displacement Learning

: One of the most crucial aspects of 3D facial models is facial reconstruction. However, it is unclear if face shape distortion is caused by identity or expression when the 3D morphable model (3DMM) is ﬁtted into largely expressive faces. In order to overcome the problem, we introduce neural networks to reconstruct stable and precise faces in time. The reconstruction network extracts the 3DMM parameters from video sequences to represent 3D faces in time. Meanwhile, our displacement networks learn the changes in facial landmarks. In particular, the networks learn changes caused by facial identity, facial expression, and temporal cues, respectively. The proposed facial alignment network exhibits reliable and precise performance in reconstructing static and dynamic faces by leveraging these displacement networks. The 300 Videos in the Wild (300VW) dataset is utilized for qualitative and quantitative evaluations to conﬁrm the effectiveness of our method. The results demonstrate the considerable advantages of our method in reconstructing 3D faces from video sequences.


Introduction
Three-dimensional (3D) facial models find widespread applications in various facial tasks, including facial animation, facial synthesis, facial reconstruction, facial recognition, and facial tracking. A crucial pre-processing step for utilizing 3D facial models is facial alignment, which involves moving and deforming a facial model to match an image. The regularized structure of facial components, such as the eyes, lips, and nose in human faces, serves as a valuable prior for efficient facial alignment. However, conventional facial alignment methods exhibit instability when dealing with large pose and expression changes. In such scenarios, it becomes challenging to distinguish whether the observed facial shape changes derive from identity, expression, or pose variations. Furthermore, when this ambiguity extends to the temporal domain, it results in unnatural facial shape changes and jittering artifacts, leading to significant visual quality degradation. To overcome these issues, this paper introduces a facial reconstruction framework that learns facial movements, i.e., displacements, according to facial identity, expression, and temporal cues. The 3D morphable model (3DMM) stands as the most widely utilized statistical representation for obtaining 3D faces from facial images in diverse face-related applications. Since its initial introduction [1], various adaptations of the 3DMM have been developed by employing principal component analysis (PCA) to decompose facial scans of different identities and expressions, enabling the representation of arbitrary human faces. Consequently, it efficiently captures the 3D facial shape from a given facial image. However, challenges arise when fitting the 3DMM to facial images exhibiting large expression or pose variations.
In such cases, there exists ambiguity in the facial shape, making it difficult to determine whether the facial shape deformation is due to identity or expression changes. While this ambiguity may not result in substantial visual degradation in a static context, it becomes evident in a temporal domain, leading to apparent visual artifacts, such as unnatural facial shape changes and jittering artifacts. To address the problem, we separately modeled identity shape, expression shape, and temporal movements. In facial parametric models, such as 3DMM [1] or FLAME [2], it is demonstrated that the statistical shape variations (i.e., movements or displacements) caused by facial identity and expressions are independent of each other. Thus, the proposed method models these movements separately to effectively reduce the ambiguity of the facial movement.
Recently, with the expansion of the generative adversarial network (GAN) in deep learning, it has been found that using discriminators leads to a network with higher performance [3]. The GAN is composed of two networks: a generator and a discriminator. The discriminator is trained to determine whether the input data distribution is close to the ground-truth data distribution or the generated data distribution. At the same time, the generator is trained to fool the discriminator, by generating more accurate data. Recently, thanks to the powerful performance of the discriminator, the discriminator has been widely adopted in various temporal data generation tasks [4,5]. Motivated by this, we propose a stable and accurate facial alignment framework by introducing displacement discriminators that determine that the regressed camera and facial shape parameters are stable. We train a discriminator to evaluate whether the distribution of the 3D face alignment results is similar to ground-truth 3D face movements. Thus, this discriminator learns the distribution difference between alignment results and the ground-truth movements. Then, the 3D facial alignment network is trained to produce a stable 3D face alignment using the distribution difference trained from the discriminator as guidance. Here, to learn the distribution difference more precisely, we present three displacement discriminators that separately discriminate the facial movements according to personal identity, expression, and temporal cues. The identity and expression displacement discriminators are trained to discriminate whether the facial deformations generated from the estimated facial identity and expression parameters are stable. This enables the facial alignment network to estimate the accurate facial identity and expression parameters. The temporal displacement discriminator is trained to discriminate whether the facial temporal displacement is stable, which allows the alignment network to achieve temporally stable alignment results. Using these displacement discriminators, the proposed facial alignment network shows accurate and stable facial alignment performance in both the static and temporal domains.
For the qualitative and quantitative evaluations, we use the 300 Videos in the Wild (300VW) dataset [6], which provides large-scale facial tracking data. In the experimental results, the proposed method shows significant improvements over state-of-the-art methods for temporal facial alignment. The results demonstrate that the proposed method enables accurate facial tracking with multiple discriminators by stabilizing facial locations and shapes over time.

3D Morphable Model
Since the pioneering work of Blanz [1] introducing the first 3D morphable model (3DMM), several subsequent 3DMMs have been proposed [7][8][9]. These models are constructed by encoding the features of 3D facial scans pertaining to identity, expression, and texture through PCA decomposition, leveraging data collected from multiple subjects. Due to the distinct topology of each facial scan, mesh registration is essential to establish vertex correspondences among them. In Blanz's work [1], optical methods were employed to determine the vertex correspondences between facial scans. Paysan et al. [8] proposed a non-rigid registration approach utilizing warping based on thin-plate splines (TPS) [10], and a non-rigid iterative closest point (ICP) [11] was utilized to achieve accurate alignment.
Vlasic et al. [9] presented a multilinear facial mode, representing facial identity and expression using singular value decomposition (SVD). Subsequently, Cao et al. [7] proposed a bilinear facial model, building upon the multilinear model by deforming the facial scan into a template model with expression. Thanks to the considerable efforts devoted to constructing accurate 3DMMs, an arbitrary 3D face can now be effectively and precisely represented using these models.

3D Face Alignment
3D facial alignment is a task that fits the 3D facial shape into the input facial images. Due to the powerful representation performance of the 3DMM, it is widely used for face alignment. The first method for 3D facial alignment [12] performed alignment of the 3DMM to the input image by minimizing the pixel-wise difference between the target facial image and a rendered image of the 3DMM. In recent years, regression-based 3D facial alignment techniques have been introduced [13][14][15][16][17], which minimize the discrepancy between the target 2D landmarks and the projected 2D landmarks of the 3DMM. While these approaches demonstrated performance improvements, two major challenges remain.
Firstly, self-occlusion becomes a concern when dealing with large pose or expression variations. Self-occlusion leads to the loss of facial semantic information, resulting in unreliable facial alignment. Secondly, in temporal sequences, temporal instability becomes pronounced during rapid and substantial facial motion. While facial alignment results may appear reliable in static shots, jittering artifacts often emerge in the temporal domain. To address these issues, this paper introduces novel stabilization discriminators that effectively guide changes in the stabilized facial shape, particularly when dealing with large poses, expressions, and motion.

Method
The proposed method is composed of the facial alignment network and the displacement discriminators. For the facial alignment network, we employ the 3DMM for efficient facial shape alignment. In addition, to ensure consistent facial alignments for an individual's identity and expression over time, multiple sub-discriminators are integrated into the displacement discriminators. Figure 1 provides an overview of the entire framework of the proposed method consisting of the facial alignment network and displacement discriminators: the identity displacement discriminator (IDD), expression displacement discriminator (EDD), and temporal displacement discriminator (TDD). . For the face alignment, the facial alignment network estimates the 3DMM parameters α and camera parameters p corresponding to the current image. The identity and expression displacement discriminators (D ID , D ED ) are trained the assess whether facial shape changes align with an individual's identity and expression, respectively. In addition, the temporal displacement discriminator is trained to determine whether the temporal facial shape change is stable or not.

Facial Alignment Network
A 3DMM represents an arbitrary 3D face (S) using bases decomposed through PCA. Using the 3DMM, the 3D face (S) can be represented by parameters for both identity and expression, α = [α id , α exp ]. Given a 2D image I, the facial reconstruction network G finds the shape parameters α. Then, the projected landmark of the 3D face is estimated using a landmark index vector l ∈ R 68 . The reconstruction network is trained using the L land loss. L land is defined as follows: where U is the labeled ground-truth 2D landmark location of the input image.

Displacement Discriminators
To achieve stability in both the temporal and static domains during training of the facial alignment network, we propose the use of three displacement discriminators: identity, expression, and temporal cues. The identity and expression displacement discriminators play a vital role in stabilizing the facial alignment network in the static domain. This is accomplished by distinguishing between the changes in facial shape estimated based on the identity and expression parameters. Consequently, the network can better understand and differentiate the influences of identity and expression on facial shape variations. On the other hand, the temporal displacement discriminator ensures stability in facial alignment over time by discerning changes in facial shape across consecutive frames. This helps the network maintain consistent facial alignments throughout a temporal sequence.

Identity Displacement Discriminator
The identity displacement discriminator (IDD) is to determine whether the estimated changes in facial shape align with the desired facial shape corresponding to the regressed facial identity parameter. To train the IDD, we calculate the difference between the facial landmarks and the estimated landmarks without considering identity information. To compute this calculation, we estimate the landmark displacement depending on the identity parameter as follows: Facial landmarks are detected from the projected facial vertices, which are in image coordinates. To facilitate comparison, both the ground-truth and estimated landmarks are normalized to the range of [0, 1] before computing the difference. The input for the IDD is then obtained by calculating the discrepancy between the normalized landmarks. This process ensures that the IDD can effectively discern facial shape changes due to variations in identity. The difference to be used as input for the IDD is computed by using the normalized landmarks as follows: where x ID is the landmark difference between the ground-truth and estimated x z,ID landmarks. To make the IDD learn the stabilized displacement based on the identity parameter, we use x ID as the real distribution, and we use x z,ID as the fake distribution. Therefore, the loss for the IDD is defined as follows:

Expression Displacement Discriminator
Similarly to the IDD, the expression displacement discriminator (EDD) is trained to distinguish facial shape changes based on the validity of the expression parameter. Similarly to Equation (3), we calculate the facial shape displacement without expression S ID by replacing A exp and αexp with A id and α id . Thus, the expression-based facial shape displacement is defined as follows: v Then, the expression-based landmark displacement x exp and x z,ED are defined as follows: x z,ED = v(:, l) − v id (:, l), During the training of the EDD, the differences between the calculated landmarks without expression x ED are used as the real data distribution, and the estimated landmarks x z,ED are used as the fake distribution. The loss for the EDD is defined as follows:

Temporal Displacement Discriminator
The IDD and EDD are responsible for stabilizing the facial alignment network in a static domain. To further enhance the temporal stabilization performance, we introduce a temporal displacement discriminator (TDD) to guide the changes in the temporal facial shape through the frames. The input for the TDD is derived from the variation in facial landmarks between the current and previous frames. Facial temporal changes are assessed by calculating the difference between the landmarks of the current frame and those of the previous frame as follows: x z,TD = v(:, l) − v (:, l), where v and U are the projected vertices and the ground-truth landmark of the previous frame, respectively. The temporal discriminator loss is defined as follows:

Adversarial Loss Function
These multiple discriminators (i.e., IDD, EDD, and TDD) are trained to discern the validity of identity, expression, and temporal changes in facial shape. Concurrently, the facial alignment network is trained to deceive these discriminators. The overall adversarial losses for these discriminators, denoted as L D ID , L D ED , and L D TD , are defined as follows: where λ ID , λ ED , and λ TD are factors for balancing between each loss term. Thus, the loss function in Equation (15) is used to train the IDD, EDD, and TDD.
The total loss for the facial alignment network (G) is defined by combining the alignment and adversarial losses as follows: Similarly to the conventional GAN training, we freeze the discriminators, i.e., IDD, EDD, and TDD, when training the facial alignment network. Thus, the discriminators and facial alignment network are trained alternately. In our experiments, we used balancing factors λ ID =λ ED = λ TD = 0.1. In our experiments, the same network architecture was employed for all discriminators. To assess stability, we utilized the landmark difference and passed it through three fully connected layers, which ultimately produced a single scalar value ranging from 0 to 1. For better understanding, we represent a block diagram of the training procedure of the proposed displacement learning method in Figure 2. The IDD (D ID ) and EDD (D ED ) are trained to judge the estimated facial landmark displacement (x z,ID and x z,ED ) as unstable (0) and the ground-truth landmark displacement (x ID , x ED ) as stable (1). The TDD (D TD ) is trained to discriminate the temporal displacement of the estimated faces x z,TD as unstable (0) and that of the ground-truth faces x TD as stable (1). To deceive these identity, expression, and temporal displacement discriminators, the facial alignment network is trained for these discriminators to output stable (1) from the estimated face (x z,ID , x z,ED , and x z,TD ). In short, similarly to the conventional GAN training procedure, we alternately train the facial alignment network and the displacement discriminators. In a single training iteration, we first train the displacement discriminators by freezing the facial alignment network and then train the facial alignment network by freezing the displacement discriminators.

Implementation Details
In our experiments, we utilized the 300VW dataset [6], a large-scale facial tracking dataset containing 114 videos with a total of 218,595 frames, each annotated with 68-point landmark labels. Among these videos, 50 were used for training, and the remaining 64 were designated for testing. The test videos were further categorized into A, B, and C sets, with C being the most challenging test subset.
During the training phase, each frame was cropped using a ground-truth landmark and resized to 256 × 256 pixels, serving as input for the facial alignment network. We employ the ResNet 18 backbone [18] for the facial alignment network. Figure 3 represents the details of the facial alignment network, identity displacement discriminator, expression displacement discriminator, and temporal displacement discriminator for reproducibility. In the facial alignment network, we add the splitting layer and four fully connected layers at the end of the ResNet backbone to estimate the 3DMM and camera parameters. In the splitting layer, the output feature vector is split into 990-dimensional and 35-dimensional feature vectors by proportionally dividing them based on the anticipated number of 228 3DMM parameters and 8 camera parameters. The 990-dimensional feature vector is fed into two fully connected layers, which are composed of 228 and 228 nodes, to estimate the 3DMM parameters. The output for the 3DMM parameter has a 228-dimensional vector. Here, 199 dimensions are used for identity parameters, and 29 dimensions are used for expression parameters. Similarly, the 35-dimensional feature vector is fed into two fully connected layers to estimate the camera parameters. For the camera parameter estimation, each fully connected layer has 16 and 8 nodes, respectively. In all layers of the facial alignment network, the RELU activation functions are employed except for the last layer. In the last layer, no activation functions are employed. For the displacement discriminators, we use four fully connected layers. Each fully connected layer has 256, 128, 64, and 1 node, respectively. All of the displacement discriminators are constructed with the same architecture. Table 1 summarizes the architecture of the displacement discriminator.  To improve the network's temporal robustness, the frame interval between the current and previous frames was randomly increased within the range from one to six. After the second frame, each frame was cropped using the landmarks estimated from the previous frame. In the testing phase, the first frame was cropped based on landmarks detected using a conventional landmark detection algorithm called MTCNN [19]. For subsequent frames, each one was cropped using landmarks estimated from the previous frame. The proposed method used in all experiments was trained for 500 epochs using TensorFlow (version 2.10.0), CUDNN (version 8.1), and CUDA (version 11.2). We employed the Adam optimizer for optimization and trained the model on a single NVIDIA 2080Ti (11 GB) GPU with a batch size of 20. The learning rate was set to 0.001 during the initial training phase, and it gradually decreased to 0.00001 over time.

Performance Evaluation
For this evaluation, we conducted a comparison of our method against other stateof-the-art facial alignment techniques, namely 3DDFA [20], RingNet [21], DSFNet [22], and SADRNet [23]. To quantitatively assess the performance, we measured the normalized mean error (NME) of the 2D facial landmarks. The NME is calculated as the average normalized landmark error divided by the facial bounding size, as per previous facial alignment methods [24,25]. The facial bounding box's size is defined as the square root of the product of the width and height of the rectangular hull formed by all the landmarks.
For the qualitative comparison, we visualize some examples of the 3D face alignment outcomes on 300VW-A, 300VW-B, and 300VW-C in Figure 4. On the 300VW-A set, the easiest dataset, all comparison methods, including ours, show similar alignment performance.
In contrast, our method shows significantly better performance than the comparison methods on the 300VW-C dataset. Note that 300VW-C is the most challenging dataset because of the fast motion and extreme light conditions. Specifically, our method shows more accurate alignment results in the face contour and the mouth. In summary, the results show that our method outperforms the state-of-the-art face alignment method in all cases of the 300VW dataset. We also evaluate the performance of face alignment in cases where a part of the face is occluded. Figure 5 demonstrates the results of face alignment under occlusion. 3DDFA [20] and RingNet [21] often fail to align the 3D face when significant occlusion occurs. DFSNet [22] and SADRNet [22] exhibit substantial alignment errors when the face is occluded. In particular, when most of the facial region is occluded, both methods have large alignment errors in rotation, translation, and scale estimations. In contrast, our proposed method demonstrates stable face alignment even under significant occlusion, providing accurate results for rotation, translation, and scale estimation. This result shows that our method outperforms other methods by achieving a stable alignment performance in extreme cases. In addition, we quantitatively evaluate face alignment accuracy by measuring the normalized mean error (NME) of the 2D facial landmarks. The NME is the landmark error normalized by the size of the facial bounding box [15]. The size of the facial bounding box is defined by the height × width of the rectangular hull calculated from all landmarks. The quantitative evaluation is summarized in Table 2. Here, the accuracy is the percentage of the bounding box size. It shows that our method outperforms other state-of-the-art face alignment methods. On the 300VW-A set, our method achieves a 14.34% accuracy improvement over the 3DDFA [20], which has the lowest accuracy. In addition, our method achieves accuracy improvements of 16.01% and 19.27% on the 300VW-B and 300VW-C datasets over the 3DDFA [20], respectively. This shows that our method demonstrated a distinct advantage in the challenging tracking case (300VW-C) compared to the other comparison methods, and it is consistent with the result in Figure 4.

Ablation Study
The proposed method is inspired by the fact that statistical facial parameter models, such as FLAME and the 3DMM, independently model facial movements with respect to identity, expression, and temporal aspects. Thus, the proposed method separately models identity shape, expression shape, and temporal movements with three different discriminators. To assess this, we conducted eight ablation tests according to the use of the discriminator.
In the baseline experiment, we trained the facial alignment network without incorporating any discriminator. Subsequently, for each discriminator, we evaluated its individual performance on the baseline model. The ablation tests were performed by measuring the accuracy of facial alignment, represented by the NME. The results of these tests are summarized in Table 3. The results indicate that each individual discrimination of identity, expression, and temporal changes contributes significantly to performance improvements. Notably, temporal discrimination plays the most crucial role in achieving stable facial alignments over time, while identity discrimination has the least impact. By comparing the outcomes in Table 3, it is evident that employing multiple discriminators for temporal, identity, and expression simultaneously provides substantial benefits in obtaining stable 3D facial alignments. Therefore, the ablation tests show this independence and orthogonality of facial identity, expression, and temporal movements, and it is demonstrated that using all discriminators can significantly boost performance.

Discussion and Conclusions
In this paper, we present a robust and precise facial alignment framework by introducing multiple stability discriminators. These discriminators effectively determine the camera, face identity, and expression parameters from an input image simultaneously. The proposed framework comprises a facial alignment network and three displacement discriminators: identity (IDD), expression (EDD), and temporal (TDD) discriminators. The previous temporal smoothing scheme uses the local average to reduce the outlier alignment result. It effectively reduces the alignment error of the outlier frame but causes unwanted alignment errors in nearby frames due to the local averaging scheme. In contrast, the proposed discriminator-based method can effectively reduce the alignment error in the outlier frame without causing unwanted alignment error propagation. This is possible because the discriminator accurately distinguishes unnatural and unstabilized facial movements based on facial identity, expression, and temporal cues using a comparison with the ground truth of the facial movement. To evaluate the performance of the proposed discriminators, we conducted qualitative and quantitative assessments using the 300VW dataset, a large-scale facial tracking dataset. The experimental results demonstrate significant improvements over state-of-the-art methods, showcasing the effectiveness of our approach in achieving accurate and stable facial alignment over time.
However, the main bottleneck in our method is that displacement discrimination is performed based on the 2D facial landmarks. This is because there is no publicly available video-based dense 3D face dataset. Since a 2D facial landmark provides sparse information on the facial shape, more detailed facial deformation, such as facial wrinkles, cannot be represented using the facial landmark. Therefore, the loss of information in facial details is a limitation of our work. We believe that when the proposed method is trained using a video-based dense 3D face dataset, it will exhibit stable temporal alignment performance while generating facial details. In future research, we plan to extend landmark displacement discrimination to dense displacement discrimination by employing the self-supervised method. This may accurately represent changes in facial details as well as facial shape over time. Lastly, we hope that our work will be valuable in various facial applications, including facial recognition [26][27][28], facial animation [29,30], and VR communication [31,32].

Data Availability Statement: Not applicable.
Acknowledgments: The authors would like to thank the editor and the reviewers for their contributions.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: IDD Identity displacement discriminator EDD Expression displacement discriminator TDD Temporal displacement discriminator 3D Three-dimensional 2S Two-dimensional GAN Generative adversarial network PCA Principal component analysis SVD Singular value decomposition TPS Thin-plate splines 3DMM 3D morphable model