An Adaptive Face Tracker with Application in Yawning Detection

In this work, we propose an adaptive face tracking scheme that compensates for possible face tracking errors during its operation. The proposed scheme is equipped with a tracking divergence estimate, which allows to detect early and minimize the face tracking errors, so the tracked face is not missed indefinitely. When the estimated face tracking error increases, a resyncing mechanism based on Constrained Local Models (CLM) is activated to reduce the tracking errors by re-estimating the tracked facial features’ locations (e.g., facial landmarks). To improve the Constrained Local Model (CLM) feature search mechanism, a Weighted-CLM (W-CLM) is proposed and used in resyncing. The performance of the proposed face tracking method is evaluated in the challenging context of driver monitoring using yawning detection and talking video datasets. Furthermore, an improvement in a yawning detection scheme is proposed. Experiments suggest that our proposed face tracking scheme can obtain a better performance than comparable state-of-the-art face tracking methods and can be successfully applied in yawning detection.


Introduction
Object visual tracking essentially deals with locating, identifying, and determining the dynamics of moving (possibly deformable) target objects in various areas such as car tracking [1], face detection [2], and driver monitoring [3]. Representational methods are applied successfully for dimensionality reduction and improve discriminative ability in classification problems [4]. Some visual object tracking methods applied representational based methods with pre-computed fixed appearance models [5]; however, the visual appearance of the tracked target object may change along the time and for this reason they may interrupt tracking the target object after a period of time when the tracking conditions change (e.g., the scene illumination changes, occlusions). Some authors proposed to use the data generated during the tracking process to accommodate possible target appearance changes, such as in online learning [6], incremental learning for visual tracking (ivt) [7], patch based approach with online representation of samples [8], and in online feature learning techniques based on dictionaries [1]. Often, online visual tracking methods tend to miss the target object in complex scenarios, such as when the head pose changes while tracking faces, or in cluttered backgrounds and/or in object occlusions [9]. The reasons for this behaviour include the inability to access the tracking error and to update the object appearance at runtime. To approach these issues, Kim et al. [10] utilized a constrained generative approach to generate generic face poses in particle filtering framework, and a pre-trained SVM classifier to discard poorly aligned targets. Furthermore, Correlation filters based methods have become popular in visual object tracking [11]. Li et al. proposed a multi-view model for visual tracking via correlation filters (MCVFT), which fuses multiple features • Face tracker that can track face and facial landmarks in challenging conditions. • The proposed tracking scheme utilizes the tracked target face samples collected during tracking to update the appearance model online to adapt to the shape and appearance changes in the tracked face along the time. • A dynamic error prediction scheme to evaluate the correctness of the tracking process during face tracking. • Utilization of a resyncing mechanism based on the Constrained Local Models (CLM), when the error predictor indicates high error. • An improvement in the classical CLM approach, namely Weighted CLM (W-CLM) to improve the facial landmark localization. • An improvement in a yawning detection scheme by using facial landmarks and imposing multiple conditions to avoid false positives.
The remaining of this paper is organized as follows. The proposed methodology is described in Section 2, followed by our experimental results in Section 3. Finally, Section 4 gives our conclusions and the future prospects of this work. Figure 1 shows the block diagram of the proposed face tracking method, and the blocks functions are explained below:

Proposed Adaptive Face Tracking Method
• Block 1: In the first video frame, the initial target face, its affine parameters and the landmarks are localized using W-CLM (for details on W-CLM, see Section 2.2). • Block 2: In order to track the target face in the subsequent video frames, new affine parameters values are drawn around the affine parameters values of the initial/tracked target face in the previous video frames (see details in Section 2.3). • Block 3: The affine parameters previously computed are used to warp the current video frame candidate target face samples of size u × u. • Block 4: If a specific number (τ) of new target face samples have been gathered, the eigenbases are built. • Block 5: If the condition in block 4 is satisfied, the candidate target face samples are decomposed into patches (v × v and v ≤ u), because the eigenbases are built using patches (see Section 2.1). • Block 6: The tracked target face is found among the candidate target face samples by maximizing the likelihood function in Equation (10)  The proposed tracking algorithm is able to track non-rigid objects such as faces and detect early potential tracking deviations from the tracked target object. The incremental update of the tracking process parameters is inspired on the incremental PCA approach [7]; however, the proposed method uses local texture information (patches of size v × v) rather than global information (the target object as a whole) to build the eigenbases, as explained in Section 2.1. A description of the W-CLM scheme and how it is used as a resyncing mechanism is in Section 2.2. How the proposed tracking method is applied to the face and facial landmark tracking is explained in Sections 2.3 and 2.4. that new data is received is time-consuming (and impractical) for applications such as object tracking, and the incremental updates of the eigenbases tends to be more interesting. The concatenation of A and B can be expressed in a partitioned form in a way to utilize the previously computed SVD of A as follows [7]:

Incremental Update of the Eigenbases and the Mean
where B represents the new eigenbases associated to the newly received data matrix B, which are can be expressed more conveniently as [7]: where I is the identity matrix. Finally, U = [U B] U and C = C are the new eigenvectors (eigenbases) and singular values, respectively, which considers the new data in B. Since only U and C will be utilized in the proposed tracking scheme, V is disregarded from now on. Furthermore, only the desired number of eigenvectors (γ) associated with non-zero singular values will be further processed, while other eigenvectors and singular values that exceed the γ ranked singular values will be disregarded. While updating the eigenbases, it is necessary to down-weight the older observations since the more recent observations are more informative about the current appearance of the tracked target face. Therefore, a forgetting factor f (∈ [0, 1]) is multiplied by the singular values in C [7], since µ(t) plays a key role in the detection of the tracked target face. Consequently, the mean µ(t) at time t is calculated incrementally as follows: where µ n represents the mean of the data matrix A with n face samples, and µ m is the mean of the newly added observations B, and t = m + n. An important benefit of having the forgetting factor f is that the mean µ(t) at time t can change in response to new observations, even if the total number of older observations in A is large.

Weighted Constrained Local Model (W-CLM) as the Feature Detector Used for Resyncing
The Constrained Local Model method (CLM) tends to be an accurate facial feature detector, but it tends to converge slowly, making its use in tracking problems challenging. Nevertheless, if CLM is used less often in comparison with other components of the tracking process, the CLM based tracking system could be viable for real-time operation. In this work, the proposed tracking scheme is applied to face tracking, and a modified CLM method, namely, the Weighted Constrained Local Model (W-CLM) is utilized to resync important facial features and avoid tracking failure, and also for the initialization of the tracking process. Consequently, the proposed method potentially is self-driven and self-corrected in real-time.
Weights Computation: The proposed W-CLM method utilizes CLM training data to evaluate the landmarks consistency by assigning higher weights to more consistent landmarks during the CLM search process. Multivariate Mutual Information (MMI) evaluates the mutual dependence between two or more random variables [20], and is utilized here to evaluate the consistency of each facial landmark. Firstly, MMI is computed independently for the feature vector of each facial landmark within a temporal window. Each feature vector x i represents the texture information in a window of size √ l × √ l around a facial landmark location in a given video frame. MMI is used to evaluate the differences of the co-occurrence probabilities of n random variables describing the local texture, and indicates how consistent is the texture information around a particular landmark in the training images, and is used as a weightŵ i ∈ [0, 1] of a facial landmark: where, x i is a column vector of size l, containing texture information around a particular landmark i = 1, 2, . . . , Z in a video frame at time t. The weightsŵ i of the landmarks are combined in a diagonal matrix to be used in the W-CLM search process. In practice, the CLM consists of two stages (modules): (1) CLM model building; (2) CLM search [18], that are discussed next:

CLM Model Building
CLM uses two models: (a) a shape model that deals with shape information, and (b) a patch model that considers local patch information. Both models are combined to represent the target object (i.e., face). Images of the cropped faces and a set of facial feature points (landmarks) are used as the training data to build the CLM face model.
In order to build the CLM shape model, all the shapes are aligned with the first (initial) shape of the training set using procrustes analysis [21], which attenuates the adverse effects of shape variations in terms of scale, translation and rotation, leaving only the intrinsic variations of the face shape S r . On these aligned faces, the PCA is performed to capture the face shape variations (eigenvectors) in the training data, and to obtain an indication of the total face variation by the eigenvalue of each eigenvector [22]. Therefore, each shape can be written as a linear combination of the eigenvectors P and the mean shape (S) as S r = S + PH r , where H r = P TŜ r is a column vector containing the coefficients of the corresponding eigenvectors P for representing the face shape S r in M.
In order to build a patch model for each facial landmark, a linear Support Vector Machine (SVM) [23]  where Ω = [Ω 1 , Ω 2 , . . ., Ω Z ] are the weights of each dimension of the input support vectors, and Θ is a constant acting as a bias to prevent overfitting. The goal of the SVM training is to search for the right values of weights Ω. For details on training CLM, please refer to [18].

Weighted CLM Search Method
Given a set of initial facial landmarks, a cropped patch around the position of each landmark is classified by the patch model, while preserving the shape constraints, using the following objective function: where, ı i (x i , y i ) is the patch of size √ l × √ l classified by the patch model in the g × g neighborhood of the location of the landmark i (g = 8 and l = 100 in our experiments) andŵ i is the weight that describes the impact of the landmark i in the optimization process. The term (5) is the patch model response and it is optimized using the quadratic programming and can be readily solved using the Matlab quadprog function.
is the shape constraint, where h j ∈ H r is the corresponding eigenvector coefficient in the eigenvectors representation H r = P TŜ of the current shape S, and λ j is the eigenvalue corresponding to the eigenvector in P and o is the number of eigenvectors in P, whereas the parameter β ∈ [0, 1] establishes a compromise between the patch and shape models.
For each landmark, the patch model is used to find a response patch at each landmark location in the local region, and the response patch is used to fit a quadratic function. Then, the best landmark positions are obtained by optimizing the function in Equation (5), created by combining the quadratic functions from the patch model and shape constraints from the CLM shape model. Then, each landmark is moved to its new position, and the process is repeated to obtain the optimum landmarks locations (i.e., face shape), or until the maximum number of iterations is reached (see details in [18]). For other promising fitting strategies, please look at the generative shape regularization model for robust face alignment [24] and unified embedding for face recognition and clustering [25]. In case of a new video sequence, the mean face shape S is used for the landmarks initialization, but in the subsequent frames the previous frame face landmarks are used to initialize the landmarks.

The Proposed Tracking Method Applied to Human Faces
As mentioned before, in the current work, the proposed tracking method is applied to face tracking. For face tracking, the state at time t is described by the affine parameters vector , where x(t) and y(t) represent the translation of the tracked target face with respect to the origin of the image, s(t) = M/u is the scale of the tracked target face w.r.t the size of the image (M × N) which contains the tracked target face (u × u), whereas θ(t), α(t) and φ(t) are the rotation angle w.r.t the horizontal axis, the aspect ratio, and the skew direction, respectively, at time t. The aspect ratio α(t) and the scale s(t) are used to keep the tracked target face in the xy image space (see details in [7]). The dynamics of each parameter in χ(t) are independently modeled by a Gaussian distribution N (.) centered at χ(t − 1), and going from χ(t − 1) to χ(t) is given by: where ψ(t) is a diagonal matrix with each main diagonal element representing the variance of the corresponding affine parameter. Equation (6) is referred as the motion model, because it models the motion of the tracked target face from one frame to the next frame. Figure 2 shows an example of the working of the motion model. The affine parameters χ(t) are represented by a point in affine parameter space; the affine parameter space is a six-dimensional space, and only three dimensions are shown in Figure 2. The red point in the Figure 2 represent the affine parameters of the tracked target face in the previous frame. Numerous affine parameters are computed using the Gaussian distribution centered around the affine parameters associated with the tracked target face in the previous frame using Equation (6), and these affine parameters are shown as blue points in Figure 2. Furthermore, these affine parameters are used to warp the candidate target face samples which may contain the tracked target face in the current frame, shown in green color faces in Figure 2 to check if they correspond to the tracked target face I(t). In order to find the tracked target face in the current frame, every candidate target face sample is represented within the space of the tracked target face I(t) that is spanned by the eigenbases U and centered at the mean µ(t), where U is obtained incrementally using the method explained in Section 2.1 [7]. The likelihood p(I(t)|χ(t)) that the candidate target face sample is the tracked target face I(t) is inversely proportional to the distance δ of the candidate target face sample to a reference point in the space (i.e., mean µ(t) projected in the space spanned by U). This distance is comprised by the sample distance to the space (δt) and the within space distance (δw) of the projected sample to the reference point µ(t). The likelihood (p δt ) that a candidate target face sample projected in the space spanned by U corresponds to the tracked target face is approximated by the negative exponential value of δt: where, δt = ||(I(t) − µ(t)) − UU T (I(t) − µ(t))|| 2 , ςI is the noise in the observation process, and I is the identity matrix and ideally ς → 0. It is worth mentioning that in the initialization, the eigenbases are not available yet, because the eigenbases only are build after a specific number τ of tracked target face samples are observed, then in the initialization U=0 and the mean µ(t) are used to estimate the likelihood p(I(t)|χ(t)) that I(t) contains the tracked target face, and Equation (7) is simplified to: Similarly, the likelihood (p δw ) that I(t) contains the tracked target face is given by the negative exponential of the Mahalanobis distance δw: where, δw = ||(I(t) − µ(t)) T UC −2 U T (I(t) − µ(t))|| 2 . Finally, the likelihood of a candidate target face sample I(t) being the tracked target face is given by the combined likelihoods p δw and p δt to ensure a more reliable decision score as follows: p(I(t)|χ(t)) = p δt (I(t)|χ(t))p δw (I(t)|χ(t)).
The candidate target face sample with the highest likelihood to be the tracked target face in Equation (10) is selected. Furthermore, the affine parameters χ(t) associated with the tracked target face are used to estimate the tracking landmarks locations (facial landmarks) as shown below: where, Λ(1) are the facial landmarks locations in the initial target face and 1 is an unitary vector of length Z (total number of landmarks). The pseudo code of the above described procedure is given in Algorithm 1.
, are the tracked target face sample, facial landmarks and the affine parameters of the previous frame, C and U is the singular values and eigenvectors respectively, C and U are empty matrix in the start and are computed and updated after each τ frames, f lag is the counter for number of frames for batch size and Υ is 1, if there is at least one more frame to process, otherwise Υ is 0. while (Υ = 1) do 4: 5: f lag ← f lag + 1; 6: 7: Draw a finite number of affine parameters centered at χ(t − 1) using Equation (6); 8: 9: Warp the candidate target face samples from I(t) using these affine parameters; 10: 11: Compute the probability of every candidate target face being the tracked target face using Equation (10); 12: 13: Select the candidate target face sample with highest likelihood as the tracked target face sample I(t); 14: 15: Estimate the facial landmarks using Equation (11); 16: 17: if ( f lag ≥ τ) then 18: 19: f lag ← 0; 20: 21: Calculate C and U (see details in Section 2.1). 22: 23: Update the mean (µ(t)) using Equation (3) return Λ T (t), χ(t),C , U ,µ(t), I(t); 36: 37: end procedure

Tracking Error Prediction and Resyncing Mechanism
Visual tracking is prone to failure if the object changes, moves quickly or changes its appearance. If the tracking methods fails, the tracking error may keep on increasing and the facial tracking process may fail. Most of the methods available do not provide a self assessment of tracking process correctness [5][6][7]26,27]. The proposed method is based on an error predictor that estimates the tracking error ε(t) at runtime. It was found experimentally that a relevant measure to predict the tracking error is the tracking difference of the facial landmarks locations in consecutive frames, which is represented by ∆(t) at time t, and its adequacy can be verified by observing the correlation ρ of ∆(t) with the tracking error ε(t), where ∆(t) at time t is given by: where, Λ The next stage of the face tracking process is to predict the potential tracking failures, and if a resyncing is required. This is done by checking if the value of ∆(t) in Equation (12) is higher than a threshold Γ T . A constant threshold value is not suitable for real applications because ∆(t) may vary from one person to another due to different face sizes, closeness to the camera, and/or the number of facial landmarks used. For this reason, the median value (Γ T = Median(∆(T))) is used as a dynamic threshold instead: where ∆(T) = {∆(1), ..., ∆(t)}, and Ψ(t) is used to indicate if resyncing is required. Moreover, the proposed error predictor is highly correlated with the actual tracking error (see Section 3). When the tracking predictor indicates a substantial error, i.e., Ψ(t) = 1, the W-CLM features are used for correcting (resyncing) the tracking process by re-adjusting the tracked landmarks Λ(t). Algorithm 2 provides the pseudo code of the proposed method applied to human faces. In the first frame, the face and the facial features are initialized using the W-CLM search method. In the other frames, Algorithm 1 is used to track the face and the facial features until the estimated tracking error increases. When the tracking predictor indicates a substantial error, W-CLM is used to resync the tracking process, which re-locates the facial landmarks Λ T (t) to the correct locations. This error prediction and correction scheme helps the proposed face tracker to adapt to the facial shape and appearance changes of the target along time and the target is not missed indefinitely. Furthermore, the detected facial landmarks Λ T (t) are then used for further processing such as computing the new affine parameters χ(t), and locating the tracked target face based on the candidate samples I(t) in the current frame. Moreover, new eigenbases are created starting from the resynced frame, and the old data is discarded because it is not relevant anymore.

Experimental Results and Discussion
The YawDD [28] and Talking Face Video [29] datasets are used in the experimental evaluation. YawDD dataset contains videos of drivers performing various tasks such as talking/singing and yawning. The camera was installed on the dash or under the car front mirror. The videos were taken under various illumination conditions. YawDD dataset contains the total of 119 participants from different age groups with the minimum age of sixteen years are involved. The videos from 29 participants are recorded using camera installed on the dash, and for other 90 participants, the camera is installed under the front mirror. On the other hand, the Talking Face video consist of 5000 frames obtained from a video of a person engaged in conversation with various face movements [29]. Table 1 shows the important parameters used in the proposed tracking method, their range and optimal values are presented, which are chosen empirically. For building the eigenbases U, the candidate/tracked target face sample is resized to u × u (u = 32) for computational efficiency, the number of eigenvectors γ = 16, the patch size is set to v × v (v = 8) and the eigenbases are updated every five frames (τ = 5), with a forgetting factor f = 0.95. The proposed face tracking algorithm is quantitatively evaluated using Center Location Error (CLE), that measures the distance between center locations of the tracked target face with the manually labeled center location of the target face that is used as the groundtruth. Furthermore, for detailed evaluation on YawDD dataset, six videos have been annotated manually, which includes the target face and landmarks (Z = 68) on the face, nose and the eyes. These videos contain different background and varied illumination. Additionally, person-specific characteristics, such as face changes, head motion, and glasses are also included. The proposed face tracking method is tested to verify if they can track the facial landmarks consistently on these videos. Hence, the error was measured by the root mean squared error (RMSE) between the estimated landmark locations (Λ T ) and the manually-labeled groundtruth (Λ G ) locations of the landmarks as follows: where ε(t) represents the tracking error in the video frame at time t, whereas Λ (i) G represents the ground truth location (x i , y i ) of the landmark i.

Choice of Batch Size
In the object tracking methods that learn the appearance of the tracked target object incrementally, the batch size plays an important role. Batch size describes that after how many frames the appearance model is updated. Different batch sizes have been tested to optimize the performance of the proposed tracking method. The phenomenon of batch size τ with the average RMSE tracking error ε M and number of resyncs r is shown in Figure 3 for different batch sizes (1 ≤ τ ≤ 16). The size of the triangle indicates the batch size, which means after how many frames the resync of the features is performed (larger the size of the triangle, bigger batch size). A larger batch size (big triangles in Figure 3) requires a lower number of resyncs, but it confers higher errors and vice versa. Contrarily, small triangles tend to lie on the upper left (upper for a large number of resyncs and left confers to smaller error) of the plot, which shows that more resyncs are required, and the error is low. However, frequent resyncs and updates may slow down the number of frames processed per second, as shown in Table 2. Figure 3 indicates that frequent updates (small batch size) on the basis of the proposed method has a lower tracking error than for large batch size. The reason for this behavior is that it updates the most recent appearance of the face and also the resync (if required) of the features are performed after a specific number of frames. The optimal trade-off is the batch size that minimizes both the number of resyncs (r) and the tracking error ε, which is defined as: where c indicates a cost function, n is the total number of batch sizes (n = 16 in the current experiments), and κ is a bias between the tracking error ε and number of resyncs r (κ = 0.5 in the current experiments). Figure 4 shows an example graph of the batch size and the cost function. The objective is to minimize the cost function to achieve an optimal batch size (τ) and in this example the error function attains minimum value (green circle) when τ is 6. Table 2. Average frames per second (fps) and number of times resync is activated for different batch sizes τ in AFTRM and AFTRM-W total of 2000 frames.

Method
Batch Size (τ)   Figure 5 shows some examples of the tracking errors of the proposed ILFT method without the error prediction and the resyncing procedure, illustrating some video frames with the tracked target face enclosed in a bounding box and the tracked facial landmarks plotted in red, whereas the yellow facial landmarks show the ground-truth landmarks. Figure 5a shows the effect of a tilted face on the tracking process, and Figure 5b shows that bad lighting also affects the tracking process, which tends to decrease its performance when the lighting conditions are changed during tracking. When the face deformation is not detected correctly, it is difficult to do facial expression analysis as shown in Figure 5c. The illumination changes may cause the tracked target face to be confused with the background, resulting in the tracking process permanent failure as can be seen in Figure 5d. Often, the tracking process fails in complex scenarios, since the eigenbases are then built using slightly incorrect tracked target face samples. Nevertheless, this tracking failure can be avoided if the tracker has an estimate of the tracking error. The proposed method addresses this problem using an error predictor and a resyncing scheme. Figure 6 shows the plots of the proposed error predictor ∆(t) computed using Equation (12) and the actual tracking error ε(t) of the tracked facial landmarks. The plots in Figure 6 suggest some correlation ρ between ∆(t) and actual tracking error ε(t), but the data is noisy and the correlation is low. Due to the noisy nature of ∆(t) and the actual tracking error ε(t), a one dimensional median filter of fifth order is applied on a sliding window of τ frames to smooth consistently ∆(t) (i.e., ∆(t)={∆(t) − τ, ∆(t) − τ + 1, . . . , ∆(t)}), increasing the correlation between ε(t) and ∆(t), as shown in Figure 7. It can be seen that the filtered ∆(t) and ε(t) have higher correlation because the data is smoothed and has fewer spikes. To further improve the tracking error prediction, a median filter of fifth order is applied over a sliding window of τ previous values of ∆(t) (i.e.,∆(t)={∆(t) − τ, ∆(t) − τ + 1, . . . , ∆(t)}), and the correlation between∆(t) andε(t) is improved, as can be seen in Figure 8. Using the proposed error predictor, the tracking quality can be evaluated and the re-estimation of the tracking landmarks locations uses W-CLM when Ψ(t)=1 (see Equation (13)). Some results obtained using this error prediction and resyncing based face tracking scheme are shown in Figure 9. The proposed tracking process tends to adapt to the changes in the tracked target face and work correctly in long video sequences, even if there is a tilt in the face (see Figure 9a), bad lighting (see Figure 9b), changes in face expression (see Figure 9c), or if the tracked face is similar to the background and under varied face expressions (see Figure 9d).

Quantitative Evaluation of the Proposed Face Tracking Method
Next is presented a quantitative comparison of the proposed AFTRM and AFTRM-W with the following methods: Incremental Learning Tracking Based on Independent Component Analysis (ILICA) [5], Incremental Learning for Robust Visual Tracking (IVT) [7], Incremental Cascaded Continuous Regression (iCCR) [15], Approximate Structured Output Learning for CLM [30], MCVFT [11], DCFNet [13], and MMDL-FT and MMDL-FTU [31]. Table 3 shows the RMSE in tracking of the facial landmarks of the proposed AFTRM, AFTRM-W, and of the comparative methods. Each column indicates the average RMSE tracking error ε m for the whole video sequence using the method specified in the first column. The last column illustrates the average tracking error obtained for all the tested videos. For the comparative methods, the parameters (if required) are set to the default values as proposed by their respective authors. Furthermore, the initialization for Terissi et al. [5], Ross et al. [7], Wang et al. [13], Li et al. [11], MMDL-FT, MMDL-FTU [31], AFTRM and AFTRM-W is done by using the W-CLM search method. Furthermore, Table 4 compares the CLE of the proposed AFTRM-W with the state-of-the-art methods based on all the videos of YawDD dataset with the camera installed on dash. Tables 3 and 4 show that the proposed AFTRM and AFTRM-W tend to outperform the other methods, whereas AFTRM-W has an improved performance in comparison with AFTRM. This is due to the weighting scheme, as consistent landmarks receive higher weights, improving the quality of the resyncing mechanism. The methods proposed by Zheng et al. [30], Sanchez et al. [15], Wang et al. [13] and our previous MMDL-FTU method [31] perform similarly to the proposed AFTRM method, whereas AFTRM-W has performed better than all the other tested methods and has a smaller tracking error. In our view, the higher tracking error presented by the comparative methods occur because once the tracking error is introduced, it keeps on increasing and eventually the tracking process fails. On the other hand, we solve this problem by estimating the tracking error during tracking and resyncing the facial landmarks if the tracking error tends to increase. For this reason, the proposed method can adapt to the challenging conditions and avoid to miss the tracked target indefinitely. Consequently, the proposed method can be used for consistent face tracking and facial features tracking in long video sequences, which can be used to detect different facial expressions, such as yawning, talking, fatigue, and so on. To compliment the experiments, the proposed method is tested on the Talking Face video [29]. Table 5 compares the proposed method with the comparative methods using CLE and RMSE measures. The experimental results show similar trend. The proposed and the comparative methods perform well on the talking face video [29]. Ross et al. [7] has performed much better on the talking face video because of its effectiveness in static background conditions. Furthermore, AFTRM-W performs better than all the comparative methods. This proves the efficiency of the proposed method and its effectiveness in the face tracking.
Face and facial landmarks can be used as a cue to many facial analysis applications, such as yawning detection, talking, facial expression detection and so on. In this paper, we evaluate the effectiveness of the proposed face and facial landmarks tracking in the context of yawning detection, which is explained next.

Evaluation of the Proposed Face Tracking Method in Yawning Detection
The accurate detection of facial landmarks is requirement fro many facial analysis applications such as human emotion analysis, fatigue detection and so on. To prove the effectiveness of the facial landmark features detected using the proposed tracker in a facial analysis application, i.e., a yawning detection scheme. An accurate yawning detection is requirement for many facial analysis applications. One of the most common usage of yawning is in driver fatigue detection, which is one important factor among others to detect fatigue in drivers [3]. Yawning detection is used as a case study to evaluate proposed tracking method in a practical facial tracking problem, where the local face appearance is changing. The proposed method takes an inspiration from the Omidyeganeh et al. [3] yawning detection approach, which is based on the backprojection theory and detects yawning based on the pixels counts in the binary mouth blocks of the current and the reference frames. To convert into a binary image, the pixel values greater than a threshold Γ 0 receive a value of 1 (named 'white pixels'), and 0 (named 'black pixels') otherwise. The proposed method improves on the method proposed by Omidyeganeh et al. [3] in two ways. Firstly, the proposed method uses only the pixels which are in the lips to measure the mouth openness in a binary image (see in Figure 10 that only the pixels inside the white region are used), as compared to [3] which uses a rectangular mouth block and includes some pixels outside the lips to detect yawning. Secondly, yawning is detected in each video frame if the following three conditions are satisfied: (1) the ratio of the number of black pixels in the current frame (NBC) and the reference frame (NBR) is greater than Γ 1 (i.e., NBC NBR > Γ 1 ); (2) the ratio of the number of black and white pixels (NWC) in the current frame is greater than Γ 2 (i.e., NBC NWC > Γ 2 ); and (3) the ratio of a vertical distance between the midpoints (VD) and the distance between the corner points (HD) of the mouth is greater than Γ 3 (i.e., VD HD > Γ 3 ). The first frame is used as a reference in the proposed scheme and is assumed to contain a closed mouth. Using the pixels of the reference frame tends to minimize scale issues when using conditions 2 and 3, that use only the pixels within the mouth region of the current frame. The proposed yawning detection scheme is evaluated in terms of; (1) True Positive Rate (TPR), which is the rate of True Positives (TP) detected as yawning, i.e., TPR = TP TP+FN ; True Negative Rate (TNR), which is the rate of True Negatives (TN) correctly detected as non-yawning, i.e., TNR = FP FP+TN ; False Positive Rate (FPR) is the rate of yawning falsely detected as non-yawning, False Negative Rate (FNR) is the rate of non-yawning falsely detected as yawning, and the Correct Detection Rate (CDR) is defined as CDR = TPR+TNR TPR+TNR+FPR+FNR . Table 6 shows a comparison of the proposed method using data provided by AFTRM and AFTRM-W, with state of the art methods in yawning detection, including Chiang et al. [32], Bouvier et al. [33] and Omidyeganeh et al. [3]. The proposed method tends to outperform the comparative methods on the YawDD dataset [28]. Furthermore, the proposed method has a higher TPR, which indicates the effectiveness of the proposed method. The threshold values for Γ 1 , Γ 2 and Γ 3 are set to 1, 0.5 and 2.5, respectively.

Conclusions
A new adaptive face tracking scheme has been proposed, which tends to reduce the face tracking errors by using a tracking divergence estimate and a resyncing mechanism. This resyncing mechanism locates adaptively the tracked facial features (e.g., facial landmarks), which tends to reduce the tracking errors and to avoid missing the tracked face indefinitely. The proposed Weighted Constrained Local Model (W-CLM) method improves the CLM feature search mechanism by assigning higher weights to more robust facial landmarks, and is used in resyncing.
The performance of the proposed face tracking method was evaluated in the drivers video sequences of the YawDD and on Talking video datasets. Both the datasets contain significant changes in illumination and head positioning. Our experiments suggest that the proposed face tracking scheme potentially can perform better than comparable state-of-the-art methods, and can be applied in yawning detection while obtaining higher Correct Detection Rates (CDRs) and True Positive Rates (TPRs) than comparable methods available in the literature. In the future, we intend to extend our work to develop a tracker for a more general class of non-rigid objects.
Author Contributions: This research article is the result of the contribution of A.K. and J.S.. A.K. has contributed in preperation of the research methodolgy, fromal analysis, writing the original draft, software and review. J.S. has the role of supervisor and helped in preparing the methodolgy of the research, formal analysis; writing the original draft, reviweing, and project administration. All authors have read and agreed to the published version of the manuscript.
Funding: CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brazil) and Sidia Instituto de Ciencia e Tecnologia provided the financial support for this project.