Augmented EMTCNN: A Fast and Accurate Facial Landmark Detection Network †

: Facial landmarks represent prominent feature points on the face that can be used as anchor points in many face-related tasks. So far, a lot of research has been done with the aim of achieving e ﬃ cient extraction of landmarks from facial images. Employing a large number of feature points for landmark detection and tracking usually requires excessive processing time. On the contrary, relying on too few feature points cannot accurately represent diverse landmark properties, such as shape. To extract the 68 most popular facial landmark points e ﬃ ciently, in our previous study, we proposed a model called EMTCNN that extended the multi-task cascaded convolutional neural network for real-time face landmark detection. To improve the detection accuracy, in this study, we augment the EMTCNN model by using two convolution techniques—dilated convolution and CoordConv. The former makes it possible to increase the ﬁlter size without a signiﬁcant increase in computation time. The latter enables the spatial coordinate information of landmarks to be reﬂected in the model. We demonstrate that our model can improve the detection accuracy while maintaining the processing speed.


Introduction
Facial landmarks such as eyes, nose, and mouth are prominent feature points on the face, and diverse tasks such as face recognition, gaze detection, person tracking, emotion recognition, and virtual makeup have been performed based on facial landmarks [1,2]. In particular, to meet the diverse demands of tasks such as real-time processing or rendering on mobile devices, fast and accurate face landmark extraction is essential [3]. So far, plenty of research has been done on extracting facial landmarks. In recent years, as convolutional neural networks (CNNs) have shown overwhelmingly strong performance in the field of image classification [4] and object detection [5], they have been investigated for applications in facial landmark extraction. Facial landmarks are popularly represented by 68 points, which cover facial contours, eyes, eyebrows, nose, and mouth [6][7][8]. In an effort to detect such facial landmark points accurately, adding more convolution layers has been attempted, as in Visual Geometry Group Network (VGGNet) [9,10]. Even though this produces better results, it requires more computational resources and is therefore not appropriate for real-time processing.
Various deep learning models have also been proposed for real-time extraction of facial landmarks [11][12][13]. Such models usually use fewer than 10 points for real-time facial landmark extraction. However, such few facial landmark points fail to represent landmark properties such as shape accurately. In the previous work, to extract a sufficient number of facial landmark points in real 2 of 14 time, we proposed an EMTCNN model by extending the original multi-task cascaded convolutional neural network (MTCNN) model [12] which extracts five facial landmark points in real time. By making the CNN layers of the MTCNN model deeper, the EMTCNN model could extract 68 facial landmark points in real time. Even though the accuracy of landmark detection can be improved further by making the output network (O-Net) of the EMTCNN model even more in-depth, the processing time would increase sharply owing to the increased number of parameters. Hence, in this study, we augment the EMTCNN model using two convolution techniques-dilated convolution [14][15][16] and CoordConv [17]-to improve the detection accuracy while maintaining the processing speed. The former makes it possible to extend the receptive field without increasing the number of parameters. The latter allows the feature map to reflect the spatial coordinate information of facial landmarks. In addition, to make our model more robust, we construct a face image dataset based on open face data and other face data collected manually and augment it by using image operations such as flipping and illumination to retain diversity of the subject race, age, and gender, face posture, and image shooting environment.
Through various experiments, we demonstrate that our proposed model can improve detection accuracy at similar speeds. This paper is organized as follows: Section 2 introduces some related works and Section 3 describes how to extract the facial landmark points in real time. Experimental results for the proposed method are described in Section 4, and Section 5 concludes the paper.

Related Works
Traditional real-time extraction methods in the fields of human images have been proposed in many research studies [18][19][20][21]. However, these conventional approaches have the drawback of using hand-crafted features. Recently, CNNs were demonstrated to be superior in the diverse fields of computer vision, energy forecasting [22,23], and biomedical science [24][25][26][27][28]. This is because it performs overwhelmingly well in feature map extraction from the convolution layers. Various studies on the extraction of high-quality feature maps based on the CNN architecture are underway [14][15][16][17]. The quality of facial landmark extraction has also been improved remarkably by using CNNs. However, in the case of high-resolution images, performing facial landmark extraction on the entire image requires a large amount of computation, and the results could be inaccurate. To alleviate such problems, the landmark extraction process can be divided into two stages: detecting facial regions and extracting facial landmarks in those detected regions.
A facial region in an image can be detected by using an object recognition technique, and many object detection algorithms have been proposed to localize objects within the image. For instance, Region CNN (R-CNN) [29], which is the first CNN-based object recognition model, uses the selective search algorithm [30] and creates bounding boxes by combining pixels that have similar feature patterns, then classifies images using CNN and support vector machine (SVM) [31]. However, this method requires a considerable amount of processing time because all the bounding boxes are fed to the CNN as input. To solve this problem, Ren [32] proposed the fast R-CNN based on the concept of region of interest (ROI) pooling. Instead of feeding all the bounding boxes to the CNN, they pooled the regions corresponding to ROIs from the final feature map of the CNN. Both R-CNN and fast R-CNN use the selective search algorithm to detect bounding box candidates. However, as selective search is a slow and time-consuming process, it affects the performance of the network. To rectify this, Ren et al. [33] proposed faster R-CNN to model ROI proposals of the region proposal network (RPN). They used a sliding window to compute the coordinates and scores of the bounding boxes. On the other hand, Redmon [34] proposed You Only Look Once (YOLO), which divides the image into N × N grids and computes the class probability and offset value along with the bounding boxes on the grid. Then, they selected the bounding boxes that had a higher class probability than some threshold value. Even though YOLO could detect object regions very quickly compared to other models, it had the problem of low accuracy. To improve the detection accuracy, Redmon et al. [35] added a batch normalization process to all the convolution layers, replaced the fully connected layer with a convolution layer, and extended the final feature map size from 7 × 7 to 13 × 13. Through these attempts, they detected 9000 object classes while further improving the speed and performance, referred to as YOLO9000. In addition, they recently reported a new model that improves the performance further by introducing a new network [36].
As mentioned earlier, facial landmark extraction can be done in conjunction with face detection to reduce computation time and improve extraction accuracy. One popular approach for facial landmark detection is to use an open library such as Dlib [37]. This library has shown reasonable performance in face landmark recognition using the ensemble approach of regression trees proposed by Kazemi and Sullivan [38]. Another approach for facial landmark detection is to use CNN-based algorithms. Sun et al. [39] first suggested a method of using cascaded CNNs to extract facial landmarks. They obtained facial landmark points from the first network and then refined the results through shallow networks. Since then, many researchers have studied how to use CNNs to extract facial landmarks. Similar to the method proposed by Sun et al., MTCNN [12] uses relatively light CNNs to detect facial regions and extract five facial landmark points in real time. Ranjan et al. [40] proposed a multi-task CNN structure using combined features from multiple convolutional layers. In this way, face recognition, landmark localization, pose estimation, and gender recognition are jointly performed. Recently, recurrent neural networks (RNNs) have been widely used for refining the output of the main CNN to improve facial landmark extraction performance [41,42]. On the other hand, semantic segmentation can also be used for representing facial landmarks. For instance, Kim et al. [10] proposed a pixel-based facial landmark extraction scheme using SegNet [43,44]. SegNet consists of encoder and decoder networks. The encoder network performs convolution to down-sample the input image to the feature map. The decoder network performs deconvolution to up-sample the feature map to its original size. The final feature map is classified by a facial landmark class through the softmax function.
Generally, the quality of a feature map depends on the depth of the CNN model. The deeper the model, the higher quality of feature map can be obtained. However, as the depth of the model increases, the number of parameters increases exponentially, and the processing time increases accordingly. There has been plenty of research on how to extract features effectively without increasing the number of parameters. Chen [14][15][16] used dilated convolution instead of the existing convolution method for image semantic segmentation. This increases the convolution filter size based on a constant rate and fills the empty spaces between zeros. They effectively enlarged the field of view of the filters to incorporate the broader context without increasing the number of parameters or the amount of computation. Liu [17] proposed a convolution layer called CoordConv that has a channel containing coordinate information. The row and column information of the channel is normalized to the same size as the feature of each layer and concatenated to the input of the layer. Thus, the coordinate transformation problem was solved by the CoordConv layer with perfect generalization; it was 150 times faster with 10-100 times fewer parameters than convolution.

Materials and Methods
In this section, we describe how our model accurately detects 68 facial landmark points in real time. We constructed the EMTCNN model by extending the original MTCNN model in our previous work. To improve the detection accuracy further while maintaining the processing speed, we augment the EMTCNN model using two convolution techniques-dilated convolution and CoordConv.

EMTCNN Augmentation
MTCNN is a cascaded structure composed of relatively light CNNs including a proposal network (P-Net), refinement network (R-Net), and output network (O-Net). Since MTCNN has a pyramid structure, it extracts face candidate regions from images at various scales and uses them as input images. This allows the model to learn different image scales effectively. The facial landmark extraction process is divided into three steps; each step is performed on a different network. In the first step, the facial region candidates are extracted through the P-Net. Then, non-maximum suppression (NMS) [45] is used to remove highly overlapped candidates. In the second step, the filtered candidates are refined through the R-Net and NMS. In the last step, one facial region and its five facial landmark points are produced through the O-Net.
To extract the 68 most popular facial landmark points, instead of 5, we extend the O-Net of the MTCNN model by using two 3 × 3 convolution layers and increasing the number of convolution filters. This allows the second 3 × 3 convolution layer to see 5 x 5 regions with fewer parameters than using the 5 × 5 convolution layer [46]. Nevertheless, as the network expands, the number of parameters naturally increases, so there is a limit to expanding the network. To avoid this, we reduce the number of facial region candidates generated by the P-Net by ignoring small facial regions. Even though there is a trade-off between processing time and detection accuracy when manipulating the O-Net, there is still a limit to the extraction performance that can be achieved just by changing the parameter numbers while maintaining the real-time processing property. Hence, to improve the accuracy of real-time facial landmark point extraction without significantly increasing the number of parameters, we augment the EMTCNN model with two convolution techniques-dilated convolution and CoordConv. Figure 1 shows the structure of our augmented EMTCNN.  Table 1 shows the number of parameters of major CNN models used for facial landmark extraction [40][41][42]. Although the numbers of parameters are increased a lot by expanding and augmenting the existing MTCNN, they are still small compared to those for other networks.

Dilated Convolution
We expanded the convolution filter size of the O-Net to make the network consider larger image regions. However, there was a limit to the improvement in performance that could be achieved by expanding the convolution filter because the amount of computation that had to be carried out owing to the expansion increased dramatically. Dilated convolution, which is also known as Atrous convolution, makes it possible to expand the receptive field without increasing the size of the convolution filter. The receptive field represents the number of pixels in the original image that contain one pixel of the feature map. In other words, the larger the receptive field is, the more information the feature map contains. Figure 2a,b shows a general convolution filter and dilated convolution filter with rate equal to 2. In the dilated convolution filter, blanks between the weights are filled with zeroes. In this way, we can increase the filter size without any significant increase in the amount of computation required. As a result, even with the same number of parameters, we can achieve better accuracy by using dilated convolution. Hence, we use a 3 × 3 dilated convolution filter with rate equal to 2 instead of a general 3 × 3 convolution filter to extract facial landmarks more accurately without a significant increase in the number of parameters.

. Dilated Convolution
We expanded the convolution filter size of the O-Net to make the network consider larger image regions. However, there was a limit to the improvement in performance that could be achieved by expanding the convolution filter because the amount of computation that had to be carried out owing to the expansion increased dramatically. Dilated convolution, which is also known as Atrous convolution, makes it possible to expand the receptive field without increasing the size of the convolution filter. The receptive field represents the number of pixels in the original image that contain one pixel of the feature map. In other words, the larger the receptive field is, the more information the feature map contains. Figure 2a, b shows a general convolution filter and dilated convolution filter with rate equal to 2. In the dilated convolution filter, blanks between the weights are filled with zeroes. In this way, we can increase the filter size without any significant increase in the amount of computation required. As a result, even with the same number of parameters, we can achieve better accuracy by using dilated convolution. Hence, we use a 3 × 3 dilated convolution filter with rate equal to 2 instead of a general 3 × 3 convolution filter to extract facial landmarks more accurately without a significant increase in the number of parameters.

CoordConv Layer
In general, the CNN consists of several convolution layers and a pooling layer used to find significant features in the input images. So far, it has exhibited unprecedented performance in object detection and image classification. Nevertheless, this structure has some limitations. For instance, convolution filters can detect the noticeable features in an image, but they do not consider positional relationships among the feature maps. Such relationships can be utilized if they are common to all face images. For instance, the position of the human eye is above the mouth. Even though positional information is significant, the EMTCNN model does not consider it. Such spatial coordinate

CoordConv Layer
In general, the CNN consists of several convolution layers and a pooling layer used to find significant features in the input images. So far, it has exhibited unprecedented performance in object Appl. Sci. 2020, 10, 2253 6 of 14 detection and image classification. Nevertheless, this structure has some limitations. For instance, convolution filters can detect the noticeable features in an image, but they do not consider positional relationships among the feature maps. Such relationships can be utilized if they are common to all face images. For instance, the position of the human eye is above the mouth. Even though positional information is significant, the EMTCNN model does not consider it. Such spatial coordinate information can be incorporated into the model by augmenting the CoordConv layer. In the CoordConv layer, two additional channels are added to each feature map as shown in Figure 3a. Figure 3b shows sample Coord channels with height and width equal to 8. As shown in the figure, each channel represents a nominalized coordinate value between -1 and 1 and has the same size as the feature map. One channel represents coordinate information on the horizontal axis, and the other channel represents coordinate information on the vertical axis.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 15 CoordConv layer, two additional channels are added to each feature map as shown in Figure 3a. Figure 3b shows sample Coord channels with height and width equal to 8. As shown in the figure, each channel represents a nominalized coordinate value between -1 and 1 and has the same size as the feature map. One channel represents coordinate information on the horizontal axis, and the other channel represents coordinate information on the vertical axis.

Metric (Loss Function)
As the MTCNN uses the Euclidean distance as its loss function, it can be used to evaluate the accuracy of the extracted facial landmark points. An example of the 68 most popular landmark coordinates extracted by Dlib is shown in Figure 4a. As the coordinates are symmetric about the yaxis, their x-axis coordinates are often not learned properly compared to the y-axis coordinates, which produces an incorrect result. Figure 4b shows an example in which the coordinates are gathered near their average. To solve this problem, instead of considering only the distance between two points, we consider the x-axis and y-axis coordinates separately. For this reason, we use the Manhattan distance, which can consider the positions of the x-and y-axes individually and give more weight to the x-axis coordinates rather than the y-axis coordinates to evaluate the loss. Equation (1)

Metric (Loss Function)
As the MTCNN uses the Euclidean distance as its loss function, it can be used to evaluate the accuracy of the extracted facial landmark points. An example of the 68 most popular landmark coordinates extracted by Dlib is shown in Figure 4a. As the coordinates are symmetric about the y-axis, their x-axis coordinates are often not learned properly compared to the y-axis coordinates, which produces an incorrect result. Figure 4b shows an example in which the coordinates are gathered near their average. To solve this problem, instead of considering only the distance between two points, we consider the x-axis and y-axis coordinates separately. For this reason, we use the Manhattan distance, which can consider the positions of the x-and y-axes individually and give more weight to the x-axis coordinates rather than the y-axis coordinates to evaluate the loss. Equation (1) represents our loss function.
Here, N_Loss(p,p) represents the normalized loss of two sets, p andp, of 68 facial landmark points. p is generated by the EMTCNN andp is the ground truth. x and y (x andŷ) are the coordinates of a point in p (p). α and β are the weights of the xand y-axes, respectively, and the sum of α and β is 1. W and H represent the width and height of the image, respectively. produces an incorrect result. Figure 4b shows an example in which the coordinates are gathered near their average. To solve this problem, instead of considering only the distance between two points, we consider the x-axis and y-axis coordinates separately. For this reason, we use the Manhattan distance, which can consider the positions of the x-and y-axes individually and give more weight to the x-axis coordinates rather than the y-axis coordinates to evaluate the loss. Equation (1)

Dataset
To train models for facial region detection, we use the WIDER FACE dataset [49], which consists of 32,203 images and 393,703 labeled faces with a high degree of variability in scale, pose, and occlusion. Incidentally, the face sizes provided in the WIDER FACE dataset are too small to display the 68 landmark points. Therefore, to train the O-Net, we used another face dataset, 300 videos in the wild (300-VW) [50], which contains approximately 110 video files and 68 facial landmark point data for each frame. As we used only 200 frames from each video file, our dataset contained a total of 22,000 images from 111 video files. However, 300-VW is more likely to cause overfitting for a particular person owing to the small number of people appearing in the images. Therefore, we used 20,000 additional facial images [51] to consider various cases in the models. As there was no ground truth for the added images, we used Dlib to create 68 facial landmark points for each facial image. Figure 5a  To solve this problem, we used both Dlib and SegNet [10]. We first executed SegNet for face segmentation and then overlaid the result onto the original facial image to highlight the facial landmarks. By applying Dlib to the overlaid region, we obtained accurate landmark coordinates. Figure 5c shows the detection results obtained by combining SegNet and Dlib, which are much better than those obtained by Dlib alone.
We collected approximately 42,000 images from 300-VW and other sources. In addition, we performed several image augmentation operations on the images in the dataset to represent still more diverse situations. For instance, facial images with diverse angles can be generated by flipping the images, and different illuminations and noise environments can be produced by adjusting the brightness and salt and pepper noise. Further, some pixels can be masked to reflect the effect of wearing various accessories such as glasses and masks. Using various image augmentation operations, we obtained a total of 1,680,000 images, as calculated in Table 2. Figure 6 shows images produced using the five image augmentation operations we considered in this work.  To solve this problem, we used both Dlib and SegNet [10]. We first executed SegNet for face segmentation and then overlaid the result onto the original facial image to highlight the facial landmarks. By applying Dlib to the overlaid region, we obtained accurate landmark coordinates. Figure 5(c) shows the detection results obtained by combining SegNet and Dlib, which are much better than those obtained by Dlib alone.
We collected approximately 42,000 images from 300-VW and other sources. In addition, we performed several image augmentation operations on the images in the dataset to represent still more diverse situations. For instance, facial images with diverse angles can be generated by flipping the images, and different illuminations and noise environments can be produced by adjusting the brightness and salt and pepper noise. Further, some pixels can be masked to reflect the effect of wearing various accessories such as glasses and masks. Using various image augmentation operations, we obtained a total of 1,680,000 images, as calculated in Table 2. Figure 6 shows images produced using the five image augmentation operations we considered in this work.

Images Augmentation operations Flip
Brightness adjustment Salt & pepper noise Region of

Experiment
This section describes the experiments we performed to evaluate our scheme. First of all, we measure the speed of facial landmark point extraction by our augmented EMTCNN model and compare it with that by other methods. Next, we compare the accuracy of each model in facial landmark point extraction. Finally, we evaluate the extraction accuracy of our model depending on the learning weights of the x-and y-axes. In the experiments, we used an Intel (R) Core (TM) i7-8700 CPU (Santa Clara, California, USA), Samsung 32G DDR4 memory (Suwon, Korea), NVIDIA Geforce GTX 1080ti (Santa Clara, California, USA). The operating system is Windows 10 version and all the experiments are all implemented by Python 3.5 environment. The mini-batch size was 384, and we had a total of 480,000 iterations.

Training
As mentioned before, all the CNN-based models we consider in this paper consist of three networks. As each network has different size inputs and roles, it is necessary to preprocess the dataset for training each network properly. The role of P-Net is to select facial region candidates from among randomly cropped regions of an image. To do this, the regions are first cropped randomly from the

Experiment
This section describes the experiments we performed to evaluate our scheme. First of all, we measure the speed of facial landmark point extraction by our augmented EMTCNN model and compare it with that by other methods. Next, we compare the accuracy of each model in facial landmark point extraction. Finally, we evaluate the extraction accuracy of our model depending on the learning weights of the x-and y-axes. In the experiments, we used an Intel (R) Core (TM) i7-8700 CPU (Santa Clara, California, USA), Samsung 32G DDR4 memory (Suwon, Korea), NVIDIA Geforce GTX 1080ti (Santa Clara, California, USA). The operating system is Windows 10 version and all the experiments are all implemented by Python 3.5 environment. The mini-batch size was 384, and we had a total of 480,000 iterations.

Training
As mentioned before, all the CNN-based models we consider in this paper consist of three networks. As each network has different size inputs and roles, it is necessary to preprocess the dataset for training each network properly. The role of P-Net is to select facial region candidates from among randomly cropped regions of an image. To do this, the regions are first cropped randomly from the image and evaluated by calculating the Intersect over Union (IoU) value, which represents the number of pixels contained in the bounding box of the facial region. Based on the IoU values, all the cropped regions are divided into three categories-positive, negative, and part-using the following criteria: positive > = 0.65, 0.4 < part < 0.65, negative < = 0.3. After all, we need a training dataset that contains face images and the ground truth of the facial region to train the P-Net properly to find positive regions from face images accurately. The R-Net refines the facial region candidates produced by the P-Net further by resizing and classifying them once more using the same criteria as the P-Net. The R-Net is trained using the resized regions and their category information. Finally, as the O-Net extracts the landmark points of the facial region, the dataset should contain such facial landmark point information.
The O-Net uses the facial region candidates produced by the R-Net as input and produces a facial region and its facial landmark points as its output.

Accuracy of Landmark Point Extraction
In this experiment, we compare the landmark point extraction accuracy of three models-augmented EMTCNN, EMTCNN, and Dlib-using the Helen dataset, which was not present in the training set. These images have 68 predefined facial landmark points as the ground truth. Hence, the accuracy can be compared quantitatively using the total distance between the ground truth points and landmark points extracted by each model. The mean normalized distance (MND) can be computed using Equation (2): Table 3 compares the mean normalized distances of various methods used to extract the 68 facial landmark points, including EMTCNN and Augmented EMTCNN. Since Dlib was trained with the -300-W dataset, it was excluded from the comparison for 300-W. As we can see from the table, the mean normalized distance for Augmented EMTCNN was improved compared to that for EMTCNN. For the ground truths shown in Figure 7a, the actual facial landmark points extracted using the Dlib, EMTCNN, and Augmented EMTCNN models are shown in Figure 7b-d, respectively. From the figures, we can see that each model performs slightly differently depending on the input image, even though their overall detection results are acceptable. Figure 8 shows more examples of facial landmark detection performed by our Augmented EMTCNN model. RCPR (Robust Cascaded Pose Regression) [53] 5.93 8.35 ESR (Explicit Shape Regression) [54] 5.70 7.58 CFAN (Coarse-to-Fine Auto-encoder Networks) [55] 5.53 7.69 SDM (Supervised Descent Method) [56] 5.50 7.50 CFSS (Coarse-to-Fine Shape Searching) [57] 4. For the ground truths shown in Figure 8a, the actual facial landmark points extracted using the Dlib, EMTCNN, and Augmented EMTCNN models are shown in Figures 8b-d, respectively. From the figures, we can see that each model performs slightly differently depending on the input image, even though their overall detection results are acceptable. Figure 9 shows more examples of facial landmark detection performed by our Augmented EMTCNN model.  services and pedestrian detection in autonomous driving. In the near future, we will investigate the face segmentation model and apply its segmentation results to input images for training to further improve the extraction accuracy.

Landmark Point Extraction Speed
In this experiment, we compare the speed of the five models Augmented EMTCNN, EMTCNN, MTCNN, Tasks-Constrained Deep Convolutional Network (TCDCN), and Dlib. Augmented EMTCNN, TCDCN, and Dlib showed very good landmark extraction accuracy. We also considered MTCNN for reference even though the number of landmark points is five. To compare the extraction speed, we measured the number of frames processed per second (fps) for the 200 × 200 input images. Table 4 shows the number of extracted facial landmark points and the speed in fps of each model. It is clear that the processing speed of the original MTCNN is the highest, as it finds just five facial landmark points. However, such few landmark points are insufficient to represent facial landmark features properly. On the other hand, Augmented EMTCNN and EMTCNN extracted 68 landmark points at speeds of 68 fps and 70 fps, respectively. Although Augmented EMTCNN is a little slower than EMTCNN, its extraction accuracy is almost twice that of EMTCNN, and an extraction speed of 68 fps is sufficient for real-time processing. In the case of Dlib, the processing speed is not enough to support the real-time processing of video, even though it can detect the same number of landmark points.

Effects of Weights on Accuracy
In Figure 4, we mentioned that the y-axis of the facial landmark points has symmetry and, owing to this, the x-axis coordinates of the facial landmark points were not well learned compared to the y-axis coordinates. To reflect this property of the x-axis coordinates, we trained the model by using different weights for the x-and y-axes. Table 5 indicates the variation of the mean normalized distance according to the ratio of the x-and y-axis learning weights. In the case of the EMTCNN model, the mean normalized distance was improved until the weight ratio of the x-and y-axes became 8:2. On the other hand, in the case of the Augmented EMTCNN model, the ratio 6:4 produced the best mean normalized distance. Hence, we trained each model using these weight ratios.

Conclusions
In this paper, we proposed a new method for extracting the 68 most popular feature points to represent facial landmarks in real time. More specifically, we first extended the original MTCNN model to increase the number of facial landmark points from 5 to 68. Then, to improve the accuracy of facial landmark extraction, we augmented the EMTCNN model by using two state-of-the-art convolution techniques-dilated convolution and CoordConv. In our experiments, we compared the number of extracted facial landmark points, processing speed, and extraction accuracy of four methods-Dlib, MTCNN, EMTCNN, and Augmented EMTCNN. The Augmented EMTCNN model extracted the 68 most popular feature points at a sufficient speed for real-time processing, and its accuracy was almost the same as the best accuracy achieved by Dlib. Our scheme can be applied to various applications that require real-time object recognition, such as face recognition in payment services and pedestrian detection in autonomous driving. In the near future, we will investigate the face segmentation model and apply its segmentation results to input images for training to further improve the extraction accuracy.