Face Alignment in Thermal Infrared Images Using Cascaded Shape Regression

The evaluation of physiological and psychological states using thermal infrared images is based on the skin temperature of specific regions of interest, such as the nose, mouth, and cheeks. To extract the skin temperature of the region of interest, face alignment in thermal infrared images is necessary. To date, the Active Appearance Model (AAM) has been used for face alignment in thermal infrared images. However, computation using this method is costly, and it has a low real-time performance. Conversely, face alignment of visible images using Cascaded Shape Regression (CSR) has been reported to have high real-time performance. However, no studies have been reported on face alignment in thermal infrared images using CSR. Therefore, the objective of this study was to verify the speed and robustness of face alignment in thermal infrared images using CSR. The results suggest that face alignment using CSR is more robust and computationally faster than AAM.


Introduction
A method for remotely evaluating physiological and psychological states based on facial skin temperature measured by infrared thermography has attracted considerable interest. Biological information is used in various fields, such as medicine, welfare, and industry. In general, the measurement of biological signals often requires physical restraint, and the measurement itself may cause mental or physical stress to the subject [1]. Conversely, infrared thermography can conduct contactless, non-invasive skin temperature measurements with high sensitivity, accuracy, and reproducibility [2][3][4][5]. In a thermal environment that is windless and non-sweat inducing, the main cause of variation in skin temperature is skin blood flow [6]. Since the autonomic nervous system controls skin blood flow as a part of the circulatory system's function to regulate body temperature, skin temperature has been used to assess the activity of the autonomic nervous system [7]. For this assessment, facial thermal infrared images are particularly suitable for application because the face is often exposed and unobstructed by clothing. Many previous studies have been conducted on the estimation of physiological and psychological states based on facial skin temperature distribution [8,9]. For example, studies have been conducted that estimate vital data such as respiratory rate [10] and heart rate [11], sleepiness [12][13][14], emotions [15][16][17], mental stress [18,19], and anomaly detection in facial skin temperature distribution [20]. These previous studies used the temperature distribution of the entire face or the temperature of specific Regions of Interest (ROI), such as the nose, mouth, and cheeks for evaluation. Therefore, it is important to automatically detect faces and facial landmarks from thermal infrared images with high accuracy and stability. In recent years, infrared thermography has lowered in price but has bettered in performance. The resolution of thermal infrared images has increased, and it is possible to measure multiple people with a single thermal infrared image. Therefore, in face detection and detection of facial landmarks in real environments, it is desirable to increase the speed to analyze multiple people at once.
The Active Appearance Model (AAM) [21] is one of the most popular methods for automatically detecting facial landmarks in thermal infrared images. AAM statistically models the changes in face shape and overall facial appearance and aligns the face shape with the model through nonlinear optimization. Kopaczka et al. conducted face alignment in thermal infrared images using AAM based on intensity, Histogram of Oriented Gradients (HOG), and Dense Scale Invariant Feature Transform (DSIFT) features [17,22]. However, in general, AAM is expensive because it solves an exact optimization problem. It also suffers from low robustness to poses, illumination, facial expression changes, and unknown subjects that are not included in the training set [23].
To solve these problems, Cascaded Shape Regression (CSR) has been proposed for face alignment in visible images [24][25][26][27]. In the CSR approach, facial landmark detection is estimated by regression, and the solution is updated multiple times by a multi-stage estimator to detect the facial landmarks. Face alignment using CSR is highly real-time. Ren et al. [26] reported face alignment at more than 3000 FPS speed. Hence, it is expected that CSR can be used for faster face alignment in thermal infrared images. However, no studies have been reported on face alignment in thermal infrared images using CSR. Therefore, the objective of this study was to verify the speed and robustness of face alignment in infrared images using CSR. First, a CSR model was created. Next, we trained and evaluated the CSR on the thermal infrared images acquired in our experiments. The results suggest that face alignment using CSR is more robust and computationally faster than AAMs proposed in the previous study, which is reported in this paper.

Cascaded Shape Regression
If x i , y i are the x, y coordinates of the ith facial landmark, then the face shape vector represented by the M facial landmarks is S = [x 1 , y 1 , . . . , x M , y M ] T . The cascaded shape regression model is a model with a multi-stage structure estimator with T number of stages, which predicts the face shape S (t) in a cascaded manner. Given the initial face shape S 0 and the input image I, the CSR model is updated by the estimator to find the shape difference fraction ∆S (t) and update the solution. At stage t, S (t) and ∆S (t) are regressed as follows: where t ∈ 1, . . . , T is the number of estimators corresponding to each stage of the CSR, and r (t) is the estimator. The loss function is represented as follows: arg min whereŜ (t) is the ground's true face shape, and N is the number of images for training. In the CSR, training is performed in such a way that this loss function is minimized. In this study, we estimated facial landmarks using the ensemble of regression tree learning methods used by Vahid et al. [27]. Gradient boosting was used as the training estimator. At each split node of the regression tree, the intensity difference sentence feature of two pixels [25,28] is determined based on the threshold. To train each split node, 400 randomly sampled features were computed.

Experimental Methods
Experiments were conducted to acquire thermal infrared images of a face for training a facial landmark detector. Seven subjects (five males and two females) aged 22-24 years participated in the experiment. They were fully informed about the experiment and the purpose of the study before their participation. All participants signed a consent form.
The experimental system is shown in Figure 1. Thermal infrared images were captured using infrared thermography (FLIR A615-model: A615, 45 • field of view, FLIR Systems, Oregon). The infrared camera had a resolution of 640 × 480 pixels and a temperature resolution of less than 0.05 K. Infrared emissivity is the ratio of the thermal radiation from the surface of an object to the radiation from a black body at the same temperature, given by Stefan-Boltzmann's law. In order to obtain accurate temperature measurements, it is necessary to set the correct infrared emissivity of the surface of an object. In this study, the infrared emissivity of the skin was set to ε = 0.98 [29]. The experimental protocol is shown in Figure 2. Three distances between the subject and infrared thermography were 60 cm, 90 cm, and 120 cm ( Figure 3). Each distance consisted of three recording intervals (Small, Large, and Random). As shown in Figure 4, the subjects were asked to turn their heads in nine directions (center, top center, top right, center right, bottom right, bottom center, bottom left, center left, top left) for the Small and Large sections. To evaluate the effect of the angle of face orientation on face alignment, subjects were asked to move their head angles to 20 degrees and 45 degrees in the Small and Large conditions, respectively. To increase the robustness of the face alignment, in the Random section, subjects were asked to move their head in any direction and make any facial expression they wanted. Nothing other than the subject's body was recorded. The experiment was conducted in the experimental room without convection. Thermal infrared images were taken 15 min after the subjects entered the experimental room for thermal acclimation to the environmental temperature, and the time to take thermal infrared images for each subject was less than 5 min. A total of 609 thermal infrared images were obtained in this experiment. We manually annotated 68 landmarks for the obtained data according to the literature [30] and bounding boxes in the face region.

Analysis Methods
The acquired images were flipped to the left and right for data augmentation. As a result, 1218 images were created. To perform k-fold cross-validation (k = 7) using CSR, we split the data of six subjects into training data and the data of the remaining subjects into test data. All subjects' data were used as test data. Unless otherwise specified, all experiments were run with the following fixed parameter settings: the number of stages in the cascade T = 10, tree depth F = 4, number of weak regressors K = 500, and a random pair of pixels P = 400 used as the difference feature between two points. The average coordinates of the facial landmarks in the training data were used as the initial shape. The Normalized Point to Point Error (NPPE) introduced by Zhu et al. [31] was used as a method to evaluate the estimation accuracy of the face alignment. The NPPE i of each ith image is the following equation: where x n,r and y n,r are the coordinates of the estimated facial landmarks, x n,g and y n,g are the coordinates of the correct facial landmarks, N is the number of facial landmarks, w i is the width of the face, h i is the height of the face, and N i is the reciprocal of the mean of w i and h i . To compare the estimation accuracy of CSR models, we performed Intensity, DSIFT, and HOG-based AAM methods that were effective in aligning faces in thermal infrared images in previous studies [17,22]. Marciniak et al. [32] reported that the accuracy of face recognition in visible images is lower when the number of pixels in the face region is small. To evaluate the effect of the number of pixels in the face region on face alignment, the number of pixels per face width was calculated. To evaluate the computation time, we measured the frames per second (FPS) of the face alignment of the test data for each method. The specifications of the evaluation PC in this experiment were Intel Core i7-8700 CPU and 16GB RAM. Only one CPU core was used. The program was implemented in C++ and Python. Table 1 shows the minimum, maximum, and mean values of the facial skin temperature and the ambient temperature. The ambient temperature was almost the same for all subjects in the experiment. Figure 5 shows the percentage of test images satisfying a given NPPE evaluated with CSR and Intensity, DSIFT, and HOG-based AAM. It is probably due to the problem that AAM is less robust to unknown subjects that are not part of the training set [23]. The CSR method has the highest number of images below 0.05, which is an acceptable accuracy value for NPPE [33]. Figure 6 shows examples of NPPE for face alignment. From Figure 6, it can be confirmed that the accuracy of face alignment becomes worse when the NPPE is greater than 0.5. The CSR method reached a higher total accuracy value. Figure 7 shows the mean value of NPPI of the test images for each method. Figure 8 shows examples of face alignment using each method. The mean NPPEs by CSR and conventional AAM were almost equal. Conversely, the variation of NPPEs was the smallest for CSR. From Figure 8, it can be confirmed that the accuracy of the face alignment of AAM becomes worse when the face is not looking the front. This suggests that face alignment by CSR is more robust than the AAM method and can be applied to face alignment for more varieties of images. Table 2 shows the FPS of each method: the FPS of the CSR model is over 80, which is more than ten times larger than the AAM methods. The FPS of CSR was the largest, and the FPS of Intensity, DSIFT, and HOG-based AAM were smaller in that order. DSIFT and HOG-based AAMs are considered to have taken more time than Intensity-based AAM because of the calculation of DSIFT and HOG features. It is suggested that face alignment in thermal infrared images using CSR is highly real-time. Table 1. Mean values ± SD of the minimum, maximum, and mean facial skin temperature. The number of thermal infrared images for each subject was 87.

Subject
Facial    Figure 9 shows the results for each cascade stage for tree depth = 3, 4, 5, and 10 and Figure 10 shows examples of facial alignment using for tree depth = 3, 4, 5, and 10. When the tree depth was 4, the accuracy of face alignment was the highest. When the tree depth was 5 or 10, the model features were large, and overfitting to the training data occurred, resulting in small accuracy. When the tree depth was 3, the model features were small and under-fitted to the training data, resulting in small accuracy.
&DVFDGHVWDJH 0HDQQRUPDOL]HGSRLQWWRSRLQWHUURU 7UHHGHSWK 7UHHGHSWK 7UHHGHSWK 7UHHGHSWK Figure 9. The mean NPPE using CSR for each cascade stage for tree depth = 3, 4, 5, and 10. Table 3 shows the mean number of pixels and mean NPPE of the face's width for each distance between the infrared thermography and the subject and Figure 11 shows examples of face alignment for each distance. From Table 3 and Figure 11, the accuracy of the fitting did not decrease with distance. In this experimental condition, differences in distance to infrared thermography and the number of pixels of the face in the image did not affect the face alignment estimates' accuracy. In thermal measurements, one meter is known to be an excellent standard to assure stable consistency [34]. It is suggested that the face alignment can be done with high accuracy when the distance between the thermography and the person is between 60 and 120 cm. This satisfies the length of 1 m, which is the right standard length for thermal measurements.
*URXQGWUXWK 7UHHGHSWK 7UHHGHSWK 7UHHGHSWK 7UHHGHSWK Figure 10. Examples of facial alignment using CSR for each cascade stage for tree depth = 3, 4, 5, and 10. Figure 11. Examples of face alignment for each distance. The distances from the top are 60 cm, 90 cm, and 120 cm.

Conclusions
As mentioned in the introduction, the objective of this study was to conduct face alignment in thermal infrared images using CSR. CSR is more robust than AAM in face alignment in facial thermal images and can be applied to various types of images. The FPS of face alignment using CSR is more than 80, and it can detect facial landmarks at a high speed. Therefore, facial landmark detection by CSR may be useful for real-world applications. However, the limitation of this study is the small sample size of 609 thermal infrared images, and we have not dealt with thermal infrared images in the wild. In the future, we plan to conduct studies using thermal infrared images of many more varieties and conditions. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: