Face Liveness Detection Using Thermal Face-CNN with External Knowledge

Face liveness detection is important for ensuring security. However, because faces are shown in photographs or on a display, it is difficult to detect the real face using the features of the face shape. In this paper, we propose a thermal face-convolutional neural network (Thermal Face-CNN) that knows the external knowledge regarding the fact that the real face temperature of the real person is 36~37 degrees on average. First, we compared the red, green, and blue (RGB) image with the thermal image to identify the data suitable for face liveness detection using a multi-layer neural network (MLP), convolutional neural network (CNN), and C-support vector machine (C-SVM). Next, we compared the performance of the algorithms and the newly proposed Thermal Face-CNN in a thermal image dataset. The experiment results show that the thermal image is more suitable than the RGB image for face liveness detection. Further, we also found that Thermal Face-CNN performs better than CNN, MLP, and C-SVM when the precision is slightly more crucial than recall through F-measure.


Introduction
Face liveness detection in indoor residential environments is an important technique for delivering security information, such as in the case of unlocking a mobile device using a face recognition system.For example, in order to allow access to only one specific person, that person's unique information, such as their face, can be used to unlock security measures.However, because the printed face photograph and face from the display can sufficiently generate the unique information of the face, the reliability of the security is reduced.Therefore, there is a need to provide more secure security by using face liveness detection, in which thermal images are distinguishable between the real face and the fake face through the heat distribution existing in the face of the real person.
In this paper, we first quantitatively identify a more suitable image for face liveness detection using both the RGB image and the thermal image.The same algorithms were applied to the RGB and thermal image datasets for the comparison.A multi-layer neural network (MLP) [1], convolutional neural network (CNN) [2], and C-support vector machine (C-SVM) [3] with a smooth hyperplane were used for the comparison.In addition, we compared the performance of the existing algorithms with thermal face-convolutional neural network (Thermal Face-CNN) proposed in this paper.Thermal Face-CNN is an algorithm with external knowledge about the temperature values that are found in a real face.
We have collected thermal images because there are many RGB image datasets for face liveness detection but few or no thermal image datasets available.We obtained RGB and thermal images of the same scene in order to evaluate how these thermal images improve performance over RGB images.Accuracy [4], recall [4], and precision [4] were mainly obtained on both the RGB and thermal image datasets.
The experimental results show that the best-performing CNN performance has an accuracy of 0.6898, a recall of 0.5752, and a precision of 0.7342 on the RGB image dataset, while it has an accuracy of 0.8367, a recall of 0.7876, and a precision of 0.8476 on the thermal image dataset.Therefore, it has been shown that the thermal image is more effective in face liveness detection than the RGB image.In addition, we show that the average recall value is improved by 13.72% over CNN by using the Thermal Face-CNN proposed in this paper for the thermal image dataset.It is also shown that we found that Thermal Face-CNN performs better than CNN, MLP, and C-SVM when the precision is slightly more crucial than recall through F-measure.

Background and Related Work
Face detection is a field involving the detection of a face in an image.Algorithms for face detection judge whether or not the object in the picture is the face [5].However, face liveness detection is a field in which the face presented is judged to be the real face or the fake face or no face.Therefore, face detection is a very different field from face liveness detection.For this reason, a paper related to face detection could not be compared with a paper related to face liveness detection.In the field of face liveness detection, there are three ways to imitate a real face: using a picture with that face, replaying a video with that face, and using a 3D face mask [6].The method using the picture with the face involves printing the face on paper or displaying the face on a display.In order to solve this problem, studies have been carried out to explore ways to detect the real face using a photo-based dataset [6][7][8][9].In addition, there have been studies into the use of video-based datasets to distinguish the real face from the fake face [7,10].Further studies into ways to distinguish between the real face and the 3D face mask have also been conducted [11,12].
Many datasets can be used for face liveness detection: NUAA [8], ZJU Eyeblink [13], Idiap Print-attack [14], Idiap Replay-attack [10], CASIA FASD [15], MSU-MFSD [16], MSU RAFS [17], UVAD [18,19], MSU USSA [6], and so on.However, these datasets include data composed of RGB images.There are not enough datasets composed of thermal images.Therefore, research on face liveness detection with thermal images has been insufficient to date.Thermal images have already been used in research for face detection and pedestrian detection [20][21][22][23].Thermal images can be obtained through the distribution of infrared rays, even at night when there is no visible light.Because RGB images have the disadvantage of being affected by the intensity of visible light, while thermal images have the advantage of being usable in places where there is no visible light, thermal images have been successfully applied in various fields.Therefore, it is necessary to compare the RGB image and the thermal image with regard to how much performance improvement is offered by the use of the thermal image in face liveness detection.For comparison, using an existing dataset would be ideal, but none of these contain information about temperature.Thus, a new dataset is needed.
Face liveness detection involves detecting the real face by analyzing the information obtained from the image.Therefore, previous studies on face liveness detection have been carried out using image processing methods.The support vector machine (SVM) is a classification algorithm that has been used to distinguish between the real and fake faces in face liveness detection [7,11].As shown in these studies, SVM performs well in the area of classification.Of the SVM algorithms, the linear SVM finds the linear hyperplane with the largest margin [24].The linear SVM assumes that classification can be performed by a line.However, there are cases where the data to be classified cannot be simply classified as a line.In order to solve this problem, research was carried out on nonlinear SVM using kernel functions [24].The classification was proceeded using SVM on the abstraction information combining static features and dynamic features for face liveness detection in [7].In addition, SVM learned the multispectral reflectance distribution information that can distinguish real human skin from images or objects meant to look like skin for face liveness detection in [11].Previously, SVM used in face liveness detection learned to perfectly classify training data without error.However, there is another way to find a soft margin hyperplane that has the largest margins while allowing exceptional misclassification of the small amount of data in the learning data [3].By using a soft margin hyperplane, we can find a hyperplane that is more generalizable without having an overfitting hyperplane on the learning data.Therefore, C-SVM, which is a nonlinear SVM using a soft margin hyperplane and more generalizable than the SVMs used in previous studies, was used in Section 4 to evaluate the performance of algorithms on the thermal image dataset.
The artificial neural network imitates human neurons [1].In particular, MLP is one of the artificial neural networks used in image processing [25].Image processing can be done through MLP, in which the information of pixels is inserted into the input layer, and the output layer outputs 0 and 1 with one node for binary classification.CNN [2], which is designed for effective image processing, is an algorithm that modifies MLP in a way that reduces weights and shares weights.There are studies that have effectively performed face liveness detection using CNN on the RGB image [7,26,27].In addition, it is known that CNN is a more powerful algorithm for face liveness detection on the RGB image than SVM [26].Furthermore, CNN can achieve 98.99% accuracy on the relatively easy RGB image dataset called NUAA [8], which means that CNN is superior to previous methods [26] and is state-of-the-art.An accuracy of 98.99% does not mean that this field is entirely conquered.There is a need to study more difficult face liveness detection by allowing multiple objects to be included simultaneously in an image and increasing a lot of computation with more pixels in an image.The thermal image can be used to do this because there have also been studies showing that CNN has been successfully used on the thermal image [20][21][22].For these reasons, and because there is a need to properly process the thermal image used for face liveness detection with CNN, we used this algorithm in Section 4. Nevertheless, it is necessary to investigate an algorithm superior to CNN for face liveness detection based on the thermal image.The CNN algorithm and Thermal Face-CNN for face liveness detection are concretely described in Section 3 of this paper.
In addition to the support vector machine and the artificial neural network, the algorithms used for face liveness detection are diverse.A logistic regression model [8,28] was used to classify the real face and the fake face.In addition, as methods to identify the features of the image, local binary pattern [9,29] and Lambertian model [8] were used for face liveness detection.The local binary pattern is a method of extracting the feature of the image considering the difference of value relative to neighboring pixels on the basis of a pixel.By this method, the feature vector representing the feature of the image was extracted for face liveness detection [9].Similarly, the Lambertian model is a method that has been studied for extracting information about the difference between the real face and fake face.Therefore, we can know that there has been a lot of research on how to extract image feature information in the related studies.

The Proposed Method
The proposed Thermal Face-CNN is an algorithm for face liveness detection based on CNN.In this algorithm, external knowledge for face liveness detection is inserted first, followed by CNN.In the proposed method, the artificial neural network part is the same as the existing CNN.CNN combines the convolutional layer, the pooling layer, and the fully connected layer.The number of convolutional layers, pooling layers, and fully connected layers vary depending on the number and type of pixels in the image.For visual convenience, an example of Thermal Face-CNN with two convolutional layers, two pooling layers, and one hidden layer is shown in Figure 1.The numbers of layers used are explained in Section 4. First, knowledge is inserted for face liveness detection.After that, the data with external knowledge is calculated in the convolutional layer and transferred to the pooling layer.This can be repeated several times in order to process the complex image.Next, CNN passes the previously obtained information to the fully connected layer.Finally, CNN classifies the image in the output layer.The process of inserting external knowledge, the convolutional layer, the pooling layer, and fully connected layer are explained as the paper continues.The process of inserting external knowledge for face liveness detection can be accomplished by the process of inserting knowledge about the temperature that a human face can have.This can be represented as Equation (1).
In Equation (1), g is the measured temperature value, and h is the input value to CNN.Equation ( 1) is a formula that multiplies the value between up limit and down limit by knowledge value so as to make use of the physiological knowledge of the mean body temperature of a person, which is between 36 and 37 degrees [30].A pixel measuring a part of a real face must have a temperature value in this vicinity.The fact that there is a high probability that a pixel with a value close to 36 or 37 degrees in a measured thermal image is likely to represent a part of a real face can only be obtained from external knowledge, not from the data.In order to insert this knowledge into the artificial neural network, we make a remarkably different value than the measured value using Equation (1).In this case, the artificial neural network recognizes the temperature of this pixel as very different from the temperature measured at other pixels.If the knowledge value is 10, it is about ten times larger than the values of other pixels.Figure 2 shows an example of selecting 34 and 39 values near the human body temperature of 36 and 37 degrees, taking into account the errors that may occur during measurement.In Section 4, we conducted experiments setting various values of knowledge value, up limit, and down limit.
In the graph shown in the upper left of Figure 2, the vertical axis represents the temperature values.In the graph shown in the upper right of Figure 2, the external knowledge about the possibility that a part of an object measured by each pixel is a part of a real face and the possibility that it is not is expressed.Note that there are no quantitative values in the vertical axis shown in the upper right graph in Figure 2. All of the graphs of the horizontal axes shown in Figure 2 represent the pixel index.In the upper left graph in Figure 2, pixels 2 and 3 are data with different meanings from the graph on the upper right, but there is almost no quantitative difference.In order to emphasize this content, input data must be re-expressed so that there are distinct differences between the two different data: one might measure a part of a real face, and the other might not.To do so, knowledge value in Equation ( 1) is used.As shown in the graph in Figure 2, below, information is forced to be distributed in a specific region through a considerable difference between real values, and thermal information about the temperature value of the pixels measured is also expressed showing a minute difference.The differences in measured temperatures can be seen by comparing pixel 1 to pixel 3 and pixel 2 to pixel 4. The optimal knowledge value can be empirically found through experimentation.First, knowledge is inserted for face liveness detection.After that, the data with external knowledge is calculated in the convolutional layer and transferred to the pooling layer.This can be repeated several times in order to process the complex image.Next, CNN passes the previously obtained information to the fully connected layer.Finally, CNN classifies the image in the output layer.The process of inserting external knowledge, the convolutional layer, the pooling layer, and fully connected layer are explained as the paper continues.The process of inserting external knowledge for face liveness detection can be accomplished by the process of inserting knowledge about the temperature that a human face can have.This can be represented as Equation (1).
In Equation ( 1), g is the measured temperature value, and h is the input value to CNN.Equation ( 1) is a formula that multiplies the value between up limit and down limit by knowledge value so as to make use of the physiological knowledge of the mean body temperature of a person, which is between 36 and 37 degrees [30].A pixel measuring a part of a real face must have a temperature value in this vicinity.The fact that there is a high probability that a pixel with a value close to 36 or 37 degrees in a measured thermal image is likely to represent a part of a real face can only be obtained from external knowledge, not from the data.In order to insert this knowledge into the artificial neural network, we make a remarkably different value than the measured value using Equation (1).In this case, the artificial neural network recognizes the temperature of this pixel as very different from the temperature measured at other pixels.If the knowledge value is 10, it is about ten times larger than the values of other pixels.Figure 2 shows an example of selecting 34 and 39 values near the human body temperature of 36 and 37 degrees, taking into account the errors that may occur during measurement.In Section 4, we conducted experiments setting various values of knowledge value, up limit, and down limit.
In the graph shown in the upper left of Figure 2, the vertical axis represents the temperature values.In the graph shown in the upper right of Figure 2, the external knowledge about the possibility that a part of an object measured by each pixel is a part of a real face and the possibility that it is not is expressed.Note that there are no quantitative values in the vertical axis shown in the upper right graph in Figure 2. All of the graphs of the horizontal axes shown in Figure 2 represent the pixel index.In the upper left graph in Figure 2, pixels 2 and 3 are data with different meanings from the graph on the upper right, but there is almost no quantitative difference.In order to emphasize this content, input data must be re-expressed so that there are distinct differences between the two different data: one might measure a part of a real face, and the other might not.To do so, knowledge value in Equation ( 1) is used.As shown in the graph in Figure 2, below, information is forced to be distributed in a specific region through a considerable difference between real values, and thermal information about the temperature value of the pixels measured is also expressed showing a minute difference.The differences in measured temperatures can be seen by comparing pixel 1 to pixel 3 and pixel 2 to pixel 4. The optimal knowledge value can be empirically found through experimentation.The convolutional layer serves to extract the complex features of the two-dimensional image [31].The parameters of the convolutional layer are kernel_size, filters, and stride.kernel_size indicates the width and height of a kernel composed of learnable weights.filters represent the number of kernels, and stride is a parameter for extracting the characteristics of an image based on a certain interval.From the convolutional layer, we can extract the spatial information while sharing the weights [2].Formal equations related to the convolutional layer are presented in [31].The information calculated in the convolutional layer is transferred to the pooling layer.
Among the layers that make up CNN, the pooling layer induces spatial invariance by reducing the size of the feature map [32].The parameters of the pooling layer are pooling_size and stride.pooling_size represents the size of the zone to be examined, such as kernel_size, a parameter of the convolutional layer discussed above.stride in the pooling layer serves the same purpose as the stride parameter of the convolutional layer.The max pooling layer has a function to find the maximum value in each region and to transfer it to the next layer [32].Finally, the information is transferred to the fully connected layer through the convolutional layer and the pooling layer.
The fully connected layer is a type of layer used in MLP consisting of nodes completely connected to the nodes in each of the previous and subsequent layers [1].The convolutional layer serves to extract the complex features of the two-dimensional image [31].The parameters of the convolutional layer are kernel_size, filters, and stride.kernel_size indicates the width and height of a kernel composed of learnable weights.filters represent the number of kernels, and stride is a parameter for extracting the characteristics of an image based on a certain interval.From the convolutional layer, we can extract the spatial information while sharing the weights [2].Formal equations related to the convolutional layer are presented in [31].The information calculated in the convolutional layer is transferred to the pooling layer.

Experiments
Among the layers that make up CNN, the pooling layer induces spatial invariance by reducing the size of the feature map [32].The parameters of the pooling layer are pooling_size and stride.pooling_size represents the size of the zone to be examined, such as kernel_size, a parameter of the convolutional layer discussed above.stride in the pooling layer serves the same purpose as the stride parameter of the convolutional layer.The max pooling layer has a function to find the maximum value in each region and to transfer it to the next layer [32].Finally, the information is transferred to the fully connected layer through the convolutional layer and the pooling layer.
The fully connected layer is a type of layer used in MLP consisting of nodes completely connected to the nodes in each of the previous and subsequent layers [1].

Data Collection and Experimental Environment Construction
The Flir C3 was used as the camera for collecting data.The camera has two lenses on the front: an RGB lens to obtain RGB images of 640 × 480 pixels and an infrared lens to obtain thermal images of 80 × 60 pixels.The information on the Flir C3 can be found at a website listed in Supplementary Materials at the end of this paper.We collected one RGB image and one thermal image in each scene to find suitable data for face liveness detection.Since a thermal image is better than an RGB image at night, we took images in indoor residential environments with visible light for accurate performance comparison.There were no conditions for the distance of the object.The faces in the dataset were used with and without a variety of accessories, such as glasses.No matter what, the face is covered by any object, which can cover anything except the eyes, nose, and mouth.We used the function of the Flir C3 that allows for the simultaneous operation of the two lenses.A total of 844 scenes were taken.The actual data used were 844 Excel files with temperature information collected from infrared lens and 2532 Excel files with R, G, and B information collected from RGB lens.In Figure 3, the images in the top row are RGB images, while the images in the bottom row are thermal images.

Data Collection and Experimental Environment Construction
The Flir C3 was used as the camera for collecting data.The camera has two lenses on the front: an RGB lens to obtain RGB images of 640 × 480 pixels and an infrared lens to obtain thermal images of 80 × 60 pixels.The information on the Flir C3 can be found at a website listed in Supplementary Materials at the end of this paper.We collected one RGB image and one thermal image in each scene to find suitable data for face liveness detection.Since a thermal image is better than an RGB image at night, we took images in indoor residential environments with visible light for accurate performance comparison.There were no conditions for the distance of the object.The faces in the dataset were used with and without a variety of accessories, such as glasses.No matter what, the face is covered by any object, which can cover anything except the eyes, nose, and mouth.We used the function of the Flir C3 that allows for the simultaneous operation of the two lenses.A total of 844 scenes were taken.The actual data used were 844 Excel files with temperature information collected from infrared lens and 2532 Excel files with R, G, and B information collected from RGB lens.In Figure 3, the images in the top row are RGB images, while the images in the bottom row are thermal images.Figure 3a,d are RGB and thermal images with a real face present, respectively.Figure 3b,e are RGB and thermal images with a face on a display, respectively.Figure 3c,f shows images taken of a ceiling air conditioner with no face.In the thermal images, the color is obtained by the software in the thermal camera itself so that the measured temperature can be intuitively grasped visually.In Figure 3a,b,d,e, it can be seen that the outline of the heat distribution and the heat on the face from the display differ from those of the real face.The RGB face liveness detection dataset jongwoo (RFLDDJ) we created and the thermal face liveness detection dataset jongwoo (TFLDDJ) we created are available on the internet.In NUAA [8], the whole picture is completely filled with faces.However, in the RGB dataset we created, people and objects were shot in indoor living environments in order to increase the level of difficulty.In other words, multiple objects coexist in a single image in the datasets we made.The data are more difficult because a more general situation is assumed.The information of the datasets can be found at websites listed in the Supplementary Materials at the end of this paper.Figure 3a,d are RGB and thermal images with a real face present, respectively.Figure 3b,e are RGB and thermal images with a face on a display, respectively.Figure 3c,f shows images taken of a ceiling air conditioner with no face.In the thermal images, the color is obtained by the software in the thermal camera itself so that the measured temperature can be intuitively grasped visually.In Figure 3a,b,d,e, it can be seen that the outline of the heat distribution and the heat on the face from the display differ from those of the real face.The RGB face liveness detection dataset jongwoo (RFLDDJ) we created and the thermal face liveness detection dataset jongwoo (TFLDDJ) we created are available on the internet.In NUAA [8], the whole picture is completely filled with faces.However, in the RGB dataset we created, people and objects were shot in indoor living environments in order to increase the level of difficulty.In other words, multiple objects coexist in a single image in the datasets we made.The data are more difficult because a more general situation is assumed.The information of the datasets can be found at websites listed in the Supplementary Materials at the end of this paper.
The numbers of pixels differ between the two lenses.The RGB lens has 640 pixels horizontally and 480 pixels vertically, for a total of 307,200 pixels on an image.By contrast, the infrared lens has 80 pixels horizontally and 60 pixels vertically, for a total of 4800 pixels on an image.The numbers of pixels in images obtained by the two lenses differ by 64 times.However, the range of actually measured scenes is not much different.Figure 4 shows its example.The numbers of pixels differ between the two lenses.The RGB lens has 640 pixels horizontally and 480 pixels vertically, for a total of 307,200 pixels on an image.By contrast, the infrared lens has 80 pixels horizontally and 60 pixels vertically, for a total of 4800 pixels on an image.The numbers of pixels in images obtained by the two lenses differ by 64 times.However, the range of actually measured scenes is not much different.Figure 4 shows its example.As shown in Figure 4, the number of pixels has a difference of 64 times, but there is not much difference in the area to be taken.In addition, because the RGB lens and the infrared lens have different pixel sizes, and because there is a slight difference in the position of each lens on the camera, it is not clear how many pixels from the horizontal, vertical, top, and bottom sides should be cut for the same range of the scene.Therefore, it is impossible to capture the same extent of the range of the scene.For the correct experiment, if the real face is in a scene that the infrared lens cannot capture as an image, this image was removed from the experiment.
We use Adam [33], Dropout [34], and ReLu [35] to improve learning abilities when learning CNN and Thermal Face-CNN.The Adam algorithm reduces error by learning the weights existing in the artificial neural network.It is easier to execute than the back-propagation algorithm [36].It is also more efficient and requires less memory [33].Dropout prevents overfitting by allowing each node not to participate in the calculation randomly during the learning process [34].Sigmoid [37] was used as an activation function in the output layer of all artificial neural networks used in the experiments except for C-SVM, and ReLu was used as an activation function of the hidden layer.As the pooling layer, the max pooling layer [32] is used.In addition, the probability of dropping each node is 10%.An intel core i7-7820X CPU was used as the hardware in the experiment, and the memory was DDR4 32G.The experiment was carried out using the Tensorflow [38] library, which has artificial neural network code.In the case of C-SVM, the sklearn.svm.svclibrary was used to carry out the experiment.The information of the library can be found at a website listed in the Supplementary Materials at the end of this paper.
Accuracy [4], recall [4], and precision [4] were mainly used as evaluation indices in the experiment.In this study, accuracy refers to how the actual value and predicted value are matched, regardless of the presence or absence of a real face.Recall is an index of how many images having the real face are judged to have the real face.Precision is also an index of how many images have the real face among those predicted to have the real face.

The Comparison of Face Liveness Detection between the RGB Image and Thermal Image
Before examining the performance of the proposed Thermal Face-CNN, we obtained accuracy, recall, and precision for each RGB image and thermal image dataset in order to identify the appropriate dataset for face liveness detection.For the comparison, we used CNN, MLP, and C-SVM.The left side of Table 1 shows the parameters of CNN applied to the RGB image dataset, and the right side of Table 1 shows the parameters of CNN applied to the thermal image dataset.We empirically sought the values of the parameters that would make the error of the artificial neural network converge to zero.
In Table 1, nodes refers to the number of nodes in the corresponding layer.Further, con_ means convolutional layer and pool_ means pooling layer.input_, hidden_, and output_ mean input layer, As shown in Figure 4, the number of pixels has a difference of 64 times, but there is not much difference in the area to be taken.In addition, because the RGB lens and the infrared lens have different pixel sizes, and because there is a slight difference in the position of each lens on the camera, it is not clear how many pixels from the horizontal, vertical, top, and bottom sides should be cut for the same range of the scene.Therefore, it is impossible to capture the same extent of the range of the scene.For the correct experiment, if the real face is in a scene that the infrared lens cannot capture as an image, this image was removed from the experiment.
We use Adam [33], Dropout [34], and ReLu [35] to improve learning abilities when learning CNN and Thermal Face-CNN.The Adam algorithm reduces error by learning the weights existing in the artificial neural network.It is easier to execute than the back-propagation algorithm [36].It is also more efficient and requires less memory [33].Dropout prevents overfitting by allowing each node not to participate in the calculation randomly during the learning process [34].Sigmoid [37] was used as an activation function in the output layer of all artificial neural networks used in the experiments except for C-SVM, and ReLu was used as an activation function of the hidden layer.As the pooling layer, the max pooling layer [32] is used.In addition, the probability of dropping each node is 10%.An intel core i7-7820X CPU was used as the hardware in the experiment, and the memory was DDR4 32G.The experiment was carried out using the Tensorflow [38] library, which has artificial neural network code.In the case of C-SVM, the sklearn.svm.svclibrary was used to carry out the experiment.The information of the library can be found at a website listed in the Supplementary Materials at the end of this paper.
Accuracy [4], recall [4], and precision [4] were mainly used as evaluation indices in the experiment.In this study, accuracy refers to how the actual value and predicted value are matched, regardless of the presence or absence of a real face.Recall is an index of how many images having the real face are judged to have the real face.Precision is also an index of how many images have the real face among those predicted to have the real face.

The Comparison of Face Liveness Detection between the RGB Image and Thermal Image
Before examining the performance of the proposed Thermal Face-CNN, we obtained accuracy, recall, and precision for each RGB image and thermal image dataset in order to identify the appropriate dataset for face liveness detection.For the comparison, we used CNN, MLP, and C-SVM.The left side of Table 1 shows the parameters of CNN applied to the RGB image dataset, and the right side of Table 1 shows the parameters of CNN applied to the thermal image dataset.We empirically sought the values of the parameters that would make the error of the artificial neural network converge to zero.
In Table 1, nodes refers to the number of nodes in the corresponding layer.Further, con_ means convolutional layer and pool_ means pooling layer.input_, hidden_, and output_ mean input layer, hidden layer, and output layer, respectively.The rest of the parameters are the same as those described in Section 3. In Table 1, the values in parentheses represent two values for the width and length of the kernel and pooling sequentially.
The parameter values for C-SVM used in the thermal image dataset are shown in Table In Table 2, c is an error penalty parameter, and we changed c when we experimented.RBF [39] or polynomial (POLY) [39] is used as kernel.gamma is the coefficient of kernel.In addition, n_features means the number of features and tolerance means stopping criterion.degree means the degree of the polynomial kernel function.
The parameters of the MLP used to learn the thermal images are shown in Table 3.A total of 599 images in the RGB image dataset and thermal image dataset from image 1 to image 599 were used as training data, and the remaining 245 images were used for test data.There are 338 images of 844 images with the real face, and 506 images without the real face.In the training set are 225 images with the real face, and 113 images with the real face are in test set.In the training set were 374 images without the real face, and 132 images without the real face are in the test set.Table 4 shows the experimental results of CNN in the RGB image dataset and the thermal image dataset.Tables 5  and 6 show the experimental results of MLP and C-SVM in the thermal image dataset.The figures in the following tables, including Tables 4-6, were rounded to the fourth decimal place.Figures expressed as percentages in the following tables were rounded to the second decimal place.In Tables 4 and 5, "The best" refers to the highest values."Average" means the average value.In order to obtain the information shown in Table 4, five CNNs in the RGB image dataset and 20 CNNs in the thermal image dataset were implemented with the same parameters.Because the combinations of weights obtained when the neural network is learned with the same parameters are always different and show different performances, we repeated the experiment 20 times in order to obtain the average performance of the general accuracy, recall, and precision values.However, in the RGB image dataset, the number of pixels contained in each image was 907,200, which required a substantial amount of computation.Therefore, 20 CNNs were learned in the thermal image dataset, but only five CNNs were learned in the RGB image dataset.To obtain Table 5, five MLPs were learned because MLP requires a large amount of computation.To evaluate C-SVM's performance in Table 6, we obtained one C-SVM on each parameter setting.The values of accuracy, recall, and precision shown in Table 4, which were obtained using the thermal image dataset, are higher than those of the RGB image dataset.It can be seen from the above that, on CNN, the thermal image is more suitable than the RGB image.
In the case of MLP, since there is 907,200-pixel information per RGB image, the number of nodes in the input layer should also be 907,200.We tried to implement an MLP with about 900,000 nodes in the input layer, but the hardware limitations made it impossible to calculate.Further, the C-SVM was learned using the parameters shown in Table 2, but it was determined that there was no real face for all the test data, because it was not learned properly.However, as shown in Tables 5 and 6, MLP and C-SVM can be learned because of the small number of pixels in a thermal image data.Through comparing Tables 4-6, it can be seen that good performance can be obtained by the thermal image data.4.3.Performance Comparison of CNN, C-SVM, and Thermal Face-CNN Section 4.2 showed that the thermal image is better than the RGB image.In Section 4.3, we applied the Thermal Face-CNN proposed in this paper to the thermal image with superior performance for face liveness detection than the RGB image, and we compared its performance with those of the other algorithms.We used the same parameters of CNN on the thermal image dataset for Thermal Face-CNN.We also constructed 20 Thermal Face-CNNs with the same parameter setting as used in the experiment on 20 CNNs, shown in Table 4.The accuracy, recall, and precision values of Thermal Face-CNNs are shown in Tables 7-12.Parenthetical values in these tables indicate knowledge value, up limit, and down limit values, sequentially.5 shows the performance of C-SVM, the green and black lines show the performance of Thermal Face-CNN, the red line shows the performance of MLP, and the orange line shows the performance of CNN.To obtain Figure 5, we used the parameters having the best performance: MLP which has an accuracy of 0.7837, a recall of 0.5664, and a precision of 0.9412 and the CNN which has an accuracy of 0.8367, a recall of 0.7876, and a precision of 0.8476 and the best performance among a up limit value of 39, and a down limit value of 34 in Thermal Face-CNN which has an accuracy of 0.8327, a recall of 0.8407, a precision of 0.8051, a knowledge value value of−5, a up limit value of 39, and a down limit value of 34 and the best performance among a knowledge value of 10 in Thermal Face-CNN which has an accuracy of 0.8245, a recall of 0.8496, a precision of 0.7869, a knowledge value value of 10, a up limit value of 39, and a down limit value of 33 and C-SVM which has a c value of 1 are used.As shown in Figure 5, Thermal Face-CNN has the dramatic performance improvement compared to CNN, and the Thermal Face-CNN's performance is close to that of MLP and C-SVM.In this paper, we argue that Thermal Face-CNN is better when precision is more important than recall.However, ROC graph does not directly consider precision because it uses true positive rate and false positive rate, which are not precision.Nonetheless, the ROC graph shows that Thermal Face-CNN is superior to CNN.

Conclusions and Future Works
Face liveness detection is an important field that allows for information about a real person to be communicated when communicating security.In this paper, face liveness detection was performed in indoor residential environment using the fact that thermal patterns on a face in a display and a photograph differ from those on the real face.First, we quantitatively compared the performance of the thermal image with the RGB image.It has been shown that the thermal image is more suitable for face liveness detection because CNN has the best performance, with an accuracy of 0.6898, a recall of 0.5752, a precision of 0.7342 on the RGB image dataset, and an accuracy of 0.8367, a recall of 0.7876, and a precision of 0.8476 on the thermal image dataset.We also propose Thermal Face-CNN, which has external knowledge about the real face temperature in the existing CNN algorithm and compares it with CNN.The performance of the best-performing Thermal Face-CNN is equal to or better than CNN.Furthermore, we used the F-measure to identify the condition in which the Thermal Face-CNN performs better than the C-SVM.
Based on the results in this paper, we hope that Thermal Face-CNN with the thermal image is used to detect malicious tricks to imitate the face.This paper shows that it is possible to insert external knowledge by adjusting the value of a particular real number range.Therefore, it is expected that the application algorithms that have knowledge in various fields will emerge.
In this study, the experiment was conducted using 844 scenes.Nevertheless, as the number of data increases, it becomes more feasible to use face liveness detection in more general situations.Therefore, there is a need to collect thermal images in the future.Moreover, due to the difference between the RGB lens and the infrared lens, the images measured differ in terms of pixel size, the number of pixels, and the range of the scene.Therefore, there is a need to construct datasets with fewer differences between the RGB and thermal image.Because the experiments of all the possible combinations of the parameters in the algorithms were not done, the comparisons are not conclusive.Therefore, it is necessary to accurately identify the optimal parameters combination that obtains the highest accuracy, recall, precision, F-measure value through additional experimentation.

Figure 2 .
Figure 2. Example of the process of inserting external knowledge.

Figure 2 .
Figure 2. Example of the process of inserting external knowledge.

Figure 3 .
Figure 3. Data examples: (a) a real face taken by RGB lens; (b) a face on a display taken by RGB lens; (c) a ceiling air conditioner taken by RGB lens; (d) a real face taken by infrared lens; (e) a face on a display taken by infrared lens; (f) a ceiling air conditioner taken by infrared lens.

Figure 3 .
Figure 3. Data examples: (a) a real face taken by RGB lens; (b) a face on a display taken by RGB lens; (c) a ceiling air conditioner taken by RGB lens; (d) a real face taken by infrared lens; (e) a face on a display taken by infrared lens; (f) a ceiling air conditioner taken by infrared lens.

Figure 4 .
Figure 4. Comparison of the ranges of lenses.

Figure 4 .
Figure 4. Comparison of the ranges of lenses.

Table 1 .
Convolutional neural network (CNN) parameters used in the RGB image dataset and the thermal image dataset.

Table 2 .
C-support vector machine (C-SVM) used in the thermal image dataset.

Table 3 .
Multi-layer neural network (MLP) parameters in the thermal image dataset.

Table 4 .
CNN's performance in the RGB image dataset and the thermal image dataset.

Table 5 .
MLP's performance in the thermal image dataset.

Table 6 .
C-SVM's performance in the thermal image dataset.

Table 10 .
Thermal Face-CNN accuracy, recall, and precision values 4. line is better than 'B' line if 'A' line is closer to the northwest than 'B' line in ROC graph.The blue line in Figure