Face Liveness Detection Using Defocus

In order to develop security systems for identity authentication, face recognition (FR) technology has been applied. One of the main problems of applying FR technology is that the systems are especially vulnerable to attacks with spoofing faces (e.g., 2D pictures). To defend from these attacks and to enhance the reliability of FR systems, many anti-spoofing approaches have been recently developed. In this paper, we propose a method for face liveness detection using the effect of defocus. From two images sequentially taken at different focuses, three features, focus, power histogram and gradient location and orientation histogram (GLOH), are extracted. Afterwards, we detect forged faces through the feature-level fusion approach. For reliable performance verification, we develop two databases with a handheld digital camera and a webcam. The proposed method achieves a 3.29% half total error rate (HTER) at a given depth of field (DoF) and can be extended to camera-equipped devices, like smartphones.


Introduction
At present, many people deal with personal business using portable devices. From unlocking cellular phones to financial business transactions, people can easily conduct their individual business tasks through such a device. Due to this trend, personal authentication has become a significant issue [1]. Instead of using a simple PIN code, industries have developed stronger security systems with biometric authorization technology [2]. Biometric traits, such as face, iris and fingerprint, are very powerful factors to protect one's private information.
However, attempts to invade security systems and steal personal information have been increasing. One type of these attacks involves using fake identities. Spoofing faces and fingerprints are threatening security systems and privacy. This would not matter if current face recognition (FR) systems were secure, but current systems cannot distinguish fake faces from real faces. In some cases, the FR system embedded in cellular phones gives approvals to forged faces. This phenomenon is an example of weakness in the biometric system. If this problem remains unsolved, anyone will be able to easily obtain others' personal information in order to commit identity-related crimes. For this reason, technological defense against spoofing attacks is necessary, so as to protect personal systems and users' private data. Over the last decade, researchers have shown steady progress in developing anti-spoofing technologies [3]. Most of these methods concentrate on exploiting features obtained from the analysis of textures, spectrums and motion in order to detect face liveness.
In this paper, we propose a new method to secure face identification systems from forged 2D photos. The key factor of our methods is that we utilize the camera function, variable focusing. In shape-from-focus, it is possible to construct 3D images using focus measures [4,5]. Even though we need not recover the 3D depth images, we use the characteristics of the defocusing technique in order to predict the existence of the depth information. By adjusting the focusing parameters, parts of the image that are not in focus become blurry. With this function, we can evaluate differences in the degree of focus between real faces and fake faces and use this information to detect face liveness. To evaluate our method, we organized two databases using a handheld digital camera and a webcam.
The remainder of this paper is organized as follows. In Section 2, we discuss previous studies on face liveness detection and the theoretical background of camera focusing. Our proposed methodologies are stated in Section 3. In Section 4, experimental results are shown and the details are discussed. Finally, concluding remarks are provided in Section 5.

Countermeasures against Spoofing Faces
Numerous approaches to minimize vulnerability to attacks using spoofing faces have been proposed. In early research, intrusive methods that request user cooperation, such as speaking phrases and shaking one's head [6], were developed. However, these approaches cause users inconvenience and rely on users' cooperation. For this reason, many researchers have attempted to develop non-intrusive methods.
Depending on the type of attack, methods can be categorized into three groups: 2D static attacks (facial photographs), 2D dynamic attacks (videos) and 3D attacks (masks). Skills and devices for disguising one's identity have evolved gradually. Masks and videos are examples of advanced spoof attacks. Some studies have focused on protecting FR systems from these advanced attacks [7,8]. However, due to the difficulty and cost of obtaining such advanced tools, 2D static attacks, such as photographs, have been widely used by attackers. In this chapter, we review studies for detecting 2D facial photo-based spoof attacks.
There are three main spoof detection approaches, depending on the characteristics of input faces. The first approach is based on textures. Real and fake faces have different texture characteristics. Some studies have used texture to detect forged faces. Kim et al. [9] applied local binary patterns (LBP) for texture analysis and power spectrum for frequency analysis. Määttä et al. [10] and Bai et al. [11] also detected face liveness by examining micro texture with multiscale LBP. Peixoto et al. [12] proposed a method to detect and maintain edges (high-middle frequencies) with different Gaussian characteristics under poor illumination conditions. In [13], the authors extracted essential information for discrimination using a Lambertian model. Singh et al. [14] proposed a method to classify real faces based on a second-order gradient. This approach focuses on differences between skin surfaces of real and fake faces. Kant et al. [15] presented a real-time solution using the skin elasticity of the human face. Approaches with a single image have advantages in terms of low capacity and simplicity.
The second approach uses motion information. Signs of liveness, such as eye blink and head movements, are clues to distinguish motionless spoofing faces. Image sequences can be used to perceive movements. These factors are exploited intuitively [16][17][18][19][20]. In addition, optical flow and various illumination approaches are helpful to analyze the differences between real and fake faces [21][22][23][24][25]. Applying the entropies of RGB color spaces is one factor in face liveness detection [26]. To make a robust system, several methods use a combination of static and dynamic images [18,27].
The last approach is based on 3D facial information. The obvious difference between a real face and a fake face is the presence or absence of depth information. Human faces have curves, while photos are flat. By considering this feature, researchers have classified spoofing attacks. Wang et al. [28] suggested an approach to detect face liveness by recovering sparse 3D facial models, and Lagorio et al. [29] presented a solution based on 3D facial shape analysis.

Background Related to Focusing
Unlike previous research, our method utilizes the effect of defocus. Defocusing is exploited to estimate the depth in an image [4,5,30]. The degree of focus is determined by the depth of field (DoF), the range between the nearest and farthest objects in a given focal plane. Entities in the DoF are perceived to be sharp. In order to emphasize the effect of defocus, the DoF should be narrow. There are three parameters that modulate DoF, and Figure 1 shows those conditions for shallow DoF [31]. The first factor is the distance between the camera and the subject; a short distance produces a shallow DoF. The second factor is the focal length, which is adjusted to be longer for a shallow DoF. The last factor is the lens aperture of the camera, which is made wider to produce a shallow DoF. Using these options, we can achieve images with a narrow DoF and a large variation in focus [31].

Previous Work with Variable Focusing [32]
In the previous work [32], a method for face liveness detection using variable focusing was suggested. Two images sequentially taken at different focuses are used as input images, and focus features are extracted. The focus feature is based on the variation of the sum modified Laplacian (SML) [33] that represents the degrees of focusing. With the focus feature and a simple classifier, fake faces are detected. 2D printed photos are used as spoofing attacks, and a database composed of images with various focuses is produced for evaluation. When DoF is shallow enough to make the only partial area blurred, this method shows good results. However, at a deep DoF, the performance is deteriorated. In order to make up for the weakness of the previous work, we propose an improved method in this paper. Extracting local feature descriptors and frequency characteristics, as well as the focus feature from the defocused images, we detect spoofing faces. Moreover, the quantity of the database is increased, and various experiments are performed to achieve the best result. A detailed explanation will be described in the following sections.

Proposed Methodology
In this section, we introduce new FR anti-spoofing methods using defocusing techniques. From partially defocused images, we extract features and classify fake faces. The most significant difference between real and fake faces is the existence of depth information. Real faces have three dimensions, with the nose and ears being relatively far from each other. This distance can be used to adequately represent the depth information. Depending on the object or place of focus, the ear area might or might not be clear, as shown in Figure 2a. Unlike real faces, 2D spoofing faces are flat. There is little difference in clarity, regardless of the focus (Figure 2b). We emphasize this characteristic in order to discriminate real faces from 2D faces. In order to maximize the effect of defocus, we must adjust the DoF to be shallow, as mentioned in Section 2. However, according to the type of camera, the adjustment of DoF may not be possible. Therefore, we obtain input images using two cameras, a handheld digital camera and a webcam. We will explain image acquisition in the following section.
Our system is composed of three steps: image acquisition and preprocessing, feature extraction and classification ( Figure 3).

Image Acquisition and Preprocessing
In our method, image acquisition is an important factor in performance. As mentioned in the previous section, a narrow DoF increases the effect of defocus and assists with detecting fake faces. However, not every camera can easily change its DoF and focal plane. If people use handheld digital cameras, such as DSLR (digital single lens reflex) and mirrorless cameras, the DoF can be made shallow by directly controlling camera settings and the areas of desired focus can be manually selected. However, when users utilize webcams and cameras embedded in cellular phones, they cannot accurately manipulate the DoF. Moreover, the position of the focal plane is inexact with such cameras. Therefore, the process of image acquisition needs to vary with the type of camera. We will introduce two methods appropriate for a handheld digital camera and a webcam, respectively.

Using a Handheld Digital Camera
With handheld digital cameras (DSLR camera, mirrorless camera, compact digital camera, etc.), it is possible to manually control the focal plane and DoF. Hence, two sequential focused facial images are obtained for use in these experiments: one is focused on a nose and the other on ears ( Figure 2). When we set the focus on the ears and nose, we can tap on the LCD panel or turn a focus ring in accordance with the type of handheld digital camera. In this paper, a mirrorless camera (SONY-NEX5) is used, and it has a focus ring. Therefore, we acquire the focused images, turning the focus ring and checking the sharpness in the regions of the ears and nose with our eyes.
In the preprocessing step, we geometrically normalize images based on the location of the eyes [34]. In every image, the positions of faces are slightly different. For accurate comparison, faces must be aligned. Based on the coordinates of the eyes, we translate, rotate and crop facial images. The eyes can be automatically detected by using feature templates. In this paper, however, we select the correct positions of the eyes manually in every image and save the coordinates. Figure 4 shows the normalized images produced in the present study. Figure 4a,c is focused on the nose (I N ) and Figure 4b,d on the ears (I E ).

Using a Webcam
The focus in a webcam is controlled by adjusting the plastic lens in and out. However, the DoF is unknown, and it is difficult to select the focus area without the use of a supplemental program. Therefore, unless the program is used, it is not easy to obtain images focused on either the nose or ears. In order to acquire input images with a webcam, we approach the problem in a different way.
Although it is not possible to accurately take images focused on either the nose or ears when using a webcam, it is possible to obtain image sequences by changing the lens motor step. Depending on the adjustment of the lens, the focal plane varies, producing images with different focal planes. From the image sequence collected here, we select two images, I N and I E . I N and I E denote the normalized images for which the nose and ear area are in focus, respectively. In order to determine these images, we detect the nose and ears and calculate the degrees of focus in those areas [4]. As mentioned before, the centers of the eyes and the regions of the ears and nose are selected manually in this paper. When the value of a specific area is at a maximum at the k-step, that region is in focus. Figure 5 depicts the changes in focus values in accordance with the lens step. In Figure 5a, the nose area is in focus at the 20th step and the ears area at the 16th step. With fake faces, the steps of the maximum focus values for the nose and ears are same, as shown in Figure 5b. This allows one to distinguish between real and fake faces. Through this procedure and normalization, we can choose two images as I N and I E ( Figure 6).

Feature Extraction
To detect forged faces, features are extracted from normalized images. In this paper, we use three feature descriptors: focus, power histogram and gradient location and orientation histogram (GLOH) [35].

Focus Feature
The focus feature is related to the degree of focusing. In the previous study [32], this feature was suggested and used for classifying fake faces. Figure 7 shows the flowchart for extracting focus features. Using several focus measures [4], we can numerically calculate the focus levels in each pixel. There are various focus measures, such as Laplacian-based measures and gradient-based measures. We will show the performance in accordance with the focus measures.
The images in Figure 8 are the results of modified Laplacian (LAPM) focus measure calculations. LAPM is one of the focus measures introduced in [4,33]. This is presented as the sum of transformed Laplacian filters. Figure 8a,b shows the LAPMs of a real facial image focused on the nose and ears, and Figure 8c,d shows the LAPMs of a fake facial image focused on the nose and ears. We denote the LAPM of nose-focused images by LAPM N and the LAPM of ear-focused images by LAPM E . In LAPM N and LAPM E , bright pixels represent high values of LAPM, and those regions are in focus with sharp edges. On the contrary, out-of-focus regions have severe blurring, lose edge information and have low values of LAPM. In the case of real faces, the nose area in LAPM N (Figure 8a) is brighter than that in LAPM E (Figure 8b). However, there is little difference between the LAPM N and LAPM E of fake faces (Figure 8c,d). Consequently, by computing the variations in focus measures, we can determine the degree of focusing. In order to maximize the LAPM difference between regions of the nose and ears, we subtract LAPM E from LAPM N (= LAPM N − LAPM E ). To analyze the differences in LAPMs (DiF, difference in focus measures) in a single dimension, we add all of the DiF in the same column. In Figure 9, blue lines describe the cumulative sums of the DiF of real and fake faces. However, these distributions are not appropriate to be used for liveness detection without any refinement. The existence of noise affects the results. Therefore, curve fitting is performed to extract meaningful features. The sum of the DiF of real faces has a similar shape to the curvature of a quadratic equation, y = ax 2 + bx + c. In the quadratic equation, there are three coefficients, A = a b c T , and these are exploited as a feature for classification. To calculate the values of these coefficients, we perform error minimization [32]. Figure 9 presents the results of curve fitting (red circles). The curve for the cumulative sum of DiF of the real face is convex, as shown in Figure 9a, while that of the fake face is flat. In Figure 10, coefficients of quadratic equations are plotted. Blue circles are features of real faces, and red crosses are those of spoofing faces. Depending on the range of DoF, the degree of feature overlap will change.

Power Histogram Feature
Out-of-focus images have few edge components because the blurring filter eradicates the boundary. This affects the frequency characteristics of such images. We analyze this feature to identify forged faces. In this section, we introduce another feature, the power histogram feature, which contains spatial frequency information. The process of extracting this feature is presented in Figure 11.
In the first step, we divide a normalized image into three subregions, as shown in Figure 12. When a picture is taken focusing on the ears, we adjust the focal plane to include the ear area. Not only ears, but other components in the DoF are in focus. To analyze those components, we divide the images radially. The first subregion (subR1, Figure 12b) is the nose area, the second subregion (subR2, Figure 12c) includes the eyes and mouth, and the third subregion (subR3, Figure 12d) contains the ears and the contour of the chin.  Using a Fourier transform, we convert subregions from the spatial domain to the frequency domain. Figure 13 illustrates center-shifted Fourier spectrums of the three described subregions with power being concentrated at the center of each spectrum. According to the subregion, the distributions of power are different. In order to analyze those distributions, we calculate the percentage of power in circular regions. We divide the frequency spectrum into several circles by allowing it to be superimposed. The percentage of power within a circular region is computed by Equation (1) [36], where C is a circular region and real(u, v) and imag(u, v) are the real and imaginary parts of the frequency component, respectively. Each spectrum has a histogram, and the value of each bin is the percentage of power in each circular area. By concatenating three histograms, we can obtain a combined histogram from one image. The dimensionality of the histogram is determined by the radii of the circular regions in the frequency spectrum. With real faces, power histograms vary depending on the focus area. However, those of fake faces do not vary. We use the differences in the power histograms as a feature for liveness detection.

GLOH Feature
We extract another feature descriptor, the gradient location and orientation histogram (GLOH) [35], which is an extended version of scale-invariant feature transform (SIFT) [37] and makes it possible to consider more spatial regions, as well as making feature descriptors robust and distinctive. In this paper, we modify and apply this feature locally. Figure 14 shows the flowchart of extracting the GLOH feature. For each Gaussian smoothed image, the gradient magnitude, GM ag, and orientation, GOri, are computed by Equation (2).
GOri(x, y) = tan −1 I(x,y+1)−I(x,y−1) I(x+1,y)−I(x−1,y) GM ag(x, y) = (I(x + 1, y) − I(x − 1, y)) 2 + (I(x, y + 1) − I(x, y − 1)) 2 (2) Next, we divide the image into P × Q patches in order to draw features locally. Figure 15 shows how to separate the image into patches. GLOH descriptors are derived from polar location grids in patches. As shown in Figure 15, each patch is divided into 17 subregions (three bins in each radial direction and eight bins in each angular direction). Note that the central subregion is not split. In a subregion, the gradient orientations are quantized into 16 bins (Figure 16). From one patch, 17 histograms are created. We reshape these histograms into one column vector, whose dimensionality is 272 (=17 × 16), as illustrated in Figure 17. Finally, a 272 × P × Q-dimensional column vector is extracted from P × Q patches.

Classification
For classification, the support vector machine-radial basis function (SVM-RBF) is used [38]. The SVM classifier learns normalized focus, power histogram and GLOH features. Furthermore, we carry out fusion-based experiments by concatenating normalized features. Figure 18 shows the flowchart of the feature-level fusion approach. Depending on the training data and the development data, the parameter of the SVM classifier is determined. Figure 18. Flowchart of the feature-level fusion approach.

Experimentation
Before evaluating the performances of our approaches, we collected frontal facial images from 24 subjects, because there is no open facial database that has various focusing areas. Although there are some databases for liveness detection, they do not satisfy our requirements. Therefore, we created two databases, one composed of images taken by a mirrorless camera (SONY-NEX5) and the other containing images taken by a webcam (Microsoft LifeCam Studio). The difference between the two cameras is the possibility of the accurate and delicate control of focus. With the mirrorless camera, it is possible to focus precisely on the nose or ear area. However, the webcam makes it difficult to adjust focus in detail, and users are not able to determine what is in focus. We will explain the processes of acquiring databases in the next section. We printed photos for fake faces with a Fuji Zerox ApeosPort-II C5400 printer.
For evaluations, the following measures are used.
• False acceptance rate (FAR): the proportion of fake images misclassified as real.
• False rejection rate (FRR): the proportion of real images misclassified as fake. The performance of the proposed method is evaluated with our own databases. Databases are randomly categorized into 3 groups: training, development and testing sets.
• Training set (30%): to be used for training the classifier.
• Development set (30%): to be used for estimating the threshold of the classifier.
• Testing set (40%): to be used for evaluating the performance.
Thirty percent of the subjects are used for training and development, and forty percent of the subjects are used for testing. Three groups are disjoint. That is, if images of subject 'A' are used for training, they cannot be utilized for development or testing.

Data Acquisition
With the mirrorless camera, the nose and ear areas are able to be in focus, and the DoF is manually controlled. In order to obtain images with various DoFs, we adjusted the distance between the camera and the subject, focal length and F-stop. Figure 19 shows the ranges of the parameters.   Table 1. The number of males is 17 and that of females is 7. The size of each normalized image is 150 by 150 pixels, and the distance between the eyes is 70 pixels. Figure 20 shows real (a) and fake (b) samples from the database.

Experimental Results
We carry out experiments in accordance with the types of features, and the detailed results are described in Appendix A. The following shows the performance of the concatenated features. The process of combining features is carried out in the feature level. For high performance, we choose features based on the above results. Modified Laplacian (LAPM) and wavelet sum (WAVS) are used as focus features. In the case of the power histogram feature, the radii of the circular regions are 5, 15, 30, 50 and 75. GLOH features are extracted using 75 × 75 patches without allowing overlap. In order to reduce the dimensionality of the GLOH features, we apply PCA and use several eigenvectors whose variances are 90%. Table 2 shows the denotations of the features.  Table 3 and Figures 21 and 22 illustrate the results of the fusion-based methods. When the DoF is shallow (within 4 cm and 6 cm), the performances of focus features (LAPM and WAVS) are better than those of other features. However, as the DoF becomes deeper, the performances of focus features deteriorate. In the case of the GLOH and fusion-based features, the performances are maintained compared to other features. Especially, the HTERs of the fusion-based features under 16-cm DoF are lower than those of other features (6.27% and 6.08%). These numerical results demonstrate that the fusion-based methods are prominent when the effect of defocusing is low.

Data Acquisition
For evaluations, we gathered facial data using Microsoft LifeCam Studio. Using the provided program, we could control the lens motor step from 0 to 40. Therefore, one input sequence is composed of 41 images. Among those, we choose I N and I E , as mentioned in Section 3.1.2. The distance between the webcam and the subject is about 20 cm, so that the image can contain a whole face. The number of real face sequences is 94. Normal prints and an HD tablet (iPad 2) are used as spoofing attacks, and the number of sequences is 240 and 120 respectively. Five-fold cross-validation is applied for the evaluation.

Experimental Results
Numerical results are listed in Table 4. The good performance is maintained, even though the webcam database cannot express depth information well compared to the mirrorless camera database. The results of the combined features are the best, and the HTERs of them are 3.02% under normal print attack and 3.15% under HD tablet attack. These experiments show the possibility that our proposed method can be used in security systems at a low cost and with low specification devices. Furthermore, if detailed adjustment of the focus is possible in the device, our method can improve the performance more.

Discussion
Due to the characteristic of our proposed method, it is impossible to apply our method to open databases, such as the CASIA database [39] and the Replay-Attack database [40]. Therefore, we conducted comparative experiments by applying other methods to our own database. Table 5 demonstrates the performance comparison between our proposed method and other methods.
Other methods [9,41,42] detect the liveness based on textural analysis (local binary patterns) or frequency components (difference of Gaussian, power spectrum). Even though they have an advantage in terms of using a single image, the performances for our own database are not remarkable, regardless of the DoF; whereas the previous work [32] shows a good result relatively at the within 4-cm DoF. However, when the DoF is deep, the performance of [32] deteriorates. This represents that the performance of the previous system is determined depending on the method of input picture collection with great effects on defocus. In order to overcome this limitation, we propose our system by considering two factors. The first is by supplementing features. By adding other feature descriptors, we try to maintain good performance, even though the DoF becomes deeper. In the case of the GLOH feature [35], it has high matching scores for images with severe blur, whereas local features used in other methods [9,41,42] are not proper for the defocused images to be compared to the GLOH feature descriptor. The influence of the GLOH feature can be confirmed in the previous Section 4.1.2. In Figure 21, the performances of the focus and power histogram features are deteriorated in accordance with the increase of the DoF. However, the performance of the GLOH feature is maintained. As a result, we can achieve 6.51% HTER (feature-fusion) at a DoF within 16 cm by using additional features specialized for the defocused images. These results are better than the HTERs of other methods and the previous method [32], which uses only a focus feature of 20.8%.
The second way to mitigate the weakness of the previous study [32] is the use of the webcam database. Digital cameras, such as DSLR and mirrorless cameras, have high specifications and make it possible to manually adjust the DoF and focusing areas. However, due to their high cost, people might be unwilling to use digital cameras for image acquisition in anti-spoofing algorithms. Webcams are cheaper than digital cameras and are utilized broadly. With the webcam, we created a database and conducted experiments. As a result, we accomplish 3.02% HTER with the combined feature. The performance with the webcam database is similar to that with the mirrorless camera database.
Even though we show the good performance for liveness detection, our method has a disadvantage in the process of acquiring and normalizing images. In this paper, we set the focus on the ears and nose and find the centers of the eyes manually. In order to apply our proposed method to the security systems at low cost and with low specification devices, like smartphones, facial components must be detected automatically. Recently, many studies for feature point extraction have been in progress, and most cameras and smartphones have a face priority auto focusing function [43][44][45], which helps to obtain face-focused images by automatically controlling the lens actuator. If these technologies are utilized, the limitation of our method will be settled and applicable to the devices. Moreover, it will strengthen the security of smartphones.

Conclusion and Future Work
We proposed a face liveness detection method based on the characteristics of defocus. Our method pays attention to the difference between the properties of real and 2D fake faces. We use focus, power histogram and GLOH as feature descriptors and classify spoofing faces in terms of the feature-level fusion processes. Our experimental results show 3.29% HTER when the DoF of images is within 4 cm. Moreover, by applying various features, we overcome the limitation of DoF without adding any other sensors. Furthermore, through experiments with a webcam, we confirm that the good performance of our method is maintained.
Even though our proposed method yields good results, it has a limitation for being applied to camera-embedded security systems, such as smartphones, because of the manual processes to acquire the focused images and to detect facial components. Therefore, in future work, we will improve our method in order for it to operate automatically in the image acquisition and preprocessing and to make it possible to embed our method on a smart devices. Furthermore, we will consider more robust countermeasures against videos and 3D attacks by analyzing textural and temporal characteristics. Furthermore, we will advance our method using a light-field camera, which can acquire various focusing information in the spatial domain using a microlens array.

A. Appendix: Experiments According to the Type of Features
We carry out experiments in accordance with the types of features. The following shows the performance of our proposed methods.

A.1. Focus Feature
We conduct experiments with eight types of focus features. Eight focus features are categorized into four groups: statistic-based, Laplacian-based, gradient-based and wavelet-based operators [4]. The focus features are listed in Table A1. Related equations are organized in [4]. Table A2 and Figure A1 show HTERs and receiver operating characteristic (ROC) curves of focus features according to the range of the DoF. In general, the performance of the Laplacian group is better than those of other groups. As depicted in Figure A1, focus features in the Laplacian group swarm in the upper side. Especially, modified Laplacian (LAPM) has stable and prominent results all over the DoF (1.64% HTER under the within 4-cm DoF and 8.93% HTER under the within 16-cm DoF). The sum of wavelet coefficients (WAVS) also shows good performance. When the DoF is shallow, the effect of defocusing is great. This makes the focus features of real and fake faces more discriminative. As a result, focus features, except gray-level variance (GLVA), yield the best performances under the within 4-cm DoF. The GLVA focus feature, unusually, has the best performance when the DoF is within 16 cm (12.4% HTER). GLVA is the simple variance of the gray-scale image. Compared to other focus features, it is inadequate to represent the difference between the focused and defocused regions regardless of the DoF.   Figure A1. ROC curves of the focus features (DoF (a) within 4 cm, (b) within 6 cm, (c) within 10 cm and (d) within 16 cm).

A.2. Power Histogram Feature
In order to find that how to divide the frequency spectrum that yields good performance, we carry out experiments taking the radii of circular regions in Table A3. The dimensionality is the length of the concatenated histograms of the three subregions.  Table A4 describes the numerical results, and Figure A2 illustrates the distributions of the HTERs and ROC curves. When the averages of HTERs are calculated respectively, Rad.ver6 shows a good performance: 7.69% HTER. Even though the dimensionality of the power histogram feature is low, it yields the best performance compared to the others.

A.3. GLOH Feature
We perform experiments by altering the size of the patch, the energy percentage in PCA and whether the patches are overlapped or not. In Tables A5-A10, numerical results are listed, and Figure A3 describes the ROC curves.
As shown in Figure A3, the performances with and without allowing the overlap of patches are similar. In terms of the computational cost, the overlap of patches is not effective. Therefore, it is better not to overlap the patches to extract the GLOH features. In the case of the energy percentage in PCA, when the percentage is 98%, the performance is worse than those under 90% and 95%.
Additionally, experiments are carried out depending on the size of the patch. When the GLOH features are extracted from the whole image, the performance is the worst, and those features cannot represent the spatial properties sufficiently. As the size of the patch is 75 × 75 and the energy percentage in PCA is 90% without allowing the overlap, the performance is the best (7.75% HTER under the within 16-cm DoF).

Conflicts of Interest
The authors declare no conflict of interest.