FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection

Li, Yingjiu; Li, Yan; Wang, Zilong

doi:10.3390/computers14070254

Open AccessArticle

FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection

by

Yingjiu Li

^1,*

,

Yan Li

² and

Zilong Wang

²

¹

Department of Human Physiology, University of Oregon, Eugene, OR 97403, USA

²

School of Aerospace Space and Technology, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(7), 254; https://doi.org/10.3390/computers14070254

Submission received: 29 May 2025 / Revised: 19 June 2025 / Accepted: 25 June 2025 / Published: 27 June 2025

(This article belongs to the Special Issue Cyber Security and Privacy in IoT Era)

Download

Browse Figures

Versions Notes

Abstract

Facial authentication has gained widespread adoption as a biometric authentication method, offering a convenient alternative to traditional password-based systems, particularly on mobile devices equipped with front-facing cameras. While this technology enhances usability and security by eliminating password management, it remains highly susceptible to spoofing attacks. Adversaries can exploit facial recognition systems using pre-recorded photos, videos, or even sophisticated 3D models of victims’ faces to bypass authentication mechanisms. The increasing availability of personal images on social media further amplifies this risk, making robust anti-spoofing mechanisms essential for secure facial authentication. To address these challenges, we introduce FaceCloseup, a novel liveness detection technique that strengthens facial authentication by leveraging perspective distortion inherent in close-up shots of real, 3D faces. Instead of relying on additional sensors or user-interactive gestures, FaceCloseup passively analyzes facial distortions in video frames captured by a mobile device’s camera, improving security without compromising user experience. FaceCloseup effectively distinguishes live faces from spoofed attacks by identifying perspective-based distortions across different facial regions. The system achieves a 99.48% accuracy in detecting common spoofing methods—including photo, video, and 3D model-based attacks—and demonstrates 98.44% accuracy in differentiating between individual users. By operating entirely on-device, FaceCloseup eliminates the need for cloud-based processing, reducing privacy concerns and potential latency in authentication. Its reliance on natural device movement ensures a seamless authentication experience while maintaining robust security.

Keywords:

face authentication; anti-spoofing; mobile security; biometric authentication

1. Introduction

Facial authentication is an innovative biometric authentication method that has seen widespread adoption in mobile devices, such as smartphones and tablets. A preliminary study on FaceCloseup was published as a short paper in ACM AsiaCCS 2019 [1]. This current version has been substantially revised, expanded, and rewritten, making it significantly different from the conference paper. It includes new and original theoretical frameworks, experimental evaluations, and analyses.Prominent commercial face authentication systems currently available in the market include TrueFaceAI [2], iProov [3], Visidon [4], and Face Unlock [5].

Compared to traditional password-based user authentication, facial authentication offers several advantages, including higher entropy and no requirement for user memory [6]. However, many existing facial recognition systems are intrinsically vulnerable to spoofing attacks [7,8,9,10,11,12,13], where an adversary may replay a photo or video containing the victim’s face or use a 3D virtual model of the victim’s face. This issue is further exacerbated by the vast amount of personal media data published online. Prior research indicates that 53% of facial photos on social networks such as Facebook and Google+ can be used to successfully spoof current facial authentication systems [14,15].

To mitigate spoofing attacks, face liveness detection techniques have been developed to enhance facial recognition systems [16,17,18,19,20,21]. These techniques verify the presence of a real user during face authentication by analyzing features such as 3D characteristics, motion patterns, and texture details captured by cameras. Major manufacturers, including Apple, Baidu, Tencent, and Alibaba, have incorporated liveness detection into their face authentication systems [22,23,24].

Previous research on liveness detection has been successful in countering photo-based attacks, where an adversary replays a facial photo of a victim. These liveness detection methods typically require users to perform specific facial motions or expressions during authentication. For example, methods based on eye blinks or head rotations require users to blink or turn their heads (e.g., [25,26]). While these techniques are effective against photo-based attacks, they are vulnerable to video-based attacks. In such cases, adversaries can use pre-recorded face videos in which the victim performs the necessary motions or expressions, thereby circumventing the liveness detection system.

Other research on liveness detection aims to counter both photo-based and video-based attacks. For instance, the facial thermogram approach leverages additional data from infrared cameras to analyze thermal characteristics [27]. Another method, sensor-assisted face recognition, detects liveness by accurately identifying the nose edge under controlled lighting conditions [28]. However, these approaches face practical challenges on mobile devices due to limited hardware capabilities and diverse usage environments.

More practical liveness detection methods have been developed to combat spoofing attacks on mobile devices. One such method is FaceLive, which requires the user to hold a smartphone with a front-facing camera and move it horizontally over a short distance in front of their face during authentication [21]. FaceLive detects liveness by analyzing the consistency between a captured facial video and the movement data of the mobile device. This approach is effective in identifying both photo-based and video-based attacks. However, it remains vulnerable to spoofing attacks using 3D virtual face models, as such models can generate facial videos that align with the device’s movements [11].

Face Flashing is another method aimed at preventing spoofing attacks on mobile devices [29]. During authentication, the user holds a smartphone facing their face at a close distance. The phone’s screen light is activated, and its color is frequently changed according to a random challenge. A video of the user’s face is recorded under this light and then sent to a cloud server for analysis. The server examines the light reflections to detect 2D surfaces in spoofing attacks. However, Face Flashing demands substantial computational power to analyze facial reflections. Additionally, it generates significant network traffic and raises privacy concerns, as the face video must be transmitted to a remote server for liveness detection. Moreover, Face Flashing relies on precise time synchronization between the screen light and video capture to counter replay attacks. These factors make implementing Face Flashing on current mobile devices challenging.

This study introduces FaceCloseup, an anti-spoofing face authentication solution designed for mobile devices. FaceCloseup effectively detects not only photo-based and video-based attacks, but also those involving 3D virtual face models. The system operates using a standard front-facing camera—commonly available on mobile devices—and requires no specific conditions, such as controlled lighting or the transfer of facial videos to remote servers. These features make FaceCloseup well-suited for on-device liveness detection and practical for deployment on commodity smartphones.

FaceCloseup requires the user to hold and move a mobile device toward or away from their face over a short distance while the front-facing camera captures a video of the user’s face. A live user is identified if the changes in the observed distortions in the facial video correspond to those expected from a live face.

To counter spoofing attacks, FaceCloseup detects the 3D characteristics of a live user’s face by analyzing distortion changes observed in the video frames. Distortion in the facial video is a common phenomenon in photography, particularly when the camera is close to the face. This distortion is primarily due to the uneven 3D surface of the user’s face, causing different facial regions to be displayed at varying scales in the video frames. The scale of specific facial regions depends mainly on the distance between the camera and those regions. Due to the uneven 3D surface of a real face, these scales vary across different facial regions.

To validate the proposed liveness detection mechanism, we conducted a user study, collecting real-world photo and video data from both legitimate authentication requests and face spoofing attacks. In particular, we simulated 3D virtual face model-based attacks using an advanced 3D face reconstruction technique [30] to generate facial photos with realistic distortions. Our experimental results demonstrate that FaceCloseup can detect face spoofing attacks with an accuracy of 99.48%. Additionally, the results reveal that FaceCloseup can distinguish between different users with an accuracy of 98.44%, effectively capturing the unique 3D characteristics of each user’s face based on the observed distortions in the facial video.

An initial study on FaceCloseup was presented as a short paper at ACM AsiaCCS 2019 [1]. This version has undergone substantial revisions, expansions, and rewrites, resulting in a significantly enhanced and distinct manuscript. It introduces new theoretical frameworks, experimental evaluations, and analyses that were not included in the original conference paper.

2. Related Work

Face authentication is increasingly favored over traditional passwords and other biometric methods due to its superior convenience, user-friendliness, and contactless nature, which reduces the risk of forgotten credentials. Recent research on face authentication can be summarized into two primary directions: face recognition and face liveness detection.

Face recognition has been rapidly advancing since the 1990s as a significant component of biometric technologies. With the advent of powerful GPUs and the development of extensive face databases, recent research has focused on the creation of deep neural networks, such as convolutional neural networks (CNNs), for all aspects of face recognition tasks [31,32,33,34,35,36,37,38]. Deep learning-based face recognition has achieved remarkable accuracy and robustness [34,37], leading to widespread adoption by companies such as Google, Facebook, and Microsoft [31].

Despite these advancements, face recognition remains vulnerable to various spoofing attacks, including photo-based attacks [7,8,9], video-based attacks [10], and 3D model-based attacks [11,12,13]. To counter these threats, face liveness detection techniques have been developed to enhance face recognition [1,16,17,18,19,20,21]. Face liveness detection verifies if a real user is participating in face authentication by analyzing features such as 3D characteristics, motion patterns, and texture details captured by cameras. Additional hardware, such as 3D cameras and infrared lights, may also be used. Many manufacturers, including Apple, Baidu, Tencent, and Alibaba, have integrated liveness detection techniques into their face authentication systems [22,23,24].

Among various face liveness detection methods, the 3D face liveness indicator relies on the understanding that a real face is a three-dimensional object with depth features. Detecting these 3D face characteristics often involves optical flow analysis and changes in facial perspectives. A 3D face exhibits the optical flow characteristic where the central part of the face moves faster than the outer regions [25]. In this context, Bao et al. proposed a liveness detection method that analyzes the differences and properties of optical flow generated from a holistic 3D face [39]. In addition to the holistic face, local facial landmarks can also be used for optical flow analysis in liveness detection. Jee et al. introduced a liveness detection algorithm based on the analysis of shape variations in eye blinking, which is utilized for optical flow calculation [40]. Kollreider et al. developed a liveness detection algorithm that analyzes optical flow by detecting ears, nose, and mouth [41]. However, approaches based on optical flow analysis typically require high-quality input videos with ideal lighting conditions, which may be challenging to achieve in practice. Unlike these methods, FaceCloseup uses input video from a generic camera, making it more practical for real-world applications.

Conversely, the 3D characteristics of a real face can also be detected through its relative movements. Chen et al. investigated the 3D characteristics of the nose for liveness detection based on the premise that a real face has a three-dimensional nose [42]. To determine user liveness, their mechanism compares the direction changes of the mobile phone, as measured by the accelerometer, with the changes in the clear nose edge observed in the camera video. However, producing a clear nose edge requires controlled lighting to cast unobstructed shadows, which may be impractical in real-world scenarios. This method is also less effective for individuals with flatter noses.

Li et al. introduced FaceLive, which requires users to move the mobile device in front of their faces and analyzes the consistency between the motion data of the device and the head rotation in the video [21]. Although these two liveness detection algorithms can identify photo-based and video-based attacks, they remain vulnerable to 3D virtual face model-based attacks, as adversaries can synthesize accurate nose changes and head rotation videos in real-time [11]. In contrast, FaceCloseup can effectively detect typical face spoofing attacks, including photo-based, video-based, and 3D virtual face model-based attacks.

Texture pattern-based liveness detection techniques are based on the assumption that printed fake faces exhibit detectable texture patterns due to the printing process and the materials used. Maatta et al. assessed user liveness by extracting local binary patterns from a single image [43]. The IDIAP team utilized facial videos, extracting local binary patterns from each frame to construct a global histogram, which was then used to determine liveness [9]. Tang et al. introduced Face Flashing, which captures face videos illuminated by random screen light and sends them to remote servers, such as cloud services, for the analysis of light reflections to detect liveness [29]. These texture pattern-based techniques often require high-quality photos and videos taken under ideal lighting conditions, as well as significant computational power for analysis. This can be challenging to achieve on mobile devices in practice. Moreover, relying on remote servers or cloud services for computation can lead to substantial network traffic and privacy concerns. In contrast, FaceCloseup operates by analyzing closeup facial videos locally on mobile devices, eliminating these issues.

Real-time response-based approaches necessitate user interaction in real-time. For example, Pan et al. required users to blink their eyes to verify liveness [26], while VeriFace, a popular face authentication software, asked users to rotate their heads for the same purpose [44]. Unfortunately, these methods are susceptible to video-based and 3D virtual face model-based attacks, where adversaries might replay videos showing the required interactions or use a 3D virtual face model to generate the necessary responses in real time [5,45]. In contrast, FaceCloseup effectively detects such video-based attacks.

Finally, multimodal liveness detection approaches incorporate both facial biometrics and other biometric data for user authentication. Rowe et al. proposed a technique that combines face authentication with fingerprint authentication using a camera and a fingerprint scanner [46]. Similarly, Wilder et al. utilized facial thermograms from an infrared camera alongside facial biometrics from a standard camera during the authentication process [27,47]. Unlike these methods, which depend on specialized hardware sensors that are rarely found on mobile devices, our approach leverages the front-facing camera, which is widely available on most mobile devices.

3. Theoretical Background

In this section, we present the theoretical background for developing FaceCloseup, covering face authentication, face spoofing and threat models, as well as distortions in facial images and videos.

3.1. Face Authentication

Face authentication verifies a user’s claimed identity by examining facial features extracted from the user’s photos or videos. A typical face authentication system consists of two subsystems: a face recognition subsystem and a liveness detection subsystem, as illustrated in Figure 1.

The face recognition subsystem captures a user’s facial image or video using a camera and compares it with the user’s enrolled facial biometrics [31,32,33,34,35,36,37,38]. This subsystem accepts the user if the input facial image or video matches the enrolled biometrics, otherwise, it rejects the user. The subsystem consists of two key modules: a face detection module and a face matching module. The face detection module identifies the face region and eliminates irrelevant parts of the image, then passes the detected face region to the face matching module. The face matching module compares the input image with the enrolled face template to determine if they belong to the same individual. As the face recognition subsystem is designed to identify a user from an input facial image or video, but not to detect forged biometrics, it is inherently vulnerable to face spoofing attacks [7,8,9,10,11,12,13]. In such attacks, an adversary may replay a pre-recorded facial image or video or display a 3D virtual face model of a victim.

The liveness detection subsystem is designed to mitigate spoofing attacks by distinguishing between live and forged faces based on facial images or videos [1,16,17,18,19,20,21]. This subsystem typically employs a camera and/or other sensors to capture information about a live user during face authentication. It comprises two key modules: the liveness feature extraction module and the forgery detection module. The liveness feature extraction module derives features from the input data, while the forgery detection module calculates a liveness score from these features and determines whether the input is from a live user. Based on the outputs from both the face recognition subsystem and the liveness detection subsystem, the face authentication system makes a final decision regarding the authentication claim.

3.2. Face Spoofing and Threat Model

Face spoofing attacks allow an adversary to forge a user’s facial biometrics using photos or videos of the user’s face [7,8,9,10,11,12,13]. The adversary can then present these forged biometrics to deceive face authentication systems, posing a significant threat to their security.

The face authentication system is inherently susceptible to face spoofing attacks during the recognition process. As illustrated in Figure 1, the face recognition subsystem identifies a user based on an input facial photo or video but is unable to determine whether the input is from a live user or a pre-recorded or synthesized image or video. To counter these vulnerabilities, a liveness detection subsystem is implemented to prevent face spoofing attacks, including those based on photos, videos, and 3D virtual face models.

Specifically, photo-based and video-based attacks involve an adversary deceiving face authentication systems by replaying pre-recorded facial photos and videos of a user, which can be sourced online, such as from social networks [7,8,9,10]. More sophisticated and potent are the 3D virtual face model-based attacks, where an adversary constructs a 3D virtual face model of a user using their photos and videos [11]. This 3D model allows the adversary to generate realistic facial videos with necessary motions and expressions in real-time, effectively bypassing face authentication systems.

The liveness detection subsystem is designed to distinguish between genuine face biometrics captured from a live user and those fabricated by adversaries using the user’s photos, videos, or a 3D virtual face model. This detection relies on indicators derived from human physiological activities, which can be categorized into several types: 3D facial structure, texture patterns, real-time responses, and multimodal approaches [25].

The 3D face-based liveness indicators are derived from the inherent depth characteristics of a real face, which is an uneven three-dimensional object, as opposed to a fake face in a photo or video, or a 3D virtual face displayed on a flat (2D) plane. Texture pattern-based liveness indicators are based on the assumption that forged faces exhibit certain texture patterns that real faces do not, and vice versa. Real-time response-based liveness indicators rely on the premise that genuine users can interact with an authentication system in real time, a feat that is challenging for fake faces to achieve.

Typical real-time response liveness indicators include eye blinks and head rotations, which have been implemented in popular face authentication systems such as Google’s FaceUnlock [5]. These indicators are effective in detecting photo-based attacks. However, they remain susceptible to video-based and 3D virtual face model-based attacks. In these scenarios, an adversary could use pre-recorded videos of the victim containing necessary facial movements and expressions, or construct a 3D virtual face model from the victim’s photos or videos to generate the required facial movements and expressions in real-time.

Finally, multimodal-based liveness indicators can be derived from both facial biometrics and additional biometric traits, which are difficult for an adversary to simultaneously obtain. These liveness detection mechanisms require no extra hardware, need only moderate image quality, and impose relatively low usability costs. Among the four types of liveness indicators, FaceCloseup employs the 3D face-based liveness indicator to counter face spoofing attacks due to its moderate usability cost, robustness in varied environments, and reliance on commonly available hardware.

In this work, it is assumed that adversaries lack access to a victim’s facial photos or videos exhibiting significant perspective distortions. User preference data (see Section 6.2.4) suggests that most individuals are reluctant to share closeup facial imagery—particularly those captured at approximately 20 cm—due to the pronounced and often unflattering distortions. Users typically avoid positioning a smartphone within 20 cm of their faces during video calls, as this results in noticeable image distortion. Additionally, minor hand movements at such short distances frequently cause incomplete or blurred facial regions in video frames.

In contrast, facial photos and videos captured at longer distances (e.g., beyond 20 cm) are more likely to be shared on social media, used in video chats, or transmitted via conferencing platforms such as Zoom or Microsoft Teams. Adversaries may acquire such content through online platforms or by recording remote sessions. However, obtaining close-up facial imagery captured within 20 cm—without the user’s awareness—is substantially more difficult.

We also exclude real 3D attacks, such as 3D-printed faces, from our scope. The effectiveness of a 3D-printed face largely depends on the quality of the surface texture, which can be compromised due to inherent material defects. This vulnerability can be addressed using texture analysis-based methods that differentiate between the surface texture of a real face and that of a printed counterfeit face.

3.3. Distortion in Facial Images and Videos

The liveness indicator in FaceCloseup is grounded in the geometric principle of perspective distortion, a well-known phenomenon in photography and computer vision. When a camera captures a three-dimensional object at close range, regions closer to the lens appear disproportionately larger than those farther away. This effect, distinct from lens aberrations like barrel distortion, is purely a function of spatial geometry and has been studied in both biometric imaging and photographic modeling.

In facial imagery, perspective distortion causes features such as the nose or chin to appear enlarged relative to more recessed areas like the ears or temples. This distortion becomes more pronounced as the camera-to-subject distance decreases. While prior work has explored image distortion for spoof detection—such as the Image Distortion Analysis (IDA) framework by Wen et al., which uses features like specular reflection and blurriness to detect spoofing attacks [48]—FaceCloseup introduces a geometric modeling approach that explicitly quantifies distortion scaling across facial regions using the Gaussian thin lens model.

Let

s_{v}

denote the distance from the lens to the image plane,

s_{u}

the distance from a facial region to the lens, and f the focal length. The Gaussian lens formula for thin lens [49] is as follows:

1 / s_{u} + 1 / s_{v} = 1 / f

(1)

With Equation (1), we compare ratio

R_{i}

between the size of a facial region in the image and the size of the facial region on 3D face as follows:

R_{i} = s_{v} / s_{u} = f / (s_{u} - f)

(2)

The change rate

S_{R_{i}}

of the size of the facial region in the image with respect to the change of

s_{u}

can be calculated as follows:

S_{R_{i}} = | d (R_{i}) / d (s_{u}) | = f / {(s_{u} - f)}^{2}

(3)

Equation (3) reveals that as the camera approaches the face, the magnification of nearer regions increases rapidly, amplifying perspective distortion. This geometric insight enables FaceCloseup to infer depth cues from single images or videos without requiring stereo vision or active sensors.

Unlike prior distortion-based spoof detection methods that rely on texture or frequency-domain features [48], FaceCloseup leverages region-specific geometric scaling to distinguish real 3D faces from flat 2D spoof media. Empirical observations show that selfie images taken at 20 cm exhibit significant distortion, while those at 50 cm do not. Since spoofing artifacts (e.g., printed photos or screen replays) lack true 3D structure, they fail to replicate the depth-dependent distortion patterns of a live face. FaceCloseup exploits this discrepancy by analyzing the relative scaling of facial regions to detect liveness.

4. FaceCloseup Design

FaceCloseup authenticates facial liveness against spoofing attacks by examining the 3D characteristics of the face, based on distortion changes across various facial regions in close-up videos. This 3D face detection serves as a liveness indicator, as utilized by many existing detection mechanisms outlined in Section 3.2. The facial video is recorded using a front-facing camera, which is standard hardware on current mobile devices.

When a live user initiates an authentication request, the distortion variations in different facial regions captured in the video should align with the 3D characteristics of the user’s actual face. For successful face authentication, the user must hold and move a mobile device over a short distance towards or away from their face. As this movement occurs, the front-facing camera on the mobile device captures the user’s face at varying distances. If it is a genuine 3D face in front of the camera, the resulting facial video will display corresponding distortion changes in various facial regions.

FaceCloseup comprises three primary modules: the Video Frame Selector (VFS), the Distortion Feature Extractor (DFE), and the Liveness Classifier (LC), as illustrated in Figure 2. Specifically, the VFS module processes the input facial video, extracting and selecting multiple frames based on the detected facial size in each frame. Using these frames, the DFE module identifies numerous facial landmarks and calculates features related to the distortion changes across the frames. Finally, the LC module applies a classification algorithm to differentiate a genuine face from a forged face during spoofing attacks.

4.1. Video Frame Selector

As a mobile device is moved towards or away from a user’s face, the device’s camera captures a video comprising multiple frames of the user’s face, each taken at varying distances between the camera and the face. Consequently, the size of the faces in the video frames fluctuates due to this movement. The Video Frame Selector (VFS) extracts and selects a series of frames from the video based on the detected face size in each frame. The VFS employs the Viola–Jones face detection algorithm, a robust real-time face detection method that achieves 98.7% accuracy on widely-used face datasets such as LFPW and LFW-A&C [50,51].

The face detection algorithm comprises several sub-techniques, including integral image, AdaBoost, and attentional cascade. The integral image technique extracts rectangular features from each frame by computing the sum of values within rectangle subsets of grids in the frame. The AdaBoost algorithm selects these features and trains a strong classifier based on a combination of sub-classifiers. The attentional cascade structure for these sub-classifiers significantly accelerates the face detection process.

Once faces are detected in the video frames, the Video Frame Selector (VFS) selects K frames based on the size of the detected faces, where K is a parameter of the FaceCloseup system. The face size is determined by the number of pixels in the detected face within a frame. VFS selects K frames by categorizing face sizes into K ranges,

(s z_{1}, s z_{2}, \dots, s z_{K})

, where each size range

s z_{i} = (n p_{i l}; n p_{i u})

corresponds to the i-th frame selection, ensuring that the face size in the i-th selected frame falls within

(n p_{i l}; n p_{i u})

, where

n p_{i l}

denotes the lower bound, and

n p_{i u}

the upper bound, both in terms of mega-pixels. If multiple frames fall within a particular size range

s z_{i}

, VFS randomly selects one among them. The sequence of selected frames is denoted as

(f_{1}, f_{2}, \dots, f_{K})

. Without loss of generality, we assume that the user moves their mobile device away from their face during liveness detection, causing the face size to decrease as the frame index increases.

4.2. Distortion Feature Extractor

Due to the three-dimensional characteristics of the human face, distortions in various facial regions can be observed in facial images, as discussed in Section 3.3. As a mobile device moves closer to or further from a user’s face, the distortion of facial regions in different video frames varies according to the distance between the camera and the user’s face at the time the frames are captured. These distortion changes are correlated with changes in distance. Given a sequence of frames

(f_{1}, f_{2}, \dots, f_{K})

selected by the Video Frame Selector (VFS), the Distortion Feature Extractor (DFE) calculates the geometric distances between various facial landmarks in each frame and uses these measurements as features to detect distortion changes in the facial video.

To detect facial landmarks on 2D facial images, we employ the supervised descent method (SDM) due to its ability to identify facial landmarks under various poses and achieve a median alignment error of 2.7 pixels [51]. The SDM method identifies 66 facial landmarks in each frame. These landmarks are distributed across different facial regions, including the chin (17), eyebrows (10), nose stem (4), area below the nose (5), eyes (12), and lips (18). A comprehensive review of facial landmark detection algorithms, including those utilizing 66 facial landmarks, is available in the literature survey on facial landmark detection [52].

We maintain consistent indices

(p_{1}, p_{2}, \dots, p_{66})

for the 66 facial landmarks across all video frames, with the coordinate of each landmark represented as

p_{i} = (x_{i}, y_{i})

. Consequently, we establish a matrix for the facial landmarks in the K selected frames as follows:

[\begin{matrix} p_{1, 1} & p_{1, 2} & \dots & p_{1, 66} \\ p_{2, 1} & p_{2, 2} & \dots & p_{2, 66} \\ ⋮ & ⋮ & \dots & ⋮ \\ p_{K, 1} & p_{K, 2} & \dots & p_{K, 66} \end{matrix}]

where each row i represents frame

f_{i}

, and each column j represents the j-th facial landmark.

Facial distortion impacts both the geometric positions of facial landmarks and the overall size of the face in different frames. To capture this distortion, we calculate the distance between any two facial landmarks,

p_{s}

and

p_{t}

, as

d = \sqrt{{(x_{s} - x_{t})}^{2} - {(y_{s} - y_{t})}^{2}}

, where

(x_{s}, y_{s})

and

(x_{t}, y_{t})

represent the coordinates of

p_{s}

and

p_{t}

, respectively, with

s, t \in {1, 2, \dots, 66}

and

s \neq t

. The 66 facial landmarks in each frame generate 2145 pairwise distances,

d_{1}, d_{2}, \dots, d_{2145}

. Assuming the detected face size in a frame is w in width and h in height, a geometric vector describing the face is formed as

g e o = (d_{1}, d_{2}, \dots, d_{2145}, w, h)

.

According to Equation (3), facial distortion becomes increasingly pronounced as the camera moves closer to a real 3D face. Instead of utilizing the absolute distances in the geometric vector, we compute relative distances by normalizing the geometric vector of each frame against a base facial image, which the user registers during a registration phase. It is required that the user’s face in this base image fall within a predefined pixel range,

s z_{b} = (n p_{b}, n p_{b})

. The geometric vector for the base image is calculated as

g e o_{b} = (d_{b 1}, d_{b 2}, \dots, d_{b 2145}, w_{b}, h_{b})

. For each selected frame

f_{i}

, we compute a relative geometric vector

r i o_{i} = (r_{i, 1}, r_{i, 2}, \dots, r_{i, 2145}, r_{i, w}, r_{i, h})

, where

r_{i j} = d_{i, j} / d_{b, j}

for

j = 1, 2, \dots, 2145

,

r_{i, w} = w_{i} / w_{b}

, and

r_{i, h} = h_{i} / h_{b}

. The facial distortion in K selected frames is represented by a

K \times 2147

matrix FD:

[\begin{matrix} r_{1, 1} & r_{1, 2} & \dots & r_{1, 2145} & r_{1, w} & r_{1, h} \\ r_{2, 1} & r_{2, 2} & \dots & r_{2, 2145} & r_{2, w} & r_{2, h} \\ ⋮ & ⋮ & \dots & ⋮ \\ r_{K, 1} & r_{K, 2} & \dots & r_{K, 2145} & r_{K, w} & r_{K, h} \end{matrix}]

where each row i corresponds to the relative geometric vector for frame

f_{i}

.

4.3. Liveness Classifier

Following the extraction of the

K \times 2147

feature matrix FD by the Distortion Feature Extractor (DFE), the Liveness Classifier (LC) module processes FD using a classification algorithm to ascertain whether the features originate from a genuine face or a counterfeit one in spoofing attacks. Given that the input matrix FD resides in a high-dimensional space, traditional classification algorithms may encounter issues such as overfitting and high variance gradients [53]. Deep learning-based classification algorithms, however, typically perform more effectively with high-dimensional data [54,55]. Specifically, we have implemented a convolutional neural network (CNN) tailored to our classification needs within the LC module. Comprehensive evaluations indicate that CNN yields highly accurate classification results in distinguishing genuine user faces from spoofed ones.

The CNN-based classification algorithm within the LC module comprises seven layers: two convolution layers, two pooling layers, two fully connected layers, and one output layer, as illustrated in Figure 3. Given an input

K \times 2147

feature matrix (FD), the first convolution layer (

C o n v_{1}

) computes a tensor matrix (

T M_{1}^{'}

). To achieve non-linear properties while maintaining the receptive fields within

C o n v_{1}

, a rectified linear unit (ReLU) activation function is applied to

T M_{1}^{'}

, resulting in the tensor matrix

T M_{1}

. The ReLU function is defined as

f (x) = max (0, x)

. The first pooling layer (

P o o l_{1}

) then performs non-linear downsampling on

T M_{1}

. Subsequently, the second convolution layer (

C o n v_{2}

) and the second pooling layer (

P o o l_{2}

) execute the same operations as

C o n v_{1}

and

P o o l_{1}

, respectively, in the third and fourth steps.

Subsequently, the fully connected layers

F C_{1}

and

F C_{2}

execute high-level reasoning. Assuming

F C_{2}

comprises M neurons, it generates a vector

f_{c} = {(e_{1}, e_{2}, \dots, e_{M})}^{T}

, which is then passed to the output layer

O U T

. The output layer calculates the probabilities for C classes, with the probability of each class c determined using the following multinomial distribution:

P (y = c) = S_{c} = \frac{e x p (V_{c}^{y} \cdot f_{c} + b_{c}^{y})}{Σ_{c = 1}^{C} e x p (V_{c}^{y} \cdot f_{c} + b_{c}^{y})}

(4)

where C represents the number of classes,

V_{c}^{y}

is the c-th row of a learnable weighting matrix

V^{y}

, and

b_{c}^{y}

is a bias term. The output layer generates the classification result by selecting the class with the highest probability among the C classes. Since liveness detection aims to differentiate between a real face and a forged face, C is set to 2.

To train the CNN-based classification model, we utilize a cross-entropy loss function, which evaluates the performance of a classification model with probabilistic output values ranging between 0 and 1 [56]. For each input training sample i, we define the loss as follows:

L_{i} = - log (S_{y_{i}})

(5)

where

S_{y_{i}}

represents the predicted probability for the correct class

y_{i}

, based on the sample’s actual observation label.

During the model training, we apply mini-batch to split the training dataset into small batches that are used to calculate model error and update model coefficients. Training the model with mini-batch can allow for robust convergence, avoid local minima, and provide a computationally efficient update process [56]. To mitigate the overfitting problem, we add regularization loss to the loss function for each mini-batch as follows:

L = \frac{1}{N} \sum_{i} L_{i} + λ \sum_{k} \sum_{l} W_{k, l}^{2}

(6)

where

\frac{1}{N} \sum_{i} L_{i}

represents the average prediction loss for N samples within the mini-batch, and

λ \sum_{k} \sum_{l} W_{k, l}^{2}

denotes the regularization loss for l weighting factors across k layers in the CNN-based classification model. Here,

λ

acts as the regularization strength factor. Hence, our objective function is to minimize L.

5. Data Collection and Dataset Generation

An IRB-approved user study was conducted to gather users’ data for both legitimate requests and face spoofing attacks, including photo-based attacks, video-based attacks, and 3D virtual face model-based attacks.

5.1. Data Collection

Our user study comprises 71 participants, consisting of 43 males and 28 females, aged between 18 and 35. Each participant spent approximately 50 min in a quiet room during the study. The study was divided into three parts, with participants taking a short break of about 3 min after completing each part. Detailed descriptions of the study follow below.

In the first part, we collected multiple selfie facial videos of participants at various device positions. Each participant was instructed to hold a mobile phone and take three frontal facial video clips over a controlled distance (DFD) between their face and the phone. The mobile device used in our experiments was a Google Nexus 6P smartphone, equipped with an 8-megapixel front-facing camera, a 5.7-inch screen, and operating on Android 7.1.1. The front-facing camera captured 1080p HD video at 30 fps. Each video clip lasted for 3 s, with each frame measuring

1920 \times 1080

pixels and the face centered in the frame. These facial videos later guided the frame selection by the Video Frame Selector (VFS) as outlined in Section 4.1 and provided facial photos for photo-based attacks.

The controlled distance (DFD) between the face and the smartphone was set at 20 cm, 30 cm, 40 cm, and 50 cm. These distances were selected based on common participant behaviors and the capabilities of the smartphone’s front-facing camera. According to our pilot study, over 65% of participants faced challenges in capturing clear and complete frontal selfie videos at DFD < 20 cm, as the camera was either too close to focus on partial facial regions or too close to capture entire faces. Conversely, more than 70% of participants experienced difficulties holding the smartphone at DFD > 50 cm due to the limited length of their arms. Consequently, we collected 12 frontal selfie video clips at these controlled distances from each participant.

In the second part of our study, we collected facial videos of participants performing FaceCloseup trials. Each participant was instructed to conduct these trials using the provided Google Nexus 6P smartphone, with controlled device movement distances. For the FaceCloseup trials, participants held and moved the smartphone away from their face from a distance of

D_{F D} =

20 cm to

D_{F D} =

50 cm, from

D_{F D} =

30 cm to

D_{F D} =

50 cm, and from

D_{F D} =

40 cm to

D_{F D} =

50 cm. Each participant completed 10 trials for each controlled movement setting. Prior to each set of trials, a researcher demonstrated the required movements, and participants were given time to practice. During the movement, the participant’s facial video, captured by the front-facing camera, was displayed on the screen in real time, allowing participants to adjust the smartphone to ensure their face was always fully captured. To control the moving distance, the required distance was marked along a horizontal line on the wall. We ensured that no significant head rotation occurred during the smartphone movement, as per the method described in [57]. In this part of the study, we collected facial video data from 30 trials per participant, which was later used to simulate both legitimate requests and video-based attacks.

In the third part of the study, each participant was asked to complete a questionnaire utilizing a 5-point Likert scale. This segment aimed to gather participants’ perceptions regarding the use of FaceCloseup and their preferences for online sharing behaviors.

5.2. Dataset Generation

To simulate legitimate requests and face spoofing attacks, we generated several datasets: a legitimate dataset, a photo-based attack dataset, a video-based attack dataset, and a 3D virtual face model-based attack dataset. These datasets were derived from the frontal facial videos collected during our user study.

5.2.1. Legitimate Dataset

Since FaceCloseup performs liveness checks by analyzing distortion changes across various facial regions in closeup videos, the facial videos for this purpose must exhibit clear facial distortion. Consequently, the legitimate dataset consists of the closeup facial videos recorded during FaceCloseup trials, with smartphone movements ranging from a distance of

D_{F D} =

20 cm to

D_{F D} =

50 cm, as detailed in Section 5.1. In total, the legitimate dataset comprises 710 trials.

5.2.2. Photo-Based Attack Dataset

To simulate photo-based attacks, we manually extracted 10 facial frames from each participant’s selfie frontal video clips, recorded at fixed distances

D_{F D}

as detailed in Section 5.1. The selected fixed distances were 30 cm, 40 cm, and 50 cm. These distances were chosen because participants typically published their selfie photos/videos or made video calls while holding their mobile phones at these distances. The majority of participants did not share selfie photos/videos taken at distances shorter than 30 cm due to noticeable facial distortion. Further details regarding participants’ sharing preferences and video call behaviors will be presented in the following section.

Secondly, for each extracted facial frame, we displayed the frame on an iPad HD Retina screen to simulate photo-based attacks, in a manner similar to the approach detailed in [28]. The facial region in the frame was adjusted to full screen to closely match the size of a real face. The smartphone was fixed on a table with its front-facing camera consistently aimed at the iPad screen. We then moved the iPad from

D_{F D} =

20 cm to

D_{F D} =

50 cm. During this movement, the smartphone’s front-facing camera recorded a video of the face displayed on the iPad screen. This process resulted in a photo-based attack dataset comprising 1420 attack videos.

5.2.3. Video-Based Attack Dataset

To simulate video-based attacks, we utilized the videos recorded during trials where participants moved the smartphone from

D_{F D} =

30 cm to

D_{F D} =

50 cm and from

D_{F D} =

40 cm to

D_{F D} =

50 cm. Each video was displayed on an iPad screen, while the smartphone remained fixed on a table with its front-facing camera continuously recording the screen. The video frames were scaled appropriately so that the face in the initial frame was displayed in full screen. The distance between the iPad and the smartphone was kept constant at 20 cm, as the displayed video included movements similar to those in legitimate requests. Consequently, we generated a total of 1420 attacking videos for the video-based attacks.

5.2.4. 3D Virtual Face Model-Based Attack Dataset

The effectiveness of photo-based and video-based attacks is often constrained, as it can be challenging for an adversary to obtain suitable facial photos and pre-recorded videos featuring the necessary facial expressions (e.g., smiles) and deformations (e.g., head rotations). In contrast, a 3D virtual face model-based attack can create a real-time 3D virtual face model of a victim, producing facial photos and videos with the required expressions and deformations. This form of attack presents a significant threat to most existing liveness detection techniques, such as FaceLive.

In 3D virtual face model-based attacks, an adversary may reconstruct a 3D virtual face model of the victim using one or more standard facial photos/videos, typically taken from a distance of at least 40 cm and possibly shared by the victim. A variety of 3D face reconstruction algorithms are available [11,58,59,60]. Most existing 3D face reconstruction algorithms process one or multiple standard facial photos of the victim to extract facial landmarks. The 3D virtual face model is then estimated by optimizing the geometry of a 3D morphable face model to align with the observed 2D landmarks. This optimization assumes that a virtual camera is positioned at a predefined distance from the face (usually considered infinite). Subsequently, image-based texturing and gaze correction techniques are applied to refine the 3D face model. The textured 3D face model can then be used to generate various facial expressions and head movements in real time.

FaceCloseup verifies the liveness of a face by analyzing changes in facial distortion as the camera-to-face distance varies. The aforementioned 3D virtual face model is unable to circumvent FaceCloseup’s detection, as its reconstruction relies on a virtual camera fixed at a predefined distance. Consequently, this 3D virtual face model cannot replicate the requisite facial distortions resulting from changes in camera distance, particularly when the camera is in close proximity to the face.

To emulate a sophisticated adversary in the context of 3D virtual face model-based attacks, we employ the perspective-aware 3D face reconstruction algorithm [30]. This algorithm is capable of generating facial distortions in accordance with changes in virtual camera distance, as well as alterations in facial expressions and head poses. For reconstructing a perspective-aware 3D face model of a victim, the algorithm initially extracts 69 facial landmarks from a provided facial photo. Among these, 66 landmarks are automatically identified using the SDM-based landmark detection algorithm [51], as detailed in Section 4.2. The remaining three facial landmarks, located on the top of the head and ears, are manually labeled to ensure higher accuracy.

Secondly, the 3D face model is associated with an identity vector

β \in R_{1 \times 50}

, an expression vector

γ \in R_{1 \times 25}

, an upper-triangular intrinsic matrix

U \in R_{3 \times 3}

, a rotation matrix

R \in R_{3 \times 3}

, and a translation matrix

T \in R_{3 \times 4}

. The facial photo and the 69 facial landmark locations are employed to fit a 3D head model by identifying the optimal parameters

β, γ, U, R, T

that minimize the Euclidean distance between the facial landmarks and their projections on the 3D head model.

Thirdly, once a good fit is achieved between the input facial photo and the 3D head model, the 3D head model can be manipulated to generate a new projected head shape by altering the virtual camera distance and head poses. Specifically, the virtual camera can be moved closer to or farther from the face by adjusting the translation matrix T, and the head can be rotated by modifying both the translation matrix T and the rotation matrix R. Ultimately, the manipulated 3D head model produces a 2D facial photo with distortions that correspond to the changes in camera distance. For further details on this perspective-aware 3D face reconstruction algorithm, please refer to [30].

To conduct the 3D virtual face model-based attacks, we first extracted 10 facial photos from the selfie videos taken by each participant at controlled distances

D_{F D}

of 30 cm, 40 cm, and 50 cm, as detailed in Section 5.1. Using these facial photos as inputs, we employed the perspective-aware 3D face reconstruction algorithm to generate photos with facial distortions by manually varying the virtual camera distances, which included 20 cm, 25 cm, 30 cm, 35 cm, 40 cm, 45 cm, and 50 cm. We adjusted the scale of the manipulated photos to match the size of the facial region in the original photos taken at similar distances. Consequently, we generated a sequence of seven manipulated facial photos from each extracted photo. The resulting dataset for the 3D virtual face model-based attack comprises 2130 sequences of manipulated facial photos.

6. Evaluation and Experimental Results

In this section, we describe the experimental settings. Subsequently, we evaluate the performance of FaceCloseup in terms of security, effectiveness, and practicality.

6.1. Experiment Settings

To ascertain facial liveness, the Video Frame Selector (VFS) of FaceCloseup first selects K frames from an input facial video, based on the size ranges

(s z_{1}, s z_{2}, \dots, s z_{K})

of the detected face region within the video frames, as detailed in Section 4.1. Given that the detected face region size primarily correlates with the distance between the participant’s face and the smartphone, these size ranges

(s z_{1}, s z_{2}, \dots, s z_{K})

are determined based on the distribution of detected face sizes in the facial videos captured at distances of

D_{F D} =

20 cm, 30 cm, 40 cm, and 50 cm.

The facial videos captured at varying distances (DFD) were collected during the user study described in Section 5.1. We set

K = 7

and selected a base facial photo for each participant, with the face region size corresponding to sz1, to optimize performance and ensure better coverage of video frames taken at different distances in our experiments. The size ranges of the detected faces are presented in Table 1.

For each facial video, we selected a sequence of frames by randomly choosing one from the frames containing a face with size in

s z_{i}

, where

i \in {1, 2, \dots, 7}

. This process was repeated 20 times to extract 20 frame sequences as samples from each facial video. Consequently, we generated 14,200 samples (

20 \times 710

) from the legitimate dataset. Similarly, we produced 28,400 samples (

20 \times 1420

) each from the photo-based attack dataset and the video-based attack dataset. Additionally, 2130 samples were generated from the 3D virtual face model-based attack dataset.

The Liveness Check (LC) component of FaceCloseup utilizes a convolutional neural network (CNN)-based classification algorithm. The architecture and parameters of the CNN model are detailed in Table 2. We employ a 5-fold cross-validation method to assess the performance of FaceCloseup. Accordingly, 80% of the samples are used to train the CNN model on a desktop featuring a 12 GB TITAN X graphics card, 60 GB of memory, and 20 Intel Core-i7 CPUs. The learning rate is set at 0.1, with a weight decay of 0.0001, and a maximum iteration count of 1000.

6.2. Experimental Results

6.2.1. Detecting Face Spoofing Attacks

FaceCloseup is effective in detecting face spoofing attacks, including photo-based attacks, video-based attacks, and 3D virtual face model-based attacks.

In the context of photo-based attacks, facial photos captured at distances of

D_{F D} =

30 cm, 40 cm, and 50 cm are used to generate attack videos, as detailed in Section 5.2. Figure 4 illustrates FaceCloseup’s effectiveness in detecting these attacks. Notably, FaceCloseup achieves accuracies of 99.23%, 99.28%, and 99.31% in identifying photo-based attacks using photos taken at 30 cm, 40 cm, and 50 cm, respectively. FaceCloseup’s liveness detection is based on analyzing changes in facial distortion that correspond to the 3D depth information of a real face and variations in camera distance. This makes it challenging for adversaries to produce accurate facial distortions by merely displaying a 2D photo on a flat surface and moving it in front of a camera.

In video-based attacks, the adversary displays facial videos to FaceCloseup. These attacking videos are captured as the smartphone moves from

D_{F D} =

30 cm to

D_{F D} =

50 cm, and from

D_{F D} =

40 cm to

D_{F D} =

50 cm. FaceCloseup achieves detection accuracies of 99.24% and 99.27% against these two types of attack videos, respectively, as illustrated in Figure 5. For effective liveness detection, FaceCloseup requires closeup videos that exhibit noticeable facial distortion changes. However, as the camera moves farther from the face, the rate of facial distortion changes decreases, as described in Equation (3). Consequently, the two types of attacking facial videos lack significant and sufficient facial distortion changes due to the camera not being sufficiently close to the face.

Three-dimensional virtual face model-based attacks are a formidable challenge. Synthesized facial photos are generated to include estimated facial distortions based on changes in camera distance. FaceCloseup detects these attacks with a remarkable accuracy of 99.48%. This high detection rate is largely because it remains difficult for current 3D virtual face models to create closeup facial photos with significant distortions, due to the complex and uneven 3D surface of faces and the occlusion of partial facial regions [30].

6.2.2. Distinguishing Users

FaceCloseup assesses the 3D characteristics of a face by analyzing changes in facial distortion within a video. As detailed in Section 3.3, these distortion changes depend on both the 3D geometry of the face and the camera distance. Consequently, the variations in facial distortion can be utilized to capture the 3D geometric information of the face, enabling the distinction between different users. We further evaluated FaceCloseup’s ability to distinguish 71 users by setting the number of classes in LC to

C = 71

. Our experimental results indicate that FaceCloseup can identify users with an accuracy of 99.44%. These findings suggest that, although FaceCloseup is primarily designed for face liveness detection, it also holds potential for face verification and recognition. A promising future direction is to evaluate FaceCloseup’s performance in face verification and identification on a larger scale.

6.2.3. Practicality

We assessed the practicality of FaceCloseup on the Google Nexus 6P smartphone by evaluating its average computation time, runtime memory usage, and storage requirements.

In our experiments, we first examined the computation time for each of the three modules in FaceCloseup. For the Video Frame Selector (VFS), the computation time is negligible as the face detector operates in real-time. For the Distortion Feature Extraction (DFE), the primary computational task is facial landmark detection over seven facial photos, averaging 2.7 s, which can be performed concurrently with facial video recording. For the Distortion Feature Extraction (DFE) module, we migrated the trained CNN-based classification model, implemented with TensorFlow, to the smartphone. Each classification request averages 1.15 s. Given that facial videos typically last 2–3 s, the total time for a liveness check is approximately 4.15 s. FaceCloseup’s average runtime memory usage is 396 MB, and it occupies 37.1 MB of device storage for the application software and 335 MB for data files, such as the trained CNN-based classifier.

6.2.4. Users’ Sharing Preferences

Given that users’ sharing behaviors can influence the availability of facial photos and videos that adversaries might exploit, we employed 5-point Likert scales to assess their preferences regarding the sharing of facial photos and videos.

As illustrated in Figure 6, users were most hesitant to share photos and videos taken at a distance of 20 cm, with an average rating of 1.1 points. Notably, 93% of users rated these images with 1 point due to the prominent and unusual facial distortions present. For photos and videos taken at 30 cm, users assigned an average rating of 2.3 points, as the face region still occupies a significant portion of the frame and noticeable distortions remain. In contrast, users were more inclined to share selfies captured at 40 cm and 50 cm, which received average ratings of 3.7 and 4.8 points, respectively, as facial distortions are minimal or nearly unnoticeable in these images. Consequently, users’ sharing preferences hinder adversaries from obtaining facial photos and videos with the substantial distortions required for FaceCloseup’s analysis.

Furthermore, there is a potential risk that users’ closeup facial photos and videos could be inadvertently exposed during video calls for casual online interactions or remote conferences conducted via software such as Zoom and Microsoft Teams on mobile devices. To understand user preferences regarding the positioning of smartphones during video calls, we collected data on their behaviors. As shown in Figure 7, users generally did not prefer to hold their smartphones as close as 20 cm to their faces, assigning an average rating of 1.7 points. This discomfort is primarily due to the noticeable facial distortions, incomplete faces, and blurred images in the video frames.

7. Discussion

In this section, we discuss the integration of FaceCloseup in existing face authentication systems and the limitations of FaceCloseup.

7.1. Integration of FaceCloseup

FaceCloseup can be seamlessly integrated into most existing mobile face authentication systems without significant modifications.

Firstly, FaceCloseup analyzes facial distortion changes by calculating the facial geometry ratio changes relative to a base facial photo within a predefined size range (sz) of the face region, as explained in Section 4.2. During registration, users need to perform a one-time process of registering one or more facial photos with face regions falling within sz, akin to the registration processes of most current face authentication systems. Secondly, during each authentication request, FaceCloseup requires users to capture a closeup selfie video with a front-facing smartphone camera, moving the smartphone towards or away from their face. By analyzing the captured video, FaceCloseup determines whether the request is from a live face or a spoofed face. Simultaneously, the typical face recognition subsystem can extract a frame from the video and compare it with the pre-stored facial biometrics to verify the user’s identity.

According to our experimental results, FaceCloseup can distinguish different users with an accuracy of 99.44%. Therefore, FaceCloseup has the potential to both recognize a user’s face and verify its liveness simultaneously. To enable face recognition, users must complete a one-time registration by capturing a closeup selfie video. This video is taken in the same manner as the videos used for liveness checks, wherein the user moves the smartphone towards or away from their face. The pattern of facial distortion changes extracted from the registered video is stored on the smartphone. For each authentication request, FaceCloseup captures a closeup selfie video of the user, incorporating the required device movements. The captured video is then used to verify the user’s identity and assess the liveness of the face concurrently.

Since FaceCloseup relies on closeup facial videos to analyze the necessary facial distortion changes, it may be vulnerable if an adversary obtains a closeup video taken within 30 cm of a victim’s face. This proximity provides the significant facial distortion changes required by FaceCloseup. However, it is challenging for adversaries to obtain such closeup videos, as users are generally reluctant to capture and share them. Moreover, it is difficult for an adversary to directly capture a closeup video at such a short distance without the victim’s awareness. Even if an adversary manages to stealthily record a closeup facial video while the victim is sleeping, FaceCloseup can easily detect such attempts by analyzing the status of the victim’s eyes. Numerous existing techniques can monitor eye movement and status in real time [26].

7.2. Limitations of FaceCloseup

Limitations of the Current User Study: Our study primarily recruited university students, who tend to be more active users of mobile devices. While this provides valuable insights into FaceCloseup’s real-world applicability, it introduces limitations in generalizing performance to broader demographics, including older adults, children, and individuals with diverse ethnic and physiological characteristics.

Potential Variations Across Demographics: Facial authentication systems may exhibit performance differences across age groups due to variations in skin texture, facial structure, and device interaction habits. Similarly, ethnic diversity can influence facial recognition algorithms, particularly in how features are detected under different lighting conditions and facial angles. While our study did not include a stratified analysis, the existing literature suggests that these factors can impact the efficacy of spoof detection and user authentication accuracy [61,62,63].

To address these concerns, future studies will involve a larger and more demographically diverse participant pool. A stratified analysis will be conducted to examine performance variations across age groups, ethnic backgrounds, and physiological characteristics. This will help refine FaceCloseup’s robustness and ensure equitable authentication accuracy across a wider range of users.

Environmental Factors and Their Impact on FaceCloseup: While our study considered lighting and movement, additional environmental factors—such as low light conditions, motion blur, outdoor scenes, and differences in camera quality—can also influence facial authentication performance.

Motion Blur and Device Stability: In our user study, participants were instructed to maintain controlled camera distances and regulated hand movements. However, involuntary hand tremors and natural limitations in movement control can introduce motion blur, which may affect facial feature detection. Prior research suggests that motion blur can degrade recognition accuracy, particularly in dynamic environments [42,64]. Future work will explore stabilization techniques and adaptive algorithms to mitigate these effects.

Outdoor Scenes and Lighting Variability: While close-up facial videos benefit from screen light when the smartphone is within 30 cm of the user’s face, outdoor environments introduce additional challenges such as variable lighting, shadows, and background complexity [65]. Studies indicate that facial recognition systems can experience reduced accuracy in uncontrolled lighting conditions [29]. Future research will incorporate outdoor testing to evaluate FaceCloseup’s robustness across diverse environments.

Camera Quality and Device Variability: Differences in smartphone camera specifications—such as resolution, sensor quality, and frame rate—can impact facial authentication performance. The existing literature highlights that lower-resolution cameras may struggle with fine-grained facial feature extraction [66]. Expanding our study to include a range of device models will help assess FaceCloseup’s adaptability across different hardware configurations.

To address these concerns, future studies will incorporate a broader range of environmental conditions, including outdoor settings, varied lighting scenarios, and different smartphone models. This will ensure a more comprehensive evaluation of FaceCloseup’s performance and enhance its applicability in real-world use cases.

Usability Considerations in FaceCloseup: While FaceCloseup enhances security by leveraging perspective distortion in close-up facial authentication, we recognize that requiring users to position their devices close to their faces may raise usability concerns, particularly for individuals with mobility limitations or privacy sensitivities.

Accessibility and Mobility Considerations: For users with limited mobility, holding a device at a precise distance may pose challenges. Future iterations of FaceCloseup could incorporate adaptive mechanisms, such as automatic distance calibration or voice-guided positioning, to improve accessibility.

Privacy Considerations: Some users may feel uncomfortable bringing their devices close to their faces due to privacy concerns, particularly in public settings. To mitigate this, FaceCloseup could be enhanced with on-device processing for privacy-preserving face authentication, ensuring that facial data remains private and is not transmitted externally, thereby addressing data security concerns.

Existing privacy-preserving biometric authentication approaches, including fuzzy extractors and fuzzy signatures [67,68,69,70,71,72,73,74], could potentially be applied to transform close-up facial biometrics into privacy-protected representations. The integration of FaceCloseup and such approaches would enable authentication servers—if deployed off-device—to verify users without inferring any biometric information.

8. Conclusions

This paper introduces FaceCloseup, an effective and practical liveness detection mechanism for face authentication designed to counteract face spoofing attacks. Utilizing the front-facing optical cameras commonly found on mobile devices, FaceCloseup employs deep learning techniques to analyze and identify changes in facial distortion in closeup videos, thereby detecting the 3D characteristics of a real face. FaceCloseup achieves a high detection accuracy of 99.48% against typical face spoofing attacks and is capable of distinguishing different users with an accuracy of 99.44%.

To further validate FaceCloseup in real-world settings, future work will focus on conducting large-scale user studies with demographically diverse participants, expanding environmental testing, and evaluating performance across various mobile devices. Additionally, we plan to explore integration with commercial authentication APIs and existing security frameworks to assess compatibility and practical deployment in industry-standard authentication systems. This will ensure FaceCloseup’s applicability for broader use in mobile authentication while maintaining security and usability.

Author Contributions

Conceptualization, Y.L. (Yingjiu Li) and Y.L. (Yan Li); methodology, Y.L. (Yan Li) and Y.L. (Yingjiu Li); software, Y.L. (Yan Li) and Z.W.; validation, Y.L. (Yan Li) and Z.W.; formal analysis, Y.L. (Yan Li) and Y.L. (Yingjiu Li); investigation, Z.W. and Y.L. (Yan Li); resources, Y.L. (Yingjiu Li); data curation, Y.L. (Yan Li) and Z.W.; writing—original draft preparation, Y.L. (Yan Li); writing—review and editing, Y.L. (Yingjiu Li); visualization, Y.L. (Yan Li); supervision, Y.L. (Yingjiu Li); project administration, Y.L. (Yingjiu Li) and Y.L. (Yan Li); funding acquisition, Y.L. (Yingjiu Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Oregon School of Law, Consumer Protection Research Grant 2024–2025, under Grant No. 4236A0.

Institutional Review Board Statement

The user study in this research was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Singapore Management University with Approval Number IRB-17-020-A022(217).

Informed Consent Statement

In this research, participants provided informed consent for an IRB-approved user study, which was conducted to gather users’ data for both legitimate requests and face spoofing attacks, including photo-based attacks, video-based attacks, and 3D virtual face model-based attacks. The study adhered to all relevant ethical guidelines and regulations.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author subject to data privacy restrictions, as they contain users’ facial biometric information.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Li, Y.; Wang, Z.; Li, Y.; Deng, R.H.; Chen, B.; Meng, W.; Li, H. A Closer Look Tells More: A Facial Distortion Based Liveness Detection for Face Authentication. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, AsiaCCS 2019, Auckland, New Zealand, 9–12 July 2019; pp. 241–246. [Google Scholar]
BigBear.ai. TrueFaceAI. 2023. Available online: https://www.trueface.ai/ (accessed on 24 December 2024).
iProov Ltd. iProov. Available online: https://www.iproov.com/ (accessed on 24 December 2024).
Visidon Ltd. Visidon. Available online: https://www.visidon.fi/ (accessed on 24 December 2024).
Jeremiah Rice. Android Jelly Bean’s Face Unlock Liveness Check Circumvented with Simple Photo Editing. 2012. Available online: http://www.androidpolice.com/2012/08/03/android-jelly-beans-face-unlock-liveness-check-circumvented-with-simple-photo-editing/ (accessed on 24 December 2024).
O’Gorman, L. Comparing passwords, tokens, and biometrics for user authentication. Proc. IEEE 2003, 91, 2021–2040. [Google Scholar] [CrossRef]
Wu, Z.; Cheng, Y.; Yang, J.; Ji, X.; Xu, W. DepthFake: Spoofing 3D Face Authentication with a 2D Photo. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–24 May 2023; pp. 917–933. [Google Scholar]
Anjos, A.; Marcel, S. Counter-measures to photo attacks in face recognition: A public database and a baseline. In Proceedings of the International Joint Conference on Biometrics (IJCB), Austin, TX, USA, 14–16 November 2011; pp. 1–7. [Google Scholar]
Chakka, M.M.; Anjos, A.; Marcel, S.; Tronci, R.; Muntoni, D.; Fadda, G.; Pili, M.; Sirena, N.; Murgia, G.; Ristori, M.; et al. Competition on counter measures to 2-d facial spoofing attacks. In Proceedings of the IJCB 2011, Austin, TX, USA, 14–16 November 2011; pp. 1–6. [Google Scholar]
Raghavendra, R.; Raja, K.B.; Busch, C. Presentation Attack Detection for Face Recognition Using Light Field Camera. IEEE Trans. Image Process. 2015, 24, 1060–1075. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Price, T.; Frahm, J.M.; Monrose, F. Virtual U: Defeating Face Liveness Detection by Building Virtual Models from Your Public Photos. In Proceedings of the USENIX Security Symposium, San Diego, CA, USA, 10–12 August 2016; pp. 497–512. [Google Scholar]
Bhattacharjee, S.; Mohammadi, A.; Marcel, S. Spoofing Deep Face Recognition with Custom Silicone Masks. In Proceedings of the IEEE 9th International Conference on Biometrics: Theory, Applications and Systems (BTAS), Salt Lake City, UT, USA, 25–27 September 2018; pp. 1–7. [Google Scholar]
Erdogmus, N.; Marcel, S. Spoofing in 2D Face Recognition with 3D Masks and Anti-spoofing with Kinect. In Proceedings of the 2013 International Conference on Biometrics: Theory, Applications and Systems (BTAS), Washington, DC, USA, 29 September–2 October 2013; pp. 1–6. [Google Scholar]
Li, Y.; Li, Y.; Xu, K.; Yan, Q.; Deng, R. Empirical study of face authentication systems under OSNFD attacks. IEEE Trans. Dependable Secur. Comput. 2016, 15, 231–245. [Google Scholar] [CrossRef]
Li, Y.; Xu, K.; Yan, Q.; Li, Y.; Deng, R.H. Understanding OSN-based Facial Disclosure Against Face Authentication Systems. In Proceedings of the AsiaCCS 2014, Hong Kong, China, 24–26 November 2014; pp. 413–424. [Google Scholar]
Liu, W.; Pan, Y. Spatio-Temporal-Based Action Face Anti-Spoofing Detection via Fusing Dynamics and Texture Face Keypoints Cues. IEEE Trans. Consum. Electron. 2024, 70, 2401–2413. [Google Scholar] [CrossRef]
Abdullakutty, F.; Elyan, E.; Johnston, P. A review of state-of-the-art in face presentation attack detection: From early development to advanced deep learning and multi-modal fusion methods. Inf. Fusion 2021, 75, 55–69. [Google Scholar] [CrossRef]
Husseis, A.; Liu-Jimenez, J.; Goicoechea-Telleria, I.; Sanchez-Reillo, R. A survey in presentation attack and presentation attack detection. In Proceedings of the 2019 International Carnahan Conference on Security Technology (ICCST), Dallas, TX, USA, 14–16 October 2019; pp. 1–13. [Google Scholar]
Parkin, A.; Grinchuk, O. Recognizing Multi-Modal Face Spoofing with Face Recognition Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Kong, C.; Zheng, K.; Liu, Y.; Wang, S.; Rocha, A.; Li, H. M3FAS: An Accurate and Robust MultiModal Mobile Face Anti-Spoofing System. arXiv 2024, arXiv:2301.12831. [Google Scholar] [CrossRef]
Li, Y.; Li, Y.; Yan, Q.; Kong, H.; Deng, R.H. Seeing your face is not enough: An inertial sensor-based liveness detection for face authentication. In Proceedings of the CCS 2015, Tokyo, Japan, 1–5 November 2015; pp. 1558–1569. [Google Scholar]
Apple. About Face ID Advanced Technology. 2024. Available online: https://support.apple.com/en-us/HT208108 (accessed on 24 December 2024).
Baidu. Baidu AI Cloud Face Recognition Platform. 2024. Available online: https://intl.cloud.baidu.com/product/face.html (accessed on 24 December 2024).
Tencent. Tencent Cloud Face Recognition Platform. 2024. Available online: https://intl.cloud.tencent.com/products/facerecognition.html (accessed on 24 December 2024).
Kahm, O.; Damer, N. 2D face liveness detection: An overview. In Proceedings of the BIOSIG 2012, Darmstadt, Germany, 26–28 September 2012; pp. 1–12. [Google Scholar]
Pan, G.; Sun, L.; Wu, Z.; Lao, S. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In Proceedings of the ICCV 2007, Rio de Janeiro, Brazil, 14–18 October 2007; pp. 1–8. [Google Scholar]
Ghiass, R.; Arandjelovic, O.; Bendada, H.; Maldague, X. Infrared face recognition: A literature review. In Proceedings of the IJCNN 2013, Beijing, China, 6–11 September 2013; pp. 1–10. [Google Scholar]
Chingovska, I.; Anjos, A.; Marcel, S. On the effectiveness of local binary patterns in face anti-spoofing. In Proceedings of the BIOSIG 2012, Darmstadt, Germany, 26–28 September 2012; pp. 1–7. [Google Scholar]
Tang, D.; Zhou, Z.; Zhang, Y.; Zhang, K. Face Flashing: A Secure Liveness Detection Protocol based on Light Reflections. In Proceedings of the NDSS 2018, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
Fried, O.; Shechtman, E.; Goldman, D.B.; Finkelstein, A. Perspective-aware manipulation of portrait photos. ACM Trans. Graph. 2016, 35, 128. [Google Scholar] [CrossRef]
Deng, N.; Xu, Z.; Li, X.; Gao, C.; Wang, X. Deep Learning and Face Recognition: Face Recognition Approach Based on the DS-CDCN Algorithm. Appl. Sci. 2024, 14, 5739. [Google Scholar] [CrossRef]
Wang, M.; Deng, W. Deep face recognition: A survey. Neurocomputing 2021, 429, 215–244. [Google Scholar] [CrossRef]
Fuad, M.T.H.; Fime, A.A.; Sikder, D.; Iftee, M.A.R.; Rabbi, J.; Al-Rakhami, M.S.; Gumaei, A.; Sen, O.; Fuad, M.; Islam, M.N. Recent Advances in Deep Learning Techniques for Face Recognition. arXiv 2021, arXiv:2103.10492. [Google Scholar] [CrossRef]
Shepley, A.J. Deep Learning For Face Recognition: A Critical Analysis. arXiv 2019, arXiv:1907.12739. [Google Scholar]
Saez-Trigueros, D.; Meng, L.; Hartnett, M. Face Recognition: From Traditional to Deep Learning Methods. arXiv 2018, arXiv:1811.00116. [Google Scholar]
Ghazi, M.; Ekenel, H. A Comprehensive Analysis of Deep Learning Based Representation for Face Recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016, Las Vegas, NV, USA, 27 June–2 July 2016; pp. 102–109. [Google Scholar]
Goswami, G.; Ratha, N.; Agarwal, A.; Singh, R.; Vatsa, M. Unravelling Robustness of Deep Learning based Face Recognition Against Adversarial Attacks. arXiv 2018, arXiv:1803.00401. [Google Scholar] [CrossRef]
Balaban, S. Deep learning and face recognition: The state of the art. In Proceedings of the Biometric and Surveillance Technology for Human and Activity Identification XII, San Diego, CA, USA, 13–17 April 2015. [Google Scholar]
Bao, W.; Li, H.; Li, N.; Jiang, W. A liveness detection method for face recognition based on optical flow field. In Proceedings of the IASP 2009, Sydney, Australia, 18–21 October 2009; pp. 233–236. [Google Scholar]
Jee, H.K.; Jung, S.U.; Yoo, J.H. Liveness detection for embedded face recognition system. Int. J. Biol. Med Sci. 2006, 1, 235–238. [Google Scholar]
Kollreider, K.; Fronthaler, H.; Bigun, J. Non-intrusive liveness detection by face images. Image Vis. Comput. 2009, 27, 233–244. [Google Scholar] [CrossRef]
Chen, S.; Pande, A.; Mohapatra, P. Sensor-assisted Facial Recognition: An Enhanced Biometric Authentication System for Smartphones. In Proceedings of the MobiSys 2014, Edinburgh, UK, 2–5 June 2014; pp. 109–122. [Google Scholar]
Maatta, J.; Hadid, A.; Pietikainen, M. Face spoofing detection from single images using micro-texture analysis. In Proceedings of the IJCB 2011, Austin, TX, USA, 14–16 November 2011; pp. 1–7. [Google Scholar]
Lenovo. Available online: https://www.thinkwiki.org/wiki/Lenovo_Veriface (accessed on 29 December 2024).
Abate, A.F.; Nappi, M.; Riccio, D.; Sabatino, G. 2D and 3D Face Recognition: A Survey. Pattern Recognit. Lett. 2007, 28, 1885–1906. [Google Scholar] [CrossRef]
Rowe, R.K.; Uludag, U.; Demirkus, M.; Parthasaradhi, S.; Jain, A.K. A multispectral whole-hand biometric authentication system. In Proceedings of the Biometrics Symposium, 2007, Gaithersburg, MD, USA, 29–31 October 2007; pp. 1–6. [Google Scholar]
Wilder, J.; Phillips, P.J.; Jiang, C.; Wiener, S. Comparison of visible and infra-red imagery for face recognition. In Proceedings of the FG 1996, Boston, MA, USA, 21–23 October 1996; pp. 182–187. [Google Scholar]
Wen, D.; Han, H.; Jain, A.K. Face Spoof Detection with Image Distortion Analysis. IEEE Trans. Inf. Forensics Secur. 2015, 10, 746–761. [Google Scholar] [CrossRef]
Self, S.A. Focusing of spherical Gaussian beams. Appl. Opt. 1983, 22, 658–661. [Google Scholar] [CrossRef]
Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In Proceedings of the CVPR 2013, Portland, OR, USA, 23–28 June 2013; pp. 532–539. [Google Scholar]
Wu, Y.; Ji, Q. Facial Landmark Detection: A Literature Survey. Int. J. Comput. Vis. 2019, 127, 115–142. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001; Volume 1. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Wei, Y.; Zhang, Y.; Yang, Q. Deep neural networks for high dimension, low sample size data. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, Melbourne, Australia, 19–25 August 2017; pp. 2287–2293. [Google Scholar]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Gross, R.; Matthews, I.; Cohn, J.; Kanade, T.; Baker, S. Multi-pie. Image Vis. Comput. 2010, 28, 807–813. [Google Scholar] [CrossRef]
Baumberger, C.; Reyes, M.; Constantinescu, M.; Olariu, R.; de Aguiar, E.; Santos, T.O. 3D face reconstruction from video using 3d morphable model and silhouette. In Proceedings of the SIBGRAPI, Florianópolis, Brazil, 22–24 October 2014; pp. 1–8. [Google Scholar]
Suwajanakorn, S.; Kemelmacher-Shlizerman, I.; Seitz, S.M. Total moving face reconstruction. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 796–812. [Google Scholar]
Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. What makes tom hanks look like tom hanks. In Proceedings of the ICCV, Santiago, Chile, 13–16 December 2015; pp. 3952–3960. [Google Scholar]
Kotwal, K.; Marcel, S. Review of Demographic Bias in Face Recognition. arXiv 2025, arXiv:2502.02309v1. [Google Scholar]
Klare, B.F.; Burge, M.J.; Klontz, J.C.; Bruegge, R.W.V.; Jain, A.K. Face Recognition Performance: Role of Demographic Information. IEEE Trans. Inf. Forensics Secur. 2011, 6, 178–188. [Google Scholar] [CrossRef]
Bhatta, A.; Pangelinan, G.; King, M.C.; Bowyer, K.W. Impact of Blur and Resolution on Demographic Disparities in 1-to-Many Facial Identification. arXiv 2023, arXiv:2309.04447. [Google Scholar]
Rahman, M.; Topkara, U.; Carbunar, B. Seeing is not believing: Visual verifications through liveness analysis using mobile devices. In Proceedings of the ACSAC 2013, Orlando, FL, USA, 9–13 December 2013; pp. 239–248. [Google Scholar]
Zhao, W.; Chellappa, R.; Phillips, P.J.; Rosenfeld, A. Face recognition: A literature survey. Acm Comput. Surv. 2003, 35, 399–458. [Google Scholar] [CrossRef]
Cheng, Z.; Zhu, X.; Gong, S. Low-Resolution Face Recognition. arXiv 2019, arXiv:1811.08965. [Google Scholar]
Lim, I.; Seo, M.; Lee, D.H.; Park, J.H. An Improved Fuzzy Vector Signature with Reusability. Appl. Sci. 2020, 10, 7141. [Google Scholar] [CrossRef]
Zhang, K.; Cui, H.; Yu, Y. Facial Template Protection via Lattice-based Fuzzy Extractors. IEEE Access 2021, 9, 1559–1572. [Google Scholar]
Uzun, E.; Yagemann, C.; Chung, S.P.; Kolesnikov, V.; Lee, W. Cryptographic Key Derivation from Biometric Inferences for Remote Authentication. In Proceedings of the ASIA CCS ’21: ACM Asia Conference on Computer and Communications Security, Virtual Event, 7–11 June 2021; pp. 629–643. [Google Scholar]
Katsumata, S.; Matsuda, T.; Nakamura, W.; Ohara, K.; Takahashi, K. Revisiting Fuzzy Signatures: Towards a More Risk-Free Cryptographic Authentication System based on Biometrics. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS), Virtual, 15–19 November 2021; pp. 2046–2065. [Google Scholar]
Sadeghi, A.R.; Schneider, T.; Wehrenberg, I. Efficient Privacy-Preserving Face Recognition. In Proceedings of the ICISC 09: 12th International Conference on Information Security and Cryptology, Seoul, Republic of Korea, 25–27 November 2009; Lee, D., Hong, S., Eds.; Lecture Notes in Computer Science. Springer: Heidelberg, Germany, 2010; Volume 5984, pp. 229–244. [Google Scholar]
Erkin, Z.; Franz, M.; Guajardo, J.; Katzenbeisser, S.; Lagendijk, I.; Toft, T. Privacy-Preserving Face Recognition. In Proceedings of the PETS 2009: 9th International Symposium on Privacy Enhancing Technologies, Zurich, Switzerland, 15–17 July 2009; Goldberg, I., Atallah, M.J., Eds.; Lecture Notes in Computer Science. Springer: Heidelberg, Germany, 2009; Volume 5672, pp. 235–253. [Google Scholar]
Im, J.H.; Jeon, S.Y.; Lee, M.K. Practical Privacy-Preserving Face Authentication for Smartphones Secure Against Malicious Clients. IEEE Trans. Inf. Forensics Secur. 2020, 15, 2386–2401. [Google Scholar] [CrossRef]
Lei, J.; Pei, Q.; Liu, X.; Zhang, L.; Wang, H. PRIVFACE: Fast Privacy-Preserving Face Authentication with Revocable and Reusable Biometric Credentials. IEEE Trans. Dependable Secur. Comput. 2021, 18, 193–203. [Google Scholar] [CrossRef]

Figure 1. Work flow of a typical face authentication system.

Figure 2. The components of FaceCloseup.

Figure 3. The structure of the CNN network used by Liveness Classifier.

Figure 4. Accuracy of FaceCloseup against the photo-based attacks.

Figure 5. Accuracy of FaceCloseup against the video-based attacks.

Figure 6. 5-point Likert scale for sharing facial video/photo taken at different distances

D_{F D}

.

Figure 6. 5-point Likert scale for sharing facial video/photo taken at different distances

D_{F D}

.

Figure 7. A 5-point Likert scale for video calls with facial video/photo taken at different distances

D_{F D}

.

Figure 7. A 5-point Likert scale for video calls with facial video/photo taken at different distances

D_{F D}

.

Table 1. Size ranges of the face region for frame selection.

	Size Range in Mega-Pixels
$s z_{1}$	$(0.75, 0.85)$
$s z_{2}$	$(0.65, 0.75)$
$s z_{3}$	$(0.55, 0.65)$
$s z_{4}$	$(0.45, 0.55)$
$s z_{5}$	$(0.35, 0.45)$
$s z_{6}$	$(0.25, 0.35)$
$s z_{7}$	$(0.15, 0.25)$

Table 2. The parameters of the CNN model.

Layer	Size	Stride	Padding
$C o n v_{1}$	32 $5 \times 5$ filters	1	1
$P o o l_{1}$	$3 \times 3$	2	1
$C o n v_{2}$	32 $3 \times 3$ filters	1	1
$P o o l_{2}$	$3 \times 3$	2	1
$F C_{1}$	$1 \times 1024$	0	0
$F C_{2}$	$1 \times 192$	0	0
$O U T$	$1 \times 2$	0	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Li, Y.; Wang, Z. FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection. Computers 2025, 14, 254. https://doi.org/10.3390/computers14070254

AMA Style

Li Y, Li Y, Wang Z. FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection. Computers. 2025; 14(7):254. https://doi.org/10.3390/computers14070254

Chicago/Turabian Style

Li, Yingjiu, Yan Li, and Zilong Wang. 2025. "FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection" Computers 14, no. 7: 254. https://doi.org/10.3390/computers14070254

APA Style

Li, Y., Li, Y., & Wang, Z. (2025). FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection. Computers, 14(7), 254. https://doi.org/10.3390/computers14070254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FaceCloseup: Enhancing Mobile Facial Authentication with Perspective Distortion-Based Liveness Detection

Abstract

1. Introduction

2. Related Work

3. Theoretical Background

3.1. Face Authentication

3.2. Face Spoofing and Threat Model

3.3. Distortion in Facial Images and Videos

4. FaceCloseup Design

4.1. Video Frame Selector

4.2. Distortion Feature Extractor

4.3. Liveness Classifier

5. Data Collection and Dataset Generation

5.1. Data Collection

5.2. Dataset Generation

5.2.1. Legitimate Dataset

5.2.2. Photo-Based Attack Dataset

5.2.3. Video-Based Attack Dataset

5.2.4. 3D Virtual Face Model-Based Attack Dataset

6. Evaluation and Experimental Results

6.1. Experiment Settings

6.2. Experimental Results

6.2.1. Detecting Face Spoofing Attacks

6.2.2. Distinguishing Users

6.2.3. Practicality

6.2.4. Users’ Sharing Preferences

7. Discussion

7.1. Integration of FaceCloseup

7.2. Limitations of FaceCloseup

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI