Face Spoofing Detection with Stacking Ensembles in Work Time Registration System

Klinowski, Rafał; Kordos, Mirosław

doi:10.3390/app15158402

Open AccessArticle

Face Spoofing Detection with Stacking Ensembles in Work Time Registration System

by

Rafał Klinowski

¹

and

Mirosław Kordos

^1,2,*

¹

Department of Computer Science and Automatics, University of Bielsko-Biala, 43-309 Bielsko-Biała, Poland

²

inEwi, sp. z o. o., ul. 1 Maja 15, 43-300 Bielsko-Biała, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8402; https://doi.org/10.3390/app15158402

Submission received: 29 May 2025 / Revised: 8 July 2025 / Accepted: 18 July 2025 / Published: 29 July 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Image Processing)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a passive face-authenticity detection system, designed for integration into an employee work time registration platform. The system is implemented as a stacking ensemble of multiple models. Each model independently assesses whether a camera is capturing a live human face or a spoofed representation, such as a photo or video. The ensemble comprises a convolutional neural network (CNN), a smartphone bezel-detection algorithm to identify faces displayed on electronic devices, a face context analysis module, and additional CNNs for image processing. The outputs of these models are aggregated by a neural network that delivers the final classification decision. We examined various combinations of models within the ensemble and compared the performance of our approach against existing methods through experimental evaluation.

Keywords:

face spoofing detection; convolutional neural network; stacking ensemble

1. Introduction

In the context of this article, an authentic face refers to the face of a live person in front of a camera. An inauthentic or spoofed face refers to a face shown on an electronic device screen, in video footage, or printed on paper.

This paper introduces a method for verifying the authenticity of faces, intended for use within a work time management system.

The specific issue addressed in the paper is the detection of scenarios in which an employee places a photo or video of another employee in front of the camera to falsely record the attendance of someone who is not physically present. Before detailing the proposed method, we will provide a brief overview of the work time management system, explain how our method interacts with other modules of the system, and outline the requirements it must meet.

The described work time management system is currently in use by several companies. Based on the requirements received, our new facial authenticity recognition module is to be integrated into the system as illustrated in Figure 1.

The facial authenticity recognition module will receive photos of employees whose identities (first and last names) have already been confirmed by the employee identification module, along with the coordinates of their faces within those photos. It must then evaluate the likelihood that the face is authentic (i.e., not spoofed) and forward the result to the work time registration module.

The technical specifications of the work time management system are as follows: the entire system consists of a client-side and a server-side part. The client-side application is implemented in Android, iOS, and in a web browser, and most often uses simple, front smartphone cameras or, sometimes, USB cameras, without infrared, to take pictures of people who register their work time (Figure 1). The server side is based on a server with two AMD EPYC 7282 16-core processors, 256 GB RAM, Nvidia RTX 3050 GPU, 1 TB FDD, and 8 TB HDD, using the Windows Operating System. The client-side application takes the photos, compresses them, and sends them to the server side, where further processing takes place and where our module, presented in this paper, operates.

According to the technical specifications, the authors of this paper were not permitted to modify any system modules other than the one they were developing. They were also not allowed to install infrared-equipped cameras or use Apple’s TrueDepth Camera technology [1] for face recognition. This restriction stemmed from the lack of consent from both clients and the original system developers to alter a well-functioning system and incur additional costs.

The client-side application is activated each time an employee starts or ends their work shift by scanning a QR code. This mechanism ensures that the system does not capture unnecessary photos of individuals merely passing by the camera or images of empty rooms. Once activated, the camera captures photos of the employee and uses the BlazeFace algorithm [2] to locate the face. Only images in which a face is successfully detected are compressed into the WEBP format (5–6 kB) for fast transmission. These are then sent to the server, where they are stored in a database for potential future analysis.

On the server side, the employee identification module processes the incoming photos. It verifies the identity of the person in each image by matching them with the employee associated with the scanned QR code, using the ArcFace algorithm [3]. This module performs with near-perfect accuracy.

Due to the constraints of the system, existing modules could not be modified. As a result, alternatives such as the RetinaFace algorithm [4] could not be integrated for face detection. However, given the high accuracy of the current solution, this limitation was not deemed problematic.

Per the provided specifications, our method receives photos along with face coordinates from the employee identification module, which is already implemented on the server. The image quality is generally suboptimal—photos are highly compressed (WEBP, 5–6 kB), often taken under poor lighting conditions, and frequently depict employees in a hurry, not properly positioned in front of the camera. Several example images taken under these conditions are presented in Figure 2. Nonetheless, our method is designed to operate effectively under these constraints, as we had no control over photo quality.

Typically, the problem of facial authenticity detection is studied in the context of users logging into an operating system or application. Many well-established methods exist for that use case and are thoroughly documented in the literature. However, the conditions and requirements in a work time registration system differ substantially, as previously described. In fact, state-of-the-art methods designed for login systems performed poorly in our settings, achieving only around 50% accuracy, and most authentic faces were incorrectly classified as spoofed. This issue also extended to a prominent commercial solution discussed in the experimental section. These systems perform well under ideal conditions: good lighting, a properly aligned face, and a high-quality USB camera. In contrast, the work time registration environment caused significant performance drops, leading us to conclude that image quality is a critical factor.

To explore this hypothesis, we captured several high-quality images using a full-frame camera and displayed them on a 4K-resolution notebook screen, simulating spoofing attempts. These images were often recognized as authentic by existing systems. This confirmed that photo quality can heavily influence the results. Since no available solution performed reliably under our specific conditions, we were compelled to develop a custom method, which is presented in this paper.

One of the key problems in the industry is the potential for fraud in work time registration. To prevent this, it is essential to ensure that the person registering work time is physically present and not being impersonated via photo or video. Therefore, we evaluated various face recognition algorithms and methods for determining whether a face belongs to a live person or is a spoofed image. This task is very different from facial authentication for login purposes and thus comes with its own set of unique requirements.

First, employees often do not actively cooperate with the system, so we cannot expect them to align their faces perfectly with the camera. Moreover, we cannot require active liveness detection (e.g., blinking, smiling).

Second, the system must function with low-cost, widely available cameras—such as front-facing smartphone cameras—where the user interface appears on the same screen.

Third, the system must be capable of serving thousands of employees logging their start times each morning through a centralized server. This requires strong image compression to save bandwidth, and the face-authenticity detection must be computationally efficient to provide feedback within one second, ensuring responsiveness under high load.

Fourth, there is a limited amount of training data, particularly examples of fraudulent attempts. We cannot ask users to intentionally spoof the system just to expand the dataset.

Importantly, the system is not expected to detect all fraudulent attempts, especially if doing so would result in a high number of false positives, which could disrupt company operations. If one fraud attempt is missed, the individual is likely to try again, increasing the likelihood of being detected on a subsequent attempt.

Furthermore, systems designed to detect screens or smartphones displaying a face often failed in our scenario, where only a partial view of the screen might be visible. For example, YOLO struggled to detect partially visible objects effectively. This limitation necessitated the creation of a custom database containing real and spoofed examples and the development of a bespoke facial authenticity detection system.

2. Review of Existing Solutions

Facial recognition systems are constantly gaining popularity. In particular, they are used to confirm the user’s authenticity in many applications, for example, logging into a computer. In particular, solutions based on deep learning have been rapidly gaining popularity in recent years. The most successful methods include DeepFace [7], DeepIDs [8], VGG Face [9], SphereFace [10], and the solution that we used in our system: ArcFace.

However, a common problem in all face recognition systems is the possibility of fraud, where someone else can obtain unauthorized access to the system using a photo or video of the legitimate user’s face.

Several systems aiming for face spoofing prevention were constructed using two cameras, a 3D camera, or an additional infrared camera. However, there are cases when we can use only a simple, cheap 2D camera, in particular the front camera of a smartphone or a webcam embedded into a laptop, as in our case.

However, the lack of 3D is not the only indicator of fraud, and the existence of 3D does not protect against using 3D masks for an attack. There also exist other indicators of an attack, and since we do not have access to the 3D information, we need to use them.

Frequently, we do not know what the signals of fraud are. However, in this case, we can use the photos of the real persons’ authentications and of the attempts of fraud to train a predictive model to recognize the fraudulent situation.

However, the predictive models usually need examples of two classes for the learning process: the real class and the fake class, and during the prediction phase, they predict the face recognition based on the similarity of examples in the training set. For that reason, several solutions were proposed.

In previous years, many traditional handcrafted feature-based methods were proposed for face spoofing detection. These algorithms contained methods such as LBP [11,12], SIFT [13], SURF [14], DoG, and HOG [15], and were designed for extracting spoofing patterns from various color spaces (RGB, HSV, and YCbCr).

Also, many methods to detect liveness cues were developed, which tried to detect eye-blinking [16,17,18], face and head movement, gaze tracking [19], and remote physiological signals such as rPPG [20,21,22,23].

However, it is easy to cheat the liveness cues with a proper video attack, which, on the one hand, frequently makes them less reliable, and on the other hand, users do not like this (in our application, it would be unacceptable; no one would use it).

Although deep learning and CNNs have achieved great success in many computer vision tasks (image classification, semantic segmentation, object detection, etc.), they contain many tuning parameters (weights) and therefore require large and diverse training datasets in order to avoid overfitting.

Thus, in response to that, several hybrid methods were proposed, where features are first extracted from the images, and then the CNNs operate on the extracted features rather than on the original images, which allows reducing the number of parameters and thus avoids overfitting. Below, we briefly list some recently proposed solutions.

Traditional image-based approaches focus on image quality and characteristics and therefore employ hand-made features, such as LBP, SIFT, HOG, and SURF, with shallow classifiers to discriminate live and fake faces [24].

Sthevanie et al. [25] combined the LBP and GLCM methods to extract the features used to detect spoof images. The LBP and GLCM are based on texture features, but GLCM uses statistical functions, so the extraction result represents the overall image texture features.

Chen et al. [26] proposed a two-stream convolutional neural network (TSCNN) that works in two complementary spaces: RGB and multi-scale retinex (MSR) space to take advantage of both texture details in RGB space and illumination invariance of MSR.

Rajeswaran and Kumar [27] developed an approach to convert the image to the L*a*b* color space, which is then passed through HOG (Histogram of Oriented Gradients) for face detection. The detected face is then advanced through the VGG7 CNN architecture for the extraction of features and then classified using a fully connected layer for spoof detection.

Hashemifard et al. [28], instead of completely extracting handcrafted texture features or relying only on deep neural networks, addressed the problem by fusing both wide and deep features in a unified neural architecture. The main idea was to take advantage of the strengths of both methods to derive a well-generalized solution for the problem.

Balamurali [9] performed face detection using MTCNN (Multi-Task Cascaded Convolutional Neural Network) and used the Support Vector Classifier (SVC) to classify real and fake faces.

Ab Wahab et al. [29] presented an Online Attendance System Using Face Recognition to record the daily attendance of students in online classes.

Shu et al. [30] proposed a model that consists of two streams. One stream converts the input RGB images into grayscale ones and conducts multi-scale color inversion to obtain the MSCI images, which are then put into the MobileNet to extract face reflection features. The other stream directly feeds RGB images into the MobileNet to extract face color features. Finally, the features extracted separately from the two branches are fused and then used for face spoofing detection.

Bahia et al. [31] presented face spoofing detection based on the extraction of Heterogeneous Auto-Similarities of Characteristics (HASC) descriptor from the HSV and YCbCr color spaces. The HASC descriptor encodes dependencies between low-level dense features of the color face image using covariance and information-theoretic measures to explore how the features are interrelated in real and fake faces.

Zhang et al. [32] proposed Dual Probabilistic Modeling (DPM), with two dedicated modules, DPM-LQ (Label Quality aware learning) and DPM-DQ (Data Quality aware learning). Both modules were designed based on the assumption that data and labels should form coherent probabilistic distributions. The task of DPM-LQ was to produce feature representations without overfitting to the distribution of noisy semantic labels, and the task of DPM-DQ was to eliminate data noise from ‘False Reject’ and ‘False Accept’ during inference by correcting the prediction confidence of noisy data based on its quality distribution.

Huang et al. [33] since the vision transformer (ViT) has demonstrated significant performance gains in various computer vision tasks by effectively capturing long-range dependencies, recent FAS methods have investigated the potential of leveraging the channel-wise features derived from self-attention (SA) in ViT. While channel-wise features effectively capture discriminative local attention, the complementary information existing between different channels remained undiscovered in prior methods. This paper investigates the unexplored characteristics of complementary channel information within ViT and proposes to incorporate both channel-wise and complementary channel information to learn long-range and discriminative features for FAS. Two modules, Channel Difference Self-Attention (CDSA) and Multi-head Channel Difference Self-Attention (MCDSA), are used to facilitate learning complementary channel characteristics and enhancing both feature discriminability and representational capacity. Using CDSA and MCDSA, the paper presents a custom Channel Difference Transformer (CDformer).

Antil et al. [34] provide a comprehensive review of the state-of-the-art works published over the past decade and discuss the temporal evolution of the FAS/PAD field. The paper reviews different types of attacks against facial authentication systems and covers key features used for FAS models. It also discusses the FAS design approaches followed by the backbone architectures used to design these methods. It also discusses publicly available databases for FAS models, standard protocols, and benchmarking methods. An extensive comparative analysis of experimental results from different PAD methods over the past decade is provided, highlighting limitations and current challenges. It observes a lack of a robust, large-scale general dataset for FAS and underscores the need for new developments in the field.

3. The Proposed Method

This work addresses a specific problem of developing a passive face spoofing detection system that does not require user interaction. The system is tailored for integration with an employee work time registration platform, using photos provided by that system, which were usually taken with a single front smartphone camera without infrared sensors.

The proposed method receives photos from the “Identification of the person in the photo” module, as shown in Figure 1, and detects whether the face visible in the photo is a face of a live person standing in front of a camera or the face was shown to the camera on a photo, video, or in another way.

Our contributions and core innovations comprise:

Development of a heterogeneous ensemble combining statistical and deep learning models.
Design and implementation of several new models that are used together with already existing models within the ensemble.
Designing the CNN architecture presented in Figure 3 and in Table 1, which outperforms both domain-specific and state-of-the-art general-purpose deep learning models. It should be noted that a vital component of this architecture is the pre-processing step.
Creation of a smartphone bezel-detection algorithm, which was consistently included in the highest-performing model combinations (see the detailed tables in the Appendix A).
Development of a CNN-based smartphone detection model presented in Figure 4 and Table 2.
Implementation of a novel face context analysis algorithm.
Identification of optimal combinations of models for the ensembles.
Integration of the proposed solution into a work time registration system to detect face spoofing attempts.

The proposed method is structured as a two-tiered stacking ensemble, as presented in Figure 5.

The first level of the model consists of the following modules:

Smartphone bezel-detection algorithm
Smartphone detection using a CNN
Face context analysis
Image analysis using a CNN
(optionally) Image analysis using the second deep neural network
(optionally) Image analysis using the third deep neural network

First, each of the modules predicts whether the face visible in the photo is real or spoofed. Then, the predictions of the modules are fed into a neural network that acts as a meta-classifier, producing the final authenticity decision. The individual modules are presented in detail in the following sections.

Explanation why a cascade of models is used for recognition: The experiments showed that the convolutional neural network that we proposed was the best-performing model of the 10 models we tested. However, using a cascade committee of various models in the form of a stacking ensemble yielded further accuracy improvements, justifying our choice of a committee-based approach.

3.1. Smartphone Bezel-Detection Algorithm

The first developed module is a bezel-detection algorithm designed to detect whether a spoof face is displayed on an electronic device such as a smartphone or tablet. It does so by detecting high-contrast edges around the displayed face, and if enough edges of a certain size are detected, then the spoof is detected.

First, pre-processing is performed on each image to normalize it and adapt it to different sizes of input images and positions of faces. The input images undergo resizing to 256 × 256 pixels. Then, the face is detected, and the image is converted to grayscale. An example of this pre-processing is presented in Figure 6.

To detect a bezel, the center point of the detected face is selected, and then the image is analyzed in four directions toward the edges. Iteratively, smaller and smaller fragments of the image are selected, and their average brightness is calculated. If the average is below a set value (in this case, a threshold of 28 for values between 0 and 255), then the fragment is considered a part of the bezel of an electronic device. This is presented in Algorithm 1. If bezels are detected in at least two directions, then the algorithm detects a spoofing attempt. For most cases, this ensures that even if a single bezel is detected due to uniform background colors or elements of the face (like dark hair), the face will not be mistaken for a spoof. This process is depicted in Figure 7.

This algorithm is optimized for computational efficiency to minimize processing time. In addition, all algorithms are executed in parallel, further accelerating the overall analysis.

Algorithm 1: Algorithm used for detecting smartphone bezels.

# Requires a grayscale image with a detected face 
def detect_bezel(gray):
     while current_bezel_size >= min_bezel_size:
         # Select a fragment of the image
         bezel = gray[y_start:y_start + current_bezel_size, x_start:x_end]
         # Check if the average value of the pixels in the fragment is below the threshold
         average = np.mean(bezel)
         if average <= gray_threshold:
             candidates.append((x_start, y_start, x_end, y_start + current_bezel_size, average))
     
        # Move the fragment
         y_start += 1
     
        if y_start + current_bezel_size >= y_end:
             current_bezel_size -= 1
             y_start = 0

3.2. Smartphone Detection Using CNN

Complementing the bezel detector, a smartphone detection algorithm was developed using a convolutional neural network architecture. This module works closely with other parts of the system and helps to detect if a face is displayed on an electronic device, such as a smartphone or a tablet, through features such as shape, bezel, or on-screen elements.

First, a model based on YOLOv4 was developed, but it did not work well in this case, as it was trained to detect whole objects, while in this use case, most of the time, only parts of the device are visible in the camera view. This necessitated the creation of a separate model for this task. The architecture of this model is presented in Figure 4.

The CNN was trained to detect whether there is a smartphone in the image and whether it appears as part of the image (for example, the presented person is holding it), or is used to actively display a spoofed face. Therefore, while this algorithm can provide useful information about the authenticity of the photo, it is best used as part of the whole system.

3.3. Face Context Analysis

This module enhances spoof detection by analyzing regions surrounding the face for inconsistencies. The algorithm aims to detect irregularities in the input image, such as sharp edges or objects, by comparing the area close to the detected face against more distant areas. The advantage of this approach is that it allows the system to more accurately detect spoofed faces displayed on electronic devices or printed by analyzing not just the face, but also its close surroundings.

The area close to the face (the "near" context) is defined as 40% of the width and height of the face outside of the detected face rectangle in all directions. The area further away from the face (the “far” context) is defined as an additional 40% in all directions outside of the close area.

For both areas, edge detection is performed using the Sobel operator. Then, for the detected edges, average pixel intensity is calculated separately for the area closer to the face and further away from it. Then, histograms are computed with the calculated intensities over 64 bins. This method could also be interpreted as a sharpness analysis of the image, as it is based on the amount and intensity of the detected edges. A sample output of this part of the algorithm is shown in Figure 8.

Then, for the histograms generated this way-separately for the close and far areas, four metrics are calculated and compared to each histogram from the dataset already generated this way. Those four metrics are Correlation, Chi-Square, Intersection, and Bhattacharyya distance. The distance between the histograms is then defined as the sum of all four metrics. This calculation is shown in Equation (5). It is important to normalize the metrics by scaling them down to a range of [0; 1], as they all cover different ranges. For Correlation and Intersection, the calculated value is subtracted from 1, as the value of zero means that the histograms are different. This ensures that the value of 1 represents two very different histograms for all matrices and thus a possible spoofing attempt.

\begin{matrix} c o r r e l a t i o n (H_{1}, H_{2}) & = 1 - n o r m (\frac{\sum_{I} (H_{1} (I) - \bar{H_{1}}) (H_{2} (I) - \bar{H_{2}})}{\sqrt{\sum_{I} {(H_{1} (I) - \bar{H_{1}})}^{2} \sum_{I} {(H_{2} (I) - \bar{H_{2}})}^{2}}}) \end{matrix}

(1)

\begin{matrix} c h i_s q u a r e (H_{1}, H_{2}) & = n o r m (\sum_{I} \frac{{(H_{1} (I) - H_{2} (I))}^{2}}{H_{1} (I)}) \end{matrix}

(2)

\begin{matrix} i n t e r s e c t i o n (H_{1}, H_{2}) & = 1 - n o r m (\sum_{I} min (H_{1} (I), H_{2} (I))) \end{matrix}

(3)

\begin{matrix} b h a t t a c h a r y y a (H_{1}, H_{2}) & = n o r m (\sqrt{1 - \frac{1}{\sqrt{\bar{H_{1}} \bar{H_{2}} N^{2}}} \sum_{I} \sqrt{H_{1} (I) \cdot H_{2} (I)}}) \end{matrix}

(4)

\begin{matrix} D (H_{1}, H_{2}) & = (c o r r e l a t i o n (H_{1}, H_{2}) + c h i_s q u a r e (H_{1}, H_{2})) \\ + (i n t e r s e c t i o n (H_{1}, H_{2}) + b h a t t a c h a r y y a (H_{1}, H_{2})) \end{matrix}

(5)

where

n o r m (x) = \frac{1}{1 + e^{- x}}

and

\bar{H_{k}} = \frac{1}{N} \sum_{J} H_{k} (J)

.

H_{1}

and

H_{2}

represent the compared histograms.

Based on these distances, a kNN algorithm is performed to select the three closest neighbors (with the lowest distance to the new histogram), and the spoofing probability is calculated depending on how many of those histograms belong to authentic images and spoof images.

This method strengthens the system’s ability to detect inconsistencies by focusing not only on the face but also on the broader context of the surrounding image. By comparing edge intensities in two areas around the face to known distributions, it is possible to detect sudden changes and edges in the image’s foreground that could indicate a spoof attempt.

3.4. Image Analysis Using CNN

A dedicated CNN was developed to analyze full images and assess face authenticity based on subtle cues, such as face quality, edge presence, and visual consistency. The goal of this module is to analyze elements of the image that would otherwise be difficult to process with algorithms, such as the quality of the face, the appearance of objects or edges close to the face, or the cohesion of the image.

The network architecture consists of two sets of convolutional layers followed by batch normalization layers, ReLU activation functions, pooling, and dropout layers (CONV => RELU => CONV => RELU => POOL => DROPOUT). Then, a single fully connected layer is added to the network. Finally, a SoftMax classifier is deployed to output the probabilities of the face being authentic or spoofed. The full architecture is shown in Figure 3. The face score is determined by the highest of those probabilities.

To better account for real use cases, the images were augmented. The enhancements included a random horizontal flip (mirror flip), a random rotation by up to 5 degrees, and a random change in the brightness and saturation of the image (by up to 10%). Such parameters were intentionally small and were selected to represent real-world conditions. The operations also included scaling the images down to a size of 64 × 64 pixels.

The images used for training and testing were additionally preprocessed by centering them around the detected face; this is to ensure that the network can analyze the face and its close surroundings, and is especially useful if the face is only part of a bigger image. Experimental evaluation has shown a great increase in accuracy after this centering step, compared to when the images are scaled down without centering. This is therefore an integral part of this module that makes it perform better.

The output layer of the convolutional neural network contains a SoftMax classifier, which has two outputs. Each output can be interpreted as the probability that the image belongs to the specific class, which in the case of this system is “authentic” or “spoof”.

3.5. Calculating the Result and Final Probability

A lightweight neural network was developed to aggregate outputs of the above-presented models and to compute the final spoofing probability, instead of relying on pre-defined weights for each module individually. The network takes each partial result and its probability (for example, a result could be “spoof” with probability 80%) and produces a final result, classifying the face as either “authentic” or “spoof” with an associated probability.

The proposed architecture consists of two fully connected layers with ReLU activation functions and an output layer using a sigmoid function classifier (Figure 9 and Table 3). The network accepts floating-point values, which represent the absolute probability of the face being inauthentic. For example, if a face is determined to be authentic with a probability of 80%, then the input would be 20% as the probability that the face is spoofed. The output of the network is a single value [0; 1] that represents the probability that the face is spoofed. If the probability is greater than 0.5, the face is marked as inauthentic.

Due to the sigmoid transfer function in the last layer of this network, its output can be interpreted as the probability that a given face is real. To do this, the network training time should be limited before the predicted values in the final training phase start to be very close to zero or one, but the training time should also be long enough to achieve high classification accuracy.

4. Experimental Evaluation

4.1. Experimental Setup

The proposed face-authenticity detection system was evaluated using four datasets:

A small dataset ( $N = 477$ ) prepared by us, which includes faces captured in the described scenario. There are 242 authentic face images and 234 spoof face images. This is the most realistic database for the purpose of our system. However, this dataset includes faces of individuals who did not agree to publish their images, and for this reason, we cannot share the data. The three remaining datasets are available for download.
The open Spoofing Dataset [35] (N = 1054), which includes authentic faces and those displayed on electronic devices (522 authentic images, 532 spoof images).
The Large Crowd-collected Facial Anti-Spoofing Dataset by D. Timoschenko et al. [5] (N = 16,284, 1942 authentic images, 14,342 spoof images).
Parts of the Celeb-A Dataset by Zhang et al. [6]. This dataset contains over 200 thousand images, which is too much to process with our available resources. Instead, a total of N = 23,052 images were randomly sampled (13,543 authentic images, 9509 spoof images).

Our system was developed in Python using PyTorch, Scikit-Learn, OpenCV, and other libraries. The source code is available at https://github.com/Stukeley/FaceAuthenticityDetection.

Technical specifications of the development and testing system: We used three computers, and the tests were split among them. Two computers were equipped with AMD Ryzen 9 7950X CPU, 128 GB RAM, NVidia RTX4070 GPU, 512 GB FDD, 8 TB HDD, Windows Operating System, and one with 2 x AMD EPYC 7282, 512 GB RAM, NVidia Tesla T4 GPU, 512 GB FDD, 6 TB HDD, Windows Operating System.

We tested our solutions and compared the results with the results obtained with other methods. The other methods include two implementations of facial spoofing detection, for which we were able to find the source code and train them on our data. Unfortunately, many authors either did not provide the source code, which made their methods practically impossible to use, or provided non-working code and did not respond to our emails when we asked for help to run their code. The remaining methods are the newest and well-known deep learning models, where we used the PyTorch implementation.

Depth Prediction with CNN [36], a method developed for facial spoofing detection using an open-source implementation by Anand Pandey [37].
MobileNet [38] implemented for facial spoofing detection in an open-source code by Nguyen Dinh Quy [39].
The ConvNeXt architecture [40] without pre-trained weights for the “BASE” model, using PyTorch implementation.
The EfficientNetV2 architecture [41] without pre-trained weights for the “S” model size, using PyTorch implementation.
The Vision Transformer architecture [42] without pre-trained weights for the b_16 architecture.
The Swin Transformer architecture [43] without pre-trained weights for the V2_S architecture.

To evaluate each system, five runs of a stratified 5-fold cross-validation were performed, starting each time from a different random data split and a different seed for training the models to account for stochasticity in training and testing performance, for a total of 25 training/testing pairs per dataset for each model. In a given cross-validation run, each model was trained using the same data split.

For the bezel-detection algorithm, no training was performed at this step; instead, the parameters, including the grayscale threshold for detected bezels, the minimum size, and the number of bezels detected to consider an image spoofed, were empirically adjusted based on images from previous evaluations.

The following models: ConvNeXt, EfficientNetV2, Vision Transformer, and Swin Transformer were trained with the same parameters:

L R = 0.001

,

B S = 32

,

E P O C H S = 10

,

D R O P O U T_R A T E = 0.25

, which were the default parameters in the PyTorch implementations we used. The image size for all models was set to (64 × 64), except for the Vision Transformer and Swin Transformer architectures, which were instead trained and evaluated using images of size (224 × 224), as these architectures were built to work with larger image sizes. Values different from those chosen were also tested, resulting in similar or worse performance. No pre-trained weights were used, as this makes the comparison between the proposed system and other tested architectures more reliable. All architectures were trained and evaluated on the same datasets.

4.2. Experimental Results

First, we present the partial results of individual models of the proposed system as well as of other tested systems. Then, the results obtained with different stacking and voting committees are presented. The following configurations were tested:

Our four developed modules: bezel detection, smartphone detection CNN, context analysis, and image analysis CNN (baseline).
Other methods used for face spoofing analysis: Depth prediction, MobileNet, ConvNeXt, EfficientNetV2, Vision Transformer, and Swin Transformer.
Various combinations of the above modules and methods in the stacking ensemble (between 2 and 7 models).
Various combinations of the above modules and methods in the voting ensemble (between 2 and 7 models).

The obtained results, including the ensembles with 3, 4, 5, 6, and 7 models with the best accuracy across all datasets, are presented in Table 4. The ensembles were chosen based on mean rank across the four datasets, and we selected the best 3-model ensemble, the best 4-model ensemble, and so forth. The ranks and classification accuracy of the ensembles are presented in Table A3.

We have performed hyperparameter optimization for our own neural network models (smartphone detection CNN and image analysis CNN) by varying the learning rate, batch size, epoch count, and dropout rate over the baseline values of

L R = 0.001

,

B S = 32

,

E P O C H S = 10

,

D R O P O U T_R A T E = 0.25

. The reason for using these baseline values as a starting point was to start from the same hyperparameters as the other models (ConvNeXt, EfficientNetV2, Vision Transformer, and Swin Transformer) used. The results are presented in Table A1 and Table A2 in the Appendix A.

Based on the experimental results presented in Table A1, we determined the set of optimal hyperparameters with which the Smartphone Detection CNN and the Image Analysis CNN obtained the highest classification accuracies. The optimal hyperparameters for the smartphone detection CNN:

L R = 0.0001

,

B S = 32

,

E P O C H S = 10

,

D R O P O U T_R A T E = 0.25

The optimal hyperparameters for the image analysis CNN:

L R = 0.001

,

B S = 32

,

E P O C H S = 20

,

D R O P O U T_R A T E = 0.25

As can be seen from Table A2, when the Smartphone Detection CNN and the Image Analysis CNN are not used as standalone models but as part of a stacking ensemble, the hyperparameter optimization still improves the results, though the differences are much smaller, especially for the two largest datasets.

In the analysis conducted here, the threshold of the final classifier network output above which a face is considered spoofed was set to

0.50

. If the probability returned by the network is greater than or equal to 50%, then the system predicts the face as spoofed. Although other threshold values were tested, the results were not sensitive to the change in the threshold value, and for that reason, a threshold value of 50% could be used, as presented in Table A7, Table A8, Table A9 and Table A10 in the Appendix A.

Two conclusions can be drawn from Table 4. First, the best-performing models are stacking ensembles. Second, the best ensembles consisting of between three and seven models include some or all of our proposed methods. The complete results for various combinations of ensembles can be seen in Table A3.

To test the statistical significance of the obtained results, two tests were used:

A one-way ANOVA, testing a null hypothesis: “There is no difference among group means”. $α = 0.05$
A rank-based Kruskal–Wallis, testing a null hypothesis: “There is no difference among medians of populations from which the groups are taken”. $α = 0.05$

To perform these tests, the accuracies obtained for each fold, random seed, and dataset for the ensembles and other methods presented in Table 4 were collected, amounting to a total of 100 values per system for a total of 11 systems. Then, the necessary metrics were calculated based on those values.

For the ANOVA test: $f = 116.36$ , $p = 1.73 \times 10^{- 145}$ , which is significantly below the accepted threshold. Therefore, the null hypothesis can be rejected.
For the Kruskal–Wallis test: $h = 476.42$ , $p = 4.80 \times 10^{- 96}$ , which is also significantly below the accepted threshold. Therefore, the null hypothesis can be rejected.

This proves that the tested systems differ and that the differences in the results are statistically significant.

The conclusion is that the stacking ensemble increases the accuracy of the system. Adding the ConvNeXt network had a bigger impact on prediction accuracy than further adding other networks, which might be a result of the individual accuracy of those networks being lower than that of ConvNeXt. Only for the Small dataset, including low-accuracy models (ConvNeXt, Vision Transformer-ViT) in the ensemble significantly increases the obtained accuracy.

The results indicate that the system is effective in recognizing spoofing attempts while simultaneously displaying high accuracy in correctly identifying authentic faces. The diversity of datasets used to evaluate the system’s performance shows that it is effective in various cases, with different backgrounds and appearances of the people in front of the camera, as well as that it performs significantly better than other systems when using a small dataset for its training.

As part of further testing, all possible ensemble combinations of the seven methods (Bezel, Smartphone Detection CNN, Context, Image Analysis CNN, EfficientNetV2, ConvNeXt, and ViT) were evaluated in an attempt to find the optimal one, separately for stacking and voting ensembles.

Unfortunately, it was not possible to use the Depth Prediction and MobileNet implementations as parts of the ensemble, as the way these models were implemented made it hard to obtain separate results for each input image, rather than aggregate results for the entire datasets, which was needed to test the ensembles. The full results can be viewed in Table A3 in the Appendix A. As can be seen from the results, the best results were obtained for ensembles encompassing more methods, and removing methods from the ensemble significantly affects the variation of the obtained results in cross-validation. The results also show that stacking ensembles are more effective than voting ensembles.

To test the impact of the selected threshold on the obtained results, multiple values were tested, in increments of

0.05

, and the key metrics of accuracy, precision, recall, and F1-score were noted for the final ensemble (consisting of four implemented methods, ConvNeXt, EfficientNetV2, and Vision Transformer). The results are presented in Table A7, Table A8, Table A9 and Table A10. These results were obtained for a different random seed value; therefore, they differ from the results presented in earlier contexts. The metrics have been calculated for the spoofed faces (positive values).

The obtained results for all metrics are very similar for different threshold values. This is especially noticeable for the larger datasets, where the differences between values are within 1%. This signals that the outputs obtained from the final neural network are often close to

0.00

or

1.00

.

We have applied Platt scaling to the outputs of the final neural network for the ensemble consisting only of our own methods (Bezel, Smartphone Detection CNN, Context, and Image Analysis CNN). This was performed in an attempt to calibrate the final outputs based on expected results. The implementation used a logistic regression model that was trained on the outputs of the final neural network and the expected classes. However, the obtained results did not improve the classification performance, showing slightly lower or slightly higher accuracy. Since the obtained differences in accuracy are inconsistent and minimal, the technique does not provide a clear benefit. The obtained results are presented in Table 5.

An additional commercial system was tested—“A Personalized Benchmark for Face Anti-spoofing” (D. Belli et al.) [44]. However, it did not offer the possibility to train the model using the provided data. Its accuracy was not satisfying, as the system frequently was not able to detect faces in the provided images, and incorrectly marked most authentic images as spoofed. This system was tested for all images from the first, small dataset (N = 477). The obtained results are in Table 6.

Another tested commercial system is the Doubango 3D Passive face liveness detection [45]. However, it was only possible to test a few photos without paying for the system. The system performed perfectly when we showed our face or its photo to a laptop camera. However, for the photos from our work time registration system, the model predicted most authentic faces as spoofed, and even marked some as “deepfake”, with a very high probability. Unfortunately, it was not possible to test the system with more data to obtain more reliable metrics.

5. Conclusions

We proposed a passive face spoofing detection method designed to operate in a work time registration system under real-world constraints: variable lighting, low image quality (6 kB webp format), and no possibility to invest in additional hardware (e.g., infrared). Our approach effectively balances performance, flexibility, and computational efficiency, making it a viable solution for real-world employee authentication systems.

The presented method is based on a stacking committee, where each member model evaluates different aspects of the photo. Experimental evaluation showed that the use of a stacking committee improves the accuracy of the entire system. We experimentally evaluated 240 different committees and different threshold parameters for the final classifier in the committee.

Additionally, the use of a centering step in the CNN analysis improves the accuracy of that module, but not implementing that step only slightly affects the total accuracy of the system. The use of a committee should also increase the adaptability of the system to real-world conditions, as different methods can detect different types of spoofing attacks.

The modular nature of our design supports future expansion by adding additional models. This will be the subject of our future work. Specifically, we plan to examine the possibility of adding new models to the committee, to evaluate grayscale vs. RGB photos as CNN-based network input, and to train the final neural network with different parameters. Moreover, the acquisition and generation of new data specific to the industrial conditions of the employee registration environment are important aspects to further improve the system.

Author Contributions

Conceptualization, M.K. and R.K.; methodology, M.K. and R.K.; software, M.K. and R.K.; validation, M.K. and R.K.; formal analysis, M.K. and R.K.; investigation, M.K. and R.K.; resources, M.K. and R.K.; data curation, M.K. and R.K.; writing—original draft preparation, M.K. and R.K.; writing—review and editing, M.K. and R.K.; visualization, M.K. and R.K.; supervision, M.K. and R.K.; project administration, M.K. and R.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Centre for Research and Development of Poland (NCBiR), project nr POIR.01.01.01-00-1144/19.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and source code used in the experiments are available at https://github.com/Stukeley/FaceAuthenticityDetection.

Conflicts of Interest

Author Mirosław Kordos was employed by the company inEwi, sp. z o. o. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1. Obtained accuracies [%] in a 5-fold cross-validation for two proposed convolutional neural networks (Smartphone Detection CNN and Image Analysis CNN) with different hyperparameter combinations (learning rate (LR), batch size (BS), epochs, and dropout rate).

LR	BS	Epochs	Dropout	Smartphone CNN	Image CNN
Small dataset
0.01	32	10	0.25	68.07 ± 5.56	73.95 ± 2.70
0.001	32	10	0.25	80.74 ± 4.28	85.92 ± 2.19
0.001	16	10	0.25	83.57 ± 4.12	87.80 ± 3.80
0.001	64	10	0.25	77.47 ± 2.97	87.55 ± 3.86
0.001	32	15	0.25	80.51 ± 4.64	86.62 ± 3.31
0.001	32	20	0.25	85.20 ± 2.68	89.91 ± 1.71
0.001	32	10	0.15	80.51 ± 3.04	81.00 ± 2.63
0.001	32	10	0.35	82.16 ± 2.03	82.86 ± 3.48
0.0001	32	10	0.25	88.73 ± 3.05	81.22 ± 3.24
Spoofing Dataset
0.01	32	10	0.25	69.86 ± 2.72	84.59 ± 1.80
0.001	32	10	0.25	86.70 ± 1.92	92.25 ± 0.77
0.001	16	10	0.25	88.13 ± 1.64	91.67 ± 1.86
0.001	64	10	0.25	81.34 ± 2.20	92.06 ± 1.50
0.001	32	15	0.25	88.42 ± 1.97	93.01 ± 0.83
0.001	32	20	0.25	86.12 ± 0.68	92.25 ± 1.77
0.001	32	10	0.15	86.22 ± 2.06	90.95 ± 0.63
0.001	32	10	0.35	86.41 ± 1.23	91.58 ± 2.32
0.0001	32	10	0.25	91.10 ± 1.20	90.81 ± 2.19
Large Crowd-collected
0.01	32	10	0.25	87.80 ± 0.34	85.03 ± 0.19
0.001	32	10	0.25	90.63 ± 0.55	98.26 ± 0.09
0.001	16	10	0.25	90.93 ± 0.33	98.16 ± 0.12
0.001	64	10	0.25	89.91 ± 0.18	98.24 ± 0.27
0.001	32	15	0.25	90.16 ± 0.45	98.50 ± 0.14
0.001	32	20	0.25	90.96 ± 0.10	98.64 ± 0.13
0.001	32	10	0.15	90.16 ± 0.52	97.10 ± 0.15
0.001	32	10	0.35	90.16 ± 0.53	97.33 ± 0.29
0.0001	32	10	0.25	91.74 ± 0.20	97.35 ± 0.24
CelebA
0.01	32	10	0.25	79.84 ± 0.62	84.57 ± 0.38
0.001	32	10	0.25	89.47 ± 0.52	93.80 ± 0.39
0.001	16	10	0.25	90.99 ± 0.34	94.39 ± 0.13
0.001	64	10	0.25	86.72 ± 0.37	93.17 ± 0.21
0.001	32	15	0.25	90.10 ± 0.50	95.16 ± 0.30
0.001	32	20	0.25	89.94 ± 0.28	95.87 ± 0.11
0.001	32	10	0.15	88.75 ± 0.49	94.61 ± 0.13
0.001	32	10	0.35	89.45 ± 0.36	94.95 ± 0.31
0.0001	32	10	0.25	88.83 ± 0.27	92.87 ± 0.21

Table A2. Obtained accuracies [%] in a 5-fold cross-validation for the stacking ensembles from Table 4 with default and optimal hyperparameters for our models-Smartphone Detection CNN (S) and Image Analysis CNN (N). The other models used in the ensembles: Bezel (B), Context (C), EfficientNetV2 (E), ConvNeXt (X), and Vision Transformer (V).

Combination	Default Parameters	Optimal Parameters
Small dataset
C,N,E	89.20 ± 1.71	91.78 ± 1.29
S,C,N,E	91.31 ± 2.54	91.02 ± 3.60
S,C,N,X,E	90.84 ± 3.28	91.55 ± 3.44
B,S,C,N,X,E	91.78 ± 2.69	91.55 ± 3.61
B,S,C,N,X,E,V	91.78 ± 2.36	91.55 ± 3.61
Spoofing Dataset
C,N,E	95.56 ± 0.56	96.27 ± 1.30
S,C,N,E	96.36 ± 0.83	96.84 ± 0.23
S,C,N,X,E	96.46 ± 0.78	97.22 ± 0.36
B,S,C,N,X,E	96.75 ± 0.98	97.03 ± 0.36
B,S,C,N,X,E,V	96.56 ± 1.26	97.13 ± 0.68
Large Crowd-collected
C,N,E	98.94 ± 0.12	99.02 ± 0.08
S,C,N,E	98.98 ± 0.23	99.13 ± 0.06
S,C,N,X,E	99.05 ± 0.17	99.12 ± 0.06
B,S,C,N,X,E	99.01 ± 0.22	99.14 ± 0.11
B,S,C,N,X,E,V	99.00 ± 0.20	99.16 ± 0.09
CelebA
C,N,E	99.04 ± 0.10	99.10 ± 0.13
S,C,N,E	99.15 ± 0.08	99.18 ± 0.07
S,C,N,X,E	99.21 ± 0.06	99.25 ± 0.10
B,S,C,N,X,E	99.20 ± 0.10	99.23 ± 0.07
B,S,C,N,X,E,V	99.17 ± 0.11	99.26 ± 0.10

Table A3. All obtained accuracies [%] for different stacking and voting ensemble combinations over five runs of a 5-fold cross-validation, part 1. The rank was obtained by averaging the position for all four datasets, for each combination. The models used are Bezel (B), Smartphone Detection CNN (S), Context (C), Image Analysis CNN (N), EfficientNetV2 (E), ConvNeXt (X), and Vision Transformer (V).

Combination	Small	Spoofing	Large	CelebA	Rank
B,S,C,N,X,E,V (stacking)	92.62 ± 3.05	96.88 ± 1.48	98.99 ± 0.17	99.20 ± 0.12	2.75
S,C,N,X,E (stacking)	92.15 ± 3.36	96.85 ± 1.51	99.00 ± 0.17	99.19 ± 0.10	4.25
B,S,C,N,X,E (stacking)	92.46 ± 3.17	96.82 ± 1.54	98.99 ± 0.19	99.18 ± 0.12	4.5
B,S,C,N,E (stacking)	92.63 ± 3.04	96.88 ± 1.53	98.95 ± 0.20	99.12 ± 0.13	7.25
S,C,N,X,E,V (stacking)	91.69 ± 3.73	96.93 ± 1.39	98.98 ± 0.18	99.19 ± 0.10	7.5
B,S,C,N,E,V (stacking)	92.80 ± 3.21	96.80 ± 1.57	98.95 ± 0.19	99.10 ± 0.13	9
S,C,N,E,V (stacking)	92.11 ± 4.04	96.89 ± 1.44	98.94 ± 0.21	99.12 ± 0.12	10.75
S,C,N,E (stacking)	92.29 ± 3.16	96.77 ± 1.46	98.93 ± 0.22	99.14 ± 0.11	11.5
B,S,C,N,X (stacking)	91.61 ± 2.67	96.12 ± 1.31	98.99 ± 0.17	99.17 ± 0.11	11.75
B,C,N,X,E,V (stacking)	92.04 ± 3.47	96.30 ± 1.66	98.95 ± 0.14	99.09 ± 0.10	13.25
B,S,C,N,X,V (stacking)	91.43 ± 2.91	96.07 ± 1.33	98.99 ± 0.17	99.17 ± 0.09	13.75
C,N,X,E,V (stacking)	91.28 ± 4.00	96.38 ± 1.69	98.97 ± 0.15	99.05 ± 0.15	16.5
S,C,N,X (stacking)	90.92 ± 2.86	95.96 ± 1.27	99.01 ± 0.18	99.16 ± 0.12	17
B,C,N,X,E (stacking)	91.69 ± 3.83	96.18 ± 1.66	98.95 ± 0.14	99.08 ± 0.13	17
S,C,N,X,V (stacking)	90.73 ± 3.05	95.88 ± 1.28	98.99 ± 0.16	99.16 ± 0.09	20.25
B,C,N,E,V (stacking)	92.16 ± 3.66	96.34 ± 1.66	98.89 ± 0.21	98.94 ± 0.18	20.5
B,S,C,N,V (stacking)	91.48 ± 2.72	95.91 ± 1.39	98.94 ± 0.20	99.09 ± 0.13	21.25
B,S,C,N (stacking)	91.29 ± 2.72	95.86 ± 1.34	98.95 ± 0.21	99.11 ± 0.12	22
B,C,N,E (stacking)	91.85 ± 4.07	96.18 ± 1.65	98.91 ± 0.21	98.93 ± 0.17	23
C,N,X,E (stacking)	90.61 ± 3.80	96.25 ± 1.80	98.95 ± 0.16	99.05 ± 0.11	23.25
S,C,N,V (stacking)	90.73 ± 3.06	95.93 ± 1.38	98.94 ± 0.22	99.07 ± 0.13	25
C,N,E,V (stacking)	91.19 ± 4.28	96.30 ± 1.62	98.89 ± 0.21	98.96 ± 0.18	27
C,N,E (stacking)	91.18 ± 3.95	96.35 ± 1.69	98.89 ± 0.22	98.94 ± 0.17	27
S,C,N (stacking)	90.65 ± 2.99	95.81 ± 1.32	98.94 ± 0.23	99.04 ± 0.10	29
B,C,N,X (stacking)	89.75 ± 3.80	94.43 ± 1.97	98.96 ± 0.16	98.98 ± 0.10	38.25
B,C,N,X,V (stacking)	89.40 ± 4.05	94.33 ± 2.00	98.97 ± 0.15	98.98 ± 0.10	40.25
B,S,C,X,E,V (stacking)	90.02 ± 4.15	95.97 ± 1.74	96.53 ± 0.41	99.03 ± 0.12	41.25
B,S,N,X,E (stacking)	91.74 ± 3.22	95.62 ± 1.88	98.37 ± 0.29	97.15 ± 0.28	41.25
B,S,C,X,E (stacking)	90.11 ± 4.12	95.80 ± 1.78	96.57 ± 0.42	99.01 ± 0.12	43.75
B,S,N,X,E,V (stacking)	91.60 ± 3.46	95.55 ± 2.12	98.37 ± 0.32	97.10 ± 0.26	43.75
S,N,X,E,V (stacking)	91.55 ± 3.90	95.58 ± 2.25	98.37 ± 0.31	97.09 ± 0.27	44
B,S,N,E (stacking)	92.16 ± 3.28	95.74 ± 2.12	98.30 ± 0.36	96.68 ± 0.34	44.75
S,N,X,E (stacking)	91.40 ± 3.70	95.68 ± 1.96	98.35 ± 0.32	97.10 ± 0.28	45.25
B,S,N,E,V (stacking)	92.12 ± 3.24	95.69 ± 2.19	98.31 ± 0.34	96.63 ± 0.35	45.75
C,N,X (stacking)	87.78 ± 4.20	94.43 ± 2.01	98.96 ± 0.16	99.00 ± 0.13	47.75
C,N,X,V (stacking)	87.97 ± 3.85	94.33 ± 2.05	98.96 ± 0.15	98.99 ± 0.10	47.75
B,C,N (stacking)	89.76 ± 3.96	94.38 ± 1.92	98.90 ± 0.22	98.84 ± 0.13	47.75
B,C,N,V (stacking)	89.73 ± 3.87	94.31 ± 1.99	98.91 ± 0.21	98.87 ± 0.12	48.5
S,C,X,E,V (stacking)	88.75 ± 5.02	95.87 ± 1.80	96.46 ± 0.39	99.01 ± 0.13	50.5
S,N,E,V (stacking)	91.74 ± 3.86	95.59 ± 2.18	98.29 ± 0.37	96.64 ± 0.34	50.5
S,C,E,V (stacking)	89.16 ± 4.38	96.04 ± 1.83	96.13 ± 0.35	98.93 ± 0.18	50.75
S,C,X,E (stacking)	88.80 ± 4.70	95.86 ± 1.80	96.49 ± 0.43	98.99 ± 0.11	51
B,S,C,E,V (stacking)	89.60 ± 4.38	95.84 ± 1.78	96.13 ± 0.37	98.93 ± 0.17	52
S,C,N,E (voting)	91.92 ± 3.47	96.27 ± 1.54	94.67 ± 1.30	98.02 ± 0.35	53.25
B,S,C,N,E (voting)	92.04 ± 3.26	94.61 ± 2.19	96.70 ± 0.40	97.37 ± 0.51	54
S,N,E (stacking)	91.21 ± 3.81	95.66 ± 2.08	98.29 ± 0.36	96.66 ± 0.36	54
B,S,C,E (stacking)	89.44 ± 4.38	95.84 ± 1.73	96.08 ± 0.37	98.91 ± 0.16	54.5
S,C,N (voting)	90.48 ± 2.19	94.87 ± 1.71	97.14 ± 0.34	98.69 ± 0.17	56.25
S,C,E (stacking)	88.70 ± 5.00	95.93 ± 1.93	96.11 ± 0.37	98.92 ± 0.15	56.5
C,N (stacking)	88.02 ± 3.58	94.43 ± 1.94	98.89 ± 0.22	98.83 ± 0.13	57
B,S,N,X,V (stacking)	91.10 ± 2.11	93.98 ± 1.43	98.36 ± 0.32	97.01 ± 0.34	57.75
B,N,X,E,V (stacking)	91.24 ± 4.21	94.71 ± 2.19	98.31 ± 0.31	96.42 ± 0.25	58.75
C,N,V (stacking)	87.88 ± 3.68	94.33 ± 2.00	98.88 ± 0.22	98.86 ± 0.14	59
B,N,X,E (stacking)	90.58 ± 4.79	94.73 ± 2.11	98.31 ± 0.34	96.39 ± 0.30	61.75
B,S,N,X (stacking)	90.42 ± 2.91	93.67 ± 1.68	98.36 ± 0.32	96.97 ± 0.36	62.75
C,N,E (voting)	89.00 ± 3.84	95.47 ± 1.66	96.53 ± 0.38	98.10 ± 0.22	63
N,X,E,V (stacking)	89.69 ± 4.79	94.69 ± 2.22	98.31 ± 0.31	96.42 ± 0.32	65
B,S,N,V (stacking)	91.41 ± 2.85	93.85 ± 1.58	98.30 ± 0.37	96.32 ± 0.23	65.25
S,N,X,V (stacking)	88.89 ± 3.36	93.93 ± 1.35	98.38 ± 0.32	96.97 ± 0.39	66.5

Table A4. All obtained accuracies [%] for different stacking and voting ensemble combinations over five runs of a 5-fold cross-validation, part 2.

Combination	Small	Spoofing	Large	CelebA	Rank
B,N,E,V (stacking)	91.51 ± 3.91	94.64 ± 2.27	98.26 ± 0.33	95.59 ± 0.66	66.5
B,S,N (stacking)	91.38 ± 2.85	93.45 ± 1.70	98.30 ± 0.36	96.39 ± 0.25	67
S,N,X (stacking)	88.64 ± 3.23	93.72 ± 1.76	98.37 ± 0.30	96.98 ± 0.36	68.75
N,X,E (stacking)	89.26 ± 5.36	94.75 ± 2.25	98.28 ± 0.33	96.39 ± 0.27	69.25
B,C,X,E,V (stacking)	86.50 ± 6.03	95.15 ± 2.09	96.10 ± 0.42	98.93 ± 0.12	70
B,N,E (stacking)	91.38 ± 4.29	94.60 ± 2.23	98.24 ± 0.35	95.58 ± 0.67	70
B,S,C,X,V (stacking)	87.37 ± 3.25	93.47 ± 1.93	96.42 ± 0.42	98.98 ± 0.10	70.75
B,C,X,E (stacking)	86.08 ± 6.03	95.17 ± 2.04	96.11 ± 0.44	98.91 ± 0.10	71.25
C,N (voting)	85.89 ± 3.65	94.49 ± 1.73	98.20 ± 0.23	98.89 ± 0.14	71.25
B,S,C,X (stacking)	87.01 ± 3.52	93.37 ± 1.96	96.41 ± 0.44	98.95 ± 0.11	73.75
N,E,V (stacking)	89.70 ± 4.87	94.75 ± 2.26	98.25 ± 0.34	95.57 ± 0.67	74.25
C,X,E,V (stacking)	84.62 ± 6.93	95.29 ± 1.92	96.03 ± 0.46	98.95 ± 0.10	75.25
S,N,V (stacking)	89.07 ± 2.95	93.94 ± 1.56	98.28 ± 0.37	96.39 ± 0.27	75.75
S,N (stacking)	89.05 ± 3.03	93.53 ± 1.78	98.30 ± 0.37	96.35 ± 0.23	76.25
B,C,E,V (stacking)	86.78 ± 5.80	95.15 ± 2.04	95.61 ± 0.44	98.63 ± 0.36	77.75
N,E (stacking)	89.31 ± 5.31	94.56 ± 2.28	98.25 ± 0.34	95.57 ± 0.70	78
S,C,X,V (stacking)	85.36 ± 3.77	93.50 ± 1.83	96.32 ± 0.43	98.96 ± 0.11	78.25
S,C,X (stacking)	85.29 ± 3.33	93.44 ± 1.94	96.33 ± 0.44	98.94 ± 0.11	80.75
C,X,E (stacking)	84.14 ± 7.19	95.29 ± 2.00	95.98 ± 0.44	98.89 ± 0.14	81.5
B,C,E (stacking)	86.45 ± 5.89	95.07 ± 2.18	95.57 ± 0.47	98.59 ± 0.34	81.5
B,S,C,N (voting)	90.14 ± 3.04	91.05 ± 2.07	98.06 ± 0.23	97.22 ± 0.31	81.5
B,S,C,V (stacking)	86.59 ± 3.06	93.46 ± 1.98	95.89 ± 0.37	98.84 ± 0.13	83.25
B,S,C (stacking)	86.97 ± 3.06	93.43 ± 1.98	95.84 ± 0.37	98.79 ± 0.13	84.25
S,C,N,X,E (voting)	89.00 ± 3.80	94.19 ± 2.23	94.27 ± 1.32	97.30 ± 0.38	84.25
S,N,E (voting)	91.24 ± 3.92	94.94 ± 2.10	93.40 ± 1.45	95.73 ± 0.54	84.75
B,N,X,V (stacking)	89.10 ± 3.89	91.46 ± 1.99	98.32 ± 0.32	96.01 ± 0.56	85.75
C,E,V (stacking)	84.71 ± 7.06	95.11 ± 2.08	95.53 ± 0.45	98.62 ± 0.35	87
S,C,N,E,V (voting)	89.06 ± 2.95	94.29 ± 2.30	92.78 ± 1.63	97.37 ± 0.37	88
C,E (stacking)	84.57 ± 6.90	94.98 ± 2.24	95.50 ± 0.46	98.58 ± 0.33	89
B,C,N,E (voting)	89.11 ± 2.67	93.30 ± 2.63	95.49 ± 3.40	96.64 ± 0.81	89.5
S,C,V (stacking)	85.22 ± 3.65	93.44 ± 1.89	95.89 ± 0.34	98.83 ± 0.09	90
S,C,E (voting)	88.00 ± 4.21	95.36 ± 2.18	92.54 ± 1.12	96.92 ± 0.46	90
S,C (stacking)	85.08 ± 3.71	93.40 ± 1.97	95.82 ± 0.35	98.78 ± 0.14	93
B,N,X (stacking)	88.37 ± 4.42	91.35 ± 1.88	98.31 ± 0.33	95.99 ± 0.55	93.5
N,X,V (stacking)	86.73 ± 3.99	91.53 ± 2.05	98.31 ± 0.34	96.00 ± 0.53	94.25
B,C,N (voting)	86.66 ± 4.41	90.74 ± 2.16	97.42 ± 0.35	97.17 ± 0.30	96.75
B,N,V (stacking)	88.16 ± 4.07	91.56 ± 1.89	98.26 ± 0.34	94.80 ± 0.30	100.25
B,S,C,N,X,E (voting)	88.72 ± 3.45	90.94 ± 2.77	95.35 ± 0.93	96.98 ± 0.58	103
B,S,C,N,E,V (voting)	88.82 ± 3.14	91.35 ± 3.19	94.20 ± 1.38	96.98 ± 0.55	103.5
N,X (stacking)	86.16 ± 4.47	91.35 ± 1.99	98.29 ± 0.33	95.92 ± 0.57	103.5
S,N (voting)	89.29 ± 3.69	92.39 ± 2.26	93.90 ± 0.79	95.69 ± 0.34	103.5
B,N (stacking)	88.04 ± 4.42	91.36 ± 1.88	98.27 ± 0.36	94.78 ± 0.27	104.25
S,C (voting)	84.41 ± 3.34	92.34 ± 1.91	94.73 ± 0.40	98.71 ± 0.20	104.75
B,S,C,E (voting)	88.93 ± 4.01	92.66 ± 2.82	92.49 ± 2.92	96.06 ± 0.79	105.5
B,C,N,X,E (voting)	85.67 ± 4.56	90.41 ± 2.64	96.55 ± 0.40	96.67 ± 0.66	107
N,V (stacking)	86.58 ± 4.09	91.40 ± 2.07	98.24 ± 0.34	94.79 ± 0.29	110.5
B,S,C,N,X (voting)	86.06 ± 4.40	86.84 ± 3.42	97.05 ± 0.46	97.00 ± 0.57	110.75
B,S,N,E (voting)	91.21 ± 3.98	92.53 ± 2.79	91.93 ± 3.12	94.68 ± 0.78	112
B,C,X (stacking)	80.06 ± 3.65	91.62 ± 2.09	95.90 ± 0.48	98.74 ± 0.19	112.25
S,C,N,X (voting)	85.64 ± 3.77	90.39 ± 2.21	94.89 ± 0.81	97.89 ± 0.25	112.25
B,C,X,V (stacking)	80.11 ± 4.48	91.45 ± 2.26	95.92 ± 0.49	98.79 ± 0.13	114.5
C,X,V (stacking)	77.60 ± 3.93	91.62 ± 2.14	95.85 ± 0.46	98.82 ± 0.12	115
C,X (stacking)	77.83 ± 4.22	91.93 ± 1.83	95.84 ± 0.50	98.74 ± 0.18	115.25
B,S,C,N,V (voting)	85.28 ± 5.00	88.25 ± 3.74	96.01 ± 0.45	96.68 ± 0.24	115.75
B,C,N,E,V (voting)	86.07 ± 4.18	90.69 ± 3.16	95.64 ± 0.71	96.48 ± 0.55	115.75
C,N,X,E (voting)	84.85 ± 4.94	92.88 ± 2.34	93.43 ± 1.44	95.91 ± 0.43	117.25
B,C (stacking)	80.28 ± 3.75	91.72 ± 1.93	95.22 ± 0.43	98.19 ± 0.18	117.75
B,C,V (stacking)	80.16 ± 3.55	91.51 ± 2.30	95.30 ± 0.45	98.33 ± 0.18	118.5
C,N,X (voting)	79.35 ± 4.03	88.61 ± 2.35	97.03 ± 0.38	97.96 ± 0.26	119
S,C,N,V (voting)	84.65 ± 3.33	91.06 ± 2.85	92.93 ± 0.88	97.52 ± 0.22	119.75
B,S,X,E,V (stacking)	88.24 ± 5.57	92.59 ± 3.33	92.31 ± 0.82	94.15 ± 0.46	121.75

Table A5. All obtained accuracies [%] for different stacking and voting ensemble combinations for five runs of a 5-fold cross-validation, part 3.

Combination	Small	Spoofing	Large	CelebA	Rank
B,S,C,N,X,E,V (voting)	86.19 ± 4.47	88.13 ± 4.98	93.64 ± 1.44	96.60 ± 0.53	123.25
B,S,N (voting)	89.29 ± 3.01	87.68 ± 2.54	96.49 ± 0.40	91.81 ± 0.62	123.5
C,V (stacking)	77.92 ± 4.11	91.38 ± 2.38	95.25 ± 0.47	98.32 ± 0.21	124.5
S,X,E,V (stacking)	86.82 ± 5.89	92.63 ± 3.50	92.21 ± 0.79	94.16 ± 0.43	126.25
B,S,X,E (stacking)	87.86 ± 5.54	92.51 ± 3.45	92.28 ± 0.80	94.07 ± 0.49	126.5
B,S,N,X,E (voting)	87.64 ± 4.31	87.92 ± 2.93	94.17 ± 0.99	94.84 ± 0.85	129.75
C,N,V (voting)	78.65 ± 4.17	89.24 ± 3.18	96.00 ± 0.31	96.70 ± 0.26	130.5
S,X,E (stacking)	86.49 ± 6.13	92.55 ± 3.47	92.18 ± 0.73	94.06 ± 0.49	130.75
C,N,E,V (voting)	84.45 ± 4.68	92.69 ± 2.30	91.44 ± 2.14	95.68 ± 0.37	132
S,C,X,E (voting)	83.66 ± 5.49	92.02 ± 2.55	92.18 ± 1.37	95.88 ± 0.39	132.5
B,S,N,E,V (voting)	87.94 ± 3.73	88.66 ± 3.36	92.69 ± 1.36	94.37 ± 0.59	134
S,N,X,E (voting)	85.98 ± 5.16	91.57 ± 2.88	91.76 ± 1.60	94.88 ± 0.51	134.25
S,C,N,X,E,V (voting)	85.10 ± 4.20	90.16 ± 3.92	91.56 ± 1.62	96.60 ± 0.38	134.75
B,S,E (stacking)	88.51 ± 5.15	92.64 ± 3.40	91.04 ± 0.70	92.00 ± 1.32	135.5
C,X,E (voting)	81.31 ± 5.40	92.36 ± 2.26	92.49 ± 1.03	95.87 ± 0.46	136.25
B,S,E,V (stacking)	87.91 ± 5.55	92.57 ± 3.37	91.15 ± 0.62	92.23 ± 1.44	137.5
B,S,C,X,E (voting)	83.93 ± 4.93	88.23 ± 3.36	93.58 ± 0.94	95.72 ± 0.81	138
N,X,E (voting)	84.40 ± 5.07	91.04 ± 2.41	93.41 ± 1.26	94.90 ± 0.52	138.25
B,S,C,E,V (voting)	85.38 ± 4.50	88.87 ± 3.63	92.03 ± 1.22	95.59 ± 0.58	138.5
C,N,X,E,V (voting)	81.75 ± 3.43	89.34 ± 4.00	92.82 ± 1.45	96.54 ± 0.40	139.5
S,C,N,X,V (voting)	81.75 ± 3.70	86.88 ± 4.74	92.91 ± 1.01	97.07 ± 0.37	140.75
B,S,C (voting)	85.25 ± 3.84	87.59 ± 2.81	94.91 ± 0.53	93.44 ± 0.49	143
S,E,V (stacking)	86.63 ± 5.92	92.55 ± 3.49	90.98 ± 0.69	92.04 ± 1.60	143
S,C,E,V (voting)	83.93 ± 4.62	91.95 ± 2.74	90.20 ± 1.62	95.71 ± 0.37	144.75
S,N,X (voting)	83.70 ± 4.71	86.94 ± 2.90	93.82 ± 0.92	95.31 ± 0.57	145.25
S,E (stacking)	86.45 ± 5.96	92.55 ± 3.46	90.92 ± 0.73	91.75 ± 1.63	146.5
B,C,E (voting)	85.34 ± 5.11	92.44 ± 3.29	87.01 ± 12.98	95.20 ± 0.67	148.5
S,N,E,V (voting)	86.87 ± 3.62	91.30 ± 2.90	89.97 ± 1.74	94.36 ± 0.44	148.5
B,S,C,N,X,V (voting)	80.70 ± 4.85	82.72 ± 6.67	94.33 ± 0.91	96.26 ± 0.65	148.75
C,E (voting)	83.78 ± 6.63	93.05 ± 3.20	89.75 ± 1.08	94.44 ± 1.08	149.25
B,C,N,X (voting)	79.01 ± 5.16	81.14 ± 4.19	97.39 ± 0.34	94.70 ± 2.28	149.75
N,E,V (voting)	84.91 ± 4.50	90.86 ± 2.63	91.46 ± 1.93	93.85 ± 0.44	151.5
C,E,V (voting)	81.10 ± 4.70	91.72 ± 2.65	90.56 ± 1.40	95.58 ± 0.38	151.75
S,C,X (voting)	77.32 ± 4.07	86.65 ± 2.94	93.12 ± 0.88	96.57 ± 0.50	152.25
B,N,E (voting)	87.55 ± 4.27	91.28 ± 3.06	88.32 ± 14.03	93.50 ± 0.79	155.25
B,C,N,V (voting)	76.95 ± 5.71	83.22 ± 4.91	97.59 ± 0.27	93.37 ± 0.36	155.25
S,N,X,E,V (voting)	84.35 ± 4.39	87.84 ± 4.16	90.98 ± 1.60	94.96 ± 0.53	156
B,C,N,X,E,V (voting)	80.38 ± 4.69	83.94 ± 6.44	93.31 ± 1.46	95.49 ± 0.85	156.75
B,C,N,X,V (voting)	74.90 ± 5.26	75.22 ± 9.75	96.07 ± 0.55	95.30 ± 0.99	157.5
B,S,X,V (stacking)	84.54 ± 3.96	87.20 ± 2.82	92.04 ± 0.82	93.25 ± 0.97	158.5
B,C,X,E (voting)	79.10 ± 5.06	86.99 ± 2.65	92.46 ± 0.52	95.41 ± 0.48	160.25
B,S,X (stacking)	84.55 ± 3.45	86.91 ± 2.76	91.94 ± 0.83	93.18 ± 1.14	160.5
B,X,E (stacking)	83.45 ± 7.73	91.46 ± 3.81	90.69 ± 0.85	93.16 ± 1.08	160.5
X,E,V (stacking)	83.21 ± 7.91	91.48 ± 3.80	90.69 ± 0.83	93.09 ± 1.15	161.5
X,E (stacking)	83.21 ± 7.91	91.48 ± 3.80	90.70 ± 0.86	92.98 ± 1.27	161.5
B,S,C,X (voting)	78.03 ± 4.37	81.40 ± 4.15	94.48 ± 0.60	94.77 ± 1.58	162
S,C,X,E,V (voting)	80.40 ± 4.09	87.88 ± 4.45	90.60 ± 1.40	95.85 ± 0.47	162.25
B,X,E,V (stacking)	83.31 ± 7.87	91.44 ± 3.83	90.69 ± 0.84	93.04 ± 1.16	163
S,N,V (voting)	82.93 ± 4.28	87.54 ± 3.32	91.30 ± 0.90	93.65 ± 0.43	166
S,C,V (voting)	75.50 ± 4.91	87.27 ± 3.14	90.75 ± 0.80	95.92 ± 0.39	166.75
S,X,V (stacking)	82.04 ± 4.30	87.24 ± 2.69	91.83 ± 0.89	93.21 ± 1.04	166.75
B,S,N,X (voting)	82.07 ± 4.84	80.53 ± 5.01	94.22 ± 0.75	92.54 ± 2.28	167
B,S,C,X,E,V (voting)	79.73 ± 4.83	84.40 ± 6.17	91.56 ± 1.40	95.23 ± 0.81	167.5
B,S,N,X,E,V (voting)	82.44 ± 4.98	83.41 ± 6.70	91.47 ± 1.59	94.15 ± 0.91	168.75
S,X,E (voting)	81.92 ± 6.01	88.47 ± 3.16	90.53 ± 1.26	93.64 ± 0.64	170.25
B,C,E,V (voting)	79.13 ± 4.70	87.53 ± 2.78	90.50 ± 0.99	95.03 ± 0.36	171.5
S,E (voting)	83.53 ± 7.14	92.28 ± 3.48	86.35 ± 5.72	91.57 ± 0.52	172
B,C (voting)	77.57 ± 5.67	85.87 ± 2.62	92.48 ± 0.50	93.76 ± 0.35	172.25
S,X (stacking)	81.69 ± 4.17	86.87 ± 2.67	91.83 ± 0.86	93.08 ± 1.19	172.25
B,S,E (voting)	86.88 ± 5.41	88.09 ± 3.42	83.54 ± 13.05	90.54 ± 0.80	173.75

Table A6. All obtained accuracies [%] for different stacking and voting ensemble combinations for five runs of a 5-fold cross-validation, part 4.

Combination	Small	Spoofing	Large	CelebA	Rank
B,S,C,V (voting)	75.54 ± 5.41	83.36 ± 4.54	93.20 ± 0.58	93.05 ± 0.62	174.75
B,S,V (stacking)	84.90 ± 3.62	87.07 ± 2.78	90.51 ± 0.61	90.42 ± 0.50	174.75
N,E (voting)	83.45 ± 7.42	91.50 ± 4.00	87.02 ± 4.38	91.51 ± 0.50	175.5
B,S,N,V (voting)	80.79 ± 5.87	82.62 ± 5.26	92.92 ± 0.80	90.41 ± 0.54	177.5
B,S,C,X,V (voting)	73.99 ± 4.99	74.77 ± 9.04	92.53 ± 0.93	94.42 ± 1.00	178
B,C,X,E,V (voting)	75.14 ± 5.66	76.44 ± 9.31	91.86 ± 1.23	94.47 ± 1.09	178.75
B,S,N,X,V (voting)	77.61 ± 5.32	74.28 ± 9.35	93.09 ± 0.98	93.02 ± 1.19	179.5
S,E,V (voting)	82.59 ± 5.08	88.43 ± 3.60	88.82 ± 1.03	92.57 ± 0.49	179.5
B,S (stacking)	84.83 ± 3.64	86.89 ± 2.65	90.42 ± 0.64	89.87 ± 0.44	179.75
B,N,X,E,V (voting)	77.18 ± 5.48	74.97 ± 9.89	92.63 ± 1.33	93.15 ± 1.33	180
B,E (stacking)	83.73 ± 7.56	91.48 ± 3.80	88.69 ± 0.75	81.54 ± 14.09	180
B,E,V (stacking)	83.63 ± 7.62	91.48 ± 3.80	88.87 ± 0.58	81.94 ± 13.52	180.25
E,V (stacking)	83.21 ± 7.91	91.48 ± 3.80	88.66 ± 0.84	80.37 ± 15.74	183.5
B,N,X (voting)	73.36 ± 5.42	65.93 ± 6.65	96.03 ± 0.48	90.38 ± 3.32	184
B,E (voting)	83.82 ± 7.24	91.13 ± 3.91	78.57 ± 19.95	88.66 ± 0.54	184.75
B,N,V (voting)	71.29 ± 5.81	71.32 ± 6.81	96.07 ± 0.47	84.98 ± 0.56	184.75
B,N,X,E (voting)	81.35 ± 4.76	81.93 ± 3.87	90.99 ± 1.24	91.52 ± 2.41	185
C,X (voting)	61.02 ± 8.35	74.35 ± 8.39	91.69 ± 0.96	95.12 ± 0.45	185.75
S,V (stacking)	81.74 ± 4.10	87.17 ± 2.72	90.53 ± 0.64	90.29 ± 0.67	186
B,C,X (voting)	70.49 ± 4.85	67.88 ± 6.33	94.33 ± 0.64	92.55 ± 2.91	186.25
C,N,X,V (voting)	69.69 ± 7.22	70.90 ± 8.28	91.93 ± 1.11	94.24 ± 1.16	187.5
S,C,X,V (voting)	69.17 ± 7.42	75.52 ± 5.02	90.72 ± 0.97	94.70 ± 0.78	189.5
C,X,E,V (voting)	73.11 ± 6.21	80.56 ± 4.37	89.75 ± 1.39	94.74 ± 0.29	190.25
B,S,X,E (voting)	81.14 ± 4.99	82.87 ± 3.80	89.49 ± 1.78	91.80 ± 1.62	192.25
B,N,E,V (voting)	82.15 ± 5.43	83.79 ± 4.35	89.43 ± 0.95	89.72 ± 0.45	193.5
B,C,V (voting)	68.83 ± 5.40	73.65 ± 6.72	93.82 ± 0.37	89.35 ± 0.59	194.25
B,C,X,V (voting)	66.87 ± 5.83	68.09 ± 7.16	91.69 ± 1.08	93.89 ± 1.05	195.5
C,X,V (voting)	66.16 ± 6.86	69.85 ± 7.85	91.28 ± 1.00	94.13 ± 1.13	195.75
N,X,V (voting)	69.02 ± 6.65	69.43 ± 8.05	92.06 ± 1.15	92.51 ± 1.25	195.75
B,S,E,V (voting)	81.95 ± 5.61	84.58 ± 4.31	87.78 ± 1.77	89.67 ± 0.47	197.75
B,S,X,E,V (voting)	76.41 ± 5.37	74.93 ± 8.83	90.30 ± 1.28	92.16 ± 1.23	198.75
S,N,X,V (voting)	70.56 ± 7.75	70.47 ± 8.68	90.32 ± 1.16	92.95 ± 1.13	203.25
X,E (voting)	71.61 ± 11.66	81.96 ± 10.60	89.37 ± 1.31	90.93 ± 0.87	205.25
S,X,E,V (voting)	70.25 ± 6.99	76.43 ± 5.38	89.42 ± 1.28	91.88 ± 0.93	206.75
B,S,X (voting)	71.92 ± 4.62	66.49 ± 6.41	91.45 ± 0.80	87.76 ± 3.08	207.5
B,N (voting)	83.15 ± 4.37	80.15 ± 3.10	79.40 ± 2.40	72.50 ± 0.57	207.75
B,S (voting)	81.93 ± 4.20	81.35 ± 3.29	75.78 ± 1.03	73.24 ± 0.58	209
B,X,E,V (voting)	70.95 ± 6.89	69.21 ± 7.97	89.54 ± 1.21	91.43 ± 1.19	210.5
N,X,E,V (voting)	68.87 ± 6.76	69.86 ± 8.12	89.66 ± 1.33	91.36 ± 1.35	212
B,N,X,V (voting)	68.58 ± 6.57	66.24 ± 6.70	91.29 ± 1.14	87.29 ± 2.21	212.75
B,X,V (stacking)	70.50 ± 5.33	64.03 ± 6.21	90.44 ± 1.00	90.38 ± 4.04	213.75
B,X (stacking)	70.83 ± 4.04	60.59 ± 3.61	90.47 ± 1.05	90.37 ± 4.05	214
S,X (voting)	57.41 ± 7.78	67.26 ± 7.48	90.75 ± 0.92	89.92 ± 1.89	214.5
X,E,V (voting)	68.23 ± 6.97	69.21 ± 8.04	89.54 ± 1.21	91.35 ± 1.35	215.25
E,V (voting)	74.34 ± 12.13	75.00 ± 15.95	87.88 ± 0.66	84.67 ± 0.35	216.25
B,S,V (voting)	70.35 ± 5.48	70.72 ± 5.98	89.86 ± 0.59	82.61 ± 0.66	216.75
S,X,V (voting)	67.38 ± 6.69	68.14 ± 7.78	89.29 ± 0.93	91.09 ± 1.25	217.25
B,E,V (voting)	75.49 ± 6.26	73.47 ± 6.40	82.48 ± 11.33	85.58 ± 0.46	218.25
N,X (voting)	55.58 ± 7.94	57.30 ± 5.41	91.02 ± 1.01	88.35 ± 3.40	218.5
X,V (stacking)	52.69 ± 8.05	62.61 ± 6.64	90.48 ± 1.05	90.37 ± 4.05	220
B,X,E (voting)	74.55 ± 5.67	67.81 ± 5.57	83.67 ± 11.34	89.31 ± 2.92	220
B,S,X,V (voting)	68.55 ± 6.62	66.71 ± 6.94	89.85 ± 0.90	87.80 ± 1.42	220
C,V (voting)	60.94 ± 5.08	69.60 ± 12.04	88.26 ± 0.80	88.40 ± 0.35	223.25
B,X (voting)	58.23 ± 7.69	55.88 ± 5.30	90.48 ± 1.03	85.75 ± 3.25	224
B,X,V (voting)	62.96 ± 5.35	57.91 ± 4.74	89.61 ± 0.91	83.13 ± 2.03	226.5
B,V (stacking)	70.38 ± 4.32	65.07 ± 6.29	87.91 ± 0.66	71.16 ± 6.14	229
S,V (voting)	55.94 ± 5.89	65.45 ± 11.14	88.06 ± 0.73	77.51 ± 0.83	232
X,V (voting)	53.58 ± 8.05	54.93 ± 11.98	88.09 ± 0.70	83.93 ± 1.22	233.25
N,V (voting)	53.77 ± 5.75	57.35 ± 10.48	88.08 ± 0.77	75.69 ± 0.54	234
B,V (voting)	58.70 ± 5.85	56.35 ± 10.52	87.92 ± 0.69	73.96 ± 0.41	234

Table A7. Obtained metrics [%] in a 5-fold cross-validation for different values of threshold of the stacking ensemble B,S,C,N,X,E,V. The threshold is the probability above which images are considered spoofed. Small dataset.

Threshold	Accuracy	Precision	Recall	F1-Score
0.05	93.60 ± 1.75	88.25 ± 5.24	98.84 ± 1.42	93.13 ± 2.55
0.10	95.13 ± 2.42	90.9 ± 6.86	98.84 ± 1.42	94.53 ± 3.36
0.15	95.74 ± 3.09	92.19 ± 7.78	98.84 ± 1.42	95.19 ± 3.99
0.20	96.05 ± 2.81	92.66 ± 7.51	98.84 ± 1.42	95.46 ± 3.87
0.25	96.35 ± 2.47	93.33 ± 6.43	98.84 ± 1.42	95.86 ± 3.22
0.30	96.65 ± 2.23	93.82 ± 6.18	98.84 ± 1.42	96.14 ± 3.13
0.35	96.96 ± 1.93	94.55 ± 5.01	98.84 ± 1.42	96.56 ± 2.46
0.40	96.05 ± 2.06	94.43 ± 5.19	96.73 ± 3.16	95.46 ± 3.02
0.45	96.04 ± 2.29	95.65 ± 3.86	95.12 ± 4.18	95.33 ± 3.32
0.50	96.04 ± 2.29	95.65 ± 3.86	95.12 ± 4.18	95.33 ± 3.32
0.55	95.74 ± 2.24	95.61 ± 3.87	94.52 ± 3.60	95.02 ± 3.22
0.60	96.04 ± 2.07	96.56 ± 2.82	94.52 ± 3.60	95.48 ± 2.57
0.65	95.43 ± 1.67	96.56 ± 2.82	93.27 ± 3.31	94.83 ± 1.99
0.70	95.13 ± 2.01	96.56 ± 2.82	92.71 ± 4.13	94.51 ± 2.19
0.75	95.13 ± 2.01	96.56 ± 2.82	92.71 ± 4.13	94.51 ± 2.19
0.80	95.43 ± 2.15	97.13 ± 2.55	92.71 ± 4.13	94.80 ± 2.39
0.85	95.74 ± 2.22	98.28 ± 2.25	91.71 ± 5.15	94.82 ± 3.30
0.90	95.13 ± 2.00	99.38 ± 1.25	89.50 ± 5.88	94.06 ± 3.15
0.95	93.92 ± 2.53	99.33 ± 1.33	87.13 ± 5.57	92.73 ± 3.15

Table A8. Obtained metrics [%] in a 5-fold cross-validation for different values of threshold of the stacking ensemble B,S,C,N,X,E,V. The threshold is the probability above which images are considered spoofed. Spoofing dataset.

Threshold	Accuracy	Precision	Recall	F1-Score
0.05	97.23 ± 1.77	96.56 ± 3.97	93.40 ± 7.42	94.67 ± 3.38
0.10	97.23 ± 1.77	95.64 ± 4.12	95.49 ± 6.75	95.32 ± 3.20
0.15	97.60 ± 1.32	95.43 ± 3.92	96.46 ± 6.10	95.74 ± 3.01
0.20	97.96 ± 1.12	95.52 ± 3.71	96.99 ± 5.58	96.08 ± 2.86
0.25	97.96 ± 1.12	95.59 ± 3.56	97.34 ± 5.17	96.31 ± 2.74
0.30	98.20 ± 0.94	95.75 ± 3.44	97.55 ± 4.83	96.51 ± 2.63
0.35	98.44 ± 1.05	95.98 ± 3.38	97.66 ± 4.56	96.70 ± 2.56
0.40	98.32 ± 1.04	96.19 ± 3.30	97.68 ± 4.36	96.83 ± 2.49
0.45	98.44 ± 1.12	96.38 ± 3.25	97.70 ± 4.18	96.94 ± 2.44
0.50	98.44 ± 1.12	96.54 ± 3.19	97.71 ± 4.04	97.03 ± 2.39
0.55	98.32 ± 1.04	96.67 ± 3.14	97.70 ± 3.91	97.10 ± 2.34
0.60	98.20 ± 1.01	96.78 ± 3.08	97.67 ± 3.78	97.14 ± 2.29
0.65	98.20 ± 1.27	96.89 ± 3.04	97.62 ± 3.69	97.18 ± 2.26
0.70	98.44 ± 1.12	97.04 ± 3.01	97.57 ± 3.61	97.23 ± 2.24
0.75	98.44 ± 1.12	97.16 ± 2.97	97.53 ± 3.54	97.28 ± 2.21
0.80	98.44 ± 1.12	97.28 ± 2.94	97.49 ± 3.47	97.32 ± 2.18
0.85	98.44 ± 1.12	97.39 ± 2.90	97.44 ± 3.44	97.36 ± 2.16
0.90	98.32 ± 1.34	97.50 ± 2.86	97.38 ± 3.44	97.38 ± 2.16
0.95	97.24 ± 1.12	97.59 ± 2.83	97.19 ± 3.53	97.33 ± 2.15

Table A9. Obtained metrics [%] in a 5-fold cross-validation for different values of threshold of the stacking ensemble B,S,C,N,X,E,V. The threshold is the probability above which images are considered spoofed. Large Crowd-collected dataset.

Threshold	Accuracy	Precision	Recall	F1-Score
0.05	97.97 ± 0.34	97.60 ± 2.76	97.32 ± 3.49	97.40 ± 2.13
0.10	98.45 ± 0.29	97.64 ± 2.70	97.43 ± 3.45	97.48 ± 2.11
0.15	98.56 ± 0.28	97.68 ± 2.65	97.54 ± 3.41	97.55 ± 2.09
0.20	98.70 ± 0.21	97.72 ± 2.60	97.64 ± 3.37	97.62 ± 2.08
0.25	98.80 ± 0.16	97.76 ± 2.56	97.72 ± 3.33	97.69 ± 2.06
0.30	98.92 ± 0.17	97.81 ± 2.52	97.80 ± 3.29	97.76 ± 2.05
0.35	98.95 ± 0.13	97.86 ± 2.49	97.88 ± 3.25	97.82 ± 2.03
0.40	98.94 ± 0.14	97.90 ± 2.45	97.94 ± 3.21	97.87 ± 2.02
0.45	98.96 ± 0.16	97.94 ± 2.42	98.00 ± 3.17	97.93 ± 2.00
0.50	98.92 ± 0.19	97.98 ± 2.39	98.06 ± 3.13	97.98 ± 1.99
0.55	98.87 ± 0.21	98.02 ± 2.36	98.10 ± 3.09	98.02 ± 1.97
0.60	98.85 ± 0.24	98.06 ± 2.33	98.15 ± 3.05	98.06 ± 1.95
0.65	98.80 ± 0.20	98.10 ± 2.31	98.18 ± 3.01	98.10 ± 1.94
0.70	98.69 ± 0.22	98.13 ± 2.28	98.21 ± 2.97	98.13 ± 1.92
0.75	98.58 ± 0.25	98.17 ± 2.26	98.24 ± 2.94	98.16 ± 1.90
0.80	98.49 ± 0.20	98.21 ± 2.24	98.25 ± 2.90	98.19 ± 1.88
0.85	98.21 ± 0.33	98.24 ± 2.22	98.26 ± 2.86	98.21 ± 1.86
0.90	97.99 ± 0.32	98.27 ± 2.20	98.26 ± 2.82	98.23 ± 1.84
0.95	97.37 ± 0.34	98.31 ± 2.18	98.23 ± 2.79	98.24 ± 1.81

Table A10. Obtained metrics [%] in a 5-fold cross-validation for different values of threshold of the stacking ensemble B,S,C,N,X,E,V. The threshold is the probability above which images are considered spoofed. CelebA dataset.

Threshold	Accuracy	Precision	Recall	F1-Score
0.05	98.54 ± 0.10	98.27 ± 2.16	98.27 ± 2.76	98.24 ± 1.79
0.10	98.78 ± 0.15	98.25 ± 2.14	98.31 ± 2.74	98.25 ± 1.77
0.15	98.96 ± 0.12	98.24 ± 2.12	98.34 ± 2.71	98.26 ± 1.75
0.20	99.03 ± 0.13	98.24 ± 2.09	98.37 ± 2.69	98.27 ± 1.73
0.25	99.08 ± 0.12	98.24 ± 2.07	98.40 ± 2.66	98.29 ± 1.71
0.30	99.12 ± 0.12	98.24 ± 2.05	98.42 ± 2.64	98.30 ± 1.70
0.35	99.13 ± 0.11	98.25 ± 2.03	98.45 ± 2.62	98.32 ± 1.68
0.40	99.18 ± 0.12	98.25 ± 2.00	98.47 ± 2.59	98.33 ± 1.67
0.45	99.22 ± 0.14	98.26 ± 1.99	98.49 ± 2.57	98.35 ± 1.65
0.50	99.24 ± 0.12	98.28 ± 1.97	98.50 ± 2.54	98.36 ± 1.64
0.55	99.25 ± 0.12	98.29 ± 1.95	98.52 ± 2.52	98.38 ± 1.63
0.60	99.27 ± 0.12	98.31 ± 1.93	98.53 ± 2.50	98.39 ± 1.61
0.65	99.24 ± 0.13	98.33 ± 1.92	98.54 ± 2.47	98.40 ± 1.60
0.70	99.20 ± 0.13	98.34 ± 1.91	98.54 ± 2.45	98.42 ± 1.59
0.75	99.10 ± 0.15	98.36 ± 1.89	98.54 ± 2.43	98.43 ± 1.58
0.80	99.07 ± 0.16	98.38 ± 1.88	98.54 ± 2.41	98.43 ± 1.56
0.85	98.90 ± 0.18	98.40 ± 1.87	98.53 ± 2.39	98.44 ± 1.55
0.90	98.71 ± 0.17	98.42 ± 1.86	98.50 ± 2.37	98.44 ± 1.54
0.95	98.26 ± 0.17	98.45 ± 1.85	98.46 ± 2.37	98.43 ± 1.53

References

Apple Inc. About Face ID Advanced Technology. 2024. Available online: https://support.apple.com/en-us/102381 (accessed on 14 April 2024).
Bazarevsky, V.; Kartynnik, Y.; Vakunov, A.; Raveendran, K.; Grundmann, M. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs. arXiv 2019, arXiv:1907.05047. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. arXiv 2018, arXiv:1801.07698. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5202–5211. [Google Scholar] [CrossRef]
Timoshenko, D.; Simonchik, K.; Shutov, V.; Zhelezneva, P.; Grishkin, V. Large Crowdcollected Facial Anti-Spoofing Dataset. In Proceedings of the 2019 Computer Science and Information Technologies (CSIT), Chennai, India, 19–20 January 2019; pp. 123–126. [Google Scholar] [CrossRef]
Zhang, Y.; Yin, Z.; Li, Y.; Yin, G.; Yan, J.; Shao, J.; Liu, Z. CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Taigman, Y.; Yang, M.; Ranzato, M.A.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar]
Balamurali, K. Face Spoof Detection Using VGG-Face Architecture. J. Phys. Conf. Ser. 2011, 1917, 012010. [Google Scholar] [CrossRef]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
de Freitas Pereira, T.; Anjos, A.; De Martino, J.M.; Marcel, S. Lbp-top based countermeasure against face spoofing attacks. In Proceedings of the ACCV, Daejeon, Republic of Korea, 5–6 November 2012. [Google Scholar]
Boulkenafet, Z.; Komulainen, J.; Hadid, A. Face anti-spoofing based on color texture analysis. In Proceedings of the ICIP, Quebec City, QC, Canada, 27–30 September 2015. [Google Scholar]
Patel, K.; Han, H.; Jain, A.K. Secure face unlock: Spoof detection on smartphones. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2268–2283. [Google Scholar] [CrossRef]
Boulkenafet, Z.; Komulainen, J.; Hadid, A. Face antispoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Process. Lett. 2016, 24, 141–145. [Google Scholar]
Komulainen, J.; Hadid, A.; Pietikäinen, M. Context based face anti-spoofing. In Proceedings of the BTAS, Arlington, VA, USA, 29 September–2 October 2013. [Google Scholar]
Pan, G.; Sun, L.; Wu, Z.; Lao, S. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 18–23 June 2007. [Google Scholar]
Jee, H.K.; Jung, S.U.; Yoo, J.H. Liveness detection for embedded face recognition system. Int. J. Biol. Med. Sci. 2006, 1, 235–238. [Google Scholar]
Li, J.W. Eye blink detection based on multiple gabor response waves. In Proceedings of the ICMLC, Kunming, China, 12–15 July 2008; Volume 5, pp. 852–2856. [Google Scholar]
Ali, A.; Deravi, F.; Hoque, S. Liveness detection using gaze collinearity. In Proceedings of the ICEST, Houston, TX, USA, 25–29 June 2012. [Google Scholar]
Li, X.; Komulainen, J.; Zhao, G.; Yuen, P.C.; Pietikäinen, M. Generalized face anti-spoofing by detecting pulse from face videos. In Proceedings of the ICPR, Cancun, Mexico, 4–8 December 2016. [Google Scholar]
Liu, Y.; Jourabloo, A.; Liu, X. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Lin, B.; Li, X.; Yu, Z.; Zhao, G. Face liveness detection by rppg features and contextual patch-based cnn. In Proceedings of the ICBEA, Stockholm, Sweden, 29–31 May 2019; ACM: New York, NY, USA, 2019. [Google Scholar]
Yu, Z.; Peng, W.; Li, X.; Hong, X.; Zhao, G. Remote heart rate measurement from highly compressed facial videos: An end-to-end deep learning solution with video enhancement. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Boulkenafet, Z.; Komulainen, J.; Hadid, A. Face spoofing detection using colour texture analysis. IEEE Trans. Inf. Forensics Secur. 2016, 11, 1818–1830. [Google Scholar] [CrossRef]
Sthevanie, F.; Ramadhani, K.N. Spoofing detection on facial images recognition using LBP and GLCM combination. J. Phys. Conf. Ser. 2018, 971, 012014. [Google Scholar] [CrossRef]
Chen, H.; Hu, G.; Lei, Z.; Chen, Y.; Robertson, N.M.; Li, S.Z. Attention-Based Two-Stream Convolutional Networks for Face Spoofing Detection. IEEE Trans. Inf. Forensics Secur. 2019, 15, 578–593. [Google Scholar] [CrossRef]
Rajeswaran, S.; Kumar, S. Face-Spoof detection system using convolutional neural network. In Proceedings of the International Conference on Recent trends in Electronics, Computing and Communication Engineering (ICRTECC), Chennai, India, 25–26 April 2019. [Google Scholar]
Hashemifard, S.; Akbari, M. A Compact Deep Learning Model for Face Spoofing Detection. arXiv 2021, arXiv:2101.04756v1. [Google Scholar] [CrossRef]
Ab Wahab, K.; Yew, L.W.; Jusoh, N.A. Online Attendance System Using Face Recognition. Eng. Agric. Sci. Technol. J. 2022, 1, 57–61. [Google Scholar] [CrossRef]
Shu, X.; Li, X.; Zuo, X.; Xu, D.; Shi, J. Face spoofing detection based on multi-scale color inversion dual-stream convolutional neural network. Expert Syst. Appl. 2023, 224, 119988. [Google Scholar] [CrossRef]
Bahia, Y.Z.; Meriem, F.; Messaoud, B. Face spoofing detection using Heterogeneous Auto-Similarities of Characteristics. Eng. Appl. Artif. Intell. 2024, 130, 107788. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Y.; Yin, Z.; Shao, J.; Liu, Z. Robust face anti-spoofing with Dual Probabilistic Modeling. Pattern Recognit. 2025, 172, 111700. [Google Scholar] [CrossRef]
Huang, P.K.; Chong, J.X.; Hsu, M.T.; Hsu, F.Y.; Hsu, C.T. Channel difference transformer for face anti-spoofing. Inf. Sci. 2025, 702, 121904. [Google Scholar] [CrossRef]
Antil, A.; Dhiman, C. Unmasking Deception: A Comprehensive Survey on the Evolution of Face Anti-spoofing Methods. Neurocomputing 2025, 617, 128992. [Google Scholar] [CrossRef]
Face. Spoofing Dataset. 2024. Available online: https://universe.roboflow.com/face-hgc6e/spoofing-wqkzq (accessed on 18 February 2025).
Ma, X.; Geng, Z.; Bie, Z. Depth Estimation from Single Image Using CNN-Residual Network. Semantic Scholar. 2017. Available online: https://cs231n.stanford.edu/reports/2017/pdfs/203.pdf (accessed on 21 July 2021).
Pandey, A. Face Liveness Detection Using Depth Map Prediction. Available online: https://github.com/anand498/Face-Liveness-Detection (accessed on 8 July 2025).
Xiao, J.; Wang, W.; Zhang, L.; Liu, H. A MobileFaceNet-Based Face Anti-Spoofing Algorithm for Low-Quality Images. Electronics 2024, 13, 2801. [Google Scholar] [CrossRef]
Quy, N.D. Face Anti-Spoofing Using MobileNet. Available online: https://github.com/dinhquy94/face-antispoofing-using-mobileNet (accessed on 8 July 2025).
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar] [CrossRef]
Belli, D.; Das, D.; Major, B.; Porikli, F. A Personalized Benchmark for Face Anti-spoofing. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2022; pp. 338–348. [Google Scholar] [CrossRef]
Doubango AI. Doubango AI 3D Passive Face Liveness Detection. Available online: https://github.com/DoubangoTelecom/FaceLivenessDetection-SDK (accessed on 8 July 2025).

Figure 1. Work time management system. The block of the module presented in this paper is marked in red.

Figure 2. Sample images received by the system. Authentic Image 3 comes from the publicly available Large Crowdcollected Facial Anti-Spoofing Dataset [5]. Spoof Image 1 and Spoof Image 2 come from the publicly available Celeb-A dataset [6]. The remaining images come from our own dataset.

Figure 3. Proposed architecture of the neural network for detecting images of spoofed faces.

Figure 4. Proposed CNN architecture to detect smartphones in images.

Figure 5. Face authenticity detection module.

Figure 6. The input image (left) and the image preprocessed by the bezel-detection algorithm (right).

Figure 7. The output image generated by the bezel-detection algorithm from the image in Figure 6. The red rectangle represents the detected face, while the green rectangles represent the detected bezels.

Figure 8. An example of edges for a spoof image displayed on a smartphone.

Figure 9. The proposed architecture of the neural network used as the final classifier.

Table 1. Details about the proposed neural network architecture for the image analysis module.

Layer (Type: Depth-Idx)	Output Shape	Param #
SpoofDetectionNet	[32, 2]	–
Conv2d: 1-1	[32, 16, 64, 64]	448
BatchNorm2d: 1-2	[32, 16, 64, 64]	32
Conv2d: 1-3	[32, 16, 64, 64]	2320
BatchNorm2d: 1-4	[32, 16, 64, 64]	32
MaxPool2d: 1-5	[32, 16, 32, 32]	–
Dropout: 1-6	[32, 16, 32, 32]	–
Conv2d: 1-7	[32, 32, 32, 32]	4640
BatchNorm2d: 1-8	[32, 32, 32, 32]	64
Conv2d: 1-9	[32, 32, 32, 32]	9248
BatchNorm2d: 1-10	[32, 32, 32, 32]	64
MaxPool2d: 1-11	[32, 32, 16, 16]	–
Dropout: 1-12	[32, 32, 16, 16]	–
Flatten: 1-13	[32, 8192]	–
Linear: 1-14	[32, 64]	524,352
BatchNorm1d: 1-15	[32, 64]	128
Dropout: 1-16	[32, 64]	–
Linear: 1-17	[32, 2]	130
Total params:		541,458
Trainable params:		541,458
Non-trainable params:		0

Table 2. Details about the proposed neural network architecture for the smartphone detection module.

Layer (Type: Depth-Idx)	Output Shape	Param #
BezelDetectionNet	[32, 2]	–
Conv2d: 1-1	[32, 32, 128, 128]	896
BatchNorm2d: 1-2	[32, 32, 128, 128]	64
MaxPool2d: 1-3	[32, 32, 64, 64]	–
Conv2d: 1-4	[32, 64, 64, 64]	18,496
BatchNorm2d: 1-5	[32, 64, 64, 64]	128
MaxPool2d: 1-6	[32, 64, 32, 32]	–
Conv2d: 1-7	[32, 128, 32, 32]	73,856
BatchNorm2d: 1-8	[32, 128, 32, 32]	256
MaxPool2d: 1-9	[32, 128, 16, 16]	–
Conv2d: 1-10	[32, 256, 16, 16]	295,168
BatchNorm2d: 1-11	[32, 256, 16, 16]	512
MaxPool2d: 1-12	[32, 256, 8, 8]	–
Linear: 1-13	[32, 1024]	16,778,240
Linear: 1-14	[32, 512]	524,800
Linear: 1-15	[32, 2]	1026
Total params:		17,693,442
Trainable params:		17,693,442
Non-trainable params:		0

Table 3. Details about the proposed neural network architecture used as the final classifier.

Layer (Type: Depth-Idx)	Output Shape	Param #
ProbabilityNN	[1, 1]	–
Linear: 1-1	[1, 10]	50
Linear: 1-2	[1, 1]	11
Sigmoid: 1-3	[1, 1]	–
Total params:		61
Trainable params:		61
Non-trainable params:		0

Table 4. Prediction accuracy [%] of the evaluated methods and ensembles in five runs of a 5-fold cross-validation (a total of 25 training/test pairs). At the bottom, the best-performing ensembles of 3, 4, 5, 6, and 7 models are presented. The complete results of 240 different stacking and voting ensembles are presented in the Appendix A.

Our Methods	Small dataset	Spoofing Data	Crowd-collected	CelebA
Bezel (B)	70.99 ± 4.22	57.67 ± 3.05	37.63 ± 0.69	47.05 ± 0.50
Smartphone CNN (S)	81.35 ± 3.83	86.78 ± 2.63	90.27 ± 0.59	89.26 ± 0.47
Context (C)	77.98 ± 4.21	91.95 ± 1.86	90.47 ± 1.05	98.22 ± 0.17
Image analysis CNN (N)	85.64 ± 4.59	91.03 ± 1.98	98.20 ± 0.36	94.71 ± 0.24
Other Methods	Small dataset	Spoofing Data	Crowd-collected	CelebA
EfficientNetV2 (E)	83.21 ± 7.91	91.48 ± 3.80	88.53 ± 0.95	91.50 ± 0.44
ConvNeXt (X)	54.71 ± 6.47	56.03 ± 4.60	90.47 ± 1.05	90.37 ± 4.05
MobileNet	71.73 ± 2.33	73.86 ± 0.89	90.83 ± 0.19	87.96 ± 0.34
Depth prediction	53.25 ± 3.33	41.27 ± 5.25	89.37 ± 7.01	85.75 ± 8.27
ViT (V)	52.52 ± 6.04	56.56 ± 10.46	87.92 ± 0.65	75.49 ± 0.55
Swin Transformer (T)	50.32 ± 3.43	59.48 ± 5.54	88.07 ± 0.67	58.75 ± 0.32
Ensembles	Small dataset	Spoofing Data	Crowd-collected	CelebA
C,N,E (stacking)	91.18 ± 3.95	96.35 ± 1.69	98.89 ± 0.22	98.94 ± 0.17
S,C,N,E (stacking)	92.29 ± 3.16	96.77 ± 1.46	98.93 ± 0.22	99.14 ± 0.11
S,C,N,X,E (stacking)	92.15 ± 3.36	96.85 ± 1.51	99.00 ± 0.17	99.19 ± 0.10
B,S,C,N,X,E (stacking)	92.46 ± 3.17	96.82 ± 1.54	98.99 ± 0.19	99.18 ± 0.12
B,S,C,N,X,E,V (stacking)	92.62 ± 3.05	96.88 ± 1.48	98.99 ± 0.17	99.20 ± 0.12

Table 5. Comparison of accuracy [%] obtained for the stacking ensemble consisting of our own methods (B,S,C,N) with and without Platt scaling with constant parameters and static random seed.

	Small	Spoofing	Crowd	CelebA
Baseline	90.64 ± 2.77	95.37 ± 1.10	96.83 ± 0.28	98.59 ± 0.38
Platt scaling	86.35 ± 2.13	94.49 ± 1.79	97.00 ± 0.66	98.64 ± 0.15

Table 6. Obtained results for the commercial system “A Personalized Benchmark for Face Anti-spoofing” for the small dataset (N = 477).

Outcome	Authentic Faces	Spoof Faces
Authentic	8	0
Spoof	230	144
Not detected	5	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Klinowski, R.; Kordos, M. Face Spoofing Detection with Stacking Ensembles in Work Time Registration System. Appl. Sci. 2025, 15, 8402. https://doi.org/10.3390/app15158402

AMA Style

Klinowski R, Kordos M. Face Spoofing Detection with Stacking Ensembles in Work Time Registration System. Applied Sciences. 2025; 15(15):8402. https://doi.org/10.3390/app15158402

Chicago/Turabian Style

Klinowski, Rafał, and Mirosław Kordos. 2025. "Face Spoofing Detection with Stacking Ensembles in Work Time Registration System" Applied Sciences 15, no. 15: 8402. https://doi.org/10.3390/app15158402

APA Style

Klinowski, R., & Kordos, M. (2025). Face Spoofing Detection with Stacking Ensembles in Work Time Registration System. Applied Sciences, 15(15), 8402. https://doi.org/10.3390/app15158402

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Face Spoofing Detection with Stacking Ensembles in Work Time Registration System

Abstract

1. Introduction

2. Review of Existing Solutions

3. The Proposed Method

3.1. Smartphone Bezel-Detection Algorithm

3.2. Smartphone Detection Using CNN

3.3. Face Context Analysis

3.4. Image Analysis Using CNN

3.5. Calculating the Result and Final Probability

4. Experimental Evaluation

4.1. Experimental Setup

4.2. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI