Next Article in Journal
Design and Validity of a Smart Healthcare and Control System for Electric Bikes
Next Article in Special Issue
Regularization for Unsupervised Learning of Optical Flow
Previous Article in Journal
An Adaptive Multi-Mode Navigation Method with Intelligent Virtual Sensor Based on Long Short-Term Memory in GNSS Restricted Environment
Previous Article in Special Issue
Polarized Object Surface Reconstruction Algorithm Based on RU-GAN Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Domain Feature Alignment for Face Anti-Spoofing

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(8), 4077; https://doi.org/10.3390/s23084077
Submission received: 26 February 2023 / Revised: 6 April 2023 / Accepted: 12 April 2023 / Published: 18 April 2023
(This article belongs to the Special Issue Deep Learning for Computer Vision and Image Processing Sensors)

Abstract

:
Face anti-spoofing is critical for enhancing the robustness of face recognition systems against presentation attacks. Existing methods predominantly rely on binary classification tasks. Recently, methods based on domain generalization have yielded promising results. However, due to distribution discrepancies between various domains, the differences in the feature space related to the domain considerably hinder the generalization of features from unfamiliar domains. In this work, we propose a multi-domain feature alignment framework (MADG) that addresses poor generalization when multiple source domains are distributed in the scattered feature space. Specifically, an adversarial learning process is designed to narrow the differences between domains, achieving the effect of aligning the features of multiple sources, thus resulting in multi-domain alignment. Moreover, to further improve the effectiveness of our proposed framework, we incorporate multi-directional triplet loss to achieve a higher degree of separation in the feature space between fake and real faces. To evaluate the performance of our method, we conducted extensive experiments on several public datasets. The results demonstrate that our proposed approach outperforms current state-of-the-art methods, thereby validating its effectiveness in face anti-spoofing.

1. Introduction

With the extensive use of deep learning in computer vision, face recognition (FR) [1,2] technology has become increasingly important in daily life, particularly in scenarios that require user identification and authorization. Despite significant progress in FR, these systems remain susceptible to various types of attacks, such as print attacks, replayed video attacks, and 3D mask attacks. To address these challenges, current state-of-the-art research has proposed various methods for face anti-spoofing (FAS) [3,4,5,6]. These methods can be broadly categorized into two groups: hand-crafted feature-based and deep learning feature-based approaches.
Despite the notable achievements of previous face anti-spoofing (FAS) methods in intra-domain testing, their performance significantly deteriorates in cross-domain testing. This is primarily due to the introduction of bias resulting from the distinct characteristics of domains and the inability to address such bias by considering their internal relationships. Consequently, the generalization effect of the model on the novel domain is insufficient. To mitigate this limitation, recent studies have utilized unsupervised-learning-based domain adaptation (DA) techniques to eliminate the domain bias between source and target domains. Nevertheless, the target domain usually denotes an unseen domain for the source domain, and acquiring an adequate amount of target domain data for training in real-world scenarios is not typically feasible.
In response to the weakness of a model’s generalization in the unseen domain, several studies related to domain generalization (DG) have been proposed. Conventional DG [7] proposed a novel multi-adversarial discriminative method to learn a discriminative multi-domain feature space and improve the generalization performance. This method aimed to find a feature space with multiple source domains aligned, assuming that the feature extracted from the unseen domain can map near that feature space, indicating that the model is well generalized over the target domain. However, Conventional DG fails to account for the difference in generalization between real and fake faces. To address this limitation, SSDG [8] proposed an end-to-end single-side method that separates the feature distributions of real and fake faces by pulling all the real faces and pushing the fake faces of different domains, resulting in a higher classification capacity than Conventional DG. However, this method is insufficient in eliminating the interference of domain-related cues in the domain generalization problem. Motivated by the above works, this study aims to align the feature spaces corresponding to multiple source domains while dividing the distribution of fake and real faces. As shown in Figure 1, our proposed method aligns multiple source domains to obtain a more generalizable feature space, outperforming the baseline method in terms of generalization and classification performance.
Based on the aforementioned considerations and an analysis of the feature space of the source domain, we found that differences in features between multiple source domains significantly interfere with the generalization effectiveness of face anti-spoofing (FAS) in an unseen feature space. To address this issue, we introduce a novel framework for domain generalization in this work, named the Multi-domain Feature Alignment Domain Generalization (MADG) framework. Specifically, the proposed framework utilizes feature generators to produce real and fake face features in each domain, which are subsequently aligned using a multi-domain feature alignment method that employs a loss function based on minimized margin discrepancy disparity between multiple source domains. To optimize this loss function, we use multiple adversarial learning processes. The feature alignment process tackles the challenges associated with generalization, while the employment of multi-directional triplet mining strengthens class boundaries to further enhance the classification performance.
The present work makes four primary contributions:
  • We propose a novel source-domain-alignment domain generalization algorithm that utilizes margin disparity discrepancy and adversarial learning. This approach is designed to improve the generalization performance of the FAS model significantly.
  • In the context of multi-domain problems, we devise two alignment strategies and modularize the alignment process. The experimental findings demonstrate that multi-domain alignment surpasses cross-domain alignment in terms of generalization performance, rendering it more advantageous in multi-domain scenarios.
  • We combine the new algorithm with multi-directional triple mining and analyze the source domain feature space. This results in a novel multi-domain feature alignment framework (MADG), which improves the classification accuracy of the FAS model.
  • Extensive experiments and comparisons have been conducted, demonstrating that our proposed approach achieves state-of-the-art performance on most protocols.
The rest of this article is structured into four main sections. In the section second, we provide a detailed introduction to the related work, including mainstream research on face anti-spoofing, domain generalization, and multi-domain learning. We also discuss some works that have provided inspiration for this study. In the third section, we present the proposed method, which includes a comprehensive introduction to the new alignment method and domain generalization framework developed in this study. The fourth section details the experiments conducted to evaluate the proposed method and presents a thorough analysis of the experimental results. Finally, in the fifth section, we provide a summary of our work and give an outlook for future research directions in this area.

2. Related Work

2.1. Face Anti-Spoofing Methods

There exist two principal categories of conventional face anti-spoofing (FAS) techniques: appearance-based and temporal-based methods. Appearance-based methods involve the extraction of hand-crafted features for classification, such as local binary patterns (LBPs) [9,10] and scale-invariant feature transform (SIFT) [11]. However, temporal-based FAS methods detect attack faces by extracting temporal cues from a sequence of consecutive face frames. Mouth movement detection [12] and eye blink detection [13,14] are examples of the earliest dynamic texture-detection methods. However, these methods do not generalize well to cross-dataset testing scenarios due to the dissimilarities in feature spaces among diverse domains, which often lead to feature bias during generalization.
Recently, deep neural networks, specifically convolutional neural networks (CNNs), have gained widespread adoption in computer vision tasks and have been extensively applied to FAS. Yang et al. [15] were the pioneers in utilizing binary classification CNNs for FAS. Jourabloo et al. [16] proposed a face de-spoofing technique that performs fake face classification by reverse decomposition into real faces and spoof noise. Liu et al. [17] presented a CNN-RNN network that combines both appearance and temporal cues to detect spoof attacks using remote photoplethysmography (rPPG) signals. Similarly, 3D mask detection attack methods [18,19] also exploit rPPG information. Yang et al. [20] combined temporal and appearance cues to detect fake faces from real ones. Roy et al. [21] investigated frame-level FAS to enhance biometric authentication security against face-spoofing attacks. More recently, deep neural networks have been applied to FAS [4,7,8,22,23,24], achieving superior performance compared to conventional methods [15,25,26,27].
In conclusion, traditional FAS methods include appearance-based and temporal-based methods, which extract hand-crafted features and mine temporal cues, respectively. However, deep neural networks, especially CNNs, have achieved state-of-the-art performance in FAS by combining the appearance and temporal cues, using techniques such as reverse decomposition and frame-level FAS for improved biometric authentication security.

2.2. Multi-Domain Learning

The use of multiple datasets has recently sparked research interest in multi-domain processing. In particular, the research community has amassed several large-scale FAS datasets with rich annotations [28,29,30,31]. Our work on multi-source domain processing shares similarities with domain adaptation (DA) methods [32,33,34,35,36,37,38,39] that require a retrained model to perform well on both source and target domain data. Specifically, Zhang et al. [37] introduced the concept of margin disparity discrepancy to characterize the differences between source and target domains, which has inspired our work. Ariza et al. [40] conducted a comparative study of several classification methods, which informed our experimental design. Additionally, Liu et al. [41] proposed the YOLOv3-FDL model for successful small crack detection from GPR images using a four-scale detection layer. Notably, the common approaches of Mancini et al. [34] and Rebuffi et al. [35] employ the ResNet [42] architecture, which offers benefits over architectures such as VGG [43] and AlexNet [44] by increasing abstraction through convolutional layers. Yang et al. [45] recently enriched FAS datasets from a different perspective to achieve multi-domain training, while Guo et al. [46] proposed a novel multi-domain model that overcomes the forgetting problem when learning new domain samples and exhibits high adaptability.

2.3. Domain Generalization

Domain adaptation (DA) and domain generalization (DG) are two fundamental methods used in FAS research. While the DG method mines the relationships among multiple domains, the DA method aims to adapt the model to a target domain. In this work, we propose a novel domain generalization method that introduces a new loss function inspired by the work of Motiian et al. [47], which encourages feature extraction in similar classes. To align multiple source domains for generalization, previous works such as Ghifary et al. [48] and Li et al. [49] have proposed autoencoder-based approaches. Our method follows a similar approach of learning a shared feature space across multiple source domains that can generalize to the target domain. Previous works such as Shao et al. [7], Saha et al. [50], Jia et al. [8], and Kim et al. [51] have also attempted to achieve this goal. Among them, the single-side adversarial learning method proposed in SSDG [8] is the work most related to ours. However, this end-to-end approach overlooks the relationships between different domains. To address the overfitting and generalization problems of adversarial generative networks, Li et al. [5] proposed a multi-channel convolutional neural network (MCCNN). Additionally, meta-learning formulations [52,53,54] have been utilized to simulate the domain shift during training to learn a representative feature space. However, recent works such as Wang et al. [6] have not adopted a domain-alignment approach but have instead increased the diversity of labeled data by reassembling different styles and content features in their SSAN method.

3. Proposed Approach

3.1. Overview

The proposed method aims to enhance the generalization ability of face anti-spoofing (FAS) by improving feature space generalization. Generalization to unseen domains is challenging, as samples from the target domain are usually not available during training. However, similarities exist between the features of the source domains and the target domain, and our approach aims to identify these similarities by aligning multiple source domains. To this end, we propose a multi-domain feature alignment framework called MADG, as shown in Figure 2. The feature generator extracts features that are cross-domain aligned using an adversarial learning process called cross-domain alignment (CDA), and multiple CDA modules are combined to form a multi-domain alignment method. Moreover, we introduce a multi-directional triplet mining process to better separate the distribution differences between real and fake faces in the feature space. Thus, MADG can consider the distribution differences between real and fake faces while aligning the feature space of multiple source domains, leading to improved classification performance in new, unseen domains.

3.2. Multi-Domain Feature Alignment

This study considers N source domains, each with a set of training samples denoted by X = { X 1 , X 2 , . . . , X N } , and their corresponding labels, Y. In the context of the face anti-spoofing task, the labels correspond to Y r and Y f for real and fake faces, respectively. However, due to significant distribution gaps between different domains, it is necessary to minimize the disparity in feature space. To address this challenge, we propose a multi-domain feature alignment method that consists of two key components: feature generators for generating features and multiple weighted adversarial learning modules combined for multi-domain feature alignment.
Pre-alignment Feature Generator and Alignment Strategies. We designed pre-alignment feature generators to transform sample data into features, as follows:
T m = G m ( X m ) , T s = G s ( X s ) ,
The features extracted from mixed samples of real and fake faces are denoted as T m , and their corresponding feature generators as G m . Similarly, T s and G s represent the features and generators generated from real or fake faces. These features are known to contain a significant number of domain-relevant clues, which may adversely affect the generalization performance. Before feature alignment, two alignment strategies are devised for the features extracted from real and fake faces as well as mixed samples. In the first strategy, real and fake faces from each domain are independently fed into the respective feature generator. In contrast, the second strategy involves feeding a mixture of real and fake faces from each domain into the feature generator. The two frameworks resulting from these strategies are referred to as MADG-S (single-sided alignment MADG) and MADG-M (mixed alignment MADG), respectively. We compare the outcomes of these two strategies in Section 4.3.4 to determine the optimal method.
Cross-domain Alignment Module. To enhance the generalization ability of the feature space for the face anti-spoofing task, it is crucial to reduce the presence of domain-relevant cues in the extracted features. Aligning the features of multiple domains into a common feature space effectively reduces deviations from any specific domain, leading to better performance on unseen domains. This insight is akin to feature alignment techniques employed in domain adaptation.
With the aim of reducing the presence of domain-relevant cues and improving generalization performance, we propose a cross-domain feature alignment module to align the features of different domains. There is typically a substantial gap between the features of different domains, characterized by domain-relevant clues, which can be quantified using measures such as the margin disparity discrepancy (MDD) proposed by Zhang et al. [37]. By progressively reducing this gap, we can obtain an aligned feature space in which domain-specific characteristics become less pronounced. Specifically, our method considers two source domains, S1 and S2, and leverages the cross-domain alignment module to achieve this goal. MDD has the following key property:
D f , F ( ρ ) ( S 1 , S 2 ) e r r S 2 ( h ) e r r S 1 ( ρ ) ( h ) λ ,
where h H denotes a label classifier and e r r represents the error rate of the classifier, while e r r ( ρ ) is the margin error and ρ is the corresponding margin. The scoring function and hypothesis set are denoted by f and F , respectively. D denotes the margin disparity discrepancy and λ = λ ( ρ , F , P , Q ) is the ideal combined margin loss. The method proposed by Zhang et al. [37] is a domain adaptation method. e r r in the above equation can be regarded as the generalization error in domain adaptation. From this inequality, the generalization error of S 1 on S 2 is upper-bounded, as we can see. According to this property, we only need to optimize this upper bound to reduce this error and achieve the purpose of feature alignment. To this end, we designed an adversarial learning module to align domain features, as depicted in Figure 3. The objective is to minimize MDD, and this process can be formulated as a minimax game, expressed as follows:
min f , ψ ε ( S 1 ^ ) + η D γ ( S 1 ^ , S 2 ^ ) , max f D γ ( S 1 ^ , S 2 ^ ) ,
where S 1 ^ , S 2 ^ are the samples drawn independently from distribution S 1 , S 2 , respectively. To strengthen the minimization process, we introduce a feature extractor ψ , and ε denotes cross-entropy loss. f and f are classifiers that share the same hypothesis space, and γ denotes the margin factor that is related to the margin of f . Through this adversarial learning process, we can obtain the minimized c d a between two domains:
min G c d a p , q = min f , ψ max f M ( S 1 , S 2 ) ,
where M denotes the adversarial learning function.
Multi-domain Alignment Loss. A multi-domain alignment loss was designed by combining all the obtained cross-domain diverging divergence values m, denoted as:
L M d a l i g n = 2 N ( N 1 ) i + j = N , i < j λ i , j m i , j ,
where λ i , j is used to control the weight of the alignment of different domain pairs. Given that the disparity between various domains is diverse, choosing the appropriate value for λ can significantly affect the overall alignment effect, and therefore, the generalization effect of the network. We adjusted the value of λ on a specific task protocol (O&M&I to C) for comparison, see Section 4.3.2 for details.

3.3. Multi-Directional Triplet Mining

In the context of different domains’ feature spaces, the disparity between real and fake faces is often less pronounced within domains than across them. This is due to the relatively strong similarities between attack samples corresponding to each real sample, which can arise from variations in data collection techniques and attack methods. Our framework aims to learn a concise feature space for the same class. Nonetheless, scattering real faces in a feature space is often easier. To overcome this challenge, we partition the feature spaces of real and fake faces and seek to compress all real faces into a compact feature space. To this end, we utilize a multi-directional triplet loss proposed by SSDG [8]. This loss enables us to mine triplet relationships between features, which can improve the generalized feature space’s performance. The optimization of the loss is expressed as follows:
min G L T r i p ( G ) = x i i s , x i p , x i n ( f ( x i a ) f ( x i p ) 2 2 f ( x i a ) f ( x i n ) 2 2 + α ) ,
α is the margin we need to define beforehand and x i a is the anchor, while x i p is the positive example and x i n is the negative one.

3.4. Loss Function

As shown in Figure 2, our framework employs a face anti-spoofing classifier that combines triplet feature generators after aligning all source domains. The final loss to optimize is a cross-entropy loss based on a hybrid domain with multi-domain alignment and category separation by triplet mining, denoted as L c l s . Once all components are integrated, the proposed multi-domain feature alignment framework can be obtained as follows:
L M A D G = L c l s + λ 1 L M d a l i g n + λ 2 L T r i p
The weighting factors λ 1 and λ 2 are leveraged to adjust the balance between the process of aligning multiple domains and separating subclasses between different domains. The values of these factors govern the relative importance of each objective in the overall optimization process.

4. Experiments

4.1. Experimental Setting

4.1.1. Databases and Protocols

To assess the effectiveness of the proposed approach, we conducted experiments on four publicly available face anti-spoofing datasets; namely, Idiap Replay-Attack [28] (referred to as I), CASIA-FASD [29] (referred to as C), MSU-MFSD [30] (referred to as M), and OULU-NPU [31] (referred to as O). Real and attack face samples from these datasets are shown in Figure 4. Notably, the datasets have a significantly different composition, with variations in display devices, attack types, illumination, and background complexity. For instance, the display devices in I are the iPhone 3Gs and iPad, while C uses the iPad, M utilizes the iPad Air and iPhone 5S, and O deploys the Dell 1905FP and Macbook Retina. Moreover, the attack types vary across the datasets, with I and O containing replayed photo samples, C having cut photo samples, and all datasets including printed photos and replayed videos. The datasets also exhibit differences in illumination, with I and O having extra light, while C and M do not. Additionally, the datasets have varying levels of background complexity, with I, C, and M having a complex background, while O has a less complex background. These differences contribute to significant inter-domain differences in the feature space of the datasets, making them suitable for evaluating the efficacy of the proposed approach.
Therefore, to provide a comprehensive evaluation, we employed the leave-one-out strategy, where one of the databases was designated as the test set, and the remaining three were used as the training set. Thus, four task protocols were established: O&C&I to M, O&M&I to C, O&C&M to I, and I&C&M to O. However, to further examine the method’s effectiveness under various multi-domain scenarios in multi-domain research, additional protocols need to be designed for controlled experiments on the method’s performance. Thus, we extended our experiments to include three domains, similar to the aforementioned four, resulting in a total of 12 task protocols. This comprehensive approach enabled us to analyze and compare the proposed method’s performance in various multi-domain scenarios.

4.1.2. Evaluation Metrics

In order to assess the effectiveness of our model in achieving the desired outcomes, it is crucial to evaluate its performance. To this end, we follow a similar evaluation approach to that of SSDG [8], the most relevant work in this field. The two evaluation metrics we use to assess the performance of our model are HTER and AUC. HTER is calculated as half of the sum of FAR and FRR, which represent the rates of falsely accepting and falsely rejecting a genuine user, respectively. This metric is widely used in liveness detection to measure the model’s ability to distinguish between live and fake faces. However, AUC is a commonly used evaluation metric in classification tasks that measures the area under the ROC curve. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds. A higher AUC value indicates better classification performance, as the model can achieve a high TPR while maintaining a low FPR. By utilizing these metrics, we can obtain a comprehensive evaluation of the performance of our model and compare it with other state-of-the-art methods.

4.2. Implementation Details

4.2.1. Data Preprocessing

To prepare the datasets for our experiments, it was necessary to preprocess both real and fake face images from the original video data. The preprocessing involved extracting frames at random intervals from each video clip and applying the MTCNN [55] method to identify and crop the faces in the resulting images. The cropped faces were then rotated and normalized to create input data of size 256 × 256 × 3 . This process ensured that the RGB channel was the sole source of information used during training, which simplified the framework’s complexity. The preprocessing step was critical for enabling our model to accurately learn from both the real and fake faces and was essential for ensuring the robustness and reliability of our experimental results.

4.2.2. Network Structure

Our framework was implemented on the platform PyTorch. The details of our network structure are shown in Table 1, Table 2 and Table 3.
Details of Feature Generator. To construct the feature generator, we employed ResNet-18 [42] as the backbone. The feature generator was designed by following the ResNet architecture, which includes a convolutional layer with a 7 × 7 convolution kernel and a max pooling layer at the head. The remaining convolutional layers utilize 3 × 3 kernels and are accompanied by a batch normalization (BN) layer to improve the stability and speed of network convergence. Some of the convolutional layers use a rectified linear unit (ReLU) activation function, as indicated in the table by conv2, while conv1 lacks the ReLU layer. We chose the ResNet-18 model due to its ability to strike a balance between performance and computational complexity. By incorporating ResNet-18, our feature generator can effectively learn and extract high-level features from facial images.
Details of Feature Embedder and Classifier. The residual block utilized in this section features the same structure as the one employed in the feature generator. The classifier comprises a linear model and fully connected layers, with a bottleneck fully connected layer added prior to the classifier.
Details of Alignment Classifier. An adversarial learning process is employed in this work to align the source domains, with a network functioning as an alignment classifier. The classifier is composed of multiple components, including a bottleneck layer, a head layer, an adversarial head layer, and a gradient reversal layer (GRL). The bottleneck and head layers consist of fully connected layers, as do the adversarial head layers. During backpropagation, the GRL layer inverts the gradients, enabling the network to learn domain-invariant features.

4.2.3. Training Setting

The optimization of the framework was performed using stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 5 × 10 4 . The initial learning rate was set to 1 × 10 2 and gradually decreased to 1 × 10 5 . To accommodate the limited GPU memory, a batch size of 10 was used for each domain, resulting in a batch size of 30 for all three domains during training. We set the values of λ 1 and λ 2 to 30 and 1, respectively. Furthermore, the λ i , j values used to constrain the multi-domain alignment process were all set to 1. These weights are defined in Section 3.

4.2.4. Testing Setting

During the testing phase, the performance of the trained model was assessed on new test samples x by feeding them into the model for classification. The classification outcome is denoted as l and obtained via l = G ( F ( x ) ) , where G represents the trained model, and F refers to the feature generator utilized during training. The classification results on the testing set were used to evaluate the model’s generalization ability, which is a crucial factor for machine learning models. Furthermore, the testing results can be utilized to compare the performance of our proposed method with state-of-the-art methods currently available.

4.3. Discussion

4.3.1. Ablation Study

To assess the individual contributions of each component in our proposed framework, we performed various ablation experiments. We report the findings of these experiments in Table 4. Our proposed method is denoted as MADG, which includes multi-domain feature alignment loss ( L M d a l i g n ), multi-directional triplet loss (denoted as L T r i p ), and cross-entropy loss.
The outcomes presented in Table 4 highlight the substantial influence of each constituent of our proposed framework. Ablating any part of the framework led to a conspicuous deterioration in classification performance, which affirms the effectiveness of each component. Specifically, the experiment pertaining to the results without L T r i p corresponded to the deactivation of the triplet loss during both forward and backward propagation. The experiment associated with the outcomes without L M d a l i g n deactivated the multi-domain alignment component.

4.3.2. Comparisons of Different Alignment Weights

The cross-domain feature alignment module is a fundamental component of our proposed multi-domain feature alignment method. By integrating this module into the training process, we can effectively align the features of two source domains. Our multi-domain alignment strategy involves aligning multiple source domains in pairs and controlling these domain pairs by weight. Furthermore, we align multiple source domains simultaneously. For instance, in the scenario involving three source domains, denoted as O, M, and I, we simultaneously align O&M, O&I, and M&I, with each alignment procedure constrained by a weight. To assess the impact of varying weights on the model’s generalization performance, we conducted a series of experiments using the O&M&I to C task protocol. The weights were adjusted using a control variable method, where only one weight λ was varied while the others were held constant at 1. The experimental results, depicted in Figure 5, reveal the impact of the weights on the model’s performance.
By adjusting the weights assigned to the alignment process of each pair of domains, we can evaluate the impact of each domain on both the overall generalization performance and the inter-domain feature characteristics. As depicted in Figure 5, the alignment process of the O&I domain exhibits the most significant impact on the generalization performance. Specifically, as the weight of the O&I alignment process ( λ O I ) increases, the generalization performance notably decreases, indicating a significant dissimilarity between the domain-relevant features of O&I and M. However, the optimal generalization performance can be achieved by enhancing the weight of the O&M alignment process ( λ O M ).

4.3.3. Experiments on Limited Source Domains

Limited Source Domain Protocols. In addition to the leave-one-out experiment, we performed experiments in which we trained the model using a subset of the available source domains. More specifically, we selected two of the three available source domains and used them as the training set. This approach yielded 12 distinct task protocols: O&M to C, O&I to C, M&I to C, C&M to O, C&I to O, M&I to O, C&O to M, C&I to M, I&O to M, C&M to I, C&O to I, and O&M to I. The experimental results of SSDG (baseline), MADG-M (ours), and MADG-S (ours) on these 12 task protocols are shown in Table 5, where the best result for each protocol is shown in bold.
Analysis of Limited Source Domain Experiments. The impact of training set diversity and data quantity on the generalization performance of the proposed framework is demonstrated in Table 4 and Table 5. As indicated in Table 5, the generalization performance of the two-source domains is generally lower than that of the three-source domains. Additionally, significant variations in the experimental results of the 12 task protocols are observed (Table 5), which is likely due to differences in data characteristics across domains. Therefore, the domain-relevant clues between the source and target domains may have a lesser effect on some task protocols, yet still attain good generalization performance. Furthermore, a comparison with other state-of-the-art methods on limited source domains is presented in Section 4.4.1. Moreover, Table 5 shows that the limited source domain task protocol is impacted differently by various alignment strategies. The subsequent Section 4.3.4 provides further comparisons of these alignment strategies.

4.3.4. Comparison of Different Alignment Strategies

To provide a comprehensive comparison of the generalization effects of the two alignment strategies implemented in the MADG-M and MADG-S frameworks, we conducted experiments on 12 limited source domain task protocols and 4 leave-one-out task protocols. To ensure a fair evaluation, we compared all of these protocols with the baseline method. The results of the limited source domain experiments are presented in Table 5, while the results of the complete source domain experiments are shown in Table 6.
Based on the experimental results presented in Table 5 and Table 6, it is apparent that our proposed alignment strategies consistently outperform the baseline method in most task protocols. However, the extent of improvement over the baseline is more limited in the limited source domain experiments than in the experiments conducted with three source domains. This finding suggests that our method performs better with a larger number of source domains, indicating greater potential in scenarios with more source domains. Additionally, the alignment effect of our method can be influenced by the distribution of features from the source domains. Therefore, the performance of different alignment strategies for various task protocols can differ, each with its distinct set of strengths and weaknesses, as revealed by the experimental results. These observations are discussed in more detail in Section 4.4.2.

4.4. Comparison with State-of-the-Art Methods

4.4.1. Comparison on Limited Source Domains

Our proposed method, as shown in Table 7, outperforms the previous state-of-the-art method SSAN [6] in the restricted source domain task protocols. The objective of SSAN was to enhance generalization by reorganizing content and style features. In comparison, our method achieves a significant improvement by offering a source domain alignment method that improves the generalization performance on the training set of double source domains. Since the multi-domain alignment strategy is not involved in the double source domain alignment, it is easier to train a well-aligned generalized feature space. The method demonstrates considerable advantages in cases of insufficient data, and can maintain good performance even with a small training set, thereby making it valuable for practical applications.

4.4.2. Comparison with Baseline Methods

To assess the effectiveness of our proposed method in improving classification performance, we compared it with the SSDG method [8], a one-sided domain generalization approach that discriminates between real and fake faces to achieve a certain generalization performance. To bring samples from different domains closer together, we incorporated a feature alignment algorithm before triplet mining and combined the domain alignment feature space to improve the aggregation of the aligned real and fake faces. Our method has two different alignment strategies to account for the large differences in source data distribution across different task protocols. The comparison results of SSDG and our method are presented in Table 8.
The SSDG method is considered a strong baseline approach and outperforms most state-of-the-art methods, as demonstrated in Table 8. While SSDG aims to achieve generalization across multiple domains by focusing on domain differences and utilizing asymmetric optimization goals for real and fake faces, it may not effectively capture domain-relevant features. In contrast, our proposed MADG method leverages asymmetric optimization while simultaneously considering alignment across multiple domains. The superior performance of our method, as shown in Table 6, suggests that source domain alignment can lead to improved generalization results. Furthermore, our method’s ability to incorporate multiple alignments accounts for the diverse distribution of source data across different task protocols.

4.4.3. Comparison with Other SOTA Methods

Table 8 presents a comparative analysis of our proposed MADG method with state-of-the-art approaches under the leave-one-out protocols. The best performance in each column is highlighted in bold. The table demonstrates the effectiveness of our approach, as it outperforms all other methods, except for the SSDG [8] approach, which relies on discriminating real faces from fake ones to achieve generalization. In contrast, other methods, such as [7,9,10,15,17,30,31,49], demonstrate significantly poorer performance than our method, perhaps due to their inability to generalize effectively across multiple domains. Although both MADDG [7] and SSDG methods adopt the DG method to learn discriminative cues, the MADDG method struggles to optimize real and fake faces simultaneously, while the SSDG method disregards domain-relevant features. Specifically, SSDG relies on the broader feature distribution extracted from fake faces than real faces, while our MADG approach focuses on feature alignment across multiple domains while considering the distribution discrepancies between fake and real faces. This feature alignment capability enables our approach to achieve superior generalization performance while minimizing the impact of domain-specific features on generalization. Furthermore, different alignment methods may produce varying outcomes for various task protocols, depending on the source domain distribution.

4.5. Visualization and Analysis

4.5.1. Feature Visualization

We employed t-SNE visualization, a powerful technique for visualizing high-dimensional data in a low-dimensional space, to analyze the feature space. Figure 6 presents the comparison results of each visualization. Specifically, we randomly selected 200 samples from the four databases and visualized the feature space learned by MADG and the feature space learned by the models in the ablation experiments.
The scatter plot on the right illustrates the feature space learned by MADG, showing that the distribution of feature vectors is more compact and well-separated, indicating that the proposed method is effective in learning discriminative features for face anti-spoofing. In contrast, the plot on the left represents the visualization result of the ablation study without the key component of the multi-domain feature alignment module, L M d a l i g n . In this case, the feature vectors are scattered and poorly separated, highlighting the importance of this module for the proposed method to achieve good performance. Our proposed method achieves better generalization space compared to the model without the multi-domain feature alignment module, and the feature space of real faces is tidier, demonstrating the effectiveness of the proposed method in aligning the feature distribution of real and fake faces.

4.5.2. Analysis of Misclassified Samples

To conduct a comprehensive analysis of our model’s performance, we examined misclassified sample frames, as presented in Figure 7. These samples were taken from the test set of the O&C&I to M task protocol, which is the test data of MSU-MFSD [30]. This dataset includes printed photo and replay video attacks collected from 35 individuals through Android or laptop cameras. The experiments utilized preprocessed face slices.
The left box of Figure 7 displays the misclassified attack samples. In the tests, the misclassified attack faces all belonged to the same individual, and the samples of two attack types obtained by all collection methods from this individual were misclassified. We believe that such errors, which occurred only for attack samples of this individual, are mainly related to individual characteristics.
In contrast, the misclassified real faces in our test involved many individuals, as shown in the right-hand box below. This suggests that our model is less effective in classifying real faces than attacking faces, indicating a certain level of overfitting. This may be for several reasons, including the planar graph architecture of the real face input network with the same dimension as the attack face, the presence of artificial actions in real face videos, such as expressions and motion, and the weakening of sample authenticity due to ornaments and lighting on the faces. Examining classification errors can provide insights for future research.

4.6. Conclusion of Experiments

We designed a set of experiments to evaluate the effectiveness of the proposed multi-domain feature alignment framework (MADG). Firstly, we performed an ablation study to assess the contribution of each component to the framework’s classification performance, and our results demonstrate that each component played a significant role in improving performance. Furthermore, we examined the impact of varying weights on the multi-domain alignment process and found that the alignment process of the O&I domain had the greatest interference with the generalization performance in the O&M&I to C protocol. This observation implies that cross-domain alignment between two domains has a significant impact on the overall alignment process due to differences in the feature space. The effectiveness of the alignment module was also demonstrated through comparison. Additionally, we evaluated the performance of MADG on limited source domains and found that the generalization performance of two-source domains was generally inferior to that of three-source domains, indicating the potential of MADG to enhance the classification performance in multi-domain scenarios. Our comparison of different alignment strategies revealed that at least one alignment strategy outperformed the baseline method in most task protocols. Finally, we validated the superiority of MADG by comparing it with the existing SOTA method.

5. Conclusions

In this paper we presented a novel approach to improve the generalization capability of face anti-spoofing by utilizing a multi-domain feature alignment domain generalization framework (MADG). To the best of our knowledge, this is the first work to introduce margin discrepancy disparity in domain generalization. Our approach modularizes the alignment algorithm and proposes two multi-domain alignment strategies to enhance the performance of multi-domain alignment. We combine multi-directional triplet loss with the multi-domain alignment module, enabling the effective separation of real and attack face distributions in the feature space. Our method outperforms previous approaches that focus on aligning the entire source domain without considering the distribution characteristics of real and attack faces or only extracting attack face features. We conducted a comprehensive set of experiments on four public databases to validate the effectiveness of our method, and the results demonstrate its superior performance over state-of-the-art approaches.
However, our work also has some limitations that require further improvements in the future. Although our method has made significant progress in the study of feature alignment, it only makes a limited attempt and only involves the exploration of multi-domain problems. Recent studies have shown that the introduction of auxiliary supervision information can effectively address the challenges posed by multi-domain problems. Therefore, future work can be conducted around multi-domain problems and the integration of auxiliary supervision information to further enhance the generalization ability of face anti-spoofing. Overall, our study provides a promising direction for future research on face anti-spoofing.

Author Contributions

Methodology, S.Z.; Validation, S.Z.; Writing—original draft, S.Z.; Writing—review & editing, S.Z. and W.N.; Visualization, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [repository name CASIA-FASD] at [http://www.cbsr.ia.ac.cn/english/FASDB_Agreement/Agreement.pdf], ref. [29]; [repository name Idiap Replay-Attack] at [https://www.idiap.ch/en/dataset/replayattack], ref. [28]; [repository name MSU-MFSD] at [http://biometrics.cse.msu.edu/Publications/Databases/MSU%20Mobile%20Face%20Spoofing.rar], ref. [30]; [repository name OULU-NPU] at [https://www.sites.google.com/site/oulunpudatabase/], ref. [31] (all accessed on 1 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
  2. Wang, M.; Deng, W. Deep face recognition: A survey. Neurocomputing 2021, 429, 215–244. [Google Scholar] [CrossRef]
  3. Yu, Z.; Qin, Y.; Li, X.; Zhao, C.; Lei, Z.; Zhao, G. Deep learning for face anti-spoofing: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5609–5631. [Google Scholar] [CrossRef] [PubMed]
  4. Benlamoudi, A.; Bekhouche, S.E.; Korichi, M.; Bensid, K.; Ouahabi, A.; Hadid, A.; Taleb-Ahmed, A. Face Presentation Attack Detection Using Deep Background Subtraction. Sensors 2022, 22, 3760. [Google Scholar] [CrossRef]
  5. Li, S.; Dutta, V.; He, X.; Matsumaru, T. Deep Learning Based One-Class Detection System for Fake Faces Generated by GAN Network. Sensors 2022, 22, 7767. [Google Scholar] [CrossRef] [PubMed]
  6. Wang, Z.; Wang, Z.; Yu, Z.; Deng, W.; Li, J.; Gao, T.; Wang, Z. Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4123–4133. [Google Scholar]
  7. Shao, R.; Lan, X.; Li, J.; Yuen, P.C. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10023–10031. [Google Scholar]
  8. Jia, Y.; Zhang, J.; Shan, S.; Chen, X. Single-side domain generalization for face anti-spoofing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8484–8493. [Google Scholar]
  9. Määttä, J.; Hadid, A.; Pietikäinen, M. Face spoofing detection from single images using micro-texture analysis. In Proceedings of the 2011 International Joint Conference on Biometrics (IJCB), Washington, DC, USA, 11–13 October 2011; pp. 1–7. [Google Scholar]
  10. Freitas Pereira, T.d.; Komulainen, J.; Anjos, A.; De Martino, J.M.; Hadid, A.; Pietikäinen, M.; Marcel, S. Face liveness detection using dynamic texture. EURASIP J. Image Video Process. 2014, 2014, 2. [Google Scholar] [CrossRef]
  11. Patel, K.; Han, H.; Jain, A.K. Secure face unlock: Spoof detection on smartphones. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2268–2283. [Google Scholar] [CrossRef]
  12. Kollreider, K.; Fronthaler, H.; Faraj, M.I.; Bigun, J. Real-time face detection and motion analysis with application in “liveness” assessment. IEEE Trans. Inf. Forensics Secur. 2007, 2, 548–558. [Google Scholar] [CrossRef]
  13. Pan, G.; Sun, L.; Wu, Z.; Lao, S. Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision (ICCV), Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
  14. Sun, L.; Pan, G.; Wu, Z.; Lao, S. Blinking-based live face detection using conditional random fields. In Proceedings of the International Conference on Biometrics (ICB), Seoul, Republic of Korea, 27–29 August 2007; pp. 252–260. [Google Scholar]
  15. Yang, J.; Lei, Z.; Li, S.Z. Learn convolutional neural network for face anti-spoofing. arXiv 2014, arXiv:1408.5601. [Google Scholar]
  16. Jourabloo, A.; Liu, Y.; Liu, X. Face de-spoofing: Anti-spoofing via noise modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 290–306. [Google Scholar]
  17. Liu, Y.; Jourabloo, A.; Liu, X. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 389–398. [Google Scholar]
  18. Liu, S.; Yuen, P.C.; Zhang, S.; Zhao, G. 3D mask face anti-spoofing with remote photoplethysmography. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 85–100. [Google Scholar]
  19. Liu, S.Q.; Lan, X.; Yuen, P.C. Remote photoplethysmography correspondence feature for 3D mask face presentation attack detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 558–573. [Google Scholar]
  20. Yang, X.; Luo, W.; Bao, L.; Gao, Y.; Gong, D.; Zheng, S.; Li, Z.; Liu, W. Face anti-spoofing: Model matters, so does data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3507–3516. [Google Scholar]
  21. Roy, K.; Hasan, M.; Rupty, L.; Hossain, M.S.; Sengupta, S.; Taus, S.N.; Mohammed, N. Bi-fpnfas: Bi-directional feature pyramid network for pixel-wise face anti-spoofing by leveraging fourier spectra. Sensors 2021, 21, 2799. [Google Scholar] [CrossRef]
  22. Liu, Y.; Stehouwer, J.; Jourabloo, A.; Liu, X. Deep tree learning for zero-shot face anti-spoofing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4680–4689. [Google Scholar]
  23. Liu, Y.; Stehouwer, J.; Liu, X. On disentangling spoof trace for generic face anti-spoofing. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 406–422. [Google Scholar]
  24. Zhang, K.Y.; Yao, T.; Zhang, J.; Tai, Y.; Ding, S.; Li, J.; Huang, F.; Song, H.; Ma, L. Face anti-spoofing via disentangled representation learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 641–657. [Google Scholar]
  25. Feng, L.; Po, L.M.; Li, Y.; Xu, X.; Yuan, F.; Cheung, T.C.H.; Cheung, K.W. Integration of image quality and motion cues for face anti-spoofing: A neural network approach. J. Vis. Commun. Image Represent. 2016, 38, 451–460. [Google Scholar] [CrossRef]
  26. Li, L.; Feng, X.; Boulkenafet, Z.; Xia, Z.; Li, M.; Hadid, A. An original face anti-spoofing approach using partial convolutional neural network. In Proceedings of the 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA), Oulu, Finland, 12–15 December 2016; pp. 1–6. [Google Scholar]
  27. Patel, K.; Han, H.; Jain, A.K. Cross-database face antispoofing with robust feature representation. In Proceedings of the Biometric Recognition: 11th Chinese Conference (CCBR), Chengdu, China, 14–16 October 2016; pp. 611–619. [Google Scholar]
  28. Chingovska, I.; Anjos, A.; Marcel, S. On the effectiveness of local binary patterns in face anti-spoofing. In Proceedings of the 2012 BIOSIG-Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 6–7 September 2012; pp. 1–7. [Google Scholar]
  29. Zhang, Z.; Yan, J.; Liu, S.; Lei, Z.; Yi, D.; Li, S.Z. A face antispoofing database with diverse attacks. In Proceedings of the 2012 5th IAPR International Conference on Biometrics (ICB), New Delhi, India, 29 March–1 April 2012; pp. 26–31. [Google Scholar]
  30. Wen, D.; Han, H.; Jain, A.K. Face spoof detection with image distortion analysis. IEEE Trans. Inf. Forensics Secur. 2015, 10, 746–761. [Google Scholar] [CrossRef]
  31. Boulkenafet, Z.; Komulainen, J.; Li, L.; Feng, X.; Hadid, A. OULU-NPU: A mobile face presentation attack database with real-world variations. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 612–618. [Google Scholar]
  32. Rebuffi, S.A.; Bilen, H.; Vedaldi, A. Learning multiple visual domains with residual adapters. arXiv 2017, arXiv:1705.08045. [Google Scholar]
  33. Li, H.; Li, W.; Cao, H.; Wang, S.; Huang, F.; Kot, A.C. Unsupervised domain adaptation for face anti-spoofing. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1794–1809. [Google Scholar] [CrossRef]
  34. Mancini, M.; Ricci, E.; Caputo, B.; Rota Bulo, S. Adding new tasks to a single network with weight transformations using binary masks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  35. Rebuffi, S.A.; Bilen, H.; Vedaldi, A. Efficient parametrization of multi-domain deep neural networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8119–8127. [Google Scholar]
  36. Tu, X.; Zhang, H.; Xie, M.; Luo, Y.; Zhang, Y.; Ma, Z. Deep transfer across domains for face antispoofing. J. Electron. Imaging 2019, 28, 043001. [Google Scholar] [CrossRef]
  37. Zhang, Y.; Liu, T.; Long, M.; Jordan, M. Bridging theory and algorithm for domain adaptation. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7404–7413. [Google Scholar]
  38. Wang, G.; Han, H.; Shan, S.; Chen, X. Improving cross-database face presentation attack detection via adversarial domain adaptation. In Proceedings of the 2019 International Conference on Biometrics (ICB), Crete, Greece, 4–7 June 2019; pp. 1–8. [Google Scholar]
  39. Wang, G.; Han, H.; Shan, S.; Chen, X. Unsupervised adversarial domain adaptation for cross-domain face presentation attack detection. IEEE Trans. Inf. Forensics Secur. 2020, 16, 56–69. [Google Scholar] [CrossRef]
  40. Ariza-Colpas, P.; Piñeres-Melo, M.; Barceló-Martínez, E.; De la Hoz-Franco, E.; Benitez-Agudelo, J.; Gelves-Ospina, M.; Echeverri-Ocampo, I.; Combita-Nino, H.; Leon-Jacobus, A. Enkephalon-technological platform to support the diagnosis of alzheimer’s disease through the analysis of resonance images using data mining techniques. In Proceedings of the Advances in Swarm Intelligence: 10th International Conference, ICSI 2019, Chiang Mai, Thailand, 26–30 July 2019; pp. 211–220. [Google Scholar]
  41. Liu, Z.; Gu, X.; Chen, J.; Wang, D.; Chen, Y.; Wang, L. Automatic recognition of pavement cracks from combined GPR B-scan and C-scan images using multiscale feature fusion deep neural networks. Autom. Constr. 2023, 146, 104698. [Google Scholar] [CrossRef]
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  43. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  44. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  45. Yang, J.; Lan, G.; Xiao, S.; Li, Y.; Wen, J.; Zhu, Y. Enriching facial anti-spoofing datasets via an effective face swapping framework. Sensors 2022, 22, 4697. [Google Scholar] [CrossRef]
  46. Guo, X.; Liu, Y.; Jain, A.; Liu, X. Multi-domain Learning for Updating Face Anti-spoofing Models. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 230–249. [Google Scholar]
  47. Motiian, S.; Piccirilli, M.; Adjeroh, D.A.; Doretto, G. Unified deep supervised domain adaptation and generalization. In Proceedings of the 2017 IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5715–5725. [Google Scholar]
  48. Ghifary, M.; Kleijn, W.B.; Zhang, M.; Balduzzi, D. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the 2015 IEEE/CVF International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2551–2559. [Google Scholar]
  49. Li, H.; Pan, S.J.; Wang, S.; Kot, A.C. Domain generalization with adversarial feature learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5400–5409. [Google Scholar]
  50. Saha, S.; Xu, W.; Kanakis, M.; Georgoulis, S.; Chen, Y.; Paudel, D.P.; Van Gool, L. Domain agnostic feature learning for image and video based face anti-spoofing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 802–803. [Google Scholar]
  51. Kim, T.; Kim, Y. Suppressing spoof-irrelevant factors for domain-agnostic face anti-spoofing. IEEE Access 2021, 9, 86966–86974. [Google Scholar] [CrossRef]
  52. Shao, R.; Lan, X.; Yuen, P.C. Regularized fine-grained meta face anti-spoofing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11974–11981. [Google Scholar]
  53. Chen, Z.; Yao, T.; Sheng, K.; Ding, S.; Tai, Y.; Li, J.; Huang, F.; Jin, X. Generalizable representation learning for mixture domain face anti-spoofing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 1132–1139. [Google Scholar]
  54. Wang, J.; Zhang, J.; Bian, Y.; Cai, Y.; Wang, C.; Pu, S. Self-domain adaptation for face anti-spoofing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 2746–2754. [Google Scholar]
  55. Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
  56. Wang, G.; Han, H.; Shan, S.; Chen, X. Cross-domain face presentation attack detection via multi-domain disentangled representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6678–6687. [Google Scholar]
  57. Liu, S.; Zhang, K.Y.; Yao, T.; Bi, M.; Ding, S.; Li, J.; Huang, F.; Ma, L. Adaptive normalized representation learning for generalizable face anti-spoofing. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), Virtual Event, China, 20–24 October 2021; pp. 1469–1477. [Google Scholar]
  58. Liu, S.; Zhang, K.Y.; Yao, T.; Sheng, K.; Ding, S.; Tai, Y.; Li, J.; Xie, Y.; Ma, L. Dual reweighting domain generalization for face presentation attack detection. arXiv 2021, arXiv:2106.16128. [Google Scholar]
Figure 1. Schematic diagram of the multi-source domain alignment algorithm proposed in this method. The distribution plot on the left illustrates the sample distribution from the source domain before feature alignment, while the plot on the right demonstrates the distribution after feature alignment in the feature space. Notably, the original distributions from different domains exhibit distinct dissimilarities; however, the feature alignment enhances the uniformity of each domain’s distribution in the feature space.
Figure 1. Schematic diagram of the multi-source domain alignment algorithm proposed in this method. The distribution plot on the left illustrates the sample distribution from the source domain before feature alignment, while the plot on the right demonstrates the distribution after feature alignment in the feature space. Notably, the original distributions from different domains exhibit distinct dissimilarities; however, the feature alignment enhances the uniformity of each domain’s distribution in the feature space.
Sensors 23 04077 g001
Figure 2. Overview of the proposed method. The proposed method is depicted in the figure, where different domains are delineated by colored borders. The method is exemplified using three source domains. Real and fake face features are obtained from all domains using the same feature generator. These features are then aligned in pairs utilizing a cross-domain alignment (CDA) module, and CDA modules work together to achieve multi-domain feature alignment. Furthermore, multi-directional triplet mining is conducted to disentangle the distribution of real and fake faces in the feature space. The final classifier is trained using cross-entropy loss.
Figure 2. Overview of the proposed method. The proposed method is depicted in the figure, where different domains are delineated by colored borders. The method is exemplified using three source domains. Real and fake face features are obtained from all domains using the same feature generator. These features are then aligned in pairs utilizing a cross-domain alignment (CDA) module, and CDA modules work together to achieve multi-domain feature alignment. Furthermore, multi-directional triplet mining is conducted to disentangle the distribution of real and fake faces in the feature space. The final classifier is trained using cross-entropy loss.
Sensors 23 04077 g002
Figure 3. This figure illustrates the cross-domain feature alignment module, which aligns sample features from two domains by minimizing the divergence between them. Adversarial learning is employed to achieve the alignment, where two identical classifiers, denoted as f and f , are trained to perform the same classification task on the aligned features. Adversarial learning is formulated as a minimax game. During the maximization process, the gradient of the input feature is inverted before being fed to the classifier to ensure the alignment of the two domains. The optimized result is denoted as “CDA” at the end of the process.
Figure 3. This figure illustrates the cross-domain feature alignment module, which aligns sample features from two domains by minimizing the divergence between them. Adversarial learning is employed to achieve the alignment, where two identical classifiers, denoted as f and f , are trained to perform the same classification task on the aligned features. Adversarial learning is formulated as a minimax game. During the maximization process, the gradient of the input feature is inverted before being fed to the classifier to ensure the alignment of the two domains. The optimized result is denoted as “CDA” at the end of the process.
Sensors 23 04077 g003
Figure 4. The figure showcases preprocessed sample frames from four publicly available face anti-spoofing databases; namely, Idiap Replay-Attack [28], CASIA-FASD [29], MSU-MFSD [30], and OULU-NPU [31]. Real face samples are demarcated by blue borders, while attack face samples are marked by red borders. The observed variations among the domains are attributed to distinct sampling environmental conditions, emphasizing the need for the design of multiple task protocols to evaluate generalization performance.
Figure 4. The figure showcases preprocessed sample frames from four publicly available face anti-spoofing databases; namely, Idiap Replay-Attack [28], CASIA-FASD [29], MSU-MFSD [30], and OULU-NPU [31]. Real face samples are demarcated by blue borders, while attack face samples are marked by red borders. The observed variations among the domains are attributed to distinct sampling environmental conditions, emphasizing the need for the design of multiple task protocols to evaluate generalization performance.
Sensors 23 04077 g004
Figure 5. The impact of the alignment weights on the overall generalization performance was evaluated using the O&M&I to C protocol. The x-axis represents the value of λ . λ is the coefficient in the formula L M d a l i g n = 2 N ( N 1 ) i + j = N , i < j λ i , j m i , j from Section 3.2 in respect of multi-domain alignment loss, and the control variable method was applied in the experiment, where only one λ was adjusted while the remaining parameters were kept constant. The left figure illustrates variations in the half total error rate (HTER), while the right figure shows the corresponding changes in the area under the curve (AUC).
Figure 5. The impact of the alignment weights on the overall generalization performance was evaluated using the O&M&I to C protocol. The x-axis represents the value of λ . λ is the coefficient in the formula L M d a l i g n = 2 N ( N 1 ) i + j = N , i < j λ i , j m i , j from Section 3.2 in respect of multi-domain alignment loss, and the control variable method was applied in the experiment, where only one λ was adjusted while the remaining parameters were kept constant. The left figure illustrates variations in the half total error rate (HTER), while the right figure shows the corresponding changes in the area under the curve (AUC).
Sensors 23 04077 g005
Figure 6. T-SNE visualization was employed to visualize the feature space of high-dimensional data, which enabled the exploration of the relationship between the input features and the output labels. In this study, we selected a subset of 200 samples to visualize the feature distribution. As shown in the scatter plot on the right, our proposed MADG method effectively learns discriminative features that enable clear separation between different domains. In contrast, the visualization of the ablation study, which excludes the multi-domain feature alignment module, exhibits a more disorganized feature distribution. By comparing the visualizations, we can infer that the proposed method achieves better generalization and is effective in aligning the feature distribution of real and fake faces across multiple domains.
Figure 6. T-SNE visualization was employed to visualize the feature space of high-dimensional data, which enabled the exploration of the relationship between the input features and the output labels. In this study, we selected a subset of 200 samples to visualize the feature distribution. As shown in the scatter plot on the right, our proposed MADG method effectively learns discriminative features that enable clear separation between different domains. In contrast, the visualization of the ablation study, which excludes the multi-domain feature alignment module, exhibits a more disorganized feature distribution. By comparing the visualizations, we can infer that the proposed method achieves better generalization and is effective in aligning the feature distribution of real and fake faces across multiple domains.
Sensors 23 04077 g006
Figure 7. The images depicted above present a collection of misclassified samples obtained from the O&C&I to M task protocol. The left-hand box displays misclassified attack faces, which were all sourced from a single individual whose attack samples were across two types, and all collection methods were misclassified. In contrast, the right-hand box shows some misclassified real face samples, which originated from videos captured via an Android or laptop camera.
Figure 7. The images depicted above present a collection of misclassified samples obtained from the O&C&I to M task protocol. The left-hand box displays misclassified attack faces, which were all sourced from a single individual whose attack samples were across two types, and all collection methods were misclassified. In contrast, the right-hand box shows some misclassified real face samples, which originated from videos captured via an Android or laptop camera.
Sensors 23 04077 g007
Table 1. Details of feature generator.
Table 1. Details of feature generator.
Feature Generator
LayerChan./Stri.ResamplingOutput Shape
Input
image
conv1-164/2None 64 × 128 × 128
pool1-/2- 64 × 64 × 64
layer1
conv1-264/1None
conv2-164/1None
conv1-364/1None
conv2-264/1None 64 × 64 × 64
layer2
conv1-4128/2None
conv2-3128/1None
conv2-4128/2Down
conv1-5128/1None
conv2-5128/1None 128 × 32 × 32
layer3
conv1-6256/2None
conv2-6256/1None
conv2-7256/2Down
conv1-7256/1None
conv2-8256/1None 256 × 16 × 16
Table 2. Details of feature embedder and classifier.
Table 2. Details of feature embedder and classifier.
Feature Embedder and Classifier
LayerChan./Stri.ResamplingOutput Shape
Input
Conv2-8
layer4
conv1-8512/2None
conv2-9512/1None
conv2-10512/2Down
conv1-9512/1None
conv2-11512/1None 512 × 8 × 8
bottleneck
Average pooling 512 × 1 × 1
fc1-1512/1-512
fc1-2512/1-512
fc2-22/1-2
Table 3. Details of alignment classifier.
Table 3. Details of alignment classifier.
Alignment Classifier
LayerChan./Stri.Output Shape
Input
Conv2-11
bottleneck
pool-/1
fc3-11024/11024
head
fc4-11024/1
fc5-12/12
grl layer
1024
adv head
fc4-21024/1
fc5-22/12
Table 4. Ablation study of different components of the proposed method. The header is the task protocol and the bold represents the optimal result.
Table 4. Ablation study of different components of the proposed method. The header is the task protocol and the bold represents the optimal result.
MethodO&C&I to MO&M&I to CO&C&M to II&C&M to O
HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)
w o L M d a l i g n 8.6395.7911.2294.7316.5091.4214.8892.84
w o L T r i p 8.4096.728.5696.9522.1477.6019.8887.53
MADG7.2097.578.7896.3315.6492.5313.9994.19
Table 5. Experiments on limited source domains involving twelve task protocols. The header is the baseline method and ours; the bold represents the optimal result.
Table 5. Experiments on limited source domains involving twelve task protocols. The header is the baseline method and ours; the bold represents the optimal result.
ProtocalSSDG-R [8]MADG-M (Ours)MADG-S (Ours)
HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)
O&M to C8.1196.808.8997.047.7897.39
O&I to C24.6784.2421.8988.1220.7887.63
M&I to C22.6783.5423.3382.3225.2281.66
C&M to O13.5493.71815.0992.4715.1092.35
C&I to O16.0692.2515.4291.7413.8793.17
M&I to O24.1783.2826.2981.5027.3181.19
C&O to M6.9697.268.4096.338.8795.09
C&I to M11.5194.9111.7594.2510.3095.35
I&O to M8.8794.9710.3194.189.8396.72
C&M to I21.5785.2520.0085.9122.2178.25
C&O to I23.4384.2224.2183.3719.4388.99
O&M to I17.1490.8414.4393.9619.2988.80
Table 6. Comparisons of different alignment strategies. The header is the baseline method and ours; the bold represents the optimal result.
Table 6. Comparisons of different alignment strategies. The header is the baseline method and ours; the bold represents the optimal result.
ProtocalSSDG-R [8]MADG-M (Ours)MADG-S (Ours)
HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)
O&C&I to M7.3897.177.4396.407.2097.57
O&M&I to C10.4495.948.7896.339.3395.35
O&C&M to I11.7196.5917.1491.7815.6492.53
I&C&M to O15.6191.5413.9994.1915.5092.76
Table 7. Comparison with state-of-the-art methods on limited source domains. The header is the task protocol and the bold represents the optimal result.
Table 7. Comparison with state-of-the-art methods on limited source domains. The header is the task protocol and the bold represents the optimal result.
MethodM&I to CM&I to O
HTER (%)AUC (%)HTER (%)AUC (%)
MS LBP [9]51.1652.0943.6358.07
IDA [30]45.1658.8054.5242.17
LBP-TOP [10]45.2754.8847.2650.21
MADDG [7]41.0264.3339.3565.10
SSDG-M [8]31.8971.2936.0166.88
DR-MD-Net [56]31.6775.2334.0272.65
ANRL [57]31.0672.1230.7374.10
SSAN-M [6]30.0076.2029.4476.62
MADG (Ours)23.3382.3226.2981.50
Table 8. Comparison with state-of-the-art methods. The header is the task protocol and the bold represents the optimal result.
Table 8. Comparison with state-of-the-art methods. The header is the task protocol and the bold represents the optimal result.
MethodO&C&I to MO&M&I to CO&C&M to II&C&M to O
HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)HTER (%)AUC (%)
MS LBP [9]29.7678.5054.2844.9850.3051.6450.2949.31
Binary-CNN [15]29.2582.8734.8871.9434.4765.8829.6177.54
IDA [30]66.6727.8655.1739.0528.3578.2554.2044.59
Color Texture [31]28.0978.4730.5876.8940.4062.7863.5932.71
LBP-TOP [10]36.9070.8042.6061.0549.4549.5453.1544.09
Auxiliary (Depth)22.7285.8833.5273.1529.1471.6930.1777.61
Auxiliary [17]--28.40-27.60---
MMD-AAE [49]27.0883.1944.5958.2931.5875.1840.9863.08
MADDG [7]17.6988.0624.5084.5122.1984.9927.8980.02
PAD-GAN [56]17.0290.1019.6887.4320.8786.7225.0281.47
RFM [52]13.8993.9820.2788.1617.3090.4816.4591.16
SSDG-M [8]16.6790.4723.1185.4518.2194.6125.1781.83
SSDG-R [8]7.3897.1710.4495.9411.7196.5915.6191.54
ANRL [57]16.0391.0410.8396.7517.8589.2615.6791.90
DRDG [58]15.5691.7912.4395.8119.0588.7915.6391.75
MADG (Ours)7.2097.578.7896.3315.6492.5313.9994.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, S.; Nie, W. Multi-Domain Feature Alignment for Face Anti-Spoofing. Sensors 2023, 23, 4077. https://doi.org/10.3390/s23084077

AMA Style

Zhang S, Nie W. Multi-Domain Feature Alignment for Face Anti-Spoofing. Sensors. 2023; 23(8):4077. https://doi.org/10.3390/s23084077

Chicago/Turabian Style

Zhang, Shizhe, and Wenhui Nie. 2023. "Multi-Domain Feature Alignment for Face Anti-Spoofing" Sensors 23, no. 8: 4077. https://doi.org/10.3390/s23084077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop