Multi-Domain Feature Alignment for Face Anti-Spoofing

Face anti-spoofing is critical for enhancing the robustness of face recognition systems against presentation attacks. Existing methods predominantly rely on binary classification tasks. Recently, methods based on domain generalization have yielded promising results. However, due to distribution discrepancies between various domains, the differences in the feature space related to the domain considerably hinder the generalization of features from unfamiliar domains. In this work, we propose a multi-domain feature alignment framework (MADG) that addresses poor generalization when multiple source domains are distributed in the scattered feature space. Specifically, an adversarial learning process is designed to narrow the differences between domains, achieving the effect of aligning the features of multiple sources, thus resulting in multi-domain alignment. Moreover, to further improve the effectiveness of our proposed framework, we incorporate multi-directional triplet loss to achieve a higher degree of separation in the feature space between fake and real faces. To evaluate the performance of our method, we conducted extensive experiments on several public datasets. The results demonstrate that our proposed approach outperforms current state-of-the-art methods, thereby validating its effectiveness in face anti-spoofing.


Introduction
With the extensive use of deep learning in computer vision, face recognition (FR) [1,2] technology has become increasingly important in daily life, particularly in scenarios that require user identification and authorization. Despite significant progress in FR, these systems remain susceptible to various types of attacks, such as print attacks, replayed video attacks, and 3D mask attacks. To address these challenges, current state-of-the-art research has proposed various methods for face anti-spoofing (FAS) [3][4][5][6]. These methods can be broadly categorized into two groups: hand-crafted feature-based and deep learning feature-based approaches.
Despite the notable achievements of previous face anti-spoofing (FAS) methods in intra-domain testing, their performance significantly deteriorates in cross-domain testing. This is primarily due to the introduction of bias resulting from the distinct characteristics of domains and the inability to address such bias by considering their internal relationships. Consequently, the generalization effect of the model on the novel domain is insufficient. To mitigate this limitation, recent studies have utilized unsupervised-learning-based domain adaptation (DA) techniques to eliminate the domain bias between source and target domains. Nevertheless, the target domain usually denotes an unseen domain for the source domain, and acquiring an adequate amount of target domain data for training in real-world scenarios is not typically feasible.
In response to the weakness of a model's generalization in the unseen domain, several studies related to domain generalization (DG) have been proposed. Conventional DG [7] proposed a novel multi-adversarial discriminative method to learn a discriminative multidomain feature space and improve the generalization performance. This method aimed Based on the aforementioned considerations and an analysis of the feature space of the source domain, we found that differences in features between multiple source domains significantly interfere with the generalization effectiveness of face anti-spoofing (FAS) in an unseen feature space. To address this issue, we introduce a novel framework for domain generalization in this work, named the Multi-domain Feature Alignment Domain Generalization (MADG) framework. Specifically, the proposed framework utilizes feature generators to produce real and fake face features in each domain, which are subsequently aligned using a multi-domain feature alignment method that employs a loss function based on minimized margin discrepancy disparity between multiple source domains. To optimize this loss function, we use multiple adversarial learning processes. The feature alignment process tackles the challenges associated with generalization, while the employment of multi-directional triplet mining strengthens class boundaries to further enhance the classification performance.
The present work makes four primary contributions: • We propose a novel source-domain-alignment domain generalization algorithm that utilizes margin disparity discrepancy and adversarial learning. This approach is designed to improve the generalization performance of the FAS model significantly.
• In the context of multi-domain problems, we devise two alignment strategies and modularize the alignment process. The experimental findings demonstrate that multidomain alignment surpasses cross-domain alignment in terms of generalization performance, rendering it more advantageous in multi-domain scenarios. • We combine the new algorithm with multi-directional triple mining and analyze the source domain feature space. This results in a novel multi-domain feature alignment framework (MADG), which improves the classification accuracy of the FAS model. • Extensive experiments and comparisons have been conducted, demonstrating that our proposed approach achieves state-of-the-art performance on most protocols.
The rest of this article is structured into four main sections. In the section second, we provide a detailed introduction to the related work, including mainstream research on face anti-spoofing, domain generalization, and multi-domain learning. We also discuss some works that have provided inspiration for this study. In the third section, we present the proposed method, which includes a comprehensive introduction to the new alignment method and domain generalization framework developed in this study. The fourth section details the experiments conducted to evaluate the proposed method and presents a thorough analysis of the experimental results. Finally, in the fifth section, we provide a summary of our work and give an outlook for future research directions in this area.

Face Anti-Spoofing Methods
There exist two principal categories of conventional face anti-spoofing (FAS) techniques: appearance-based and temporal-based methods. Appearance-based methods involve the extraction of hand-crafted features for classification, such as local binary patterns (LBPs) [9,10] and scale-invariant feature transform (SIFT) [11]. However, temporal-based FAS methods detect attack faces by extracting temporal cues from a sequence of consecutive face frames. Mouth movement detection [12] and eye blink detection [13,14] are examples of the earliest dynamic texture-detection methods. However, these methods do not generalize well to cross-dataset testing scenarios due to the dissimilarities in feature spaces among diverse domains, which often lead to feature bias during generalization.
Recently, deep neural networks, specifically convolutional neural networks (CNNs), have gained widespread adoption in computer vision tasks and have been extensively applied to FAS. Yang et al. [15] were the pioneers in utilizing binary classification CNNs for FAS. Jourabloo et al. [16] proposed a face de-spoofing technique that performs fake face classification by reverse decomposition into real faces and spoof noise. Liu et al. [17] presented a CNN-RNN network that combines both appearance and temporal cues to detect spoof attacks using remote photoplethysmography (rPPG) signals. Similarly, 3D mask detection attack methods [18,19] also exploit rPPG information. Yang et al. [20] combined temporal and appearance cues to detect fake faces from real ones. Roy et al. [21] investigated frame-level FAS to enhance biometric authentication security against face-spoofing attacks. More recently, deep neural networks have been applied to FAS [4,7,8,[22][23][24], achieving superior performance compared to conventional methods [15,[25][26][27].
In conclusion, traditional FAS methods include appearance-based and temporal-based methods, which extract hand-crafted features and mine temporal cues, respectively. However, deep neural networks, especially CNNs, have achieved state-of-the-art performance in FAS by combining the appearance and temporal cues, using techniques such as reverse decomposition and frame-level FAS for improved biometric authentication security.

Multi-Domain Learning
The use of multiple datasets has recently sparked research interest in multi-domain processing. In particular, the research community has amassed several large-scale FAS datasets with rich annotations [28][29][30][31]. Our work on multi-source domain processing shares similarities with domain adaptation (DA) methods [32][33][34][35][36][37][38][39] that require a retrained model to perform well on both source and target domain data. Specifically, Zhang et al. [37] intro-duced the concept of margin disparity discrepancy to characterize the differences between source and target domains, which has inspired our work. Ariza et al. [40] conducted a comparative study of several classification methods, which informed our experimental design. Additionally, Liu et al. [41] proposed the YOLOv3-FDL model for successful small crack detection from GPR images using a four-scale detection layer. Notably, the common approaches of Mancini et al. [34] and Rebuffi et al. [35] employ the ResNet [42] architecture, which offers benefits over architectures such as VGG [43] and AlexNet [44] by increasing abstraction through convolutional layers. Yang et al. [45] recently enriched FAS datasets from a different perspective to achieve multi-domain training, while Guo et al. [46] proposed a novel multi-domain model that overcomes the forgetting problem when learning new domain samples and exhibits high adaptability.

Domain Generalization
Domain adaptation (DA) and domain generalization (DG) are two fundamental methods used in FAS research. While the DG method mines the relationships among multiple domains, the DA method aims to adapt the model to a target domain. In this work, we propose a novel domain generalization method that introduces a new loss function inspired by the work of Motiian et al. [47], which encourages feature extraction in similar classes. To align multiple source domains for generalization, previous works such as Ghifary et al. [48] and Li et al. [49] have proposed autoencoder-based approaches. Our method follows a similar approach of learning a shared feature space across multiple source domains that can generalize to the target domain. Previous works such as Shao et al. [7], Saha et al. [50], Jia et al. [8], and Kim et al. [51] have also attempted to achieve this goal. Among them, the single-side adversarial learning method proposed in SSDG [8] is the work most related to ours. However, this end-to-end approach overlooks the relationships between different domains. To address the overfitting and generalization problems of adversarial generative networks, Li et al. [5] proposed a multi-channel convolutional neural network (MCCNN). Additionally, meta-learning formulations [52][53][54] have been utilized to simulate the domain shift during training to learn a representative feature space. However, recent works such as Wang et al. [6] have not adopted a domain-alignment approach but have instead increased the diversity of labeled data by reassembling different styles and content features in their SSAN method.

Overview
The proposed method aims to enhance the generalization ability of face anti-spoofing (FAS) by improving feature space generalization. Generalization to unseen domains is challenging, as samples from the target domain are usually not available during training. However, similarities exist between the features of the source domains and the target domain, and our approach aims to identify these similarities by aligning multiple source domains. To this end, we propose a multi-domain feature alignment framework called MADG, as shown in Figure 2. The feature generator extracts features that are crossdomain aligned using an adversarial learning process called cross-domain alignment (CDA), and multiple CDA modules are combined to form a multi-domain alignment method. Moreover, we introduce a multi-directional triplet mining process to better separate the distribution differences between real and fake faces in the feature space. Thus, MADG can consider the distribution differences between real and fake faces while aligning the feature space of multiple source domains, leading to improved classification performance in new, unseen domains. These features are then aligned in pairs utilizing a cross-domain alignment (CDA) module, and CDA modules work together to achieve multi-domain feature alignment. Furthermore, multi-directional triplet mining is conducted to disentangle the distribution of real and fake faces in the feature space. The final classifier is trained using cross-entropy loss.

Multi-Domain Feature Alignment
This study considers N source domains, each with a set of training samples denoted by X = {X 1 , X 2 , ..., X N }, and their corresponding labels, Y. In the context of the face anti-spoofing task, the labels correspond to Y r and Y f for real and fake faces, respectively. However, due to significant distribution gaps between different domains, it is necessary to minimize the disparity in feature space. To address this challenge, we propose a multidomain feature alignment method that consists of two key components: feature generators for generating features and multiple weighted adversarial learning modules combined for multi-domain feature alignment.
Pre-alignment Feature Generator and Alignment Strategies. We designed pre-alignment feature generators to transform sample data into features, as follows: The features extracted from mixed samples of real and fake faces are denoted as T m , and their corresponding feature generators as G m . Similarly, T s and G s represent the features and generators generated from real or fake faces. These features are known to contain a significant number of domain-relevant clues, which may adversely affect the generalization performance. Before feature alignment, two alignment strategies are devised for the features extracted from real and fake faces as well as mixed samples. In the first strategy, real and fake faces from each domain are independently fed into the respective feature generator. In contrast, the second strategy involves feeding a mixture of real and fake faces from each domain into the feature generator. The two frameworks resulting from these strategies are referred to as MADG-S (single-sided alignment MADG) and MADG-M (mixed alignment MADG), respectively. We compare the outcomes of these two strategies in Section 4.3.4 to determine the optimal method.

Cross-domain Alignment Module.
To enhance the generalization ability of the feature space for the face anti-spoofing task, it is crucial to reduce the presence of domainrelevant cues in the extracted features. Aligning the features of multiple domains into a common feature space effectively reduces deviations from any specific domain, leading to better performance on unseen domains. This insight is akin to feature alignment techniques employed in domain adaptation.
With the aim of reducing the presence of domain-relevant cues and improving generalization performance, we propose a cross-domain feature alignment module to align the features of different domains. There is typically a substantial gap between the features of different domains, characterized by domain-relevant clues, which can be quantified using measures such as the margin disparity discrepancy (MDD) proposed by Zhang et al. [37]. By progressively reducing this gap, we can obtain an aligned feature space in which domain-specific characteristics become less pronounced. Specifically, our method considers two source domains, S1 and S2, and leverages the cross-domain alignment module to achieve this goal. MDD has the following key property: where h ∈ H denotes a label classifier and err represents the error rate of the classifier, while err (ρ) is the margin error and ρ is the corresponding margin. The scoring function and hypothesis set are denoted by f and F , respectively. D denotes the margin disparity discrepancy and λ = λ(ρ, F , P, Q) is the ideal combined margin loss. The method proposed by Zhang et al. [37] is a domain adaptation method. err in the above equation can be regarded as the generalization error in domain adaptation. From this inequality, the generalization error of S1 on S2 is upper-bounded, as we can see. According to this property, we only need to optimize this upper bound to reduce this error and achieve the purpose of feature alignment. To this end, we designed an adversarial learning module to align domain features, as depicted in Figure 3. The objective is to minimize MDD, and this process can be formulated as a minimax game, expressed as follows: where S1, S2 are the samples drawn independently from distribution S1, S2, respectively. To strengthen the minimization process, we introduce a feature extractor ψ, and ε denotes cross-entropy loss. f and f are classifiers that share the same hypothesis space, and γ denotes the margin factor that is related to the margin of f . Through this adversarial learning process, we can obtain the minimized cda between two domains: where M denotes the adversarial learning function.

Multi-domain Alignment Loss.
A multi-domain alignment loss was designed by combining all the obtained cross-domain diverging divergence values m, denoted as: where λ i,j is used to control the weight of the alignment of different domain pairs. Given that the disparity between various domains is diverse, choosing the appropriate value for λ can significantly affect the overall alignment effect, and therefore, the generalization effect of the network. We adjusted the value of λ on a specific task protocol (O&M&I to C) for comparison, see Section 4.3.2 for details. Adversarial learning is employed to achieve the alignment, where two identical classifiers, denoted as f and f , are trained to perform the same classification task on the aligned features. Adversarial learning is formulated as a minimax game. During the maximization process, the gradient of the input feature is inverted before being fed to the classifier to ensure the alignment of the two domains. The optimized result is denoted as "CDA" at the end of the process.

Multi-Directional Triplet Mining
In the context of different domains' feature spaces, the disparity between real and fake faces is often less pronounced within domains than across them. This is due to the relatively strong similarities between attack samples corresponding to each real sample, which can arise from variations in data collection techniques and attack methods. Our framework aims to learn a concise feature space for the same class. Nonetheless, scattering real faces in a feature space is often easier. To overcome this challenge, we partition the feature spaces of real and fake faces and seek to compress all real faces into a compact feature space. To this end, we utilize a multi-directional triplet loss proposed by SSDG [8]. This loss enables us to mine triplet relationships between features, which can improve the generalized feature space's performance. The optimization of the loss is expressed as follows: α is the margin we need to define beforehand and x a i is the anchor, while x p i is the positive example and x n i is the negative one.

Loss Function
As shown in Figure 2, our framework employs a face anti-spoofing classifier that combines triplet feature generators after aligning all source domains. The final loss to optimize is a cross-entropy loss based on a hybrid domain with multi-domain alignment and category separation by triplet mining, denoted as L cls . Once all components are integrated, the proposed multi-domain feature alignment framework can be obtained as follows: The weighting factors λ 1 and λ 2 are leveraged to adjust the balance between the process of aligning multiple domains and separating subclasses between different domains. The values of these factors govern the relative importance of each objective in the overall optimization process.

Databases and Protocols
To assess the effectiveness of the proposed approach, we conducted experiments on four publicly available face anti-spoofing datasets; namely, Idiap Replay-Attack [28] (referred to as I), CASIA-FASD [29] (referred to as C), MSU-MFSD [30] (referred to as M), and OULU-NPU [31] (referred to as O). Real and attack face samples from these datasets are shown in Figure 4. Notably, the datasets have a significantly different composition, with variations in display devices, attack types, illumination, and background complexity. For instance, the display devices in I are the iPhone 3Gs and iPad, while C uses the iPad, M utilizes the iPad Air and iPhone 5S, and O deploys the Dell 1905FP and Macbook Retina. Moreover, the attack types vary across the datasets, with I and O containing replayed photo samples, C having cut photo samples, and all datasets including printed photos and replayed videos. The datasets also exhibit differences in illumination, with I and O having extra light, while C and M do not. Additionally, the datasets have varying levels of background complexity, with I, C, and M having a complex background, while O has a less complex background. These differences contribute to significant inter-domain differences in the feature space of the datasets, making them suitable for evaluating the efficacy of the proposed approach.
Therefore, to provide a comprehensive evaluation, we employed the leave-one-out strategy, where one of the databases was designated as the test set, and the remaining three were used as the training set. Thus, four task protocols were established: O&C&I to M, O&M&I to C, O&C&M to I, and I&C&M to O. However, to further examine the method's effectiveness under various multi-domain scenarios in multi-domain research, additional protocols need to be designed for controlled experiments on the method's performance. Thus, we extended our experiments to include three domains, similar to the aforementioned four, resulting in a total of 12 task protocols. This comprehensive approach enabled us to analyze and compare the proposed method's performance in various multi-domain scenarios.  [28], CASIA-FASD [29], MSU-MFSD [30], and OULU-NPU [31]. Real face samples are demarcated by blue borders, while attack face samples are marked by red borders. The observed variations among the domains are attributed to distinct sampling environmental conditions, emphasizing the need for the design of multiple task protocols to evaluate generalization performance.

Evaluation Metrics
In order to assess the effectiveness of our model in achieving the desired outcomes, it is crucial to evaluate its performance. To this end, we follow a similar evaluation approach to that of SSDG [8], the most relevant work in this field. The two evaluation metrics we use to assess the performance of our model are HTER and AUC. HTER is calculated as half of the sum of FAR and FRR, which represent the rates of falsely accepting and falsely rejecting a genuine user, respectively. This metric is widely used in liveness detection to measure the model's ability to distinguish between live and fake faces. However, AUC is a commonly used evaluation metric in classification tasks that measures the area under the ROC curve. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds. A higher AUC value indicates better classification performance, as the model can achieve a high TPR while maintaining a low FPR. By utilizing these metrics, we can obtain a comprehensive evaluation of the performance of our model and compare it with other state-of-the-art methods.

Data Preprocessing
To prepare the datasets for our experiments, it was necessary to preprocess both real and fake face images from the original video data. The preprocessing involved extracting frames at random intervals from each video clip and applying the MTCNN [55] method to identify and crop the faces in the resulting images. The cropped faces were then rotated and normalized to create input data of size 256 × 256 × 3. This process ensured that the RGB channel was the sole source of information used during training, which simplified the framework's complexity. The preprocessing step was critical for enabling our model to accurately learn from both the real and fake faces and was essential for ensuring the robustness and reliability of our experimental results.

Network Structure
Our framework was implemented on the platform PyTorch. The details of our network structure are shown in Tables 1-3.

Details of Feature Generator.
To construct the feature generator, we employed ResNet-18 [42] as the backbone. The feature generator was designed by following the ResNet architecture, which includes a convolutional layer with a 7 × 7 convolution kernel and a max pooling layer at the head. The remaining convolutional layers utilize 3 × 3 kernels and are accompanied by a batch normalization (BN) layer to improve the stability and speed of network convergence. Some of the convolutional layers use a rectified linear unit (ReLU) activation function, as indicated in the table by conv2, while conv1 lacks the ReLU layer. We chose the ResNet-18 model due to its ability to strike a balance between performance and computational complexity. By incorporating ResNet-18, our feature generator can effectively learn and extract high-level features from facial images.
Details of Feature Embedder and Classifier. The residual block utilized in this section features the same structure as the one employed in the feature generator. The classifier comprises a linear model and fully connected layers, with a bottleneck fully connected layer added prior to the classifier.
Details of Alignment Classifier. An adversarial learning process is employed in this work to align the source domains, with a network functioning as an alignment classifier. The classifier is composed of multiple components, including a bottleneck layer, a head layer, an adversarial head layer, and a gradient reversal layer (GRL). The bottleneck and head layers consist of fully connected layers, as do the adversarial head layers. During backpropagation, the GRL layer inverts the gradients, enabling the network to learn domain-invariant features.

Training Setting
The optimization of the framework was performed using stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 5 × 10 −4 . The initial learning rate was set to 1 × 10 −2 and gradually decreased to 1 × 10 −5 . To accommodate the limited GPU memory, a batch size of 10 was used for each domain, resulting in a batch size of 30 for all three domains during training. We set the values of λ 1 and λ 2 to 30 and 1, respectively. Furthermore, the λ i,j values used to constrain the multi-domain alignment process were all set to 1. These weights are defined in Section 3.

Testing Setting
During the testing phase, the performance of the trained model was assessed on new test samples x by feeding them into the model for classification. The classification outcome is denoted as l and obtained via l = G(F(x)), where G represents the trained model, and F refers to the feature generator utilized during training. The classification results on the testing set were used to evaluate the model's generalization ability, which is a crucial factor for machine learning models. Furthermore, the testing results can be utilized to compare the performance of our proposed method with state-of-the-art methods currently available.

Ablation Study
To assess the individual contributions of each component in our proposed framework, we performed various ablation experiments. We report the findings of these experiments in Table 4. Our proposed method is denoted as MADG, which includes multi-domain feature alignment loss (L Mdalign ), multi-directional triplet loss (denoted as L Trip ), and cross-entropy loss.
The outcomes presented in Table 4 highlight the substantial influence of each constituent of our proposed framework. Ablating any part of the framework led to a conspicuous deterioration in classification performance, which affirms the effectiveness of each component. Specifically, the experiment pertaining to the results without L Trip corresponded to the deactivation of the triplet loss during both forward and backward propagation. The experiment associated with the outcomes without L Mdalign deactivated the multi-domain alignment component.

Comparisons of Different Alignment Weights
The cross-domain feature alignment module is a fundamental component of our proposed multi-domain feature alignment method. By integrating this module into the training process, we can effectively align the features of two source domains. Our multi-domain alignment strategy involves aligning multiple source domains in pairs and controlling these domain pairs by weight. Furthermore, we align multiple source domains simultaneously. For instance, in the scenario involving three source domains, denoted as O, M, and I, we simultaneously align O&M, O&I, and M&I, with each alignment procedure constrained by a weight. To assess the impact of varying weights on the model's generalization performance, we conducted a series of experiments using the O&M&I to C task protocol. The weights were adjusted using a control variable method, where only one weight λ was varied while the others were held constant at 1. The experimental results, depicted in Figure 5, reveal the impact of the weights on the model's performance.
By adjusting the weights assigned to the alignment process of each pair of domains, we can evaluate the impact of each domain on both the overall generalization performance and the inter-domain feature characteristics. As depicted in Figure 5, the alignment process of the O&I domain exhibits the most significant impact on the generalization performance. Specifically, as the weight of the O&I alignment process (λ OI ) increases, the generalization performance notably decreases, indicating a significant dissimilarity between the domainrelevant features of O&I and M. However, the optimal generalization performance can be achieved by enhancing the weight of the O&M alignment process (λ OM ).

Experiments on Limited Source Domains
Limited Source Domain Protocols. In addition to the leave-one-out experiment, we performed experiments in which we trained the model using a subset of the available source domains. More specifically, we selected two of the three available source domains and used them as the training set. This approach yielded 12 distinct task protocols: O&M to C, O&I to C, M&I to C, C&M to O, C&I to O, M&I to O, C&O to M, C&I to M, I&O to M, C&M to I, C&O to I, and O&M to I. The experimental results of SSDG (baseline), MADG-M (ours), and MADG-S (ours) on these 12 task protocols are shown in Table 5, where the best result for each protocol is shown in bold.
Analysis of Limited Source Domain Experiments. The impact of training set diversity and data quantity on the generalization performance of the proposed framework is demonstrated in Tables 4 and 5. As indicated in Table 5, the generalization performance of the two-source domains is generally lower than that of the three-source domains. Additionally, significant variations in the experimental results of the 12 task protocols are observed (Table 5), which is likely due to differences in data characteristics across domains. Therefore, the domain-relevant clues between the source and target domains may have a lesser effect on some task protocols, yet still attain good generalization performance. Furthermore, a comparison with other state-of-the-art methods on limited source domains is presented in Section 4.4.1. Moreover, Table 5 shows that the limited source domain task protocol is impacted differently by various alignment strategies. The subsequent Section 4.3.4 provides further comparisons of these alignment strategies.

Comparison of Different Alignment Strategies
To provide a comprehensive comparison of the generalization effects of the two alignment strategies implemented in the MADG-M and MADG-S frameworks, we conducted experiments on 12 limited source domain task protocols and 4 leave-one-out task protocols. To ensure a fair evaluation, we compared all of these protocols with the baseline method. The results of the limited source domain experiments are presented in Table 5, while the results of the complete source domain experiments are shown in Table 6.
Based on the experimental results presented in Tables 5 and 6, it is apparent that our proposed alignment strategies consistently outperform the baseline method in most task protocols. However, the extent of improvement over the baseline is more limited in the limited source domain experiments than in the experiments conducted with three source domains. This finding suggests that our method performs better with a larger number of source domains, indicating greater potential in scenarios with more source domains. Additionally, the alignment effect of our method can be influenced by the distribution of features from the source domains. Therefore, the performance of different alignment strategies for various task protocols can differ, each with its distinct set of strengths and weaknesses, as revealed by the experimental results. These observations are discussed in more detail in Section 4.4.2. Table 6. Comparisons of different alignment strategies. The header is the baseline method and ours; the bold represents the optimal result.

HTER (%) AUC (%) HTER (%) AUC (%) HTER (%) AUC (%)
O&C&I  Table 7, outperforms the previous state-of-theart method SSAN [6] in the restricted source domain task protocols. The objective of SSAN was to enhance generalization by reorganizing content and style features. In comparison, our method achieves a significant improvement by offering a source domain alignment method that improves the generalization performance on the training set of double source domains. Since the multi-domain alignment strategy is not involved in the double source domain alignment, it is easier to train a well-aligned generalized feature space. The method demonstrates considerable advantages in cases of insufficient data, and can maintain good performance even with a small training set, thereby making it valuable for practical applications. Table 7. Comparison with state-of-the-art methods on limited source domains. The header is the task protocol and the bold represents the optimal result.

Comparison with Baseline Methods
To assess the effectiveness of our proposed method in improving classification performance, we compared it with the SSDG method [8], a one-sided domain generalization approach that discriminates between real and fake faces to achieve a certain generalization performance. To bring samples from different domains closer together, we incorporated a feature alignment algorithm before triplet mining and combined the domain alignment feature space to improve the aggregation of the aligned real and fake faces. Our method has two different alignment strategies to account for the large differences in source data distribution across different task protocols. The comparison results of SSDG and our method are presented in Table 8. The SSDG method is considered a strong baseline approach and outperforms most state-of-the-art methods, as demonstrated in Table 8. While SSDG aims to achieve general-ization across multiple domains by focusing on domain differences and utilizing asymmetric optimization goals for real and fake faces, it may not effectively capture domain-relevant features. In contrast, our proposed MADG method leverages asymmetric optimization while simultaneously considering alignment across multiple domains. The superior performance of our method, as shown in Table 6, suggests that source domain alignment can lead to improved generalization results. Furthermore, our method's ability to incorporate multiple alignments accounts for the diverse distribution of source data across different task protocols. Table 8 presents a comparative analysis of our proposed MADG method with stateof-the-art approaches under the leave-one-out protocols. The best performance in each column is highlighted in bold. The table demonstrates the effectiveness of our approach, as it outperforms all other methods, except for the SSDG [8] approach, which relies on discriminating real faces from fake ones to achieve generalization. In contrast, other methods, such as [7,9,10,15,17,30,31,49], demonstrate significantly poorer performance than our method, perhaps due to their inability to generalize effectively across multiple domains. Although both MADDG [7] and SSDG methods adopt the DG method to learn discriminative cues, the MADDG method struggles to optimize real and fake faces simultaneously, while the SSDG method disregards domain-relevant features. Specifically, SSDG relies on the broader feature distribution extracted from fake faces than real faces, while our MADG approach focuses on feature alignment across multiple domains while considering the distribution discrepancies between fake and real faces. This feature alignment capability enables our approach to achieve superior generalization performance while minimizing the impact of domain-specific features on generalization. Furthermore, different alignment methods may produce varying outcomes for various task protocols, depending on the source domain distribution.

Feature Visualization
We employed t-SNE visualization, a powerful technique for visualizing high-dimensional data in a low-dimensional space, to analyze the feature space. Figure 6 presents the comparison results of each visualization. Specifically, we randomly selected 200 samples from the four databases and visualized the feature space learned by MADG and the feature space learned by the models in the ablation experiments.
The scatter plot on the right illustrates the feature space learned by MADG, showing that the distribution of feature vectors is more compact and well-separated, indicating that the proposed method is effective in learning discriminative features for face anti-spoofing. In contrast, the plot on the left represents the visualization result of the ablation study without the key component of the multi-domain feature alignment module, L Mdalign . In this case, the feature vectors are scattered and poorly separated, highlighting the importance of this module for the proposed method to achieve good performance. Our proposed method achieves better generalization space compared to the model without the multi-domain feature alignment module, and the feature space of real faces is tidier, demonstrating the effectiveness of the proposed method in aligning the feature distribution of real and fake faces. Figure 6. T-SNE visualization was employed to visualize the feature space of high-dimensional data, which enabled the exploration of the relationship between the input features and the output labels. In this study, we selected a subset of 200 samples to visualize the feature distribution. As shown in the scatter plot on the right, our proposed MADG method effectively learns discriminative features that enable clear separation between different domains. In contrast, the visualization of the ablation study, which excludes the multi-domain feature alignment module, exhibits a more disorganized feature distribution. By comparing the visualizations, we can infer that the proposed method achieves better generalization and is effective in aligning the feature distribution of real and fake faces across multiple domains.

Analysis of Misclassified Samples
To conduct a comprehensive analysis of our model's performance, we examined misclassified sample frames, as presented in Figure 7. These samples were taken from the test set of the O&C&I to M task protocol, which is the test data of MSU-MFSD [30]. This dataset includes printed photo and replay video attacks collected from 35 individuals through Android or laptop cameras. The experiments utilized preprocessed face slices.
The left box of Figure 7 displays the misclassified attack samples. In the tests, the misclassified attack faces all belonged to the same individual, and the samples of two attack types obtained by all collection methods from this individual were misclassified. We believe that such errors, which occurred only for attack samples of this individual, are mainly related to individual characteristics.
In contrast, the misclassified real faces in our test involved many individuals, as shown in the right-hand box below. This suggests that our model is less effective in classifying real faces than attacking faces, indicating a certain level of overfitting. This may be for several reasons, including the planar graph architecture of the real face input network with the same dimension as the attack face, the presence of artificial actions in real face videos, such as expressions and motion, and the weakening of sample authenticity due to ornaments and lighting on the faces. Examining classification errors can provide insights for future research. Figure 7. The images depicted above present a collection of misclassified samples obtained from the O&C&I to M task protocol. The left-hand box displays misclassified attack faces, which were all sourced from a single individual whose attack samples were across two types, and all collection methods were misclassified. In contrast, the right-hand box shows some misclassified real face samples, which originated from videos captured via an Android or laptop camera.

Conclusion of Experiments
We designed a set of experiments to evaluate the effectiveness of the proposed multidomain feature alignment framework (MADG). Firstly, we performed an ablation study to assess the contribution of each component to the framework's classification performance, and our results demonstrate that each component played a significant role in improving performance. Furthermore, we examined the impact of varying weights on the multidomain alignment process and found that the alignment process of the O&I domain had the greatest interference with the generalization performance in the O&M&I to C protocol. This observation implies that cross-domain alignment between two domains has a significant impact on the overall alignment process due to differences in the feature space. The effectiveness of the alignment module was also demonstrated through comparison. Additionally, we evaluated the performance of MADG on limited source domains and found that the generalization performance of two-source domains was generally inferior to that of three-source domains, indicating the potential of MADG to enhance the classification performance in multi-domain scenarios. Our comparison of different alignment strategies revealed that at least one alignment strategy outperformed the baseline method in most task protocols. Finally, we validated the superiority of MADG by comparing it with the existing SOTA method.

Conclusions
In this paper we presented a novel approach to improve the generalization capability of face anti-spoofing by utilizing a multi-domain feature alignment domain generalization framework (MADG). To the best of our knowledge, this is the first work to introduce margin discrepancy disparity in domain generalization. Our approach modularizes the alignment algorithm and proposes two multi-domain alignment strategies to enhance the performance of multi-domain alignment. We combine multi-directional triplet loss with the multi-domain alignment module, enabling the effective separation of real and attack face distributions in the feature space. Our method outperforms previous approaches that focus on aligning the entire source domain without considering the distribution characteristics of real and attack faces or only extracting attack face features. We conducted a comprehensive set of experiments on four public databases to validate the effectiveness of our method, and the results demonstrate its superior performance over state-of-the-art approaches.
However, our work also has some limitations that require further improvements in the future. Although our method has made significant progress in the study of feature alignment, it only makes a limited attempt and only involves the exploration of multidomain problems. Recent studies have shown that the introduction of auxiliary supervision information can effectively address the challenges posed by multi-domain problems. Therefore, future work can be conducted around multi-domain problems and the integration of auxiliary supervision information to further enhance the generalization ability of face anti-spoofing. Overall, our study provides a promising direction for future research on face anti-spoofing.

Conflicts of Interest:
The authors declare no conflict of interest.