1. Introduction
Biometric methods are widely used to identify individuals based on unique biological and behavioral characteristics. One particular area where biometrics, especially Face Recognition Systems (FRSs), has gained considerable traction is in international airport security protocols. By leveraging biometric data stored in electronic Machine-Readable Travel Documents (eMRTDs) [
1], FRS enables authorities to compare the facial features captured in passports with real-time images of travelers. Despite the high accuracy of state-of-the-art FRS in controlled scenarios, several studies showed that FRS is sensitive or vulnerable to many image modification attacks. One of the most dangerous threats to FRS is a face morphing attack (FMA), which seamlessly mixes multiple face images with a new facial image containing the facial features of the subjects. In the context of biometrics, face morphing poses a serious risk, as it enables the creation of synthetic identities that can bypass security measures. It is important to highlight that the production of high-quality and reliable morphed images requires the ability to remove artifacts and abnormal pixels to achieve high visual similarity, in order to convince the officer.
FMA should maximize both the probability of acceptance of the morphed image by the human officer in the enrollment stage and the possibility of being identified as the same person during the verification stage. In the enrollment stage, an attacker has to find an accomplice and morph his/her facial features with himself/herself to apply for a passport or another form of electronic travel document containing the manipulated image. Once the electronic travel document containing the morphed image is issued to the accomplice, the criminal can utilize it to bypass border controls, whether they are automated or manually operated. Such an attack exploiting the vulnerability of the FRS system was first described in [
2]. Numerous face morphing attack methods [
3,
4,
5,
6,
7] have been proposed, which have further highlighted the vulnerability of FRSs.
To counter the risk of such FMA attacks, many face morphing attack detection (FMAD) strategies were proposed to identify instances where facial images have been morphed. Two prominent approaches in this domain are single-image-based FMAD (S-FMAD) [
8,
9] and differential-image-based FMAD (D-FMAD) [
10,
11,
12,
13,
14]. While S-FMAD evaluates individual images to determine whether they have undergone morphing, D-FMAD contrasts the suspect image with a trusted probe image to indicate cases of morphing. Among D-FMAD methods, de-morphing has emerged as a promising one. De-morphing was first introduced by Ferrara et al. [
15]. In this differential FMAD approach, the trust live capture (TLC) is utilized to revert (de-morph) a potentially morphed image. Essentially, the TLC is subtracted from the suspect image with a predefined weight (de-morphing factor). The resulting de-morphed face image is then compared with the TLC using an FRS. Thus, the face recognition score between the TLC and the de-morphed image serves as the final FMAD score. If this comparison yields a non-match, it indicates that a morphing attack has been detected; otherwise, the authentication attempt is considered bona fide. Later, Ortega-Delcampo et al. [
16] introduced autoencoders to restore accomplices’ facial images for detecting morphing. Banerjee et al. [
17] restored the facial images of two contributors from a single morphed image. Shiqerukaj et al. [
18] combined de-morphing with Deep Face Representation. Peng et al. [
19] utilized a symmetric dual network and restoration losses for accomplice image restoration. Min Long et al. [
20] proposed a diffusion-based method, focusing on accomplice image reconstruction.
Though some previous methods give reasonable performance, their practical utility in real-world scenarios remains limited. Environmental factors such as varying lighting conditions, facial expressions, and image resolutions can significantly impact the detection accuracy of these methods. Specifically in de-morphing methods, the morphing factor, which represents the weight of the attacker’s contribution, is an unknown variable that significantly influences performance. Due to this limitation, Ferrara et al. [
15] proposed a practical range of morphing
for successful enrollment in a landmark-based morphing method and also tested with the same range of de-morphing blending weights
in a landmark-based one. The experiment results showed the effectiveness of a de-morphing method itself, but its performance varies with the combinations of two (morphing and de-morphing) blending parameters. It is a mathematically ill-posed problem to estimate the morphing factor for the suspect (morphed or bona fide) and TLC. Furthermore, the recent deep learning-based methods have diverse blending parameters, so the estimation of morphing contribution parameters has become difficult and impractical.
With the primary goal of classifying whether the suspect image is morphed or bona fide, rather than reconstructing the accomplice’s face from the morphed image, we aim to reduce reliance on prior knowledge of face morphing generation in existing de-morphing methods and improve the efficiency of detecting morphing attacks. To achieve this, we train a neural network to recognize the similarity score patterns of de-morphed images, considering different contribution factors associated with enrollment. As we will show in
Section 5, similarity score variation of the de-morphed image with the live capture varying the de-morphing factor shows different patterns. Inspired by differences in similarity score variations between morphed and non-morphed images, the detection pipeline was proposed to learn the variation patterns of similarity scores between live capture and de-morphed face images with different de-morphing factors. An effective deep de-morphing network based on StyleGAN and the pSp (pixel2style2pixel) encoder [
21] was developed. The network generates de-morphed images from suspect and live images with multiple de-morphing factors and calculates similarity scores between feature vectors from the ArcFace network, which are then classified by the detection network.
The main contributions are in the following:
We propose a D-FMAD method using a neural network that learns the similarity score patterns of de-morphed images with different contribution factors with TLC images.
We propose a simple but effective and efficient deep learning face morphing and de-morphing method utilizing pixel2style2pixel [
21], which does not need further fine-tuning. We used this de-morphing method for our pipeline of similarity pattern-based D-FMAD.
We conduct experiments and analysis for created FMAD databases and SYN-MAD datasets. The results demonstrate that the proposed similarity pattern-based detection method outperforms the existing FMAD method in detecting unseen morphing attacks across different datasets.
The rest of this paper is organized as follows: Related works are reviewed in
Section 2.
Section 3 provides a detailed description of the proposed FMAD. The FMAD dataset is described in
Section 4. The experiment and analysis are presented in
Section 5. Finally, some conclusions are drawn in
Section 6.
3. Proposed FMAD Method
Figure 1 illustrates the pipeline of the proposed face morphing attack detection method. The pipeline consists of 4 main stages: a (face feature) encoder network, a latent space blending block, a face decoder (StyleGAN generator), and an ML-based decision with a similarity scores evaluator. Latent vectors extracted from suspect (morphed or bona fide) images and TLC through an encoder network are blended together in the StyleGAN W + latent space at multiple de-morphing factors. The obtained latent codes are fed into a StyleGAN generator to produce corresponding de-morphed images. These de-morphed images are compared with TLC images using ArcFace features in a machine learning-based classifier to distinguish between bona fide attempts and morph attacks.
It is noteworthy that we do not propose a new encoder, nor do we retrain StyleGAN. Instead, we leverage the strengths of existing models to carry out multiple de-morphing processes, thereby enabling the detection of morphed images. In this study, we employ the pSp encoder [
21] with a StyleGAN- based decoder and ArcFace face similarity to benchmark the performance of the proposed method. However, the proposed similarity pattern-based method can be applied to other similar implementations.
Since Generative Adversarial Networks (GANs) were introduced in 2014, there have been a lot of improvements proposed that made it a state-of-the-art method to generate synthetic images including synthetic human faces. However, there was not much focus on control over the generator part of GAN. Face morphing and de-morphing processes require the controllability of features of human faces, such as pose, hair color, and eye color. StyleGAN provides excellent hierarchical control, enabling the development of numerous reconstruction methods [
21,
46,
47] and deep learning face morphing [
5,
6].
In the StyleGAN model, a latent code in the input latent space Z obtained from the input image is first transformed into in the intermediate latent space W using a nonlinear mapping network , implemented as an 8-layer CNN. The dimensionality of both spaces is set to 512. Learned affine transformations then converted into styles , controlling adaptive instance normalization (AdaIN) operations after each convolution layer of the synthesis network G. Additionally, the generator introduces explicit noise inputs to generate stochastic detail. These noise inputs are single-channel images of uncorrelated Gaussian noise, fed to each layer of the synthesis network and added to the output of the corresponding convolution through learned per-feature scaling factors.
The latent vector
is linearly combined to achieve image-level morphing. This approach is motivated by the well-established properties of StyleGAN’s latent space, in which linear interpolation between latent codes often results in smooth and semantically meaningful transitions in the image space. Specifically, given two latent vectors
and
, a weighted combination
produces an intermediate latent code
that corresponds to a smooth morphing of images generated from
and
. While a rigorous mathematical derivation of the resulting image is generally intractable due to the inherent nonlinearity of the generator, extensive empirical evidence in the StyleGAN literature supports the validity and effectiveness of linear operations in
-space. We have clarified this theoretical rationale in the revised manuscript and added references to the original StyleGAN papers to justify the applicability conditions of our formulation. Since the morphing and de-morphing process uses the existing (real) face images, not simply generating artificial face images, we need to obtain the feature vectors of the face images. GAN inversion aims to invert a given image back into the latent space of a pre-trained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator [
48]. We compared the previous StyleGAN inversion method focusing on reconstruction performance after latent space operations for morphing and de-morphing applications and found the the pSp encoder [
21] performs best for morphing and de-morphing purposes. The pSp (pixel to style to pixel) framework is based on the power of a pre-trained StyleGAN and the
latent space. Rather than using explicit noise input, this model learns the latent code relative to the average style vector
. Adapting to different levels of detail in StyleGAN, the pSp encoder
E extends the backbone with a feature pyramid and map2style network, generating 18 style vectors. This encoder is able to match each input image to a coding in the latent domain and shows a strong representation.
3.1. Virtual Morphing Method
To facilitate explanation, a virtual morphing model based on StyleGAN is defined. The actual morphing method may be similar to or different from this model. Assuming that a suspect image Isusp is identified as a morphed image resulting from the morphing process involving criminal image
and accomplice image
,
where
presents the morph generation and the morphing factor
controls the contribution of the criminal, and where
presents the morph generation and the morphing factor
controls the contribution of the criminal.
The criminal face image
and accomplice face image
are first transformed into
(latent code for the criminal) and
(latent code for the accomplice) through an encoder network, respectively. The morphed latent vectors are then generated via the following blending procedure:
The morphed latent code
is fed into the StyleGAN generator along with
, the average style vector of the pre-trained generator, to produce the corresponding morphed images
, as follows:
where
and
denote the StyleGAN generator and encoder, respectively.
3.2. The Proposed De-Morphing Method
The proposed de-morphing network is also based on a pSp network, which is again specific to StyleGAN. Given a trust live capture (TLC)
, an inverse morphing process can be applied to recover the accomplice’s facial image
. From (
3), the de-morphed latent codes can be obtained with the de-morphing factor
as follows:
Even though we could use a more sophisticated formula for latent space de-morphing and possibly the morphing method can use different contribution factors for each latent element, we found that this simple and basic formula performs very well in general for detecting the morphing attacks.
Finally, a de-morphed image is generated from the de-morphed latent vectors through the StyleGAN generator, as follows:
where
presents the de-morphing process.
In practical scenarios, the information regarding the morphing factor
(or contribution weight of criminal in morphed image) is not available. Experiments in session IV demonstrate that even if an associated de-morphing factor could be approximated through practical assumptions, the task of reconstructing the facial features of the accomplice remains notably challenging because of post-processing. Ferrara et al. [
15] also showed similar phenomena in the landmark-based de-morphing method. However, it may be possible to detect morphed images by performing de-morphing across various values of the de-morphing factor
.
3.3. Morphing Detection Network with Similarity Scores
We apply the above de-morphing process with the same
and
at
N de-morphing factor in the range
and obtain multiple de-morphed images
. A face recognition network (ArcFace [
43] in this work) is employed to extract the face-related features
with a size of 512 from the TLC, and a set of
N de-morphed images for
,
is TLC’s feature. These features are combined to produce
N similarity scores between
and
. These scores are then fed into a fully connected MLP classifier with three layers: an input layer of size
N, a hidden layer with
nodes, and a single output node with sigmoid activation. The network is trained using the Adam optimizer with a learning rate of 0.001 for 10 epochs, with a 7:3 training-to-test split.
The parameter
N denotes the number of de-morphing alpha factors. While the classification network employed is a relatively small MLP, increasing
N proportionally increases the overall computational cost of the de-morphing process. We systematically evaluated
N in the range of 2 to 10, and the results demonstrated that detection performance saturates beyond
N = 5, with no statistically meaningful improvement thereafter. Accordingly, we selected
N = 5 as an optimal trade-off between computational efficiency and detection accuracy. This network estimates a probability score that indicates that the suspect image is morphed or bona fide and reaches the decision as illustrated in
Figure 2.
4. Dataset
An FMAD database has been created from the FRGCv2 and Color FERET databases (
Table 1). First, selected images are verified to meet ISO/ICAO specifications [
49], ensuring that there are no strong expressions, closed eyes, hats, or glasses. After filtering, the dataset includes 1239 subjects: 806 from Color FERET (466 male and 340 female) and 433 from FRGCv2 (241 male and 192 female). Each dataset is divided into two groups: set A is for morphing and set B contains probe images. For the images selected from the Color FERET database, one image is chosen as the criminal image and one as the probe image per subject. For the images from the FRGCv2 database, one image is chosen as the criminal image and five as probe images per subject.
Furthermore, the SYN-MAD dataset [
8] including MIPGAN-I, MIPGAN-II, FaceMorpher, Webmorph, and OpenCV is additionally conducted for FMA detection evaluation. That dataset contains 4483 (984 OpenCV, 1000 FaceMorpher, 500 Webmorph, 1000 MIPGAN-I, 999 MIPGAN-II) morphed face images and 204 bona fide images from the FRLL dataset. During the morphing and de-morphing processes, all images are normalized to a resolution of
.
4.1. FMAD Dataset Creation
Generally put, the strength of a morphing attack depends upon the criminal face. When the criminal and accomplice’s faces are similar, the morphed face is hard to detect. Previous works are not aligned on how to choose the dataset and only generate morphed images with equal weights of criminal and accomplice. Therefore, in this paper, we generated morphed images according to 2 protocols.
4.1.1. Protocol 1: For De-Morphing Network’s Performance Evaluation
Protocol 1 is designed to evaluate de-morphing on morphing attacks with different morphing factors in the range [0.1, 0.45]
For each subject indicated as a criminal, candidate accomplices were selected to execute morphing as follows:
The image of each subject (criminal) in set A is compared with the other of the same gender of the same source database (FRGCv2 or Color FERET). The K subjects (K = 4 for experiments) with the highest ArcFace cosine similarity scores with the criminal are chosen as the accomplices for morphing.
The morphing processes following our StyleGAN in (6) and FaceMorpher [
3] (landmark-based) are performed between each pair of criminal and accomplice with a specific value of alpha in the range of
. This results in a total of
(number accomplices per criminal)
(number morphing factor)
morphed attempts for each morphing method.
For the generated morphed images, ArcFace similarity scores are calculated against the probe image of the criminal to determine the morphing attack success probability (Criminal Morph Acceptance Rate) at a specific morphing factor.
4.1.2. Protocol 2: For Morphing Image Detection Performance Evaluation
Protocol 2 presents the quantitative evaluation of vulnerability analysis of morphed images to the Face Recognition System (FRS) at a morphing factor equal to
. This follows [
50], which demonstrates that morphing images with equal weights pose the greatest vulnerability to FRS.
Candidates of accomplice for each criminal are selected as in protocol 1. If a pair of subjects is chosen where the first is the criminal and the second is the accomplice, the reverse order (the first as accomplice and the second as criminal) will not be selected. Instead, the next candidate with the highest score will be chosen. This results in a total of
(number of accomplices per criminal)
morphed attempts for each morphing method. Due to the limited number of probes per subject in the FERET dataset and following previous works [
5,
51], Mated Morphed Presentation Match Rate (MMPMR) and Fully Mated Morphed Presentation Match Rate (FMMPMR) in
Table 2 are only reported on the FRGCv2 database.
For the constructed FMAD dataset, morphed images that successfully match against criminal images are collected from protocols 1 and 2. Scherhag et al. [
13] found that using post-processing techniques like resizing and print and scan during training has minimal impact on detection performance. Ferrara et al. [
15] also showed that de-morphing improves efficiency for print and scan images. While we do not conduct print and scan images, we believe that the same performance can be obtained in such scenarios.
Table 1 summarizes the created database for face morphing attack detection. For each dataset created using a different method from the original dataset, we split it into training and testing sets in a 7:3 ratio. Then, we trained a model on one dataset and tested it on all the other datasets. The SYN-MAD dataset is used solely for evaluation. Face de-morphing is executed on the suspect image (bona fide or morphed image) and corresponding TLC from set B. For the FRGCv2 dataset, the first image in the probe set is selected as the TLC. Meanwhile, since each subject in the SYN-MAD dataset has only 2 genuine images (1 neutral and 1 smiling) when one image is utilized as criminal for morphing, the remaining image of the subject will be used as the TLC. The network was trained over 100 epochs with a learning rate of
.
4.2. Properties and Statistics of Morphing Attack Datasets
During the morphing and de-morphing experimentation, a threshold score corresponding to a FAR (False Acceptance Rate) of
has been used for both datasets according to Frontex guidelines, where the target FAR is
and FRR is
.
Table 3 summarizes the Criminal Morph Acceptance Rate (CMAR) of morphed images in protocol 1. The results (
Table 2) indicate that as the criminal’s contribution to the image becomes more noticeable, the acceptance rate of morphed images increases significantly, eventually reaching a level comparable to that of the landmark-based method [
3] (
and
compared with
and
).
Figure 3 and
Figure 4 present some examples of morphed images generated by the proposed StyleGAN-based morphing method and FaceMorpher [
3]. The first and second columns show criminals and accomplices, respectively, with the cosine similarity score calculated between them. The third to fifth columns display morphed images with morphing alphas ranging from
to
. The scores of these morphed images are compared with the TLC images in the last column. It is obvious that as the morphing alpha increases, the morphed images exhibit higher similarity scores when compared with the criminal, indicating a greater correlation to the criminal.
The Mated Morphed Presentation Match Rate (MMPMR) and Fully Mated Morphed Presentation Match Rate (FMMPMR) of the created morph, indicating the vulnerability of the Face Recognition System (FRS) on protocol 2, are shown in
Table 2. It should be noted that the created databases differ from other databases used in scientific publications on FMA and FMAD. In particular, the intra-class variation is much higher in our database due to the number of selected subjects. This approach ensures that our database is more eligible to simulate real-world scenarios. Even though the comparisons are relative, the results show that the proposed morph method indicates high vulnerability, outperforming both StyleGAN and Landmark-II and approaching the performance of the MIPGAN method. This emphasizes the high quality of morphs produced by the proposed method, making them reliable for the evaluation of the proposed detection method.
6. Conclusions
In this paper, we proposed a D-FMAD pipeline based on deep learning de-morphing technology that addresses challenges in face morphing attacks at FRS. Inspired by differences in similarity score variations between morphed and non-morphed images, the proposed approach learns the change patterns of similarity scores between live capture and de-morphed face images with different de-morphing factors. An effective deep de-morphing network based on StyleGAN and the pSp (pixel2style2pixel) encoder was developed. The method generates de-morphed images from suspect and live images with multiple de-morphing factors and calculates similarity scores between feature vectors from the ArcFace network, which are then classified by the detection network. Experiments on morphing datasets from the Color FERET, FRGCv2, and SYS-MAD databases, including landmark-based and deep learning attacks, demonstrate that the proposed method performs high accuracy in detecting unseen morphing attacks across different databases.
It is important to emphasize that the approach using the similarity score variation in the proposed pipeline is not restricted to specific de-morphing techniques. This underscores the potential for improving FMAD tasks by employing advanced de-morphing techniques, particularly since current studies still fall short of meeting the needs of real-world systems.