Global–Local Facial Fusion Based GAN Generated Fake Face Detection

Media content forgery is widely spread over the Internet and has raised severe societal concerns. With the development of deep learning, new technologies such as generative adversarial networks (GANs) and media forgery technology have already been utilized for politicians and celebrity forgery, which has a terrible impact on society. Existing GAN-generated face detection approaches rely on detecting image artifacts and the generated traces. However, these methods are model-specific, and the performance is deteriorated when faced with more complicated methods. What’s more, it is challenging to identify forgery images with perturbations such as JPEG compression, gamma correction, and other disturbances. In this paper, we propose a global–local facial fusion network, namely GLFNet, to fully exploit the local physiological and global receptive features. Specifically, GLFNet consists of two branches, i.e., the local region detection branch and the global detection branch. The former branch detects the forged traces from the facial parts, such as the iris and pupils. The latter branch adopts a residual connection to distinguish real images from fake ones. GLFNet obtains forged traces through various ways by combining physiological characteristics with deep learning. The method is stable with physiological properties when learning the deep learning features. As a result, it is more robust than the single-class detection methods. Experimental results on two benchmarks have demonstrated superiority and generalization compared with other methods.


Introduction
With the development of generative adversarial networks (GANs) [1], massive GANbased deepfake methods are proposed to replace, modify, and synthesize human faces. The GAN-based deepfake methods can be divided into two main categories, i.e., the facial replacement-based methods and the expression attribute modification-based methods. The facial replacement methods [2][3][4] exchange two faces by style conversion [5] and face reconstruction [6] to modify the identities of two persons. Expression attribute methods leverage the expressions or actions of a person to manipulate the target person, which includes personal image reconstruction [7], rendering network [8,9], expression migration [10], and expression matching [11]. Recently, mobile phone applications with deepfake methods, such as ZAO and Avatarify, have been utilized commercially and have achieved great success.
Due to the malicious use of the above deepfake methods, media credibility has been seriously endangered, resulting in negative impacts on social stability and personal reputation. Various deepfake detection methods have been proposed to reduce the negative impacts. Some methods [12][13][14] focus on detecting local regions such as the eye, nose, lip, and other facial parts to find out the forgery traces. Matern et al. [12] propose detection of forgeries through the missing details and light reflection in eyes and teeth regions. Meanwhile, they also use discordant features such as facial boundaries and nose tips to detect forgery.
(1) We establish a mechanism to identify forgery images by combining both physiological methods, such as iris color, pupil shape, etc., and deep learning methods. (2) We propose a novel deepfake detection framework, which includes a local region detection branch and a global detection branch. The two branches are trained end-toend to generate comprehensive detection results. (3) Extensive experiments have demonstrated the effectiveness of our method in detection accuracy, generalization, and robustness when compared with other approaches.
The rest of the paper is organized as follows: In Section 2, we introduce the related work. Section 3 gives a detailed description of our method. Section 4 displays the experimental results and analysis. Finally, we make conclusions in Section 5.

Related Work
GAN-generated face detection methods can be divided into two categories, i.e., the detection method based on the physical properties and the classification method using deep learning methods [21,22]. Among them, the physiological properties are mainly to detect the forgery traces. Meanwhile, the deep learning detection methods are primarily focused on global image information.

Physical Properties Detection Method
Physical properties' detection is to detect inconsistencies and irrationalities caused by the forgery process, from physical device attributes to physiological inconsistencies.
Using physical devices such as cameras and smartphones, particular traces will be left out, which can be regarded as fingerprints for forensics. The image can be identified as a forgery if multiple fingerprints exist in the same image. Most of the methods aim to detect the fingerprint of an image [23] to determine the authenticity, including the detection method based on the twin network [24] and the comparison method based on CNN [25]. Face X-ray [26] converts facial regions into X-rays to determine if those facial regions are from a single source.
Physiological inconsistencies play a vital role in image or video deepfake detection. These methods detect the physiological signal features from contextual environments or persons, including illumination mistakes, the reflection of the differences in human eyes and faces, and blinking and breathing frequency disorders.
(3) Using the visual and sound consistency to distinguish the multi-modal approaches [30]. (4) Using the artifact identification of affected contents through Daubechies wavelet features [31] and edge features [32].
The physical detection methods are generally robust, especially in detecting the GANgenerated face. However, with the improvement of the GAN model, the artifact of synthetic image is no longer apparent, and the applicability of some detection methods is weakened.

Deep Learning Detection Method
Deep learning detection method is the most commonly used in forgery detection. Earlier methods included the classification of forgery contents by learning the intrinsic features [33], the generator fingerprint [34] of images generated by GANs, the co-occurrence matrix inconsistency [15] detection in the color channel, the inconsistency between spectral bands [35], or detection of synthetic traces of the CNN model [36][37][38]. However, with the development of generation approaches, the forgery trace is becoming challenging to be detected.
Meanwhile, most of the studies try to solve the problem caused by superimposed noise. Hu et al. [39] proposed a two-stream method by analyzing the frame-level and temporalitylevel of compressed deepfake media, aiming to detect the forensics of compressed videos. Chen et al. [40] considered both the luminance components and chrominance components of dual-color spaces to detect the post-processed face images generated by GAN. He et al. [41] proposed to re-synthesize the test images and extract visual cues for detection. Superresolution, denoising, and colorization are also utilized in the re-synthesis. Zhang et al. [42] proposed an unsupervised domain adaptation strategy to improve the performance in the generalization of GAN-generated image detection by using only a few unlabeled images from the target domain.
With the deepening of research, some current methods utilize the local features of the face to enhance the global features to obtain more acceptable results. Ju et al. [18] proposed a two-branch model to combine global spatial information from the whole image and local features from multiple patches selected by a novel patch selection module. Zhao et al. [20] proposed a method containing global information and local information. The fusion features of the two streams are fed into the temporal module to capture forgery clues.
The deep learning detection method uses the deep learning model to detect the synthetic contents. Most processes take the entire image as input. Meanwhile, the physical properties of the image are not fully considered. Many approaches still have space for progress. This paper proposes a global-local dual-branch GAN-generated detection framework by combining the physical properties and deep learning. Specifically, the local region detection branch aims to extract iris color and pupil shape artifacts. The global detection branch is devoted to detecting the holistic forgery in images. The evaluation is based on the fusion results from those two branches by a ResNeSt model. Finally, a logical operation determines the forgery images.

Proposed Method
In this section, we elaborate the dual-branch architecture GLFNet, which is combined with the physical properties and deep learning method. The local region detection branch is adopted for consistent judgment of physiological features, including iris color comparison and pupil shape estimation. Meanwhile, the global detection branch detects global information from the image residuals. Following ResNeSt [43], we extract the residual features and classify them. Finally, we use a classifier to predict the results of the two branches by logical operation. The overall architecture is shown in Figure 1.
In this section, we elaborate the dual-branch architecture GLFNet, which is combined with the physical properties and deep learning method. The local region detection branch is adopted for consistent judgment of physiological features, including iris color comparison and pupil shape estimation. Meanwhile, the global detection branch detects global information from the image residuals. Following ResNeSt [43], we extract the residual features and classify them. Finally, we use a classifier to predict the results of the two branches by logical operation. The overall architecture is shown in Figure 1.

Motivation
Existing forgery detection approaches [12,14] are encountered with inconsistencies and traces that appear in local areas. Some methods [41,44] use global features to detect forgery. However, the results show that the critical region, such as the eyes, has more different features than other areas.
Figure 2a [41] illustrates the results by detecting the artifacts of the perceptual network at the pixel-and stage5-level, in which artifacts are more easily detected in eyes, lips, and hair. Figure 2b [44] adopts a schematic diagram of residual error-guided attention, catching the apparent residuals in the eyes, nose, lips, and other vital regions.
Based on this, we propose a novel framework to combine the local region and global full-face detection. We use iris color comparison and pupil shape estimation in local region detection to provide more robust detection results and assist the global detection branch.

Motivation
Existing forgery detection approaches [12,14] are encountered with inconsistencies and traces that appear in local areas. Some methods [41,44] use global features to detect forgery. However, the results show that the critical region, such as the eyes, has more different features than other areas.
Figure 2a [41] illustrates the results by detecting the artifacts of the perceptual network at the pixel-and stage5-level, in which artifacts are more easily detected in eyes, lips, and hair. Figure 2b

Local Region Detection Branch
The local region detection branch is designed to model the local regions' consistency and illumination, including iris and pupil segmentation, iris color comparison, and pupil shape estimation.
We utilize the HOG SVM shape predictor in the Dlib toolbox to obtain 68 facial coordinate points in the face ROI. The eye landmarks extracted as local regions are illustrated in Figure 3a. The white line takes a landmark from the eyes, the red box takes the iris, the Based on this, we propose a novel framework to combine the local region and global full-face detection. We use iris color comparison and pupil shape estimation in local region detection to provide more robust detection results and assist the global detection branch.

Local Region Detection Branch
The local region detection branch is designed to model the local regions' consistency and illumination, including iris and pupil segmentation, iris color comparison, and pupil shape estimation.

Iris and Pupil Segmentation
We utilize the HOG SVM shape predictor in the Dlib toolbox to obtain 68 facial coordinate points in the face ROI. The eye landmarks extracted as local regions are illustrated in Figure 3a. The white line takes a landmark from the eyes, the red box takes the iris, the yellow box takes the pupil, and the blue arrows point to the sclera.

Local Region Detection Branch
The local region detection branch is designed to model the local regions' consistency and illumination, including iris and pupil segmentation, iris color comparison, and pupi shape estimation.

Iris and Pupil Segmentation.
We utilize the HOG SVM shape predictor in the Dlib toolbox to obtain 68 facial coor dinate points in the face ROI. The eye landmarks extracted as local regions are illustrated in Figure 3a. The white line takes a landmark from the eyes, the red box takes the iris, the yellow box takes the pupil, and the blue arrows point to the sclera.
iris pupil lanmark sclera The name of each part of the eye.
(a) Following [45], we segment the iris and pupil regions using EyeCool, which adopt the U-Net [46] as the backbone. The segmentation results are shown in Figure 3b. EyeCoo Following [45], we segment the iris and pupil regions using EyeCool, which adopts the U-Net [46] as the backbone. The segmentation results are shown in Figure 3b. EyeCool employs EfficientNet-B5 [47] as an encoder and U-Net as a decoder. Meanwhile, the decoder comprises a boundary attention module, which can improve the detection effect on the object boundary.

Iris Color Detection
In an actual image, the pupil color of a person's left and right eyes is supposed to be the same. However, some GAN-generated images do not consider the pupil color globally. As shown in Figure 4a, the color differences between the left and right eyes are obvious. Like the first image in row 1, the left eye's iris is blue, and the right is brown, which does not happen in common people. At the same time, we also listed the iris colors in two authentic images. As shown in the two right pictures of Figure 4a, the iris colors of the left and right eyes are the same. Therefore, we can detect inconsistencies in terms of iris color to distinguish the GAN-generated images.
ally. As shown in Figure 4a, the color differences between the left and right eyes are obvious. Like the first image in row 1, the left eye's iris is blue, and the right is brown, which does not happen in common people. At the same time, we also listed the iris colors in two authentic images. As shown in the two right pictures of Figure 4a, the iris colors of the left and right eyes are the same. Therefore, we can detect inconsistencies in terms of iris color to distinguish the GAN-generated images.  We tag the left and right iris regions as and from EyeCool, as shown in Figure  3c. The differences of and in RGB color space are calculated as: where , , , , , and are average values of R, G, and B of the left and right eye pixels after segmentation.

Pupil Shape Estimation
In the actual image, the shape of the pupils is mainly oval, as shown in the two right pictures of Figure 4b. The pupil regions of the actual image are marked by the white line. However, the pupils generated by GAN may have irregular shapes, as shown in the four left pictures of Figure 4b, which show the pupil shape difference between the left and right eyes. The irregular pupil shape is marked by the white lines, which would not occur in natural images.
After Figure 3b, we employ the ellipse fitting method for the pupil boundary, such as in Figure 3d. Following [48], we use the least squares fitting approach to determine the ellipse parameter's θ to make the parameter ellipse and pupil boundary points as close We tag the left and right iris regions as u L and u R from EyeCool, as shown in Figure 3c. The differences of u L and u R in RGB color space are calculated as: where u LR , u RR , u LG , u RG , u LB , and u RB are average values of R, G, and B of the left and right eye pixels after segmentation.

Pupil Shape Estimation
In the actual image, the shape of the pupils is mainly oval, as shown in the two right pictures of Figure 4b. The pupil regions of the actual image are marked by the white line. However, the pupils generated by GAN may have irregular shapes, as shown in the four left pictures of Figure 4b, which show the pupil shape difference between the left and right eyes. The irregular pupil shape is marked by the white lines, which would not occur in natural images.
After Figure 3b, we employ the ellipse fitting method for the pupil boundary, such as in Figure 3d. Following [48], we use the least squares fitting approach to determine the ellipse parameter's θ to make the parameter ellipse and pupil boundary points as close as possible. We calculate the algebraic distance D(µ; θ) of a 2D point (x, y) to the ellipse by: where T denotes the transpose operation, and a to f are the parameters of general ellipse [49]. Meanwhile, the best result is D(µ; θ) = 0. Moreover, we minimize the sum of squared distances (SSD) at the pupil boundary, and we show the pseudo-code in Algorithm 1.
where L represents two paradigm forms, and w is a constant decremented in each round. We avoid the trivial solution of θ = 0 and ensure the positive definiteness.
After that, we use Boundary IoU (BIoU) to evaluate the pupil mask pixels. The bigger the BIoU value is, the better the boundary-fitting effect.
where P is the predicted pupil mask, and D is the fitted ellipse mask. ε controls the sensitivity of the formula, and it is directly proportional to the boundary fitting sensitivity. P ε and D ε mean the mask pixels within distance ε from the expected and fitted boundaries. Following [48], we take ε = 4 in pupil shape estimation.

Global Detection Branch
The global detection branch is mainly used to classify the natural and generated faces by calculating the residual from the images, which contains downsampling, upsampling, and residual feature extraction.

Downsample
In the global detection branch, we perform a downsampled in image I by four times operation, to provide space for upsampling in the global detection branch.
where M 4 represents the four times downsampling function, and I down represents downsampling image.

Upsampling
We employ a super-resolution model δ to upsample the image to obtain its residuals. Following [41], we utilize the pre-trained perceptual loss to train δ to enhance the detection of high-level information, which is made by [50] and supervised by the ell 1 pixel loss as well as the VGG-based perceptual loss [51,52].
In the training stage, δ is trained by the real set D T . The loss function of the regression task is formulated as follows: where α 0 is a hyper-parameter to control the feature importance during the training process. And ∅ represents the feature extractor. The loss function is calculated by the sparse first normal form mapping on each stage for calculation convenience.

Residual Features Extraction
After training δ, we construct a union dataset using D T and D F . The union dataset is used to train the ResNeSt-14, extract the residual image features, and classify them. The input of the ResNest-14 is a residual feature map of size 224 × 224. The network radix, cardinality, and width are set to 2, 1, and 64, respectively. Following [41], the detection result R G is calculated as: where C 0 is a pixel-level classifier. C i is a classifier at the different perceptual loss training stage. β is used to control the importance of the classifier.

Classifier
We employ and fuse the three results Dist i_c , BIoU p_s , and R G from the two branches. The detection result is actual if all the results satisfy the accurate parameters. We can calculate the fusion results as R: where γ i_c , γ p_s , and γ G are the judgment thresholds.

Real Person Data
CelebA [53]: The CelebA dataset includes 10,177 identities and 202,599 aligned face images, which are used as the natural face dataset.
CelebA-HQ [14]: The CelebA-HQ is derived from CelebA images, which contains 30k 1024×1024 images. Following [41], we use 25k real images and 25k fake images as the training set, and we use 2.5k real images and 2.5k fake images as the testing set.
FFHQ [54]: FFHQ consists of 70,000 high-quality images at 1024 × 1024 resolution. It includes more variation than CelebA-HQ regarding age, ethnicity, and image background. Additionally, it has coverage of accessories such as eyeglasses, sunglasses, hats, etc.

GAN-Generated Methods
ProGAN [4]: ProGAN is used to grow both the generator and discriminator progressively. It starts from a low resolution, and the authors add new layers that model increasingly fine details as training progresses. Meanwhile, the authors construct a higherquality version of the CelebA dataset.
StyleGAN [54]: StyleGAN is an architecture with an automatically learned, unsupervised separation of high-level attributes and borrowing from style transfer literature. It considers high-level features and stochastic variation during training, making the generated content more intuitive. At the same time, the author proposed the corresponding dataset.
StyleGAN2 [55]: Based on StyleGAN, StyleGAN2 redesigned the generator normalization, revisited progressive growing, and regularized the generator to encourage good conditioning in the mapping from latent codes to images.

Implementation Details
In the local region detection branch, the length of the eye regions is resized to 1000 pixels. We utilized ResNeSt as a feature extractor in the global detection branch. The comparison methods include PRNU [34], FFT-2d magnitude [38], Re-Synthesis [41], Xception [36], and GramNet [56], etc.
Meanwhile, we also perform the experiments of hyper-parameter analysis in ablation studies such as γ i_c , γ p_s , and γ G .

Ablation Study
We test our approach on CelebA-HQ dataset. Using ProGAN and StyleGAN to generate images, each category has 2500 pictures [41]. We chose 1250 synthesized images and mixed them with 1250 authentic images to make up the test dataset. We utilized an ablation study to test the local region detection branch and verify the noise immunity of the physical detection method.
Ablation Study in Hyper-Parameter Analysis Table 1 shows the correlation between the RGB scores in the left and right eye. The value of γ i_c will influence the detection of the eye color. We set five groups of pixel comparison experiments, including 1, 3, 5, 7, and 9 to select the optimal comparison results when γ p_s = 0.7 and γ G = 0.5. We can conclude from Table 1 that γ i_c and false positives have negative correlation, and γ i_c and missed detection rate have positive correlation. Therefore, obtaining a balance γ i_c is necessary. Based on the experimental results, we adopted γ i_c = 5 as the parameter in subsequent experiments. Table 2 shows the hyper-parameter analysis in pupil shape estimation. We conduct parameter ablation studies of γ p_s varying in [0.1, 0.3, 0.5, 0.7, 0.9] when γ i_c = 5 and γ G = 0.5. In Table 2, the model has achieved the best result when γ p_s = 0.7. Meanwhile, Table 3 shows the hyper-parameter analysis in the global detection branch. We also set a five-group parameter experiment of γ G varying in [0.1, 0.3, 0.5, 0.7, 0.9] when γ i_c = 5 and γ p_s = 0.7. And we can see from Table 3 that the best result is obtained when γ G = 0.5.

Ablation Study in Local Detection Branch
We set the different groups, including raw images (Raw) and noisy images. The noise types include spectrum regularization (+R), spectrum equalization (+E), spectralaware adversarial training (+A), and an ordered combination of image perturbations (+P), which is shown in Table 4. In addition, the ICD represents iris color detection result, PSE represents pupil shape estimation result, LRD (All) represents the local region detection result combining the two methods, NUM represents the number of detected forged images, and ACC (%) represents accuracy rate. The experimental results show that both ICD and PSE can detect forgery images. The ICD can identify the color of the left and right iris. If the image is below the threshold, it will be identified as a forgery, as shown in Figure 5a. Additionally, the PSE can identify images with pupil abnormalities, as shown in Figure 5b. Furthermore, LRD (All) has a slightly higher detection accuracy than the two detection methods, which also demonstrates the stability of the branch. We utilize LRD (All) as the local region detection branch results in subsequent experiments.

Ablation Study in Two Branches
We test the effects of the local region detection branch and global detection branch in Table 5. The LRD evaluates the effectiveness of the local region detection branch. GD refers to the global detection branch. Dual refers to the detection results combining both the

Ablation Study in Two Branches
We test the effects of the local region detection branch and global detection branch in Table 5. The LRD evaluates the effectiveness of the local region detection branch. GD refers to the global detection branch. Dual refers to the detection results combining both the LRD and the GD. Experiment shows that the dual-branch detection result outperforms each single-branch detection result. Meanwhile, we only utilize the authentic images to train the feature extractor of the global detection branch.  Figure 6 shows the visualization results in the global detection branch. Figure 6a shows the residuals of the actual images, and Figure 6b shows the residuals of the fake images. We can see that the features of residual graphs from real and fake images are different. We train the classifier to distinguish the difference.

Noise Study
In this section, we set a noise study to test the robustness of our method. We compare our approach with baselines in CelebA-HQ, including PRNU [34], FFT-2d magnitude [38], GramNet [56], and Re-Synthesis [41]. As shown in Table 6, the results are referred from [22,41]. "ProGAN -> StyleGAN" denotes utilizing ProGAN for training and using Style-GAN for testing.

Noise Study
In this section, we set a noise study to test the robustness of our method. We compare our approach with baselines in CelebA-HQ, including PRNU [34], FFT-2d magnitude [38], GramNet [56], and Re-Synthesis [41]. As shown in Table 6, the results are referred from [22,41]. "ProGAN -> StyleGAN" denotes utilizing ProGAN for training and using StyleGAN for testing. From Table 6, double-branch detection is more effective than the single branch from the average accuracy. Furthermore, double-branch detection is more generalized and robust. Compared with Re-Synthesis [41], our method has shown some advantages in all groups. Primarily our approach performs well in the experimental group of +P, and the accuracy is stable between 83.3% and 86.9%. We noticed that the local detection branch is stable, with accuracy between 82.6% and 93.1%. It proves that our approach has specific stability in processing fake images with superposition noise.
Meanwhile, we observe that the dual branches have complementary performances. Like Figure 7a, the upper line shows the human face that was only detected by the local branch, and the lower line shows the results only detected from the global branch.
Sensors 2023, 23, 616 13 accuracy is stable between 83.3% and 86.9%. We noticed that the local detection bra stable, with accuracy between 82.6% and 93.1%. It proves that our approach has sp stability in processing fake images with superposition noise. Meanwhile, we observe that the dual branches have complementary perform Like Figure 7a, the upper line shows the human face that was only detected by the branch, and the lower line shows the results only detected from the global branch. The local branch is straightforward in detecting the anomalies and changes in ical features, such as iris color changes and pupil shape changes, which can effec complement the detection results of the global detection branch. Meanwhile, the g branch can also detect the GAN-generated images with complete physical propertie  The local branch is straightforward in detecting the anomalies and changes in physical features, such as iris color changes and pupil shape changes, which can effectively complement the detection results of the global detection branch. Meanwhile, the global branch can also detect the GAN-generated images with complete physical properties.

Comparison with the Physical Approaches
Following [57], we set the AUC comparison experiment to verify the effectiveness of our method, which mainly compares some state-of-the-art methods in physical theory. We selected four typical methods [12,14,48,58] that provide AUC scores. Hu et al. [14] and Guo et al. [48] employed the actual images from FFHQ and used StyleGAN2 to make synthetic images. Matern et al. [12] selected 1000 real images in CelebA as the actual image and used ProGAN to produce the synthesis image. The actual images with enlarged sizes had the lowest AUC (0.76). Yang et al. [58] selected more than 50,000 real images in CelebA and used ProGAN for image synthesis. The lowest AUC (0.91) is derived from color classification, and the highest AUC (0.94) is derived from the K-NN classification.
We set up three groups of experiments. The first group chose 1000 real images in FFHQ and used StyleGAN2 as the generated method. The second and third ones selected 1000 images in CelebA as the actual image, using ProGAN as the generated method. Meanwhile, we set the source image size to be enlarged and reduced by 50% to verify the robustness, as shown in Table 7: "raw" means raw images, "u-s" means the image after 50% upsampling, "d-s" means the image after 50% downsampling. Some of the results are excerpted from [57]. Table 7. Comparison of classification AUC (%) with state-of-the-art methods.

Method
Real Face GAN Face AUC Hu's Method [14] FFHQ (500) StyleGAN2 (500) 0.94 Guo's Method [48] FFHQ (1.6K) StyleGAN2 (1.6K) 0.91 Matern's Method [12] CelebA (1K) ProGAN (1K) 0.76-0.85 Yang's Method [58] CelebA ( Experimental results show that our method is significantly better on AUC, because the global detection branch with the deep learning model can detect the images without evident physical traces, as shown in Figure 8. There are some synthetic images generated by ProGAN. The physical properties of the first three images are realistic, while the last three use some stylizing methods. The forgery of these images is not apparent, which makes detection by the local branch challenging. These images require the global branch for detection.
Meanwhile, we also compared the influence of image upsampling and downsampling in our method. We made a line chart for the last three rows in Table 7, as shown in Figure 9.
The results show that our method is robust when equipped with image sampling, which proved that the physical method has some influence.
Experimental results show that our method is significantly better on AUC, because the global detection branch with the deep learning model can detect the images without evident physical traces, as shown in Figure 8. There are some synthetic images generated by ProGAN. The physical properties of the first three images are realistic, while the last three use some stylizing methods. The forgery of these images is not apparent, which makes detection by the local branch challenging. These images require the global branch for detection.
(b) (a) Meanwhile, we also compared the influence of image upsampling and downsampling in our method. We made a line chart for the last three rows in Table 7, as shown in Figure 9. The results show that our method is robust when equipped with image sampling, which proved that the physical method has some influence.  three use some stylizing methods. The forgery of these images is not apparent, which makes detection by the local branch challenging. These images require the global branch for detection. Meanwhile, we also compared the influence of image upsampling and downsampling in our method. We made a line chart for the last three rows in Table 7, as shown in Figure 9. The results show that our method is robust when equipped with image sampling, which proved that the physical method has some influence.  Table 7. Figure 9. The line chart from Table 7.

Comparison with the State-of-the-Arts Comparative Experiment
We followed the results in [42] and conducted the experiments to compare with the state-of-the-art methods, as shown in Table 8, where all the images in the target domain are unlabeled. Experiments prove that our method has higher accuracy than others. The early approach cannot detect all types of GAN-generated images. For example, when using Mi's method [37] to find the faces made by StyleGAN, the accuracy rate is only 50.4%, while the methods proposed recently have a certain degree of robustness.
We followed the results in [40] and tested our proposed method. The detection accuracies of the state-of-the-art methods are shown in Table 9. We can conclude from Table 9 that our approach has a good performance in classification accuracy. Meanwhile, Figure 9 shows the line chart from Table 9: the X-axis represents the disturbance parameters, and the y-axis represents the accuracy. We observed that our approach is more stable in all groups. As shown in Figure 10, our approach shows better adaptability in JPEG compression, gamma correction, and other disturbances. With the limiting threshold becoming worse, such as the lower JPEG compression rate and a larger kernel size of Gaussian blurring, the accuracy of our method has little impact. It is because the physical method is not sensitive to image scaling, filling, noise, and other disturbances. The robustness of the model is improved.

Conclusions and Outlook
In this paper, we proposed a novel deepfake detection method that integrates global and local facial features, namely GLFNet. GLFNet comprises a local region detection branch and a global detection branch, which are designed for forgery detection on iris color, pupil shape, and forgery trace in the whole image. It delivers a new deepfake detection method that combines physiological and deep neural network methods.

Conclusions and Outlook
In this paper, we proposed a novel deepfake detection method that integrates global and local facial features, namely GLFNet. GLFNet comprises a local region detection branch and a global detection branch, which are designed for forgery detection on iris color, pupil shape, and forgery trace in the whole image. It delivers a new deepfake detection method that combines physiological and deep neural network methods.
Four kinds of experiments are conducted to verify the effectiveness of our method. Firstly, we demonstrated the effectiveness of two branches using ablation studies. Secondly, we tested the anti-interference performance of our approach. Thirdly, we demonstrated the effectiveness of our method against the physical detection methods. Finally, we set experiment groups to evaluate the method's accuracy with state-of-the-art methods. The added noises include JPEG compression, gamma correction, median blurring, Gaussian blurring, Gaussian noising, and resizing. Experiments show that our method is robust because the local detection branch adopts the physiological detection strategy, which can adapt to image noise and disturbance. In future work, we will research the physical attributes in cross-datasets.