Deepfakes Generation and Detection: A Short Survey

Advancements in deep learning techniques and the availability of free, large databases have made it possible, even for non-technical people, to either manipulate or generate realistic facial samples for both benign and malicious purposes. DeepFakes refer to face multimedia content, which has been digitally altered or synthetically created using deep neural networks. The paper first outlines the readily available face editing apps and the vulnerability (or performance degradation) of face recognition systems under various face manipulations. Next, this survey presents an overview of the techniques and works that have been carried out in recent years for deepfake and face manipulations. Especially, four kinds of deepfake or face manipulations are reviewed, i.e., identity swap, face reenactment, attribute manipulation, and entire face synthesis. For each category, deepfake or face manipulation generation methods as well as those manipulation detection methods are detailed. Despite significant progress based on traditional and advanced computer vision, artificial intelligence, and physics, there is still a huge arms race surging up between attackers/offenders/adversaries (i.e., DeepFake generation methods) and defenders (i.e., DeepFake detection methods). Thus, open challenges and potential research directions are also discussed. This paper is expected to aid the readers in comprehending deepfake generation and detection mechanisms, together with open issues and future directions.


Introduction
It is estimated that 1.8 billion images and videos per day are uploaded to online services, including social and professional networking sites [1]. However, approximately 40% to 50% of these images and videos appear to be manipulated [2] for benign reasons (e.g., images retouched for magazine covers) or adversarial purposes (e.g., propaganda or misinformation campaigns). In particular, human face image/video manipulation is a serious issue menacing the integrity of information on the Internet and face recognition systems since faces play a central role in human interactions and biometrics-based person identification. Therefore, plausible manipulations in face samples can critically subvert trust in digital communications and security applications (e.g., law enforcement).
DeepFakes refer to multimedia content that has been digitally altered or synthetically created using deep learning models [3]. Deepfakes are the results of face swapping, enactment/animation of facial expressions, and/or digitally generated audio or non-existing human faces. In contrast, face manipulation involves modifying facial attributes such as age, gender, ethnicity, morphing, attractiveness, skin color or texture, hair color, style or length, eyeglass, makeup, mustache, emotion, beard, pose, gaze, mouth open or closed, eye color, injury and effects of drug use [4,5], and adding imperceptible perturbations (i.e., adversarial examples), as shown in Figure 1. The readily-available face editing apps (e.g., FaceApp [6], ZAO [7], Face Swap Live [8], Deepfake web [9], AgingBooth [10], PotraitPro Studio [11], Reface [12], Audacity [13], Soundforge [14], Adobe Photoshop [15]), and Deep Neural network (DNN) source codes [16,17] have enabled even non-experts and non-technical people to In fact, not only have recent advances made creating a deepfake with just a still image [26], but also deepfakes are successfully being misused by cybercriminals in the real world. For instance, an audio deepfake was utilized to scam a CEO out of $243,000 [27]. The issue of deepfakes and face manipulations is getting compounded as they can negatively affect the automated face recognition system (AFRS). For instance, studies have shown that AFRS's error rates can reach up to 95% under deepfakes [28], 50-99% under morphing [29], 17.08% under makeup manipulation [30], 17.05-99.77% under partial face tampering [31], 40-74% under digital beautification [32], 93.82% under adversarial examples [33], and 67% under GANs generated synthetic samples [34]. Similarly, automated speaker verification's accuracy drops to 40% from 98% under adversarial examples [35]. There exist many deepfake and face manipulation detection methods. However, a systematic analysis shows that the majority of them have low generalization capability, i.e., their performances drop drastically when they encounter a novel deepfake/manipulation type that was not used during the training stage, as also demonstrated in [36][37][38][39][40]. Also, prior studies considered deepfake detection a reactive defense mechanism and not as a battle between the attackers (i.e., deepfake generation methods) and the defenders (i.e., deepfake detection methods) [41][42][43]. Therefore, there is a crucial gap between academic deepfake solutions and real-world scenarios or requirements. For instance, the foregoing works are usually lagging in the robustness of the systems against adversarial attacks [44], decision explainability [45], and real-time mobile deepfake detection [46].
The study of deepfake generation and detection, in recent years, is gathering much more momentum in the computer vision and machine learning community. There exist some review papers on this topic (e.g., [5,24,47,48]), but they are focused mainly on deepfake or synthetic samples using generative adversarial networks. Moreover, most survey articles (e.g., [4,49,50]) were mainly written from an academic point of view and not from a practical development point of view. Also, they did not cover the advent of very recent face manipulation methods and new deepfake generation and detection techniques. Thus, this paper provides a concise but comprehensive overview from both theoretical and practical points of view to furnish the reader with an intellectual grasp as well as to facilitate Deepfakes are expected to advance present disinformation and misinformation sources to the next level, which could be exploited by trolls, bots, conspiracy theorists, hyperpartisan media, and foreign governments; thus, deepfakes could be fake news 2.0. Deepfakes can be used for productive applications such as realistic dubbing of foreign video films [18] or historical figure reanimation for education [19]. Deepfakes can also be used for destructive applications such as the use of fake pornographic videos to damage a person's reputation or to blackmail them [20], manipulating elections [21], creating warmongering situations [22], generating political or religious unrest via fake speeches [23], causing chaos in financial markets [24], or identity theft [25]. It is easy to notice that the number of malevolent exploitations of deepfakes chiefly dominates the benevolent ones. In fact, not only have recent advances made creating a deepfake with just a still image [26], but also deepfakes are successfully being misused by cybercriminals in the real world. For instance, an audio deepfake was utilized to scam a CEO out of $243,000 [27]. The issue of deepfakes and face manipulations is getting compounded as they can negatively affect the automated face recognition system (AFRS). For instance, studies have shown that AFRS's error rates can reach up to 95% under deepfakes [28], 50 [33], and 67% under GANs generated synthetic samples [34]. Similarly, automated speaker verification's accuracy drops to 40% from 98% under adversarial examples [35].
There exist many deepfake and face manipulation detection methods. However, a systematic analysis shows that the majority of them have low generalization capability, i.e., their performances drop drastically when they encounter a novel deepfake/manipulation type that was not used during the training stage, as also demonstrated in [36][37][38][39][40]. Also, prior studies considered deepfake detection a reactive defense mechanism and not as a battle between the attackers (i.e., deepfake generation methods) and the defenders (i.e., deepfake detection methods) [41][42][43]. Therefore, there is a crucial gap between academic deepfake solutions and real-world scenarios or requirements. For instance, the foregoing works are usually lagging in the robustness of the systems against adversarial attacks [44], decision explainability [45], and real-time mobile deepfake detection [46].
The study of deepfake generation and detection, in recent years, is gathering much more momentum in the computer vision and machine learning community. There exist some review papers on this topic (e.g., [5,24,47,48]), but they are focused mainly on deepfake or synthetic samples using generative adversarial networks. Moreover, most survey articles (e.g., [4,49,50]) were mainly written from an academic point of view and not from a practical development point of view. Also, they did not cover the advent of very recent face manipulation methods and new deepfake generation and detection techniques. Thus, this paper provides a concise but comprehensive overview from both theoretical and practical points of view to furnish the reader with an intellectual grasp as well as to facilitate the progression of novel and more resilient techniques. For example, publicly available apps, codes, or software information can be easily accessed or downloaded for further development and use. All in all, this paper presents an overview of current deepfake and face manipulation techniques by covering four kinds of deepfake or face manipulation. The four main types of manipulation are identity swap, face reenactment, attribute manipulation, and entire face synthesis, where every category manipulation generation and such manipulation detection methods are summarized. Furthermore, open challenges and potential future directions (e.g., robust deepfake detection systems against adversarial attacks using multistream and filtering schemes) that need to be addressed in this evolving field of deepfakes are highlighted. The main objectives of this article are to complement earlier survey papers with recent advancements, to impart to the reader a deeper understanding of the deepfake creation and detection domain, and to use this article as ground truth to develop novel algorithms for deepfake and face manipulation generation and detection systems.
The rest of the article is organized as follows. Section 2 presents deepfake and face manipulation generation as well as detection techniques. In Section 3, the open issues and potential future directions of deepfake generation and detection are discussed. The conclusions are described in Section 4.

Deepfake Generation and Detection
We can broadly define deepfake as "believable audio-, visual-or multimedia generated by deep neural networks". Deepfake/face manipulation can be categorized into four main groups: identity swap, face reenactment, attribute manipulation, and entire face synthesis [47], as shown in Figure 2. Several works have been conducted on different types of deepfake/face manipulation generation and detection. However, in the following subsections, we have included representative studies based on their novelty, foundational idea, and/or performance. Also, studies have been incorporated to represent the most upto-date research works depicting the state-of-the-art in deepfake generation and detection.

Identity Swap
Here, an overview of existing identity swap or face swap (i.e., replacing a person's face with another person's face) generation and detection methods is presented.

Identity Swap Generation
This consists of replacing the face of a person in the target image/video with the face of another person in the source image/video [51]. For example, Korshunova et al. [52] developed a face-swapping method using Convolutional Neural Networks (CNNs). While Nirkin et al. [53] proposed a technique using a standard fully convolutional network in unconstrained settings. Mahajan et al. [54] presented a face swap procedure for privacy protection. Wang et al. [55] presented a real-time face-swapping method. Natsume et al. [56] proposed a region-separative generative adversarial network (RSGAN) for face swapping and editing. Other interesting face swamping methods can be seen in [28,[57][58][59][60][61].
Aneja et al. [79] work using zero-shot learning). Recently, S. Liu et al. [80] propo block shuffling learning method to detect deepfakes, where the image is divided blocks, and using random shuffling where intra-block and inter-block-based feature extracted.

Face Reenactment
Here, an overview of prior face reenactment (i.e., changing the facial expressi the individual) generation and detection techniques is provided.

Face Reenactment Generation
This consists of replacing the facial expression of a person in the target image/v with the facial expression of another person in the source image/video [47]. It is known as expression swap or puppet master. For instance, Thies et al. [82] developed time face reenactment RGB video streams. Whereas encoder-decoder, RNN, unified mark converter with geometry-aware generator, GANs, and task-agnostic GANs-b schemes were designed by Kim et al. [83], Nirkin et al. [84], Zhang et al. [85], Douk al. [86], and Cao et al. [87], respectively.

Face Reenactment Detection
Face reenactment detection methods were designed by Cozzolino et al. [88] u CNNs; Matern et al. [89] using visual features with logistic regression and MLP; Ro et al. [90] using mesoscopic, steganalysis, and CNN features; Sabir et al. [91] using R Amerini et al. [65] using Optical Flow + CNNs; Kumar et al. [92] using multistream C and Wang et al. [93] using 3DCNN. In contrast, Zhao et al. [94] designed a spatiotem network, which can utilize complementary global and local information. In particula framework uses a spatial module for the global information, and the local inform module extracts features from patches selected by attention layers.

Attribute Manipulation
Here, an overview of existing attribute manipulation or face retouching, or face ing (i.e., altering certain face attributes such as skin tone, age, and gender) generation detection techniques is presented.

Face Reenactment
Here, an overview of prior face reenactment (i.e., changing the facial expression of the individual) generation and detection techniques is provided.

Face Reenactment Generation
This consists of replacing the facial expression of a person in the target image/video with the facial expression of another person in the source image/video [47]. It is also known as expression swap or puppet master. For instance, Thies et al. [82] developed real-time face reenactment RGB video streams. Whereas encoder-decoder, RNN, unified landmark converter with geometry-aware generator, GANs, and task-agnostic GANs-based schemes were designed by Kim et al. [83], Nirkin et al. [84], Zhang et al. [85], Doukas et al. [86], and Cao et al. [87], respectively.

Face Reenactment Detection
Face reenactment detection methods were designed by Cozzolino et al. [88] using CNNs; Matern et al. [89] using visual features with logistic regression and MLP; Rossler et al. [90] using mesoscopic, steganalysis, and CNN features; Sabir et al. [91] using RNN; Amerini et al. [65] using Optical Flow + CNNs; Kumar et al. [92] using multistream CNNs; and Wang et al. [93] using 3DCNN. In contrast, Zhao et al. [94] designed a spatiotemporal network, which can utilize complementary global and local information. In particular, the framework uses a spatial module for the global information, and the local information module extracts features from patches selected by attention layers.

Attribute Manipulation
Here, an overview of existing attribute manipulation or face retouching, or face editing (i.e., altering certain face attributes such as skin tone, age, and gender) generation and detection techniques is presented.

Entire Face Synthesis
Here, an overview of prior entire face synthesis (i.e., creating non-existent face samples) generation and detection techniques is provided.

Entire Face Synthesis Detection
Many studies have also focused on entire face synthesis detection. For example, Mc-Closkey et al. [124] presented a color cues-based system. While GAN fingerprint + CNNs [125], PRNU [126], co-occurrence matrices [127], neuron behaviors [128], incremental learning + CNNs [129], and self-attention mechanism [130] were also utilized. Table 1 presents a summary of deepfake and face manipulation generation and detection techniques. Guo et al. [131] showed that GANs-generated faces could be detected by analyzing the irregular pupil shapes, which may be caused by the lack of physiological constraints in the GANs models.

Open Issues and Research Directions
Although great efforts have been made in devising deepfake generation and detection, there are several issues yet to be addressed successfully. In the following, some of them are discussed.

Generalization Capability
It is easy to notice in the literature that most of the existing deepfake detection frameworks' performances decrease remarkably when tested under deepfakes, manipulations, or databases that were not used for the training. Thus, detecting unknown novel deepfakes or deepfake generation tools is yet a big challenge. The generalization capability of deepfake detectors is vital for dependable precision and public trust in the information being shared online. Some preliminary generalization solutions have been proposed, but their ability to tackle novel emerging deepfakes is still an open issue.

Explainability of Deepfake Detectors
There is a lack of work on the deepfake detection framework's interpretability and dependability. Most deep-learning-based deepfake or face manipulation detection methods in the literature usually do not explain the reason behind the final detection outcome. It is mainly due to deep learning techniques being the black box in nature. Current deepfake or face manipulation detectors only give a label, confidence percentage, or fakeness probability score but not the insight description of results. Such a description would be useful to know why the detector made a certain decision. Also, deepfake or face manipulation (e.g., applying digital makeup) can be performed either for benign or malicious intentions. Nonetheless, present deepfake or face manipulation detection techniques cannot distinguish the intent. For deepfake detection framework's interpretability and dependability, various advanced combinations of techniques such as fuzzy inference systems [187], layer-wise relevance propagation [188], and the Neural Additive Model [189] could be helpful.

Next-Generation Deepfake and Face Manipulation Generators
Improved deepfake and face manipulation generation techniques will help develop more advanced and generalized deepfake detection methods. Some of the shortcomings of current datasets and generation methods are the lack of ultra-high-resolution samples (e.g., existing methods are usually generating 1014 × 1024 resolution samples, which is not sufficient for the next generation of deepfakes), limited face attribution manipulations (i.e., face attribute manipulation types are dependent on the training set, thereby manipulation characteristics and attributes are limited, and novel attributes cannot be generated), video continuity problem (i.e., the deepfake/face manipulation, especially identity swap, techniques neglects the continuation of video frames as well as physiological signals), and no obvious deepfake/face manipulations (i.e., present databases are not composed of obvious fake samples such as a human face with three eyes).

Vulnerability to Adversarial Attacks
Recent studies have shown that deep learning-based deepfake and face manipulation detection methods are vulnerable to adversarial examples [44]. Though current detectors are capable of handling several degradations (e.g., compression and noise), their accuracy goes to extremely low levels under adversarial attacks. Thus, next-generation techniques should be not only able to tackle deepfakes but also adversarial examples. To this aim, developing various multistream and filtering schemes could be effective.

Mobile Deepfake Detector
The neural networks-based deepfake detection methods, which are capable of attaining remarkable accuracy, are mostly unsuited for mobile platforms/applications owing to the huge number of parameters and computational cost. Compressed, yet effective, deep learning-based detection systems, which could be used on mobile and wearable devices, will greatly help counteract deepfakes and fake news.

Lack of Large-Scale ML-Generated Databases
Most studies on AI-synthesized face sample detection compiled their own database with various GANs. Thereby, different published studies have different performances on GANs samples, because the quality of GANs-generated samples varies and are mostly unknown. Several public GANs-generated fake face sample databases should be produced to help the advancement of this demanding research field.

Reproducible Research
In machine learning and the deepfake research community, the reproducible results trend should be urged by furnishing the public with large datasets with larger human scores/reasons, experimental setups, and open-source tools/codes. It will surely aid in outlining the true progress in the field and avoid overestimation of the performances by the developed methods.

Conclusions
AI-synthesized or digitally manipulated face samples, commonly known as Deep-Fakes, are a significant challenge threatening the dependability of face recognition systems and the integrity of information on the Internet. This paper provides a survey on recent advances in deepfake and facial manipulation generation and detection. Despite noticeable progress, there are several issues remaining to be resolved to attain highly effective and generalized generation and defense techniques. Thus, this article discussed some of the open challenges and research opportunities. The field of deepfakes still has to go a long way for dependable deepfake and face manipulation detection frameworks, which will need interdisciplinary research efforts in various domains, such as machine learning, computer vision, human vision, psychophysiology, etc. All in all, this survey may be utilized as a ground truth for developing novel AI-based algorithms for deepfake generation and detection. Also, it is hoped that this survey paper will motivate budding scientists, practitioners, researchers, and engineers to consider deepfakes as their domain of study.

Conflicts of Interest:
The authors declare no conflict of interest.