1. Introduction
The rapid advancements in artificial intelligence and machine learning are widely being exploited by a wide variety of technologies, including good (e.g., smart home) and bad (e.g., generating sophisticated cyberattacks) applications. One such harmful application involves deepfakes and digital manipulations of audio and facial visual contents. ‘Deepfake’ is coined by fusing ‘deep learning’ and ‘fake’. It is estimated that a significant portion (e.g., around 50% [
1]) of billions of audio, images, and videos that are uploaded daily to various online platforms, spanning social and professional networking sites, appears to be manipulated [
2,
3]. As faces and speeches are integral to human interactions and serve as the foundation for biometrics-based person recognition, manipulated faces and speeches pose remarkable troubles to Internet information integrity and security systems [
4].
Deepfakes are believable audio-, visual-, or multimedia-content that are digitally modified or synthetically generated through the application of AI and deep learning models [
5,
6,
7]. Digital face manipulation encompasses the alteration of facial traits, e.g., gender, ethnicity, age, eyeglasses, emotion, morphing, beard, makeup, mustache, simulated effects of drug use, attractiveness, mouth closed or open, hair color/length/style, gaze, injury, skin texture or color, pose, adversarial examples (i.e., adding imperceptible perturbations), and eye color [
2,
8,
9], as depicted in
Figure 1. All in all, face manipulation or deepfakes could be grouped into four main categories: identity swap, face reenactment, attribute manipulation, and non-existing entire audio or face synthesis [
5], as presented in
Figure 2.
Similarly, audio deepfakes involve manipulating or synthesizing speech samples in a way that changes the original sentences or generating utterly new audio contents of a particular individual [
12,
13]. As depicted in
Figure 3, the main types of audio deepfakes [
14,
15,
16,
17] are voice cloning (i.e., generating synthetic voice recordings, which closely sound like a specific person, using deep learning (DL) models trained on the target’s voice data), speech synthesis (i.e., generating natural-sounding speech from text input, also referred to as text-to-speech (TTS) synthesis), voice conversion (i.e., altering the speech attributes of a source speaker to make it sound like target person’s voice while preserving linguistic original speech’s content, also referred to as impersonation or imitation-based audio deepfake), audio manipulation (i.e., modifying speech by changing tone, pitch, tempo, emotional expression, rearranging, adding or removing words and sentences, adding or removing background noise, or converting accent or gender), language translation (i.e., converting speech from one language to a different language while preserving the original speaker’s voice characteristics), and half deepfake (i.e., manipulating and synthesizing a specific segment of the audio while preserving the rest of the original audio source unchanged, also referred to as partial deepfake).
The prevalence of affordable and high-tech mobile devices (e.g., smartphones) along with easily accessible face and audio editing apps (e.g., FaceApp [
10] and ResembleAI [
18]) as well as deep neural network source codes (e.g., Identity-aware Dynamic Network [
19] and vocoder-network-based multispeaker text-to-speech synthesis (SV2TTS) [
20]) has empowered even non-experts to produce intricate deepfakes and digitally modified face samples that pose challenges for current computer forensic tools and human examiners. The ongoing deepfake evolution is signifying the emergence of fake news 2.0 (e.g., more evolved fake news using highly realistic deepfakes created via sophisticated AI), disinformation and misinformation to be exploited by various entities such as bots, foreign governments, hyperpartisan media, conspiracy theorists, and trolls. For instance, a chief executive officer (CEO) was scammed into losing USD 243,000 via the use of an audio deepfake [
21].
In recent years, in the face deepfake field, several methods for identity swap (deepfake) generation (e.g., Shu et al. [
22]), identity swap (deepfake) detection (e.g., Wang et al. [
23]), reenactment generation (Agarwal et al. [
24]), reenactment detection (e.g., Cozzolino et al. [
25]), attribute manipulation generation (e.g., Patashnik et al. [
26]), attribute manipulation detection (e.g., Asnani et al. [
27]), entire face synthesis generation (e.g., Li et al. [
28]), and entire face synthesis detection (e.g., Tan et al. [
29]) have been developed. Similarly, in the audio deepfake field, several methods for voice cloning generation (e.g., Luong et al. [
30]), voice cloning detection (e.g., Kulangareth et al. [
31]), speech synthesis generation (e.g., Oord et al. [
32]), speech synthesis detection (e.g., Rahman et al. [
33]), voice conversion generation (e.g., Wang et al. [
34]), voice conversion detection (e.g., Lo et al. [
35]), audio manipulation generation (e.g., Choi et al. [
36]), audio manipulation detection (e.g., Zhao et al. [
37]), language translation generation (e.g., Jia et al. [
38]), language translation detection (e.g., Kuo et al. [
39]), half deepfake generation (e.g., Yi et al. [
40]), and half deepfake detection (e.g., Wu et al. [
41]) have been formulated. Despite the great progress, most audio and video deepfake detection frameworks have a lower generalization capability (i.e., their accuracy substantially decreases under novel deepfakes [
42]), are not useful for real-time mobile deepfake detection [
43], and are susceptible to adversarial attacks [
44], primarily owing to their reactive nature rather than being proactive. It is a continuous arms race between attackers (i.e., deepfake generation techniques) and defenders (i.e., deepfake detection techniques). It is imperative for the researchers and professionals to stay abreast of the latest deepfake advancements in order to secure victory in this crucial race.
Consequently, a few survey articles have been produced in the literature (e.g., [
5,
45,
46,
47,
48]). However, they did not present systematic and detailed information about different deepfake datasets as well as comprehensive open issues and challenges for academics and practitioners in the multidisciplinary fields. To bridge this gap, this paper provides the following: (i) a comprehensive overview of existing image, video, and audio deepfake databases that could be utilized not only for enhancing accuracy, generalization, and resilience against attacks of deepfake detection techniques but also for devising novel deepfake generation methods and tools; (ii) an extensive discussion of open challenges and potential research directions in the field of audio and visual deepfake generation and mitigation. All in all, we hope that this paper will complement prior survey articles and help newcomers, researchers, and engineers achieve profound comprehension and novel deepfake algorithms developments.
The rest of the paper is organized as follows:
Section 2 presents the overview of available image and video datasets employed for deepfakes mitigation techniques.
Section 3 provides a review of available audio databases utilized for audio deepfakes detection methods.
Section 4 is dedicated to discussing the open issues in deepfake technology and exploring possible future directions.
Section 5 outlines the key conclusions.
2. Image and Video Deepfake Datasets
Deepfake databases play a pivotal role in training, testing, and benchmarking countermeasures against deepfakes. The presence of an array of diverse databases will help advancing deepfake technology. In this section, we present a comprehensive overview of exiting datasets focused on image and video deepfakes. We thoroughly examined various research articles and database repositories to furnish comprehensive details, including information not typically found in previous works, such as the memory requirements for downloading or saving databases. We have incorporated all publicly reported and/or accessible datasets from 2013 to 2024, illustrating the comprehensive advancements within the image and video deepfake datasets field.
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13,
Table 14,
Table 15,
Table 16,
Table 17,
Table 18,
Table 19,
Table 20 and
Table 21 present comparative analysis of existing datasets for image and video deepfakes.
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13,
Table 14,
Table 15,
Table 16,
Table 17,
Table 18,
Table 19,
Table 20 and
Table 21 list the image and video deepfake datasets in ascending order by year (first column) and then in alphabetical order by name within each year (second column) to provide a view of how the datasets and the deepfake field have progressed over time. It is hoped that this chronological arrangement will help readers observe the evolution of image and video deepfake datasets, deepfake types, unimodal or multimodal characteristics, quantities, quality, and the methods/tools used to generate deepfakes.
2.1. DSI-1 [49]
This dataset comprises 25 authentic and 25 manipulated images sourced from diverse websites, each possessing different resolutions. The images were altered by adding one or more individuals and exhibit medium to high visual quality. A splicing technique was used to create the deepfakes.
2.2. DSO-1 [49]
This dataset is composed of 100 real and 100 fake images. The images were manipulated by addition of one or more persons. Furthermore, splicing with brightness and color adjustments were also performed as part of the forgeries. The images have a resolution of pixels.
2.3. CelebA-HQ [50]
This dataset features a collection of 30,000 high-quality deepfake celebrity images in JPG format. Every image is characterized by a pixel resolution of . The GAN technique was used to generate these deepfakes. This is a high-quality version deepfake dataset generated using CelebA dataset as its source. It has a total size of ∼28.21 gigabytes.
2.4. DeepfakeTIMIT [51]
This database was created in 2018 with 320 real and 620 deepfake videos. The deepfakes were created based on face swapping using the FSGAN (Face Swapping Generative Adversarial Network [
52]) technique. The deepfakes are of two visual qualities, i.e.,
pixels (low resolution) and
pixels (high resolution). The videos are in AVI format. The deepfakes are mostly with frontal faces and blur. There is no audio manipulation in the deepfakes. It has a size of 226.6 MB.
Table 1.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 1.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 2.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 2.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 3.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 3.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 4.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 4.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2019 | DEFACTO [72] | - | 229,000 (I) [72] | 229,000 (I) [72] | Face swapping, Face morphing [72] | Low (C) | TIF, JPG [73] | – [73] | Copy–Move, splicing, object-removal, morphing [72] | 121 GB [73] | https://www.kaggle.com/defactodataset/datasets (accessed on 31 December 2023) |
2019 | DFFD [74] | 1000 (V), 58,703 (I) [74] | 3000 (V), 240,336 (I) [74] | 4000 (V), 299,039 (I) [74] | Identity swap, Expression swap, Attribute manipulation, Entire face Synthesis [74] | Low and High
(C) | PNG, MP4 [75] | (I) [75] | FaceSwap, Deepfake, DeepFaceLab, Face2Face, FaceAPP, StarGAN, PGGAN, StyleGAN [74] | 18.73 GB [75] | https://cvlab.cse.msu.edu/dffd-dataset.html (accessed on 31 December 2023) |
2019 | Face- Forensics++ [76] | 1000 (V) [76] | 4000 (V) [76] | 5000 (V) (C) | Face reconstruction, Face swap, Facial reenactment [76] | Low and High [76] | MP4, PNG [60] | (VGA), (HD), (Full HD) [77] | DeepFakes (DF), Face2Face (F2F), FaceSwap (FS), NeuralTextures (NT) [76] | Original videos: 38.5 GB; All h264 compressed videos with compression rate factor: raw/0: 500 GB, 23: 10 GB, 40: 2 GB; All raw extracted images as PNGs: 2TB [60] | https://kaldir.vc.in.tum.de/faceforensics_download_v4.py (accessed on 31 December 2023) |
Table 5.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 5.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2019 | FFHQ [78] | 70,000 (I) [78] | 140,000 (I) [79] | 210,000 (I) [79] | Entire face synthesis, Attribute manipulation [78] | High [78] | PNG [79] | [78] | StyleGAN [78] | 2.56 TB [79] | https://drive.google.com/drive/folders/1u2xu7bSrWxrbUxk-dT-UvEJq8IjdmNTP (accessed on 1 January 2024) |
2019 | WhichFaceReal [80] | 70,000 (I) [80] | 70,000 (I) (C) | 140,000 (I) (C) | Face Synthesis [80] | High (C) | JPG [80] | (C) | StyleGAN [80] | - | https://www.whichfaceisreal.com/ (accessed on 1 January 2024) |
2020 | Deeper- Forensics-1.0 [81,82] | 48,475 (V) [83] | 11,000 (V) [83] | 59,475 (V) (C) | Face swap [81] | High [66] | MP4 [84] | [81] | DF-VAE (DeepFake Variational Auto Encoder) [81] | 281.35 GB [84] | https://drive.google.com/drive/folders/1s3KwYyTIXT78VzkRazn9QDPuNh18TWe- (accessed on 31 December 2023) |
2020 | DFDC [85] | 23,654 (V) (C) | 104,500 (V) [85] | 128,154 (V) [85] | Face reenactment, Face swap [85] | High [66,85] | MP4 [86] | (DF-128) and (DF-256) [85] | DFAE, MM/NN face swap, NTH, FSGAN, StyleGAN, Refinement, Audio Swaps [85] | 471.84 GB [86] | https://www.kaggle.com/competitions/deepfake-detection-challenge/data (accessed on 31 December 2023) |
Table 6.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 6.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 7.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 7.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2020 | FFIW 10K [91] | 10,000 (V) [91] | 10,000 (V) [91] | 20,000 (C) | Face swap [91] | High [66] | MP4 [92,93] | 480 and above [91] | DeepFa- ceLab, FS-GAN, FaceSwap [91] | Train: 17 GB [92], Test: 4.1 GB [93] | Train + Val: https://drive.google.com/file/d/1-Ha_A9yRFS0dACrv-L156Kfy_yaPn980/view?usp=sharing; Test: https://drive.google.com/file/d/1ydNrV_LK3Ep6i3_WPsUo0_aQan4kDUbQ/view?usp=sharing |
2020 | iFakeFace DB [94] | - | 87,000 (I) [95] | 87,000 (I) [95] | Entire face synthesis [95] | High (C) | JPG [96] | [95] | StyleGAN, GANPrintR [95] | 1.4 GB [96] | http://socia-lab.di.ubi.pt/~jcneves/iFakeFaceDB.zip (accessed on 1 January 2024) |
2020 | RFFD [97] | Train: 1081 (I) [97] | Train: 960 (I) [97] | Train: 2041 (I) (C) [97] | Attribute manipulation [97] | Medium to High (C) | JPG [97] | [97] | Photoshop [97] | 431 MB [97] | https://www.kaggle.com/datasets/ciplab/real-and-fake-face-detection (accessed on 1 January 2024) |
Table 8.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 8.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 9.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 9.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2021 | DeepStreets [106] | 600 (V) [106] | 600 (V) [106] | 1200 (V) [106] | Video to video synthesis [106] | High/Low [106] | MP4 [107] | [106] | Vid2vid, Wcvid2vid [106] | 24.1 GB [107] | http://clem.dii.unisi.it/~vipp/datasets.html (accessed on 1 January 2024) |
2021 | DFGC-21 [108] | 1000 (I) [108] | N × 1000 (I) [108] | Fake: N × 1000; Real: 1000 (I) (C) | Face swapping [108] | High (C) | PNG [108] | [108] | FaceShifter, FaceController, FaceSwap, FirstOrderMotion, FSGAN [108] | 5.18 GB [109] | https://drive.google.com/drive/folders/1SD4L3R0XCZnr-LnZy5G9Vsho9BpIYe6Z (accessed on 1 January 2024) |
2021 | DF-Mobio [110] | 31,950 (V) [110] | 14,546 (V) [110] | 46,496 (V) (C) | Face swap [110] | High (C) | MP4 [111] | [110] | GAN [110] | 170.1 GB [111] | https://zenodo.org/records/5769057 (accessed on 3 July 2024) |
2021 | DF-W [112] | - | 1869 (V) [112] | 1869 (V) [112] | Face swap [112] | High (C) | MP4 [113] | (low) to (high) [112] | Collected from YouTube, Reddit, and Bilibili [112] | 31.55 GB [113] | https://github.com/jmpu/webconf21-deepfakes-in-the-wild (accessed on 1 January 2024) |
Table 10.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 10.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2021 | FaceSynthesis [114] | - | 100,000 (I) [114] | 100,000 (I) [114] | Face synthesis [114] | High (C) | PNG [115] | [114] | VFX (3D face model) [114] | 31.8 GB [115] | https://facesyntheticspubwedata.blob.core.windows.net/iccv-2021/dataset_100000.zip (accessed on 1 January 2024) |
2021 | FakeAV-Celeb [116] | 500 (V) [116] | 19,500 (V) [116] | 20,000(V) [116] | Face swapping, facial reenactment, voice cloning [116] | High [116] | MP4 [117] | [118] | Video: FSGAN, Wav2Lip, FaceSwap; Audio: SV2TTS [116] | 6 GB [118] | https://sites.google.com/view/fakeavcelebdash-lab/download?authuser=0 (accessed on 1 January 2024) |
2021 | ForgeryNet [119] | 1,438,201 (I), 99,630 (V) [119] | 1,457,861 (I), 121,617 (V) [119] | 2,896,062 (I) (C), 221,247 (V) (C) | ID replaced: Face swap, Face transfer, Face stacked manipulation; ID remained: Face reenactment, Face editing [119] | High/Low [66] | JPG, MP4 [120] | 240 to 1080 [119] | ID replaced: FSGAN, FaceShifter, BlendFace, MM Replacement; ID remained: MASKGAN, StarGAN, StyleGAN, ATVG-Net, SC-FEGAN [119] | 496.23 GB [120] | Link 1: https://opendatalab.com/OpenDataLab/ForgeryNet/tree/main; Link 2: https://115.com/s/swnk84d3wl3?password=cvpr&#ForgeryNet (accessed on 31 December 2023) |
Table 11.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 11.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2021 | HiFiFace [121] | - | 1000 (V) [122] | 1000 (V) [122] | Face swapping [121] | High [121] | MP4 [122] | and [121] | Encoder–Decoder [121] | 990 MB [123] and 5.1 GB [122] | https://drive.google.com/file/d/1tZitaNRDaIDK1MPOaQJJn5CivnEIKMnB/view (accessed on 1 January 2023) |
2021 | KoDF [124] | 62,166 (V) [124] | 175,776 (V) [124] | 237,942 (V) [124] | Face reenactment, Face swap [124] | High [124] | MP4 [125] | (initial), (reduced) [124] | Video: FaceSwap, DeepFaceLab, FSGAN, FOMM; Audio: ATFHP, Wav2Lip [124] | 2.6 TB [125] | https://deepbrainai-research.github.io/kodf/ |
2021 | Open Forensics [126] | 45,473 (I) [126] | 70,325 (I) [126] | 115,325 (I) [126] | Face Swap [126] | Low and High (C) | JPG [126] | (low), (high) [126] | GAN, Poisson blending [126] | 56.4 GB [127] | https://zenodo.org/records/5528418 (accessed on 1 January 2024) |
Table 12.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 12.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 13.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 13.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2021 | VideoSham [132] | 413 (V) [132] | 413 (V) [132] | 826 (V) [132] | Face swapping, Face reenactment [132] | High (C) | MP4 [133] | [133] | Attack 1(Adding an entity/subject): Adobe Photoshop; Attack 2 (Removing an entity subject): AfterEffects; Attack 3 (Background/Color Change); Attack 4 (Text Replaced/Added); Attack 5 (Frames Duplication/Removal/Dropping); Attack 6 (Audio Replaced) [132] | Trimmed and manipulated: 5.2 GB [134] | https://github.com/adobe-research/VideoSham-dataset (accessed on 1 January 2024) |
2021 | WPDD [135] | 946 (V) [135] | 320,499(I) [135] | 320,499 (I) + 946 (V) [135] | Face swap, seven face manipulations [135] | Medium and High (C) | MP4 | (Frame), (Face) [135] | iface and Faceapp | 315 GB | - |
Table 14.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 14.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2022 | CDDB [136] | - | 842,140 (I) [136] | 842,140 (I) [136] | Face swap, Face reenactment, Face Synthesis [136] | Medium (C) | PNG [137] | [137] | GAN models: ProGAN, StyleGAN, BigGAN, CycleGAN, GauGAN, StarGAN; Non-GAN models: Glow, CRN, IMLE, SAN, Deepfake, Face2Face, Faceswap, Neural Texture; Unknown model: WhichFaceReal, WildDeepFake [136] | 9.6 GB [137] | https://drive.google.com/file/d/1NgB8ytBMFBFwyXJQvdVT_yek1EaaEHrg/view (accessed on 1 January 2024) |
2022 | CelebV-HQ [138] | - | 35,666 (V) [138] | 35,666 (V) [138] | Attribute editing [138] | High [138] | MP4 [139] | [138] | VideoGPT, MoCoGANHD, DIGAN, StyleGANV [138] | 38.5 GB [139] | https://pan.baidu.com/s/1TGzOwUcXsRw72l4gaWre_w?pwd=pg71#list/path=%2F (accessed on 1 January 2024) |
Table 15.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 15.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 16.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 16.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2022 | FMFCC-V [146] | Short version: 44,290 (V), long version: 83 (V) [146] | Short version: 38,102 (V), long version: 192 (V) [146] | Short version: 82,392 (V) [146], long version: 275 (V) (C) | Face swap [146] | High (C) | MP4 [147] | 480, 720, 1080 [146] | Faceswap, faceswapGAN, DeepFaceLab, Recycle-GAN [146] | Long video version: 80.5 GB, Short video version: 411 GB [147] | https://github.com/iiecasligen/FMFCC-V (accessed on 1 January 2024) |
2022 | GBDF [148] | 2500 (V) (C) | 10,000 (V) [148] | 12,500 (V) (C) | Identity Swap, Expression swap [148] | High (C) | MP4 [149] | [148] | Identity Swapping: FaceSwap, FaceSwap-Kowalski, FaceShifter, Encoder–Decoder; Expression swapping: Face2Face and NeuralTextures [148] | 1 TB | https://github.com/aakash4305/~GBDF/releases/tag/v1.0 (accessed on 1 January 2024) |
2022 | LAV-DF [150] | 36,431 (V) [150] | 99,873 (V) [150] | 136,304 (V) [150] | Face reenactment, Voice Reenactment, Transcript manipulation [150] | High (C) | MP4 [151] | [150] | SV2TTS, Wav2Lip [150] | 23.8 GB [151] | https://drive.google.com/file/d/1-OQ-NDtdEyqHNLaZU1Lt9Upk5wVqfYJw/view (accessed on 31 December 2023) |
Table 17.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 17.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2022 | SFHQ [152] | - | 425,258 (I) [152] | 425,258 (I) [152] | Entire face synthesis [152] | High (C) | JPG, PNG [152] | [152] | StyleGAN2 [152] | Part 1: 15 GB, Part 2: 15 GB, Part 3: 23 GB, Part 4: 24 GB [152] | Part 1: https://www.kaggle.com/datasets/selfishgene/synthetic-faces-high-quality-sfhq-part-1; Part 2: https://www.kaggle.com/datasets/selfishgene/synthetic-faces-high-quality-sfhq-part-2; Part 3: https://www.kaggle.com/datasets/selfishgene/synthetic-faces-high-quality-sfhq-part-3; Part 4: https://www.kaggle.com/datasets/selfishgene/synthetic-faces-high-quality-sfhq-part-4 (accessed on 1 January 2024) |
2022 | TrueFace [153] | 70,000 (Pre social); 30,000 (Post social) (I) [153] | 80,000 (Pre social); 30,000 (Post social) (I) [153] | 150,000 (Pre social), 60,000 (Post social) (I) [153] | Face synthesis [153] | High (Quality factor = 87) [153] | JPG [153] | (initial), (resized) [153] | StyleGAN, StyleGAN2 [153] | 212 GB [154] | https://drive.google.com/file/d/1WgBrmuKUaLM3YT_5bSgyYUgIUYI_ghOo/view (accessed on 1 January 2024) |
2022 | ZoomDF [155] | 400 (I) [155] | 400(V) [155] | 400 (I), 400 (V) (C) | Motion manipuation [155] | High (C) | MP4 (C) | - | Avatarify based on First-Order Motion Model (FOMM) [155] | - | - |
Table 18.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 18.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2023 | AV-Deepfake 1M [65] | 286,721 (V) [65] | 860,039 (V) [65] | 1,146,760 (V) [65] | Face reenactment, Voice cloning, Transcript manipulation [65] | High [65] | MP4 [156] | | TalkLip, VITS, YourTTS, ChatGPT [65] | ∼400 GB | https://monashuni-my.sharepoint.com/:f:/g/personal/zhixi_cai_monash_edu/EgeT8-G5RPdLnHqVw33ePRUBeqfxt6Ighe3CmIWKLLQWLQ?e=Sqf7n (accessed on 25 April 2024) |
2023 | DETER [157] | 38,996 (I) [157] | 300,000 (I) [157] | 338,996 (I) (C) | Face swapping, inpainting, attribute editing [157] | High [157] | JPG [158] | ; ; ; ; ; >2048 [157] | GAN-based models: E4S, MAT; Diffusion Models: DiffSwap, DiffIR [157] | - | https://deter2024.github.io/deter/ (accessed on 1 January 2024) |
2023 | DFFMD [159] | 1000 (V) [159] | 1000 (V) [159] | 2000 (V) [159] | Facial reenactment [159] | Medium [159] | MP4 [160] | [160] | First-Order Motion model (FOMM) [159] | 10 GB [160] | https://www.kaggle.com/datasets/hhalalwi/deepfake-face-mask-dataset-dffmd (accessed on 1 January 2024) |
2023 | DF-Platter [161] | 764 (V) [161] | 132,496 (V) [161] | 133,260 (V) [161] | Face reenactment, Face swap [161] | High/Low [66], Average brisque score: 43.25 [161] | MPEG4.0 [161] | 720 (High resolution), 360 (Low resolution) [161] | FSGAN, FaceSwap, FaceShifter [161] | 417 GB [161] | https://drive.google.com/drive/folders/1GeR-a2LfcMkcY6Qzpv2TP8utLtYFBmTs (accessed on 31 December 2023) |
2023 | eKYC-DF [162] | 760 (V) [162] | 228,000(V) [162] | 228,760 (V) [162] | Face swap [162] | High [162] | - | $ and [162] | SimSwap, FaceDancer, SberSwap [162] | eKYC-DF: >1.5 TB; eKYC-6K: >750 GB | https://github.com/hichemfelouat/eKYC-DF (accessed on 1 January 2024)
|
2023 | IDForge [163] | 79,827+, 214,438 (reference dataset) (V) [163] | 169,311(V) [163] | 463,576 (V) [163] | Face swapping, transcript manipulation, audio cloning/manipulation [163] | High [163] | - | [163] | Video: Insight-Face, SimSwap, InfoSwap, Wav2Lip; Audio: TorToiSe, RVC, audio shuffling; Text: GPT-3.5, text shuffling [163] | ∼600 GB | - |
Table 19.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 19.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2023 | PolyGlotFake [164] | 766 (V) [164] | 14,472 (V) [164] | 15,238 (V) [164] | Attribute manipulation (lip-sync), text-to-speech, voice cloning [164] | High [164] | - | [164] | Audio manipulation: Bark+FreeVC, Micro- TTS+FreeVC, XTTS, Tacotron+F- reeVC, Vall-E-X; Video manipulation: VideoRetalking, Wav2Lip [164] | - | https://github.com/tobuta/PolyGlotFake (accessed on 14 June 2024) |
2023 | Retouching FFHQ [165] | 58,158 (I) [165] | 652,568 (I) [165] | 710,726 (I) (C) | Face retouching [165] | High [165] | JPG, PNG [165] | (initial), (final) [165] | Using APIs: Megvii, Alibaba, Tencent [165] | 363.96 GB [166] | https://drive.google.com/drive/folders/194Viqm8Xh8qleYf66kdSIcGVRupUOYvN (accessed on 18 April 2024) |
2023 | RWDF-23 [167] | - | 2000 (V) [167] | 2000 (V) [167] | Face swapping [167] | High [167] | - | [167] | DeepFaceLab, DeepFaceLive, FOMM, SimSwap, FacePlay, Reface, Deepfake Studio, FaceApp, Revive, LicoLico, Fakeit; DeepFaker, DeepFakesWeb, Deepcake.io, DeepFaker Bot, Revel.ai [167] | - | - |
Table 20.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 20.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2023 | SPRITZ-PS [168] | 20,000 (I) [168] | 20,000 (I) [168] | 40,000 (I) [168] | Entire face synthesis, iris reconstruction [168] | Low to Medium (C) | JPG Face: >2000× > 2000; Iris: [169] | StyleGAN2, ProgressiveGAn, StarGAN [168] | 17.84 GB [169] | https://ieee-dataport.org/documents/spritz-ps-validation-synthetic-face-images-using-large-dataset-printed-documents (accessed on 1 January 2024) | |
2024 | DeepFaceGen [170] | 463,583 (I), 313,407 (V) [170] | 350,264 (I), 423,548 (V) [170] | 350,264 (I), 423,548 (V) [170] | Entire Face Synthesis, Face Swapping, Face Reenactment, Attribute Manipulation [170] | High [170] | JPG, PNG, MP4 [171] | - | FaceShifter, FSGAN, DeepFakes, BlendFace, DSS, SBS, MMReplacement, SimSwap, Talking Head Video, ATVG-Net, Motion-cos, FOMM, StyleGAN2, MaskGAN, StarGAN2, SC-FEGAN, DiscoFaceGAN, OJ, SD1, SD2, SDXL, Wenxin, Midjourney, DF-GAN, DALL·E, DALL·E 3, AnimateDiff, AnimateLCM, Hotshot, Zeroscope, MagicTime, Pix2Pix, SDXLR, VD [170] | 491.1 GB [171] | https://github.com/HengruiLou/DeepFaceGen (accessed on 20 June 2024) |
Table 21.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Table 21.
Comparison and key characteristics of existing face image and video deepfake datasets. I = Images; V = Videos; C = Value or Visual Quality estimated by authors.
Year | Dataset | No. of Real Samples | No. of Fake Samples | Total No. of Samples | Types of Deepfakes | Visual Quality | Format | Resolution | Methods Used to Generate Deepfake | Size of Dataset | Link to Download the Dataset |
---|
2024 | DF40 [172] | 52,590 (V) [172] | 0.1 M+ (I), 1 M+ (V) [172] | 0.1 M+ (I), 1 M+ (V) [172] | Face Swapping, Face Reenactment, Entire Face Synthesis, Attribute Manipulation [172] | High [172] | - | - | FSGAN, FaceSwap, SimSwap, InSwapper, BlendFace, UniFace, MobileSwap, e4s, FaceDancer, DeepFaceLab, FOMM, FS_vid2vid, Wav2Lip, MRAA, OneShot, PIRender, TPSMM, LIA, DaGAN, SadTalker, MCNet, HyperReenact, HeyGen, VQGAN, StyleGAN2, StyleGAN3, StyleGAN-XL, SD-2.1, DDPM, RDDM, PixArt-, DiT-XL/2, SiT-XL/2, MidJounery6, WhichisReal, CollabDiff, e4e, StarGAN, StarGANv2, StyleCLIP [172] | - | - |
2.5. EBV (Eye Blinking Video) Dataset [55]
Within this dataset, there are 50 authentic videos as well as 49 counterfeit videos, focusing on deepfakes of the face-swapping type. It employs the DeepFake algorithm [
55] for generation. The videos have a resolution of
pixels, characterized by low visual quality. The dataset is 1.01 GB in size.
2.6. Faceforensics [57]
This database was introduced in 2018. It comprises 1004 real and 1004 deepfake videos. The source original videos were acquired from YouTube. The videos tagged with ‘face’, ‘newscaster’ or ‘newsprogram’ on YouTube and the YouTube-8M dataset [
173] were chosen for this dataset. The dataset focuses on two types of deepfakes (i.e., Self-Reenactment and Source-to-target transfer) using the Face2Face technique. In terms of visual quality, it is high quality and of MP4 format with a resolution of
pixels. The dataset has been segmented into 704, 150, and 150 videos for training, validation, and testing, respectively. The dataset consists of 130 GB of lossless compressed videos and 3.5 TB of raw videos.
2.7. FFW (Fake Faces in the Wild) [61]
This database, established in 2018, comprises a total of 150 fake videos. The dataset focuses primarily on face replacement and image tampering. The videos are formatted in MP4 and AVI, and exhibit a resolution of 480 pixels or higher. The manipulation techniques used are CGI (Computer-Generated Imagery), GANs (generative adversarial networks), and FakeApp mobile application.
2.8. HOHA (Hollywood Human Actions)-Based [63]
The HOHA-based dataset was introduced in 2018 and comprises 300 real and 300 forged videos. This dataset primarily focuses on face swapping and used the encoder–decoder approach for deepfake generation. The videos are provided in MP4 format with a resolution of pixels, characterized by low-to-medium visual quality.
2.9. UADFV [64]
UADFV is a deepfake dataset created in 2018 with 98 samples, evenly distributed between 49 deepfake and 49 real videos. The dataset’s primary focus is on face swap deepfakes. It has low visual quality with a resolution of pixels. The FakeApp mobile application was used to generate the deepfakes. This database has a size of 146 MB.
2.10. Celeb-DF [68]
This is another deepfake dataset featuring 6229 samples, including 590 real and 5639 fake samples. The primary focus of the dataset is on face reconstruction, accomplished using an improved deepfake synthesis algorithm. The visual quality of the deepfakes is high. The videos in the dataset are in MPEG4.0 format and have a resolution of pixels. The dataset was generated using the original deepfake synthesis algorithm. The dataset has a size of 9.3 GB. The real videos acting as sources were derived from publicly accessible YouTube clips, showcasing 59 celebrities with diverse gender identities, age groups, and ethnic backgrounds. In the genuine videos, 56.8% of the subjects are male while 43.2% are female. The distribution of subjects is as follows: 6.4% are under the age of 30, 28.0% are in their 30s, 26.6% are in their 40s, 30.5% are in the 50–60 age range, and 8.5% are aged 60 and above. In terms of ethnicity, 5.1% are Asians, 88.1% are Caucasians, and 6.8% are African Americans.
2.11. DeepFakeDetection (Google and Jigsaw Dataset) [70]
This dataset was published in 2019 by Google and Jigsaw. It includes 3068 fake videos and 363 real videos. The deepfake type used in the dataset is face swap. The dataset includes videos manipulated with DeepFakes (DF) [
174], Face2Face (F2F) [
59], FaceSwap (FS) [
175], and NeuralTextures (NT) [
176], and is part of the FaceForensics benchmark. The source videos have the following sizes based on their compression rate factors: Raw/0: 200 GB, C23: 3 GB, and C40: 400 MB. The manipulated videos have the following sizes: Raw/0: 1.6 TB, C23: 22 GB, and C40: 3 GB.
2.12. DEFACTO [72]
The DEFACTO deepfake dataset has a total of 299,000 fake images. For faces, the dataset focuses on face swapping and face morphing. It used techniques like copy–move, splicing, object-removal, and morphing. The copy–move technique was used for duplication of an element within the image. For splicing, one portion of an image is copied and pasted onto another image. In object-removal, an object is removed from the image using inpainting algorithms. Finally, morphing consists of warping and blending two images together. For each forgery, postprocessing may be applied (rotation, scaling, contrast, etc.). The visual quality of the deepfakes in DEFACTO is low. The images are encoded in the TIF (Tagged Image File Format) and JPG formats and have resolutions ranging from pixels to pixels. It uses the FaceSwap technique to generate deepfakes. The dataset has a size of 121 GB.
2.13. DFFD (Diverse Fake Face Dataset) [74]
This dataset was made available in 2019 and is made up of 1000 real videos and 58,703 genuine images, alongside 3000 deepfake videos and 240,336 deepfake images. The deepfake variants encompass identity swap, expression swap, attribute manipulation, and entire face synthesis using frameworks such as FaceSwap [
174], Deepfake [
174], Deep Face Lab [
177], Face2Face [
59], FaceAPP [
10], StarGAN [
178], PGGAN (Progressive Growing of GANs) [
50], and StyleGAN [
78]. The visual quality is low and high. The size of the dataset is 18.73 GB.
2.14. FaceForensics++ [76]
FaceForensics++ stands out as a prominent dataset for deepfake detection. It is an extension of the Faceforensics [
57] database. It is made up of 1000 real and 4000 manipulated videos. Multiple tools and techniques were employed to generated four subsets of deepfakes, i.e., DeepFakes (DF) [
174], Face2Face (F2F) [
59], FaceSwap (FS) [
175], and NeuralTextures (NT) [
176]. The visual qualities are mainly low and high. The dataset utilizes uncompressed and H264 compressed format, i.e., H.264 encoding with CRF values of 0, 23, and 40 for the MP4 format videos. The deepfake detector’s performance can be assessed on both compressed and uncompressed videos. The dataset offers a range of resolutions:
for VGA,
for HD, and
for Full HD. Lack of lip-sync deepfakes and color inconsistencies can be easily observed in the dataset. This dataset contains 60% male and 40% female samples. The original videos have a total size of 38.5 GB. The h264 compressed videos have the following sizes based on their compression rate factors: Raw/0: 500 GB, 23: 10 GB, and 40: 2 GB. Moreover, all the raw extracted images in PNG format have a total size of 2 TB.
2.15. FFHQ (FlickrFaces-High Quality) [78]
The FFHQ (Flickr-Faces-HQ) dataset was created in 2019 with 70,000 real and 140,000 fake images, totaling 210,000 images. The dataset focuses on high-resolution synthetic and attribute-manipulated images stored in PNG format, each with a resolution of pixels. The images were generated using StyleGAN technology. It has an extensive size of 2.56 TB.
2.16. WhichFaceReal [80]
This dataset was established in 2019 and contains of 70,000 deepfake images, focusing primarily on face synthesis deepfakes. Stored in JPG, the samples maintain a resolution of
pixels. The dataset used the StyleGAN [
11] technique to generate manipulated images.
2.17. DeeperForensics-1.0 [81,82]
DeeperForensics-1.0 is a deepfake dataset created in 2020 containing 59,475 videos, including 48,475 real and 11,000 fake videos. The primary focus of this dataset is on face swapping. The visual quality of the deepfakes in DeeperForensics-1.0 is high and the videos are encoded in MP4 format. The resolution of the content is pixels. The generation process involves the use of DF-VAE (DeepFake Variational Auto Encoder). The dataset has a size of 281.35 GB. It includes 45 females and 55 males, spanning across 26 countries. The age spectrum of the subjects varies between 20 and 45 years, mirroring the prevalent age range in real-world videos. The videos contain eight expressions: neutral, happy, surprise, sad, angry, disgust, contempt, and fear.
2.18. DFDC (Deepfake Detection Challenge) [85]
DFDC is a public dataset created in 2020, comprising 128,154 videos. This dataset includes 104,500 fake and 23,654 real videos. It contains face reenactment and face swap types of deepfakes. The deepfakes exhibit a high level of visual quality, and videos are in the MP4 format. The dataset offers two resolutions, with images sized at
pixels (DF-128) and
pixels (DF-256). Multiple tools and techniques were employed in the generation process, including DFAE (Deepfake Autoencoder), MM/NN (morphable-mask/nearest-neighbors model), NTH (Neural Talking Heads) [
179], FSGAN (Face Swapping GAN) [
180], StyleGAN [
11], Refinement, and Audio Swaps [
181]. This dataset has a size of 471.84 GB.
2.19. DMD (Deepfake MUCT Dataset) [87]
This dataset was created in 2020. It contains 751 original images and 8857 manipulated images, forming a dataset of 9608 images in total. The dataset focuses on deepfake types such as face swapping, gender conversion, face morphing, and smile manipulation. All images have a resolution of pixels. The FakeApp mobile application was used to generate the deepfakes. It has a total size of 1.42 GB.
2.20. FaceShifter [88]
This dataset is composed of 1000 real videos and 10,000 fake videos. It is primarily centered around the face swapping deepfake type. It contains deepfake images generated using the FaceShifter technique with a resolution of pixels. The videos have a high visual quality.
2.21. FakeET [89]
Introduced in 2020, this dataset includes 331 authentic videos and 480 manipulated (fake) videos. This dataset focuses on face swap. The videos are presented in high visual quality at pixels and are in MP4 format. This dataset was generated using the Google/Jigsaw dataset. This dataset has a size of 25.9 GB.
2.22. FFIW 10K (Face Forensics in the Wild) [91]
FFIW was introduced in 2020, featuring 10 K real and 10 K fake videos, totaling 20,000 videos. The deepfake type used in this dataset is face swapping. The videos in FFIW have a resolution of 480 pixels and above. The dataset uses face manipulation technologies like DeepFaceLab [
182], FS-GAN [
180], and FaceSwap [
183]. The dataset is split into a training set of 17 gigabytes and a testing set with a size of 4.1 GB. The videos are in MP4 format with high visual quality.
2.23. iFakeFaceDB [94]
This dataset was also introduced in 2020, focusing on entire face synthesis. It includes a total of 87,000 fake images. The images within iFakeFaceDB are stored in jpg format and have a resolution of
pixels. The high-resolution and visual quality of the images ensures detailed and visually rich synthetic faces. The dataset is generated using StyleGAN [
11] and GANPrintR approaches. This dataset has a size of 1.4 gigabytes.
2.24. RFFD (Real and Fake Face Detection) [97]
The dataset is composed of 1081 genuine images and 960 manipulated images in training. The dataset was focused on generating attribute-manipulated deepfakes using Photoshop. The videos are standardized at a resolution of pixels. The dataset size is 431 megabytes. The images are in the format of JPG with a medium-to-high visual quality.
2.25. UIBVFED [98]
UIBVFED (User-Independent Blendshape-based Video Facial Expression Dataset) contains 640 fake images. This dataset used the blendshapes technique and autodesk character generator tool to generate attribute manipulation and entire face synthesis-based deepfakes. It has a medium-to-high visual quality. The dataset contains the images in the PNG format. The resolution for each image in UIBVFED is set at pixels. The dataset has a size of 652 megabytes.
2.26. WildDeepfake [100]
WildDeepfake is a unique deepfake dataset created in 2020 with a total of 7314 videos, comprising 3805 real and 3509 fake videos. This dataset was collected from various sources on Internet. The visual quality of the deepfakes is high. It is acknowledged as a challenging dataset for deepfake detection. The images are in the PNG format with a resolution of pixels. The dataset has a size of 67.8 gigabytes.
2.27. YouTube-DF [102]
The dataset includes 98 authentic videos that were utilized to generate 79 deepfake videos and 98 faceswaps. It mainly focuses on face swapping and has low and high visual qualities. The videos are formatted in MP4 at pixels. DeepFaceLab was the technology employed for face swapping in this dataset.
2.28. DeepFake MNIST+ [103]
Crafted in 2021, this dataset comprises 20,000 videos, meticulously balanced with 10,000 authentic videos and 10,000 manipulated (fake) videos. This dataset primarily focuses on image animation. The visual quality of the deepfakes in DeepFake MNIST+ is high. The videos are in MP4 format and have a resolution of pixels. The generation process involves the use of the FOMM (First-Order Motion Model) technique. It has a size of 2.21 GB. The videos contain ten actions: open mouth, blink, yawn, left slope head, right slope head, nod, surprise, embarrassment, and look up and smile.
2.29. DeepStreets [106]
This dataset includes a total of 1200 video samples, evenly distributed with 600 authentic videos and 600 manipulated (fake) videos. The type of deepfake used in dataset is video-to-video synthesis. It has low and high visual qualities, and utilizes Vid2vid and Wcvid2vid technologies [
184,
185] to generate deepfakes. The videos are in MP4 format with a resolution of
pixels. The total size of the dataset is 24.1 GB.
2.30. DFGC-21 (DeepFake Game Competition) [108]
This dataset was introduced in 2021. It focuses on face swapping deepfake types, encompassing 1000 real images and
fake images. The images within DFGC-21 are in PNG format and feature a resolution of
pixels. It uses various face manipulation techniques such as FaceShifter [
88], FaceController [
186], FaceSwap [
183], FOMM [
105], and FSGAN [
180]. The samples’ visual quality is high and the total dataset size is 5.18 GB.
2.31. DF-Mobio [110]
This is a gender-diverse dataset released in 2021. It contains a total of 46,496 deepfake videos, including 31,950 real and 14,546 fake video samples focusing on identity swap. The videos are characterized by a pixel resolution of and were generated using GAN technology. The dataset consists of bimodal (audio and video) data taken from 150 people. The dataset has a female–male ratio of nearly 1:2 (99 males and 51 females) and was collected from August 2008 until July 2010 at six different sites from five different countries. The videos are in MP4 format with high visual quality, and the dataset has a size of 170.1 GB.
2.32. DF-W (DeepFake Videos in the Wild) [112]
This dataset consists of 1869 manipulated (fake) videos. The dataset focuses on face swapping. The videos are collected from YouTube, Reddit, and Bilibili by searching for videos with keywords ‘face swap’. The videos in DF-W are formatted in MP4 and are available in two resolutions: a lower resolution of pixels and a higher resolution of pixels. This dataset has a compact size of 31.55 gigabytes.
2.33. FaceSynthesis [114]
This dataset has a collection of 100,000 fake images. The images have a resolution of pixels. The synthesis techniques involved are associated with VFX (Visual Effects). The images are in PNG format with high visual quality. The total dataset size is 31.8 gigabytes.
2.34. FakeAVCeleb [116]
This dataset comprises 20,000 videos, with 500 authentic videos and 19,500 manipulated (fake) videos. The generated deepfakes encompass face-swapping, facial reenactment, and voice cloning. The videos in FakeAVCeleb are presented in MP4 format with high visual quality and a resolution of
pixels. For video manipulations, technologies such as FSGAN [
180], Wav2Lip [
187], and FaceSwap [
188] were utilized, while SV2TTS (Speaker Verification to Text-to-Speech) [
20] was used for audio manipulations. This dataset has a size of 6 gigabytes.
2.35. ForgeryNet [119]
Introduced in 2021, this extensive deepfake dataset comprises a total of 2,896,062 images and 221,247 videos. It is characterized by a balance of 1,438,201 real images and 1,457,861 manipulated (fake) images, as well as 121,617 fake videos and 99,630 real videos. The dataset encompasses various face manipulation techniques like face editing, face reenactment, face transfer, face swap, and face stacked manipulation. The visual quality of the deepfakes in ForgeryNet varies for both images and videos (high and low). The dataset’s images and videos are encoded in the JPG and MP4 formats, exhibiting a diverse range of resolutions spanning from 240 pixels to 1080 pixels. The generation process involves the use of multiple techniques and tools, such as FSGAN [
180], FaceShifter [
88], BlendFace, FOMM Replacement [
105], MASKGAN [
189], StarGAN [
190], StyleGAN [
11], ATVG-Net (audio transformation and visual generation network) [
191], and SC-FEGAN [
192]. The total size of the dataset is 496.23 GB.
2.36. HiFiFace (High Fidelity Face Swapping) [121]
The HiFiFace database encompasses 1000 synthetic videos derived from FaceForensics++, aligning precisely with the target and source pair configurations in FaceForensics++. In addition, 10,000 frames sourced from FaceForensics++ videos are included in the database, providing a robust foundation for quantitative assessment. The face swapping was performed using an architecture composed of three components, i.e., a 3D shape-aware identity extractor, a semantic facial fusion module, and an encoder–decoder structure. The samples have two resolutions, i.e., and . The database is 900 MB in size.
2.37. KoDF (Korean DeepFake Detection Dataset) [124]
Established in 2021, this deepfake dataset encompasses a total of 237,942 videos, consisting of 62,166 authentic videos and 175,776 manipulated (fake) videos. The dataset focuses on both face reenactment and face swap deepfakes. The visual quality of the deepfakes in KoDF is high. The videos have an initial resolution of
pixels, which is later reduced to
pixels. Various tools and techniques, including FaceSwap [
174], DeepFaceLab [
182], FSGAN [
180], FOMM (First-Order Motion Model) [
105], ATFHP (Audio-driven Talking Face Head Pose) [
193], and Wav2Lip [
187], were used in the deepfake generation process. The surveyed population was predominantly aged between 20 and 39, making up 77.21% of the respondents. Gender distribution is nearly equal, with females at 50.87% and males at 49.13%, reflecting a balanced demographic profile. The dataset has a size of 2.6 TB.
2.38. OpenForensics [126]
This dataset consists of 45,473 authentic images and 70,325 deepfake images. It specifically concentrates on generating face swap deepfake images. The images in OpenForensics are presented in JPG format and have resolutions of
(low) and
(high) pixels with low and high visual qualities. The dataset synthesizes images using advanced generative techniques, such as GAN (Generative Adversarial Network) [
194,
195] and Poisson blending [
196]. This dataset has a size of 56.4 gigabytes.
2.39. Perception Synthetic Faces [128]
This dataset was introduced in 2021 and consists of 150 authentic images and 150 manipulated (fake) images. This dataset synthesizes the entire face to generate deepfakes using techniques such as PGGAN [
50], StyleGAN [
78], and StyleGAN2 [
197]. The samples are in jpg format, featuring a resolution of
pixels, and exhibit high visual quality. This dataset has a size of 24.5 megabytes.
2.40. SR-DF (Swapping and Reenactment DeepFake) [23]
Introduced in 2021, this dataset is centered around facial reenactment and face swapping, consisting of 1000 real videos and 4000 manipulated (fake) videos. The videos in SR-DF have a resolution of
pixels. The dataset incorporates a variety of face manipulation techniques, with face swapping employing technologies such as FS-GAN [
180] and FaceShifter [
88], while facial reenactment involves the First-order motion model and IcFace. The dataset has a high visual quality.
2.41. Video-Forensics-HQ [130]
This dataset was established in 2021 and contains 1737 fake videos. The dataset contains high-quality deepfakes of entire face synthesis and manipulation types. The videos are in MP4 format and have a resolution of
pixels. The dataset used techniques such as Deep Video Portraits (DVP) [
198] to achieve its synthetic images. It has a total size of 13.5 GB.
2.42. VideoSham [132]
This database presents a balanced composition, featuring an equal number of fake and real samples, totaling 826 videos. The deepfake samples are face swap and face reenactment types. This dataset uses spatial and temporal stacks as techniques to generate deepfakes. Spatial attacks (1–4) include adding/removing entities, changing backgrounds, and text manipulation, while temporal attacks (5–6) involve frame and audio modifications. The videos have a resolution of pixels. The videos are in MP4 format with high visual quality. The dataset has a total size of 5.2 GB.
2.43. WPDD (World Politicians Deepfake Dataset) [135]
WPDD was created in 2021 and consists of 946 real videos and 320,499 fake images. The dataset was primarily created using face swap techniques and also incorporates seven distinct face manipulations. The videos within WPDD are in MP4 format and possess resolutions of pixels and pixels. The videos manipulations were created using the iFace and FaceApp applications with medium and high visual qualities. This dataset has a size of 315 gigabytes.
2.44. CDDB (Continual Deepfake Detection Benchmark) [136]
The CDDB dataset was introduced in 2022. It contains 842,140 deepfake images. It mainly focuses on three deepfakes types: face synthesis, face swap, and face reenactment. The dataset is characterized by a variety of GAN models such as ProGAN [
50], StyleGAN [
78], BigGAN [
199], CycleGAN [
200], GauGAN (Gaussian GAN) [
201], StarGAN [
178], non-GAN models such as Glow [
202], CRN (Cascaded Refinement Network) [
203], IMLE (Implicit Maximum Likelihood Estimation) [
204], SAN (Second-order Attention Network) [
205], DeepFakes [
174], Face2Face [
59], FaceSwap [
175], and NeuralTextures [
176], and unknown models such as WhichFaceReal [
80] and WildDeepFake [
100]. It has a total size of 9.6 gigabytes. The images, formatted in PNG, have a resolution of
pixels and medium visual quality.
2.45. CelebV-HQ (High-Quality Celebrity Video Dataset) [138]
This database comprises 35,666 fake videos, specifically emphasizing high-quality renditions of celebrity faces. This dataset focuses on attribute manipulation and employs techniques such as VideoGPT [
206], MoCoGAN-HD (Motion and Content decomposed GAN for HighDefinition video synthesis) [
207], DIGAN (Dynamics-aware Implicit Generative Adversarial Network) [
208], and StyleGAN-V [
209]. The videos in CelebV-HQ are presented in MP4 format and have a resolution of
pixels. This dataset has a size of 38.5 gigabytes. All clips underwent manual labeling, encompassing 83 facial attributes that span appearance, emotion, and action.
2.46. DeePhy (Deepfake Phylogeny) [140]
DeePhy is a deepfake dataset created in 2022 comprising 5140 videos, with 100 real and 5040 fake videos. The dataset covers multiple deepfakes types, including face reenactment and face swap. The visual quality of the deepfakes in DeePhy is high. The videos are in MPEG4.0 format with a resolution of 720 pixels. The generation process involves the use of the following techniques: FS-GAN [
180], FaceShifter [
88], and FaceSwap [
183]. It has a size of 26 GB. The generated deepfakes are segregated in three categories: (i) utilizing a single technique, (ii) employing two different techniques, and (iii) applying three different techniques.
2.47. DFDM (DeepFakes from Different Models) [142]
The DFDM dataset has a collection of 6450 deepfake videos that focus on face-swapping. The videos in DFDM are of high quality and presented in the MPEG4.0 format, with an average resolution of
pixels. This dataset incorporates various face manipulation techniques like FaceSwap [
183], Lightweight [
174], IAE, Dfaker [
210], and DFL-H128 [
182]. It has a size of 159.59 gigabytes.
2.48. FakeDance [144]
This dataset includes 99 real and 99 fabricated videos. It focuses on the whole body reenactment deepfake. It utilizes Everybody Dance Now (a video synthesis algorithm) [
211], GAN, and pix2pixHD [
212] technologies to generate deepfakes. The videos are in MP4 format with a resolution of
pixels and low/high visual quality. The total size of the dataset is 32.2 GB.
2.49. FMFCC-V (Fake Media Forensics Challenge of China Society of Image and Graphics-Video Track) [146]
This dataset was introduced in 2022 and features 42 male and 41 female subjects. The short video version has 44,290 real and 38,102 fake videos, totaling 82,392 videos, whereas the long video version has 83 real videos and 192 fake videos, totaling 275 videos. It uses various face manipulation tools like faceswap [
174], faceswapGAN [
52], DeepFaceLab [
177], Recycle-GAN [
213], and Poisson blending. This dataset contains two versions: long and short. The long version is 80.5 GB and the short version is 411 GB in size.
2.50. GBDF (Gender Balanced DeepFake Dataset) [148]
This dataset was introduced in 2022. It is a specialized dataset created to address gender balance in the context of deepfake generation and detection. The dataset comprises 2500 real and 10,000 fake videos. The videos in GBDF are of high quality, presented in MP4 format with a resolution of
pixels. The manipulation techniques used in this dataset are identity swapping (FaceSwap, FaceSwap-Kowalski [
183], FaceShifter [
88], DeepFakes [
174], and Encoder–Decoder) and expression swapping (Face2Face [
59] and NeuralTextures) [
176]. The dataset is 1 TB in size.
2.51. LAV-DF (Localized Audio Visual DeepFake) [150]
This is a significant deepfake dataset created in 2022 comprising 136,304 videos, with 36,431 real and 99,873 fake videos. The dataset focuses on face reenactment, voice reenactment, and transcript manipulation. The visual fidelity of the deepfakes in LAV-DF is high. The dataset includes both video files (MP4) and associated metadata files (csv). The initial resolution of the images is
pixels. The generation process involves the use of specific tools and techniques, including SV2TTS [
20] and Wav2Lip [
187]. It has a size of 23.8 GB.
2.52. SFHQ (Synthetic Faces High-Quality) [152]
The SFHQ dataset was created in 2022. It contains 425,258 synthesized images. These images are provided in multiple formats, including jpg and PNG. The resolution for each image is set at
pixels. It uses StyleGAN2 [
197] to generate deepfakes. The dataset is distributed across four parts, each varying in size: Part 1 (15 GB), Part 2 (15 GB), Part 3 (23 GB), and Part 4 (24 GB).
2.53. TrueFace [153]
This dataset is divided into two parts: pre-social and post-social. The pre-social dataset consists of 70,000 authentic images and 80,000 synthetic images, amounting to a total of 150,000 images. The post-social dataset contains 30,000 real and 30,000 fake images, totaling 60,000 images. It contains images of high visual quality with a quality factor of 87. It uses generation techniques like StyleGAN [
78] and StyleGAN2 [
197]. The videos are presented at a resolution of
pixels initially and
pixels after resizing. The total size of the dataset is 212 GB.
2.54. ZoomDF [155]
This dataset was introduced in 2022 with 400 fake video samples generated from 400 real images. The deepfake videos were generated on the Avatarify framework using the First-Order Motion Model (FOMM) [
105] method. The type of deepfake is motion manipulation. The videos are in MP4 format with a high visual quality.
2.55. AV-Deepfake1M [65]
Within this dataset, there are 286,721 legitimate videos alongside 860,039 manipulated videos. It concentrates on the creation of deepfakes through face reenactment, as well as the voice cloning and manipulation of transcripts. It uses TalkLip [
214], VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) [
215], ChatGPT [
216] for transcript manipulation, and YourTTS [
217] for manipulation. The samples in this dataset are in MP4 format with a resolution of
pixels and high visual quality. It has a total size of ∼400 GB.
2.56. DETER (DETEcting Edited Image Regions) [157]
DETER is a large-scale, manipulated deepfake database. The dataset comprises 38,996 authentic images and 300,000 forged images. The deepfake images were generated by four state-of-the-art techniques with three distinct editing operations, i.e., face swapping, attribute editing, and inpainting. The face swapping and attribute editing were performed using GANs-based editing for swapping (E4S) [
218] and diffusion models (DMs)-based DiffSwap [
219]. For inpainting, the manipulation tools employed were the GANs-based Mask-Aware Transformer (MAT) [
220] and DMs-based DiffIR [
221]. In the process of face swapping and attribute editing, modifications were focused on specific facial regions such as eyes and nose. In contrast, inpainting operations targeted random image regions, effectively eliminating spurious correlations present in prior datasets. Furthermore, meticulous image postprocessing was executed to guarantee that the deepfakes maintain a realistic appearance. The high visual quality of the edited images is evident within the dataset. The image resolutions ranges from
to
and >2028.
2.57. DFFMD (Deepfake Face Mask Dataset) [159]
This dataset, introduced in 2023, comprises 1000 real and 1000 fabricated videos, resulting in a total of 2000 videos. The dataset focuses on facial reenactment. The technique used in the generation process is FOMM [
105]. The videos are in MP4 format with a resolution of
pixels and medium visual quality. The dataset contains videos of 40 subjects—28 male and 12 female. The dataset has a total size of 10 GB.
2.58. DF-Platter [161]
This is a deepfake dataset with a total of 133,260 videos, consisting of 764 real and 132,496 fake videos. The primary focus of this dataset is on face reenactment and face swap. The dataset includes an average BRISQUE score of 43.25, indicating high visual quality. The content is encoded in the MPEG4.0 format and is provided in two resolutions: 720 pixels (high resolution) and 360 pixels (low resolution). The generation process involves the use of various tools, including FSGAN [
180], FaceSwap [
183], and FaceShifter [
88]. It has a size of 417 GB. This database comprises three distinct sets: Set A, Set B, and Set C. Set A encompasses deepfakes featuring a single subject. To generate single-subject deepfakes, a source video and a target video are utilized, each featuring a single subject. Set B encompasses intra-deepfakes, where the faces of one or more subjects within a given video are swapped. In contrast, Set C comprises multi-face deepfakes, wherein the faces in the source videos undergo manipulation to resemble those of celebrities in the target. Set A incorporates deepfakes created using FSGAN and FaceShifter, while Sets B and C include deepfakes generated through all three generation schemes.
2.59. eKYC-DF [162]
This dataset was also introduced in 2023 and contains 228,000 fake video and 760 real video samples. It utilized the SimSwap [
222], FaceDancer [
223], and SberSwap [
224] schemes to yield face-swapped deepfakes. The samples in this dataset are of
and
resolutions. The visual quality of the samples is high. The dataset sizes are estimated to be over 1.7 TB for the eKYC-DF dataset and 750 GB for the eKYC-6K dataset.
2.60. IDForge [163]
This dataset was introduced in 2023. It is composed of 79,827 real samples along with 214,438 samples in the reference dataset as well as 169,311 deepfake samples. It uses various techniques to produce deepfakes like face swapping with lip-syncing (Insight-Face [
225,
226], SimSwap [
222], InfoSwap [
227], and Wav2Lip [
187]), audio cloning/manipu- lation (TorToiSe [
228], RVC [
229], and audio shuffling), and transcript manipulation (GPT-3.5 [
230] and text shuffling). The samples have a resolution of
pixels. The visual quality of the samples is high, and the dataset size totals an estimate of ∼600 GB.
2.61. PolyGlotFake [164]
The dataset contains an audiovisual and multilingual deepfake collection, which includes 766 real videos and 14,472 fake videos, amounting to a total of 15,238 videos. The samples are in seven languages, i.e., English, French, Spanish, Russian, Chinese, Arabic, and Japanese. The audio and video manipulations were generated utilizing various text-to-speech, voice cloning, and lip-sync methods. For audio manipulations, Bark+FreeVC [
231,
232], MicroTTS+FreeVC [
233], XTTS [
234], Tacotron+FreeVC [
235], and Vall-E-X [
236] were used. VideoRetalking [
237] and Wav2Lip [
187] were employed for video manipulations. The quality of the deepfakes is high.
2.62. RetouchingFFHQ (Retouching FlickrFaces-High Quality) [165]
Established in 2023, this dataset consists of 710,726 images, including 58,158 real and 652,568 fabricated images. It primarily concentrates on facial retouching tasks, encompassing skin smoothing, face whitening, face lifting, and eye enlarging. The visual quality of the deepfakes in RetouchingFFHQ is high. The images are initially provided at a resolution of
pixels, which is later reduced to
pixels in the final output. The dataset is generated using APIs from major technology companies, including Megvii [
238], Alibaba [
239], and Tencent [
240]. The dataset has a total size of 363.96 GB.
2.63. RWDF-23 (Real-World Deepfake) [167]
The RWDF database was introduced in 2023, encompassing content in English, Korean, Chinese, and Russian languages. The dataset consists of 2000 fake videos sourced from diverse platforms like YouTube, TikTok, Reddit, and Bilibili. These videos are presented with a resolution greater than
pixels. The deepfakes were generated using open source frameworks (i.e., DeepFaceLab [
177], DeepFaceLive [
241,
242], FOMM [
105], and SimSwap [
222]), mobile apps (i.e., FacePlay [
243], Reface [
244,
245,
246], Deepfake Studio [
247], FaceApp [
10], Revive [
248], LicoLico [
249], Fakeit [
250], and DeepFaker [
251]) and commercial products (i.e., DeepFakesWeb [
252], Deepcake.io [
253,
254], DeepFaker Bot [
255], and Revel.ai [
256]).
2.64. SPRITZ-PS [168]
Within this database, there are a total of 20,000 deepfake images. It focuses on synthetic false images and iris reconstruction using techniques such as StyleGAN2, ProgressiveGAN, and StarGAN. The dataset provides images in jpg format. It has a size of 17.84 gigabytes.
2.65. DeepFaceGen [170]
DeepFaceGen was created in 2024 and contains 463,583 real images, 313,407 real videos, 350,264 forged images, and 423,548 forged videos. In total, it is composed of 813,847 images and 736,955 videos. The facial deepfakes are focused on entire face synthesis, face swapping, face reenactment, and attribute manipulation, which were generated utilizing 34 image or video generation methods. The deepfake generation techniques employed include FaceShifter [
88], FSGAN [
180], DeepFakes [
174], BlendFace [
257], MMReplacement [
119], DeepFakes-StarGAN-Stack (DSS), StarGAN-BlendFace-Stack (SBS), and SimSwap [
222] for face swapping. For face reenactment, the techniques used are Talking Head Video [
258], ATVG-Net [
191], Motion-cos [
259], and FOMM [
105]. For face alteration, the methods include StyleGAN2 [
11], MaskGAN [
189], StarGAN2 [
190], SC-FEGAN [
192], and DiscoFaceGAN [
260]. The Text2Image techniques involve Openjourney (OJ) [
261], Stable Diffusion 1 (SD1), Stable Diffusion 2 (SD2), Stable Diffusion XL (SDXL) [
262], Wenxin [
263], Midjourney [
264], DF-GAN [
265], DALL·E, and DALL·E 3 [
266]. For Text2Video, the techniques are AnimateDiff [
267], AnimateLCM [
268], Hotshot [
269], Zeroscope [
270], and MagicTime [
271]. The Image2Image subset includes InstructPix2Pix (Pix2Pix), Stable Diffusion XL Refiner (SDXLR), and Stable Diffusion Image Variation (VD) [
262]. The quality of the forgery samples is high, and the dataset size is 491.1 GB.
2.66. DF40 [172]
This dataset is composed of over 100,000 fake images and more than 1 million fake videos. It used 52,590 real videos (Table 10 in [
172]). Four types of deepfakes were targeted, i.e., face swapping, face reenactment, entire face synthesis, and face editing (i.e., attribute manipulation). The quality of deepfakes is high. This dataset contains a diverse range of deepfake samples, which were generated using 40 different deepfake techniques. Face swapping methods include FSGAN [
180], FaceSwap [
183], SimSwap [
222], InSwapper [
272], BlendFace [
257], UniFace [
273], MobileSwap [
19], E4S (editing for swapping) [
218], FaceDancer [
223], and DeepFaceLab [
177]. Face reenactment techniques comprise FOMM [
105], FS_vid2vid (few shot video-to-video) [
274], Wav2Lip [
187], MRAA (motion representations for articulated animation) [
275], OneShot [
276], PIRender (portrait image neural renderer) [
277], TPSMM (thin-plate spline motion model) [
278], LIA (latent image animator) [
279], DaGAN (depth-aware GAN) [
280], SadTalker (stylized audio-driven talking-head) [
281], MCNet (memory compensation network) [
282], HyperReenact [
283], and HeyGen [
284]. Face synthesis techniques include VQGAN (vector-quantized GAN) [
285], StyleGAN2 [
11], StyleGAN3 [
286], StyleGAN-XL (StyleGAN large-scale) [
287], SD-2.1 (Stable-Diffusion-2.1) [
288], DDPM (denoising diffusion probabilistic model) [
289], RDDM (residual denoising diffusion model) [
290], PixArt-
(transformer-based text-to-image diffusion model) [
291], DiT-XL/2 (diffusion transformers large) [
292], SiT-XL/2 (self-supervised vision transformer) [
293], MidJounery6 [
264], and WhichisReal [
80]. Face editing methods encompass CollabDiff (collaborative diffusion) [
294], e4e (encoder for editing) [
295], StarGAN [
178], StarGANv2 [
190], and StyleCLIP (styleGAN contrastive language-image pre-training) [
26]). Deepfake samples were generated using original/real samples from the FaceForensics++ [
76], Celeb-DF [
68], UADFV [
64], VFHQ (high-quality video face dataset) [
296], FFHQ [
78], CelebA [
297], FaceShifter [
298], and Deeperforensics-1.0 [
81] datasets.
3. Audio Deepfake Datasets
A variety of audio deepfake databases have been created to advance both the audio deepfake detection and technology of audio deepfakes. In this section, we provide a meticulous summary of existing audio deepfake databases. We extensively reviewed multiple research articles and database repositories to provide comprehensive information, encompassing insights not commonly present in prior works. We have included all publicly reported and/or accessible datasets from 2018 to 2024, showcasing the advancements in the field of image and video deepfake datasets. A comparative analysis of existing audio deepfake datasets is depicted in
Table 22,
Table 23,
Table 24,
Table 25,
Table 26 and
Table 27.
Table 22,
Table 23,
Table 24,
Table 25,
Table 26 and
Table 27 list the audio deepfake datasets in ascending order by year (first column) and then in alphabetical order by name within each year (second column) to provide a view of how the audio deepfake datasets and the audio deepfake field have progressed over time. This chronological arrangement aims to help readers observe the evolution of audio deepfake datasets, including deepfake types, languages of samples, durations, formats, quantities, quality, and the methods/tools used to generate deepfakes.
3.1. Baidu Silicon Valley AI Lab Cloned Audio [299]
This dataset was released by the Baidu Silicon Valley AI Lab. The set has 130 samples, including 10 real and 20 fake voice clips. The voice-cloning-based fake samples were created using encoder-based neural networks. The dataset includes 6 h of high-quality audio clips featuring multiple speakers. The samples are in MP3 format. The dataset is ∼75.27 MB in size.
3.2. M-AILABS [301]
The M-AILABS database spans an extensive duration of 999 h and 32 min. This compilation features a diverse array of languages, including German, Spanish, English, Italian, Russian, Ukrainian, French, and Polish. It includes voices from both male and female speakers. The samples are in WAV format. The total size of the dataset is approximately ∼110.1 GB.
3.3. ASV Spoof 2019 [303,353,354]
ASVspoof-2019 consists of two components, i.e., LA (logical access) and PA (physical access). Both components were derived from the VCTK base corpus [
355], encompassing audio clips sourced from 107 speakers (61 females and 46 males). LA encompasses both voice conversion and speech synthesis samples, while in PA there is a combination of replay samples and genuine recordings. Both components are subdivided into three distinct sets (i.e., training, development, and evaluation). The fake samples were created using deep learning techniques like Merlin [
356], CURRENT [
357], MaryTTS [
358], long short-term memory (LSTM) [
359,
360], WaveRNN [
361], WaveNet [
362], and WaveCycleGAN2 [
363]. The samples are preserved in the FLAC format, and the dataset has a size of approximately 23.55 GB.
3.4. Cloud2019 [305,306]
This dataset has a total of 11,785 samples. These samples are derived from TTS (Text-to-Speech) cloud services, including Amazon AWS Polly (PO) [
364], Google Cloud Standard (GS) [
365], Google Cloud Wave Net (GW) [
365], Microsoft Azure (AZ) [
366], and IBM Watson (WA) [
367]. The dataset primarily consists of English-language audio clips that were generated by automated systems using human voices as input. The audio files are in the WAV and PCM formats.
3.5. FoR (Fake or Real) [308]
The Fake or Real (FOR) dataset, introduced in 2019, is a significant collection in the domain of audio deepfake detection. This dataset comprises a total of 198,000+ samples, with 87,000+ being synthetic samples created through the utilization of Deep Voice 3 [
368], Amazon AWS Polly, Baidu TTS, Google traditional TTS, Google cloud TTS, Microsoft Azure TTS, and Google Wavenet [
365]. These samples collectively span a duration of 150.3 h and are in the English language. The dataset features voices with a distribution of 140 real and 33 fake speakers, and the samples are stored in WAV format. The database has a size of 16.1 GB.
3.6. H-Voice [310]
H-Voice is a dataset that includes a total of 6672 recordings, consisting of 3332 (in imitation section) and four (in synthetic section) original recordings, and 3264 (in imitation section) and 72 (in synthetic section) fake recordings. This dataset is designed for TTS applications and offers recordings in the WAV audio format. It covers a range of languages, including Spanish, English, Portuguese, French, and Tagalog, and features contributions from 84 different speakers. The voice recordings were generated using the imitation and deep voice methods. The dataset needs approximately 370 MB of storage space.
3.7. Ar-DAD (Arabic Diversified Audio Dataset) [314]
The ‘Ar-DAD: Arabic Diversified Audio’ dataset comprises a significant total of 16,209 audio recordings. This dataset primarily focuses on Arabic speech data and features contributions from 30 different speakers. In particular, the speakers comprise individuals from Arabic-speaking regions such as Saudi Arabia, Egypt, Kuwait, Yemen, UAE, and Sudan. The audio files are saved in the WAV format, making them versatile for various applications and research in Arabic language processing and speech synthesis. The recordings in this dataset were generated using the imitation method, and it takes approximately 9.37 GB of storage space.
3.8. VCC 2020 [316]
The VCC 2020 dataset is part of the Voice Conversion Challenge 2020 and encompasses multiple languages, including English, Finnish, German, and Mandarin, making it suitable for multilingual studies. The dataset is balanced in terms of gender, with a 50% representation of male and female voices. The voice-conversion-based deepfake samples were created via sequence-to-sequence (seq2seq) mapping networks, neural vocoders, GANs, and encoder–decoder networks. The audio files in this dataset are stored in the WAV format. The storage size of the dataset is approximately 195.5 MB.
3.9. ASVspoof 2021 [319,320]
The ASVspoof 2021dataset is a comprehensive and diverse collection of 1,513,852 audio samples in English language. There are 130,032 VC (Voice Conversion)- and TTS (Text-to-Speech)-based fraudulent samples that were created by employing VC attack algorithms [
353] and Vocoders. It exhibits a balanced distribution of gender, with 41.5% male voices and 58.5% female voices. The dataset’s audio files are stored in MP3, A4A, and OGG formats and have a storage size of approximately 34.5 GB. The logical access portion of the database can be accessed on [
323].
3.10. FMFCC-A [324]
The FMFCC-A database was introduced in 2021 and comprises a total of 50,000 samples, of which 10,000 are real and 30,000 are fake. The dataset features the Mandarin language and the samples are available in WAV, AAC, and MP3 formats. For the generation of deepfakes, methods such as Text To Speech (TTS) (e.g., Fastspeech 2 [
369]) and Voice Conversion (e.g., Ada-In VC [
370]) were employed. Other systems used were Alibaba TTS, BlackBerry TTS, IBM Waston TTS, Tacotron [
235], and GAN TTS [
371]. The dataset occupies about 3.31 GB.
3.11. WaveFake [326]
This database comprises 117,985 synthesized audio clips. The samples encompass both English and Japanese languages. The TTS synthetic samples were produced by leveraging the LJSpeech [
326] and Japanese Speech Corpus (JSUT) [
372] databases, employing MelGAN [
373], Parallel Wave-GAN (PWG) [
374], Multiband MelGAN (MB-MelGAN) [
375], Fullband MelGAN (FB-MelGAN), HiFi-GAN [
376], and Wave-Glow [
377] models. The synthetic samples, though high-quality and realistic, lack diversity as each sample has only one speaker. The samples are in the WAV format, and the total dataset size is approximately 28.9 GB.
3.12. ADD 2022 [328]
The ADD 2022 dataset constitutes a component of the ADD 2022 challenge. This database includes partially fake audio detection (PF) and low-quality fake audio detection (LF). LF comprises 36,953 genuine voices and 123,932 fabricated spoken words with real-world noises. In contrast, PF encompasses 127,414 audio samples that are manipulated or altered. The ADD dataset, which is openly accessible, is exclusively in the Chinese language. The dataset requires 3.73 GB of storage space. The findings from ADD 2022 indicate that a single method or model is not capable of addressing all types of fakes and generalization.
3.13. CFAD (Chinese Fake Audio Detection) [330]
Within the CFAD dataset, there are 347,400 samples contributed by a total of 1212 real and 1212 fake speakers (Table II in [
330]). The TTS and partial fake audio clips in Chinese were generated using 12 vocoder techniques (i.e., STRAIGHT [
378], Griffin-Lim [
379], LPCNet [
380], WaveNet [
362], neural-vocoder-based system (PWG [
374], HifiGAN [
376], Multiband-MelGAN [
375], Style-MelGAN [
381]), WORLD [
382], FastSpeech-HifiGAN [
369], Tacotron-HifiGAN [
235], and Partially Fake [
40,
383,
384]). The audio files are available in multiple formats, including MP3, M4A, OGG, FLAC, AAC, and WMA. The dataset’s storage size is approximately 29.9 GB.
3.14. FakeAVCeleb [116]
This database encompasses a collection of 20,000 audio samples with 500 real and 9500 fake samples in English language. The manipulated audio samples were generated by a real-time voice cloning method known as SV2TTS [
20]. The dataset exhibits an equal distribution of male and female voices, with each gender contributing 50% of the dataset. The samples are available in MP4 format.
3.15. In-the-Wild [332]
This database is made up of total 31,779 audio samples, including 11,816 synthesized samples using 19 distinct TTS synthesis algorithms. There are 58 genuine and 58 fake English-speaking celebrities and political speakers. The total audio duration is 38 h. The samples are stored in WAV format, and the dataset size is approximately 7.6 GB. This study also used RawNet2 [
385] and RawGAT-ST [
386] techniques for audio deepfake detection.
3.16. Lav-DF (Localized Audio Visual DeepFake) [150]
This dataset comprises a total of 136,304 English audio clips, with 99,873 clips being fake segments. The dataset comprises 153 speakers. This dataset generated Text-to-Speech (TTS) deepfake content, employing content-driven Recurrent Encoder (RE), Tacotron 2 [
387], and SV2TTS [
20] techniques. The dataset contains audio files in MP4 format, requiring approximately 24 GB of storage.
3.17. TIMIT [334]
The TIMIT dataset [
388] was expanded to create this dataset with fake audio samples. It features English speech data and includes recordings from 630 different speakers. The fake samples mimic human voice and were created by state-of-the-art neural speech synthesis techniques such as Google TTS [
365], Tacotron-2 [
387], and MelGAN [
373]. The audio files are stored in WAV format, and the dataset size is 3 GB.
3.18. ADD 2023 [336]
The ADD 2023 dataset is an integral part of the ADD 2023 challenge. Audio Fake Game (FG), Manipulation Region Location (RL), and Deepfake Algorithm Recognition (AR) are three sub-challenges in the ADD 2023 challenge. In contrast to the ADD 2022 challenge, ADD 2023 shifts its focus from binary fake or real classification. Instead, it concentrated on the more intricate task of localizing manipulated intervals within a partially fake utterance and identifying the specific source responsible for generating any fake audio. This nuanced approach adds a layer of complexity to the evaluation and fosters advancements in understanding and addressing more sophisticated manipulations in audio data.
3.19. AV-Deepfake 1M [65]
This dataset is composed of 286,721 legitimate samples and 860,039 manipulated audio samples. The fakes are of voice cloning and manipulation of transcripts deepfake types, and they were generated using VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) [
215] and YourTTS [
217]. This is an extensive audio dataset of 1886 h from 2068 speakers. The samples in this dataset are in MP4 format with a total size of ∼400 GB.
3.20. EmoFake [37]
This dataset was created using the English portion of the Emotional Speech Database dataset [
389]. The EmoFake dataset consists of both emotion-manipulated audio and authentic emotional speech. In the EmoFake dataset, there are 57,400 samples in total, comprising 17,500 authentic recordings and 39,900 artificially generated instances. The fake samples were generated by transitioning the emotional state from a source emotion to a target emotion. The fake audios were produced using seven open-source Emotional Voice Conversion (EVC) models (i.e., VAW-GAN-CWT [
390], Seq2Seq-EVC [
391], CycleTransGAN [
392], DeepEST [
393], EmoCycleGAN [
394], StarGAN-EVC [
395], and CycleGAN-EVC [
396]).
3.21. Fake Song Detection (FSD) [337]
This Chinese fake song detection dataset consists of 200 real and 450 fake songs. The fake songs were generated using five state-of-the-art singing voice conversion and singing voice synthesis techniques. The fake songs were produced employing SO-VITS [
397], NSF-HifiGAN with Snake [
398], SO-VITS with shallow diffusion [
399], DiffSinger [
399], and Retrieval-based Voice Conversion (RVC) [
229]. The dataset features voices with a distribution of 27 male and 27 female singers, and the samples are in WAV format.
3.22. Half-Truth [40]
This dataset was meticulously curated to encourage research in identifying partially fake audio utterances, where any single sample contains elements of both truth and falsehood. The partially fake samples were generated by splicing fractions of synthetic audio into the original speech. The fake audio samples were produced utilizing the style token (GST) Tacotron technique [
383,
384]. This dataset is in Chinese with 175 female and 43 male speakers. The samples are conveniently provided in WAV format, and the total size of the dataset is 8.1 GB.
3.23. SFR (System Fingerprint Recognition) [340]
Established in 2023, this database comprises 181,764 audio clips in total. These recordings span 526.53 h of Chinese speech data and are distributed across 745 different speakers. There are 22,400 authentic samples and 159,364 fabricated samples. The fake samples were generated through the seven TTS systems, i.e., Aispeech [
400], Sogou [
401], Alibaba Cloud [
402], Baidu Ai Cloud [
403], Databaker [
404], Tencent Cloud [
405], and iFLYTEK [
406]. All audio files are stored in the MP3, MP4, WMA, AMR, AAC, AVI, WMV, MOV, and FLV formats.
3.24. TIMIT-TTS [341]
This database employed the VidTIMIT [
407] and DeepfakeTIMIT [
51] datasets to produce TTS deepfakes, comprising a total of 80,000 synthetic audio tracks. Twelve distinct state-of-the-art TTS methods (MelGAN [
373], WaveRNN [
361], Tacotron [
235], Tacotron 2 [
387], GlowTTS [
408], FastSpeech2 [
369], FastPitch [
409], TalkNet [
410], MixerTTS [
411], MixerTTS-X [
411], VITS [
215], SpeedySpeech [
412], gTTS [
413], and Silero [
414]) were utilized to generate the TTS deepfakes. To elevate the authenticity of the generated tracks, a Dynamic Time Warping (DTW) [
415] step was incorporated. The database features a balanced representation of male (50%) and female (50%) voices. The audio files conform to the WAV format, and the database has a total size of approximately 7.2 GB.
3.25. Cross-Domain Audio Deepfake Detection (CD-ADD) [343]
The CD-ADD dataset encompasses 25,111 real and 120,459 audio deepfake samples, totaling 145,570 samples. The dataset has more than 300 h of speech samples generated by five zero-shot TTS models that are grouped in two types: decoder-only (i.e., VALL-E [
236]) and encoder–decoder (i.e., YourTTS [
217], WhisperSpeech [
416], Seamless Expressive [
417], and Open-Voice [
418]).
3.26. Codecfake [344]
The Codecfake dataset was created to detect audio language model-based deepfake audio. It encompasses over 1 M audio samples in both English and Chinese languages. Specifically, there are 1,058,216 audio samples, with 132,277 and 925,939 being real and fake, respectively. The real audio samples were sourced from the VCTK [
355] and AISHELL3 [
419] databases. The fake samples were fabricated using seven different codec techniques, i.e., SoundStream [
420], SpeechTokenizer [
421], FunCodec [
422], EnCodec [
423], AudioDec [
424], AcademicCodec [
425], and Descript-audio-codec (DAC) [
426]. The audio samples are in WAV format, and the total size of the dataset is 32.61 GB.
3.27. Controlled Singing Voice Deepfake Detection (CtrSVDD) [346]
The CtrSVDD is a collection of diverse bona fide and deepfake singing vocals. For 164 singers, this dataset has 188,486 deepfake song samples and 32,312 bona fide song clips for 164 singers. The total number of samples is 220,798 mono vocal clips totaling 307.98 h. The bona fide singing vocals were taken from mandarin singing datasets (i.e., Opencpop [
427], M4Singer [
428], Kising [
429], and official ACE-Studio release [
430]) and Japanese singing datasets (i.e., Ofuton-P [
431], Oniku Kurumi [
432], Kiritan [
433], and JVS-MuSiC [
434]). The deepfake vocals were generated using both Singing voice synthesis (SVS) and Singing voice conversion (SVC) methods. The SVS systems used are XiaoiceSing [
435], VISinger [
436], VISinger2 [
437], neural network (NN)-based SVS (NNSVS) [
438], Naive RNN [
439], DiffSinger [
399], and ACESinger [
429,
430]. The SVC systems utilized are Nagoya University (NU) [
440], WavLM [
441], ContentVec [
442], MR-HuBERT [
443], WavLabLM [
444], and Chinese HuBERT [
445].
3.28. FakeSound [347]
The FakeSound dataset consists of 39,597 genuine audio samples and 3798 manipulated audio samples. To create deepfake samples, a grounding model identifies and masks key areas in legitimate audio based on caption information. The generation model then recreates these key regions, substituting them to generate convincing, realistic deepfake audio. In particular, open-source models (i.e., AudioLDM1 [
446] and AudioLDM2 [
447]) inpaint masked portions using input text and remaining audio. AudioSR [
448] upscales for realism and quality; then, the regenerated segments are combined with the original audio. The audio samples are in WAV format, totaling 986.6 MB in size.
3.29. SceneFake [351]
The SceneFake dataset was created to detect scene-manipulated utterances. This database is assembled from manipulated audios, which were generated by only tampering with the acoustic scene of real utterances by employing speech enhancement technologies. It includes both fake and genuine utterances involving various scenes. The dataset was developed using the logical access (LA) section of ASVspoof 2019 and the acoustic scene dataset from the DCASE 2022 challenge. It contains 19,838 real samples, 64,642 fake samples, and 84,480 total samples, all in English. The dataset features six types of acoustic scenes, i.e., Airport, Bus, Park, Public, Shopping, and Station. The fake utterances were manipulated by utilizing four kinds of speech enhancement methods, i.e., Spectral Subtraction (SSub) [
449,
450], Minimum Mean Square Error (MMSE) [
449,
450], Wiener filtering [
449,
450], and Full and band network (FullSubNet) [
451].
3.30. SingFake [352]
The SingFake dataset, introduced in 2024, consists of a total of 26,954 audio clips, with 15,488 clips categorized as real and 11,466 clips categorized as fake. The dataset has a cumulative duration of 58.33 h and encompasses multiple languages, including Mandarin, Cantonese, English, Spanish, and Japanese. All audio files from the 40 speakers are stored in the MP3, AAC, OPUS, and VORBIS formats.
Evolution and Transition to Future Prospects of Deepfake Datasets: The deepfake datasets described in
Section 2 and
Section 3 consist of extensive collections of visual and audio deepfake media, which have become crucial in advancing AI’s capabilities to both create and detect deepfakes. This collection is an invaluable resource for those pursuing an in-depth and exhaustive analysis of the subject matter. Overall, during the preliminary years, these datasets were limited and less diverse, but they have progressively become more sophisticated and expansive. In spite of the remarkable progress, there is a growing demand for larger and more comprehensive databases that cover a broader spectrum of scenarios, media types, and characteristics to keep pace with advancing technology, as discussed in
Section 4.2.6,
Section 4.10,
Section 4.15 and
Section 4.22.
Considerations and Strategies for Selecting Deepfake Datasets: Still, the crucial question is how one decides which databases to select from the available options for their research or study. The selection of deepfake datasets involves understanding the purpose of one’s study, the types of deepfake techniques one aims to address, and the quality and diversity of the dataset needed for robust model development. To this aim, first, define your research goal (i.e., whether you focus on single deepfake detection or generalized multiple deepfakes detection). Next, determine which types of deepfakes you need (e.g., audio or visual deepfakes, multimodal deepfakes, or full body deepfakes). Look for datasets with different quality, diverse samples, and sizes (e.g., high-resolution images or videos and sophisticated audio files and formats with a variety of ethnicities, ages, and environments) to ensure robust and advanced model development. Larger and more diverse datasets can improve the generalizability of your model. Moreover, choose datasets that are commonly used in the field to facilitate benchmarking and comparison with other research. Benchmarking datasets (e.g., Deepfake Detection Challenge [
85] and DeepFake Game Competition [
108]) are widely recognized and used within the research community, which allows for standardized evaluation and comparison of different methods and models and facilitates progress and innovation in the field. All in all, to substantiate the efficacy, robustness, and benefits of the developed deepfake framework, it is crucial to report experimental analyses using heterogeneous datasets from various time periods. For instance, employing one dataset from the early years, another from the middle years, and a recent dataset from the last two years (as more advanced tools are typically used to generate deepfakes in recent datasets) would provide a comprehensive evaluation. This selection of diverse databases will objectively assess the dynamism and effectiveness of the developed algorithms under a cross-database setting, wherein training is performed on one or more databases and testing is conducted on different ones.
4. Open Issues and Potential Research Directions
Despite significant progress in the deepfake field, there remain various noteworthy concerns surrounding deepfake generation and detection frameworks [
12,
14,
66,
319,
452,
453,
454,
455,
456,
457,
458,
459,
460,
461,
462,
463,
464,
465,
466,
467]. This section highlights both key challenges and future research directions in the field of deepfake technology. This section roughly follows a topical structure, encompassing themes like benchmarking, next-generation deepfake detectors and generators, critical factors (e.g., source identification and fairness), emerging frontiers in deepfake technology, regulatory landscapes, delving into the complexities of audio deepfake techniques, combating deepfake vulnerabilities and cybersecurity threats, and the importance of cross-disciplinary insights for effective mitigation and awareness. Some content in subsections within these themes may seem to overlap in subject matter, but they are intentionally structured this way to emphasize their significance without diluting their focus and importance.
4.1. Assessing Effectiveness: Benchmarking and Metrics for Evaluation
The accuracy of audio and visual deepfake and manipulation methods grapples with fundamental issues, including the evaluation, configuration, and comparison of algorithms. A comprehensive evaluation framework is essential to assess and rate the capacity of deepfake generation and detection techniques. Practitioners and researchers could potentially accomplish it by (i) formulating protocols and tools specifically designed to estimate the quality of generated deepfakes, performance of deepfake detectors, vulnerability analysis of deepfake detectors against adversarial attacks, and robustness of speech and face recognition systems under deepfakes; (ii) creating standardized common criteria; and (iii) establishing an online and open platform for transparent and independent evaluations of systems against validated benchmarks. Professionals and researchers are encouraged to devise novel performance evaluation matrix-, method-, and security- and non-security-related error rates, as well as propose some unified frameworks, universal taxonomy, and shared vocabulary for deepfake systems. We need to design qualitative and qualitative evaluation metrics.
Protocols ought to establish a foundation for current and future deepfake generation and detection methods, addressing both existing (known) and new (unknown) attribute manipulations and attacks. This is crucial for continuous and timely progress in the field. Furthermore, it is imperative to equally prioritize evaluation metrics that differentiate false positives in deepfake detection and speech and face recognition systems because false positives in deepfake detection occur when genuine content is incorrectly labeled as fake. This can potentially lead to unwarranted scrutiny or filtering of legitimate content, thereby affecting the reliability of detection algorithms. Conversely, false positives in speech or face recognition systems occur when the system incorrectly identifies an unauthorized user as legitimate. This can lead to unauthorized access being granted, thereby compromising security, privacy, and data integrity. Also, protocols should integrate sociocultural parameters such as human perceptions, human detectors’ background and knowledge, and privacy.
Studies [
4,
51] have shown that higher quality samples are not only difficult to detect but also notably degrade the performance of face recognition systems. Thus, the online and open platform should provide datasets with varying qualities as well as options to generate deepfakes of different qualities using freely available tools or commercial software that may be automatic or require human intervention. To report baseline accurately without conveying any misleading impression of progress, it is imperative to define common criteria to assess deepfake quality, computational complexity, decision-making, performance, and policies.
All in all, a few initial attempts (e.g., [
468,
469,
470,
471]) have been made in this direction. However, there is still a deficiency in conducting large-scale analyses of existing evaluation metrics in the field of deepfakes to determine their reasonability. Any such study should also consider human capabilities and disparities between machines and humans for face manipulation and deepfake. Streamlined benchmarking platforms are essential, incorporating modular design, extensibility, interpretability, vulnerability assessment, fairness evaluation, a diverse range of generators and detectors, and strong analytical capabilities.
4.2. Future Deepfake Detection Technologies
This subsection focuses on future cutting-edge research on frameworks to detect and combat increasingly sophisticated deepfake contents. Specifically, the generalization capability of deepfake countermeasures in
Section 4.2.1, identity-aware deepfake detection in
Section 4.2.3, next-generation detection methods in
Section 4.2.4, acoustic constituent-aware audio deepfake detectors in
Section 4.2.5, multimodal deepfake detection in
Section 4.2.6, and modality-neutral deepfake detection in
Section 4.2.7 are discussed.
4.2.1. Generalization Capability of Deepfake Countermeasures
Many current deepfake detection models encounter difficulties in accurately identifying deepfakes and digitally manipulated speech and faces in datasets that deviate from their original training data. Namely, the performance of detectors sharply drops upon encountering novel deepfakes types, databases, manipulations, variations, attacks, and speech or face generator models that were not utilized during the training phase [
87,
110,
472,
473]. This is attributed to the fact that previous approaches have predominantly concentrated on particular artifacts, deepfake scenarios, closed-set settings, and supervised learning, making them susceptible to overfitting. Besides overfitting, some other factors that may contribute to poor generalization are model architecture and complexity, underfitting, feature selection and engineering, hyperparameter tuning, data preprocessing, and dataset quality and representativeness. Attackers can take advantage of the lack of generalization. Generalizability is crucial for staying ahead of deepfake generators in the continuous development of effective countermeasures. There has been limited research aimed at addressing the generalizability of deepfake countermeasures. However, it is evident that we have not achieved generalizable deepfake detection. The problem of generalization is exacerbated by unknown deepfakes, temporal dynamics, and adversarial examples’ transferability. The rapid development of deepfake detectors with broad generalizability is the need of the hour, ensuring effectiveness across diverse deepfake datasets, types, manipulations, resolutions, audio/visual synthesizers, apps, and tools.
For better understanding and future solutions, studies should strive to address questions such as ‘to what extent do current deepfake detectors expand their capabilities in identifying deepfakes across diverse datasets?’ and ‘do deepfake countermeasures inadvertently learn unwanted features that impede generalization?’. Also, exploring the synergies between deepfake detectors and augmentation methods that contribute to better generalizability, and asking why deepfake detectors struggle to generalize could provide valuable insights. To confront the challenge of generalization capability, we need more robust, scalable, and adaptable deepfake detectors. Future research endeavors may center around enhancing the generalizability of deepfake countermeasures through the implementation of contrastive learning, meta learning, ensemble learning, active learning, semi-supervised learning, unsupervised learning, continuous learning, one class learning, causality learning (i.e., enhancing contributing neurons of generalizability, e.g., partial neurons/layer freezing), interferential learning (i.e., carefully applying strategies to avoid acquiring unwanted features, e.g., guided DropConnect), hybrid learning, zero-shot learning, few shot learning, open set recognition, human-in-the-loop, multi-attention, cross-modality attention, non-identity representations learning (e.g., deep information decomposition), multi-source to multi-target learning, universal features learning, and hierarchical learning methodologies.
4.2.2. Identity-Aware Deepfake Detection
The landscape of deepfake detection primarily revolves around ‘artifact-driven’ techniques or unnaturalness generation schemes. Thereby, detectors function well within the close-set context (i.e., known artifacts, manipulations, and attributes) but work poorly in the open-set context (i.e., unknown artifacts, manipulations, and attributes). New deepfake generators will continue to grow, producing deepfakes with novel manipulations and artifacts. One promising and reliable solution is ‘identity-driven’ deepfake detection, which learns identity implicit features rather than specific artifacts [
25,
163,
474,
475]. However, ‘Identity-driven’ deepfake detection techniques also struggle to perform effectively in open-set scenes, particularly when the detectors encounter previously unseen identities. Audio and multimodal identity-driven deepfake detection schemes do not currently exist. To develop innovative identity-aware deepfake detection technology, various multi-feature frameworks can be explored. These frameworks may incorporate components such as 3D morphable models, temporal attributes analysis, identity embeddings, metric learning techniques, adversarial training methodologies, attention mechanisms, continual learning approaches, self-supervised learning methods, federated learning strategies, graph neural networks, and secure multi-party computation techniques. Some key use cases of identity-aware deepfake detection could be content moderation (e.g., social platforms using identity-aware deepfake detection to automatically filter and flag fake videos, swiftly removing misinformation and malicious content), evidence authentication (e.g., verifying the video authenticity evidence utilized in court cases to ensure that deepfake videos are not used to mislead the judicial process), fact-checking (i.e., news organizations employing this technology to verify video authenticity before broadcasting and preventing the spread of fake news and manipulated content), and fraud prevention (e.g., banks and financial institutions can use it to verify the identities of individuals in video banking services to avoid identity fraud and account takeovers).
4.2.3. Detection Methods Scarcity for Non-English Languages
Current research significantly prioritizes the crafting of detection techniques aimed at identifying deepfakes in the English language, particularly those in the audio domain. There is an acute shortage of frameworks specifically crafted for other major languages, such as Arabic, Hindi, Chinese, and Spanish, with estimated native speaker populations of 362 million, 344 million, 1.3 billion, and 485 million, respectively. This highlights a significant gap in existing research. The idiosyncrasies of alphabet pronunciation in each language present challenges that traditional English audio/video/multimodal deepfake detectors and audio/video/multimodal processing are not adept at handling. Also, non-English (audio/video/multimodal) deepfake databases are notably scarce. The research community should bridge this important gap by introducing innovative (audio/video/multimodal) deepfake detection methods capable of identifying languages beyond English. For instance, such novel, robust, and enhanced deepfake detection methods for Spanish, Arabic, Hindi, and Chinese could be integrated into social media platforms, empowering users to discern manipulated content, thereby increasing trust and reliability in digital communication.
4.2.4. Next-Generation Detection Methods
The rivalry between deepfake detection and generation resembles a dynamic cat-and-mouse game. Current static methods are not reliable, flexible, sustainable, and robust against attacks. It is crucial to devise more advanced detection schemes to overcome the limitations in prior audio/face-forensic technologies. The next-generation deepfake and audio/face manipulation detection algorithms should possess the capability to tackle diverse audio/facial features, sizes, and resolutions. They should be trainable with small databases without a loss of efficacy, be resilience to adversarial examples/attacks, function in real-time, and support multi-modal inputs and outputs. To achieve this goal, ensemble learning, multitask learning, adaptive schemes, multi-feature and multi-level fusion, as well as consideration of local and global attributes hold significant potential.
4.2.5. Acoustic Constituent-Aware Audio Deepfake Detectors
In the literature, a common observation is that most audio deepfake detection studies overlooked significant factors (e.g., accent, dialect, speech rate, pitch, tone, and emotional state), which could adversely affect the detection accuracy. Studies reveal the negative influence of such factors on speaker recognition performance [
476]—there is a lack of sufficient research on this subject in audio deepfakes so far. More studies are warranted to understand the intricate effects of acoustic constituents like dialect, tone, accent, pitch, speech rate, emotional state, etc., both individually and in combination, on the efficacy of audio deepfake detectors. To assess its impact, one can evaluate the detection accuracy across varying levels of acoustic constituent sophistication, real-world applicability under diverse audio qualities and sources, computational speed and efficiency, comparison with current solutions, scalability and adoption potential, and ethical considerations regarding potential misuse.
4.2.6. Multimodal Deepfake Detection
Unimodal deepfake detectors, specializing in either visual or audio fake detection, constitute the prevailing majority within the current landscape of deepfake detection. As a result, they exhibit the poorest performance for multimodal deepfakes. Multimodal deepfakes are forged video samples that contain both fake visual and corresponding synthesized lip-synced fake audio. Although a handful of multimodal deepfake detectors have been proposed (e.g., [
477,
478]), their proficiency in identifying multimodal manipulations is notably substandard. The shortage of diverse and high-quality multimodal deepfake datasets restricts advancements in multimodal deepfake detection.
To catalyze increased efforts in the realm of multimodal deepfake detection studies, a greater number of diverse multimodal datasets should be produced and shared publicly for research purposes. These datasets should include types of forgeries (e.g., real audio paired with manipulated video, real video coupled with manipulated audio, manipulated video and manipulated audio, and manipulated partial portions) and a broad spectrum of event/context categories (e.g., indoor, outdoor, sports, and politics). Additionally, it is crucial to create public databases containing highly realistic multimodal deepfakes, wherein entire multimodal samples are meticulously crafted by AI. Generating cohesive and convincing multimodal deepfakes remains a challenge for contemporary techniques, particularly in ensuring the consistency and alignment of fake visuals and fake speech.
Future multimodal deepfake detectors may be grounded in diffusion-based multimodal features, multimodal capsule networks, multimodal graphs, multimodal variational autoencoders, multi-scale features, new audiovisual deepfake features, time-aware neural networks, novel disjoint and joint multimodal training, cross-model joint filters, multimodal multi-attention networks, and identity- and context-aware fusion mechanisms. Researchers should also develop ensemble-based multimodal deepfake detection systems to ensure superior generalization to unseen deepfakes and explainability. Additionally, exploring methods to improve the generalizability of multimodal deepfake detection methods to unseen deepfakes will be a key area of investigation.
4.2.7. Modality-Neutral Deepfake Detection
Deepfakes have evolved from singular modality manipulation to multimodal falsified contents, allowing for the fabrication of audio, visual, or combined audio–visual elements. Employing two separate unimodal detectors (i.e., one for audio and other for visual element) for audiovisual deepfakes either miss potential cross-modal forgery cues or introduce higher computational complexity [
479,
480]. Conventional multimodal deepfake detection methods often involve correlating audio and visual modalities, with a prerequisite for the concurrent existence of audio and visual modalities. In reality, deepfake samples may exhibit the absence of certain modalities, where either modality is missing or present at a time. Given the increasing prevalence of audiovisual deepfake content, it is essential to devise modality-agnostic detectors capable of efficiently thwarting multimodal deepfakes, regardless of the type or number of modalities at play. Future multimodal deepfake detection methods need to be cohesive, considering the intermodal relationships, and capable of addressing scenarios with missing modalities and the manipulation of any or both modalities. In these frameworks, detectors should assign two separate labels to accurately specify whether the audio, visual, or both modalities have undergone manipulation. Some potential directions for future research in this field include investigating multi-modal neural networks and ensemble frameworks that can integrate features from different sources to identify inconsistencies across modalities, as well as exploring domain adaptation and transfer learning methods that allow knowledge learned from detecting deepfakes in one modality (e.g., images) to be applied to another (e.g., audio).
4.3. Future Deepfake Generation Technologies
This subsection explores cutting-edge research on future frameworks for generating sophisticated deepfake content. In particular, next-generation audio/face manipulation and deepfake generators in
Section 4.3.1, audio/facial attribute manipulation techniques in
Section 4.3.2, and real-time deepfakes generation in
Section 4.3.3 are covered.
4.3.1. Next-Generation Audio or Face Manipulation and Deepfake Generators
A dynamic and healthy interplay between deepfake generators and deepfake detectors may fuel great advances in designing next-generation multimedia forensic tools. The contemporary deepfake and audio/face manipulation methods are marked by several practical deficiencies, such as not being equipped for lifelike ultra-high-resolution sample creation, audio/face attribution manipulations constrained by the training dataset (i.e., new attributes could not be created if they are not part of training set), incompetent to manage scenarios involving occlusion, audiovisual continuity problems (i.e., jarring frame transitions and inconsistent physiological signals), are easily overfitted (e.g., training process exhibiting instability), poor generalization to unfamiliar situations, generated samples with artifacts or modal fingerprints being prone to easy detection by digital forensic tools, longer training time, higher generation (processing) time (especially for high-quality and -resolution samples), and requirement of sophisticated computing infrastructure.
Next-generation deepfake generation frameworks offer ample opportunities for improvement and innovation, for instance, developing automated and swift generators that leverage both deep learning and computer graphics techniques to create highly realistic samples. Similarly, they require generators that need a small training database; craft lifelike movies from few audios/photographs or from very short clips; maintain unbiasedness towards the dataset; efficiently handle occlusion, pose, resolutions, audio noises, audio distortions, and orientations, and class imbalances and demographic factors during and after training; and improve the synchronicity and temporal consistency of audiovisual deepfake content, particularly in the context of language-translated manipulation. Also, an imperative need exists for a computationally efficient speech/face generator designed to run seamlessly on edge devices or smartphones. Another avenue to explore is creating real-time 3D deepfakes with improved fidelity, head movements, and speech/facial expressions using advanced generators. Future ‘entire sample synthesis’ methods should not only generate audio or audiovisual samples without traces of identity-related indicators from the speeches/faces used in training (i.e., identity-agnostic schemes) but also be multi-domain and optimized for use on mobile devices.
Contrary to current deepfake generators, next-generation methods are poised to advance into full-body manipulations, incorporating changes in gestures, body pose, and more [
211]. Also, there is a limited number of frameworks dedicated to producing obvious deepfakes and speech/face manipulations (i.e., human face with tusks) or creating target-specific individual audio or joint audio–visual deepfakes with a heightened level of realism and naturalness. Furthermore, designing dedicated hardware for deepfake generation and establishing fresh empirical metrics for efficacy offer exciting avenues for advancement.
Considering the viewpoint of the red team, future generators should possess the capability to generate samples devoid of subtle traceable information (i.e., fingerprints), speech/pixel inconsistencies, or unusual audio/texture artifacts, as these elements can potentially expose deepfakes to detection methods. Deepfake generators equipped with built-in mechanisms (e.g., traces tailored loss functions or layers) to mask or eliminate these elements without compromising quality can effectively circumvent proactive defenses. In essence, the competition between sophisticated generators and detectors will drive progress in both domains, particularly in developing robust detection frameworks and creating large-scale deepfake and face manipulation datasets.
4.3.2. Audio/Facial Attribute Manipulation Techniques
Audio/facial attribute schemes can be grouped into two main types: audio/facial attribute estimation (A/FAE) and audio/facial attribute manipulation (A/FAM). A/FAE is utilized to identify the presence of specific audio or visual attributes within a sample, while A/FAM involves modifying speeches or faces by synthesizing or removing attributes using generative models [
453]. Existing audio/facial attribute techniques fail to successfully deal with audio, image, or video sizes, masks, vocal peripherals, obscurity, vocal cavity, resolutions, and occlusions in the wild. The A/FAM methodologies exhibit deficiencies in interactivity (e.g., lack of sufficient interactivity options for users), handling attributes/class imbalance (e.g., no samples of large birthmarks), producing believable and high-quality multi-manipulations, as well as in translating audio/image/video to video/speech manipulations. Moreover, creating a taxonomy of audio/facial attributes in A/FAE and A/FAM will greatly help the people in the field.
4.3.3. Real-Time Deepfakes Generation
Advancements in deepfake technology have empowered both malicious attackers and non-malicious users to manipulate faces during live streaming, broadcasting, gaming, entertainment, or videoconferencing. Such real-time facial audiovisual content modification is achieved via the use of AI-powered deepfake filters [
481], e.g., increasing the prevalence of smiles or changing eye gaze. Nevertheless, generating high-fidelity and high-precision face swaps or reenactments remains a persistent challenge in the context of real-time Internet live sessions. More compressed, sufficient, and tailored algorithms should be developed for live deepfake generators in in-the-wild scenarios. Moreover, innovate AI-driven deepfake filters capable of producing high-quality audio and video deepfakes; specifically designed for multi-person or multi-face scenarios; accommodating substantial head movements; and addressing aspects such as higher tone, register, vibrato, diction, and breath support should be designed. Future research in real-time deepfakes generation may also focus on improved GANs and diffusion models, enhancing the temporal coherence in video sequences, few-shot and zero-shot learning with minimal data, and dynamic and adaptive neural networks that can adjust their processing power and complexity in real-time depending on the input data. As discussed in
Section 4.12.2, the issue of face tampering in live streaming can be mitigated through a challenge–response procedure.
4.4. Advancing Scalability and Usability in Deepfake Techniques
This section delves into scalability and usability considerations within deepfake techniques.
Section 4.4.1 discusses the topic in general, while
Section 4.4.2 explicitly talks about audio deepfake detection, which is underexplored in this theme compared to visual deepfakes.
4.4.1. Scalability and Usability Considerations
Scalability and usability are important factors to consider when selecting a deepfake technique. The deepfake and audio/face manipulation generation and detection tools’ scalability and usability depend on factors like input audio/face (and/or output) resolutions, sizes, dataset, user interface, technical expertise required, speed, and documentation; thus, researchers and practitioners need to be attentive when choosing the most suitable deepfake method for their study, needs, and available resources. For instance, among generation techniques, CycleGAN [
200] is comparatively better than StyleGAN [
78] due to ability of handling unpaired data and user-friendliness. A comprehensive study to assist users in selecting a suitable technique based on scalability and usability is currently unavailable.
4.4.2. Scalability-Focused Approaches to Audio Deepfake Detection
The trade-off between efficiency and scalability is a notable challenge for current audio deepfake detectors. Thus, innovative schemes need to be formulated that achieve both the highest accuracy and significant scalability simultaneously. Towards this goal, emphasis should be on creating fast and efficient preprocessing, audio DL models, data transformations, self-supervised learning, dynamic optimal transport, gradient compression, distributed architectures, hybrid processing, accelerated system-on-chip, and parallel algorithms that are capable of handling both labeled and unlabeled data.
Evaluating the impact of scalability and usability considerations in deepfake technology involves examining both the technical and societal implications. Some of the key aspects of scalability considerations are computational power, data availability, algorithm efficiency, optimization techniques, cloud services, resource costs, energy consumption, real-time processing, accuracy and speed trade-offs, deployment (i.e., integration with platforms), user accessibility, and adaptability to evolving threats. The usability considerations are user interface, ease of use, customization options, and interoperability.
4.5. Deepfake Provenance and Blockchain Solutions
The subsection addresses critical strategies for enhancing the traceability and integrity of deepfake content, focusing on source identification (
Section 4.5.1) and the application of blockchain technology (
Section 4.5.2). With the burgeoning momentum of blockchain technology and its increasing utilization across various sectors, we find it deserving of a dedicated subsection.
4.5.1. Deepfake Source Identification
Another way to alleviate deepfakes is deepfake content provenance (i.e., deepfake source identification). The content provenance can help common people know the authenticity of the information. An initial attempt was made by Adobe, The New York Times, and Twitter (currently X) via the Content Authenticity Initiative (CAI) [
482] to determine details like time, location of the media, type of device used, etc. Also, the Coalition for Content Provenance and Authority (C2PA) [
483] has been formed, which will direct the enactment of content provenance for editors, creators, media platforms, publishers, and consumers. More attention should be devoted to the aspect of deepfake provenance. Researchers can develop frameworks using blockchain technologies, distributed ledger technologies, and multi-level graph topology, usable on large social networks for deepfake content traceability. Furthermore, deepfake source identification algorithms, which are multimodal and can take text, image, video, and audio as input modalities in both unconstrained and constrained settings, should be devised.
Addressing the challenge of deepfake provenance requires collaboration and coordination among various stakeholders. The responsibility for implementing and supervising Deepfake Provenance (Source Identification) typically falls on a combination of entities, including research institutions, tech companies, government agencies, law enforcement, independent non-profit organizations, internet social media platforms, as well as users and communities.
4.5.2. Blockchain in Deepfake Technology
Conventional deepfake methodologies fall short in examining the history and origins of digital media or algorithms. Utilizing blockchain technology offers a fresh perspective in the ongoing battle against deepfake manipulation. Blockchain is a decentralized and distributed digital ledger technology that facilitates secure, transparent, and tamper-resistant record-keeping of proceedings and the monitoring of resources securely and independently [
484,
485,
486,
487]. Few studies have explored blockchain in deepfake technology—this study area is still in its infancy. Future advancements in blockchain, smart contracts, hyperledger fabric, hashing techniques, security methods, and integrity measures in the realm of deepfake technology is crucial for deepfake provenance, modifications made to it, and the authenticity of media assets. These innovations will help in uncovering the user motivations behind deepfake postings/publications and addressing the potential misuse of deepfake generation and detection algorithms.
4.6. Fairness and Credibility of Deepfakes
Recent studies observed that deepfake (detection) models and datasets are negatively biased and yield partisan performance for different racial, gender, and age groups. Certain groups are excluded from correct detection and unfairly targeted, leading to fairness, security, privacy, generalizability, unintentional filtering of harmless contents, and public trust or opinion issues. Furthermore, a study conducted in [
488] revealed that racial adjustments in videoconferencing sessions resulted in higher speaker credibility when participants were perceived as Caucasian. It is imperative to assess and discern bias (i.e., favoritism or unfairness absence) in deepfake technology for massive roll-out and commercial adoption. Little research has been conducted on ensuring fairness in deepfake technology, and there is a need to fill this gap. Towards this goal, we can explore new loss functions that are agnostic to demographic factors, non-decomposable loss with large-scale demographic and non-demographic attributes, and reducing pairwise feature correlations. In addition, advanced training algorithms can be developed to specifically isolate or remove demographic features, e.g., deep generative models to eliminate gender information from input samples before feeding them into feature extraction for classification. Likewise, fairness penalties and risk measures can be incorporated into the learning objectives.
Specific features also contribute to the lower fairness of deepfake detectors; hence, multi-view features-based systems comprising multiple micro/local and macro/global mechanisms should be constructed. Also, collecting and publicly releasing more balanced, unbiased, and diverse demographic databases will recuperate the fairness issue. Undoubtedly, substantial strides are necessary for the continued advancement in creating balanced datasets and fostering fairness within deepfake frameworks. Besides, the social media and video-conferencing companies should introduce a feature for transparency, e.g., full disclosure of face ethnicity alteration (i.e., digital modifications of the perceived ethnic characteristics of a person’s face) to all parties on the post or video call (i.e., displaying a red mark).
All in all, evaluating the fairness and credibility of deepfakes necessitates a multifaceted approach combining quantitative and qualitative methodologies. Quantitative approaches such as statistical analysis (e.g., detection accuracy, false negative rate, confidence scores, and types of deepfake) and data analysis (e.g., deepfake distribution across different platforms and contexts, geographical analysis, and temporal evolution) can be applied. Qualitatively methods such as such as user perception studies (e.g., surveys and interviews), content analysis (e.g., categorizing themes, purposes, context, and potential impacts), and expert evaluation (e.g., involving experts from ethics, law, journalism, and technology) could be adopted to understand the societal impact and ethical considerations. The integrated approaches that merge quantitative and qualitative methods ensure comprehensive analyses that addresses both technical reliability and the broader societal implications of deepfake technologies.
4.7. Mobile Deepfake Generation and Detection
Mobile devices have become our natural extension, which is expected to expand in coming years with high interconnectivity. Deep neural network-based deepfake generation and detection, while highly accurate, is often impractical for mobile platforms or applications due to complex features, extensive parameters and computational costs. For instance, models like temporal convolutional networks [
489], denoising diffusion probabilistic models [
289], adversarial autoencoders [
490], and Stack GANs [
491,
492,
493] used in deepfake generation and detection can demand substantial computational power and memory, which are often beyond the capabilities of mobile devices. Light-weight [
494] and compressed deep learning-based [
495] detection systems, which are yet efficient and deployable on mobile and wearable devices, will be instrumental in mitigating the impact of misinformation, fake news, and deepfakes. There are huge opportunities to devise frameworks for both generating and detecting deepfakes on mobile devices, particularly designed for online meetings and video calls.
4.8. Hardware-Based Deepfake Detectors
Everyday, millions of hours’ worth videos are exchanged or uploaded online; therefore, to thwart fake news, mis/disinformation, and deepfakes, researchers and professionals can formulate hardware technologies that can detect such contents proactively. For example, developing dedicated media processing chips/systems (e.g., ASIC (application-specific integrated circuit) [
496] and TPMs (Trusted Platform Modules) [
497]), which directly scrutinize content for their authenticity. Hardware accelerators exhibiting parallelism, soft cores emulation, energy efficiency, rapid inference, and fileless processing capabilities are poised to drive real-time and better-quality deepfake detection in unconstrained situations. Any such innovation is poised to revolutionize the field, spawning a plethora of new applications.
4.9. Part-Centric Strategies for Deepfake Detection
This subsection delves into part-centric approaches, with a specific focus on
Section 4.9.1 and
Section 4.9.2, which specifically address the detection of part-based face and part-based audio deepfakes, respectively.
4.9.1. Part-Based Deepfake or Face Manipulation Detection
Most of the prior deepfake and face manipulation detection systems employ whole face image/frame. Nonetheless, often certain face parts are cluttered or redundant, leading typically to inferior performances. To overcome such issues, face part-based techniques can be designed, for example, selecting particular facial components (e.g., right eye or mouth) for deepfake detection [
498]. The part-based methods to locate facial attributes and extract features can be performed using auxiliary part localization (e.g., existing facial part detectors) and end-to-end localization (e.g., unified framework to locate facial part and classification) algorithms [
453].
4.9.2. Partial Audio Deepfakes
Publicly available data samples within the domain of deepfake technology are predominantly concentrated on scenarios where the spoken words are entirely either real or deepfake. The real-word scenarios where only a small part like a short section or a single word is manipulated (i.e., partial deepfake [
40]) are hard to detect. It is worth noting that partial deepfakes might not trigger false alarms in automatic speaker verification systems. However, they can still easily alter the meaning of a phrase. The continuous estimation of labels at the segment level (i.e., which portions are real and which are fake) study will help to comprehend the rationale behind a classifier’s specific decision. The demand for extensive partial audio deepfake datasets in various languages emphasizes the necessity for segment-level labels and metadata for better analysis and detection efforts.
4.10. Multi-Person Deepfake Databases and Algorithms
The majority of publicly available deepfake datasets are primarily composed of scenarios where a single person/face/audio is manipulated or deepfaked within a given sample, or the sample itself contains only one face/person/audio. Similarly, a systematic analysis shows that the majority of existing methods are concentrated on single-person deepfake scenarios and fail under multi-person or multi-face deepfaked situations. However, multi-person deepfakes (i.e., where deepfakes of multiple individuals coexist in a sample) mirror real-world scenarios accurately. Therefore, wide-reaching deepfake databases featuring multiple people (i.e., multi-faces and/or multi-speaker) in uncontained settings with known and unknown numbers of manipulations should be created. To this aim, the Openforensics [
126] and DF-Platter [
161] datasets were recently created. No publicly available audio deepfake datasets currently feature multiple deepfaked speakers in the same sample with varying real-world conditions, to the best of our knowledge. Furthermore, formulating sophisticated algorithms for creating and detecting multi-person audio and video deepfakes will significantly propel the evolution of deepfake technology.
4.11. Occlusion-, Pose-, and Obscured-Aware Deepfake Technology
Current deepfake generators and detectors struggle to effectively manage deepfakes with occlusion, pose, obscurity, and mask variations. Namely, deepfake detectors may have limitations in identifying manipulated audios/faces when they are obscured by facial masks, heavy makeup, vocal peripherals, vocal cavity, phonology, background acoustics, and interference, or when only a partial portion of the speech/face is manipulated. Future research in this area could focus on directions such as occlusion and/or pose detection and inpainting, contextual understanding, 3D modeling and reconstruction, differentiable rendering, multi-view learning, pose normalization, partial convolution, feature disentanglement, adaptive feature extraction, and hierarchical feature representations, which can focus on available information while compensating for missing parts.
Similarly, deepfake generators often lack consistency and quality in their outputs. In addressing such challenges, recently, a few frameworks (e.g., [
223,
499]) have been proposed. Further emphasis should be placed on the advancement of deepfake generators that possess occlusion-, pose-, and obscurity-awareness; high resolution, consistency in identity-related and -unrelated details; real-time functionality; and high fidelity. Additionally, generators should be lightweight and adept at processing/generating diverse qualities for arbitrary source and target pairs. Also, the availability of public databases containing partial, occluded, and obscured deepfakes, along with manipulated speech/face samples, is currently limited.
4.12. Instantaneous and Interactive Detection Methods
This subsection explores the field of live deepfake detection, with
Section 4.12.1 discussing instantaneous detection in general and
Section 4.12.2 focusing on participatory deepfake detection.
4.12.1. Real-Time Deepfake Detection
Certain challenges persist within the existing deepfake detectors, which require careful attention and resolution, e.g., many algorithms lack real-time applicability. Namely, highly precise detection algorithms require longer inference time. Social media and online services (e.g., voice or face-based authentication in online banking) will continue to expand, as does the threat of deepfakes. The irreversible impact may materialize before the detection or realization that the content is manipulated or fake. Consequently, the call is for real-time, highly accurate deepfake detectors in real-world applications. Special emphasis should be given to future techniques that operate in real-time without being resource-intensive. Novel real-time deepfake detector schemes should exhibit reliability, robustness, swift processing, and effectiveness across various platforms and situations. With this objective in mind, researchers may delve into compressed DNN (deep neural network) modeling, lightweight dynamic ensemble learning, and contextual learning.
4.12.2. Challenge–Response-Based Deepfake Detection
Challenge–response deepfake detection can be employed in online audio calls, video calls, or meetings, where users are required to perform specific tasks/actions to be labeled as a genuine content/user for verification via the proper execution of random instructions in order. This could also prevent bots from performing automated deepfakes in live online calls/meetings. This method may be user-unfriendly, with high computational costs and low acceptability. Innovative, effortless response, and user-friendly techniques need to be devised, e.g., the system can pose challenges triggered by factors such as lack of change in voice tone [
500], movement, or facial expressions. The deepfake attacker can easily sidestep a single challenge; therefore, a system can implement a series of challenges, as circumventing an entire series of challenges is notably much harder.
4.13. Human-in-the-Loop in Deepfake
Deepfake and audio/facial attribution manipulations are continuously attaining a level of sophistication that surpasses ML-based detectors’ capacity [
501,
502]. The battle between audio/visual forgery and multimedia forensic continues without resolution. Thus, rather than automated AI/ML deepfake systems without human involvement, including humans in both the training and testing stages of devising deepfake generation/detection algorithms would produce better results because of a continuous feedback loop. We can term it as the human-in-the-loop deepfake approach, which can leverage the effectiveness of intelligent ML automation while remaining responsive to human feedback, knowledge, and experiences via Human–Computer Interaction (HCI). To the best of our knowledge, there is currently no human-in-the-loop deepfake detection method. The question of how to embody meaningful and useful human knowledge, attributes, and interaction into the system still remains open. Feedback from someone who cannot recognize deepfakes will be valuable in a human-in-the-loop approach to improving deepfake systems. Such feedback may help in various ways, such as highlighting particular weaknesses in current algorithms, and insights from non-experts guide developers to improve accuracy and robustness, enhancing the diversity of training data, improving user interface design, and addressing real-world detection challenges and biases across different user groups and contexts.
4.14. Common Sense Audio/Face Manipulations and Deepfakes
The predominant phoniness resides in the visual texture and timbre of audio/sound. Humans can very easily identify the common-sense audio/face manipulations and deepfakes, e.g., human face with horns, one eye on the chin area, horizontal nose on the forehead, ears at the position of mouth, malformed face, artificial intonation, etc. However, deepfake detection techniques cannot identify them as they are devoid of basic common sense. There is a scarcity of public databases with common sense audio, visual deepfakes, and face manipulations.
4.15. Large-Scale AI-Generated Datasets
Examining published literature reveals that research on detecting AI-synthesized deepfakes often utilizes custom databases derived from various generative [
503,
504] and diffusion [
505,
506] deep models. Since there is no consensus on the choice of parameters and datasets, the bulk of published studies exhibit diverse performances under GAN and diffusion samples, as the quality of AI-generated samples varies and remains largely unknown. To foster preeminent advancements, the community ought to create several all-inclusive public AI-generated audio and video deepfake datasets, housing samples of different manipulations, qualities, ages, ethnicities, languages, tones, educations, and backgrounds with both constrained and unconstrained scenarios.
4.16. Deepfakes in the Metaverse
The metaverse [
507] will soon be integral to our lives, reshaping how we learn, live, work, and entertain while enabling imaginative experiences beyond the constraints of time and space. Harnessing deepfakes in positive manners in the metaverse can result in extended reality applications that are both hyper-realistic and profoundly immersive. But, in the metaverse, malicious actors might employ deepfakes to deceive and manipulate others, potentially resulting in diverse fraudulent activities [
508]. Current deepfake detection methods, originally designed for 2D deepfakes, are inadequate in the metaverse as metaverse deepfakes are primarily based on 3D rendering and modeling. To counter XR, AR, VR, and MR (i.e., the metaverse) deepfakes effectively, it is required to develop intricate and multi-layered systems of robust and preventive safeguards. Future techniques for generating and detecting metaverse deepfakes demand lower complexity, real-time capability, and high resolution.
4.17. Training and Testing Policies
Collaborative efforts between research and professional communities are essential to formulate comprehensive training and testing policies for deepfake technology. It is advisable to incorporate both a fixed policy and a more flexible one. The strict or fixed policy can be utilized for meaningful comparisons of different deepfake frameworks as well as a basis for benchmarking evaluations. For instance, the fixed policy could include standardized datasets (e.g., DFDC); uniform and well-adopted evaluation metrics (e.g., accuracy and area under the receiver operating characteristic curve); and benchmarking protocols including the environment setup, hardware specifications, and procedures (e.g., GPU model and software environment). The flexible or adaptable policy can be swiftly useful for future simple, complex, compressed, larger, or bigger frameworks. The flexible policy may be designed to handle, for example, varied frameworks (e.g., personal choice-based DL methods among GANs, RNNs, and CNNs), new or emerging datasets, and innovative metrics (e.g., evaluation metrics that can evaluate both robustness and detection/generation). The training and testing policies, whether fixed or flexible, must explicitly outline procedures and considerations for generating and detecting individual audio, video, and text deepfakes, as well as joint audio–visual and text-to-audio/multimedia deepfakes.
4.18. Ethical and Standardization Dimensions in Deepfake Technology
This subsection explores two critical dimensions of deepfake technology, i.e., the efforts to standardize its generation and detection (
Section 4.18.1) and the ethical implications thereof (
Section 4.18.2).
4.18.1. Standardization of Deepfake (and Audio/Face Manipulation) Generation and Detection
Deepfake standards serve as the overarching principles for generating, collecting, storing, sharing, and evaluating deepfake or digitally manipulated audio/face data. There are no international standards yet for deepfakes, although some attempts are underway by the International Organization for Standardization (ISO) [
509], European Telecommunications Standards Institute (ETSI) [
510], and Coalition for Content Provenance and Authenticity (C2PA) [
483], to name a few. It is crucial to set accurate standards to fully maximize the potential of deepfake technologies. As high-level decision-makers and government agencies assess the trade-offs between the deepfake risks and the convenience of digital rights, they must recognize the necessity of implementing proper deepfake standards for privacy, vulnerability, development, sale, and certification.
4.18.2. Ethical Implications of Deepfakes
The ethical concerns surrounding deepfakes are continuously growing [
511], as they can be utilized for blackmail, sabotage, intimidation, incitement to violence, ideological manipulation, and challenges to trust and accountability. To foresee the ethical implications, it is vital to grasp its current state, limitations, and potential opportunities. While technological aspects have garnered significant attention, the ethical dimensions of deepfake technology have received comparatively less attention. Proactively developing and adopting ethical standards will help mitigate potential risks. To achieve this objective, a thorough study on the ethical implications of deepfakes across diverse applications and domains should be undertaken for informed ethical developments, for instance, investigating questions like ‘what kinds and levels of harm can rise?’ and ‘what steps can be taken to minimize these harms?’. Ethicists, information technologists, lawmakers, communication scholars, political scientists, digital consumers, business networks, and publishers will have to join forces to develop “Ethical Deepfake” guidelines akin to “Ethical AI” guidelines. For example, ethical deepfakes in business and marketing should be transparent (i.e., content source), non-deceptive (i.e., content is not real), fair ((i.e., respecting the rights of third parties), and accountable (i.e., providing the option to consumers to opt out of deepfake content if desired).
4.19. Regulatory Landscape for Deepfake Technologies
Here,
Section 4.19.1 concentrates on the complexities of protecting personal privacy amidst the proliferation of deepfakes, as well as examining the various legal liabilities arising from their misuse, while
Section 4.19.2 studies legislative efforts worldwide aimed at mitigating risks and ensuring responsible use of this technology.
4.19.1. Deepfakes Privacy and Liability
Privacy: The surge of deepfakes raises grave concerns regarding the compromise of personal privacy, data privacy, and identity security [
512]. Consequentially, different techniques to enhance individual’s privacy and security in audio/facial samples or deepfakes have been suggested such as pixelation [
513], blurring [
514], masking [
515], background noise or sprechstimme addition [
473], cartoons [
516], and avatars [
517].
On the contrary, deepfakes themselves have lately been regarded as a means to bolster privacy and security, e.g., modifying, swapping, and manipulating the original face style with a machine-generated face/avatar and attributes, respectively. Additionally, to safeguard the privacy of audio/face samples used in training of ‘entire audio/face synthesis’ models, approaches like watermarking [
518], obstructing the training process [
519], or removing identity-related features first before use [
520] have been introduced. Despite their utility, the mentioned approaches frequently show suboptimal generalization to unfamiliar models and introduce unwanted noise to the original sample. Wide-ranging studies are necessary to evaluate the efficacy of deepfakes in concealing individuals’ identities and to quantify the extent to which privacy-enhancing deepfakes impact the performance of speech/face recognition systems.
Text-guided audio/face sample editing strategies [
521] should be developed to empower users to carry out deepfake privacy procedures according to their intentions and level of desired privacy or security. Similarly, social networks and deepfake apps should provide users with the option to appear only in manipulated audio/visual samples/posts they have explicitly approved as well as to choose whether their face/audio needs to be deepfaked.
Liability: Deepfakes pose several significant liabilities such as intellectual property infringement (e.g., using deepfake technology owned by others without proper authorization), invasion of privacy (e.g., unethical exploitation of people’s images to create deepfakes or the illicit sale of such manipulated content), defamation liability (e.g., utilizing deepfakes to disseminate false information, potentially tarnishing someone’s reputation), data breach liability (e.g., a compromise of data protection and privacy through unauthorized disclosure, substitution, modification, or use of sensitive identity data for deepfakes), and unfair competition liability (e.g., promotional marketing materials featuring a deepfaked spokesperson). Addressing these concerns demands a comprehensive framework that navigates the intricate legal and ethical landscapes surrounding the deepfake technology.
4.19.2. Deepfake Legislation
The current regulatory system is inadequate in addressing the potential harms associated with deepfakes. Legislators are contemplating new legislations to grapple with deepfake technologies, which could significantly impact individuals, nation, companies, and society as a whole. For example, the states of Virginia and New York have respectively criminalized nonconsensual deepfake pornography and non-disclosure of deepfake audio, visual, or moving-picture content [
522]. Policymakers across the globe should put more dedicated efforts to regulate and create criminal statutes, as numerous regulatory proposals are still in the initial phase. The new and effective legislations and regulations should foresee, accommodate, and mitigate current and future deepfake harms without impeding technological development or stifling expression. It will require nuanced collaboration between civil society organizations, technology companies, and governments to formulate adaptable regulatory schemes capable of addressing the evolving nature of deepfake threats.
4.20. True-to-Life Audio Deepfakes
The field of audio deepfake generation and detection is gaining momentum [
461,
523]; however, the existing body of literature on visual/image deepfakes far surpasses that of audio deepfakes. Audio deepfake detection is still problematic, particularly in discerning synthetic voice tracks created through advanced techniques and samples derived from open-set scenarios. Moreover, many detectors use synthetic speech recognition features, which compromise accuracy and leave them susceptible to adversarial examples and realistic counterfeit audio [
524]. More advanced audio deepfake generators should be formulated that can produce more authentic audio tracks, incorporating diverse background noises and contextual variations. Research communities should create extensive public audio deepfake databases that cover diverse languages, background noises, codings, and a variety of vocal characteristics including tenor, bass, mezzo-soprano, baritone, contralto, countertenor, falsetto, whisper, narrative voice, families of synthetic speeches, and different age groups. These databases should also include recordings made using PA systems and mobile devices to ensure a comprehensive representation for effective research in audio deepfake technologies. Additionally, large-scale deepfake samples, subject to simultaneous language conversion and manipulation, should be released publicly with accurate labels.
4.21. Effects of Non-Speech Intervals in Deepfakes
Non-speech intervals naturally occur in spoken language and can also assist in distinguishing between authentic and manipulated speech. Attackers can adroitly eliminate non-speech segments to circumvent audio deepfake detectors. Exhaustive investigations are required to gain insights of the generation or alteration of non-speech segments, as not all non-speech frames/types (e.g., breathing sounds) in different languages have been explored. In a similar vein, it is crucial to conduct research on the impacts of varying durations of non-speech intervals in deepfakes across different languages, investigating their correlations within languages and across various languages and modalities.
4.22. Future Deepfake Datasets
This section discusses cutting-edge deepfake datasets across three key dimensions: diverse main categories of audio deepfakes (
Section 4.22.1); varied audio datasets with different attributes, e.g., styles, single and multi-person (
Section 4.22.2); and a comprehensive collection of audiovisual deepfake media spanning a wide range of subjects, scenarios, languages, and contexts (
Section 4.22.3), akin to the breadth of content found in an encyclopedia. Each part explores advancements in technology and methods that will catalyze the advancement of the deepfake field.
4.22.1. Diverse Audio Deepfakes
Publicly available audio deepfakes samples or comparative datasets were mainly created using the TTS (text to speech), SS (speech synthesis), and VC (voice conversion) methods [
47]. Studies on recent TTS, SS, and VC methods, which need minimum training data to generate cutting-edge deepfake samples, should be conducted. A fresh effort should be initiated to collect and generate manipulated audio tracks using software-based real-time voice changers, voice cloners, vocal synthesizers, language deepfake translations, digital signal processing-based techniques, next-generation neural networks, and hardware-based voice changers. Furthermore, examining deepfake detectors and speech recognizers’ vulnerabilities against the abovementioned manipulation techniques with different types of scenarios (e.g., background noises, vocal cavity, vocal peripherals, fixed microphone, or access control) will aid next-generation developments.
4.22.2. Heterogeneous Audio Datasets
Current, deepfake audio datasets predominantly feature samples from one to three languages and include only one or two types of deepfake instances [
12]. As a result, detectors struggle to generalize and tend to overfit [
332,
473]. Developing heterogeneous genuine audio and deepfake datasets with multiple languages, manipulation techniques, speaking styles, scenes, partial and full manipulated tracks, same single track containing multiple types of audio deepfakes, and relevant factors will advance audio deepfake technologies and applications, including the robustness and versatility of models.
4.22.3. State-of-the-Art Encyclopedic Deepfake Datasets
While numerous deepfake datasets (e.g., [
121,
330]) are publicly accessible for research and development, they come with several limitations. For instance, many datasets lack coverage of the latest deepfake audios and/or videos crafted through state-of-the-art (SOTA) techniques available on various platforms. The datasets lack diversity in quality, age, and ethnicities, with deepfakes showing noticeable and/or visible disparities from those circulated online and imbalanced class distributions. Hence, deepfake detectors tested on or developed using these datasets encounter challenges related to overfitting and generalization issues. Forthcoming encyclopedic databases for audio and video deepfakes should be meticulously crafted to include SOTA-generated deepfakes; multi-language, different quality, demographic data; realistic context and scene representations; various audio/facial attribute manipulations; samples collected in the wild; full and partial manipulations; multimodal forgeries; distinct resolutions; superior-quality synthetic and morphed samples; and adversarially perturbed, postprocessed (i.e., postprocessing operations on deepfakes remove artifacts to look/sound more lifelike), and metadata. Such extensive and diverse datasets contribute to the development of sophisticated countermeasures against the evolving landscape of real-world deepfakes. Additionally, although ‘not safe for work’ deepfake audios and/or videos are accessible online, public datasets containing pornographic deepfakes in the wild are not widely available.
4.23. Imitation-Based Audio Deepfake Detection Solutions
The prevailing detectors largely address synthetic-based deepfakes, which leaves imitation-based audio deepfake detection understudied. Identifying audio deepfake content generated via the imitation-based technique is challenging owing to the strong similarity between original and fake audio samples [
525]. This conspicuous gap and issue underscore the urgency of redirecting analytical efforts towards the refinement of imitation-based deepfake detectors, including identifying subtle differences in fake speech imitation features as well as algorithms the can detect both synthetic- and imitation-based deepfakes.
4.24. Proactive Deepfake Defenses
In deepfake passive defenses, detectors undergo retrospective training to discern manipulated samples. This methodology functions as post-event forensics but proves inadequate in preemptively thwarting the dissemination of disinformation. Therefore, initiatives for deepfake proactive defenses have been put forward. This methodology operates as a preemptive forensic analysis, aiming to obstruct the generation of deepfakes and promptly identify them before they swiftly spread across diverse platforms. Proactive approaches bear some resemblance to anti-counterfeit measures. Proactive methods enhance forensic tools in order to fortify overall deepfake defense. Some of the proactive deepfake countermeasures are watermarking [
518] (i.e., tracing the copyright of audiovisual samples by examining embedded visible or hidden watermarks or embedding anti-deepfake labels in speech/facial features to detect tampering by indicating the presence or absence of the labels), adversarial examples [
526] (i.e., injecting adversarial perturbations into real data creates distorted deepfake outputs, which may be easily noticeable to human observers), and radioactive training data [
527] (i.e., infusing data with subtle alterations ensures that any model trained on it will produce deepfakes with a unique and identifiable ‘fingerprint’). Proactive defense techniques, according to recent studies, demonstrate specific limitations. For instance, they can be comfortably sidestepped by basic audio/image transformations to the original perturbated samples, reconstruction of the original sample from deformed deepfake outcome, and eradicating the watermarks or fingerprints from the samples. Furthermore, many current proactive methods incur high computational costs. Subsequent endeavors should focus on developing holistic deepfake proactive solutions by incorporating techniques such as watermarking, adversarial examples, and radioactive training in a unified approach. Future proactive defenses should be designed to generalize across unseen deepfakes and withstand tampering and postprocessing, which can be achieved using ensemble watermarking (e.g., integration of invisible and visible watermarks or robust and fragile watermarks) and blockchain methodologies. There is a shortage of deepfake proactive defense mechanisms for audio and multimodal samples.
4.25. Explainable Deepfake Detectors
The reliability of a deepfake detector stands on the pillars of transferability, robustness, and interpretability. The interpretability (also referred to as explainability [
528,
529,
530,
531,
532]) involves the capability to understand, interpret, and offer transparent insights into the processes and decisions made by a deepfake detector, thereby fostering trust among users, researchers, and policymakers in comprehending the intricacies of the detection process. A plethora of solutions have emerged to confront challenges related to transferability and robustness, but the aspect of interpretability has been relatively underexplored. Due to the lack of explainability in deepfake detectors, their outcomes struggle to instill trust among the public and face challenges in gaining acceptance in critical scenarios, such as legal proceedings in a court. Deepfake detectors based on deep learning lack the ability to offer human-understandable justifications for their outputs, primarily because of their black-box nature. Present speech/face manipulation or deepfake detection frameworks solely offer a categorization label (e.g., pristine, deepfake, or digitally manipulated), a probability score indicating the likelihood of fakeness, or a confidence percentage for audio or video content being fake, without accompanying explanations and reasons for the detection results. Additionally, deepfake detectors are unable to discern whether the speech/face manipulation or deepfake was carried out with benign or malicious motives, and which tool was utilized to generate it.
A few explainable deepfake detection schemes have been proposed, often employing attention mechanisms like heatmaps to highlight manipulated regions in audio or video. The dependability and transparency issues in deepfake detection are yet to be fully resolved, requiring answers to questions such as the following: why is the audio or video classified as deepfake?; on which portion and by what criteria does the detection algorithm identify the audio or video as fake?; how can logical reasoning be efficiently integrated to improve the interpretability of multimodal deepfakes?; how and which aspects of psychological knowledge can enhance the generalization capability of transparent deepfake detectors?; and in what ways can future explainable methods be crafted to democratize deepfake detection for individuals not well-versed in AI?
Future efforts should prioritize transparent end-to-end frameworks that incorporate multiple cues (e.g., physiological and physical indicators), since a singular cue alone cannot encompass the diverse range of artifacts and attributes. Additionally, emphasis should be placed on exploring multimodal explainable approaches and developing universally acceptable explainability indicators or measures. To cultivate proficient interpretable deepfake detection systems, leveraging dynamic ensembles of techniques—such as knowledge distillation, multimodal learning, layer-wise relevance propagation, neural additive models, and fuzzy inference—can be advantageous. Deepfake interpretability can also be achieved through interdisciplinary research, involving a team of experts in social networks capable of analyzing posts, re-posts, and comments, along with fact-checkers, data mining specialists, and computer vision and AI scientists.
4.26. Multitask Learning-Based Deepfake Technology
The paradigm of multitask learning in machine learning involves training a framework to tackle several tasks simultaneously. Multitask learning has been empirically shown to yield superior performance when compared to single-task learning [
533]. The future research emphasis should be on developing unified deepfake generation and detection frameworks rooted in multitask learning. These frameworks should jointly handle various factors, including different resolutions, quality variations, occlusions, demographic classes, poses, and expressions, to drive progress in the field. Towards this objective, schemes can meticulously design loss functions, including but not limited to face prior loss, perceptual loss, adversarial loss, style loss, smooth loss, pixel loss, temporal loss, and inter-modality loss. Extracting features from the forgery location consequentially contributes to the success of deepfake detection [
534]. Therefore, creating deepfake detection methods capable of simultaneously performing deepfake detection and forgery location could enhance generalization capability and bolster resilience against adversarial attacks and postprocessing tricks, such as resizing or pitch shifting.
4.27. Computational Efficiency
The traditional deepfake approaches based on computer graphic or digital signal processing are relatively costly and time consuming, with low processing speed and high computational complexity. Comparatively, deep learning (DL)-based deepfake generation and detection frameworks are easier to use and more computationally efficient. For instance, a pre-trained GAN model for deepfake generation can perform quickly and efficiently during inference compared to traditional methods involving facial feature modeling, texture mapping, and rendering pipelines. However, despite their advantages, these DL-based frameworks still cannot be used widely in practical scenarios as they mainly focus on accuracy improvements. In particular, for good quality of deepfake generation, DL methods require larger databases. To reduce the computational complexity during training, techniques such as data distillation, one-shot learning [
535], and few-shot learning [
536] should be fully explored. In both detection and generation, by considering image/video/audio resolution, dataset sizes, deepfake types, and model complexity, efforts should be directed towards devising proficient light-weight schemes with cycle-consistency loss and/or federated learning and edge-cloud systems.
4.28. Battle of Deepfake Detection: Humans versus Machines
Deepfakes and manipulated multimedia content, if widely circulated, can have a detrimental influence on society at large. Human deepfake detection capabilities exhibit significant variation, influenced by diverse factors such as context, audio or visual modalities, quality, realism, coherence, and other variables. Few studies (e.g., [
501,
537,
538]) have delved into the comparative analysis of human proficiency against machine in identifying deepfakes. There is no consensus in findings, but existing studies indicate the following: (i) ordinary people struggle and need more attention to spot deepfakes than experts; (ii) people are better at spotting fake speeches when they have both audio and visual elements; (iii) automated ML deepfake detectors are slightly more effective with blurry, noisy, grainy, or very dark deepfakes (but it is well known that automated detectors show limitations when they encounter novel deepfakes not used in their training); (iv) human detectors and ML deepfake detectors make different types of errors; (v) synchronous collective performance (of groups of individuals) can match or surpass individual subject and ML deepfake detector detectors accuracy; (vi) humans can easily detect deepfakes in political videos; (vii) financial incentives have limited impacts on humans to improve the accuracy; (viii) in ML deepfake detectors, combining both sight and sound leads to higher accuracy than that of humans; (ix) computers and people perform similarly in detecting audio deepfakes; (x) native speakers are usually better at catching fakes than non-native speakers; (xi) individuals are more vulnerable to video deepfakes compared to audio ones; (xii) most studies are on English and Mandarin language deepfakes; (xiii) language does not affect detection accuracy much; (xiv) shorter deepfake clips are easier to identify; (xv) increasing awareness of deepfake existence marginally enhances detection rates; (xvi) difficulty of the detection by human subjects rises when background/context knowledge is eliminated; (xvii) detection accuracy relies on human knowledge and expertise; (xviii) impairing visual face processing impairs human participants but not ML-based detectors; (xix) human individual and group biases (e.g., homophily bias and heterophily bias) play a vital role in detection capabilities; (xx) overly confident subjects in their own detection abilities exhibit reduced accuracy in deepfake detection; (xxi) demographic traits strongly affect deepfake detection (e.g., male, non-Caucasian, and young individuals are more accurate); (xxii) audio deepfakes crafted using text-to-speech methods are difficult to distinguish than the same deepfakes with voice actor audio; (xxiii) human subjects place more emphasis on ‘how something is said’ rather than ‘what is said’; (xxiv) examining consistency and coherence in texts is more advantageous to detect text deepfakes.
In essence, machines currently outshine humans in detecting deepfakes due to their ability to discern subtle details often overlooked by human observers. Human and machine perceptions differ, but both can be fooled by deepfakes in unique ways. Collaboration between academics and professionals in digital media forensics, social psychology, and perceptual psychology is crucial for a comprehensive understanding. The focus of future endeavors should be on devising novel schemes to fuse human deepfake ratings (e.g., mean opinion scores), and comprehensive, large-scale assessments of human versus machine performance in the realm of deepfake technology, considering deepfake types, modality (audio, visual, multimodal, and text), quality, styles, resolutions, color depth, codec, aspect ratio, artifacts, number of people in deepfake samples, real-world situations, languages, individual, crowd, and demographic traits. Conducting future studies will aid in identifying the factors that optimize human deepfake detection performance, if any reciprocal learning between humans and machines exist, and effective ways to integrate human cognition and perception into deepfake detection. There is a need for systematic qualitative and quantitative research to unravel the encoding of modality-specific unique characteristics in deepfakes. There has not been any assessment of human versus machine performance specifically for large-scale partial deepfakes. The findings from these studies and works could contribute to designing tailored cybersecurity training programs aimed at enhancing the human detection of deepfakes, fake news, and misinformation.
4.29. Navigating Security Threats: Adversarial Attacks and Audio Noises
Here, we discuss addressing the complexities of navigating security threats posed to deepfake detectors by adversarial attacks (
Section 4.29.1) and the intricacies of audio noise (
Section 4.29.2).
4.29.1. Vulnerability to Adversarial Attacks and Anti-Forensics
Recent research investigations reveal that deep neural network-based deepfake detection techniques, owing to their inherent flaws, lack robustness against adversarial examples [
44] and anti-forensics [
539]. Adversarial examples are samples intentionally perturbed (e.g., by adding imperceptible noise) to mislead deepfake detectors [
540]. Anti-forensics (aka counter-forensics) is the deliberate act of eliminating or obscuring deepfake evidences in an effort to diminish the efficacy of forensic investigation methods [
539]. In anti-forensics, an adversary can modify original deepfakes through different procedures such as resizing, compression, blurring, rotation, light variations, noisy effects, motion blur, and quality degradation. This impedes forensic tools from extracting essential clues about manipulations, falsifications, source devices, etc. Adversarial examples and anti-forensics can be strategically utilized to effectively undermine the classification accuracy of automated deepfake detectors.
There is currently a lack of substantial work focused on robust deepfake detectors against adversarial examples and anti-forensics. Deepfake detection methods experience a drastic decline in accuracy when confronted with novel adversarial attacks and counter-forensics techniques. The trajectory suggests that this trend is poised to grow in the near future. More resilient deepfake detection schemes should be developed, which have the ability to combat not only conventional deepfakes but also those enhanced with adversarial examples and anti-forensic measures. In pursuit of this goal, the strategic design of different filtering, photoplethysmographic features, multi-stream processing, deep reinforcement learning, and incorporating noise or adversarial layers in the detection network holds significant promise. Furthermore, there is a need to design next-generation deepfake detectors that are robust against social media laundering. Social media laundering (i.e., to conserve network bandwidth and safeguard user privacy, videos typically undergo metadata removal, downsizing, and heavy compression prior to being uploaded on social platforms) eliminates traces of underlying manipulation, leading deepfake detectors to misclassify deepfakes as real.
Also, the red team can pioneer the development of next-generation adversarial attacks and anti-forensics. For example, few deepfake detectors rely on physiological signals, like heart rate extracted from videos or respiration rate extracted from audio, as potent indicators of falsification. This is because GAN techniques fail to preserve the intricate color variations associated with the heart rate/respiration rate signals. To outsmart new detectors, an attacker can design a method for fake videos/audios to contain a normal heart rate/respiration rate signal by introducing sequential color variations in frames/samples. Moreover, current anti-forensics primarily focus on aural presence/appearance artifacts (e.g., distinct frequency spectrum generated by deepfake generators) but overlook their impact on perception (e.g., deep speech/face features). Future advancements may involve the joint removal of both artifacts and perceptual elements for more effective detection. Attackers can design advanced audio, visual, and multimodal frameworks to simultaneously eliminate fake traces by both aural presence/appearance artifacts and perceptual intricacies.
4.29.2. Effects of Noise on Audio Deepfake Detectors
Research has indicated that natural noises (e.g., rain, traffic, machinery, reverberation, wind, thunder, non-stationary signals, babble noise, white noise, and pink noise) and electrical noises (e.g., electrical interference, channel distortions, and cross-talk) can hamper speech recognition system performance [
541,
542,
543,
544]. Noise is typically defined as ‘any arbitrary, unwanted, or irrelevant sound or (electrical) disturbance that interferes with the desired signal or information’. Present audio deepfake detectors struggle with both natural and electrical noises, thereby creating opportunities for attackers to effortlessly deceive them by introducing such noises. Large-scale studies are yet to explore the precise impacts of natural and electrical disturbances on automated audio deepfake detection frameworks. Studies that consider samples recorded in both indoor and outdoor settings with various noise conditions would provide valuable insights for future research, e.g., developing robust deepfake audio detectors applicable in real-world noise contexts.
4.30. Cybersecurity Attacks using Deepfakes
Deepfakes can be utilized as a tool in different types of cybersecurity attacks [
545], such as malware/ransomware attacks (i.e., deepfake messages as a vector to deliver malware/ransomware by tricking victims into interacting with malicious content disguised as authentic communications), social engineering (i.e., lifelike deepfakes of trusted individuals exploiting the trust of unsuspecting victims and gaining unauthorized access to sensitive information or systems), phishing attacks (i.e., audio or video realistic deepfakes deceiving victims into performing unauthorized actions), business email compromise (i.e., high-ranking official deepfakes as a tool for illicit data sharing or fund transfers), authentication bypass (i.e., audio or visual deepfake deceiving voice or face authentication systems, allowing unauthorized access to secure systems or sensitive information), and insider threats (i.e., deepfakes impersonating employees, gaining unauthorized access, or manipulating others within an organization for malicious purposes such as espionage).
Current cybersecurity defense mechanisms should be revisited in order to effectively counter deepfake threats. Defending against deepfake cybersecurity attacks requires a multi-faceted strategy involving technology, education, and vigilance. Some of the key countermeasures are behavioral analytics (i.e., monitoring user behavior and detecting anomalies that may indicate a deepfake), cutting-edge security solutions (i.e., advanced email filtering and endpoint protection solutions to identify and block phishing attempts using deepfake technology), multi-factor authentication (i.e., to add an extra layer of security for identity verification), robust anti-spoofing (i.e., audio, visual, and audiovisual mechanisms to detect deepfake attempts on user authentication systems), secure communication channels (i.e., encrypted messaging platforms for sensitive and confidential information), deepfake detection tools usable in core cybersecurity networks, watermarking and digital signatures (i.e., incorporating digital watermarks and/or signatures in multimedia content to verify its authenticity), continuous monitoring (i.e., proactive approach for network traffic and user activities to detect deepfakes in real-time), employee training and awareness (i.e., novel ways to educate employees on deepfake awareness, associated risks, and threat recognition), AI-powered threat intelligence platforms (i.e., tools analyzing large datasets to identify emerging deepfake threats and provide actionable insights to enhance cybersecurity defenses), intrusion detection and prevention systems (i.e., network and/or system monitoring systems to detect deepfake activities and automatically responding to potential threats), and blockchain (i.e., blockchain-based timestamping of multimedia content to establish the authenticity).
4.31. Reproducible Research
The research and professional communities should advocate the reproducible deepfake research trend [
546]. This can be accomplished by enriching large public databases with comprehensive human assigned deepfakability scores and reasons, offering clear and accessible experimental setups, and providing open-source deepfake and audio/face manipulation codes and tools. This will ensure an accurate assessment of deepfake technology progress while averting the overestimation of deepfake algorithmic strengths.
4.32. Forensics Specialists with Limited Experience in Courtrooms
The use of audio, visual, and multimedia samples as legal evidence relies on digital media forensics experts. Recent advancements in digital content manipulation pose a greater challenge for detection among digital forensics experts with training in law enforcement and/or computer science [
547,
548]. Digital media forensics professionals must remain vigilant regarding the ease with which multimedia content can be manipulated, as such false evidences may lead to wrongful convictions. The complexity of deepfakes in courtrooms is compounded by the silent witness theory (i.e., audio, photos, and videos are considered to speak for themselves as legal evidence). Nonetheless, an empirical study has shown that attackers can remotely infiltrate body camera devices, extract footage, make alterations, or selectively delete portions they wish to conceal from law enforcement, and seamlessly re-upload the manipulated content without leaving any identifiable trace of tampering [
549]. Digital media forensic results, vital for court use, require meticulous validation; however, AI-based manipulation detectors, while accurate, lack explainability (i.e., black-box models). The straightforward incorporation of AI-based detectors into forensic applications is hindered, as digital forensics experts and computer professionals lack the necessary knowledge of AI algorithms and struggle to explain the results effectively. New courses specifically tailored for digital media forensics experts should be developed. New forensic models that combine cyber-forensic and incident response approaches would empower forensics experts to conduct thorough and legally sound investigations. Future research directions should concentrate on deepfake forensic tools that are explainable, accurate, and meaningful to meet the stringent standards for admissibility in legal proceedings.
4.33. National and International Competitions and Challenges for Deepfake Technology
In the spirit of fostering innovation and technological progress, public competitions and challenges have been launched, inviting individuals or teams from the general public, academia, and industry to showcase their deepfake skills within a competitive framework, e.g., the Open Media Forensics Challenge [
550] by NIST, Deepfake Detection Challenge [
85], ADD [
328], and ASVSpoof [
320]. These competitions and challenges have mainly concentrated on the development of deepfake detectors rather than generators or a combination of both. Moreover, these events are held biannually and less frequently. More frequent national and international deepfake competitions and challenges must be hosted, taking into account factors such as diverse data quality, sample variability, model complexity, human detectors, evaluation metrics, and ethical and social implications.
4.34. Deepfake Education for the General Population
Deepfakes are rapidly on the rise, posing a growing threat to individuals in online spaces, including audio and video calls. A global study regarding public deepfake media knowledge was conducted by Iproov in 2022 and found that around only 29% of the population understands deepfakes [
551]. Moreover, tech-savvy individuals, despite their social media literacy, often feel a confidence dip when exposed to deepfake detection outcomes. Hence, it is paramount to educate the general public about the perils of deepfakes and equip them with the skills to critically evaluate digital content and detect early indicators of a deepfake attacks. By enhancing individuals’ ability to identify deepfakes, we can effectively minimize the impact of deceptive contents. The realm of deepfake technology literacy is still in its early stages of development. Researchers and professionals must dedicate additional efforts to formulate effective training and educational frameworks for the general public. Furthermore, they should identify and assess both existing and novel educational strategies. Multilingual online tutorials, videos, and presentations could prove highly beneficial on a global scale.
4.35. Open-Source Intelligence (OSINT) Techniques against Deepfake
The deepfake OSINT approaches aim at devising and sharing open-source tools to identify disinformation-related contents and deepfakes [
552]. The widely used approach is reverse image search (e.g., FotoForensics [
553], InVID [
554], and WeVerify [
555]), which aids a user in verifying the authenticity of a questionable image or video. Encyclopedic OSINT tools against deepfakes should be designed to provide superior accuracy and quality of obtained search results. These tools should include features like audio, visual, and multimedia deepfake or audio/face tampering detection, reverse or audio/image/video search, metadata analysis, noise analysis, voice clone detection, extensive datasets, and user-friendly interfaces. There is a shortage of OSINT tools for audio, 3D, virtual reality, and mixed reality deepfakes, along with limited large-scale deepfake OSINT databases.
4.36. Interdisciplinary Research
The research community must prioritize fostering interdisciplinary basic science to cultivate resilient, authentic, reliable, and universally applicable techniques for effectively addressing challenges posed by deepfakes, fake news, disinformation, and misinformation [
556,
557,
558]. Any such efforts will advance the state-of-the-art fake content and information detection technologies.
5. Conclusions
The substantive evolutions in AI and ML are unlocking new possibilities across sectors like finance, healthcare, manufacturing, entertainment, information dissemination, and storage and retrieval. Deepfakes leverage AI and ML advancements to digitally manipulate or create convincing and highly realistic audio, visual, or audiovisual face samples. Deepfakes have the potential to deceive speech/facial recognition technologies; deliver malicious payloads in state-sponsored cyber-warfare; and propagate misinformation, disinformation, and fake news, thereby eroding the privacy, security, trustworthiness of online contents and democratic stability. Researchers and practitioners strive to enhance deepfake detection, yet a fierce competition persists between generator and detector advancements. Thus, this paper first presents an extensive overview of current audio, image, and video deepfake and audio/face manipulation databases with detailed insights into their characteristics and nuances. Next, the paper extensively explores the open challenges and promising research directions in audio, image, and video deepfake generation and mitigation. Concerted interdisciplinary efforts are essential for making significant headway in deepfake technology, and this article offers a valuable reference point for creating cutting-edge deepfake generation and detection algorithms. In summary, this article strives to augment existing survey studies and serve as a catalyst to inspire newcomers; researchers; practitioners; engineers; information-, political-, and social-scientists; policymakers; and multimedia forensic investigators to consider deepfakes as a focal point in their scholarly pursuits.