Comparison of Deepfake Detection Techniques through Deep Learning

: Deepfakes are realistic-looking fake media generated by deep-learning algorithms that iterate through large datasets until they have learned how to solve the given problem (i


Introduction
In the last few years, cybercrime, which accounts for a 67% increase in the incidents of security breaches, has been one of the most challenging problems that national security systems have had to deal with worldwide [1]. Deepfakes (i.e., realistic-looking fake media that has been generated by deep-learning algorithms) are being widely used to swap faces or objects in video and digital content. This artificial intelligence-synthesized content can have a significant impact on the determination of legitimacy due to its wide variety of applications and formats that deepfakes present online (i.e., audio, image and video).
Considering the quickness, ease of use, and impacts of social media, persuasive deepfakes can rapidly influence millions of people, destroy the lives of its victims and have a negative impact on society in general [1]. The generation of deepfake media can have a wide range of intentions and motivations, from revenge porn to political fake news. Rana Ayyub, an investigative journalist in India, became a target of this practice when a deepfake sex video showing her face on another woman's body was circulated on the Internet in April 2018 [2]. Deepfakes have also been published to falsify satellite images with non-existent landscape features for malicious purposes [3].
There are numerous captivating applications of deepfakery in video compositing and transfiguration in portraits, especially in identity protection as it can replace faces in photographs with ones from a collection of stock images. Cyber-attackers, using various strategies other than deepfakery, are always aiming to penetrate identification or authentication systems to gain illegitimate access. Therefore, identifying deepfake media using forensic methods remains an immense challenge since cyber-attackers always leverage newly published detection methods to immediately incorporate them in the next generation of deepfake generation methods. With the massive usage of the Internet and social media, and billions of images available on the Internet, there has been an immense loss of trust from social media users. Deepfakes are a significant threat to our society and to digital evidence in courts. Therefore, it is highly important to obtain state-of-the-art techniques to identify deepfake media under criminal investigation.
As demonstrated in Table 1 (inspired by the figure presented in [1]), tampering of evidence, scams and frauds (i.e., fake news), digital kidnapping associated with ransomware blackmailing, revenge porn and political sabotage are among the vast majority of types of deepfake activities with the highest level of intention to mislead [1]. The first deepfake content published on the Internet was a celebrity pornographic video that was created by a Reddit user (named deepfake) in 2017. The Generative Adversarial Network (GAN) was first introduced in 2014 and used for image-enhancement purposes only [4]. However, since the first published deepfake media, it has been unavoidable for deepfake and GAN technology to be used for malicious uses. Therefore, in 2017, GANs were used to generate new facial images for malicious uses for the first time [5]. Following that, there has been a constant development of other deepfake-based applications such as FakeApp and FaceSwap. In 2019, Deepnude was developed and provided undressed videos of the input data [6]. The widespread strategies used to manipulate multimedia files can be broadly categorized into the following major categories: copy-move, splicing, deepfake, and resampling [7]. Copy-move, splicing and resampling involve repositioning the contents of a photo, overlapping different regions of multiple photos into a new one, and manipulating the scale and position of components of a photo. The final goal is to manipulate the user by conveying the deception of having a larger number of components in the photograph than those that were initially present. Deepfake media, however, leveraging powerful machine-learning (ML) techniques, have significantly improved the manipulation of the contents. Deepfake can be considered to be a type of splicing, where a person's face, sound, or actions in media is swiped by a fake target [8]. A wide set of cybercrime activities are usually associated with this type of manipulation technique, and while spreading them is easy, correcting the records and avoiding deepfakes are harder [9]. Consequently, it is becoming harder for machinelearning techniques to identify convolutional traces of deepfake generation algorithms, as there needs to be frequency-specific anomaly analysis. The most basic algorithms that were being used to train models for the task of deepfake detection such as Support Vector Machine (SVM), Convolution Neural Network (CNN), and Recurrent Neural Network (RNN) are now being coupled with multi-attentional [10] or ensemble [11] methods to increase the performance and address weakness of other methods. As proposed by [12], by implementing an ensemble of standard and attention-based data-augmented detection networks, the generalization issue of the previous approaches can be avoided. As such, it is of high importance to identify the most suitable algorithms for the backbone layers in multi-attentional and ensembled architectures. As generation of deepfake media only started in 2017, academic writing on the problem is meager [13]. Most of the developed and published methods/techniques are focused on deepfake videos. The main difference between deepfake video-and image-detection methods is that video-detection methods can leverage spatial features [14], spatio-temporal anomalies [15] and supervised domain [16] to draw a conclusion on the whole video by aggregating the inferred output both in time and across multiple faces. However, deepfake image-detection techniques have access to one face image only and mostly leverage pixel- [17] and noise-level analysis [18] to identify the traces of the manipulation method.
Therefore, identifying the most reliable methods for face-image forgery detection that relies on convolutional neural networks (CNN) as the backbone for a binary classification task could provide valuable insight for the future direction in the development of deepfakedetection techniques. The overall approach taken in this work is illustrated in Figure 1. DenseNet has shown significant promise in the field of facial recognition. DenseNet as an extension of Residual CNN (ResNet) architecture has addressed the low-supervision problem of all its counterparts by initiating a between-layer connection using dense blocks. The dense blocks in the DenseNet architecture improve the learning process by leveraging a transition layer (essentially convolution, average pooling, and batch normalization between each dense block) that concatenates feature maps. As such, gradients from the initial input and loss function are shared by all the layers. The described implementation reduces the number of required parameters and feature maps, and consequently provides a less computationally expensive model. Therefore, we have decided to test DenseNet's capabilities and compare it with other neural network architectures.
VGG-19, as an algorithm that has been widely used to extract the features of the detected face frames [19], was chosen to be compared with the DenseNet architecture. VGG-19's architecture eases the face-annotation process by forming a large training dataset with the use of online knowledge sources that are then used to implement deep CNNs to perform the task of face recognition. The formed model is then evaluated on face recognition benchmarks to analyze model efficiency regarding the generation of facial features. During this process, VGG-19 is trained on classifiers with sigmoid activation function in the output layer which produces a vector representation of facial features (face embedding) to finetune the model. The fine-tuning process differentiates class similarities using Euclidean distance that is achieved using a triplet loss function that aims at comparing Euclidean spaces of similar and different faces using learning score vectors. The CNN architecture implemented in VGG-19 implements fully connected classifiers that include kernels and ReLU activation followed by maxpooling layers.
Finally, we have implemented a Custom CNN architecture to evaluate the performance of previously described algorithms and analyze the effectiveness of dropout, padding, augmentation and grayscale analysis on model performance.
This study aims to provide an in-depth analysis on the described algorithms, structures and mechanisms that could be leveraged in the implementation of an ensembled multiattentional network to identify deepfake media. The result of this work contributes to the nascent literature on deepfakery by providing a comparative study on effective algorithms for deepfake detection on facial images within the possible use of digital forensics in criminal investigations.
The rest of this paper is organized as follows. Section 2 provides a literature review of the algorithms and datasets that are widely used for deepfake detection. Section 3 provides details on the analysis methods and configurations of the compared algorithms as well as with the details on the tested dataset. Section 4 provides the results of the comparative analysis. Finally, Section 5 concludes with implications, limitations, and suggestions for future research.

Literature Review
Anti-deepfake technology can be divided into three categories: (1) detection of the deepfake; (2) authentication of the published content; and (3) prevention of the spread of contents that can be used for deepfake production. Technology towards detection and authentication of deepfakery is growing fast; however, the capacity to generate deepfakes is proceeding much faster than the ability to detect them. Twitter has reported attempts to publish misinformation and fake media by 8 million accounts per week [20]. There has been a wide variety of deepfake media, and the detection techniques that have been used to identify them is shown in Figure 2. This has created a massive challenge for researchers to provide a solution that can promptly analyze all the posted material on the Internet and social media platforms to identify deepfakes. Previous research has mostly aimed at improving previously developed technologies to train a new detection system.

Deepfake Detection Datasets
Deepfake detection systems typically leverage binary classifiers to cluster information into real and fake classes. This method requires a great quantity of good-quality authentic and tampered data to train classification models. The first known datasets that had a great impact on the growth and improvement of deepfake detection technologies were UADFV [21] and DFTIMIT [22]. FaceForensics++ dataset includes 977 downloaded videos from YouTube, provides 1000 sequences of original unobstructed faces, as well as their manipulated versions. The manipulated versions were generated by four methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures [23]. The DeepFakeDetection dataset (DFD) released by Google in collaboration with Jigsaw contains over 363 original sequences from 28 paid actors in 16 different scenes as well as over 3000 manipulated videos using deepfakes [23]. The Deepfake Detection Challenge (DFDC) dataset [24] published by Facebook is another publicly available large dataset that includes over 100,000 total clips from 3426 actors, produced with deepfake, GAN-based and unsupervised models. Celeb-DF (v2) [25] dataset published by [25] is an extension to Celeb-DF (v1) that contains real and fake videos that are generated via deepfake algorithm by providing images with the same quality as the synthesized videos circulating online. This dataset provides 5639 videos with subjects of different ages, ethnic groups and genders, and their corresponding deepfake videos. The DeeperForensics-1.0 dataset is a large-scale benchmark for face forgery detection that represents the largest face forgery detection dataset by far. This benchmark includes 60,000 videos forming a total of 17.6 million frames generated by an end-to-end face-swapping framework. Furthermore, extensive real-world perturbations are applied to obtain a more challenging benchmark of larger scale and higher diversity [26].
For our research and analysis, we took the "Real and Fake Face-Detection" dataset from Yonsei University [27] that contains expert-generated high-quality PhotoShopped face images. The dataset includes 960 fake and 1081 real images that are composites of different faces, separated by eyes, nose, mouth, or whole face. The second dataset that has been used in this work is the "140K Real and Fake Faces" that consists of 70K real faces from the Flickr dataset collected by Nvidia, as well as 70K fake faces sampled from the 1 million fake faces (generated by StyleGAN) that were published by Bojan [28]. These two datasets were used to include both GAN-generated images along with expert/human-generated images to provide many good-quality data. All the above-mentioned datasets can be used for image and video classification, segmentation, generation and augmentation of new data. Table 2 represents a cumulative comparison of the mentioned datasets; please note that the rows with a "*" sign include images only (not videos). Deepfake datasets have been categorized into two generations based on several factors and elements. Considering release time and synthesis algorithms involved in the generation of the data, UADFV and DF-TIMIT are categorized as the first generation. Considering the quality and quantity of the generated data, DFD, DeeperForensics, DFDC, and the Celeb-DF datasets are categorized as the second generation [25].

Deepfake Detection Algorithms
Deepfake detection techniques aim to conceal revealing traces of deepfakes by extracting semantic and contextual understanding of the content. Research in the field of media forensics provides a wide range of imperfections as indicators of fake media: face wobble, shimmer and distortion; waviness in a person's movements; inconsistencies with speech and mouth movements; abnormal movements of fixed objects such as a microphone stand; inconsistencies in lighting, reflections and shadows; blurred edges; angles and blurring of facial features; lack of breathing; unnatural eye direction; missing facial features such as a known mole on a cheek; softness and weight of clothing and hair; overly smooth skin; missing hair and teeth details; misalignment in face symmetry; inconsistencies in pixel levels; and strange behavior of an individual doing something implausible are all the indicators and features used by deepfake detection algorithms [13]. The use of deep-learning techniques and algorithms such as CNN and GAN has made deepfake detection more challenging for forensics models because deepfakes can preserve pose, facial expression and lighting of the photographs [29]. Frequency domain, JPEG Ghost and Error Level Analysis (ELA) are among the first methods that were used to identify manipulation traces on images. However, they are not successful in identifying manipulated images that are generated with deep-learning and GAN algorithms. Neural networks are one of the most widely used methods for deepfake detection. There are some proposals on the usage of X-rays [18], and spectrograms [30] to identify traces of blending and noise in deepfake media. However, such methods cannot detect random noise and suffer from a performance drop when encountering low-resolution images. Deepfakes are implemented mainly using a CNN that generates deepfake images and an encoder-decoder network structure (ED), or GAN [4] that synthesizes fake videos. Deepfake detection techniques focused on anomalies in the face region only can be categorized into holistic and feature-based matching techniques [31]. The holistic techniques, which are mostly used to identify deepfake face images and include Principal Component Analysis (PCA), Support Vector Machines (SVM), and CNN, mainly analyze the face as a whole. These techniques aim at reducing data dimensionality by forming a smaller set of linear combinations of the image pixels that are then fed to a binary classifier to identify authentic and fake images. Feature-based or attention-based matching techniques, however, are used for both deepfake video and image identification, and split the whole face into different regions of focus such as eye, nose, lips, skin, head position, color mismatches, etc. [32]. Holistic techniques are successful in detecting localized deepfake characteristics (i.e., anomalies in the face and jaw region) and can be leveraged to identify specific feature characteristics (eyes, nose, mouth) that could be significant in detection [12]. Convolutional Neural Network (CNN)-based image classification and recognition models have been proven to be trainable to classify manipulated images from authentic ones [33]. Luca et al. [34] aimed to extract and detect fingerprints that represent convolution traces left in the process of generating GAN images using the Expectation-Maximization algorithm. Wang et al. [35] demonstrated that with careful pre-and post-processing and data augmentation, a standard classifier trained on ProGAN, an unconditional CNN generator can be generalized surprisingly well to unseen architectures, datasets, and training methods. CNN have also been trained to detect manipulation techniques such as lack of eye-blinking [36], missing details in eyes from an image [37], and facial wrapping artifacts. Furthermore, CNNs have been shown to be able to capture distinctive traces of generation methods that have worked on further wrapping the faces with high-resolution sources [17].
VGG19 and VGG16 has significantly improved large-scale fake image recognition by increasing the layer depth (23/26 layers) of CNN-based models [38]. Chang et al. [39] presented an improved VGG network, namely NA-VGG, based on image augmentation and noise-level analysis to detect a deepfake face image. The experimental results using the Celeb-DF dataset shows that NA-VGG improved accuracy over other state-of-the-art fake image detectors. Kim et al. [40] demonstrated that VGG-16 has a better performance than the ShallowNet architecture to classify genuine facial images from disguised face images.
Furthermore, DenseNet architecture has also been demonstrated to be computationally more efficient with its feed-forward design network, which connects each layer to every other layer [41]. In DenseNet architecture, feature maps of all former layers are used as the input for each layer. DenseNet requires significantly fewer parameters and computation to achieve state-of-the-art performance [33]. Hsu, Chih-Chung, Yi-Xiu Zhuang, and Chia-Yen Lee [42] in their work proposed a fake face-image detector based on the novel CFFN, consisting of an improved DenseNet backbone network and Siamese network architecture. Their comprehensive analysis demonstrated that deep features-based deepfake-detection systems such as DenseNet obtain significant accuracy when trained and tested on the same kind of manipulation technique.
Feature-based techniques have started identifying the deficiencies of deepfake generation methods such as unnatural eye-blinking patterns and temporal flickering, which gave rise to a more improved generation of deepfake models that were trained on datasets that have addressed the identified deficiencies. Yang et al. [43] demonstrated that facial landmarks could be used to provide an estimate of head posture direction. The work of [44,45] illustrated that eye pupils' inconsistencies are one of the indicators of fake media. Some studies [46] including audio into the training process have illustrated that the difference between lip movements and voice matching distinguishes real and fake media. There have been some efforts on domain-specific deepfake detection such as [47] that leveraged forensic techniques to model political leaders' facial expressions and speaking patterns; however, it would be a more difficult task to train and generalize such approach for the whole world. Even though feature-based techniques are more robust to deformations, they have been mainly designed to have the best performance on domain-specific datasets. Holistic techniques are competent in learning human faces and extracting higher-dimensional semantic features for classification.
Other techniques that leverage spatial features and spatio-temporal anomalies in the supervised domain such as Xception [48] and EfficientNet [49] have been shown to be more efficient than CNNs. Xception architecture claims to gain a more efficient use of model parameters due to depthwise separable convolutions that can understand as an inception module. Kumar and Bhavsar [16] demonstrated that Xception combined with metric learning can enhance the classification in high-compression scenarios. They were able to achieve an AUC score of 99.2% and accuracy of 90.71% for deepfake video identification on the Celeb-DF dataset. Ismail et al. [14] in their experimental analysis demonstrated that XceptionNet combined with an additional Bi-LSTM and LSTM layer can achieve a 79% ROC-AUC score. Li et al. [50] demonstrated that Xception does not have a good performance on face-image datasets (AUC of 73.2) and, furthermore, it has a high true-negative rate while having the lowest true-positive rate. To summarize, Xception may provide better performance for fake video detection; however, it does not address the generalizability issue across different datasets and does not perform well when fed with images only. EfficientNET proposes a new scaling method that uniformly scales all dimensions of depth/width/resolution using compound coefficient. Coccomini et al. [15] were able to achieve an AUC of 0.95% and F1-score of 88% on the DFDC dataset. Pokroy and Egorov [51] demonstrated that an increased scale in all dimensions may not always lead to higher accuracy due to the fact that CNNs will have to deal with more complex patterns that are difficult to transfer to a different task. Mitra et al. [52] were able to achieve a 96% accuracy on the FaseForensics++ dataset by making the complexity of detecting forged videos low using the depthwise separable convulsion of EfficientNet. In conclusion, Xception and EfficientNet, by uniformly scaling all dimensions, can gain a more efficient use of model parameters. Furthermore, they can extract spatial features and spatio-temporal anomalies by aggregating the inferred output both in time and across multiple faces due to their depthwise separable convolutions. These methods have illustrated that they can draw an improved conclusion on the whole video; however, they have not demonstrated any improvements to deepfake classification on a single image (i.e., deepfake image-detection).
Recent scholarly work has been focused on implementing an ensemble of holistic and feature-based detection networks by addressing the drawbacks of both methods. Dolescki et al. [53], in their work implementing a classification method, which involves a collection of classifiers with a certain utility function regarded as an aggregation operator, were able to achieve accuracy of 87%. Silva et al. [12] were able to achieve a 92% accuracy on the DFDC dataset by implementing a hierarchical explainable forensics algorithm that incorporates humans in the detection loop. Hanqing et al. [10] proposed a multi-attentional deepfake detection network that can achieve a 97% accuracy by implementing multiple spatial attention heads, textural feature enhancement blocks and aggregating low-level textural features and high-level semantic features. Bonettini et al. [11] were able to achieve AUC of 87% on DFDC by assembling different trained Convolutional Neural Network (CNN) models that combined EfficientNetB4 with attention layers and Siamese training. Du et al. [54] demonstrated that a good balance between accuracy and efficiency can be achieved with two separated EfficientNet architectures that simultaneously analyze raw content and its frequency-domain representation.
Given that the most successful approaches to identifying and preventing deepfakes are deep-learning methods that rely on CNNs as the backbone for a binary classification task [12], and a large 2D CNN model can prove to be better than EfficientNet model if deepfake classification is the only desired result [55], we have evaluated the most common backbone architecture of existing developed frameworks (CNN, VGG-19 and DenseNet) that are demonstrated to have the best performance on the task of deepfake image classification.

Approach
Our proposed method for deepfake detection on images is shown in Figure 1. We have taken two different classification procedures in this work. As shown in both Figures 1 and 3, input data goes through the same procedure with the same architecture; however, Figure 3 demonstrates a second round of analysis with an additional post-processing classification step that has been added to the last output layer of the analyzed models. The second round of analysis with additional post-processing was performed to analyze the effects of principal component analysis on the task of deepfake classification. Further details about the post-processing step are described in the final paragraphs of the evaluation subsection of this section.

Implementation
Input data are a dataset that is labeled and clustered into two categories of real and fake. They are augmented for training purposes using the following specifications: After augmentation, the face images are classified as either fake or real using three different models: Custom CNN, VGG, and DenseNET. We defined two classes for our binary classification task: 0 to denote the real (e.g., normal, validation, and disguised face images) and 1 to denote fake (e.g., impersonator face images) groups, respectively.
The "Real and Fake Face-Detection" dataset was used to train the three models at a learning rate of 0.001 and for 10 epochs. The test accuracy was then calculated using the test set. We applied data augmentation to flip all original images horizontally and vertically, hence a three-fold increase of the dataset size (original image + horizontally flipped image + vertically flipped image).
The Custom CNN architecture included six convolution layers (Conv2D) each paired with batch normalization, max pooling and dropout layers. Rectified Linear Unit (ReLU) and sigmoid activation functions were applied for the input and output layers respectively. Dropout was applied to each layer to minimize over-fitting and padding was also applied to the kernel to allow for a more accurate analysis of images. The Custom CNN architectures have been trained and validated on the original and augmented datasets with a 1/255 scaling factor. Data augmentation was performed to observe effects of data aggregation on model performance and promote the generalizability of the findings. Details on augmentation process includes horizontal flip along with a 0.2 zoom range, shear range of 0.2 along with rescaling factor to avoid image quality to factor in model behavior during classification since not all the images had the same pixel-level quality.
Following a similar approach to [56], the VGG-19 model that was used is a 16-layer CNN architecture paired with three fully connected layers, five maxpooling layers and one SoftMax layer that is modeled from architectures in [56]. VGG-19 has been pretrained on a wide variety of object categories, which leads to its ability to learn rich feature representations. VGG-19 has demonstrated that it can provide a high accuracy level when classifying partial faces. This architecture demonstrated that its highest accuracy is accessible when its size is increased [57]; therefore, we have applied a high-end configuration to it by adding a dense layer after the last layer block that provides the facial features and added a dense layer as the output layer with sigmoid activation function to fine-tune the model for the task of deepfake detection.
The DenseNET architecture used in this work is Keras's DenseNet-264 architecture with an additional dense layer as the last output layer. This architecture starts with a 7 × 7 stride 2 convolutional layer followed by a 3 × 3 stride-2 MaxPooling layer. It also includes four dense blocks paired with batch normalization and ReLU activation function for the input layers and sigmoid activation function for the output layer. Furthermore, there are transition layers between each denseblock that include a 2 by 2 average pooling layer along with a 1 by 1 convolutional layer. The last dense block is followed by a classification layer that leverages the feature maps of all layers of the network for the task of classification which we have coupled with a denseblock with the sigmoid activation function as the output layer. This model was trained on 100,000 images and validated on 20,000 images. This model has been trained and validated on the original, grayscale and augmented datasets with a 1/255 scaling factor too. We aimed to add to the diversity of the training data by performing augmentation to the DenseNet architecture by applying a horizontal flip, a 20 range rotation along with the same rescaling procedure that was applied in the Custom CNN architecture. Because pixel-level resolution of grayscale and color images are different, we have also measured the importance of color on model behavior towards classifying data into the fake and real categories by training the DenseNet architecture on grayscale only data too. The VGG architecture, however, was only trained and tested on the original dataset. All the analyzed models in this work are used as they were designed with an additional custom dense layer with sigmoid activation function. The rationale behind adding this layer to all models was to add a useful rectifier activation function layer for the task of binary classification to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values.

Evaluation
The performance of the described models is assessed with accuracy, precision, recall, F1-score, average precision (AP) and area under the ROC curve.
Accuracy, simply put, indicates how close the model prediction is to the target or actual value (fake vs. real), meaning how many times the model was able to make a correct predication among all the predictions it has made. Equation (1) indicates the overall formula used to calculate prediction, where TPR stands for true prediction and TOPR stands for total predictions made by the model.
Precision, on the other hand, refers to how consistent results are regardless of how close to the true value they are using the target label. Equation (2) demonstrates the ratio that indicates the proportion of positive identifications by model that were actually correct. TP in Equation (2) stands for the number of true positives and FP stands for the number of false positives.
The recall is the proportion of actual positives that were identified by the model that were correct. Equation (3) demonstrates this ratio where TP is the number of true positives and FN the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F1-score, by taking into account both precision and recall, balances the precision and recall and indicates model ability to accurately predict both true-positive and truenegative classes. The F1 score can be interpreted as a harmonic mean of the precision and recall. For the task of deepfake classification, F1-score is a better measure to assess model performance, since both classes are of importance and the relative contribution of precision and recall to the F1 score are better than equal. Equation (4) demonstrates how F1-score is calculated.
Average Precision (AP) was used as an aggregation function for the task of object detection to summarize the precision-recall curve as the weighted mean of precision achieved at each threshold, with the increase in recall from the previous threshold used as the weight based on Equation (5), where R n and P n are the precision and recall at the nth threshold [58].
Finally, as shown in Figure 3, the output vectors of the final hidden layer of the analyzed architectures were extracted and treated as a representation of the images. Dimensions of the vectors for the Custom CNN architecture, VGG-19 and DenseNet architectures were 512, 2048 and 1024, respectively. Principal Component Analysis (PCA) was performed to keep the most dominant variable vector points and preserved 50 principal components. The resulting vectors from the PCA were fed into a support vector machine (SVM) to classify them into the two classes of real and fake.

Preliminary Results
This section provides the results obtained from the three different neural network architectures that have been tested in this work. The dataset section provides an overview of the advantages, drawbacks, and improvements of the datasets described in the literature review.

Dataset
Deepfake datasets should have careful consideration of quality, scale, and diversity. UADF and DFTMIT provide a baseline dataset for preliminary analysis in deepfake de-tection; however, they lack the quantity and diversity elements. The DeepFakeDetection dataset extends the preliminary FaceForensics dataset; however, it contains relatively few videos with few subjects and limited size and number of methods that are represented. The DFDC dataset addresses the drawbacks of the previously published datasets by providing a large number of clips, of varying quality, and with a good representation of the current state-of-the-art face-swap methods. However, it still has various visual artifacts that make them easily distinguishable from the real videos. The DFDC dataset resolves the limited availability of source footage, few videos and fewer subjects; however, the Celeb-DF dataset provides more relevant data to evaluate and support the future development of deepfake detection methods by fixing color mismatch, inaccurate face masks, and temporal flickering of previously discussed datasets. Finally, deeper forensics, by addressing the drawbacks of all mentioned datasets, provides a benchmark of larger scale and higher diversity that can be leveraged to achieve the best performance of deepfake detection algorithms. Table 3 summarizes the drawbacks and improvements of the described datasets. The mentioned datasets include videos that could be used for face detection in images; however, the "Real and Fake Face-Detection" dataset combined with the "140K Real and Fake Faces" includes both GAN-generated images as well as expert/human-generated images, and is considered by far one of the largest available face-image datasets. The two described datasets together include 70,960 fake and 71,081 real images. As shown in Table 4, 70K of the fake images are GAN-generated and 960 of them are human expert-generated. Similarly for the real images, 70K of them are GAN-generated and 1081 of them are human expert-generated. The distribution of the human-generated fake images is not balanced with the GAN-generated photos, but this is the largest human-generated image dataset available currently.

Algorithms
The accuracy, precision and recall rates of analyzed models demonstrated in Table 5, the ROC curve demonstrated in Figure 4, the area under the ROC curve (AUC), F1-scores and AP results demonstrated in Table 6 were used to evaluate model performance in terms of separability and their ability to differentiate between classes. The algorithm comparison results revealed that the VGG-19 model had the best performance among all 3 other algorithms, with an accuracy level of 95%.
The results of this study demonstrate that VGG-19 can be a suitable choice not only for partial face images, but also for full-face images confirming the findings of [57]. The better performance of VGG-19 is because it is pretrained on a wide variety of objects. AP was used as an aggregation function to summarize the precision-recall curve into a single value that represents the average of all precisions. VGG-19, even though it had the highest accuracy, had the lowest AP of 95% in comparison to all other analyzed models. The DenseNet architecture on the original dataset and grayscale dataset had a closer performance to VGG-19, with 94% accuracy. Results from DenseNET architecture demonstrates that gray channel-based analysis does not have a huge impact on model accuracy level in classifying images into the two categories of real and fake. The DenseNet architecture, even though was second best in terms of performance, achieved an AP of 99% on both augmented and grayscale datasets, which is slightly in contrast to the results found in [59] in terms of precision rate; however, it aligns with claims regarding detection time. Custom CNN architecture had the lowest accuracy level (89%). The second-highest AP score after DenseNet was the Custom CNN model. Augmented input reduced model performance and accuracy level on both DenseNET and Custom CNN by 5-22%. However, the Custom CNN had a better performance on augmented data in comparison to the DenseNet architecture. Precision and recall rates from DenseNet architecture trained on augmented data suggest that the final dense block that we have coupled with the DenseNet classification layer did not have a positive impact on model behavior. The issue with reduced performance on augmented data might be resolved by training the model for a larger number of epochs, since augmentation results in harder training samples. VGG-19, even though it was great in terms of performance, aligns with results from [60]; it was computationally very expensive, especially if fed with augmented data. DenseNET was computationally more efficient in comparison to VGG-19 and Custom CNN, which aligns with the results from [40]. The F1-score of the DenseNet architecture on grayscale was the highest, reaching 97% suggesting it could be a suitable backbone when dealing with unbalanced class distribution in their dataset. The second-highest F1-score was achieved by VGG-19, as it achieved a 95% F1-score. The lowest F1-score was achieved by the Custom CNN on augmented data, as the F1-score was only 85%. Taking F1-score as a measurement to balance precision and recall, DenseNet on grayscale data might seem to be a better solution, however, since the dataset used for training in this analysis had a balanced class distribution accuracy level and is a better judge in this analysis. The results from the PCA-SVM classification demonstrated that VGG-19 was able to form a distinctive cluster of fake and real images using the PCA vectors as a representation of the image (demonstrated in Figure 5). Custom CNN architectures and DenseNet trained on the original and augmented datasets showed decent classification. However, DenseNet trained on grayscale images presented very poor performance (Table 5). Overall analysis of the results reveal that all the architectures had a higher efficiency in detection and classification of GAN-generated images due to the traces that GAN generators left on the generated media. Considering VGG-19's performance and behavior, even though it may not be the most computationally efficient model, it had a competitively better performance than the other analyzed model and it showed a promising improvement when coupled with PCA-SVM classification layers. This suggests that VGG-19 could be a more suitable backbone architecture for the task of deepfake detection related to the essential technical and legal requirements that determine evidence admissibility. Deepfakes are a threat to the admissibility of digital evidence in courts. Quick and effective detection of authentic media is critical in any criminal investigations. VGG-19 could be a fast solution for detecting deepfakes in courts. We must test more datasets from digital evidence and conduct further experiments.

Conclusions and Future Work
The results of our work demonstrated that deep-learning architectures are reliable and accurate at distinguishing fake vs. real images; however, detection of the minimal inaccuracies and misclassifications remain a critical area of research. Recent efforts have focused on improving the algorithms that create deepfakes by adding especially designed noise to digital photographs or videos that are not visible to human eyes and can fool the face-detection algorithms [61]. The results of our work indicate that VGG-19 performed best, taking accuracy, F1-score, precision, AUC-ROC and PCA-SVM measures into the account. DenseNet had a slightly better performance in terms of AP, and the results from the Custom CNN trained on original data were satisfactory too. This suggests that aggregation of the results from multiple models, i.e., ensemble or multi-attention approaches, can be more robust in distinguishing deepfake media.
Future work could also leverage unsupervised clustering methods such as autoencoders to analyze its effectiveness on the task of deepfake classification and provide a better interpretation of the CNN algorithms designed in this work. There could be classification methods developed that would examine and flag social media users who uploaded images/videos before being posted on the Internet to avoid the spread of misinformation [62]. We plan to further improve performance with deep-learning algorithms as well as exploring the application of stenography, steganalysis and cryptography in the identification and classification of the genuine and disguised face images [63]. Future work not only has to include collecting and experimenting with different disguised classifiers, but also must work on the development of training data that can improve the performance of implemented architectures as suggested by [33]. The authors of the paper plan to discover the use of information pellets on the development of an ensemble framework. As suggested in [64] using a patch-based fuzzy rough set feature-selection strategy can preserve the discrimination ability of original patches. Such implementation can assist in anomaly detection for the task of deepfake detection. By integrating the local-to-global feature-learning method with multi-attention and ensemble-modeling (holistic, feature-based, noise-level, steganographic) approach, we believe we can achieve a superior performance than the cur-rent state-of-the-art methods. Considering the limitations of Eff-YNet network developed by [55], which has an advantage in examining visual differences within individual frames, analyzing EfficientNet performance on deepfake image datasets used in this work can be another direction for future work, as it may identify another suitable baseline model for ensembled approaches.