High Performance DeepFake Video Detection on CNN-Based with Attention Target-Speciﬁc Regions and Manual Distillation Extraction

: The rapid development of deep learning models that can produce and synthesize hyper-realistic videos are known as DeepFakes. Moreover, the growth of forgery data has prompted concerns about malevolent intent usage. Detecting forgery videos are a crucial subject in the ﬁeld of digital media. Nowadays, most models are based on deep learning neural networks and vision transformer, SOTA model with EfﬁcientNetB7 backbone. However, due to the usage of excessively large backbones, these models have the intrinsic drawback of being too heavy. In our research, a high performance DeepFake detection model for manipulated video is proposed, ensuring accuracy of the model while keeping an appropriate weight. We inherited content from previous research projects related to distillation methodology but our proposal approached in a different way with manual distillation extraction, target-speciﬁc regions extraction, data augmentation, frame and multi-region ensemble, along with suggesting a CNN-based model as well as ﬂexible classiﬁcation with a dynamic threshold. Our proposal can reduce the overﬁtting problem, a common and particularly important problem affecting the quality of many models. So as to analyze the quality of our model, we performed tests on two datasets. DeepFake Detection Dataset (DFDC) with our model obtains 0.958 of AUC and 0.9243 of F1-score, compared with the SOTA model which obtains 0.972 of AUC and 0.906 of F1-score, and the smaller dataset Celeb-DF v2 with 0.978 of AUC and 0.9628 of F1-score.


Introduction
DeepFake is a type of artificial intelligence which is used to produce convincing pictures, audio, and video forgeries. The main methods used to construct DeepFakes are based on deep learning and correlate training Generative Neural Networks (GANs) [1] architectures. Generative Adversarial Networks (GAN) are deep learning techniques for training generative models, which are most commonly used for the generation of synthetic images. The GAN model architecture involves two sub-models: a generator model for generating new examples and a discriminator model for classifying whether the generated examples are real or fake. The growth of GAN lead to the development of a series of applications and sophisticated techniques, such as face swapping, face manipulation, and face synthesis, resulting in a rapidly increasing number of fake videos with accurate quality and more complexity. The results of the DeepFake generation have become increasingly realistic in recent years, making it harder to isolate the real from the fake for the normal eyes. Multimedia content that has been tampered with is increasingly being utilized in a variety of cybercrime operations (also mentioned in Ferreira et al. research [2]). Fake news, disinformation, digital kidnapping, and ransomware-related crimes are only a few of the most common crimes perpetrated and disseminated using altered digital pictures and videos.
DeepFake detection solutions usually use multimodal detection approaches to evaluate whether target material has been altered or created synthetically. Existing detection approaches often focus on developing AI-based algorithms in algorithmic detection methods such as Vision Transformer [3,4], two-stream neural network [5], MesoNet (which is proposed by Afchar et al. [6]), etc. However, less attention is paid to manual image processing to focus on highlighting the important regions of an image. This often results in the model having to process all of the videos, making the model heavier. In order to improve the DeepFake detection approach, we used both a manual processing and an AI-based algorithm in this research. The most important information, regions, and features will be carefully focused and processed before being put into deep neural networks. Concentrating on the most relevant elements to learn not only reduces the needless learning burden on these networks, but also improves the overall model's accuracy.
The major concept of our article will be to take advantage of few most popular classification models to identify fake videos and show how to transform the DeepFake detection into a simpler classification problem. As the existing classification models are designed for high accuracy, the reasonable selection of these models will also increase the ability to solve the problem of DeepFake detection. We also propose processing methods to convert the input sequences of DeepFake detection into the inputs of a basic classification model with two classes (class "1" for real video and class "0" for fake video). Our proposal also accommodates the processing steps to avoid losing essential features and supports synthesizing afterwards.

Face Forgery Generation
Face forgery generation is one of the fields of image synthesis. The objective is to create new faces using generative adversarial networks (GANs) [1]. The most popular approach is StyleGAN [7], which makes it possible to control the image synthesis via scale-specific modifications to the styles. Even with the growth of StyleGAN2 [8], which is based on data-driven unconditional generative image modeling, non-existent lifelike faces can be made with near-real sophistication and are often indistinguishable fake people who don't actually exist in real life. An application to create non-existent lifelike face of peoples using StyleGAN2 is mentioned in this tool [9]. The algorithms underpinning the AI are trained on publicly available pictures before being asked to generate fresh variants that satisfy the requisite level of realism. In addition, many synthetic programs [10] are available as open source and may be used by anyone.
Face swapping is the most common type of face modification currently. The DeepFake face swap method is built on two Auto-encoders [11], where one common encoder is used in training and rebuilding sources and another one to target face training pictures, respectively. The aim of face swapping is to generate a new fake image or video of a person after swapping its face. Presently, a variety of approaches have been proposed. Some prominent methods can be mentioned such as RSGAN [12], FSGAN [13], DCGAN [14], PGGAN [15], FSNet [16], High Fidelity Identity Swapping [17], and StarGAN v2 [18]. A lot of datasets were created based on face swaps, of which the standout is the DeepFake Detection Challenge [19] (DFDC) dataset. In addition, face manipulation was also used in face forgery generation such as MulGAN [20], MaskGAN [21], PuppetGAN [22], and HistoGAN [23].

Face Forgery Detection
Various techniques have been developed to identify fake synthetic pictures. In the research on detection of face manipulation [24], the authors used attention layers on top of feature maps to extract the manipulated regions of the face. In the research of Liu et al. [25], they presented a novel spatial-phase shallow learning (SPSL) method, which combines spatial image and phase spectrum to capture face forgeries upsampling artifacts and enhance transferability. Zhao et al. [26] formulated DeepFake detection as a fine-grained classification problem and proposed a new multi-attentional DeepFake detection network that can be more efficient for collecting local important features. Mazaheri et al. [27] suggested a system that uses image manipulation approaches and a mix of facial expression recognition to identify modification and localization in facial expression. Chen et al. [28] proposed a light weight DeepFake detection network by using the successive subspace learning (SSL) principle from various parts of face images to automatically extract important features. Dynamic face augmentation was utilized to improve the DeepFake detection model's performance and prevent model overfitting owing to the large number of data generated from a limited number of distinct objects by Das et al. [29]. Kim et al. [30] presented a detection approach that combined color distribution analysis of the vertical edge region in a modified picture with pixel correction techniques to reduce the discomfort of pixel value discrepancies. In addition, a lot of proposals related to vision transformer were published, which can be mentioned by Wodajo et al. [3]. They proposed a convolution vision transformer, which is a combined convolutional neural network (CNN) and vision transformer (ViT) to learn features and categorization by using an attention mechanism. Heo et al. [4] proposed a model with EfficientNetB7 [31] as a basic backbone.
All of these methods were mentioned previously based on deep learning. There are variety of approaches relied on non-deep learning-based methods too. Yang et al. [32] used SVM classifier to detect splicing generated facial region. Matern et al. [33] applied multi-layer perceptron (MLP) classifiers and extracted the landmarks to detect fake videos. Besides, some researches were proposed a face friend-safe adversarial for recognition face system as well as protect system by Kwon et al. [34,35].

Proposed Methodology
In this section, we describe the architecture for DeepFake detection based on the classifier network with manual attention target-specific regions to create distillation set, which not only can improve the accuracy of classification using neural networks, but also allows the use of a lighter backbone. We introduce some steps in image processing to manually create a set of important data which is called manually distillation set in this paper and focus on special regions in Section 3.1. In addition, we also provide a model structure that we have used as a normal image classification model, which can generate features for each domain, to facilitate synthesis in the next step. In Section 3.3, we discuss on how to merge several frames and multi-regions in images before deciding on the final result.

Image Preprocessing
The objective of this section is to preprocess the picture before it is fed into the next stages. This part is critical because it influences the quality of the entire process moving ahead. It also improves the quality of the entire process through data processing.

Face Extraction
The first step in this process is person detection [36], which is implemented by the open library "OpenCV" [37]. Person detection is required because it helps identifying the real person's face. It also helps avoiding situations when faces are from a non-real person or any object (something identical to a person such as statue). This phase aims to reduce face detection mistakes as much as possible too. After identifying the main people in video, we need to extract their faces from these people that have been detected. Multitask cascaded convolutional networks [38] (MTCNN) are then used to extract the face. These steps will significantly reduce the amount of misleading data in the dataset that is used to fool the model. An example for the inclusion of these fictitious photographs is shown in Figure 1. Typically, some approaches just concentrate on face extraction such as Wodajo et al. [3], Heo et al. [4], Selim et al. [39], etc. These models sometimes might identify the face of a non-real person and that could significantly adversely affect the quality of the model. face detection mistakes as much as possible too. After identifying the main people in video, we need to extract their faces from these people that have been detected. Multitask cascaded convolutional networks [38] (MTCNN) are then used to extract the face. These steps will significantly reduce the amount of misleading data in the dataset that is used to fool the model. An example for the inclusion of these fictitious photographs is shown in Figure 1. Typically, some approaches just concentrate on face extraction such as Wodajo et al. [3], Heo et al. [4], Selim et al. [39], etc. These models sometimes might identify the face of a non-real person and that could significantly adversely affect the quality of the model. Figure 1. The example images are taken from a video with label is "REAL" of DFDC [19] dataset. Including real person along with many small pictures contain fake faces that are not the main person. Two images were captured at different time in the same video.
Person detection is currently very popular with several models that are far superior to the "OpenCV" model, such as high-level semantic feature detection [40] and person detection for far field [41]. However, the reason that a library that is not very effective at person detection was picked since it strictly constrains the input data in detecting person in order to get good extracted faces in future output. On account of that, the confidence level for recognizing those faces is likewise set to high while utilizing the MTCNN [38] face extraction library.

Face Augmentation
The overfitting problem is always considered carefully in DFDC datasets. When a model learns the information and noise in the training data to the point then it degrades the model's performance on new data, this is known as overfitting. This means that the model learns too well with the training data for the DFDC dataset, but the outcomes with the test data are not as good as expected. One method for resolving this issue is to use augmentation [42]. In terms of increasing this quality, previous research has also found that data augmentation can help to mitigate this negative effect [29]. This is a crucial approach for generating more usable data and improving the model's quality during training. The methods of augmentation used in this proposal are mostly focused on information dropping [43,44], illustrated in Figure 2. It mainly focuses on the meaningful regions of the face to distinguish the real from the fake, such as the eyes, nose and mouth. These important regions are illustrated in Figure 3. Important regions are dropped randomly to increase data diversity. Person detection is currently very popular with several models that are far superior to the "OpenCV" model, such as high-level semantic feature detection [40] and person detection for far field [41]. However, the reason that a library that is not very effective at person detection was picked since it strictly constrains the input data in detecting person in order to get good extracted faces in future output. On account of that, the confidence level for recognizing those faces is likewise set to high while utilizing the MTCNN [38] face extraction library.

Face Augmentation
The overfitting problem is always considered carefully in DFDC datasets. When a model learns the information and noise in the training data to the point then it degrades the model's performance on new data, this is known as overfitting. This means that the model learns too well with the training data for the DFDC dataset, but the outcomes with the test data are not as good as expected. One method for resolving this issue is to use augmentation [42]. In terms of increasing this quality, previous research has also found that data augmentation can help to mitigate this negative effect [29]. This is a crucial approach for generating more usable data and improving the model's quality during training. The methods of augmentation used in this proposal are mostly focused on information dropping [43,44], illustrated in Figure 2. It mainly focuses on the meaningful regions of the face to distinguish the real from the fake, such as the eyes, nose and mouth. These important regions are illustrated in Figure 3. Important regions are dropped randomly to increase data diversity.

Patch Extraction
Several essential characteristics of the face, notably the eyes, nose, and mouth, which

Patch Extraction
Several essential characteristics of the face, notably the eyes, nose, and mouth, which are difficult to represent, are used to distinguish between fake and real faces. This content is also emphasized in the research content [45]. So as to solve this problem in the preprocessing section it will be handled tightly, facial landmarks are extracted from each extracted face by using "OpenFace2" [46] an open-source toolbox. As the input to our networks, we harvest regions of interest (RoI). Our goal is to reveal the artifacts that exist between the false face and the surrounding. The rectangular regions that encompass both the face and surrounding areas are chosen as the RoIs. Specifically, we use facial landmarks to determine the RoIs. The rectangular regions box can be defined as Equation (1).

Patch Extraction
Several essential characteristics of the face, notably the eyes, nose, and mouth, which are difficult to represent, are used to distinguish between fake and real faces. This content is also emphasized in the research content [45]. So as to solve this problem in the preprocessing section it will be handled tightly, facial landmarks are extracted from each extracted face by using "OpenFace2" [46] an open-source toolbox. As the input to our networks, we harvest regions of interest (RoI). Our goal is to reveal the artifacts that exist between the false face and the surrounding. The rectangular regions that encompass both the face and surrounding areas are chosen as the RoIs. Specifically, we use facial landmarks to determine the RoIs. The rectangular regions box can be defined as equation (1).

Distillation Set
The idea of distillation through attention method [48] by Facebook AI, they introduced a teacher-student strategy specific transformers. It was based on a distillation token that ensures the student learns from the teacher through attention. It automatically generates a distillation token, which functions similarly to a class token in that it interacts After extracting facial landmarks, patches of face are cropped from different parts of face, including the left eye, the right eye, nose and mouth. The RoIs are resized to 32 × 32.

Distillation Set
The idea of distillation through attention method [48] by Facebook AI, they introduced a teacher-student strategy specific transformers. It was based on a distillation token that ensures the student learns from the teacher through attention. It automatically generates a distillation token, which functions similarly to a class token in that it interacts with other embeddings via self-attention. It allows their model to learn from the output of the teacher through the self-attention layers. Instead of using model self-attention throughout the training process to instruct the model focusing on important regions we used the idea to develop our model which is able to learn locally during preprocessing steps before the actual training process. Distillation set including a lot of patches which are grouped into sets to designate which parts of the face will be trained, as well as which pieces of the face will be learnt at which classifier network input. This arrangement into distillation set increases training accuracy and makes it convenient for region ensemble.

Classifier Network Architecture
We obtain special features and distillation sets containing a lot of useful information for training and inference through previous steps. These materials which are used to differentiate the real from the fake in a frame concentrated mostly on the face and the important regions in it, and thus training the entire frame will be unnecessary. If we use the all information in an image, it will dilute the essential information that the model needs to focus on. This may lead to an adverse effect on the quality of the model or cause the size of model to grow. At the input of classifier network are distillation sets, and each distillation set will contain extracted faces, face augmentation, multi-region of face in patches along with "Real" or "Fake" label. In this step, the issues essentially become two-class classification problems.
The design is depicted in the Figure 4. In this section, two main backbones, In-ceptionV3 [49] and MobileNet [50] are utilized without last fully connected layers. The InceptionV3 network will be used to train the entire face as well as the outcomes of the face augmentation, while the MobileNet component will train face patches. The output of these two components is concatenated and fed into global average pooling. This was suggested by Min Lin [51]. Now we have feature sets, which contain information to discriminate between actual and fake faces, as well as significant areas.

Classifier Network Architecture
We obtain special features and distillation sets containing a lot of useful information for training and inference through previous steps. These materials which are used to differentiate the real from the fake in a frame concentrated mostly on the face and the important regions in it, and thus training the entire frame will be unnecessary. If we use the all information in an image, it will dilute the essential information that the model needs to focus on. This may lead to an adverse effect on the quality of the model or cause the size of model to grow. At the input of classifier network are distillation sets, and each distillation set will contain extracted faces, face augmentation, multi-region of face in patches along with "Real" or "Fake" label. In this step, the issues essentially become twoclass classification problems.
The design is depicted in the Figure 4. In this section, two main backbones, Incep-tionV3 [49] and MobileNet [50] are utilized without last fully connected layers. The Incep-tionV3 network will be used to train the entire face as well as the outcomes of the face augmentation, while the MobileNet component will train face patches. The output of these two components is concatenated and fed into global average pooling. This was suggested by Min Lin [51]. Now we have feature sets, which contain information to discriminate between actual and fake faces, as well as significant areas.
If we analyze and evaluate the output of the classifier network (feature sets) to make decision as "Real" or "Fake", we will get DeepFake detection for images. However, because the frames and regions are not combined, the results are often lower than the model with combination and classification module.  If we analyze and evaluate the output of the classifier network (feature sets) to make decision as "Real" or "Fake", we will get DeepFake detection for images. However, because the frames and regions are not combined, the results are often lower than the model with combination and classification module.

Combination and Classification Module
This part will choose and synthesize feature sets of frames in a video, synthesize regions in faces and change virtual threshold levels to make final decision for videos. As depicted in the

Frame Ensemble
The model gathered feature sets of N frames per video. By using only a subset of the frames, we will reduce the size of input. Due to a large variation in video lengths, extracting N frames per second would also result in a large variation in input lengths. Although this could be solved with padding, we would end up with some input mostly composed of zeros, which in turn could lead to poor training performance. Therefore, N frames per video, which did not depend on the video length was selected. An example for comparison of two video sampling methods is shown in Figure 5. depicted in the Figure 4.

Frame Ensemble
The model gathered feature sets of N frames per video. By using only a subset of the frames, we will reduce the size of input. Due to a large variation in video lengths, extracting N frames per second would also result in a large variation in input lengths. Although this could be solved with padding, we would end up with some input mostly composed of zeros, which in turn could lead to poor training performance. Therefore, N frames per video, which did not depend on the video length was selected. An example for comparison of two video sampling methods is shown in Figure 5.

Multi-Region Ensemble
N frames after being synthesized will be put into multi-region ensemble for synthesis and evaluation, region information is also collected from feature sets and then pooled to derive video classification results while evaluating if the video is real or fake.
The majority of prior research has relied on a threshold parameter, which represents the probability of fake video to synthesize information between frames in a video in order to make a final decision about whether the video is real or fake. This threshold parameter is generally set to a fixed value, which might make the model more sensitive to fake or real video, lowering the overall model quality. In our proposal, this threshold for distinguishing real and fake will be dynamically changed to match the trained parameters to solve this problem, enhancing the capacity to properly predict the outcome.
In this section we use a simple logistic regression model with the input and where the input is portrayed by a value for each video along with ground truth label of video. It can be represented as = [ , , … , ] ∈ ℝ = , , … , and = [ , , … , ], where M is number of samples; N is selected frames per video for combination and classification module, is representation of the probability of fake video value.
The output of logistic regression is written in the following format (2): where is called logistic function, w is weight value.

Multi-Region Ensemble
N frames after being synthesized will be put into multi-region ensemble for synthesis and evaluation, region information is also collected from feature sets and then pooled to derive video classification results while evaluating if the video is real or fake.
The majority of prior research has relied on a threshold parameter, which represents the probability of fake video to synthesize information between frames in a video in order to make a final decision about whether the video is real or fake. This threshold parameter is generally set to a fixed value, which might make the model more sensitive to fake or real video, lowering the overall model quality. In our proposal, this threshold for distinguishing real and fake will be dynamically changed to match the trained parameters to solve this problem, enhancing the capacity to properly predict the outcome.
In this section we use a simple logistic regression model with the input and where the input is portrayed by a value for each video along with ground truth label of video. It can be represented as X = [x 1 , x 2 , . . . , x M ] ∈ R d x M = a 1 N , a 2 N , . . . , a M N and Y = [y 1 , y 2 , . . . , y M ], where M is number of samples; N is selected frames per video for combination and classification module, a i is representation of the probability of fake video value.
The output of logistic regression is written in the following format (2): where θ is called logistic function, w is weight value. We assume that the probability that a data x falls into class 1 is f w T x , x falls into class 0 is 1 − f w T x , we can represent it as follows (3): In this case, we use the Stochastic Gradient Descent (SGD). Loss function with (x i , y i ) is (4):

Limitation
The accuracy of this part is highly reliant on the quality of the classifier network in the previous section. If the classifier network's accuracy is high, the whole model's quality will Appl. Sci. 2021, 11, 7678 8 of 14 improve considerably; nevertheless, if this part's quality is low, it will have a resonance effect, causing the results deteriorate.

Experiments
This section describes the dataset as well as the parameters in depth. We also compare to the SOTA model and some current models. In Section 4.1, we will discuss the datasets. In addition, we discuss the parameter setup and configuration environment necessary in the training process in Section 4.2, and in Section 4.3, the outcomes of the experiment will be analyzed, and the proposals will be compared to existing researches.

Dataset
We evaluate the performance of our model on two DeepFake video datasets: the smaller dataset Celeb-DF v2 [52], and the bigger dataset DFDC [19]. The goal of testing on both types of datasets is to check that the model works effectively on both small and large datasets, as well as to compare the differences in output between the two datasets.
Celeb-DF v2 [52] dataset contains real and DeepFake synthesized videos. This dataset is greatly extended from previous Celeb-DF v1. Currently, Celeb-DF v2 includes 590 original videos collected from YouTube with subjects of different ages, ethnic group, and genders, and 5639 corresponding DeepFake videos.
DFDC [19] dataset is currently the largest DeepFake dataset, which is available on the Internet, with over 100,000 total. These clips were generated by face swap dependent on GAN. Inference for the DFDC dataset, we evaluated on the 5000 test dataset.

Image Preprocessing
As explained in Section 3, we use MTCNN with strict conditionals, extracting only if confidence is greater than 90 percent and cropping and resizing extracted faces to 128 × 128. Then 68 facial landmarks are detected. Based on these facial landmarks, the left eye, the right eye, nose, and mouth are cropped with dimensions of 32 × 32 for each one.

Training Parameters
We train classifier Network with batch size of 64 and 30 epochs. We use Adam optimizer with learning rate is 0.001, the exponential decay rate for the 1st moment estimates beta_1 is 0.9, and for 2nd moment estimates is 0.999, without epsilon arguments. In the Combination and Classification Module, we choose 40 frames per video. We didn't use all the frames from the video as at 25 or 30 frames per second, most of the frames look alike. By using only a subset of the frames, we will reduce the size of our input and therefore speed up the training process. Furthermore, in Celeb-DF v2 and DFDC dataset, most of the videos are pretty short (usually a few seconds), and therefore we suggest a 40 frames per video which is appropriate for the length of the video while also ensuring sufficient training time. Regarding hardware parameters, we trained and evaluated on NVIDIA GeForce RTX 2080 Ti Graphics Card.

Evaluation Metrics
It is extremely useful for measuring Recall, Precision, F1-score, and most importantly AUC-ROC curves. It can be calculated through the confusion matrix shown in Table 1. Each column in the confusion matrix represents instances in a predicted class, while each row represents instances in an actual class. The positive class refers to the original and unmanipulated images, while the negative class represents the manipulated ones. True positive (TP) represents the outcome where the model correctly predicts the positive class. Similarly, true negative (TN) represents the outcome where the model correctly predicts the negative class. False positive (FP) represents the outcome where the models incorrectly predict the positive class. And false negative (FN) represents the model incorrectly predicts the negative class.
Precision measures the percentage of positive identifications was actually correct. Precision is defined as (5): Recall measures the percentage of actual positives was identified correctly. Recall is defined as (6): F1-score is the harmonic mean of precision and recall. It is difficult to compare two models with low precision and high recall or vice versa. F1-score helps to measure Recall and Precision at the same time. F1-score is defined as (7): True positive rate (TPR) is a synonym for recall and is defined as (8): False positive rate (FPR) is defined as (9): Accuracy measures the percentage of number of correct predictions. Accuracy is defined as (10) Training results of the classifier network are shown in Figures 6 and 7. This step basically becomes a simple case of classification model with two classes ("Real" and "Fake"), therefore it doesn't take a long time in this section. Most of the time will be spent on the distillation set extraction. The training outputs of the Celeb-DF v2 dataset is shown in Figure 6 and DFDC dataset is shown in Figure 7. The performance of the training set is usually better than the validation set, for example, the accuracy of the training set is approximately equal to "1". The accuracy of the validation set is not too much lower than the training set, detailed values are listed in Table 2. This demonstrates that the model has partially solved the overfitting problem mentioned in Section 3.1. therefore it doesn't take a long time in this section. Most of the time will be spent on the distillation set extraction. The training outputs of the Celeb-DF v2 dataset is shown in Figure 6 and DFDC dataset is shown in Figure 7. The performance of the training set is usually better than the validation set, for example, the accuracy of the training set is approximately equal to "1". The accuracy of the validation set is not too much lower than the training set, detailed values are listed in Table 2. This demonstrates that the model has partially solved the overfitting problem mentioned in Section 3.1.  We can also measure frame-by-frame quality using feature sets at the output of the classifier network. However, we are primarily interested in DeepFake detection for video, so the results of classifier network will be used in the combination and classification module.  sically becomes a simple case of classification model with two classes ("Real" and "Fake"), therefore it doesn't take a long time in this section. Most of the time will be spent on the distillation set extraction. The training outputs of the Celeb-DF v2 dataset is shown in Figure 6 and DFDC dataset is shown in Figure 7. The performance of the training set is usually better than the validation set, for example, the accuracy of the training set is approximately equal to "1". The accuracy of the validation set is not too much lower than the training set, detailed values are listed in Table 2. This demonstrates that the model has partially solved the overfitting problem mentioned in Section 3.1.  We can also measure frame-by-frame quality using feature sets at the output of the classifier network. However, we are primarily interested in DeepFake detection for video, so the results of classifier network will be used in the combination and classification module.   We can also measure frame-by-frame quality using feature sets at the output of the classifier network. However, we are primarily interested in DeepFake detection for video, so the results of classifier network will be used in the combination and classification module. Figure 8 illustrates the ROC curve related to the processing of the validation set, the good performance of the classifier can also be seen, in which the curve is pushed to the upper left corner of the graphic. This means that the area under the ROC is also higher. Figure 9 illustrates the confusion matrix of our model with the Celeb-DF v2 dataset. Our model is compared to the current state-of-the-art model [39]. The model's performance is equivalent to SOTA. We used the train dataset to train our model and the validation set to test it. In Figure 10, the confusion matrix of our model is compared to that of the SOTA model [39]. Our model has a relatively comparable 0.958 of AUC and 0.9243 of F1-score to the SOTA model, which had an AUC of 0.972 and 0.906. However, since SOTA is built on Efficient Network, which is often very heavy, our proposed method is better in terms of a lighter architecture while maintaining the same level of quality. Figure 8 illustrates the ROC curve related to the processing of the validation set, the good performance of the classifier can also be seen, in which the curve is pushed to the upper left corner of the graphic. This means that the area under the ROC is also higher.  Figure 9 illustrates the confusion matrix of our model with the Celeb-DF v2 dataset. Our model is compared to the current state-of-the-art model [39]. The model's performance is equivalent to SOTA. We used the train dataset to train our model and the validation set to test it. In Figure 10, the confusion matrix of our model is compared to that of the SOTA model [39]. Our model has a relatively comparable 0.958 of AUC and 0.9243 of F1-score to the SOTA model, which had an AUC of 0.972 and 0.906. However, since SOTA is built on Efficient Network, which is often very heavy, our proposed method is better in terms of a lighter architecture while maintaining the same level of quality.   [19]; the left side is the SOTA model's prediction which is described in Heo et al. [4] results and the right side is our results. Top-right, top-left, bottom-right, bottom-left means false-positive, true-negative, true-positive, false-negative in order.   Figure 9 illustrates the confusion matrix of our model with the Celeb-DF v2 dataset. Our model is compared to the current state-of-the-art model [39]. The model's performance is equivalent to SOTA. We used the train dataset to train our model and the validation set to test it. In Figure 10, the confusion matrix of our model is compared to that of the SOTA model [39]. Our model has a relatively comparable 0.958 of AUC and 0.9243 of F1-score to the SOTA model, which had an AUC of 0.972 and 0.906. However, since SOTA is built on Efficient Network, which is often very heavy, our proposed method is better in terms of a lighter architecture while maintaining the same level of quality.   [19]; the left side is the SOTA model's prediction which is described in Heo et al. [4] results and the right side is our results. Top-right, top-left, bottom-right, bottom-left means false-positive, true-negative, true-positive, false-negative in order.   Figure 9 illustrates the confusion matrix of our model with the Celeb-DF v2 dataset. Our model is compared to the current state-of-the-art model [39]. The model's performance is equivalent to SOTA. We used the train dataset to train our model and the validation set to test it. In Figure 10, the confusion matrix of our model is compared to that of the SOTA model [39]. Our model has a relatively comparable 0.958 of AUC and 0.9243 of F1-score to the SOTA model, which had an AUC of 0.972 and 0.906. However, since SOTA is built on Efficient Network, which is often very heavy, our proposed method is better in terms of a lighter architecture while maintaining the same level of quality.   [19]; the left side is the SOTA model's prediction which is described in Heo et al. [4] results and the right side is our results. Top-right, top-left, bottom-right, bottom-left means false-positive, true-negative, true-positive, false-negative in order.  [19]; the left side is the SOTA model's prediction which is described in Heo et al. [4] results and the right side is our results. Top-right, top-left, bottom-right, bottom-left means false-positive, true-negative, true-positive, false-negative in order. Table 3. For Celeb-DF v2 dataset, the majority of researches in [5,6,47,53,54] may not have a high AUC. Outstanding research can be mentioned is Chen et al. [28], which achieved a pretty good result of 90.56% AUC with this dataset. Not only that, but their outcomes are also superior in terms of a number of parameters and their research goal is to create a lightweight model that can operate on systems with minimal hardware requirements. In this section, our method achieved an AUC of 97.8%, which is somewhat slightly better than theirs. Our fundamental model is based on a deep neural network for classification; thus, the number of parameters is frequently larger. For DFDC dataset, the best results can be mentioned in the research of Heo et al. [4] with 97.8% AUC. On the other hand, the idea of Young-Jin research is similar to SOTA model [39], they are also based on the Vision Transformer model, EfficientNet with distillation methodology. Thus, their results must have a large number of parameters although they are not mentioned in their research contents. In this section our results are a bit lower than theirs, but we will undoubtedly have the advantage in terms of the number of parameters.  Table 4 depicts the results obtained with the benchmarking of the Celeb-DF v2 dataset and DFDC dataset. Notable in the table is the F1-score of DFDC, we can see that the proposed model is robust in keeping fair assessment between fake videos and real videos and the F1-score was 0.9243, which was slightly higher than 0.919 of Heo et-al [4] proposal. This might be due to dynamic threshold which is described in Section 3.3.

Conclusions
In this paper, we have proposed a method for DeepFake detection. A model consisted of CNN network with high-performance and lighter weight model compared to current models such as SOTA [39] and combined vision transformer and EfficientNet [4]. We found an 0.958 of AUC, 0.9243 of F1-score for the DFDC dataset and 0.978 of AUC, 0.9628 of F1-score for Celeb-DF v2 with 26M parameters. By combining a manual technique with an AI-based algorithm, we improved the DeepFake detection method. The most important information, regions, and features were carefully cleaned and processed before being put into deep neural networks. Important preprocessing steps were also recommended to improve the model's quality considerably. In the future work, we will look for ways to enhance the model so that we can continue to propose lighter models while also improving its accuracy.