2.3.1. Datasets
The datasets utilized for training and validation in our studies mostly originates FaceForensics++ [
16], supplemented by additional data from Celeb-DF (V2) [
17]. In our testing, we utilized not only the two aforementioned datasets but also the dataset introduced by Baru et al. [
18], a novel benchmark specifically created from CelebA-HQ [
51] using diffusion models.
The original FaceForensics++ dataset contains four synthetic methods: Deepfakes, FaceSwap, Face2Face, and NeuralTextures, along with a new method, FaceShifter, introduced subsequent to the release of Rossler et al.’s work [
16]. It includes 1000 original videos and 5000 manipulated videos, with each original video processed by all five methods, yielding a total of 6000 videos and over 500,000 frames. They also offer three video quality tiers: raw, c23, and c40, from highest to lowest quality. The c23 and c40 denote different levels of compression applied to the original videos. To imitate the real-world scenario in which numerous videos are not of high clarity, we trained our model using medium-quality videos, i.e., c23, which was also utilized for testing.
The Celeb-DF dataset is exclusively dedicated to Deepfakes, comprising 590 footage of celebrities and 5639 generated videos. Unlike FaceForensics++, Celeb-DF is divided into only two categories: training and testing, whereas the former also encompasses a validation set. The Celeb-DF dataset is more challenging as a consequence of the disparity between the synthetic and real videos: the quantity of DeepFake videos is roughly tenfold that of the original videos, reflecting a more realistic scenario in the opposite [
52].
To address the imbalance between the number of original and manipulated videos, we employed a dynamic stratified random sampling method for FaceForensics++ to select the training set. This involved sampling different fake videos from each of the five methods, utilizing videos with unique IDs, to ensure a near-equal distribution of real and fake videos. Alongside the balanced dataset, we constructed a mixed dataset that integrated the Deepfakes subset of FaceForensics++ and a portion of the Celeb-DF dataset, comprising 1200 original videos and 3600 manipulated videos, for the training process in hyperparameter sensitivity analysis. The validation set was randomly chosen from the original Deepfakes subset’s validation set to maintain an imbalance similar to that of the training set. Detailed statistics and splits are presented in the
Table 1.
Although both FaceForensics++ and Celeb-DF are GAN-based datasets, the evaluation may not sufficiently encompass contemporary methodologies such as diffusion models. Consequently, we employed the dataset introduced by Baru et al. [
18], which incorporated three advanced diffusion models: DDPM [
53], DDIM [
54], and LDM [
27], to produce altered faces from CelebA-HQ [
51]. The published dataset has 10,000 images for each model, resulting in a total of 30,000 manipulated images and more than 1500 original photographs. This dataset has greater challenges than the preceding two, as it pertains to the latest generation of face forgery techniques, offering a more contemporary simulation of real-world circumstances.
2.3.2. Implementation Details
Data Preprocessing. The subsequent preprocessing procedures are implemented for each video in datasets. Initially, we employed openCV as the video reader to extract frames from the videos, according to the official method provided by FaceForensics++. Subsequently, we used MTCNN [
55] to identify faces in the frames, align them to the center of the image and crop, with 20 pixels of margin on each side to prevent edge loss. The clipped images were then scaled to 450 × 450 pixels and center cropped to 224 × 224 pixels to fit the input dimensions of the Efficient ViT. We also implemented data augmentation by applying random color jittering, including brightness and contrast adjustments, with a minimal probability of 0.01, as these transformations could affect the frequency features, which are critical to the outputs. It is important to note that the fake videos are classified as the positive class, contrary to the original labeling of Celeb-DF, since we consider that our model’s objective is to identify forgeries rather than concentrate on the original videos.
Training Strategy. We proposed a phased training strategy to address the imbalanced data issue while maintaining traditional end-to-end training and ensuring a stable training process. We defined two parameters to dynamically regulate the sampling process during training: a fixed sample ratio and a novelty ratio , and divided the entire training procedure into two stages:
Initial Training Stage: We randomly select a set of examples randomly for training, disabling the orthogonal loss to prioritize the establishment of a stable foundation for the model. At this stage, is kept at 1 and at 0, indicating that only the specified samples are employed while the remaining samples are retained in a dynamic pool for subsequent stages.
Dynamic Sample Stage: As the training advances to this stage, and begin to adjust linearly, facilitating the transition from fixed sampling to dynamic sampling. The samples in the dynamic pool are first arranged based on their utilization frequency, with less frequently used samples receiving higher priority for selection. The integration of selected samples and subsequent shuffling ensures that the final training set provides a thorough representation of the dataset while mitigating the risk of overfitting to commonly encountered samples. The orthogonal loss is activated at this stage as well.
A comprehensive explanation of the sampling strategy is presented in Algorithm 1. In this context,
denotes the complete dataset,
represents the fixed sample set, and
signifies the dynamic pool.
stores the least used samples governed by
, whereas
comprises samples randomly selected from the dynamic pool. The
is the batch formed by integrating the three sets, subsequently utilized for training.
c represents the usage counter for each sample, while
denotes the output of the model.
Figure 5 intuitively illustrates the dynamic sample allocation by depicting the change of
and
during the training phase.
The batch size was configured to 8 videos for video-level training and 1000 frames for frame-level training. The model underwent training for 30 epochs, initiating with a learning rate of
, which was later diminished by a cosine annealing of
at the training’s conclusion. We also adopted AdamW optimizer to optimize the overall network, utilizing a weight decay of
.
Algorithm 1 Two-Stage Training with Dynamic Sample Allocation. The algorithm describes the details of the strategy, with some hyperparameters we set for training. |
- 1:
for do - 2:
- 3:
if then - 4:
- 5:
- 6:
else - 7:
- 8:
- 9:
end if - 10:
- 11:
- 12:
- 13:
- 14:
- 15:
- 16:
for do - 17:
- 18:
- 19:
- 20:
- 21:
end for - 22:
end for
|
For the two attention layers, the hyperparameters of the SFE module mirror those of the original Efficient ViT, with the exception of the depth, which is configured to 2 to ensure the DAMA module occupies a dominant position. The MWT module is structured with three levels, based on the following two considerations: (1) We cite the study by Liu et al. [
5], which illustrated three levels of immersive performance in comparable tasks; (2) Our initial experiments indicate that three levels of decomposition proficiently reconcile the acquisition of intricate artifact details with the optimization of computational expenses. Fewer than three levels would result in the loss of nuanced forging attributes, whilst more layers provide negligible improvements and lead to considerable increases in parameters and computing demands. It is essential to emphasize that the SFE module, along with the Efficient ViT, is not completely trained from scratch. The EfficientNet-B0 [
44] inside is initialized from the parameters on ImageNet [
56], as this method yields excellent training results for the model.
We implemented the weight warmup technique for the parameter in the joint loss function to incrementally introduce the orthogonal loss, hence preventing the model from becoming overloaded and ensuring it prioritizes the learning of primary tasks. The value is established at 0.0 for the initial 20% of the training epochs, thereafter increasing linearly to 1.0 by the 50% mark, after which the value remains constant at 1.0 for the remaining 50% of the epochs. By dynamically modifying the parameter, we can effectively incorporate the orthogonal loss into the model, facilitating regularization only after the model has developed a strong feature representation.
In the hyperparameter study, we trained the model for 20 epochs for each value, keeping the other parameters constant as previously determined. To balance the data and the difficulty of the dataset, we employed the focal loss in replacement of cross-entropy loss, with the alpha and gamma configured as 0.25 and 2.0, respectively. The stratified sampling method is also deactivated in this section because the reason that the mixed dataset is entirely constructed using the same method, DeepFake.
Evaluation Metrics. Essentially, DeepFake detection is a a binary classification task; therefore, we selected Accuracy rate (ACC) and Area Under Curve (AUC) as the primary evaluation metrics. AUC measures a classifier’s capacity to differentiate between classes and is widely utilized as a summary of the ROC curve. A higher AUC and ACC indicate superior model performance in distinguishing between fake and real classes. AUC is a more equitable metric since it won’t be affected by the imbalanced data, whereas ACC is more sensitive to the threshold variations.
We selected these two metrics mainly because they are extensively documented in prior studies [
57], facilitating comparison with current approaches easier given the constraints of our experimental environment. Hence, metrics such as the F1 score, which are infrequently reported and limit direct comparison, are not addressed in this paper.
We further employ the Equal Error Rate (EER) as a supplementary evaluation to examine the model’s performance on the diffusion model-based dataset. The EER is a prevalent metric for assessing binary classifiers, especially in contexts with imbalanced classes, as it offers a singular number that encapsulates the trade-off between false positive and false negative rates. It is delineated as follows:
where
is the threshold that minimizes the the difference between the false positive rate (FPR) and false negative rate (FNR). The EER is determined by identifying the point on the ROC curve where
, with TPR representing the true positive rate and FPR indicating the false positive rate. A reduced EER denotes superior performance, as it represents a more equitable compromise between false positives and false negatives.
Moreover, numerous methods we assessed were not open-source, including the study by Liu et al. [
5], which serves as one of our main references. Consequently, we had to depend on the original paper’s findings as benchmarks for our experiments, constituting a component of the comparative results we presented in the following sections.