Deepfake Image Classification Using Decision (Binary) Tree Deep Learning

Alrajeh, Mariam; Al-Samawi, Aida

doi:10.3390/jsan14020040

Open AccessArticle

Deepfake Image Classification Using Decision (Binary) Tree Deep Learning

by

Mariam Alrajeh

and

Aida Al-Samawi

^*

Department of Computer Networks, College of Computer Sciences and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(2), 40; https://doi.org/10.3390/jsan14020040

Submission received: 19 February 2025 / Revised: 29 March 2025 / Accepted: 1 April 2025 / Published: 8 April 2025

(This article belongs to the Section Network Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

The unprecedented rise of deepfake technologies, leveraging sophisticated AI like Generative Adversarial Networks (GANs) and diffusion-based models, presents both opportunities and challenges in terms of digital media authenticity. In response, this study introduces a novel deep neural network ensemble that utilizes a tree-based hierarchical architecture integrating a vision transformer, ResNet, EfficientNet, and DenseNet to address the pressing need for effective deepfake detection. Our model exhibits a high degree of adaptability across varied datasets and demonstrates state-of-the-art performance, achieving up to 97.25% accuracy and a weighted F1 score of 97.28%. By combining the strengths of various convolutional networks and the vision transformer, our approach underscores a scalable solution for mitigating the risks associated with manipulated media.

Keywords:

deepfake classification; decision trees; vision transformers; convolutional neural network (CNN)

1. Introduction

“I only believe what I see.” This statement reflects a widespread cognitive bias that equates visual evidence with truthfulness [1]. This reliance on visual authenticity has been deeply ingrained into human perception; however, recent advancements in deep learning like AlexNet [2] have disrupted this paradigm. The rise of deepfake technologies has enabled the creation of hyper-realistic manipulated media, challenging the very foundation of trust in visual evidence [3,4]. Deepfakes leverage Artificial Intelligence (AI) to alter or replace facial identities in images and videos, often generating results that are indistinguishable from reality. Generative AI models, including Generative Adversarial Networks (GANs) [5] and diffusion-based frameworks like Stable Diffusion [6], have revolutionized synthetic media creation by generating highly realistic content. While these advancements drive innovation, they also pose societal risks such as misinformation, identity theft, and fraud, making the detection of manipulated media a pressing research challenge. According to the Dimensions Scholarly Database [7], the rapid advancement of deepfake generation technologies, along with the corresponding rise in detection research, underscores an ongoing arms race between creation and mitigation efforts. This trend is illustrated in Figure 1.

In this study, we propose a novel deep model ensemble based on a tree structure, combining a vision transformer (ViT-Base) [8], EfficientNet [9], and DenseNet [10] within a hierarchical framework. Our approach enhances hierarchical feature learning, improving the generalization across diverse datasets while reducing the computational overhead. As illustrated in Figure 2, the process outlines the key stages of our work. The model achieves a state-of-the-art accuracy of 97.25% with a weighted F1 score of 97.28%, outperforming the existing methods while maintaining efficiency. Experimental studies validate key design choices, including fully connected layers, architectural diversity, and hyperparameter optimization. By addressing computational inefficiency and improving the robustness, our method establishes a practical, scalable solution for real-world deepfake detection challenges.

Key Contributions

Our work makes four key contributions: First, we explore a novel tree-based topology combining CNN backbones for hierarchical feature learning, improving the generalization. Second, our design emphasizes scalability and efficiency, suitable for real-world deployment. Third, comprehensive ablation studies validate critical design choices, including the FC layers and hyperparameters. Finally, we systematically address challenges like computational inefficiency and generalization gaps.

To facilitate a comprehensive understanding, the structure of this paper is as follows: In Section 2, we review the existing literature on deepfake detection, focusing on traditional and state-of-the-art methods. Section 3 introduces the proposed tree-based topology and its hierarchical design principles. In Section 4, we detail the experimental setup, datasets, and evaluation metrics, followed by our results and an in-depth discussion of these results. Section 6 and Section 7 include comparisons with the existing approaches and ablation studies. Finally, Section 8 summarizes our findings and outlines potential directions for future work.

2. Related Works

In the rapidly evolving field of deepfake detection, numerous methodologies have been proposed to address the challenges posed by increasingly sophisticated synthetic media. This section provides an overview of the current landscape, categorizing the existing approaches into key areas. To systematically contextualize our contributions, Table 1 compares state-of-the-art methods across critical dimensions such as accuracy, generalization, and limitations. This analysis highlights the unmet needs in terms of scalability, cross-dataset robustness, and computational efficiency—gaps our work directly addresses.

2.1. Deep-Learning-Based Methods

Generative models: Generative models are a class of deep learning that has been at the forefront of deepfake detection due to their ability to automatically extract and learn complex features from data [11,12]. Generative Adversarial Networks (GANs), introduced by [11], form the foundation of deepfake generation. Subsequent advancements, such as Wasserstein GANs (WGANs) by [13], have stabilized GAN training and reduced mode collapse, further enhancing their utility in deepfake generation. In recent years, diffusion models have gained prominence in generative tasks, particularly with the advent of Stable Diffusion (SD) models [6]. SD models employ iterative denoising to generate high-resolution data, offering improved stability over that of GANs [14]. They excel in producing diverse, high-quality content and enable greater control during the generation process [15]. While diffusion models like Stable Diffusion excel in their generation quality [6], their detection counterparts (e.g., MesoInception-4 [16]) often struggle with generalization, as shown in Table 1.

Object detection: Object detection tasks in computer vision [17,18,19,20,21] have been a cornerstone of Artificial Intelligence (AI), with early works laying the foundation for object recognition and feature extraction using classical methods [22]. Ref. [23] proposed a CNN-based approach that analyzed the spatial inconsistencies in facial regions to detect manipulated content. Contrastive learning, a self-supervised learning approach, has recently emerged as a powerful paradigm for learning robust feature representations by contrasting positive and negative pairs in latent space [24,25,26]. The authors of [27] proposed a novel deepfake detection model leveraging contrastive learning, cross-modality data augmentation (SRM and RGB), and a multi-scale feature enhancement module to improve the generalization across known and unknown manipulations. In [28], the Hybrid Model for Deepfake Detection (HMDD) was proposed, focusing on improving the feature scales and leveraging attributes like background comparisons, eye-blinking patterns, facial artifacts, and pose estimation to enhance the accuracy and efficiency of deepfake detection. Recognizing that deepfakes can affect multiple modalities, recent research has explored the integration of audio and visual data for detection. Ref. [29] developed a multimodal approach that analyzed both facial expressions and voice patterns to improve the detection accuracy. Haliassos et al. [30] proposed a method that examined the lip-sync discrepancies between audio and visual streams, identifying inconsistencies indicative of deepfakes. For instance, hybrid frameworks like DeepfakeStack [31] achieve high accuracy but suffer from overfitting and resource intensity, underscoring the need for efficient architectures like ours.

Traditional machine learning techniques: Prior to the dominance of deep learning, traditional machine learning methods were utilized to detect manipulated media. These approaches often relied on handcrafted features and statistical analysis. For example, ref. [32] developed a method that detected discrepancies in eye-blinking patterns, a physiological signal often overlooked in deepfake videos. Additionally, ref. [31] focused on inconsistencies in head pose estimations to identify forged videos. These methods, however, lack scalability and fail to generalize across diverse manipulation types (Table 1).

Hybrid models: To leverage the strengths of both deep learning and traditional techniques, hybrid models have been proposed. Ref. [33] combined CNNs with recurrent neural networks (RNNs) to capture both spatial and temporal inconsistencies in videos, enhancing the detection performance. Ref. [34] employed a one-class variational autoencoder to detect anomalies indicative of deepfakes, offering a lightweight alternative to traditional ensemble models. Similarly, Sabir et al. [35] integrated an optical flow analysis with deep learning to detect subtle artifacts in manipulated videos. While prior hybrid approaches (e.g., CNN-LSTM [32]) improve temporal analyses, they exhibit high cross-dataset variability. Our binary tree ensemble mitigates this by hierarchically fusing the spatial and temporal features, achieving superior accuracy (97.25%) and generalization.

Table 1. Comparative analysis of deepfake detection methods on the FaceForensics++ dataset. Key metrics include accuracy (Acc.) and generalization capability.

Method	Core Innovation	Acc. (%)	Generalization	Key Limitations
Pan, D. et al. [32]	Hybrid CNN-LSTM with Efficient NetB4/Inception V3	91–98	Low	High cross-dataset variability, struggles with temporal inconsistencies
Rana and Sung [31]	Deepfake stack ensemble (multiple DL models)	99.65	Moderate	High GPU memory demand, overfits to training artifacts
CNN-GMM [36]	GMM probability layer replaces FC layers	96	Low	Sensitive to class imbalances, low computational efficiency
Afchar et al. [16]	MesoInception-4 for mid-level artifacts	98	Moderate	Fails on low-resolution inputs, limited to specific forgery types
Sabir et al. [37]	Recurrent CNN for temporal analysis	94.3–96.9	High	Weakness in frame-level manipulation detection, high latency
Our Method	Binary tree fusion of ViT + CNNs	97.25	High	Initial setup required if the method needs to be scaled

2.2. Benchmark Datasets and Evaluation

The development of robust deepfake detection methods has been facilitated by the creation of comprehensive datasets. The FaceForensics++ dataset, introduced by [38], provides a large-scale collection of manipulated videos for training and evaluation, including various levels of compression to simulate real-world scenarios. Similarly, the DeepFake Detection Challenge Dataset, released by [39], offers a diverse set of deepfake videos for benchmarking detection algorithms, with contributions from industry leaders such as Facebook AI. Celeb-DF [40] addresses the issues with visual artifacts found in earlier datasets, providing high-quality deepfake videos for improved evaluations. Another noteworthy contribution is the DFDC Preview dataset [41], which emphasizes realistic manipulations by introducing subtle perturbations into deepfake generation. These datasets collectively set the foundation for evaluating the generalization and robustness of detection methods across varied manipulations and compression artifacts.

Evaluation metrics are essential for assessing the performance of classification models. Here, we briefly discuss key metrics:

Accuracy: This measures the overall correctness of the model as the ratio of correct predictions to total predictions;
F1-Score: The harmonic mean of precision and recall, useful in imbalanced classes;
Precision: The ratio of true positives to all positive predictions, emphasizing the cost of false positives;
Recall: Also known as sensitivity, this measures the model’s ability to identify all relevant cases;
AUC-ROC: This represents the trade-off between the true positive and false positive rates across different thresholds [42].

3. Methodology

3.1. Problem Definition

Deepfake detection is formulated as a binary classification problem. Given an image

I \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and number of channels, the goal is to classify it as either real (

y = 0

) or fake (

y = 1

). The classifier

f (I; θ)

, parameterized by

θ

, predicts the probability

P (y = 1 | I)

of the image being fake. A diagram of the architecture is shown in Figure 3.

3.2. Choice of Models

The backbone models used in our proposed method shown in Figure 3 are as follows: (1) ResNet-18 [43], (2) EfficientNet-B0 [9], (3) DenseNet-121 [10], and (4) a vision transformer (ViT) [8]. These models were chosen for their complementary strengths in feature extraction. ResNet-18 introduces residual connections that mitigate vanishing gradient issues, making it effective for deep architectures. EfficientNet-B0 leverages compound scaling to achieve an optimal balance between the depth, width, and resolution, providing a strong performance with a reduced computational cost. DenseNet-121 employs dense connections that facilitate feature reuse, resulting in efficient parameter utilization. The ViT utilizes self-attention mechanisms that allow it to focus on relevant parts of the input image, enhancing its ability to manage long-range dependencies, which is crucial for understanding complex image contexts.

These models have been extensively validated in various computer vision tasks and are known for their robustness in learning diverse feature representations. By integrating these architectures within a hierarchical tree-like ensemble, the proposed method combines their strengths to achieve improved generalization and robustness against adversarial manipulations.

3.3. The Model Architecture

The proposed approach integrates the selected backbones (ResNet-18, EfficientNet-B0, DenseNet-121, and ViT) into a hierarchical tree-like ensemble. Each backbone independently extracts features from the input image

I \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and channels of the image, respectively. These features are processed through fully connected (FC) layers organized into a tree-like hierarchical topology. This design enhances hierarchical feature learning, enabling better generalization across diverse datasets while maintaining the computational efficiency.

Let us represent the backbone

B_{i}

that extracts the features

F_{i}

from the input image:

F_{i} = B_{i} (I); i \in {ResNet, EfficientNet, DenseNet, ViT}

(1)

3.3.1. Binary Tree Topology

Each backbone extracts complementary features, which are refined through a series of fully connected (FC) layers organized into hierarchical branches. The final outputs from each backbone branch are then concatenated to form a combined feature representation. For a given backbone

B_{i}

, the feature output is denoted as

F_{B_{i}}

. These features are passed through multiple levels of fully connected layers to progressively refine the feature representation. For instance, let

F_{ViT}

be the extracted features from a pre-trained vision transformer model. Then,

H_{1}^{ViT} = σ (W_{1} F_{ViT} + b_{1}),

H_{2}^{ViT} = σ (W_{2} F_{ViT} + b_{2}),

H_{3}^{ViT} = σ (W_{3} H_{1}^{ViT} + b_{3}),

H_{4}^{ViT} = σ (W_{4} H_{2}^{ViT} + b_{4}),

We experiment with our method in two different settings. Firstly, we use all CNN-based models; that is, the representation

H_{final}

becomes the following.

H_{final} = [H_{ResNet}; H_{EfficientNet}; H_{DenseNet}]

(2)

Here, the final feature representation is a combination of three CNN-based models, namely ResNet-18, EfficientNet, and DenseNet. We call this variant CNNs-Ensemble. Secondly, we replace ResNet-18 with the vision transformer (ViT) in the model configuration. This updated combination includes the ViT, EfficientNet-B0, and DenseNet-121. We refer to this variant as ViT_CNNs-Ensemble.

H_{final} = [H_{EfficientNet}; H_{DenseNet}; H_{ViT}]

(3)

H_{final}

is passed through additional fully connected layers for the final prediction.

y = Softmax (W_{final} H_{final} + b_{final})

(4)

Given the substantial computational demands associated with transformer-based models, we have decided against experimenting with configurations solely comprising multiple vision transformers (i.e., a combination of three distinct ViT architectures).

3.3.2. The Proposed Framework Algorithm

The proposed architecture is inherently scalable in two primary dimensions: 1: Ensemble-wise scalability: The model can seamlessly incorporate additional backbones or reduce the number of backbones based on the computational requirements or dataset complexity. Adding more backbones enhances the feature diversity but increases the computational overhead. Conversely, reducing the number of backbones makes the model lightweight, which is suitable for resource-constrained environments. 2: Depth-wise scalability: The hierarchical structure allows for flexibility in the number of fully connected layers within each backbone branch. Increasing the depth of the tree structure (i.e., adding more FC layers) can improve the feature refinement, capturing more intricate relationships. However, this comes at the cost of higher computational complexity and potential overfitting. Conversely, reducing the depth simplifies the model, making it faster and more efficient, albeit at the risk of losing some feature richness. This dual scalability enables the model to adapt to various tasks, ranging from lightweight implementations for real-time applications to more robust configurations for high-accuracy requirements. The complete detection process is outlined in Algorithm 1.

Algorithm 1 Deepfake detection framework.

Require:: Input image $I$ , models ${B_{i}}$
Ensure:: Prediction real/fake
1:: Extract features $F_{i} = B_{i} (I)$ for each backbone
2:: Hierarchical fusion: $H = TreeMerge (F_{ViT}, F_{ResNet}, . . .)$
3:: Classify: $y = Softmax (W H + b)$

3.3.3. Model Ensembling

Model ensembling has long been a cornerstone of machine learning, demonstrating improved performance and robustness by aggregating the predictions from multiple models [44]. Traditional ensemble techniques often involve training a set of models independently and combining their predictions using strategies such as bagging [45], boosting [46], or majority voting [47]. Our work, however, takes inspiration from methods like the Model Averaging (MA) technique by [48], where the predictions from multiple checkpoints in a single model are averaged during inference. We refer to these intermediate checkpoints as snapshots. Specifically, we save the state dictionary of the model every five epochs during training. For a total of N training epochs, this results in

N / 5

snapshots. At inference time, the final model parameters

θ_{final}

are computed as the average of the state dictionaries across all snapshots, formulated as

θ_{final} = \frac{1}{K} \sum_{k = 1}^{K} θ_{k}

(5)

where

K = N / 5

, and

θ_{k}

represents the parameters of the model saved after the k-th snapshot. This averaging strategy integrates knowledge captured at various training stages, thereby enhancing the generalization ability of the ensemble.

3.3.4. Integration with the Dataset

The method is evaluated using the FaceForensics++ dataset [38], a benchmark dataset widely used in deepfake detection research. This dataset includes videos manipulated using various techniques, providing a diverse and challenging benchmark for evaluating detection methods. Additionally, the dataset repository [38] offers tools for preprocessing and standardized splits, ensuring reproducibility and fair comparison with the existing approaches.

3.4. The Algorithm: Training and Snapshot Ensembling

The following algorithm (Algorithm 2) summarizes the training process and snapshot ensembling:

Algorithm 2 The algorithm initializes the model parameters, iterates through epochs with loss computation and updates, and periodically captures model snapshots. Learning rate adjustments optimize the performance, while ensembling enhances the generalization and mitigates overfitting.

1:: Input: Training data $D_{train}$ , validation data $D_{val}$ , learning rate $α$ , patience p, snapshot intervals $S$ .
2:: Initialize the model parameters $θ$ , optimizer, scheduler, and loss function.
3:: for epoch = 1 to $N_{epochs}$ do
4:: Training:
5:: for each batch $(x, y) \in D_{train}$ do
6:: Compute predictions: $\hat{y} = f (x; θ)$ .
7:: Compute loss: $L = - \sum y log \hat{y}$ .
8:: Backpropagation: Update $θ \leftarrow θ - α \cdot \nabla_{θ} L$ .
9:: end for
10:: Validation: Evaluate the model on $D_{val}$ .
11:: Adjust the learning rate: $α \leftarrow scheduler (L_{val})$ .
12:: if epoch $\in S$ then
13:: Save snapshot: $M \leftarrow save (θ)$ .
14:: end if
15:: end for
16:: Output: The trained model and snapshot ensemble.

This setup ensures robust training with strong generalization capabilities and reduced overfitting risks through dynamic learning rate adjustments and snapshot ensembling.

4. Experimental Settings

In this section, we present our experimental framework, outlining the training configuration, evaluation metrics, and computational resources utilized in our experiments. Special attention is given to strategies that enhance the model’s generalization and reduce overfitting. To classify a given image as either fake or real, we use the cross-entropy (CE) loss function:

L = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i}))

(6)

where

y_{i}

is the ground truth label,

{\hat{y}}_{i}

is the predicted probability, and N is the batch size. The Adam optimizer [49] with a learning rate of

1 \times 10^{- 5}

and a weight decay of

1 \times 10^{- 6}

is used:

θ \leftarrow θ - α \cdot \nabla_{θ} L

(7)

where

α

is the learning rate, and

θ

represents the model parameters.

The ReduceLROnPlateau and Cosine Annealing schedulers in PyTorch ([50]) are used to dynamically adjust the learning rate based on the validation loss in CNNs-Ensemble and ViT_CNNs-Ensemble, respectively:

α_{t + 1} = \{\begin{matrix} α_{t} \cdot factor, & if L_{val} does not improve for p epochs, \\ α_{t}, & otherwise . \end{matrix}

(8)

Snapshots of the model weights are saved at an interval of every five epochs (e.g., epochs 5, 10, 15, etc.). During testing, the average weights of these snapshots is used to predict the class label (fake or real). The model is trained for 100 epochs on an NVIDIA GPU equipped with 24 GB of dedicated memory.

To improve the generalization and enhance the robustness of the model, aggressive data augmentation techniques are applied during training. These include random resized cropping to

224 \times 224

, random horizontal flips with a probability of

p = 0.5

, and random vertical flips with

p = 0.2

. Additionally, random rotations of up to

30^{\circ}

are performed, alongside color jittering to adjust the brightness, contrast, saturation, and hue, thereby simulating varied lighting conditions. Random affine transformations, such as translation, scaling, and rotation, further diversify the dataset. All of the inputs are normalized using ImageNet [51] statistics, with the mean

μ = [0.485, 0.456, 0.406]

and the standard deviation

σ = [0.229, 0.224, 0.225]

. For validation and testing, only resizing and normalization are applied, ensuring consistency in the evaluation while minimizing alterations to the input data.

5. Evaluation Metrics

To ensure a comprehensive understanding of the model’s strengths and limitations, we employ the following metrics:

Accuracy: Accuracy measures the proportion of correctly classified samples out of the total samples. It is defined as

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

(9)

where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Accuracy provides a high-level overview of the model’s performance but may be less informative for imbalanced datasets.
Weighted F1 score: The F1 score offers the harmonic mean of precision and recall, making it a robust metric for evaluating models in scenarios where the class distributions are imbalanced. We compute the weighted F1 score to account for class imbalances:

$F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}, weighted by class distribution$

(10)

Here, precision and recall are defined as

$Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}$

(11)

The weighted F1 score ensures that the model’s performance on both real and fake image classes is fairly represented.
Confusion matrix: The confusion matrix provides a detailed breakdown of the model’s predictions, showing the distribution of TPs, TNs, FPs, and FNs. This granular analysis is instrumental in identifying specific error patterns, such as the tendency to misclassify fake images as real or vice versa.

These metrics, widely recognized in the machine learning community, offer a holistic view of the model’s classification performance.

The Dataset and Computational Resources

The FaceForensics++ dataset contains 6000 original videos manipulated via DeepFake, Face2Face, FaceSwap, and NeuralTextures (Table 2). We extracted the frames at 1 fps, resulting in 10,000 images (80% training, 10% validation, 10% testing).

As shown in Figure 4, our dataset includes diverse manipulation techniques, with subtle artifacts such as misaligned facial landmarks and texture irregularities distinguishing synthetic media. These samples illustrate nuanced differences that necessitate advanced detection frameworks.

The experiments were conducted on a system equipped with an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). (24GB VRAM) and CUDA version 12.2. The average training time for 50 epochs with a batch size of 64 was approximately 60 min. Efficient memory management techniques, such as gradient checkpointing and mixed-precision training, were employed to optimize the resource utilization.

6. Results

This section comprehensively evaluates our proposed models, analyzing their performance, computational efficiency, and comparative effectiveness against the state-of-the-art deepfake detection methods.

6.1. The Performance of Our Models

Table 3 presents the classification performance metrics for both the CNNs-Ensemble and ViT-CNNs-Ensemble models. ViT-CNNs-Ensemble achieves a superior overall accuracy (97.25%) and demonstrates a more balanced trade-off between its precision and recall across both real and fake classes, highlighting its effectiveness in deepfake detection.

Figure 5 presents the trends in the validation accuracy (left) and F1 scores (right) across training epochs. Both graphs exhibit a steady increase, stabilizing near 0.9 after approximately 20 epochs, indicating strong model generalization.

The Impact of Snapshot Ensembling on Deepfake Detection

To evaluate the impact of snapshot ensembling on our method, we compare the classification performance with and without ensembling. Table 4 and Table 5 present the precision, recall, and F1 score for both variants, CNNsEnsemble and ViT_CNNs_Ensemble, respectively.

Snapshot ensembling improves the fake class precision (95.99% → 95%) but slightly reduces the real class recall (79.38% → 74%), highlighting the trade-off between stability and sensitivity (Table 4). Ensembling boosts the real class F1-score by 4% (0.92 → 0.96) while maintaining the fake class precision above 98%, validating its effectiveness for minority class refinement (Table 5).

6.2. A Comparative Analysis with the State-of-the-Art Methods

To assess the effectiveness of the proposed approach in the context of existing technologies, we compare its performance with that of state-of-the-art deepfake detection methods evaluated on the FaceForensics++ dataset, as shown in Table 6. ViT-CNNs-Ensemble achieves a competitive accuracy of 97.25%, outperforming several CNN-based models while maintaining a reasonable computational footprint. Although methods like [52] ensemble reach a higher accuracy (99.65%), they impose a significant computational overhead, limiting their feasibility for real-time applications.

6.3. Analysis of the Computational Complexity

Table 7 validates our design’s efficiency. CNNs-Ensemble requires 71% fewer FLOPs than Xception does while achieving a higher accuracy. The ViT variant trades efficiency for enhanced artifact detection.

6.4. Insights from Comparative Models

Rana et al. [52]’s ensemble achieves 99.65% accuracy but requires three-model execution (Xception × 3), resulting in 7.8× higher FLOPs than those of our CNNs-Ensemble (38.9G vs. 5.0G FLOPs);
Afchar et al. [16]’s MesoNet achieves 98%’s MesoNet achieves 98% accuracy on their custom deepfake dataset but shows a degraded performance on compressed videos (93.5% at H.264 level 23), while our model maintains 95.53% accuracy across compression artifacts;
Our ViT-CNNs outperforms MesoNet in its feature learning capacity (97.25% vs. 98%) with only 20.8 GFLOPs vs. MesoNet’s 1.3 GFLOPs, demonstrating a better accuracy–computation balance for modern needs.

7. Discussion

7.1. The Effectiveness of the Binary Tree Topology

The proposed tree topology demonstrates significant advantages in hierarchical feature learning and generalization. By structuring the model into levels of fully connected layers, the architecture effectively decomposes complex patterns into simpler sub-patterns, allowing for more robust feature extraction. This hierarchical approach mirrors the process of decision-making, enabling the model to progressively refine its understanding of the input data. The results highlight that increasing the depth of the tree up to an optimal level, such as eight fully connected layers, leads to superior performance in terms of both accuracy and the F1 score.

Furthermore, the hierarchical structure contributes to improved generalization, especially in datasets with subtle differences between classes, such as deepfake and real images. Unlike flat architectures, the tree topology leverages its layered organization to capture nuanced relationships between features while avoiding overfitting. This is particularly evident from the consistently high performance across varied configurations, underscoring its effectiveness in handling complex classification tasks.

7.2. Real-World Deployment Considerations

The proposed model is designed for integration into platforms requiring real-time deepfake detection, such as social media moderation tools, video conferencing systems, and CCTV surveillance networks. For facial analysis tasks (e.g., identity verification in banking apps), the optimal performance is expected at a camera distance of 0.5–2 m, where the facial features are captured at sufficient resolution (1080p or higher). In scenarios with moving cameras or platforms (e.g., drones, autonomous vehicles), the model’s hierarchical feature extraction remains robust to moderate motion blur at speeds of up to 30 km/h, provided the input frames are stabilized using optical flow algorithms or hardware gimbals. However, extreme motion (e.g., >60 km/h) or low-light conditions may degrade the accuracy, necessitating preprocessing steps like super-resolution or adaptive noise reduction. To address scalability, the tree topology’s efficiency enables deployment on edge devices (e.g., the NVIDIA Jetson, the Raspberry Pi) with minimal latency, though dynamic environments may require hybrid cloud–edge inference for resource-intensive tasks. Future work will explore temporal consistency checks across video frames to further mitigate motion-related artifacts.

7.3. Challenges and Limitations

While the tree topology offers numerous advantages, it is not without challenges and limitations. One primary limitation is the increase in the computational complexity as the depth of the tree grows. Although deeper trees may initially enhance the performance, diminishing returns were observed beyond 8 fully connected layers, with a slight degradation in the accuracy and F1 score at 10 layers. This suggests that overly deep configurations may lead to an unnecessary computational overhead without proportional gains in performance.

Another challenge lies in balancing the trade-off between the memory usage and computational efficiency. While the hierarchical structure requires less memory compared to some dense architectures, optimizing the number of layers and nodes is essential to prevent excessive resource utilization. Additionally, the tree topology’s reliance on predefined levels requires careful tuning and experimentation, making it less flexible compared to models that dynamically adjust their architecture during training. These challenges highlight the need for further optimization to make the model more scalable and efficient for broader applications.

8. Conclusions

This research has established a robust framework for deepfake detection by introducing a tree-based deep neural network ensemble, which effectively harnesses the strengths of the ViT, ResNet, EfficientNet, and DenseNet within a singular hierarchical structure. The presented model not only meets but exceeds the existing benchmarks through its enhanced generalization capabilities and computational efficiency. Achieving high accuracy and F1 scores, the model validates the effectiveness of integrating multiple architectures for complex feature learning. By continually refining the model architecture and exploring innovative approaches, we aim to stay ahead of the evolving landscape of digital media manipulation.

9. Future Work

While the tree-topology-based architecture has shown promising results, several avenues merit exploration to enhance its robustness and applicability. Integrating attention mechanisms could further improve the feature interactions by dynamically focusing on artifact-prone regions, particularly for emerging diffusion-generated deepfakes (e.g., Stable Diffusion 3, DALL-E 3). Expanding the dataset’s diversity through cross-benchmark evaluations on Celeb-DF and DFDC would test the generalization beyond FaceForensics++, addressing current technique-specific biases. The computational efficiency could be enhanced via dynamic tree pruning and quantization-aware training for edge deployment, while a neural architecture search might automate topology optimization. Finally, extending the framework to video–temporal analyses and multimodal detection (e.g., audio–visual inconsistencies) could address real-world deployment challenges, particularly for streaming platforms combating sophisticated synthetic media.

Author Contributions

Conceptualization, M.A. and A.A.-S.; methodology M.A. and A.A.-S.; software, M.A. and A.A.-S.; validation, M.A. and A.A.-S. formal analysis, M.A. and A.A.-S.; investigation, M.A.;writing—original draft preparation, M.A. and A.A.-S.; writing—review and editing, M.A. and A.A.-S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia (Project No. KFU 250994). The authors extend their appreciation for the financial support that made this study possible.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ViT	Vision Transformer
CNN	Convolutional Neural Network

References

Lange, R.D.; Chattoraj, A.; Beck, J.M.; Yates, J.L.; Haefner, R.M. A confirmation bias in perceptual decision-making due to hierarchical approximate inference. PLoS Comput. Biol. 2021, 17, e1009517. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar]
Nguyen, T.T.; Nguyen, Q.V.H.; Nguyen, D.T.; Nguyen, D.T.; Huynh-The, T.; Nahavandi, S.; Nguyen, T.T.; Pham, Q.V.; Nguyen, C.M. Deep learning for deepfakes creation and detection: A survey. Comput. Vis. Image Underst. 2022, 223, 103525. [Google Scholar]
Gupta, G.; Raja, K.; Gupta, M.; Jan, T.; Whiteside, S.T.; Prasad, M. A Comprehensive Review of DeepFake Detection Using Advanced Machine Learning and Fusion Methods. Electronics 2023, 13, 95. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://papers.nips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html (accessed on 28 March 2025).
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
Dimensions. Dimensions Scholarly Database. 2025. Available online: https://www.dimensions.ai/resources/dimensions-scientific-research-database/ (accessed on 28 March 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Nichol, A.; Dhariwal, P. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Afchar, D.; Nagrani, A.; Zheng, C.; Smeureanu, R.; Sattar, M.A. MesoNet: A Compact Facial Video Forgery Detection Network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Vijayakumar, A.; Vairavasundaram, S. Yolo-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Khan, A.; Khattak, M.U.; Dawoud, K. Object detection in aerial images: A case study on performance improvement. In Proceedings of the 2022 International Conference on Artificial Intelligence of Things (ICAIoT), Istanbul, Turkey, 29–30 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–9. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using capsule networks to detect forged images and videos. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2307–2311. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Khan, A.; AlBarri, S.; Manzoor, M.A. Contrastive self-supervised learning: A survey on different architectures. In Proceedings of the 2022 2nd International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 30–31 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Dong, F.; Zou, X.; Wang, J.; Liu, X. Contrastive learning-based general Deepfake detection with multi-scale RGB frequency clues. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 90–99. [Google Scholar]
Rajesh, N.; Prajwala, M.; Kumari, N.; Rayyan, M.; Ramachandra, A. Hybrid Model for Deepfake Detection. In Proceedings of the 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication: MARC 2021, Ghaziabad, India, 10–11 December 2021; Springer: Singapore, 2022; pp. 639–649. [Google Scholar]
Agarwal, S.; Varshney, L.R. Limits of deepfake detection: A robust estimation viewpoint. arXiv 2019. arXiv 2019, arXiv:1905.03493. [Google Scholar]
Haliassos, A.; Mira, R.; Petridis, S.; Pantic, M. Leveraging real talking faces via self-supervision for robust forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14950–14962. [Google Scholar]
Yang, X.; Li, Y.; Lyu, S. Exposing deep fakes using inconsistent head poses. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8261–8265. [Google Scholar]
Li, Y.; Chang, M.C.; Lyu, S. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In Proceedings of the 2018 IEEE International workshop on information forensics and security (WIFS), Hong Kong, China, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar]
Chandrasegaran, K.; Tran, N.T.; Binder, A.; Cheung, N.M. Discovering transferable forensic features for cnn-generated images detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 671–689. [Google Scholar]
Khalid, H.; Woo, S.S. Oc-fakedect: Classifying deepfakes using one-class variational autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 656–657. [Google Scholar]
Caldelli, R.; Galteri, L.; Amerini, I.; Del Bimbo, A. Optical Flow based CNN for detection of unlearnt deepfake manipulations. Pattern Recognit. Lett. 2021, 146, 31–37. [Google Scholar]
Alnafea, R.M.; Nissirat, L.; Al-Samawi, A. CNN-GMM approach to identifying data distribution shifts in forgeries caused by noise: A step towards resolving the deepfake problem. PeerJ Comput. Sci. 2024, 10, e1991. [Google Scholar]
Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, I.; Natarajan, P. Recurrent Convolutional Strategies for Face Manipulation Detection in Videos. arXiv 2019, arXiv:1905.00582. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The deepfake detection challenge (dfdc) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3207–3216. [Google Scholar]
Jiang, L.; Li, R.; Wu, W.; Qian, C.; Loy, C.C. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2889–2898. [Google Scholar]
Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Dietterich, T.G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar]
Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, Garda, Italy, 28 June–1 July 1996; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1996; pp. 148–156. [Google Scholar]
Shaaban, M.A.; Akkasi, A.; Khan, A.; Komeili, M.; Yaqub, M. Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text. arXiv 2024, arXiv:2401.15780. [Google Scholar]
Khan, A.; Shaaban, M.A.; Khan, M.H. Improving Pseudo-labelling and Enhancing Robustness for Semi-Supervised Domain Generalization. arXiv 2024, arXiv:2401.13965. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Rana, M.S.; Sung, A.H. Deepfakestack: A deep ensemble-based learning technique for deepfake detection. In Proceedings of the 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), New York, NY, USA, 1–3 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 70–75. [Google Scholar]
Agnihotri, A. DeepFake Detection using Deep Neural Networks. Ph.D. Thesis, National College of Ireland, Dublin, Ireland, 2021. Available online: https://norma.ncirl.ie/5131/1/ambujagnihotri.pdf (accessed on 18 December 2024).

Figure 1. Evolution of deepfake generation and detection research (2016–2024). The left axis shows the publication counts for detection methods, while the right axis shows the year, highlighting the escalating arms race between creation and detection technologies.

Figure 2. This illustration provides an overview of our deepfake detection process, focusing on stages like data preprocessing, model training, and classification in the binary tree topology.

Figure 3. This illustration combines ResNet-18, ViT-Base, EfficientNet-B0, and DenseNet-121. It features fully connected layers for classification, showing the image input flow, feature extraction stages, and final classification decisions for real or fake.

Figure 4. Example pairs of real and fake images from the dataset. Key artifacts in fake images (e.g., inconsistent lighting, unnatural facial contours) highlight detection challenges.

Figure 5. Comparison of the models’ validation accuracy and F1 score over training epochs.

Table 2. Dataset distribution across training, validation, and testing sets.

Set	Fake Images	Real Images	Total Images
Training Set	4250	850	5100
Validation Set	970	194	1164
Test Set	485	97	582

Table 3. Performance metrics of our proposed models.

Metric	CNNs-Ensemble	ViT_CNNs-Ensemble
Overall Accuracy (%)	95.53	97.25
Precision (Fake Class) (%)	95.99	98.96
Recall (Fake Class) (%)	98.76	97.73
Precision (Real Class) (%)	92.77	89.32
Recall (Real Class) (%)	79.38	94.85
Overall Precision (%)	95.45	97.35
Overall Recall (%)	89.07	96.29
Overall F1 Score (%)	95.39	97.28
Time (in Hours)	2.45 gpu	1.90 gpu

Table 4. CNNs-Ensemble classification reports.

Class	Precision	Recall	F1-Score	Support
Without Ensemble
Real	0.9277	0.7938	0.8556	97
Fake	0.9599	0.9876	0.9736	485
With Snapshot Ensemble
Real	0.94	0.74	0.83	97
Fake	0.95	0.99	0.97	485

Table 5. ViT_CNNs-Ensemble classification reports.

Class	Precision	Recall	F1-Score	Support
Without Ensemble
Real	0.8932	0.9485	0.9200	97
Fake	0.9896	0.9773	0.9834	485
With Snapshot Ensemble
Real	0.93	0.85	0.88	97
Fake	0.99	0.98	0.98	485

Table 6. Comparison of our models with existing approaches.

Method	Technique	Accuracy (%)
Pan, D. et al. [53]	Xception + MobileNet	91–98
Rana M. S. and Sung A. H. [52]	Heavy Ensemble (3 × Xception)	99.65
Afchar et al. [16]	MesoInception (Lightweight CNN)	98
Sabir et al. [37]	Recurrent CNNs + LSTM	94.3–96.9
CNN-GMM	CNN + Gaussian Mixture Model	96
Our Model (CNNs-Ensemble)	ResNet + EfficientNet + DenseNet Tree	95.53
Our Model (ViT-CNNs)	ViT + CNNs Tree	97.25

Table 7. The computational profile for 224 × 224 input dimensions. Timing is measured on an NVIDIA RTX 4090 with PyTorch 2.0, utilizing a batch size of 64 and mixed precision.

Model	FLOPs (G)	Params (M)	Inference (ms)
Xception	16.85	22.9	34
MesoNet	1.32	2.1	12
Rana et al. Rana et al. (Ensemble) [52]	38.92	85.3	121
CNNs-Ensemble	5.0	25.0	28
ViT_CNNs-Ensemble	20.8	99.7	121

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alrajeh, M.; Al-Samawi, A. Deepfake Image Classification Using Decision (Binary) Tree Deep Learning. J. Sens. Actuator Netw. 2025, 14, 40. https://doi.org/10.3390/jsan14020040

AMA Style

Alrajeh M, Al-Samawi A. Deepfake Image Classification Using Decision (Binary) Tree Deep Learning. Journal of Sensor and Actuator Networks. 2025; 14(2):40. https://doi.org/10.3390/jsan14020040

Chicago/Turabian Style

Alrajeh, Mariam, and Aida Al-Samawi. 2025. "Deepfake Image Classification Using Decision (Binary) Tree Deep Learning" Journal of Sensor and Actuator Networks 14, no. 2: 40. https://doi.org/10.3390/jsan14020040

APA Style

Alrajeh, M., & Al-Samawi, A. (2025). Deepfake Image Classification Using Decision (Binary) Tree Deep Learning. Journal of Sensor and Actuator Networks, 14(2), 40. https://doi.org/10.3390/jsan14020040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deepfake Image Classification Using Decision (Binary) Tree Deep Learning

Abstract

1. Introduction

Key Contributions

2. Related Works

2.1. Deep-Learning-Based Methods

2.2. Benchmark Datasets and Evaluation

3. Methodology

3.1. Problem Definition

3.2. Choice of Models

3.3. The Model Architecture

3.3.1. Binary Tree Topology

3.3.2. The Proposed Framework Algorithm

3.3.3. Model Ensembling

3.3.4. Integration with the Dataset

3.4. The Algorithm: Training and Snapshot Ensembling

4. Experimental Settings

5. Evaluation Metrics

The Dataset and Computational Resources

6. Results

6.1. The Performance of Our Models

The Impact of Snapshot Ensembling on Deepfake Detection

6.2. A Comparative Analysis with the State-of-the-Art Methods

6.3. Analysis of the Computational Complexity

6.4. Insights from Comparative Models

7. Discussion

7.1. The Effectiveness of the Binary Tree Topology

7.2. Real-World Deployment Considerations

7.3. Challenges and Limitations

8. Conclusions

9. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI