Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

Soundararajan, Joe; Xu, Dong

doi:10.3390/ai7030115

Open AccessArticle

Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

by

Joe Soundararajan

¹ and

Dong Xu

^1,2,3,*

¹

Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65201, USA

²

Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA

³

Health Informatics Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA

^*

Author to whom correspondence should be addressed.

AI 2026, 7(3), 115; https://doi.org/10.3390/ai7030115

Submission received: 6 February 2026 / Revised: 11 March 2026 / Accepted: 12 March 2026 / Published: 19 March 2026

(This article belongs to the Section AI Systems: Theory and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) classification and (ii) manipulated-region localization when pixel-level supervision is available. We evaluated the approach on three public benchmarks of increasing complexity—HiDF, SID_Set (SIDA), and CiFake—using each dataset’s official partitions where provided (SID_Set uses the predefined train/validation split) and a standardized preprocessing and training pipeline across experiments. Results: On HiDF, our model achieved strong performance on both video and image tracks (AUC up to 0.931 on video and 0.968 on images), yielding large gains relative to previously reported HiDF baselines under their published settings. On SID_Set, the model achieved 99.1% three-class accuracy (real/synthetic/tampered) and produced accurate localization masks for many tampered regions, while we explicitly documented the split protocol and leakage checks to support the validity of the evaluation. On CiFake, the model exceeded 95% accuracy and attained an AUC of 0.986. Conclusions: Overall, the results indicate that SigLIP-2 representations combined with multi-task training can deliver high detection accuracy and interpretable localization on challenging, realistic forgeries, while highlighting the importance of clearly stated evaluation protocols for fair comparison.

Keywords:

deepfake detection; SigLip-2; vision transformers; CLIP

1. Introduction

The emergence of deepfake technology has been driven by advances in generative modeling within computer vision and machine learning [1]. Early work on facial reenactment and manipulation demonstrated the feasibility of photorealistic identity transfer. Thies et al. proposed Face2Face in 2016, enabling real-time transfer of facial expressions between source and target actors [2]. Suwajanakorn et al. further advanced this line of research by synthesizing realistic lip movements of former President Obama driven by audio input, highlighting the potential of deep learning for audiovisual synthesis. These early studies provided the foundation for later, more accessible systems commonly referred to as deepfakes [3]. Through the 2000s and early 2010s, steadily improving tools for image synthesis and facial modeling laid the groundwork for today’s deepfakes. Innovations in 3D face modeling (e.g., morphable models) and early neural-network-based image filters advanced the ability to swap or alter faces in media [4]. However, these earlier methods often required significant manual effort or had limited realism. The emergence of deep learning around 2014 dramatically accelerated progress. Researchers began applying neural networks to facial generation and editing, moving from traditional editing to automated. By the mid-2010s, academic teams were already showing convincing face swaps and talking-head animations driven by neural networks.

Behind the rise of deepfakes is a clear technical trajectory. Early deepfake implementations in 2017–2018 primarily used autoencoder networks, a class of deep neural networks designed for unsupervised encoding and decoding. In a typical face-swapping autoencoder, one neural network (encoder) learns to compress face images into a latent representation, and two decoders are trained to reconstruct the faces of Person A and Person B, respectively [5], but the results often had noticeable artifacts (blurry skin textures, color mismatches) that gave the fake away. GAN-based architecture [6] has achieved impressive photorealism in human face generation. Notable examples include StyleGAN (Karras et al., 2019) [7] and StarGAN (Choi et al., 2018) [8]. Vision transformers (ViTs) are increasingly used for deepfake detection because they treat an image [9]. In video deepfakes, transformers are also well suited to capturing spatiotemporal cues, since they can attend across patch sequences to detect irregular motion/texture dynamics that arise from frame-wise synthesis and blending [10].

The Transformer-based Deepfake Detection (LTTD) framework explicitly models sequences of local patches using a Local Sequence Transformer and then aggregates evidence globally (including cross-patch objectives) to improve robustness and cross-dataset generalization, motivated by the observation that low-level temporal inconsistencies can persist under common post-processing [11]. Recent work also shows that ViT-based detection can be made more efficient and more generalizable through input design and pre-trained transformer adaptation. TALL (ICCV 2023) [12] introduces a simple but effective strategy for deepfake video detection by converting short sequences of consecutive video frames into a single composite “thumbnail” image. Instead of modeling temporal information with computationally expensive 3D CNNs or explicit temporal transformers, TALL spatially rearranges multiple frames into a grid-like layout (the Thumbnail Layout). This design allows a standard image-based vision transformer—such as a Swin Transformer—to implicitly capture temporal inconsistencies (e.g., flickering artifacts, misaligned facial geometry, blending errors) through spatial attention mechanisms. By encoding temporal dynamics as spatial structure, TALL significantly reduces memory and computational overhead while maintaining performance [13].

At the same time, foundation-model approaches adapt or decode pre-trained ViT representations: FatFormer builds a forgery-aware adaptive transformer on top of CLIP-style vision–language features (including frequency-aware adaptation and language-guided alignment) to improve detection across diverse generators [14]. DeCLIP (WACV 2025) decodes frozen CLIP embeddings with a convolutional decoder to localize partial manipulations, improving generalization and producing interpretable manipulation masks rather than only a binary label [15]. Despite rapid advances in deepfake detection, many detectors exhibit degraded performance under high perceptual realism or unseen generators, highlighting a persistent generalization gap [16]. In addition, a substantial portion of prior work focuses on face-centric artifacts and offers limited spatial interpretability, thereby constraining its applicability to broader social media manipulation scenarios. Furthermore, cross-dataset transfer remains challenging, particularly when only limited labeled data are available in deployment settings [17]. Li et al. noted that existing DeepFake datasets often contained low visual quality and conspicuous synthesis artifacts, such as splicing boundaries, color mismatch, and inconsistent face orientation, which made them less representative of real-world deepfakes [18]. This indicates that earlier detectors were effectively solving an easier task than actual in-the-wild deepfake detection. Next-generation datasets have begun to address these gaps; for example, the recent HiDF benchmark was explicitly built using state-of-the-art tools to produce “more realistic and undetectable” forgeries—and revealingly, detectors that performed well on older data suffered major accuracy drops on HiDF’s high-fidelity fakes [19].

As such, current systems still face four technical limitations that hinder real-world deployment. First, generalization remains brittle: detectors trained on one dataset or generator often degrade under new generators, compression, or domain shifts; to address this, we build on vision–language pre-trained backbones (SigLIP-2/OpenCLIP) to leverage semantically rich features that are less tied to dataset-specific artifacts. Second, much prior work is face-centric, while modern manipulations increasingly involve objects, attributes, and compositing across diverse social media imagery; accordingly, we emphasize semantic representations that capture global coherence and spatial context rather than relying on face-only cues. Third, explainability is limited in many detectors; we therefore add a mask-prediction (segmentation) head alongside classification to localize manipulated regions when pixel supervision is available, making outputs more inspectable for forensic and moderation workflows. Finally, deployment often involves distribution shift with scarce labels, where full retraining is impractical; we include a prototype-based few-shot adaptation procedure that reuses frozen features and updates decisions from a small support set, enabling lightweight transfer to new domains.

To address these limitations, our framework is designed with a clear mapping between identified research gaps and our methodological objectives. Specifically, we mitigate the risk of brittle generalization by leveraging the semantically rich, pre-trained feature space of SigLIP-2 and OpenCLIP backbones, which provide a more robust foundation than standard vision transformers. To overcome the excessive reliance on face-centric cues common in existing methods, we employ global-context semantic representations that facilitate a holistic analysis of image integrity. Furthermore, we address the challenge of limited explainability with an integrated multi-task mask-prediction head, enabling precise, pixel-level localization of forensic artifacts. Finally, to counter the persistent threat of distribution shift across diverse deployment scenarios, we incorporate a few-shot prototype adaptation procedure that enables the model to rapidly generalize to unseen data distributions. Collectively, these design choices ensure that our framework transitions from task-specific classification to a more interpretable and adaptable forensic analysis system.

Similarly, new benchmarks like SIDA move beyond faces to include object replacements and other subtle manipulations in social media images [20]. The field clearly faces a knowledge gap: current detection systems, trained on yesterday’s artifacts and limited domains, are ill-equipped to handle the increasingly realistic, diverse, and context-rich deepfakes emerging today. Without new approaches, automated forgery detectors will continue to lag the ever-evolving methods of synthetic media generation. This work makes three contributions: (i) we adapt SigLIP-2/OpenCLIP vision transformers for deepfake detection across image and video settings; (ii) we introduce a multi-task detector that outputs both a class decision and a pixel-level tamper mask when supervision is available; and (iii) we provide a unified empirical evaluation on HiDF, SIDA, and CiFake, including a few-shot prototype adaptation study to quantify cross-domain transfer. The multi-task SigLIP-2 framework, which jointly optimizes classification and forensic localization, represents a key innovation in deepfake detection by moving beyond simple classification toward integrated and interpretable forensic analysis grounded in foundation-model representations. In addition, our cross-domain adaptation strategy further demonstrates the framework’s flexibility through a prototype-based few-shot adaptation procedure, underscoring its practical utility in settings with limited labeled data, a capability not readily achieved by existing methods.

2. Materials and Methods

2.1. Model Overview

The HIDF-Siglip2 model is a binary image classifier built on a pre-trained OpenCLIP vision transformer backbone with a lightweight linear head (Figure 1). The backbone (a ViT-based image encoder) is initialized with OpenCLIP weights (e.g., ViT-L/16 SigLIP2 model pre-trained on WebLI) and kept frozen during training. This frozen feature extractor produces high-dimensional image embeddings, which are passed to a small fully connected head for classification. The head consists of a 256-unit hidden layer with ReLU activation and dropout, followed by a single output neuron that produces a single logit (interpreted via a sigmoid as the probability that the image is “real”). To regularize the model, light noise and dropout are applied to the features during training: a small Gaussian noise (std 0.02) is added to the embedding, and 10% dropout masks some feature components. This encourages robustness and prevents the classifier from relying on spurious cues. All image features are L2-normalized before feeding into the classifier head, consistent with CLIP’s feature representation.

Preprocessing: The input images undergo a two-stage preprocessing pipeline. First, images are processed using OpenCV (version 4.12.0) (which performs fast JPEG/PNG reading) and converted from BGR to RGB color space. The processed image (or a fallback dummy image if decoding fails) is then converted to a PIL image without re-encoding and passed through OpenCLIP’s standard transform pipeline. This pipeline includes resizing to the backbone’s expected resolution (e.g., 224 × 224 pixels), center cropping, tensor conversion, and normalization to CLIP’s mean and standard deviation. The result is a normalized tensor suitable for the CLIP visual encoder. No additional data augmentations (beyond the CLIP defaults) are applied in HIDF. The dataset is organized into two classes, FAKE and REAL, with labels 0 and 1, respectively. The data is split into separate training, validation, and test sets, each containing subfolders for the two classes.

2.2. Training Setup

Dataset and Task: The HIDF-Siglip2 model is trained to distinguish “fake” images from “real” images in a supervised binary classification setup. The class distribution is roughly balanced between fake and real images (the training set had fake ≈ real, so no extreme class imbalance), and thus no special class weighting was needed (the loss uses equal weight for both classes, with pos_weight = 1.0). Each input image is processed through the frozen CLIP backbone to obtain features, and the trainable head predicts a single logit, which is compared to the binary ground truth.

Training Configuration: The model was trained for 20 epochs on the training set, using a batch size of 64 and the binary cross-entropy loss (implemented as BCEWithLogitsLoss) for optimization. The AdamW optimizer was employed only on the classification head parameters (since the backbone’s gradients are frozen) with an initial learning rate of 2 × 10⁻⁴ and a weight decay of 0.01. As such, we chose to freeze the backbone to reduce trainable capacity and increase stability. Training was accelerated with mixed-precision (FP16) automatic casting on GPU. Early stopping was not used; instead, the model ran for the full 20 epochs, and the best model (highest validation F1) was checkpointed for reference. We trained for a fixed budget of 20 epochs across all experiments to keep the optimization budget consistent and fully reproducible across datasets and comparisons. We do not claim that 20 epochs is optimal for all settings; rather, it serves as a standardized training budget. Reported results were obtained from the best validation checkpoint according to the primary metric (e.g., AUC/F1 depending on the task), not necessarily from the final epoch. A systematic study of early-stopping criteria and epoch budget sensitivity is left for future work. Validation was performed at the end of each epoch to track performance. The final model selection occurred at the last epoch (as performance continued to improve), and the model was then evaluated on the held-out test set.

For the second dataset (SIDS), we used a model architecture that supports both binary classification and pixel segmentation. We built our system on the SigLIP-2 vision transformer backbone, augmented with a multi-task decoder design. SigLIP-2 (a CLIP-based ViT model) provides a strong image encoder that produces a sequence of patch embeddings and a pooled image representation. On top of this encoder, we attached two heads: a classification head and a segmentation decoder. The classification head is a simple linear layer (with dropout regularization) that takes the transformer’s pooled output and predicts the image’s class: real, synthetic, or tampered. For segmentation, we incorporated a lightweight, SegFormer-style decoder to predict a pixel-wise tampering mask. Specifically, we extracted feature maps from multiple transformer layers (from shallow to deep) and projected each to a fixed embedding dimension via 1 × 1 convolutions. Each projected feature map was reshaped to its 2D spatial layout and passed through a depth-wise convolutional smoothing layer. These multi-scale feature maps were then concatenated and fed into a learned fusion module that uses an attention-gating mechanism to reweight each scale’s contribution. Finally, a 1 × 1 convolution produced the output tamper probability mask. This design, inspired by SegFormer’s all-MLP decoder, is efficient and leverages both fine and coarse features for precise localization. The model therefore outputs both a 3-class image prediction and a full-resolution binary mask in a single forward pass.

Loss Functions: Training used a composite objective to supervise both classification and segmentation. For image classification, we used a standard cross-entropy loss on the 3-class output. For segmentation, we designed an enhanced multi-loss to address class imbalance and shape accuracy in the tampered masks. We also included an IoU loss term that optimizes the intersection over union of the predicted mask, improving the model’s region-level accuracy. For the reported experiments, we used a two-term segmentation objective combining binary cross-entropy and Dice loss in Equation (1):

L_{s e g} = λ_{b c e} L_{B C E} + λ_{d i c e} L_{D i c e}

(1)

with fixed weights

λ_{b c e} = 0.2

and

λ_{d i c e} = 0.3

. This choice was used consistently to ensure a reproducible training protocol. The complete training code and configuration for the submitted results are provided in our GitHub repository. Our implementation also contained optional loss components (e.g., focal, IoU, and morphological consistency) that can be enabled for further study; however, these components were disabled in the default configuration used for all experiments reported in this paper.

We also employed a simple dynamic weighting schedule for the segmentation losses: early in training, a higher weight was placed on pixel-wise BCE (stabilizing learning), and gradually more weight was shifted to the Dice/IoU components in later epochs, to fine-tune the mask shape and overlap as the model becomes more confident. The classification and segmentation losses were finally combined (summed), as the tasks are learned jointly.

Dataset and Augmentation: We evaluated our approach on the SID_Set dataset, which consists of natural and manipulated images labeled in three categories (real, fully synthetic, and tampered). Each tampered image was accompanied by a ground-truth binary mask indicating the manipulated region, enabling supervised training of the segmentation branch. The dataset is roughly balanced across the three classes (e.g., 10 k images per class in the test set), and tampered regions vary in size and content, posing a realistic challenge for localization. To improve generalization, we applied extensive data augmentations during training. We first resized and padded to a fixed input size (using longest-side scaling to preserve aspect ratio), and then applied random image-level transformations using Albumentations and Kornia. These include random horizontal flips and mild geometric transforms (affine rotations and perspective warps) to augment spatial invariance. We simulated imaging artifacts by randomly applying JPEG compression degradation or Gaussian noise on some images and adjusting illumination via color jittering (random changes in brightness, contrast, saturation, and hue). We also added blur or sharpening filters (e.g., Gaussian blur, motion blur, or sharpening) to mimic various camera and post-processing effects. In addition, to help reveal subtle tampering traces, we employed Contrast Limited Adaptive Histogram Equalization (CLAHE) on a random subset (e.g., 20%) of images, which boosts local contrast and can make hidden splice or edit boundaries more visible. These augmentations were applied stochastically per batch, using GPU-accelerated routines (Kornia) to avoid bottlenecks. Overall, the augmentation strategy exposes the model to a wide range of appearances and distortions, forcing it to rely on intrinsic anomalies of synthetic or tampered content rather than superficial cues. We used the official HuggingFace partitions (train/validation) and did not perform any additional random splitting; training was performed on the training split and all SIDA metrics were reported on the validation split.

Training Strategies: We trained the SigLIP-2 + SegFormer model end to end, leveraging several strategies to maximize performance and efficiency. Progressive resizing was used: the model was initially trained on lower-resolution images and later fine-tuned on higher resolutions as training progresses. This curriculum allows faster convergence in early epochs and improves segmentation detail in later stages (once the model has learned coarse features, it benefits from seeing higher-resolution input). We also applied dropout regularization (e.g., 10% dropout) in both the classification head and within the decoder’s convolutional layers, which mitigates overfitting given the model’s high capacity and near-perfect training accuracy. To efficiently train the large SigLIP-2 backbone, we employed automatic mixed precision (AMP) with bfloat16. This accelerates training and reduces memory usage without degrading accuracy. In our experiments, we enabled gradient checkpointing on the transformer encoder (trading extra computation for about 30% lower memory usage), and used memory-efficient tensor layouts (PyTorch’s channels-last format) to further optimize GPU utilization. We additionally compiled the decoder using PyTorch 2.0’s torch.compile to improve runtime throughput. Training was done for 24 epochs using a cosine learning-rate schedule with warmup.

For the CiFake dataset, we used a similar code to the HiDF [21]. Architecturally, the system is grounded in OpenCLIP’s SigLIP and SigLIP-2 families of vision–language encoders, offering multiple configurations tailored to different computational budgets and research needs: a tiny ViT-B-16 at 256 px input resolution, a small ViT-B-16 at 384 px, a medium ViT-L-16 at 384 px (~307 M parameters), and a large ViT-SO400M-16-SigLIP2 at 512 px (~400 M parameters). A custom FastBinaryClassifier wraps these backbones, providing an efficient classification head, optional lightweight attention for smaller variants, and dropout regularization. From a data processing perspective, the script demonstrates considerable flexibility. It supports traditional torchvision pipelines alongside advanced augmentations via Albumentations, and also includes GPU-accelerated Kornia transforms for real-time stochastic variation. Two experimental strategies, progressive resizing and ultra-JPEG compression, were exposed as configurable options to improve model robustness against distribution shifts and compression artifacts. Batch sizes were dynamically tuned based on backbone size and active augmentations, reducing out-of-memory risks for medium and large models, particularly when multiple memory-intensive features are enabled simultaneously. The loss function can be specified as either a weighted binary cross-entropy with logits, giving additional weight to the synthetic (“fake”) class, or as focal loss, with tunable α and γ parameters to address class imbalance and hard-negative mining. The optimization backbone relied on AdamW with decoupled weight decay, paired with modern stability and performance techniques such as gradient accumulation, gradient checkpointing, and Fully Sharded Data Parallelism (FSDP) for large-scale distributed training. Additional regularization included label smoothing, MixUp interpolation, and exponential moving average (EMA) updates of model parameters, a practice known to improve generalization in vision tasks. PyTorch 2.0’s torch.compile facility was integrated for potential runtime acceleration, with fallbacks across compilation modes to ensure robustness

Computational Footprint: All experiments were run on a single NVIDIA 5090 GPU (24 GB) with PyTorch 2.1. We partitioned each dataset into training, validation, and test splits with ratios of 70%, 15%, and 15%, respectively. This configuration ensured reproducibility while providing sufficient computational resources for both training and evaluation. Unless otherwise specified, the same experimental protocol was applied across all datasets, except for the SIDS dataset, which required a slightly different handling strategy due to its unique characteristics. The SIDS dataset was downloaded from HuggingFace, which is provided with predefined partitions: approximately 210,000 images in the training set and 30,000 images in the validation set, totaling ~240,000 samples.

2.3. Prototype-Based Few-Shot Inference Using SigLIP-2

In this work, we employ a prototype-based few-shot classification framework to perform cross-dataset adaptation for deepfake detection using a frozen SigLIP-2 model. We initialize from a checkpoint trained on the HiDF dataset; during few-shot adaptation, we discard the classifier head and use only the frozen image encoder to extract transferable representations. This enables adaptation to a new dataset using only a small, labeled support set per class, without fine-tuning the backbone.

Pre-trained feature extractor.

Let

f : X \to R^{d}

denote the frozen SigLIP-2 image encoder (here

d = 1024

). For an input image

x

, we compute an

L_{2}

-normalized embedding in Equation (2):

z (x) = \frac{f (x)}{∥ f (x) ∥_{2}} .

(2)

Support set and prototype construction.

Given a support set

S = {(x_{i}, y_{i})}_{i = 1}^{n}

with classes

c \in {r e a l, f a k e}

, we define the class-specific subset

S_{c} = {x_{i} \in S : y_{i} = c}

. The prototype for class

c

is the mean of normalized embeddings in Equation (3):

p_{c} = \frac{1}{∣ S_{c} ∣} \sum_{x_{i} \in S_{c}} z (x_{i}) .

(3)

We then

L_{2}

-normalize

p_{c}

before classification.

Query classification.

For a query image

x

, we compute its embedding

z (x)

and measure its Euclidean distance to each prototype. Class probabilities are obtained via a softmax over negative distances in Equation (4):

P (y = c ∣ x) = \frac{\exp (- ∥ z (x) - p_{c} ∥_{2} / τ)}{\sum_{c^{'}} e x p (- ∥ z (x) - p_{c^{'}} ∥_{2} / τ)} .

(4)

In our implementation we do not apply additional temperature scaling; therefore, we set

τ = 1.0

throughout.

2.3.1. Robustness Through Test-Time Augmentation

To increase stability and account for minor spatial variations, we apply test-time augmentation (TTA) by horizontally flipping the input image and averaging the softmax outputs across original and flipped versions. This ensemble approach mitigates overfitting to dataset-specific low-level artifacts and improves generalization. This approach is especially powerful when the source (training) and target (inference) datasets differ in compression, resolution, manipulation type, or content domain (e.g., faces, artworks, or social media uploads).

2.3.2. Threshold Selection and Performance Evaluation

A classification threshold is selected using a disjoint validation subset to satisfy a minimum precision constraint. This threshold is then applied to the probability outputs for final prediction. The model’s performance is reported as accuracy, precision, recall, F1-score, and AUC-ROC. Additionally, false positives and negatives are analyzed to identify dominant failure modes, and standard diagnostic plots (confusion matrix, ROC, precision–recall curve) are generated for interpretability.

3. Results

Figure 2 presents a comprehensive evaluation of the HIDF model. The training progress plot (top left) shows that the loss declines rapidly in the early epochs before flattening around epoch 20, while the validation F1 and accuracy steadily increase and plateau at high levels (F1 ≈ 0.92, Acc ≈ 0.91), demonstrating convergence and stable learning. The ROC curve (top right) highlights the model’s strong discriminative ability, with the solid red curve bending sharply toward the upper left and yielding an AUC of ≈0.968, which approaches near-perfect classification; by contrast, the random baseline (AUC = 0.5) is represented by the diagonal dashed line. The precision–recall curve (bottom left) further emphasizes robustness on the positive (real) class, maintaining very high precision across a wide recall range and achieving an average precision of ≈0.972, substantially outperforming the random baseline precision of ~0.553 defined by class distribution. Finally, the confusion matrix (bottom right) demonstrates that correct predictions dominate, with 4287 fake–fake and 5283 real–real classifications on the diagonal and relatively few errors (401 fake as real and 505 real as fake), reflected by the light shading of the off-diagonal cells. Taken together, these results confirm that the HIDF model achieves excellent generalization and balanced performance across multiple evaluation criteria.

3.1. Training Progress

In Figure 3, the video model’s training history over 20 epochs shows steady improvement and convergence. Training loss decreased rapidly from around 0.8 to roughly 0.30 by epoch 20, while validation loss dropped in parallel from about 0.6 to 0.36. This downward trend in both losses, with validation loss closely tracking training loss, indicates that the model learned effectively without severe overfitting. Similarly, accuracy climbed consistently—training accuracy rose to the high 80s (%), and validation accuracy improved from ~70% in early epochs to around 85–90% by the final epoch. Notably, the validation accuracy closely approached the training accuracy in later epochs, suggesting good generalization. The validation F1 score followed a similar pattern: it jumped early on and then gradually leveled off, reaching a best value around 0.85–0.86. After about 10 epochs, improvements in F1 became incremental, indicating that the model had largely converged. In summary, by epoch 20 the training and validation metrics stabilized (with high accuracy and F1 and low loss), demonstrating that the model converged to a strong solution with balanced performance on training and validation data.

3.2. Video Classification Model Performance

As shown in Figure 4, the accuracy was 85.3%, indicating that the model correctly classified approximately 85% of all video samples. The balanced accuracy shown in Figure 4 was also 85.3%, confirming that the model performs evenly well on both classes (mitigating any class imbalance effects by averaging per-class recall). The model’s precision was 0.834, and recall was 0.881—in other words, when the model predicted a video was “fake,” 83.4% of those predictions were correct, and it caught 88.1% of all actual fake videos. This trade-off yielded a combined F1 score of 0.857, reflecting a high overall accuracy of the positive “fake” classification in terms of the precision–recall balance. Furthermore, the classifier demonstrated strong discriminative ability: the ROC AUC was 0.931, indicating that the model can distinguish real vs. fake videos with 93.1% area under the ROC curve. The average precision (AP) was 0.925, which corresponds to the area under the precision–recall curve—a high value indicates that the model maintains good precision across a range of recall levels. Together, these metrics indicate robust performance for the video classification model, with both high accuracy and strong ability to separate the two classes.

As shown in Figure 5, out of all real videos in the test set, 540 were correctly predicted as real (true negatives for the fake-detection task), while 115 were incorrectly predicted as “fake.” These 115 misclassified real videos represent the false positives (type I errors) from the perspective of fake detection—genuine content that the model flagged as fake. For the fake videos, the model correctly identified 577 as fake (true positives), and 78 were wrongly classified as “real.” Those 78 missed fakes are the model’s false negatives (type II errors), i.e., fake videos that slipped through as seeming real. This breakdown aligns with the earlier per-class recall values (real recall 540/(540 + 115) ≈ 82.4%; fake recall 577/(577 + 78) ≈ 88.1%). Overall, the confusion matrix shows that the classifier is more likely to produce false positives (flagging a real video as fake) than false negatives (failing to catch a fake video). In absolute terms, there were 115 false alarms vs. 78 misses. This suggests that the model has been tuned to err on the side of caution—it catches most of the fake videos (high fake recall), even if it occasionally raises an incorrect alert on real videos. Importantly, the large numbers on the diagonal (540 and 577 correct classifications for real and fake, respectively) relative to off-diagonals demonstrate the model’s strong overall accuracy in distinguishing the two classes.

3.3. Prediction Confidence Distribution

The distribution of the model’s prediction confidences shows that its outputs are well separated across the two classes, with fake predictions skewed toward higher confidence. We examined the predicted probability scores for the “fake” class for both actually real and fake videos. The histogram of these probabilities shows a clear bimodal pattern: most real videos have predicted fake probabilities near 0 (blue distribution concentrated at the low end), whereas most fake videos have predicted probabilities near 1.0 (red distribution piled up at the high end). There is a pronounced gap in the middle probability range, indicating that the model rarely produces mid-range, ambiguous confidence scores. In other words, the classifier tends to be very confident in its decisions—it usually outputs probabilities close to 0 or close to 1 for real and fake instances, respectively.

In Figure 6, the boxplot of prediction probabilities for each true class further illustrates this separation. For real videos, the predicted probability of being fake is generally very low: the median probability for real instances is near 0.1–0.2, and the interquartile range lies in the low probabilities (most real videos receive <0.3 probability of being fake). Only a small number of outlier real videos receive higher fake-probability scores (corresponding to the few false positives in which the model was overly suspicious of real content). In contrast, fake videos show a high-confidence profile: their predicted fake-probability scores have a median around 0.9, with many examples at or near 1.0. The bulk of fake instances receive a probability well above 0.7, indicating that the model is typically very sure when it labels something as fake. A few fake videos have lower scores (outliers in the boxplot), which are the cases the model found challenging and possibly misclassified as real (the false negatives). Overall, this confidence analysis demonstrates that the model’s internal decision boundary cleanly separates real and fake videos. The scores for the two classes occupy mostly distinct ranges with minimal overlap, explaining the high AUC of 0.931. Moreover, the skew of fake predictions toward high probabilities suggests that the model is emphatically identifying fake content with strong confidence. This clear separation in prediction confidence is a desirable property in a classification model—it means that not only is the model accurate, but it also knows when it is uncertain. Here, the small overlap between the blue and red distributions corresponds to the relatively few errors (the 115 real videos and 78 fakes near the decision threshold), while most cases are classified with a comfortable margin. Such well-separated probability distributions reinforce the reliability of the model’s predictions in practical deployment.

3.4. Comparative Analysis with HiDF Baselines

To contextualize our results, we compare them against the baseline metrics reported in the HiDF benchmark (Table 3 in KDD 2025) [19]. Across both video and image tracks, previously published methods struggled with HiDF’s human-indistinguishable content, with video baselines such as EB4 achieving AP = 0.712 and AUC = 0.733 and image baselines like EB4 achieving AP = 0.722 and AUC = 0.697. Other transformer-based or convolutional architectures (e.g., MARLIN, CADDM, FTCN) generally remained in the 0.49–0.70 AUC range for video and below 0.75 AP for images, reflecting the difficulty of the dataset.

By contrast in Table 1, our model demonstrates a substantial performance leap on HiDF. On the video track, we obtain AP = 0.925 and AUC = 0.931, representing an improvement of +0.213 AP and +0.198 AUC over the strongest reported baseline (EB4). On the image track, our model further advances the state of the art with AP = 0.972 and AUC = 0.968, outperforming the prior best (EB4, AP = 0.722, AUC = 0.697) by more than +0.25 AP and +0.27 AUC. These results establish our approach as the new state of the art (SOTA) on HiDF across both modalities. The magnitude of improvement highlights not only the robustness of our architecture but also its ability to generalize to human-indistinguishable forgeries, where prior methods exhibited severe performance degradation.

3.4.1. SIDA Dataset

Classification Performance: In Figure 7, the proposed model achieves near-perfect classification accuracy on SIDA. At epoch 24, overall classification accuracy reaches 99.09%, which indicates that almost all images are correctly identified as real, synthetic, or tampered. The macro-averaged F1-score is 0.9909, showing uniformly high performance across all three classes. The confusion matrix reveals that real images are correctly recognized 99.01% of the time (9901 out of 10,000 real images), synthetic images 98.69% of the time, and tampered images 99.56% of the time. In other words, only a handful of images in each category are misclassified. Most of the confusion comes from synthetic vs. tampered mislabeling in a few cases (e.g., a synthetic image mistaken as a tampered photo or vice versa), whereas genuine real images are almost never confused as fake. Such results represent a substantial improvement over prior works on this dataset (which typically report lower-90s accuracy) and effectively push the classification task to an almost saturated level. Although overall classification accuracy exceeds 99%, the ROC-AUC (~0.74) indicates moderate ranking separability between real and manipulated samples. This discrepancy arises because two classes (synthetic and tampered) are highly separable under argmax classification, while real and tampered samples exhibit overlapping probability distributions. In the role of deepfake detection, SigLIP-2’s vision transformer (ViT) architecture functions as a holistic evaluator of physical and logical consistency, moving beyond the “crop and look” limitations of traditional face-centric detectors. By treating an image as a sequence of patches that “talk” to one another via self-attention, the model can cross-reference information from across the entire frame to identify subtle compositing errors. For instance, while a manipulated face might look flawless in isolation, SigLIP-2 can recognize that the direction of light hitting the nose contradicts the shadows cast by objects in the background, or that the person’s reflection in a window does not align with their actual posture. This “global conversation” allows the model to detect when a subject has been digitally spliced into a scene, ensuring that the foreground and background obey the same rules of geometry, lighting, and perspective. Ultimately, the model does not just evaluate if a face looks real, but whether that face logically belongs within its specific environmental context.

The model attains 99.09% overall accuracy with macro/weighted F1 = 0.9909. At epoch 24, the confusion matrix shows 9901/10,000 real, 9869/10,000 synthetic, and 9956/10,000 tampered predictions correct, with the few residual errors concentrated between synthetic and tampered. By comparison, the SIDA paper reports 93.5% accuracy and 93.5% F1 for SIDA-7B on SID-Set (13B is similar). This places the detector ≈5.6–5.7 percentage points higher in both accuracy and F1 than SIDA’s best reported setting.

Figure 8 summarizes the pixel-level segmentation results. The confusion matrix shows that the model correctly identifies most background pixels, with 7,437,482 classified as background and only 91,200 mislabeled as tampered. For tampered regions, 807,227 pixels are correctly detected while 52,699 are missed, demonstrating strong recall. Overall, this yields a pixel accuracy of 98.28%. Focusing on tampered pixels, the model attains a recall of 93.87% and a precision of 89.85%, reflecting a tendency to slightly over-predict boundaries to maximize detection coverage. The overlap-based measures confirm this performance: the mean intersection over union (IoU) across the dataset is 0.743, with a corresponding Dice coefficient of 0.822. Notably, the IoU distribution skews toward high values, with the median at 0.8215, meaning that more than half the images achieve IoU above 82%. Taking together, these results indicate that while segmentation remains more challenging than classification, the model reliably captures most tampered areas with strong overlap quality and only occasional boundary-level false positives.

In Figure 9, for large, well-defined manipulations (top rows), the predicted masks align closely with the ground truth, yielding high overlaps (IoU = 0.920, Dice = 0.958 for “truck”; IoU = 0.849, Dice = 0.918 for “mug”). For medium-sized manipulations, such as human figures or objects, predictions remain strong, though boundary alignment begins to degrade slightly (IoU = 0.710, Dice = 0.830; IoU = 0.699, Dice = 0.823). For smaller or more subtle edits (bottom rows), performance is more variable: IoUs range from 0.614 down to 0.479, with Dice scores between 0.761 and 0.647. The error overlays confirm that failure cases are largely due to imperfect boundary delineation or missed fragments of very small, tampered regions, rather than complete failure to detect manipulation. These examples highlight the strengths of the proposed model in accurately recovering large-scale manipulations, while also revealing their limitations on fine-grained or low-contrast edits.

The proposed model attains 99.09% overall accuracy with macro/weighted F1 = 0.9909. At epoch 24, the confusion matrix shows 9901/10,000 real, 9869/10,000 synthetic, and 9956/10,000 tampered predictions correct, with the few residual errors concentrated between synthetic and tampered. By comparison, the SIDA paper reports 93.5% accuracy and 93.5% F1 for SIDA-7B on SID-Set (13B is similar). Per Table 2, this places the detector ≈ 5.6–5.7 percentage points higher in both accuracy and F1 than SIDA’s best reported setting. Per Table 3, the results presented in the table demonstrate a significant advancement in performance compared to existing state-of-the-art (SOTA) methods. The proposed model establishes a new benchmark, showing substantial improvements in both F1 Score and Intersection over Union (IoU).

3.4.2. CIFake Dataset

The binary classifier’s performance, as shown in the visualizations provided, demonstrates strong ability to distinguish between real and fake instances. Beginning with the normalized confusion matrix, we observe that 94% of real samples are correctly identified as real, while only 5.1% are mistakenly classified as fake. Similarly, on the other side of the matrix, 96.9% of fake samples are correctly predicted as fake, with only 3.1% misclassified as real. These percentages highlight balanced performance, with neither class disproportionately favored and both exhibiting high recognition rates. The normalized view allows us to see general trends independent of the dataset’s absolute size, underscoring the model’s ability to maintain reliable predictions across classes.

The raw confusion matrix further grounds these observations by revealing the concrete counts of predictions. In Figure 10, out of nearly twenty thousand samples, 9487 real items are correctly identified, while 513 fall into the category of false positives. Likewise, 9689 fake items are classified correctly, with only 311 contributing to false negatives. These raw numbers reinforce the earlier impression: misclassifications are a relatively small minority, accounting for just over four percent of the total. The fact that the dataset is relatively balanced between real and fake examples, and that both sides achieve similar success rates, suggests that the classifier did not overfit to one class at the expense of the other. Instead, it demonstrates consistency across the task of binary classification.

In Figure 10, turning to the ROC curve, the model’s discriminative capacity is further highlighted. The curve rises sharply toward the upper-left corner, which represents the ideal performance zone where the true-positive rate is maximized while the false-positive rate remains minimal. The area under the curve (AUC) reaches 0.986, which is extremely close to the perfect score of 1.0. This indicates that across all possible decision thresholds, the classifier maintains a very high probability of correctly ranking positive instances above negative ones. In other words, the model does not simply perform well at one cutoff but instead exhibits robustness and reliability across varying decision boundaries. Such an AUC value points to a strong generalization capability and a high level of confidence in the system’s predictions.

3.5. Few-Shot Inference and Cross Validation on Different Datasets

We evaluate our few-shot adaptation framework using two publicly available datasets from Kaggle: the AI vs. Human Generated Images dataset by Sala (2023) [22] and the Deepfake vs. Real 60K dataset by Prithiv (2023) [23]. The Sala dataset comprises a balanced collection of 79,950 images, equally divided between AI-generated (e.g., diffusion or GAN-based) and human-generated photographs. Each sample is annotated with a binary label: 1 for AI-generated and 0 for human-generated, spanning diverse visual content and styles that reflect the range of real-world generative use cases. The Sakthi dataset provides a large-scale benchmark of 60,000 images similarly labeled as real or deepfake, with a variety of manipulation artifacts and compression levels. Together, these datasets support rigorous evaluation across both content and distribution shifts. We use a subset of each for prototype construction (few-shot support sets) and apply our method on the full query set, simulating low-resource deployment settings Per Table 4, their combined diversity and scale make them well suited for assessing generalization and cross-domain robustness in image authenticity detection. The 50-shot adaptation significantly improved cross-domain generalization, with Deepfake-vs.-Real-Classification achieving near-perfect balance (F1 ≈ 0.982) and AI vs. Human retaining robust discriminative power (AUC = 0.9553). These findings confirm the few-shot (50-shot) framework’s strong generalization and rapid adaptation to new datasets.

4. Discussion and Conclusions

4.1. Comparison with Prior Work and Benchmark Performance

The experimental results establish that the proposed SigLIP-2-based models significantly outperform previous deepfake detectors across all evaluated benchmarks. On the HiDF dataset, which was explicitly designed with extremely realistic, “human-indistinguishable” forgeries, our model achieves a video AUC of 0.931 and image AUC of 0.968, whereas the strongest prior baseline only reached AUC 0.733 on videos and 0.697 on images. This ~0.20 (20 percentage-point) improvement in AUC over the best previous results is a remarkable leap, highlighting that our CLIP-based architecture can detect subtle artifacts that earlier CNN and transformer models missed. Other published methods on HiDF (e.g., MARLIN, CADDM, FTCN) remained in the 0.49–0.70 AUC range for videos, reflecting the difficulty of this dataset. In contrast, our approach’s high accuracy on HiDF indicates a robust ability to generalize to high-fidelity fakes that previously caused severe performance degradation. The state-of-the-art performance across both images and videos suggests that the SigLIP-2 backbone effectively captures discriminative features of forgeries even when they appear highly authentic.

On the SIDA image dataset, a similarly large performance gain is observed. SIDA’s original benchmark model achieved about 93.5% accuracy (F1 = 93.5%) on its three-class classification task (real, synthetic, tampered). Our proposed model dramatically improves this to 99.09% accuracy (F1 ≈ 0.991), effectively pushing the task toward saturation. In practical terms, the detector correctly identifies virtually 99 out of 100 images in each category, versus ~93 out of 100 for the previous state of the art. Such a result indicates that the SigLIP-2 vision transformer backbone provides an exceedingly discriminative representation for distinguishing real photographs from both AI-generated and manipulated ones. The confusion matrix analysis shows almost no confusion between genuine and fake images—most errors are minor mix-ups between the synthetic vs. tampered classes. This implies that our model not only detects the presence of a fake but can also differentiate AI-generated images from those that were genuine but altered, a nuanced distinction important for forensics. The improvement over SIDA’s baseline (~5.6 percentage points in accuracy) underscores the strength of our approach’s feature encoding and training strategy. It is noteworthy that SIDA’s own framework emphasized multimodal explainability—their model not only classified images but also produced text explanations for each decision. In contrast, our detector does not generate natural-language explanations, focusing purely on classification and pixel-level localization. This likely allowed the model to allocate all its capacity to detection performance, resulting in higher accuracy. However, it also means our system currently lacks the user-friendly rationale that SIDA’s explanations provided.

For the CiFake dataset (a binary real vs. fake image benchmark derived from CIFAR-10), our model similarly achieves outstanding results. CiFake is a more constrained scenario (low-resolution 32 × 32 images from a fixed set of object classes), but it is useful for evaluating detectors on generative artifacts in a controlled setting. Previous work on CiFake reported accuracy up to ~92.98% with deep CNNs. Our CLIP-based model exceeds this, correctly classifying over 95% of images and yielding an AUC of 0.986. The ROC curve for our classifier rises steeply towards the top left, indicating excellent discriminative power across decision thresholds. The confusion matrix is nearly diagonal—about 94% of real images and 96.9% of fakes are identified correctly, with only ~3–5% error rates in each class. This balanced high performance suggests the model is not skewed toward false positives or false negatives on CiFake, an important property for practical use. Given CiFake’s limited resolution and scope, our results demonstrate that the proposed approach can scale down to detect even low-res synthetic images reliably. It also highlights that pre-training on large image-text data (as in CLIP) endows the model with general features that transfer effectively to detecting AI-generated content, even in small images.

4.2. Limitations and Failure Modes

The conclusions of this paper are bounded by the datasets, generators, and evaluation protocols included in our experiments. We therefore do not claim universal robustness to all future generative models, all post-processing pipelines, or all “in-the-wild” conditions; rather, we report performance under the specific benchmarks and settings evaluated here. A consistent limitation is sensitivity to subtle edits. While large, visually salient manipulations (e.g., face swaps or fully synthetic images) are detected with near-perfect accuracy, performance degrades when the forgery is small, low-contrast, or boundary-ambiguous. In the SIDA experiments, segmentation performance for smaller or low-visibility manipulations could drop to approximately 0.50–0.60 IoU, compared to ~0.82 IoU for large changes. In such cases (e.g., adding a small object or altering a minor detail), the model may partially highlight the tampered area or miss it entirely. Importantly, these cases reflect localization and boundary delineation failures (missed fragments of tiny regions) rather than a complete inability to detect any manipulation. Practically, this means that the method is less reliable for detecting small edits embedded within otherwise authentic imagery, which remains a recognized open challenge in multimedia forensics. Although we evaluate across three benchmarks with distinct characteristics (HiDF: high-fidelity deepfakes; SIDA: social-media-style synthetic images produced by specific diffusion/GAN pipelines; CiFake: diffusion-based low-resolution fakes), there remains a risk that the model exploits dataset- or generator-specific artifacts that may not persist in future synthesis pipelines. For example, prior studies report that CiFake classifiers may rely on diffusion-related background noise patterns; given our strong CiFake performance, similar shortcut cues may contribute to the decision boundary. If future generators reduce or eliminate such artifacts, detection performance may degrade. Likewise, the SIDA setting may bias the model toward the generation pipelines represented in its training data.

Consequently, the results reported here should be interpreted as evidence of generalization across the evaluated benchmarks, not as proof of generator-invariant detection in fully open-world scenarios. The strongest configuration relies on a large vision transformer backbone and non-trivial training/inference costs. This work does not evaluate production constraints such as latency, throughput, or energy budgets, and it does not provide a deployment-optimized model. In addition, our study focuses on images and RGB video; we do not evaluate multimodal deepfakes (e.g., synthetic speech in talking-head videos) or audio-driven manipulation, which are increasingly relevant [24]. Finally, while localization masks offer a form of visual interpretability where pixel supervision exists, we do not provide textual or causal explanations of the model’s decisions, which limits interpretability compared with approaches that explicitly generate human-readable rationales [25].

4.3. Conclusions

This work presented a new deepfake detection framework that leverages large-scale pre-trained vision transformers (OpenCLIP’s SigLIP-2) to achieve improved accuracy in identifying and localizing fake content. Across three benchmark datasets—HiDF (high-fidelity video and images), SIDA (social media images with manipulations), and CiFake (synthetic vs. real image pairs)—the proposed models established state-of-the-art results, markedly outperforming previous best approaches on each. Notably, the detector improved AUC on the very challenging HiDF videos by nearly 20 points over prior methods, and pushed image classification accuracy on SIDA to ~99%, about 5–6% higher than the previous state of the art. In addition to classification prowess, the model provides pixel-level tampering masks that align well with ground truth, offering valuable forensic insight into where and how an image was altered.

In conclusion, our CLIP-based SigLIP-2 detection models significantly advance the state of the art in deepfake detection, combining accuracy and interpretability to address key challenges in the field. While no detector can afford complacency in the face of rapidly evolving generative technology, the techniques and results we have demonstrated here form a strong foundation. They pave the way for the development of next-generation forensic tools that can keep pace with emerging deepfake methods, protect the authenticity of digital media, and ultimately bolster public trust in the veracity of online information. The ongoing evolution of this detector, through continued research on generalization and explainability, will be crucial for maintaining a trustworthy digital ecosystem as AI-generated content becomes increasingly sophisticated.

Author Contributions

J.S. and D.X. proposed the research. J.S. designed the algorithms, implemented the software, performed the data analytics, and co-drafted the paper. D.X. provided guidance and edited the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the National Science Foundation CyberCorps SFS Fund (Award 1946619).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in public repositories at the following locations: the HiDF: A Human-Indistinguishable Deepfake Dataset is available in Zenodo at https://zenodo.org/records/16140829 (accessed on 10 March 2026); the SIDA (Social Media Image Deepfake Assistant) dataset and associated code are available on GitHub at https://github.com/hzlsaber/SIDA (accessed on 10 March 2026); the CIFAKE: Real and AI-Generated Synthetic Images dataset is available on Kaggle at https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images (accessed on 10 March 2026); the AI vs. Human-Generated Images dataset is available on Kaggle at https://www.kaggle.com/datasets/alessandrasala79/ai-vs-human-generated-dataset (accessed on 10 March 2026); and the Deepfake-vs-Real-60k dataset is available on Kaggle at https://www.kaggle.com/datasets/prithivsakthiur/deepfake-vs-real-60k (accessed on 10 March 2026). The GitHub for the code is https://github.com/joesound212985/Deepfake-Detection-using-CLIP-Based-SigLIP-2-Vision-Transformers/ (accessed on 10 March 2026). The demo is available at https://huggingface.co/spaces/joesound212985/deepfake-detector-v2 (accessed on 10 March 2026).

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Bregler, C.; Covell, M.; Slaney, M. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH); ACM: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2Face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 2387–2395. [Google Scholar] [CrossRef]
Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 2017, 36, 95. [Google Scholar] [CrossRef]
Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH; ACM: New York, NY, USA, 1999; pp. 187–194. [Google Scholar]
Guo, Y.; He, W.; Zhu, J.; Li, C. A light autoencoder networks for face swapping. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence; ACM: New York, NY, USA, 2018; pp. 459–462. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar] [CrossRef]
Yermakov, A.; Cech, J.; Matas, J.; Fritz, M. Deepfake detection that generalizes across benchmarks. arXiv 2025, arXiv:2508.06248. [Google Scholar] [CrossRef]
Kaddar, B.; Fezza, S.A.; Akhtar, Z.; Hamidouche, W.; Hadid, A.; Serra-Sagristá, J. Deepfake detection using spatio-temporal transformer. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 345. [Google Scholar] [CrossRef]
Raza, M.A.; Malik, K.M.; Haq, I.U. HolisticDFD: Infusing spatiotemporal transformer embeddings for deepfake detection. Inf. Sci. 2023, 645, 119352. [Google Scholar] [CrossRef]
Xu, Y.; Liang, J.; Jia, G.; Yang, Z.; Zhang, Y.; He, R. TALL: Thumbnail layout for deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 22658–22668. [Google Scholar]
Guan, J.; Zhou, H.; Hong, Z.; Ding, E.; Wang, J.; Quan, C.; Zhao, Y. Delving into sequential patches for deepfake detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4517–4530. [Google Scholar]
Liu, H.; Tan, Z.; Tan, C.; Wei, Y.; Wang, J.; Zhao, Y. Forgery-aware adaptive transformer for generalizable synthetic image detection (FatFormer). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Smeu, S.; Oneata, E.; Oneata, D. DeCLIP: Decoding CLIP representations for deepfake localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; IEEE: New York, NY, USA, 2025; pp. 149–159. [Google Scholar]
Gupta, P.; Ghosh, S.; Gedeon, T.; Do, T.T.; Dhall, A. Multiverse through deepfakes: The MultiFakeVerse dataset of person-centric visual and conceptual manipulations. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; Association for Computing Machinery, ACM: New York, NY, USA, 2025. [Google Scholar]
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The DeepFake Detection Challenge (DFDC) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
Kang, C.; Jeong, S.; Lee, J.; Choi, D.; Woo, S.S.; Han, J. HiDF: A human-indistinguishable deepfake dataset. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Huang, Z.; Hu, J.; Li, X.; He, Y.; Zhao, X.; Peng, B.; Wu, B.; Huang, X.; Cheng, G. SIDA: Social media image deepfake detection, localization and explanation with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 28831–28841. [Google Scholar]
Bird, J.J.; Lotfi, A. CIFAKE: Image classification and explainable identification of AI-generated synthetic images. IEEE Access 2024, 12, 15642–15650. [Google Scholar] [CrossRef]
Sala, A. AI vs. Human-Generated Images Dataset. Kaggle Datasets 2024. Available online: https://www.kaggle.com/datasets/alessandrasala79/ai-vs-human-generated-dataset (accessed on 10 March 2026).
Prithiv Sakthi U R. Deepfake-vs-Real-Classification Dataset. Kaggle Datasets 2025. Available online: https://www.kaggle.com/datasets/prithivsakthiur/deepfake-vs-real-60k (accessed on 10 March 2026).
Smith, M.S. Real-time audio deepfakes have arrived: A cybersecurity firm has created convincing voices on the fly. IEEE Spectrum, 21 October 2025. [Google Scholar]
Momin, M.D.S.; Sufian, A.; Barman, D.; Leo, M.; Distante, C.; Damer, N. Explainable deepfake detection across different modalities: An overview of methods and challenges. Image Vis. Comput. 2025, 163, 105738. [Google Scholar] [CrossRef]

Figure 1. Prediction architecture. Input images (384 × 384) are resized/normalized and tokenized into a 24 × 24 patch grid (576 tokens) for a ViT-L/16 backbone. Most transformer blocks are frozen while the final blocks are fine-tuned; pooled 1024-D features are L2-normalized and passed to a lightweight MLP head (1024 → 512 → 1) to output a real/fake logit (colors indicate frozen vs. trainable components).

Figure 2. Evaluation of the HIDF model. (Top left) Training loss with validation accuracy and F1 across 20 epochs; (Top right) ROC curve (AUC = 0.968); (Bottom left) precision–recall curve (AP = 0.972); (Bottom right) confusion matrix. ROC = receiver operating characteristic; AP = average precision.

Figure 3. Training progress—video classification (20 epochs). (Top left): Training (blue) and validation (red) loss steadily decrease, with validation loss remaining slightly lower than training loss. (Top right): Training (blue) and validation (red) accuracy increase rapidly early and then plateau in the mid-80% range by epoch 20. (Bottom left): Validation F1 score (green) rises from ~0.65 to a peak of 0.856 (dotted reference line). (Bottom right): Combined view showing train/val loss decreasing while validation F1 increases; the dashed vertical line marks the current epoch.

Figure 4. Overall performance metrics for HiDF video deepfake detection.

Figure 5. Confusion matrix for HiDF video classification (raw counts). Test-set outcomes for real vs. fake videos indicate more false-positive “fake” flags than missed fakes.

Figure 6. Prediction confidence distributions for HiDF video classification. (Left): density histograms of predicted P(fake) for real vs. fake videos. (Right): boxplots of P(fake) by true class (median, IQR, and outliers), illustrating strong score separation and a small overlap region near the decision boundary.

Figure 7. SIDA (SID-Set) 3-class classification confusion matrix (epoch 24).

Figure 8. Pixel-level tampering localization on SIDA. (Left): pixel confusion matrix for background vs. tampered regions. (Right): distribution of per-image IoU scores with mean IoU = 0.743 (dashed line). IoU = intersection over union.

Figure 9. Qualitative segmentation examples on SIDA validation data. Each row shows the tampered image, ground-truth mask, predicted mask, and an error overlay highlighting TP (green), FP (red), FN (blue), and TN (gray/transparent). IoU and Dice are reported per example to illustrate success cases (large edits) and boundary/missed-fragment failure modes (small edits).

Figure 10. CiFAKE binary classification results: confusion matrices and ROC. (Left): normalized confusion matrix; (middle): raw counts; (right): ROC curve with AUC = 0.986, indicating near-perfect separability across thresholds.

Table 1. Comparison of baseline to our proposed model.

Type	Model	AP	AUC	ΔAP vs. Baseline	ΔAUC vs. Baseline
Video	EB4 (best baseline)	0.712	0.733	-	-
Video	Proposed	0.925	0.931	+0.213	+0.198
Image	EB4 (best baseline)	0.722	0.697	-	-
Image	Proposed Model	0.972	0.968	+0.250	+0.271

Table 2. Comparison of baselines to our proposed model. Table reproduced from Huang et al. [20].

Method	Year	Real (Acc/F1)	Synthetic (Acc/F1)	Tampered (Acc/F1)	Overall (Acc/F1)
AntifakePrompt	2024	64.8/78.6	93.8/96.8	30.8/47.2	63.1/74.2
CnnSpott	2021	79.8/88.7	39.5/56.6	6.9/12.9	42.1/52.7
FreDect	2020	83.7/91.1	16.8/28.8	11.9/21.3	37.4/47.0
Gram-Net	2020	70.1/82.4	93.5/96.6	0.8/1.6	54.8/60.2
UnivFD	2023	68.0/67.4	62.1/87.5	64.0/85.3	64.7/80.0
LGrad	2023	64.8/78.6	83.5/91.0	6.8/12.7	51.7/60.7
LNP	2023	71.2/83.2	91.8/95.7	2.9/5.7	55.3/61.5
SIDA-7B	2024	89.1/91.0	98.7/98.6	92.7/91.0	93.5/93.5
SIDA-13B	2024	89.6/91.1	98.5/98.7	92.9/91.2	93.6/93.5
Proposed Model	2025	99.0/99.0	98.6/99.3	99.6/98.9	99.1/99.1

Table 3. Comparison to baselines to our proposed model for segmentation. Table reproduced from Huang et al. [20].

Method	Year	F1	IoU
MVSS-Net	2023	31.6	23.7
HIFI-Net	2023	45.9	21.1
PSCC-Net	2022	71.3	35.7
LISA-7B-v1	2024	69.1	32.5
SIDA-7B	2024	73.9	43.8
Proposed Model	2025	82.2	74.3

Table 4. The metrics summarize binary discrimination between real and fake (AI-generated) images. Accuracy represents the overall proportion of correctly classified samples. AUC-ROC denotes the area under the receiver operating characteristic curve, indicating separability between real and synthetic imagery. AP score (average precision) reflects the area under the precision–recall curve. Precision (macro) and Recall (macro) are the means of per-class precision and recall values, while F1 (macro) is their harmonic mean, averaged across classes. Total samples lists the number of images evaluated for each dataset.

Dataset	Accuracy	Precision (Macro)	Recall (Macro)	F1 (Macro)	Total Samples
AI vs. Human Generated Images [23]	0.8917	0.8923	0.8917	0.8917	79,850
Deepfake-vs.-Real-Classification [24]	0.9820	0.9819	0.9819	0.9820	91,234

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Soundararajan, J.; Xu, D. Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI 2026, 7, 115. https://doi.org/10.3390/ai7030115

AMA Style

Soundararajan J, Xu D. Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI. 2026; 7(3):115. https://doi.org/10.3390/ai7030115

Chicago/Turabian Style

Soundararajan, Joe, and Dong Xu. 2026. "Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers" AI 7, no. 3: 115. https://doi.org/10.3390/ai7030115

APA Style

Soundararajan, J., & Xu, D. (2026). Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI, 7(3), 115. https://doi.org/10.3390/ai7030115

Article Menu

Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Overview

2.2. Training Setup

2.3. Prototype-Based Few-Shot Inference Using SigLIP-2

2.3.1. Robustness Through Test-Time Augmentation

2.3.2. Threshold Selection and Performance Evaluation

3. Results

3.1. Training Progress

3.2. Video Classification Model Performance

3.3. Prediction Confidence Distribution

3.4. Comparative Analysis with HiDF Baselines

3.4.1. SIDA Dataset

3.4.2. CIFake Dataset

3.5. Few-Shot Inference and Cross Validation on Different Datasets

4. Discussion and Conclusions

4.1. Comparison with Prior Work and Benchmark Performance

4.2. Limitations and Failure Modes

4.3. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI