1. Introduction
Since the emergence of deep generative models, deepfake technology has evolved rapidly. It has shifted from a niche research topic to widely accessible tools, enabling the easy creation of highly convincing facial manipulations that often escape casual observation. Recent advancements in generative adversarial networks (GANs), variational autoencoders (VAEs), and attention-based architectures have enabled seamless modifications of facial identity, expressions, lip movements, and other features [
1,
2]. Although these tools offer opportunities in creative domains such as film post-production and virtual try-on, their accessibility has also fueled the rise of digital forgeries [
3]. This trend raises serious concerns about trust, privacy, and the integrity of public discourse. For instance, individuals with little legal awareness can create fake videos of public figures to spread misinformation or attempt extortion. Such incidents prompt urgent calls for effective forensic detection methods from regulators and platform operators [
4,
5,
6].
Conventionally, deepfake detection is formulated as a binary classification problem [
7,
8]. To identify whether a segment of video is forged or not, researchers have adopted two primary methodologies: The first focuses on intra-frame analysis, scrutinizing individual video frames for visual inconsistencies, such as frequency-domain artifacts resulting from upsampling, color mismatches at facial boundaries, or subtle semantic discrepancies between manipulated areas and their surroundings [
9,
10]. Although these methods often achieve high accuracy against known manipulation techniques, they struggle with novel or heavily compressed forgeries [
11]. Moreover, their dependence on standard CNN feature extractors limits their ability to detect subtle, localized artifacts such as those around the corners of the eyes or the boundaries of the lips, which reveal a synthetic origin [
7,
12]. The second type of approach utilizes temporal cues across multiple frames, examining features like eye-blink frequency, coherence of optical flow, or the dynamics of deep features learned by recurrent networks to identify unnatural motion patterns [
13]. Although temporal methods generally demonstrate a higher resilience to compression artifacts, they require longer video inputs and entail considerable computational costs. These factors may limit their practicality for real-time applications or large-scale deployments [
14,
15]. Furthermore, both intra-frame and temporal techniques often exhibit limited generalization when confronted with previously unseen forgery algorithms.
Recent research has delved deeper into these limitations, revealing that current detectors struggle significantly with cross-dataset evaluations and real-world corruptions [
16,
17]. One category of approaches leverages generative adversarial networks (GANs) with dual generators—blend-based for adaptive masking and blending, and transfer-based for style mixing—to create challenging synthetic forgeries that bridge in-dataset and cross-dataset gaps by simulating divergent synthetic patterns, such as those from techniques like DF-VAE or NeuralTextures [
18]. These GAN-based methods often incorporate collaborative spatial-frequency discriminators to detect artifacts while filtering out perturbations like blur, noise, or adversarial attacks, thereby enhancing overall robustness. Another class of algorithms emphasizes inconsistency learning through advanced image blending, introducing bi-level inconsistencies—extrinsic between real and pseudo-forged regions, and inherent between real and manipulated areas—to better replicate common forgery clues and improve generalization across diverse datasets, as evidenced by improved metrics on benchmarks like DFDC and Celeb-DF [
19,
20]. These developments underscore the essential requirement for detection frameworks that not only adapt to unseen forgeries but also withstand various real-world distortions.
Despite promising progress, two inevitable challenges remain in practical deployment: (1) the limited generalization of detectors to unseen forgery generation techniques and cross-dataset scenarios, where texture patterns may differ drastically; and (2) the lack of robustness to perturbations such as compression, noise, and adversarial attacks, which can easily corrupt discriminative cues and lead to misclassification.
Addressing the limitations of existing approaches, this paper proposes WAFF as a video-level face forgery detection framework, rather than merely as a replacement of an existing backbone. The novelty lies in the task-specific integration of weakly supervised attention learning, complementary local–global augmentation, and calibrated video-level evidence fusion for robust deepfake detection under compression and dataset shift. The significant contributions of this paper are as follows:
- (1)
We design WSEffiNet, an EfficientNet-B3-based detector enhanced by weakly supervised attention. Instead of requiring pixel-level forgery masks or manually annotated artifact regions, WSEffiNet learns discriminative attention maps from image-level real/fake labels and uses them to guide bilinear attention pooling, attention cropping, and attention dropping.
- (2)
For video-level classification, WAFF incorporates a calibrated fusion module that integrates fake-frame counting, average confidence scoring, key-frame voting, and attention-guided weighting. This design explicitly balances sensitivity to intermittent forgery cues with stability against isolated false alarms.
- (3)
Experiments conducted on the public datasets FaceForensics++, Celeb-DF v2, DFD, DFDC, and FFIW-10K show that WAFF generally outperforms state-of-the-art baselines in both in-dataset and cross-dataset settings. The evaluation further analyzes compression robustness, decision-rule behavior, weak versus full supervision, preprocessing sensitivity, and error cases, thereby clarifying both the strengths and the deployment limitations of the proposed framework.
3. Materials and Methods
This section delineates the proposed WAFF framework, which was developed to address the pressing challenge of robust and scalable video-level forgery detection. As illustrated in
Figure 1, WAFF is conceptualized as a three-stage pipeline, furnishing an end-to-end methodology that spans from raw video input to robust video-level inference. Specifically, the framework comprises (1) preprocessing, wherein videos are decomposed into sampled frames and facial regions are localized utilizing InsightFace; (2) frame-level analysis, wherein a novel weakly supervised backbone, designated WSEffiNet (see
Section 3.1), is employed to integrate EfficientNet with attention-based augmentation, facilitating the identification of subtle and localized forgery artifacts; and (3) video-level decision rules, wherein frame-level predictions are synthesized via max-based detection, mean-confidence scoring, and key-frame voting to produce the final classification outcome.
The design of WAFF is motivated by the observation that deepfake artifacts are not uniformly distributed across the entire face or consistently present in every frame. Instead, manipulations typically introduce localized visual anomalies in specific facial regions—such as facial boundaries, the mouth, or the nose—and these anomalies may appear intermittently over time. Consequently, effective deepfake detection requires both spatial sensitivity to local artifacts at the frame level and a robust mechanism to aggregate such evidence across frames at the video level.
It is important to note that the weak supervision in WAFF is realized through the design of WSEffiNet, where the WS-DAN module generates attention responses from image-level class labels during training. These responses act as internally produced spatial indicators and are obtained without incorporating region annotations or pixel-level artifact masks. Unlike conventional supervised settings that integrate explicitly defined spatial supervision into the training pipeline, the spatial guidance in WAFF emerges from the model’s own attention mechanism, which derives localized cues implicitly while learning from global labels.
More specifically, each training face crop is supervised only by its binary real/fake label. No manual forgery boundary, manipulated-region mask, landmark-level artifact label, or temporal annotation is used. Given the final EfficientNet-B3 feature tensor, a convolution produces M attention maps. During training, the attention maps are sampled according to their activation strength: one map guides attention cropping, which zooms into the most discriminative local region, while another guides attention dropping, which suppresses an already salient region and forces the network to discover complementary evidence. In this way, the weak label supervises classification, whereas the attention mechanism supplies self-generated spatial proposals that encourage local–global feature learning.
3.1. Architecture Details of WSEffiNet
The proposed WSEffiNet architecture is based on a customized EfficientNet-B3 backbone, shown in
Table 1. Compared with the original EfficientNet-B3, we adapt the input resolution to
for higher spatial fidelity, slightly adjust the number of channels in the later stages (e.g., 136 and 232 channels in Stages 6 and 7), and integrate an attention mechanism before the classification head. These modifications are designed to better capture subtle artifacts in facial forgeries while maintaining efficiency.
Architecturally, WSEffiNet follows a backbone–head design. The EfficientNet-B3 backbone is employed as a feature extractor and preserves its original MBConv and SE structures. The WS-DAN module is integrated at the output of the backbone and operates on the final convolutional feature maps, without altering the intermediate layers. Specifically, WS-DAN takes the last-stage feature tensor as an input to generate multiple attention maps, which are subsequently used for attention pooling, attention-guided cropping, and dropping during training.
The compound scaling principle of EfficientNet governs the design of the backbone:
where
d,
w, and
r control depth, width, and resolution, respectively. In our case, setting
corresponds to EfficientNet-B3, which achieves a strong trade-off between accuracy and complexity.
At the core of the backbone is a sequence of Mobile Inverted Bottleneck Convolution (MBConv) blocks, each followed by a Squeeze-and-Excitation (SE) module. An MBConv block first expands the input channels with a convolution, applies depthwise convolution over neighborhoods, and projects features back via another convolution. The SE module adaptively reweights channels to emphasize informative features. These modules preserve efficiency while improving the representation quality.
After processing an input face image
, the network produces feature maps
. To focus on discriminative regions, we append a
convolutional layer that generates
M attention maps
. Instead of uniformly aggregating all regions through global average pooling, we apply bilinear attention pooling (BAP) to selectively weight spatial regions:
yielding a feature matrix
with tens of thousands of elements.
To stabilize training, each row of
X is transformed by the sign–sqrt rule and normalized to the unit
sphere:
This enhances discriminability by reducing bursty responses. The classification head is a fully connected layer:
producing logits over
K classes.
To encourage richer representation learning, we adopt a multi-path attention strategy where raw images, attention-guided cropping, and attention-guided dropping are jointly used. This forces the model to focus on both localized fine-grained cues and global context. The classification head is trained with a joint loss that combines cross-entropy and center loss:
where
averages cross-entropy over raw, cropped, and dropped paths, while
enforces intra-class compactness by penalizing deviations from learned class centers.
The integration of compound-scaled MBConv+SE blocks, attention-guided BAP, multi-path attention learning, and the joint objective allows WSEffiNet to maintain efficiency (about 12 M parameters and 1.8 B FLOPs, comparable to EfficientNet-B3) while enhancing sensitivity to subtle deepfake artifacts.
The overall training procedure is summarized in Algorithm 1. Unlike conventional CNN training, our pipeline incorporates attention-guided cropping and dropping to augment feature diversity, as well as a center loss term to enforce compact intra-class representations, making the process tailored to deepfake detection.
3.2. Decision Rules
While the proposed WSEffiNet model described in the previous subsection yields frame-level predictions with high discriminative capacity, its practical deployment requires the implementation of a robust mechanism to aggregate and elevate these results to the video level. This step is crucial, since deepfake artifacts may only appear intermittently across frames, and occasional spurious responses could otherwise trigger false alarms.
To address this challenge, the video-level aggregation in WAFF is designed to be evidence-driven rather than purely uniform. Although the decision rules are implemented using simple aggregation operations, they are guided by frame-level prediction confidence and attention responses produced by WSEffiNet. This design allows the aggregation process to account for heterogeneous frame-level signals when forming a video-level decision, rather than assuming equal contribution from all frames.
| Algorithm 1 Training Procedure of the Integrated EfficientNet-B3 and WS-DAN Model |
Input: Training set D, model parameters , number of epochs T, initial learning rate , batch size N, number of attention maps M Output: Updated model parameters , feature center matrix C
- 1:
Initialize weight parameters - 2:
Initialize feature center matrix C - 3:
for to T do - 4:
for to N do - 5:
Input image X and label y - 6:
Extract feature map F of X using EfficientNet-B3 - 7:
Apply convolution on F to obtain attention map - 8:
for to M do - 9:
Compute weighted feature map: - 10:
Apply global average pooling on to obtain feature vector - 11:
end for - 12:
Concatenate features: - 13:
Compute raw prediction: - 14:
Update feature center: - 15:
Randomly select two attention maps from A - 16:
Apply attention cropping/dropping on to obtain - 17:
Feed into model to obtain - 18:
Compute average cross-entropy loss: - 19:
Compute feature center loss: - 20:
Compute total loss: - 21:
Backpropagate - 22:
Update parameters using SGD - 23:
Adjust learning rate using StepLR scheduler - 24:
end for - 25:
end for
|
To derive a unified video-level classification from frame-level fake probabilities
, where
corresponds to sampled frames, we propose an aggregation strategy that balances sensitivity to forgery cues with robustness against noise. A straightforward yet effective approach is to first count the number of frames where
exceeds a predetermined detection threshold
. If any such frame exists, the video is immediately classified as fake according to the rule
thereby maximizing sensitivity to even a single high-confidence forgery cue. However, this “any-frame” criterion can be overly susceptible to occasional false alarms, so we complement it with an average-confidence criterion that computes the mean probability
and labels the video as fake only if
exceeds a global threshold
,
This averaging strategy dilutes sporadic outliers and is particularly effective when forgeries manifest across multiple frames.
To further mitigate computational load in long sequences, we restrict evaluation to a smaller subset of “key frames”
, selected for their high motion or salience. In our implementation, key frames are identified based on frame-level prediction confidence, where frames with the highest forgery scores
are selected. We empirically set
as a balance between detection robustness and computational efficiency. A soft majority vote is then performed by requiring the proportion of key frames with
to exceed a ratio
:
where
denotes the indicator function.
Beyond these rules, more flexible schemes are incorporated. In particular, weighted aggregation assigns higher importance to frames with stronger attention responses or higher face detection confidence, while a hierarchical strategy first flags suspicious frames using the max rule and then confirms them through averaging or voting.
By integrating these complementary aggregation rules—max-based detection, mean confidence, key-frame voting, and weighted extensions—into a cohesive decision framework, our system forms a natural continuation of the attention-guided frame-level design, ensuring that localized forgery cues are effectively elevated to a reliable and stable video-level verdict.
The thresholds are not chosen from the test set. In our implementation, , , K, and are selected on the validation split by grid search with two objectives: maximizing validation AUC/ACC, and avoiding excessive false positives on real videos. The final values are then fixed for all test datasets. This calibration protocol reduces the risk that the reported performance is tied to a single favorable threshold, and it supports a fairer comparison between the sensitivity-oriented and stability-oriented decision rules.
3.3. Preprocessing
To guarantee that the model receives clean, focused, and geometrically standardized inputs, we designed a preprocessing pipeline that transforms raw videos into aligned facial images, which are suitable for precise deepfake detection. This step is crucial because most manipulations in deepfake videos occur within facial regions. Proper localization and normalization enable the network to concentrate on areas of semantic significance, ensuring more accurate and effective results.
The preprocessing process begins with uniform frame sampling. For each video, this method extracts one frame per second—a frequency selected to strike a balance between temporal diversity and computational efficiency, which guarantees that a sufficient number of frames are sampled to reveal potential forgeries while minimizing unnecessary redundancy among highly similar adjacent frames.
Next, each sampled frame is processed through the RetinaFace detector. RetinaFace is a single-stage dense facial detection network, particularly suitable for our task due to its robustness in handling extreme head poses, occlusion, and low resolution. For each frame, this method retains the largest detected facial bounding box, assuming that the subject of interest typically occupies the most prominent area of the frame. To ensure that contextual information, such as the cheeks and jawline, is not truncated, the bounding box is expanded by a fixed scale factor of 1.3 along both axes, which enlarges the region and then serves as the candidate face region.
For implementation consistency, face localization is performed through the InsightFace face analysis interface with a RetinaFace-style detector. When multiple faces are detected, the largest face is selected because the benchmark videos are dominated by a principal subject. When no face is detected in a sampled frame, the frame is skipped and the decision is made from the remaining valid frames; if no valid face frame is obtained for a video, the video is excluded from quantitative scoring and recorded as a preprocessing failure. This policy prevents low-quality frames, missed detections, or background faces from silently biasing the model prediction.
To mitigate the variation introduced by different face orientations and camera angles, this method applies facial alignment using five key landmarks provided by RetinaFace: the centers of the left and right eyes, the nose tip, and the corners of the mouth. Let the detected landmarks in the input image be denoted as
, and let the reference template landmarks be
. We then compute a similarity transformation
T that minimizes the least-squares alignment error:
This transformation is applied to warp the detected face region so that the five landmarks align with a canonical configuration, resulting in consistent geometric alignment across all samples.
After alignment, each facial image is cropped and resized to a fixed spatial resolution of
pixels to meet the input specifications of the backbone architecture. Subsequently, the pixel intensities are rescaled to the range of
, followed by channel-wise normalization utilizing the mean and standard deviation values derived from the ImageNet dataset:
where
and
are the per-channel mean and standard deviation vectors used during EfficientNet pretraining.
Consequently, each input video is transformed into a sequence of geometrically aligned, photometrically normalized facial images that maintain consistent resolution and structured semantics. These processed face crops are then fed into the network for feature extraction and classification, forming the foundation for dependable and robust deepfake detection.
3.4. Data Augmentation
To enhance the robustness of the model and improve its ability to focus on discriminative local features, we propose an attention-guided data augmentation strategy inspired by the WAFF framework. Unlike traditional data augmentation techniques that apply global transformations such as random flipping, color jittering, or cropping, our approach leverages spatial attention maps derived from the model to dynamically generate augmentation masks. These attention maps highlight the most informative regions of an input image, allowing for the targeted amplification or suppression of local features during training.
Formally, given a feature map
extracted from the backbone and its corresponding attention maps
, two attention maps are selected for each training image: one for cropping and one for dropping. Let
denote the
k-th attention map from the set. For cropping, we identify the bounding region where the attention intensity exceeds a threshold
and extract the corresponding image patch:
Then, the corresponding image patch is resized to the original image resolution. This approach necessitates that the model prioritizes the examination of prominent local features, including the eyes, mouth, and facial contours, and concentrates on these salient areas that are often most vulnerable to the introduction of forgery artifacts.
Conversely, for attention dropping, we mask out a high-attention region by setting the corresponding pixels to zero or a neutral value, effectively encouraging the network to explore alternative regions:
where
denotes the binary mask induced by a drop threshold
, and
I is the original input image. This operation aims to prevent the model from overfitting to only the most salient patterns and promotes the learning of complementary features.
To balance these two augmentation strategies, we randomly sample two attention maps , from the top-activated set for each mini-batch sample. One produces a cropped augmented image, and the other generates a dropped version. Both are included in the training batch with the original image, which increases intra-class variance and creates a more diverse and challenging training distribution.
5. Conclusions
This paper introduces WAFF, a novel deepfake detection framework that integrates WSEffiNet with a flexible video-level fusion strategy. WAFF leverages an EfficientNet-B3 backbone enhanced by a Weakly Supervised Data Augmentation Network (WS-DAN) to generate fine-grained attention maps, which highlight subtle forgery artifacts. At the video level, we further propose an attention-guided decision fusion mechanism that effectively balances sensitivity to localized manipulations with robustness against spurious noise and compression artifacts. Extensive experiments across multiple benchmarks, including FaceForensics++, Celeb-DF, DFDC, DFD, and FFIW-10K, demonstrate that WAFF consistently achieves superior in-dataset performance and outperforms state-of-the-art methods in challenging cross-dataset evaluations. These results confirm the robustness, generalization ability, and efficiency of WAFF for practical deployment in real-world deepfake forensics.
The current framework also has limitations. First, WAFF depends on reliable face detection and alignment; severe occlusion, very small faces, or unusual camera views may reduce the number of valid frames and weaken video-level confidence. Second, although EfficientNet-B3 offers a favorable accuracy–efficiency trade-off, the model still benefits from ImageNet pretraining and may require additional adaptation when deployed against substantially different attack families. Third, video-level aggregation adds latency compared with a single-frame classifier, especially when dense sampling is used. Finally, the present design primarily models frame-level evidence with lightweight aggregation rather than explicit long-range temporal dynamics, which may limit sensitivity to manipulations that are visible only through extended motion inconsistencies.
In future work, we aim to explore several promising directions to further advance this line of research. First, we plan to investigate more lightweight architectures and model compression techniques to reduce inference costs and enable deployment on diverse edge devices. Second, we will incorporate adaptive temporal modeling strategies to better capture long-range dynamics in videos without sacrificing efficiency. Finally, we also plan to study explainable deepfake detection by developing visualization tools that explicitly reveal discriminative artifacts, which may enhance interpretability and trustworthiness in forensic practice.