1. Introduction
With the rapid advancement of deepfake technologies, particularly Generative Adversarial Networks (GANs), image generation techniques have achieved remarkable progress, enabling the synthesis of highly realistic forged images indistinguishable from authentic faces [
1]. Recent enhancements in GANs have significantly increased both visual fidelity and detection difficulty [
2]. Such high-quality synthetic images pose unprecedented challenges to facial recognition and security systems, creating significant risks in political and financial domains, thus emphasizing the urgent need for improved forgery detection methods [
3].
To address those demanding situations, researchers have proposed various methods for detecting artificial snapshots. For instance, Zhao et al. [
4] delivered a multi-attention community that considerably stepped forward detection accuracy by integrating a couple of spatial attention heads and texture enhancement modules. Lee et al. [
5] demonstrated exceptional performance in detecting high-quality forgeries using a Transfer Learning-based Autoencoder with Residual Blocks (TAR). TAR outperformed other methods on the FaceForensics++ dataset [
6]. Wang et al. [
7] proposed an attention-guided records augmentation approach to enhance the functionality of CNN-based detectors in detecting solid faces. While those strategies have superior detection accuracy to a point, their effectiveness remains constrained for low-quality images.
However, despite recent progress, current deepfake detection techniques still exhibit significant performance degradation when processing low-quality forged images. These images, characterized by blur, noise, and low resolution, severely obscure critical facial features, making traditional CNN-based methods ineffective. Existing approaches predominantly address high-quality forgeries, while those targeting low-quality scenarios either rely on computationally intensive super-resolution or lack generalization capabilities. Consequently, there is an urgent need for an efficient detection method specifically designed to handle the complexities associated with low-quality deepfakes.
As image generation technologies have advanced, conventional forgery detection methods have shown massive performance degradation on low-quality photos [
8], specifically those stricken by noise, low resolution, or blur. Low-quality image snapshots normally include considerable noise and reduced resolution, which significantly degrade the overall performance of existing brand new, deep, ultra-modern techniques in detecting artificial content material [
9]. The similarity between low-quality synthetic and authentic images further complicates the differentiation system of traditional detection strategies [
10]. Although state-of-the-art deep learning models excel in processing high-quality images, they often fail to deliver satisfactory results on low-quality inputs [
11].
In recent years, researchers have introduced several novel tactics to deal with the demanding situation of detecting low-quality counterfeit pictures, particularly the ones stricken by low resolution and noise. Those techniques aim to enhance the accuracy and robustness of contemporary forgery detection by leveraging superior strategies consisting of feature extraction, amazing resolution healing, and current deep techniques. For example, Zou et al. [
12] proposed the DIFLD technique, which employs high-frequency invariance and excessive-dimensional function distribution learning to enhance detection performance on low-quality compressed images. While effective, this approach restrained generalization abilties. Kiruthika et al. [
13] proposed a counterfeit face detection approach based totally on photograph assessment features, accomplishing stepped forward accuracy through frequency and spatial area analysis. But its performance on non-widespread datasets remains suboptimal.
Sohaib et al. [
14] developed a video forgery detection system using a Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) model, integrating facial feature extraction via a Multi-task Cascaded Convolutional Neural Network (MTCNN) and Extreme Inception (Xception) with temporal feature capture through Long Short-Term Memory (LSTM). Although this model performed well on the Google Deepfake AI dataset [
15], it struggled with issues such as color inconsistency.
In addition, advancements include those by Wang et al. [
16], who mixed sparse coding with deep learning for super-resolution image restoration. Despite achieving high restoration accuracy and subjective quality, their method incurs significant computational overhead and has limited applicability. Herrmann et al. [
17] enhanced local matching techniques by integrating multi-scale Local Binary Pattern (LBP) features with time fusion, demonstrating success in low-resolution surveillance video face recognition. However, its accuracy declines under extremely low resolution and complex backgrounds. Heinsohn et al. [
18] introduced the blur-ASR method, which leverages sparse representation with dictionaries tailored to various blur levels to improve low-resolution image recognition. While effective, its performance is highly dependent on dataset representativeness, limiting its applicability in real-world scenarios. Kim et al. [
19] introduced an adaptive feature extraction technique effective for blurred face images, though it struggles with severe blur and high computational complexity. Lin et al. [
20] proposed a two-stage pipeline using the pix2pix network for low-resolution face image enhancement, which is coupled with a multi-quality fusion network to improve recognition performance. Despite its strengths, the method is limited in restoring extremely low-quality data and suffers from high computational complexity. Mu et al. [
21] presented a lightweight convolutional model combining multi-resolution feature fusion and spatial attention modules, outperforming traditional CNNs on the Lock3DFace dataset [
22]. Nevertheless, its ability to handle extremely low-quality data remains constrained.
Many existing methods focus on a single technique, such as super-resolution restoration or feature extraction, and often exhibit limited performance on extremely low-quality images or incur significant computational overhead. Consequently, current research has yet to provide a comprehensive solution for addressing the combined challenges of noise, blur, and low resolution in low-quality forged images.
To address these limitations, this paper proposes the YOLOv9-ARC algorithm, which significantly enhances the detection performance of low-quality forged images through the introduction of two key components: the Adaptive Kernel Convolution (AKConv) module and the Convolutional Block Attention Module (CBAM). Unlike previous approaches, the core innovation of this work lies in the synergistic integration of dynamically adjustable convolution kernels (AKConv) and multi-level attention mechanisms (CBAM), which optimize feature extraction and feature representation, respectively. The AKConv module dynamically adjusts the size of the convolution kernel, enabling the model to effectively handle low resolution, noise, and blur, thereby enhancing its feature extraction capability on low-quality images. Simultaneously, the CBAM mechanism dynamically recalibrates feature importance across both channel and spatial dimensions, amplifying critical regions while suppressing noise and irrelevant features.
Compared to existing techniques that rely solely on feature extraction or super-resolution restoration, the proposed approach combines adaptive convolution kernel adjustment with attention-guided feature optimization. This dual strategy not only improves the model’s adaptability to low-quality images but also achieves higher accuracy in addressing challenges such as low resolution and noise. Furthermore, the model demonstrates enhanced robustness by focusing on discriminative regions and reducing the impact of irrelevant information.
In summary, this paper builds upon existing technologies and introduces significant improvements to overcome the limitations of traditional methods in detecting low-quality forged images. The proposed YOLOv9-ARC algorithm offers a more efficient and adaptable solution, advancing the state-of-the-art in forgery detection under challenging conditions.
The key contributions of this paper are outlined below:
We propose YOLOv9-ARC, which is an improved detection framework specifically designed for low-quality deepfake images characterized by noise, blur, and low resolution.
We introduce Adaptive Kernel Convolution (AKConv) to enhance the model’s ability to extract meaningful features under degraded visual conditions.
We incorporate the Convolutional Block Attention Module (CBAM) to emphasize critical facial regions and suppress irrelevant noise, improving detection accuracy.
Extensive experiments demonstrate that YOLOv9-ARC achieves superior performance and generalization on low-quality datasets compared to baseline methods.
Compared to existing techniques, the innovations of this paper include the following:
Combining adaptive convolution and attention mechanisms to improve detection performance: Unlike methods that rely only on super-resolution restoration (e.g., Wang et al. [
16]) or feature extraction (e.g., Sohaib et al. [
14]), YOLOv9-ARC integrates both adaptive convolution and attention mechanisms, achieving more robust low-quality forgery detection.
Better generalization and adaptability: While the DIFLD method [
12] mainly relies on high-dimensional feature distribution learning, AKConv enables the model to optimize feature extraction across varying resolutions and noise levels, improving its adaptability to complex datasets.
Optimized computational efficiency: Compared to computationally expensive super-resolution methods (e.g., Lin et al. [
20], who used a pix2pix network for low-resolution face enhancement), YOLOv9-ARC reduces computational complexity through its lightweight AKConv design, making it more suitable for real-time applications.
This paper is structured as follows:
Section 1 reviews related works on deepfake detection with a focus on challenges posed by low-quality synthetic images.
Section 2 introduces the proposed YOLOv9-ARC model in detail, including the AKConv and CBAM modules.
Section 3 presents the experimental setup and evaluation results on the DFDC dataset.
Section 4 discusses the implications of the findings, current limitations, and future research directions. Finally,
Section 5 concludes the paper by summarizing the key contributions and outcomes.
2. Materials and Methods
2.1. Model Overview
In the field of object detection, YOLOv9 [
23] emerged as a new breakthrough in 2024, setting a new benchmark for efficiency and accuracy. However, when dealing with low-quality facial images, the YOLOv9 model begins to show inefficiencies, leading to a decline in detection performance.
To address these issues, this paper proposes the YOLOv9-ARC algorithm for counterfeit detection in low-quality facial images. The full network architecture is depicted in
Figure 1. This framework is composed of three primary components.
The Backbone is responsible for extracting initial features from the input image. The Convolutional Block Attention Module (CBAM) [
24] is integrated into the Backbone to focus on important regions of the image. By leveraging both spatial and channel attention mechanisms, CBAM enhances critical information while suppressing noise and irrelevant regions, thereby improving the feature extraction capability for low-quality images. The Adaptive Kernel Convolution (AKConv) [
25] is combined with the RepNCSPELAN4 module to expand the receptive field and enable the network to capture finer spatial features, enhancing its adaptability to images of varying quality and size. Additionally, the SPPFELAN module further aggregates and refines these features.
The Neck connects the Backbone and the Head, primarily processing and fusing the features extracted by the Backbone to enhance multi-scale feature representation. This is particularly beneficial for detecting objects of varying sizes in low-quality images. The Neck also employs AKConv to dynamically adjust the convolution kernel size, enabling it to handle feature maps at different scales. Through CBFuse operations, multiple feature maps are fused, integrating information from different levels into a unified feature representation. Furthermore, UpSample and Concat operations upsample and concatenate feature maps of different resolutions, generating higher-dimensional feature maps and ensuring effective multi-scale information fusion.
Finally, the Head utilizes the refined features from the Neck to generate the final detection results. This component includes multiple Conv layers to further refine the features, ensuring precise detection outputs. Multiple Conv2d layers are used to generate target location and category information, while the Bbox Loss is computed to optimize the model through bounding box regression.
By integrating the CBAM attention mechanism, AKConv adaptive convolution, and multi-scale feature fusion techniques, YOLOv9-ARC significantly enhances object detection capabilities in low-quality images. It effectively addresses challenges such as noise and blur, providing accurate forgery detection results.
As illustrated in
Figure 1 and
Figure 2, the proposed YOLOv9-ARC model demonstrates significant advantages over the original YOLOv9 model when processing low-quality facial images, owing to the integration of the CBAM attention mechanism and AKConv adaptive convolution. The CBAM mechanism enables the model to focus on critical regions of the image, particularly edge and texture features that are essential for low-quality images. Meanwhile, AKConv dynamically adjusts the convolution kernel size to adapt to multi-scale feature extraction, effectively capturing fine-grained details in the images. In terms of feature fusion, the proposed model employs the CBFuse operation to better integrate features from different levels, enhancing the ability to combine multi-scale information. This is particularly crucial for low-quality images, where facial features may be fragmented or blurred. In contrast, although the original YOLOv9 model also utilizes Conv layers and concatenation operations, its lack of the CBAM attention mechanism limits its ability to prioritize key features in low-quality images, resulting in inferior performance compared to the proposed model. Overall, the proposed model, with its more sophisticated feature processing and fusion techniques, is better equipped to handle noise and blur in low-quality facial images, delivering higher detection accuracy.
2.2. YOLOv9 Algorithm
YOLOv9 [
23] introduces the concept of Programmable Gradient Information (PGI) and designs a supplementary supervision structure that incorporates PGI. As shown in
Figure 3, the framework consists of featuring three essential components: the primary branch, a supplementary reversible branch, and a multi-level framework During the logical conclusion phase, the network only utilizes the main transition toward avoid increasing inference costs. However, during training, the complementary reversible route provides reliable gradient-based information for help the primary pathway learn effective features, preventing the loss of important feature information. This design ensures that the loss function is computed using the full input data, thereby supplying stable and reliable gradient feedback for fine-tuning the model parameters.
In addition, the research team proposed a newly designed, low-complexity network architecture, the General Efficient Layer Aggregation Network (GELAN), as demonstrated in
Figure 4c. The General Efficient Layer Aggregation Network (GELAN) is built upon the Efficient Layer Aggregation Network (ELAN) architecture in
Figure 4b by integrating the Cross-Stage Partial Network (CSPNet) from
Figure 4a. Firstly, balancing computational complexity and detection accuracy is at the core of the design. GELAN optimizes the model’s computational requirements through modular design, ensuring that even under limited computational resources, high detection accuracy can still be maintained. Secondly, lightweight convolution operations are introduced between each convolutional layer, reducing computation, which is particularly suitable for edge devices or low-power environments, thus lowering the computational cost during inference and improving processing speed. The authors also improve computational efficiency by incorporating the Cross-Stage Partial Network (CSPNet), which distributes the computational load across different stages and network blocks, preventing excessive computation at any single stage and enabling flexible adjustment of computation based on hardware conditions. Dynamic resource adjustment allows GELAN to dynamically optimize the allocation of computational resources according to hardware conditions, ensuring high inference efficiency even when resources are limited. Finally, the modular design of the model, through the rational combination and integration of different network layers and modules, maintains high detection accuracy without adding excessive computational burden. The design of this architecture considers the balance between computational complexity and detection performance, ensuring that the model maintains high inference efficiency even with limited computational resources.
The incorporation of these two innovative technologies is crucial for mitigating information loss during training while also optimizing efficiency and inference speed. These advancements strengthen the model’s competitiveness and practical relevance in real-world situations.
2.3. CBAM Hybrid Attention Mechanism
In terms of low-quality facial images, the dataset contains low-resolution or noisy images; that is, images where the boundaries between the facial features and the background are blurred. Hence, this is helpful for cases where it is difficult to interpose features in the same image with multiple types of facial features, even leading to large-magnitude variations at different in the same cases of image types. To address these problems, in this paper, the CBAM Hybrid Attention Mechanism is embedded into the detection framework. This mined and integrated feature information, which are crucial for facial detection, allows the model to concentrate more on these key features, enabling feature fusion.
A deep learning attention mechanism CBAM [
24] is based on the idea of inferring attention maps in channel and spatial dimensions independently from the input feature map. Attention maps are then applied on the input feature map to refine the features based on the attention.
The CBAM embodies two main components: the Channel Attention Module and the Spatial Attention Module (as shown in
Figure 5a. The channel attention is followed by a spatial attention.
This sequence enables the model to effectively capture and utilize feature information from multiple dimensions, enhancing its capacity to recognize relevant characteristics within images.
The entire process of the attention mechanism can be represented by the following equation:
In the above expression, denotes the element-wise multiplication operation. F indicates the input feature representation at the beginning of the process, refers to the feature map generated after passing through the channel attention mechanism, and denotes the resulting feature map obtained following passing through the attention across spatial dimensions module. indicates the output produced by the attention mechanism for channel-wise features after processing the visual feature matrix F, while represents the output generated through the spatial attention mechanism after processing the visual feature matrix .
As shown in
Figure 5b, this part is the channel attention module. Initially, the input visual feature matrix
F, characterized by dimensions of
(where
C represents the channel count, while
and
denote the image’s height and width, respectively), is processed. Then, global pooling is applied individually to each channel of the input visual feature matrix
F, including both max pooling across the entire spatial dimension and pooling by computing the mean over the entire spatial dimension, which results in two distinct feature maps. These include then passed into a fully connected layer (MLP), which learns the attention weights for each channel. After processing, the feature maps are integrated to produce a new activation map. At the final stage, the channel attention features
are computed by applying a
activation function. These features are then integrated with the input feature map
F by applying multiplication on an element-by-element basis, generating the updated channel attention visual feature grid
.
The process of channel attention is expressed as shown below:
In the equation above, stands for the activation function, denotes the global max pooling operation, and corresponds to global average pooling. The terms and refer to the learnable weights and variables of the network.
As shown in
Figure 5c, this section represents the spatial feature focusing module. After being processed via the channel attention mechanism module, the output feature representation
undergoes the global max pooling procedure followed by global average pooling, resulting in two sets of feature maps of measurements
. The concatenated results are then processed through a convolutional layer to produce the spatial attention weights
. Subsequently, the attention weights are processed using the
function, after which they are applied to the feature map
for spatial attention weighting, ultimately producing the final the resulting feature map generated as the output
.
The formula for the focus on spatial information within the feature map process can be expressed as shown below:
In the expression, refers to the function, represents a convolutional filter with dimensions 7 × 7, represents max pooling, refers to average pooling, represents the extracted feature representation generated by the channel-based attention mechanism, denotes the result produced by the spatial attention module, and refers to the final result produced by the CBAM attention mechanism.
The authors of this paper have decided to integrate the CBAM hybrid attention mechanism to the backbone network of the baseline model. By amplifying important information through both channel and spatial attention, it suppresses the expression of irrelevant information. Additionally, since this attention mechanism is a lightweight module, it does not noticeably add to the system’s computational complexity.
4. Discussion
This study proposes a novel method for detecting low-quality fake facial images based on the YOLOv9-ARC model and validates its effectiveness through extensive experiments. The experimental results demonstrate that YOLOv9-ARC achieves a mean average precision (mAP) of 75.1% on the DFDC (DeepFake Detection Challenge) dataset, representing a 3.5% improvement over the baseline model, highlighting the model’s superior detection performance in complex environments.
The primary advantage of YOLOv9-ARC lies in the integration of the Adaptive Kernel Convolution (AKConv) module, which dynamically adjusts the receptive field to enable more effective multi-scale feature extraction. Additionally, the Convolutional Block Attention Module (CBAM) enhances the model’s ability to focus on critical regions, further improving detection accuracy. Experimental results confirm that CBAM significantly boosts both precision and recall, validating its effectiveness in optimizing feature representation.
A deeper analysis reveals that the combination of AKConv and CBAM plays a critical role in the model’s success on low-quality images. AKConv allows adaptive kernel size adjustment based on local features, which helps mitigate challenges caused by noise, blur, and reduced resolution. Meanwhile, CBAM refines feature representation by emphasizing discriminative regions and suppressing redundant or noisy features. This synergy is particularly beneficial for detecting visually degraded or compressed deepfakes, which are prevalent in the DFDC dataset.
However, the DFDC dataset still contains challenging conditions such as extreme lighting, occlusion, and large pose variations, which may affect detection outcomes. Although YOLOv9-ARC improves most performance metrics, it can sometimes generate false positives due to its high sensitivity to subtle artifacts, resulting in slightly lower precision in specific cases.
The current evaluation is limited to a single dataset, and the model’s generalizability requires further validation on diverse datasets such as FaceForensics++ and Celeb-DF. Potential data bias and lack of adversarial robustness testing also present limitations. Future work will address these issues through cross-dataset evaluation, deployment optimization, and the integration of explainable AI techniques to improve interpretability and transparency.
Compared to prior research, YOLOv9-ARC demonstrates superior performance in detecting low-quality fake facial images. Traditional methods such as Faster R-CNN [
26] and EfficientNet [
27] excel on high-quality facial image datasets but exhibit significant performance degradation under conditions of low resolution, noise, or blur. Faster R-CNN, with its region proposal-based detection mechanism, offers strong accuracy but suffers from high computational complexity, making it unsuitable for real-time applications. EfficientNet, while effective for standard image classification tasks, struggles with low-resolution and high-noise environments due to its limited feature extraction capabilities. Recent advancements in YOLO series models [
28,
29,
30] have made them popular for face detection tasks due to their efficiency and robust feature extraction. However, studies reveal that YOLOv3 and YOLOv5 still face challenges such as false positives and missed detections in low-quality fake face detection. This is primarily due to the limitations of standard convolution in capturing multi-scale information and the lack of mechanisms to focus on key regions. While YOLOv3 employs a fixed-size receptive field, limiting its effectiveness for small-object detection, YOLOv5 improves computational efficiency through depth-wise separable convolution but remains constrained in feature extraction for high-noise data.
In contrast, YOLOv9-ARC introduces the AKConv variable convolution kernel and CBAM attention mechanism, significantly enhancing the model’s detection capability for low-quality images. Experimental results show that YOLOv9-ARC achieves a 3.5% improvement in mAP over the baseline model on the DFDC dataset and outperforms YOLOv5 and EfficientNet in noisy and low-resolution conditions. Analysis indicates that AKConv adaptively adjusts the receptive field based on input scales, improving the model’s ability to capture fine details, while CBAM enhances focus on key regions, reducing false detections. These advancements enable YOLOv9-ARC to maintain high detection accuracy even in complex environments, addressing gaps in existing research on low-quality fake face detection.
Despite the strong performance of YOLOv9-ARC in detecting low-quality forged facial images, several research opportunities remain across different levels of difficulty and innovation. As shown in
Figure 9, future work can be structured into three progressive layers.
At the fundamental level, efforts should focus on constructing diverse and representative datasets. This includes collecting and synthesizing forged samples under extreme lighting, occlusion, and demographic variations, using both generative models and real-world data. Benchmarking across multiple datasets with standardized evaluation metrics (e.g., accuracy, recall) is also essential for understanding YOLOv9-ARC’s strengths and limitations compared to other state-of-the-art detectors.
In the advanced research stage, model robustness must be strengthened through adversarial training strategies and stress testing with examples generated by attacks like FGSM and PGD. Furthermore, adapting the model to detect forgeries created by emerging techniques such as diffusion models and advanced GANs requires structural enhancements in feature extraction and classification modules.
Moving into high-level innovation, future directions include the development of cross-modal detection systems that fuse video, audio, and text information to enhance robustness. Additionally, the integration of explainable AI (XAI) tools such as saliency maps and Layer-wise Relevance Propagation (LRP) can provide transparency and accountability, while fairness audits are crucial to ensure equitable model performance across demographic groups.
Future research should focus on optimizing computational efficiency to enable deployment in resource-constrained environments, such as mobile devices or embedded systems. Specifically, we plan to integrate lightweight neural network architectures (e.g., MobileNet, ShuffleNet) into the backbone of YOLOv9-ARC. These architectures are specifically designed to reduce the number of parameters and floating-point operations (FLOPs), which are key factors influencing model efficiency. For example, MobileNet uses depthwise separable convolutions, which significantly reduce the computational load by breaking the convolution operation into two parts: depthwise convolution (for each input channel) and pointwise convolution (to combine the outputs).
In addition to using lightweight backbones, we also consider applying model pruning techniques to remove redundant connections and neurons that do not significantly contribute to the model’s performance. This will help further reduce the model size and increase inference speed. Additionally, we plan to use knowledge distillation to transfer the knowledge from a larger, more complex model (teacher model) to a smaller, more efficient model (student model). By leveraging the student model’s compactness, we aim to maintain detection accuracy while significantly reducing computational overhead.
These strategies (lightweight architecture, pruning, and distillation) are specifically targeted at reducing inference time, memory footprint, and power consumption, all of which are critical when deploying models in resource-constrained environments such as mobile devices and embedded systems.
In future experiments, we will benchmark the optimized models on low-power hardware, such as Jetson Nano or ARM-based devices, and evaluate their practical deployment feasibility by measuring key performance indicators, including inference time (latency), memory usage, frames per second (FPS), and detection accuracy (mAP) on benchmark datasets. These metrics will help assess the model’s efficiency, resource consumption, and performance in real-time applications. The evaluations will help us determine the trade-off between model size, accuracy, and real-time processing capability, ensuring that YOLOv9-ARC remains efficient and effective even in environments with limited computational resources.
In summary, this study demonstrates the effectiveness of the YOLOv9-ARC model in detecting low-quality fake facial images, achieving significant improvements through the integration of AKConv and CBAM. However, future research must address challenges related to dataset diversity, computational efficiency, robustness in complex environments, and adaptability to emerging deepfake technologies. By focusing on these areas, the practical application value of fake facial image detection models can be further enhanced, contributing to the development of more reliable and scalable solutions.