1. Introduction
In recent years, the rapid advancement of deepfake techniques has resulted in manipulated facial images and videos becoming increasingly realistic [
1,
2]. These technologies have been widely applied in entertainment, film production, and virtual reality. However, their malicious use poses significant risks, including identity theft, misinformation dissemination, and privacy violations [
3,
4]. Consequently, many countries and organizations have established policies and regulations to govern the application of this technology, and technical measures are urgently needed to complement legal governance [
5].
The three mainstream approaches for deepfake detection can be categorized according to the domain in which forgery cues are modeled: frequency, temporal, and spatial domains.
Frequency-based approaches analyze spectral characteristics to reveal forgery traces. Qian et al. [
6] proposed a frequency-aware decomposition module to extract subtle spectral artifacts. Wang et al. [
7] integrated frequency-domain features into a multi-modal multi-scale Transformer. Tan et al. [
8] utilized cross-dimensional high-frequency representations. Gao et al. [
9] introduced a high-frequency enhancement framework to improve forgery detection.
Temporal-based approaches leverage inter-frame inconsistencies to detect video forgeries. Guera et al. [
10] developed a temporally aware network. Liu et al. [
11] proposed TI2Net for temporal identity inconsistency. Sun et al. [
12] utilized facial region displacement trajectories. Yu et al. [
13] designed a multi-scale spatiotemporal inconsistency amplifier.
Spatial-domain approaches examine pixel-level artifacts such as texture, edges, lighting, and color distribution. Yang et al. [
14] analyzed head and facial posture differences. Matern et al. [
15] observed artifacts in the eye and teeth regions. Li et al. [
16] detected blending boundaries. Nguyen et al. [
17] employed heatmap visualization and self-consistency regression.
Recent multi-feature integration efforts have been proposed to enhance representation learning and improve generalization. Zhou et al. [
18] and Chen et al. [
19] proposed dual-stream architectures. Chen et al. [
20] combined RGB and frequency features. Bonettini et al. [
21] leveraged multiple convolutional neural networks (CNNs) for cross-dataset adaptability. Yu et al. [
22] explored common and specific forgery features. Gao et al. [
23] used texture and artifact features. Yan et al. [
24], Xu et al. [
25], and Lin et al. [
26] incorporated more than two feature types. Despite these advances, most methods do not explicitly model the relationships among heterogeneous feature representations or fully exploit cross-channel dependencies.
Existing deepfake detectors still face generalization limitations in cross-dataset scenarios, since forgery cues learned from one dataset may be affected by manipulation methods, compression levels, image quality, and acquisition conditions. Local representations are important for capturing subtle differences between real and fake face images, but they may be sensitive to noise when lacking global semantic context. In contrast, global representations provide coarse-grained semantic information and improve robustness to data distribution variations, but they may overlook fine-grained forgery details. Although hybrid CNN-Transformer methods can capture both types of information, their interaction is often treated as simple feature aggregation, which may not sufficiently preserve complementary local and global representations. Therefore, the present study aims to construct a channel-aware local–global representation learning network that integrates local representations, global representations, input features, and channel-wise feature interaction for more stable cross-dataset deepfake detection.
The present study proposes a channel-aware local–global representation learning network for generalizable deepfake detection, achieving improved cross-dataset generalization. The main contributions are as follows:
(a) A channel-aware local–global representation learning network is proposed for generalizable deepfake detection. The network jointly learns local and global representations and enhances channel-wise feature interaction to improve the discriminability of forgery-related representations across datasets.
(b) A Local–Global Integration Vision Transformer (LGI-ViT) block is designed to learn and integrate local representations, global representations, and input features. The block uses coordinate convolution to strengthen the spatial perception of local differences and adopts Transformer layers to model long-range feature interactions.
(c) A Lightweight Channel Attention Network (LCAN) block is introduced to improve cross-channel feature interaction in the lightweight backbone. By incorporating channel attention into the inverted residual structure, LCAN helps emphasize informative channels and suppress less relevant channel responses with limited structural complexity.
2. Method
This section is organized as follows:
Section 2.1 introduces the overall detection network.
Section 2.2 details the LGI-ViT block for local–global representation learning within the network.
Section 2.3 explains the LCAN block.
2.1. Overall Detection Network
We propose a channel-aware local–global representation learning network by jointly modeling local and global representations. Inspired by MobileViT [
27], which effectively captures both local and global representations, we adopt it as the backbone. On this basis, we design the LGI-ViT and LCAN blocks to boost generalization capability. The overall architecture is illustrated in
Figure 1. The LGI-ViT and LCAN blocks are shown in blue and green, respectively, while standard convolution layers are depicted in yellow. Downsampling operations (stride = 2) are indicated by downward arrows. An input image is first passed through a 3 × 3 convolution layer to extract initial features, reducing spatial resolution by half. Subsequently, features are processed through alternating LCAN and LGI-ViT blocks. After multiple downsampling stages, the features are fed into a 1 × 1 convolution, followed by global average pooling and a fully connected layer to produce a binary classification output.
The LGI-ViT block learns local and global representations, where local representations are extracted via a convolutional neural network (CNN) and global representations are modeled via stacked Transformer layers, and further integrates them with input features. The LCAN block enhances channel-wise feature discrimination by emphasizing informative channels and suppressing irrelevant ones through channel attention.
2.2. LGI-ViT Block for Local–Global Representation Learning
The LGI-ViT block is the core module of the proposed network, where local and global representations are learned and further integrated with input features. In the LGI-ViT block, local and global representations are learned through feature extraction and feature interaction at different levels. Specifically, a CNN is used to extract local features. Through convolution operations, the CNN learns hierarchical feature representations that progress from fine-grained details to semantic concepts. Under this architecture, features are extracted from spatially adjacent regions, producing feature maps enriched with local information. For global feature extraction, we adopt the paradigm from MobileViT. L Transformer layers model long-range dependencies across different spatial regions of the image. This enables the extraction of feature maps enriched with global context. The self-attention mechanism enables interactions with features from all image patches, thereby supporting comprehensive global modeling.
The backbone network is MobileViT [
27]. Within this backbone, we modify the original MobileViT block, inspired by the design principles of MobileViTv3 [
28], to construct the LGI-ViT block, which integrates local and global feature representations for generalizable deepfake detection. To better exploit the complementary properties of these two types of features, an adaptive channel-wise weight
α is generated to dynamically balance the contributions of local and global features.
αR
C×1×1 is obtained through a lightweight gating network. Specifically, the local features are weighted by
α, while the global features are weighted by 1 −
α, and the two are then combined through channel-wise addition. Let
Flocal denote the local features, and
Fglobal denote the global features. The fused features
Ffused can be formulated as
where ⊙ denotes element-wise multiplication with channel-wise broadcasting.
The weight
α is generated by a lightweight sub-network consisting of an adaptive average pooling layer, a 1 × 1 convolution layer, a ReLU activation function, another 1 × 1 convolution layer, and a Sigmoid activation function. The generation process of
α is formulated as
where
W1 and
W2 are learnable weight matrices,
σ denotes the Sigmoid function, ϕ denotes the ReLU function, and
GAP denotes global average pooling.
Inspired by the idea of residual networks [
29], the fused features are further integrated with the input features through element-wise addition, so as to obtain enhanced feature representations with stronger local–global complementarity. As shown in
Figure 1, before residual addition, a 1 × 1 convolution is applied to
Ffused to match the channel dimension of
Fin. The output feature
Fout is defined as
where
Fin denotes the input features, and
Fout denotes the output features.
In addition, the original 3 × 3 convolution layer in the MobileViT block is removed, which simplifies the post-integration learning process while controlling the total number of parameters. Compared with the original MobileViT block, the proposed design learns local and global representations and further integrates them with input features. Despite this additional integration, the total number of parameters remains relatively low due to the removal of the 3 × 3 convolution layer.
2.2.1. Local Feature Extraction
Within the proposed LGI-ViT block, a Lightweight Local Feature Augmentation Network (LLFAN) is designed to extract local features. Its structure is shown in
Figure 2. It consists of a depthwise convolution layer, a coordinate convolution layer, a group convolution network, and a convolutional feedforward network.
First, a depthwise convolution (DWConv) layer encodes local spatial information. It performs convolution independently on each channel, reducing the network’s parameter count. To enhance the network’s spatial perception capability for local artifacts in deepfakes, a coordinate convolution (CoordConv) layer [
30] is introduced to record the positional information of each pixel. A standard convolution is translation-equivariant: shifting the input shifts the output accordingly. CoordConv adds coordinate channels to provide explicit spatial information. If the coordinate channels do not carry useful information, the layer behaves similarly to standard convolution, roughly preserving translation equivariance. When the coordinate channels capture spatial information, the layer becomes position-sensitive, enabling the detection of location-dependent forgery artifacts while weakening strict translation equivariance. Therefore, coordinate convolution enables the network to learn spatially aware representations, improving generalization to some extent.
Subsequently, the feature channels are divided into four parts, with one part fed into the group convolution network [
31]. In group convolution, the feature maps are first partitioned into groups, each processed separately. As shown in the upper part of
Figure 3a, for 12 input channels and 6 convolution kernels, the feature maps are evenly divided into 3 groups of 4 channels each; the kernels are also divided into 3 groups with 2 kernels per group, and each group processes its corresponding 4 input channels to produce 2 output feature maps. Therefore, the three groups produce 6 output feature maps in total. Compared with standard convolution, group convolution has fewer parameters and lower computational cost, effectively reducing network complexity.
Although group convolution reduces parameters, each output channel connects only to a subset of input channels. As shown in the lower part of
Figure 3a, in two stacked group convolutions without channel shuffle, each group’s output depends only on the corresponding input group, limiting information flow and representation. Therefore, channel shuffle is introduced to enhance feature learning.
Figure 3b shows that rearranging channels within each group promotes interaction across feature channels;
Figure 3c further illustrates the shuffled channel connections, enhancing cross-group information flow and network representation [
32].
By combining group convolution with channel shuffle, the LLFAN block enables efficient feature extraction while promoting information interaction across channel groups. The group convolution output is fused with the remaining original features and fed into a convolutional feedforward network. This network, similar to the feedforward network after multi-head attention in ViT, uses 1 × 1 convolutions instead of fully connected layers, enhancing correlations among feature maps and improving feature representation and integration of outputs from different information sources.
2.2.2. Global Feature Extraction
For global feature extraction, the proposed method aims to model long-range dependencies while preserving a large receptive field. The Global Representations part in
Figure 1 provides an intuitive illustration of this process, which helps readers better understand the following feature extraction procedure. To enable the LGI-ViT block to learn global features with spatial inductive bias, stacked Transformer layers are used for feature extraction. The number of stacked Transformer layers
L is set to 2. Let
XL denote the local features output by the LLFAN. To obtain global features,
XL is unfolded into
N non-overlapping flattened patch features
XU ∈
RP×N×d. Here,
P =
w × h,
N =
HW/
P, and
N is the number of patches, while
P denotes the number of pixels in each patch;
h ≤
n and
w ≤
n are the height and width of a patch, respectively. For each intra-patch position
p ∈ {1, 2, …,
P}, the relationships among the
N patches are encoded by the stacked Transformer layers. The global interaction feature
XG ∈
RP×N×d is formulated as
Consequently, XG ∈ RP×N×d is folded back to the original spatial dimensions to obtain the feature map XF ∈ RH×W×d. XF is then projected into a lower C-dimensional space via pointwise convolution. Since XU (p) uses convolution to encode local information from an n × n region, and XG (p) models global dependencies across image patches, each pixel in XG can aggregate information from all pixels in the input tensor X. By employing stacked Transformer layers, the order of image patches is preserved while also maintaining the spatial structure of pixels within each patch. Therefore, the effective receptive field of the LGI-ViT block covers the entire spatial dimension H × W.
2.3. LCAN Block
In the original MobileViT network structure, the MV2 module adopts the inverted residual structure of MobileNetV2 [
33], which is built upon depthwise separable convolutions. However, depthwise convolution operates independently on each channel of the input feature map. This prevents the model from capturing inter-channel dependencies. Thus, features from important channels may not be effectively emphasized, while interference from noisy channels cannot be adequately suppressed. To enhance cross-channel feature interaction capability, we introduce channel attention to design the LCAN block, which incorporates channel attention. As illustrated in
Figure 4, channel attention comprises three operations: squeeze, excitation, and recalibration [
34].
The LCAN block first applies a 1 × 1 pointwise convolution to the input features, expanding the channels to a higher-dimensional space to enhance nonlinear expressive capacity. To reduce computational cost and avoid cross-channel parameter redundancy, a 3 × 3 depthwise convolution operation filters the expanded features, processing each channel independently. After filtering, the features are passed through a channel attention module. Finally, another 1 × 1 pointwise convolution compresses the channel dimension back to its original size. To minimize feature information loss, a linear activation function is used in this final pointwise convolution instead of ReLU.
By incorporating channel attention into the inverted-residual-style structure, the LCAN block compensates for the limited channel communication caused by depthwise convolution. Specifically, the LCAN block estimates channel-wise importance from aggregated feature responses and recalibrates the expanded feature maps before the final pointwise projection. In this way, informative channels can be strengthened while less relevant channels are suppressed, with limited additional parameters.
This design is practically useful for generalizable deepfake detection because manipulation traces are often subtle, weakly distributed, and easily affected by image quality or compression. LCAN provides additional channel dependency modeling without introducing a heavy spatial attention branch or complex global context module. Therefore, it can be integrated into the lightweight backbone while maintaining a balance among discriminative feature learning, structural simplicity, and computational efficiency. Such a design helps the network produce more focused responses to manipulation-related facial regions.
The LCAN block designed in this study contains only 784 parameters, making it extremely lightweight. Owing to the minimal parameter count of LCAN, the total network parameters amount to merely 0.34 M, which is far lower than those of common deepfake detection networks such as Xception or EfficientNet-B4, whose parameters typically range in the tens of millions. This design ensures that the model maintains competitive detection performance while achieving very low computational and memory overhead.
3. Experiments and Analysis
To evaluate the generalization capability of the proposed detection network, we designed and conducted a series of experiments.
Section 3.1 introduces the experimental setup.
Section 3.2 introduces the evaluation metrics.
Section 3.3 and
Section 3.4 evaluate the performance of the LGI-ViT and LCAN blocks, respectively.
Section 3.5 presents generalization performance.
Section 3.6 describes ablation studies.
3.1. Experimental Setup
To evaluate the generalization, following prior studies, we trained the network on FF++ [
35] and tested generalization on DFDC [
36] and Celeb-DF [
37]. FF++, DFDC, and Celeb-DF differ in data source, manipulation quality, compression settings, and artifact distribution. FF++ contains multiple manipulation methods and compression levels, DFDC includes more diverse real-world recording conditions, and Celeb-DF provides higher-quality deepfake videos with fewer obvious visual artifacts. These differences make cross-dataset evaluation important for assessing whether the model generalizes beyond dataset-specific cues. The FF++ dataset provides three versions with different compression levels: raw, C23, and C40. Given that forgery methods often involve post-processing operations to degrade image quality and improve concealment, we used the moderately compressed C23 version for training. All datasets were processed at the frame level after face detection and cropping. The training and testing splits followed the official or commonly used video-level protocols to avoid frame-level leakage. For each video, frames were uniformly sampled and resized to 224 × 224.
During training, we employed the AdamW optimizer, and the binary cross-entropy loss is defined as
where
denotes the predicted probability of the fake class, and
y = 1 indicates a manipulated facial image, while
y = 0 indicates a real one.
The initial learning rate was set to 0.001, with a linear warm-up for the first 10 steps, followed by cosine annealing decay to 1 × 10−6. To achieve a trade-off between GPU memory usage and training stability, the batch size was set to 16, and the number of training epochs was set to 50. To mitigate overfitting, we applied random horizontal flipping and CutMix for data augmentation and set the dropout rate to 0.2. All experiments were conducted using Python 3.11 (Python Software Foundation, Wilmington, DE, USA), PyTorch 2.5.0 (PyTorch Foundation, Linux Foundation, Wilmington, DE, USA), and CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA). The experiments were performed on an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). All reported experimental results correspond to the average performance over three independent training runs to account for the inherent randomness in deep learning optimization.
3.2. Evaluation Metrics
In this paper, we use accuracy (
ACC) and Area Under the Curve (
AUC) as evaluation metrics.
ACC represents the proportion of correctly classified samples among all test samples.
AUC refers to the area under the Receiver Operating Characteristic (
ROC) curve. It represents the comprehensive performance of a classifier across various thresholds, thereby avoiding the limitations of evaluation based on a single threshold. The
ROC curve is a tool for evaluating classifier performance. It plots the True-Positive Rate (
TPR) against the False-Positive Rate (
FPR) as the classification threshold varies. The
AUC is calculated as follows:
where
TPR, also known as recall, indicates the proportion of actual fake face samples correctly identified as fake by the model. The
TPR is calculated as follows:
where
TP (True Positive) denotes the number of deepfakes correctly identified as fake, and
FN (False Negative) denotes the number of deepfakes incorrectly identified as real.
FPR represents the proportion of real samples incorrectly identified as fake by the model among all real samples. It is calculated as follows:
where
FP (False Positive) denotes the number of real samples incorrectly identified as fake, and
TN (True Negative) denotes the number of real samples correctly identified as real.
3.3. Evaluation of the Local–Global Integration Vision Transformer Block
(1) Evaluation of Coordinate Convolution
To investigate the contribution of coordinate convolution to the performance improvement of the LGI-ViT block, we compared it with two recently proposed convolutional operations: depthwise separable convolution (DSConv) [
38] and dual convolution kernel convolution (DualConv) [
39]. In this experiment, we replaced the coordinate convolution in the LGI-ViT block with DSConv and DualConv, respectively, while keeping all other structures unchanged. The comparative results are presented in
Table 1.
As shown in
Table 1, different convolutions lead to noticeable variations in
ACC and
AUC. CoordConv achieves the best detection performance across the FF++, DFDC, and Celeb-DF datasets. Specifically, it achieves
ACC improvements of 4.78 and 5.58 percentage points on average, and
AUC improvements of 4.79 and 5.31 percentage points on average, respectively, over DSConv and DualConv. The results suggest that introducing coordinate information may help the network capture spatially related artifact cues under the current experimental setting.
(2) Evaluation of Different Integration Strategies
To validate the effectiveness of the proposed local–global representation learning network, we conducted experiments using different integration strategies within the LGI-ViT block. The results are shown in
Table 2. In
Table 2, “Without Integration” refers to the configuration in which the local features, global features, and input features are not integrated; in this case, both the red and yellow dashed lines in the LGI-ViT block structure shown in
Figure 1 are removed. “Local–Global Integration” refers to integrating only the local and global features, in which the yellow dashed line in the LGI-ViT block is removed. The results indicate that the Without Integration strategy yields lower cross-dataset performance on DFDC and Celeb-DF, whereas the Local–Global Integration strategy improves generalization but reduces in-dataset performance on FF++. This suggests that the local features extracted by the proposed LLFAN contribute to improved generalization in deepfake detection. Although the Local–Global Integration strategy performs slightly better than the Local–Global Integration with Input Residual strategy on the Celeb-DF dataset in
AUC, it results in lower
ACC and
AUC on the FF++ and DFDC datasets. The residual integration strategy slightly improves DFDC performance while showing a small decrease on Celeb-DF. This may be because DFDC contains more complex variations in lighting, resolution, background, and compression; the input residual helps preserve basic visual information and reduces information loss during local–global fusion. On Celeb-DF, the
AUC slightly decreases, possibly because forgery traces in this dataset rely more on fused representations, and additional input features may introduce slight redundancy. In contrast, the Local–Global Integration with Input Residual strategy demonstrates more pronounced detection advantages on both FF++ and DFDC.
3.4. Evaluation of the Lightweight Channel Attention Network Block
To assess the contribution of channel attention to detection performance, we compared the proposed LCAN block with two recent attention mechanisms: Efficient Channel Attention (ECA) [
40] and Global Context Attention (GCA) [
41]. In this experiment, we replaced the channel attention within the LCAN block with ECA and GCA, respectively, while keeping all other structures unchanged. The experimental results are presented in
Table 3. As shown in
Table 3, the proposed LCAN block achieves the best generalization detection performance among the three attention mechanisms. On the FF++, DFDC, and Celeb-DF datasets, the proposed method achieves
ACC and
AUC improvements of 4.25 and 3.59 percentage points on average over ECA, and 8.69 and 9.10 percentage points on average over GCA, respectively.
To visualize the effect of the LCAN block in the deepfake detection process, we employed Grad-CAM [
42] to generate heatmaps for analysis. The results are shown in
Figure 5. Since the LCAN block introduces channel attention into the MV2 module, we compare the heatmaps produced by the original MV2 module and the proposed LCAN block. Warmer colors indicate regions where the network responds more strongly to forgery-related features. The visualization shows that when using the MV2 module, the network’s response is concentrated on relatively small regions, with the largest response area observed on the FF++ dataset and the smallest on the DFDC dataset. In contrast, with the LCAN block, the network exhibits larger response regions on the FF++ dataset and significantly enhanced responses on both the DFDC and Celeb-DF datasets. The Grad-CAM visualization indicates that LCAN is associated with broader responses in regions that may contain forgery-related cues.
3.5. Generalization Performance Evaluation
We compared the proposed method with 11 state-of-the-art detection models, including Xception [
35], EfficientB4 [
43], F3Net [
6], X-ray [
16], B4ATT [
21], SRM [
44], Yu [
22], UCF [
24], MCX-API [
25], Lin [
26], and TAD [
23]. The evaluation metric is
AUC, and the results are presented in
Table 4. Among these, Xception is a widely used baseline in deepfake detection. Xception, EfficientB4, and X-ray utilize single-feature detection, while the remaining methods leverage multi-feature representation learning strategies. The results for the comparison methods are sourced from [
22,
23,
26,
45]. Since the results of competing methods are collected from published benchmarks, minor differences in preprocessing protocols may exist. Therefore,
Table 4 is used mainly to provide a reference comparison under commonly adopted cross-dataset settings.
As shown in
Table 4, the proposed method achieves an average
AUC of 83.31% across FF++, DFDC, and Celeb-DF, surpassing the second-best model (UCF) by 1.01 percentage points. On the FF++ dataset, it attains an
AUC of 96.56%, closely following Yu (98.38%) and Lin (98.28%). On DFDC and Celeb-DF, it achieves the highest
AUCs of 73.95% and 79.42%, respectively. The lower
AUC on DFDC is attributed to greater diversity in lighting, resolution, occlusion, and background, which better reflects real-world scenarios and increases the difficulty of generalization.
Examining baseline methods, X-ray, Lin, and TAD achieve high AUC on FF++ but degrade substantially on DFDC and Celeb-DF. For example, X-ray drops from 95.92% on FF++ to 63.26% on DFDC; Lin and TAD exhibit similar declines. In contrast, the proposed method consistently maintains AUCs of 73.95% on DFDC and 79.42% on Celeb-DF, outperforming all compared approaches. The improvement can be attributed to enhanced feature extraction: by jointly learning local and global representations and integrating them with input features, the model captures more discriminative multi-level forgery representations. The LCAN block further leverages inter-channel dependencies to focus on manipulated facial regions.
In summary, the proposed channel-aware local–global representation learning network for deepfake detection exhibits competitive performance in both in-dataset and cross-dataset evaluations. These results suggest its potential for practical deepfake detection scenarios.
3.6. Ablation Study
To further investigate the individual contributions of the proposed LGI-ViT and LCAN blocks, we conducted ablation studies using MobileViT as the backbone network. Performance was evaluated by replacing the corresponding blocks in MobileViT with the LGI-ViT block and the LCAN block, either individually or in combination. The experimental results are presented in
Table 5.
As shown in
Table 5, both the LGI-ViT and LCAN blocks contribute to improving the generalization performance of the model. The LGI-ViT block alone yields a more substantial improvement, indicating that local–global representation integration and the introduction of coordinate convolution have a greater impact on performance than the LCAN block alone. The last row of
Table 5 shows that combining both blocks maximizes the model’s generalization capability.
Next, t-distributed stochastic neighbor embedding (t-SNE) [
46] was used to visualize the learned feature representations. t-SNE is a nonlinear dimensionality reduction technique mainly used for visualizing high-dimensional data. By preserving local similarities between data points, it can map complex high-dimensional structures to a low-dimensional space. For each dataset, 2000 samples were selected in a class-balanced manner, including 1000 real samples and 1000 fake samples, to avoid visual bias caused by class imbalance. As illustrated in
Figure 6, compared to the original backbone network MobileViT, the proposed method provides clearer visual separation between real and fake samples, forming two distinct clusters. On the Celeb-DF dataset, the proposed method achieves a clear distinction between real and fake samples, although a small number of real samples (represented by red points) remain dispersed within the fake cluster, indicating some residual confusion. Nevertheless, the separation is significantly improved over the original backbone. On the DFDC dataset, real images are scattered into multiple isolated regions in the feature space, forming multiple isolated clusters rather than a single coherent group. This distribution suggests that the model’s representation of real samples is not yet fully compact, potentially leading to ambiguous decision boundaries when evaluating new samples. This observation is consistent with the lower detection accuracy on DFDC compared to Celeb-DF. Overall, the proposed method achieves clearer separation of real and fake samples into distinct clusters across all three datasets. This indicates that the learned representations exhibit improved discriminability and consistency across datasets, enabling the network to reliably distinguish real from fake images.
However, some false negatives and false positives still occur. False negatives mainly arise when manipulated regions are subtle, highly compressed, or partially occluded, making local artifacts less distinguishable. False positives are more likely to appear in real images with strong illumination variation, motion blur, heavy compression, or facial regions containing abnormal texture patterns. These observations suggest that, although the proposed method improves local–global representation learning, its robustness may still be affected by severe image degradation and ambiguous facial textures.
4. Conclusions
This paper proposes a channel-aware local–global representation learning network for generalizable deepfake detection. We designed an LGI-ViT block to learn local and global representations and further integrated them with input features. By constructing a local feature augmentation network, the spatial perception of local artifacts in fake samples is enhanced. An LCAN block is introduced to improve cross-channel feature interaction. Ablation studies demonstrate that combining the proposed LGI-ViT and LCAN blocks improves the network’s generalization capability, resulting in cross-dataset AUCs of 73.95% and 79.42% on the DFDC and Celeb-DF datasets, respectively. Comparisons with 11 state-of-the-art models further validate the competitive generalization performance of the proposed network.
Although the proposed method achieves promising cross-dataset performance, potential overfitting to dataset-specific artifacts, such as compression patterns, image quality, and manipulation traces, still requires further investigation. Future work will evaluate the framework on more diverse real-world deepfake content to further improve its robustness and cross-domain generalization. In addition, the current image-level framework could be extended to video-level detection by incorporating temporal information, such as inter-frame consistency and temporal dynamics. More comprehensive efficiency analysis, including floating-point operations (FLOPs) and inference speed, will also be considered to assess the practical deployment potential of the proposed method.