Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection

Kuang, Liang; Chen, Beijing; Shi, Pei

doi:10.3390/math14111913

Open AccessArticle

Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection

by

Liang Kuang

^1,2,*

,

Beijing Chen

¹ and

Pei Shi

³

¹

School of Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of IoT Engineering, Jiangsu Vocational College of Information Technology, Wuxi 214153, China

³

School of IoT Engineering, Wuxi University, Wuxi 214105, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(11), 1913; https://doi.org/10.3390/math14111913

Submission received: 28 April 2026 / Revised: 27 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Artificial Intelligence Algorithms in Information Security and Cryptography)

Download

Browse Figures

Versions Notes

Abstract

Deepfake detection has emerged as a critical research area in image content security. To address the issue of limited generalization caused by insufficient modeling of forgery representations, this paper proposes a channel-aware local–global representation learning network for generalizable deepfake detection. Specifically, we introduce a Local–Global Integration Vision Transformer (LGI-ViT) block that learns local and global representations and further integrates them with input features to capture more generalizable forgery cues at both fine-grained and global levels. Local representation learning is enhanced through coordinate convolution, while a hybrid convolution–Transformer architecture is employed to model global dependencies. Based on these representations, a residual connection is incorporated to integrate the combined local–global representation with the original input features. In addition, a Lightweight Channel Attention Network (LCAN) block is designed to strengthen interactions among feature channels and improve the discriminability of forgery-related representations. Experimental results demonstrate that the proposed network, trained on the FaceForensics++ (FF++) dataset, achieves cross-dataset AUC scores of 73.95% and 79.42% on the DeepFake Detection Challenge (DFDC) and Celeb-DF datasets, respectively. It outperforms the best-performing baseline among 11 competing models for generalizable detection by 1.01 percentage points on average, thereby validating its effectiveness in generalizable deepfake detection.

Keywords:

deepfake detection; generalization; local–global representation; channel-aware; transformer

MSC:

68T45; 68U10; 68T07

1. Introduction

In recent years, the rapid advancement of deepfake techniques has resulted in manipulated facial images and videos becoming increasingly realistic [1,2]. These technologies have been widely applied in entertainment, film production, and virtual reality. However, their malicious use poses significant risks, including identity theft, misinformation dissemination, and privacy violations [3,4]. Consequently, many countries and organizations have established policies and regulations to govern the application of this technology, and technical measures are urgently needed to complement legal governance [5].

The three mainstream approaches for deepfake detection can be categorized according to the domain in which forgery cues are modeled: frequency, temporal, and spatial domains.

Frequency-based approaches analyze spectral characteristics to reveal forgery traces. Qian et al. [6] proposed a frequency-aware decomposition module to extract subtle spectral artifacts. Wang et al. [7] integrated frequency-domain features into a multi-modal multi-scale Transformer. Tan et al. [8] utilized cross-dimensional high-frequency representations. Gao et al. [9] introduced a high-frequency enhancement framework to improve forgery detection.

Temporal-based approaches leverage inter-frame inconsistencies to detect video forgeries. Guera et al. [10] developed a temporally aware network. Liu et al. [11] proposed TI2Net for temporal identity inconsistency. Sun et al. [12] utilized facial region displacement trajectories. Yu et al. [13] designed a multi-scale spatiotemporal inconsistency amplifier.

Spatial-domain approaches examine pixel-level artifacts such as texture, edges, lighting, and color distribution. Yang et al. [14] analyzed head and facial posture differences. Matern et al. [15] observed artifacts in the eye and teeth regions. Li et al. [16] detected blending boundaries. Nguyen et al. [17] employed heatmap visualization and self-consistency regression.

Recent multi-feature integration efforts have been proposed to enhance representation learning and improve generalization. Zhou et al. [18] and Chen et al. [19] proposed dual-stream architectures. Chen et al. [20] combined RGB and frequency features. Bonettini et al. [21] leveraged multiple convolutional neural networks (CNNs) for cross-dataset adaptability. Yu et al. [22] explored common and specific forgery features. Gao et al. [23] used texture and artifact features. Yan et al. [24], Xu et al. [25], and Lin et al. [26] incorporated more than two feature types. Despite these advances, most methods do not explicitly model the relationships among heterogeneous feature representations or fully exploit cross-channel dependencies.

Existing deepfake detectors still face generalization limitations in cross-dataset scenarios, since forgery cues learned from one dataset may be affected by manipulation methods, compression levels, image quality, and acquisition conditions. Local representations are important for capturing subtle differences between real and fake face images, but they may be sensitive to noise when lacking global semantic context. In contrast, global representations provide coarse-grained semantic information and improve robustness to data distribution variations, but they may overlook fine-grained forgery details. Although hybrid CNN-Transformer methods can capture both types of information, their interaction is often treated as simple feature aggregation, which may not sufficiently preserve complementary local and global representations. Therefore, the present study aims to construct a channel-aware local–global representation learning network that integrates local representations, global representations, input features, and channel-wise feature interaction for more stable cross-dataset deepfake detection.

The present study proposes a channel-aware local–global representation learning network for generalizable deepfake detection, achieving improved cross-dataset generalization. The main contributions are as follows:

(a) A channel-aware local–global representation learning network is proposed for generalizable deepfake detection. The network jointly learns local and global representations and enhances channel-wise feature interaction to improve the discriminability of forgery-related representations across datasets.

(b) A Local–Global Integration Vision Transformer (LGI-ViT) block is designed to learn and integrate local representations, global representations, and input features. The block uses coordinate convolution to strengthen the spatial perception of local differences and adopts Transformer layers to model long-range feature interactions.

(c) A Lightweight Channel Attention Network (LCAN) block is introduced to improve cross-channel feature interaction in the lightweight backbone. By incorporating channel attention into the inverted residual structure, LCAN helps emphasize informative channels and suppress less relevant channel responses with limited structural complexity.

2. Method

This section is organized as follows: Section 2.1 introduces the overall detection network. Section 2.2 details the LGI-ViT block for local–global representation learning within the network. Section 2.3 explains the LCAN block.

2.1. Overall Detection Network

We propose a channel-aware local–global representation learning network by jointly modeling local and global representations. Inspired by MobileViT [27], which effectively captures both local and global representations, we adopt it as the backbone. On this basis, we design the LGI-ViT and LCAN blocks to boost generalization capability. The overall architecture is illustrated in Figure 1. The LGI-ViT and LCAN blocks are shown in blue and green, respectively, while standard convolution layers are depicted in yellow. Downsampling operations (stride = 2) are indicated by downward arrows. An input image is first passed through a 3 × 3 convolution layer to extract initial features, reducing spatial resolution by half. Subsequently, features are processed through alternating LCAN and LGI-ViT blocks. After multiple downsampling stages, the features are fed into a 1 × 1 convolution, followed by global average pooling and a fully connected layer to produce a binary classification output.

The LGI-ViT block learns local and global representations, where local representations are extracted via a convolutional neural network (CNN) and global representations are modeled via stacked Transformer layers, and further integrates them with input features. The LCAN block enhances channel-wise feature discrimination by emphasizing informative channels and suppressing irrelevant ones through channel attention.

2.2. LGI-ViT Block for Local–Global Representation Learning

The LGI-ViT block is the core module of the proposed network, where local and global representations are learned and further integrated with input features. In the LGI-ViT block, local and global representations are learned through feature extraction and feature interaction at different levels. Specifically, a CNN is used to extract local features. Through convolution operations, the CNN learns hierarchical feature representations that progress from fine-grained details to semantic concepts. Under this architecture, features are extracted from spatially adjacent regions, producing feature maps enriched with local information. For global feature extraction, we adopt the paradigm from MobileViT. L Transformer layers model long-range dependencies across different spatial regions of the image. This enables the extraction of feature maps enriched with global context. The self-attention mechanism enables interactions with features from all image patches, thereby supporting comprehensive global modeling.

The backbone network is MobileViT [27]. Within this backbone, we modify the original MobileViT block, inspired by the design principles of MobileViTv3 [28], to construct the LGI-ViT block, which integrates local and global feature representations for generalizable deepfake detection. To better exploit the complementary properties of these two types of features, an adaptive channel-wise weight α is generated to dynamically balance the contributions of local and global features. α

\in

R^C×1×1 is obtained through a lightweight gating network. Specifically, the local features are weighted by α, while the global features are weighted by 1 − α, and the two are then combined through channel-wise addition. Let F_local denote the local features, and F_global denote the global features. The fused features F_fused can be formulated as

F_{f u s e d} = α ⊙ F_{l o c a l} + (1 - α) ⊙ F_{g l o b a l}

(1)

where ⊙ denotes element-wise multiplication with channel-wise broadcasting.

The weight α is generated by a lightweight sub-network consisting of an adaptive average pooling layer, a 1 × 1 convolution layer, a ReLU activation function, another 1 × 1 convolution layer, and a Sigmoid activation function. The generation process of α is formulated as

α = σ (W_{2} \cdot ϕ (W_{1} \cdot G A P ([F_{l o c a l}, F_{g l o b a l}])))

(2)

where W₁ and W₂ are learnable weight matrices, σ denotes the Sigmoid function, ϕ denotes the ReLU function, and GAP denotes global average pooling.

Inspired by the idea of residual networks [29], the fused features are further integrated with the input features through element-wise addition, so as to obtain enhanced feature representations with stronger local–global complementarity. As shown in Figure 1, before residual addition, a 1 × 1 convolution is applied to F_fused to match the channel dimension of F_in. The output feature F_out is defined as

F_{o u t} = F_{i n} + C o n v (F_{f u s e d})

(3)

where F_in denotes the input features, and F_out denotes the output features.

In addition, the original 3 × 3 convolution layer in the MobileViT block is removed, which simplifies the post-integration learning process while controlling the total number of parameters. Compared with the original MobileViT block, the proposed design learns local and global representations and further integrates them with input features. Despite this additional integration, the total number of parameters remains relatively low due to the removal of the 3 × 3 convolution layer.

2.2.1. Local Feature Extraction

Within the proposed LGI-ViT block, a Lightweight Local Feature Augmentation Network (LLFAN) is designed to extract local features. Its structure is shown in Figure 2. It consists of a depthwise convolution layer, a coordinate convolution layer, a group convolution network, and a convolutional feedforward network.

First, a depthwise convolution (DWConv) layer encodes local spatial information. It performs convolution independently on each channel, reducing the network’s parameter count. To enhance the network’s spatial perception capability for local artifacts in deepfakes, a coordinate convolution (CoordConv) layer [30] is introduced to record the positional information of each pixel. A standard convolution is translation-equivariant: shifting the input shifts the output accordingly. CoordConv adds coordinate channels to provide explicit spatial information. If the coordinate channels do not carry useful information, the layer behaves similarly to standard convolution, roughly preserving translation equivariance. When the coordinate channels capture spatial information, the layer becomes position-sensitive, enabling the detection of location-dependent forgery artifacts while weakening strict translation equivariance. Therefore, coordinate convolution enables the network to learn spatially aware representations, improving generalization to some extent.

Subsequently, the feature channels are divided into four parts, with one part fed into the group convolution network [31]. In group convolution, the feature maps are first partitioned into groups, each processed separately. As shown in the upper part of Figure 3a, for 12 input channels and 6 convolution kernels, the feature maps are evenly divided into 3 groups of 4 channels each; the kernels are also divided into 3 groups with 2 kernels per group, and each group processes its corresponding 4 input channels to produce 2 output feature maps. Therefore, the three groups produce 6 output feature maps in total. Compared with standard convolution, group convolution has fewer parameters and lower computational cost, effectively reducing network complexity.

Although group convolution reduces parameters, each output channel connects only to a subset of input channels. As shown in the lower part of Figure 3a, in two stacked group convolutions without channel shuffle, each group’s output depends only on the corresponding input group, limiting information flow and representation. Therefore, channel shuffle is introduced to enhance feature learning. Figure 3b shows that rearranging channels within each group promotes interaction across feature channels; Figure 3c further illustrates the shuffled channel connections, enhancing cross-group information flow and network representation [32].

By combining group convolution with channel shuffle, the LLFAN block enables efficient feature extraction while promoting information interaction across channel groups. The group convolution output is fused with the remaining original features and fed into a convolutional feedforward network. This network, similar to the feedforward network after multi-head attention in ViT, uses 1 × 1 convolutions instead of fully connected layers, enhancing correlations among feature maps and improving feature representation and integration of outputs from different information sources.

2.2.2. Global Feature Extraction

For global feature extraction, the proposed method aims to model long-range dependencies while preserving a large receptive field. The Global Representations part in Figure 1 provides an intuitive illustration of this process, which helps readers better understand the following feature extraction procedure. To enable the LGI-ViT block to learn global features with spatial inductive bias, stacked Transformer layers are used for feature extraction. The number of stacked Transformer layers L is set to 2. Let X_L denote the local features output by the LLFAN. To obtain global features, X_L is unfolded into N non-overlapping flattened patch features X_U ∈ R^P×N×d. Here, P = w × h, N = HW/P, and N is the number of patches, while P denotes the number of pixels in each patch; h ≤ n and w ≤ n are the height and width of a patch, respectively. For each intra-patch position p ∈ {1, 2, …, P}, the relationships among the N patches are encoded by the stacked Transformer layers. The global interaction feature X_G ∈ R^P×N×d is formulated as

X_{G} (p) = Transformer (X_{U} (p)), 1 \leq p \leq P

(4)

Consequently, X_G ∈ R^P×N×d is folded back to the original spatial dimensions to obtain the feature map X_F ∈ R^H×W×d. X_F is then projected into a lower C-dimensional space via pointwise convolution. Since X_U (p) uses convolution to encode local information from an n × n region, and X_G (p) models global dependencies across image patches, each pixel in X_G can aggregate information from all pixels in the input tensor X. By employing stacked Transformer layers, the order of image patches is preserved while also maintaining the spatial structure of pixels within each patch. Therefore, the effective receptive field of the LGI-ViT block covers the entire spatial dimension H × W.

2.3. LCAN Block

In the original MobileViT network structure, the MV2 module adopts the inverted residual structure of MobileNetV2 [33], which is built upon depthwise separable convolutions. However, depthwise convolution operates independently on each channel of the input feature map. This prevents the model from capturing inter-channel dependencies. Thus, features from important channels may not be effectively emphasized, while interference from noisy channels cannot be adequately suppressed. To enhance cross-channel feature interaction capability, we introduce channel attention to design the LCAN block, which incorporates channel attention. As illustrated in Figure 4, channel attention comprises three operations: squeeze, excitation, and recalibration [34].

The LCAN block first applies a 1 × 1 pointwise convolution to the input features, expanding the channels to a higher-dimensional space to enhance nonlinear expressive capacity. To reduce computational cost and avoid cross-channel parameter redundancy, a 3 × 3 depthwise convolution operation filters the expanded features, processing each channel independently. After filtering, the features are passed through a channel attention module. Finally, another 1 × 1 pointwise convolution compresses the channel dimension back to its original size. To minimize feature information loss, a linear activation function is used in this final pointwise convolution instead of ReLU.

By incorporating channel attention into the inverted-residual-style structure, the LCAN block compensates for the limited channel communication caused by depthwise convolution. Specifically, the LCAN block estimates channel-wise importance from aggregated feature responses and recalibrates the expanded feature maps before the final pointwise projection. In this way, informative channels can be strengthened while less relevant channels are suppressed, with limited additional parameters.

This design is practically useful for generalizable deepfake detection because manipulation traces are often subtle, weakly distributed, and easily affected by image quality or compression. LCAN provides additional channel dependency modeling without introducing a heavy spatial attention branch or complex global context module. Therefore, it can be integrated into the lightweight backbone while maintaining a balance among discriminative feature learning, structural simplicity, and computational efficiency. Such a design helps the network produce more focused responses to manipulation-related facial regions.

The LCAN block designed in this study contains only 784 parameters, making it extremely lightweight. Owing to the minimal parameter count of LCAN, the total network parameters amount to merely 0.34 M, which is far lower than those of common deepfake detection networks such as Xception or EfficientNet-B4, whose parameters typically range in the tens of millions. This design ensures that the model maintains competitive detection performance while achieving very low computational and memory overhead.

3. Experiments and Analysis

To evaluate the generalization capability of the proposed detection network, we designed and conducted a series of experiments. Section 3.1 introduces the experimental setup. Section 3.2 introduces the evaluation metrics. Section 3.3 and Section 3.4 evaluate the performance of the LGI-ViT and LCAN blocks, respectively. Section 3.5 presents generalization performance. Section 3.6 describes ablation studies.

3.1. Experimental Setup

To evaluate the generalization, following prior studies, we trained the network on FF++ [35] and tested generalization on DFDC [36] and Celeb-DF [37]. FF++, DFDC, and Celeb-DF differ in data source, manipulation quality, compression settings, and artifact distribution. FF++ contains multiple manipulation methods and compression levels, DFDC includes more diverse real-world recording conditions, and Celeb-DF provides higher-quality deepfake videos with fewer obvious visual artifacts. These differences make cross-dataset evaluation important for assessing whether the model generalizes beyond dataset-specific cues. The FF++ dataset provides three versions with different compression levels: raw, C23, and C40. Given that forgery methods often involve post-processing operations to degrade image quality and improve concealment, we used the moderately compressed C23 version for training. All datasets were processed at the frame level after face detection and cropping. The training and testing splits followed the official or commonly used video-level protocols to avoid frame-level leakage. For each video, frames were uniformly sampled and resized to 224 × 224.

During training, we employed the AdamW optimizer, and the binary cross-entropy loss is defined as

L_{C E} = - [y \log \overset{\land}{y} + (1 - y) \log (1 - \overset{\land}{y})]

(5)

where

\overset{\land}{y}

denotes the predicted probability of the fake class, and y = 1 indicates a manipulated facial image, while y = 0 indicates a real one.

The initial learning rate was set to 0.001, with a linear warm-up for the first 10 steps, followed by cosine annealing decay to 1 × 10⁻⁶. To achieve a trade-off between GPU memory usage and training stability, the batch size was set to 16, and the number of training epochs was set to 50. To mitigate overfitting, we applied random horizontal flipping and CutMix for data augmentation and set the dropout rate to 0.2. All experiments were conducted using Python 3.11 (Python Software Foundation, Wilmington, DE, USA), PyTorch 2.5.0 (PyTorch Foundation, Linux Foundation, Wilmington, DE, USA), and CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA). The experiments were performed on an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). All reported experimental results correspond to the average performance over three independent training runs to account for the inherent randomness in deep learning optimization.

3.2. Evaluation Metrics

In this paper, we use accuracy (ACC) and Area Under the Curve (AUC) as evaluation metrics. ACC represents the proportion of correctly classified samples among all test samples. AUC refers to the area under the Receiver Operating Characteristic (ROC) curve. It represents the comprehensive performance of a classifier across various thresholds, thereby avoiding the limitations of evaluation based on a single threshold. The ROC curve is a tool for evaluating classifier performance. It plots the True-Positive Rate (TPR) against the False-Positive Rate (FPR) as the classification threshold varies. The AUC is calculated as follows:

A U C = \int_{0}^{1} T P R (F P R) d (F P R)

(6)

where TPR, also known as recall, indicates the proportion of actual fake face samples correctly identified as fake by the model. The TPR is calculated as follows:

T P R = \frac{T P}{T P + F N}

(7)

where TP (True Positive) denotes the number of deepfakes correctly identified as fake, and FN (False Negative) denotes the number of deepfakes incorrectly identified as real.

FPR represents the proportion of real samples incorrectly identified as fake by the model among all real samples. It is calculated as follows:

F P R = \frac{F P}{F P + T N}

(8)

where FP (False Positive) denotes the number of real samples incorrectly identified as fake, and TN (True Negative) denotes the number of real samples correctly identified as real.

3.3. Evaluation of the Local–Global Integration Vision Transformer Block

(1) Evaluation of Coordinate Convolution

To investigate the contribution of coordinate convolution to the performance improvement of the LGI-ViT block, we compared it with two recently proposed convolutional operations: depthwise separable convolution (DSConv) [38] and dual convolution kernel convolution (DualConv) [39]. In this experiment, we replaced the coordinate convolution in the LGI-ViT block with DSConv and DualConv, respectively, while keeping all other structures unchanged. The comparative results are presented in Table 1.

As shown in Table 1, different convolutions lead to noticeable variations in ACC and AUC. CoordConv achieves the best detection performance across the FF++, DFDC, and Celeb-DF datasets. Specifically, it achieves ACC improvements of 4.78 and 5.58 percentage points on average, and AUC improvements of 4.79 and 5.31 percentage points on average, respectively, over DSConv and DualConv. The results suggest that introducing coordinate information may help the network capture spatially related artifact cues under the current experimental setting.

(2) Evaluation of Different Integration Strategies

To validate the effectiveness of the proposed local–global representation learning network, we conducted experiments using different integration strategies within the LGI-ViT block. The results are shown in Table 2. In Table 2, “Without Integration” refers to the configuration in which the local features, global features, and input features are not integrated; in this case, both the red and yellow dashed lines in the LGI-ViT block structure shown in Figure 1 are removed. “Local–Global Integration” refers to integrating only the local and global features, in which the yellow dashed line in the LGI-ViT block is removed. The results indicate that the Without Integration strategy yields lower cross-dataset performance on DFDC and Celeb-DF, whereas the Local–Global Integration strategy improves generalization but reduces in-dataset performance on FF++. This suggests that the local features extracted by the proposed LLFAN contribute to improved generalization in deepfake detection. Although the Local–Global Integration strategy performs slightly better than the Local–Global Integration with Input Residual strategy on the Celeb-DF dataset in AUC, it results in lower ACC and AUC on the FF++ and DFDC datasets. The residual integration strategy slightly improves DFDC performance while showing a small decrease on Celeb-DF. This may be because DFDC contains more complex variations in lighting, resolution, background, and compression; the input residual helps preserve basic visual information and reduces information loss during local–global fusion. On Celeb-DF, the AUC slightly decreases, possibly because forgery traces in this dataset rely more on fused representations, and additional input features may introduce slight redundancy. In contrast, the Local–Global Integration with Input Residual strategy demonstrates more pronounced detection advantages on both FF++ and DFDC.

3.4. Evaluation of the Lightweight Channel Attention Network Block

To assess the contribution of channel attention to detection performance, we compared the proposed LCAN block with two recent attention mechanisms: Efficient Channel Attention (ECA) [40] and Global Context Attention (GCA) [41]. In this experiment, we replaced the channel attention within the LCAN block with ECA and GCA, respectively, while keeping all other structures unchanged. The experimental results are presented in Table 3. As shown in Table 3, the proposed LCAN block achieves the best generalization detection performance among the three attention mechanisms. On the FF++, DFDC, and Celeb-DF datasets, the proposed method achieves ACC and AUC improvements of 4.25 and 3.59 percentage points on average over ECA, and 8.69 and 9.10 percentage points on average over GCA, respectively.

To visualize the effect of the LCAN block in the deepfake detection process, we employed Grad-CAM [42] to generate heatmaps for analysis. The results are shown in Figure 5. Since the LCAN block introduces channel attention into the MV2 module, we compare the heatmaps produced by the original MV2 module and the proposed LCAN block. Warmer colors indicate regions where the network responds more strongly to forgery-related features. The visualization shows that when using the MV2 module, the network’s response is concentrated on relatively small regions, with the largest response area observed on the FF++ dataset and the smallest on the DFDC dataset. In contrast, with the LCAN block, the network exhibits larger response regions on the FF++ dataset and significantly enhanced responses on both the DFDC and Celeb-DF datasets. The Grad-CAM visualization indicates that LCAN is associated with broader responses in regions that may contain forgery-related cues.

3.5. Generalization Performance Evaluation

We compared the proposed method with 11 state-of-the-art detection models, including Xception [35], EfficientB4 [43], F3Net [6], X-ray [16], B4ATT [21], SRM [44], Yu [22], UCF [24], MCX-API [25], Lin [26], and TAD [23]. The evaluation metric is AUC, and the results are presented in Table 4. Among these, Xception is a widely used baseline in deepfake detection. Xception, EfficientB4, and X-ray utilize single-feature detection, while the remaining methods leverage multi-feature representation learning strategies. The results for the comparison methods are sourced from [22,23,26,45]. Since the results of competing methods are collected from published benchmarks, minor differences in preprocessing protocols may exist. Therefore, Table 4 is used mainly to provide a reference comparison under commonly adopted cross-dataset settings.

As shown in Table 4, the proposed method achieves an average AUC of 83.31% across FF++, DFDC, and Celeb-DF, surpassing the second-best model (UCF) by 1.01 percentage points. On the FF++ dataset, it attains an AUC of 96.56%, closely following Yu (98.38%) and Lin (98.28%). On DFDC and Celeb-DF, it achieves the highest AUCs of 73.95% and 79.42%, respectively. The lower AUC on DFDC is attributed to greater diversity in lighting, resolution, occlusion, and background, which better reflects real-world scenarios and increases the difficulty of generalization.

Examining baseline methods, X-ray, Lin, and TAD achieve high AUC on FF++ but degrade substantially on DFDC and Celeb-DF. For example, X-ray drops from 95.92% on FF++ to 63.26% on DFDC; Lin and TAD exhibit similar declines. In contrast, the proposed method consistently maintains AUCs of 73.95% on DFDC and 79.42% on Celeb-DF, outperforming all compared approaches. The improvement can be attributed to enhanced feature extraction: by jointly learning local and global representations and integrating them with input features, the model captures more discriminative multi-level forgery representations. The LCAN block further leverages inter-channel dependencies to focus on manipulated facial regions.

In summary, the proposed channel-aware local–global representation learning network for deepfake detection exhibits competitive performance in both in-dataset and cross-dataset evaluations. These results suggest its potential for practical deepfake detection scenarios.

3.6. Ablation Study

To further investigate the individual contributions of the proposed LGI-ViT and LCAN blocks, we conducted ablation studies using MobileViT as the backbone network. Performance was evaluated by replacing the corresponding blocks in MobileViT with the LGI-ViT block and the LCAN block, either individually or in combination. The experimental results are presented in Table 5.

As shown in Table 5, both the LGI-ViT and LCAN blocks contribute to improving the generalization performance of the model. The LGI-ViT block alone yields a more substantial improvement, indicating that local–global representation integration and the introduction of coordinate convolution have a greater impact on performance than the LCAN block alone. The last row of Table 5 shows that combining both blocks maximizes the model’s generalization capability.

Next, t-distributed stochastic neighbor embedding (t-SNE) [46] was used to visualize the learned feature representations. t-SNE is a nonlinear dimensionality reduction technique mainly used for visualizing high-dimensional data. By preserving local similarities between data points, it can map complex high-dimensional structures to a low-dimensional space. For each dataset, 2000 samples were selected in a class-balanced manner, including 1000 real samples and 1000 fake samples, to avoid visual bias caused by class imbalance. As illustrated in Figure 6, compared to the original backbone network MobileViT, the proposed method provides clearer visual separation between real and fake samples, forming two distinct clusters. On the Celeb-DF dataset, the proposed method achieves a clear distinction between real and fake samples, although a small number of real samples (represented by red points) remain dispersed within the fake cluster, indicating some residual confusion. Nevertheless, the separation is significantly improved over the original backbone. On the DFDC dataset, real images are scattered into multiple isolated regions in the feature space, forming multiple isolated clusters rather than a single coherent group. This distribution suggests that the model’s representation of real samples is not yet fully compact, potentially leading to ambiguous decision boundaries when evaluating new samples. This observation is consistent with the lower detection accuracy on DFDC compared to Celeb-DF. Overall, the proposed method achieves clearer separation of real and fake samples into distinct clusters across all three datasets. This indicates that the learned representations exhibit improved discriminability and consistency across datasets, enabling the network to reliably distinguish real from fake images.

However, some false negatives and false positives still occur. False negatives mainly arise when manipulated regions are subtle, highly compressed, or partially occluded, making local artifacts less distinguishable. False positives are more likely to appear in real images with strong illumination variation, motion blur, heavy compression, or facial regions containing abnormal texture patterns. These observations suggest that, although the proposed method improves local–global representation learning, its robustness may still be affected by severe image degradation and ambiguous facial textures.

4. Conclusions

This paper proposes a channel-aware local–global representation learning network for generalizable deepfake detection. We designed an LGI-ViT block to learn local and global representations and further integrated them with input features. By constructing a local feature augmentation network, the spatial perception of local artifacts in fake samples is enhanced. An LCAN block is introduced to improve cross-channel feature interaction. Ablation studies demonstrate that combining the proposed LGI-ViT and LCAN blocks improves the network’s generalization capability, resulting in cross-dataset AUCs of 73.95% and 79.42% on the DFDC and Celeb-DF datasets, respectively. Comparisons with 11 state-of-the-art models further validate the competitive generalization performance of the proposed network.

Although the proposed method achieves promising cross-dataset performance, potential overfitting to dataset-specific artifacts, such as compression patterns, image quality, and manipulation traces, still requires further investigation. Future work will evaluate the framework on more diverse real-world deepfake content to further improve its robustness and cross-domain generalization. In addition, the current image-level framework could be extended to video-level detection by incorporating temporal information, such as inter-frame consistency and temporal dynamics. More comprehensive efficiency analysis, including floating-point operations (FLOPs) and inference speed, will also be considered to assess the practical deployment potential of the proposed method.

Author Contributions

Conceptualization, L.K. and B.C.; methodology, L.K.; software, L.K.; data curation, L.K.; writing—original draft, L.K. and P.S.; writing—review and editing, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

The project is funded by the Wuxi Science and Technology Innovation Fund “Taihu Light” Science and Technology Tackling Program (K20231011), the Jiangsu Provincial Higher Vocational Colleges Engineering Technology Research and Development Center (Document No. Su Jiao Ke Han [2023] 11), and the Jiangsu Provincial University Excellent Scientific and Technological Innovation Team (Document No. Su Jiao Ke [2023] 3).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, C.; Cao, Y.; Zhou, Z.; Fu, Z.; Xia, Z.; Wu, Q.J. A Robust Dual-Pronged Proactive Defense Framework Against Deepfakes via Adversarial Semi-Fragile Watermarking. Expert Syst. Appl. 2025, 303, 130721. [Google Scholar] [CrossRef]
Kaur, A.; Noori Hoshyar, A.; Saikrishna, V.; Firmin, S.; Xia, Z. Deepfake video detection: Challenges and opportunities. Artif. Intell. Rev. 2024, 57, 159. [Google Scholar] [CrossRef]
Xu, F.; Wang, R.; Huang, Y.; Guo, Q.; Ma, L.; Liu, Y. Countering malicious deepfakes: Survey, battleground, and horizon. Int. J. Comput. Vis. 2022, 130, 1678–1734. [Google Scholar] [CrossRef] [PubMed]
Yuan, C.; Guo, Q.; Zhang, S.; Zhou, Z.; Xia, Z.; Choo, K.-K.R. ProIP-D2FDM: Protecting intellectual property of deepfake fingerprint detection model on cross-material generalization. IEEE J. Sel. Top. Signal Process. 2025, 19, 1392–1405. [Google Scholar] [CrossRef]
Abbas, F.; Taeihagh, A. Unmasking deepfakes: A systematic review of deepfake detection and generation techniques using artificial intelligence. Expert Syst. Appl. 2024, 252, 124260. [Google Scholar] [CrossRef]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.X.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 86–103. [Google Scholar] [CrossRef]
Wang, J.; Wu, Z.; Ouyang, W.; Han, X.; Chen, J.; Jiang, Y.-G.; Li, S.-N. M2TR: Multimodal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval; Association for Computing Machinery: New York, NY, USA, 2022; pp. 615–623. [Google Scholar] [CrossRef]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Frequency-Aware Deepfake Detection: Improving Generalizability Through Frequency Space Domain Learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5052–5060. [Google Scholar] [CrossRef]
Gao, J.; Xia, Z.; Marcialis, G.L.; Dang, C.; Dai, J.; Feng, X. DeepFake detection based on high-frequency enhancement network for highly compressed content. Expert Syst. Appl. 2024, 249, 123732. [Google Scholar] [CrossRef]
Guera, D.; Delp, E. Deepfake Video Detection Using Recurrent Neural Networks. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
Liu, B.; Liu, B.; Ding, M.; Zhu, T.; Yu, X. TI²Net: Temporal identity inconsistency network for deepfake detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; pp. 4691–4700. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, Z.; Echizen, I.; Nguyen, H.H.; Qiu, C.; Sun, L. Face forgery detection based on facial region displacement trajectory series. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–7 January 2023; pp. 633–642. [Google Scholar] [CrossRef]
Yu, Y.; Zhao, X.; Ni, R.; Yang, S.; Zhao, Y.; Kot, A.C. Augmented multi-scale spatiotemporal inconsistency magnifier for generalized DeepFake detection. IEEE Trans. Multimed. 2023, 25, 8487–8498. [Google Scholar] [CrossRef]
Yang, X.; Li, Y.; Lyu, S. Exposing deep fakes using inconsistent head poses. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8261–8265. [Google Scholar] [CrossRef]
Matern, F.; Riess, C.; Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar] [CrossRef]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face x-ray for more general face forgery detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1–10. [Google Scholar] [CrossRef]
Nguyen, D.; Mejri, N.; Singh, I.P.; Kuleshova, P.; Astrid, M.; Kacem, A.; Ghorbel, E.; Aouada, D. LAA-Net: Localized artifact attention network for quality-agnostic and generalizable DeepFake detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 17395–17405. [Google Scholar] [CrossRef]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-stream neural networks for tampered face detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1839. [Google Scholar] [CrossRef]
Chen, Z.; Yang, H. Attentive semantic exploring for manipulated face detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1985–1989. [Google Scholar] [CrossRef]
Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Li, J.; Rui, L. Local relation learning for face forgery detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1081–1088. [Google Scholar] [CrossRef]
Bonettini, N.; Cannas, E.D.; Mandelli, S.; Bondi, L.; Bestagini, P.; Tubaro, S. Video face manipulation detection through ensemble of CNNs. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5012–5019. [Google Scholar] [CrossRef]
Yu, P.; Fei, J.; Xia, Z.; Zhou, Z.; Weng, J. Improving generalization by commonality learning in face forgery detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 547–558. [Google Scholar] [CrossRef]
Gao, J.; Micheletto, M.; Orrù, G.; Concas, F.; Feng, X.; Marcialis, G.L.; Roli, F. Texture and artifact decomposition for improving generalization in deep-learning-based deepfake detection. Eng. Appl. Artif. Intell. 2024, 133, 108450. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, Y.; Fan, Y.; Wu, B. UCF: Uncovering common features for generalizable deepfake detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 22412–22423. [Google Scholar] [CrossRef]
Xu, Y.; Raja, K.; Verdoliva, L.; Pedersen, M. Learning pairwise interaction for generalizable deepfake detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–7 January 2023; pp. 672–682. [Google Scholar] [CrossRef]
Lin, L.; He, X.; Ju, Y.; Wang, X.; Ding, F.; Hu, S. Preserving fairness generalization in deepfake detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16815–16825. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the Tenth International Conference on Learning Representations, Virtually, 25–29 April 2022. [Google Scholar]
Wadekar, S.N.; Chaurasia, A. MobileViTv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, R.; Lehman, J.; Molino, P.; Such, F.P.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Niessner, M. FaceForensics++: Learning to detect manipulated facial images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar] [CrossRef]
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The DeepFake Detection Challenge (DFDC) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3204–3213. [Google Scholar] [CrossRef]
Shi, P.; Lu, J.; Xu, Y.; Wang, Q.; Zhang, Y.; Kuang, L.; Chen, D.; Huang, G. A multiple convolution and bilayer acceleration model for precise and efficient early urban fire detection in complex scenarios. Eng. Appl. Artif. Intell. 2025, 162, 112555. [Google Scholar] [CrossRef]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, Y.; Yuan, X.; Lyu, S.; Wu, B. DeepfakeBench: A comprehensive benchmark of deepfake detection. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 4534–4565. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Structure of the channel-aware local–global representation learning network.

Figure 2. Structure of the LLFAN block.

Figure 3. Illustration of the principles of group convolution and channel shuffle. (a) Group convolution and two stacked group convolutions without channel shuffle, where information flow is restricted within each channel group. (b) Channel shuffle operation, which rearranges feature channels among different groups to promote cross-group information exchange. (c) Shuffled channel connections after channel shuffle, which enhance cross-group feature interaction and improve representation learning.

Figure 4. Structure of the LCAN block.

Figure 5. Grad-CAM heatmaps generated by the original MV2 module and the proposed LCAN block on FF++, DFDC, and Celeb-DF. Warmer colors indicate stronger model responses.

Figure 6. t-SNE visualization of feature distributions produced by the backbone network and the proposed method on FF++, DFDC, and Celeb-DF. Red and green points denote real and fake samples, respectively.

Table 1. Comparison of different convolutions (%).

Convolution	FF++ (In-Dataset)		DFDC		Celeb-DF
Convolution	ACC	AUC	ACC	AUC	ACC	AUC
DSConv	92.64	93.95	64.46	67.32	72.05	74.30
DualConv	91.35	94.12	69.98	72.79	65.41	67.08
CoordConv	94.58	96.56	72.18	73.95	76.73	79.42

Table 2. Comparison of different integration strategies (%).

Integration Strategy	FF++ (In-Dataset)		DFDC		Celeb-DF
Integration Strategy	ACC	AUC	ACC	AUC	ACC	AUC
Without Integration	94.32	96.01	68.15	71.09	71.08	73.40
Local–Global Integration	90.15	92.78	70.32	72.34	76.32	79.59
Local–Global Integration with Input Residual	94.58	96.56	72.18	73.95	76.73	79.42

Table 3. Comparison of different attention mechanisms (%).

Attention Mechanism	FF++ (In-Dataset)		DFDC		Celeb-DF
Attention Mechanism	ACC	AUC	ACC	AUC	ACC	AUC
ECA	93.07	95.85	69.24	71.50	68.43	71.82
GCA	90.74	91.94	64.63	66.35	62.06	64.34
Channel Attention	94.58	96.56	72.18	73.95	76.73	79.42

Table 4. AUC comparison of different models across datasets (%).

Method	FF++ (In-Dataset)	DFDC	Celeb-DF	Average
Xception [35]	96.37	70.77	77.94	81.69
EfficientB4 [43]	95.67	69.55	79.09	81.44
F3Net [6]	96.35	70.21	77.69	81.42
X-ray [16]	95.92	63.26	70.93	76.70
B4ATT [21]	89.52	58.28	69.29	72.36
SRM [44]	95.76	69.95	79.26	81.66
Yu [22]	98.38	72.09	74.20	81.56
UCF [24]	97.05	71.91	77.93	82.30
MCX-API [25]	67.19	54.94	56.43	59.52
Lin [26]	98.28	61.47	74.42	78.06
TAD [23]	93.32	56.40	64.42	71.38
Proposed method	96.56	73.95	79.42	83.31

Table 5. Ablation study on different datasets in terms of ACC and AUC (%).

Scheme	FF++ (In-Dataset)		DFDC		Celeb-DF
Scheme	ACC	AUC	ACC	AUC	ACC	AUC
Backbone	92.11	95.73	57.06	59.95	60.76	62.84
Backbone + LGI-ViT	94.09	96.12	69.34	71.40	73.08	75.32
Backbone + LCAN	93.22	95.95	68.91	71.69	72.43	74.53
Backbone + LCAN + LGI-ViT	94.58	96.56	72.18	73.95	76.73	79.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kuang, L.; Chen, B.; Shi, P. Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection. Mathematics 2026, 14, 1913. https://doi.org/10.3390/math14111913

AMA Style

Kuang L, Chen B, Shi P. Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection. Mathematics. 2026; 14(11):1913. https://doi.org/10.3390/math14111913

Chicago/Turabian Style

Kuang, Liang, Beijing Chen, and Pei Shi. 2026. "Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection" Mathematics 14, no. 11: 1913. https://doi.org/10.3390/math14111913

APA Style

Kuang, L., Chen, B., & Shi, P. (2026). Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection. Mathematics, 14(11), 1913. https://doi.org/10.3390/math14111913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Channel-Aware Local–Global Representation Learning for Generalizable Deepfake Detection

Abstract

1. Introduction

2. Method

2.1. Overall Detection Network

2.2. LGI-ViT Block for Local–Global Representation Learning

2.2.1. Local Feature Extraction

2.2.2. Global Feature Extraction

2.3. LCAN Block

3. Experiments and Analysis

3.1. Experimental Setup

3.2. Evaluation Metrics

3.3. Evaluation of the Local–Global Integration Vision Transformer Block

3.4. Evaluation of the Lightweight Channel Attention Network Block

3.5. Generalization Performance Evaluation

3.6. Ablation Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI