ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

Yu, Shaoqian; Chen, Xingyu; Sheng, Yuzhe; Zhang, Han; Li, Xinlong; Yu, Sijia

doi:10.3390/electronics14224420

Open AccessArticle

ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

by

Shaoqian Yu

,

Xingyu Chen

,

Yuzhe Sheng

^*,

Han Zhang

,

Xinlong Li

and

Sijia Yu

School of Computer Science, Hunan University of Technology and Business, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4420; https://doi.org/10.3390/electronics14224420

Submission received: 10 October 2025 / Revised: 6 November 2025 / Accepted: 7 November 2025 / Published: 13 November 2025

(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The alarming realism of Deepfake presents a significant challenge to digital authenticity, yet its inherent difficulty in synchronizing the emotional cues between facial expressions and speech offers a critical opportunity for detection. However, most existing approaches rely on general-purpose backbones for unimodal feature extraction, resulting in an inadequate representation of fine-grained dynamic emotional expressions. Although a limited number of studies have explored cross-modal emotional consistency of deepfake detection, they typically employ shallow fusion techniques which limit latent expressiveness. To address this, we propose ACE-Net, a novel framework that identifies forgeries via multimodal emotional inconsistency. For the speech modality, we design a bidirectional cross-attention mechanism to fuse acoustic features from a lightweight CNN-based model with textual features, yielding a representation highly sensitive to fine-grained emotional dynamics. For the visual modality, a MobileNetV3-based perception head is proposed to adaptively select keyframes, yielding a representation focused on the most emotionally salient moments. For multimodal emotional consistency discrimination, we develop a multi-dimensional fusion strategy to deeply integrate high-level emotional features from different modalities within a unified latent space. For unimodal emotion recognition, both the audio and visual branches outperform baseline models on the CREMA-D dataset. Building on this, the complete ACE-Net model achieves a state-of-the-art AUC of 0.921 on the challenging DFDC benchmark.

Keywords:

deepfake detection; multimodal emotional consistency; fine-grained emotion recognition; cross-modal fusion

1. Introduction

With the development of deep learning and Generative Adversarial Networks (GANs) [1,2,3], Deepfake, especially face swapping and voice synthesis [4,5], is now capable of realistically simulating real content in both visual and auditory modalities. While enabling creative applications, this technology poses a significant threat to social trust and individual security. The goal of Deepfake detection is to automatically identify such AI-generated forgeries from vast digital repositories. However, the rapid evolution of generative models continuously blur the boundary between low-level visual artifacts and acoustic traces in authentic and manipulated data [6,7,8], rendering traditional artifact-based detectors increasingly unreliable [9], creating an urgent need for more sophisticated detection paradigms.

Therefore, deep learning-based detectors have emerged as the dominant approach, but many still focus on analyzing artifacts within a single modality, such as XceptionNet [10], ResNet, and EfficientNet. While demonstrating strong local feature extraction capabilities, their reliance on unimodal, generator-specific fingerprints often leads to overfitting and poor generalization to unseen manipulation techniques [11].

As a result, the research focus has shifted toward analyzing high-level semantic conflicts across modalities, with emotional consistency emerging as a particularly promising cue [12]. Unlike easily forged low-level traces, emotional information is intricately woven into human audio and facial expressions, making it exceptionally difficult for generative models to perfectly synchronize this cross-modal intrinsic correlation [13]. However, emotion-based methods still face two primary challenges: (1) Inadequate unimodal representation. Existing models often fail to adequately characterize the fine-grained dynamics of emotional expressions. (2) Superficial cross-modal fusion. Simple concatenation or fixed metrics are insufficient for modeling the deep, non-linear relationships between high-dimensional features [14,15].

To this end, we propose ACE-Net, a novel framework centered on multimodal emotional consistency. To address (1), We introduce dedicated attention mechanisms within the unimodal branches to refine their feature representations. To address (2), we construct a consistency discriminator with a multi-dimensional fusion strategy.

The main contributions of this work are summarized as follows:

We develop a lightweight Multi-grained Depthwise Convolutional Network (MDCNN) equipped with a parallel channel-spatial attention mechanism. This network captures emotionally salient peaks by combining global and local pooling strategies and efficiently models spatio-temporal correlations in the time-frequency domain using depthwise separable convolutions, thereby achieving refined, multi-scale perception of acoustic features.
We design a Coarse-to-Fine visual emotional dynamics frame selection strategy, which integrates motion analysis based on optical flow with an expression verification perception head based on MobileNetV3. This two-stage mechanism rapidly discards static and redundant segments, focusing computational resources on keyframes with the most intense emotional expressions, thereby significantly improving the efficiency and precision of visual feature extraction.
We propose an emotional consistency discriminator based on multi-aspect feature fusion and deep learning. Our method first constructs a comprehensive feature representation by combining three complementary operations: concatenation, difference, and product. This multi-aspect vector, which encodes aggregation, conflict, and synergy signals, is then fed into a deep, non-linear discriminator (MLP). This allows the model to learn complex semantic correlations from the data directly, significantly enhancing its performance and generalization capability.

2. Related Works

2.1. Visual Modality Detection Methods

While Convolutional Neural Networks (CNNs) are widely validated for visual deepfake detection, their local receptive fields inherently limit sensitivity to global context and subtle, localized manipulations [16]. Attention modules can mitigate this limitation by reweighting salient regions, but this comes at an increased computational cost. Vision Transformers (ViT) models [17] excel at modeling long-range dependencies and are effective against subtle artifacts, yet their quadratic complexity often precludes deployment in real-time or resource-constrained settings. This establishes a central trade-off between performance and efficiency.

To address the efficiency aspect of this trade-off, lightweight backbones like GhostNet [18] reduces the computational load by generating feature maps through inexpensive linear operations, thereby preserving representational power for edge and biometric applications [19]. Despite their efficiency, these models still require stronger mechanisms to reliably focus on critical forgery cues [20].

Navigating this trade-off, our visual branch first performs dynamic keyframe selection to prune static or low-information frames and localize segments with pronounced facial movements. The retained frames are then processed by an attention-augmented lightweight backbone built on GhostNet, which reallocates capacity toward emotionally salient facial regions (e.g., eyes, mouth) and micro-expressions while maintaining low computational overhead. This two-stage design preserves global context and sensitivity to subtle manipulations, yielding a practical detector suitable for real-time and resource-constrained settings.

2.2. Audio Modality Detection Methods

Classical audio forgery and speaker analysis pipelines commonly relied on handcrafted spectral descriptors, notably MFCCs [21] and LPC [22], to summarize timbre, formants, and other low-level acoustical cues. As voice conversion, neural TTS, and audio-video synchronization matured, these descriptors became increasingly spoofable, motivating a shift toward higher-level semantics, particularly prosody and emotion, that are harder to counterfeit consistently [23].

To operationalize this shift, CNNs and ViT-style architectures have been applied to log-Mel spectrograms to learn discriminative patterns directly from time-frequency maps [24]. Hybrid designs such as CNN + GRU further capture longer-range temporal structure critical for affective dynamics [25]. While attention and channel re-weighting enhance emotionally informative bands and transitions, several complex variants incur substantial computational overhead in streaming or resource-constrained settings [26].

Depthwise separable convolution (DSC) offers a favorable efficiency–accuracy compromise by factorizing 2D convolutions into depthwise and pointwise stages. This decomposition matches the separable structure of spectrogram processing, enabling lightweight yet fine-grained pattern extraction without sacrificing local detail [27]. Nevertheless, acoustics alone can be semantically ambiguous; fusing ASR-derived textual embeddings with acoustic features improves robustness under noise and content ambiguity by aligning complementary evidence in a shared space [28,29,30].

In line with these trends, our audio branch adopts a DSC-based backbone with multi-granularity channel-spatial attention (MDCNN) to sharpen emotionally salient time-frequency responses [31], and a bidirectional cross-attention module that fuses acoustic features with textual embeddings to form a compact, emotion-sensitive representation.

2.3. Multimodal Emotional Consistency Detection Methods

To overcome the limitations of unimodal detection, recent work has increasingly targeted multimodal consistency modeling, reframing the task from a search for generative artifacts to the identification of cross-modal semantic conflicts. However, early fusion strategies often relied on shallow aggregation, which struggled to expose deep contradictions [32]. Initial explorations frequently adopted decision-level fusion, such as voting or simple weighting of unimodal outputs. Since interaction in these models occurs only at a highly abstract layer, they often disregard fine-grained feature evidence and prove ineffective against complex cross-modal conflicts in forged data. To address this information loss, research shifted to feature-level fusion, most commonly by directly concatenating multimodal embeddings [33]. Yet, simple concatenation remains a form of shallow aggregation, as it treats all feature dimensions equally and fails to explicitly model the specific correlations and conflicts between modalities, which is the crux of assessing high-level semantic consistency [34]. In the post-fusion discrimination stage, studies by Cheng et al. [35] have adopted a two-stage similarity-threshold paradigm. As noted by other researchers [36,37], this design decouples representation learning from the final decision, thereby impeding gradient flow and limiting both performance and generalization.

Consequently, the central challenge has evolved from whether to fuse to how to fuse deeply and discriminate intelligently [38]. This principle is supported by findings in related domains, where methods that model complex cross-modal interactions have consistently demonstrated superior performance over shallow fusion schemes. Motivated by these insights, we propose a consistency discriminator that learns to assess emotional alignment directly from the data. Instead of relying on a single fusion operation, our approach first constructs a set of multi-aspect statistics—capturing feature aggregation, discrepancy, and synergy—and then uses a non-linear model to learn their complex relationships in a unified latent space, removing the reliance on external similarity thresholds.

3. Methods

3.1. Overall Framework

ACE-Net formulates multimodal deepfake detection as emotional consistency discrimination between speech and facial cues. As shown in Figure 1, the system comprises three modules: (1) a Speech–Text Emotion Feature Extractor, (2) a Dynamic–Temporal Facial Emotion Feature Extractor, and (3) a Consistency Discriminator.

Given a 2–4 s video segment, faces are sampled at 30 fps, aligned, and cropped to 224 × 224; the audio track is resampled to 16 kHz and converted into 80-band log-Mel spectrograms (25 ms window, 10 ms hop), while text is obtained from a frozen ASR followed by tokenization (maximum length L). Audio spans and dynamically selected facial keyframes are temporally aligned; forged pairs are constructed as emotion-tampering (same identity, different emotion) and cross-identity samples (different identity and emotion), mixed 1:1 with genuine pairs unless otherwise specified. The speech–text branch outputs a joint embedding

z_{a t} \in R^{d}

and the facial branch outputs

z_{v} \in R^{d}

(default d = 256); both are projected to the same dimensionality and then fused in a unified latent space to expose synergy, conflict, and correlation across modalities. The fused representation is fed to a lightweight MLP and optimized with binary cross-entropy on balanced batches of genuine and forged pairs.

3.2. Speech–Text Emotion Feature Extraction Module

This module produces a compact, emotion-sensitive embedding by jointly modeling acoustic prosody and lexical content [39]. We achieve this through a dual-stream fusion architecture that consists of three core components: (1) an acoustic stream for log-Mel spectrograms; (2) a textual stream from a frozen ASR/BERT encoder; and (3) a bidirectional cross-attention mechanism for deep interaction between the two modalities. Audio is resampled to 16 kHz and converted to 80-band log-Mel spectrograms (25 ms window, 10 ms hop); transcripts are tokenized to a sequence of length

T_{t}

.

3.2.1. MDCNN Acoustic Branch

The MDCNN is an acoustic feature extraction network designed specifically to address the unique challenges of speech emotion recognition. Its novelty lies not in the invention of new operators, but in the task-specific architectural design that combines established techniques to create a synergistic effect for capturing fine-grained emotional cues from spectrograms. The design philosophy centers on “multi-granularity” perception for emotional dynamics and end-to-end efficiency.

For the speech modality, we employ a network with a Depthwise Separable Convolution (DSC) backbone, augmented by a parallel multi-granularity channel-spatial attention module. This architecture, which we term MDCNN (Multi-granularity-attention Depthwise Convolutional Network), is illustrated in Figure 2. The attention block has two branches: a channel-attention branch and a spatial-attention branch. Given an utterance-level log-Mel input

{X \in R}^{F \times T}

(default F = 80, 25 ms windows, 10 ms hop), MDCNN stacks four DSC stages with channels [64, 128, 256, 256] and strides [2, 2, 2, 1] to obtain an intermediate feature map

A \in R^{F \times T \times C}

with C = 256. BatchNorm and ReLU are used throughout. The MDCNN then splits into two parallel branches, named channel attention and spatial attention, whose outputs are later combined to reweight the backbone features.

The channel attention branch reweights channels according to their importance for emotional discrimination. Recognizing that speech emotion is characterized by both global tonal properties and transient, local energy bursts, we adopt a dual-pooling approach inspired by CBAM [40]. Specifically, Global Average Pooling (GAP) effectively summarizes the overall statistical distribution of each feature map channel [41], capturing the global emotional tone. In parallel, Global Max Pooling (GMP) is more sensitive to capturing the most discriminative local responses, which correspond to moments of peak emotional expression in the spectrogram. We concatenate GAP and GMP descriptors

{[G A P, G M P] \in R}^{2 C}

. Finally, pass these fused features through a shared MLP

{[2 C \to C / r \to C] \in R}^{2 C}

(reduction = 8), ReLU and Sigmoid, yielding channel weights

{w_{c} \in R}^{C}

.

The spatial attention branch produces a time-axis saliency map by applying a 3 × 3 DSC to the channel-pooled map. We first collapse the channel dimension of

A

by mean and max pooling and concatenate the two maps to form

U \in R^{2 \times F \times T}

. To efficiently model local temporal correlations crucial for emotion dynamics, deviating from traditional methods that use large-kernel pooling or standard convolutions, we innovatively employ a 3 × 3 DSC on the channel-pooled feature map to efficiently model spatial correlations, which is applied on

U

to produce a single-map score

S

. After a sigmoid we obtain a spatial score map

M

. Averaging

M

along frequency yields a temporal saliency curve

w_{t}

that highlights the time steps where emotional evidence is strongest. This design choice not only maintains stylistic consistency with the MDCNN’s DSC backbone but also allows for the capture of fine-grained local spatial context at a very low computational cost.

We reweight the backbone features by combining the channel weights

w_{c}

and temporal weights

w_{t}

:

\tilde{A} = (w_{c} \otimes 1_{F} \otimes 1_{T}) ⊙ ({1_{C} \otimes 1}_{T} \otimes w_{t}) ⊙ A

(1)

Intuitively,

w_{c}

highlights the most relevant frequency channels for a given emotion, while

w_{t}

pinpoints the critical moments in time; their synergy directs computation to the most informative regions of the signal.

From the reweighted feature map

\tilde{A}

, we average along the frequency axis to obtain a per-time embedding

f_{t}^{A}

. Stacking these vectors over time forms an acoustic sequence

F \in R^{T \times C}

. A linear projection then maps

F

to the shared model dimension d, producing the acoustic token sequence

z_{a}

. This sequence is subsequently fused with textual tokens via the bidirectional cross-attention described in Section 3.2.2.

3.2.2. Bidirectional Cross-Modal Attention Mechanism

The bidirectional cross-attention module is responsible for the deep fusion of acoustic and semantic features. Given the acoustic token sequence

z_{a}

from MDCNN and token embeddings

T_{0}

from a frozen ASR and BERT encoder. As illustrated in Figure 3, this module performs two attention operations in parallel. We project both streams to a shared model dimension d (default d = 256), obtaining the acoustic sequence A and the textual sequence T. Padding masks are applied to ignore padded time steps during attention. We adopt multi-head scaled dot-product attention in both directions, with h = 4 heads and dropout 0.1.

In the Acoustic-to-Textual Attention stream (

A \to T

), acoustic features A provide the queries

Q_{A}

to selectively aggregate relevant semantic information from T, which provide the keys

K_{T}

and values

V_{T}

. Conversely, in the Textual-to-Acoustic Attention stream (

T \to A

), textual features T provide the queries

Q_{T}

to align with critical prosodic details from A, which in turn serve as the keys

K_{A}

and values

V_{A}

. This bidirectional information enhancement allows the model to generate a deeply fused representation. The core operations can be formulated as:

{A t t n}_{A \to T} = s o f t m a x (\frac{Q_{A} K_{T}^{⊤}}{\sqrt{d_{h}}} + M_{A \to T}) V_{T}

(2)

{A t t n}_{T \to A} = s o f t m a x (\frac{Q_{T} K_{A}^{⊤}}{\sqrt{d_{h}}} + M_{T \to A}) V_{A}

(3)

where

Q_{A} = A W_{A}^{Q}

,

K_{T} = T W_{T}^{K}

, and

V_{T} = T W_{T}^{V}

are learnable linear projection matrices.

W^{Q}

,

W^{K}

, and

W^{V}

are learned linear projection matrices for queries, keys, and values, respectively. Head outputs are concatenated and linearly projected back to d, followed by Residual + LayerNorm.

We then mean-pool the outputs of both directions along time to obtain

\bar{a}

and

\bar{t}

, and sum them to form the joint acoustic-textual embedding

z_{a t}

. Only

W_{A}

,

b_{A}

,

W_{T}

,

b_{T}

and cross-attention parameters are trainable; the ASR/BERT backbone remains frozen for stability and efficiency. Through this bidirectional “information focusing,” the model generates a context-aware joint emotion representation that holistically integrates “how it is said” (acoustics) with “what is said” (text), serving as the final output of the speech-modality branch.

3.3. Dynamic–Temporal Facial Emotion Feature Extraction Module

This module is responsible for efficiently extracting spatiotemporal features of facial emotions from video sequences. Recognizing that emotional cues in videos are often sparse and transient—peaking around moments of expression change rather than being uniformly present—we designed a two-stage keyframe selection pipeline. This “coarse-to-fine” approach aims to maximize both computational efficiency and the signal-to-noise ratio of the extracted visual features, ensuring that only the most emotionally salient frames are passed to the downstream feature extractor. It consists of two core stages: (1) keyframe selection, which is used to filter out redundant frames from the original video frame sequence and locate key segments of expression changes; (2) lightweight spatiotemporal feature extraction, using GhostNet as the backbone network to efficiently encode the features of the selected keyframe sequences

z_{v}

.

3.3.1. Keyframe Selection

We adopt a coarse-to-fine cascade [42] to select keyframes that exhibit pronounced facial expression changes. The idea is to quickly discard easy negatives with a cheap coarse model and then focus computation on a compact set of fine candidates with a lightweight verifier.

The principle of the coarse stage is motion gating. Given face-aligned frames

{\{I_{t}\}}_{t = 1}^{T_{v}}

at 30 fps, we compute dense optical flow between consecutive frames and summarize motion by the mean magnitude:

v_{t} (x) = F l o w (I_{t - 1}, I_{t}) [x]

(4)

m_{t} = \frac{1}{| Ω |} \sum_{x \in Ω} ‖ v_{t} (x) ‖_{2}

(5)

I_{t - 1}

and

I_{t}

are two consecutive face-cropped frames;

F l o w (\cdot, \cdot)

estimates the pixel-wise displacement vector at location

x

.

Ω

denotes the face region; the L2 norms are averaged to yield a single motion score

m_{t}

for frame

t

(higher means stronger facial movement).

Within a sliding window of

w

seconds centered at

t

(default w = 0.5 s), let

W = ⌊w \cdot f p s⌋

. We set a data-adaptive threshold as the

ρ

-th percentile (default

ρ

= 80) of local motion levels and keep frames whose motion exceeds it:

τ_{t} = {P e r c e n t i l e}_{ρ} (\{m_{t^{'}} : |t^{'} - t| \leq \frac{W}{2}\})

(6)

G = \{t| m_{t} \geq τ_{t}\}

(7)

τ_{t}

is computed from motion scores inside the local window (window is truncated near sequence boundaries). Frames with

m_{t} \geq τ_{t}

are retained as candidates, pruning static and low-motion segments.

In the fine stage, we perform expressiveness verification. For each motion-gated candidate

t \in G

, a lightweight MobileNetV3 head

g (\cdot)

(input 224 × 224) produces an expressiveness score.

c_{t} = σ (w^{⊤} g (I_{t}) + b) \in (0,1)

(8)

This maps each candidate frame

I_{t}

to a comparable probability score

c_{t} \in (0,1)

that quantifies its facial expressiveness content. Here

g (\cdot)

is a lightweight MobileNetV3 head extracting expression-related features, and

(w, b)

with the sigmoid

σ (\cdot)

converts those features into the confidence that the frame is sufficiently expressive. We use this head as a fixed scorer (no human per-frame labels), and it is not separately supervised on our datasets. The purpose is to provide a unified numeric measure of emotional salience for subsequent filtering and weighting. We then retain a compact keyframe set via confidence filtering and

T o p - K

ranking, using a fixed confidence threshold

θ = 0.7

to discard low-score frames before ranking:

K = T o p - K (\{(t, c_{t}) : t \in G, c_{t} \geq θ\})

(9)

Among motion-gated candidates

G

, frames with score

\geq θ

are retained and the

T o p - K

by

c_{t}

are selected to form the keyframe set

K

. This produces a compact subset of the most expressive moments, improving robustness and computational efficiency and avoiding ad hoc, intuition-based selection. Finally, to synchronize the selected visual frames with the audio stream, we apply a simple time-alignment mapping, which maps each video frame index

t

to the nearest audio index

τ

based on the video fps and the audio frame rate. This auxiliary step ensures temporal consistency for downstream multimodal fusion, as shown in Figure 1 Keyframe Selection module.

3.3.2. Lightweight Spatiotemporal Feature Extraction

For the selected keyframe sequence

{\{I_{t}\}}_{t \in K}

, we adopt a lightweight visual encoder

h (\cdot)

built on GhostNet (FV-LiteNet). To better capture subtle facial muscle movements while controlling latency, we keep the shallow and mid layers that are most sensitive to local deformations and remove the SE modules in the last two stride-2 Ghost bottlenecks, as shown in Figure 4a. Frames are face-aligned and resized to 224 × 224; BatchNorm and ReLU are used throughout the backbone. Before global pooling, we apply a lightweight joint spatial–channel attention head, as shown in Figure 4b, to recalibrate the backbone features over key facial regions.

Each retained frame is encoded into a compact vector and stacked in temporal order:

f_{t}^{v} = h (I_{t}) \in R^{C_{v}}

(10)

{F = [f_{t}^{v}]}_{t \in K} R^{{K \times C}_{v}}

(11)

h (\cdot)

is the GhostNet-based FV-LiteNet encoder;

f_{t}^{v}

is the

C_{v}

dimensional embedding of frame

t

; keyframe count K = 8. Stacking yields a

{K \times C}_{v}

sequence for aggregation. To emphasize emotionally salient moments identified, the fine-stage scores

c_{t}

are converted into normalized attention weights via a temperature-scaled softmax:

α_{t} = \frac{e x p (β_{c_{t}})}{\sum_{t^{'} \in K} e x p (β_{c_{t}'})}, β > 0

(12)

Larger

c_{t}

induces a larger weight

α_{t}

; the softmax temperature

β

(default = 5) controls how sharply the attention concentrates on the top-scoring frames. The visual representation is obtained by attention pooling followed by a linear projection to the shared model width d.

z_{v} = (\sum_{t \in K} α_{t} f_{t}^{v}) W_{v} \in R^{d}

(13)

A weighted sum of per-frame embeddings is mapped by

W_{v} \in R^{C_{v} \times d}

to a fixed-length vector

z_{v}

, aligned with the audio–text branch for multimodal fusion

z_{a t}

.

3.4. Multimodal Emotional Consistency Discrimination

This module is responsible for fusing and discriminating the multimodal features output by upstream modules. Let the feature vector output by the joint speech–text representation module be

z_{a t}

, and the feature vector output by the facial emotion representation module be

z_{v}

. After obtaining aligned speech–text and visual emotion embeddings, we encode aggregation, discrepancy, and agreement as complementary, interpretable statistics and let a non-linear discriminator learn higher-order interactions over them:

f = [z_{a t}; z_{v}; |z_{a t} - z_{v}|; z_{a t} ⊙ z_{v}] \in R^{4 d}

(14)

where

[z_{a t}; z_{v}]

represents vector concatenation is employed to preserve the complete information from both modalities, providing a foundational feature space for the discriminator [43];

Element-wise Difference

|z_{a t} - z_{v}|

is designed to explicitly capture inter-modal conflict. This directly tests our hypothesis that large feature discrepancies are a strong indicator of emotional inconsistency, a technique proven effective for highlighting such conflicts [44];

{[z}_{a t} ⊙ z_{v}]

is the element-by-element product, aiming to model synergistic interactions, capturing correlated feature activations that are assumed to reinforce each other in emotionally consistent pairs [45].

This fusion approach not only preserves the original high-dimensional information of each modality but also explicitly incorporates correlation features between them, enabling the downstream discriminator to fully learn the essential differences in the joint distribution between genuine and fake.

The fused features are used as input and are subjected to binary classification by a multi-layer perceptron (MLP). The MLP adopts a three-layer inverted triangle structure, with each layer followed by Batch Normalization and Dropout to improve generalization ability, and the activation function is ReLU. We feed

f

to an MLP with an inverted-triangle layout, BatchNorm (BN), ReLU, and Dropout:

h_{1} = D r o p (R e L U (B N (W_{1} f + b_{1})))

(15)

h_{2} = D r o p (R e L U (B N (W_{2} h_{1} + b_{2})))

(16)

p = σ (w_{o}^{⊤} h_{2} + b_{o})

(17)

where

W_{1} \in R^{512 \times 4 d}

,

W_{2} \in R^{128 \times 512}

, and the consistency score is

p \in (0,1)

. The output layer is mapped by a Sigmoid function

σ (\cdot)

to give the probability of the current sample being “fake” or “real”. The MLP learns a decision surface that assigns higher

p

when discrepancies dominate and lower

p

when the two streams correlate and agree. We set hidden widths

{d i m (h}_{1}) =

512 and

{d i m (h}_{2}) =

128, with Dropout = 0.3 after each hidden layer.

The discriminator is trained using a binary cross entropy (BCE) loss function and supports full-process end-to-end backpropagation, so that the upstream feature extraction module can adaptively optimize feature representation for the forgery detection task. For a label

y \in \{0,1\}

(0: genuine, 1: fake), the loss is:

L_{B C E} = - [y \log p + (1 - y) \log (1 - p)]

(18)

This objective supports full end-to-end backpropagation, so gradients flow through the MLP and fusion operator back into the unimodal encoders, enabling them to refine feature extraction specifically for forgery discrimination. Inference uses a default decision threshold of 0.5; no external similarity threshold is required.

The main advantages of this method are: (1) the high-dimensional feature fusion method significantly enhances the model’s ability to express modal relationships, which helps to capture more fine-grained forgery clues, especially in real-world scenarios where there are certain errors in emotion recognition. It still has strong robustness; (2) the end-to-end discrimination mechanism does not require explicit threshold setting, which simplifies the system design process, reduces artificial hyperparameters, and improves the applicability and generalization ability of the method.

3.5. Training Strategy

Our model is trained via a decoupled, two-stage strategy, which is deliberately designed to ensure that it learns to detect high-level semantic inconsistency rather than low-level synthesis artifacts.

The first stage focuses on learning robust unimodal feature representations. To achieve this, the speech–text module MDCNN and the facial module FV-LiteNet are trained independently as standard emotion classifiers. Critically, this training is performed exclusively on genuine, unaltered data from the CREMA-D, MELD, and SAVEE datasets. Upon completion, the parameters of these feature extractors are frozen. This crucial step grounds the learned feature spaces in authentic emotional expressions, preventing any bias from later exposure to synthetic data. The performance of these individual modules, which serves as a quality benchmark for our feature extractors, is reported in Section 4.3.

In the second stage, the model is trained to discriminate between emotionally consistent and inconsistent audio–visual pairs. With the feature extractors frozen, only the parameters of the fusion module and the final MLP discriminator are updated. This stage utilizes a carefully balanced mixed dataset. The negative class (consistent pairs) comprises only genuine audio–visual samples in which the speaker identity and the expressed emotion are aligned (

S

same,

A

same). The positive class (inconsistent pairs) includes forgeries from both Emotional Tampering (

S

same, emotion

A \neq B

) and Cross-Identity Mismatch (

S 1 \neq S 2, A \neq B

). This design compels the model to learn semantic consistency across identity and emotion, rather than relying on low-level artifacts or speaker-specific cues. During training, we use a 1:1 class balance and stratify positive samples evenly across the two subtypes.

3.6. Computational Efficiency Analysis

Table 1 summarizes device-agnostic complexity figures that are statically derivable from the architecture. We report parameter counts and analytical compute reductions where applicable. The values corroborate that ACE-Net’s “lightweight” property results from quantifiable choices at each stage rather than hardware-dependent latency.

4. Results and Discussion

4.1. Experimental Datasets and Forgery Synthesis

Our experiments are conducted on three public emotion corpora: CREMA-D, MELD, and SAVEE. To facilitate our study on cross-modal consistency, a dedicated forgery dataset was synthesized exclusively from the CREMA-D corpus. The entire experimental pipeline, from data synthesis to training, is guided by a design philosophy aimed at rigorously testing for high-level semantic inconsistency. The following subsections detail our core design principles, the forgery synthesis methods, and the data preprocessing steps.

4.1.1. Design Principles

Our experiments are conducted on three public emotion datasets: CREMA-D, MELD, and SAVEE. The core of our experimental design is to rigorously test for high-level semantic inconsistency while explicitly preventing the model from relying on low-level synthesis artifacts. To achieve this, we adhere to the following key principles:

Decoupled Two-Stage Training. Our training process, detailed in Section 3.5, separates feature learning on genuine data from consistency learning on mixed data to ensure a focus on semantic relationships.
Controlled Forgery Generation. We generate different types of forgeries to act as experimental controls, allowing us to isolate and test for specific inconsistencies. The detailed synthesis methods are described in Section 4.1.2.

4.1.2. Forgery Synthesis

Our forgery synthesis process covers two key paradigms designed to target emotional consistency, with all forged samples balanced at a 1:1 ratio against genuine pairs during training.

The first paradigm, Emotional Tampering, creates an audio–visual emotional conflict while preserving speaker identity. For a genuine video clip featuring speaker

S

with facial emotion

A

and transcript

T

, we first synthesize a new speech waveform using an emotion-controllable TTS model, Melotron TTS, conditioned on a different target emotion

B

(

B \neq A

) and the original transcript

T

. Subsequently, we employ a Retrieval-based Voice Conversion (RVC) model to transfer the original speaker

S

’s timbre onto the synthesized TTS waveform, yielding

{\hat{a}}_{S, B}

. To ensure identity preservation, we retain a sample only if the cosine similarity between pre-trained speaker embeddings (x-vector) of the original and converted audio ≥0.75 [46]. Finally, the forged audio is re-synchronized with the original video (fixed 30 fps) to create a sample where the facial expression conveys emotion

A

while the speech conveys emotion

B .

The second paradigm, Cross-Identity Spliced Forgery, simulates a common deepfake scenario involving both identity and affective inconsistencies. We create forged pairs by taking a video clip of speaker

S_{1}

expressing emotion

A

and pairing it with an audio clip of speaker

S_{2}

expressing emotion

B

, where

A \neq B

. The pairs are sampled uniformly across speakers and emotions from the genuine CREMA-D datasets, ensuring no speaker overlap between the audio and video streams. To prevent the model from learning simple identity cues, we explicitly exclude cases where cross-identity spliced forgery would be trivially detectable and require an x-vector cosine similarity ≥0.75 to qualify positive matches; pairs failing this constraint are discarded. Furthermore, to confirm a genuine identity mismatch, we require the x-vector cosine similarity between the audio of the paired speaker

S_{2}

and the original video’s speaker

S_{1}

to be ≤0.40. Audio-video synchronization is verified to be within a 2-frame offset (at 30 fps), and duration differences are normalized to within 0.2 s via time-stretching or trimming.

4.1.3. Data Preprocessing

For the visual stream, we first applied the dynamic keyframe selection algorithm. For each selected keyframe, we employed MTCNN [47] to detect the face bounding box and its corresponding 5 facial landmarks (eyes, nose, and mouth corners). To normalize for head pose and scale, a similarity transformation was applied to align the landmarks to a canonical template. The resulting bounding box was then expanded by a 25% margin to include the full facial region and context, such as the chin and forehead. These cropped images were subsequently resized to 224 × 224 pixels. All videos were processed at a uniform rate of 30 fps. To handle occasional detection failures, if a face was not detected in a given frame, the bounding box from the previous successfully detected frame was propagated forward for one frame. Video clips with more than 10% of frames failing face detection were discarded from the datasets to ensure data quality. During training, we applied data augmentation, including random horizontal flipping and brightness jittering.

For the audio stream, all waveforms were resampled to 16 kHz. We then computed 80-dimensional log-Mel spectrograms using the librosa library [48], with a 25 ms Hanning window, a 10 ms hop length, and 1024 FFT points. During training, audio augmentations included random volume perturbation.

For the text stream, transcriptions were generated using a pre-trained and frozen ASR model, specifically the OpenAI Whisper ‘base’ model. Standard text cleaning procedures were applied, such as converting text to lowercase and removing punctuation. The cleaned transcripts were then tokenized using the BERT tokenizer before being fed into the model.

4.2. Experimental Setup

4.2.1. Model Parameters and Training Configuration

All models were implemented using the PyTorch 2.4.1 framework with Python 3.8, and experiments were conducted on a single NVIDIA RTX 3090 GPU with 24 GB of VRAM. To ensure reproducibility, all experiments were run with a fixed random seed.

For training, we used the Adam optimizer with a learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵. Models were trained with a batch size of 32 for up to 50 epochs. To mitigate overfitting and reduce unnecessary training time, we implemented an early stopping strategy with a patience of 25 epochs based on the validation set loss. After training concluded, the model state with the best validation performance was restored for all subsequent evaluations.

4.2.2. Evaluation Strategy and Metrics

To evaluate model performance, this study adopts a differentiated validation scheme tailored to the size characteristics of each dataset, ensuring statistical robustness and reliability. For the CREMA-D and MELD datasets, which have sufficient sample sizes, a fixed holdout validation strategy is applied. Each dataset is randomly partitioned into a training set (80%), a validation set (10%), and a test set (10%). For the SAVEE dataset, due to its small sample size, a 10-fold cross-validation strategy is employed to mitigate the accidental bias that a single random split might introduce.

The model’s performance is assessed using standard evaluation metrics. For the emotion recognition task, our primary metric is the Weighted F1-Score, chosen for its effectiveness in handling the class imbalance common in affective datasets. We also report Accuracy for completeness and use the Confusion Matrix for qualitative error analysis. For the spoofing detection task, the Area Under the Curve (AUC) serves as the core metric due to its robustness to class distribution and its direct measure of discriminative capability. To mitigate ASR-induced bias, we freeze the ASR/BERT stack, rely on acoustic cues via MDCNN when transcriptions are noisy, and evaluate on datasets with diverse recording conditions.

4.3. Emotion Recognition Results Analysis

4.3.1. Speech–Text Model Experimental Results and Analysis

To validate the effectiveness of our proposed speech–text emotion recognition module MDCNN, we conducted a comparative analysis against several baseline models on the CREMA-D, SAVEE, and MELD datasets. These baselines were implemented to systematically ablate the contribution of different architectural components and are representative of common approaches in the field of speech emotion recognition. Importantly, for this unimodal evaluation, the speech–text model was trained and tested exclusively on genuine data. The baselines are defined as follows:

CNN [49]: A standard convolutional neural network applied to log-Mel spectrograms. This serves as a fundamental baseline to evaluate the effectiveness of basic acoustic feature extraction, a common starting point in many audio analysis tasks.

CNN + GRU [25]: This model extends the CNN by adding a Gated Recurrent Unit (GRU) layer to model the temporal dependencies within the speech signal. Hybrid CNN-RNN architectures are a widely adopted standard for SER.

CNN + BERT [50]: To assess the impact of textual information, this baseline performs a simple late fusion by concatenating the features from the acoustic CNN with sentence embeddings from a pre-trained BERT model. This represents a straightforward approach to multimodal speech–text fusion.

By comparing our proposed MDCNN against these progressively more complex baselines, we can isolate the performance gains attributable to our bidirectional cross-modal attention mechanism. A summary of the performance comparison is presented in Table 2.

The overall results in Table 2 demonstrate that the proposed model achieved the best performance across all three datasets, consistently outperforming the established baselines. Specifically, the model attained accuracies of 74.67%, 80.25%, and 68.92% on the respective datasets, a significant improvement over models that rely on simple concatenation for fusion. This confirms that our attention-based fusion mechanism can more effectively weigh the relative importance of the acoustic and textual modalities, thereby enabling more precise discrimination. These baselines provide standard unimodal references for validating the superiority of the proposed speech–text fusion framework.

An examination of the Speech–Text Confusion Matrix (Figure 5) further reveals the model’s strong classification capability across most emotion categories, with values along the main diagonal being substantially higher than off-diagonal entries. In particular, for the “Happy” emotion—characterized by a distinct positive valence and unique acoustic patterns—the model achieved exceptionally high recognition accuracy. Even on the contextually complex MELD dataset, this accuracy remained near 70%. Similarly, “Anger,” as a high-arousal negative emotion, consistently yielded recognition rates above 80% on both CREMA-D and SAVEE.

While the model exhibits some confusion when distinguishing between acoustically similar emotions like “Neutral,” “Fear,” and “Surprise,” its overall recognition performance remains high. This suggests that the MDCNN backbone is highly effective in capturing acoustic features defined by high energy and rapid speech rates.

4.3.2. Facial Emotion Recognition Results and Analysis

To evaluate the performance of the facial modality branch, independent experimental validation was conducted on the CREMA-D, SAVEE, and MELD datasets, with the results shown in Table 3. While the FV-LiteNet, like all our unimodal modules, was trained exclusively on genuine data, its performance on the CREMA-D test set was evaluated using the original, unaltered videos that were also selected for our forgery synthesis pipeline. This test verifies that our feature extractor’s performance on these specific visual samples is stable and not incidentally affected by their selection for a different task, thus providing a reliable baseline before they are used in the multimodal forgery detection stage. For the MELD and SAVEE datasets, where no forgeries were synthesized, the evaluation followed a standard protocol using their respective genuine test sets.

To benchmark our model, we compared it against a traditional baseline using Local Binary Patterns and a Support Vector Machine (LBP + SVM), a widely recognized method for hand-crafted facial texture analysis. To ensure a fair and reproducible comparison, we implemented this baseline as follows: First, Uniform LBP features were extracted from each facial crop, which was divided into an 8 × 8 grid. An LBP histogram was computed for each region using a radius of 1 pixel and 8 neighboring points, and these regional histograms were concatenated. The final feature vector for a video was the average of all its frame-level vectors. Subsequently, these features were used to train an SVM classifier with an RBF kernel. Hyperparameters were optimized via grid search, yielding a penalty parameter C = 1 and a kernel coefficient gamma (γ) = 0.5. This strong conventional approach [51] was evaluated on the same test sets as FV-LiteNet for a direct comparison.

The comparative results, presented in Table 3, show that our FV-LiteNet model generally outperforms the LBP + SVM baseline, particularly on the more challenging datasets.

Furthermore, a comprehensive observation of the Facial Emotion Recognition Confusion Matrices (Figure 6) reveals that for “Happy” and “Sad”—two emotions characterized by clear, singular muscle group movements—the model achieved extremely high recognition accuracy across all datasets, with accuracy generally exceeding 90%. Similarly, “Surprise,” as a high-arousal emotion, was easily captured by the model, resulting in excellent recognition performance. Although there was some confusion between “Anger” and “Disgust,” and minor differences in the subtle static frames between “Fear” and “Neutral” led to occasional misclassification, the model still demonstrated a high recognition rate even for these easily confused emotions. The consistent performance across datasets of varying styles and scales strongly validates that FV-LiteNet can provide high-quality, high-confidence facial visual feature input for the subsequent multimodal fusion stage. The comparison with the classical LBP + SVM baseline confirms that the lightweight FV-LiteNet achieves superior generalization and expressive power while maintaining computational efficiency.

4.4. Forgery Detection Results Analysis

4.4.1. Performance on Different Forgery Types

To comprehensively evaluate the performance of our proposed forgery detection framework, we conducted experiments on the CREMA-D, MELD, and SAVEE datasets. Our evaluation focuses on the model’s ability to distinguish Genuine Pairs, Emotional Tampering and Cross-Identity Spliced Forgery. The detailed detection performance across these different audio–visual pairing types is presented in Table 4.

The experimental results, detailed in Table 4, demonstrate that the proposed cross-modal detection method exhibits robust performance across the three diverse datasets: SAVEE, CREMA-D, and MELD. The model’s capabilities were assessed on three distinct pairing types: For Genuine Pairs, the model effectively identifies authentic emotional consistency, achieving accuracies consistently above 91% and an AUC exceeding 0.95. For the two primary forgery categories, Emotion-Tampered Forgeries and Cross-Identity Spliced Forgeries, the model attains accuracies ranging from 70% to 88%, with the AUC reaching as high as 0.94.

Notably, the detection performance on Cross-Identity Spliced Forgeries surpasses that on Emotional Tampering. This highlights the model’s robustness in scenarios with strong audio–visual mismatches, likely because cross-identity splicing induces more pronounced and readily detectable inter-modal inconsistencies. Furthermore, the model maintains stable performance on the more challenging MELD dataset, with all evaluation metrics for all pairing types remaining above 70%. Collectively, these findings validate the efficacy of the multimodal emotional consistency mechanism for the deepfake detection task.

A noteworthy observation from Table 4 is the close proximity between the Accuracy and F1-score values across most experiments. This is not a coincidence but rather a direct consequence of our experimental design and the balanced nature of our model’s performance. Our test sets for the forgery detection task were constructed with a balanced 1:1 ratio of genuine to forged samples. The resulting high correlation between Accuracy and F1-score indicates that our model exhibits a symmetric error profile, making similar numbers of false positive and false negative predictions. This balanced performance is a desirable characteristic for a reliable detection system, as it demonstrates that the model is not biased towards either the genuine or the forged class.

4.4.2. Ablation Study

To assess the influence of various modal fusion strategies on forgery detection performance, we conducted a comprehensive ablation study on the CREMA-D dataset. As detailed in Table 5, the experiments compare three progressively enhanced fusion methods.

The results in Table 5 reveal that our proposed multi-dimensional feature fusion strategy yields progressive and consistent performance gains across both forgery types.

Specifically, for Emotion Tampering detection, starting from the Concatenation only baseline, the introduction of the Difference operation significantly boosts the F1-score to 85.7%. This empirically confirms that explicitly modeling inter-modal discrepancies is crucial for capturing emotional conflicts. The final integration of the Product operation further elevates the F1-score to 87.0% and the AUC to 0.92, demonstrating the added value of modeling feature correlations.

A similar upward trend is observed in Cross-Identity Spliced Forgery detection, where the F1-score and AUC ultimately reach a peak of 90.5% and 0.94, respectively, with the full Concatenation + Difference + Product strategy. This consistent improvement across different forgery scenarios validates our core hypothesis: while simple concatenation merely aggregates information, incorporating Difference and Product operations empowers the model with a more profound understanding of the complex relationship between modalities, capturing both discrepancy and correlation. Furthermore, the close alignment of Accuracy and F1-scores throughout this ablation study reinforces the observation from Table 5 that our model maintains a balanced and stable error profile across different architectural configurations. This enrichment of interactive features allows the downstream MLP discriminator to learn more precise distributional distinctions between authentic and forged pairs, thereby enabling more robust discrimination overall.

4.4.3. Comparison with Existing Methods

To position ACE-Net at the forefront of current research, we conduct a comprehensive evaluation against a diverse range of existing methods on the challenging DFDC dataset. The selected baselines span two primary technical paradigms: (1) Unimodal Artifact-based Detectors: This group comprises methods that rely on either visual or audio cues derived from low-level artifacts. For visual modality, we include MesoNet-4 [52], which targets mid-level texture inconsistencies; Face X-ray [53], which focuses on blending boundaries; and Two-stream CNN [54], a framework designed to enhance artifact detection. For speech modality, we compare against the classic CQCC-GMM [55] and the deep learning-based RawNet2 [56]. (2) Multimodal High-Level Semantic Frameworks: Representing the current research frontier, this category includes approaches that analyze multimodal inconsistencies. Our comparison features DeepRhythm [57], which inspects heart-rate consistency; a Siamese network [58] that utilizes metric learning; and MDS [59], another advanced multimodal discriminator. The comparative results, measured by AUC, are presented in Table 6.

The comparative results presented in Table 6 systematically elucidate the performance landscape of different detection paradigms on the DFDC dataset. An analysis of the unimodal approaches reveals a distinct performance ceiling, for both visual and audio modalities, that is significantly lower than that of advanced multimodal methods. Within the visual domain, even the Two-stream CNN (AUC 0.614), despite its fusion strategy, is substantially outperformed by Face X-ray (AUC 0.809), which leverages finer-grained cues. This observation, coupled with the generally modest performance of audio-based detectors (which top out with RawNet2 at an AUC of 0.718), underscores that most unimodal methods are unable to match the efficacy of leading multimodal systems. This strongly suggests that the fusion of multi-source information is a critical pathway for performance enhancement, particularly when confronted with the complex challenges posed by datasets like DFDC, characterized by extensive cross-modal inconsistencies and variable video quality.

Among multimodal methods, a clear trend emerges: approaches that pivot to modeling high-level semantic consistency demonstrate marked performance gains. For example, the Siamese network (AUC 0.844) based on metric learning and the state-of-the-art multimodal discriminator MDS (AUC 0.915) both vastly outperform all unimodal baselines. These results indicate that as forgery techniques mature, the paradigm of detection is shifting from capturing low-level statistical artifacts to modeling disruptions in high-level semantic associations, which are inherently more difficult to replicate with fidelity [60].

Through this systematic comparison, our experiments not only validate that targeting multimodal high-level semantics is a highly effective direction for contemporary forgery detection. Furthermore, ACE-Net, architected around the core principle of cross-modal emotional consistency, contributes a novel approach for constructing next-generation forgery detection systems that are both efficient and versatile, owing to its unique emphasis on a lightweight, flexible and modular design [61,62].

5. Conclusions and Future Work

In response to the generalization bottleneck faced by current deepfake detection techniques, we propose a novel detection framework ACE-Net based on multimodal emotional consistency analysis. Its core lies in identifying the illogical inconsistencies between modalities in intrinsic emotional expressions. We effectively achieve this goal through a two-stage strategy comprising feature extraction and consistency discrimination. In the first stage, this strategy uses the upstream single-modal expert model as a high-level semantic feature extractor to convert the original audio–visual content into a dense, high-quality emotional embedding; in the second stage, the downstream multi-dimensional fusion network is used to model and discriminate the consistency of cross-modal emotions through explicit feature interaction operations.

While promising, our work opens several avenues for future research. A key direction is to enhance the model’s generalization to unseen manipulation techniques by evaluating it on a wider range of benchmarks and employing more rigorous testing protocols. Another promising path is to create a more holistic detector by integrating our emotion-centric module with state-of-the-art identity verification systems, thus combining semantic and biometric cues. Finally, exploring more advanced learnable fusion mechanisms, such as bilinear pooling or graph-based models, could further improve the detection of complex cross-modal relationships. Additionally, future work should address the robustness of the textual channel by exploring methods that are less reliant on explicit ASR transcriptions, such as using self-supervised speech representations, or by incorporating ASR confidence scores to mitigate the impact of transcription errors.

Author Contributions

Conceptualization, H.Z.; methodology, X.C.; software, X.L.; formal analysis, Y.S.; resources, S.Y. (Sijia Yu) and Y.S.; writing—original draft preparation, X.C.; writing—review and editing, S.Y. (Shaoqian Yu); project administration, S.Y. (Shaoqian Yu). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

During the preparation of this work the authors used ChatGPT 5.0 in order to improve language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACE-Net	Affective Consistency Evaluation Network
ASR	Automatic Speech Recognition
AUC	Area Under the ROC Curve
BERT	Bidirectional Encoder Representations from Transformers
CBAM	Convolutional Block Attention Module
DWConv	Depthwise Convolution
DSC	Depthwise Separable Convolution
FV-LiteNet	Facial Visual Lite Network
MACs	Multiply–Accumulate Operations
MDCNN	Multi-granularity-attention Depthwise Convolutional Network
MLP	Multi-Layer Perceptron

References

Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Xu, C.; Zhang, J.; Hua, M.; He, Q.; Yi, Z.; Liu, Y. Region-aware face swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7632–7641. [Google Scholar]
Tan, M.K.; Xu, S.K.; Zhang, S.H.; Chen, Q. Survey on deep adversarial visual generation. J. Image Graph. 2021, 26, 2751–2766. [Google Scholar] [CrossRef]
Yang, H.Y.; Li, X.H.; Hu, Z. A review of deepfake face generation and detection techniques. J. Huazhong Univ. Sci. Technol. (Nat. Sci. Ed.) 2025, 53, 85–103. [Google Scholar]
Koujan, M.R.; Doukas, M.C.; Roussos, A.; Zafeiriou, S. Head2head: Video-based neural head synthesis. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, 16–20 November 2020; pp. 16–23. [Google Scholar]
Xue, P.Y.; Dai, S.T.; Bai, J.; Gao, X. Bimodal emotion recognition with speech and facial image. J. Electron. Inf. Technol. 2024, 46, 4542–4552. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using capsule networks to detect forged images and videos. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 2307–2311. [Google Scholar]
Wang, Z.; Bao, J.; Zhou, W.; Wang, W.; Li, H. Altfreezing for more general video face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4129–4138. [Google Scholar]
Zhao, H.; Zhou, W.; Chen, D.; Wei, T.; Zhang, W.; Yu, N. Multi-attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2185–2194. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Zhang, D.; Xiao, Z.; Li, S.; Lin, F.; Li, J.; Ge, S. Learning natural consistency representation for face forgery video detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 407–424. [Google Scholar]
Chen, X.; Zhang, W.; Xu, X.; Chao, W. A public and large-scale expert information fusion method and its application: Mining public opinion via sentiment analysis and measuring public dynamic reliability. Inf. Fusion 2022, 78, 71–85. [Google Scholar] [CrossRef]
Jiang, L.; Tan, P.; Yang, J.; Liu, X.; Wang, C. Speech emotion recognition using emotion perception spectral feature. Concurr. Comput. Pract. Exp. 2021, 33, e5427. [Google Scholar] [CrossRef]
Liang, W.; Chen, X.; Huang, S.; Xiong, G.; Yan, K.; Zhou, X. Federal learning edge network based sentiment analysis combating global COVID-19. Comput. Commun. 2023, 204, 33–42. [Google Scholar] [CrossRef]
Yang, D.; Liu, M.; Cao, M. Multi-modality behavioral influence analysis for personalized recommendations in health social media environment. IEEE Trans. Comput. Soc. Syst. 2019, 6, 888–897. [Google Scholar] [CrossRef]
Zhang, T. Deepfake generation and detection, a survey. Multimed. Tools Appl. 2022, 81, 6259–6276. [Google Scholar] [CrossRef]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Zhou, K.; Zhou, X.; Yu, L.; Shen, L.; Yu, S. Double biologically inspired transform network for robust palmprint recognition. Neurocomputing 2019, 337, 24–45. [Google Scholar] [CrossRef]
Zhou, X.; Liang, W.; Kevin, I.; Wang, K.; Wang, H.; Yang, L.T.; Jin, Q. Deep-learning-enhanced human activity recognition for Internet of Healthcare Things. IEEE Internet Things J. 2020, 7, 6429–6438. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Atal, B.S.; Schroeder, M.R. Adaptive predictive coding of speech signals. Bell Syst. Tech. J. 1970, 49, 1973–1986. [Google Scholar] [CrossRef]
Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A comprehensive review of speech emotion recognition systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
Akinpelu, S.; Viriri, S.; Adegun, A. An enhanced speech emotion recognition using vision transformer. Sci. Rep. 2024, 14, 13126. [Google Scholar] [CrossRef]
Nfissi, A.; Bouachir, W.; Bouguila, N.; Sadouk, L. CNN-N-GRU: End-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications, Nassau, Bahamas, 12–15 December 2022; pp. 699–702. [Google Scholar]
Guo, Y.C.; Zhang, X.; Zhao, H.Y.; Mao, X.N. Speech enhancement based on deep complex gated dilated recurrent convolutional network. J. China Acad. Electron. Inf. Technol. 2025, 20, 194–202. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Nguyen, T.A.; Muller, B.; Yu, B.; Stenetorp, P.; Gales, M.; Aharoni, R.; Andreae, C.; Gales, D.; Wang, Y.; King, S. Spirit-lm: Interleaved spoken and written language model. Trans. Assoc. Comput. Linguist. 2025, 13, 30–51. [Google Scholar]
Zhou, X.; Li, Y.; Liang, W. CNN-RNN based intelligent recommendation for online medical pre-diagnosis support. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 18, 912–921. [Google Scholar] [CrossRef] [PubMed]
Zhan, M.; Kou, G.; Dong, Y.; Chiclana, F. Bounded confidence evolution of opinions and actions in social networks. IEEE Trans. Cybern. 2022, 52, 7017–7028. [Google Scholar] [CrossRef]
Zhu, G.J.; Cai, C.G.; Pan, B.; Wang, P. A multi-agent linguistic-style large group decision-making method considering public expectations. Int. J. Comput. Intell. Syst. 2021, 14, 188. [Google Scholar] [CrossRef]
Shi, S.; Qin, J.J.; Yu, Y.; Hao, X.K. Audio-visual emotion recognition based on improved ConvMixer and dynamic focal loss. Acta Electron. Sin. 2024, 52, 2824–2835. [Google Scholar]
Tang, B.; Zheng, B.; Paul, S.; Wang, H.; Wang, Z.; Li, Y.; Zhang, H.; Zheng, Z.; Sun, J.; Liu, S. Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 15–21 June 2025; pp. 28586–28595. [Google Scholar]
Feng, Y.; Qin, Y.; Zhao, S. Correlation-split and Recombination-sort Interaction Networks for air quality forecasting. Appl. Soft Comput. 2023, 145, 110544. [Google Scholar] [CrossRef]
Cheng, H.; Guo, Y.; Wang, T.; Dou, Y.; Cao, Y.; Tao, D. Voice-face homogeneity tells deepfake. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 1–22. [Google Scholar]
Liu, M.; Qi, M.J.; Zhan, Z.Y.; Qu, L.G.; Nie, X.S.; Nie, L.Q. A survey on image-text matching research based on deep learning. Chin. J. Comput. 2023, 46, 2370–2399. [Google Scholar]
Sun, W.; Jiang, J.; Huang, Y.; Li, J.; Zhang, M. An integrated PCA-DAEGCN model for movie recommendation in the social Internet of Things. IEEE Internet Things J. 2021, 9, 9410–9418. [Google Scholar] [CrossRef]
Ustubioglu, A.; Ustubioglu, B.; Ulutas, G. Mel spectrogram-based audio forgery detection using CNN. Signal Image Video Process. 2023, 17, 2211–2219. [Google Scholar] [CrossRef]
Liu, J.; Zhang, X. Truthful resource trading for dependent task offloading in heterogeneous edge computing. Future Gener. Comput. Syst. 2022, 133, 228–239. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Makhmudov, F.; Kutlimuratov, A.; Akhmedov, F.; Ostonov, A.; Islmuratov, S. Modeling speech emotion recognition via attention-oriented parallel CNN encoders. Electronics 2022, 11, 4047. [Google Scholar] [CrossRef]
Gu, X.; Fan, Z.; Zhu, S.; Dai, Q.; Tan, P. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2495–2504. [Google Scholar]
Chen, G.; Liao, Y.; Zhang, D.; Mahendren, J.; Bukhari, S.; Iqbal, S. Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment. Electronics 2025, 14, 3638. [Google Scholar] [CrossRef]
Wang, C.; Qian, J.; Wang, J.; Chen, Y. Illumination-Aware Cross-Modality Differential Fusion Multispectral Pedestrian Detection. Electronics 2023, 12, 3576. [Google Scholar] [CrossRef]
Zhou, H.; Du, J.; Zhang, Y.; Wang, Q.; Liu, Q.F.; Lee, C.H. Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2617–2629. [Google Scholar]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Ku, H.; Dong, W. Face recognition based on MTCNN and convolutional neural network. Front. Signal Process. 2020, 4, 37–42. [Google Scholar] [CrossRef]
Mcfee, B.; Raffel, C.; Liang, D.; Ellis, D.P.W.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–24. [Google Scholar]
Maountzouris, K.; Perikos, I.; Hatzilygeroudis, I. Speech emotion recognition using convolutional neural networks with attention mechanism. Electronics 2023, 12, 4376. [Google Scholar] [CrossRef]
Lee, S.; Han, D.K.; Ko, H. Fusion-ConvBERT: Parallel convolution and BERT fusion for speech emotion recognition. Sensors 2020, 20, 6688. [Google Scholar] [CrossRef] [PubMed]
Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [Google Scholar] [CrossRef] [PubMed]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. MesoNet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security, Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5001–5010. [Google Scholar]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-stream neural networks for tampered face detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1839. [Google Scholar]
Todisco, M.; Delgado, H.; Evans, N. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 2017, 45, 516–535. [Google Scholar] [CrossRef]
Tak, H.; Patino, J.; Todisco, M.; Nautsch, A.; Evans, N.; Larcher, A. End-to-end anti-spoofing with RawNet2. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6369–6373. [Google Scholar]
Qi, H.; Guo, Q.; Juefei-Xu, F.; Xie, X.; Ma, L.; Feng, W.; Liu, Y.; Zhao, J. DeepRhythm: Exposing deepfakes with attentional visual heartbeat rhythms. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4318–4327. [Google Scholar]
Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2823–2832. [Google Scholar]
Chugh, K.; Gupta, P.; Dhall, A.; Subramanian, R. Not made for each other-Audio-visual dissonance-based deepfake detection and localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 439–447. [Google Scholar]
Pan, Z.; Wang, Y.; Cao, Y.; Gui, W. VAE-based interpretable latent variable model for process monitoring. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 6075–6088. [Google Scholar] [CrossRef]
Ren, Y.; Liu, A.; Mao, X.; Li, F. An intelligent charging scheme maximizing the utility for rechargeable network in smart city. Pervasive Mob. Comput. 2021, 77, 101457. [Google Scholar] [CrossRef]
Wang, J.; Lv, P.; Wang, H.; Shi, C. SAR-U-Net: Squeeze-and-Excitation Block and Atrous Spatial Pyramid Pooling Based Residual U-Net for Automatic Liver Segmentation in CT. Comput. Methods Programs Biomed. 2021, 208, 106268. [Google Scholar] [CrossRef] [PubMed]

Figure 1. ACE-Net: Overall architecture of the proposed multimodal emotional consistency discriminator network for deepfake detection, illustrating the key components including Unimodal Emotion Recognition, Multimodal Fusion, and Cross-Modal Emotional Consistency Discrimination.

Figure 2. MDCNN as a lightweight acoustic feature extractor, it leverages a parallel Multi-granularity Channel-Spatial Attention mechanism to dynamically focus on salient time–frequency patterns in speech.

Figure 3. Architecture of the Bidirectional Cross-Modal Attention Mechanism.

Figure 4. The architecture of the two core innovative components in FV-LiteNet, they should be listed as: (a) Ghost bottleneck used in FV-LiteNet; in the last two stride-2 the SE (squeeze–excitation) is removed to reduce parameters and latency; (b) The proposed spatial-channel attention head, which replaces the original classification layer. Abbreviations: Conv2D (standard convolution), DWConv (Depthwise Convolution), PWConv (Pointwise Convolution).

Figure 5. Speech–Text Model Confusion Matrix.

Figure 6. Facial Expression Model Confusion Matrix.

Table 1. Analytical Complexity and Parameter Estimates of ACE-Net Components.

Component	Efficiency Source	Params (M)	Analytical Reduction
MDCNN	3 × 3 DSC; GAP + GMP (r = 8)	0.019	≈1/9 MACs vs. standard 3 × 3
Keyframe-based	K keyframes instead of all T	–	×(K/T) ≈ 0.07–0.13
FV-LiteNet	Truncated GhostNet; last two SE removed	1.2	–
Fusion	Small inverted MLP on 4d-dim fused vector	0.59	–

Table 2. Comparison of Emotional Recognition Performance Evaluation (ACC %).

Model	CREMA-D	SAVEE	MELD
CNN	62.15	68.25	55.48
CNN + GRU	65.83	71.98	58.01
CNN + BERT	70.29	76.54	64.33
MDCNN	74.67	80.25	68.92

Table 3. Comparison of Emotional Recognition Performance Evaluation (ACC%).

Model	CREMA-D	SAVEE	MELD
LBP + SVM	88.10	82.45	54.18
FV-LiteNet	86.83	90.15	71.77

Table 4. Evaluation of Cross-modal Detection Performance for Different Forgery Types Paired.

Dataset	Pairing Type	ACC	Precision	Recall	F1-Score	AUC
SAVEE	Genuine Pairs	91.2	90.8	92	91.4	0.95
SAVEE	Emotion Tampering	85.6	86.5	84.2	85.3	0.91
SAVEE	Cross-Identity Spliced Forgery	88.1	87.6	89	88.3	0.92
CREMA-D	Genuine Pairs	90.5	90.9	90.2	90.5	0.95
CREMA-D	Emotion-Tampering	86.2	88.1	86	87	0.92
CREMA-D	Cross-Identity Spliced Forgery	88.9	88.5	89.4	88.9	0.94
MELD	Genuine Pairs	75.1	76.0	74.8	75.4	0.82
MELD	Emotion-Tampering	70.4	71.2	70.0	70.6	0.77
MELD	Cross-Identity Spliced Forgery	73.2	72.8	73.8	73.3	0.80

Table 5. Ablation Experiments of Different Fusion Methods and Discrimination Strategies.

Pairing Type	Fusion Method	ACC	Precision	Recall	F1-Score	AUC
Emotion Tampering	Concatenation+ Difference + Product	87.2	88.1	86	87	0.92
Emotion Tampering	Concatenation+ Difference	85.9	86.3	85.2	85.7	0.9
Emotion Tampering	Concatenation	82.5	83.1	82.0	82.5	0.88
Cross-Identity Spliced Forgery	Concatenation+ Difference + Product	90.3	89.8	91.2	90.5	0.94
Cross-Identity Spliced Forgery	Concatenation+ Difference	89.1	89.6	88.3	88.9	0.93
Cross-Identity Spliced Forgery	Concatenation	85.1	84.8	85.5	85.1	0.90

Table 6. Performance comparison (AUC) with existing methods on the DFDC dataset.

Category	Method	Dataset
Category	Method	DFDC
Visual—Artifact/Texture	MesoNet-4	0.753
	Face X-ray	0.809
	Two-stream CNN	0.614
Audio—Acoustic Artifact	CQCC-GMM	0.523
Audio—Acoustic Artifact	RawNet2	0.718
Multimodal—High-level Semantics	DeepRhythm	0.745
	Siamese	0.844
	MDS	0.915
Ours	ACE-Net	0.921

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, S.; Chen, X.; Sheng, Y.; Zhang, H.; Li, X.; Yu, S. ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency. Electronics 2025, 14, 4420. https://doi.org/10.3390/electronics14224420

AMA Style

Yu S, Chen X, Sheng Y, Zhang H, Li X, Yu S. ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency. Electronics. 2025; 14(22):4420. https://doi.org/10.3390/electronics14224420

Chicago/Turabian Style

Yu, Shaoqian, Xingyu Chen, Yuzhe Sheng, Han Zhang, Xinlong Li, and Sijia Yu. 2025. "ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency" Electronics 14, no. 22: 4420. https://doi.org/10.3390/electronics14224420

APA Style

Yu, S., Chen, X., Sheng, Y., Zhang, H., Li, X., & Yu, S. (2025). ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency. Electronics, 14(22), 4420. https://doi.org/10.3390/electronics14224420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

Abstract

1. Introduction

2. Related Works

2.1. Visual Modality Detection Methods

2.2. Audio Modality Detection Methods

2.3. Multimodal Emotional Consistency Detection Methods

3. Methods

3.1. Overall Framework

3.2. Speech–Text Emotion Feature Extraction Module

3.2.1. MDCNN Acoustic Branch

3.2.2. Bidirectional Cross-Modal Attention Mechanism

3.3. Dynamic–Temporal Facial Emotion Feature Extraction Module

3.3.1. Keyframe Selection

3.3.2. Lightweight Spatiotemporal Feature Extraction

3.4. Multimodal Emotional Consistency Discrimination

3.5. Training Strategy

3.6. Computational Efficiency Analysis

4. Results and Discussion

4.1. Experimental Datasets and Forgery Synthesis

4.1.1. Design Principles

4.1.2. Forgery Synthesis

4.1.3. Data Preprocessing

4.2. Experimental Setup

4.2.1. Model Parameters and Training Configuration

4.2.2. Evaluation Strategy and Metrics

4.3. Emotion Recognition Results Analysis

4.3.1. Speech–Text Model Experimental Results and Analysis

4.3.2. Facial Emotion Recognition Results and Analysis

4.4. Forgery Detection Results Analysis

4.4.1. Performance on Different Forgery Types

4.4.2. Ablation Study

4.4.3. Comparison with Existing Methods

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI