Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

Electronics 2025, 14(22), 4420; https://doi.org/10.3390/electronics14224420

by Shaoqian Yu, Xingyu Chen, Yuzhe Sheng^*, Han Zhang, Xinlong Li and Sijia Yu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2025, 14(22), 4420; https://doi.org/10.3390/electronics14224420

Submission received: 10 October 2025 / Revised: 6 November 2025 / Accepted: 7 November 2025 / Published: 13 November 2025

(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Machine Learning)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

See attachment.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

While ACE-Net presents an interesting approach to Deepfake detection by leveraging multimodal emotional consistency, there are several aspects that could be improved in the paper:
1)While the paper describes MDCNN as novel, the components (Depthwise Separable Convolutions, Global Average/Max Pooling, MLP for channel attention, 3x3 DSC for spatial attention) are individually well-established building blocks (e.g., Squeeze-and-Excitation networks, CBAM). The specific combination might be unique, but the paper doesn't sufficiently highlight why this specific configuration is uniquely superior or how it fundamentally advances the state-of-the-art beyond a careful assembly of existing techniques. The "multi-granularity" aspect feels somewhat generic.
2)The "coarse-to-fine" strategy using optical flow for motion and MobileNetV3 for expression verification, while practical, isn't groundbreaking. Optical flow is a classic motion detection method, and MobileNetV3 is a standard lightweight CNN. The novelty lies more in the application within this specific pipeline rather than in a new algorithmic contribution.
3) Concatenation, element-wise difference, and element-wise product are common operations for fusing features. The claim of "deep fusion" needs stronger justification beyond combining these operations. It's more of an exhaustive combination of common fusion strategies rather than a fundamentally new "deep" fusion mechanism.
4)The paper describes what components are used and how they are put together, but often lacks a deeper theoretical explanation of why certain architectural choices are made and how they fundamentally address the limitations of previous methods beyond empirical performance. For instance, why is the specific multi-granularity attention in MDCNN better than other attention mechanisms for emotion features?
5) The "emotional consistency" concept is intuitive, but the paper doesn't delve into a formal or computational definition of what constitutes "consistency" or "inconsistency" in a way that truly informs the model's design beyond training on forged vs. genuine data.
6) The term "lightweight" is used frequently (e.g., lightweight MDCNN, lightweight GhostNet backbone, lightweight joint spatial-channel attention). While efficiency is mentioned, the paper doesn't provide concrete metrics (e.g., FLOPs, parameter count, inference speed on specific hardware) for these "lightweight" components relative to alternatives until the full model performance is discussed. This makes it hard to gauge the true efficiency benefit of these individual design choices.
7)The textual features are derived from "ASR-derived text embeddings" and fed into a BERT model. While BERT is powerful, the paper doesn't discuss the potential weaknesses or biases introduced by the ASR model itself (e.g., errors in transcription of emotionally charged speech, robustness to noise). The interaction between text and acoustic features could be explored more deeply from a linguistic or paralinguistic perspective.
8)The "synthetic forgery dataset" is constructed from existing genuine corpora (CREMA-D, MELD, SAVEE). While useful for controlled experiments, this might not fully capture the complexity and diversity of real-world Deepfakes created by advanced generative models. The generalization to "unseen manipulation techniques" (mentioned in the introduction) is a critical challenge that synthetic datasets may not fully address.
9)The DFDC benchmark is good, but a wider range of benchmarks, including those focusing on specific types of forgery (e.g., highly convincing face swaps vs. subtle emotional manipulations), would strengthen the evaluation.
10)While the ablation study for fusion methods (Table 4) shows performance gains, the "multidimensional fusion" (concatenation, difference, product) still feels somewhat like throwing everything at the problem. The paper doesn't clearly articulate what specific type of emotional conflict or correlation each fusion operation is uniquely designed to capture, or how the model learns to weigh these different fused representations. It's an empirical success but lacks clear mechanistic insight.
11) Figure 1 (Overall Framework) is dense and could benefit from clearer visual flow and labeling, especially around the "Multi-dimensional Fused Embedding" and "Multi-aspect Fusion."
12) Figure 4 (FV-LiteNet architecture) has labels like "Cnov2d," "DWCnov" that are not immediately intuitive or defined.
13) Figure 5 and 6, the confusion matrices, are very hard to read due to small font sizes and dense numbers. They would benefit from a more visually engaging representation (e.g., heatmaps) or a clear explanation of how to interpret the color scaling.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

You may view further details in the report, which reflects updates since the first review. In summary, I appreciate the authors' substantial and thoughtful revisions. The manuscript has improved in clarity, structure, reproducibility, and methodological rigor. The expanded explanations in sections 4.2 and 4.3, inclusion of proper citations, and detailed baseline specifications directly address prior concerns. Only minor improvements remain, and overall this version is well-prepared for publication following minor language and editorial revisions.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript should include a statement on the use of generative AI. Also, there should be a table of abbreviations at the end of the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

ACE-Net: A Fine-Grained Deepfake Detection Model with Multimodal Emotional Consistency

Further Information

Guidelines

MDPI Initiatives

Follow MDPI