Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Modal-Guided Multi-Domain Inconsistency Learning for Face Forgery Detection

Appl. Sci. 2025, 15(1), 229; https://doi.org/10.3390/app15010229

by Zishuo Guo¹

, Baopeng Zhang^1,*, Jack Fan², Zhu Teng¹ and Jianping Fan³

Reviewer 1:

Cheonshik Kim

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Saurabh Agarwal

Appl. Sci. 2025, 15(1), 229; https://doi.org/10.3390/app15010229

Submission received: 10 September 2024 / Revised: 24 October 2024 / Accepted: 30 October 2024 / Published: 30 December 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes MGDL-Net (Modal-Guided Domain Learning Network) for detecting DeepFakes across various modalities, including images, videos, and audio. The proposed network learns inconsistencies in spatial, temporal, and frequency domains, effectively detecting tampered traces both within and between modalities. Additionally, a multi-modal dataset was constructed to enable detailed detection of manipulations specific to each modality. Experimental results show that MGDL-Net outperforms existing state-of-the-art methods, suggesting that learning cross-modal inconsistencies is beneficial for multi-modal face forgery detection.

1) There is a lack of detailed explanation regarding the hyperparameters used during the model training process (e.g., learning rate, batch size) and the optimization techniques applied.

2) Although the proposed method claims to be superior to existing methods, there is a lack of in-depth analysis explaining the reasons for this superiority.

3) Although Grad-CAM results are included in the paper, there needs to be more explanation on how these results contribute to the model’s decision-making process. Discussions should be more specific about which features the model considers important, along with the visualization results.

4) Further discussion is needed on how the proposed method can be used in real-world applications, including the real-time processing capability and any performance constraints.

Author Response

Comments 1: There is a lack of detailed explanation regarding the hyperparameters used during the model training process (e.g., learning rate, batch size) and the optimization techniques applied.

Response: Thank you very much for your valuable suggestion. We added a detailed explanation of our experimental setting and training process of the proposed MGDL-Net in Subsection 4.1 ‘Implementation Details’ in the revision, including learning rate, batch size, weight decay, etc.

Comments 2：Although the proposed method claims to be superior to existing methods, there is a lack of in-depth analysis explaining the reasons for this superiority.

Response: Thank you very much for your valuable suggestion. In Section 4 ‘Experiment’ in the revision, we added more in-depth analysis of our experimental results in terms of precision and generalization to explain the rationality of the superior performance of the model.

Comments 3：Although Grad-CAM results are included in the paper, there needs to be more explanation on how these results contribute to the model’s decision-making process. Discussions should be more specific about which features the model considers important, along with the visualization results.

Response: Thank you very much for your valuable suggestion. The Grad CAM results highlight the regions in the raw images that have a significant impact on the final decision of the MGDL-Net, thereby providing more visual explanations for the decision-making process. In Subsection 4.6, we added more discussions of our Grad-CAM visualizations to exhibit the specific components that are essential and discriminative for modeling multi-modal inconsistency.

Comments 4: Further discussion is needed on how the proposed method can be used in real-world applications, including the real-time processing capability and any performance constraints.

Response: Thank you very much for your valuable suggestion. In Section 5 in the modified version, we re-organized the “Conclusion” of the paper and added more discussions on the applications of our model, such as the real-world applications, limitations of the proposed MGDL-Net and future plans.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript entitled "Modal-Guided Multi-Domain Inconsistency Learning for Face Forgery Detection" mainly proposes a novel unified neural network named MGDL-Net that (as the authors declare) has the ability to detect face-related input with flexible modalities and perceive both intra- and inter-domain inconsistency, such as unimodal, bimodal, and trimodal.

Some suggestions for the authors:

- paper's structure should be presented by the end of Section 1;

- avoid referencing too many references in a single stage; check line 26 where there are 9 references [3-11], line 34 with 7 references [12-18], line 64 with 5 references [19-23] as the references should be made for one particular idea extracted from a research (as it is conducted in lines 91-156).

- Figures should have title and number. The description should be made in text and the figure number shall be referenced from that text. Check Figures 1, 2, 3, 4, 5,6 and 7 the explanations should be moved to paragraphs rather that being displayed below Figures.

- Similarly (as above) for Tables 1-6 !

- the conclusion section should be more elaborated

- the limitations of the study and future path, should be added, perhaps in Section 5.

Author Response

Comments 1: Paper's structure should be presented by the end of Section 1.

- Figures should have title and number. The description should be made in text and the figure number shall be referenced from that text. Check Figures 1, 2, 3, 4, 5, 6 and 7 the explanations should be moved to paragraphs rather that being displayed below Figures.

- Similarly (as above) for Tables 1-6!

Response: Thank you very much for your valuable suggestion. In the modified version, we re-organized the description of the figures and tables to corresponding paragraphs and shortened the number of references cited in a single stage by moving some references to Section 2. Moreover, we depicted the structure of our paper at the end of Section 1 according to your treasure advice.

Comments 2：

- the conclusion section should be more elaborated

- the limitations of the study and future path, should be added, perhaps in Section 5.

Response: Thank you very much for your valuable suggestion. In Section 5 in the modified version, we re-organized the “Conclusion” of the paper and added more discussions on the applications of our model, such as the inference time of each branch, limitations of the proposed MGDL-Net and future plans.

Reviewer 3 Report

Comments and Suggestions for Authors

1. A brief summary:

Unified multi-modal framework is introduced for detecting face forgery.

2. General concept comments

a) Weaknesses of the paper:

Further study ideas/plans or forecasts are not stated in conclusion.

b) Strengths of the paper:

A useful and practical topic.

Discussions and results are explained sufficiently. Also, chapters are organized logically to help readers to understand easily.

Figures and tables are helpful to understand, and adressed very well in paragraphs. Related works are reviewed in detail.

c) Hypotheses/goals / research gap:

This paper addresses a gap in research about DeepFake detection methods, which often look at content with specific types, like images, videos, or audio, to find forgeries. To detect more complex forgeries that involve multiple types of media together.

d) Methodology

Methodologies and experiments are explained sufficiently. Experiments were carried out, and their results show that the proposed method performs better than many modern techniques.

e) Literature:

References are appropriate for the context of the study, and all references are cited in the order.

3. Specific comments:

Conclusion chapter should have been written in more detail, limitations and future research are needed.

The language of the paper is clear, and researchers used appropriate English.

The article's purpose is applicable/practical and appropriate for Applied Science Journal.

This paper is reommended to publish.

Author Response

Comments: Further study ideas/plans or forecasts are not stated in conclusion. Conclusion chapter should have been written in more detail, limitations and future research are needed.

Response: Thank you very much for your valuable suggestion. In Section 5 in the modified version, we re-organized the “Conclusion” of the paper and added more discussions on the applications of our model, such as the inference time of each branch, limitations of the proposed MGDL-Net and future plans.

Reviewer 4 Report

Comments and Suggestions for Authors

In this manuscript, a Modal-Guided Domain Learning Network is introduced, a unified neural network characterized by spatial, temporal, and frequency branches. This architecture allows the network to process face-related inputs across various modalities and to detect inconsistencies within and between domains, including unimodal, bimodal, and tri-modal configurations. Heterogeneous Inconsistency Learning is also proposed, utilizing a three-level joint extraction approach to enhance the detection of inconsistencies from different perspectives and reduce static noise. Additionally, a multi-modal deepfake dataset has been developed to support the capabilities of the model.

The deployment of the proposed model is likely highly impractical as it requires extremely high computational and hardware resources.

It's crucial to understand that claims of the proposed method's superiority need to be substantiated. To truly validate this superiority, a comprehensive comparative analysis with state-of-the-art methods is indispensable. This rigorous validation process is not just a formality but a necessity to provide the reassurance and confidence needed in the research.

One significant issue that needs to be addressed is the need for more discussion about where and how the model does not work. This is not a weakness but an essential part of fully understanding the model's contours and ensuring that the audience is fully informed and aware of its limitations.

The performance metrics are unreliable —results were excellent on some criteria but wrong on others.

This is a problem in terms of scientific validation because it needs to provide more information to reproduce the experiments.

The validation methods they utilized were not enough to support how well the algorithm would perform in other everyday scenarios, casting doubt on its widespread application in practice.

In section 4.6, the figure numbers need to be corrected.

Author Response

Comments 1：The deployment of the proposed model is likely highly impractical as it requires extremely high computational and hardware resources.

Response: Thank you very much for your valuable suggestion. We plan to deployment the proposed MGDL-Net by the manner of client-to-server since its computational requirement makes it unsuitable for deployment on edge computational devices (e.g., NVIDIA Jetson TX2). However, as mentioned in Subsection 4.1, the pre-trained ResNet-18 and 3D-ResNet34 are adopted as the backbone of diverse branches in MGDL-Net. The proposed MGDL-Net adaptively selects the corresponding branch for prediction according to the modality of the input data. This mechanism allows our model to reduce required computational resources in many scenarios. Even in the worst case, the MGDL-Net is composed of all three branches due to input samples contain three modalities, which can be smoothly processed by a deep learning workstation with NVIDIA GTX 1080Ti GPU. As we discussed in Section 5, our future work will focus on designing light-weight and real-time face forgery detection methods.

Comments 2: It's crucial to understand that claims of the proposed method's superiority need to be substantiated. To truly validate this superiority, a comprehensive comparative analysis with state-of-the-art methods is indispensable. This rigorous validation process is not just a formality but a necessity to provide the reassurance and confidence needed in the research.

Response: Thank you very much for your valuable suggestion. In Section 4.3 “Comparisons with SOTA Methods” in the modified version, we added three more state-of-the-art methods for comprehensive comparisons, such as AVoiD-DF^[1], VFD^[2] and CWM^[3]. The proposed MGDL-Net consistently occupies the first place on corresponding benchmarks in terms of AUC and Accuracy metrics.

[1] Yang, Wenyuan, et al. "Avoid-df: Audio-visual joint learning for detecting deepfake." IEEE Transactions on Information Forensics and Security 18 (2023): 2015-2029.

[2] Cheng, Harry, et al. "Voice-face homogeneity tells deepfake." ACM Transactions on Multimedia Computing, Communications and Applications 20.3 (2023): 1-22.

[3] Zou, Heqing, et al. "Cross-Modality and Within-Modality Regularization for Audio-Visual Deepfake Detection." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

Comments 3：One significant issue that needs to be addressed is the need for more discussion about where and how the model does not work. This is not a weakness but an essential part of fully understanding the model's contours and ensuring that the audience is fully informed and aware of its limitations.

Response: Thank you very much for your valuable suggestion. In Section 5 in the modified version, we re-organized the “Conclusion” of the paper and added more discussions on the applications of our model, such as the limitations of the proposed MGDL-Net, to ensure the audience fully understand the model's contours.

Comments 4：The performance metrics are unreliable —results were excellent on some criteria but wrong on others.This is a problem in terms of scientific validation because it needs to provide more information to reproduce the experiments.The validation methods they utilized were not enough to support how well the algorithm would perform in other everyday scenarios, casting doubt on its widespread application in practice.

Response: Thank you very much for your valuable suggestion. Following most face forgery detection works, we rigorously validate the proposed method using the official metrics (e.g., Acc, AUC) provided by benchmark publisher.

Comments 5：In section 4.6, the figure numbers need to be corrected.

Response: Thank you very much for your valuable suggestion. In the modified version, we revised the wrong figure number in section 4.6.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The author implemented my suggestions.

There still issues with the references. Some of them appear as question marks.

Please correct them! Good luck!

Author Response

Question 1: There still issues with the references. Some of them appear as question marks.

Response: Thank you very much for your valuable suggestion. We have corrected the some issues in the references section. In our revised manuscript, there are still two question remarks appear in the reference [11] and [51], as the titles of these two papers originally contain question marks.

Article Menu

Modal-Guided Multi-Domain Inconsistency Learning for Face Forgery Detection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI