Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Novel Deepfake Image Detection with PV-ISM: Patch-Based Vision Transformer for Identifying Synthetic Media

Appl. Sci. 2025, 15(12), 6429; https://doi.org/10.3390/app15126429

by Orkun Çınar^*

and Yunus Doğan

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Renhai Feng

Appl. Sci. 2025, 15(12), 6429; https://doi.org/10.3390/app15126429

Submission received: 24 April 2025 / Revised: 1 June 2025 / Accepted: 5 June 2025 / Published: 7 June 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Image Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors I have following comments:

1. You should clarify the architectural novelty beyond ViT. What this paper contribute.

2. Add atleast 2, 3 more datasets for generalization which very critical.

3. Improve academic writing and proofreading. And please remove unrelated citations.

Comments on the Quality of English Language

Proofread it properly

Author Response

Response to Reviewer 1

We sincerely thank Reviewer 1 for the thoughtful and constructive comments.

For the reviewer’s convenience, all newly added or revised sections in the manuscript have been highlighted in purple within the PDF. These highlights correspond to the reviewer’s original suggestions and were made explicitly in response to the comments provided.

Below, we address each point in detail and describe the corresponding revisions made in the manuscript.

Comment 1:

“You should clarify the architectural novelty beyond ViT. What this paper contribute.”

Response:
We appreciate the reviewer’s request for a clearer articulation of the architectural novelty beyond standard Vision Transformer (ViT) frameworks. In response, a new subsection titled “3.2.3. Architectural Contributions beyond ViT” has been added at Line 301. This section details the specific architectural distinctions of PV-ISM, including:

The integration of customized Dense and Dropout layers both within and after the transformer block sequence.
The use of a dual-logit output strategy in place of a standard single-neuron sigmoid output, which enhances flexibility in post-processing.
The ability to achieve robust performance without reliance on pretraining or transfer learning.

These modifications are summarized and justified within the new subsection to clearly differentiate PV-ISM from conventional ViT-based models.

Comment 2:

“Add at least 2, 3 more datasets for generalization which is very critical.”

Response:
Thank you for highlighting the importance of evaluating model generalization. In response, we have introduced a new section titled “5. Supplementary Experimental Studies with PV-ISM” at Line 491, which reports extensive evaluations using two additional datasets:

RVF-10K (Real vs Fake Faces): A dataset of 10,000 high-resolution face images for binary real/fake classification.
Art Images: A curated dataset focusing on stylistically abstract content, challenging the model’s robustness under non-standard visual domains.

The results, along with corresponding figures and tables, are presented across pages 16 to 21. These experiments validate the generalization capacity of PV-ISM across diverse visual contexts, as requested.

Comment 3:

“Improve academic writing and proofreading. And please remove unrelated citations.”

Response:
We acknowledge the importance of clear academic writing and citation accuracy. Accordingly, the entire manuscript has undergone a thorough round of professional proofreading. We also employed the university’s licensed Grammarly tool to enhance linguistic precision and consistency. In addition, a careful review of all references was conducted, and irrelevant or misplaced citations (references 29, 30, and 32) were corrected to ensure academic rigor.

Once again, we thank the reviewer for their valuable feedback, which has significantly contributed to the improvement of this manuscript.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The authors presented deep learning method to detect deepfake images. The topic is interesting and useful. The methods used in the paper is not very innovative, but this work still provides a reference for others to quickly set up a framework for deepfake detection. Minor modification suggestions:

(1) The author may consider having more augmentation methods, such as, color (brightness, contrast and saturation) modification.

(2) The authors may add some discussions about which part of the network is the most important part for their technology, like what would be the gain by having this structure.

Author Response

Response to Reviewer 2

First and foremost, we would like to express our sincere gratitude for your valuable and constructive feedback. To facilitate your review, we have highlighted all newly added content in the revised manuscript in blue, indicating the sections that were specifically enhanced following your insightful suggestions.

Below, we provide detailed responses to each of your comments along with the corresponding revisions implemented.

Response to Reviewer 2

We sincerely thank you for your valuable and constructive comments. Your insightful suggestions have helped us improve the clarity and completeness of our manuscript. Please find below our detailed responses to your specific points. To facilitate your review, we have highlighted all newly added content in the revised manuscript in brown, indicating the sections that were specifically enhanced following your insightful suggestions.

Below, we provide detailed responses to each of your comments along with the corresponding revisions implemented.

Comment 1:

“The author may consider having more augmentation methods, such as, color (brightness, contrast and saturation) modification.”

Response:

Thank you for this valuable suggestion. We have expanded the data augmentation pipeline accordingly. Specifically, at line 344, the following sentence has been added:

“Additionally, geometric data augmentation techniques, such as random horizontal flipping and rotations, and color-based augmentations, such as random brightness, contrast, and saturation adjustments, were applied to enhance model generalization and robustness.”

This addition strengthens the robustness of our model by including relevant color augmentation methods.

Comment 2:

“The authors may add some discussions about which part of the network is the most important part for their technology, like what would be the gain by having this structure.”

Response:

We appreciate the insightful comment regarding the key architectural components of our model. In response, at line 488, we included the following discussion:

“Based on performance trends and model behavior, the patch encoding and transformer-based attention layers appear to be the most impactful. The patch-wise tokenization enables fine-grained feature localization, while the attention mechanism enhances global feature correlation, both essential for distinguishing subtle generative artifacts.”

Moreover, supportive information elaborating on these architectural contributions can be found under the newly introduced subsection “3.2.3. Architectural Contributions beyond ViT” at line 302. These additions aim to clarify the advantages and critical components of our PV-ISM architecture.

Once again, we thank the reviewer for their valuable feedback, which has significantly contributed to the improvement of this manuscript.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript introduces a new method of detecting AI-generated images on a custom Vision Transformer (ViT) model, PV-ISM. The research is well-organized, and it gives useful information about the efficiency of transformer-based architectures in the detection of synthetic images.

The Abstract demonstrates the purpose of the research of detection of AI-generated images and describes the proposed PV-ISM method. It shows the performance of PV-ISM with 96.60% accuracy and its superiority to ResNet50 (93.32%) and other existing methods. The abstract gives a brief description of PV-ISM’s architecture, mentioning key components such as patch extraction, positional encoding, transformer blocks and training efficiency.

Suggestions:

The abstract is concerned with technical performance but lacks the emphasis on practical importance.
Include a sentence as to why detecting synthetic images is important (e.g., preventing deepfake misinformation, authenticating digital media).

The Introduction clearly sets out the wide range of applications of AI in visual data analysis with proper examples in various fields. It smoothly flows into the issues of synthetic media, which highlights the societal risks (misinformation, loss of trust) and technical needs. However, the introduction can be greatly improved in several aspects to make it clearer, more focused, and more academic.

Suggestions:

The introduction starts with a wide overview of AI applications, but it takes a long period to get down to the synthetic media detection. The connection between the overall progress of AI and the particular need for PV-ISM can be closer.
The introduction contains an extensive list of AI applications in all fields (arts, biology, medicine, etc.), which, although impressive, may divert attention from the paper’s real subject. This list can be shortened or shifted to another section (e.g., Related Work), which would leave space for a more compact build-up to the paper’s research question.

The Related Works section provides a comprehensive overview of the existing methods of AI-generated image detection, including transfer learning hybrid approaches and emerging techniques, prototype-based classification. Identifies the main trends, including the dominance of DenseNet/ResNet models and the inadequacy of benchmark datasets.

Suggestions:

The section reiterates similar aspects of transfer learning’s effectiveness (e.g., lines 109–111, 147–153). Merge into a single subsection with a table of model performances.
Line 137: There is a repetition of and "and and frequency-based" → "and frequency-based".

The Materials and Methods section describe data augmentation pipeline and its role in preventing overfitting clearly. It includes full description of the ResNet-50 transfer learning approach including preprocessing, feature extraction, and hyperparameter tuning. Explicitly compares PV-ISM (ViT-based) with ResNet-50 (CNN-based), which is justifiable when it comes to the necessity of the transformer-based approaches.

Suggestions:

“Recent researches” should be “Recent research”.
There are several explanations which duplicate the concepts such as normalization and patch-based modeling which can be shortened.
The process of self-attention and transformer encoding is explained multiple times in different contexts, consider unifying this explanation under a single header.

Results and Discussion section provides a detailed and properly organized description of the experimental framework. It specifically describes the comparison of the proposed Vision Transformer-based model (PV-ISM) and the baseline ResNet50 using transfer learning. The section shows a good methodological depth with equations, figures, and extensive descriptions of hyperparameters.

Suggestions:

"Converesly" → "Conversely".
The risk of overfitting for more epochs is observed but not examined. Was early stopping used? How was the best number of epochs found? Outline the Mitigation strategies (e.g., validation-based early stopping, dropout rates) briefly.
Even though figures are cited, their descriptions are minimal in the text. The authors should incorporate the figures more into the discussion, and emphasize the trends, anomalies, or the insights from the confusion matrix.

Conclusion

Suggestion:

Exchange redundant performance statements with wider implications.

Technical and Language Corrections

Suggestions:

"Recent researches..." → should be "Recent research..."
"never-overlapping patches" → should be "non-overlapping patches."
"has 4 main element" → should be "has four main elements."
"Converesly" → "Conversely".
"and and frequency-based" → "and frequency-based".
Correct grammatical errors and simplify complex sentences for clarity and ease of understanding.
Attention to consistency (terms, abbreviations), technical accuracy (equations, hyperparameters).

Figures and Tables

Suggestions:

Introduce figures and diagrams in the introduction section for better understanding.
In the case of figure captions, some of them are ended using a full stop, whereas others are not. For uniformity and compliance with the academic formatting instructions, it is advised to follow one style for every figure caption, one that should be concluded with a period.

The manuscript offers a current, technically viable contribution to the area of deepfake image detection via Vision Transformer architecture. It provides important insight into the patch-based processing and produces better results than conventional transfer-learning methods. Nevertheless, to increase the clarity and academic value of the paper, the suggestions for the improvement of structural organization, figure integration, language consistency and technical explanation are recommended. These revisions would significantly elevate the manuscript's quality and impact.

Author Response

Response to Reviewer 3
First and foremost, we would like to express our sincere gratitude for your valuable and constructive feedback. To facilitate your review, we have highlighted all newly added content in the revised manuscript in blue, indicating the sections that were specifically enhanced following your insightful suggestions.
Below, we provide detailed responses to each of your comments along with the corresponding revisions implemented.
________________________________________
Comment 1:
“The abstract is concerned with technical performance but lacks the emphasis on practical importance. Include a sentence as to why detecting synthetic images is important (e.g., preventing deepfake misinformation, authenticating digital media).”
Response:
We have added a sentence emphasizing the practical importance of detecting synthetic images, highlighting applications such as preventing deepfake misinformation and ensuring the authenticity of digital media. This addition aims to contextualize the technical performance within real-world relevance (line 2).________________________________________
Comment 2:
“The introduction starts with a wide overview of AI applications, but it takes a long period to get down to the synthetic media detection. The connection between the overall progress of AI and the particular need for PV-ISM can be closer.
The introduction contains an extensive list of AI applications in all fields (arts, biology, medicine, etc.), which, although impressive, may divert attention from the paper’s real subject. This list can be shortened or shifted to another section (e.g., Related Work), which would leave space for a more compact build-up to the paper’s research question.”
Response:
Following your recommendation, the extensive list of AI applications has been shortened and, where appropriate, moved to the Related Works section to maintain focus. We have also strengthened the connection between general AI advancements and the specific need for our proposed PV-ISM method, making the narrative more direct and concise (line 26).
________________________________________
Comment 3:
“The section reiterates similar aspects of transfer learning’s effectiveness (e.g., lines 109–111, 147–153). Merge into a single subsection with a table of model performances.”
Response:
Repetitive statements regarding the effectiveness of transfer learning were merged into a single subsection accompanied by a comparative table summarizing model performances. Minor typographical errors such as “and and frequency-based” have been corrected (line 140 and 158).
________________________________________
Comment 4:
“Recent researches” should be “Recent research
There are several explanations which duplicate the concepts such as normalization and patch-based modeling which can be shortened.
The process of self-attention and transformer encoding is explained multiple times in different contexts, consider unifying this explanation under a single header.”
Response: The phrase “Recent researches” was corrected to “Recent research.” (line 162).
Redundant explanations, especially related to normalization and patch-based modeling, were condensed to improve clarity (line 207).
Additionally, we unified the descriptions of self-attention and transformer encoding under a dedicated subsection titled “3.2.2 Patch Encoding and Transformer Block” to avoid repetition (lines 261).________________________________________
Comment 5:
“Converesly" → "Conversely
The risk of overfitting for more epochs is observed but not examined. Was early stopping used? How was the best number of epochs found? Outline the Mitigation strategies (e.g., validation-based early stopping, dropout rates) briefly.
Even though figures are cited, their descriptions are minimal in the text. The authors should incorporate the figures more into the discussion, and emphasize the trends, anomalies, or the insights from the confusion matrix.”
Response: The typo “Converesly” was corrected to “Conversely.” (line 402)
We have expanded the discussion regarding the risk of overfitting, detailing the early stopping strategy and other mitigation techniques employed during training (line 405).
Furthermore, figure references were integrated more thoroughly into the text with enhanced analysis of trends and insights derived from confusion matrices (lines 462 and 483).________________________________________

Comment 6:
“Conclusion”
Response:
Redundant performance statements were replaced with broader implications of our findings, emphasizing the significance and potential impact of the proposed approach (line 640).________________________________________
Comment 7:
“Technical and Language Corrections
"Recent researches..." → should be "Recent research..."
"never-overlapping patches" → should be "non-overlapping patches."
"has 4 main element" → should be "has four main elements."
"Converesly" → "Conversely".
"and and frequency-based" → "and frequency-based".
Correct grammatical errors and simplify complex sentences for clarity and ease of understanding.
Attention to consistency (terms, abbreviations), technical accuracy (equations, hyperparameters).”
Response:
Following your valuable comments on language accuracy and clarity, we have carefully reviewed the manuscript and implemented the following specific corrections:
The phrase “Recent researches” was corrected to “Recent research” to conform with standard academic usage and singular collective noun form.
The term “never-overlapping patches” was replaced with the technically accurate expression “non-overlapping patches.”
The phrase “has 4 main element” was revised to “has four main elements,” ensuring proper number agreement and spelling out the numeral for stylistic consistency.
The misspelled word “Converesly” was corrected to “Conversely.”
The duplicated phrase “and and frequency-based” was fixed to “and frequency-based.”
In addition to these specific corrections, we conducted a thorough proofreading of the entire manuscript, addressing grammatical errors, improving sentence structures, and simplifying complex sentences to enhance readability and comprehension.
We also paid close attention to the consistency of terms and abbreviations throughout the paper to maintain uniformity.
Technical accuracy was verified, including correctness in equations and hyperparameter descriptions, to ensure clarity and precision.
We acknowledge the importance of clear academic writing. Accordingly, the entire manuscript has undergone a thorough round of professional proofreading. We also employed the university’s licensed Grammarly tool to enhance linguistic precision and consistency.________________________________________
Comment 8:
“Figures and Tables
Introduce figures and diagrams in the introduction section for better understanding.
In the case of figure captions, some of them are ended using a full stop, whereas others are not. For uniformity and compliance with the academic formatting instructions, it is advised to follow one style for every figure caption, one that should be concluded with a period.”
Response:
We standardized figure captions to consistently end with a period, ensuring compliance with academic formatting. Additionally, introductory figures were incorporated into the Introduction section to aid conceptual understanding.________________________________________
Once again, we thank the reviewer for their valuable feedback, which has significantly contributed to the improvement of this manuscript.

Article Menu

Novel Deepfake Image Detection with PV-ISM: Patch-Based Vision Transformer for Identifying Synthetic Media

Further Information

Guidelines

MDPI Initiatives

Follow MDPI