Next Article in Journal
Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data
Previous Article in Journal
Lightweight Deep Learning Models for Face Mask Detection in Real-Time Edge Environments: A Review and Future Research Directions
 
 
Article
Peer-Review Record

Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages

Mach. Learn. Knowl. Extr. 2026, 8(4), 103; https://doi.org/10.3390/make8040103
by Abdul Sittar 1,*, Mateja Smiljanic 2, Alenka Guček 1 and Marko Grobelnik 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Mach. Learn. Knowl. Extr. 2026, 8(4), 103; https://doi.org/10.3390/make8040103
Submission received: 2 March 2026 / Revised: 8 April 2026 / Accepted: 9 April 2026 / Published: 15 April 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In general topic rised by the authors is interesting and topic. Fake news detection in combination with LLM for dataset generation is an interesting although not a new idea (LLMs are used for text dataset augmentation for a longer time, maybe just not for fake news, but for other tasks - this should be reflected in the review section) is perspective and interesting for the readers.

Manuscript is well organized, scientific style is followed and I would say, that there is only one main concern regarding the acceptance (minor remarks are provided below):

  • Authors state, that "random oversampling improves accuracy by 1.46%, while synthetic augmentation yields a slightly higher improvement of 1.58%.". These and similar statements in abstract, text and conclusions, are not statistically reliable: since cross-fold validation is not used (or at least not described) statements are based on a random result and other initial split of data would not give an increase at all. Discussing 0.12% differences have no value at all. I would suggest either performing several experiments and getting the statistically reliable results, or to remove statements with improments achieved and providing strong motivation, why cross-fold validation was not possible.

Minor remarks on the text:

  • "Given the heterogeneous nature of global information ecosystems, fake news detection models must operate effectively across a wide range of linguistic, cultural, and media environments [11]." - statement is discussable, since potentially it may be better to have several, but better tuned models for different regions/platforms/languages.
  • Authors state, that the LLM in the proposed approach "generates challenging and alternative examples that help models handle difficult and unfamiliar cases". The question of separate "good" and "fake" new is not so simple and when is not based on scientific facts, is more a question of political interpretation or propaganda manipulation. How this can be done with the generated text then?
  • Different models (OpenAI GPT-3.5-Turbo / gpt-4-turbo-preview)  were used for different tasks. First of all, choice of GPT should be motivated in general. Also please explain, why for different tasks different models are used.
  • The question of dataset, used for LLM training, poisoning, is not discussed. This is extremely important for the fakes detection.
  • Please provide samples of generated text records.

Author Response

Comment # 1: In general topic rised by the authors is interesting and topic. Fake news detection in combination with LLM for dataset generation is an interesting although not a new idea (LLMs are used for text dataset augmentation for a longer time, maybe just not for fake news, but for other tasks - this should be reflected in the review section) is perspective and interesting for the readers.

Response: Thanks for for the positive feedback. We have incorporated additional discussion in the related work section to better reflect the use of LLMs for data augmentation in other domains (see paragraph number 4, Subsection 2.2).

Comment # 2: Manuscript is well organized, scientific style is followed and I would say, that there is only one main Comment regarding the acceptance (minor remarks are provided below):

Response: Thank you for the positive feedback. We have addressed the remaining concerns and improved the manuscript accordingly.

Comment # 3: Authors state, that "random oversampling improves accuracy by 1.46%, while synthetic augmentation yields a slightly higher improvement of 1.58%.". These and similar statements in abstract, text and conclusions, are not statistically reliable: since cross-fold validation is not used (or at least not described) statements are based on a random result and other initial split of data would not give an increase at all.

Response: Thank you for the comment. We have repeated the experiments with cross-validation and updated the results accordingly (see Tables 7, 8, and 9).

Comment # 4: Discussing 0.12% differences have no value at all. I would suggest either performing several experiments and getting the statistically reliable results, or to remove statements with improments achieved and providing strong motivation, why cross-fold validation was not possible.

Response: Thanks for the suggestion. We conducted multiple  experiments using cross-validation and removed previously overstated claims. The abstract and results sections have been updated to reflect statistically reliable findings.

Comment # 5: "Given the heterogeneous nature of global information ecosystems, fake news detection models must operate effectively across a wide range of linguistic, cultural, and media environments [11]." - statement is discussable, since potentially it may be better to have several, but better tuned models for different regions/platforms/languages.

Response: Thank you for the comment. We have revised the statement to acknowledge that region- and platform-specific models may perform better in specific contexts, while also discussing the trade-off with generalizability (see Section 1).

Comment # 6: Authors state, that the LLM in the proposed approach "generates challenging and alternative examples that help models handle difficult and unfamiliar cases". The question of separate "good" and "fake" new is not so simple and when is not based on scientific facts, is more a question of political interpretation or propaganda manipulation. How this can be done with the generated text then?

Response: Thank you for the clarification. We have added an explicit explanation in the Methods section (see Section 3.4), clarifying that the LLM operates on already labeled data and preserve the original semantic meaning and labels during augmentation.

Comment # 7: Different models (OpenAI GPT-3.5-Turbo / gpt-4-turbo-preview)  were used for different tasks. First of all, choice of GPT should be motivated in general. Also please explain, why for different tasks different models are used.

Response: Thank you for this comment. We have added justification for the choice of LLMs and clarified why different model were used for different tasks (see Subsection 5.8).

Comment # 8: The question of dataset, used for LLM training, poisoning, is not discussed. This is extremely important for the fakes detection

Response: Thank you for the comment. We clarify that no LLM was trained or fine-tuned for fake news detection, Instead, LLMs were used solely for synthetic data generation. This clarification has been added in Section 3 (Subsection 3.4.1).

Comment # 9: Please provide samples of generated text records.

Response: Thank you for the comment. We have added representative samples of the generated text records in the Appendix, including synthetic tweets, headlines, articles, and multilingual examples, along with their associated metadata (see Appendix A).

Reviewer 2 Report

Comments and Suggestions for Authors

This paper investigates an LLM-based, feature-guided augmentation pipeline for fake news detection across different media types and multiple languages. The authors aim to improve performance under class imbalance and domain variation by generating synthetic fake samples guided by empirically observed stylistic and factual patterns. The topic is relevant, and the paper has several positive aspects, including its broad experimental scope, its focus on multilingual and cross-media settings, and its attempt to move beyond naive augmentation by grounding generation in exploratory feature analysis. At the same time, the current version leaves several important questions open regarding reproducibility, methodological transparency, and the strength and consistency of the reported empirical gains.

Weaknesses

  1. The paper emphasizes prompt engineering as a central component of the method, but no full prompt templates are provided. Could the authors include several representative prompts, at least in the appendix, to improve reproducibility?

  2. I could not find any public code repository or implementation artifacts. Given that the main contribution is the augmentation pipeline itself, do the authors plan to release the code, preprocessing scripts, and generation settings?

  3. How much of the reported improvement actually comes from feature guidance, rather than from simply adding more synthetic fake samples? A comparison against simpler prompting strategies would help clarify the contribution of the proposed pipeline.

  4. The reported results are somewhat difficult to reconcile. In some places synthetic augmentation is presented as the strongest approach, while the summary table appears to show random oversampling performing better in at least one setting. Could the authors clarify this point?

  5. How was the quality of the generated synthetic samples validated beyond feature-level matching? It would be helpful to better understand whether the generated texts are genuinely realistic or mainly reflect prompt-specific artifacts.

  6. The experimental section covers multiple settings at once, including headlines, tweets, articles, and multilingual data. Could the authors more clearly separate the main controlled experiment from the additional evaluations, and clarify which conclusions are intended to generalize across setups?

Comments on the Quality of English Language

The English is generally understandable, but the manuscript would benefit from stylistic editing for clarity and precision. In several places, the writing is overly dense and uses broad or overly strong formulations that make the empirical claims sound more definitive than the presented evidence. There are also occasional awkward phrasings and minor editing issues. A careful language revision would improve readability and help the paper express its contributions more clearly.

Author Response

Comment 1: This paper investigates an LLM-based, feature-guided augmentation pipeline for fake news detection across different media types and multiple languages. The authors aim to improve performance under class imbalance and domain variation by generating synthetic fake samples guided by empirically observed stylistic and factual patterns. The topic is relevant, and the paper has several positive aspects, including its broad experimental scope, its focus on multilingual and cross-media settings, and its attempt to move beyond naive augmentation by grounding generation in exploratory feature analysis. At the same time, the current version leaves several important questions open regarding reproducibility, methodological transparency, and the strength and consistency of the reported empirical gains.

Response: Thank you for the detailed feedback. We have improved methodological transparency, clarified experimental settings, and refined the presentation of results to address concerns about reproducibility, and consistency.

Comment 2:   The paper emphasizes prompt engineering as a central component of the method, but no full prompt templates are provided. Could the authors include several representative prompts, at least in the appendix, to improve reproducibility?

Response: Thank you for the comment. We have added representative prompt templates in the Appendix, including examples for tweet generation, headline generation, article style transfer, and multilingual generation. These prompts illustrate the design of our augmentation pipeline and improve the reproducibility of the proposed approach (see Appendix B).

Comment 3:  I could not find any public code repository or implementation artifacts. Given that the main contribution is the augmentation pipeline itself, do the authors plan to release the code, preprocessing scripts, and generation settings?

Response: Thank you for the comment. We have already made the implementation publicly available and have now clarified this in the manuscript by explicitly referencing the GitHub repository in the main text (see the footnote in subsection 3.2.3).

Comment 4:   How much of the reported improvement actually comes from feature guidance, rather than from simply adding more synthetic fake samples? A comparison against simpler prompting strategies would help clarify the contribution of the proposed pipeline.

Response: Thank you for this insightful comment. We have clarified the contribution of feature-guided augmentation by comparing it with simple semantic-based generation and random  oversampling (see Subsection 5.2 and Subsection 6.1)

Comment 5:   The reported results are somewhat difficult to reconcile. In some places synthetic augmentation is presented as the strongest approach, while the summary table appears to show random oversampling performing better in at least one setting. Could the authors clarify this point? 

Response: Thank you for the comment. We clarified that performance differences depend on the proportion of synthetic data used. Additional explanation has been added to resolve inconsistencies across experiments (see Subsection 3.4, and 4.3).

Comment 6:   How was the quality of the generated synthetic samples validated beyond feature-level matching? It would be helpful to better understand whether the generated texts are genuinely realistic or mainly reflect prompt-specific artifacts.

Response:  Thank you for this important comment. We have expanded the validation procedure tp include multiple checks beyond feature-level matching (see Subsection 3.4.6).  

Comment 7:   The experimental section covers multiple settings at once, including headlines, tweets, articles, and multilingual data. Could the authors more clearly separate the main controlled experiment from the additional evaluations, and clarify which conclusions are intended to generalize across setups? 

Response: Thank you for the suggestion. We have restructured the experimental section to clearly separate the main controlled experiment and from additional evaluations (see Subsection 4.3 and Subsection 6.1). 

Comment 8:   The English is generally understandable, but the manuscript would benefit from stylistic editing for clarity and precision. In several places, the writing is overly dense and uses broad or overly strong formulations that make the empirical claims sound more definitive than the presented evidence. There are also occasional awkward phrasings and minor editing issues. A careful language revision would improve readability and help the paper express its contributions more clearly.

Response: Thank you for the comment. We have revised the manuscript for clarity, improved phrasing, and reduced overly strong claims to better reflect the evidence.

Reviewer 3 Report

Comments and Suggestions for Authors
  1. The steps labeled "prompt engineering" and "augmentation control" appear in Figure 1, but the text does not explain them in detail. The authors do not mention temperature settings, specific versions of the LLM (for example, GPT-4o or Llama 3), or the exact prompts used for "paraphrasing" and "style changes."
  2. While the authors say that samples are "filtered using automatic checks," they do not explain what these checks are. There is a gap in understanding "synthetic artifacts," which are patterns introduced by LLMs that models might learn instead of real indicators of fake news.
  3. The authors should add an appendix or section with the exact prompts used for LLM-driven augmentation, including paraphrasing, translation, and style transfer. This information is vital for ensuring "reproducibility."
  4. In Section 4 in the Discussion: Interpretability and Attention Mechanisms. It is suggested to refer from the paper: “Encoder-Only Attention-Guided Transformer Framework for Accurate and Explainable Social Media Fake Profile Detection”. This reference should be included in Section 4 (Discussion: Interpretability and Attention Mechanisms). Justification: A major weakness of deep learning-based detection is that it operates as a "black box." This reference proposes improving detection by using attention weights to show which features the model relies on. Adding these "attention-guided" insights will help address the need for "interpretive validity."
  5. The authors should explain what "automatic checks" they use to ensure factual accuracy. Are these based on Natural Language Inference (NLI) models, knowledge graphs, or basic keyword matching?
  6. The authors should analyze how varying the synthetic-to-real data ratio (e.g., 10%, 25%, 50%) affects model performance. This analysis will support the idea of "controlled" mixing.
  7. All figures and tables need to be clearly referenced in the text. For example, Figure 1 and Table 1 should be properly integrated within the narrative.
  8. If applicable, the "Highlights" section should be revised to clearly state that "style-based and fact-based features" are essential elements of the framework.

Author Response

Comment 1:  The steps labeled "prompt engineering" and "augmentation control" appear in Figure 1, but the text does not explain them in detail. The authors do not mention temperature settings, specific versions of the LLM (for example, GPT-4o or Llama 3), or the exact prompts used for "paraphrasing" and "style changes." 

Response: Thank you for the comment. We have clarified the components “prompt engineering” and “augmentation control” in Figure 1 by providing a detailed description in the Methods section (see Subsection 3.2.1). 

Comment 2:  While the authors say that samples are "filtered using automatic checks," they do not explain what these checks are. There is a gap in understanding "synthetic artifacts," which are patterns introduced by LLMs that models might learn instead of real indicators of fake news. 

Response: Thanks for the comment. We have clarified the automatic filtering process and explained how stylistic features are used to reduce synthetic artifacts (see Subsection 6.1). 

Comment 3:  The authors should add an appendix or section with the exact prompts used for LLM-driven augmentation, including paraphrasing, translation, and style transfer. This information is vital for ensuring "reproducibility."

Response: Thank you for the comment. We have added representative prompt templates in the Appendix to improve reproducibility (see Appendix B).

Comment 4:  In Section 4 in the Discussion: Interpretability and Attention Mechanisms. It is suggested to refer from the paper: “Encoder-Only Attention-Guided Transformer Framework for Accurate and Explainable Social Media Fake Profile Detection”. This reference should be included in Section 4 (Discussion: Interpretability and Attention Mechanisms). Justification: A major weakness of deep learning-based detection is that it operates as a "black box." This reference proposes improving detection by using attention weights to show which features the model relies on. Adding these "attention-guided" insights will help address the need for "interpretive validity."

Response: Thank you for the valuable  suggestion. We have incorporated the recommended reference in Section 6 (Discussion: Interpretability and Attention Mechanisms) and briefly discussed how attention-guided approaches can enhance interpretability.

Comment 5:  The authors should explain what "automatic checks" they use to ensure factual accuracy. Are these based on Natural Language Inference (NLI) models, knowledge graphs, or basic keyword matching?

Response: Thank you for the comment. We clarify that the “automatic checks” refer primarily to the use of stylistic feature constraints rather than explicit factual verification mechanisms. Our goal is to generate synthetic sample that reflect the stylistic characteristics of fake news to address class imbalance. 

Regarding factual accuracy, we initially explored two approaches: 1) prompt design to encourage consistency, and 2) an additional verification method based on prior work (https://aile3.ijs.si/dunja/SiKDD2024/Papers/IS2024_-_SIKDD_2024_paper_13.pdf). However, this approach was not effective for our datasets, as they are not strictly claim-based and lack structured factual annotations. 

Importantly, since the synthetic samples belong to the fake news class, strict factual accuracy is not required, and controlled variation in content can even be beneficial for improving model robustness. Based on this, we revised the manuscript to avoid misleading claims about factual preservation. Specifically, we replaced references to “factual structure” with “semantic structure”, and updated related terminology (e.g., “fact-based features” to “semantic features”). These changes are reflected in Section1, 6, and 3 (see Subsection 3.1).

Comment 6:  The authors should analyze how varying the synthetic-to-real data ratio (e.g., 10%, 25%, 50%) affects model performance. This analysis will support the idea of "controlled" mixing.

Response: Thank you for the suggestion. We conducted a controlled analysisby varying the imbalance severity, resulting in synthetic data proportions ranging from approximately 3% to 34% of the training set. The results have been added to the manuscript to support the concept of controlled augmentation (see Subsection 5.6).

Comment 7:  All figures and tables need to be clearly referenced in the text. For example, Figure 1 and Table 1 should be properly integrated within the narrative.

Response: Thank you for the comment. We have revised the manuscript to ensure that all figures and tables, including Figure 1 and Table 1, are clearly referenced and properly integrated into the narrative.

Comment 8:  If applicable, the "Highlights" section should be revised to clearly state that "style-based and fact-based features" are essential elements of the framework.

Response: Thank you for the comment. We have revised the Highlights section to explicitly emphasize the role of semantic- and style-based features as core components of the proposed framework.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have addressed all the remarks provided in the original review. Manuscript can be accepted.

Author Response

Comment 1: The authors have addressed all the remarks provided in the original review. Manuscript can be accepted.

Response 1: Dear Reviewer, thank you for your kind evaluation and positive feedback. We sincerely appreciate your time and effort in reviewing our manuscript and are glad that our revisions addressed all the remarks.

Reviewer 2 Report

Comments and Suggestions for Authors

The revised manuscript addresses the main points raised in the previous review to a satisfactory extent. In particular, the authors improved methodological transparency by adding representative prompt templates, clarifying the experimental structure, expanding the validation of synthetic data, and providing access to the implementation. While some issues of presentation and claim calibration remain, the revision is substantial and the manuscript is now acceptable for publication.

Author Response

Comment 1: The revised manuscript addresses the main points raised in the previous review to a satisfactory extent. In particular, the authors improved methodological transparency by adding representative prompt templates, clarifying the experimental structure, expanding the validation of synthetic data, and providing access to the implementation. While some issues of presentation and claim calibration remain, the revision is substantial and the manuscript is now acceptable for publication.

Response 1: Thank you for your thoughtful evaluation and constructive comments. We are pleased that the revisions—particularly the added prompt templates, clarified experimental structure, expanded validation of synthetic data, and provision of implementation details—addressed your main concerns.

Back to TopTop