Review Reports - Fake News Detection: It’s All in the Data!

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Review Report

Fake News Detection: It’s All in the Data!

Summary

This survey offers a timely and comprehensive examination of the role of datasets in fake news detection research. Rather than focusing on detection algorithms alone, the paper deliberately centers on dataset characteristics—including data types, labeling schemes, annotation methodologies, biases, and ethical considerations—as the primary drivers of model performance and generalizability. The authors systematically analyze 93 peer-reviewed studies following PRISMA 2020 guidelines, supported by bibliometric visualizations (e.g., VOSviewer keyword co-occurrence maps). They further contribute a curated GitHub repository that aggregates publicly available fake news datasets, enhancing accessibility and reproducibility for the research community. The survey also addresses multimodal data, cross-dataset generalization challenges, and emerging trends such as AI-generated misinformation, while outlining best practices for dataset construction and proposing future directions grounded in ethical and technical considerations.

Strengths

1. Clear and Unique Focus: Unlike prior surveys that emphasize algorithms or linguistic features, this work places dataset quality at the core of fake news detection research—an underexplored but critical perspective. This reframing is both novel and necessary given the field’s recurring generalization failures.

2. Methodological Rigor: The adoption of PRISMA 2020 guidelines, explicit inclusion/exclusion criteria, reproducible search queries (Appendix A), and bibliometric analysis (Figure 2) significantly strengthens the survey’s credibility and transparency.

3. Practical Contribution: The GitHub dataset portal (to be made public) is a valuable community resource, especially given the fragmented and scattered nature of existing fake news datasets. Tables 2–4 provide clear, structured summaries of dataset properties (labels, size, language, modality), aiding researchers in dataset selection.

4. Holistic Coverage: The survey thoughtfully integrates technical, ethical, and societal dimensions—from annotation bias and class imbalance to consent, privacy, and cultural representation—aligning with growing demands for responsible AI in misinformation research.

Weaknesses

1. Limited Critical Synthesis of Dataset Limitations: While biases (selection, labeling, cultural) are well-described, the survey does not sufficiently analyze *how* these biases propagate into specific failure modes of detection models (e.g., false positives on satire in non-Western contexts). A deeper linkage between dataset flaws and model errors would strengthen the argument.

2. Underdeveloped Discussion of Emerging Threats: The treatment of AI-generated fake news (e.g., via LLMs) is brief and lacks discussion of dataset gaps for detecting synthetic media (e.g., deepfakes, GPT-4–level text). Given the paper’s 2025 submission date, this topic merits more emphasis.

Overall Assessment

This is a well-structured, timely, and highly relevant survey that fills an important gap in the fake news detection literature by foregrounding data—rather than models—as the foundational challenge. The systematic methodology, ethical reflection, and practical repository significantly advance research reproducibility and rigor. While the discussion of emerging AI-generated threats could be expanded, the paper already provides a robust foundation for future work in dataset-centric misinformation research. With minor enhancements—particularly in linking dataset flaws to real-world model failures—this survey will serve as an essential reference for researchers, practitioners, and policymakers.

Recommendation: Accept with Minor Revisions

Author Response

Comment 1 : Limited Critical Synthesis of Dataset Limitations: While biases (selection, labeling, cultural) are well-described, the survey does not sufficiently analyze *how* these biases propagate into specific failure modes of detection models (e.g., false positives on satire in non-Western contexts). A deeper linkage between dataset flaws and model errors would strengthen the argument.

Response 1 : We added a concise synthesis explaining how dataset biases translate into specific failure modes, such as false positives on culturally distinct satire and false negatives in underrepresented domains. Line 392

Comment 2 :

Underdeveloped Discussion of Emerging Threats: The treatment of AI-generated fake news (e.g., via LLMs) is brief and lacks discussion of dataset gaps for detecting synthetic media (e.g., deepfakes, GPT-4–level text). Given the paper’s 2025 submission date, this topic merits more emphasis.

Response 2 :

We expanded the Future Directions section to highlight gaps in datasets for detecting modern AI-generated fake news, including realistic multimodal content generated by state-of-the-art language and image models.

Reviewer 2 Report

Comments and Suggestions for Authors

The topic addressed in this article is undoubtedly fascinating, as it provides a systematic review of the role data plays in the development of systems for detecting fake news. However, the approach and methodology raise certain doubts about the suitability and validity of the article.

Firstly, although disinformation and fake news are among the most recurrent and studied topics in scientific literature in recent years, the theoretical basis is relatively scarce and superficial, without delving into the more specific theories and aspects of this phenomenon, as well as the aspects that are subsequently developed or on which this research focuses. This would represent a significant shortcoming of the article.

Regarding the methodology and criteria used for searching and selecting articles, although it follows PRISMA guidelines and, a priori, could be synonymous with transparency and rigour, it is noteworthy that the Web of Science database was not considered. This is the leading and most prestigious database, so it can be assumed that much of the most relevant research has not been taken into account for this study. This has an impact on the rigour and validity of the results obtained. This is one of the article's main weaknesses.

Similarly, other flaws are identified in the methodological section. For example, it would be necessary to specify the time period analysed, as well as the procedures for searching and selecting articles and the variables and categories considered for the analysis and dissection of each article.

In light of this, it is worth highlighting the practical perspective or approach taken by creating and promoting a repository on GitHub, which facilitates access to and use of public datasets for the research community. Furthermore, one of the most innovative and valuable aspects of the article is the discussion of biases, ethics, and challenges associated with data collection and use, which provides an essential and novel critical approach to the study of this phenomenon.

Even so, in addition to the methodological doubts that arise for me, another weakness or area for improvement that I perceive in the article is that, although the review is exhaustive, it could benefit from a more in-depth discussion of the specific limitations faced by current datasets in relation to the problem of journalism, especially in media environments where misinformation has immediate effects on public opinion and democracy. Additionally, although reference is made to ethical challenges, the discussion could be expanded and enriched in relation to the specific implications of using these data in journalistic research contexts, such as possible censorship or covert censorship, and to what extent data transparency actually translates into greater social responsibility.

These issues limit the article's scope and prevent it from representing a significant advance in knowledge in this area. Despite the interesting approach and perspective it offers, this distinguishing factor is not reflected to the same extent in the resulting work. In general terms, this could be remedied—at least partially—by strengthening the discussion in relation to specific applications and challenges in the field of journalism and communication in an ever-changing media context. If these shortcomings are corrected, the manuscript could be published.

Author Response

Comment 1 : Web of Science database was not considered. This is the leading and most prestigious database....

Response 1 : Thank you for raising this concern. As described in Section Paper Selection Methodology, we used Scopus as the primary bibliographic database because it provides comprehensive coverage and structured metadata suitable for systematic literature reviews, including abstracts, titles, authors, DOIs, and keywords. To ensure broader coverage beyond Scopus indexing, we supplemented the search with Google Scholar and arXiv, then manually screened and de-duplicated results. In addition, VOSviewer was employed to analyze keyword co-occurrence and thematic structures, supporting the organization and synthesis of the surveyed literature. This multi-source and transparent selection strategy follows PRISMA guidelines and ensures adequate coverage and methodological rigor.

Comment 2 : ......In general terms, this could be remedied—at least partially—by strengthening the discussion in relation to specific applications and challenges in the field of journalism and communication in an ever-changing media context. If these shortcomings are corrected, the manuscript could be published.

Response 2 : We thank the reviewer for this insightful comment. To address this concern, we have added a new subsection (Section 10.4) that explicitly discusses the implications of fake news dataset design for media and journalism from a dataset-centric perspective. In this section, we explain how common dataset construction and annotation practices implicitly reflect journalistic norms, how these assumptions influence downstream detection systems, and how such effects may be amplified across different linguistic and cultural contexts. The discussion is grounded in established communication and journalism literature and is intended to clarify the relevance of our dataset survey to media and journalism research without expanding the scope beyond datasets and their design.

Reviewer 3 Report

Comments and Suggestions for Authors

Although the paper claims adherence to PRISMA 2020, the screening process appears weak. No papers got excluded after abstract screening, an unusual outcome for a systematic review.

The review summarizes datasets and prior studies but lacks a strong comparative analysis. Many sections reiterate well-known issues such as dataset bias, imbalance, and multilingual challenges without providing new theoretical insights or novel taxonomies.

The main claimed contribution, a GitHub repository aggregating datasets, offers some utility but remains incremental. Similar repositories already exist within the community, and the repository remained private at the time of submission, limiting immediate verifiability and reproducibility.

Coverage across the paper is uneven: some sections, such as dataset descriptions and ethical considerations, are highly detailed, whereas others, including multimodal limitations and model dataset interaction mechanisms, remain superficial and high-level.

Several tables contain visual artifacts, broken symbols, unclear icons, and placeholders, which hinder readability. In particular, Table 4 proves difficult to interpret and does not meet publication-quality standards.

Although generative text datasets (M4) are briefly mentioned, the paper does not address LLM-driven misinformation or synthetic media used for adversarial content generation, issues that have emerged as pressing challenges since 2023.

Author Response

Comment : The review summarizes datasets and prior studies but lacks a strong comparative analysis. Many sections reiterate well-known issues such as dataset bias, imbalance, and multilingual challenges without providing new theoretical insights or novel taxonomies.......

Response :

We thank the reviewer for highlighting the uneven coverage and fragmented treatment of challenges and generalization issues.

In response, we substantially reorganized Section 7. Specifically, we merged the former subsections on common challenges, model influence, and generalizability into a unified subsection entitled “From dataset construction to generalization failure” (Section 7.1), which explicitly traces the causal pipeline from dataset properties to model behavior and cross-dataset transfer limitations. This reorganization reduces redundancy, equalizes depth across topics, and clarifies the mechanisms of model–dataset interaction.

In addition, to address emerging challenges that were previously under-represented, we introduced a new subsection (Section 7.3) dedicated to LLM-era challenges and benchmark instability, including synthetic misinformation, evaluation contamination, and ecosystem-level effects. This new subsection strengthens the discussion of contemporary threats and improves the balance and completeness of the limitations analysis.

We believe these revisions substantially improve coherence, analytical depth, and contemporary relevance, directly addressing the reviewer’s concerns regarding uneven coverage and superficial treatment of generalization.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The changes have been addressed correctly, so I consider the article suitable for publication.