Fake News Detection: It’s All in the Data!

Kuntur, Soveatin; Wróblewska, Anna; Ganzha, Maria; Paprzycki, Marcin; Sachdeva, Shelly

doi:10.3390/app16031585

Open AccessSystematic Review

Fake News Detection: It’s All in the Data!

by

Soveatin Kuntur

^1,*

,

Anna Wróblewska

¹

,

Maria Ganzha

¹

,

Marcin Paprzycki

²

and

Shelly Sachdeva

³

¹

Faculty of Mathematics and Information Science, Warsaw University of Technology, Plac Politechniki 1, 00-661 Warsaw, Poland

²

Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland

³

Department of Computer Science and Engineering, National Institute of Technology, Zone P1, Plot No. FA7, GT Karnal Rd., Delhi 110036, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1585; https://doi.org/10.3390/app16031585

Submission received: 5 December 2025 / Revised: 22 January 2026 / Accepted: 1 February 2026 / Published: 4 February 2026

(This article belongs to the Special Issue Advancements in Natural Language Processing, Semantic Networks, and Sentiment Analysis: 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This brief survey acts as a fundamental resource for researchers beginning their exploration into fake news detection. It emphasizes the importance of dataset quality and diversity in enhancing the effectiveness of detection models, detailing key features, labeling systems, and prevalent biases. It also presents the challenges and limitations. By addressing ethical considerations (such as privacy and consent, societal impacts, transparency, and accountability) and best practices (annotation methodologies, real-world dynamics, reliability, and validity), we offer a thorough overview of current datasets. Additionally, our contribution includes a GitHub repository that aggregates publicly available datasets into a single, easily accessible portal, thereby supporting further research and development in the fight against fake news.

Keywords:

fake news; datasets quality; detection models; biases mitigation; data ethics; multimodal analysis

1. Introduction

Detecting fake news is essential in today’s digital era due to its profound impact on individuals, societies, and democratic processes [1,2,3]. Identifying and debunking false information ensures the dissemination of accurate and reliable information, protecting public discourse from manipulation and mitigating the harm caused by rumors, conspiracy theories, and false narratives [3,4,5,6,7]. Effective fake news detection maintains trust in media sources, promotes critical thinking, and prevents manipulation by malicious actors. It is also crucial for cybersecurity, as misinformation can facilitate the spread of malware and phishing attacks. Addressing fake news thus supports social cohesion and upholds ethical standards in digital communication, safeguarding democratic processes from manipulation and propaganda.

Fake news and disinformation are commonly distinguished in communication studies based on intent, dissemination strategy, and contextual framing. These distinctions have direct implications for dataset construction, annotation practices, and evaluation protocols, as datasets necessarily operationalize specific assumptions about what constitutes “false” or “misleading” information. Given the critical role of fake news detection in preserving information integrity, it is important to examine the development of these detection systems, particularly the use of datasets. High-quality and diverse datasets are crucial for capturing various misinformation patterns, enhancing the effectiveness of detection models that integrate textual, visual, and behavioral features [8,9,10].

This survey investigates the key components of datasets used in developing fake news detection models. It explores the characteristics, common features, and labels of existing datasets, evaluating their impact on the effectiveness and resilience of detection algorithms. Emphasis is placed on the importance of collecting representative and reliable datasets. The survey also addresses the challenges, biases, and ethical considerations associated with these datasets, the role of multimodal datasets, and best practices for constructing high-quality datasets. Additionally, it reviews the evolution of fake news detection models with the availability of diverse datasets and considers future directions for advancing detection technologies.

In Section 2, we detail the selection methodology, including the data sources and search strategy, as well as the visualization techniques. Section 3 defines key terms and reviews related surveys to provide context for our discussion. Section 4 delves into the characteristics of existing fake news datasets, while Section 5 examines the impact of dataset properties on detection algorithms. Section 6 discusses the role of multimodal datasets in improving detection accuracy, and Section 7 highlights the challenges and limitations in current datasets. In Section 8, we outline best practices for creating high-quality datasets, and in Section 9, we trace the evolution of detection models with the availability of updated datasets. Finally, Section 10 addresses the ethical considerations involved in fake news datasets, and Section 11 proposes future directions for research in this field.

2. Paper Selection Methodology

To improve transparency and reproducibility in the study selection process, this survey follows the PRISMA 2020 guidelines for literature reviews. The flow of records through the identification, screening, eligibility, and inclusion phases is illustrated in Figure 1. The completed PRISMA checklist is provided in Appendix B for reference. This review was not registered in any public systematic review registry.

2.1. Data Sources and Search Strategy

A comprehensive search was conducted across three major scholarly platforms, which are Scopus, Google Scholar, and arXiv, to identify research related to datasets used in fake news detection. Scopus was the primary database due to its extensive coverage of peer-reviewed literature. The advanced search query used in Scopus is provided in Appendix A was limited to publications from 2023 onward to emphasize recent developments in dataset-driven fake news detection. To ensure coverage beyond Scopus, supplementary searches were carried out in Google Scholar and arXiv using the same set of keywords.

This multi-source strategy ensured broad coverage of relevant work, including peer-reviewed articles, conference proceedings, and preprints involving dataset development, dataset use, or methodological contributions to fake news detection.

2.2. Screening and Eligibility Assessment

The initial search identified 125 records. These records were consolidated and checked for redundancy. A total of 32 duplicates were removed, resulting in 93 unique articles. Since all retained articles were relevant based on titles and abstracts, and all met the basic inclusion criteria, no articles were excluded during full-text assessment.

Table 1 summarizes the inclusion and exclusion criteria that guided this process.

2.3. Bibliometric Visualization with VOSviewer

Following article selection, the metadata associated with the 93 included studies were exported and analyzed using VOSviewer 1.6.20. This software enables the construction of keyword co-occurrence networks, revealing dominant research themes, methodological trends, and dataset-related patterns across the literature. The resulting map (Figure 2) highlights central concepts such as fake news detection, machine learning, and multimodal datasets, which informed the thematic structure of this survey.

2.4. Final Dataset

The final dataset consists of 93 studies that met all inclusion criteria. These papers collectively represent the current landscape of dataset-driven research in fake news detection and constitute the evidence base for the analysis presented in the following sections.

3. Definitions and Related Surveys

3.1. Definitions

In the context of this survey, several key terms are defined to ensure clarity and consistency throughout the discussion:

Fake News: False or misleading information presented as news. This may include fabricated news stories, misinformation, and disinformation intended to deceive readers [1].
Misinformation: False or inaccurate information spread regardless of intent to deceive. It may arise from errors, misconceptions, or rumors [11].
Disinformation: Deliberately false information spread with the intention to mislead or manipulate public opinion [12].
Detection Models: Algorithms and systems designed to identify and classify fake news, misinformation, or disinformation [13].
Datasets: Collections of data used to train and evaluate fake news detection models, which can include textual, visual, and multimodal data [14].

3.2. Related Surveys

Several surveys have been conducted in recent years to explore various aspects of fake news detection. These surveys provide a comprehensive overview of existing methods, datasets, and challenges in the field. Here, we summarize a few significant works:

Shu et al. (2017): This survey explores fake news detection from both social and news content perspectives, highlighting the importance of leveraging social context in improving detection accuracy [15].
Zhou and Zafarani (2018): The authors review machine learning approaches for fake news detection, emphasizing the role of feature engineering and classification techniques [16].
Zhou and Zafarani (2020): This comprehensive survey provides an in-depth analysis of fundamental theories, detection methods, and future opportunities in the field of fake news detection [17].
Kumar and Shah (2018): This survey focuses on false information on the web and social media, discussing various detection strategies and their effectiveness [18].

While these surveys offer valuable insights into different dimensions of fake news detection, our survey is uniquely focused on the datasets themselves. Specifically, we examine how dataset characteristics influence the performance and effectiveness of fake news detection models. By focusing on datasets, we aim to provide a detailed understanding of the current landscape, identify gaps, and propose best practices for creating and using high-quality datasets in fake news detection research.

4. Characteristics of Existing Fake News Datasets

In this section, we contribute by collecting publicly available datasets, summarizing their contents, and comparing them on our GitHub page (https://github.com/fakenewsresearch/dataset, accessed on 31 January 2026. This initiative offers researchers a centralized, comprehensive portal for accessing and analyzing relevant datasets, with regular updates. Due to page constraints, only a portion of the GitHub pages are displayed here. Table 2 provides an overview of key datasets included in the repository.

4.1. Types of Data Collected

Fake news datasets are typically categorized into textual, visual, and multimodal data types to better address the different forms of fake news content. For instance, textual datasets focus on fake news articles and posts, visual datasets include manipulated images and videos, and multimodal datasets combine both text and visual elements to provide a comprehensive analysis [30,32]. Textual datasets focus on written content, including articles, headlines, and social media posts; examples include the LIAR [22] and MisInfoText [26] datasets, which are used for analyzing linguistic patterns and textual inconsistencies, with challenges such as the need for context in short texts and detecting satire or sarcasm [33]. Visual datasets consist of images and videos used to detect fake news through visual content analysis; the Verification Corpus [29] and FCV-2018 [28] datasets help develop algorithms for identifying image manipulations and verifying video authenticity. Multimodal datasets combine text, images, and videos for a comprehensive approach to fake news detection; examples include FakeNewsNet [13] and r/fakeddit [30], which improve detection accuracy by cross-referencing multiple data types. Generative machine text datasets, including those produced by AI models like GPT-3, are a recent trend focused on understanding and detecting AI-generated misinformation. State-of-the-art methods for detecting fake news in machine-generated text achieve over

90 %

accuracy in controlled environments, but their applicability in real-world settings remains challenging [34]. The M4 dataset [31] exemplifies this trend, containing both human-written and machine-generated text and aiding the development of distinguishing tools.

4.2. Common Features and Labels

Fake news datasets typically include a variety of features, such as linguistic, metadata, and visual features, along with labels indicating whether content is fake or real. In datasets that focus exclusively on images, visual features such as image metadata, pixel patterns, and manipulation traces are used to distinguish authentic from fake images. These visual datasets are essential for identifying visually deceptive information, which cannot be assessed through textual analysis alone [35]. Linguistic features include syntax, semantics, and stylistics derived from text, and are crucial for training machine learning models to classify textual content [4,13,36,37]; common examples include n-grams, part-of-speech tags, and sentiment scores [36,37,38]. Metadata provides contextual information such as publication date, author, source, and social media engagement metrics, helping assess source credibility and track the spread of information; for instance, the PHEME dataset [20] uses metadata to analyze the spread of rumors on Twitter. Visual features include image metadata, pixel data, and patterns identified through image processing, and advances in computer vision—such as convolutional neural networks—enhance the detection of visual inconsistencies [39]. Labels generally classify content as fake or real, with some datasets offering more detailed labels indicating degrees of falsity; binary labels are simple, while nuanced labels offer deeper insights into disinformation [22]. Table 3 illustrates the various rating scales used in different datasets. For example, the CREDBANK dataset uses a five-point scale from “Certainly Inaccurate” to “Certainly Accurate,” while the PHEME dataset categorizes information as “true,” “false,” or “unverified.”

4.3. Variation in Characteristics

The size, source, annotation process, language of fake news datasets vary significantly, affecting their applicability and the methods used for analysis. The following explains these characteristics and provides examples. Size varies from a few hundred entries to millions; larger datasets, like FakeNewsNet (https://github.com/KaiDMML/FakeNewsNet, accessed on: (accessed on 20 November 2024)) [13], provide extensive training data but require more computational resources. Refer to Table 4 for detailed information on dataset sizes.

Sources vary across social media platforms, news websites, and fact-checking organizations, each with unique challenges, such as the dynamic nature of social media and the credibility of news websites [13].

Annotation processes include manual, crowdsourced, and automated approaches; manual annotation, exemplified by the LIAR dataset [22], guarantees high accuracy but is labor-intensive and time-consuming, whereas crowdsourcing, as utilized by BuzzFeed [23], and automation, as implemented in FakeNewsNet [13], provide scalable solutions that may occasionally compromise on quality [40].

Language coverage is increasingly multilingual; datasets such as M4 [31] are vital for ensuring global applicability, though they demand language-specific models and complex data processing [41]. Furthermore, the TALLIP-FakeNews-Dataset (https://github.com/Arko98/TALLIP-FakeNews-Dataset, accessed on 20 November 2024) serves as a multilingual resource that encompasses low-resource languages [42].

Understanding these characteristics helps researchers select appropriate datasets, ensuring their methodologies align with the dataset’s features and constraints. Table 3 outlines the rating scales employed across different datasets, while Table 2 and Table 4 provide detailed information on the year of publication, language coverage, data types, and access methods. This comprehensive overview highlights the diversity and scope of available datasets, helping researchers select the most appropriate datasets for their specific needs. The trend towards multimodal and generative machine text datasets reflects the evolving landscape of fake news detection and underscores the necessity for advanced analytical methods.

5. Impact of Dataset Properties on Detection Algorithms

5.1. Performance Influence

The performance of detection algorithms is significantly influenced by dataset characteristics. Larger datasets generally improve classification performance by providing more information and aiding pattern generalization during training. In contrast, smaller datasets often lead to overfitting and less reliable models due to limited variability and detail [4,43,44,45,46]. For instance, using the NELA-GT-2018 dataset [27] which contains approximately 713,000 articles, researchers have observed a noticeable enhancement in model performance. The large volume and diversity of data help reduce overfitting and improve the generalization capabilities of the model. In one study, models trained on NELA-GT-2018 achieved an accuracy improvement of up to 15% compared to models trained on smaller datasets [27]. Conversely, smaller datasets like the Verification Corpus, with about 15,630 articles, present challenges such as overfitting despite being relatively substantial. For example, a study demonstrated that models trained on smaller datasets could see a performance drop of around 10–20% in accuracy when tested on unseen data, illustrating the limitations of small datasets in capturing the full variability of real-world scenarios [29]. To conclude, the size and diversity of datasets play a crucial role in the effectiveness of detection algorithms, with larger datasets generally providing better performance and reliability.

5.2. Specific Properties Leading to Better Performance

Certain dataset characteristics consistently improve detection accuracy and robustness [47,48]. High-quality annotations provide richer information, resulting in more accurate predictions. Datasets that accurately represent the original distribution tend to yield better performance regardless of size. Incorporating diverse features, such as numerical and textual data, enhances generalization. For example, the LIAR dataset [22] includes extensive fact-checking data with multiple levels of truthfulness, allowing models to learn subtle distinctions and improve accuracy in classifying statement veracity. For additional information on rating scales, see Table 3. Balanced class distributions help train unbiased models, reducing the risk of biasing toward a particular class and improving performance across models. For example, the PHEME dataset [20] provides a balanced distribution of rumor and non-rumor data, ensuring the model is not biased toward one class and leading to more robust and reliable rumor detection. However, it is also argued that natural statistical balance, as seen in real-world data, may lead to better generalization and performance in certain applications. Real-world datasets often exhibit inherent class imbalance that reflects the actual class frequencies, and training models on such datasets can improve their ability to handle real-world scenarios effectively [49,50].

5.3. Cross-Dataset Generalization and Transferability

Although the majority of fake news detection studies evaluate models within a single dataset, a real-world deployment requires robustness across datasets that differ in domain, temporal scope, linguistic characteristics, and annotation practices [51,52]. As a result, performance measured under within-dataset evaluation provides only a limited indication of practical reliability and may overestimate robustness under distributional shift [52]. Rather than proposing new benchmarks, this subsection synthesizes empirical evidence from prior cross-dataset and transfer-oriented evaluations to identify systematic generalization failures that are primarily attributable to the dataset properties rather than model architecture [53,54].

To this end, we review studies that explicitly assess fake news detection models under cross-dataset settings, where training and evaluation are conducted on different datasets [51,52]. Because reported metrics vary across studies (e.g., accuracy, F1-score, AUC), we focus on relative performance variation or degradation observed within individual works, rather than comparing absolute scores across papers [51]. Table 5 summarizes representative evidence from the recent literature.

Several consistent patterns emerge from the evidence summarized in Table 5. Across all surveyed studies, cross-dataset evaluation results in non-trivial performance degradation, ranging from moderate drops of approximately 8–15% to severe declines exceeding 40%, even when identical model architectures and training strategies are employed [51,52]. This observation indicates that the dataset characteristics exert a dominant influence on generalization behavior, often outweighing the impact of architectural choice [53].

A closer examination of the reviewed studies suggests that cross-dataset degradation is primarily driven by mismatches in dataset properties rather than limitations of the model capacity [52,53]. Domain shift constitutes a major source of error: datasets constructed from curated news articles differ substantially from social media datasets in terms of linguistic style, topical diversity, noise level, and narrative structure [51,54]. Models trained on long-form, professionally edited content frequently rely on stylistic regularities and lexical cues that fail to transfer to short, informal, or conversational text, leading to systematic performance degradation under cross-domain evaluation [52].

Annotation granularity further exacerbates generalization challenges [51,53]. Datasets labeled at the article or source level implicitly encourage models to exploit topic-level or publisher-level correlations, whereas claim-level datasets require fine-grained semantic reasoning and factual consistency assessment [51]. When models trained under one annotation paradigm are evaluated on datasets constructed under another, their learned decision boundaries become misaligned, resulting in degraded cross-dataset performance even in the absence of architectural changes [53].

Evidence from multiple studies also points to benchmark saturation and dataset aging as important contributors to inflated within-dataset results [51,52]. Legacy benchmarks such as ISOT are frequently associated with near-perfect accuracy across a wide range of models [51]. However, models achieving over 99% accuracy on ISOT often exhibit substantial performance degradation when evaluated on more challenging datasets such as LIAR or FakeNewsNet [51,52]. This discrepancy suggests that evaluations relying primarily on saturated or static datasets may overestimate real-world generalization by encouraging overfitting to dataset-specific lexical and stylistic artifacts [52].

Importantly, increasing model complexity alone does not resolve these limitations [52,53]. Ensemble learning, contrastive objectives, and multimodal fusion improve robustness within individual datasets but remain sensitive to shifts in domain, annotation policy, and temporal distribution [53]. These findings reinforce the view that cross-dataset generalization is fundamentally constrained by dataset design rather than model capacity [52].

At the same time, recent transfer-focused studies demonstrate that stronger cross-dataset robustness is achievable when training paradigms explicitly reduce dependence on dataset-specific fake artifacts [54]. Approaches that emphasize dataset-aware objectives, such as transfer- and stream-based detection, report substantially smaller performance degradation across heterogeneous datasets, indicating that improved generalization is attainable through careful dataset and evaluation design [54].

Overall, the evidence synthesized in this subsection indicates that cross-dataset generalization remains a central bottleneck in fake news detection research [51,52]. High accuracy reported on individual benchmarks should therefore be interpreted with caution, as it may reflect alignment between training and test data distributions rather than genuine robustness [53]. These observations motivate the need for temporally diverse, domain-inclusive, and transfer-aware datasets and evaluation protocols, as discussed in the following sections [54].

6. Role of Multimodal Datasets in Fake News Detection

6.1. Comparison with Unimodal Datasets

Multimodal datasets, which combine text, images, and videos, generally outperform unimodal datasets in fake news detection. Research shows that models using multimodal data better capture context and detect inconsistencies, leading to improved accuracy. For instance, the Fakeddit dataset, integrating text and images, achieved 87% accuracy with a CNN architecture, surpassing text-only methods [30,55,56]. Studies indicate that multimodal news classification can improve accuracy by up to 8.11% compared to text-only classification [57]. These results are supported by further research, underscoring the superiority of multimodal approaches over unimodal ones in detecting fake news [58,59].

6.2. Challenges of Multimodal Datasets

Creating and utilizing multimodal datasets involves several challenges. Collecting data from various sources (e.g., social media, news articles, images) and integrating them into a unified dataset is complex and labor-intensive, and ensuring that data from different modalities are synchronized and accurately linked is critical for effective analysis [55,56,60]. Annotating multimodal datasets requires expertise in both textual and visual analysis, and the process is more time-consuming and expensive than annotating unimodal datasets because it involves reviewing and labeling multiple types of data [55,56]. In addition, processing multimodal data demands significant computational resources, as models must handle large volumes of data and perform complex feature extraction and integration, which can be computationally intensive and require advanced hardware [55,61].

6.3. Advantages and Disadvantages of Multimodal Datasets

Multimodal datasets offer several advantages. By leveraging multiple data types, they can capture a richer set of features and contextual information, leading to higher detection accuracy and a more comprehensive understanding of news content [30,55]. They also enhance robustness by enabling cross-verification across modalities, which reduces the likelihood of false positives and negatives [55,56]. Moreover, combining text and visual data improves contextual understanding; for example, a sensational headline paired with an equally sensational image can be more easily identified as fake news [30,62]. However, multimodal approaches also have disadvantages. They are resource-intensive, requiring greater computational power and storage, which can be a barrier for smaller research teams or organizations with limited resources [30,55,56]. They are more complex to implement than unimodal models, involving sophisticated data preprocessing, feature extraction, and model integration techniques [55,56]. Finally, high-quality multimodal datasets are less readily available, as collecting and curating large-scale datasets that include both textual and visual content is challenging and resource-intensive [30,62].

7. Challenges and Limitations in Current Fake News Datasets

7.1. From Dataset Construction to Generalization Failure

Fake news detection performance is fundamentally shaped by the properties of the datasets used for training and evaluation. Class imbalance remains pervasive across benchmarks, where real news instances substantially outnumber fake news, biasing models toward majority-class predictions and increasing false negative rates for rare misinformation categories [10,62]. Data noise introduced by informal user-generated content, typographical errors, and inconsistent annotations further obscures discriminative patterns and degrades representation learning [62,63]. In parallel, the dynamic evolution of misinformation narratives induces distribution drift: topics, writing styles, and dissemination strategies change rapidly, rendering static datasets increasingly mismatched to contemporary threats [62,64].

These dataset-level characteristics propagate directly into model behavior. Imbalanced and noisy corpora promote majority-class overfitting and reduce sensitivity to minority misinformation categories, while biased topic and source distributions yield uneven detection performance across domains and narratives [62,63]. Models trained on temporally outdated corpora further overfit to obsolete stylistic and topical patterns, undermining robustness in rapidly evolving information environments [10,62]. As a consequence, high in-dataset accuracy often masks brittle decision boundaries and fragile feature reliance.

The resulting limitations become most visible in cross-dataset and cross-domain evaluation. Models trained on platform-specific corpora (e.g., social media) frequently transfer poorly to other domains (e.g., long-form news) due to divergent linguistic conventions and feature distributions [65,66]. Cross-lingual generalization remains particularly constrained, as detectors trained predominantly on English-centric data struggle to capture morphological, syntactic, and cultural variation in other languages [62,64]. Temporal misalignment further degrades transferability: rapidly evolving misinformation narratives render historical datasets increasingly obsolete, diminishing effectiveness in longitudinal deployment [10,67].

Taken together, these interactions reveal a systematic pipeline from dataset construction flaws to model-level overfitting and eventual generalization collapse, underscoring the need for balanced, continuously updated, multilingual, and domain-diverse benchmarks to support robust misinformation detection in realistic settings.

7.2. Biases in Datasets

Biases present in fake news datasets can significantly impact detection outcomes. Selection bias arises when the sampled data fails to accurately reflect the broader population, leading to skewed representations and potentially flawed conclusions [62,63]. This often manifests when certain types of fake news are over-represented while others are under-represented, causing algorithms to perform inconsistently across categories [62,63]. For instance, Baly et al. [41] reported that the MediaEval Benchmarking Initiative for Multimedia Evaluation dataset disproportionately featured political news (approximately

60 %

of articles), whereas health-related fake news comprised only about

10 %

, yielding model accuracy of

85 %

on political news but only

60 %

on health-related news. Similarly, Shu et al. [13] found that FakeNewsNet over-represented fake news from popular websites relative to lesser-known sources, leading to

78 %

accuracy on content from well-known websites but just

55 %

on content from less prominent sources. Labeling bias occurs when the process of labeling news as fake or real introduces systematic errors influenced by annotators’ preconceptions or inconsistent criteria, a problem that is especially pronounced in crowdsourced settings [62]. Studies show that articles aligned with annotators’ political views were

30 %

less likely to be labeled as fake than opposing articles [62,68], and Chen et al. [69] observed that clearer, stricter guidelines increased inter-annotator agreement by

15 %

, reducing bias (see also [70]). Cultural and linguistic bias emerges when data are predominantly in one language or rooted in a specific cultural context, limiting generalization to other languages or cultures [62,63,64]. Models trained primarily on English content from Western sources often struggle with languages such as Mandarin, Arabic, or Spanish due to linguistic and cultural differences [64], and phenomena like satire or parody can be misinterpreted across cultures [63]. Underrepresentation of minority languages further degrades performance, with notable accuracy drops reported for languages such as Swahili or Tagalog compared to major languages [62]. Efforts like the Multilingual and Multicultural Fake News Detection (MM-FND) project aim to mitigate these issues by compiling datasets across diverse languages and cultural contexts [64]. These biases propagate into specific failure modes rather than merely affecting overall accuracy. Selection and cultural biases lead models to rely on topic and stylistic cues, resulting in false positives for satire or parody, especially in non-Western contexts. At the same time, misinformation in underrepresented domains is more likely to go unnoticed, leading to higher false-negative rates.

7.3. LLM-Era Challenges and Benchmark Instability

The emergence of large language models fundamentally alters the assumptions underlying the design and evaluation of fake news datasets. Chen and Shu [71] show that LLM-generated misinformation is often fluent, coherent, and stylistically indistinguishable from human-authored content, substantially reducing the effectiveness of traditional lexical, stylistic, and propagation-based cues. As a result, even balanced and bias-mitigated datasets may fail to capture the true difficulty of contemporary misinformation detection.

A more fundamental challenge arises from benchmark instability and data contamination. Yang et al. [72] demonstrate that static evaluation datasets are increasingly unreliable because both generative models and detectors are trained on overlapping web-scale corpora. Classical human-written texts, including literary and religious sources, are frequently misclassified as machine-generated, while commercial detectors continuously adapt to public benchmarks. This circular dependency undermines reproducibility, inflates reported performance, and weakens the validity of cross-study comparisons, casting doubt on the notion of a fixed ground truth in LLM-era misinformation detection.

These challenges are further amplified at the ecosystem level. Chen and Shu [71] document the rapid proliferation of AI-powered news websites and synthetic journalism across multiple languages, generating large volumes of automated content that pollute information ecosystems. This shift introduces new dataset requirements: continuous updating, explicit modeling of synthetic–human mixtures, multilingual and cross-domain coverage, and evaluation protocols resilient to contamination. Without such adaptations, existing benchmarks risk systematically underestimating real-world difficulty and overestimating model robustness in adversarial misinformation environments.

8. Best Practices for Creating High-Quality Fake News Datasets

8.1. Annotation Methodologies

Reliable annotation methodologies are essential for high-quality fake news datasets. Expert annotators familiar with fake news nuances help maintain accuracy and consistency [64,73]. Automated methods can assist but require human oversight to address subtleties that algorithms miss [63,74]. Quality control measures, such as cross-validation by multiple annotators and periodic reviews, are critical for dataset integrity [63,74]. For example, the NELA-GT-2018 dataset [27] employs cross-verification to reduce bias and error [64]. To elaborate on these methodologies and provide concrete examples:

1.

Expert annotators: Utilizing expert annotators ensures that those who are well-versed in the nuances of fake news can provide high-quality annotations. For example, the MediaEval Benchmarking Initiative engages professional journalists and fact-checkers to annotate news articles, ensuring that the annotations are based on a deep understanding of journalistic integrity and misinformation tactics [41].

2.

Automated assistance with human oversight: While automated methods can expedite the annotation process, they must be complemented by human oversight. Automated systems can flag potentially fake news based on certain linguistic cues or metadata, but human annotators are necessary to verify these flags. For instance, the FakeNewsNet dataset uses an automated system to initially filter articles based on their sources and content, followed by human verification to ensure accuracy [13].

3.

Cross-validation by multiple annotators: Cross-validation involves having multiple annotators review the same content to ensure consistency and accuracy. For example, in the creation of the LIAR dataset, each statement was reviewed by three annotators to cross-check labels and resolve discrepancies through discussion and majority voting [22].

4.

Periodic reviews and updates: Regular reviews and updates to the dataset ensure that it remains relevant and accurate. This involves periodically re-evaluating the data to correct any errors and update annotations based on new information or shifts in the nature of fake news. The PHEME dataset, for example, undergoes regular updates where annotations are reviewed and revised based on ongoing events and newly available information [20].

5.

Specific Quality Control Measures:

Inter-annotator agreement (IAA): This measure is used to assess the consistency among annotators. High IAA scores indicate that annotators are in agreement, thereby enhancing the reliability of the dataset. For instance, the Yelp review dataset uses IAA metrics to ensure that labels for fake and real reviews are consistent across annotators [25].
Blind review processes: In blind review processes, annotators are unaware of each other’s assessments, which helps reduce bias. The Verification Corpus uses a blind review process in which posts are labeled independently by different annotators to ensure unbiased annotations [29].
Question and answer annotation: The POLygraph dataset incorporates a detailed question and answer annotation scheme, expanding beyond simple factuality to include questions on the disseminator’s intention, the target of the news, and the potential harm caused, providing a more comprehensive understanding of fake news [75].

8.2. Incorporating Real-World Dynamics

Continuously updating datasets with new data reflecting current trends is crucial for capturing evolving patterns of fake news [63,74]. This involves monitoring various information sources, including social media and news websites, to capture emerging narratives and misinformation tactics [73,76]. Integrating these real-world dynamics keeps datasets relevant and useful for training robust fake news detection models [74,76]. Using automated tools for data scraping and periodic manual updates can help maintain dataset relevance [74,76].

For example, the CREDBANK dataset is continuously updated with tweets evaluated for credibility by human annotators, ensuring it reflects the most recent trends and events [19]. Similarly, the CoAID dataset includes data on COVID-19 misinformation, which is regularly updated to capture the evolving nature of misinformation during the pandemic [77].

8.3. Ensuring Reliability and Validity

Maintaining the reliability and validity of fake news datasets involves several strategies. First, constructing datasets from diverse sources minimizes bias and increases representativeness [63,64]. Second, rigorous pre-processing steps, such as removing duplicates and normalizing data, enhance dataset quality [63,64]. Third, standardized metrics and evaluation methods aid in assessing dataset performance and reliability [63,64]. The FakeWatch ElectionShield dataset, for instance, consolidates data from multiple sources and employs robust pre-processing and annotation techniques to ensure high quality and validity [63,64].

An example of ensuring reliability is the implementation of the Media Bias/Fact Check (MBFC) methodology, which involves cross-referencing articles with verified fact-checking websites to ensure the accuracy of annotations. Another example is the use of ground-truth data from established fact-checking organizations such as Snopes and PolitiFact to validate the dataset labels [78].

9. Evolution of Fake News Detection Models with Dataset Availability

9.1. Trends in Model Performance

Early Models and Their Limitations

From 2017 to 2020, early fake news detection models primarily employed conventional machine learning algorithms, including logistic regression, support vector machines (SVMs), and decision trees. These models largely relied on textual data derived from news articles and social media posts, focusing on linguistic features such as n-grams, part-of-speech tags, and basic metadata [15,45,79].

Advancements with Deep Learning

Beginning in 2021, a substantial number of peer-reviewed articles have been published on the utilization of deep learning for fake news detection. With the availability of larger, more complex datasets, models began to incorporate deep learning methodologies. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have enabled more sophisticated text analysis by capturing semantic subtleties and contextual information, resulting in notable enhancements in detection accuracy [80,81,82].

Impact of Transformer Models

The development of transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), has revolutionized fake news detection. These models excel in understanding context and generating coherent text, making them highly effective in identifying fake news. Their training on extensive datasets has enhanced their ability to generalize across diverse content types [83,84,85].

9.2. Impact of Dataset Versions and Updates

Significance of Regular Updates

Updated datasets profoundly impact longitudinal studies and the effectiveness of models. Regular updates ensure models remain relevant and handle the latest fake news trends and tactics. For example, NELA-GT-2020 and NELA-GT-2022 include additional data and updated annotations that reflect current misinformation trends, thereby enhancing model effectiveness over time [22,86]. Please refer to Table 6 for a summary of updated datasets.

Addressing Distribution Drift

As the nature of fake news changes, models trained on outdated datasets may underperform on new misinformation types. Regular dataset updates help mitigate this issue by ensuring that models adapt to distribution shifts and maintain high performance [87,88].

Empirical Evidence

Successful case studies have shown that updated datasets can significantly enhance the performance of fake news detection models. For instance, models trained on the latest versions of the LIAR dataset have exhibited improved accuracy in detecting fake news. This improvement is largely due to the more accurate representation of recent misinformation patterns, which helps models better adapt to and identify new types of fake news [22,62].

10. Ethical Considerations in Fake News Datasets

10.1. Privacy and Consent

Ensuring privacy and consent for individuals in fake news datasets is crucial. Anonymizing personal information prevents identification and misuse by applying techniques such as tokenization and removal of PII; for example, the FakeWatch ElectionShield dataset anonymizes user data to safeguard privacy [64,73,76]. Equally important is obtaining informed consent from individuals whose data are included, which requires clearly communicating the intended use of the data and ensuring explicit user agreement; ethical research practices demand transparency and respect for privacy rights [89].

10.2. Societal Impacts

Bias propagation must be carefully considered when using specific datasets for fake news detection, as propagating biased or unverified information can influence public opinion and cause harm. Datasets focusing on particular types of fake news or sources may inadvertently reinforce biases if not managed properly [63,89]. Managing biases properly means ensuring that the dataset is representative, annotations are consistent, and any inherent biases are identified and mitigated through diverse data collection, rigorous annotation standards, regular updates, and the use of bias detection and mitigation tools. Influence on public opinion calls for transparency and accountability when developing and deploying fake news detection models to avoid perpetuating or creating biases. Ethical implications—such as influencing elections or public health decisions—highlight the need for rigorous standards in dataset creation and use [3,22,89]. To overcome these challenges, it is essential to ensure transparency in data collection and annotation, regularly update datasets to reflect current trends, involve diverse annotators to minimize bias, employ robust validation methods, and engage stakeholders (policymakers, researchers, and the public) to ensure ethical deployment.

10.3. Transparency and Accountability

Transparent, accountable deployment of fake news detection models requires adherence to well-defined guidelines that prevent misuse and unintended harm, thereby preserving the integrity and reliability of these systems [90]. Equally important is clear disclosure of training data, data provenance, and known biases so stakeholders can understand model decision-making, identify potential inaccuracies, and enable external scrutiny; such openness is fundamental to improving models, maintaining public trust, and aligning with ethical standards [90].

10.4. Media and Journalism Implications of Dataset Design in Cross-Cultural Contexts

Fake news datasets are closely connected to journalism, as they frequently rely on journalistic institutions, fact-checking organizations, and editorial judgments to define ground truth. Communication research has shown that fake news challenges established norms of verification and credibility rather than constituting a purely technical classification problem [91]. From a dataset perspective, common labeling practices—often binary or coarse-grained—simplify journalistic judgments that are inherently contextual, provisional, and temporally evolving. Such simplifications can limit the alignment between dataset-trained models and real-world media practices, particularly in early-stage reporting or contested news events [92].

Dataset design also carries normative implications. When datasets implicitly treat journalistic form, institutional affiliation, or source reputation as proxies for legitimacy, detection models may inherit these assumptions. Framing fake news as “counterfeit news” highlights how datasets may encode normative distinctions between legitimate and illegitimate information sources, with downstream implications for automated content filtering and editorial decision-making [93].

These limitations are further amplified across diverse linguistic and cultural contexts. Recent bibliometric evidence indicates that fake news and media trust research is geographically concentrated, with many widely used datasets derived from English-language media environments and a narrow set of platform ecosystems [94]. As a result, benchmark datasets may insufficiently capture misinformation dynamics in regions characterized by different journalistic traditions, communication infrastructures, and platform usage patterns. Annotation guidelines developed in one context may therefore misinterpret culturally specific forms of expression, such as satire, political dissent, or informal reporting styles.

From a dataset design standpoint, these observations highlight the importance of transparent documentation of contextual assumptions, broader multilingual coverage, and culturally informed annotation practices. Addressing these factors does not expand the scope of fake news detection research; rather, it clarifies the conditions under which dataset-driven models can be expected to generalize reliably across media systems and societal contexts.

11. Future Directions

As fake news evolves, creating adaptive datasets to reflect changing patterns is essential. One approach is to develop dynamic datasets that are regularly updated with new data from social media, news websites, and forums. Incorporating feedback mechanisms based on the performance of a fake news detection model can maintain relevance and effectiveness, and using machine learning to identify and integrate emerging patterns further enhances adaptiveness. In practice, this includes implementing real-time data collection through automated systems that continuously curate data from diverse sources to keep datasets current, establishing model feedback loops in which model performance highlights gaps that require more data or different signals, and applying anomaly detection techniques to surface and incorporate newly emerging misinformation patterns as they emerge.

Synthetic data generation enables supplementing real-world datasets, addressing data scarcity and imbalance. Techniques such as Generative Adversarial Networks (GANs) can create realistic fake news samples to augment datasets, training more robust models with diverse examples [95,96]. To preserve integrity, hybrid datasets should combine synthetic and real data using approaches such as data augmentation and transfer learning, so that synthetic data complements rather than distorts real distributions [97,98]. It is also critical to implement bias detection and correction mechanisms within the generation pipeline to avoid reinforcing existing biases, for example, through fairness-aware objectives and auditing [99,100].

Developing models that can handle multiple languages and cultural contexts is crucial for global effectiveness. Models restricted to a single language or cultural framework may not perform adequately in varied settings, limiting their utility; multilingual and cross-cultural capabilities enable more accurate detection across diverse environments. Key strategies include curating multilingual training corpora that span languages and cultural contexts [101], applying cross-lingual transfer learning to transfer knowledge from high-resource to low-resource languages [102], and incorporating cultural context via culturally aware embeddings, annotations, and evaluation protocols [103].

Finally, a particularly urgent future direction concerns datasets for detecting AI-generated fake news produced by modern large language models and diffusion-based image generators. While synthetic data augmentation has been explored, most existing benchmarks rely on earlier-generation models and fail to capture the realism and multimodal coherence of contemporary AI-generated news. Recent evidence shows that realistic image–caption pairs generated by state-of-the-art systems pose a substantial challenge not only for humans but also for current multimodal detection models, suggesting that existing datasets significantly overestimate real-world robustness [104]. Addressing this gap requires continuously updated, multimodal datasets that reflect advances in generative models and their use in real misinformation campaigns.

12. Conclusions

This survey has examined the pivotal role of datasets in the development of effective fake news detection models. We explored various facets, including the characteristics and challenges of existing fake news datasets, their impact on model performance, and the importance of multimodal and high-quality datasets. In addition, the survey synthesized evidence on cross-dataset generalization, showing that strong performance within individual datasets does not necessarily translate to robustness across different data sources. The analysis further highlighted significant ethical considerations in the creation and utilization of datasets, along with the evolution of detection models in response to the availability of new and updated data.

Datasets remain the cornerstone of progress in fake news detection technologies. As the misinformation landscape continues to evolve, the development of adaptive, diverse, and transfer-aware datasets becomes increasingly critical. Future research should prioritize the construction of dynamic datasets that reflect changing misinformation patterns, support cross-dataset evaluation, and leverage synthetic data to enhance diversity and robustness. Equally important is the enforcement of ethical standards in dataset creation and use, which is essential for maintaining public trust and ensuring the long-term effectiveness of fake news detection systems. Continued innovation and collaboration in dataset design and evaluation will be key to addressing the complex and evolving challenges posed by fake news.

13. Glossary

Convos (in PHEME dataset) [20] Refers to conversations. Specifically, it denotes the threads or discussions on social media platforms where a statement, claim, or rumor is being discussed or responded to by various users. The dataset categorizes these conversations as true, false, or unverified.
9 pages (in BuzzFace dataset) [24] Refers to the specific Facebook pages from which the dataset’s news samples were collected. These pages likely include a mix of reputable news sources, satire, and pages known for spreading misinformation, providing a diverse set of news items for analysis.
Parallel data (in M4 dataset) [31] Refers to a collection of data that includes pairs of text in two different languages that are translations of each other. The M4 dataset consists of 147,000 such parallel pairs, with 3000 pairs for each of 49 different languages. This type of data is commonly used in training and evaluating machine translation systems.

Author Contributions

S.K.: Conceptualization, Writing—Original Draft, Writing—Review and Editing, Methodology, Visualization, Data Curation. A.W.: Conceptualization, Writing—Review and Editing, Supervision. M.G.: Conceptualization, Supervision. M.P.: Conceptualization, Supervision. S.S.: Writing—Review and Editing, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

Horizon Europe grant OMINO (grant number 101086321). A.W. and S.K. were also co-financed with funds from the Polish Ministry of Education and Science under the program entitled International Co-Financed Projects.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in article.

Acknowledgments

A.W. and S.K. were funded by the European Union under the Horizon Europe grant OMINO (grant number 101086321). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or the European Research Executive Agency. Neither the European Union nor the European Research Executive Agency can be held responsible for them. A.W. and S.K. were also co-financed with funds from the Polish Ministry of Education and Science under the program entitled International Co-Financed Projects.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Scopus Advanced Search Query

To ensure a comprehensive review of the impact of datasets on fake news detection, the following advanced search query was used in Scopus:

TITLE-ABS-KEY("fake news" OR "misinformation" OR "disinformation" OR "

false information") AND

("detection" OR "identification" OR "classification") AND

("dataset" OR "data set" OR "corpus" OR "data collection") AND

("algorithm" OR "model" OR "method" OR "approach") AND

(LIMIT-TO(SUBJAREA, "COMP") OR LIMIT-TO(SUBJAREA, "ENGI") OR LIMIT-TO(

SUBJAREA, "MATH") OR LIMIT-TO(SUBJAREA, "SOCI") OR LIMIT-TO(SUBJAREA,

"DECI"))

Appendix B. PRISMA 2020 Checklist

Table A1. Completed PRISMA 2020 checklist for this systematic review.

Item	Description	Reported in Manuscript
Tittle
1	Identify the report as a systematic review	Title/Abstract
Abstract
2	Structured summary	Abstract
Introduction
3	Rationale	Introduction
4	Objectives	Introduction
Methods
5	Eligibility criteria	Methods–Screening and Selection Criteria
6	Information sources	Methods–Data Sources and Search Strategy
7	Search strategy	Appendix (Search Query)
8	Selection process	Methods + PRISMA Flow Diagram
9	Data collection process	Methods–VOSviewer Export Procedure
10	Data items	Methods–Metadata and Keyword Extraction
11	Risk of bias assessment	Not applicable
12	Effect measures	Not applicable
13	Synthesis methods	Methods–VOSviewer Visualization
Results
14	Study selection	PRISMA Flow Diagram + Methods
15	Study characteristics	Results Section
16	Risk of bias in studies	Not applicable
17	Results of individual studies	Reported narratively
18	Synthesis results	Results–Keyword Co-occurrence Analysis
Discussion
19	Summary of evidence	Discussion
20	Limitations	Discussion (if included)
21	Conclusions	Conclusion Section
Other Information
22	Registration	Not registered
23	Support/Funding	As stated in manuscript
24	Competing interests	As stated in manuscript

References

Lazer, D.M.; Baum, M.A.; Benkler, Y.; Berinsky, A.J.; Greenhill, K.M.; Menczer, F.; Metzger, M.J.; Nyhan, B.; Pennycook, G.; Rothschild, D.; et al. The science of fake news. Science 2018, 359, 1094–1096. [Google Scholar] [CrossRef]
Lewandowsky, S.; Ecker, U.K.; Cook, J. Beyond misinformation: Understanding and coping with the “post-truth” era. J. Appl. Res. Mem. Cogn. 2017, 6, 353–369. [Google Scholar] [CrossRef]
Reisach, U. The responsibility of social media in times of societal and political manipulation. Eur. J. Oper. Res. 2021, 291, 906–917. [Google Scholar] [CrossRef] [PubMed]
Shu, K.; Wang, S.; Liu, H. Beyond news contents: The role of social context for fake news detection. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 312–320. [Google Scholar]
Abdelminaam, D.S.; Ismail, F.H.; Taha, M.; Taha, A.; Houssein, E.H.; Nabil, A. Coaid-deep: An optimized intelligent framework for automated detecting COVID-19 misleading information on twitter. IEEE Access 2021, 9, 27840–27867. [Google Scholar] [CrossRef]
Goldani, M.H.; Momtazi, S.; Safabakhsh, R. Detecting fake news with capsule neural networks. Appl. Soft Comput. 2021, 101, 106991. [Google Scholar] [CrossRef]
Truică, C.O.; Apostol, E.S. It’s all in the embedding! fake news detection using document embeddings. Mathematics 2023, 11, 508. [Google Scholar] [CrossRef]
Peng, L.; Jian, S.; Kan, Z.; Qiao, L.; Li, D. Not all fake news is semantically similar: Contextual semantic representation learning for multimodal fake news detection. Inf. Process. Manag. 2024, 61, 103564. [Google Scholar] [CrossRef]
Jain, M.K.; Gopalani, D.; Meena, Y.K. ConFake: Fake news identification using content based features. Multimed. Tools Appl. 2024, 83, 8729–8755. [Google Scholar] [CrossRef]
Lai, J.; Yang, X.; Luo, W.; Zhou, L.; Li, L.; Wang, Y.; Shi, X. RumorLLM: A Rumor Large Language Model-Based Fake-News-Detection Data-Augmentation Approach. Appl. Sci. 2024, 14, 3532. [Google Scholar] [CrossRef]
Wu, L.; Morstatter, F.; Carley, K.M.; Liu, H. Misinformation in social media: Definition, manipulation, and detection. ACM SIGKDD Explor. Newsl. 2019, 21, 80–90. [Google Scholar] [CrossRef]
Fallis, D. What is disinformation? Libr. Trends 2015, 63, 401–426. [Google Scholar]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media. arXiv 2018, arXiv:1809.01286. [Google Scholar] [CrossRef]
Hanselowski, A.; PVS, A.; Schiller, B.; Caspelherr, F.; Chaudhuri, D.; Meyer, C.M.; Gurevych, I. A retrospective analysis of the fake news challenge stance detection task. arXiv 2018, arXiv:1806.05180. [Google Scholar] [CrossRef]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Zhou, X.; Zafarani, R. Fake news: A survey of research, detection methods, and opportunities. arXiv 2018, arXiv:1812.00315. [Google Scholar]
Zhou, X.; Zafarani, R. A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv. (CSUR) 2020, 53, 1–40. [Google Scholar] [CrossRef]
Kumar, S.; Shah, N. False information on web and social media: A survey. arXiv 2018, arXiv:1804.08559. [Google Scholar] [CrossRef]
Mitra, T.; Gilbert, E. Credbank: A large-scale social media corpus with associated credibility annotations. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2015; Volume 9, pp. 258–267. [Google Scholar]
Zubiaga, A.; Liakata, M.; Procter, R.; Wong Sak Hoi, G.; Tolmie, P. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS ONE 2016, 11, e0150989. [Google Scholar] [CrossRef]
Tacchini, E.; Ballarin, G.; Della Vedova, M.L.; Moret, S.; De Alfaro, L. Some like it hoax: Automated fake news detection in social networks. arXiv 2017, arXiv:1704.07506. [Google Scholar] [CrossRef]
Wang, W.Y. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648. [Google Scholar] [CrossRef]
Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2017; Volume 11, pp. 759–766. [Google Scholar]
Santia, G.; Williams, J. Buzzface: A news veracity dataset with facebook user commentary and egos. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2018; Volume 12, pp. 531–540. [Google Scholar]
Barbado, R.; Araque, O.; Iglesias, C.A. A framework for fake review detection in online consumer electronics retailers. Inf. Process. Manag. 2019, 56, 1234–1244. [Google Scholar] [CrossRef]
Torabi Asr, F.; Taboada, M. Big Data and quality data for fake news and misinformation detection. Big Data Soc. 2019, 6, 2053951719843310. [Google Scholar] [CrossRef]
Nørregaard, J.; Horne, B.D.; Adalı, S. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2019; Volume 13, pp. 630–638. [Google Scholar]
Papadopoulou, O.; Zampoglou, M.; Papadopoulos, S.; Kompatsiaris, I. A corpus of debunked and verified user-generated videos. Online Inf. Rev. 2019, 43, 72–88. [Google Scholar] [CrossRef]
Boididou, C.; Papadopoulos, S.; Zampoglou, M.; Apostolidis, L.; Papadopoulou, O.; Kompatsiaris, Y. Detection and visualization of misleading content on Twitter. Int. J. Multimed. Inf. Retr. 2018, 7, 71–86. [Google Scholar] [CrossRef]
Nakamura, K.; Levy, S.; Wang, W.Y. Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; European Language Resources Association: Paris, France, 2020; pp. 6149–6157. [Google Scholar]
Wang, Y.; Mansurov, J.; Ivanov, P.; Su, J.; Shelmanov, A.; Tsvigun, A.; Whitehouse, C.; Mohammed Afzal, O.; Mahmoud, T.; Sasaki, T.; et al. M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1369–1407. [Google Scholar]
Zhang, G.; Giachanou, A.; Rosso, P. SceneFND: Multimodal fake news detection by modelling scene context information. J. Inf. Sci. 2024, 50, 355–367. [Google Scholar] [CrossRef]
Ali, R.; Farhat, T.; Abdullah, S.; Akram, S.; Alhajlah, M.; Mahmood, A.; Iqbal, M.A. Deep learning for sarcasm identification in news headlines. Appl. Sci. 2023, 13, 5586. [Google Scholar] [CrossRef]
Valiaiev, D. Detection of Machine-Generated Text: Literature Survey. arXiv 2024, arXiv:2402.01642. [Google Scholar] [CrossRef]
Kondamudi, M.R.; Sahoo, S.R.; Chouhan, L.; Yadav, N. A comprehensive survey of fake news in social networks: Attributes, features, and detection approaches. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 101571. [Google Scholar] [CrossRef]
Garg, S.; Sharma, D.K. Linguistic features based framework for automatic fake news detection. Comput. Ind. Eng. 2022, 172, 108432. [Google Scholar] [CrossRef]
Choudhary, A.; Arora, A. Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 2021, 169, 114171. [Google Scholar] [CrossRef]
Chakraborty, A.; Paranjape, B.; Kakarla, S.; Ganguly, N. Stop clickbait: Detecting and preventing clickbaits in online news media. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM); IEEE: Piscataway, NJ, USA, 2016; pp. 9–16. [Google Scholar]
Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2017; pp. 795–816. [Google Scholar]
Ghanem, B.; Rosso, P.; Rangel, F. Stance detection in fake news a combined feature representation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 66–71. [Google Scholar]
Baly, R.; Karadzhov, G.; Alexandrov, D.; Glass, J.; Nakov, P. Predicting factuality of reporting and bias of news media sources. arXiv 2018, arXiv:1810.01765. [Google Scholar] [CrossRef]
De, A.; Bandyopadhyay, D.; Gain, B.; Ekbal, A. A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 9. [Google Scholar] [CrossRef]
Gravanis, G.; Vakali, A.; Diamantaras, K.; Karadais, P. Behind the cues: A benchmarking study for fake news detection. Expert Syst. Appl. 2019, 128, 201–213. [Google Scholar] [CrossRef]
Ahmad, F.; Lokeshkumar, R. A comparison of machine learning algorithms in fake news detection. Int. J. Emerg. Technol. 2019, 10, 177–183. [Google Scholar]
Faustini, P.H.A.; Covoes, T.F. Fake news detection in multiple platforms and languages. Expert Syst. Appl. 2020, 158, 113503. [Google Scholar] [CrossRef]
Dhawan, A.; Bhalla, M.; Arora, D.; Kaushal, R.; Kumaraguru, P. FakeNewsIndia: A benchmark dataset of fake news incidents in India, collection methodology and impact assessment in social media. Comput. Commun. 2022, 185, 130–141. [Google Scholar] [CrossRef]
Shu, K.; Zhou, X.; Wang, S.; Zafarani, R.; Liu, H. The role of user profiles for fake news detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 436–439. [Google Scholar]
Wang, Y.; Yang, W.; Ma, F.; Xu, J.; Zhong, B.; Deng, Q.; Gao, J. Weak supervision for fake news detection via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; Volume 34, pp. 516–523. [Google Scholar]
Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Rout, J.; Mishra, M.; Saikia, M.J. Towards Reliable Fake News Detection: Enhanced Attention-Based Transformer Model. J. Cybersecur. Priv. 2025, 5, 43. [Google Scholar] [CrossRef]
Aslam, Z.; Missen, M.M.S.; Ghaffar, A.A.; Mehmood, A.; Villar, M.G.; Alvarado, E.S.; Ashraf, I. Advancing fake news combating using machine learning: A hybrid model approach. Knowl. Inf. Syst. 2025, 67, 12137–12177. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Lian, Z.; Wang, S. Fake news detection based on contrast learning and cascading attention. In 2024 IEEE Cyber Science and Technology Congress (CyberSciTech); IEEE: Piscataway, NJ, USA, 2024; pp. 306–313. [Google Scholar]
Xie, J.; Liu, J.; Zha, Z. Towards Effective and Transferable Detection for Multi-modal Fake News in the Social Media Stream. IEEE Trans. Knowl. Data Eng. 2025, 37, 6723–6737. [Google Scholar] [CrossRef]
Wang, L.; Zhang, C.; Xu, H.; Xu, Y.; Xu, X.; Wang, S. Cross-modal contrastive learning for multimodal fake news detection. In Proceedings of the 31st ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2023; pp. 5696–5704. [Google Scholar]
Segura-Bedmar, I.; Alonso-Bartolome, S. Multimodal fake news detection. Information 2022, 13, 284. [Google Scholar] [CrossRef]
Wang, Z.; Shan, X.; Zhang, X.; Yang, J. N24News: A new dataset for multimodal news classification. arXiv 2021, arXiv:2108.13327. [Google Scholar]
Zhou, Y.; Pang, A.; Yu, G. Clip-GCN: An adaptive detection model for multimodal emergent fake news domains. Complex Intell. Syst. 2024, 10, 5153–5170. [Google Scholar] [CrossRef]
Sormeily, A.; Dadkhah, S.; Zhang, X.; Ghorbani, A.A. MEFaND: A Multimodel Framework for Early Fake News Detection. IEEE Trans. Comput. Soc. Syst. 2024, 11, 5337–5353. [Google Scholar] [CrossRef]
Hangloo, S.; Arora, B. Combating multimodal fake news on social media: Methods, datasets, and future perspective. Multimed. Syst. 2022, 28, 2391–2422. [Google Scholar] [CrossRef] [PubMed]
Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2022, 38, 2939–2970. [Google Scholar] [CrossRef]
Murayama, T. Dataset of fake news detection and fact verification: A survey. arXiv 2021, arXiv:2111.03299. [Google Scholar] [CrossRef]
D’Ulizia, A.; Caschera, M.C.; Ferri, F.; Grifoni, P. Fake news detection: A survey of evaluation datasets. PeerJ Comput. Sci. 2021, 7, e518. [Google Scholar] [CrossRef]
Khan, T.; Rahman, M.; Chatrath, V.; Bamgbose, O.; Raza, S. FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections. arXiv 2023, arXiv:2312.03730. [Google Scholar]
Abdali, S. Multi-modal misinformation detection: Approaches, challenges and opportunities. arXiv 2022, arXiv:2203.13883. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal ambiguity learning for multimodal fake news detection. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2897–2905. [Google Scholar]
Choraś, M.; Demestichas, K.; Giełczyk, A.; Herrero, Á.; Ksieniewicz, P.; Remoundou, K.; Urda, D.; Woźniak, M. Advanced Machine Learning techniques for fake news (online disinformation) detection: A systematic mapping study. Appl. Soft Comput. 2021, 101, 107050. [Google Scholar] [CrossRef]
Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef]
Chen, Z.; Hu, L.; Li, W.; Shao, Y.; Nie, L. Causal intervention and counterfactual reasoning for multi-modal fake news detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 627–638. [Google Scholar]
Shu, K. Combating disinformation on social media: A computational perspective. BenchCouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100035. [Google Scholar] [CrossRef]
Chen, C.; Shu, K. Combating misinformation in the age of llms: Opportunities and challenges. AI Mag. 2024, 45, 354–368. [Google Scholar] [CrossRef]
Yang, X.; Pan, L.; Zhao, X.; Chen, H.; Petzold, L.R.; Wang, W.Y.; Cheng, W. A Survey on Detection of LLMs-Generated Content. In Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9786–9805. [Google Scholar] [CrossRef]
Kim, B.; Xiong, A.; Lee, D.; Han, K. A systematic review on fake news research through the lens of news creation and consumption: Research efforts, challenges, and future directions. PloS ONE 2021, 16, e0260080. [Google Scholar] [CrossRef]
Nagy, K.; Kapusta, J. Improving fake news classification using dependency grammar. PloS ONE 2021, 16, e0256940. [Google Scholar] [CrossRef] [PubMed]
Dzienisiewicz, D.; Graliński, F.; Jabłoński, P.; Kubis, M.; Skórzewski, P.; Wierzchoń, P. POLygraph: Polish Fake News Dataset. arXiv 2024, arXiv:2407.01393. [Google Scholar] [CrossRef]
Huang, Y.; Sun, L. Harnessing the power of chatgpt in fake news: An in-depth exploration in generation, detection and explanation. arXiv 2023, arXiv:2310.05046. [Google Scholar]
Cui, L.; Lee, D. Coaid: COVID-19 healthcare misinformation dataset. arXiv 2020, arXiv:2006.00885. [Google Scholar] [CrossRef]
Hanselowski, A.; Stab, C.; Schulz, C.; Li, Z.; Gurevych, I. A richly annotated corpus for different tasks in automated fact-checking. arXiv 2019, arXiv:1911.01214. [Google Scholar] [CrossRef]
Reddy, H.; Raj, N.; Gala, M.; Basava, A. Text-mining-based fake news detection using ensemble methods. Int. J. Autom. Comput. 2020, 17, 210–221. [Google Scholar] [CrossRef]
Aslam, N.; Ullah Khan, I.; Alotaibi, F.S.; Aldaej, L.A.; Aldubaikil, A.K. Fake detect: A deep learning ensemble model for fake news detection. Complexity 2021, 2021, 5557784. [Google Scholar] [CrossRef]
Sahoo, S.R.; Gupta, B.B. Multiple features based approach for automatic fake news detection on social networks using deep learning. Appl. Soft Comput. 2021, 100, 106983. [Google Scholar] [CrossRef]
Hu, L.; Wei, S.; Zhao, Z.; Wu, B. Deep learning for fake news detection: A comprehensive survey. AI Open 2022, 3, 133–155. [Google Scholar] [CrossRef]
Raza, S.; Ding, C. Fake news detection based on news content and social contexts: A transformer-based approach. Int. J. Data Sci. Anal. 2022, 13, 335–362. [Google Scholar] [CrossRef]
Raza, S. Automatic fake news detection in political platforms-a transformer-based approach. In Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-Political Events from Text (CASE 2021); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 68–78. [Google Scholar]
Low, J.F.; Fung, B.C.; Iqbal, F.; Huang, S.C. Distinguishing between fake news and satire with transformers. Expert Syst. Appl. 2022, 187, 115824. [Google Scholar] [CrossRef]
Mishra, A.; Sadia, H. A Comprehensive Analysis of Fake News Detection Models: A Systematic Literature Review and Current Challenges. Eng. Proc. 2023, 59, 28. [Google Scholar]
Horne, B.D.; Nørregaard, J.; Adali, S. Robust fake news detection over time and attack. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 11, 1–23. [Google Scholar] [CrossRef]
Fenza, G.; Gallo, M.; Loia, V.; Petrone, A.; Stanzione, C. Concept-drift detection index based on fuzzy formal concept analysis for fake news classifiers. Technol. Forecast. Soc. Change 2023, 194, 122640. [Google Scholar] [CrossRef]
Shushkevich, E.; Alexandrov, M.; Cardiff, J. Improving Multiclass Classification of Fake News Using BERT-Based Models and ChatGPT-Augmented Data. Inventions 2023, 8, 112. [Google Scholar] [CrossRef]
Campos Zabala, F.J. Responsible AI Understanding the Ethical and Regulatory Implications of AI. In Grow Your Business with AI: A First Principles Approach for Scaling Artificial Intelligence in the Enterprise; Apress: Berkeley, CA, USA, 2023; pp. 453–477. [Google Scholar]
Tandoc, E.C., Jr.; Jenkins, J.; Craft, S. Fake news as a critical incident in journalism. J. Pract. 2019, 13, 673–689. [Google Scholar] [CrossRef]
Waisbord, S. Truth is what happens to news: On journalism, fake news, and post-truth. J. Stud. 2018, 19, 1866–1878. [Google Scholar] [CrossRef]
Fallis, D.; Mathiesen, K. Fake news is counterfeit news. Inquiry 2025, 68, 3191–3210. [Google Scholar] [CrossRef]
Dwivedi, V.; Sen, K. Navigating the challenges of fake news and media trust: A bibliometric study. J. Inf. Commun. Ethics Soc. 2025, 23, 262–283. [Google Scholar] [CrossRef]
Yadav, D.; Salmani, S. Deepfake: A survey on facial forgery technique using generative adversarial network. In 2019 International Conference on Intelligent Computing and Control Systems (ICCS); IEEE: Piscataway, NJ, USA, 2019; pp. 852–857. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Neural Information Processing Systems Foundation: San Diego, CA, USA, 2014; Volume 27. [Google Scholar]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018); IEEE: Piscataway, NJ, USA, 2018; pp. 289–293. [Google Scholar]
Orphanou, K.; Otterbacher, J.; Kleanthous, S.; Batsuren, K.; Giunchiglia, F.; Bogina, V.; Tal, A.S.; Hartman, A.; Kuflik, T. Mitigating bias in algorithmic systems—A fish-eye view. ACM Comput. Surv. 2022, 55, 87. [Google Scholar] [CrossRef]
Xu, D.; Yuan, S.; Zhang, L.; Wu, X. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data); IEEE: Piscataway, NJ, USA, 2018; pp. 570–575. [Google Scholar]
Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 4411–4421. [Google Scholar]
Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar] [CrossRef]
Senel, L.K.; Ebing, B.; Baghirova, K.; Schütze, H.; Glavaš, G. Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin–A Benchmark and Evaluation for Turkic Languages. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1672–1688. [Google Scholar]
Huang, R.; Dugan, L.; Yang, Y.; Callison-Burch, C. MiRAGeNews: Multimodal Realistic AI-Generated News Detection. In Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 16436–16448. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram summarizing article selection, with duplicates as the only reason for exclusion.

Figure 2. Keyword co-occurrence network generated by VOSviewer. This network visualizes relationships between keywords related to fake news detection, emphasizing dataset forms such as text, images, and multimodal data. The size of each node represents keyword frequency, and the links indicate co-occurrence, highlighting the critical role of diverse datasets in this research domain.

Table 1. Inclusion and exclusion criteria for paper selection.

Inclusion Criteria	Exclusion Criteria
Peer-reviewed journals, conference proceedings, or preprints. Focus on dataset development or application in fake news detection. Discuss algorithms, models, or methods using datasets. Publications in English.	Not related to fake news, misinformation, or disinformation. Do not focus on dataset utilization or impact. Non-peer-reviewed articles (e.g., opinion pieces, editorials, book chapters). Insufficient methodological rigor. Duplicate articles across databases.

Table 2. Summary of misinformation datasets including year, language(s), data type, and availability. Data type symbols: ⧫ = Text, ★ = Images, △ = Video, ∘ = Metadata, ♢ = Generated text.

Dataset	Year	Lang.	Data Type	Availability
CREDBANK [19]	2015	EN	⧫	https://github.com/compsocial/CREDBANK-data (accessed on 20 November 2024)
PHEME [20]	2016	EN, DE	$⧫ \circ$	https://figshare.com/articles/PHEME_rumour_scheme_dataset_journalism_use_case/2068650/2 (accessed on 20 November 2024)
FacebookHoax [21]	2017	EN	⧫	https://github.com/gabll/some-like-it-hoax (accessed on 20 November 2024)
LIAR [22]	2017	EN	⧫	https://paperswithcode.com/dataset/liar (accessed on 20 November 2024)
BuzzFeed [23]	2017	EN	⧫	https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data (accessed on 20 November 2024)
BuzzFace [24]	2018	EN	⧫	https://github.com/gsantia/BuzzFace (accessed on 20 November 2024)
FakeNewsNet [13]	2018	EN	$⧫ ★$	https://github.com/KaiDMML/FakeNewsNet (accessed on 20 November 2024)
Yelp [25]	2019	EN	⧫	mailto:o.araque@upm.es
MisInfoText [26]	2019	EN	⧫	https://github.com/sfu-discourse-lab/MisInfoText (accessed on 20 November 2024)
NELA-GT-2018 [27]	2019	EN	⧫	https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YHWTFC (accessed on 20 November 2024)
FCV-2018 [28]	2019	Multi	$⧫ ▵$	https://mklab.iti.gr/results/fake-video-corpus/ (accessed on 20 November 2024)
Verification Corpus [29]	2019	Multi	$⧫ ★$	https://github.com/MKLab-ITI/image-verification-corpus (accessed on 20 November 2024)
r/fakeddit [30]	2020	EN	$⧫ ★ \circ$	https://fakeddit.netlify.app/ (accessed on 20 November 2024)
M4 [31]	2024	Multi	$⧫ ♢$	https://github.com/mbzuai-nlp/M4?tab=readme-ov-file#data (accessed on 20 November 2024)

Lang.: EN = English, DE = German, Multi = Multilingual (consists of more than two languages). Applsci 16 01585 i015

Publicly available dataset; Applsci 16 01585 i016

Available upon request.

Table 3. Summary of datasets and their rating scales. Abbreviations: Cert. (Certainly), Prob. Inacc. (Probably Inaccurate), Doubt. (Doubtful), Prob. Acc. (Probably Accurate), unverif. (unverified).

Dataset	Rating Scale
CREDBANK [19]	5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.)
PHEME [20]	3 values (true, false, unverif.)
FacebookHoax NA [21]	2 values (hoaxes, non-hoaxes)
LIAR [22]	6 values (pants-fire, false, barely-true, half-true, mostly-true, true)
BuzzFeed [23]	4 values (mostly true, not factual, mix, mostly false)
BuzzFace [24]	4 values (mostly true, mostly false, mix, no factual)
FakeNewsNet [13]	2 values (fake, real)
Yelp [25]	2 values (fake, trustful)
MisInfoText [26]	4 (BuzzFeed), 5 (Snopes) values
NELA-GT-2018 [27]	2 values (true, false)
FCV-2018 [28]	2 values (true, false)
Verification Corpus [29]	2 values (true, false)
r/fakeddit [30]	5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.)
M4 [31]	5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.)

Table 4. Summary of datasets and their sizes.

Dataset	Size	Annotation Process
CREDBANK [19]	60 M tweets, 1049 events	Crowdsourced
PHEME [20]	330 convos * (159 true, 68 false, 103 unverified)	Manual
FacebookHoax [21]	15.5 K posts from 32 pages *	Manual
LIAR [22]	12.8 K labeled statements	Manual
BuzzFeed [23]	2.3 K news samples	Manual
BuzzFace [24]	2.3 K news from 9 pages *	Crowdsourced
FakeNewsNet [13]	422 news (211 fake, 211 real)	Automated
Yelp [25]	18.9 K reviews (9.5 K fake, 9.5 K real)	Manual
MisInfoText [26]	1.7 K articles (1.4 K BuzzFeed, 312 Snopes)	Crowdsourced
NELA-GT-2018 [27]	713 K articles	Automated
FCV-2018 [28]	380 videos, 77.3 K tweets	Manual
Verification Corpus [29]	15.6 K posts	Manual
r/fakeddit [30]	1.06 M samples	Manual
M4 [31]	147 K parallel data * (3 K per 49 languages)	Manual

* See Section 13.

Table 5. Summary of cross-dataset generalization evidence in fake news detection studies.

Study	Year	Datasets	Evaluation	Generalization
Li et al. [53]	2024	DGM4	Cross-dataset	Weak
Rout et al. [51]	2025	FakeNewsNet, ISOT, LIAR	Multi-dataset	Moderate
Aslam et al. [52]	2025	Multiple benchmarks	Multi-dataset	Moderate
Xie et al. [54]	2025	Twitter, Weibo	Transfer-focused	Strong

Table 6. Summary of dataset updates. A ✔ indicates that the dataset is regularly updated, whereas a ✗ indicates that it is not.

Dataset	Regularly Updated
NELA-GT-2018 [27]	✔
CREDBANK [19]	✗
PHEME [20]	✗
FacebookHoax [21]	✗
LIAR [22]	✗
BuzzFeed [23]	✗
BuzzFace [24]	✗
FakeNewsNet [13]	✗
Yelp [25]	✗
MisInfoText [26]	✗
FCV-2018 [28]	✗
Verification Corpus [29]	✗
r/fakeddit [30]	✗
M4 [31]	✗

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kuntur, S.; Wróblewska, A.; Ganzha, M.; Paprzycki, M.; Sachdeva, S. Fake News Detection: It’s All in the Data! Appl. Sci. 2026, 16, 1585. https://doi.org/10.3390/app16031585

AMA Style

Kuntur S, Wróblewska A, Ganzha M, Paprzycki M, Sachdeva S. Fake News Detection: It’s All in the Data! Applied Sciences. 2026; 16(3):1585. https://doi.org/10.3390/app16031585

Chicago/Turabian Style

Kuntur, Soveatin, Anna Wróblewska, Maria Ganzha, Marcin Paprzycki, and Shelly Sachdeva. 2026. "Fake News Detection: It’s All in the Data!" Applied Sciences 16, no. 3: 1585. https://doi.org/10.3390/app16031585

APA Style

Kuntur, S., Wróblewska, A., Ganzha, M., Paprzycki, M., & Sachdeva, S. (2026). Fake News Detection: It’s All in the Data! Applied Sciences, 16(3), 1585. https://doi.org/10.3390/app16031585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fake News Detection: It’s All in the Data!

Abstract

1. Introduction

2. Paper Selection Methodology

2.1. Data Sources and Search Strategy

2.2. Screening and Eligibility Assessment

2.3. Bibliometric Visualization with VOSviewer

2.4. Final Dataset

3. Definitions and Related Surveys

3.1. Definitions

3.2. Related Surveys

4. Characteristics of Existing Fake News Datasets

4.1. Types of Data Collected

4.2. Common Features and Labels

4.3. Variation in Characteristics

5. Impact of Dataset Properties on Detection Algorithms

5.1. Performance Influence

5.2. Specific Properties Leading to Better Performance

5.3. Cross-Dataset Generalization and Transferability

6. Role of Multimodal Datasets in Fake News Detection

6.1. Comparison with Unimodal Datasets

6.2. Challenges of Multimodal Datasets

6.3. Advantages and Disadvantages of Multimodal Datasets

7. Challenges and Limitations in Current Fake News Datasets

7.1. From Dataset Construction to Generalization Failure

7.2. Biases in Datasets

7.3. LLM-Era Challenges and Benchmark Instability

8. Best Practices for Creating High-Quality Fake News Datasets

8.1. Annotation Methodologies

8.2. Incorporating Real-World Dynamics

8.3. Ensuring Reliability and Validity

9. Evolution of Fake News Detection Models with Dataset Availability

9.1. Trends in Model Performance

9.2. Impact of Dataset Versions and Updates

10. Ethical Considerations in Fake News Datasets

10.1. Privacy and Consent

10.2. Societal Impacts

10.3. Transparency and Accountability

10.4. Media and Journalism Implications of Dataset Design in Cross-Cultural Contexts

11. Future Directions

12. Conclusions

13. Glossary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Scopus Advanced Search Query

Appendix B. PRISMA 2020 Checklist

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI