Fake News Detection: It’s All in the Data!
Abstract
1. Introduction
2. Paper Selection Methodology
2.1. Data Sources and Search Strategy
2.2. Screening and Eligibility Assessment
2.3. Bibliometric Visualization with VOSviewer
2.4. Final Dataset
3. Definitions and Related Surveys
3.1. Definitions
- Fake News: False or misleading information presented as news. This may include fabricated news stories, misinformation, and disinformation intended to deceive readers [1].
- Misinformation: False or inaccurate information spread regardless of intent to deceive. It may arise from errors, misconceptions, or rumors [11].
- Disinformation: Deliberately false information spread with the intention to mislead or manipulate public opinion [12].
- Detection Models: Algorithms and systems designed to identify and classify fake news, misinformation, or disinformation [13].
- Datasets: Collections of data used to train and evaluate fake news detection models, which can include textual, visual, and multimodal data [14].
3.2. Related Surveys
- Shu et al. (2017): This survey explores fake news detection from both social and news content perspectives, highlighting the importance of leveraging social context in improving detection accuracy [15].
- Zhou and Zafarani (2018): The authors review machine learning approaches for fake news detection, emphasizing the role of feature engineering and classification techniques [16].
- Zhou and Zafarani (2020): This comprehensive survey provides an in-depth analysis of fundamental theories, detection methods, and future opportunities in the field of fake news detection [17].
- Kumar and Shah (2018): This survey focuses on false information on the web and social media, discussing various detection strategies and their effectiveness [18].
4. Characteristics of Existing Fake News Datasets
4.1. Types of Data Collected
4.2. Common Features and Labels
4.3. Variation in Characteristics
5. Impact of Dataset Properties on Detection Algorithms
5.1. Performance Influence
5.2. Specific Properties Leading to Better Performance
5.3. Cross-Dataset Generalization and Transferability
6. Role of Multimodal Datasets in Fake News Detection
6.1. Comparison with Unimodal Datasets
6.2. Challenges of Multimodal Datasets
6.3. Advantages and Disadvantages of Multimodal Datasets
7. Challenges and Limitations in Current Fake News Datasets
7.1. From Dataset Construction to Generalization Failure
7.2. Biases in Datasets
7.3. LLM-Era Challenges and Benchmark Instability
8. Best Practices for Creating High-Quality Fake News Datasets
8.1. Annotation Methodologies
- 1.
- Expert annotators: Utilizing expert annotators ensures that those who are well-versed in the nuances of fake news can provide high-quality annotations. For example, the MediaEval Benchmarking Initiative engages professional journalists and fact-checkers to annotate news articles, ensuring that the annotations are based on a deep understanding of journalistic integrity and misinformation tactics [41].
- 2.
- Automated assistance with human oversight: While automated methods can expedite the annotation process, they must be complemented by human oversight. Automated systems can flag potentially fake news based on certain linguistic cues or metadata, but human annotators are necessary to verify these flags. For instance, the FakeNewsNet dataset uses an automated system to initially filter articles based on their sources and content, followed by human verification to ensure accuracy [13].
- 3.
- Cross-validation by multiple annotators: Cross-validation involves having multiple annotators review the same content to ensure consistency and accuracy. For example, in the creation of the LIAR dataset, each statement was reviewed by three annotators to cross-check labels and resolve discrepancies through discussion and majority voting [22].
- 4.
- Periodic reviews and updates: Regular reviews and updates to the dataset ensure that it remains relevant and accurate. This involves periodically re-evaluating the data to correct any errors and update annotations based on new information or shifts in the nature of fake news. The PHEME dataset, for example, undergoes regular updates where annotations are reviewed and revised based on ongoing events and newly available information [20].
- 5.
- Specific Quality Control Measures:
- Inter-annotator agreement (IAA): This measure is used to assess the consistency among annotators. High IAA scores indicate that annotators are in agreement, thereby enhancing the reliability of the dataset. For instance, the Yelp review dataset uses IAA metrics to ensure that labels for fake and real reviews are consistent across annotators [25].
- Blind review processes: In blind review processes, annotators are unaware of each other’s assessments, which helps reduce bias. The Verification Corpus uses a blind review process in which posts are labeled independently by different annotators to ensure unbiased annotations [29].
- Question and answer annotation: The POLygraph dataset incorporates a detailed question and answer annotation scheme, expanding beyond simple factuality to include questions on the disseminator’s intention, the target of the news, and the potential harm caused, providing a more comprehensive understanding of fake news [75].
8.2. Incorporating Real-World Dynamics
8.3. Ensuring Reliability and Validity
9. Evolution of Fake News Detection Models with Dataset Availability
9.1. Trends in Model Performance
- Early Models and Their Limitations
- Advancements with Deep Learning
- Impact of Transformer Models
9.2. Impact of Dataset Versions and Updates
- Significance of Regular Updates
- Addressing Distribution Drift
- Empirical Evidence
10. Ethical Considerations in Fake News Datasets
10.1. Privacy and Consent
10.2. Societal Impacts
10.3. Transparency and Accountability
10.4. Media and Journalism Implications of Dataset Design in Cross-Cultural Contexts
11. Future Directions
12. Conclusions
13. Glossary
- Convos (in PHEME dataset) [20] Refers to conversations. Specifically, it denotes the threads or discussions on social media platforms where a statement, claim, or rumor is being discussed or responded to by various users. The dataset categorizes these conversations as true, false, or unverified.
- 9 pages (in BuzzFace dataset) [24] Refers to the specific Facebook pages from which the dataset’s news samples were collected. These pages likely include a mix of reputable news sources, satire, and pages known for spreading misinformation, providing a diverse set of news items for analysis.
- Parallel data (in M4 dataset) [31] Refers to a collection of data that includes pairs of text in two different languages that are translations of each other. The M4 dataset consists of 147,000 such parallel pairs, with 3000 pairs for each of 49 different languages. This type of data is commonly used in training and evaluating machine translation systems.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Scopus Advanced Search Query
Appendix B. PRISMA 2020 Checklist
| Item | Description | Reported in Manuscript |
|---|---|---|
| Tittle | ||
| 1 | Identify the report as a systematic review | Title/Abstract |
| Abstract | ||
| 2 | Structured summary | Abstract |
| Introduction | ||
| 3 | Rationale | Introduction |
| 4 | Objectives | Introduction |
| Methods | ||
| 5 | Eligibility criteria | Methods–Screening and Selection Criteria |
| 6 | Information sources | Methods–Data Sources and Search Strategy |
| 7 | Search strategy | Appendix (Search Query) |
| 8 | Selection process | Methods + PRISMA Flow Diagram |
| 9 | Data collection process | Methods–VOSviewer Export Procedure |
| 10 | Data items | Methods–Metadata and Keyword Extraction |
| 11 | Risk of bias assessment | Not applicable |
| 12 | Effect measures | Not applicable |
| 13 | Synthesis methods | Methods–VOSviewer Visualization |
| Results | ||
| 14 | Study selection | PRISMA Flow Diagram + Methods |
| 15 | Study characteristics | Results Section |
| 16 | Risk of bias in studies | Not applicable |
| 17 | Results of individual studies | Reported narratively |
| 18 | Synthesis results | Results–Keyword Co-occurrence Analysis |
| Discussion | ||
| 19 | Summary of evidence | Discussion |
| 20 | Limitations | Discussion (if included) |
| 21 | Conclusions | Conclusion Section |
| Other Information | ||
| 22 | Registration | Not registered |
| 23 | Support/Funding | As stated in manuscript |
| 24 | Competing interests | As stated in manuscript |
References
- Lazer, D.M.; Baum, M.A.; Benkler, Y.; Berinsky, A.J.; Greenhill, K.M.; Menczer, F.; Metzger, M.J.; Nyhan, B.; Pennycook, G.; Rothschild, D.; et al. The science of fake news. Science 2018, 359, 1094–1096. [Google Scholar] [CrossRef]
- Lewandowsky, S.; Ecker, U.K.; Cook, J. Beyond misinformation: Understanding and coping with the “post-truth” era. J. Appl. Res. Mem. Cogn. 2017, 6, 353–369. [Google Scholar] [CrossRef]
- Reisach, U. The responsibility of social media in times of societal and political manipulation. Eur. J. Oper. Res. 2021, 291, 906–917. [Google Scholar] [CrossRef] [PubMed]
- Shu, K.; Wang, S.; Liu, H. Beyond news contents: The role of social context for fake news detection. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 312–320. [Google Scholar]
- Abdelminaam, D.S.; Ismail, F.H.; Taha, M.; Taha, A.; Houssein, E.H.; Nabil, A. Coaid-deep: An optimized intelligent framework for automated detecting COVID-19 misleading information on twitter. IEEE Access 2021, 9, 27840–27867. [Google Scholar] [CrossRef]
- Goldani, M.H.; Momtazi, S.; Safabakhsh, R. Detecting fake news with capsule neural networks. Appl. Soft Comput. 2021, 101, 106991. [Google Scholar] [CrossRef]
- Truică, C.O.; Apostol, E.S. It’s all in the embedding! fake news detection using document embeddings. Mathematics 2023, 11, 508. [Google Scholar] [CrossRef]
- Peng, L.; Jian, S.; Kan, Z.; Qiao, L.; Li, D. Not all fake news is semantically similar: Contextual semantic representation learning for multimodal fake news detection. Inf. Process. Manag. 2024, 61, 103564. [Google Scholar] [CrossRef]
- Jain, M.K.; Gopalani, D.; Meena, Y.K. ConFake: Fake news identification using content based features. Multimed. Tools Appl. 2024, 83, 8729–8755. [Google Scholar] [CrossRef]
- Lai, J.; Yang, X.; Luo, W.; Zhou, L.; Li, L.; Wang, Y.; Shi, X. RumorLLM: A Rumor Large Language Model-Based Fake-News-Detection Data-Augmentation Approach. Appl. Sci. 2024, 14, 3532. [Google Scholar] [CrossRef]
- Wu, L.; Morstatter, F.; Carley, K.M.; Liu, H. Misinformation in social media: Definition, manipulation, and detection. ACM SIGKDD Explor. Newsl. 2019, 21, 80–90. [Google Scholar] [CrossRef]
- Fallis, D. What is disinformation? Libr. Trends 2015, 63, 401–426. [Google Scholar]
- Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media. arXiv 2018, arXiv:1809.01286. [Google Scholar] [CrossRef]
- Hanselowski, A.; PVS, A.; Schiller, B.; Caspelherr, F.; Chaudhuri, D.; Meyer, C.M.; Gurevych, I. A retrospective analysis of the fake news challenge stance detection task. arXiv 2018, arXiv:1806.05180. [Google Scholar] [CrossRef]
- Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
- Zhou, X.; Zafarani, R. Fake news: A survey of research, detection methods, and opportunities. arXiv 2018, arXiv:1812.00315. [Google Scholar]
- Zhou, X.; Zafarani, R. A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv. (CSUR) 2020, 53, 1–40. [Google Scholar] [CrossRef]
- Kumar, S.; Shah, N. False information on web and social media: A survey. arXiv 2018, arXiv:1804.08559. [Google Scholar] [CrossRef]
- Mitra, T.; Gilbert, E. Credbank: A large-scale social media corpus with associated credibility annotations. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2015; Volume 9, pp. 258–267. [Google Scholar]
- Zubiaga, A.; Liakata, M.; Procter, R.; Wong Sak Hoi, G.; Tolmie, P. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS ONE 2016, 11, e0150989. [Google Scholar] [CrossRef]
- Tacchini, E.; Ballarin, G.; Della Vedova, M.L.; Moret, S.; De Alfaro, L. Some like it hoax: Automated fake news detection in social networks. arXiv 2017, arXiv:1704.07506. [Google Scholar] [CrossRef]
- Wang, W.Y. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648. [Google Scholar] [CrossRef]
- Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2017; Volume 11, pp. 759–766. [Google Scholar]
- Santia, G.; Williams, J. Buzzface: A news veracity dataset with facebook user commentary and egos. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2018; Volume 12, pp. 531–540. [Google Scholar]
- Barbado, R.; Araque, O.; Iglesias, C.A. A framework for fake review detection in online consumer electronics retailers. Inf. Process. Manag. 2019, 56, 1234–1244. [Google Scholar] [CrossRef]
- Torabi Asr, F.; Taboada, M. Big Data and quality data for fake news and misinformation detection. Big Data Soc. 2019, 6, 2053951719843310. [Google Scholar] [CrossRef]
- Nørregaard, J.; Horne, B.D.; Adalı, S. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2019; Volume 13, pp. 630–638. [Google Scholar]
- Papadopoulou, O.; Zampoglou, M.; Papadopoulos, S.; Kompatsiaris, I. A corpus of debunked and verified user-generated videos. Online Inf. Rev. 2019, 43, 72–88. [Google Scholar] [CrossRef]
- Boididou, C.; Papadopoulos, S.; Zampoglou, M.; Apostolidis, L.; Papadopoulou, O.; Kompatsiaris, Y. Detection and visualization of misleading content on Twitter. Int. J. Multimed. Inf. Retr. 2018, 7, 71–86. [Google Scholar] [CrossRef]
- Nakamura, K.; Levy, S.; Wang, W.Y. Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; European Language Resources Association: Paris, France, 2020; pp. 6149–6157. [Google Scholar]
- Wang, Y.; Mansurov, J.; Ivanov, P.; Su, J.; Shelmanov, A.; Tsvigun, A.; Whitehouse, C.; Mohammed Afzal, O.; Mahmoud, T.; Sasaki, T.; et al. M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1369–1407. [Google Scholar]
- Zhang, G.; Giachanou, A.; Rosso, P. SceneFND: Multimodal fake news detection by modelling scene context information. J. Inf. Sci. 2024, 50, 355–367. [Google Scholar] [CrossRef]
- Ali, R.; Farhat, T.; Abdullah, S.; Akram, S.; Alhajlah, M.; Mahmood, A.; Iqbal, M.A. Deep learning for sarcasm identification in news headlines. Appl. Sci. 2023, 13, 5586. [Google Scholar] [CrossRef]
- Valiaiev, D. Detection of Machine-Generated Text: Literature Survey. arXiv 2024, arXiv:2402.01642. [Google Scholar] [CrossRef]
- Kondamudi, M.R.; Sahoo, S.R.; Chouhan, L.; Yadav, N. A comprehensive survey of fake news in social networks: Attributes, features, and detection approaches. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 101571. [Google Scholar] [CrossRef]
- Garg, S.; Sharma, D.K. Linguistic features based framework for automatic fake news detection. Comput. Ind. Eng. 2022, 172, 108432. [Google Scholar] [CrossRef]
- Choudhary, A.; Arora, A. Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 2021, 169, 114171. [Google Scholar] [CrossRef]
- Chakraborty, A.; Paranjape, B.; Kakarla, S.; Ganguly, N. Stop clickbait: Detecting and preventing clickbaits in online news media. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM); IEEE: Piscataway, NJ, USA, 2016; pp. 9–16. [Google Scholar]
- Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2017; pp. 795–816. [Google Scholar]
- Ghanem, B.; Rosso, P.; Rangel, F. Stance detection in fake news a combined feature representation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 66–71. [Google Scholar]
- Baly, R.; Karadzhov, G.; Alexandrov, D.; Glass, J.; Nakov, P. Predicting factuality of reporting and bias of news media sources. arXiv 2018, arXiv:1810.01765. [Google Scholar] [CrossRef]
- De, A.; Bandyopadhyay, D.; Gain, B.; Ekbal, A. A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 9. [Google Scholar] [CrossRef]
- Gravanis, G.; Vakali, A.; Diamantaras, K.; Karadais, P. Behind the cues: A benchmarking study for fake news detection. Expert Syst. Appl. 2019, 128, 201–213. [Google Scholar] [CrossRef]
- Ahmad, F.; Lokeshkumar, R. A comparison of machine learning algorithms in fake news detection. Int. J. Emerg. Technol. 2019, 10, 177–183. [Google Scholar]
- Faustini, P.H.A.; Covoes, T.F. Fake news detection in multiple platforms and languages. Expert Syst. Appl. 2020, 158, 113503. [Google Scholar] [CrossRef]
- Dhawan, A.; Bhalla, M.; Arora, D.; Kaushal, R.; Kumaraguru, P. FakeNewsIndia: A benchmark dataset of fake news incidents in India, collection methodology and impact assessment in social media. Comput. Commun. 2022, 185, 130–141. [Google Scholar] [CrossRef]
- Shu, K.; Zhou, X.; Wang, S.; Zafarani, R.; Liu, H. The role of user profiles for fake news detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 436–439. [Google Scholar]
- Wang, Y.; Yang, W.; Ma, F.; Xu, J.; Zhong, B.; Deng, Q.; Gao, J. Weak supervision for fake news detection via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; Volume 34, pp. 516–523. [Google Scholar]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Rout, J.; Mishra, M.; Saikia, M.J. Towards Reliable Fake News Detection: Enhanced Attention-Based Transformer Model. J. Cybersecur. Priv. 2025, 5, 43. [Google Scholar] [CrossRef]
- Aslam, Z.; Missen, M.M.S.; Ghaffar, A.A.; Mehmood, A.; Villar, M.G.; Alvarado, E.S.; Ashraf, I. Advancing fake news combating using machine learning: A hybrid model approach. Knowl. Inf. Syst. 2025, 67, 12137–12177. [Google Scholar] [CrossRef]
- Li, F.; Zhang, H.; Lian, Z.; Wang, S. Fake news detection based on contrast learning and cascading attention. In 2024 IEEE Cyber Science and Technology Congress (CyberSciTech); IEEE: Piscataway, NJ, USA, 2024; pp. 306–313. [Google Scholar]
- Xie, J.; Liu, J.; Zha, Z. Towards Effective and Transferable Detection for Multi-modal Fake News in the Social Media Stream. IEEE Trans. Knowl. Data Eng. 2025, 37, 6723–6737. [Google Scholar] [CrossRef]
- Wang, L.; Zhang, C.; Xu, H.; Xu, Y.; Xu, X.; Wang, S. Cross-modal contrastive learning for multimodal fake news detection. In Proceedings of the 31st ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2023; pp. 5696–5704. [Google Scholar]
- Segura-Bedmar, I.; Alonso-Bartolome, S. Multimodal fake news detection. Information 2022, 13, 284. [Google Scholar] [CrossRef]
- Wang, Z.; Shan, X.; Zhang, X.; Yang, J. N24News: A new dataset for multimodal news classification. arXiv 2021, arXiv:2108.13327. [Google Scholar]
- Zhou, Y.; Pang, A.; Yu, G. Clip-GCN: An adaptive detection model for multimodal emergent fake news domains. Complex Intell. Syst. 2024, 10, 5153–5170. [Google Scholar] [CrossRef]
- Sormeily, A.; Dadkhah, S.; Zhang, X.; Ghorbani, A.A. MEFaND: A Multimodel Framework for Early Fake News Detection. IEEE Trans. Comput. Soc. Syst. 2024, 11, 5337–5353. [Google Scholar] [CrossRef]
- Hangloo, S.; Arora, B. Combating multimodal fake news on social media: Methods, datasets, and future perspective. Multimed. Syst. 2022, 28, 2391–2422. [Google Scholar] [CrossRef] [PubMed]
- Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2022, 38, 2939–2970. [Google Scholar] [CrossRef]
- Murayama, T. Dataset of fake news detection and fact verification: A survey. arXiv 2021, arXiv:2111.03299. [Google Scholar] [CrossRef]
- D’Ulizia, A.; Caschera, M.C.; Ferri, F.; Grifoni, P. Fake news detection: A survey of evaluation datasets. PeerJ Comput. Sci. 2021, 7, e518. [Google Scholar] [CrossRef]
- Khan, T.; Rahman, M.; Chatrath, V.; Bamgbose, O.; Raza, S. FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections. arXiv 2023, arXiv:2312.03730. [Google Scholar]
- Abdali, S. Multi-modal misinformation detection: Approaches, challenges and opportunities. arXiv 2022, arXiv:2203.13883. [Google Scholar] [CrossRef]
- Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal ambiguity learning for multimodal fake news detection. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2897–2905. [Google Scholar]
- Choraś, M.; Demestichas, K.; Giełczyk, A.; Herrero, Á.; Ksieniewicz, P.; Remoundou, K.; Urda, D.; Woźniak, M. Advanced Machine Learning techniques for fake news (online disinformation) detection: A systematic mapping study. Appl. Soft Comput. 2021, 101, 107050. [Google Scholar] [CrossRef]
- Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef]
- Chen, Z.; Hu, L.; Li, W.; Shao, Y.; Nie, L. Causal intervention and counterfactual reasoning for multi-modal fake news detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 627–638. [Google Scholar]
- Shu, K. Combating disinformation on social media: A computational perspective. BenchCouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100035. [Google Scholar] [CrossRef]
- Chen, C.; Shu, K. Combating misinformation in the age of llms: Opportunities and challenges. AI Mag. 2024, 45, 354–368. [Google Scholar] [CrossRef]
- Yang, X.; Pan, L.; Zhao, X.; Chen, H.; Petzold, L.R.; Wang, W.Y.; Cheng, W. A Survey on Detection of LLMs-Generated Content. In Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9786–9805. [Google Scholar] [CrossRef]
- Kim, B.; Xiong, A.; Lee, D.; Han, K. A systematic review on fake news research through the lens of news creation and consumption: Research efforts, challenges, and future directions. PloS ONE 2021, 16, e0260080. [Google Scholar] [CrossRef]
- Nagy, K.; Kapusta, J. Improving fake news classification using dependency grammar. PloS ONE 2021, 16, e0256940. [Google Scholar] [CrossRef] [PubMed]
- Dzienisiewicz, D.; Graliński, F.; Jabłoński, P.; Kubis, M.; Skórzewski, P.; Wierzchoń, P. POLygraph: Polish Fake News Dataset. arXiv 2024, arXiv:2407.01393. [Google Scholar] [CrossRef]
- Huang, Y.; Sun, L. Harnessing the power of chatgpt in fake news: An in-depth exploration in generation, detection and explanation. arXiv 2023, arXiv:2310.05046. [Google Scholar]
- Cui, L.; Lee, D. Coaid: COVID-19 healthcare misinformation dataset. arXiv 2020, arXiv:2006.00885. [Google Scholar] [CrossRef]
- Hanselowski, A.; Stab, C.; Schulz, C.; Li, Z.; Gurevych, I. A richly annotated corpus for different tasks in automated fact-checking. arXiv 2019, arXiv:1911.01214. [Google Scholar] [CrossRef]
- Reddy, H.; Raj, N.; Gala, M.; Basava, A. Text-mining-based fake news detection using ensemble methods. Int. J. Autom. Comput. 2020, 17, 210–221. [Google Scholar] [CrossRef]
- Aslam, N.; Ullah Khan, I.; Alotaibi, F.S.; Aldaej, L.A.; Aldubaikil, A.K. Fake detect: A deep learning ensemble model for fake news detection. Complexity 2021, 2021, 5557784. [Google Scholar] [CrossRef]
- Sahoo, S.R.; Gupta, B.B. Multiple features based approach for automatic fake news detection on social networks using deep learning. Appl. Soft Comput. 2021, 100, 106983. [Google Scholar] [CrossRef]
- Hu, L.; Wei, S.; Zhao, Z.; Wu, B. Deep learning for fake news detection: A comprehensive survey. AI Open 2022, 3, 133–155. [Google Scholar] [CrossRef]
- Raza, S.; Ding, C. Fake news detection based on news content and social contexts: A transformer-based approach. Int. J. Data Sci. Anal. 2022, 13, 335–362. [Google Scholar] [CrossRef]
- Raza, S. Automatic fake news detection in political platforms-a transformer-based approach. In Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-Political Events from Text (CASE 2021); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 68–78. [Google Scholar]
- Low, J.F.; Fung, B.C.; Iqbal, F.; Huang, S.C. Distinguishing between fake news and satire with transformers. Expert Syst. Appl. 2022, 187, 115824. [Google Scholar] [CrossRef]
- Mishra, A.; Sadia, H. A Comprehensive Analysis of Fake News Detection Models: A Systematic Literature Review and Current Challenges. Eng. Proc. 2023, 59, 28. [Google Scholar]
- Horne, B.D.; Nørregaard, J.; Adali, S. Robust fake news detection over time and attack. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 11, 1–23. [Google Scholar] [CrossRef]
- Fenza, G.; Gallo, M.; Loia, V.; Petrone, A.; Stanzione, C. Concept-drift detection index based on fuzzy formal concept analysis for fake news classifiers. Technol. Forecast. Soc. Change 2023, 194, 122640. [Google Scholar] [CrossRef]
- Shushkevich, E.; Alexandrov, M.; Cardiff, J. Improving Multiclass Classification of Fake News Using BERT-Based Models and ChatGPT-Augmented Data. Inventions 2023, 8, 112. [Google Scholar] [CrossRef]
- Campos Zabala, F.J. Responsible AI Understanding the Ethical and Regulatory Implications of AI. In Grow Your Business with AI: A First Principles Approach for Scaling Artificial Intelligence in the Enterprise; Apress: Berkeley, CA, USA, 2023; pp. 453–477. [Google Scholar]
- Tandoc, E.C., Jr.; Jenkins, J.; Craft, S. Fake news as a critical incident in journalism. J. Pract. 2019, 13, 673–689. [Google Scholar] [CrossRef]
- Waisbord, S. Truth is what happens to news: On journalism, fake news, and post-truth. J. Stud. 2018, 19, 1866–1878. [Google Scholar] [CrossRef]
- Fallis, D.; Mathiesen, K. Fake news is counterfeit news. Inquiry 2025, 68, 3191–3210. [Google Scholar] [CrossRef]
- Dwivedi, V.; Sen, K. Navigating the challenges of fake news and media trust: A bibliometric study. J. Inf. Commun. Ethics Soc. 2025, 23, 262–283. [Google Scholar] [CrossRef]
- Yadav, D.; Salmani, S. Deepfake: A survey on facial forgery technique using generative adversarial network. In 2019 International Conference on Intelligent Computing and Control Systems (ICCS); IEEE: Piscataway, NJ, USA, 2019; pp. 852–857. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Neural Information Processing Systems Foundation: San Diego, CA, USA, 2014; Volume 27. [Google Scholar]
- Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
- Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018); IEEE: Piscataway, NJ, USA, 2018; pp. 289–293. [Google Scholar]
- Orphanou, K.; Otterbacher, J.; Kleanthous, S.; Batsuren, K.; Giunchiglia, F.; Bogina, V.; Tal, A.S.; Hartman, A.; Kuflik, T. Mitigating bias in algorithmic systems—A fish-eye view. ACM Comput. Surv. 2022, 55, 87. [Google Scholar] [CrossRef]
- Xu, D.; Yuan, S.; Zhang, L.; Wu, X. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data); IEEE: Piscataway, NJ, USA, 2018; pp. 570–575. [Google Scholar]
- Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 4411–4421. [Google Scholar]
- Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar] [CrossRef]
- Senel, L.K.; Ebing, B.; Baghirova, K.; Schütze, H.; Glavaš, G. Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin–A Benchmark and Evaluation for Turkic Languages. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1672–1688. [Google Scholar]
- Huang, R.; Dugan, L.; Yang, Y.; Callison-Burch, C. MiRAGeNews: Multimodal Realistic AI-Generated News Detection. In Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 16436–16448. [Google Scholar] [CrossRef]


| Inclusion Criteria | Exclusion Criteria |
|---|---|
|
|
| Dataset | Year | Lang. | Data Type | Availability |
|---|---|---|---|---|
| CREDBANK [19] | 2015 | EN | ⧫ | https://github.com/compsocial/CREDBANK-data (accessed on 20 November 2024) ![]() |
| PHEME [20] | 2016 | EN, DE | https://figshare.com/articles/PHEME_rumour_scheme_dataset_journalism_use_case/2068650/2 (accessed on 20 November 2024) ![]() | |
| FacebookHoax [21] | 2017 | EN | ⧫ | https://github.com/gabll/some-like-it-hoax (accessed on 20 November 2024) ![]() |
| LIAR [22] | 2017 | EN | ⧫ | https://paperswithcode.com/dataset/liar (accessed on 20 November 2024) ![]() |
| BuzzFeed [23] | 2017 | EN | ⧫ | https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data (accessed on 20 November 2024) ![]() |
| BuzzFace [24] | 2018 | EN | ⧫ | https://github.com/gsantia/BuzzFace (accessed on 20 November 2024) ![]() |
| FakeNewsNet [13] | 2018 | EN | https://github.com/KaiDMML/FakeNewsNet (accessed on 20 November 2024) ![]() | |
| Yelp [25] | 2019 | EN | ⧫ | mailto:o.araque@upm.es ![]() |
| MisInfoText [26] | 2019 | EN | ⧫ | https://github.com/sfu-discourse-lab/MisInfoText (accessed on 20 November 2024) ![]() |
| NELA-GT-2018 [27] | 2019 | EN | ⧫ | https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YHWTFC (accessed on 20 November 2024) ![]() |
| FCV-2018 [28] | 2019 | Multi | https://mklab.iti.gr/results/fake-video-corpus/ (accessed on 20 November 2024) ![]() | |
| Verification Corpus [29] | 2019 | Multi | https://github.com/MKLab-ITI/image-verification-corpus (accessed on 20 November 2024) ![]() | |
| r/fakeddit [30] | 2020 | EN | https://fakeddit.netlify.app/ (accessed on 20 November 2024) ![]() | |
| M4 [31] | 2024 | Multi | https://github.com/mbzuai-nlp/M4?tab=readme-ov-file#data (accessed on 20 November 2024) ![]() |
Publicly available dataset;
Available upon request.| Dataset | Rating Scale |
|---|---|
| CREDBANK [19] | 5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.) |
| PHEME [20] | 3 values (true, false, unverif.) |
| FacebookHoax NA [21] | 2 values (hoaxes, non-hoaxes) |
| LIAR [22] | 6 values (pants-fire, false, barely-true, half-true, mostly-true, true) |
| BuzzFeed [23] | 4 values (mostly true, not factual, mix, mostly false) |
| BuzzFace [24] | 4 values (mostly true, mostly false, mix, no factual) |
| FakeNewsNet [13] | 2 values (fake, real) |
| Yelp [25] | 2 values (fake, trustful) |
| MisInfoText [26] | 4 (BuzzFeed), 5 (Snopes) values |
| NELA-GT-2018 [27] | 2 values (true, false) |
| FCV-2018 [28] | 2 values (true, false) |
| Verification Corpus [29] | 2 values (true, false) |
| r/fakeddit [30] | 5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.) |
| M4 [31] | 5 values (Cert., Prob. Inacc., Doubt., Prob. Acc., Cert.) |
| Dataset | Size | Annotation Process |
|---|---|---|
| CREDBANK [19] | 60 M tweets, 1049 events | Crowdsourced |
| PHEME [20] | 330 convos * (159 true, 68 false, 103 unverified) | Manual |
| FacebookHoax [21] | 15.5 K posts from 32 pages * | Manual |
| LIAR [22] | 12.8 K labeled statements | Manual |
| BuzzFeed [23] | 2.3 K news samples | Manual |
| BuzzFace [24] | 2.3 K news from 9 pages * | Crowdsourced |
| FakeNewsNet [13] | 422 news (211 fake, 211 real) | Automated |
| Yelp [25] | 18.9 K reviews (9.5 K fake, 9.5 K real) | Manual |
| MisInfoText [26] | 1.7 K articles (1.4 K BuzzFeed, 312 Snopes) | Crowdsourced |
| NELA-GT-2018 [27] | 713 K articles | Automated |
| FCV-2018 [28] | 380 videos, 77.3 K tweets | Manual |
| Verification Corpus [29] | 15.6 K posts | Manual |
| r/fakeddit [30] | 1.06 M samples | Manual |
| M4 [31] | 147 K parallel data * (3 K per 49 languages) | Manual |
| Study | Year | Datasets | Evaluation | Generalization |
|---|---|---|---|---|
| Li et al. [53] | 2024 | DGM4 | Cross-dataset | Weak |
| Rout et al. [51] | 2025 | FakeNewsNet, ISOT, LIAR | Multi-dataset | Moderate |
| Aslam et al. [52] | 2025 | Multiple benchmarks | Multi-dataset | Moderate |
| Xie et al. [54] | 2025 | Twitter, Weibo | Transfer-focused | Strong |
| Dataset | Regularly Updated |
|---|---|
| NELA-GT-2018 [27] | ✔ |
| CREDBANK [19] | ✗ |
| PHEME [20] | ✗ |
| FacebookHoax [21] | ✗ |
| LIAR [22] | ✗ |
| BuzzFeed [23] | ✗ |
| BuzzFace [24] | ✗ |
| FakeNewsNet [13] | ✗ |
| Yelp [25] | ✗ |
| MisInfoText [26] | ✗ |
| FCV-2018 [28] | ✗ |
| Verification Corpus [29] | ✗ |
| r/fakeddit [30] | ✗ |
| M4 [31] | ✗ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kuntur, S.; Wróblewska, A.; Ganzha, M.; Paprzycki, M.; Sachdeva, S. Fake News Detection: It’s All in the Data! Appl. Sci. 2026, 16, 1585. https://doi.org/10.3390/app16031585
Kuntur S, Wróblewska A, Ganzha M, Paprzycki M, Sachdeva S. Fake News Detection: It’s All in the Data! Applied Sciences. 2026; 16(3):1585. https://doi.org/10.3390/app16031585
Chicago/Turabian StyleKuntur, Soveatin, Anna Wróblewska, Maria Ganzha, Marcin Paprzycki, and Shelly Sachdeva. 2026. "Fake News Detection: It’s All in the Data!" Applied Sciences 16, no. 3: 1585. https://doi.org/10.3390/app16031585
APA StyleKuntur, S., Wróblewska, A., Ganzha, M., Paprzycki, M., & Sachdeva, S. (2026). Fake News Detection: It’s All in the Data! Applied Sciences, 16(3), 1585. https://doi.org/10.3390/app16031585















