Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages

Sittar, Abdul; Smiljanic, Mateja; Guček, Alenka; Grobelnik, Marko

doi:10.3390/make8040103

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages

¹

Jožef Stefan Institute, Jamova Cesta 39, 1000 Ljubljana, Slovenia

²

Faculty of Mechanical Engineering, University of Ljubljana, Aškerčeva Cesta 6, 1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(4), 103; https://doi.org/10.3390/make8040103

Submission received: 2 March 2026 / Revised: 8 April 2026 / Accepted: 9 April 2026 / Published: 15 April 2026

Download

Browse Figure

Review Reports Versions Notes

Abstract

The proliferation of fake news across social media, headlines, and news articles poses major challenges for automated detection, particularly in multilingual and cross-media settings affected by data imbalance. We propose a fake news detection framework based on LLM-driven, feature-guided text augmentation. The method generates realistic synthetic samples across languages, media types, and text granularities while preserving meaning and stylistic coherence. Experiments with classical and transformer-based models (Random Forest, Logistic Regression, BERT, XLM-R) across social media, headlines, and multilingual news datasets show consistent improvements in performance. For inherently balanced datasets (e.g., social media), synthetic augmentation yields negligible but stable performance changes. Across imbalanced scenarios, synthetic augmentation substantially improves minority-class recall and F1-score (e.g., fake news recall from 0.57 to 0.86), while preserving majority-class performance, leading to more balanced and reliable classifiers, whereas oversampling significantly degrades results due to overfitting on duplicated language patterns. Overall, a hybrid semantic- and style-based model proves to be the most robust strategy, outperforming oversampling and matching or exceeding baseline performance across datasets.

Keywords: fake news detection; low-resource languages; data imbalance; synthetic data generation; prompt engineering; style-based features; semantic features

Graphical Abstract

Share and Cite

MDPI and ACS Style

Sittar, A.; Smiljanic, M.; Guček, A.; Grobelnik, M. Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages. Mach. Learn. Knowl. Extr. 2026, 8, 103. https://doi.org/10.3390/make8040103

AMA Style

Sittar A, Smiljanic M, Guček A, Grobelnik M. Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages. Machine Learning and Knowledge Extraction. 2026; 8(4):103. https://doi.org/10.3390/make8040103

Chicago/Turabian Style

Sittar, Abdul, Mateja Smiljanic, Alenka Guček, and Marko Grobelnik. 2026. "Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages" Machine Learning and Knowledge Extraction 8, no. 4: 103. https://doi.org/10.3390/make8040103

APA Style

Sittar, A., Smiljanic, M., Guček, A., & Grobelnik, M. (2026). Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages. Machine Learning and Knowledge Extraction, 8(4), 103. https://doi.org/10.3390/make8040103

Article Menu

Fake News Detection Through LLM-Driven Text Augmentation Across Media and Languages

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI