An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection
Abstract
1. Introduction
- The study introduces a unique unsupervised model, addressing the lack of methodologies focused on Arabic.
- It conducts a detailed examination of linguistic features, highlighting differences between real and fake posts, which improves insights into misinformation in Arabic.
- The use of innovative distribution-based clustering techniques aids in classifying textual content, potentially influencing future research in Natural Language Processing.
- Achieving high results (up to 92% accuracy) through online learning techniques demonstrates the model’s effectiveness and adaptability to misinformation trends.
2. Related Works
2.1. Unsupervised Model for Detecting Fake News
2.2. Arabic Fake News
3. The Proposed Online Learning Framework
3.1. The Considered Distributions
3.1.1. The Generalized Dirichlet Multinomial and Its Exponential Approximation
3.1.2. The Multinomial Beta-Liouville and Its Exponential Approximation
3.2. The Proposed Online Mixture Model
3.2.1. Phase 1: Mixture Model for Offline Clustering
3.2.2. Phase 2: Real-Time Detection of Fake News in the News
| Algorithm 1 The proposed online clustering framework. |
|
4. Experimental Results and Discussion
- Normalization: The first phase is to normalize Arabic characters, where “إ”, “أ”, and “آ” characters are replaced with “ؤ”, “ا” characters are replaced with “و”, and “ى” or “ئ” are replaced by “ي”, while “ة” is replaced by “ه”. Moreover, Hindi numerals are also replaced by Arabic numerals; for example, “٠” and “١” are replaced with “0” and “1”, respectively.
- Diacritics Removal: In this step, Arabic diacritics are removed.
- Punctuation Removal: Removing all punctuation marks.
- In this step, hyperlinks, mentions starting with @, emojis or retweet signs (RT) are removed.
5. Textual Analysis of Real and Fake Posts
5.1. Textual Features Categories
5.2. Textual Features Analysis
6. Limitation and Future Works
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Akbar, S.Z.; Panda, A.; Kukreti, D.; Meena, A.; Pal, J. Misinformation as a Window into Prejudice: COVID-19 and the Information Environment in India. Proc. ACM Hum.-Comput. Interact. 2021, 4, 249. [Google Scholar] [CrossRef]
- Gallotti, R.; Valle, F.; Castaldo, N.; Sacco, P.; De Domenico, M. Assessing the risks of ‘infodemics’ in response to COVID-19 epidemics. Nat. Hum. Behav. 2020, 4, 1285–1293. [Google Scholar] [CrossRef] [PubMed]
- Knittel, J.; Koch, S.; Tang, T.; Chen, W.; Wu, Y.; Liu, S.; Ertl, T. Real-time visual analysis of high-volume social media posts. IEEE Trans. Vis. Comput. Graph. 2021, 28, 879–889. [Google Scholar] [CrossRef]
- Xia, Q.; Maekawa, T.; Hara, T. Unsupervised human activity recognition through two-stage prompting with chatgpt. arXiv 2023, arXiv:2306.02140. [Google Scholar] [CrossRef]
- Al-Sayed, A.; Khayyat, M.M.; Zamzami, N. Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques. Appl. Sci. 2023, 13, 13278. [Google Scholar] [CrossRef]
- Cholevas, C.; Angeli, E.; Sereti, Z.; Mavrikos, E.; Tsekouras, G.E. Anomaly detection in blockchain networks using unsupervised learning: A survey. Algorithms 2024, 17, 201. [Google Scholar] [CrossRef]
- Li, D.; Guo, H.; Wang, Z.; Zheng, Z. Unsupervised fake news detection based on autoencoder. IEEE Access 2021, 9, 29356–29365. [Google Scholar] [CrossRef]
- Su, X.; Zamzami, N.; Bouguila, N. Covid-19 news clustering using MCMC-based learning of finite EMSD mixture models. In Proceedings of the International FLAIRS Conference, North Miami Beach, FL, USA, 17–19 May 2021; Volume 34. [Google Scholar] [CrossRef]
- Bouguila, N. Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 2008, 20, 462–474. [Google Scholar] [CrossRef]
- Bouguila, N. Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 2010, 22, 186–198. [Google Scholar] [CrossRef]
- Zamzami, N.; Bouguila, N. Sparse count data clustering using an exponential approximation to generalized Dirichlet multinomial distributions. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 89–102. [Google Scholar] [CrossRef]
- Zamzami, N.; Bouguila, N. High-dimensional count data clustering based on an exponential approximation to the multinomial beta-liouville distribution. Inf. Sci. 2020, 524, 116–135. [Google Scholar] [CrossRef]
- Saini, N.; Singhal, M.; Tanwar, M.; Meel, P. Multimodal, Semi-supervised, and Unsupervised web content credibility Analysis Frameworks. In Proceedings of the 2020 4th IEEE International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; pp. 948–955. [Google Scholar] [CrossRef]
- Dong, X.; Victor, U.; Chowdhury, S.; Qian, L. Deep two-path semi-supervised learning for fake news detection. arXiv 2019, arXiv:1906.05659. [Google Scholar] [CrossRef]
- Hosseinimotlagh, S.; Papalexakis, E.E. Unsupervised content-based identification of fake news articles with tensor decomposition ensembles. In Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web (MIS2), Marina del Rey, CA, USA, 9 February 2018. [Google Scholar]
- Gangireddy, S.C.R.; Long, C.; Chakraborty, T. Unsupervised fake news detection: A graph-based approach. In Proceedings of the 31st ACM Conference on Hypertext and Social Media, Virtual Event USA, 13–15 July 2020; pp. 75–83. [Google Scholar] [CrossRef]
- Wan, P.; Wang, X.; Pang, G.; Wang, L.; Min, G. A novel rumor detection with multi-objective loss functions in online social networks. Expert Syst. Appl. 2023, 213, 119239. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Shu, K.; Wang, S.; Gu, R.; Wu, F.; Liu, H. Unsupervised fake news detection on social media: A generative approach. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5644–5651. [Google Scholar] [CrossRef]
- Najar, F.; Zamzami, N.; Bouguila, N. Fake news detection using Bayesian inference. In Proceedings of the 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, 30 July–1 August 2019; pp. 389–394. [Google Scholar] [CrossRef]
- Alasmari, A.; Addawood, A.; Nouh, M.; Rayes, W.; Al-Wabil, A. A retrospective analysis of the COVID-19 infodemic in Saudi Arabia. Future Internet 2021, 13, 254. [Google Scholar] [CrossRef]
- Mahlous, A.R.; Al-Laith, A. Fake news detection in Arabic tweets during the COVID-19 pandemic. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 778–788. [Google Scholar] [CrossRef]
- Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. Arcov-19: The first arabic covid-19 twitter dataset with propagation networks. arXiv 2020, arXiv:2004.05861. [Google Scholar]
- Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. arXiv 2020, arXiv:2010.08768. [Google Scholar]
- Ameur, M.S.H.; Aliane, H. AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. Procedia Comput. Sci. 2021, 189, 232–241. [Google Scholar] [CrossRef]
- Alam, F.; Dalvi, F.; Shaar, S.; Durrani, N.; Mubarak, H.; Nikolov, A.; Da San Martino, G.; Abdelali, A.; Sajjad, H.; Darwish, K.; et al. Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms. In Proceedings of the Fifteenth International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 7–10 June 2021; pp. 913–922. [Google Scholar] [CrossRef]
- Alyoubi, S.; Kalkatawi, M.; Abukhodair, F. The Detection of Fake News in Arabic Tweets Using Deep Learning. Appl. Sci. 2023, 13, 8209. [Google Scholar] [CrossRef]
- Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; AlSaeed, D.; Essam, A. Arabic fake news detection: Comparative study of neural networks and transformer-based approaches. Complexity 2021, 2021, 5516945. [Google Scholar] [CrossRef]
- Helmy, T.M.H.Y.M.; Elzanfaly, D.S. An Ensemble Stacking Model for Rumor Detection Based on Arabic Tweets. J. Theor. Appl. Inf. Technol. 2023, 101, 7347–7358. [Google Scholar]
- Madsen, R.E.; Kauchak, D.; Elkan, C. Modeling word burstiness using the Dirichlet distribution. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 545–552. [Google Scholar] [CrossRef]
- Zhou, H.; Lange, K. MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Stat. 2010, 19, 645–665. [Google Scholar] [CrossRef] [PubMed]
- Zamzami, N.; Bouguila, N. Consumption behavior prediction using hierarchical Bayesian frameworks. In Proceedings of the 2018 First IEEE International Conference on Artificial Intelligence for Industries (AI4I), Laguna Hills, CA, USA, 26–28 September 2018; pp. 31–34. [Google Scholar] [CrossRef]
- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley Series in Probability and Statistics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007. [Google Scholar]
- Yao, J.F. On recursive estimation in incomplete data models. Stat. J. Theor. Appl. Stat. 2000, 34, 27–51. [Google Scholar] [CrossRef]
- Bouguila, N.; Ziou, D. Online clustering via finite mixtures of Dirichlet and minimum message length. Eng. Appl. Artif. Intell. 2006, 19, 371–379. [Google Scholar] [CrossRef]
- Hu, B.; Mao, Z.; Zhang, Y. An overview of fake news detection: From a new perspective. Fundam. Res. 2025, 5, 332–346. [Google Scholar] [CrossRef]
- Tucker, J.S.; Friedman, H.S. Chapter 17—Emotion, Personality, and Health. In Handbook of Emotion, Adult Development, and Aging; Magai, C., McFadden, S.H., Eds.; Academic Press: Cambridge, MA, USA, 1996; pp. 307–326. [Google Scholar] [CrossRef]
- Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022, 47, 10453–10469. [Google Scholar] [CrossRef]
- Alshahrani, H.J.; Tarmissi, K.; Alshahrani, H.; Ahmed Elfaki, M.; Yafoz, A.; Alsini, R.; Alghushairy, O.; Ahmed Hamza, M. Computational Linguistics with Deep-Learning-Based Intent Detection for Natural Language Understanding. Appl. Sci. 2022, 12, 8633. [Google Scholar] [CrossRef]
- Biber, D.; Johansson, S.; Leech, G.; Conrad, S.; Finegan, E. Longman Grammar of Spoken and Written English; Longman: London, UK, 2000. [Google Scholar]
- Rayson, P.; Wilson, A.; Leech, G. Grammatical Word Class Variation within the British National Corpus Sampler. In New Frontiers of Corpus Research; Peters, P., Purnell, P., Rayner, S., Eds.; Brill: Leiden, The Netherlands, 2002; pp. 295–306. [Google Scholar] [CrossRef]
- Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 11–16. [Google Scholar]
- Alfaidi, A.; Alwadei, H.; Alshutayri, A.; Alahdal, S. Exploring the Performance of Farasa and CAMeL Taggers for Arabic Dialect Tweets. Int. Arab. J. Inf. Technol. (IAJIT) 2023, 20, 349–356. [Google Scholar] [CrossRef]
- Levenson, R.W.; Ekman, P.; Friesen, W.V. Voluntary facial action generates emotion-specific autonomic nervous system activity. Psychophysiology 1990, 27, 363–384. [Google Scholar] [CrossRef]
- Jupe, L.M.; Vrij, A.; Leal, S.; Nahari, G. Are you for real? Exploring language use and unexpected process questions within the detection of identity deception. Appl. Cogn. Psychol. 2018, 32, 622–634. [Google Scholar] [CrossRef]
- Hancock, J.T.; Curry, L.E.; Goorha, S.; Woodworth, M. On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes 2007, 45, 1–23. [Google Scholar] [CrossRef]
- Himdi, H.T.; Assiri, F.Y. Tasaheel: An Arabic Automative Textual Analysis Tool—All in One. IEEE Access 2023, 11, 139979–139992. [Google Scholar] [CrossRef]
- Kapusta, J.; Obonya, J. Improvement of Misleading and Fake News Classification for Flective Languages by Morphological Group Analysis. Informatics 2020, 7, 4. [Google Scholar] [CrossRef]
- Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 759–766. [Google Scholar] [CrossRef]
- Yang, Y.; Zheng, L.; Zhang, J.; Cui, Q.; Li, Z.; Yu, P.S. TI-CNN: Convolutional neural networks for fake news detection. arXiv 2018, arXiv:1806.00749. [Google Scholar] [CrossRef]
- Kapusta, J.; Hajek, P.; Munk, M.; Benko, L. Comparison of fake and real news based on morphological analysis. Procedia Comput. Sci. 2020, 171, 2285–2293. [Google Scholar] [CrossRef]
- Knapp, M.L.; Hart, R.P.; Dennis, H.S. An exploration of deception as a communication construct. Hum. Commun. Res. 1974, 1, 15–29. [Google Scholar] [CrossRef]
- O’Connor, T. Saudi Arabia Warns Those Who Spread ’Fake News’ Will Be Jailed, Fined, Amid Rumors It Had Journalist Killed. Available online: https://www.newsweek.com/saudi-arabia-fake-news-jamal-khashoggi-1170613 (accessed on 1 September 2024).
- Newman, M.L.; Pennebaker, J.W.; Berry, D.S.; Richards, J.M. Lying Words: Predicting Deception from Linguistic Styles. Personal. Soc. Psychol. Bull. 2003, 29, 665–675. [Google Scholar] [CrossRef] [PubMed]
- DePaulo, B.M.; Lindsay, J.J.; Malone, B.E.; Muhlenbruck, L.; Charlton, K.; Cooper, H. Cues to deception. Psychol. Bull. 2003, 129, 74. [Google Scholar] [CrossRef]
- Petty, R.E.; Cacioppo, J.T. The Elaboration Likelihood Model of Persuasion; Advances in Experimental Social Psychology; Academic Press: Amsterdam, The Netherlands, 1986; Volume 19, pp. 123–205. [Google Scholar] [CrossRef]
- Bertelson, P.; Eelen, P.; d’Ydewalle, G. International Perspectives On Psychological Science, II: The State of the Art; Taylor & Francis: London, UK; Psychology Press: London, UK, 2013. (In English) [Google Scholar]
- Paschen, J. Investigating the emotional appeal of fake news using artificial intelligence and human contributions. J. Prod. Brand Manag. 2020, 29, 223–233. [Google Scholar] [CrossRef]



| Model | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) |
|---|---|---|---|---|
| K-means | 50.42 | 50.26 | 72.78 | 59.45 |
| MM | 51.45 | 51.47 | 51.47 | 51.47 |
| DCM | 52.32 | 51.26 | 73.78 | 60.49 |
| GDM | 64.55 | 64.97 | 65.82 | 65.39 |
| MBL | 83.15 | 82.92 | 84.69 | 83.79 |
| EGDM | 74.94 | 75.98 | 69.88 | 72.80 |
| EMBL | 89.44 | 89.44 | 89.44 | 89.44 |
| Model | Avg. Accuracy (%) | F-Score (%) | Time |
|---|---|---|---|
| GDM | 68.04 | 68.50 | 0.0240 |
| EGDM | 76.65 | 75.88 | 0.0055 |
| MBL | 87.20 | 86.05 | 0.0239 |
| EMBL | 92.51 | 90.89 | 0.0049 |
| Reference | Method | Dataset | Best Reported Result |
|---|---|---|---|
| [27] | Deep learning models (CNN, RNN, GRU) and transformer-based models (AraBERT v1, AraBERT v0.2, AraBERT v2, ArElectra, QARiB, ARBERT, MARBERT) for Arabic fake news detection. | ArCOV19-Rumors, Covid-19-Fakes, AraNews, ANS corpus | Accuracy: 95.80% (QARiB) |
| [26] | Deep learning (CNN-BiLSTM) and transformer-based models (ARBERT, MARBERT) for Arabic fake news detection. | ArCOV19-Rumors, COVID-19 misinformation | F1-score: 95% (ARBERT, MARBERT) |
| [28] | Hybrid ensemble model combining machine learning and deep learning techniques for detecting Arabic rumor tweets. | ArCOV19-Rumors | Accuracy: 90% |
| Proposed model 1 | EGDM (offline) | ArCOV19-Rumors | Accuracy: 74.94% |
| Proposed model 2 | EMBL (offline) | ArCOV19-Rumors | Accuracy: 89.44% |
| Proposed model 3 | EGDM (online) | ArCOV19-Rumors | Accuracy: 76.65% |
| Proposed model 4 | EMBL (online) | ArCOV19-Rumors | Accuracy: 92.51% |
| POS | Real | Fake |
|---|---|---|
| Nouns | 23.86 | 30.60 |
| Verbs | 3.91 | 4.46 |
| Particles | 2.79 | 1.89 |
| Conjunctions | 2.78 | 3.41 |
| Pronouns (all) | 3.10 | 4.06 |
| Pronouns (singular) | 33.16 | 28.27 |
| Pronouns (plural) | 5.44 | 6.07 |
| Adverbs | 0.23 | 0.26 |
| Determiners | 5.03 | 4.08 |
| Prepositions | 8.14 | 7.51 |
| Adjectives | 4.85 | 4.98 |
| Linguistics | ||
| Place | 0.74 | 0.86 |
| Assurance | 1.14 | 0.79 |
| Negators | 1.55 | 0.69 |
| Opposition | 0.19 | 0.03 |
| Justification | 0.88 | 0.21 |
| Exception | 0.23 | 0.19 |
| Illustration | 0.12 | 0.05 |
| Hedges | 0.52 | 0.20 |
| Time | 0.20 | 0.27 |
| Order | 0.01 | 0.01 |
| Intensifiers | 0.59 | 0.62 |
| Emotions | ||
| Joy/happiness | 0.93 | 0.78 |
| Sadness | 0.17 | 0.25 |
| Anger | 0.24 | 0.02 |
| Fear | 0.14 | 0.03 |
| Surprise | 0.08 | 0.11 |
| Disgust | 0.04 | 0.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zamzami, N.; Himdi, H.; Qarout, R.K. An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics 2026, 14, 1250. https://doi.org/10.3390/math14081250
Zamzami N, Himdi H, Qarout RK. An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics. 2026; 14(8):1250. https://doi.org/10.3390/math14081250
Chicago/Turabian StyleZamzami, Nuha, Hanen Himdi, and Rehab K. Qarout. 2026. "An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection" Mathematics 14, no. 8: 1250. https://doi.org/10.3390/math14081250
APA StyleZamzami, N., Himdi, H., & Qarout, R. K. (2026). An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics, 14(8), 1250. https://doi.org/10.3390/math14081250

