An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection

Zamzami, Nuha; Himdi, Hanen; Qarout, Rehab K.

doi:10.3390/math14081250

Open AccessArticle

An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection

by

Nuha Zamzami

,

Hanen Himdi

^*

and

Rehab K. Qarout

Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah 23218, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1250; https://doi.org/10.3390/math14081250

Submission received: 1 February 2026 / Revised: 6 April 2026 / Accepted: 7 April 2026 / Published: 9 April 2026

Download

Browse Figures

Versions Notes

Abstract

Along with the coronavirus pandemic (COVID-19), some in the medical publication industry have observed an “infodemic”, which is more pandemic than the virus. Given the lack of sufficient pandemic preparedness measures in many countries, people started posting millions of posts on social media without questioning their veracity or accuracy, particularly within Arabic-speaking communities. This study investigates an unsupervised model for detecting fake news in Arabic to fight the infodemic. While there has been much research on fake news detection (FND) in English, this subject in Arabic has yet to be investigated enough in the literature. We examine the use of distribution-based clustering techniques for Arabic FND and show their performance compared to each other. Moreover, we conduct a comprehensive linguistic analysis, identifying significant differences in textual features between real and fake posts, which can improve fake news detection. Our research shows the potential of online learning techniques to enhance model performance, leading to high accuracy, reaching up to 92%. By addressing the unique challenges posed by Arabic-language posts, our research offers practical implications for developing effective strategies for reducing infodemics and their social consequences and for strategic planning to control the current and future infodemics.

Keywords:

Arabic fake news; COVID-19 infodemic; textual analysis; unsupervised learning; real-time clustering; probabilistic mixture

MSC:

62H30; 68T05; 62H12

1. Introduction

In the fight against the COVID-19 pandemic, social media screening for public health has emerged as a critical component. False rumors, misinformation, disinformation, and fake news spread faster, deeper, and wider on social media platforms than reliable information. In pandemics, research has shown that disinformation can contribute to a climate of fear and discrimination [1]. As a result, during pandemics, the public requires clear, up-to-date information as well as transparency in strategic and operational decision-making. With the globalization of social media platforms, there are concerns about how regional cultural variables influence online involvement, particularly in the context of “#infodemics” [2], a term that refers to an abundance of information, both online and offline. The World Health Organization revealed in February 2021 that the coronavirus pandemic was accompanied by an “infodemic” of fake news (WHO 2020). In this study, we investigate the COVID-19 infodemic in the context of Arabic-speaking users on the X platform.

The number of content creators, as well as the volume and topic diversity of published content on social media platforms like X, has significantly expanded with the growing importance of these platforms in society. For that, a significant challenge is the vast amount and pace of created posts. Numerous methods are either limited in their analytical capacities, only support analyses conducted offline, or are utilising process large volumes of data. Nevertheless, they either depend on a great deal of preprocessing or extra meta-data [3]. Moreover, most existing work relies on deep learning or supervised learning techniques, which require many labeled data. However, high-dimensional unlabeled data are rapidly accumulated in the real world, and obtaining labeled data is expensive and time-consuming. Accordingly, unsupervised learning techniques are up and coming, yet challenging. To address this challenge, we propose an unsupervised learning model that explores alternative techniques that do not heavily depend on labeled data.

This research addresses a significant gap in the literature by explicitly focusing on detecting fake news in Arabic, a topic that has been less explored than studies on English and other languages within the same field. We aim to enhance the understanding of misinformation among Arabic-speaking communities and provide tailored solutions catering to their unique linguistic context. In recent years, unsupervised learning methods have drawn more attention, as they offer a potent method of segmenting the available data into distinct groups, thereby facilitating the identification of abnormal instances in the form of outliers. This approach can achieve state-of-the-art performance comparable to supervised learning by utilizing data in different fields, e.g., pertinent information about the activities [4], patient data to predict heart disease [5], anomaly detection in blockchain networks [6], and fake news detection [7,8].

This study explores innovative approaches to tackle fake texts in two folds: the first is an online unsupervised learning method to detect fake news. We propose an online clustering approach based on a mixture of generalized Dirichlet multinomial (GDM) [9], a mixture of Multinomial Beta-Liouville (MBL) [10] and their exponential approximations EGDM [11] and EMBL [12]. Indeed, numerous applications across a wide range of areas highlight the significance of online learning in real-world circumstances where data are created at an elevated rate. This approach minimizes the need for labeled datasets, which are often scarce, particularly in less-resourced languages. It showcases the potential of advanced machine learning techniques to identify misinformation without extensive prior annotations efficiently.

Furthermore, in the second approach, we employ a thorough lexical analysis for specific linguistic categories, namely part of speech, emotion, and linguistics, to analyze both real and fake posts in Arabic thoroughly. This analysis offers valuable insights concerning prior studies addressing fake news aimed at mitigating fake news.

For both approaches, this study emphasizes the linguistic characteristics of textual content rather than multimodal content, as studies demonstrate that text-based content is the most common form for spreading fake news on social media (https://www.envistaforensics.com/knowledge-center/insights/articles/pixels-and-perjury-the-alarming-ease-of-fabricating-text-message-evidence-and-what-to-do-about-it/ (accessed on 13 September 2024)). This prevalence can be attributed to the relative ease with which textual content can be created, manipulated, and distributed compared to other formats, such as images or videos. By utilizing distribution-based clustering methods and linguistic analyses, the authors plan to demonstrate the potential of advanced technologies to improve model performance and adaptability. In today’s fast-paced information environment, there is a pressing need for real-time solutions to detect and counteract the spread of misinformation. The proposed framework seeks to provide timely responses that can help mitigate the societal impacts of fake news, ultimately fostering informed decision-making during critical situations.

This work enhances understanding of misinformation and fake news detection, particularly in Arabic-language content during the COVID-19 pandemic. The objective of this work is not to maximize supervised accuracy but to provide an unsupervised and online-adaptive framework suitable for evolving and label-limited scenarios. Key contributions include the following:

The study introduces a unique unsupervised model, addressing the lack of methodologies focused on Arabic.
It conducts a detailed examination of linguistic features, highlighting differences between real and fake posts, which improves insights into misinformation in Arabic.
The use of innovative distribution-based clustering techniques aids in classifying textual content, potentially influencing future research in Natural Language Processing.
Achieving high results (up to 92% accuracy) through online learning techniques demonstrates the model’s effectiveness and adaptability to misinformation trends.

The remainder of this paper is organized as follows. Section 2 presents the most recent work regarding unsupervised learning approaches for FND in general and the published work related to Arabic language news specifically. In Section 3, we outline the proposed online framework for the real-time clustering of fake news, discuss the properties of the fitting distribution for each proposed model, and introduce the online clustering model based on a mixture of the discussed distributions. Section 4 discusses and evaluates the experimental results. Section 5 presents an analysis of real and fake news, including word use categories and a comparison of real and fake news. Finally, Section 7 presents the conclusions, limitations, and suggestions for future work.

2. Related Works

2.1. Unsupervised Model for Detecting Fake News

The unsupervised approach to fake news detection is significantly more challenging than the traditional supervised approach because unsupervised techniques do not have the luxury of labeled data to guide the search process. In contrast, supervised methods rely on labeled data to train machine learning models to classify news items as fake or legitimate.

The unsupervised models have been used rarely for fake detection news. [13] presents some recent models showing the outcomes and the limitations of the existing frameworks for fake detection news. The paper discusses several implications of fake news on society. It can rapidly disseminate false information, which can mislead the public and distort perceptions of reality. This is particularly concerning in the context of significant events, where fake news can escalate tensions and create panic. The Social and Economic Impact of the spread of fake news can have far-reaching consequences, affecting individuals’ beliefs and decisions. For instance, the paper cites an example where a rumor about a political figure led to a significant drop in stock value, illustrating the potential economic ramifications of fake news. The erosion of trust that leads to the prevalence of fake news can undermine trust in legitimate news sources and institutions. As people become more suspicious of the information they encounter, it can lead to a general distrust in media and public discourse, complicating efforts to communicate important information. Moreover, this paper emphasizes the urgent need for effective fake news detection methods to combat the adverse effects of fake news. It highlights that as writing styles and methods of spreading fake news evolve, technological advancements and updated detection strategies are necessary to keep pace with these changes. Fake news’s implications are profound, affecting not only individual understanding but also societal cohesion and the functioning of democratic processes. Most of the studied models did not achieve remarkable results with limited datasets. The paper calls for continued research and the development of robust detection frameworks to address these challenges effectively.

A study by [14] used a semi-supervised model that combines the outputs from two paths (supervised and unsupervised CNN) to optimize the detection performance by jointly minimizing the supervised and unsupervised losses, allowing it to effectively learn from limited labeled data while leveraging a larger amount of unlabeled data. By employing a deep two-path semi-supervised learning model, the architecture consists of both a supervised path (for labeled data) and an unsupervised path (for both labeled and unlabeled data). This allows the model to leverage the abundant information from unlabeled data to enhance learning. Overall, the experimental results confirmed the effectiveness of the proposed model in detecting fake news, showcasing its potential for real-world applications in combating fake news.

Another work by [15] focuses on precisely uncovering different categories of fake news based on the contents. This study focuses on the relationship between articles and terms and the contextual relations between terms to reveal the entire contents. The presented model shows high accuracy in defining fake news and classifying it into different categories. The proposed method was able to find most of the fake news categories with homogeneity values up to 80% overall, reduce the diversity of outliers to 2.5, and explore all categories with high homogeneity. In contrast, other algorithms were incapable of discovering all categories.

Additionally, the proposed method was able to produce latent groups with 65% coherence overall and more than 80% for most of the categories, whereas NMFSVD achieved up to 50% on average of the top 30 news for each factor. The paper also reports that the proposed CP/PARAFAC-based method outperforms other algorithms in both overall homogeneity and the capability of finding all categories. This approach of combining multiple models to improve accuracy can be applied to other types of data analysis problems where multiple models can be trained on the same data to improve the system’s overall performance.

A graph-based approach was introduced by [16] as an unsupervised detection model for detecting fake news in new and missing labeled datasets. The proposed model operates in three phases and shows improved accuracy close to 80% over state-of-the-art techniques. By leveraging the relationships and interactions between articles and users, this graph-based approach can identify patterns indicative of fake news, making it a powerful tool for detecting fake news in the absence of labeled data. This approach does not require labeled data and can be applied to a wide range of domains, making it more generalizable than supervised methods. However, as mentioned in this paper, it is also more complex and computationally intensive than supervised methods.

A study by [17] proposed a rumor detection method that treats the detection time, accuracy, and stability as the three training objectives and continuously adjusts and optimizes this objective instead of using a fixed value during the entire training process, thereby enhancing its adaptability and universality. In the proposed approach, multi-objective loss functions are used to optimize the rumor detection method’s detection time, accuracy, and stability. The authors continuously adjust and optimize these objectives instead of using a fixed value during the entire training process, which enhances the method’s adaptability and universality. To solve the problem of hyperparameter selection brought by the integration of multiple optimization objectives, a convex optimization method is utilized to parameterize them to adaptively change during the entire model learning process and avoid the huge computational cost of enumerations. Moreover, the authors propose a sliding interval-based detection method by constructing different optimization loss objectives to intercept the required data rather than using the entire sequence data. This sliding interval method reduces the computational cost of the method and only needs to find the detection point and perform the detection within the detection interval. Through continuous learning of features, the detection points are adaptively adjusted to make them more universal. The results show that the proposed method achieves higher metrics than existing methods like DSTS, CAMI, and GLAN, particularly in the Twitter dataset where the proposed method (IDMO-SA) attained an accuracy of 0.818, precision of 0.749, and recall of 0.896. Additionally, the experimental results demonstrate the effectiveness of the sliding interval detection method and the multi-objective optimization approach, confirming that the proposed method can adaptively adjust detection points and improve overall performance in rumor detection tasks.

Moreover, [18] used the Bayesian network model to recognize the truths of news and the users’ trustiness. The authors used the Gibbs sampling approach to measure the news authenticity and the users’ credibility. The Bayesian network model provides a robust framework for unsupervised fake news detection by leveraging user engagement data and probabilistic reasoning to infer the truthfulness of news items. Their proposed model shows improved results on two real-world datasets compared to existing unsupervised benchmarks. Similarly, [19] developed a Bayesian approach for learning mixture models and validate their method through extensive simulations and comparisons with multinomial-based mixture models. In doing so, the authors introduced a new family of distributions that is more adjustable to high-dimensional data and can overcome the drawbacks of the multinomial assumption. The findings of the proposed Bayesian approach for learning mixture models of the Exponential family approximation to the Dirichlet Compound Multinomial (EDCM) distributions achieved superior results compared to other multinomial-based methods for fake news detection.

2.2. Arabic Fake News

Few studies in the literature tackle the issue of Arabic fake news in different ways. For instance, [20] combines quantitative and qualitative techniques to determine how information-exchanging behaviors might be used to reduce the consequences of emergent disinformation. According to the findings, social media platforms are the most significant source of rumors in fast disseminating information in the community. It was discovered that WhatsApp users accounted for around 46% of the source of rumors on online platforms, whereas Twitter (rebranding to X since July 2023) reported a 41% decline in rumors. Furthermore, the results show that pharmaceutical corporations offered the second-most common sort of fake news; nonetheless, a widespread type of fake news spreading worldwide during this pandemic relates to the biological war. Social media, with various techniques in public discourse, leads to efficient public health responses, according to this combined retrospective analysis of the study [20].

In addition, the authors in [21] gathered around seven million Arabic news about the coronavirus pandemic from January to August 2020, utilizing trending hashtags during the period of the outbreak. To extract a list of keywords connected to disinformation and fake news issues, they used two fact-checkers: the France-Press Agency and the Saudi Anti-Rumors Authority. A small corpus was retrieved and carefully annotated from the gathered news into bogus and authentic classes. The authors employed a set of features taken from post content to train a set of machine learning classifiers. A technique for automatically detecting fake news from Arabic text was built using the manually annotated corpus as a baseline. The manually annotated dataset was classified with an F1-score of 87.8% using Logistic Regression (LR) as a classifier with the n-gram-level Term Frequency-Inverse Document Frequency (TF-IDF) as a feature, and the automatically annotated dataset was classified with a 93.3% F1-score using the same classifier with count vector feature.

ArCOV-19, an Arabic news dataset, was introduced in [22]. The dataset contains over 2.7 million news collected in one year (January 2020 to January 2021) and includes all propagation networks, including re-post and conversation threads. The authors represent this dataset to the public for all research fields related to text processing and social analysis. The authors extend this work in [23] to detect fake news over two levels: verifying free-text claims (called claim-level verification) and verifying claims expressed in the news (called post-level verification). The benchmarking results for post-level verification on the ArCOV19-Rumors dataset show that the MARBERT model achieved the highest overall performance, with an F1 score of 0.717 for fake posts and 0.762 for true posts. The results also suggest that the dataset is a good resource for training verification systems, as true vs. false news distribution is balanced.

Furthermore, “AraCOVID19-MFH”, a multi-label Arabic COVID-19 false news and hate speech detection dataset, was carefully annotated and released in [24]. This dataset has around 10,828 Arabic news with ten different classifications. The labels were created with some features of the fact-checking process in mind, including the post’s checkworthiness, positivity, negativity, and factuality. The annotated dataset was used to train and assess multiple classification models, and the findings were provided to prove their practical value. Though the dataset was created primarily to detect fake news, it can also be used for hate speech detection, opinion/news classification, dialect recognition, and various other applications. In our study, we will use the dataset provided by this study to train our dataset and compare the results.

On the other hand, ref. [25] designed an annotation schema and precise annotation instructions that represent the perspective that malicious content encompasses not only fake news, rumors, and conspiracy theories but also the promotion of fake cures, panic, racism, xenophobia, and mistrust of authorities, among other. The authors also introduced a multilingual annotation platform and issued a rallying cry to the academic commuIn the most recent work on the Arabic dataset using deep learning, the work presented by [26] emphasised the effectiveness of deep learning methods, particularly CNNs, in recognising fake news across various topics in Arabic news.

Moreover, the authors in [27] thoroughly compared neural network and transformer-based language models for Arabic fake news detection. They investigated and compared their performance using neural networks and transformer-based language models for Arabic fake news detection. Furthermore, they thoroughly investigated the likely causes of the disparities in performance outcomes produced by different methodologies. The results showed that transformer-based models outperform neural network-based models, resulting in an increase in the F1 score from 0.83 (best neural network introduced a graph-based approach r-based model, QARiB) and a 16 % increase in accuracy over the best neural network-based solution.

In the most recent work on the Arabic dataset using deep learning, the work presented by [26] emphasized the effectiveness of deep learning methods, particularly CNNs, in recognizing fake news across various topics in Arabic news. The study demonstrated that the proposed deep learning model, which utilizes both the tweets’ content and users’ social context, achieved a superior performance with an F1-score of 0.956. This indicates a high level of accuracy in detecting fake news across various topics in Arabic tweets. The research highlighted the effectiveness of contextual embeddings (such as MARBERT) compared to classic word embeddings. This approach allows the model to capture better the meaning of words based on their context, which is crucial for understanding the nuances of the Arabic language. The authors constructed a comprehensive, manually labeled dataset containing fake and actual news tweets in Arabic. This dataset is significant for training and evaluating the model, addressing the challenge of limited labeled data in previous studies. The study concluded that incorporating both the textual features of the news and the characteristics of the news publisher (social context) leads to more accurate detection of news credibility. This dual approach is a notable contribution to fake news detection. The results indicate that the MARBERT with CNN model outperforms others, achieving an accuracy and F1-score of 0.956. The authors suggest that while their model shows promising results, further research is still needed to enhance the generalization capabilities of fake news detection systems, particularly in the context of Arabic. Overall, the research contributes to understanding fake news detection in Arabic and provides a foundation for future studies in this area.

Furthermore, ref. [28] presents an innovative ensemble stacking model designed for detecting rumors in Arabic news, a crucial step in combating fake news in Arabic-speaking communities. The authors highlight the importance of addressing the unique challenges of rumor detection in Arabic, utilizing a combination of machine learning and deep learning techniques. The accuracy of the Random Forest and Gaussian Naive Bayesian models significantly outperforms the Decision Tree model. According to the study, the Random Forest achieved an accuracy of 86% and an F1 score of 77%. Gaussian Naive Bayes also reached an accuracy of 86% with an F1 score of 74%.

In contrast, the Decision Tree model had a lower accuracy of 83% and an F1 score of 68%. This indicates that both the Random Forest and Gaussian Naive Bayesian models perform better in terms of accuracy and F1 score than the Decision Tree model, highlighting the effectiveness of ensemble methods like Random Forest in handling complex classification tasks.

3. The Proposed Online Learning Framework

In this section, we give the details of the proposed online framework for the real-time clustering of fake news. For each proposed model, we first discuss the properties of the fitting distribution; then, we propose the online clustering model based on the mixture of the discussed distributions. Finally, we give the complete learning framework.

Please note that this study does not aim to introduce fundamentally new model components; rather, it proposes a unified unsupervised and online-adaptive framework through the integration and adaptation of existing techniques to address dynamic and label-scarce environments.

3.1. The Considered Distributions

3.1.1. The Generalized Dirichlet Multinomial and Its Exponential Approximation

The generalized Dirichlet distribution (GD) has proven to be a more suitable prior for naive Bayesian classifiers despite the fact that the Dirichlet distribution is frequently employed as a prior to the multinomial. This is because generalized Dirichlet overcomes Dirichlet’s negative correlation and equal-confidence criteria, besides other constraints. Additionally, the GD distribution offers greater flexibility than the Dirichlet distribution due to its independence feature, which is defined by the capacity to sample each item of the random vector from independent beta distributions. The generalized Dirichlet multinomial (GDM), which is composed of the generalized Dirichlet and the multinomial in the same manner as the Dirichlet compound multinomial model (DCM) [29], was first introduced by Bouguila [9]. Define

X = (x_{1}, \dots, x_{D})

as a vector of counts representing a COVID-19 post, where

x_{d}

is the frequency of the word d. The following formula represents the probability density function of generating the vector

X

using the GDM distribution with parameter

θ = {α_{1}, β_{1}, \dots, α_{D}, β_{D}}

[9]:

\begin{matrix} GDM (X | θ) & = \frac{Γ (n + 1)}{\prod_{d = 1}^{D + 1} Γ (x_{d} + 1)} \prod_{d = 1}^{D} \frac{Γ (α_{d} + β_{d})}{Γ (α_{d}) Γ (β_{d})} \\ \prod_{d = 1}^{D} \frac{Γ (α_{d}^{'}) Γ (β_{d}^{'})}{Γ (α_{d}^{'} + β_{d}^{'})} \end{matrix}

(1)

where

Γ (.)

is the gamma function,

n = \sum_{d = 1}^{D + 1} x_{d}

,

α_{d}^{'} = α_{d} + x_{d}

, and

β_{d}^{'} = β_{d} + x_{d + 1} + \dots + x_{D + 1}

, for

d = 1, \dots, D

. Just like DCM, GDM does not belong to the exponential family.

In fact, in a variety of applications, the generalized Dirichlet multinomial (GDM) has proven to be a successful substitute for DCM that provides superb clustering accuracy [9,30,31]. However, it has issues comparable to those of DCM, such as the slow parameter estimation process. Thus, an efficient approximation to the GDM was proposed in [11] to streamline the parameter estimation procedure and decrease computation in high-dimensional spaces.

The approximation, called EGDM, belongs to the exponential family of distributions and has been shown to successfully and accurately reflect the burstiness phenomenon while being much faster and more computationally efficient than the comparable GDM. By using an appropriate transformation and reparameterization while considering specific characteristics of the logarithm and gamma functions, it was possible to reduce GDM to a member of the exponential family. The exponential family version of the approximation, EGDM, is as follows:

\begin{matrix} EGDM (X) & = {(\prod_{d : x_{d} \geq 1} x_{d})}^{- 1} \prod_{d : x_{d} \geq 1} \frac{Γ (z_{d})}{Γ (x_{d} + z_{d})} n! \\ exp [\sum_{d = 1}^{D} I (x_{d} \geq 1) log \frac{α_{d} β_{d}}{(α_{d} + β_{d})}] \end{matrix}

(2)

where

z_{d} = x_{d + 1} + \dots + x_{D}

is the cumulative sum, and

I (x_{d} \geq 1)

is an indicator that represents whether a word d appears at least once in the vector

X

. The reader may refer to the original work [11] for more details about the transition from GDM to EGDM.

3.1.2. The Multinomial Beta-Liouville and Its Exponential Approximation

The second kind of Liouville family includes the Dirichlet distribution as a particular case if all variables in the Liouville random vector have the same normalized variance and the density generator variate has a beta distribution. By selecting the Beta distribution as the generating density, one obtains what is known as the Beta-Liouville distribution. Similar to the Dirichlet distribution, the Beta-Liouville is a conjugate prior to the multinomial distribution and can get around the Dirichlet distribution’s fundamental limitations. Additionally, the two additional factors in Beta-Liouville can be utilized to modify the distribution’s spread, making it more useful and enhancing modeling capabilities. The Multinomial Beta-Liouville (MBL) is a flexible joint distribution produced by considering the Beta-Liouville as a prior to the multinomial. The probability density function of the Multinomial Beta-Liouville (MBL) distribution with parameters

ξ = (α_{1}, \dots, α_{D}, α, β)

, proposed in [10], is given by

\begin{matrix} MBL (X | ξ) = \frac{Γ ((\sum_{d = 1}^{D + 1} x_{d}) + 1)}{\prod_{d = 1}^{D + 1} Γ (x_{d} + 1)} \\ \times \frac{Γ (\sum_{d = 1}^{D} α_{d}) Γ (α + β) Γ (α^{'}) Γ (β^{'}) \prod_{d = 1}^{D} Γ (α_{d}^{'})}{Γ (\sum_{d = 1}^{D} α_{d}^{'}) Γ (α^{'} + β^{'}) Γ (α) Γ (β) \prod_{d = 1}^{D} Γ (α_{d})} \end{matrix}

(3)

where

α_{d}^{'} = α_{d} + x_{d}

,

α^{'} = α + \sum_{d = 1}^{D} x_{d}

, and

β^{'} = β + x_{D + 1}

.

Indeed, compared to other models with the same composition, the MBL mixture model has demonstrated to attain excellent clustering accuracy. However, MBL is inefficient in high-dimensional spaces where many parameters must be evaluated because it does not belong to the exponential family. A previous work combined the benefits of exponential approximation to reduce computation time and to improve the flexibility with the efficiency of MBL to model sparse high-dimensional count data in an approximation called EMBL [12]. To enhance the computing performance, only non-zero word counts

x_{d}

are taken into account in this approximation. If

α ≪ 1

, which was discovered to be true for text collections represented as Bag-of-Words (BoW), EMBL can be produced as an approximation to MBL, and is formulated as follows:

\begin{matrix} EMBL (X) = (\prod_{d : x_{d} \geq 1} x_{d}^{- 1}) n! \frac{Γ (s) Γ (α^{'}) Γ (β^{'}) α}{Γ (s + n) Γ (α^{'} + β^{'})} \times exp [\sum_{d = 1}^{D} I (x_{d} \geq 1) log (α_{d})] \end{matrix}

(4)

where

n = \sum_{d = 1}^{D + 1} x_{d}

and

s = \sum_{d = 1}^{D} α_{d}

. The sufficiency statistic

I (x_{d} \geq 1)

is a measure of whether the letter d appears at least once in the vector

X

. For more details about driving EMBL from MBL, we refer the reader to the original work [12].

3.2. The Proposed Online Mixture Model

In this study, the fake news detection is carried out in two steps using an unsupervised learning technique. In the first phase, which is known as an offline phase, the clustering model is created using a training set. In the second stage, the model is updated promptly online, and the new data is sent to it as soon as it is received. In our tests, if

X_{N}

arrives all at time

t = 0

, we randomly select 80% of the observations (represented as a bag of words vectors) to build and configure the model (the offline phase). Then, until the remaining 20% (the online phase) is finished, we insert a new piece of information every

t + 1

.

3.2.1. Phase 1: Mixture Model for Offline Clustering

For unsupervised learning of multivariate data, finite mixtures provide a versatile and effective probabilistic model-based technique. In mixture modeling, it is suggested that the data came from a variety of subpopulations. Let us say that

X

is an observed dataset with N data instances, where

X = {X_{1}, \dots, X_{N}}

, where

X_{i} = (x_{i 1}, \dots, x_{i D})

are selected from a superposition of M densities of the following form:

P (X | Θ) = \prod_{i = 1}^{N} \sum_{j = 1}^{M} π_{j} P (X_{i} | θ_{j}) .

(5)

where

π_{j} (0 < π_{j} < 1

and

\sum_{j = 1}^{M} π_{j} = 1)

are the mixing proportions. Each

P (X | θ_{j})

represents mixture component j and has its own parameters

θ_{j}

. There is a latent variable

Z_{i}

that corresponds to each seen data point

X_{i}

. The missing group-indicator vectors for the data elements in the jth cluster are indicated by the set

Z = {Z_{1}, \dots, Z_{N}}

. Given that one element in the value of

z_{i j}

is equal to one and every other element is equal to 0, this value fulfills the condition

z_{i j} \in {0, 1}

. The total set of data is referred to as

(X, Z | Θ)

, where

Θ

is the collection of all latent parameters and variables. Given a mixture model with M components, the full data log-likelihood is given by

L (X, Z | Θ) = \sum_{i = 1}^{N} \sum_{j = 1}^{M} z_{i j} (log P (X_{i} | θ_{j}) + log π_{j})

(6)

In this work, we look at the mixture models of the above-mentioned distributions (i.e., MGD, EMGD, MBL, and EMBL). To simplify the process of calculating the model parameters, the log of the likelihood function is maximized. The suggested method for the estimate process to identify the set of parameters is the maximum likelihood estimate (MLE), which is achieved by increasing the total likelihood using the expectation-maximization (EM) algorithm [32]. This iterative algorithm consists of two steps until convergence. In the expectation step (E-step), the posterior probability (i.e., the likelihood that a data sample

X_{n}

belongs to a cluster j) is first calculated. The M-step then maximizes the conditional expectation of the complete data log-likelihood.

The best parameter values will be returned if the convergence condition is satisfied. Finally, the cluster with the highest posterior probability will receive each post (feature vector). The original works [9,10,11,12] contain specifics of the algorithm for calculating the MGD, EMGD, MBL, and EMBL mixture model parameters update.

It is worth mentioning that the computational cost of each update step depends primarily on the number of features D and the number of clusters K. Specifically, each update requires computing distances or gradients with respect to all cluster representatives, resulting in a per-sample complexity of

O (K \times D)

. This linear dependence on both K and D makes the proposed method scalable and suitable for high-dimensional and streaming data scenarios.

3.2.2. Phase 2: Real-Time Detection of Fake News in the News

As new data are obtained for online clustering systems in this manner, the model must be updated without losing its flexibility. Let us say that, informally, we have a dataset

X = {X_{1}, \dots, X_{N}}

comprising N news at the time posted at t, and that it is represented by a M-component MGD, EMGD, MBL or EMBL, mixture with parameters

Θ_{M}^{(t)}

. The model and its parameters will now be given a new post,

X_{N + 1}

, at time

t + 1

. As a result, to detect if fake news has been posted, the model should be progressively updated, taking into account the new data. In other words, the various mixture model parameters must be changed anytime new data is added. The stochastic ascending gradient parameter updating suggested in [33] is what we utilize to update the parameters. Naturally, we must maintain the restrictions

(0 < π_{j}^{(t)} \leq 1)

and

\sum_{j = 1}^{M} π_{j}^{(t)} = 1

. In this regard, new variables belonging to

R

defined as

ω_{1}, \dots, ω_{M - 1}

are taken into consideration by using the Logit transform

ω_{j} = log \frac{π_{j}}{π_{M}}, j = 1, \dots, M - 1

to guarantee the unity of the mixing proportion

p i_{j}

. The following changes can be made to the mixing ratio:

π_{j}^{(t + 1)} = \frac{exp (ω_{j}^{(t + 1)})}{1 + \sum_{j = 1}^{M - 1} exp (ω_{j}^{(t + 1)})}, j = 1, \dots, M - 1

(7)

π_{M}^{(t + 1)} = \frac{1}{1 + \sum_{j = 1}^{M - 1} exp (ω_{j}^{(t + 1)})}

(8)

such that,

ω_{j}^{(t + 1)} = ω_{j}^{(t)} + δ_{N} (z_{N + 1 j} - ω_{j}^{(t)}),

where

δ_{N}

is a sequence of positive number that decreases to zero chosen to be

δ_{N} = 1 / (N + 1)

[33,34], and

z_{N + 1 j}

is the posterior probability of the new coming vector given a set of parameters

Θ^{(t)}

. Additionally, the model parameters

θ_{j}

for MGD, EMGD, MBL, and EMBL will be changed in accordance with [33]:

θ_{j}^{(t + 1)} = θ_{j}^{(t)} + \frac{z_{N + 1 j}}{N + 1} \frac{\partial log (P (X_{N + 1}, Z_{N + 1} | θ_{j}^{(t)}))}{\partial θ_{j}}

(9)

The reader may refer to the previously mentioned works [9,10,11,12] to find the first-order derivative of the log-likelihood with respect to each parameter of the considered models that have been used in Equation (9). For completeness, we summarize the main steps leading to Equation (9), including the explicit form of the gradient terms in Appendix A. The procedure for estimating and updating the model parameters is summarized in Algorithm 1. The class that maximizes the new post’s posterior will be assigned after the parameters have been updated, and the model’s performance will then be reviewed.

Algorithm 1 The proposed online clustering framework.

1:: INPUT: At t, D-dimensional dataset with N news
2:: $X^{(t)} = {X_{1}, \dots, X_{N}}$ ,
3:: Estimate $Θ^{(t)} = {θ_{1}^{(t)}, \dots, θ_{M}^{(t)}}$ as shown in [9,10,11,12].
4:: At $t + 1$ , insert a new post as a BoW feature vector $X_{N + 1}$ .
5:: for $j = 1 \to M$ do
6:: Compute the posterior probability $z_{N + 1 j}$ ,
7:: if $z_{N + 1 j}$ is the highest then
8:: Assign $X_{N + 1}$ to cluster j.
9:: end if
10:: Update the weights $π_{j}^{(t + 1)}$ using (Equations (7) and (8))
11:: Update the parameters $θ_{j}^{(t + 1)}$ using (Equation (9))
12:: end for

The stability of the online updating process is influenced by the incremental nature of the updates and the choice of step size. By employing controlled update magnitudes, the algorithm avoids abrupt changes in model parameters, thereby ensuring smooth adaptation over time. Furthermore, the use of repeated experimental runs demonstrates low variability in performance, providing empirical evidence of the robustness and stability of the proposed approach.

4. Experimental Results and Discussion

Numerous applications across several fields evidence the significance of online learning in practical situations where data is produced rapidly. Real-time surveillance has become increasingly valuable nowadays, encompassing traffic monitoring and the prevention of potential dangers, including terrorist attacks, environmental risks, and disease outbreaks, which are examples in which online learning is advantageous. The next threat to global public health has arisen from the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes coronavirus disease 2019 (COVID-19). COVID-19 surfaced in late 2019 and has resulted in a persistent pandemic, impacting over 200 nations globally. Researchers, policymakers, and other stakeholders promptly launched machine learning projects to address significant concerns concerning the outbreak and mitigate its impact on society as much as possible.

In this section, we provide the results of our experiments, which show the effectiveness of online learning techniques in enhancing model performance. The problem is addressed in the context of detecting fake news in COVID-19 health news articles as a binary classification task for each hypothesis: fake or real (not fake). Given the intricacy that fake news is a broad umbrella that includes numerous types of false content, such as misinformation, disinformation, and propaganda, thus categorizing the texts as real or fake gives an overall perspective by capturing the underlying contrast between real and fake content. This methodology is consistent with current research in fake news detection and ensures robustness in recognizing deceptive content, regardless of its intention [35]. It is worth mentioning that the framework is designed to be domain-agnostic, as it relies on general data processing and machine learning components that can be adapted to other thematic datasets with minimal modification.

We used the ArCOV19-Rumors benchmark subset released as part of the ArCOV-19 project (Publicly available at https://gitlab.com/bigirqu/ArCOV-19/ (accessed on 13 September 2024)), composed by Haouari et al. [22]; it contains 3553 labeled Arabic posts with a dictionary size of around 11,372 words and was used to enable quantitative evaluation. In this study, we differentiate between the complete ArCOV-19 collection and the reference subset utilised for assessment. The ArCOV-19 project is widely known as a large Arabic COVID-19 corpus, but the total size may change with each new dataset release or repository update (for example, the literature may say “over 2.7 million” items, but the dataset link may show a larger number of tweets because of updates and new collections).

To conduct quantitative experimentation and facilitate equitable comparison with previous research, we utilize the ArCOV19-Rumors benchmark subset linked to ArCOV-19. This subset comprises 3553 Arabic posts/news items labelled with rumor-related tags and is widely employed in rumor and fake news verification contexts. The full ArCOV-19 collection is big, and not all of the labels are the same for finding false information. Using the labeled benchmark subset makes it possible to do evaluations that can be repeated and report performance across methods in a consistent way.

This is a challenging task in the Arabic language, mainly due to the nature and character of the Arabic language, such as the richness of morphology, the greater ambiguity levels of many words, and the scarcity of Arabic human-annotated resources. Also, the task is hampered by the limited available resources and the small number of large-scale datasets being developed. This work is presented to address this deficiency problem that is exacerbated by the difficulty of detecting health-related fake news.

Before any experiments are conducted, the input data has to be preprocessed. Details of the preprocessing steps are outlined below, and the corresponding implementation is available at https://github.com/NuhaEZ/Arabic_text (accessed on 29 March 2025).

Normalization: The first phase is to normalize Arabic characters, where “إ”, “أ”, and “آ” characters are replaced with “ؤ”, “ا” characters are replaced with “و”, and “ى” or “ئ” are replaced by “ي”, while “ة” is replaced by “ه”. Moreover, Hindi numerals are also replaced by Arabic numerals; for example, “٠” and “١” are replaced with “0” and “1”, respectively.
Diacritics Removal: In this step, Arabic diacritics are removed.
Punctuation Removal: Removing all punctuation marks.
In this step, hyperlinks, mentions starting with @, emojis or retweet signs (RT) are removed.

Then, we evaluate different clustering approaches in offline clustering, where we use the complete dataset directly. All the experiments were conducted using optimized MATLAB R2024a codes on an Intel (R) Core (TM) i7-4790T @2.70GHz Processor PC with the 64-bit Windows 11 operating system with 8 GB main memory. All models, including the proposed approach and baseline methods, are evaluated under a unified experimental protocol to ensure fair and consistent comparison. Specifically, experiments are conducted using the same dataset splits, where the data are partitioned into training and evaluation sets following a fixed and reproducible scheme. Identical preprocessing steps are applied across all methods, including data cleaning, normalization, and feature extraction where applicable.

For baseline models, we adhere as closely as possible to the configurations reported in their original studies. When such details are not explicitly available, we adopt commonly used or recommended parameter settings from the literature. At the same time, we ensure that all methods are evaluated using the same performance metrics and evaluation procedures.

To account for variability, experiments involving stochastic components are repeated multiple times, and the average performance is reported. This unified setup ensures that the reported results provide a fair, transparent, and reproducible comparison between the proposed framework and existing approaches.

We utilize various metrics to evaluate performance, including overall accuracy, precision, recall, and F-score, as each is crucial for comprehending our model’s efficacy. The performance indicators for the considered dataset, obtained by applying various clustering techniques, are presented in Table 1. No train/test split is used due to the unsupervised nature of the task. Each experiment is repeated 10 times with different random initializations. The reported results correspond to the average performance, and the observed low variance across runs indicates the stability of the proposed method. That is, the clustering algorithms may produce clusters with different label distributions depending on initialization and data structure. To mitigate this issue, we perform multiple runs of the clustering algorithm and report the average performance across runs. This helps ensure that the results are not dependent on a particular cluster assignment.

As Table 1 shows, the accuracy is improved considerably using the GDM and MBL models. Furthermore, the models based on exponential approximation outperform their corresponding models, with an overall average clustering accuracy of 74.94% and 89.44% for EGDM and EMBL, respectively. This result confirms the previously published ones in [11,12]. The justification behind the performance of these models is twofold: the capacity of their parameters to encapsulate overdispersion and burstiness phenomena by fitting high-dimensional data into mixes of multinomial distributions, along with the efficiency of the exponential approximation principle in handling high dimensionality.

Then, to assess the proposed online approach, we select at random 60% of the observations, represented by Bag of Word (BoW), to initialize and construct the model (the offline phase), assuming that

X^{(t)} = {X_{1}, \dots, X_{N}}

arrives all at time

t = 0

, and we cluster using the mixture models as detailed in [9,10,11,12]. Afterwards, we introduce a new post at each time

t + 1

until completion (the online phase), while concurrently updating the various mixture model parameters by stochastic gradient ascent as previously outlined in Section 3.2.2. Figure 1 presents the average accuracy of online learning for the proposed models in comparison to offline learning. The results indicate that our proposed online frameworks, utilizing mixtures of EGDM and EMBL, have enhanced performance in terms of accuracy, significantly surpassing their corresponding models. The clustering performance, indicated by an accuracy of 92.51% for the online EMBL compared to 89.44 for the offline learning approach, provided promising performance for the online clustering framework. Likewise, the online learning for EGDM achieves an accuracy of 76.65% compared to 74.94% for the offline clustering. The justification for the enhanced performance is that in online learning, the parameters are updated with the new coming data without losing the model flexibility. As a result, the models are continuously up to date, and expenses for data storage and maintenance are considerably minimized.

Adopting online learning with EGDM and EMBL mixture models enhances the model’s efficacy in managing the daily and exponentially growing volume of data. The efficacy of the online learning algorithm is illustrated in Figure 2 and Figure 3. These figures illustrate the model’s accuracy with the introduction of batches containing the remaining 40% of the dataset, which has been inserted as new data vectors. The model’s accuracy does not substantially improve with the availability of fresh data; thus, the trade-off between performance and resource allocation is contingent upon the application’s requirements and the user’s preferences.

Figure 2 and Figure 3 illustrate the accuracy (%) of the GDM/EGDM and MBL/EMBL models with each insertion of new feature vectors during the method’s execution. With each insertion, the model is updated in accordance with the proposed online algorithm, and the overall accuracy is re-evaluated. EGDM and EMBL significantly outperform their corresponding models, namely, GDM and MBL. The given results indicate that the minimal clustering accuracy for GDM is approximately 63%, whereas EGDM achieves around 70%.

Moreover, as additional posts are incorporated, the performance of both models improves to reach up to 68.04% and 76.65% for GDM and EGDM, respectively. This outcome validates that the online clustering method facilitates lifelong learning, as the models are enhanced with the incorporation of new data.

Table 2 illustrates the average accuracy, the average micro F1, and run time results averaged over 10 runs for the proposed online frameworks. The presented results suggest that the performance of the proposed online framework, based on the mixture of EGDM and EMBL, provides promising results in terms of time and accuracy and significantly outperforms the corresponding GDM and MBL models. The best performing model is EMBL, measured by the accuracy of around 92.51% with a run time of 0.0049 per post, offering a fast and acceptably accurate online clustering framework.

Furthermore, as a comparison with models from the literature, Table 3 shows the benchmark against state-of-the-art transformer models (e.g., MARBERT and AraBERT) to contextualize the proposed model’s performance. Several approaches listed in Table 3 get higher scores, especially those that use fully supervised deep learning or transformer architectures. These architectures usually do better when they have a lot of labeled training and offline retraining.

In contrast, the proposed approach is an unsupervised, online system for monitoring false information in real time. Its main goal is to help people quickly adapt to new stories and changing content streams with as little help from expensive manual annotation as possible. The proposed online EMBL method achieves 92.51% accuracy on the ArCOV19-Rumors benchmark, despite functioning in a more challenging environment. This shows that it is effective in a competitive way while also providing additional benefits for use in real-world crisis situations (such as early-warning detection, triage, and continuous streaming analysis).

Accordingly, the scientific novelty of this work lies not only in absolute metric maximization but in providing a scalable and label-efficient online misinformation detection methodology that can function under dynamic conditions where supervised retraining and large labeled datasets are not always feasible. In other words, this study does not aim to achieve state-of-the-art performance relative to supervised transformer-based models; instead, it focuses on enabling unsupervised learning with online adaptability in dynamic and label-scarce environments.

5. Textual Analysis of Real and Fake Posts

According to [36], deceiving another person requires the manipulation of language and the cautious construction of a story that appears to be true. Despite liars having some control over the content of their stories, the language and style used to convey the story or to influence the narratives could indicate their state of mind. Their writings may provide vital cues into the linguistic manifestations of their mental state. According to this notion, the study attempts to examine the effect this may have on Arabic fake posts, which is considered a type of deception. Additionally, by examining the linguistic features and patterns in Arabic fake posts, we can gain vital insight and a deeper understanding of how fake posters peddlers use language to manipulate and mislead others. Therefore, this analysis can help us develop strategies to detect and counteract fake communication in online platforms.

In this section, we adhere to the technique used by Himdi et al. [37] to examine the linguistic nature of fake posts. We thoroughly analyze the ArCOV-19 dataset, which includes an overall of 3553 Arabic posts about COVID-19. Our analysis’s initial result shows a total number of 1831 real and 1722 fake posts about COVID-19.

Whilst the structure of fake news is evolving, results obtained from the analysis are factored based on the dataset utilized, which do not necessarily coincide with or account for the most recent texts or trends regarding fake news. Despite this drawback, the dataset employed provides real-world posts and is considered reliable and appropriate for linguistic analyses and academic discourse.

5.1. Textual Features Categories

The experiment focuses on the textual content of the posts. In that view, we determine if there are any statistically significant differences in word usage between ‘real’ and ‘fake’ posts. The categories analyzed are as follows.

Part of speech: They reflect the words that make up the sentences. These include verbs, nouns, adjectives, adverbs, prepositions, determiners, particles, conjunctions, and pronouns. According to Al-Shahrani et al. [38], the frequency distribution of part-of-speech (POS) identifiers in a text is frequently genre dependent [39,40]. In our work, we examine whether a correlation exists between real and fake posts by constructing features for each post based on the frequency of each POS tag. These characteristics also serve as a benchmark against which to evaluate our other automated methods. We employ the Farasa POS Arabic tagger [41], a tool used to determine specific Arabic taggers and contextual formats to analyze the text. We choose to use this Arabic tagger, as studies suggested its high accuracy of 83% in POS tagging opposed to other tested Arabic taggers [42].

Emotions: They refer to the degree of emotion expressed in a given text. Therefore, six essential human emotions—anger, disgust, fear, sadness, joy, and surprise—were used to analyze the emotional state of each post [43]. Diverse emotions have been linked to deceptive texts such as fake posts [44,45].

Linguistics: Specific linguistic categories that fall into each category were extracted to study the word uses. The linguistic categories are undoubtedly syntactical categories that are too fine-grained to be captured by general POS. Each syntactical unit serves a specific linguistic purpose used to construct meaningful statements. We then extract the linguistic categories used in a study by Himdi et al. [37], which are hedges, assurances, temporal and spatial words, including exceptions, negations, illustrations, intensifiers, oppositions, justifications, and superlatives.

To extract both emotion and linguistic categories, we use the Tasaheel tool [46] textual analysis tool specifically developed for Arabic natural language processing tasks.

Computing the lexical densities of textual classes in a dataset comprises the usage of descriptive statistics. This phase is aimed at supporting the comprehension of the dataset’s subtler details, specifically, the distinctions between the use of words in real and fake posts. Several studies, including those conducted by [47,48], have compared the word usage of real and fake texts to obtain a deeper understanding of the writing characteristics of fake news.

In that view, the formula adopted by Yang et al. [49] is used to calculate lexical densities in this context, which is as follows:

Lexical Density (L) = \frac{(Total number of occurrences of each feature in a class) \times 100}{Total number of words in the whole class}

(10)

5.2. Textual Features Analysis

The analysis focuses on ascertaining which word categories are typical of real and fake posts. In this context, we determine whether there are any statistically significant variations in the relative counts of word groups or categories within posts analyzed previously. The results of lexical densities for POS, linguistics, and emotions categories are presented in Table 4.

According to Table 4, the result of the study shows that the largest proportion of POS tags comes from nouns. It is pertinent to note that ‘nouns’ comprise themes that are present in the posts in greater proportion. Interestingly, the findings also show that not only is there a huge increase in nouns but there is an increase in ‘conjunctions’ as well. A study by Kapusta et al. [50] reported similar observations wherein ‘nouns’ comprised a higher percentage of the themes they discovered while conducting their experimental study. What these findings suggest could partly be explained by the fact that fake post producers tend to integrate their fake statements with made-up ‘nouns’ in the form of ‘persons’ and ‘countries’ to support their writings and narratives. We find that verbs, adverbs, and adjectives, although slightly more highly used in fake posts than real ones, indicate that fake post writers tend to form and support their deceptive text by integrating supporting events and citing their sources.

An interesting insight drawn from the study’s findings concerns the analysis of pronouns. We found that singular pronouns are less frequently used in fake posts, representing an overall value of 28.27%, than in real ones, representing an overall value of 33.16%. The difference in values between real and fake posts in regard to pronoun usage is 4.89. The prominent type of pronouns tagged, which is associated with nouns, are in their plural form. Interestingly, this insight was supported in an extant study conducted by [51], highlighting that liars tend to dissociate themselves from their made-up statements in the form of feeling the guilt of lying or due to their lack of support for their lie by authentic facts. Another reason that could be profound in the Arab world is that fake posts, in general, are a highly punishable crime in countries such as Saudi Arabia [52]. This fact may be attributed to a factor that discourages and causes fear among fake post producers in the event they are caught in their lies and get punished or prosecuted. Consequently, in order to disconnect themselves from made-up stories and be “less self-responsible” to the posts’ context and content, they frequently use the plural form apart from the singular form, which indicates “responsibility” [53].

Another finding is that the less frequently used determiners are particles where real and fake posts comprise 2.7908% and 1.8926%, respectively. Though the difference is subtle, it may suggest that when real post producers are citing key figures, they are referring to their full names and titles with confidence to support the sources of their content. On the other hand, fake post writers tend to hesitate to cite reliable sources from figures to support their made-up content. In real posts, particles are intensely used. This could be explained in terms of the events and facts that post-producers state and highlight, causing a group of particles to link the events and the scenario. On the contrary, however, fake post writers and producers tend to write simple events and facts. They are not very keen to link these events and facts in their posts because they lack “self-responsibility” and substance in “supporting authentic sources” in any of their posts. This notion was widely described in a study by [54], as liars seem more nervous when the stakes are of higher circumstances (such as murder cases) than when the stakes are of lower circumstances (such as traffic violation). The same characteristic features could be an indicative factor presented in fake post writers, especially when it concerns serious and sensitive topics such as COVID-19, as it might have a toll on many aspects such as medical, governmental regulation, and personal mental health or wellbeing. Consequently, fake posts lack writing intricacy, especially when compared to real posts that confidently integrate several ideas, facts, and events. This is also supported by the same study, which found that liars tend to be less forthcoming or candid than truth tellers, resulting in fewer complicated and detailed statements from their side.

Another interesting fact is that fake posts show increased lexical density involving intensifiers. This is obvious because social media platforms such as Twitter are “social”. This creates an engagement between users to focus attention on social topics, in this case, COVID-19, and get as much engagement as possible. There is no better way than integrating intensifiers to gain wider social attention. In support of the “made up” events in posts, specific time and location or geographical terms were woven in the fake posts to prompt belief. On the other hand, assurance, exceptions, hedges, and justification, including illustration, are linguistic terms that were mostly used in real posts rather than fake ones. It could be that the influence Twitter has on its users impacts their writing as well. We find that users wrote posts about COVID-19 in a logical discussion by supporting such posts with medical and governmental facts. This resulted in the usage of a range of linguistic categories of justification terms to better persuade the readers. For example, hedges, exceptions, and negation are used to show facts with caution, whereas assurance terms are used to provide stated facts. Similarly, illustration and opposite terms are used to display real-world cases or events.

When we compare the positions of truth-tellers and liars in high-stakes to low-stakes situations, we find that even truth-tellers can be apprehensive or nervous in high-stakes settings. The nervousness found in both parties is not a one-size-fits-all template for all cases [54]. We clearly find that real post writers and truth tellers share the feeling of nervousness on the topic of COVID-19. In some cases, they do so by expressing it in their writings by adding linguistic terms such as hedges, exceptions, and negation.

To better emphasize the results, we link our findings with theoretical studies, the thoroughly studied Elaboration Likelihood Model (ELM) of persuasion is one such example. According to the ELM, readers are persuaded via two distinct routes: central and peripheral [55]. The central route of persuasion emerges from a thorough assessment of the presenting arguments and message qualities, which necessitates a significant amount of effort and cognition. On the contrary, the peripheral path of persuasion includes linking concepts or establishing assumptions that are unconnected to the logic and quality of the presented information. This method could be referred to as heuristic because it does not guarantee to be optimal or even adequate in achieving its goal of locating accurate or trustworthy information. The peripheral route requires minimal energy and mental capacity [56]. This result is consistent with what we discovered when analyzing the linguistic word choices in real and fake posts. Specifically, linguistic terms such as justification, assurance, exceptions, and illustration are found in real posts that follow the central route of persuasion. On the contrary, we find that fake posts rely more on linguistic terms such as places (geographical location) and times (events in time) to support the made-up events in their effort to persuade through a peripheral route.

When it concerns emotions, we find that sad/sadness, disgust, and surprise are the most dominant emotional terms used in fake posts because these posts deal with the COVID-19 topic. On the other hand, real posts frequently included the emotions of joy/happiness, anger, and fear. Unlike previous studies that found the emotional tone to significantly influence the content and impact of fake news articles [57], our findings did not indicate substantial discrepancies in emotional expression between real and fake posts. This similarity may be attributed to the sensitivity of the topic—COVID-19—which likely elicited strong emotional responses across both real and fake content. It is possible that the pandemic’s increased public anxiety caused emotional tones to converge, making it harder to identify fake news using just emotional markers. In fact, joy/happiness as an emotional term or word usage was found in real posts that posted the announcement of successfully producing the COVID-19 vaccine. Similarly, fake post writers shared the same announcement. However, their posts come with a twist in the tale by misrepresenting facts. This comprehensive analysis can assist researchers in identifying fake posts through an emphasis on their linguistic features.

The observed differences in word density between real and fake posts provide important insights into possible language patterns related to fake news. However, the results could be influenced by the specific characteristics of the dataset used in this study. Consequently, the identified patterns should be interpreted as signaling rather than assertive applicable conclusions. To validate their reliability and statistical significance, subsequent investigations should examine these linguistic pattern insights across multiple Arabic fake news datasets.

6. Limitation and Future Works

Even though the proposed work has shown successful results, there are still limitations that need to be addressed. First, the experiments only used one benchmark dataset of fake posts discussing COVID-19, which may make it difficult to generalize the findings to other topics. Thus, we suggest future work that includes evaluating the framework on additional datasets from different domains to further validate its robustness and generalization. Second, the proposed study examines the textual aspect, without including other types of metadata that could come with social media posts, such as images and videos. Thus, the findings should be considered and make use of the methodology applied in the study to be conducted on a similar textual context. Third, the methodology only used unsupervised models, which may be expanded to include additional innovative AI models to get better outcomes and deliver comprehensive insights. Finally, given the widespread harmful impact of fake news on several domains, security models should be incorporated to ensure the full tackling of this issue.

7. Conclusions

This study highlights the crucial need to tackle misinformation, especially about epidemics like COVID-19. The paper demonstrates the efficacy of the proposed unsupervised model for detecting fake news in Arabic, offering assurance regarding advancements in the dependability of information circulation. By analyzing linguistic features and emotional cues in both real and fake posts, the research offers valuable insights into the characteristics that distinguish credible information from misleading content. The findings emphasize the need for continuous efforts to develop practical tools and methodologies to combat the spread of fake news, particularly in linguistically diverse contexts. Furthermore, the study calls for collaboration among researchers, policymakers, and technology developers to create comprehensive strategies that not only detect but also mitigate the impact of fake news on public perception and behavior. Ultimately, this research contributes to the broader discourse on fake news and its implications for society, paving the way for future studies and interventions to foster a more informed public.

Author Contributions

Conceptualization, N.Z.; Methodology, N.Z. and R.K.Q.; Software, N.Z.; Validation, N.Z. and H.H.; Formal analysis, N.Z., H.H. and R.K.Q.; Investigation, N.Z. and H.H.; Writing—original draft, N.Z., H.H. and R.K.Q.; Writing—review and editing, H.H. and R.K.Q.; Project administration, N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-22-DR-101). The authors, therefore, acknowledge with thanks the University of Jeddah for its technical and financial support.

Data Availability Statement

The data used in this study are publicly available in the ArCOV-19 repository and can be accessed at: https://gitlab.com/bigirqu/ArCOV-19 (accessed on 13 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Forms of gradient terms for GDM & EGDM:

The log-likelihood of one observation

X_{i}

, following GDM, is

\begin{matrix} log P (X_{i} | θ_{j}) & = l o g (Γ ((\sum_{d = 1}^{D + 1} x_{d}) + 1)) - \sum_{d = 1}^{D + 1} l o g (Γ (x_{d} + 1)) + \sum_{d = 1}^{D} l o g (Γ (α_{j d} + β_{j d})) \\ - \sum_{d = 1}^{D} (l o g (Γ (α_{j d})) + l o g (Γ (β_{j d}))) + \sum_{d = 1}^{D} (l o g Γ (α_{j d}^{'}) + l o g (Γ (β_{j d}^{'}))) \\ - \sum_{d = 1}^{D} l o g (Γ (α_{j d}^{'} + β_{j d}^{'})) \end{matrix}

(A1)

Then,

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial α_{j d}} = Ψ (α_{j d} + β_{j d}) - Ψ (α_{j d}) + Ψ (α_{j d}^{'}) - Ψ (α_{j d}^{'} + β_{j d}^{'}) \end{matrix}

(A2)

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial β_{j d}} = Ψ (α_{j d} + β_{j d}) - Ψ (β_{j d}) + Ψ (β_{j d}^{'}) - Ψ (α_{j d}^{'} + β_{j d}^{'}) \end{matrix}

(A3)

where

Ψ (.)

is the digamma function.

The log-likelihood of one observation

X_{i}

, following EGDM, is

\begin{matrix} log P (X_{i} | θ_{j}) & = log (n_{i}!) + \sum_{w : x_{i d} \geq 1} (log Γ (z_{i d}) - log Γ (x_{i d} + z_{i d})) \\ + \sum_{d : x_{i d} \geq 1} [(log (α_{j d}) + log (β_{j d}) - log (α_{j d} + β_{j d})) - log (x_{i d})] \end{matrix}

(A4)

By computing the first derivative of

L (X, Z | Θ)

with respect to

α_{j d}

and

β_{j d}

, we obtain

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial α_{j d}} = \sum_{i = 1}^{N} z_{i j} I (x_{i d} \geq 1) [\frac{1}{α_{j d}} - \frac{1}{α_{j d} + β_{j d}}], \end{matrix}

(A5)

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial β_{j d}} = \sum_{i = 1}^{N} z_{i j} I (x_{i d} \geq 1) [\frac{1}{β_{j d}} - \frac{1}{α_{j d} + β_{j d}}] . \end{matrix}

(A6)

Forms of gradient terms for MBL & EMBL:

The log-likelihood of one observation

X_{i}

, following MBL, is

\begin{matrix} log P (X_{i} | θ_{j}) & = l o g (Γ ((\sum_{d = 1}^{D + 1} x_{d}) + 1)) - \sum_{d = 1}^{D + 1} l o g (Γ (x_{i d} + 1)) \\ + l o g Γ (\sum_{d = 1}^{D} α_{d}) + l o g (Γ (α + β) + l o g Γ (α^{'}) + l o g Γ (β^{'}) + \sum_{d = 1}^{D} l o g Γ (α_{d}^{'}) \\ - l o g Γ (\sum_{d = 1}^{D} (α_{d}^{'})) - l o g (Γ (α^{'} + β^{'}) - l o g Γ (α) - - - l o g Γ (β) - - - \sum_{d = 1}^{D} l o g Γ (α_{d}) \end{matrix}

(A7)

Then,

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial α_{j d}} = Ψ (\sum_{d = 1}^{D} α_{j d}) + Ψ (α_{j d}^{'}) - Ψ (\sum_{d = 1}^{D} α_{j d}^{'}) - Ψ (α_{j d}) \end{matrix}

(A8)

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial α_{j}} = Ψ (α_{j} + β_{j}) + Ψ (α_{j}^{'}) - Ψ (α_{j}^{'} + β_{j}^{'}) - Ψ (α_{j}) \end{matrix}

(A9)

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial β_{j}} = Ψ (α_{j} + β_{j}) + Ψ (β_{j}^{'}) - Ψ (α_{j}^{'} + β_{j}^{'}) - Ψ (β_{j}) \end{matrix}

(A10)

The data log-likelihood following EMBL is given by

\begin{matrix} log P (x_{i} | Θ) & = log (n_{i}!) + log Γ (s_{j}) + log Γ (α_{j}^{'}) + log Γ (β_{j}^{'}) + log (α_{j}) \\ - log Γ (s_{j} + n_{i}) - - - log Γ (α_{j}^{'} + β_{j}^{'}) + \sum_{d = 1}^{D} I (x_{i d} \geq 1) (log (α_{j d}) - - - log (x_{i d})) \end{matrix}

(A11)

We compute the first derivative of log likelihood with respect to

α_{j}

and

β_{j}

as

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial α_{j}} = Ψ (α_{j}^{'}) + \frac{1}{α_{j}} - Ψ (α_{j}^{'} + β_{j}^{'}) \end{matrix}

(A12)

\begin{matrix} \frac{\partial L (X, Z | Θ)}{\partial β_{j}} = Ψ (β_{j}^{'}) - Ψ (α_{j}^{'} + β_{j}^{'}) \end{matrix}

(A13)

References

Akbar, S.Z.; Panda, A.; Kukreti, D.; Meena, A.; Pal, J. Misinformation as a Window into Prejudice: COVID-19 and the Information Environment in India. Proc. ACM Hum.-Comput. Interact. 2021, 4, 249. [Google Scholar] [CrossRef]
Gallotti, R.; Valle, F.; Castaldo, N.; Sacco, P.; De Domenico, M. Assessing the risks of ‘infodemics’ in response to COVID-19 epidemics. Nat. Hum. Behav. 2020, 4, 1285–1293. [Google Scholar] [CrossRef] [PubMed]
Knittel, J.; Koch, S.; Tang, T.; Chen, W.; Wu, Y.; Liu, S.; Ertl, T. Real-time visual analysis of high-volume social media posts. IEEE Trans. Vis. Comput. Graph. 2021, 28, 879–889. [Google Scholar] [CrossRef]
Xia, Q.; Maekawa, T.; Hara, T. Unsupervised human activity recognition through two-stage prompting with chatgpt. arXiv 2023, arXiv:2306.02140. [Google Scholar] [CrossRef]
Al-Sayed, A.; Khayyat, M.M.; Zamzami, N. Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques. Appl. Sci. 2023, 13, 13278. [Google Scholar] [CrossRef]
Cholevas, C.; Angeli, E.; Sereti, Z.; Mavrikos, E.; Tsekouras, G.E. Anomaly detection in blockchain networks using unsupervised learning: A survey. Algorithms 2024, 17, 201. [Google Scholar] [CrossRef]
Li, D.; Guo, H.; Wang, Z.; Zheng, Z. Unsupervised fake news detection based on autoencoder. IEEE Access 2021, 9, 29356–29365. [Google Scholar] [CrossRef]
Su, X.; Zamzami, N.; Bouguila, N. Covid-19 news clustering using MCMC-based learning of finite EMSD mixture models. In Proceedings of the International FLAIRS Conference, North Miami Beach, FL, USA, 17–19 May 2021; Volume 34. [Google Scholar] [CrossRef]
Bouguila, N. Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 2008, 20, 462–474. [Google Scholar] [CrossRef]
Bouguila, N. Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 2010, 22, 186–198. [Google Scholar] [CrossRef]
Zamzami, N.; Bouguila, N. Sparse count data clustering using an exponential approximation to generalized Dirichlet multinomial distributions. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 89–102. [Google Scholar] [CrossRef]
Zamzami, N.; Bouguila, N. High-dimensional count data clustering based on an exponential approximation to the multinomial beta-liouville distribution. Inf. Sci. 2020, 524, 116–135. [Google Scholar] [CrossRef]
Saini, N.; Singhal, M.; Tanwar, M.; Meel, P. Multimodal, Semi-supervised, and Unsupervised web content credibility Analysis Frameworks. In Proceedings of the 2020 4th IEEE International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; pp. 948–955. [Google Scholar] [CrossRef]
Dong, X.; Victor, U.; Chowdhury, S.; Qian, L. Deep two-path semi-supervised learning for fake news detection. arXiv 2019, arXiv:1906.05659. [Google Scholar] [CrossRef]
Hosseinimotlagh, S.; Papalexakis, E.E. Unsupervised content-based identification of fake news articles with tensor decomposition ensembles. In Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web (MIS2), Marina del Rey, CA, USA, 9 February 2018. [Google Scholar]
Gangireddy, S.C.R.; Long, C.; Chakraborty, T. Unsupervised fake news detection: A graph-based approach. In Proceedings of the 31st ACM Conference on Hypertext and Social Media, Virtual Event USA, 13–15 July 2020; pp. 75–83. [Google Scholar] [CrossRef]
Wan, P.; Wang, X.; Pang, G.; Wang, L.; Min, G. A novel rumor detection with multi-objective loss functions in online social networks. Expert Syst. Appl. 2023, 213, 119239. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Shu, K.; Wang, S.; Gu, R.; Wu, F.; Liu, H. Unsupervised fake news detection on social media: A generative approach. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5644–5651. [Google Scholar] [CrossRef]
Najar, F.; Zamzami, N.; Bouguila, N. Fake news detection using Bayesian inference. In Proceedings of the 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA, 30 July–1 August 2019; pp. 389–394. [Google Scholar] [CrossRef]
Alasmari, A.; Addawood, A.; Nouh, M.; Rayes, W.; Al-Wabil, A. A retrospective analysis of the COVID-19 infodemic in Saudi Arabia. Future Internet 2021, 13, 254. [Google Scholar] [CrossRef]
Mahlous, A.R.; Al-Laith, A. Fake news detection in Arabic tweets during the COVID-19 pandemic. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 778–788. [Google Scholar] [CrossRef]
Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. Arcov-19: The first arabic covid-19 twitter dataset with propagation networks. arXiv 2020, arXiv:2004.05861. [Google Scholar]
Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. arXiv 2020, arXiv:2010.08768. [Google Scholar]
Ameur, M.S.H.; Aliane, H. AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. Procedia Comput. Sci. 2021, 189, 232–241. [Google Scholar] [CrossRef]
Alam, F.; Dalvi, F.; Shaar, S.; Durrani, N.; Mubarak, H.; Nikolov, A.; Da San Martino, G.; Abdelali, A.; Sajjad, H.; Darwish, K.; et al. Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms. In Proceedings of the Fifteenth International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 7–10 June 2021; pp. 913–922. [Google Scholar] [CrossRef]
Alyoubi, S.; Kalkatawi, M.; Abukhodair, F. The Detection of Fake News in Arabic Tweets Using Deep Learning. Appl. Sci. 2023, 13, 8209. [Google Scholar] [CrossRef]
Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; AlSaeed, D.; Essam, A. Arabic fake news detection: Comparative study of neural networks and transformer-based approaches. Complexity 2021, 2021, 5516945. [Google Scholar] [CrossRef]
Helmy, T.M.H.Y.M.; Elzanfaly, D.S. An Ensemble Stacking Model for Rumor Detection Based on Arabic Tweets. J. Theor. Appl. Inf. Technol. 2023, 101, 7347–7358. [Google Scholar]
Madsen, R.E.; Kauchak, D.; Elkan, C. Modeling word burstiness using the Dirichlet distribution. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 545–552. [Google Scholar] [CrossRef]
Zhou, H.; Lange, K. MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Stat. 2010, 19, 645–665. [Google Scholar] [CrossRef] [PubMed]
Zamzami, N.; Bouguila, N. Consumption behavior prediction using hierarchical Bayesian frameworks. In Proceedings of the 2018 First IEEE International Conference on Artificial Intelligence for Industries (AI4I), Laguna Hills, CA, USA, 26–28 September 2018; pp. 31–34. [Google Scholar] [CrossRef]
McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley Series in Probability and Statistics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007. [Google Scholar]
Yao, J.F. On recursive estimation in incomplete data models. Stat. J. Theor. Appl. Stat. 2000, 34, 27–51. [Google Scholar] [CrossRef]
Bouguila, N.; Ziou, D. Online clustering via finite mixtures of Dirichlet and minimum message length. Eng. Appl. Artif. Intell. 2006, 19, 371–379. [Google Scholar] [CrossRef]
Hu, B.; Mao, Z.; Zhang, Y. An overview of fake news detection: From a new perspective. Fundam. Res. 2025, 5, 332–346. [Google Scholar] [CrossRef]
Tucker, J.S.; Friedman, H.S. Chapter 17—Emotion, Personality, and Health. In Handbook of Emotion, Adult Development, and Aging; Magai, C., McFadden, S.H., Eds.; Academic Press: Cambridge, MA, USA, 1996; pp. 307–326. [Google Scholar] [CrossRef]
Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022, 47, 10453–10469. [Google Scholar] [CrossRef]
Alshahrani, H.J.; Tarmissi, K.; Alshahrani, H.; Ahmed Elfaki, M.; Yafoz, A.; Alsini, R.; Alghushairy, O.; Ahmed Hamza, M. Computational Linguistics with Deep-Learning-Based Intent Detection for Natural Language Understanding. Appl. Sci. 2022, 12, 8633. [Google Scholar] [CrossRef]
Biber, D.; Johansson, S.; Leech, G.; Conrad, S.; Finegan, E. Longman Grammar of Spoken and Written English; Longman: London, UK, 2000. [Google Scholar]
Rayson, P.; Wilson, A.; Leech, G. Grammatical Word Class Variation within the British National Corpus Sampler. In New Frontiers of Corpus Research; Peters, P., Purnell, P., Rayner, S., Eds.; Brill: Leiden, The Netherlands, 2002; pp. 295–306. [Google Scholar] [CrossRef]
Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 11–16. [Google Scholar]
Alfaidi, A.; Alwadei, H.; Alshutayri, A.; Alahdal, S. Exploring the Performance of Farasa and CAMeL Taggers for Arabic Dialect Tweets. Int. Arab. J. Inf. Technol. (IAJIT) 2023, 20, 349–356. [Google Scholar] [CrossRef]
Levenson, R.W.; Ekman, P.; Friesen, W.V. Voluntary facial action generates emotion-specific autonomic nervous system activity. Psychophysiology 1990, 27, 363–384. [Google Scholar] [CrossRef]
Jupe, L.M.; Vrij, A.; Leal, S.; Nahari, G. Are you for real? Exploring language use and unexpected process questions within the detection of identity deception. Appl. Cogn. Psychol. 2018, 32, 622–634. [Google Scholar] [CrossRef]
Hancock, J.T.; Curry, L.E.; Goorha, S.; Woodworth, M. On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes 2007, 45, 1–23. [Google Scholar] [CrossRef]
Himdi, H.T.; Assiri, F.Y. Tasaheel: An Arabic Automative Textual Analysis Tool—All in One. IEEE Access 2023, 11, 139979–139992. [Google Scholar] [CrossRef]
Kapusta, J.; Obonya, J. Improvement of Misleading and Fake News Classification for Flective Languages by Morphological Group Analysis. Informatics 2020, 7, 4. [Google Scholar] [CrossRef]
Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 759–766. [Google Scholar] [CrossRef]
Yang, Y.; Zheng, L.; Zhang, J.; Cui, Q.; Li, Z.; Yu, P.S. TI-CNN: Convolutional neural networks for fake news detection. arXiv 2018, arXiv:1806.00749. [Google Scholar] [CrossRef]
Kapusta, J.; Hajek, P.; Munk, M.; Benko, L. Comparison of fake and real news based on morphological analysis. Procedia Comput. Sci. 2020, 171, 2285–2293. [Google Scholar] [CrossRef]
Knapp, M.L.; Hart, R.P.; Dennis, H.S. An exploration of deception as a communication construct. Hum. Commun. Res. 1974, 1, 15–29. [Google Scholar] [CrossRef]
O’Connor, T. Saudi Arabia Warns Those Who Spread ’Fake News’ Will Be Jailed, Fined, Amid Rumors It Had Journalist Killed. Available online: https://www.newsweek.com/saudi-arabia-fake-news-jamal-khashoggi-1170613 (accessed on 1 September 2024).
Newman, M.L.; Pennebaker, J.W.; Berry, D.S.; Richards, J.M. Lying Words: Predicting Deception from Linguistic Styles. Personal. Soc. Psychol. Bull. 2003, 29, 665–675. [Google Scholar] [CrossRef] [PubMed]
DePaulo, B.M.; Lindsay, J.J.; Malone, B.E.; Muhlenbruck, L.; Charlton, K.; Cooper, H. Cues to deception. Psychol. Bull. 2003, 129, 74. [Google Scholar] [CrossRef]
Petty, R.E.; Cacioppo, J.T. The Elaboration Likelihood Model of Persuasion; Advances in Experimental Social Psychology; Academic Press: Amsterdam, The Netherlands, 1986; Volume 19, pp. 123–205. [Google Scholar] [CrossRef]
Bertelson, P.; Eelen, P.; d’Ydewalle, G. International Perspectives On Psychological Science, II: The State of the Art; Taylor & Francis: London, UK; Psychology Press: London, UK, 2013. (In English) [Google Scholar]
Paschen, J. Investigating the emotional appeal of fake news using artificial intelligence and human contributions. J. Prod. Brand Manag. 2020, 29, 223–233. [Google Scholar] [CrossRef]

Figure 1. Comparison between the performance of the offline vs. the proposed online models.

Figure 2. The performance of GDM and EGDM online learning.

Figure 3. The performance of MBL and EMBL online learning.

Table 1. Performance of different clustering models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F-Score (%)
K-means	50.42	50.26	72.78	59.45
MM	51.45	51.47	51.47	51.47
DCM	52.32	51.26	73.78	60.49
GDM	64.55	64.97	65.82	65.39
MBL	83.15	82.92	84.69	83.79
EGDM	74.94	75.98	69.88	72.80
EMBL	89.44	89.44	89.44	89.44

Table 2. Comparing the performance of the proposed online frameworks.

Model	Avg. Accuracy (%)	F-Score (%)	Time
GDM	68.04	68.50	0.0240
EGDM	76.65	75.88	0.0055
MBL	87.20	86.05	0.0239
EMBL	92.51	90.89	0.0049

Table 3. Comparison of the proposed models with existing approaches on publicly available Arabic fake news datasets.

Reference	Method	Dataset	Best Reported Result
[27]	Deep learning models (CNN, RNN, GRU) and transformer-based models (AraBERT v1, AraBERT v0.2, AraBERT v2, ArElectra, QARiB, ARBERT, MARBERT) for Arabic fake news detection.	ArCOV19-Rumors, Covid-19-Fakes, AraNews, ANS corpus	Accuracy: 95.80% (QARiB)
[26]	Deep learning (CNN-BiLSTM) and transformer-based models (ARBERT, MARBERT) for Arabic fake news detection.	ArCOV19-Rumors, COVID-19 misinformation	F1-score: 95% (ARBERT, MARBERT)
[28]	Hybrid ensemble model combining machine learning and deep learning techniques for detecting Arabic rumor tweets.	ArCOV19-Rumors	Accuracy: 90%
Proposed model 1	EGDM (offline)	ArCOV19-Rumors	Accuracy: 74.94%
Proposed model 2	EMBL (offline)	ArCOV19-Rumors	Accuracy: 89.44%
Proposed model 3	EGDM (online)	ArCOV19-Rumors	Accuracy: 76.65%
Proposed model 4	EMBL (online)	ArCOV19-Rumors	Accuracy: 92.51%

Table 4. POS, linguistic, and emotion lexical densities.

POS	Real	Fake
Nouns	23.86	30.60
Verbs	3.91	4.46
Particles	2.79	1.89
Conjunctions	2.78	3.41
Pronouns (all)	3.10	4.06
Pronouns (singular)	33.16	28.27
Pronouns (plural)	5.44	6.07
Adverbs	0.23	0.26
Determiners	5.03	4.08
Prepositions	8.14	7.51
Adjectives	4.85	4.98
Linguistics
Place	0.74	0.86
Assurance	1.14	0.79
Negators	1.55	0.69
Opposition	0.19	0.03
Justification	0.88	0.21
Exception	0.23	0.19
Illustration	0.12	0.05
Hedges	0.52	0.20
Time	0.20	0.27
Order	0.01	0.01
Intensifiers	0.59	0.62
Emotions
Joy/happiness	0.93	0.78
Sadness	0.17	0.25
Anger	0.24	0.02
Fear	0.14	0.03
Surprise	0.08	0.11
Disgust	0.04	0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zamzami, N.; Himdi, H.; Qarout, R.K. An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics 2026, 14, 1250. https://doi.org/10.3390/math14081250

AMA Style

Zamzami N, Himdi H, Qarout RK. An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics. 2026; 14(8):1250. https://doi.org/10.3390/math14081250

Chicago/Turabian Style

Zamzami, Nuha, Hanen Himdi, and Rehab K. Qarout. 2026. "An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection" Mathematics 14, no. 8: 1250. https://doi.org/10.3390/math14081250

APA Style

Zamzami, N., Himdi, H., & Qarout, R. K. (2026). An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection. Mathematics, 14(8), 1250. https://doi.org/10.3390/math14081250

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Automated Unsupervised Model Using Probabilistic Mixture Models and Textual Analysis for Arabic Fake News Detection

Abstract

1. Introduction

2. Related Works

2.1. Unsupervised Model for Detecting Fake News

2.2. Arabic Fake News

3. The Proposed Online Learning Framework

3.1. The Considered Distributions

3.1.1. The Generalized Dirichlet Multinomial and Its Exponential Approximation

3.1.2. The Multinomial Beta-Liouville and Its Exponential Approximation

3.2. The Proposed Online Mixture Model

3.2.1. Phase 1: Mixture Model for Offline Clustering

3.2.2. Phase 2: Real-Time Detection of Fake News in the News

4. Experimental Results and Discussion

5. Textual Analysis of Real and Fake Posts

5.1. Textual Features Categories

5.2. Textual Features Analysis

6. Limitation and Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI