Automatic Fake News Detection for Romanian Online News

: This paper proposes a supervised machine learning system to detect fake news in online sources published in Romanian. Additionally, this work presents a comparison of the obtained results by using recurrent neural networks based on long short-term memory and gated recurrent unit cells, a convolutional neural network, and a Bidirectional Encoder Representations from Transformers (BERT) model, namely RoBERT, a pre-trained Romanian BERT model. The deep learning architectures are compared with the results achieved by two classical classiﬁcation algorithms: Naïve Bayes and Support Vector Machine. The proposed approach is based on a Romanian news corpus containing 25,841 true news items and 13,064 fake news items. The best result is over 98.20%, achieved by the convolutional neural network, which outperforms the standard classiﬁcation methods and the BERT models. Moreover, based on irony detection and sentiment analysis systems, additional details are revealed about the irony phenomenon and sentiment analysis ﬁeld which are used to tackle fake news challenges.


Introduction
Over the recent years, artificial intelligence (AI) brought important changes in the domain of information technologies and architectures, such as using and developing intelligent transportation systems, virtual personal assistants, robotic surgery, and maybe with the greatest impact on our lives, natural language processing [1].
Nowadays, due to the internet, the quality and quantity of the news increases every day, and the way that the consumer accesses and manages daily online information is constantly changing. Young people, especially the inexperienced ones, use social media platforms, mobile applications, or simple and dynamic websites to extract the necessary information quickly and easily, many times without discernment. The diversity of the online news may increase the engagement in democratic elections, giving everyone the opportunity to get involved or even change opinions and mentalities. However, new technologies and features can be used through social media platforms to spread fake news on a large scale, creating personalized information and becoming more effective in misinformation campaigns. Therefore, credible and reliable sources of information are needed so that the public does not fall prey to the intentions of those interested in manipulating reality. Some researchers [2] have also suggested that populist politicians use fake news in order to undermine authority. For example, recent research suggests that fake news is used to doubt some sources of information that in the past had been considered trusted in several fields, such as the scientific community or journalism [3].
Fake news can function as propaganda or misinformation, but it always appeals to the emotions of the public and the intent to cover rational responses, analysis, and comparison of information from several sources, encouraging inflammation and outrage and can easily lead to conspiracy theories and partisan biased content that negatively affects social security. Hence, to tackle fake news, the researchers involve several approaches such as text classification, network analysis, or reviews and evaluations [4]. Moreover, Caplan et al. [5] described how companies and AI researchers should define fake news by type. There are other ways to address misinformation such as conspiracies, discrediting, emotion, or social media feeds that may influence democratic processes [6]. In addition, manual contents on the web are intentionally polluted by fake journalists profiling to attract clicks and attention [7]. In addition, it is very important to evaluate the credibility of the writers' beliefs and moral values.
This paper focuses on analyzing the performance of several models for fake news detection in Romanian by using neural network architectures such as long short-term memory (LSTM), a convolutional neural network (CNN), gated recurrent units (GRU), Bidirectional Encoder Representations from Transformers (BERT), and standard classifiers such as Support Vector Machine (SVM) and Naïve Bayes (NB). Furthermore, in this paper, a statistical analysis for a dataset of online articles with real and fake classes is presented. Moreover, a sentiment analysis and irony detection systems are applied to our datasets, providing important information in the process of detecting fake news.
The paper continues as follows. In the next section, the related work is presented. Section 3 describes the methodology of the proposed method, and Section 4 describes the experiments, while Section 5 presents the results. Section 6 contains the conclusions.

Related Work
Busioc et al. [8] proposed an automated analysis of political statements in Romanian using several natural language processing techniques. They used a corpus collected from Factual [9], a Romanian initiative where different public statements which are labeled as true or fake can be found. Another study highlights the existing approaches, challenges, and observations from other languages to be applied for Romanian resources, identifying future paths [10] in developing fake news detection systems.
In recent years, there have been many studies available in the fake news detection field for other languages. Currently, there are studies and models that suggest using classical machine learning algorithms for detecting fake news in other languages [11]. Some authors have also proposed BERT models [12], mentioning an accuracy of 98.90% for FakeBert. There is a wide choice of deep neural network models available in the literature [13], and some papers use hybrid convolutional neural network and recurrent neural network (RNN) models, such as that of Ajao et al. [14], wherein they achieved an accuracy of 82%. Furthermore, several papers focused on fake news detection using neural learning systems have been published [15][16][17], highlighting the complexity of this domain. Another study provides a comparison between multiple methods using neural network systems and attention mechanisms [18], achieving an accuracy of 88.78% for CNN + Bi-LSTM ensembled networks.
The internet gives researchers the opportunity to find and use several datasets, such as Fake News Challenge [19], to develop different approaches based on CNN, LSTM, and Bi-LSTM [20] by using the headlines and bodies of the articles, achieving an accuracy of 71.2% for the testing dataset.
At this moment, the importance of social media is widely known from the social to the financial points of view, being a powerful free online tool for spreading misinformation without investigation or personal filters. Due to the increasing number of people who collected from social media their daily information, these platforms became the most important "weapon" in misinformation campaigns. Recent papers [21] have revealed that fake and real news are spreading differently and deeply, making it possible to create patterns that could be used in fake news detection systems.
An existing paper in the broader literature examined the challenges in fake news detection on social media using a logistic classifier [22], achieving an accuracy of 99.4% for the testing data.
The study of Shu et al. [23] revealed that the user's engagements are important and relevant. In addition, the analysis process of the huge amount of noisy data collected from social platforms is essential in understanding and detecting the misinformation phenomenon.
Guibon et al. [24] proposed different approaches for fake news detection systems and, based on redundant information, tried to find a connection between satire and fake news, achieving an accuracy of 93% for several datasets.
Some experiments that have been used to incorporate sentiment analysis (SA) in fake news detection approaches were presented by Zhang et al. [25], who stated that most existing papers on fake news are based on the strength of the emotions expressed by the publishers. Additionally, Ross and Thirunarayan [26] used in their study several sentimental features.
Nowadays, large-scale pre-trained language models have become very important in fake news detection, and the first Romanian transformer-based language model was proposed by Dumitrescu, Avram, and Pyysalo [27]. In this paper, a Romanian pretrained BERT model for experiments named RoBERT is used [28].

Dataset Details
We collected a dataset of fake and real news between 2016 and 2021: • Fake news: This dataset contains 12,767 news items, and it was automatically crawled from Romanian online platforms such as Fluierul [29], Vremuritulburi [30], and Cunoastelumea [31]. It is based on Rubrika [32], the first fully automatic news aggregator in Romania, which promotes articles only from trustworthy sources and provides a list of websites to avoid [33]. In addition to this automatically collected dataset, there were 297 more news items added that were manually labeled as fake news. For example, after a fake news instance was manually annotated, in the Romanian online environment, several news sites with the same information that was already propagated can be identified, and these news sites are added to the dataset by a human using a web application system, labeling them with a simple button as fake.

•
Real news: This dataset contains 25,841 news items and was manually collected from Romanian official sources such as Agerpres [34], Mediafax [35], and Rador [36]. Each article's content is verified and annotated as real news.
The dataset includes only Romanian content that was automatically detected with a PHP library named Text Language Detect [37]. The evaluation and annotation process was performed by 12 employees, consisting of males and females aged between 34 and 49 years, in a public institution in Romania.

Dataset Description
For the research presented herein, the dataset was split into train (59.48%), validation (20.26%), and test (20.26%). Additionally, to maintain the balance of class distribution, only 50.55% of the real class was used for this experiment, being randomly selected. Table 1 presents the distribution of the fake and real classes in the dataset. It shows that our dataset was balanced, and the training dataset was three times longer than the validation and test datasets.  Table 2 presents two examples of real and fake news, and Figure 1 presents the distribution of words across the fake and real datasets. Table 3 shows that the average number of words of the fake dataset was three times larger than that of the true dataset, and the vocabulary size consisting of unique Romanian words from The Explanatory Dictionary of the Romanian Language (DEX) for the fake dataset was twice as large as that of the real dataset. Table 2. Examples of fake and real news from dataset.
Pre-processing was applied to minimize noisy data and provide simple, complete, and consistent datasets. In order, the words of the fake and real training datasets with more than 5000 occurrences were removed. Some examples of such words included "Ro-  In some cases, the differences between fake and real news could be rather small, and the presented example was very hard to identify because it was necessary to search the employee in the public institution archive when he said that he was injured as a soldier, which turned out to be a lie.
Pre-processing was applied to minimize noisy data and provide simple, complete, and consistent datasets. In order, the words of the fake and real training datasets with more than 5000 occurrences were removed. Some examples of such words included "Romania" (15,461), "arta" (14,769), "national" (13,729), "military" (19,111), and "present" (8085).
This paper proposes a method that receives as input pre-processed news to reduce the chances of underfitting or overfitting and performs several analyses and transformations, using the term frequency-inverse document frequency (TF-IDF) for feature extraction.

Experiments
For the experiments presented in this paper, 4 NVIDIA Tesla V100 GPU Accelerators with 32 GB RAM and 5120 CUDA cores were used. The proposed method (as shown in Figure 2) consisted of two classical algorithms (NB and SVM), three deep learning models (LSTM, CNN, and GRU), and two variants of BERT.

Proposed Models
• Classical algorithms: The classical machine learning models are based on supervised classifiers such as Naïve Bayes and Support Vector Machine. Each traditional algorithm learns in different ways. The Naïve Bayes algorithm is based on Bayes' theorem to evaluate and choose the highest probability of new data belonging to one of the classes defined in the dataset. The SVM classifier finds the best hyperplane that separates the data into two classes (fake vs. real) with the highest margin. For experiments, the SVM algorithm uses an SVC linear kernel, and the NB algorithm uses multinomial Naive Bayes.

Proposed Models
• Classical algorithms: The classical machine learning models are based on supervised classifiers such as Naïve Bayes and Support Vector Machine. Each traditional algorithm learns in different ways. The Naïve Bayes algorithm is based on Bayes' theorem to evaluate and choose the highest probability of new data belonging to one of the classes defined in the dataset. The SVM classifier finds the best hyperplane that separates the data into two classes (fake vs. real) with the highest margin. For experiments, the SVM algorithm uses an SVC linear kernel, and the NB algorithm uses multinomial Naive Bayes. • Deep learning models: Three types of deep neural network models were investigated. The first two were recurrent neural network architectures using LSTM and GRU. The third type was a CNN architecture that is a class of deep neural network mostly used in computer vision tasks. For the experiments, these architectures used the optimal parameters achieved during the random search optimization phase and binary crossentropy as a loss function. • Transformer models: Transformers are a type of neural network model, being introduced by Vaswani et al. [38] to solve the issue of sequence transduction or neural machine translation. The most popular NLP model that uses a transformer is BERT, introduced by Devlin et al. [39], which is a model that learns contextual embeddings from both sides of a token's context during the training phase.
The two applications of BERT are "pretraining" and "fine-tuning". For the pretraining process, BERT uses the masked language model and next sentence prediction. In this research, for the fine-tuning process, the Romanian pretrained model RoBERT was used. Currently, there are three uncased versions available: RoBERT-small, RoBERT-base, and RoBERT-large.

Deep Learning Architectures
This section presents several pieces of software or packages and version numbers, as shown in Table 4, and it also provides an overview of the deep neural network models used in this research, the main hyperparameters (as shown in Table 5), and their architectures, being fine-tuned for three epochs with an Adam optimizer. • Long short-term memory: LSTM networks are a type of recurrent neural network having the capability to learn a mapping between the input and output patterns. For the experiments, the LSTM model consisted of 1 layer with 128 units that decreased the embedding vector from 5000 to 128, a dropout layer (0.2), and 2 dense layers, using 32 as the batch size and 32 neurons. The details of the LSTM architecture used in this work are presented in Table 6. • Convolutional neural network: A CNN is a deep learning architecture successfully used to extract features for images and classify text documents. For this architecture, the convolution layer has 250 filters with a kernel size of 3 that decreases the embedding vector from 5000 to 4998. A max-pooling, Rectified Unit Layer (RELU), activation, and dropout layer were added to the proposed CNN model, passing the outputs through a dense layer. The CNN architecture is described in Table 7. • Gated recurrent units: GRU are one of the latest generation of recurrent neural networks, being more complex due to a hidden state which transfers useful information based on two gates: a reset gate and an update gate. In this architecture, the GRU model consists of one layer with 128 units and a dropout, activation (TanH, the hyperbolic tangent), and dense layer. The detail of the GRU architecture that is used in this work is presented in Table 8.

Transformer Architectures
The transformer model is an encoder-decoder architecture using a multi-headed attention layer to increase the speed of the training process and excel in specific NLP tasks such as voice conversion or text-to-speech transformation.
In this research, two versions of Romanian pretrained BERT models were used: RoBERT-small and RoBERT-large. RoBERT is a Romanian pretrained model that is based on a multi-layer bidirectional transformer. It consists of a large Romanian corpus collected from several sources such as Wikipedia, Oscar [40], and the RoTex collection [41]: • RoBERT-small (see Table 9) contains less weights (19M) and a number of trainable layers.

•
RoBERT-large (see Table 9) contains large weights (341M) and twice the number of trainable layers, having the same layer sizes as BERT-large. Furthermore, the authors of the RoBERT models followed the same methodology proposed by Devlin et al. to train their models.
The BERT models were trained over 30 epochs with the Adam optimizer [42], having a learning rate of 3 x 10 5 and maximum sequence length of 512. Moreover, these models contained two dense layers: a dropout layer (0.2) with RELU activation, and a Softmax layer. Table 10 presents the main hyperparameters used in the training process.

Results and Discussion
This paper presented three architectures based on classical algorithms, deep learning models, and transformers. Tables 11 and 12 show the results of the test and validation datasets, which consisted of 5296 unique news items.  • Classical algorithms: From Table 12, it can be observed that the Naïve Bayes algorithm obtained a better F1 score of 97.50% for the test dataset compared with the Support Vector Machine algorithm, which obtained an F1 score of 94.70%. The results were slightly similar for the validation set. There are studies and models that suggest using Naïve Bayes with n-gram (bigram TF-IDF) features to outperform the standard machine learning systems for online fake news detection approaches, achieving almost 94% accuracy on multiple corpora [43]. • Deep learning models: In this research, the differences between the neural network models' performances were small. The CNN architecture obtained an F1 score of 97.80% for the validation dataset and an F1 score of 98.20% for the test dataset, outperforming the LSTM and GRU models. For example, instead of just using CNN models, another study [44] proposed a hybrid deep learning architecture that combines the CNN and RNN models trained on several datasets. • Transformer models: As already mentioned, this research used for the BERT experiments two Romanian pretrained models. The RoBERT-small model obtained a better F1 score of 92.50%, while RoBERT-large's was only 88% for the test dataset, achieving similar results for the validation dataset. This was due to the first dense layer of the BERT models, which decreased the dense vector from 1024 to 512 for RoBERT-large and from 256 to 32 for RoBERT-small, indicating that the RoBERT-small model was more efficient for our datasets, generating fewer false positives. Future research should consider the potential effects of Language Understanding with Knowledge-Based Embeddings (LUKE), a new model based on the transformer that outperformed the BERT and RoBERTa [45] models, achieving an F1 score of 95%. LUKE [46] is based on the Stanford Question Answering Dataset [47]. • Fake news and sentiment analysis: The sentiment expressed in the fake news dataset had a significant role, and some researchers such as Alonso et al. [48] and Bhutani et al. [49] proposed different fake news detection systems that incorporated sentiment as an important feature. Therefore, a sentiment analysis method [50] was applied to the test dataset, based on an algorithm that achieved an F1 score of 82% using a Romanian dictionary of 42,497 labeled words with 3 levels for the positive and negative polarities. Table 13 shows that 99.96% of the fake news dataset contained a neutral polarity, indicating that in these campaigns of fake news, the impartial connotation was predominant. Moreover, Figure 3 presents as a word cloud several words with positive (left side) and negative polarities (right side) from the fake news employed in the proposed system. al. [49] proposed different fake news detection systems that incorporated sentiment as an important feature. Therefore, a sentiment analysis method [50] was applied to the test dataset, based on an algorithm that achieved an F1 score of 82% using a Romanian dictionary of 42,497 labeled words with 3 levels for the positive and negative polarities. Table 13 shows that 99.96% of the fake news dataset contained a neutral polarity, indicating that in these campaigns of fake news, the impartial connotation was predominant. Moreover, Figure 3 presents as a word cloud several words with positive (left side) and negative polarities (right side) from the fake news employed in the proposed system.  • Fake news and irony: Even if irony was not used as a legitimate way of communication, and most of the recent papers tried to establish a connection between satire and fake news, in this paper, a solution to finding the possible relations between irony and fake newswas provided. Therefore, an automatic irony detection approach [51] was applied to the test dataset based on the Naïve Bayes algorithm, achieving an F1 score of 91%. Table 14 shows that 24.05% of the fake news contained irony, suggesting that ironic articles from online media besides fake news were used very often in misinformation campaigns in order to denigrate institutions or even public figures. There are some potentially open questions about the reliability of several news pieces used in this experiment that were automatically collected from Times New Roman [52] and may have contained fake content. The presented results (see Table 12) show that the convolutional neural network architecture provided a better score than other models such as LSTM, GRU, or BERT. The small score differences achieved by these models suggest that it is necessary to measure • Fake news and irony: Even if irony was not used as a legitimate way of communication, and most of the recent papers tried to establish a connection between satire and fake news, in this paper, a solution to finding the possible relations between irony and fake newswas provided. Therefore, an automatic irony detection approach [51] was applied to the test dataset based on the Naïve Bayes algorithm, achieving an F1 score of 91%. Table 14 shows that 24.05% of the fake news contained irony, suggesting that ironic articles from online media besides fake news were used very often in misinformation campaigns in order to denigrate institutions or even public figures. There are some potentially open questions about the reliability of several news pieces used in this experiment that were automatically collected from Times New Roman [52] and may have contained fake content. The presented results (see Table 12) show that the convolutional neural network architecture provided a better score than other models such as LSTM, GRU, or BERT. The small score differences achieved by these models suggest that it is necessary to measure the reliability of the proposed system by applying statistically significant tests, and future studies should include such evaluation processes.
In addition, the results obtained by using a sentiment analysis and irony detection system (see Tables 13 and 14) sets out a connection between irony, polarities, and the fake news phenomenon, being used more in content with a neutral polarity and ironic remarks in fake news.

Conclusions
The fake news phenomenon is spreading every day from discussion up to the level of research, with the automatic detection tasks with machine learning systems being very important. This paper supports these statements in defining the necessary environment that would lead to the identification and processing of available data, which will help us to control this phenomenon.
In this paper, a fake news detection method was proposed by choosing the best result between traditional classifiers, such as the Naïve Bayes, Support Vector Machine, and neural network models, based on long short-term memory, convolutional neural networks, gated recurrent units, and BERT models. The proposed approach achieved a better score for the convolutional neural network, indicating the superiority of deep neural network architectures over machine learning. Furthermore, according to the results that were obtained during the evaluation process, by applying an SA and irony detection system, a correlation could be associated between irony or neutral sentiments and fake news in Romanian online news. In addition, our experiments confirmed that human evaluation provides paramount contributions for Romanian datasets and achieves significant improvements for fake news detection approaches when using a convolutional neural network.Our results for the Romanian language are consistent with the literature covering English news, which shows that some studies investigated and analyzed several datasets using neural network architectures, achieving better performance for the convolutional neural network (FNDNet) [53] and yielding an accuracy of 98.36% for the test data. Moreover, the paper of Martínez-Gallego [54] addressed the problem of fake news detection, achieving an accuracy of up to 80% for a language with Latin roots such as Spanish by using a combination of a pretrained BETO (Spanish BERT) model and an LSTM architecture.
The system proposed in this paper was integrated in a mobile application [55] designed and developed to give users access to reliable Romanian information and to inspire other researchers to use this model.
Nowadays, governments and public institutions are using different machine learning algorithms to automate the important processes of several departaments, such as human resources and claims revenue, with it being necesary to follow several ethical rules such as transparency, responsibility, or fairness and to avoid epistemic and normative concerns. The exclusive use of algorithms and AI systems may encounter errors or have insufficient data for selecting the best ways to work, but these systems should deliver socially good outcomes, and ethical and technological analyses are mandatory.