It's All in the Embedding! Fake News Detection Using Document Embeddings

With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and malformation through the use of fake news. The emergence of this harmful phenomenon has managed to polarize society and manipulate public opinion on particular topics, e.g., elections, vaccinations, etc. Such information propagated on social media can distort public perceptions and generate social unrest while lacking the rigor of traditional journalism. Natural Language Processing and Machine Learning techniques are essential for developing efficient tools that can detect fake news. Models that use the context of textual data are essential for resolving the fake news detection problem, as they manage to encode linguistic features within the vector representation of words. In this paper, we propose a new approach that uses document embeddings to build multiple models that accurately label news articles as reliable or fake. We also present a benchmark on different architectures that detect fake news using binary or multi-labeled classification. We evaluated the models on five large news corpora using accuracy, precision, and recall. We obtained better results than more complex state-of-the-art Deep Neural Network models. We observe that the most important factor for obtaining high accuracy is the document encoding, not the classification model's complexity.


Introduction
With the increase in the digitalization of mass media, new journalistic paradigms for information distribution have emerged. These new paradigms have substantially changed the way society consumes information. By trying to be ahead of the competition, sometimes people who report on world events leave behind the rigors of classical journalism and publish their content as soon as possible in order to "go viral" by obtaining as many views, likes, comments, and shares as possible in a short amount of time [1]. This new paradigm centers on the users, catering to their needs, behavior, and interests. Along with the advantages the digitalization of mass media brings, it also increases the risk of misinformation, with potentially detrimental consequences for society, by facilitating the spread of misinformation [2,3] in the form of fake news (which influenced the Brexit referendum [4], the 2016 US presidential election [5], COVID-19 vaccinations [6], etc.).
Fake news consists of news articles that are intentionally and verifiably false. This type of information aims to mislead readers by presenting alleged, real-seeming facts about social, economic, and political subjects of interest [7]. However, the current technological trends make this type of content harmful, with potentially dire consequences to the community (e.g., public polarization regarding elections). This has become a major challenge for democracy. Information propagated online may lack the rigor of classic journalism, and can, therefore, distort public perceptions, cause false alarms, and generate social unrest. Furthermore, the president of the EU, Ursula von der Leyen, has repeatedly condemned and asked for immediate action to be taken against the spread of fake news that undermines arXiv:2304.07781v1 [cs.CL] 16 Apr 2023 democracy and public health [8]. Thus, the ideological polarization of readers through the spread of fake news is an important issue and requires scholarly attention. We believe that designing and building tools and methods for accurately detecting fake news is of great relevance, and thus, our results will have an overall positive impact.
In this paper, we propose a new approach that uses document embeddings (i.e., DOCEMB) for detecting fake news. We also present our benchmark on different architectures that detect fake news using binary or multi-labeled classification. The document embeddings are constructed using several (1) word embeddings trained on each dataset selected for experiments, and (2) pre-trained transformers. We employ TFIDF, word embeddings (i.e., WORD2VEC, FASTTEXT, and GLOVE), and transformers (i.e., BERT, ROBERTA, and BART) to create DOCEMB, our new document embedding approach. For classification, we train both classical Machine Learning models (i.e., Naïve Bayes, Gradient Boosted Trees) and Deep Learning models (i.e., Perceptron, Multi-Layer Perceptron, [Bi]LSTM, [Bi]GRU).
In our experiments, we analyze the performance of the DOCEMB based detection solution on multiple datasets annotated with either binary or multi-class labels. We use 4 binary datasets, i.e., a sample of 20 000 manually annotated news articles from the Fake News Corpus, Liar, Kaggle, and Buzz Feed News. Finally, we use 2 multi-labeled datasets, i.e., Liar with 6 labels and TSHP-17 with 3 labels. As evaluation metrics, we use accuracy, precision, and recall.
We compare our results with state-of-the-art Deep Neural Networks models. Our method outperforms these models on each dataset. The most important takeaway from our experiments is that we empirically show that: (1) A simpler neural architecture offers better or at least similar results compared to complex architectures that employ multiple layers, and (2) The difference in performance lies in the embeddings used to vectorize the textual data and how well these perform in encoding contextual and linguistic features.
The main contributions of this article are as follows: (C 1 ) We propose a new document embedding (DOCEMB) constructed using word embeddings and transformers. We specifically trained the proposed DOCEMB on the five datasets used in the experiments.
(C 2 ) We show empirically that simple Machine Leaning algorithms trained with our proposed DOCEMB obtain similar results or better results than deep learning architectures specifically developed for the task of binary and multi-class fake news detection. This contribution is important in the machine learning literature because it changes the focus from the classification architecture to the document encoding architecture.
(C 3 ) We present a new manually filtered dataset. The original dataset is the widely used Fake News Corpus that was annotated with an automatic process.
This paper is structured as follows. Section 2 discusses current research on the topic of fake news detection. Section 3 introduces our approach and presents the different modules and models employed. Section 4 presents the datasets and analyzes the results. In Section 5, we summarize our key findings and discuss the major implications. Section 6 presents the conclusions and outlines directions for future work.

Related Work
As views and clicks monetize online media, for some publishers, it is most important to provide news that might interest their audience, to the detriment of the quality of the facts reported [9]. Thus, proper journalistic rigor has come under threat through the online spread of fake news.
Wang [10] employed SVM (Support Vector Machine), LogReg (Logistic Regression), BiLSTM (Bidirectional Long Short-Term Memory), and CNN (Convolutional Neural Network), to detect the veracity of ∼13 K short statements. The preprocessing was done using Google News' pre-trained WORD2VEC embeddings. Conroy et al. [11] present analysis methods based on linguistic and syntactic features for discovering fake news.
Many current approaches employ complex Deep Neural Network architectures, e.g., based on CNN (Convolutional Neural Network) [12,13,14], BiLSTM (Bidirectional Long Short-Term Memory) [15], and others. Ilie et al. [16] used multiple deep neural networks to determine how models that use pre-trained and specific trained word embeddings perform in the task of fake news detection. Further, some solutions use advanced document embeddings based on encoder architectures [17]. Kaliyar et al. [18] propose FakeBERT, an extension of FNDNet that uses BERT instead of GLOVE embeddings. Kula et al. [19] used a hybrid architecture for fake news detection that connects BERT with recurrent networks while Mersinias et al. [20] introduced CLDF, a new vectorization technique for extracting feature vectors. The results for CLDF, FNDNet, and FakeBERT were obtained using the Kaggle dataset with ∼21 K news articles.
Different ensemble models have also been used for this task, with good results [21,22]. Mondal et al. [21] used a voting-based ensemble method that relies on the voting of the collective majority. The authors employ only non-deep learning models and TF-IDF as the vectorization technique. Aslam et al. [22] used an ensemble-based deep learning model that combines two architectures, i.e., Bi-LSTM-GRU-Dense and Dense. Truicȃ and Apostol [23] propose MisRoBAERTa, a BERT-and ROBERTA-based ensemble model for fake news detection.
Sedik et al. [24] propose a deep learning approach that uses both sequential and recurrent layers. The sequential models employ stacked CNNs (i.e., CNN model) or concatenated CNN (i.e., C-CNN model), while the recurrent models use stacked CNN with LSTM and Dense layers (i.e., CNN-LSTM model) or simple GRU with a Dense layer (i.e., GRU model). The experimental results using the binary labeled Kaggle and Fake News Challenge dataset show that C-CNN and CNN-LSTM have the best performance, i.e., C-CNN obtains an accuracy of 99.90% on the Kaggle dataset, and CNN-LSTM obtains an accuracy of 96% on the Fake News Challenge dataset.
Several current solutions are based on linguistic and syntactic features, e.g., WELFake [25], which uses word embedding over linguistic features. In other current directions, multimodal learning that integrates comments [26], images [27], and the social and network context has been used [26,28,29]. Wang et al. [30] propose a knowledgedriven Multimodal Graph Convolutional Network model for detecting fake news from textual and visual information. This solution models posts from social media as graph data structures that combine textual and visual data with knowledge concepts.
Le and Mikolov [31] propose Doc2Vec as an extension to Word2Vec. Doc2Vec computes a feature vector for every document in the dataset, as opposed to Word2Vec, which computes every word in the dataset. Several articles have discussed the use of Doc2Vec for fake news detection, but it is used only as a baseline combined with traditional Machine Learning solutions. Cui et al. [32] use as a baseline Doc2Vec with SVM and compare it with graph-based Deep Learning solutions. Singh [33] presents several experiments on LIAR and Kaggle datasets using different vector space representations, i.e., one-hot encoding, TFIDF, Word2Vec, and Doc2Vec. Truicȃ et al. [7] propose a BiLSTM architecture with Sentence Transformer for the fake news detection challenge at CheckThatLab! 2022. The proposed architecture uses BART for a monolingual fake news detection task and XML-RoBERTa for the multilingual task. For the multilingual task, the model relies on transfer learning. Thus, the BiLSTM XML-RoBERTa model is trained on English and tested on a German dataset. The proposed model managed to obtain an accuracy of 0.53 for the first task and an accuracy of 0.28 for the second task.
3 Methodology Figure 1 presents the pipeline of our proposed solution. A labeled corpus of news articles is first preprocessed to extract the tokens. Then, the tokens are transformed into a vector model using term weighting schemes (TFIDF) and word/transformer embeddings. These vectors are used to create document embeddings. We also use the raw corpus to create document embeddings using transformers. The vectorized documents are then passed to the classification module. Finally, the classification is evaluated using accuracy, precision, and recall.

Text Preprocessing
To prepare the text for vectorization, we use the following preprocessing steps to minimize the vocabulary and remove terms that bring no information gain [34]: (1) removal of punctuation and stopwords, and (2) word lemmatization. We chose to lemmatize the words to minimize the vocabulary and remove any language inflections. We do not apply these preprocessing steps when using the transformer embeddings.

Term Weighting
To vectorize the preprocessed documents, we employ the TFIDF (Equation (1)). To compute this metric, we first need to compute the (1) term-frequency TF (Equation (2)) and (2) the inverse document frequency (Equation (3)). For a set of n documents D = {d i | i ∈ 1, n}, we extract the set of m unique terms V = {t j | j ∈ 1, m}. This set of unique terms is called a vocabulary. For each term, we compute the raw frequency (f tj ,di ), which counts the number of occurrences of a term t j in a document d i . The f tj ,di does not store context and is biased towards longer documents. Thus, to remove the bias, we normalize the frequency with the length of the document ( t ∈di f t ,di ) and obtain TF [35]. Furthermore, to minimize the importance of common terms that bring no information value, the IDF (Equation (3)) is used to reduce the TF weight by a factor that grows with the collection frequency n j of a term t j , i.e., n j is the number of documents where there is at least one occurrence of term t j . Finally, to normalize TFIDF in the [0, 1] range, we use the 2 -norm (Equation (4)).
Using TFIDF, we construct a n×m document-term matrix X = {x ij | i = 1, n∧j = 1, m} (X ∈ R n×m ) where rows correspond to documents and columns to terms. The value x ij = T F IDF (t j , d i , D) represents the weight of term t j in document d i . Thus, each document d i is represented by a vector x i = {x ij | j ∈ 1, m}. For ease of notation, we use x i for denoting lines in X.

Word Embeddings
Each word from the vocabulary is transformed into its vector representation. This module employs WORD2VEC [36,37], FASTTEXT [38], and GLOVE [39]. For WORD2VEC and FASTTEXT, we use both the CBOW (Continuous Bagof-Words) and SG (Skip-gram) models. By using these models, we obtain the embedding W ordEmb(t) for each term t ∈ V .

WORD2VEC
The WORD2VEC [36,37] embedding model is used to create vectorized representations of the words in a dataset within the same vector space. This representation measures the distance between the corresponding vectors in this space to determine the context similarity. For WORD2VEC, there are two models for representing the words in this vector space: Continuous Bag-Of-Words (CBOW) or Skip-Gram.
CBOW Model The CBOW model attempts to predict a word using the context given by its surrounding words. Each word t i ∈ V (i ∈ 1, m) is defined by two d-dimensional (with d ≥ 2 a natural number, i.e., d ∈ N) vectors depending of its function in training: (1) v ti ∈ R d is defined when t i is used as the center word, and (2) u ti ∈ R d is defined when t i is used as a context word. The conditional probability of generating any center word t c given its surrounding context words T o = {t 1 , . . . , t c−1 , t c+1 , . . . , t s } within a context window of size s can be modeled by a probability distribution p(t c | T o ) (Equation (5)) that considers the average of the context vectors v o = 1

Skip-Gram Model
The Skip-Gram model starts with the context word t c as input and tries to generate its context. As in the CBOW case, the two d-dimensional vectors (d ∈ N and d ≥ 2), i.e., v ti ∈ R d and u ti ∈ R d , are defined for each word t i ∈ V (i ∈ 1, m). The conditional probability of generating any context word t o given the center word t c can be modeled by a softmax operation (Equation (6)).

FASTTEXT
FASTTEXT [38] is an extension to WORD2VEC and follows a similar approach to construct word embeddings [40]. The main difference between FASTTEXT and WORD2VEC is that FASTTEXT does not consider the word as the basic unit, but rather considers a bag of character n-grams. Using such an approach, the accuracy is improved, and the training time is decreased when compared to WORD2VEC. As in the case of WORD2VEC, FASTTEXT employs both CBOW and Skip-Gram models.

GLOVE
GLOVE (Global Vectors) [39] is another model used for creating word embeddings. To create the vector representation of words, GLOVE uses the word co-occurrences matrix. This matrix manages to encapsulate local and global corpus statistics regarding word-word co-occurrences. Thus, GLOVE for each word stores the frequency of its appearance in the same context as another word by employing a term co-occurrence matrix. Using the ratio of co-occurrence probability, GLOVE captures the relationship between words. Furthermore, GLOVE identifies word analogies and synonyms within the same contexts using this probability ratio.

Transformers Embeddings
To create transformer embeddings, we use BERT [41], ROBERTA [42], and BART [43]. By using these models, we obtain the word embedding by transformer W ordEmb(t) for each term t ∈ V .

BERT
BERT (bidirectional encoder representations from transformers) [41] is a deep bidirectional transformer architecture used for natural language understanding. Thus, in contrast to classic language models that treat textual data as unidirectional or bidirectional sequences of words, BERT learns contextual relations between the words by employing this deep bidirectional transformer architecture. Using the surrounding words of a given word, the model learns and creates a vector representation for each word that also encapsulates its context. Thus, BERT reads the entire sequence of words at once using the transformer encoder to create contextual word embeddings. By employing transfer learning, BERT can directly be used for various natural language processing operations, understanding, and generation. Furthermore, it can be fine-tuned by using new datasets to adapt to specific tasks. Experimental results on various tasks [41] show that the language models built with BERT manage to improve language context detection more than the models that use static word embeddings, which only see textual data as sequences of words.

ROBERTA
ROBERTA (a Robustly optimized BERT pre-training Approach) [42] is a training optimizing method for BERT. This model improves the language masking strategy of BERT by modifying the following key training aspects: (1) more data are used for training, (2) dynamic masking patterns instead of static masking patterns are employed, (3) the next-sentence pre-training objective is removed and replaced with full sentences without NSP (Next Sentence Prediction), (4) training is performed on longer sequences, (5) the mini-batches are improved, and (6) the learning rates are improved. Thus, all these modifications lead to improving ROBERTA's downstream task performance and mitigate some of the shortcomings encountered by the significantly undertrained BERT model.

BART
BART (bidirectional and autoregressive transformer) [43] is a transformer model that employs the standard transformer-based neural machine translation architecture, i.e., a generalized BERT architecture. The pre-training process of BART uses an arbitrary noising function to corrupt the textual data within the dataset in order to make the transformer learn how to recreate the original text during training. During the pre-training of BART, two key techniques are used to improve the words' contextual representations. Firstly, the order of original sentences is randomly shuffled. Secondly, using a novel in-filling scheme, a single mask token is used to replace the spans of text. Experimental results [43] show that a fine-tuned BART works better than BERT for both text generation and comprehension tasks.

Document Embeddings
We create a vector for each document by averaging all the word or transformer embeddings for the words appearing in the document. Thus, if we have m i terms in a document d i , we obtain the document embedding (DOCEMB) x i by summing all the embeddings w(t) of the terms t that are present in document d i as well as in the vocabulary V , and dividing the sum by m i (Equation (7)). Each document embedding creates a context for words in a document and becomes an extension of the presented word embeddings.
Similarly to TFIDF, we construct a document-embedding matrix X = {x i | i = 1, n} where each row corresponds to the document embedding x i . For this matrix, the columns are not associated to terms in the vocabulary V , and the number of columns is different from the total number of terms in V . In this case, m is the size of the embedding vector. For ease of notation, we use m as the number of columns, although it is different than the number of terms in the vocabulary, as in the case of the document-term matrix. Thus, X ∈ R n×m is a n × m matrix.

Fake News Detection
Classification is used to determine the veracity of a news article, i.e, fake news detection. Given a set of documents D represented by the matrix X ∈ R n×m (either the document-terms or the document embedding matrix), a set of classes Y = {y 1 , . . . , y n } with values in a discrete domain C = {c k | k = 1, κ} (Y ⊆ C) of size κ (i.e., the number of classes is κ), and an implication x i → y i (i ∈ 1, n), the objective of classification is to predictŷ i = f (x i ) (ŷ i ∈Ŷ ⊆ C) that best approximates y i . For the fake news detection task, we employ the following algorithms to construct models: Naïve Bayes (NB), Gradient Boosted Trees (XGBTrees), Perceptron, Multi-Layer Perceptron (MLP), Long Short-Term Memory Network (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Units (GRU), and Bidirectional GRU (BiGRU). For comparison, we use MisRoBAERTa [23]. In the original article that presents MisRoBAERTa, the authors fine-tune both BART and ROBERTA. In this work, we use the pre-trained BART (facebook/bart-large) and ROBERTA (roberta-base) from HuggingFace [44].

Naïve Bayes
The Naïve Bayes (NB) model is a probabilistic classification algorithm that computes the probability of x i given a class c k (Equation (8)), where p(x i ) and p(c k ) are the probability of a document and a class, respectively, and p(x i | c k ) is the probability of class c k given x i . Expending x i by its components {x i1 , . . . , x im }, we can rewrite Equation (8) as Equation (9).
The denominator p( Furthermore, all the terms are conditionally independent given a class c k . Thus, Using these assumptions, the Naïve Bayes classifier tries to estimate the classŷ i using Equation (10).ŷ There are various types of Naïve Bayes classifiers; the most common ones are Multinomial Naïve Bayes and Gaussian Naïve Bayes.
Multinomial Naïve Bayes Multinomial Naïve Bayes (MNB) models the distribution of words in a document by using a multinomial representation for the distribution of probabilities that a word appears for a certain class (Equation (11)). The assumption for this model is that a document is handled as a sequence of words. Also, it is assumed that each word position is generated independently of every other [45].
Gaussian Naïve Bayes The Gaussian Naïve Bayes (GNB) model is used when dealing with continuous data. The model is based on the assumption that continuous values correlated with each class are distributed according to a Gaussian distribution. Thus, given column j ∈ 1, m from X and a class c k , we employ the following steps: • Segment the data by class c k .
• Compute the associated means µ j and variances σ j of dimension j using the values x ij (i ∈ 1, n), only for the lines x i ∈ X labeled with class c k . • Compute the probability p(x ij | c k ) (Equation (13)).

Gradient Boosted Trees
Gradient boosting is an ensemble method that uses multiple weak predictions learners. In the case of Gradient Boosted Trees, the weak learners are Decision Trees. Similar to other classification methods, the method tries to predict y i = f (x i ) = w · x i + b that best approximates the true class y i of x i by minimizing an objective function L(Ŷ i , Y i ) that represents the training loss, e.g., the mean score L(y i , and the function f (t) (x i ) determined by the current weak learner that best fits the residuals, i.e., i . As the objective is to minimize training loss and to obtain the specific objective at step t,we can take the Taylor expansion of the loss function up to the second order for each learner and remove all the constants to obtain (15)).

Perceptron
The Perceptron model (Equation 16) is a simple non-linear processing unit that tries to predict the labelŷ i for a given input x i by adjusting a weight vector w ∈ R m using the sigmoid activation δ s (z) = 1 1+e −z ∈ [0, 1]. The objective for a good prediction is to minimize the average cross-entropy loss function between the set of predictionŶ and the set of true labels Y (Equation 17)

Multi-Layer Perceptron
The Multi-Layer Perceptron (MLP) model is a Deep Learning architecture that stacks multiple layers j ∈ 1, l of fullyconnected Perceptron units. The MLP architecture can be divided into three components: (1) the input i layer (j = 1), (2) the hidden layers j ∈ 2, l − 1, and (3) the output layer o =ŷ (j = l). Each node in layer j connects to every node in the following layer j + 1 with a certain weight W j . Because the connections between the layers are directed from the input i to the output o by passing information through the hidden layers h j , the MLP model is a feed-forward architecture. Equation (18) presents the MLP classification model at a given iteration t.

Long Short-Term Memory
Long Short-Term Memory (LSTM) [46] is a Recurrent Artificial Neural Network that uses two state components for classification. The first component, represented by a hidden state, is the short-term memory that learns the short-term dependencies between the previous and current states. The second component, represented by an internal cell state, is the long-term memory which stores long-term dependencies between the previous and current states. The model uses three gates to preserve the long-term memory within the state: (1) input gate (i ∈ R n ), (2) forget gate (f ∈ R n ), and (3) output gate (o ∈ R n ). Equation (19) presents the compact forms for the state updates of the LSTM unit for a given iteration t, where: • h (t) ∈ R n is the hidden state vector as well as the unit's output vector of dimension n, where the initial value is h (0) = 0; •c (t) ∈ R n is the input activation vector; • c (t) ∈ R n is the cell state vector, with the initial value c (0) = 0; are the weight matrices corresponding to the current input of the input gate, output gate, forget gate, and the cell state; are the weight matrices corresponding to the hidden output of the previous state for the current input of the input gate, output gate, forget gate, and the cell state; are the bias vectors corresponding to the current input of the input gate, output gate, forget gate, and the cell state; is the hyperbolic tangent activation function; • is the Hadamard Product, i.e., element wise product.
We chose LSTM because it manages to avoid the vanishing and the exploding gradient issues by regulating the way the recurrent weights are learned.

Bidirectional LSTM
As the LSTM model processes sequence data, it is able to capture past information. To take into consideration future information as well, we use the Bidirectional LSTM (BiLSTM). The BiLSTM encapsulates past and future information through the use of two hidden states (Equation (20) processes the input in a forward manner using the past information provided by the forward LSTM ( processes the input in a backward manner using the future information provided by the backward LSTM ( At every time-step, the hidden states, i.e., , are concatenated into one hidden state h (t) (Equation (21)). This approach enables the encoding of information from both past and future contexts in the hidden state h (t) .

Gated Recurrent Unit
The Gated Recurrent Unit (GRU) [47] is a Recurrent Artificial Neural Network that simplifies the LSTM unit and improves performance considerably. Instead of three gates as in the case of LSTM, the GRU has only two gating mechanisms. The first gating mechanism is the update gate (u ∈ R n ). This gate encodes both the forget gate and the input gate that are present in the LSTM cell. The second gating mechanism is the reset gate (r ∈ R n ). This gate determines the percentage of information from the previous hidden state that contributes to the candidate state of the new step [48] Furthermore, the GRU uses the hidden state as the only state component. Equation (22) presents the compact forms for the state updates of the GRU unit at a given iteration step t, where: • i (t) ∈ R n is the input and output of the cell at step t; •h (t) ∈ R n is the candidate hidden state with a cell dimension of n; • h (t) ∈ R n is the current hidden state with a cell dimension of n; • W u , W r , W h ∈ R n×m are the weight matrices corresponding to the current input of the update gate, reset gate, and the hidden state; • V u , V r , V h ∈ R n×m are the weight matrices corresponding to the hidden output of the previous state for the current input of the update gate, reset gate, and the hidden state; • b u , b r , b h ∈ R n are the bias vectors corresponding to the current input of the update gate, reset gate, and the hidden state; • is the Hadamard Product.

Bidirectional GRU
Similar to the BiLSTM, the Bidirectional GRU (BiGRU) considers both past and future information by employing a forward and backward GRU, i.e., which processes the input in a forward manner using − −− → GRU F , and (2) ← − h (t) which processes the input in a backwards manner using ← −− − GRU B . As for BiLSTM, the hidden states − → h (t) and ← − h (t) are concatenated at every time-step to encode the information from both past and future contexts into one hidden state

Evaluation Module
We use accuracy, precision, and recall [49] to evaluate the models. For binary classification with the classes positive and negative, the following information is used to construct a confusion matrix that is afterward used to compute the evaluation metrics: • tp (True Positive) is the number of positive observations that are correctly classified; • f n (False Negative) is the number of positive observations that are incorrectly classified as negative; • f p (False Positive) is the number of false observations that are incorrectly classified as positive; • tn (True Negative) is the number of false observations that are correctly classified.
Accuracy (Equation (24)) measures the overall effectiveness of a classifier. Precision (Equation (25)) measures the class agreement of the data labels within the positive labels. Recall (Equation (26)) measures the effectiveness of a classifier in identifying positive labels.

Experimental Results
In this section, we present the experimental results obtained using our methodology. Firstly, we introduce a humanverified sample from the Fake News Corpus [50] and present the results of the exploratory data analysis performed on it. Secondly, we present the experimental setup for our experiments as well as the hyperparameters and implementation packages for the models. Thirdly, we present the experimental results using the different sentence embeddings and classification methods on the Fake News Corpus sample. Lastly, we show the generalization of our observation by performing additional experiments on 5 additional datasets: LIAR multiclass [10], LIAR binary [51], Kaggle [12,18], Buzz Feed News [52], and TSHP-17 [53,54].

Dataset Details
For the experiments, we used a set of 20K English language news articles (10K reliable and 10K fake) selected from the Fake News Corpus [50] as it is widely used in current research [55,56,57,16,23]. Some of the labels might not be correct because the original dataset is not manually annotated. However, this shortcoming should not pose a practical issue for classification models, as ML/DL models generalize better when noise is added [58]. Instead, this should help the models to better generalize and remove overfitting. Additionally, we made sure that URL for the selected article point to the correct article by matching the titles and authors.
Before performing the experiments, we verified the label correctness for the sampled news articles using computer science students as annotators. In total, there were 40 student annotators to annotate 25K articles (12.5K reliable and 12.5K fake). We sampled more articles to mitigate any inconsistencies between two annotators as well as between the final annotation and the original label. For their annotation work, the students obtained credits for different courses.
Before annotating the articles, the students received an instruction list that explains the annotation task. The annotation task included the following steps: (1) Verify that the title matches the title from the URL; (2) Verify that the content matches the content from the URL; (3) Verify that the authors match the authors from the URL; (4) Verify that the source matches the source from the URL; (5) Verify if the information is false or reliable; Each article was manually verified by two annotators. If there was no consensus between the two, a third annotator was used to break the tie. In 99% of the cases, there was no requirement for adding a third annotator. In the experiments, we removed all the articles where no consensus was found as well as where a difference between the human annotation and the original label was found. In the end, we scaled down the sample to 20K news articles. Table 1 presents the corpus statistics and information before and after preprocessing. We observe that, although there is a small imbalance in the number of tokens between the classes, this imbalance is small enough not to add bias to the classification task. We also extracted the top 10 unigrams and the top topic using the NMF algorithm for topic modeling [59]. We used the NLTK [60] for extracting unigrams and scikit-learn [61] for NMF. We computed the average similarity with PolyFuzz [62] by employing the pre-trained FASTTEXT embedding on news articles (sim(FT)) and the base-case BERT (sim(BERT)). Analyzing both similarities, we conclude that the documents discuss the same topics (Table 1). For the neural networks (i.e., Perceptron, MLP, LSTM, BiLSTM, GRU, BiGRU), we use one-hot encoders to represent the labels.

Experimental Setup
In our experiments, we analyzed how well we can predict if an article is fake or reliable using multiple vectorizations: (1) the TFIDF vector space model, and (2) [67].
The neural-based fake news detection module used Multi-layer Perceptron, LSTM, Bidirectional LSTM, GRU, and Bidirectional GRU. Each layer consists of 100 cells. The LSTM was configured as in [46], while the GRU was configured as presented in [47]. A Dense layer with 2 units and a sigmoid function as activation was used as the output.
For the LSTM, BiLSTM, GRU, and BiGRU models, we used the ADAM optimizer and a 64-batch size. All the neural network models were trained for 100 epochs with an Early-Stopping mechanism to mitigate overfitting. We employed Keras for implementing the neural models. For comparison, we used the free implementation of MisRoBAERTa [23], made available by the authors on GitHub.
The code will be made available on GitHub upon the acceptance of this work.

Fake News Detection
For the experiments, we used an NVIDIA ® DGX Station ™ . Table 2 presents the results for the fake news detection task. We tested the models in 10 rounds and used a 70%-10%-20% train-validation-test split ratio. Each split shuffles the dataset and extracts a stratified sample initialized with different random seeds. We report the average and standard deviation for each metric. The TFIDF approach proves that the relevance of calculating the importance of each word from a document is an important factor for the fake news detection problem. This result is a direct consequence of the size of the documentterm matrix used as the input. Moreover, the transformer embeddings obtained the best results among the document embeddings experiments as they manage to encode and preserve the context within the vector representation. When compared to the state-of-the-art model MisRoBAERTa [23], the BiLSTM with BART obtained similar results, while the BiGRU with BART marginally outperformed the model with a 0.02% difference in accuracy. We hypothesize that this difference in performance is due to the use of pre-trained transformers instead of fine-tuned versions.
The recall metric is the most relevant one for the fake news detection task because it calculates the documents correctly classified as fake relative to all the actual fake documents, regardless of the predicted label. No clear pattern emerges among the document embeddings to determine which has the overall best performance. For example, when using LSTM, the best performance was obtained with document embedding DOCEMB WORD2VEC CBOW (96.84%), followed closely by DOCEMB FASTTEXT SG (96.72%), while, when using Multi-Layer Perceptron, the best performance was obtained by DOCEMB FASTTEXT SG (95.48%), followed closely by DOCEMB WORD2VEC SG (95.40%).
Finally, by analyzing the results, we observed the following: (1) A simpler neural architecture offers similar or better results compared to complex deep learning architectures that employ multiple layers, i.e., in our comparison, we obtained similar results as the complex MisRoBAERTa [23] architecture without fine-tuning the transformers; (2) The embeddings used to vectorize the textual data make all the difference in performance, i.e., the right embedding must be selected to obtain good results with a given model; (3) We need a data-driven approach to select the best model and the best embedding for our dataset.

Additional Experiments
In this section, we present more experiments using four additional datasets that are analyzed in detail in [68]. For this set of experiments, we compared our results with existing results from the current literature. Furthermore, we trained our own model for each dataset using MisRoBAERTa [23], but we did not fine-tune the transformers. We used the pretrained BART (facebook/bart-large) and ROBERTA (roberta-base) versions from HuggingFace [44]. We hypothesize that this is the reason we obtained similar results to the ones obtained with the models that use document embeddings with this state-of-the-art architecture. Tables 3 and 4 present experimental results obtained on the LIAR dataset [10]. For our experiments, we used the dataset as it was initially released, with 6 labels [10] (Table 3), and by balancing the dataset's labels (Table 4) as proposed in [51]. To balance the labels, we created binary labels, i.e., all the texts that are not labeled with true are considered false. Using the same experimental configurations as presented in Section 4.2, we obtained results that are aligned with our original observations on the proposed dataset. Further, we obtained results similar to state-of-the-art results for the multi-label dataset, e.g., Wang [10] and Alhindi et al. [69] obtained an accuracy of ∼20%. For the binary classification, we obtained results that go beyond the the state of the art, e.g.,Upadhayay and Behzadan [51] obtains an accuracy of 70% while we obtain an accuracy of 83.99% with the LSTM model that employs the document embeddings constructed with GLOVE. Table 3 presents the results obtained by the different machine and deep learning algorithms on the LIAR dataset [10]. The dataset contains approximately 12.8K human annotated short statements collected using POLITIFACT.COM's API. In this set of experiments, we used all the 6 labels of LIAR, i.e., pants-fire, false, barely-true, half-true, mostlytrue, and true, to build our classification models. The dataset is highly imbalanced, as there are more news articles labeled with true than news articles labeled with the other five classes combined. Due to this high degree of imbalance, the models performed poorly. We observe that the best-performing models employ document embedding constructed with BART. The overall best performance model was Multi-Layer Perceptron with BART-built document embedding, with an accuracy of 25.89%. The overall difference between the worst-and best-performing models is approximately 7%. We note that for Naïve Bayes, the model trained with TFIDF obtained better scores than the models trained with document embedding. We observed no real difference in performance among the models trained with the document 26.00 embeddings using word embeddings. This low performance is also present in the current literature [10,69], with accuracy scores very similar to the ones obtained by the models we trained.
To mitigate the poor performance obtained using all 6 labels of the LIAR dataset and to minimize the imbalance between the classes, we employed a binarization approach to the dataset. This approach is also used in the current literature. For example, Upadhayay and Behzadan [51] and Yang et al. [29] also use the LIAR dataset with 2 labels, i.e., true and false, to train their models. On this dataset, we observed that the performance of all the models improved. Naïve Bayes trained on document embeddings obtained the worst results. The overall best results were obtained by LSTM with the document embedding constructed with GLOVE, with an accuracy score of 83.99%. Furthermore, Naïve Bayes, Gradient Boosted Trees, and Perceptron obtained better results with the TFIDF vectorization. The performance of these models is directly impacted by TFIDF's features. The proposed approach outperforms more  [51] 70.00 UFD [29] 75.90 complex models proposed in the current literature, e.g., CNN with BERT-base embeddings [51] obtained an accuracy of 70% and UFD [29] obtained an accuracy of 75.90%.
Tables 5-7 present the experimental results obtained on the Kaggle [12,18], Buzz Feed News [52], and TSHP-17 datasets as presented in [53,54]. Both Kaggle and Buzz Feed News are binary datasets, i.e., with the levels reliable and false. To emphasize that the embedding makes the main difference and that the models can generalize when we move from binary to multi-class classification, we used the multilabel dataset TSHP-17, which has the following 3 classes: satire, hoax, and propaganda. For this set of experiments, we used the same experimental setup and algorithm configurations as presented above. Again, we obtained results that are aligned with our original observations, reinforcing our claims.  [23] 97.57 ± 0.29 97.58 ± 0.28 97.57 ± 0.31 C-CNN [24] 99.90 99.90 99.90 Accuracy FNDNet [12] 98.36 FakeBERT [18] 98.90 Table 5 presents the results obtained on the Kaggle dataset [12,18]. We observed that only for the Gradient Boosted Trees and Multi-Layer Perceptron models, the document embeddings obtained with WORD2VEC SG and FASTTEXT SG outperformed their CBOW counterparts. When analyzing the same document embedding, i.e., DOCEMB, we observed very little difference in performance among the neural models. As we used early stopping mechanisms, the neural network models did not overfit. Among the document embeddings employing transformers, the ones that use BART obtained the best results across all experiments. With an accuracy of 99.80%, the overall best-performing model is the Bidirectional LSTM with document embeddings constructed with BART, i.e., DOCEMB BART. The results show that our approach outperforms more complex models proposed in the current literature, e.g., FNDNet [12] obtained an accuracy of 98.36%, FakeBERT [18] obtained an accuracy of 98.90%, and C-CNN [24] obtained an accuracy of 99.90%. We observe that C-CNN, a large neural model with multiple layers that also concatenates the  [23] 77.39 ± 0.83 77.39 ± 0.83 77.39 ± 0.83 Accuracy SVM [52] 78.00 UFD [29] 67.90 results of three CNN models, outperforms the Bidirectional LSTM in terms of average accuracy on the Kaggle dataset by only 0.10%. We also want to emphasize that the results in Table 5 present the mean over 10 runs for each metric per model and embedding pair. Thus, if we only take the best-performing model as in the case of the C-CNN results presented by Sedik et al. [24], then the Bidirectional LSTM model manages to obtain an accuracy of 99.92%(= 99.80% (mean accuracy) +0.12% (standard deviation)). Table 6 presents the results obtained on the Buzz Feed News dataset [52]. On this dataset, we observed that all the models obtained good results with TFIDF, such that some models that employ the TFIDF vectorization outperformed the document embeddings constructed with word and transformer embeddings, see, e.g., the results for LSTM, Bidirectional LSTM, GRU, and Bidirectional GRU. With an accuracy of 79.78%, the overall best-performing model is Perceptron with BART document embeddings. For all the models, there is very little difference between the document embeddings that employ CBOW and their Skip-Gram counterparts. The results show that our approach outperforms  [23] 99.52 ± 0.12 99.52 ± 0.12 99.52 ± 0.12 Accuracy Proppy [70] 98.36 more complex models proposed in the current literature, e.g., SVM [52] obtained an accuracy of 78.00% and UFD [29] obtained an accuracy of 67.90%. Thus, in conclusion, we show, on five additional datasets, that: (1) A simpler neural architecture offers at least similar or better results as complex architectures that employ multiple layers, and (2) The difference in performance lies in the embeddings used to vectorize the textual data.
Furthermore, we generally obtained better results than in other current state-of-the-art work: (1) On the LIAR dataset with 6 labels, Wang [10] obtained an F 1 -Score of 27.7% using Hybrid CNNs and Alhindi et al. [69] obtained an F 1 -Score of 26% using BiLSTM, while we obtained an accuracy of 25.89% using Multi-Layer Perceptron with the document embeddings employing BART; (2) On the LIAR dataset with 2 labels, Upadhayay and Behzadan [51] obtained an accuracy of 70% using CNN with BERT-base embeddings, while we obtained an accuracy of 83.99% using LSTM with the document embeddings employing GLOVE; (3) On the Kaggle dataset, the large deep learning model FakeBERT [18] obtained an accuracy of 98.90% and C-CNN [24] obtained an accuracy of 99.90%, while we obtained an accuracy of 99.80% using a simple Bidirectional LSTM with the document embeddings employing BART; (4) On the Buzz Feed News dataset, Horne and Adali [52] obtained an accuracy of 78% using a linear SVM, while we obtained an accuracy of 79.78% using Perceptron with the document embeddings employing BART; (5) On the TSHP-17 dataset, Barrón-Cedeño et al. [54] obtained an accuracy of 97.58% using Proppy [70], while we obtained 99.65% using Bidirectional LSTM with the document embeddings employing BART; To sum up, this set of experiments again enforces our observations that the embedding is more important than the complexity of the classification architecture. Furthermore, there is no generic model that offers the best performance regardless of the dataset. Thus, a data-driven approach together with hyper-parameter tuning and ablation testing should be considered when the goal is to determine the best-performing model for a given dataset.

Discussion
Word embeddings manage to capture both local and global contexts as defined in Truicȃ et al. [71]. These help the machine learning algorithms to model and learn the text context, syntax, and semantics, but fail to differentiate among the words' grammatical functions, i.e., the same word embedding is computed for a word regardless of its part-ofspeech. On the other hand, transformer embeddings manage to learn the linguistic meaning of words, as they manage to preserve context by design. Thus, the same word has a different embedding depending on its lexical sense and concept as well as part-of-speech. Based on this, we can observe that the experiments that use document embeddings that employ transformers perform better than those that employ word embedding on average. The most interesting results, however, are those obtained with the document representation obtained with TFIDF. We observed that only the frequency-based importance of a word to a document within a textual corpus has a high impact on the models' performance. As a general observation, we observed very little difference in performance among the neural models when using the same document embedding.
The experimental results show that the DOCEMBs that use WORD2VEC and FASTTEXT obtain very similar results, with a difference of ∼±2% when using Perceptron, and ∼±0.5% when using LSTM. GLOVE based DOCEMB obtains the best results together with the LSTM model on the LIAR dataset when using 2 labels. For the sample extracted from the Fake News Corpus as well as the LIAR with 6 labels, Kaggle, and TSHP-17 datasets, the BART DOCEMB obtains the best results with different classification algorithms. We can conclude that there is no clear result for a classification model that generalizes well regardless of the dataset.
From the experimental results, we could not determine a clear winner with regards to document embedding and classification model. We observed empirically that the best-performing classification model changes with the dataset and the document embedding employed.
Our DOCEMB solutions were compared to the results we obtained when employing MisRoBAERTa [23], a more complex state-of-the-art model that employs fine-tune BART and ROBERTA embeddings. We note that we did not use fine-tuning for our dataset as the authors did in the original work [23]. Thus, we used the pre-trained BART (facebook/bart-large) and ROBERTA (roberta-base) from HuggingFage [44]. Furthermore, we also compared the results we obtained on each dataset with the results obtained with other state-of-the-art models presented in the current literature.
In our experiments, we obtained results that lead to the following observation: feature selection is more important than the Deep Learning Architecture used for classification. To put it bluntly, the need to stack layers upon layers of neural cells, just to claim a novel architecture, does not solve real-world problems, it just exacerbates out of proportion our understanding of how to use Machine Learning/Deep Learning for Natural Language Processing tasks. To conclude our findings: (1) A simpler neural architecture offers similar if not better results as complex deep learning architectures that employ multiple layers, i.e., in our comparison, we obtained similar results as the complex MisRoBAERTa [23] architecture, better than state-of-the-art results, i.e., FakeBERT [18], and Poppy [70]; (2) The embeddings used to vectorize the textual data makes all the difference in performance, i.e., the right embedding must be selected to obtain good results with a given model; (3) We need a data-driven approach to select the best model and the best embedding for our dataset; (4) The way the word embedding manages to encapsulate the semantic, syntactic, and context features improves the performance of the classification models.

Conclusions
In this article, we presented a new approach for fake news detection using document embeddings (DOCEMBs). We also proposed a benchmark to establish the most efficient ways for finding misleading information. To detect fake news, we used multiple machine learning algorithms together with DOCEMBs built using either TFIDF, or word and transformer embeddings: WORD2VEC SG and CBOW, FASTTEXT SG and CBOW, GLOVE, BERT, ROBERTA, and BART.
Our approach emphasizes the importance of an overall document representation when dealing with the task of fake news detection and shows state-of-the-art performance results. Depending on the dataset, the results show that BIGRU/BILSTM with DOCEMB BART outperforms the other models. In the experiments, we obtained better results than state-of-the-art Deep Neural Network models, even though we used a simpler Deep Neural Network Architecture. Additionally, we obtained similar results as MisRoBAERTa [23] when using pre-trained BART (facebook/bart-large) and ROBERTA (roberta-base) from HuggingFace [44]. These are significant results, not because of the evaluation scores but because of the complexity of the models. The main takeaway of this work is that a simpler neural architecture offers similar if not better results as complex architectures that employ multiple layers. We observe that the most relevant factor is the embedding employed for classification, as it can really make a difference.
In future research, we plan to use sentiment analysis with fake news detection to determine if there is a correlation between polarity and veracity. We also aim to use ensemble models that combine our proposed method with existing methods to determine if the performance of fake news detection is improved.