An Empirical Comparison of Machine Learning and Deep Learning Models for Automated Fake News Detection

Yexin Tian; Shuo Xu; Yuchen Cao; Zhongyan Wang; Zijing Wei

doi:10.3390/math13132086

,

and

¹

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

²

Computer Science & Engineering Department, University of California San Diego, La Jolla, CA 92093, USA

³

Khoury College of Computer Science, Northeastern University, Seattle, WA 98109, USA

⁴

Center of Data Science, New York University, New York, NY 10011, USA

Mathematics2025, 13(13), 2086;https://doi.org/10.3390/math13132086

This article belongs to the Special Issue Mathematical Foundations in NLP: Applications and Challenges

Version Notes

Order Reprints

Abstract

Detecting fake news is a critical challenge in natural language processing (NLP), demanding solutions that balance accuracy, interpretability, and computational efficiency. Despite advances in NLP, systematic empirical benchmarks that directly compare both classical and deep models—across varying input richness and with careful attention to interpretability and computational tradeoffs—remain underexplored. In this study, we systematically evaluate the mathematical foundations and empirical performance of five representative models for automated fake news classification: three classical machine learning algorithms (Logistic Regression, Random Forest, and Light Gradient Boosting Machine) and two state-of-the-art deep learning architectures (A Lite Bidirectional Encoder Representations from Transformers—ALBERT and Gated Recurrent Units—GRUs). Leveraging the large-scale WELFake dataset, we conduct rigorous experiments under both headline-only and headline-plus-content input scenarios, providing a comprehensive assessment of each model’s capability to capture linguistic, contextual, and semantic cues. We analyze each model’s optimization framework, decision boundaries, and feature importance mechanisms, highlighting the empirical tradeoffs between representational capacity, generalization, and interpretability. Our results show that transformer-based models, especially ALBERT, achieve state-of-the-art performance (macro F1 up to 0.99) with rich context, while classical ensembles remain viable for constrained settings. These findings directly inform practical fake news detection.

Keywords:

fake news detection; natural language processing; machine learning; deep learning; text classification; model interpretability

MSC:

68T50

1. Introduction

The rise of the internet and digital communication platforms over the past two decades has fundamentally transformed the production, dissemination, and consumption of news. While these advances have democratized access to information and accelerated the pace of global news delivery, they have also facilitated the spread of fake news—false or misleading information presented as legitimate news content []. Fake news poses significant risks to public understanding, democratic institutions, and societal trust. Notable examples include the proliferation of fabricated political stories during the 2016 U.S. presidential election [], widespread misinformation about COVID-19 vaccines and health policies [], and manipulated narratives contributing to social unrest in geopolitical conflicts [].

Fake news propagates through a variety of online channels, including social media platforms (e.g., Twitter, Facebook, Reddit), news aggregation websites, independent blogs, and online forums []. Social media, in particular, enables rapid and wide-reaching dissemination, often amplifying the reach of false information through algorithmic recommendation systems and viral user engagement. Automated accounts or bots further compound the problem by artificially boosting the visibility of fake news, distorting public discourse [].

Given these challenges, automated detection of fake news has emerged as a critical task in natural language processing (NLP) and computational journalism.

1.1. Automated Detection of Fake News and Related Work

Given the overwhelming volume and velocity of digital news content, traditional manual verification approaches are insufficient to ensure information integrity []. This has motivated the development of automated fake news detection systems leveraging advances in machine learning (ML) and deep learning (DL).

Early ML approaches—such as Logistic Regression, Random Forests, and support vector machines—use features extracted from textual content, user metadata, or publication patterns to classify news articles as real or fake []. However, these models often depend on hand-engineered features and may struggle to generalize across topics, domains, or evolving deception strategies. Several recent works have advanced the use of ensemble and gradient boosting methods for fake news detection, including LightGBM and related feature-augmented approaches [,].

Recent progress in deep learning, particularly in NLP, has enabled more flexible and effective fake news detection. Recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer-based architectures (e.g., BERT, ALBERT) can automatically learn complex linguistic and contextual representations from raw text, surpassing traditional approaches in both accuracy and adaptability [,,,,,]. These models have proven especially valuable in identifying nuanced or subtle patterns of deception that may not be captured by surface-level features alone. For instance, FakeBERT [] and related BERT-based variants [,] have demonstrated strong results on social media and short-form news.

Meanwhile, the research community has explored hybrid and knowledge-augmented models [,], graph-based networks [,], and multimodal architectures that integrate textual, visual, and metadata signals for improved robustness [,].

A systematic review of domain generalization and transfer learning approaches has also been undertaken [,], particularly in the context of COVID-19 misinformation and multilingual news, showing that models often struggle to generalize across datasets and domains. Several benchmarks have evaluated the adaptability of deep models to new sources or languages [,]. The use of data augmentation, e.g., sequence-to-sequence and contrastive learning, is being actively researched to address data sparsity and generalization gaps [,].

1.2. Challenges and Open Problems

Despite these advances, several core challenges persist:

Linguistic Diversity and Context: Fake news articles often use varied writing styles, rhetorical strategies, and context-dependent cues, complicating detection by automated systems [,,,,].
Class Imbalance: In most datasets, genuine news far outnumbers fake news, leading to imbalanced learning scenarios that can bias models toward majority classes [,,].
Generalization Across Sources and Topics: News data are highly heterogeneous, with content drawn from diverse sources, domains, and time periods. Models trained in one context may not transfer well to others [,,].
Interpretability: While deep models achieve high accuracy, their complex architectures often hinder transparency and interpretability, complicating the justification of automated decisions—an important issue for stakeholders, journalists, and platform administrators.

Furthermore, most existing benchmarks focus on a single input type—either headlines or article bodies—rather than systematically evaluating model performance across varying input richness (title only, content only, and both). This leaves open questions about the incremental value of additional context for model accuracy and interpretability.

1.3. Research Gap and Contributions

Despite the progress described above, there is a notable gap in systematic, empirical benchmarking of both classical and state-of-the-art deep learning models across input scenarios of varying richness (headline-only, content-only, and combined), as well as in reporting not only accuracy but also interpretability and computational tradeoffs.

This work addresses these challenges by providing a comprehensive, empirically-driven comparison of classical and deep learning models for automated fake news detection. Using the WELFake dataset—a large, balanced benchmark that includes both news headlines and full article content []—we systematically evaluate the following models:

Classical machine learning: Logistic Regression, Random Forest, and Light Gradient Boosting Machine (LightGBM);
Deep learning: A Lite Bidirectional Encoder Representations from Transformers (ALBERT) and Gated Recurrent Units (GRUs).

Models are assessed under three input conditions: (i) news headlines only, (ii) article content only, and (iii) combined headlines and article bodies. We report performance across multiple metrics (macro-averaged precision, recall, F1-score, and AUC-ROC), apply robust hyperparameter optimization and McNemar’s statistical significance testing, and analyze model interpretability via feature importance.

The main contributions of this study are as follows:

A unified benchmarking of traditional and neural NLP models for fake news detection across diverse input scenarios;
Empirical insights into how input granularity (headline, content, headline + content) affects model performance and feature utilization;
A contextualized review and benchmarking of both classical and recent state-of-the-art transformer and hybrid models for fake news detection, including discussion of domain generalization and data augmentation strategies;
A transparent, reproducible experimental protocol with interpretable analysis of model decision criteria.

Our findings aim to inform both the research community and practitioners regarding best practices, empirical tradeoffs, and open challenges in building robust, interpretable automated fake news detection systems.

The remainder of this paper is organized as follows: Section 2 describes data processing, model architectures, and evaluation metrics. Section 3 presents quantitative results, statistical comparisons, and interpretability analyses. Section 4 discusses practical implications, limitations, and future research directions.

2. Methods

This section presents the methodological framework developed to systematically evaluate the effectiveness of classical and deep learning models for fake news detection. We begin by describing the construction and preprocessing of the WELFake dataset, detailing the mathematical techniques used to transform raw text into suitable feature representations. Next, we outline the architectures and optimization procedures for both traditional machine learning and advanced neural models, including hyperparameter tuning strategies grounded in statistical rigor. Finally, we define the quantitative evaluation metrics employed to assess and compare model performance, highlighting the mathematical rationale behind each metric in the context of binary text classification. This comprehensive approach enables a robust, reproducible, and interpretable analysis of fake news identification models across a variety of input and algorithmic scenarios.

2.1. Data Preprocessing and Preparation

In this study, we utilized the WELFake dataset, a large and diverse corpus specifically designed for fake news classification tasks []. We selected WELFake because it is one of the most widely used and carefully curated public benchmarks for fake news detection, offering a balanced and relatively noise-free dataset. Its scale and source diversity (integrating four prominent open-access news datasets) make it a strong proxy for benchmarking automated methods under controlled conditions. WELFake integrates four prominent open-access news sources—the Kaggle Fake News Dataset, McIntire Dataset, Reuters Dataset, and BuzzFeed Political News Dataset—resulting in a balanced collection of 72,134 English-language news articles, each annotated with a binary label indicating its authenticity (1: fake, 0: real).

A distinctive feature of the WELFake dataset is the inclusion of two textual columns for each record: the title, corresponding to the news headline, and the content, representing the full article text. This structure allowed us to systematically examine and compare model performance in three experimental settings:

Title-only models: Trained solely on the title field, emulating scenarios where only headline information is available or where computational efficiency is paramount.
Content-only models: Trained solely on the content field, isolating the value of full-article context and addressing reviewer requests for a content-based baseline.
Title + Content models: Trained on the concatenated output of the title and content fields, leveraging both concise headline cues and the broader semantic context provided by the full article.

Distribution of Number of Tokens/Words: Recent analyses (Table 1) of the WELFake dataset provide the following descriptive statistics:

Table 1. Summary statistics of title and content lengths in the WELFake dataset.

Titles: The average title length is 12.17 words, with a maximum of 72 words. There are 558 articles with no title. The distribution is right-skewed, with most titles being short and only a few very long.
Content: Most article bodies fall between 0 and 1000 words, though there is a long tail: some articles contain up to 24,000 words. There are 783 articles with 0 words in the body. The top 5 longest articles exceed 20,000 words, but the vast majority are much shorter.
For machine learning applications, article content is often truncated or padded (e.g., to 200 tokens) for model input. Both fields have some missing values, but the majority of the data are well formed for NLP tasks.

By including all three input configurations, our experiments enable a rigorous assessment of the incremental value provided by headlines, article body text, and their combination, which has been identified as a key methodological gap in prior work.

For all modeling strategies, we applied a unified text normalization pipeline to all relevant fields, comprising:

Lowercasing: All text was converted to lowercase to standardize lexical representation.
Noise removal: URLs, HTML tags, user mentions, hashtags, punctuation, and numerical digits were removed using regular expressions.
Whitespace normalization: Multiple consecutive spaces were collapsed into a single space, and leading/trailing whitespace was trimmed.
Stopword removal: Common English stopwords were excluded using the NLTK corpus, retaining only semantically meaningful words.
Lemmatization: The WordNet lemmatizer was used to reduce each token to its lemma, consolidating morphological variants.

For the Title + Content models, the preprocessed title and content were concatenated prior to feature extraction or tokenization. To ensure data quality and consistency, all entries with missing or duplicate values in either field were removed.

For classical machine learning models, the processed text was transformed into fixed-length feature vectors using the Term Frequency–Inverse Document Frequency (TF-IDF) method. Specifically, for each document d and term t, the TF-IDF score is defined as

TF - IDF (t, d) = tf (t, d) \times log (\frac{N}{df (t)})

where

tf (t, d)

is the frequency of term t in document d, N denotes the total number of documents, and

df (t)

represents the number of documents containing t [,]. We extracted up to 1000 TF-IDF features (including unigrams and bigrams, n-gram range: 1–2) per sample. The TF-IDF vectorizer was fit on the training set and applied to the validation and test sets to prevent information leakage. TF-IDF weighting is especially advantageous in our setting, as it normalizes for document length and emphasizes terms that are discriminative across a highly variable distribution of document sizes—ranging from single-sentence headlines to article bodies with up to 24,000 words—thus mitigating the impact of extreme length variability on model training and performance.

For deep learning models (ALBERT and GRUs), after the unified normalization described above, text sequences were further processed using model-specific tokenizers. For ALBERT, we utilized the default WordPiece tokenizer from the HuggingFace transformers library, which segments text into subword units and maps them to integer token IDs compatible with the pretrained ALBERT vocabulary. Each input sequence was padded or truncated to a fixed maximum length (e.g., 200 tokens), reflecting common practice in NLP applications and facilitating efficient batch processing. For GRU models, the Keras Tokenizer was employed to map words to integer indices, with all sequences similarly padded or truncated. This approach standardizes input lengths for neural models and enables them to handle the highly variable article and title lengths in WELFake.

The dataset was partitioned into training (60%), validation (20%), and test (20%) subsets using stratified random sampling, preserving the class distribution across splits. Random seeds were fixed for all splitting procedures to guarantee full experimental reproducibility.

This comprehensive and standardized preprocessing framework ensured high data quality and comparability across all models and input configurations, allowing for a rigorous assessment of the incremental value of article context and the empirical properties of different NLP modeling approaches.

2.2. Machine Learning Models

To systematically assess the predictive capacity of classical algorithms for fake news detection, we implemented and rigorously evaluated a suite of supervised machine learning models. Our selection of models—Logistic Regression, Random Forest, and Light Gradient Boosting Machine (LightGBM)—is motivated by their widespread adoption in text classification and their complementary modeling paradigms. Logistic Regression serves as a simple, interpretable linear baseline. Random Forest and LightGBM represent two major tree-based ensemble paradigms: bagging and boosting, respectively, each leveraging different mechanisms for variance and bias reduction. Their proven efficacy in handling text data and their complementary strengths justify their inclusion as representative models for our comparative analysis. Random Forest provides a robust, nonlinear ensemble approach that mitigates overfitting, and LightGBM offers a state-of-the-art gradient boosting method, known for its efficiency and superior performance in high-dimensional, sparse-text scenarios. Collectively, these models allow us to compare the strengths and weaknesses of linear versus nonlinear and bagging versus boosting strategies for fake news detection. Each method represents a distinct class of mathematical modeling approaches—ranging from linear discriminative models to complex nonlinear ensembles—offering a comprehensive perspective on the strengths and limitations of established techniques within natural language processing (NLP). By applying these models to both title-only and title + content feature sets, we aim to elucidate the mathematical foundations and practical trade-offs that govern model selection and performance in high-dimensional, text-based classification tasks. The following subsections provide detailed descriptions and mathematical formulations of each model considered in this study.

2.2.1. Logistic Regression

Logistic Regression is a fundamental statistical model for binary classification and provides an interpretable reference point for more complex algorithms in NLP []. In the fake news detection setting, each article is represented as a high-dimensional TF-IDF feature vector

x \in R^{p}

, and the probability that a sample belongs to the ‘fake’ class (

y = 1

) is modeled as

P (y = 1 ∣ x) = σ (w^{⊤} x + b) = \frac{1}{1 + exp (- (w^{⊤} x + b))}

where

w

is the coefficient vector, b is a scalar bias, and

σ (\cdot)

denotes the sigmoid activation.

The model parameters are estimated by minimizing the regularized binary cross-entropy loss:

L (w, b) = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i})] + λ {∥ w ∥}_{2}^{2}

where

y_{i}

is the ground-truth label for sample i,

{\hat{y}}_{i}

is the model-predicted probability, and

λ

is the

ℓ_{2}

regularization strength. The regularization term discourages overly large weights, promoting model stability and generalizability, especially in high-dimensional settings common in text classification.

To optimize model performance, we conducted a grid search over key hyperparameters using the validation set weighted F1-score as the selection criterion. The parameter grid was defined as follows:

$penalty = ℓ_{2}$ , specifying ridge regularization.
$C \in {0.01, 0.1, 1, 10, 100}$ , where $C = 1 / λ$ is the inverse regularization strength.
$solver \in {liblinear, lbfgs}$ , where liblinear implements coordinate descent, suitable for sparse, high-dimensional data, and lbfgs is a quasi-Newton method advantageous for larger datasets and faster convergence with $ℓ_{2}$ regularization.
$class_weight \in {balanced, None}$ , to optionally adjust weights inversely proportional to class frequencies.

The grid search systematically evaluated all combinations of these hyperparameters to identify the configuration maximizing generalization as measured by the weighted F1-score on held-out validation data.

The linearity and transparency of Logistic Regression permit direct interpretation of feature coefficients, enabling identification of headline terms most associated with the likelihood of fake news. This mathematical clarity justifies its use as a baseline for comparison against nonlinear and ensemble models.

2.2.2. Tree-Based Models

The decision tree is a nonparametric, supervised learning algorithm that partitions the input space into axis-aligned regions, facilitating interpretable and hierarchical decision boundaries for classification. In the context of fake news detection, the tree recursively splits the feature space derived from text representations (such as TF-IDF vectors) to predict the binary authenticity label of each article [].

At each internal node, the tree selects a feature j and threshold t that partitions the dataset

D

into left and right child nodes, maximizing the purity of each subset. The splitting criterion is based on minimizing an impurity function, with common choices including the Gini index and Shannon entropy.

For a node containing N samples with C classes, the Gini impurity is defined as

G = 1 - \sum_{c = 1}^{C} p_{c}^{2}

where

p_{c}

is the proportion of samples belonging to class c at the node. For binary classification (

C = 2

), this simplifies to

G = 2 p (1 - p)

, where p is the fraction of one class.

Alternatively, the entropy-based information gain uses the entropy at a node:

H = - \sum_{c = 1}^{C} p_{c} {log}_{2} p_{c}

The optimal split at each node is determined by selecting the feature and threshold that maximize the reduction in impurity (Gini or entropy) from the parent node to its children.

The tree construction proceeds recursively, partitioning the data until a stopping condition is met, as follows:

A minimum number of samples required to further split a node.
A maximum tree depth.
All samples at a node belong to the same class.

Although decision trees can represent complex and highly nonlinear relationships, their expressiveness often leads to overfitting, especially in high-dimensional, sparse settings such as those encountered in text classification. Overfitting manifests as a tree memorizing idiosyncrasies of the training data, rather than learning generalizable patterns [].

Despite their limitations, decision trees remain popular for their transparent, rule-based decision structure, which enables users to trace the logic behind each classification through the tree’s branches.

In this work, we constructed decision tree classifiers as mathematical baselines for understanding hierarchical feature interactions. However, due to their known propensity for overfitting—particularly in the context of high-dimensional textual features—we focused our primary analysis on ensemble variants, such as Random Forest and boosting, which introduce mechanisms to improve generalization performance.

Random Forest

Random Forest is an ensemble-based classification algorithm that addresses the high variance and overfitting issues commonly associated with single decision trees. By constructing a collection of randomized trees and aggregating their predictions, the Random Forest achieves robust generalization, especially in high-dimensional text classification tasks [].

Each tree in the ensemble is built from a bootstrapped sample of the training data. At each node, only a randomly selected subset of features is considered for splitting, promoting diversity among the trees. The prediction for a given sample is determined by a majority vote across all trees.

For a binary classification problem, the Random Forest seeks to minimize classification error by reducing variance through averaging, while maintaining low bias. Formally, for T trees

{h_{t}}_{t = 1}^{T}

, the predicted class

\hat{y}

for an input

x

is

\hat{y} = mode ({h_{t} (x)}_{t = 1}^{T})

where

h_{t} (x)

is the class prediction from tree t.

We optimized the Random Forest configuration using a grid search over the following parameter space, selecting the best model based on the weighted F1-score on the validation set:

n_estimators $\in {50, 100, 200}$ : Number of trees in the forest.
max_depth $\in {10, 20, None}$ : Maximum allowable depth for each tree.
min_samples_split $\in {2, 5, 10}$ : Minimum number of samples required to split an internal node.
min_samples_leaf $\in {1, 2, 4}$ : Minimum number of samples required to be at a leaf node.
class_weight $\in {None, balanced}$ : Adjusts weights inversely proportional to class frequencies to mitigate class imbalance.

All combinations were exhaustively evaluated to identify the optimal configuration for the task.

Feature importance was assessed by averaging the reduction in impurity across all trees for each feature, offering interpretable insights into which words or phrases most influenced classification. While the Random Forest sacrifices some transparency relative to a single decision tree, its ensemble structure yields significantly improved predictive accuracy and robustness in text-based settings.

Light Gradient Boosting Machine (LightGBM)—Mathematical Perspective

Gradient boosting machines (GBMs) are additive models that build an ensemble of base learners

{h_{m}}_{m = 1}^{M}

, typically decision trees, in a stage-wise manner. At each boosting round m, the model

F_{m}

is updated as

F_{m} (x) = F_{m - 1} (x) + γ_{m} h_{m} (x),

where

h_{m}

is fit to approximate the negative gradient of the loss function

L

evaluated at the current model predictions [].

For a general differentiable loss function

L

, the optimization at each stage is performed by minimizing the expected loss:

min_{h_{m}} \sum_{i = 1}^{n} L (y_{i}, F_{m - 1} (x_{i}) + h_{m} (x_{i})) .

LightGBM utilizes a second-order Taylor expansion of the loss function around

F_{m - 1} (x_{i})

to efficiently find the optimal splits:

L (y_{i}, F_{m - 1} (x_{i}) + h_{m} (x_{i})) \approx L (y_{i}, F_{m - 1} (x_{i})) + g_{i} h_{m} (x_{i}) + \frac{1}{2} h_{i} h_{m} {(x_{i})}^{2},

where

g_{i} = \partial L (y_{i}, F_{m - 1} (x_{i})) / \partial F_{m - 1} (x_{i})

is the first derivative (gradient), and

h_{i}

is the second derivative (Hessian) with respect to

F_{m - 1} (x_{i})

. This second-order approximation enables LightGBM to select splits based on both the gradient and curvature information, accelerating convergence [].

For a candidate split that partitions data into left (L) and right (R) child nodes, the split gain (i.e., reduction in loss) is computed as

Gain = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - γ,

where

G_{L} = \sum_{i \in L} g_{i}

,

H_{L} = \sum_{i \in L} h_{i}

,

G_{R}

and

H_{R}

are the corresponding sums for the right child,

λ

is the regularization parameter on leaf weights, and

γ

is the penalty for creating a new leaf (to control model complexity). The optimal split maximizes this gain.

Traditional GBMs expand trees level-wise (breadth-first). LightGBM instead uses a leaf-wise strategy: at each step, it identifies the leaf whose split would yield the greatest reduction in the loss (highest gain), resulting in deeper, more adaptive trees that can fit complex patterns in the data.

LightGBM bins continuous features into a finite number of discrete intervals, which reduces the complexity of finding the best split from

O (N)

to

O (K)

per feature per split, where K is the number of bins (

K ≪ N

), improving both speed and memory efficiency.

To avoid overfitting, LightGBM introduces the following:

An $ℓ_{2}$ penalty on leaf weights ( $λ$ ).
A minimum sum of Hessians per leaf node.
A penalty $γ$ for each additional leaf.

These regularizers help to control tree growth and model complexity.

Through mask-based optimization and efficient memory usage, LightGBM directly supports sparse input matrices (e.g., from TF-IDF vectorization), skipping unnecessary computation and storage for zero-valued features.

Through its mathematically grounded innovations—including second-order optimization, leaf-wise splitting, and histogram-based binning—LightGBM attains a strong balance of accuracy, efficiency, and scalability, making it particularly apt for high-dimensional, sparse-text classification problems such as fake news detection.

2.3. Deep Learning Models

To complement the classical machine learning baselines, we implemented a set of deep learning architectures tailored for sequential and contextual modeling of textual data. Deep learning approaches, particularly those based on neural networks, offer powerful tools for capturing complex dependencies, hierarchical patterns, and semantic nuances in natural language. In this study, we selected two state-of-the-art neural models—A Lite Bidirectional Encoder Representations from Transformers (ALBERT) and Gated Recurrent Units (GRUs)—each representing a distinct paradigm in modern NLP: transformer-based contextual encoding and recurrent sequence modeling. The following subsections provide mathematical formulations and methodological details for each model, highlighting their suitability for large-scale fake news detection tasks. Our selection of ALBERT and Gated Recurrent Units (GRUs) is motivated by their demonstrated balance between model complexity, computational efficiency, and state-of-the-art performance in NLP. GRUs are chosen over vanilla RNNs and LSTM due to their simpler gating mechanism, which achieves comparable performance to LSTM while requiring fewer parameters and offering faster convergence, making it suitable for large-scale experiments. ALBERT, a parameter-efficient transformer model, is selected over larger models such as BERT and recent LLMs, as it provides competitive accuracy with significantly reduced memory and computational costs, facilitating practical deployment. Moreover, while LLMs and more complex architectures may offer marginal gains, they often require prohibitive resources and are less interpretable for structured benchmarking studies. Thus, by focusing on ALBERT (representing modern transformer-based contextual encoding) and GRUs (representing recurrent sequence modeling), we aim to achieve an optimal trade-off between performance, efficiency, and reproducibility.

2.3.1. A Lite Bidirectional Encoder Representations from Transformers (ALBERT)

Transformer-based models have revolutionized natural language processing by enabling efficient modeling of contextual and sequential relationships in text. The Bidirectional Encoder Representations from Transformers (BERT) model [] established a new paradigm by leveraging deep stacks of transformer encoders, each comprising multihead self-attention and position-wise feedforward layers, to compute bidirectional, context-sensitive token embeddings.

Mathematically, the encoder processes an input sequence

{x_{i}}_{i = 1}^{L}

through L layers:

h_{i}^{(l + 1)} = LayerNorm (h_{i}^{(l)} + MultiHeadAttn (h_{i}^{(l)})),

h_{i}^{(l + 1)} = LayerNorm (h_{i}^{(l + 1)} + FFN (h_{i}^{(l + 1)})),

where MultiHeadAttn denotes multihead self-attention, FFN the feedforward network, and LayerNorm layer normalization.

While BERT achieves strong empirical results, its large parameter count increases both memory usage and computational demand. ALBERT (A Lite BERT) [] addresses these issues via two principal innovations:

Factorized Embedding Parameterization: ALBERT factorizes the word embedding matrix, decoupling vocabulary size from hidden dimension, such that $E = E_{1} E_{2}$ with $E_{1} \in R^{V \times k}$ and $E_{2} \in R^{k \times d}$ for vocabulary size V, bottleneck size k, and hidden size d, where $k ≪ d$ .
Cross-layer Parameter Sharing: A single set of encoder weights $Θ$ is shared across all layers, so

$z_{i}^{(l + 1)} = TransformerLayer (z_{i}^{(l)}; Θ), l = 0, \dots, L - 1,$

reducing the total number of trainable parameters and promoting regularization.

Furthermore, ALBERT introduces Sentence Order Prediction (SOP) as an auxiliary pretraining objective to enhance inter-sentence coherence modeling.

In our study, we fine-tuned the pretrained albert-base-v2 model for binary fake news classification, using either headline-only or concatenated headline + content inputs. Each sequence was tokenized, padded, or truncated to a fixed maximum length, and the [CLS] token’s output embedding was used for classification via a softmax layer.

A randomized search was performed over the following domains:

Learning rate $η$ : log-uniformly sampled in $[10^{- 5}, 10^{- 4}]$ .
Number of epochs: integers in $[3, 5]$ .
Dropout rate: uniformly sampled in $[0.1, 0.5]$ .

Ten random hyperparameter configurations were evaluated, with the optimal setting selected based on the validation weighted F1-score.

ALBERT’s parameter-efficient and mathematically grounded design yields robust performance on large-scale text classification, making it particularly suitable for fake news detection across both succinct headlines and full article contexts.

2.3.2. Gated Recurrent Units (GRUs)

Recurrent neural networks (RNNs) are foundational models for sequential data analysis in natural language processing, as they can process arbitrary-length sequences and capture dependencies across time steps. Standard RNNs, however, often struggle with learning long-range dependencies due to vanishing or exploding gradients. Gated Recurrent Units (GRUs) [] address this issue by introducing gating mechanisms that adaptively regulate the flow of information through the network.

Given an input sequence

{x_{t}}_{t = 1}^{T}

, the GRU maintains a hidden state

h_{t}

at each time step, updated via

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z}), (update gate)

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r}), (reset gate)

{\tilde{h}}_{t} = tanh (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}) + b_{h}), (candidate state)

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t},

where

σ (\cdot)

denotes the sigmoid activation, ⊙ is element-wise multiplication, and

W_{*}

,

U_{*}

, and

b_{*}

are learnable parameters. The update gate

z_{t}

interpolates between the previous hidden state and the candidate state, while the reset gate

r_{t}

determines how much past information to forget.

For fake news classification, each input text (either title-only or concatenated title + content) was tokenized and mapped to a sequence of embeddings. The GRU-based model consisted of the following:

An embedding layer that converts token indices into dense vectors.
One or more GRU layers to encode sequential dependencies and context.
A fully connected output layer applied to the final hidden state for binary classification.

Dropout regularization was included to reduce overfitting.

We adopted a randomized search approach, sampling 10 distinct hyperparameter combinations from the following domains:

Embedding dimension: integers in $[150, 250]$ .
Hidden dimension: integers in $[256, 768]$ .
Learning rate $η$ : log-uniformly sampled in $[10^{- 4}, 10^{- 3}]$ .
Number of epochs: integers in $[5, 10]$ .

Model performance was evaluated using the weighted F1-score on the validation set, and the best configuration was selected accordingly.

The GRUs’ gating mechanisms enable the model to retain or forget information dynamically, allowing effective modeling of both local and global dependencies in text. This makes GRUs especially suited to tasks such as fake news detection, where critical information may occur anywhere in the sequence.

Through their mathematically principled gating structure and efficient parameterization, GRU networks provide a strong and scalable approach for sequential modeling in NLP, serving as a competitive neural baseline for our fake news detection experiments.

2.4. Evaluation Metrics

To rigorously assess model performance in the context of fake news identification, we employed several standard classification metrics, each underpinned by a precise mathematical formulation. Let

y_{i} \in {0, 1}

denote the ground-truth label and

{\hat{y}}_{i} \in {0, 1}

the predicted label for sample i, where 1 represents a fake news article.

Precision = \frac{TP}{TP + FP}

Here,

TP

(true positives) is the number of fake news articles correctly identified as fake, while

FP

(false positives) is the number of real news articles incorrectly labeled as fake. High precision in fake news detection indicates that when the model flags an article as fake, it is highly likely to be truly fake—minimizing false alarms and reducing the risk of wrongly censoring legitimate content.

Recall = \frac{TP}{TP + FN}

FN

(false negatives) is the number of fake news articles incorrectly labeled as real. High recall means the model is effective at catching the majority of fake news articles, reducing the chance that misleading information will evade detection and propagate unchecked.

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

The F1-score balances the trade-off between precision and recall. In fake news identification, a high F1-score signifies that the model performs well both at accurately flagging fake articles (precision) and at catching as many fakes as possible (recall)—a crucial property in environments where both false positives (wrongly flagged news) and false negatives (missed fake news) are costly.

When classes are imbalanced, as is common in real-world fake news datasets, macro-averaged metrics are used. For a binary case, macro-averaged precision, recall, and F1-score are calculated by computing the metric for each class (fake and real) and then averaging, thereby giving equal weight to both classes regardless of their frequency.

The ROC curve plots the true positive rate (TPR, or recall) versus the false positive rate (FPR) at various probability thresholds:

TPR = \frac{TP}{TP + FN}, FPR = \frac{FP}{FP + TN}

The area under this curve (AUC) reflects the probability that the model will rank a randomly chosen fake article higher than a randomly chosen real article. In fake news detection, a high AUC means the model is consistently good at distinguishing between fake and real news, regardless of the classification threshold.

Weighted F1-score on the validation set was adopted as the primary criterion for hyperparameter tuning and model selection, ensuring both precision and recall are optimized and that the evaluation is robust to any minor class imbalance. All metrics were reported on the held-out test set for final comparison.

This comprehensive, mathematically rigorous evaluation strategy provides nuanced insight into the strengths and weaknesses of each model in addressing the critical societal challenge of fake news identification.

2.5. Statistical Comparison of Classifiers: McNemar’s Test

To formally assess whether differences in classification accuracy between pairs of models were statistically significant, we conducted pairwise hypothesis testing using McNemar’s test []. McNemar’s test is a nonparametric method for evaluating the performance of two classifiers on the same sample, specifically testing the null hypothesis that both classifiers have the same error rate on paired observations.

Given two classifiers, A and B, let

n_{01}

denote the number of samples misclassified by A but correctly classified by B, and

n_{10}

denote the number of samples correctly classified by A but misclassified by B. McNemar’s test statistic is defined as

χ^{2} = \frac{{(n_{01} - n_{10})}^{2}}{n_{01} + n_{10}}

Under the null hypothesis of equal error rates, this statistic asymptotically follows a chi-squared distribution with one degree of freedom.

To account for multiple pairwise comparisons, p-values were adjusted using Holm’s method, providing a more stringent control of the family-wise error rate. This approach enables robust inference on the relative superiority of competing classifiers under identical experimental conditions.

3. Results

This section presents a comprehensive evaluation of classical and deep learning models for fake news detection across two distinct input configurations: (i) headline (title) only and (ii) headline combined with article body (title + content). We first summarize the predictive performance of all models using multiple quantitative metrics on the held-out test set, enabling direct comparison of precision, recall, F1-score, and AUC-ROC under both scenarios. Next, we report the results of rigorous pairwise statistical tests to determine the significance of performance differences between models. Finally, we analyze feature importance for each machine learning model, providing interpretability insights into which textual elements most strongly influence classification outcomes. Collectively, these results offer a nuanced understanding of how model architecture and input context affect fake news detection accuracy, robustness, and transparency.

3.1. Computational Resource Cost and Efficiency

To provide a comprehensive view of practical deployment, we report the average training computational time required for each model across all three input configurations. All models were trained on Google Colab, utilizing a high-RAM configuration powered by an NVIDIA T4 GPU, which provided sufficient computational efficiency for the experiments—especially for deep learning models. The reported training time for each configuration is averaged over all hyperparameter tuning conditions, calculated as the total tuning time divided by the number of tuning configurations, thereby representing the average cost for a single model fit.

Table 2 summarizes the computational time (in seconds) required for each model under the three experimental scenarios: title only, content only, and title plus content.

Table 2. Training Computational Time (in seconds) for Each Model Across Input Configurations.

These results highlight that classical machine learning models (Logistic Regression, Random Forest, LightGBM) require minimal computational time, even as input complexity increases. In contrast, deep learning models—especially transformer-based ALBERT—incur substantially higher training costs, reflecting their greater representational power and parameterization. GRUs achieve a more favorable trade-off, offering significant efficiency improvements over ALBERT while still supporting sequential modeling. This comparison can guide practitioners in balancing model performance against resource constraints in real-world fake news detection deployments.

3.2. Model Performance

Table 3 reports the macro-averaged precision, recall, F1-score, and AUC-ROC for all evaluated models on the test set, stratified by the three input scenarios: title only, content only, and title + content.

Table 3. Test set performance (macro-averaged) for all models using title-only and title + content inputs.

When restricted to headline information, ensemble and deep learning models consistently outperformed classical linear baselines. Among traditional machine learning approaches, Random Forest exhibited the strongest results (Precision/Recall/F1: 0.85; AUC-ROC: 0.93), marginally exceeding both Logistic Regression (0.84 across metrics, AUC-ROC: 0.92) and LightGBM (0.83 across metrics, AUC-ROC: 0.92). This finding suggests that the Random Forest’s capacity to model nonlinear feature interactions confers a performance advantage, even in concise textual contexts.

Deep learning architectures provided a substantial boost. ALBERT achieved the highest overall accuracy in this setting, with a macro F1-score of 0.93 and an AUC-ROC of 0.98, indicating its strong ability to extract informative patterns from short headlines. The GRU model also demonstrated competitive performance (F1: 0.90, AUC-ROC: 0.96), affirming the value of sequential modeling, although it trailed ALBERT by a moderate margin.

We also evaluated a set of models trained exclusively on article content (‘content-only’), which serves as a necessary baseline for quantifying the incremental value of headlines versus body text. Content-only models, unsurprisingly, performed significantly better than headline-only models across all algorithms, confirming that richer textual context provides substantial predictive signal. For example, LightGBM, Random Forest, and Logistic Regression all achieved macro F1-scores above 0.92 and AUC-ROC above 0.98 in the content-only setting, outperforming their headline-only counterparts by a wide margin. Both ALBERT and GRUs also exhibited strong results (F1: 0.97, AUC-ROC: 1.00), nearly matching their title + content scores, underscoring the power of deep architectures to leverage extended context. These results reinforce the importance of article body information and provide a rigorous, previously missing comparison baseline for future studies.

Integrating the full article text with the headline led to marked improvements across all models. Classical machine learning approaches exhibited notable gains: both Logistic Regression and Random Forest achieved F1-scores of 0.93 and AUC-ROC values of 0.98 and 0.99, respectively. LightGBM showed an even greater improvement (F1: 0.94, AUC-ROC: 0.99), highlighting the utility of richer feature sets for boosting ensembles.

Neural models achieved near-perfect accuracy in the combined input scenario. ALBERT reached an F1-score and recall of 0.99 and an AUC-ROC of 1.00, demonstrating exceptional discriminative ability. The GRU model also performed remarkably well (F1: 0.97, AUC-ROC: 1.00), nearly matching ALBERT’s performance and substantially outperforming all classical baselines. These results demonstrate that, when provided with comprehensive textual context, advanced neural networks are capable of synthesizing both local and global features to enable highly accurate fake news detection.

The incremental benefit of incorporating article content is clear for all models, with the most pronounced relative gains seen for LightGBM and GRUs. Across both input scenarios, ALBERT consistently outperformed all other models, highlighting the power of transformer-based architectures for both short and long textual classification. Classical machine learning models, while efficient and interpretable, were ultimately surpassed by neural methods—especially when richer input was available. Across all models and configurations, AUC-ROC values remained high, reflecting robust discrimination between fake and real news even under balanced class conditions.

Overall, these findings reinforce the superiority of state-of-the-art neural architectures, particularly ALBERT, for automated fake news detection. The added contextual information from article content led to substantial improvements in performance for all models, with neural networks demonstrating especially strong gains. Classical machine learning methods remain viable for resource-constrained settings or when interpretability is paramount, but their predictive power is outmatched by contemporary deep learning approaches on large-scale, real-world datasets.

3.3. Model Comparison

To rigorously assess whether observed differences in predictive performance between models were statistically significant, we conducted pairwise McNemar’s tests with Holm correction for multiple comparisons (Table 4) []. This nonparametric approach evaluates whether two classifiers differ in their tendency to make errors on the same samples, providing a robust basis for head-to-head significance testing.

Table 4. Pairwise McNemar Test Results (Holm corrected) for Model Comparisons: Title-Only (top) and Title + Content (bottom) Models.

For the title-only models, ALBERT consistently and significantly outperformed all classical machine learning baselines (Logistic Regression, Random Forest, LightGBM) as well as GRUs, with extremely small corrected p-values (<0.0001) in all comparisons. GRUs also demonstrated significant superiority over all tree-based models. Among the classical methods, Random Forest outperformed LightGBM and Logistic Regression, though the margin over LightGBM was particularly pronounced. The only nonsignificant result in the title-only scenario was between LightGBM and Logistic Regression (

p = 0.33

), indicating comparable performance profiles between these two models in this input setting.

For the content-only models, both ALBERT and GRUs achieved top performance, with a tie observed between them (

p = 0.0000

). LightGBM demonstrated a statistically significant edge over Random Forest and Logistic Regression, suggesting that boosting techniques gain greater advantage when richer article content is available. The classical baselines, while strong, were consistently outperformed by deep neural approaches in this context.

For the title + content models, the advantage of deep learning models became even more pronounced. ALBERT emerged as the top performer, significantly outperforming all other models, including GRUs, which exhibited a significant advantage over all classical methods. LightGBM outperformed both Random Forest and Logistic Regression, while Random Forest was only marginally better than Logistic Regression (

p = 0.02

). The high frequency of statistically significant results (corrected p < 0.0001) underscores the clear superiority of transformer-based neural architectures when richer textual context is leveraged.

Collectively, these results provide robust statistical evidence that advanced neural models—especially ALBERT—offer substantial gains over traditional machine learning approaches for fake news detection. This effect is most pronounced when both headline and full article content are available, but it remains apparent even with headline-only inputs. These findings highlight the critical role of deep contextual models for text classification in practical misinformation detection pipelines.

3.4. Model Interpretation: Feature Importance Analysis

To better understand how classical machine learning models distinguish between fake and real news, we analyzed the top 20 most important TF-IDF features as determined by each model in both the title-only and title + content scenarios. Figure 1 displays these feature importance for Logistic Regression, Random Forest, and LightGBM.

Figure 1. Top 20 most important features for each machine learning model (Logistic Regression, Random Forest, LightGBM) under both title-only (top row) and title + content (bottom row) inputs: (a) Logistic Regression (title only). (b) Random Forest (title only). (c) LightGBM (title only). (d) Logistic Regression (title only). (e) Random Forest (title only). (f) LightGBM (title only). (g) Logistic Regression (title + content). (h) Random Forest (title + content). (i) LightGBM (title + content).

For both LightGBM and Random Forest, the feature said overwhelmingly dominates as the most important predictor, followed by image and washington. This pattern suggests that attributional cues (e.g., ‘said’) and references to images or locations play a crucial role in content-based fake news detection for tree-based ensemble models. Other common features with moderate importance include temporal terms (e.g., ‘monday’, ‘friday’, ‘october’), communication-related terms (‘email’, ‘twitter’), and event or entity words (e.g., ‘america’, ‘percent’, ‘obama’). Logistic Regression displays a different pattern, distributing importance more evenly across its top features. Instead of being dominated by a single term, the top features include com, follow, image, october, didnt, and pic, among others. While ‘said’ remains among the top 10, the linear model appears more sensitive to platform or web-related terms (e.g., ‘com’, ‘twitter’, ‘rt’) and colloquial expressions (‘im’, ‘dont’, ‘thats’), which may signal informal or viral content. Across all models, location and event-driven terms (e.g., ‘washington’, ‘friday’, ‘october’), as well as references to communication platforms or media (‘twitter’, ‘email’, ‘image’), are recurrently selected, indicating that both content and context are critical in distinguishing fake news articles.

Compared with headline-only models, the inclusion of full article content leads to a sharper concentration of importance on specific attributional and event-related features for tree-based models. The greater spread of feature importance in Logistic Regression highlights its tendency to pick up on a broader array of stylistic and platform-specific cues, while ensemble models focus more tightly on a few highly predictive features.

Overall, the combination of location, media-related, event-driven, and stylistic cues reflects the multifaceted nature of language signals that classical machine learning models leverage in fake news detection, and highlights the complementary strengths of linear and nonlinear algorithms in modeling different aspects of the input text.

Although classical machine learning models readily provide interpretable feature importance rankings, our deep learning models (ALBERT, GRUs) do not inherently yield such direct explanations. We attempted to employ SHAP (SHapley Additive exPlanations) for interpretability with these neural models, but the computational demands were prohibitive: even after over 100 h of running on a high-memory GPU, we were unable to complete the analysis. This observation echoes recent findings that highlight significant scalability and efficiency challenges when applying SHAP to complex NLP architectures [,]. As a result, we do not present SHAP-based interpretability for our deep models, but recognize this as an ongoing limitation and an important avenue for future research.

3.5. Qualitative Error Analysis and Limitations

Despite the strong overall performance of both classical and deep learning models in our experiments, several recurring sources of misclassification were observed, consistent with broader trends in fake news detection research [,,].

First, models sometimes struggle with headlines or articles that use ambiguous, sarcastic, or figurative language, which can be difficult to distinguish from genuine misinformation. Satirical content, humor, and news pieces relying on cultural or contextual cues are commonly misclassified, as automatic systems may not fully grasp the intended nuance [].

Moreover, we observed that model errors are more frequent on articles from underrepresented sources or with atypical topics. This may reflect limitations in the diversity of the training data, leading to potential domain or topic bias [].

Additionally, both classical and neural models may exhibit sensitivity to temporal drift, misclassifying articles related to breaking news events or novel topics that were not present in the training set. This highlights the importance of continual model updating and retraining to maintain accuracy in evolving news landscapes.

In general, models tend to be more reliable when clear, explicit cues for fake or real news are present, but they can falter in cases with subtle language, emerging topics, or less typical writing styles. Although we did not conduct an extensive ablation or systematic error annotation, these observations are consistent with known limitations in the field and underscore the value of ongoing research into dataset diversity, model robustness, and interpretability.

4. Discussion

This study delivers a comprehensive, mathematically principled evaluation of both classical and deep learning models for automated fake news detection, leveraging the large-scale WELFake dataset under both headline-only and headline + content input scenarios. Our findings provide new insights into the relative effectiveness, interpretability, and practical implications of these approaches for text-based misinformation detection in modern digital ecosystems.

4.1. Comparative Model Performance

Our results underscore that both input representation and algorithmic choice exert a profound impact on fake news detection outcomes. In the headline-only scenario, ensemble tree-based models (Random Forest, LightGBM) consistently surpass linear baselines such as Logistic Regression, demonstrating the advantage of capturing nonlinear feature interactions and higher-order relationships even in short textual inputs. However, the adoption of deep neural models—most notably ALBERT, a highly parameter-efficient transformer—produces the most pronounced gains. The superior macro F1-score and AUC-ROC attained by ALBERT indicate that transformer-based architectures are adept at extracting subtle contextual patterns from minimal input, which are otherwise challenging for classical models.

When the full article content is included, all models benefit from the additional semantic context, but the magnitude of improvement is especially dramatic for deep learning approaches. Both ALBERT and GRU models attain near-perfect accuracy (macro F1

\geq 0.97

, AUC-ROC

\approx 1.00

), decisively outperforming classical methods. These findings validate the exceptional representational power of neural architectures for natural language processing (NLP) tasks and affirm that context-rich features significantly enhance the detection of sophisticated fake news.

4.2. Computational Cost and Resource Efficiency

While detection accuracy is paramount, practical deployment of fake news detection systems also requires careful consideration of computational cost and resource requirements. In our experiments, all models were trained and evaluated on Google Colab using an NVIDIA T4 GPU (for deep learning models) and a high-RAM configuration (for all models).

ALBERT: Fine-tuning ALBERT on the full dataset required the highest training time and memory consumption among all models—typically several hours per hyperparameter configuration—due to the complexity of transformer-based architectures and the need to process long sequences, especially when article content is included.
GRUs: GRU models offered a more lightweight alternative to transformers but still required substantial GPU resources for sequence modeling.
Random Forest and LightGBM: Classical ensemble models (Random Forest, LightGBM) trained much faster (within minutes on CPU or GPU) and required substantially less memory, making them suitable for resource-constrained environments or real-time applications.
Logistic Regression: This linear model was the most efficient, both in terms of memory and computation, completing training in under a minute on CPU for all input settings.

Average training times per model and input scenario are summarized in Table 2. These results reinforce the trade-off between model complexity and computational cost: while transformer-based models deliver state-of-the-art accuracy, classical models offer significant efficiency and transparency advantages for large-scale or time-sensitive deployments.

4.3. Statistical Significance and Model Robustness

To rigorously evaluate the robustness of observed performance differences, we conducted pairwise McNemar’s tests with Holm correction. These statistical results substantiate that the improvements offered by neural models—and especially ALBERT—over classical baselines are both substantial and statistically significant across all settings. This provides strong evidence that, when data and computational resources permit, deep contextual NLP models are the methodological gold standard for fake news detection.

Notably, in the headline-only scenario, differences between Random Forest and LightGBM are minimal and often statistically insignificant, highlighting the competitive nature of these two ensemble approaches for concise inputs. In contrast, the advantage of boosting (LightGBM) becomes more pronounced when more complex and high-dimensional features (title + content) are incorporated. The only statistically nonsignificant result is between LightGBM and Logistic Regression for headline-only inputs, suggesting their performance is context-dependent.

4.4. Interpretability and Feature Analysis

Interpretability remains an essential consideration, particularly for real-world applications where model transparency and explainability are required. Classical machine learning models offer direct insights into the decision process. Our feature importance analysis reveals that all models consistently prioritize keywords such as ‘video’, ‘breaking’, and location names as key indicators of fake news. Linear models, such as Logistic Regression, tend to elevate expressive or colloquial terms (e.g., ‘lol’, ‘hilarious’), suggesting sensitivity to the sensational and stylistic language that often characterizes fake headlines. In contrast, ensemble and boosting models leverage broader contextual and event-related cues, especially with article content (e.g., ‘said’, ‘image’, ‘friday’), demonstrating their capacity to incorporate both local and global signals. These observations point to complementary strengths of linear and nonlinear models for different aspects of fake news detection.

4.5. Practical Implications

Our findings have practical implications for the design and deployment of fake news detection systems:

For environments requiring computational efficiency and interpretability—such as media monitoring or regulatory compliance—classical ensemble models, particularly Random Forest and LightGBM, offer a robust and transparent option, especially when limited to headline data.
In mission-critical or high-stakes settings, where detection accuracy is paramount and computational resources are sufficient, transformer-based neural architectures (such as ALBERT) are preferable, particularly when full article content is available.
The interpretable nature of classical models can facilitate human-in-the-loop systems and explainable AI pipelines, supporting trust, regulatory transparency, and the rapid identification of emerging fake news topics or patterns.

4.6. Limitations and Future Directions

Despite the strengths of this study, several limitations should be considered. First, our experiments rely exclusively on the WELFake dataset, which, although large and relatively diverse, is composed of English-language news articles collected from four prominent open-access sources. As such, the dataset may not fully capture the linguistic, stylistic, or topical diversity encountered in real-world misinformation across different languages, regions, or media platforms. Consequently, the generalizability of our findings to multilingual, cross-domain, or more dynamic fake news scenarios remains an open question that warrants future investigation [,,]. Second, while the WELFake dataset is balanced, real-world fake news distributions are often imbalanced and can shift over time (temporal or topical drift), challenging generalization. Third, our analysis is limited to supervised learning; future research should explore semi-supervised, unsupervised, and domain adaptation approaches, as well as continual learning for dynamically evolving misinformation landscapes. Fourth, deep learning models, despite their high accuracy, pose ongoing challenges for interpretability. The development and application of model-agnostic interpretability tools and attention-based visualization techniques should be pursued. Finally, while our statistical significance tests provide rigorous comparative validation, future evaluations in operational settings should also consider cost-sensitive metrics, deployment trials, and real-world impact assessments.

Another promising direction for future work is the exploration of hybrid systems that integrate classical machine learning with deep learning architectures. Recent research suggests that combining interpretable feature-based models (such as ensemble trees) with contextual deep learning encoders (e.g., transformers or RNNs) can leverage the strengths of both paradigms—balancing transparency and generalization. Such hybrid approaches may involve feature-level ensembling, stacking, or model distillation, and have shown potential in recent fake news detection studies [,,]. Developing and benchmarking hybrid pipelines represents a valuable next step for improving both the reliability and interpretability of automated misinformation detection systems.

Overall, this study demonstrates that, under mathematically rigorous and statistically validated evaluation protocols, transformer-based models—especially ALBERT—offer state-of-the-art performance for fake news detection, especially when leveraging both headline and article content. Nevertheless, further work is needed to assess model robustness in more linguistically and geographically diverse contexts and to ensure reliable deployment in the face of evolving news landscapes. These findings provide practical guidance for both researchers and practitioners developing automated solutions for misinformation detection in increasingly complex digital media environments.

4.7. Concluding Remarks

In summary, this study systematically benchmarked both classical machine learning and modern deep learning models for automated fake news detection using the large-scale WELFake dataset. By evaluating multiple input settings (headline, content, and combined), we provided a comprehensive assessment of model performance, interpretability, and computational efficiency. Our results demonstrate that transformer-based models, especially ALBERT, achieve state-of-the-art accuracy, while classical ensembles remain competitive for efficient and interpretable applications. These findings address the stated research objectives and offer actionable guidance for the development and deployment of robust misinformation detection systems.

Author Contributions

Conceptualization, S.X.; Methodology, Y.T. and Z.W. (Zhongyan Wang); Software, S.X. and Z.W. (Zhongyan Wang); Validation, Z.W. (Zhongyan Wang); Formal analysis, Y.T. and Z.W. (Zhongyan Wang); Investigation, Y.T., Z.W. (Zhongyan Wang) and Z.W. (Zijing Wei); Resources, Y.C. and Z.W. (Zijing Wei); Data curation, Y.T., S.X., Y.C., Z.W. (Zhongyan Wang) and Z.W. (Zijing Wei); Writing—original draft, Y.C. and Z.W. (Zijing Wei); Writing—review & editing, Y.C. and Z.W. (Zijing Wei); Visualization, S.X.; Supervision, S.X.; Project administration, Y.C. and Z.W. (Zijing Wei). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Zenodo at https://doi.org/10.1109/TCSS.2021.3068519, reference number 4561253 (Verma et al., 2021 []).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lazer, D.; Baum, M.; Benkler, Y.; Berinsky, A.; Greenhill, K.; Menczer, F.; Metzger, M.; Nyhan, B.; Pennycook, G.; Rothschild, D.; et al. The Science of Fake News. Science 2018, 359, 1094–1096. [Google Scholar] [CrossRef] [PubMed]
Allcott, H.; Gentzkow, M. Social Media and Fake News in the 2016 Election. J. Econ. Perspect. 2017, 31, 211–236. [Google Scholar] [CrossRef]
Zarocostas, J. How to Fight an Infodemic. Lancet 2020, 395, 676. [Google Scholar] [CrossRef]
Uluşan, O.; Özejder, İ. Faking the War: Fake Posts on Turkish Social Media During the Russia–Ukraine War. Humanit. Soc. Sci. Commun. 2024, 11, 891. [Google Scholar] [CrossRef]
Vosoughi, S.; Roy, D.; Aral, S. The Spread of True and False News Online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef]
Himelein-Wachowiak, M.; Giorgi, S.; Devoto, A.; Rahman, M.; Ungar, L.; Schwartz, H.A.; Epstein, D.H.; Leggio, L.; Curtis, B. Bots and Misinformation Spread on Social Media: Implications for COVID-19. J. Med. Internet Res. 2021, 23, e26933. [Google Scholar] [CrossRef]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Conroy, N.; Rubin, V.; Chen, Y. Automatic Deception Detection: Methods for Finding Fake News. Proc. Assoc. Inf. Sci. Technol. 2015, 52, 1–4. [Google Scholar] [CrossRef]
Devi, V.S.; Kannimuthu, S. Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation. Neural Process. Lett. 2023, 55, 589–614. [Google Scholar] [CrossRef]
Devi, V.S.; Kannimuthu, S.; Madasamy, A.K. The Effect of Phrase Vector Embedding in Explainable Hierarchical Attention-Based Tamil Code-Mixed Hate Speech and Intent Detection. IEEE Access 2024, 12, 11316–11329. [Google Scholar] [CrossRef]
Ruchansky, N.; Seo, S.; Liu, Y. CSI: A Hybrid Deep Model for Fake News Detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017. [Google Scholar] [CrossRef]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
Zhang, Y.; Wang, Z.; Ding, Z.; Tian, Y.; Dai, J.; Shen, X.; Liu, Y.; Cao, Y. Tutorial on using machine learning and deep learning models for mental illness detection. arXiv 2025, arXiv:2502.04342. [Google Scholar]
Kaliyar, R.K.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimed. Tools Appl. 2021, 80, 11765–11788. [Google Scholar] [CrossRef]
Kula, S.; Choraś, M.; Kozik, R. Application of the BERT-Based Architecture in Fake News Detection. In Proceedings of the 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020), Burgos, Spain, 16–18 September 2020; Herrero, A., Cambra, C., Urda, D., Sedano, J., Quintián, H., Corchado, E., Eds.; Advances in Intelligent Systems and Computing. Springer: Cham, Switzerland, 2021; Volume 1267, pp. 233–241. [Google Scholar] [CrossRef]
Liu, C.; Lin, Z.; Liu, M.; Sun, Y.; Zhou, D. A Two-Stage Model Based on BERT for Short Fake News Detection. In Proceedings of the Knowledge Science, Engineering and Management (KSEM 2019), Athens, Greece, 28–30 August 2019; Douligeris, C., Karagiannis, D., Apostolou, D., Eds.; Proceedings, Part II. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11776, pp. 337–350. [Google Scholar] [CrossRef]
Tahmasebi, S.; Hakimov, S.; Ewerth, R.; Müller-Budack, E. Improving Generalization for Multimodal Fake News Detection. In Proceedings of the International Conference on Multimedia Retrieval (ICMR ’23), Thessaloniki, Greece, 12–15 June 2023; p. 5. [Google Scholar] [CrossRef]
Hua, J.; Cui, X.; Li, X.; Tang, K.; Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Appl. Soft Comput. 2023, 136, 110125. [Google Scholar] [CrossRef]
Bang, Y.; Ishii, E.; Cahyawijaya, S.; Ji, Z.; Fung, P. Model Generalization on COVID-19 Fake News Detection. In Combating Online Hostile Posts in Regional Languages During Emergency Situation. CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, 8 February 2021, Revised Selected Papers; Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S., Eds.; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2021; Volume 1402, pp. 191–206. [Google Scholar] [CrossRef]
Suprem, A.; Vaidya, S.; Pu, C. Exploring Generalizability of Fine-Tuned Models for Fake News Detection. In Proceedings of the 2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC), Atlanta, GA, USA, 14–16 December 2022; pp. 82–88. [Google Scholar] [CrossRef]
Glazkova, A. Data Augmentation for Fake News Detection by Combining Seq2seq and NLI. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023), Varna, Bulgaria, 4–6 September 2023; Mitkov, R., Angelova, G., Eds.; INCOMA Ltd.: Shoumen, Bulgaria, 2023; pp. 429–439. [Google Scholar]
Gupta, A.; Kumaraguru, P. Credibility Ranking of Tweets during High Impact Events. In Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, Lyon, France, 17 April 2012. [Google Scholar] [CrossRef]
Zhou, X.; Zafarani, R. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities. ACM Comput. Surv. 2020, 53, 109. [Google Scholar] [CrossRef]
Liu, Y.; Shen, X.; Zhang, Y.; Wang, Z.; Tian, Y.; Dai, J.; Cao, Y. A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases. Int. J. Data Sci. Anal. 2025. [Google Scholar] [CrossRef]
Chawla, N.; Japkowicz, N.; Kołcz, A. Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explor. 2004, 6, 1–6. [Google Scholar] [CrossRef]
Sun, Y.; Wong, A.K.C.; Kamel, M.S. Classification of Imbalanced Data: A Review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
Ding, Z.; Wang, Z.; Zhang, Y.; Cao, Y.; Liu, Y.; Shen, X.; Tian, Y.; Dai, J. Trade-offs between machine learning and deep learning for mental illness detection on social media. Sci. Rep. 2025, 15, 14497. [Google Scholar] [CrossRef]
Bay, Y.Y.; Yearick, K.A. Machine Learning vs Deep Learning: The Generalization Problem. arXiv 2024, arXiv:2403.01621. [Google Scholar]
Verma, P.K.; Agrawal, P.; Amorim, I.; Prodan, R. WELFake: Word Embedding Over Linguistic Features for Fake News Detection. IEEE Trans. Comput. Soc. Syst. 2021, 8, 881–893. [Google Scholar] [CrossRef]
Jones, K.S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Cao, Y.; Dai, J.; Wang, Z.; Zhang, Y.; Shen, X.; Liu, Y.; Tian, Y. Machine learning approaches for depression detection on social media: A systematic review of biases and methodological challenges. J. Behav. Data Sci. 2025, 5, 1–36. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth & Brooks/Cole Advanced Books & Software: Monterey, CA, USA, 1984. [Google Scholar]
Xu, S.; Cao, Y.; Wang, Z.; Tian, Y. Fraud Detection in Online Transactions: Toward Hybrid Supervised–Unsupervised Learning Pipelines. In Proceedings of the 2025 6th International Conference on Electronic Communication and Artificial Intelligence (ICECAI 2025), Chengdu, China, 20–22 June 2025. [Google Scholar]
Breiman, L. Random Forests; Springer: Cham, Switzerland, 2001; Volume 45, pp. 5–32. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef]
Mosca, E.; Szigeti, F.; Tragianni, S.; Gallagher, D.; Groh, G. SHAP-based Explanation Methods: A Review for NLP Interpretability. In Proceedings of the 29th International Conference on Computational Linguistics (COLING), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4593–4603. [Google Scholar]
Musi, E.; Reed, C.; Aakhus, M. A systematic review on the detection of fake news. arXiv 2022, arXiv:2110.11240. [Google Scholar]
Musi, E.; Reed, C. From fallacies to semi-fake news: Improving the identification of misinformation triggers across digital media. Discourse Soc. 2022, 33, 439–457. [Google Scholar] [CrossRef]

Figure 1. Top 20 most important features for each machine learning model (Logistic Regression, Random Forest, LightGBM) under both title-only (top row) and title + content (bottom row) inputs: (a) Logistic Regression (title only). (b) Random Forest (title only). (c) LightGBM (title only). (d) Logistic Regression (title only). (e) Random Forest (title only). (f) LightGBM (title only). (g) Logistic Regression (title + content). (h) Random Forest (title + content). (i) LightGBM (title + content).

Table 1. Summary statistics of title and content lengths in the WELFake dataset.

Feature	Typical Range	Average	Max	Min	Notes
Title length	1–20 words	12.17 words	72	0	558 titles missing
Content length	0–1000+ words	(not specified)	24,000	0	783 contents missing

Table 2. Training Computational Time (in seconds) for Each Model Across Input Configurations.

Model	Title Only	Content Only	Title + Content
Logistic Regression	2.95	2.54	3.01
Random Forest	1.87	7.50	10.76
LightGBM	2.32	22.41	26.91
ALBERT	1440.69	2095.64	6489.21
GRU	107.16	211.41	413.48

Table 3. Test set performance (macro-averaged) for all models using title-only and title + content inputs.

Input	Model	Precision	Recall	F1-Score	AUC-ROC
Title only	Logistic Regression	0.84	0.84	0.84	0.92
	Random Forest	0.85	0.85	0.85	0.93
	LightGBM	0.83	0.83	0.83	0.92
	ALBERT	0.92	0.93	0.93	0.98
	GRU	0.80	0.80	0.90	0.96
Content only	Logistic Regression	0.93	0.93	0.93	0.98
	Random Forest	0.92	0.92	0.92	0.98
	LightGBM	0.94	0.94	0.94	0.98
	ALBERT	0.97	0.97	0.97	1.00
	GRU	0.97	0.97	0.97	1.00
Title + Content	Logistic Regression	0.93	0.93	0.93	0.98
	Random Forest	0.93	0.93	0.93	0.98
	LightGBM	0.94	0.94	0.94	0.99
	ALBERT	0.98	0.99	0.98	1.00
	GRU	0.97	0.97	0.97	1.00

Table 4. Pairwise McNemar Test Results (Holm corrected) for Model Comparisons: Title-Only (top) and Title + Content (bottom) Models.

Model 1	Model 2	Statistic	Winner	Corrected p-Value	Significant
Title Only
LightGBM	RandomForest	51.60	RandomForest	<0.0001	Yes
LightGBM	LR	0.96	LR	0.33	No
LightGBM	ALBERT	801.63	ALBERT	<0.0001	Yes
LightGBM	GRU	425.25	GRU	<0.0001	Yes
RandomForest	LR	24.39	RandomForest	<0.0001	Yes
RandomForest	ALBERT	553.23	ALBERT	<0.0001	Yes
RandomForest	GRU	260.98	GRU	<0.0001	Yes
LR	ALBERT	789.32	ALBERT	<0.0001	Yes
LR	GRU	394.33	GRU	<0.0001	Yes
ALBERT	GRU	77.64	ALBERT	<0.0001	Yes
Content Only
LightGBM	RandomForest	65.08	LightGBM	<0.0001	Yes
LightGBM	LR	9.78	LightGBM	0.0018	Yes
LightGBM	ALBERT	223.67	ALBERT	<0.0001	Yes
LightGBM	GRU	223.67	GRU	<0.0001	Yes
RandomForest	LR	20.36	LR	<0.0001	Yes
RandomForest	ALBERT	422.97	ALBERT	<0.0001	Yes
RandomForest	GRU	422.97	GRU	<0.0001	Yes
LR	ALBERT	295.33	ALBERT	<0.0001	Yes
LR	GRU	295.33	GRU	<0.0001	Yes
ALBERT	GRU	∞	Tie	0.0000	Yes
Title + Content
LightGBM	RandomForest	64.50	LightGBM	<0.0001	Yes
LightGBM	LR	30.82	LightGBM	<0.0001	Yes
LightGBM	ALBERT	371.12	ALBERT	<0.0001	Yes
LightGBM	GRU	207.76	GRU	<0.0001	Yes
RandomForest	LR	5.39	LR	0.02	Yes
RandomForest	ALBERT	592.98	ALBERT	<0.0001	Yes
RandomForest	GRU	408.49	GRU	<0.0001	Yes
LR	ALBERT	524.70	ALBERT	<0.0001	Yes
LR	GRU	341.88	GRU	<0.0001	Yes
ALBERT	GRU	41.65	ALBERT	<0.0001	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

An Empirical Comparison of Machine Learning and Deep Learning Models for Automated Fake News Detection

Abstract

1. Introduction

1.1. Automated Detection of Fake News and Related Work

1.2. Challenges and Open Problems

1.3. Research Gap and Contributions

2. Methods

2.1. Data Preprocessing and Preparation

2.2. Machine Learning Models

2.2.1. Logistic Regression

2.2.2. Tree-Based Models

Random Forest

Light Gradient Boosting Machine (LightGBM)—Mathematical Perspective

2.3. Deep Learning Models

2.3.1. A Lite Bidirectional Encoder Representations from Transformers (ALBERT)

2.3.2. Gated Recurrent Units (GRUs)

2.4. Evaluation Metrics

2.5. Statistical Comparison of Classifiers: McNemar’s Test

3. Results

3.1. Computational Resource Cost and Efficiency

3.2. Model Performance

3.3. Model Comparison

3.4. Model Interpretation: Feature Importance Analysis

3.5. Qualitative Error Analysis and Limitations

4. Discussion

4.1. Comparative Model Performance

4.2. Computational Cost and Resource Efficiency

4.3. Statistical Significance and Model Robustness

4.4. Interpretability and Feature Analysis

4.5. Practical Implications

4.6. Limitations and Future Directions

4.7. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics