This section reports the results of the proposed pipeline, assessing data augmentation, ensemble learning, transfer learning, and active learning using accuracy, precision, recall, F1-score, and AUC. Comparative analyses confirm improved robustness and generalisation in class-imbalanced, low-resource settings, demonstrating the framework’s effectiveness for fake news detection. Balanced Random Forest (BRF) and Synthetic Minority Over-sampling Technique (SMOTE) are used to denote class-balanced ensemble learning and data-level augmentation methods, respectively.
4.1. Fake News Detection Pipeline Performance
Table 6 summarises the performance of all evaluated models across key metrics, including Accuracy (ACC), Precision (PRE), Recall (REC), F1-score (F1), Area Under the Curve (AUC), and Cohen’s Kappa (KAPPA). These results capture the comparative behaviour of classical, ensemble, resampling, autoencoder-based, and active learning configurations under class-imbalanced conditions. Mean and standard deviation values, computed across multiple folds, provide insight into both central performance trends and model stability. This comprehensive comparison establishes the empirical foundation for subsequent analysis of robustness, generalisation, and trade-offs among the evaluated fake news detection approaches. Each node in the propagation tree is treated as an independent text instance for feature extraction and classification.
The evaluated configurations can be interpreted as a staged ablation analysis, where baseline models establish reference performance, augmentation improves minority exposure, ensemble methods reduce variance, and autoencoder-based representations enhance feature structure. This progression highlights the contribution of each component under imbalanced conditions. Although SVM achieved the highest baseline accuracy, all models exhibited weak recall and F1 scores, particularly for minority classes, indicating strong majority bias and limited generalisation. Low Cohen’s Kappa values further confirm poor agreement beyond chance, emphasising the need for imbalance mitigation. Consequently, synthetic oversampling using SMOTE significantly improved minority sensitivity and overall class balance, with augmented models consistently outperforming their non-augmented counterparts.
Autoencoder-based transfer learning provided an additional mechanism to enhance minority-class detection. As reported in
Table 6, the Encoder model achieved high precision with moderate recall, indicating that latent embeddings preserved discriminative information while reducing dimensionality. Decoder-generated synthetic samples further enriched minority exposure, improving classification balance. The workflow illustrated in
Figure 2 demonstrates how encoder–decoder architectures support both representation learning and informed augmentation within a unified process. The autoencoder should be interpreted as a feature compression and representation learning component, rather than a standalone classifier, as it enhances latent structure but does not optimise recall independently. Active learning with uncertainty sampling, shown in
Figure 3, iteratively selected informative instances, yielding incremental improvements in minority F1. Using logistic regression as an efficient base learner, performance gains stabilised after the eighth round, indicating convergence and diminishing returns. These findings suggest that combining representation learning with selective querying enhances cost efficiency in low-resource and imbalanced misinformation settings.
Although absolute AUC values appear modest, performance must be interpreted within the constraints of the dataset, including a four-class structure, severe minority scarcity, n equals 53 for FALSE, and substantial linguistic overlap. Under these conditions, baseline classifiers approached near random discrimination, particularly for minority instances. In contrast, ensemble and augmentation strategies achieved macro F1 improvements exceeding 30 percent and reduced fold variance by up to 40 percent, indicating enhanced stability and minority sensitivity. Importantly, this study does not aim to maximise absolute accuracy, but to evaluate robustness under constrained settings. Therefore, performance should be assessed through relative improvements, variance reduction, and minority recall, which better reflect practical utility in low-resource misinformation detection and demonstrate meaningful boundary refinement beyond conventional metrics.
To provide a reference point for modern NLP approaches, contextual semantic representations were extracted using a multilingual BERT model and used as input features for the same classifiers evaluated in the linguistic-feature pipeline (DT, SVM, and FCNN). This choice ensures a fair comparison at the representation level while avoiding confounding effects introduced by large-scale fine-tuning and computational resource disparities. Unlike handcrafted linguistic indicators, BERT embeddings encode contextual semantic information learned from large-scale corpora. The results presented in
Table 7 therefore function as a semantic baseline, enabling a comparison between interpretable linguistic features and transformer-based representations within the same classification framework. It is important to emphasise that BERT is intentionally used as a frozen semantic baseline rather than a fully fine-tuned transformer model. The objective is not to achieve state-of-the-art performance, but to provide a controlled comparison between contextual semantic representations and interpretable linguistic features under identical experimental conditions. This design enables a fair evaluation of representation-level differences rather than optimisation-dependent performance.
While contextual embeddings capture semantic relations, linguistic indicators provide interpretable signals that facilitate analysis of misinformation patterns.
While fully fine-tuned transformer models may achieve higher absolute performance, their effectiveness depends on large annotated datasets and substantial computational resources. In contrast, the present study deliberately focuses on realistic low-resource conditions, where such requirements are not feasible. Therefore, the comparison with frozen BERT embeddings is intended to isolate representation-level differences rather than to compete with fully optimised transformer pipelines. This design choice aligns with the study’s objective of evaluating interpretable and deployable solutions under constrained environments. Fine-tuned transformer models were not considered to ensure methodological fairness and to reflect realistic low-resource deployment conditions, where large-scale pretraining and GPU-intensive optimisation are not feasible.
4.2. Comparative Analysis
Table 6 compares performance across classical, resampling, ensemble, and active learning configurations. Overall, class-balanced ensemble methods, particularly Balanced Random Forest and Bagging, achieved the most consistent trade-off across metrics, demonstrating superior stability under severe class imbalance. In contrast, baseline models such as DT, SVM, and FCNN exhibited poor generalisation, with low F1 scores and near-random AUC values. Although Active Learning attained the highest accuracy, its lower F1 and increased variance indicate a bias toward majority class predictions. These results confirm that single models are insufficient under extreme imbalance, while ensemble aggregation significantly improves robustness. These results confirm that learning is occurring, as evidenced by consistent improvements over baseline models and statistically supported gains, despite inherently low absolute performance under extreme data constraints.
From a methodological perspective, the configurations can be interpreted as a staged ablation analysis. Resampling techniques, including SMOTE, improved minority recall but produced variable outcomes depending on the model. Importantly, the strongest gains emerged from combining synthetic augmentation with class-balanced ensembles, highlighting the complementary interaction between data-level and model-level strategies. This synergy resulted in improved stability, reduced variance, and more balanced decision boundaries, demonstrating that performance improvements are driven by integration rather than isolated techniques.
Finally, autoencoder and active learning models revealed distinct behaviours. The Encoder achieved high precision but low recall and F1, indicating limited generalisation and a narrow decision boundary. Similarly, Active Learning improved accuracy through iterative sampling but showed modest AUC and sensitivity to sample selection. These findings suggest that while representation learning and selective querying offer efficiency benefits, they do not match the robustness of ensemble and resampling approaches. Overall, pronounced precision–recall trade-offs emphasise the difficulty of balancing sensitivity and specificity in imbalanced fake news detection, reinforcing the need for integrated and interpretable solutions.
4.3. Mutual Information
To better illustrate the relative importance of the extracted features,
Figure 5 presents a horizontal bar chart showing the top twenty variables ranked by their values. For readability, the feature names are abbreviated.
Mutual information analysis reveals that discourse and argumentative markers are the most informative predictors of fake news. Specifically, features from Argumentative Discourse Markers (ADM, including exemplification–concession, generalizers, and justifying expressions) indicate that deceptive narratives rely on structured framing to enhance plausibility. In addition, Textual Discourse Markers, TDM, such as distributors and finalizatives, highlight the role of discourse organisation. Moreover, emotional features from NRC and LMc lexicons, particularly Joy, Anger, and Sadness, suggest that misinformation leverages affective framing, while syntactic markers, including Demonstratives and Negations, indicate referential ambiguity. Importantly, the lower contribution of GFOG confirms the strength of targeted linguistic indicators. To complement this,
Figure 6 illustrates class distributions, revealing partial separation and overlap, which reflects the inherent complexity of distinguishing misinformation categories.
Figure 7 illustrates the comparative recall performance of all evaluated models, highlighting their ability to correctly identify fake news instances. The bars represent mean recall values across validation folds, with error bars indicating standard deviation, thereby reflecting model stability under class-imbalanced conditions. This visual comparison complements the quantitative results in
Table 6, providing insight into how different learning strategies—such as resampling, ensemble integration, and transfer learning—affect sensitivity to minority-class examples in fake news detection.
4.4. Statistical Validation
This subsection reports formal statistical analyses to determine whether performance differences across models reflect robust improvements rather than sampling variability.
Table 8 presents 95 percent confidence intervals, median values, interquartile ranges, and mean plus standard deviation for AUC under five fold cross validation. These statistics quantify uncertainty and allow a more rigorous comparison of discriminative capacity. Ensemble based and SMOTE enhanced models, including BRF, BAGGING, and FCNN SMOTE, display narrower confidence intervals and higher median AUC values, indicating greater stability and consistent separation across folds. In contrast, baseline models such as DT and FCNN show wider intervals and larger dispersion, suggesting reduced robustness under imbalance. This analysis intentionally strengthens inferential reliability by moving beyond point estimates toward distribution aware evaluation.
Table 9 presents pairwise comparisons of F1-score between the SVM baseline and enhanced models using summary statistics over cross-validation folds. A two-sided
z-test on the difference of means was applied, with Holm correction for multiple comparisons. Results show that ensemble-based models, particularly B.RF and BAGG, achieved statistically significant improvements with large effect sizes. In addition, SMOTE-based variants also demonstrated significant gains, although with comparatively smaller effects. Conversely, DT and Encoder underperformed relative to the baseline, while FCNN and ACTIVE showed no meaningful differences. Importantly, effect sizes should be interpreted cautiously, given the limited number of folds and the approximation of independent samples. Overall, these findings confirm that combining ensemble learning with data augmentation yields consistent and practically meaningful improvements in F1-score.
Table 9 reports pairwise F1 comparisons against the SVM baseline using a two-sided
z-test with Holm-adjusted p-values. Results indicate that class-balanced ensemble methods, particularly BRF and Bagging-based configurations, achieve the most consistent and statistically meaningful improvements, confirming their robustness under severe class imbalance. In contrast, DT and Encoder exhibit weaker generalisation, reflecting the limitations of single or unregularised models. Overall, these findings demonstrate the advantage of integrating ensemble aggregation with data-centric strategies in imbalanced misinformation detection. From a statistical perspective, inference combines cross-validation estimates with controlled comparisons, while results are interpreted cautiously due to the limited number of folds. Importantly, effect size analysis supports practical relevance, ensuring that observed improvements reflect systematic methodological advantages rather than random variation.
4.5. Summary of Key Findings
The results demonstrate that the proposed pipeline effectively addresses the challenges of small, imbalanced datasets by improving stability, minority-class recall, and overall robustness. Baseline models exhibited poor generalisation and strong bias toward majority classes, confirming the limitations of traditional classifiers under data scarcity. In contrast, data-centric strategies, particularly synthetic oversampling and augmentation, significantly enhanced minority representation, enabling models to capture more informative linguistic patterns. Moreover, the integration of autoencoder-based representations further enriched the feature space structure, supporting more discriminative learning under constrained conditions.
Importantly, the strongest performance gains were achieved through the combination of class-balanced ensemble methods and resampling techniques, which consistently improved stability and reduced variance across folds. While active learning contributed to efficient sample selection and reduced annotation effort, its performance remained sensitive to data distribution. Overall, the findings highlight that hybrid approaches, integrating linguistic features, augmentation, and ensemble learning, provide the most reliable and interpretable solutions, particularly when performance is evaluated through robustness-oriented metrics rather than absolute accuracy in low-resource misinformation detection settings. Augmentation modifies feature space geometry while preserving core linguistic patterns, thereby supporting robust decision boundaries and distributional checks, confirming preservation of key linguistic feature statistics after augmentation. Furthermore, the experimental setup can be interpreted as a staged ablation analysis, enabling systematic evaluation of component-wise contributions.
Additional distributional checks confirmed that key linguistic feature statistics were preserved after augmentation, indicating that synthetic samples did not distort the underlying feature space.