You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

16 October 2024

Hybrid RFSVM: Hybridization of SVM and Random Forest Models for Detection of Fake News

and
1
GGSIPU, NSUT East Campus (Formerly Ambedkar Institute of Advanced Communication Technologies and Research), New Delhi 110031, India
2
NSUT East Campus (Formerly Ambedkar Institute of Advanced Communication Technologies and Research), New Delhi 110031, India
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition)

Abstract

The creation and spreading of fake information can be carried out very easily through the internet community. This pervasive escalation of fake news and rumors has an extremely adverse effect on the nation and society. Detecting fake news on the social web is an emerging topic in research today. In this research, the authors review various characteristics of fake news and identify research gaps. In this research, the fake news dataset is modeled and tokenized by applying term frequency and inverse document frequency (TFIDF). Several machine-learning classification approaches are used to compute evaluation metrics. The authors proposed hybridizing SVMs and RF classification algorithms for improved accuracy, precision, recall, and F1-score. The authors also show the comparative analysis of different types of news categories using various machine-learning models and compare the performance of the hybrid RFSVM. Comparative studies of hybrid RFSVM with different algorithms such as Random Forest (RF), naïve Bayes (NB), SVMs, and XGBoost have shown better results of around 8% to 16% in terms of accuracy, precision, recall, and F1-score.

1. Introduction

More people spend time communicating online and consuming news from web media, preferably press agencies. There has been a change in adoption behavior, as it is inexpensive and less time is taken to adopt news or information from media platforms compared to adopting news from press media like newspapers or television. There are various related concepts of information that are distinguished on the strength of various characteristics such as legitimacy or authenticity, intention, and whether it is news or not. Different concepts of information can also be misleading, deceiving, rumor, fake news, or malicious fake news. The authors have presented a different perspective for this study, based on style and on propagation.

1.1. Knowledge-Based Study

This study aims to analyze and detect fake news using a fact-checking process. In the process of fact checking, knowledge is extracted from verified news content to check the news authenticity. Expert-based fact-checking websites involve well-known websites that present statistics on the authenticity of topics, and this information can help in further scrutinization for verification purposes. Table 1 [1] shows a review of expert-based websites. For example, HaoxSlayer focuses on the authenticity of information and categorizes articles and information into hoaxes, junk e-mails, and false news.
Table 1. Expert-based fact-checking websites.
Figure 1 shows the automatic fact-checking process. This process is categorized into two parts: knowledge base construction extraction and comparison checking. In the first part, raw facts and data are represented by the knowledge that is extracted from the web. A knowledge graph is constructed with the help of extracted knowledge. Redundancy reduction, invalidity reduction, conflict resolution, credibility improvement, and completeness enhancement are carried out using knowledge extracted from the web. In the fact-checking process, a comparison is carried out between the knowledge extracted from the knowledge base and the news content that is to be verified to check the authenticity of the news [1].
Figure 1. Automatic fact-checking process.

1.2. Style-Based Study

This study determines if the intention of news is whether to deceive the public or not. The style is a characteristic that represents and differentiates fake news from veracity. Deception analysis investigates how the style of misleading content is written across various kinds of information. Features or characteristics based on attributes require some supplementary level of calibration or computing, which is time-consuming, but connects to the greater significance of the evaluation of characteristics based on attributes and filtering for the detection of misleading content.

1.3. Propagation-Based Study of Fake News

This study gives information about the spreading of rumors and the process of spreading it. In this propagation-based study, we deal with the following:
  • How are the propagation patterns of false news represented?
  • What will be the measuring parameters for the characterization of dissemination of false news?
  • How do we differentiate the dissemination of fake news from news that is verified?
  • How do we analyze the pattern of fake news in various domains like politics, economy, and education?
  • How does fake news propagate differently for topics like presidential elections and health, for various platforms like Instagram, Facebook, and X (Twitter), in different languages like English, Hindi, Mandarin Chinese, and Spanish?
The sections that make up this document are as follows: The research methods, motives, problem description, and objectives deduced from the literature review and the proposed architecture and proposed algorithm for measuring various parameters are discussed in Section 2. A summary of the process implementation and outcomes is given in Section 3. Conclusions in Section 4 along with the scope of future work.

3. Implementation and Results

The authors implemented the project in Python, and considered data on fake news about COVID-19 from X (Twitter). All packages like XGBClassifier and Random Forest Classifier were imported from the sklearn library in Python. In total, 60% of the data were training data and 40% of the data were testing data. The dataset was read using the read_csv method using the panda package. Authors considered TFIDF [22] as a feature set. TF computes the weight and the frequency of every term occurring in a document.
T F ( w , j ) = F r e q u e n c y   o f   w   i n   d o c u m e n t T o t a l   n u m b e r   o f   w   i n   d o c u m e n t
IDF computes the importance of each term w. In sklearn, IDF(t) is
I D F   t = L o g   1 + n 1 + d f ( t ) + 1
The evaluation parameters were determined using naïve Bayes, XGBoost, Support Vector Machines, and Random Forest. For text classification issues, naïve Bayes classifiers are a quick and effective option. They can easily be trained on small datasets. Support Vector Machines tend to give more accurate results on concise datasets and handle high-dimensional spaces efficiently. Random Forest works on large datasets efficiently and produces very accurate results. The XGBoost classifier is scalable to large datasets and produces optimized and efficient computational performance [30]. It also handles sparse data. Table 2 shows the abbreviations used for machine-learning classifiers.
Table 2. Abbreviations used for classifiers.
Table 3 signifies the performance of various parameters using classification algorithms like naïve Bayes [26], Support Vector Machines [24], Random Forest [25], XGBClassifier, and hybrid algorithm of SVM and Random Forest. The result predicts that the hybrid RFSVM outperformed in our dataset.
Table 3. Test performance.
Figure 4 and Figure 5 show the accuracy % and precision % for five different classification algorithms, respectively. Naïve Bayes has the lowest accuracy %, and hybrid SVMs and Random Forest have the highest accuracy %. SVMs have a precision of 86.91%, RF has 84.88%, and XGBoost has 84.56%. Figure 6 and Figure 7 show the recall % and F1-score % of NB, RF, SVMs, XGBoost, and RFSVM, respectively. RFSVM has the highest recall at 92.30% and XGBoost has the lowest recall at 83.90%. RFSVM has the highest F1-score at 93.50%.
Figure 4. Accuracy % of artificial intelligence algorithms.
Figure 5. Precision % of artificial intelligence algorithms.
Figure 6. Recall % of artificial intelligence algorithms.
Figure 7. F1-score % of artificial intelligence algorithms.
The results have shown that the RFSVM excels in various parameters with an accuracy of 97.56%, precision of 88.21%, recall of 92.30%, and F1-score of 93.50%, compared to the results of naïve Bayes, Random Forest, SVMs and XGBoost. This means that 92.30% (maximum recall) of fake news was detected successfully with our proposed methodology. Table 4 shows the comparison among various machine-learning classification algorithms in terms of accuracy, precision, recall, and F1-score.
Table 4. Comparison of Performance Results.
SVMs show better accuracy than naïve Bayes, Random Forest, and XGBoost, as naïve Bayes treats features as independent features and SVMs explore the relationship between independent features to a certain degree, and a nonlinear kernel is used as Gaussian. The output therefore varies depending on the features of problem statement interaction and prediction models. SVMs are better than naïve Bayes. The prediction function shows dependencies between variables that naïve Bayes (y (a, b) = ab) does not catch, so it is not a universal approximator. As SVMs utilize kernel trick and maximum margin principle to work better in nonlinear and high-dimensional tasks, SVMs are the best algorithm of all. It also benefits from the correct set of features and extraction/transformation techniques much of the time. Figure 8 shows the graphical comparison among various parameters.
Figure 8. Comparison among evaluation parameters of different classifiers.
Table 5 shows that the hybrid approach of SVMs and Random Forest outperforms others with an accuracy of 0.9756. Different categories of fake news may be evaluated with five different models in terms of accuracy, precision, and F1-score.
Table 5. Comprehensive comparative analysis with baseline studies.
(a)
Analysis of different types of news categories with five different models in terms of accuracy: naïve Bayes (NB), Random Forest (RF), XGBOOST, SVMs, and RFSVM. A comparative analysis of different models in terms of accuracy, with different types of news categories, is presented in Table 6.
Table 6. Accuracy values for different news categories with different ML algorithms.
RFSVM’s combination of Support Vector Machine and Random Forest is around 10% better than the naïve model in terms of accuracy, which is around 6% to 8% better in terms of accuracy compared to other models such as XGBoost, Random Forest, and SVMs.
(b)
Analysis of different types of news categories with five different models in terms of precision in Table 7:
Table 7. Precision values for different news categories with different ML algorithms.
RFSVM’s combination of Support Vector Machines and Random Forest is around 8% better than the naïve model in terms of precision, which is around 4% to 6% better in terms of precision compared to other models such as XGBoost, Random Forest, and SVMs.
(c)
Analysis of different types of news categories with five different models in terms of precision in Table 8:
Table 8. Recall values for different news categories with different ML algorithms.
RFSVM’s combination of Support Vector Machines and Random Forest is around 18% better than the naïve model in terms of recall, which is around 12% to 16% better in terms of recall compared to other models such as XGBoost, Random Forest, and SVMs.
(d)
Analysis of different types of news categories with five different models in terms of precision in Table 9:
Table 9. F1-score values for different news category with different ML algorithms.
RFSVM’s combination of Support Vector Machines and Random Forest is around 15% better than the naïve model in terms of F1-scores, which is further around 8% to 12% better in terms of F1-scores compared to other models such as XGBoost, Random Forest, and SVMs.
Cross validation of RF and SVMs provides a significant better result in terms of interpretability, time, and accuracy. Due to the better process of the training and feature extraction, accompanied by the hybridization of RF and SVMs, various improvements in accuracy, precision, recall, and F1-score across different news categories can be achieved. Currently, the hybrid model has been applied on small datasets in order to understand the better visualization of the data across different categories. Applying the hybrid algorithm on small datasets already provides brief insights into how the results are better, for different types of news categories, in comparison to individual machine-learning models. In our approach, currently RF and SVMs have been applied for comparisons within the various classes.

4. Conclusions

In the current social media era, fake news detection is a rapidly emerging topic. The literature surveyed here addresses various research gaps identified on the web and media platforms. The authors have surveyed, summarized, compared, and evaluated the ongoing research on fake news which includes various perspectives on fake news. Various evaluation parameters, such as precision, recall, accuracy, and F1-score have been calculated on different machine-learning classifiers by applying the TFIDF feature set. The authors have presented that hybrid RFSVM outperformed the best among others on the basis of different evaluation parameters. In our approach, currently RF is applied and further SVMs for comparisons within the class. In future, a hybridization of RF and SVMs will be applied on the large datasets along with the integration of more modern techniques such as Transformer based models which can further be used to strengthen the comparisons, which further improve the efficiency and effectiveness of the model. In our proposed model as size of the dataset is not to large, which enable us to provide effective, accurate and fast results using hybridization. Hence, modern techniques such as transformer-based model were not used to strengthen the comparisons. In future research, more focus can be on diversified and labeled datasets. Fake news detection at the initial stage, fake news detection in different languages in cross-platform, and hybridization of various intelligent algorithms can be carried out for better results, dynamic and benchmark datasets are major challenges in domain of false news identification that can be used for potential research opportunities. Images containing fake text is also most promising area as future research.

Author Contributions

Conceptualization, D.G.D. and V.B.; methodology, D.G.D. and V.B.; software, D.G.D. and V.B.; validation, D.G.D. and V.B.; formal analysis, D.G.D. and V.B.; investigation, D.G.D. and V.B.; resources, D.G.D. and V.B.; data curation, D.G.D. and V.B.; writing—original draft preparation, D.G.D. and V.B.; writing—review and editing, D.G.D. and V.B.; visualization, D.G.D. and V.B.; supervision, D.G.D. and V.B.; project administration, D.G.D. and V.B.; funding acquisition, D.G.D. and V.B. All authors have read and agreed to the published version of the manuscript.

Funding

All authors declare that there is no funding available for this article.

Data Availability Statement

Data are available for this article whenever required.

Acknowledgments

The authors are thankful to the learned reviewer for his valuable comments.

Conflicts of Interest

The authors declare that they have no competing interests.

References

  1. Zhou, X.; Zafarani, R. Fake news: A survey of research, detection methods, and opportunities. arXiv 2018, arXiv:1812.00315. [Google Scholar]
  2. Reddy, H.; Raj, N.; Gala, M.; Basava, A. Text-mining-based Fake News Detection Using Ensemble Methods. Int. J. Autom. Comput. 2020, 17, 210–221. [Google Scholar] [CrossRef]
  3. Liu, Y.; Wu, Y.F.B. FNED: A Deep Network for Fake News Early Detection on Social Media. ACM TOIS 2020, 38, 25. [Google Scholar] [CrossRef]
  4. Alkhodair, S.A.; Ding, S.H.; Fung, B.C.; Liu, J. Detecting breaking news rumors of emerging topics in social media. Inf. Process. Manag. 2020, 57, 102018. [Google Scholar] [CrossRef]
  5. Meel, P.; Vishwakarma, D.K. Fake News, Rumor, Information Pollution in Social Media and Web: A Contemporary Survey of State-of-the-arts, Challenges and Opportunities. Expert Syst. Appl. 2019, 153, 112986. [Google Scholar] [CrossRef]
  6. Sharma, K.; Qian, F.; Jiang, H.; Ruchansky, N.; Zhang, M.; Liu, Y. Combating fake news: A survey on identification and mitigation techniques. ACM TIST 2019, 10, 21. [Google Scholar] [CrossRef]
  7. Bondielli, A.; Marcelloni, F. A survey on fake news and rumour detection techniques. Inf. Sci. 2019, 497, 38–55. [Google Scholar] [CrossRef]
  8. Ozbay, F.A.; Alatas, B. Fake news detection within online social media using supervised artificial intelligence algorithms. Phys. A Stat. Mech. Appl 2020, 540, 123174. [Google Scholar] [CrossRef]
  9. Monti, F.; Frasca, F.; Eynard, D.; Mannion, D.; Bronstein, M.M. Fake News Detection on Social Media Using Geometric Deep Learning. arXiv 2019, arXiv:1902.06673. [Google Scholar]
  10. Vishwakarma, D.K.; Varshney, D.; Yadav, A. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cogn. Syst. Res. 2019, 58, 217–229. [Google Scholar] [CrossRef]
  11. Alzanin, S.M.; Azmi, A.M. Detecting rumors in social media: A survey. Procedia Comput. Sci. 2018, 142, 294–300. [Google Scholar] [CrossRef]
  12. Cybenko, A.K.; Cybenko, G. AI and fake news. IEEE Intell. Syst. 2018, 33, 1–5. [Google Scholar] [CrossRef]
  13. Jang, S.M.; Geng, T.; Li, J.Y.Q.; Xia, R.; Huang, C.T.; Kim, H.; Tang, J. A computational approach for xamining the roots and spreading patterns of fake news: Evolution tree analysis. Comput. Hum. Behav. 2018, 84, 103–113. [Google Scholar] [CrossRef]
  14. Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic Detection of Fake News. arXiv 2017, arXiv:1708.07104. [Google Scholar]
  15. Wang, W.Y. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.0064. [Google Scholar]
  16. Ruchansky, N.; Seo, S.; Liu, Y. CSI: A hybrid deep model for fake news detection. In Proceedings of the 26th ACM International Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 797–806. [Google Scholar]
  17. Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
  18. Al Amrani, Y.; Lazaar, M.; El Kadiri, K.E. Random forest and support vector machine based hybrid approach to sentiment analysis. Procedia Comput. Sci. 2018, 127, 511–520. [Google Scholar] [CrossRef]
  19. Garg, N.; Gupta, R.; Kaur, M.; Ahmed, S.; Shankar, H. Efficient Detection and Classification of Orange Diseases using Hybrid CNN-SVM Model. In Proceedings of the 2023 International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 11–12 May 2023; pp. 721–726. [Google Scholar]
  20. Nasir, J.A.; Khan, O.S.; Varlamis, I. Fake news detection: A hybrid CNN-RNN based deep learning approach. Int. J. Inf. Manag. Data Insights 2021, 1, 100007. [Google Scholar] [CrossRef]
  21. Dedeepya, P.; Yarrarapu, M.; Kumar, P.P.; Kaushik, S.K.; Raghavendra, P.N.; Chandu, P. Fake News Detection on Social Media Through a Hybrid SVM-KNN Approach Leveraging Social Capital Variables. In Proceedings of the 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 5–7 June 2024; pp. 1168–1175. [Google Scholar]
  22. Ramos, J. Using TF-IDF to determine word relevance in document queries. In Proceedings of the 1st Instructional Conference on Machine Learning. 2003; Volume 242, pp. 133–142. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1;type=pdf;doi=b3bf6373ff41a115197cb5b30e57830c16130c2c (accessed on 11 August 2024).
  23. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4 August 2001; Volume 3, pp. 41–46. [Google Scholar]
  24. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM TIST 2011, 2, 1–27. [Google Scholar] [CrossRef]
  25. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  26. Yager, R.R. An extension of the naive Bayesian classifier. Inf. Sci 2006, 176, 577–588. [Google Scholar] [CrossRef]
  27. Cortes, V.; Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  28. Ahmad, I.; Yousaf, M.; Yousaf, S.; Ahmad, M.O. Fake news detection using machine learning ensemble methods. Complexity 2020, 1, 8885861. [Google Scholar] [CrossRef]
  29. Hamsa, H.; Indiradevi, S.; Kizhakkethottam, J.J. Student academic performance prediction model using decision tree and fuzzy genetic algorithm. Proc. Technol. 2016, 25, 326–332. [Google Scholar] [CrossRef]
  30. Malhotra, P.; Malik, S.K. Fake News Detection Using Ensemble Techniques. Multimed. Tools Appl. 2024, 83, 42037–42062. [Google Scholar] [CrossRef]
  31. Sharma, U.; Saran, S.; Patil, S.M. Fake news detection using machine learning algorithms. IJCRT 2020, 8, 509–518. [Google Scholar]
  32. Khanam, Z.; Alwasel, B.N.; Sirafi, H.; Rashid, M. Fake news detection using machine learning approaches. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1099, 012040. [Google Scholar] [CrossRef]
  33. Pandey, S.; Prabhakaran, S.; Reddy, N.S.; Acharya, D. Fake news detection from online media using machine learning classifiers. J. Phys. Conf. Ser. 2022, 2161, 012027. [Google Scholar] [CrossRef]
  34. Mallick, C.; Mishra, S.; Senapati, M.R. A cooperative deep learning model for fake news detection in online social networks. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 4451–4460. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.