Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis
Abstract
:1. Introduction
- The novelty of the integration of NLP and NER can emphasize the relationship between entities within a text, where NER identifies named entities and NLP captures the relationships between these entities. This approach can better grasp the thematic context of the text through the enhancement of the discriminatory model.
- A mathematical model is developed to make the above concept enforceable. We can find the optimal solution to represent a legitimate visual area, which forms the fundamental basis for stylometric detection.
2. Related Works
3. Data and Methods
3.1. Dataset and Corpus
3.2. NER-SA: The Modified Proof-of-Concept Model
3.2.1. Creation of Related-Entity and Unrelated-Entity Banks
3.2.2. Filtering Out the Outliers and Calculating the Reoccurrence Index and Variety Index
3.2.3. Developing the Mathematical Model to Form the Legitimate Area
- -
- Initialize ‘p’ to an initial value ‘initial_p’ (the parameter setting = 0.5).
- -
- Set the maximum number of iterations ‘max_iterations’ (the parameter setting = 5000).
- -
- Set the initial temperature ‘initial_temp’ (the parameter setting = 100).
- -
- Set the cooling rate ‘cooling_rate’ (the parameter setting = 0.95).
- -
- Define the objective function ‘objective_function(L, F, M, p)’ that calculates the objective score based on the given ‘p’.
- -
- Enter the main loop and iterate from 1 to ‘max_iterations’.
- -
- In each iteration:
- -
- Calculate the current objective score for the current ‘p’, ‘current_score’.
- -
- If ‘current_score’ is better than the previous best score ‘best_score’, update ‘best_score’ and ‘best_p’.
- -
- Generate a new ‘p’ value ‘new_p’, usually by adding a small random decimal to the current ‘p’.
- -
- Calculate the acceptance probability ‘acceptance_prob’ using the formula ‘exp((current_score—objective_function(L, F, M, new_p))/initial_temp)’.
- -
- Generate a random number between 0 and 1; if it is less than ‘acceptance_prob’, accept the new solution by updating ‘current_p’ to ‘new_p’.
- -
- Multiply the temperature ‘initial_temp’ by the cooling rate ‘cooling_rate’ to decrease the temperature.
- -
- Update the new temperature to ‘initial_temp’.
- -
- After optimization, output the best ‘p’ value ‘best_p’ and its corresponding objective score ‘best_score’.
3.2.4. Differentiating between Legitimate News and Fake News
3.3. Performance Evaluation
4. Experimental Results and Discussion
4.1. Statistical Evaluation
4.2. In-Domain Analysis Results
4.3. Cross-Domain Analysis Results
4.4. Discussion
- Accuracy (A) is the proportion of correct predictions among all predictions, which means that accurately identifying true and false news samples is crucial to maintaining high accuracy. In NER for fake news detection, a higher accuracy demonstrates that the NER-SA model possesses a higher proportion of correctly recognized named entities, such as people, places, organizations, etc., among all the named entities in the text. Accurate identification of entities is an important and reliable foundation for NLP.
- Precision (P) is the average of the proportion of true positive predictions among all positive predictions and the proportion of true negative predictions among all negative predictions. In NER for fake news detection, a higher precision means that when the NER-SA model marks an entity, it is more likely to be a true entity and not a false one. Precise identification of trustworthy entities in the extracted information helps in avoiding the inclusion of fabricated or misleading entities.
- Recall (R) is the average of the proportion of true positive predictions among all actual positive samples and the proportion of true negative predictions among all actual negative samples. In NER for fake news detection, a higher recall implies that the NER-SA model is more effective at capturing a significant portion of the entities, including those that might be intentionally obscured or hidden within the fake news text. The recall identification of correctly identified entities provides valuable clues about the authenticity of the information.
- The F1-score (F1) is the harmonic mean of precision and recall, considering the trade-off between false positives and false negatives. In NER for fake news detection, achieving a higher F1 implies that the NER-SA model not only correctly identifies named entities (precision) but also successfully captures a substantial portion of all entities present (recall). Moreover, a high F1 leads to a more robust and reliable NER-SA model for detecting fake news.
- For cross-domain analysis, the NER-SA model reports higher recall along with lower precision in all domains. The real news must have some similarities, as similar evidence is referenced to support their reporting on events; however, fake news does not have limited evidence, meaning that the descriptions must be conjured up from the author’s fictional imagination.
- For the mathematical model, some fake news articles will inevitably possess the coordinates on the VI that are located in the legitimate area. Such a problem can be effectively solved by involving the penalty function in the NER-SA model so as to avoid excessive expansion of the legitimate area. Judging from the numerical results of A and F1, its performance will indeed be better than machine learning models and deep learning models. Since fake news producers continuously employ adversarial techniques that involve intentional modifications of linguistic features to deceive NLP models, it is suggested that more diverse RI and VI shape frameworks should be developed to maximize the legitimate area in order to enhance the robustness of the NER-SA model against adversarial attacks and improve detection performance in the future.
4.5. Theoretical and Practical Implications
5. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hutchinson, A. New Research Shows that 71% of Americans Now Get News Content via Social Platforms. 2021. Available online: https://www.socialmediatoday.com/news/new-research-shows-that-71-of-americans-now-get-news-content-via-social-pl/593255/ (accessed on 31 October 2022).
- Ellerbeck, S. Most People Get Their News Online—But Many Are Switching off Altogether. Here’s Why. 2022. Available online: https://www.weforum.org/agenda/2022/09/news-online-europe-social-media/ (accessed on 31 October 2022).
- Majid, A. Survey: Google Is Most Trusted Tech Platform for News, TikTok the Least. 2022. Available online: https://pressgazette.co.uk/data-shows-broad-trust-gap-between-news-in-general-and-news-on-social-media/ (accessed on 31 October 2022).
- Shahsavari, S.; Holur, P.; Tangherlini, T.R.; Roychowdhury, V. Conspiracy in the time of corona: Automatic detection of COVID-19 conspiracy theories in social media and the news. J. Comput. Soc. Sci. 2020, 3, 279–317. [Google Scholar] [CrossRef]
- Tsai, C.M.; Xu, B.S. Automatic differentiation between legitimate and fake news using named entity recognition. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Xiamen, China, 26–28 June 2020; pp. 74–78. [Google Scholar] [CrossRef]
- Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; Stein, B. A stylometric inquiry into hyperpartisan and fake news. arXiv 2017, arXiv:1702.05638. [Google Scholar] [CrossRef]
- Nadeem, M.I.; Ahmed, K.; Zheng, Z.; Li, D.; Assam, M.; Ghadi, Y.Y.; Alghamedy, F.H.; Eldin, E.T. SSM: Stylometric and semantic similarity oriented multimodal fake news detection. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101559. [Google Scholar] [CrossRef]
- Abeynayake, A.D.L.; Sunethra, A.A.; Deshani, K.A.D. A stylometric approach for reliable news detection using machine learning methods. In Proceedings of the 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 30 November–1 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Wang, Y.; Qian, S.; Hu, J.; Fang, Q.; Xu, C. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proceedings of the 10th International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 540–547. [Google Scholar] [CrossRef]
- Torabi Asr, F.; Taboada, M. Big Data and quality data for fake news and misinformation detection. Big Data Soc. 2019, 6. [Google Scholar] [CrossRef]
- Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022, 47, 10453–10469. [Google Scholar] [CrossRef]
- Wang, W.Y. Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar] [CrossRef]
- Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef]
- Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Marinier, D. An automated string-based approach to extracting and characterizing White Matter fiber-bundles. Comput. Biol. Med. 2016, 77, 64–75. [Google Scholar] [CrossRef]
- Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Marinier, D. An automated string-based approach to White Matter fiber-bundles clustering. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar] [CrossRef]
- Saikh, T.; De, A.; Ekbal, A.; Bhattacharyya, P. A deep learning approach for automatic detection of fake news. In Proceedings of the 16th International Conference on Natural Language Processing, Hyderabad, India, 18–21 December 2019; pp. 230–238. [Google Scholar] [CrossRef]
- Amer, E.; Kwak, K.-S.; El-Sappagh, S. Context-based fake news detection model relying on deep learning models. Electronics 2022, 11, 1255. [Google Scholar] [CrossRef]
- Rasool, A.; Tao, R.; Kamyab, M.; Hayat, S. GAWA—A feature selection method for hybrid sentiment classification. IEEE Access 2020, 8, 191850–191861. [Google Scholar] [CrossRef]
- Lai, C.-M.; Chen, M.-H.; Kristiani, E.; Verma, V.K.; Yang, C.-T. Fake News Classification Based on Content Level Features. Appl. Sci. 2022, 12, 1116. [Google Scholar] [CrossRef]
- Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130. [Google Scholar] [CrossRef]
- Khan, J.Y.; Khondaker, M.T.I.; Afroz, S.; Uddin, G.; Iqbal, A. A benchmark study of machine learning models for online fake news detection. Mach. Learn. Appl. 2021, 4, 100032. [Google Scholar] [CrossRef]
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the International Conference on Computational Linguistics, Yangon, Myanmar, 16–18 August 2017. [Google Scholar]
- Wang, H.; Wang, S.; Han, Y.H. Detecting fake news on Chinese social media based on hybrid feature fusion method. Expert Syst. Appl. 2022, 208, 118111. [Google Scholar] [CrossRef]
- Alghamdi, J.; Lin, Y.; Luo, S. A comparative study of machine learning and deep learning techniques for fake news detection. Information 2022, 13, 576. [Google Scholar] [CrossRef]
- Corradini, E. The dark threads that weave the web of shame: A network science-inspired analysis of body shaming on Reddit. Information 2023, 14, 436. [Google Scholar] [CrossRef]
- Kishwar, A.; Zafar, A. Fake news detection on Pakistani news using machine learning and deep learning. Expert Syst. Appl. 2023, 211, 118558. [Google Scholar] [CrossRef]
- Song, C.; Yang, C.; Chen, H.; Tu, C.; Liu, Z.; Sun, M. CED: Credible early detection of social media rumors. IEEE Trans. Knowl. Data Eng. 2019, 1, 99. [Google Scholar] [CrossRef]
- Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal knowledge-aware event memory network for social media rumor detection. In Proceedings of the 27th International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1942–1951. [Google Scholar] [CrossRef]
- Trueman, T.E.; Ashok Kumar, J.; Narayanasamy, P.; Vidya, J. Attention-based C-BiLSTM for fake news detection. Appl. Soft. Comput. 2021, 110, 107600. [Google Scholar] [CrossRef]
- Segura-Bedmar, I.; Alonso-Bartolome, S. Multimodal fake news detection. Information 2022, 13, 284. [Google Scholar] [CrossRef]
- Yang, L.O. Newspaper3k: Article Scraping & Curation. 2020. Available online: https://newspaper.readthedocs.io (accessed on 15 December 2020).
- Xu, M.; Fralick, D.; Zheng, J.Z.; Wang, B.; TU, X.M.; Feng, C. The differences and similarities between two-sample t-test and paired t-test. Shanghai Arch. Psychiatry 2017, 29, 184–188. [Google Scholar] [CrossRef]
- Zubiaga, A.; Aker, A.; Bontcheva, K.; Liakata, M.; Procter, R. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
- DSouza, K.M.; French, A.M. Social media and fake news detection using adversarial collaboration. In Proceedings of the 55th Hawaii International Conference on System Sciences, Maui, HI, USA, 4–7 January 2022; pp. 115–123. [Google Scholar]
Reference | Method/Contributions | Limitations |
---|---|---|
Shu et al. [13] | Their content-based LSTM model is beneficial for fake news detection, fake news evolution, fake news mitigation, and malicious account detection. | People belonging to the same social communities often express the same interests and viewpoints. Their opinions are more likely to be manipulated. |
Cauteruccio et al. [14,15] | They proposed a particular string-based fiber representation to solve the string clustering problem of fibers. Their method can be applied to heterogeneous data from different domains. | The model can easily extract those fibers with the same structure The contextual multi-modal approach might not be considered. |
Amer et al. [17] | They developed a more effective linguistic model with contextual feature extraction and showed that their model outperformed the accuracy of BERT. Their model also achieves the same accuracy as the LSTM and the Gated Recurrent Unit (GRU) models. | The authors said their model needs to be continuously retrained or updated to deal with a larger volume of malicious fake news generation tools. |
Rasool et al. [18] | They proposed the GAWA method, supported by using hybrid sentiment classification to find the optimal features for feature reductions. The results revealed that GAWA can reduce the feature set by up to 61.95% without reducing the accuracy level and enhance the efficiency of the Naive Bayes algorithm for better accuracy. | The GAWA method may not consider interdependencies between features. In cases where some features are correlated with each other, the GAWA method may select all features, resulting in a redundant feature set. |
Lai et al. [19] | They verified that both ML models and neural network models implementing NLP have better performance than traditional ML models, CNN, and LSTM in both accuracy and precision. | Lack of consistency in writing patterns, stylistic analysis, and sentiment analysis can lead manipulators to intentionally use common words or synonyms to mimic legitimate texts. |
Bonifazi et al. [20] | They extracted the sentiment scope of a user on any topic on the social platform Reddit and proposed an approach that expands the sentiment scope to integration with spatial and temporal scopes. | Authors argued that the main limitations of their approach include: 1. the interconnectivity between different social platforms is not considered; 2. the interference provided by different users potentially having specific sentiments regarding a topic on a given user cannot be analyzed. |
Alghamdi et al. [24] | They proposed an approach that combines the bidirectional encoder representation from the transformer (BERT) and CNN to capture semantic and contextual information of a given news text using NLP and demonstrated the improvement of the performance of fake news detection. | The context-based information needs to incorporate style and sentiment analysis for better performance. |
Trueman et al. [29] | They proposed the attention-based convolutional bidirectional long short-term memory (AC-BiLSTM) model, which can simultaneously capture the meaning of all sentences and memorize long input sequences. The hybrid model provides an impressive improvement in accuracy and F1-score in comparison with other existing SVM, CNN, and LSTM models. | The AC-BiLSTM may face the limitations such as computational complexity, overfitting, and interpretability, which make it less suitable for the applications of specific tasks. |
Segura-Bedmar and Alonso-Bartolome [30] | They adopted both uni-modal (using only texts) and multi-modal (integrating texts and images) approaches to detect fake news on the Fakeddit dataset. The results revealed that the uni-modal approach using BERT is the model with the best accuracy, while the CNN-based multi-modal approach obtains the best performance in accuracy. | The proposed multi-modal approach does not involve the similarity and association detection methods. |
Datasets | Training Set Accuracy | Test Set Accuracy | RMSE | R-Square | t Value |
---|---|---|---|---|---|
LIAR | 0.6426 | 0.6218 | 0.4045 | 0.703 | 3.867 *** |
Fake or real news | 0.9382 | 0.9304 | 0.3640 | 0.814 | 4.748 *** |
FakeNews AMT (Technology domain) | 0.9463 | 0.9378 | 0.2454 | 0.837 | 4.506 *** |
FakeNews AMT (Education domain) | 0.9451 | 0.9365 | 0.2271 | 0.860 | 4.454 *** |
FakeNews AMT (Business domain) | 0.8926 | 0.8731 | 0.2689 | 0.815 | 3.980 *** |
FakeNews AMT (Sports domain) | 0.8705 | 0.8582 | 0.2904 | 0.796 | 4.014 *** |
FakeNews AMT (Politics domain) | 0.9502 | 0.9376 | 0.2487 | 0.842 | 4.621 *** |
FakeNews AMT (Entertainment domain) | 0.8677 | 0.8563 | 0.3104 | 0.775 | 3.859 *** |
Datasets | Machine Learning Models * | Deep Learning Models * | NER-SA Model (The Increase % in Performance on Average) | |||
---|---|---|---|---|---|---|
SVM (Lexical Features) | Naive Bayes (Bigram Feature) | CNN | LSTM | |||
LIAR | A | 0.56 | 0.60 | 0.58 | 0.54 | 0.62 (8.94%) |
P | 0.56 | 0.59 | 0.58 | 0.29 | 0.61 (31.96%) | |
R | 0.56 | 0.60 | 0.58 | 0.54 | 0.61 (7.18%) | |
F1 | 0.48 | 0.59 | 0.58 | 0.38 | 0.61 (24.04%) | |
Fake or real news | A | 0.67 | 0.86 | 0.86 | 0.76 | 0.93 (19.36%) |
P | 0.67 | 0.86 | 0.86 | 0.78 | 0.93 (18.58%) | |
R | 0.67 | 0.86 | 0.86 | 0.76 | 0.93 (19.36%) | |
F1 | 0.67 | 0.86 | 0.86 | 0.76 | 0.93 (19.36%) |
Domains | Model 1 * (Readability Feature) | Model 2 * (All Features) | NER-SA Model (The Increase % in Performance on Average) | |||||
---|---|---|---|---|---|---|---|---|
A | F1 | A | F1 | A | P | R | F1 | |
Technology | 0.90 | 0.90 | 0.80 | 0.79 | 0.94 (10.97%) | 0.84 | 0.96 | 0.90 (6.96%) |
Education | 0.84 | 0.84 | 0.84 | 0.84 | 0.94 (11.90%) | 0.82 | 0.96 | 0.88 (4.76%) |
Business | 0.53 | 0.41 | 0.85 | 0.85 | 0.87 (43.97%) | 0.83 | 0.88 | 0.85 (53.66%) |
Sports | 0.51 | 0.45 | 0.81 | 0.81 | 0.86 (50.18%) | 0.80 | 0.88 | 0.84 (45.19%) |
Politics | 0.91 | 0.91 | 0.75 | 0.75 | 0.94 (14.32%) | 0.86 | 0.96 | 0.91 (10.67%) |
Entertainment | 0.61 | 0.60 | 0.75 | 0.75 | 0.86 (39.72%) | 0.82 | 0.86 | 0.84 (26.00%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tsai, C.-M. Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis. Electronics 2023, 12, 3676. https://doi.org/10.3390/electronics12173676
Tsai C-M. Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis. Electronics. 2023; 12(17):3676. https://doi.org/10.3390/electronics12173676
Chicago/Turabian StyleTsai, Chih-Ming. 2023. "Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis" Electronics 12, no. 17: 3676. https://doi.org/10.3390/electronics12173676
APA StyleTsai, C.-M. (2023). Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis. Electronics, 12(17), 3676. https://doi.org/10.3390/electronics12173676