Misinformation vs. Situational Awareness: The Art of Deception and the Need for Cross-Domain Detection
Abstract
:1. Introduction
- (1)
- Reputation/Publishing history analysis;
- (2)
- Network data analysis;
- (3)
- Image-based analysis;
- (4)
- Image tampering detection.
1.1. Motivation
1.2. Contribution
1.3. Structure
2. Literature Review
2.1. Publications
2.2. Related Surveys
3. Background and Criteria
3.1. Machine Learning Models Used in Misinformation Detection
- Support Vector Machines (SVM): A model that utilizes training data to generate support vectors. These are used by the model to label the test data it is given [19].
- Random Forest: A classification and regression model that aggregates the results of multiple randomized decision trees [20].
- K-Nearest Neighbors (K-NN): A model that conceptualizes and plots the training data as a set of data points in a high dimensional space. It then categorizes each test point based on the K nearest data points in said high dimensional space [21].
- Logistic Regression: A linear regression model that uses a threshold and a sigmoid function to dichotomize and label the test data [22].
- AdaBoost: A model that mainly used in conjunction with short decision trees. AdaBoost takes multiple iterations of “weak learners” (i.e., decision trees), each with a different selection of weighted features, and aggregates them into one “strong learner” [23].
- XGBoost, LGBoost, Gradient Boosting: Machine learning models based on the Gradient Boosting framework [24].
- Neural Networks: A collection of interconnected nodes (neurons), modelled to imitate neurons in a human brain. Layers upon layers of these units create bonds, which represent the underlying connections of the data in a set. This information is used to solve classification and regression problems [25].
- Long Short-Term Memory (LSTM): A variation of the Recurrent Neural Network (RNN), created to avoid the exploding and vanishing gradient problems found in standard RNNs [26].
- BERT Model: Bidirectional Encoder Representation and Transformers (BERT) is an NLP-based model, which trains such representations using unlabeled data. Post training can be fine-tuned to appropriate specifications, using labeled data [27].
- Convolutional Neural Network (CNN): A type of deep neural network that takes advantage of convolution (mathematic linear operation in matrices) [28].
- MMFD, SemSeq4FD: Custom frameworks created specifically for this problem set.
3.2. Model Evaluation Criteria
- The Vanishing Gradient Problem is found mainly in gradient focused machine learning models (e.g., Gradient Descent). This issue causes the model to stagnate since it cannot learn any further or has an extremely slow learning rate [32].
- The Exploding Gradient Problem is the opposite phenomenon of the Vanishing Gradient Problem. It occurs when long term ML components grow exponentially and cause the model to produce poor prediction results [32].
3.3. Data Pre-Processing Criteria
3.4. Adaptability Criteria
- The first parameter is associated with the ML model accuracy of each approach. We utilize accuracy as it is the most commonly provided metric in the researched literature. When a methodology does not refer to this evaluation metric, it is assessed as “Non-flexible”. If this metric is calculated, then if its score is lower than 0.6, it is also “Non-flexible”, while if its score is between 0.6 and 0.8 the approach is “Relatively flexible”. Otherwise, it is “Flexible”.
- The second parameter is related to the sector (domains/languages/sources) of each cross-category. Specifically, in Cross-Domain methodologies if an approach covers less than 3 domains then it is “Non-flexible”. In the case that the approach covers 3–7 domains, then it is “Relatively flexible”. Otherwise, it is “Flexible”. In Cross-Language methods, when an approach focuses on two languages, it is “Relatively flexible”, whereas when it focuses on three or more languages, it is “Flexible”. In the case of Cross-Source methodologies, if an approach uses data from less than 10 sources, it is “Non-flexible”. When it uses data from 10 to 100 sources, then it is “Relatively flexible”. Otherwise, it is “Flexible”. In the case of Multi-Cross methodologies, if the approach covers two cross-categories, it is “Relatively flexible”, otherwise it is “Flexible”. According to the categories to which this approach belongs, we separately calculate the adaptability of the related sectors. For instance, we consider a Multi-Cross methodology (A) that is Cross-Domain and Cross-Language. To characterize (A), we take into account the criteria related to Cross-Domain and Cross-Language methodologies, while ignoring Cross-Source-related criteria.
- The third parameter refers to the data pre-processing features that the examined methodologies use. When an approach uses less than 5 features, it is “Non-flexible”. In case the approach uses 5–5 features, we also consider the ML model accuracy score. If it used more than 5 features and it achieves an accuracy score lower than 0.6, it is “Non-flexible”, otherwise when the accuracy score is equal or greater than 0.6, the methodology is “Relatively flexible”. Finally, if it used more than 15 features and achieves an ML model accuracy score more than 0.8, it is “Flexible”. Otherwise, if its accuracy score is equal or less than 0.8, it is “Relatively flexible”.
- The fourth parameter is related to the datasets that each approach uses. Specifically, if the approach uses less than 5 datasets and covers less than 3 domains, then it is “Non-flexible”. If the datasets that are used cover 3 to 7 domains, the approach is “Relatively flexible”. Otherwise, it is “Flexible”.
- The fifth parameter refers only to Multi-Cross methodologies. Our aim is to study whether including multiple cross categories refines the results of deception detection methodologies, thus increasing their flexibility. Specifically, if the approach incorporates two cross categories, then it is “Relatively flexible”. Otherwise, it is “Flexible”. Methodologies that use only one category are not considered as Multi-Cross.
4. Cross-* Detection Methodologies
4.1. Datasets Used in Cross-* Methodologies
4.2. Cross-Domain Methodologies
4.2.1. Cross-Domain Methodologies Analysis
- Structural: Determination of structural patterns in how fake news spreads globally (macro-level) and identification of structural patterns in conversations, where viewpoints are expressed (micro-level).
- Temporal: Recognition of opinions and emotions through the time period and frequency in which a conversation or a response took place.
- Linguistic: Analysis of textual information, such as emotional language, specific vocabulary, and syntax.
4.2.2. Comparison of Cross-Domain Methodologies
4.3. Cross-Language Methodologies
4.3.1. Cross-Language Methodology Analysis
4.3.2. Cross-Language Methodologies Comparison
4.4. Cross-Source Methodologies
4.4.1. Cross-Source Methodologies Analysis
4.4.2. Cross-Source Methodologies Comparison
4.5. Multiple-Cross Methodologies
4.5.1. Multiple-Cross Methodologies Analysis
4.5.2. Comparison of Multi-Cross Methodologies
4.6. Other Cross-* Datasets
4.7. Tabulated Processed Information
- Cross category of the methodology.
- Sectors of the cross category which are contained in the dataset used.
- The data pre-processing applied to each dataset.
- The type of representation used for training.
- Machine learning type.
- Metrics for which results are provided.
5. Discussion on Cross-* Methodologies
6. Suggestions, Conclusions and Future Plans
6.1. Suggestions and Future Research
6.2. Conclusions
6.3. Future Plans
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Authors | Publication Title | Cross-* | Sectors (Domain/Language/Source) | Dataset | Data Pre-Processing | Word Vector Representation | Machine Learning | Performance Metric |
---|---|---|---|---|---|---|---|---|
PÉREZ-ROSAS et al., Univ. of Michigan, Univ. of Amsterdam | Automatic detection of fake news | Cross-Domain | Business, Education, Politics, Sports, Technology, Entertainment | FakeNewsAMT Celebrity | Syntax, N-Grams, Punctuation, Psychological, Readability, Word Count, Sentence Structure, Emotional Words, Common Word Count, Average Word Size, POS Tags | Bag of Word | supervised learning | A (Accuracy), F1, R (Recall), P (Precision) |
GAUTAM A. and JERRIPOTHULA., Indraprastha Institute of Information Technology Delhi | SGG: Spinbot, grammarly and GloVe based fake news detection | Cross-Domain | Business, Education, Politics, Sports, Technology, Entertainment | FakeNewsAMT Celebrity | Syntax, Punctuation, Readability, Word Count, Misspelled Words, Verb Tense Analysis, Sentence Structure, Style, Plagiarism, Vocabulary, Common Word Count | TF-IDF vectorizer | supervised learning | A (Accuracy) |
SAIKH T. et al. Indian Institute of Technology Patna, Government College of Engineering | A deep learning approach for automatic detection of fake news | Cross-Domain | Business, Education, Politics, Sports, Technology, Entertainment | FakeNewsAMT Celebrity | N-Grams, Word Count | N/A | supervised learning | A (Accuracy) |
CASTELO S. et al., New York Univ., Federal Univ. of Amazonas | A topic-agnostic approach for identifying fake news pages | Cross-Domain | Political News, Entertainment | US Election 2016 Celebrity | N-Grams, Psychological, Readability, Word Count, Web Markup Songs | Bag of Words and TF-IDF | supervised learning | A (Accuracy) |
LIN J. et al. Louisiana State Univ., Keene State College, Worcester Polytechnic Institute | Detecting fake news articles | Cross-Domain | Political News, Entertainment | PolitiFact GossipCop | N-Grams, Punctuation, Psychological, Readability, Word Count, Sentence Structure | Bag of Words and TF-IDF | deep learning | A (Accuracy), F1, R (Recall), P (Precision) |
CHU S. et al. Univ. of Hong Kong | Cross-language fake news detection | Cross-Language | English, Chinese | English (Custom) Chinese (Custom) | Common Word Count | N/A | supervised learning | A (Accuracy), F1, R (Recall), P (Precision) |
VOGEL et al. Fraunhofer Institute SIT | Detecting fake news spreaders on Twitter (multilingual perspective) | Cross-Language | English, Spanish | Custom Dataset | Syntax, Psychological, Word Count, Verb Tense Analysis, Style, Stop Word Removal, Emotional Words, Average Word Size | Bag of Words and TF-IDF | supervised learning | A (Accuracy), F1, R (Recall) P (Precision) |
ABONIZIO et al. State Univ. of Londrina, Univ. degli Studi di Milano | Language-independent fake news detection: English, Portuguese and Spanish mutual features | Cross-Language | English, Spanish, Portuguese | FNC | Syntax, Punctuation, Psychological, Word Count, Verb Tense Analysis, Sentence Structure, Style, Common Word Count, Average Word Size, Tokenization, POS tags | Bag of Word | supervised learning | A (Accuracy) |
GUIBON et al. Aix-Marseille Univ., Univ. de Bretagne Occidentale, Geol Semantics, Knoema-Corporation Perm, PluriTAL | Multilingual fake news detection with satire | Cross-Language | English, French | Custom Dataset | Syntax, N-Grams, Punctuation, Word Count, Stop Word Removal, Web Markup Words, Common Word Count, Tokenization | TF-IDF vectorizer | supervised and deep learning | F1 |
WANG L. et al. Max Planck Institute for Informatics, Shandong University, Univ. of Potsdam | Cross-Domain learning for clarifying propaganda in online contents | Cross-Source | Speeches, News and Tweets from various sources | Custom Dataset | Web Markup Words | Bag of Words and TF-IDF | supervised and deep learning | P (Precision) R (Recall) F1 |
ASR et al. Simon Fraser University | The data challenge in misinformation detection: Source reputation vs. content veracity | Cross-Source | 1st (The Natural News, Activist Report, The Onion, The Borowitz Report, Clickhole, America News, DC Gazette, Gigaword News), 2nd (The Onion, The Beaverton, The Toronto Star, The New York Times), 3rd (Multiple FB pages), 4th (multiple sources, by Snopes) | Rashkin, Rubin, BuzzFeedUSE, Snopes312 | N-Grams, Punctuation, Psychological, Word Count, Sentence Structure, Stop Word Removal, Emotional Words, Common Word Count, Average Word Size, POS Tags | Bag of Words and TF-IDF | supervised learning | F1 |
HUANG et al. National Tsing Hua University | Conquering cross-source failure for news credibility: Learning generalizable representations beyond content embedding | Cross-Source | 1st dataset FRN (NPR, New York Times, WallStreet Journal, Bloomberg), 2nd dataset KJR (Before It’s News, Daily Buzz Live, Activist Post, BBC, Reuters, CNN, ABC News, NYTimes) 3rd dataset NELA (195 sources) | FRN KJR NELA | Syntax, N-Grams, Sentence Structure, Style, Tokenization, POS Tags | TF-IDF vectorizer | supervised and deep learning | A (Accuracy) |
KARIMI et al. Michigan State University | Multi-source multi-class fake news detection | Cross-Source | LIAR LIAR (From multiple sources selected by Politifact) | LIAR | N-Grams | Bag of Word | supervised and deep learning | A (Accuracy) |
SHU et al., Arizona State Univ., Penn State University | Hierarchical propagation networks for fake news detection: Investigation & exploitation | Cross-Domain | Domains (Political News, Entertainment) | FakeNewsNet, PolitiFact, GossipCop | Custom | N/A | supervised learning | A(Accuracy), P(Precision), R(Recall) |
Y. WANG et al. Taiyuan Univ. of Technology | SemSeq4FD: Integrating global semantic relationship & local sequential order to enhance text representation for fake news detection | Multi-Cross | Domains (health, economic, technology, entertainment, society, military, political and education), Languages (English, Chinese), Sources (The Onion, Gigaword News, Associated Press, Washington Post, Bloomberg NewsWire, The Borowitz Report, Clickhole, Toronto Star, NY Times, The Beaverton) | SLN (English) LUN (English) Weibo (Chinese) RCED (Chinese) | N-Grams, Word Count, Stop Word Removal, Common Word Count, Average Word Size, Tokenization | TF-IDF vectorizer | supervised and deep learning | A (Accuracy), F1, R (Recall) P (Precision) |
JERONIMO C. et al. Federal Univ. of Campina Grande | Fake news classification based on subjective language | Multi-Cross | Domains (Politics, Sports, Economy, Culture) Languages (English, Portuguese), Sources (Fohla de Sao Paolo, Estadao) | Custom Dataset | Punctuation, Word Count, Sentence Structure, Stop Word Removal, Emotional Words | TF-IDF vectorizer | supervised learning | A (Accuracy) |
SHAHI, G. K., and NANDINI, D. Univ. of Duisburg-Essen, Univ. of Bamberg | FakeCovid—A multilingual cross domain FactCheck news dataset for COVID-19 | Multi-Cross | Domains (Health, Science) Languages (English, Hindi, German), Source (Multiple sources from Snopes and Poynter) | FakeCovid | Word Count, Mispelled Words, Web Markup Words, Average Word Size, Tokenization | N/A | N/A | P (Precision), R (Recall) F1 |
KULA et al., UTP Univ. of Science and Technology, Kaszimierz Wielki Univ., Wroclaw | Sentiment analysis for fake news: Detection by means of neural networks | Cross-Domain | Domains (World News, Politics, US news) | ISOT, Kaggle | Custom | N/A | proposed | A(Accuracy), P(Precision), R(Recall), F1-Score |
References
- Zhou, X.; Zafarani, R. A survey of Fake news. ACM Comput. Surv. 2020, 53, 1–40. [Google Scholar] [CrossRef]
- Gelfert, A. Fake News: A Definition. Informal Log. 2018, 38, 84–117. [Google Scholar] [CrossRef]
- Tandoc, E.; Lim, Z.; Ling, R. Defining “fake news” A typology of scholarly definitions. Digit. J. 2018, 6, 137–153. [Google Scholar]
- Vieweg, S.; Hughes, A.; Starbird, K.; Palen, L. Microblogging during two natural hazards events. In Proceedings of the 28th International Conference on Human factors in Computing Systems, Atlanta, GA, USA, 10–15 April 2010. [Google Scholar]
- Medium. Detecting Fake News with NLP. 2021. Available online: https://medium.com/@Genyunus/detecting-fake-news-with-nlp-c893ec31dee8 (accessed on 15 March 2021).
- Wang, Y.; Wang, L.; Yang, Y.; Lian, T. SemSeq4FD: Integrating global semantic relationship and local sequential order to enhance text representation for fake news detection. Expert Syst. Appl. 2021, 166, 114090. [Google Scholar] [CrossRef] [PubMed]
- PRISMA-Transparent reporting of systematic reviews and meta-analyses. Available online: http://www.prisma-statement.org/ (accessed on 25 January 2021).
- Stergiopoulos, G.; Gritzalis, D.A.; Limnaios, E. Cyber-attacks on the Oil & Gas sector: A survey on incident assessment and attack patterns. IEEE Access 2020, 8, 128440–128475. [Google Scholar]
- Zhang, X.; Ghorbani, A.A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag. 2020, 57, 102025. [Google Scholar] [CrossRef]
- Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
- Bondielli, A.; Marcelloni, F. A survey on fake news and rumor detection techniques. Inf. Sci. 2019, 497, 38–55. [Google Scholar] [CrossRef]
- Oshikawa, R.; Qian, J.; Wang, W. A survey on natural language processing for fake news detection. arXiv 2018, arXiv:1811.00770. [Google Scholar]
- Sharma, K.; Qian, F.; Jiang, H.; Ruchansky, N.; Zhang, M.; Liu, Y. Combating fake news: A survey on identification and mitigation techniques. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–42. [Google Scholar] [CrossRef]
- Chowdhury, G. Natural language processing. Annu. Rev. Inf. Sci. Technol. 2005, 37, 51–89. [Google Scholar] [CrossRef] [Green Version]
- Dey, A. Machine learning algorithms: A review. Int. J. Comput. Sci. Inform. Technol. 2016, 7, 1174–1179. [Google Scholar]
- Padala, V.; Gandhi, K.; Dasari, P. Machine learning: The new language for applications. Int. J. Artif. Intell. 2019, 8, 411. [Google Scholar] [CrossRef]
- Zhou, Z. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2017, 5, 44–53. [Google Scholar] [CrossRef] [Green Version]
- Medium. 10 Machine Learning Methods that Every Data Scientist Should Know. Available online: https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-should-know-3cc96e0eeee9 (accessed on 31 March 2021).
- Pisner, D.A.; Schnyer, D.M. Support vector machine. In Machine Learning; Mechelli, A., Vieira, S., Eds.; Academic Press: Cambridge, MA, USA, 2020; pp. 101–121. [Google Scholar] [CrossRef]
- Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef] [Green Version]
- Nadkarni, P. Core Technologies: Data Mining and “Big Data”. In Clinical Research Computing: A Practitioner’s Handbook; Academic Press: Cambridge, MA, USA, 2016; pp. 187–204. [Google Scholar]
- Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398, pp. 1–5. [Google Scholar]
- Collins, M.; Schapire, R.E.; Singer, Y. Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 2002, 48, 253–285. [Google Scholar] [CrossRef]
- Xgboost.readthedocs.io. XGBoost Documentation—Xgboost 1.5.0-dev Documentation. 2021. Available online: https://xgboost.readthedocs.io/en/latest/ (accessed on 21 June 2021).
- MIT News | Massachusetts Institute of Technology. Explained: Neural Networks. 2021. Available online: https://news.mit.edu/2017/explained-neural-networks-deep-learning-0414 (accessed on 21 June 2021).
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- González-Carvajal, S.; Garrido-Merchán, E.C. Comparing BERT against traditional machine learning text classification. arXiv 2020, arXiv:2005.13012. [Google Scholar]
- Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017. [Google Scholar]
- Hossin, M.; Sulaiman, M. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process. 2015, 5, 1. [Google Scholar] [CrossRef]
- Dietterich, T. Overfitting and undercomputing in machine learning. ACM Comput. Surv. 1995, 27, 326–327. [Google Scholar] [CrossRef]
- Hawkins, D. The Problem of Overfitting. J. Chem. Inf. Comput. Sci. 2004, 44, 1–12. [Google Scholar] [CrossRef]
- Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318. [Google Scholar]
- Castelo, S.; Almeida, T.; Elghafari, A.; Santos, A.; Pham, K.; Nakamura, E.; Freire, J. A topic-agnostic approach for identifying fake news pages. In Proceedings of the 2019 WWW Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
- Gritzalis, D.; Iseppi, G.; Mylonas, A.; Stavrou, V. Exiting the risk assessment maze: A meta-survey. ACM Comput. Surv. 2018, 51, 1–30. [Google Scholar] [CrossRef]
- Saikh, T.; De, A.; Ekbal, A.; Bhattacharyya, P. A deep learning approach for automatic detection of fake news. arXiv 2020, arXiv:2005.04938. [Google Scholar]
- Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef]
- Chu, S.; Xie, R.; Wang, Y. Cross-Language Fake News Detection. Data Inf. Manag. 2020, 5, 100–109. [Google Scholar]
- Dogo, M.; Deepak, P.; Jurek-Loughrey, A. Exploring Thematic Coherence in Fake News. In Proceedings of the ECML PKDD 2020 Workshops, 14–18 September 2020; pp. 571–580. [Google Scholar]
- Rashkin, H.; Choi, E.; Jang, J.; Volkova, S.; Choi, Y. Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2931–2937. [Google Scholar]
- Rubin, V.; Conroy, N.; Chen, Y.; Cornwell, S. Fake news or truth? Using satirical cues to detect potentially misleading news. In Proceedings of the 2nd Workshop on Computational Approaches to Deception Detection, San Diego, CA, USA, 12–17 June 2016; pp. 7–17. [Google Scholar] [CrossRef] [Green Version]
- Huang, Y.; Liu, T.; Lee, S.; Calderon Alvarado, F.; Chen, Y. Conquering Cross-source Failure for News Credibility: Learning Generalizable Representations beyond Content Embedding. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. [Google Scholar]
- Wang, W. Liar, liar pants on fire: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648. [Google Scholar]
- Long, Y. Fake news detection through multi-perspective speaker profiles. In Proceedings of the The 8th International Joint Conference on Natural Language Processing, Taibei, Taiwan, 27 November–1 December July 2017; pp. 252–256. [Google Scholar]
- Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.; Wong, K.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 3818–3824. [Google Scholar]
- Ma, J.; Gao, W.; Wei, Z.; Lu, Y.; Wong, K. Detect rumors using time series of social context information on microblogging websites. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, New York, NY, USA, 18–23 October 2015. [Google Scholar]
- Jeronimo, C.; Marinho, L.; Campelo, C.; Veloso, A.; da Costa Melo, A. Fake news classi fication based on subjective language. In Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, Munich, Germany, 2–4 December 2019; pp. 15–24. [Google Scholar]
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. arXiv 2017, arXiv:1708.07104. [Google Scholar]
- Shahi, G.; Nandini, D. FakeCovid-A Multilingual Cross-domain Fact Check News Dataset for COVID-19. arXiv 2020, arXiv:2006.11343. [Google Scholar]
- Gautam, A.; Jerripothula, K. SGG: Spin bot, Grammarly and GloVe based Fake News Detection. In Proceedings of the IEEE 6th International Conference on Multimedia Big Data, New Delhi, India, 24–26 September 2020. [Google Scholar]
- Lin, J.; Tremblay-Taylor, G.; Mou, G.; You, D.; Lee, K. Detecting Fake News Articles. In Proceedings of the 2019 IEEE International Conference on Big Data, Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar]
- Kula, S.; Choraś, M.; Kozik, R.; Ksieniewicz, P.; Woźniak, M. Sentiment Analysis for Fake News Detection by Means of Neural Networks. In Proceedings of the International Conference on Computational Science, Amsterdam, The Netherlands, 3–5 June 2020; pp. 653–666. [Google Scholar]
- Shu, K.; Mahudeswaran, D.; Wang, S.; Liu, H. Hierarchical propagation networks for fake news detection: Investigation and exploitation. In Proceedings of the International AAAI Conference on Web and Social Media, Atlanta, GA, USA, 8–11 June 2020; Volume 14. [Google Scholar]
- Vogel, I.; Meghana, M. Detecting fake news spreaders on twitter from a multilingual perspective. In Proceedings of the IEEE 7th International Conference on Data Science and Advanced Analytics, Sydney, Australia, 6–9 October 2020; pp. 599–606. [Google Scholar]
- Abonizio, H.; de Morais, J.; Tavares, G.; Barbon, S. Language-Independent Fake News Detection: English, Portuguese, and Spanish Mutual Features. Future Internet 2020, 12, 87. [Google Scholar] [CrossRef]
- Guibon, G.; Ermakova, L.; Seffih, H.; Firsov, A.; Le Noé-Bienvenu, G. Multilingual fake news detection with satire. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, 7–13 April 2019. [Google Scholar]
- Asr, F.; Taboada, M. The data challenge in misinformation detection: Source reputation vs. content veracity. In Proceedings of the 1st Workshop on Fact Extraction and VERification, Brussels, Belgium, 1 November 2018; pp. 10–15. [Google Scholar]
- Karimi, H.; Roy, P.; Saba-Sadiya, S.; Tang, J. Multi-source multi-class fake news detection. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1546–1557. [Google Scholar]
- Wang, L.; Shen, X.; de Melo, G.; Weikum, G. Cross-Domain Learning for Classifying Propaganda in Online Contents. arXiv 2020, arXiv:2011.06844. [Google Scholar]
- Tacchini, E.; Ballarin, G.; Della Vedova, M.; Moret, S.; de Alfaro, L. Some like it hoax: Automated fake news detection in social networks. arXiv 2017, arXiv:1704.07506. [Google Scholar]
- Horne, B.; Adali, S. This Just in: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire Than Real News. In Proceedings of the ICWSM, Montreal, QC, Canada, 15–18 May 2017; Volume 11. [Google Scholar]
- GitHub. BuzzFeedNews/2017-12-Fake-News-Top-50. 2021. Available online: https://github.com/BuzzFeedNews/2017-12-fake-news-top-50 (accessed on 15 March 2021).
- GitHub. thiagovas/bs-detector-dataset. 2021. Available online: https://github.com/thiagovas/bs-detector-dataset (accessed on 15 March 2021).
- Mitra, T.; Gilbert, E. CREDBANK: A Large-Scale Social Media Corpus with Associated Credibility Annotations. In Proceedings of the ICWSM, Oxford, UK, 26–29 May 2015; Volume 9. [Google Scholar]
- Buntain, C.; Golbeck, J. Automatically identifying fake news in popular twitter threads. In Proceedings of the IEEE International Conference on Smart Cloud, New York, NY, USA, 3–5 November 2017; pp. 208–215. [Google Scholar]
- Kochkina, E.; Liakata, M.; Zubiaga, A. All-in-one: Multi-task learning for rumor verification. arXiv 2018, arXiv:1806.03713. [Google Scholar]
- Fakenewschallenge.org. Fake News Challenge. 2021. Available online: http://www.fakenewschallenge.org/ (accessed on 15 March 2021).
Goal | Question | Keyword Search |
---|---|---|
Discover Cross-Domain methodologies for fake news detection | Which of the published methods are addressed to detect fake news in different topics? | (“fake news” OR “disinformation” OR “misinformation” OR “false information”) AND (“cross domain” OR “different domains” OR “different topics” OR “cross domain datasets”) AND (“deception detection” OR “automatic detection” OR “manual detection”) |
Discover Cross-Language methodologies for fake news detection | Which of the published methods are addressed to detect fake news in different languages? | (“fake news” OR “disinformation” OR “misinformation” OR “false information”) AND (“cross language” OR “different languages” OR “multilingual”) AND (“deception detection” OR “automatic detection” OR “manual detection”) |
Discover Cross-Source methodologies for fake news detection | Which of the published methods are addressed to detect fake news from different sources? | (“fake news” OR “disinformation” OR “misinformation” OR “false information”) AND (“cross source” OR “different sources”) AND (“deception detection” OR “automatic detection” OR “manual detection”) |
Metric | Definition |
---|---|
Accuracy | Number of correctly predicted articles proportionate to the total amount of texts given to the model for testing. |
Recall | Number of correctly predicted Positives (or Negatives) relative to the total amount of correct predictions. |
Precision | Number of True Positives (or True Negatives) the model correctly predicted in relation to the total number of Positives (or Negatives) the model predicted. |
F1 | Describes the balance between Precision and Recall. A model is considered perfect when it achieves an F1 score of 1 and useless if it reaches a value near to 0. |
Feature Category | Definition |
---|---|
Readability Features | • Information that simplifies/complicates reading comprehension. • Syntax, Tokenization, Average word size, Common Word Count, Sentence Structure, Word Count, Stop Word Removal, Other. |
Web-markup Features | • Information regarding the layout of the web pages from which it was gathered. |
Morphological Features | • Information related to the grammatical and syntactic structure of sentences. • N-Grams, POS Tags, Misspelled Words, Verb Tense Analysis. |
Psychological Features | • Information associated with semantic data that are captured from textual analysis. • Punctuation, Emotional Words, Other. |
Other Features | • Features that do not belong in the previous categories. • Style, Plagiarism, Vocabulary. |
Ref. No. | Authors | Datasets | Model | Average Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|---|
[47] | V. Pérez-Rosas (Univ. of Michigan), B. Kleinberg (Univ. of Amsterdam), A. Lefevre Univ. of Michigan), R. Mihalcea (Univ. of Michigan) | FakeNewsAMT Celebrity | Custom SVM | 0.7350 | 0.7375 | 0.7350 | 0.7325 |
[50] | J. Lin (Louisiana State University), G. Tremblay-Taylor (Keene State College), G. Mou, Di You, K. Lee (Worcester Polytechnic Institute) | PolitiFact GossipCop | Logistic Regression | 0.8097 | 0.8207 | 0.8097 | 0.8137 |
SVM | 0.7677 | 0.7037 | 0.6470 | 0.7973 | |||
KNN | 0.8163 | 0.8133 | 0.8163 | 0.7683 | |||
Random Forest | 0.8420 | 0.8443 | 0.8420 | 0.8290 | |||
AdaBoost | 0.8277 | 0.8207 | 0.8277 | 0.8200 | |||
XGBoost | 0.8583 | 0.8550 | 0.8583 | 0.8517 | |||
LSTM-ATT | 0.8120 | 0.8167 | 0.8120 | 0.8123 | |||
[49] | G. Jerripothula K. (Indraprastha IT Institute) | FakeNewsAMT Celebrity | Spinbot + Grammarly + GloVe | 0.6300 | N/A | N/A | N/A |
[35] | T. Saikh, A. Ekbal, P. Bhattacharayya (Indian Institute of Technology), A. De (Government College of Engineering, Berhampore) | FakeNewsAMT Celebrity | MLP ELMo | 0.7680 0.8115 | N/A | N/A | N/A |
[33] | S. Castelo, A. Santos, K. Pham, J. Freire, A. Elgahafari (NYU), T. Almeida, E. Nakamura (Federal Univ. of Amazonas) | US Election 2016 Celebrity | FNDetector (SVM) FNDetector (KNN) FNDetector (RF) TAG Model (SVM) TAG Model (KNN) TAG Model (RF) | 0.5900 0.5750 0.5350 0.6650 0.6200 0.6200 | N/A | N/A | N/A |
[51] | Sebastian Kula, Michal Choras, Rafal Kozik, Pawel Ksieniewicz1, Michal Wo’zniak (Univ. of Science and Technology, Kazimierz Wielki Univ., Wroclaw Univ. of Science and Technology, Poland) | ISOT Kaggle | RNN | 0.9986 | 0.9982 | 0.9991 | 0.9986 |
[52] | Kai Shu, Deepak Mahudeswaran, Suhang Wang, Huan Liu (Computer Science and Engineering, Arizona State Univ., Penn State Univ.) | FakeNewsNet PolitiFact GossipCop | Multiple Different Models (RF, Naïve Bayes, Decision Trees, Logistic Regression) | 0.8520 | 0.8440 | 0.8660 | 0.8520 |
Features | Pérez-Rosas et al. | Gautam A. et al. | Saikh T. et al. | Castelo S. et al. | Lin J. et al. | |
---|---|---|---|---|---|---|
Readability Features | Syntax | ✓ | ✓ | |||
Tokenization | ||||||
Average Word Size | ✓ | |||||
Common Word Count | ✓ | ✓ | ||||
Sentence Structure | ✓ | ✓ | ✓ | |||
Word Count | ✓ | ✓ | ✓ | ✓ | ✓ | |
Stop Word Removal | ||||||
Other readability features | ✓ | ✓ | ✓ | |||
Psychological Features | Punctuation | ✓ | ✓ | ✓ | ||
Emotional Words | ✓ | |||||
Other Psychological features | ✓ | ✓ | ✓ | |||
Morphological Features | N-Grams | ✓ | ✓ | ✓ | ✓ | |
POS Tags | ✓ | |||||
Misspelled Words | ✓ | |||||
Verb Tense Analysis | ✓ | |||||
Others | Style | ✓ | ||||
Plagiarism | ✓ | |||||
Vocabulary | ✓ | |||||
Web Markup Words | ✓ |
Suggested by | Education | Entertainment | Business | Politics | Technology | Sports | Celebrity | Other |
---|---|---|---|---|---|---|---|---|
Pérez-Rosas et al. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Gautam A. et al. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Saikh T. et al. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Castelo S. et al. | ✓ | ✓ | ||||||
Lin J. et al. | ✓ | ✓ | ||||||
Kula et. al. | ✓ | World News | ||||||
Shu et al. | ✓ | ✓ |
Suggested by | 1st Parameter (Accuracy) | 2nd Parameter (# of Sectors) | 3rd Parameter (# of Features) | 4th Parameter (# of Datasets) | Overall Adaptability |
---|---|---|---|---|---|
Pérez-Rosas et al. | Relatively flexible | Flexible | Relatively flexible | Relatively flexible | Relatively flexible |
Gautam A. et al. | Relatively flexible | Flexible | Relatively flexible | Relatively flexible | Relatively flexible |
Saikh T. et al. | Flexible | Flexible | Non-flexible | Relatively flexible | Flexible |
Castelo S. et al. | Relatively flexible | Non-flexible | Non-flexible | Non-flexible | Non-flexible |
Lin J. et al. | Flexible | Non-flexible | Non-flexible | Non-flexible | Non-flexible |
Ref. No. | Authors | Datasets | Model | Average Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|---|
[37] | S. W. Chu, R. Xie, Y. Wang (Univ. of Hong Kong) | English (Custom) Chinese (Custom) | BERT Model | 0.6701 | 0.6750 | 0.8700 | 0.7400 |
[53] | I. Vogel, M. Meghana (Fraunhofer Institute) | Custom Dataset | CNN | 0.7250 | 0.7850 | 0.6600 | 0.7150 |
SVM | 0.7500 | 0.8200 | 0.6450 | 0.7200 | |||
Logistic Regression | 0.7400 | 0.7950 | 0.6400 | 0.7100 | |||
[54] | H. Queiroz Abonizio, J. de Morais, S. Barbon (State Univ of Londrina), G. Marques Tavares (Univ. of Milan) | FNC | Random Forest | 0.8530 | N/A | N/A | N/A |
[55] | G. Guibon(Aix-Marseille Univ.), L. Ermakova (Univ. de Bretagne), H. Seffih (GeolSemantics), A. Firsov (Knoema-Corp.), G. Le Noé-Bienvenu (PluriTAL) | Custom Dataset | Decision Tree SVM Custom LGBM Optimized LGBM Optimized SVM CNN | N/A | N/A | N/A | 0.5776 N/A 0.8476 0.8727 0.8835 N/A |
Features | Chu et al. | Vogel et al. | Abonizio et al. | Guibon et al. | |
---|---|---|---|---|---|
Readability Features | Syntax | ✓ | ✓ | ✓ | |
Tokenization | ✓ | ✓ | |||
Average Word Size | ✓ | ✓ | |||
Common Word Count | ✓ | ✓ | ✓ | ||
Sentence Structure | ✓ | ||||
Word Count | ✓ | ✓ | ✓ | ||
Stop Word Removal | ✓ | ✓ | |||
Other readability features | |||||
Psychological Features | Punctuation | ✓ | ✓ | ||
Emotional Words | ✓ | ||||
Other Psychological features | ✓ | ✓ | |||
Morphological Features | N-Grams | ✓ | |||
POS Tags | ✓ | ||||
Misspelled Words | |||||
Verb Tense Analysis | ✓ | ✓ | |||
Others | Style | ✓ | ✓ | ||
Plagiarism | |||||
Vocabulary | |||||
Web Markup Words | ✓ |
Suggested by | Languages |
---|---|
Chu S. et al. | English, Chinese |
Vogel et al. | English, Spanish |
Abonizio et al. | English, Spanish, Portuguese |
Guibon et al. | English, French |
Suggested by | 1st Parameter (Accuracy) | 2nd Parameter (Num. of Sectors) | 3rd Parameter (Num. of Features) | 4th Parameter (Num. of Datasets) | Overall Adaptability |
---|---|---|---|---|---|
Chu S. et al. | Relatively flexible | Relatively flexible | Non-flexible | Non-flexible | Non-flexible |
Vogel et al. | Relatively flexible | Relatively flexible | Relatively flexible | Non-flexible | Relatively flexible |
Abonizio et al. | Flexible | Flexible | Relatively flexible | Relatively flexible | Flexible |
Guibon et al. | Non-flexible | Relatively flexible | Non-flexible | Non-flexible | Non-flexible |
Ref. No. | Authors | Datasets | Model | Average Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|---|
[59] | L. Wang, X. Shen, G. Weikum (Max Planck Institute for Informatics), L. Wang (Shandong University), G. de Melo (Hasso Plattner Institute (University of Potsdam) | Custom Dataset | LR | N/A | 0.5300 | 0.3412 | 0.3540 |
SVM | 0.5132 | 0.3617 | 0.3840 | ||||
LSTM | 0.4715 | 0.5772 | 0.4745 | ||||
LSTMR | 0.4797 | 0.3317 | 0.3767 | ||||
[57] | Y.-H. Huang, F. Calderon, T.-W. Liu, Y.-S. Chen, S.-R. Lee (National Tsing Hua University) | FRN KJR NELA-GT 2018 | GBT | 0.6683 | N/A | N/A | N/A |
SVM | 0.6860 | ||||||
Random Forest | 0.6026 | ||||||
Bi-LSTM-attention | 0.6658 | ||||||
[58] | H. Karimi, P. C. Roy, S. Saba-Sadiya, J. Tang (Michigan State University) | LIAR | Basic SVM Basic Random Forest Basic NN MMFD | 0.2998 0.2701 0.2912 0.3881 | N/A | N/A | N/A |
[56] | F. T. Asr, M. Taboada (Simon Fraser University) | Custom Dataset | SVM | N/A | N/A | N/A | 0.8550 |
Features | Wang et al. | Huang et al. | Karimi Et Al. | Asr et al. | |
---|---|---|---|---|---|
Readability Features | Syntax | ✓ | |||
Tokenization | ✓ | ||||
Average Word Size | ✓ | ||||
Common Word Count | ✓ | ||||
Sentence Structure | ✓ | ✓ | |||
Word Count | ✓ | ||||
Stop Word Removal | ✓ | ||||
Other readability features | |||||
Psychological Features | Punctuation | ✓ | |||
Emotional Words | ✓ | ||||
Other Psychological features | ✓ | ||||
Morphological Features | N-Grams | ✓ | ✓ | ✓ | |
POS Tags | ✓ | ✓ | |||
Misspelled Words | |||||
Verb Tense Analysis | |||||
Others | Style | ✓ | |||
Plagiarism | |||||
Vocabulary | |||||
Web Markup Words | ✓ |
Suggested by | 1st Parameter (Accuracy) | 2nd Parameter (Num. of Sectors) | 3rd Parameter (Num. of Features) | 4th Parameter (Num. of Datasets) | Overall Adaptability |
---|---|---|---|---|---|
Wang et al. | Non-flexible | Relatively flexible | Relatively flexible | Relatively flexible | Relatively flexible |
Huang et al. | Relatively flexible | Flexible | Relatively flexible | Relatively flexible | Relatively flexible |
Karimi et al. | Non-flexible | Relatively flexible | Non-flexible | Relatively flexible | Non-flexible |
Asr et al. | Non-flexible | Relatively flexible | Non-flexible | Relatively flexible | Non-flexible |
Ref. No. | Authors | Datasets | Model | Average Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|---|
[6] | Yuhang Wang, Li Wang, Yanjie Yang, Tao Lian (Taiyuan University of Technology) | SLN (English) LUN (English) Weibo (Chinese) RCED (Chinese) | SemSeq4FD | 0.8825 | 0.8900 | 0.8800 | 0.8967 |
[48] | Gautam Kishore Shahi (Univ. of Duisburg-Essen), Durgesh Nandini (Univ. of Bamberg) | FakeCovid | BERT-based Classification | N/A | 0.7800 | 0.7500 | 0.7600 |
[46] | Caio Libanio Melo Jeronimo, Leandro Balby Marinho, Claudio Campelo, Adriano Veloso, Allan Sales da Costa Melo (Federal Univ. of Campina Grande) | Custom Dataset | XGBoost Random Forest Dummy | 0.8350 0.8950 0.8000 | N/A | N/A | N/A |
Features | Y. Wang et al. | Jeronimo et al. | Shahi G. & Nandini D. | |
---|---|---|---|---|
Readability Features | Syntax | |||
Tokenization | ✓ | ✓ | ||
Average Word Size | ✓ | ✓ | ||
Common Word Count | ✓ | |||
Sentence Structure | ✓ | |||
Word Count | ✓ | ✓ | ✓ | |
Stop Word Removal | ✓ | ✓ | ||
Other readability features | ||||
Psychological Features | Punctuation | ✓ | ||
Emotional Words | ✓ | |||
Other Psychological features | ||||
Morphological Features | N-Grams | ✓ | ||
POS Tags | ||||
Misspelled Words | ✓ | |||
Verb Tense Analysis | ||||
Others | Style | |||
Plagiarism | ||||
Vocabulary | ||||
Web Markup Words | ✓ |
Suggested by | Education | Entertainment | Business | Politics | Technology | Sports | Celebrity | Other |
---|---|---|---|---|---|---|---|---|
Y. Wang et al. | ✓ | ✓ | ✓ | ✓ | Military, Health, Economy, Society | |||
Jeronimo C. L. M. et al. | ✓ | ✓ | Economy, Culture | |||||
Shahi, G. & Nandini, D. | Health, Science |
Suggested by | Cross-Domain | Cross-Language | Cross-Source |
---|---|---|---|
Y. Wang et al. | ✓ | ✓ | ✓ |
Jeronimo C. et al. | ✓ | ✓ | |
Shahi, G., & Nandini, D. | ✓ | ✓ | ✓ |
Suggested by | 1st Parameter (Accuracy) | 2nd Parameter (Num. of Sectors) | 3rd Parameter (Num. of Features) | 4th Parameter (Num. of Datasets) | 5th Parameter (Num. of Categories) | Overall Adaptability |
---|---|---|---|---|---|---|
Y. Wang at al. | Flexible | Flexible | Relatively flexible | Relatively flexible | Flexible | Flexible |
Jeronimo C. et al. | Non-flexible | Relatively flexible | Non-flexible | Relatively flexible | Relatively flexible | Relatively flexible |
Shahi, G., & Nandini, D. | Flexible | Non-flexible | Relatively flexible | Relatively flexible | Flexible | Flexible |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xarhoulacos, C.-G.; Anagnostopoulou, A.; Stergiopoulos, G.; Gritzalis, D. Misinformation vs. Situational Awareness: The Art of Deception and the Need for Cross-Domain Detection. Sensors 2021, 21, 5496. https://doi.org/10.3390/s21165496
Xarhoulacos C-G, Anagnostopoulou A, Stergiopoulos G, Gritzalis D. Misinformation vs. Situational Awareness: The Art of Deception and the Need for Cross-Domain Detection. Sensors. 2021; 21(16):5496. https://doi.org/10.3390/s21165496
Chicago/Turabian StyleXarhoulacos, Constantinos-Giovanni, Argiro Anagnostopoulou, George Stergiopoulos, and Dimitris Gritzalis. 2021. "Misinformation vs. Situational Awareness: The Art of Deception and the Need for Cross-Domain Detection" Sensors 21, no. 16: 5496. https://doi.org/10.3390/s21165496
APA StyleXarhoulacos, C.-G., Anagnostopoulou, A., Stergiopoulos, G., & Gritzalis, D. (2021). Misinformation vs. Situational Awareness: The Art of Deception and the Need for Cross-Domain Detection. Sensors, 21(16), 5496. https://doi.org/10.3390/s21165496