Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification
Abstract
:1. Introduction
- A Chinese infodemic dataset is introduced. To the best of our knowledge, this is the first Chinese infodemic dataset for misinformation identification.
- The original imbalanced dataset is converted into balanced by exploring the properties of the collected records.
- The validation of the proposed dataset is verified by intercoder reliability and word frequency while experiments are carried with a baseline for future works.
2. Related Works
3. Data Collection
4. Dataset Construction
5. Dataset Validation
6. Baseline Experiments
7. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Pan American Health Organization. Some Rights Reserved. This Work Is Available under License CC BY-NC-SA 3.0 IGO; Pan American Health Organization: Washington, DC, USA, 2020. [Google Scholar]
- Wu, A.W.; Connors, C.; Everly, G.S., Jr. COVID-19: Peer support and crisis communication strategies to promote institutional resilience. Ann. Intern. Med. 2020, 172, 822–823. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ahmad, A.R.; Murad, H.R. The impact of social media on panic during the COVID-19 pandemic in Iraqi Kurdistan: Online questionnaire study. J. Med. Internet Res. 2020, 22, e19556. [Google Scholar] [CrossRef]
- Pennycook, G.; McPhetres, J.; Zhang, Y.; Lu, J.G.; Rand, D.G. Fighting COVID-19 misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention. Psychol. Sci. 2020, 31, 770–780. [Google Scholar] [CrossRef] [PubMed]
- Orso, D.; Federici, N.; Copetti, R.; Vetrugno, L.; Bove, T. Infodemic and the spread of fake news in the COVID-19-era. Eur. J. Emerg. Med. 2020, 27, 327–328. [Google Scholar] [CrossRef] [PubMed]
- Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. A convolutional approach for misinformation identification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19 August 2017; pp. 3901–3907. [Google Scholar]
- Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA, 23–27 October 2017; pp. 795–816. [Google Scholar]
- Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 19–23 August 2018; pp. 849–857. [Google Scholar]
- Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. Attention-based convolutional approach for misinformation identification from massive and noisy microblog posts. Comput. Secur. 2019, 83, 106–121. [Google Scholar] [CrossRef]
- Available online: https://github.com/lqhou/NLP_Bus_Pro/tree/master/fake_news (accessed on 20 February 2020).
- Castillo, C.; Mendoza, M.; Poblete, B. Predicting information credibility in time-sensitive social media. Internet Res. 2013, 23, 560–588. [Google Scholar] [CrossRef] [Green Version]
- Abbasi, M.A.; Liu, H. Measuring user credibility in social media. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction; Springer: Berlin/Heidelberg, Germany, 2013; pp. 441–448. [Google Scholar]
- Kwon, S.; Cha, M.; Jung, K.; Chen, W.; Wang, Y. Prominent features of rumor propagation in online social media. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining, Washington, DC, USA, 7–10 December 2013; pp. 1103–1108. [Google Scholar]
- Boididou, C.; Papadopoulos, S.; Kompatsiaris, Y.; Schifferes, S.; Newman, N. Challenges of computational verification in social multimedia. In Proceedings of the 23rd International Conference on World Wide Web, New York, NY, USA, 7–11 April 2014; pp. 743–748. [Google Scholar]
- Derczynski, L.; Bontcheva, K.; Liakata, M.; Procter, R.; Hoi GW, S.; Zubiaga, A. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, 3–4 August 2017; pp. 69–76. [Google Scholar]
- Available online: https://service.account.weibo.com/ (accessed on 14 August 2009).
- Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.F.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 3818–3824. [Google Scholar]
- Wu, K.; Yang, S.; Zhu, K.Q. False rumors detection on sina weibo by propagation structures. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; pp. 651–662. [Google Scholar]
- Latif, S.; Usman, M.; Manzoor, S.; Iqbal, W.; Qadir, J.; Tyson, G.; Castro, I.; Razi, A.; Boulos, M.N.K.; Weller, A.; et al. Leveraging data science to combat COVID-19: A comprehensive review. IEEE Trans. Artif. Intell. 2020, 1, 85–103. [Google Scholar] [CrossRef]
- Shuja, J.; Alanazi, E.; Alasmary, W.; Alashaikh, A. COVID-19 open source data sets: A comprehensive survey. Appl. Intell. 2020, 51, 1296–1325. [Google Scholar] [CrossRef]
- Memon, S.A.; Carley, K.M. Characterizing COVID-19 misinformation communities using a novel twitter dataset. arXiv 2020, arXiv:2008.00791. [Google Scholar]
- Zhou, X.; Mulay, A.; Ferrara, E.; Zafarani, R. Recovery: A multimodal repository for COVID-19 news credibility research. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 19–23 October 2020; pp. 3205–3212. Available online: https://dl.acm.org/doi/10.1145/3340531.3412880 (accessed on 19 October 2020).
- Cui, L.; Lee, D. Coaid: COVID-19 healthcare misinformation dataset. arXiv 2020, arXiv:2006.00885. [Google Scholar]
- Li, Y.; Jiang, B.; Shu, K.; Liu, H. MM-COVID: A Multilingual and Multidimensional Data Repository for Combating COVID-19 Fake New. arXiv 2020, arXiv:2011.04088. [Google Scholar]
- Eysenbach, G. Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J. Med. Internet Res. 2009, 11, e11. [Google Scholar] [CrossRef] [PubMed]
- Luo, J.; Xue, R.; Hu, J. COVID-19 infodemic on Chinese social media: A 4P framework, selective review and research directions. Meas. Control 2020, 53, 2070–2079. [Google Scholar] [CrossRef]
- Available online: https://www.tencent.com/zh-cn/responsibility/combat-covid-19-handbook.html (accessed on 7 April 2020).
- Available online: http://www.nhc.gov.cn/yzygj/s7653p/202008/0a7bdf12bd4b46e5bd28ca7f9a7f5e5a.shtml (accessed on 19 August 2020).
- Available online: https://www.who.int/zh/emergencies/diseases/novel-coronavirus-2019?gclid=CjwKCAiAlNf-BRB_EiwA2osbxRKB_bkVsu64Vrc2d4xOD75fOvcPIXwGzaEdwx5VXsn-0LcYYTx-0BoCRjYQAvD_BwE (accessed on 31 December 2019).
- Lombard, M.; Snyder-Duch, J.; Bracken, C.C. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Hum. Commun. Res. 2002, 28, 587–604. [Google Scholar] [CrossRef]
- Available online: https://www.weiciyun.com/ (accessed on 25 August 2019).
- Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
- Zhang, Y.; Wallace, B.C. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. In Proceedings of the 8th International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017; pp. 253–263. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Lee, J.H.; Camacho Collados, J.; Espinosa Anke, L.; Schockaert, S. Capturing Word Order in Averaging Based Sentence Embeddings. In Proceedings of the 24th European Conference on Artificial Inteligence, Santiago de Compostela, Spain, 29 Augest–8 September 2020; pp. 2062–2069. [Google Scholar]
Criterion | Representative Examples | |
---|---|---|
Records labeled as questionable | Controversial | Remdesivir is efficient for the treatment of COVID-19. |
Inconclusive | Same as the flu, COVID-19 will outbreak seasonally. | |
Conditionally true | The N95 mask needs to be changed every 4 h. | |
Partially true | A virus can stick to hairs. Thus, it is necessary to wash hairs after arriving home. | |
Records labeled as false | False general knowledge | Perfume can be used to prevent COVID-19. |
False scientific knowledge | Drinking red wine can help resist COVID-19 and delay the development of the disease. | |
Fake news | Wuhan’s coronavirus hospital will be relocated. | |
Rumor | After the re-open of red-light district in Greece, customers could only stay 15 min. | |
Records labeled as true | General assertion | Family members are not suggested to share towels. |
True news | NBA announced the suspension of the 2019–2020 season. |
N° of Records Labelled as Questionable | N° of Records Labelled as False | N° of Records Labelled as True | |
---|---|---|---|
Raw dataset | 128 | 600 | 69 |
Dataset after the first adjustment | 512 | 249 | 36 |
Dataset after the second adjustment | 478 | 281 | 38 |
Dataset after the third adjustment | 435 | 281 | 38 |
Dataset after the fourth adjustment | 435 | 281 | 339 |
Questionable | False | True | |
---|---|---|---|
Labels after four times adjustment | 435 | 281 | 339 |
Labels agreed by healthcare workers | 394 | 259 | 330 |
Intercoder reliability | 0.9057 | 0.9217 | 0.9734 |
Method | Metric | Questionable | False | True |
---|---|---|---|---|
RNN | Precision | 0.8400 | 0.5077 | 0.6667 |
Recall | 0.7925 | 0.5156 | 0.7164 | |
F1-score | 0.8155 | 0.5116 | 0.6906 | |
Accuracy | 0.6962 | |||
CNN | Precision | 0.8500 | 0.6232 | 0.7941 |
Recall | 0.8019 | 0.6719 | 0.8060 | |
F1-score | 0.8252 | 0.6466 | 0.8000 | |
Accuracy | 0.7679 | |||
fastText | Precision | 0.8208 | 0.6833 | 0.7183 |
Recall | 0.8208 | 0.6406 | 0.7612 | |
F1-score | 0.8208 | 0.6613 | 0.7391 | |
Accuracy | 0.7553 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Luo, J.; Xue, R.; Hu, J.; El Baz, D. Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification. Healthcare 2021, 9, 1094. https://doi.org/10.3390/healthcare9091094
Luo J, Xue R, Hu J, El Baz D. Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification. Healthcare. 2021; 9(9):1094. https://doi.org/10.3390/healthcare9091094
Chicago/Turabian StyleLuo, Jia, Rui Xue, Jinglu Hu, and Didier El Baz. 2021. "Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification" Healthcare 9, no. 9: 1094. https://doi.org/10.3390/healthcare9091094
APA StyleLuo, J., Xue, R., Hu, J., & El Baz, D. (2021). Combating the Infodemic: A Chinese Infodemic Dataset for Misinformation Identification. Healthcare, 9(9), 1094. https://doi.org/10.3390/healthcare9091094