Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks
Abstract
1. Introduction
Motivation
- A disease outbreak news dataset developed without using any disease names.
- A novel and generic framework for disease outbreak prediction using news data without bias.
- Attention-based DL models to predict disease outbreak using news data.
2. Related Work
2.1. Disease Outbreak Prediction
2.2. News Data for Disease Outbreak Prediction
2.3. Disease Outbreak Prediction Using LSTM, Bi-LSTM, and Bi-LSTM with Multi-Head Attention
2.4. Disease Outbreak Prediction Using Transformer
3. Proposed Methodology
3.1. All the News 2.0
3.2. Disease Outbreak News Articles Extracted from All the News 2.0
3.3. Pre-Processing News Data
3.4. LSTM and Bi-LSTM
3.5. The Attention Mechanism
3.6. Bi-LSTM with Multi-Head Attention
3.7. Transformer Architecture
4. Results
4.1. LSTM
4.2. Bi-LSTM
4.3. Bi-LSTM with MHA
4.4. Transformer
4.5. Attention-Based DL Model Accuracy Comparison
5. Discussion
5.1. No Benchmark Available: Generic Disease Outbreak Prediction
5.2. Context, Limitations, and Future Integration with Network-Based Models
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AdB | AdaBoost |
| AIDS | Acquired Immunodeficiency Syndrome |
| ANN | Artificial Neural Networks |
| AUC | Area Under the Curve |
| BERT | Bidirectional Encoder Representations from Transformers |
| Bi-LSTM | Bidirectional Long Short-Term Memory |
| CNN | Convolutional Neural Network |
| COVID-19 | Coronavirus Disease 2019 |
| DAnIEL | Data Analysis for Information Extraction in any Language |
| DL | Deep Learning |
| FFN | Feed-Forward Neural Network |
| FPR | False Positive Rate |
| GCN | Graph Convolutional Network |
| GCT | Granger Causal Testing |
| GNN | Graph Neural Network |
| HFMD | Hand, Foot, and Mouth Disease |
| HIV | Human Immunodeficiency Virus |
| ILI | Influenza Like Illness |
| KNN | K-Nearest Neighbors |
| LASSO | Least Absolute Shrinkage and Selection Operator |
| LDA | Latent Dirichlet Allocation |
| LSTM | Long Short-Term Memory |
| MERS | Middle East Respiratory Syndrome |
| MHA | Multi-Head Attention |
| MLP | Multi-Layer Perceptron |
| MNB | Multinomial Naive Bayes |
| NCDC | Nigeria Centre for Disease Control and Prevention |
| NLP | Natural Language Processing |
| NN | Neural Network |
| PADI | Platform for Automated Extraction of Disease Information |
| ReLU | Rectified Linear Unit |
| RF | Random Forest |
| RNN | Recurrent Neural Network |
| ROC | Receiver Operating Characteristic |
| RSS | Really Simple Syndication |
| Seq2Seq | Sequence-to-Sequence |
| SAMOH | Saudi Arabia Ministry of Health |
| SARS | Severe Acute Respiratory Syndrome |
| SGD | Stochastic Gradient Descent |
| SIR | Susceptible, Infected, Recovered |
| TB | Tuberculosis |
| TF-IDF | Term Frequency–Inverse Document Frequency |
| TPR | True Positive Rate |
| WHO | World Health Organization |
| WHO-AFRO | WHO African Region |
| WHO-DON | WHO Disease Outbreak News |
| WHO-IHR | WHO International Health Regulations |
References
- Disease Outbreak News. Available online: https://www.who.int/emergencies/disease-outbreak-news (accessed on 19 August 2025).
- Gautam, A.S.; Raza, Z. Disease Outbreak Prediction Using Natural Language Processing: A Review. Knowl. Inf. Syst. 2024, 66, 6561–6595. [Google Scholar] [CrossRef]
- Gautam, A.S.; Raza, Z. Autoencoder and Multi-Head Attention with GRU Based Approach to Predict Disease Outbreak Using News-Crawl 2019 Data. In Proceedings of the 2024 International Conference on Computational Intelligence and Network Systems (CINS), Dubai, United Arab Emirates, 28–29 November 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Vector-Borne Diseases. Available online: https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases (accessed on 19 August 2025).
- Pley, C.; Evans, M.; Lowe, R.; Montgomery, H.; Yacoub, S. Digital and Technological Innovation in Vector-Borne Disease Surveillance to Predict, Detect, and Control Climate-Driven Outbreaks. Lancet Planet. Health 2021, 5, e739–e745. [Google Scholar] [CrossRef] [PubMed]
- Coronavirus Disease ({COVID-19})–World Health Organization. Available online: https://www.who.int/emergencies/diseases/novel-coronavirus-2019 (accessed on 19 August 2025).
- New Research on Deaths and Economic Impact in the First Year of the COVID-19 Pandemic. Available online: https://siepr.stanford.edu/news/new-research-deaths-and-economic-impact-first-year-covid-19-pandemic (accessed on 19 August 2025).
- Wang, W.; Gurgone, A.; Martínez, H.; Barbieri Góes, M.C.; Gallo, E.; Kerényi, Á.; Turco, E.M.; Coburger, C.; Andrade, P.D.S. COVID-19 Mortality and Economic Losses: The Role of Policies and Structural Conditions. JRFM 2022, 15, 354. [Google Scholar] [CrossRef]
- Khotimah, P.H.; Fachrur Rozie, A.; Nugraheni, E.; Arisal, A.; Suwarningsih, W.; Purwarianti, A. Deep Learning for Dengue Fever Event Detection Using Online News. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Tangerang, Indonesia, 18–20 November 2020; pp. 261–266. [Google Scholar]
- Li, J.; Sia, C.-L.; Chen, Z.; Huang, W. Enhancing Influenza Epidemics Forecasting Accuracy in China with Both Official and Unofficial Online News Articles, 2019–2020. Int. J. Environ. Res. Public Health 2021, 18, 6591. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Ibaraki, M.; Schwartz, F.W. Disease Surveillance Using Online News: Dengue and Zika in Tropical Countries. J. Biomed. Inform. 2020, 102, 103374. [Google Scholar] [CrossRef] [PubMed]
- Fast, S.M.; Kim, L.; Cohn, E.L.; Mekaru, S.R.; Brownstein, J.S.; Markuzon, N. Predicting Social Response to Infectious Disease Outbreaks from Internet-Based News Streams. Ann. Oper. Res. 2018, 263, 551–564. [Google Scholar] [CrossRef] [PubMed]
- Azam, N.; Tahir, B.; Mehmood, M.A. News-EDS: News Based Epidemic Disease Surveillance Using Machine Learning. In Proceedings of the 2020 14th International Conference on Open Source Systems and Technologies (ICOSST), Lahore, Pakistan, 16–17 December 2020; pp. 1–6. [Google Scholar]
- Chakraborty, S.; Subramanian, L. Extracting Signals from News Streams for Disease Outbreak Prediction. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; pp. 1300–1304. [Google Scholar]
- Li, Z.; Wang, B.; Li, M.; Ma, W.-Y. A Probabilistic Model for Retrospective News Event Detection. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 15–19 August 2005; pp. 106–113. [Google Scholar]
- PULS Project: Surveillance of Global News Media. Available online: http://puls.cs.helsinki.fi/static/index.html (accessed on 19 August 2025).
- Valentin, S.; Lancelot, R.; Roche, M. Identifying Associations between Epidemiological Entities in News Data for Animal Disease Surveillance. Artif. Intell. Agric. 2021, 5, 163–174. [Google Scholar] [CrossRef]
- Jang, B.; Kim, I.; Kim, J.W. Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles: Deep Learning Model Study. JMIR Med. Inform. 2021, 9, e23305. [Google Scholar] [CrossRef]
- Jahanbin, K.; Rahmanian, V. Using Twitter and Web News Mining to Predict COVID-19 Outbreak. Asian Pac. J. Trop. Med. 2020, 13, 378. [Google Scholar] [CrossRef]
- Liu, D.; Clemente, L.; Poirier, C.; Ding, X.; Chinazzi, M.; Davis, J.T.; Vespignani, A.; Santillana, M. A Machine Learning Methodology for Real-Time Forecasting of the 2019–2020 COVID-19 Outbreak Using Internet Searches, News Alerts, and Estimates from Mechanistic Models. Available online: https://www.jmir.org/2020/8/e20285/ (accessed on 19 August 2025).
- Collier, N. What’s Unusual in Online Disease Outbreak News? J. Biomed. Sem. 2010, 1, 2. [Google Scholar] [CrossRef]
- Khan, S.A.; Patel, C.O.; Kukafka, R. GODSN: Global News Driven Disease Outbreak and Surveillance. AMIA Annu. Symp. Proc. 2006, 2006, 983. [Google Scholar]
- Mele, I.; Bahrainian, S.A.; Crestani, F. Event Mining and Timeliness Analysis from Heterogeneous News Streams. Inf. Process. Manag. 2019, 56, 969–993. [Google Scholar] [CrossRef]
- Goel, R.; Valentin, S.; Delaforge, A.; Fadloun, S.; Sallaberry, A.; Roche, M.; Poncelet, P. EpidNews: Extracting, Exploring and Annotating News for Monitoring Animal Diseases. J. Comput. Lang. 2020, 56, 100936. [Google Scholar] [CrossRef]
- Ghosh, S.; Chakraborty, P.; Nsoesie, E.O.; Cohn, E.; Mekaru, S.R.; Brownstein, J.S.; Ramakrishnan, N. Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks. Sci. Rep. 2017, 7, 40841. [Google Scholar] [CrossRef] [PubMed]
- All the News 2-News Articles Dataset. 2019. Available online: https://components.one/datasets/all-the-news-2-news-articles-503 (accessed on 19 August 2025).
- Camacho-Collados, J.; Pilehvar, M.T. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 31 October–1 November 2018. [Google Scholar]
- Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. Others Industrial-Strength Natural Language Processing. Available online: https://spacy.io (accessed on 19 August 2025).
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks 2019. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Stroudsburg, PA, USA, 3–7 November 2019. [Google Scholar]
- Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385, pp. 37–45. ISBN 978-3-642-24796-5. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate 2016. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; Volume 30, pp. 1–11. [Google Scholar]
- Bogoch, I.I.; Watts, A.; Thomas-Bachli, A.; Huber, C.; Kraemer, M.U.G.; Khan, K. Potential for Global Spread of a Novel Coronavirus from China. J. Travel. Med. 2020, 27, taaa011. [Google Scholar] [CrossRef] [PubMed]
- Fong, S.J.; Dey, N.; Chaki, J. AI-Empowered Data Analytics for Coronavirus Epidemic Monitoring and Control. In Artificial Intelligence for Coronavirus Outbreak; Springer Briefs in Applied Sciences and Technology; Springer: Singapore, 2021; pp. 47–71. ISBN 978-981-15-5935-8. [Google Scholar]
- Ten Threats to Global Health in 2019. Available online: https://www.who.int/news-room/spotlight/ten-threats-to-global-health-in-2019 (accessed on 19 August 2025).
- Amin, S.; Uddin, M.I.; Zeb, M.A.; Alarood, A.A.; Mahmoud, M.; Alkinani, M.H. Detecting Dengue/Flu Infections Based on Tweets Using LSTM and Word Embedding. IEEE Access 2020, 8, 189054–189068. [Google Scholar] [CrossRef]
- Aziz, A.; Aziz, A. Dengue Cases Prediction Using Machine Learning Approach. Irasd J. Comp. Info Tech. 2021, 2, 13–25. [Google Scholar] [CrossRef]
- Amin, S.; Irfan Uddin, M.; Ali Zeb, M.; Abdulsalam Alarood, A.; Mahmoud, M.H.; Alkinani, M. Detecting Information on the Spread of Dengue on Twitter Using Artificial Neural Networks. Comput. Mater. Contin. 2021, 67, 1317–1332. [Google Scholar] [CrossRef]
- Fung, I.C.-H.; Fu, K.-W.; Ying, Y.; Schaible, B.; Hao, Y.; Chan, C.-H.; Tse, Z.T.-H. Chinese Social Media Reaction to the MERS-CoV and Avian Influenza A(H7N9) Outbreaks. Infect. Dis. Poverty 2013, 2, 31. [Google Scholar] [CrossRef]
- Huang, Y.; Zhang, P.; Wang, Z.; Lu, Z.; Wang, Z. HFMD Cases Prediction Using Transfer One-Step-Ahead Learning. Neural Process Lett. 2023, 55, 2321–2339. [Google Scholar] [CrossRef]
- Wang, Y.; Cao, Z.; Zeng, D.; Wang, X.; Wang, Q. Using Deep Learning to Predict the Hand-Foot-and-Mouth Disease of Enterovirus A71 Subtype in Beijing from 2011 to 2018. Sci. Rep. 2020, 10, 12201. [Google Scholar] [CrossRef] [PubMed]
- Meng, D.; Xu, J.; Zhao, J. Analysis and Prediction of Hand, Foot and Mouth Disease Incidence in China Using Random Forest and XGBoost. PLoS ONE 2021, 16, e0261629. [Google Scholar] [CrossRef]
- Fung, I.C.-H.; Zeng, J.; Chan, C.-H.; Liang, H.; Yin, J.; Liu, Z.; Tse, Z.T.H.; Fu, K.-W. Twitter and Middle East Respiratory Syndrome, South Korea, 2015: A Multi-Lingual Study. Infect. Dis. Health 2018, 23, 10–16. [Google Scholar] [CrossRef]
- Lee, H. Stochastic and Spatio-Temporal Analysis of the Middle East Respiratory Syndrome Outbreak in South Korea, 2015. Infect. Dis. Model. 2019, 4, 227–238. [Google Scholar] [CrossRef]
- Balashankar, A.; Dugar, A.; Subramanian, L.; Fraiberger, S. Reconstructing the MERS Disease Outbreak from News. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies, Accra, Ghana, 3–5 July 2019; pp. 272–280. [Google Scholar]
- Odlum, M.; Yoon, S. What Can We Learn about the Ebola Outbreak from Tweets? Am. J. Infect. Control 2015, 43, 563–571. [Google Scholar] [CrossRef] [PubMed]
- Joshi, A.; Sparks, R.; Karimi, S.; Yan, S.-L.J.; Chughtai, A.A.; Paris, C.; MacIntyre, C.R. Automated Monitoring of Tweets for Early Detection of the 2014 Ebola Epidemic. PLoS ONE 2020, 15, e0230322. [Google Scholar] [CrossRef]
- Park, J.; Chaffee, A.W.; Harrigan, R.J.; Schoenberg, F.P. A Non-Parametric Hawkes Model of the Spread of Ebola in West Africa. J. Appl. Stat. 2022, 49, 621–637. [Google Scholar] [CrossRef]
- Wakamiya, S.; Kawai, Y.; Aramaki, E. Twitter-Based Influenza Detection After Flu Peak via Tweets with Indirect Information: Text Mining Study. JMIR Public Health Surveill. 2018, 4, e65. [Google Scholar] [CrossRef] [PubMed]
- Nsoesie, E.O.; Oladeji, O.; Abah, A.S.A.; Ndeffo-Mbah, M.L. Forecasting Influenza-like Illness Trends in Cameroon Using Google Search Data. Sci. Rep. 2021, 11, 6713. [Google Scholar] [CrossRef]
- Kim, M.; Chae, K.; Lee, S.; Jang, H.-J.; Kim, S. Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches. Int. J. Environ. Res. Public Health 2020, 17, 9467. [Google Scholar] [CrossRef] [PubMed]
- Valentin, S.; Arsevska, E.; Rabatel, J.; Falala, S.; Mercier, A.; Lancelot, R.; Roche, M. PADI-Web 3.0: A New Framework for Extracting and Disseminating Fine-Grained Information from the News for Animal Disease Surveillance. One Health 2021, 13, 100357. [Google Scholar] [CrossRef] [PubMed]
- Valentin, S.; Arsevska, E.; Falala, S.; De Goër, J.; Lancelot, R.; Mercier, A.; Rabatel, J.; Roche, M. PADI-Web: A Multilingual Event-Based Surveillance System for Monitoring Animal Infectious Diseases. Comput. Electron. Agric. 2020, 169, 105163. [Google Scholar] [CrossRef]
- Jang, B.; Kim, M.; Kim, I.; Kim, J.W. EagleEye: A Worldwide Disease-Related Topic Extraction System Using a Deep Learning Based Ranking Algorithm and Internet-Sourced Data. Sensors 2021, 21, 4665. [Google Scholar] [CrossRef]
- Șerban, O.; Thapen, N.; Maginnis, B.; Hankin, C.; Foot, V. Real-Time Processing of Social Media with SENTINEL: A Syndromic Surveillance System Incorporating Deep Learning for Health Classification. Inf. Process. Manag. 2019, 56, 1166–1184. [Google Scholar] [CrossRef]
- Abbood, A.; Ullrich, A.; Busche, R.; Ghozzi, S. EventEpi—A Natural Language Processing Framework for Event-Based Surveillance. PLoS Comput. Biol. 2020, 16, e1008277. [Google Scholar] [CrossRef] [PubMed]
- EpiTator. Available online: https://github.com/ecohealthalliance/EpiTator (accessed on 19 August 2025).
- Amin, S.; Uddin, M.I.; Hassan, S.; Khan, A.; Nasser, N.; Alharbi, A.; Alyami, H. Recurrent Neural Networks with TF-IDF Embedding Technique for Detection and Classification in Tweets of Dengue Disease. IEEE Access 2020, 8, 131522–131533. [Google Scholar] [CrossRef]
- Alassafi, M.O.; Jarrah, M.; Alotaibi, R. Time Series Predicting of COVID-19 Based on Deep Learning. Neurocomputing 2022, 468, 335–344. [Google Scholar] [CrossRef]
- Nguyen, V.-H.; Tuyet-Hanh, T.T.; Mulhall, J.; Minh, H.V.; Duong, T.Q.; Chien, N.V.; Nhung, N.T.T.; Lan, V.H.; Minh, H.B.; Cuong, D.; et al. Deep Learning Models for Forecasting Dengue Fever Based on Climate Data in Vietnam. PLoS Negl. Trop. Dis. 2022, 16, e0010509. [Google Scholar] [CrossRef]
- Athanasiou, M.; Fragkozidis, G.; Zarkogianni, K.; Nikita, K.S. Long Short-Term Memory–Based Prediction of the Spread of Influenza-Like Illness Leveraging Surveillance, Weather, and Twitter Data: Model Development and Validation. J. Med. Internet Res. 2023, 25, e42519. [Google Scholar] [CrossRef]
- Zhu, X.; Fu, B.; Yang, Y.; Ma, Y.; Hao, J.; Chen, S.; Liu, S.; Li, T.; Liu, S.; Guo, W.; et al. Attention-Based Recurrent Neural Network for Influenza Epidemic Prediction. BMC Bioinform. 2019, 20, 575. [Google Scholar] [CrossRef]
- Kim, Y.; Park, C.-R.; Ahn, J.-P.; Jang, B. COVID-19 Outbreak Prediction Using Seq2Seq + Attention and Word2Vec Keyword Time Series Data. PLoS ONE 2023, 18, e0284298. [Google Scholar] [CrossRef] [PubMed]
- Dai, S.; Han, L. Influenza Surveillance with Baidu Index and Attention-Based Long Short-Term Memory Model. PLoS ONE 2023, 18, e0280834. [Google Scholar] [CrossRef]
- Yang, L.; Li, G.; Yang, J.; Zhang, T.; Du, J.; Liu, T.; Zhang, X.; Han, X.; Li, W.; Ma, L.; et al. Deep-Learning Model for Influenza Prediction from Multisource Heterogeneous Data in a Megacity: Model Development and Evaluation. J. Med. Internet Res. 2023, 25, e44238. [Google Scholar] [CrossRef] [PubMed]
- Wu, N.; Green, B.; Ben, X.; O’Banion, S. Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case 2020. arXiv 2020. [Google Scholar] [CrossRef]
- Li, L.; Jiang, Y.; Huang, B. Long-Term Prediction for Temporal Propagation of Seasonal Influenza Using Transformer-Based Model. J. Biomed. Inform. 2021, 122, 103894. [Google Scholar] [CrossRef]
- Li, Y.; Wang, Y.; Ma, K. Integrating Transformer and GCN for COVID-19 Forecasting. Sustainability 2022, 14, 10393. [Google Scholar] [CrossRef]
- Yom-Tov, E. Ebola Data from the Internet: An Opportunity for Syndromic Surveillance or a News Event? In Proceedings of the 5th International Conference on Digital Health 2015, Florence, Italy, 18–20 May 2015; pp. 115–119. [Google Scholar]
- Choi, S.; Lee, J.; Kang, M.-G.; Min, H.; Chang, Y.-S.; Yoon, S. Large-Scale Machine Learning of Media Outlets for Understanding Public Reactions to Nation-Wide Viral Infection Outbreaks. Methods 2017, 129, 50–59. [Google Scholar] [CrossRef] [PubMed]
- McGough, S.F.; Brownstein, J.S.; Hawkins, J.B.; Santillana, M. Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data. PLoS Negl. Trop Dis. 2017, 11, e0005295. [Google Scholar] [CrossRef] [PubMed]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso: A Retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 273–282. [Google Scholar] [CrossRef]
- Grubaugh, N.D.; Saraf, S.; Gangavarapu, K.; Watts, A.; Tan, A.L.; Oidtman, R.J.; Ladner, J.T.; Oliveira, G.; Matteson, N.L.; Kraemer, M.U.G.; et al. Travel Surveillance and Genomics Uncover a Hidden Zika Outbreak during the Waning Epidemic. Cell 2019, 178, 1057–1071.e11. [Google Scholar] [CrossRef]
- Yong, B.; Owen, L. Dynamical Transmission Model of MERS-CoV in Two Areas. In Proceedings of the AIP Conference Proceedings, Bandung, Indonesia, 22–23 November 2016; p. 020010. [Google Scholar]
- Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 1969, 37, 424. [Google Scholar] [CrossRef]
- Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the First Instructional Conference on Machine Learning, West New York, NJ, USA, 9–15 August 2003; Volume 242, pp. 29–48. [Google Scholar]
- Wang, Y.; Zhou, Z.; Jin, S.; Liu, D.; Lu, M. Comparisons and Selections of Features and Classifiers for Short Text Classification. IOP Conf. Ser. Mater. Sci. Eng. 2017, 261, 012018. [Google Scholar] [CrossRef]
- Bijalwan, V.; Kumar, V.; Kumari, P.; Pascual, J. KNN Based Machine Learning Approach for Text and Document Mining. Int. J. Database Theory Appl. 2014, 7, 61–70. [Google Scholar] [CrossRef]
- Pal, S.K.; Mitra, S. Multilayer Perceptron, Fuzzy Sets, and Classification. IEEE Trans. Neural Netw. 1992, 3, 683–697. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting. In Computational Learning Theory; Vitányi, P., Ed.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1995; Volume 904, pp. 23–37. ISBN 978-3-540-59119-1. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
- Zhang, X.; LeCun, Y. Text Understanding from Scratch. arXiv 2016. [Google Scholar] [CrossRef]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- International Health Regulations (2005) (IHR). Available online: https://www.who.int/teams/ihr (accessed on 19 August 2025).
- Weekly Bulletin on Outbreak and Other Emergencies: Week 29: 14–20 July 2025. Available online: https://www.afro.who.int/countries/democratic-republic-of-congo/publication/weekly-bulletin-outbreak-and-other-emergencies-week-29-14-20-july-2025 (accessed on 19 August 2025).
- Nigeria Centre for Disease Control and Prevention. Available online: https://ncdc.gov.ng/ (accessed on 19 August 2025).
- Meng, Z.; Okhmatovskaia, A.; Polleri, M.; Shen, Y.; Powell, G.; Fu, Z.; Ganser, I.; Zhang, M.; King, N.B.; Buckeridge, D.; et al. BioCaster in 2021: Automatic Disease Outbreaks Detection from Global News Media. Bioinformatics 2022, 38, 4446–4448. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.; Shareghi, E.; Meng, Z.; Basaldella, M.; Collier, N. Self-Alignment Pretraining for Biomedical Entity Representations 2021. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual Meeting, 6–11 June 2021. [Google Scholar] [CrossRef]
- Mutuvi, S.; Boros, E.; Doucet, A.; Lejeune, G.; Jatowt, A.; Odeo, M. Multilingual Epidemic Event Extraction. In Towards Open and Trustworthy Digital Societies; Ke, H.-R., Lee, C.S., Sugiyama, K., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 13133, pp. 139–156. ISBN 978-3-030-91668-8. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale 2020. In Proceedings of the 58th annual meeting of the association for computational linguistics, Virtual Meeting, 5–10 July 2020. [Google Scholar] [CrossRef]
- Mutuvi, S.; Doucet, A.; Lejeune, G.; Odeo, M. A Dataset for Multilingual Epidemiological Event Extraction. In Proceedings of the 12th Conference on Language Resources and Evaluation, Marseille, France, 11–16 May 2020; pp. 4139–4144. [Google Scholar]
- Lejeune, G. Daniel_corpus: A Corpus for Evaluating Multilingual Epidemic Surveillance Systems (2089 Annotated Documents in 5 Languages). 2013. Available online: https://aclanthology.org/2021.ranlp-1.138.pdf (accessed on 19 August 2025).
- Menya, E.; Roche, M.; Interdonato, R.; Owuor, D. Enriching Epidemiological Thematic Features for Disease Surveillance Corpora Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; European Language Resources Association: Marseille, France, 2022; pp. 3741–3750. [Google Scholar]
- Parekh, T.; Mac, A.; Yu, J.; Dong, Y.; Shahriar, S.; Liu, B.; Yang, E.; Huang, K.-H.; Wang, W.; Peng, N.; et al. Event Detection from Social Media for Epidemic Prediction 2024. arXiv 2024. [Google Scholar] [CrossRef]
- Wadden, D.; Wennberg, U.; Luan, Y.; Hajishirzi, H. Entity, Relation, and Event Extraction with Contextualized Span Representations 2019. arXiv 2019. [Google Scholar] [CrossRef]
- Du, X.; Cardie, C. Event Extraction by Answering (Almost) Natural Questions 2021. arXiv 2020. [Google Scholar] [CrossRef]
- Hsu, I.-H.; Huang, K.-H.; Boschee, E.; Miller, S.; Natarajan, P.; Chang, K.-W.; Peng, N. DEGREE: A Data-Efficient Generation-Based Event Extraction Model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, DC, USA, 10–15 July 2022. [Google Scholar] [CrossRef]
- Hsu, I.-H.; Huang, K.-H.; Zhang, S.; Cheng, W.; Natarajan, P.; Chang, K.-W.; Peng, N. TAGPRIME: A Unified Framework for Relational Structure Extraction 2022. arXiv 2022. [Google Scholar] [CrossRef]
- Shi, B.; Huang, W.; Dang, Y.; Zhou, W. Leveraging Social Media Data for Pandemic Detection and Prediction. Humanit. Soc. Sci. Commun. 2024, 11, 1075. [Google Scholar] [CrossRef]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Virtual Meeting, 16–20 November 2020; Association for Computational Linguistics: Online, 2020; pp. 657–668. [Google Scholar]
- Keras 3: Deep Learning for Humans. Available online: https://github.com/fchollet/keras (accessed on 19 August 2025).
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tens§orflow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; USENIX Association: Savannah, GA, USA, 2016; pp. 265–283. [Google Scholar]
- Tan, C.W.; Yu, P.-D.; Chen, S.; Poor, H.V. DeepTrace: Learning to Optimize Contact Tracing in Epidemic Networks with Graph Neural Networks. IEEE Transactions on Signal and Information Processing over Networks. IEEE Trans. Signal Inf. Process. over Networks 2025, 11, 97–113. [Google Scholar] [CrossRef]
- Tan, C.W.; Yu, P.-D. Contagion Source Detection in Epidemic and Infodemic Outbreaks: Mathematical Analysis and Network Algorithms. Found. Trends® Netw. 2023, 13, 106–251. [Google Scholar] [CrossRef]








| Ref. | Year | Methods | Dataset | Specific Disease Outbreak | Limitations |
|---|---|---|---|---|---|
| [69] | 2015 | Spearman correlation | Bing queries, tweets, news articles, WHO Ebola case data | Ebola | Specific to Ebola, NLP methods and Clustering can be used |
| [14] | 2016 | Supervised non-negative matrix factorization | English newspaper | Flu, Malaria, Dengue, Diarrhea, and TB | Manually labeled data |
| [70] | 2017 | LDA and Word embedding | Korean news articles | MERS | Specific to MERS |
| [71] | 2017 | Multivariable model and LASSO regression [72] | Google Zika search data, tweets and HealthMap news data | Zika | Specific to Zika, NLP methods can be used |
| [12] | 2018 | Hill-climbing greedy search with Bayesian Information Criterion | HealthMap news | Ebola | Specific to 16 disease outbreaks |
| [73] | 2019 | Bayesian model and Mean posterior approximations | Zika case data Zika and Dengue News articles of Cuba | Zika and Dengue | Specific to Zika and Dengue |
| [45] | 2019 | SIR [74] model with GCT [75] | GDELT news event dataset | MERS | Specific to MERS, NLP methods can be used |
| [11] | 2020 | Clustering, Louvain Modularity, and NLP methods | LexisNexis database | Dengue and Zika | Specific to Dengue and Zika |
| [13] | 2020 | TF-IDF [76,77] with SGD, MNB, KNN [78], MLP [79], RF [80], AdB [81], and BERT [82] embeddings with CNN | Pakistan media news | Hepatitis, HIV/AIDS, Influenza, Dengue, and Malaria | Manually labeled data |
| [51] | 2020 | CNN [83], Bi-LSTM [84] | WHO-DON, WHO-IHR [85], WHO-AFRO [86], NCDC [87], and SAMOH. | Data on 100 disease outbreaks | Manually labeled and characterized |
| [9] | 2020 | MLP, CNN and LSTM | Twitter datasets in English language. Indonesian Dengue news | Dengue | Manually labeled data and specific to dengue |
| [88] | 2021 | PubMedBERT and SapBERT [89] | Google and RSS news feeds. Translating the various news documents from 10 languages into English. | Manually labeled data and rule-based approach outbreak event extraction | |
| [90] | 2021 | BERT-multilingual-cased and uncased and semi-supervised learning [82,91] | DAnIEL News Dataset [92,93]. Analysis of texts in six languages. | Specific to various disease outbreaks | Manual annotations at token-level and detection of disease names and location, not prediction of disease outbreak |
| [94] | 2022 | EpidBioBERT | PADI-Web corpus | Animal disease outbreak | Specific to animal disease outbreak |
| [65] | 2023 | Multi-Attention LSTM | ILI cases, Baidu search engines, climate, and demography data | ILI | Specific to ILI |
| [95] | 2024 | DyGIE++ [96], BERT-QA [97], DEGREE [98], TagPrime [99] | Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic | Monkeypox, Zika, Dengue, and et. al. | Early warning of any impending epidemic |
| [100] | 2024 | BERT—Chinese- roberta-wwm-ext-large [101] | CCIR 2020 | COVID-19 | Leveraging social media data for pandemic monitoring and forecasting. |
| Parameters | LSTM | Bi-LSTM | Bi-LSTM with MHA | Transformer |
|---|---|---|---|---|
| batch_size | 128 | 128 | 128 | 128 |
| n_epochs | 20 | 20 | 20 | 20 |
| max_seq_len | 100 | 100 | 100 | 100 |
| learning_rate | ||||
| embedding_dim | 128 | 128 | 128 | 128 |
| Model | Precision, [%] | Recall, [%] | F1 Score, [%] | Accuracy, [%] |
|---|---|---|---|---|
| LSTM | 89.67 | 94.09 | 91.50 | 91.60 |
| Bi-LSTM | 89.71 | 96.28 | 93.00 | 92.59 |
| Bi-LSTM with MHA | 97.45 | 99.11 | 98.00 | 98.25 |
| Transformer | 97.45 | 96.29 | 93.00 | 92.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gautam, A.S.; Raza, Z.; Lapina, M.; Babenko, M. Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data Cogn. Comput. 2025, 9, 291. https://doi.org/10.3390/bdcc9110291
Gautam AS, Raza Z, Lapina M, Babenko M. Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data and Cognitive Computing. 2025; 9(11):291. https://doi.org/10.3390/bdcc9110291
Chicago/Turabian StyleGautam, Avneet Singh, Zahid Raza, Maria Lapina, and Mikhail Babenko. 2025. "Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks" Big Data and Cognitive Computing 9, no. 11: 291. https://doi.org/10.3390/bdcc9110291
APA StyleGautam, A. S., Raza, Z., Lapina, M., & Babenko, M. (2025). Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data and Cognitive Computing, 9(11), 291. https://doi.org/10.3390/bdcc9110291

