Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks

Gautam, Avneet Singh; Raza, Zahid; Lapina, Maria; Babenko, Mikhail

doi:10.3390/bdcc9110291

Open AccessArticle

Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks

by

Avneet Singh Gautam

¹

,

Zahid Raza

¹,

Maria Lapina

^2,3

and

Mikhail Babenko

^2,3,*

¹

School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi 110067, India

²

Research Center for Trusted Artificial Intelligence, Ivannikov Institute for System Programming of the Russian Academy of Science, 109004 Moscow, Russia

³

Department of Computational Mathematics and Cybernetics, North-Caucasus Federal University, 355017 Stavropol, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(11), 291; https://doi.org/10.3390/bdcc9110291

Submission received: 25 September 2025 / Revised: 6 November 2025 / Accepted: 10 November 2025 / Published: 14 November 2025

Download

Browse Figures

Versions Notes

Abstract

Natural Language Processing is being used for Disease Outbreak Prediction using news data. However, the available research focuses on predicting outbreaks for only specific diseases using disease-specific data such as COVID-19, Zika, SARS, MERS, and Ebola, etc. To address the challenge of disease outbreak prediction without relying on prior knowledge or introducing bias, this research proposes a model that leverages a news dataset devoid of specific disease names. This approach ensures generalizability and domain independence in identifying potential outbreaks. To facilitate supervised learning, spaCy was employed to annotate the dataset, enabling the classification of articles as either related or unrelated to disease outbreaks. LSTM, Bi-LSTM, and Bi-LSTM with a Multi-Head Attention mechanism, and transformer have been used and compared for the purpose of classification. Experimental results exhibit good prediction accuracy with Bi-LSTM with Multi-Head Attention and transformer on the test dataset. The work serves as a pro-active and unbiased approach to predict any disease outbreak without being specific to any disease.

Keywords:

disease outbreak prediction; news data; LSTM; Bi-LSTM; attention mechanism; transformer

1. Introduction

Disease outbreak prediction refers to a system’s ability to anticipate and alert about a potential outbreak before it occurs. The World Health Organization (WHO) defines a disease outbreak as an abrupt increase in disease cases, regardless of whether the disease is known or previously unidentified [1,2,3].

Each year, over 700,000 deaths are attributed to diseases such as Dengue, Malaria, Chikungunya, Yellow Fever, Schistosomiasis, Rift Valley Fever, Leishmaniasis, and Japanese Encephalitis [4,5]. Owing to factors such as pollution, adulteration, and many other unknown factors, any disease and illness can turn into an outbreak if the conditions are favorable. Predicting disease outbreaks earlier can save human lives and reduce economic loss. In 2020, the COVID-19 outbreak had a global impact, resulting in widespread loss of life and causing severe disruptions or damage to the economies of numerous countries [6,7,8].

When any disease outbreak occurs, it soon becomes a news event as media agencies start to cover the disease outbreak with news headlines. Predicting disease outbreaks using news data and Natural Language Processing (NLP) techniques is a challenging task. Various researchers have used specific disease names to curate datasets, but all works are limited to only a specific disease outbreak prediction [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. Therefore, it is very important to have a mechanism capable of identifying a disease outbreak that is not based on the searches corresponding to a specific disease and is able to detect a disease outbreak from multiple sources.

To address this problem, this research proposes an approach where disease outbreak prediction is performed by using news data from All the News 2.0 [26] dataset and creating a dataset by specifying disease outbreak-related keywords. To achieve this, various disease outbreak-related keywords have been used to extract disease outbreak-related news, without looking for any specific disease name followed by the application of the text pre-processing technique [27] to remove any bias.

Later, spaCy [28] is used for the purpose of labeling the disease outbreak news as disease_outbreak_news or no_disease_outbreak_news. For news embedding, the work uses pre-trained Sentence Bert ’all-MiniLM-L6-v2′ [29]. Finally, the news is classified as disease outbreak news or otherwise. The work uses Long Short-Term Memory (LSTM) [30], Bi-LSTM (Bidirectional LSTM), Bi-LSTM with an MHA (Multi-Head Attention) mechanism [31], and transformer [32] for classification purposes. An elaborate explanation of the methodology has been explained in Section 3.

Motivation

Based on the extant literature known to the authors, there is currently no system capable of predicting disease outbreaks without relying on disease names; even if such a system exists, it typically depends on resource-intensive processes like manual annotation and accurate data labeling. This research work aims to bridge this gap and focuses on developing a system that can predict disease outbreaks using news data without using any disease names. This enables the model to focus on the relevance of the disease outbreak news with minimum human intervention rather than focusing on disease names of the outbreak. As mentioned earlier, most of the available systems in the domain can only predict specific disease outbreaks such as COVID-19, Middle East Respiratory Syndrome (MERS), Ebola, Zika, and Severe Acute Respiratory Syndrome (SARS). Accordingly, these systems have a mandate limited to only the specific disease outbreak predictions. Further, these systems depend on the datasets having specific disease outbreak information, which can only predict the specific disease outbreak. This work is novel in the sense that a dataset is developed using generic disease outbreak keywords, rather the using disease names. This is followed by the attention-based DL models to predict disease outbreak using news data.

The contribution of this research work are summarized as follows:

A disease outbreak news dataset developed without using any disease names.
A novel and generic framework for disease outbreak prediction using news data without bias.
Attention-based DL models to predict disease outbreak using news data.

The organization of the work is outlined as follows. The current section presents an introduction to disease outbreaks with definitions, objectives, motivation, and research contributions. Section 2 presents the prior literature research in the disease outbreak domain. Section 3 presents the proposed methodology used in the work. Section 4 presents the results obtained using various DL models considered in the work. Section 5 presents a detailed discussion of the results while analyzing the same. The work concludes with Section 6 presenting the conclusions drawn from the work.

2. Related Work

To study the existing literature in the domain, an effort is made to cover every aspect of the disease outbreak prediction from news data with LSTM, Bi-LSTM, Bi-LSTM with MHA, and a transformer state-of-the-art model. To ascertain the current state of research, a literature review was performed using the Web of Science databases. The search query comprised the terms “disease outbreak prediction” and “news” alongside the specific architectural keywords: “LSTM,” “Bi-LSTM,” “Bi-LSTM with MHA,” and “transformer.”

2.1. Disease Outbreak Prediction

Disease outbreak prediction is an active area where researchers use various disease or symptom names as keywords to build datasets and use NLP techniques that restrict the scope of predicting the disease outbreak to a specific disease outbreak. A few examples of such research are COVID-19 [33,34,35], Dengue [9,36,37,38], SARS [39], HFMD [40,41,42], MERS [39,43,44,45], Ebola [46,47,48], and Influenza/Flu/ILI [49,50].

2.2. News Data for Disease Outbreak Prediction

Using the news as a data source to predict disease outbreaks is observed to be mainly limited to a specific disease as disease names and symptoms are used as keywords to search for a particular disease outbreak-related news. Various researchers developed disease outbreak prediction systems using NLP and Artificial Neural Networks (ANNs) but datasets limit the scope of predicting the disease outbreak due to the news having only specific disease outbreak names or symptoms [12,13,51,52,53,54].

2.3. Disease Outbreak Prediction Using LSTM, Bi-LSTM, and Bi-LSTM with Multi-Head Attention

The study in [51] employed a Bi-LSTM model to classify 100 disease outbreak news articles using manually labeled data. A global topic mining system, EagleEye [54], integrates Term Frequency–Inverse Document Frequency (TF-IDF) with Bi-LSTM to retrieve important terms from internet-based data sources. SENTINEL [55] is a real-time surveillance system that collects and analyzes disease-related news articles from various health-focused platforms. In a similar vein, the tweet sentiment analysis approach in [17] utilizes LSTM to predict regions at high risk for epidemic spread. EventEpi [56], in conjunction with EpiTator [57], extracts named entities related to disease outbreaks and enhances influenza prediction using an LSTM-based model. Additionally, a novel method combining RNN and LSTM for detecting dengue outbreaks was realized [58]. In [59], LSTM is leveraged with Rectified Linear Unit (ReLU) activation and RNN models with tanh and sigmoid functions to forecast COVID-19 confirmed cases and deaths across Malaysia, Morocco, and Saudi Arabia.

LSTM, attention-enhanced LSTM, Convolutional Neural Networks (CNNs), and transformer model to develop an accurate Dengue Fever prediction model with meteorological data in Vietnam, where attention-enhanced LSTM outperformed all other models has been reported in [60]. Employing LSTM to investigate the combined use of Influenza-Like Illness (ILI) surveillance data, Twitter, population, and weather data of Greece to forecast ILI weekly was conducted by [61]. An attention-based RNN (multichannel LSTM) for influenza prediction in Guangzhou, China, where influenza case data and climate data have been used is reported in [62]. Reference [63] predicted COVID-19 using LSTM with a Seq2Seq model using COVID-19 data such as Google trends of top keywords and news data from New York Times, and Center for Disease Control (CDC) data were used. To predict influenza trends [64] uses survey, Baidu Index data, and ILI data provided by ILI monitoring outpost hospital from thirty-one provinces with attention and LSTM. Multi-attention with LSTM has been used in [65] to predict influenza trends using heterogeneous data from various sources, where ILI, climate, demography, and Baidu search engine data are collected.

2.4. Disease Outbreak Prediction Using Transformer

Reference [66] uses transformer techniques to forecast influenza cases, where state-wise and country-level USA data are used. Similarly, reference [67] also uses influenza case data from the USA and Japan to predict trends in epidemiological data to prevent influenza outbreak. Transformer models have been developed to use data from multiple sources and capture spatial dependency. An integrated transformer with a novel dynamic positional encoding to encode and Graph Convolutional Network (GCN) to decode the COVID-19 data with a graph structure as the single Neural Network (NN) has been used in [68] but has the problem of unstable prediction and poor convergence. Table 1 summarizes the related and prominent research work observed according to the criteria.

3. Proposed Methodology

In the proposed work, first disease outbreak news data are extracted from the All the News 2.0 article dataset. Then, various disease outbreak-related keywords used with spaCy to label, extracted disease outbreak data. NLP pre-processing techniques are used to clean the data including removing duplicate news data, if present. After pre-processing disease outbreak news data, embedding generated using pre-trained Sentence Bert ‘all-MiniLM-L6-v2’ [29]. After labeling the data using spaCy [28], four deep learning (DL) models viz. LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer are used to predict whether a piece of news belongs to disease outbreak news or otherwise. The flow diagram of the proposed methodology is presented in Figure 1.

3.1. All the News 2.0

The All the News 2.0 dataset [26] consists of nearly twenty-seven lakh news articles sourced from twenty-seven American news publications covering the time period from 1 January 2016 to 2 April 2020. Each entry includes metadata such as publication date, author, title, article text, URL, section, and publication name. The All the News 2.0 dataset is open source and available for research and non-commercial purposes. The author’s utilized the ‘title’ column because it contains the greatest number of unique articles viz. Twenty-four thousand eighty-one thousand two hundred sixty-two, followed by the ‘article’ column having twenty-four thousand seventy-five thousand five hundred and twenty unique news articles. The total number of publications published between 2016 and 2020 is shown in Figure 2. Notably, the volume of news articles surged after the declaration of COVID-29 as global pandemic.

3.2. Disease Outbreak News Articles Extracted from All the News 2.0

In this phase, a disease outbreak news dataset is extracted from All the News 2.0 [26]. To achieve this, the work uses various generic terms to extract news articles related to disease outbreak, which are presented below:

‘Outbreak’, ‘disease outbreak’, ‘epidemic’, ‘pandemic’, ‘infection disease’, ‘virus spreads’, ‘unusual illness’, ‘mysterious disease’, ‘new disease outbreak’, ‘quarantine’, ‘lockdown’, ‘travel restrictions’, ‘containment measures’, ‘case surge’, ‘hospitals overwhelmed’, ‘health officials warn’, ‘disease x’, ‘unknown disease outbreak’, ‘patient zero’, and ‘public health emergency’.

spaCy library [28] was used with multiple keywords to label the extracted news article as either a news article related to disease outbreak or not. A total of 83,601 disease outbreak news articles were labeled using spaCy. After eliminating duplicate entries and records with missing information, the dataset was refined to a total of 75,222 articles.

Figure 3 depicts the monthly distribution of disease outbreak news stories gathered from All the News 2.0, appearing between 2016 and 2020. A significant surge in article volume is observed following the development of COVID-19.

3.3. Pre-Processing News Data

Pre-processing steps include the removal of stop words, special characters, line breaks, and URLs. Numerical information is converted into words and then lowercase. To generate news embeddings, pre-trained Sentence Bert ‘all-MiniLM-L6-v2’ [29] is utilized. Figure 4 presents the disease outbreak news ‘title’ column length distribution.

3.4. LSTM and Bi-LSTM

Following the pre-processing stage, LSTM [30], a specialized form of an RNN was utilized to mitigate the vanishing gradient challenge while preserving long-term contextual relationships in sequences. In RNNs, the gradient can become extinct or explode as it is circulated posterior in training time, which in turn makes it challenging to cram distant necessity. LSTM employs a gating method to regulate the stream of data, enabling the model to determine which information should be retained or discarded at each step. Its core components include the input gate, forget gate, cell state update, cell state, and output gate. The input gate controls how much and in what manner novel information should be incorporated into the cell state. LSTM takes both the existing input and the preceding hidden state as inputs, generating a value that ranges from 0 to 1. The forget gate decides the way the previous cell state should be reserved or elapsed, which also takes the input as the input gate and output the same value as the input gate. This gate value governs the quantity of old data to be reserved in the cell state. Cell state update calculates the candidate cell state. The input to the cell state updates are the current input and the previous hidden state. Furthermore, using the hyperbolic tangent (tanh) function, it generates a candidate value for the new cell state, which in turn can be added to the cell state. During that time, a memory unit also called a cell state, stores and updates the data using a candidate cell, input gate, and forget gate. The input gate controls the candidate cell state and should be added and data controlled by forget cell when it needs to be removed. Similarly, the output gate decides the number of updated cells that need to be exposed as outputs to the hidden state. The gates receive the current input along with the previous hidden state and produce values between 0 and 1, which determine the extent of information to be carried forward to the next step.

Bi-LSTM uses inputs from both forward and backward directions to capture dependencies from earlier and later inputs in the sequence. In Bi-LSTM, the embedding layer is used to process the inputs to convert them into vector representations used by the Forward LSTM layer to process the vector input representations to maintain the hidden states and memory cells to capture the information. The onward LSTM layer is followed by the rearward LSTM layer to process the inputs in the reverse direction. The concatenate layer is then used to concatenate the information from the onward and rearward LSTMs. Finally, to represent the final output of Bi-LSTM, the output layer is used.

3.5. The Attention Mechanism

The attention [31] method empowers the model to precisely emphasize important portions of the input text data while making calculations. RNNs face challenges in parsing long-term sequences, paving the way for the attention method, which allows the model to select the significant words that are pertinent and related to other words in the input sentence while making predictions. The attention mechanism has 3 main components named

Q u e r y (Q)

,

K e y (K)

, and

V a l u e (V)

, as shown in Figure 5. Here, Query denotes the existing context or the hidden state of the output decoder model. The Key denotes the hidden states of the input sequences, which contain the encoded representation of the input tokens. The Value denotes data accompanying each token in the input sequence, which the model uses to focus and update the data.

3.6. Bi-LSTM with Multi-Head Attention

Bi-LSTM with MHA is an extension of the traditional Bi-LSTM architecture that incorporates the attention mechanism, commonly used for sequence-to-sequence tasks, such as text classification and machine translation. Bi-LSTM can also be used to perform the classification task on the disease outbreak news data to classify the news using the MHA mechanism [31], which in turn allows us to focus on the important context available in the news data. Attention helps the model to accentuate the most pertinent parts of the input sequence, allowing it to give more importance to specific elements during the disease outbreak classification process.

Bi-LSTM with MHA follows the same Bi-LSTM architecture as explained in Section 3.4, whereas the input text is tokenized and every token is mapped to a fixed-size vector representation using sentence transformer embedding, which in turn are fed into the Bi-LSTM. The Bi-LSTM processes input embeddings sequentially, updating its hidden state at every step by incorporating the current input along with the preceding hidden state. The attention mechanism targets the important and relevant tokens in the input sequence for the disease outbreak news classification task, which in turn takes the hidden state from the Bi-LSTM and calculates attention weights that use the dot product between a context vector and each hidden state of the Bi-LSTM. To obtain a probability distribution, the dot product value is passed to the softmax function, which represents the importance of every token in the input sequence. To compute the weighted sum of the Bi-LSTM hidden states, attention weights are used and each hidden state is multiplied by its attention weight and then summed. This weighted sum shows the context vector, which represents the relevant portions of the input sequence for disease outbreak news classification.

3.7. Transformer Architecture

The transformer architecture, as presented in [32], uses the attention mechanism to attain parallelization and tackle long text input dependencies capably. Self-attention enables the transformer architecture to assess the contextual significance of each token in relation to all other tokens within the input sequence. This capability allows the model to selectively attend to the most pertinent information when generating task-specific outputs such as, in the case of this study, classifying disease outbreak-related news. The encoder and decoder, having multiple self-attention layers and a Feed-Forward Neural Network (FFNN), are the main components of the transformer architecture.

The encoder processes the input sequence, while the decoder is responsible for generating the corresponding output sequence. Input sequence embedding also includes positional encodings which make data on the token’s position available and captures contextual information. Following which, the model uses residual connections to bypass the layer and contribute to the output layer for stabilizing training layer normalization. The decoder structure is similar as the structure of the encoder but incorporates an extra encoder–decoder attention mechanism, which enables it to attend to and extract relevant information from the encoder’s output. To keep the focus on the relevant part of the text data, encoder–decoder attention computes attention weights between the decoder’s current hidden state query and the encoder’s output representation.

4. Results

The work uses four DL models viz. LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer on disease outbreak news data to realize the objectives. Table 2 presents the hyperparameters used in the above-mentioned models. Results have been presented in Table 3. The authors evaluated the performance of deep learning models using metrics such as accuracy, precision, recall, the confusion matrix, and the Receiver Operating Characteristic (ROC) curve. Accuracy is measured as total number of precise and improper data points (in this research work referred to as disease outbreak news) and are classified as correct and incorrect, respectively. F1 score combines precision and recall calculating the harmonic mean, whereas precision is correct predicted data points and recall is actual prediction as compared to all the prediction of the same class. The ROC is employed to envisage the trade-off among the True Positive Rate (TPR) and the False Positive Rate (FPR). Bi-LSTM with MHA achieves the highest accuracy among all four deep learning models.

The following sections, Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.5 present the implementation of the four deep learning models in the context of the proposed work, along with an explanation of the models’ parameters. A comparison of these four DL models developed in this research work is presented in Section 4.5.

4.1. LSTM

The embedding layer, is the foremost layer of the LSTM model, where each word or token from the input news headline is represented as a dense vector capturing its semantic meaning. Following the embedding, a single LSTM layer with 64 neurons (units) is applied to process the sequence of word embeddings and learn complex, long-range dependencies within the news text. The LSTM layer captures important sequential patterns across the news headlines to assist in the binary classification task. A fully connected dense layer encompassing a single neuron and sigmoid activation function is employed to produce a probability score indicating whether a news headline corresponds to a disease outbreak. The Adam optimization algorithm with the Binary Cross-Entropy loss function is employed to train the model, with accuracy, precision, and recall employed as evaluation metrics. To avert overfitting and ensure generality, the model is validated on a separate test set. Finally, performance is assessed through accuracy and loss plots, classification reports, confusion matrix visualization, and ROC curve analysis. LSTM is chosen as a baseline model for the disease outbreak prediction task using news data.

4.2. Bi-LSTM

The first layer of the Bi-LSTM architecture is the embedding layer, where each input news headline is converted into dense 128-dimensional vector representations that capture the semantic information of the tokens. After the embedding layer, a bidirectional LSTM layer with 64 units is employed to process input sequences in both onward and rearward directions, thereby empowering the Bi-LSTM to capture contextual dependencies from preceding and succeeding tokens within the news data. This bidirectional processing helps in learning richer, long-range patterns for better prediction. Following the bidirectional LSTM layer, a single neuron and fully connected dense layer is utilized, incorporating a sigmoid activation function to generate a likelihood score indicating whether the input news headline pertains to a disease outbreak. The Bi-LSTM model is trained using the Adam optimizer, with Binary Cross-Entropy as the loss function to address the disease outbreak news classification task. Throughout training, accuracy, recall, and precision as evaluation metrics used to assess the performance of the Bi-LSTM model. To assess and visualize Bi-LSTM’s performance, accuracy and loss curves, the confusion matrix, classification report, and an ROC curve are generated, providing a comprehensive analysis of both the training and validation phases.

4.3. Bi-LSTM with MHA

The first layer of the Bi-LSTM with MHA architecture is the embedding layer, which transforms the input news headlines into dense word vectors capturing semantic relationships between words. Let the input sequence be as follows:

X = [x_{1}, x_{2}, \dots, x_{T}], x_{t} \in \{1, 2, \dots, V\}

(1)

where each token

x_{t}

is mapped to an embedding vector using an embedding matrix

W_{e}

:

e_{t} = W_{e} [x_{t}] \in R^{d_{e}}

(2)

yielding the embedded input matrix:

E = [e_{1}, e_{2}, \dots, e_{T}] \in R^{T \times d_{e}}

(3)

Next, a Bi-LSTM layer with 128 neurons is applied, processing the embedded sequence in both forward and backward directions:

{\vec{h}}_{t} = {L S T M}_{f o r w a r d} (e_{t}), {\overset{⃐}{h}}_{t} = {L S T M}_{b a c k w a r d} (e_{t})

(4)

and combines the outputs as:

h_{t}^{(1)} = [{\vec{h}}_{t}; {\overset{⃐}{h}}_{t}] \in R^{2 \times d_{h}}, H^{(1)} = [h_{1}^{(1)}, \dots, h_{T}^{(1)}] \in R^{T \times d_{h}}

(5)

After the Bi-LSTM layer, an MHA mechanism is applied to let the model focus on multiple salient parts of the news sequence:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{{Q \cdot K}^{⊤}}{\sqrt{d_{k}}}) V

(6)

where

Q = K = V = H^{(1)}

. For each of the

h

attention heads:

{h e a d}_{i} = A t t e n t i o n ({Q W}_{i}^{Q}, {K W}_{i}^{K}, {V W}_{i}^{V})

(7)

and the MHA output is the concatenated result:

M H A (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(8)

To stabilize learning, a residual connection and layer normalization are applied:

Z = LayerNorm (H^{(1)} + D r o p o u t (M H A))

(9)

This is followed by another Bidirectional LSTM layer to further process the attention-refined representation:

{\vec{h}}_{t}^{(2)}, {\overset{⃐}{h}}_{t}^{(2)} = Bi-LSTM (Z), h^{(2)} = [{\vec{h}}_{T}^{(2)}; {\overset{⃐}{h}}_{1}^{(2)}] \in R^{2 \times d_{h}}

(10)

A dropout layer is then applied:

c = D r o p o u t (h^{(2)})

(11)

followed by a fully connected dense layer with a sigmoid activation for binary classification:

\hat{y} = σ (W_{o} \cdot c + b_{o}), \hat{y} \in [0, 1]

(12)

The model is trained using the Binary Cross-Entropy loss function:

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot \log ({\hat{y}}_{i}) + (1 - y_{i}) \cdot \log (1 - {\hat{y}}_{i})]

(13)

Training is optimized using the Adam optimizer and the model is evaluated using metrics such as

A c c u r a c y

,

P r e c i s i o n

, and

R e c a l l

:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(14)

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

R e c a l l = \frac{T P}{T P + F N}

(16)

For probabilistic evaluation, the

R O C

curve and

A U C

are calculated using the following:

T P R = \frac{T P}{T P + F N}

(17)

F P R = \frac{F P}{F P + T N}

(18)

A U C = \int_{0}^{1} T P R (x) d x

(19)

Attention weight visualization was used for the purpose of model interpretability and confirmation of the model focuses on outbreak-relevant information. This was achieved by the model that shares the same input and embedding layers as our primary classification model. This visualization model was specifically configured to output the attention_scores directly from the Multi-Head Attention layer. Figure 6 presents a total of five figures for the purpose of attention weight visualization. Figure 6a,b present attention heatmaps for “Disease Outbreak News”, Figure 6c,d present the attention weight visualization for “No Disease Outbreak News”, and Figure 6e presents incorrect predictions.

To visualize the attention weight, the authors utilized current news as test set. Attention scores were obtained from all attention heads and then each token’s position weight was summed up. Then, we averaged the scores of all the attention heads. Then, the sum of the attention weights for each token position, to create a vector representing the overall importance of each term in the sequence, was calculated. These scores were then aligned with the original, non-padded words, normalized to either 0 or 1 scale, and plotted as a heatmap. The color intensity, from dark purple (low attention) to bright yellow (high attention), directly corresponds to the normalized score, thus clearly highlighting the key terms and phrases the model utilizes to classify the news headlines.

4.4. Transformer

The transformer model begins with an embedding layer that encodes tokenized disease outbreak news into dense vectors. A Multi-Head Self-Attention mechanism follows, capturing contextual and semantic relationships across tokens, aided by residual connections and layer normalization to stabilize training and avoid vanishing gradients. After self-attention, a position-wise Feed-Forward Neural Network (FFN) refines token representations, again using residual connections and normalization. The model uses a single Transformer Encoder block with eight attention heads and a feed-forward dimension of 128, with minimal dropout to maintain learning. Global Average Pooling converts the encoded sequence into a fixed-size vector, which passes through dense layers, ending with a sigmoid output layer to predict the probability of a headline being classified as disease outbreak news or no disease outbreak news. The transformer model was trained, having Binary Cross-Entropy loss, enhanced by Adam at a learning rate

10^{- 6}

, evaluated over 20 epochs using accuracy, recall, and precision metrics. Model performance was visualized through accuracy/loss curves, and is evaluated using a classification report, confusion matrix, and ROC-AUC curve.

4.5. Attention-Based DL Model Accuracy Comparison

LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer models have been developed in Keras [102] and TensorFlow [103]. As mentioned in Section 4, the Bi-LSTM with MHA model achieves the highest accuracy. Accuracy and loss of Bi-LSTM with MHA on disease outbreak news data have been presented in Figure 7a,b, respectively. Figure 7c and present the confusion matrix and ROC, respectively.

Table 3 summarizes the results obtained using four attention-based DL models in terms of the performance metrics on the disease outbreak news data.

5. Discussion

An active area of investigation involves the utilization of news data for predicting disease outbreaks. A notable challenge within this field is the limited scope of the currently available research, a constraint stemming from the fact that existing news datasets often concentrate on disease outbreaks [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25].

The existing available research work applies to only specific disease outbreak prediction using news data as the news data contains news about specific disease outbreaks, as mentioned in Section 1 and Section 2. This research work proposes a methodology where minimum human intervention is required to predict disease outbreaks using news data without being specific to any disease names. This is conducted by generating a dataset where the dependency on specific disease outbreak predictions is removed. This makes the model realistic as it looks for disease outbreaks without any bias unlike the existing works reported in the literature. Four DL models viz. LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer were developed in this research work using disease outbreak news data.

All four DL based models were observed to have a good performance with accuracy ranging from 91.60% to 98.25% from LSTM to Bi-LSTM with MHA, respectively. LSTM reported the lowest performance, owing to the information flow only in one direction. Bi-LSTM achieved better performance owing to bidirectional information flow having 92.59% accuracy, which results in 0.99% increased accuracy as compared to LSTM. Similarly, the Bi-LSTM model integrated with MHA demonstrated superior performance in capturing contextual information from disease outbreak news, achieving an accuracy of 98.25% on the test dataset. This represents a 5.49% improvement over the transformer model, which attained 92.76% accuracy, thereby surpassing all three comparative deep learning models.

5.1. No Benchmark Available: Generic Disease Outbreak Prediction

Due to the absence of publicly available benchmark datasets that cover generic, non-disease-specific outbreak predictions using news data, it was not possible to directly compare the proposed models with existing state-of-the-art systems. Most prior works have utilized curated datasets focused on specific disease outbreaks such as COVID-19, Ebola, or Zika, often employing targeted features and labels. As such, those datasets and models are not directly applicable for evaluating general-purpose outbreak prediction models like those developed in this study. Consequently, the performance metrics reported for the models in this research, based on a custom-labeled dataset from the All the News 2.0 dataset, serve as preliminary baselines for future research in this underexplored direction. The lack of a standardized benchmark underscores the originality of the proposed approach and highlights the need for a broader, community-accepted dataset for generalized disease outbreak prediction from news data.

5.2. Context, Limitations, and Future Integration with Network-Based Models

The contribution of this study is intentionally focused on establishing a robust, disease-agnostic semantic classifier for detecting outbreak-related signals in unstructured news texts. The authors acknowledge that this textual-first approach, by design, does not capture the complex, real-world dynamics of disease spread, which are fundamental network problems defined by spatiotemporal connectivity and population topology. Similarly, the Deeptrace model [104] employs GNNs to learn representations of epidemic spread networks. The Deeptrace model optimizes contact tracing by identifying the most informative individual nodes, thereby modeling the propagation dynamics of a known outbreak to mitigate its spread.

Similarly, the monograph on Contagion Source Detection in Epidemic and Infodemic Outbreaks [105] addresses the critical problem of identifying the source or “patient zero” of an outbreak by observing the network’s current state. This body of work highlights methods for tracing contagion pathways backward through a network to find the origin, a task that is non-trivial and essential for public health interventions. Our proposed approach is positioned as a crucial foundational component to these advanced systems. Such network models, while powerful, must be populated with timely and accurate data. Our semantic classifier can serve as an initial, high-precision signal generation engine. The textual events identified by our proposed approach can be time stamped and geotagged, providing the initial data points needed to construct or update the dynamic graphs, such as Deeptrace.

In essence, our proposed framework provides the “what” of a potential outbreak event, from unstructured news data, which is a necessary precursor to modeling where and how the dynamics of disease outbreak spatiotemporally spread and their networks, addressed by GNNs. A future, multi-modal surveillance architecture would thus logically integrate our NLP-based disease outbreak prediction with a GNN-based network. This synthesis represents a promising and comprehensive direction for developing next-generation public health warning systems capable of both early signal detection and predictive spread analysis.

6. Conclusions

Disease outbreak prediction is an important part of managing disasters. NLP techniques have proven to be quite useful in the domain. Although numerous studies have addressed disease outbreak prediction, the existing literature predominantly presents models tailored to specific diseases, thereby limiting their general applicability. This serves the purpose only partially, as the methods evaluate outbreak with an input bias. This work addresses the problem of having unbiased disease outbreak prediction without being specific to any disease. The purpose is to build a system that can proactively identify any outbreak from the news data.

To accomplish this, the All the News 2.0 dataset comprising news articles published between 2016 and 2020 was utilized. A set of generic disease outbreak-related keywords, excluding specific disease names, was employed to extract relevant articles, ensuring the dataset remained broad and domain-agnostic. For annotation, spaCy was used to label the articles, enabling the application of supervised learning techniques. Subsequently, four deep learning-based models were evaluated, with particular emphasis on attention mechanisms. Among these, the Bi-LSTM model integrated with MHA emerged as the most effective, demonstrating strong potential as a core component of a robust disease outbreak prediction system. A critical avenue for future research will be the integration of this semantic detection framework with network-based architectures, such as Graph Neural Networks, to progress from semantic event detection to comprehensive mechanistic spatiotemporal modeling and contagion source detection, as discussed in Sub-Section 5.2.

Author Contributions

A.S.G.: Study conception and design, the literature search, methodology design, coding, result analysis, table design, validation, writing—original draft, and writing—review and editing. Z.R.: Study conception and design, methodology design, supervision, validation, and writing—review and editing. M.L.: Study conception and design, methodology design, supervision, validation, and writing—review and editing. M.B.: Study conception and design, methodology design, supervision, validation, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant provided by the Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000C313925P4G0002) and an agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences dated 20 June 2025 No. 139-15-2025-011.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These data were derived from the following resources available in the public domain: [https://components.one/datasets/all-the-news-2-news-articles-dataset (accessed on 19 August 2025)].

Conflicts of Interest

The authors have no conflicts of interest to declare.

Abbreviations

The following abbreviations are used in this manuscript:

AdB	AdaBoost
AIDS	Acquired Immunodeficiency Syndrome
ANN	Artificial Neural Networks
AUC	Area Under the Curve
BERT	Bidirectional Encoder Representations from Transformers
Bi-LSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Network
COVID-19	Coronavirus Disease 2019
DAnIEL	Data Analysis for Information Extraction in any Language
DL	Deep Learning
FFN	Feed-Forward Neural Network
FPR	False Positive Rate
GCN	Graph Convolutional Network
GCT	Granger Causal Testing
GNN	Graph Neural Network
HFMD	Hand, Foot, and Mouth Disease
HIV	Human Immunodeficiency Virus
ILI	Influenza Like Illness
KNN	K-Nearest Neighbors
LASSO	Least Absolute Shrinkage and Selection Operator
LDA	Latent Dirichlet Allocation
LSTM	Long Short-Term Memory
MERS	Middle East Respiratory Syndrome
MHA	Multi-Head Attention
MLP	Multi-Layer Perceptron
MNB	Multinomial Naive Bayes
NCDC	Nigeria Centre for Disease Control and Prevention
NLP	Natural Language Processing
NN	Neural Network
PADI	Platform for Automated Extraction of Disease Information
ReLU	Rectified Linear Unit
RF	Random Forest
RNN	Recurrent Neural Network
ROC	Receiver Operating Characteristic
RSS	Really Simple Syndication
Seq2Seq	Sequence-to-Sequence
SAMOH	Saudi Arabia Ministry of Health
SARS	Severe Acute Respiratory Syndrome
SGD	Stochastic Gradient Descent
SIR	Susceptible, Infected, Recovered
TB	Tuberculosis
TF-IDF	Term Frequency–Inverse Document Frequency
TPR	True Positive Rate
WHO	World Health Organization
WHO-AFRO	WHO African Region
WHO-DON	WHO Disease Outbreak News
WHO-IHR	WHO International Health Regulations

References

Disease Outbreak News. Available online: https://www.who.int/emergencies/disease-outbreak-news (accessed on 19 August 2025).
Gautam, A.S.; Raza, Z. Disease Outbreak Prediction Using Natural Language Processing: A Review. Knowl. Inf. Syst. 2024, 66, 6561–6595. [Google Scholar] [CrossRef]
Gautam, A.S.; Raza, Z. Autoencoder and Multi-Head Attention with GRU Based Approach to Predict Disease Outbreak Using News-Crawl 2019 Data. In Proceedings of the 2024 International Conference on Computational Intelligence and Network Systems (CINS), Dubai, United Arab Emirates, 28–29 November 2024; pp. 1–7. [Google Scholar] [CrossRef]
Vector-Borne Diseases. Available online: https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases (accessed on 19 August 2025).
Pley, C.; Evans, M.; Lowe, R.; Montgomery, H.; Yacoub, S. Digital and Technological Innovation in Vector-Borne Disease Surveillance to Predict, Detect, and Control Climate-Driven Outbreaks. Lancet Planet. Health 2021, 5, e739–e745. [Google Scholar] [CrossRef] [PubMed]
Coronavirus Disease ({COVID-19})–World Health Organization. Available online: https://www.who.int/emergencies/diseases/novel-coronavirus-2019 (accessed on 19 August 2025).
New Research on Deaths and Economic Impact in the First Year of the COVID-19 Pandemic. Available online: https://siepr.stanford.edu/news/new-research-deaths-and-economic-impact-first-year-covid-19-pandemic (accessed on 19 August 2025).
Wang, W.; Gurgone, A.; Martínez, H.; Barbieri Góes, M.C.; Gallo, E.; Kerényi, Á.; Turco, E.M.; Coburger, C.; Andrade, P.D.S. COVID-19 Mortality and Economic Losses: The Role of Policies and Structural Conditions. JRFM 2022, 15, 354. [Google Scholar] [CrossRef]
Khotimah, P.H.; Fachrur Rozie, A.; Nugraheni, E.; Arisal, A.; Suwarningsih, W.; Purwarianti, A. Deep Learning for Dengue Fever Event Detection Using Online News. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Tangerang, Indonesia, 18–20 November 2020; pp. 261–266. [Google Scholar]
Li, J.; Sia, C.-L.; Chen, Z.; Huang, W. Enhancing Influenza Epidemics Forecasting Accuracy in China with Both Official and Unofficial Online News Articles, 2019–2020. Int. J. Environ. Res. Public Health 2021, 18, 6591. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Ibaraki, M.; Schwartz, F.W. Disease Surveillance Using Online News: Dengue and Zika in Tropical Countries. J. Biomed. Inform. 2020, 102, 103374. [Google Scholar] [CrossRef] [PubMed]
Fast, S.M.; Kim, L.; Cohn, E.L.; Mekaru, S.R.; Brownstein, J.S.; Markuzon, N. Predicting Social Response to Infectious Disease Outbreaks from Internet-Based News Streams. Ann. Oper. Res. 2018, 263, 551–564. [Google Scholar] [CrossRef] [PubMed]
Azam, N.; Tahir, B.; Mehmood, M.A. News-EDS: News Based Epidemic Disease Surveillance Using Machine Learning. In Proceedings of the 2020 14th International Conference on Open Source Systems and Technologies (ICOSST), Lahore, Pakistan, 16–17 December 2020; pp. 1–6. [Google Scholar]
Chakraborty, S.; Subramanian, L. Extracting Signals from News Streams for Disease Outbreak Prediction. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; pp. 1300–1304. [Google Scholar]
Li, Z.; Wang, B.; Li, M.; Ma, W.-Y. A Probabilistic Model for Retrospective News Event Detection. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 15–19 August 2005; pp. 106–113. [Google Scholar]
PULS Project: Surveillance of Global News Media. Available online: http://puls.cs.helsinki.fi/static/index.html (accessed on 19 August 2025).
Valentin, S.; Lancelot, R.; Roche, M. Identifying Associations between Epidemiological Entities in News Data for Animal Disease Surveillance. Artif. Intell. Agric. 2021, 5, 163–174. [Google Scholar] [CrossRef]
Jang, B.; Kim, I.; Kim, J.W. Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles: Deep Learning Model Study. JMIR Med. Inform. 2021, 9, e23305. [Google Scholar] [CrossRef]
Jahanbin, K.; Rahmanian, V. Using Twitter and Web News Mining to Predict COVID-19 Outbreak. Asian Pac. J. Trop. Med. 2020, 13, 378. [Google Scholar] [CrossRef]
Liu, D.; Clemente, L.; Poirier, C.; Ding, X.; Chinazzi, M.; Davis, J.T.; Vespignani, A.; Santillana, M. A Machine Learning Methodology for Real-Time Forecasting of the 2019–2020 COVID-19 Outbreak Using Internet Searches, News Alerts, and Estimates from Mechanistic Models. Available online: https://www.jmir.org/2020/8/e20285/ (accessed on 19 August 2025).
Collier, N. What’s Unusual in Online Disease Outbreak News? J. Biomed. Sem. 2010, 1, 2. [Google Scholar] [CrossRef]
Khan, S.A.; Patel, C.O.; Kukafka, R. GODSN: Global News Driven Disease Outbreak and Surveillance. AMIA Annu. Symp. Proc. 2006, 2006, 983. [Google Scholar]
Mele, I.; Bahrainian, S.A.; Crestani, F. Event Mining and Timeliness Analysis from Heterogeneous News Streams. Inf. Process. Manag. 2019, 56, 969–993. [Google Scholar] [CrossRef]
Goel, R.; Valentin, S.; Delaforge, A.; Fadloun, S.; Sallaberry, A.; Roche, M.; Poncelet, P. EpidNews: Extracting, Exploring and Annotating News for Monitoring Animal Diseases. J. Comput. Lang. 2020, 56, 100936. [Google Scholar] [CrossRef]
Ghosh, S.; Chakraborty, P.; Nsoesie, E.O.; Cohn, E.; Mekaru, S.R.; Brownstein, J.S.; Ramakrishnan, N. Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks. Sci. Rep. 2017, 7, 40841. [Google Scholar] [CrossRef] [PubMed]
All the News 2-News Articles Dataset. 2019. Available online: https://components.one/datasets/all-the-news-2-news-articles-503 (accessed on 19 August 2025).
Camacho-Collados, J.; Pilehvar, M.T. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 31 October–1 November 2018. [Google Scholar]
Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. Others Industrial-Strength Natural Language Processing. Available online: https://spacy.io (accessed on 19 August 2025).
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks 2019. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Stroudsburg, PA, USA, 3–7 November 2019. [Google Scholar]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385, pp. 37–45. ISBN 978-3-642-24796-5. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate 2016. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; Volume 30, pp. 1–11. [Google Scholar]
Bogoch, I.I.; Watts, A.; Thomas-Bachli, A.; Huber, C.; Kraemer, M.U.G.; Khan, K. Potential for Global Spread of a Novel Coronavirus from China. J. Travel. Med. 2020, 27, taaa011. [Google Scholar] [CrossRef] [PubMed]
Fong, S.J.; Dey, N.; Chaki, J. AI-Empowered Data Analytics for Coronavirus Epidemic Monitoring and Control. In Artificial Intelligence for Coronavirus Outbreak; Springer Briefs in Applied Sciences and Technology; Springer: Singapore, 2021; pp. 47–71. ISBN 978-981-15-5935-8. [Google Scholar]
Ten Threats to Global Health in 2019. Available online: https://www.who.int/news-room/spotlight/ten-threats-to-global-health-in-2019 (accessed on 19 August 2025).
Amin, S.; Uddin, M.I.; Zeb, M.A.; Alarood, A.A.; Mahmoud, M.; Alkinani, M.H. Detecting Dengue/Flu Infections Based on Tweets Using LSTM and Word Embedding. IEEE Access 2020, 8, 189054–189068. [Google Scholar] [CrossRef]
Aziz, A.; Aziz, A. Dengue Cases Prediction Using Machine Learning Approach. Irasd J. Comp. Info Tech. 2021, 2, 13–25. [Google Scholar] [CrossRef]
Amin, S.; Irfan Uddin, M.; Ali Zeb, M.; Abdulsalam Alarood, A.; Mahmoud, M.H.; Alkinani, M. Detecting Information on the Spread of Dengue on Twitter Using Artificial Neural Networks. Comput. Mater. Contin. 2021, 67, 1317–1332. [Google Scholar] [CrossRef]
Fung, I.C.-H.; Fu, K.-W.; Ying, Y.; Schaible, B.; Hao, Y.; Chan, C.-H.; Tse, Z.T.-H. Chinese Social Media Reaction to the MERS-CoV and Avian Influenza A(H7N9) Outbreaks. Infect. Dis. Poverty 2013, 2, 31. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, P.; Wang, Z.; Lu, Z.; Wang, Z. HFMD Cases Prediction Using Transfer One-Step-Ahead Learning. Neural Process Lett. 2023, 55, 2321–2339. [Google Scholar] [CrossRef]
Wang, Y.; Cao, Z.; Zeng, D.; Wang, X.; Wang, Q. Using Deep Learning to Predict the Hand-Foot-and-Mouth Disease of Enterovirus A71 Subtype in Beijing from 2011 to 2018. Sci. Rep. 2020, 10, 12201. [Google Scholar] [CrossRef] [PubMed]
Meng, D.; Xu, J.; Zhao, J. Analysis and Prediction of Hand, Foot and Mouth Disease Incidence in China Using Random Forest and XGBoost. PLoS ONE 2021, 16, e0261629. [Google Scholar] [CrossRef]
Fung, I.C.-H.; Zeng, J.; Chan, C.-H.; Liang, H.; Yin, J.; Liu, Z.; Tse, Z.T.H.; Fu, K.-W. Twitter and Middle East Respiratory Syndrome, South Korea, 2015: A Multi-Lingual Study. Infect. Dis. Health 2018, 23, 10–16. [Google Scholar] [CrossRef]
Lee, H. Stochastic and Spatio-Temporal Analysis of the Middle East Respiratory Syndrome Outbreak in South Korea, 2015. Infect. Dis. Model. 2019, 4, 227–238. [Google Scholar] [CrossRef]
Balashankar, A.; Dugar, A.; Subramanian, L.; Fraiberger, S. Reconstructing the MERS Disease Outbreak from News. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies, Accra, Ghana, 3–5 July 2019; pp. 272–280. [Google Scholar]
Odlum, M.; Yoon, S. What Can We Learn about the Ebola Outbreak from Tweets? Am. J. Infect. Control 2015, 43, 563–571. [Google Scholar] [CrossRef] [PubMed]
Joshi, A.; Sparks, R.; Karimi, S.; Yan, S.-L.J.; Chughtai, A.A.; Paris, C.; MacIntyre, C.R. Automated Monitoring of Tweets for Early Detection of the 2014 Ebola Epidemic. PLoS ONE 2020, 15, e0230322. [Google Scholar] [CrossRef]
Park, J.; Chaffee, A.W.; Harrigan, R.J.; Schoenberg, F.P. A Non-Parametric Hawkes Model of the Spread of Ebola in West Africa. J. Appl. Stat. 2022, 49, 621–637. [Google Scholar] [CrossRef]
Wakamiya, S.; Kawai, Y.; Aramaki, E. Twitter-Based Influenza Detection After Flu Peak via Tweets with Indirect Information: Text Mining Study. JMIR Public Health Surveill. 2018, 4, e65. [Google Scholar] [CrossRef] [PubMed]
Nsoesie, E.O.; Oladeji, O.; Abah, A.S.A.; Ndeffo-Mbah, M.L. Forecasting Influenza-like Illness Trends in Cameroon Using Google Search Data. Sci. Rep. 2021, 11, 6713. [Google Scholar] [CrossRef]
Kim, M.; Chae, K.; Lee, S.; Jang, H.-J.; Kim, S. Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches. Int. J. Environ. Res. Public Health 2020, 17, 9467. [Google Scholar] [CrossRef] [PubMed]
Valentin, S.; Arsevska, E.; Rabatel, J.; Falala, S.; Mercier, A.; Lancelot, R.; Roche, M. PADI-Web 3.0: A New Framework for Extracting and Disseminating Fine-Grained Information from the News for Animal Disease Surveillance. One Health 2021, 13, 100357. [Google Scholar] [CrossRef] [PubMed]
Valentin, S.; Arsevska, E.; Falala, S.; De Goër, J.; Lancelot, R.; Mercier, A.; Rabatel, J.; Roche, M. PADI-Web: A Multilingual Event-Based Surveillance System for Monitoring Animal Infectious Diseases. Comput. Electron. Agric. 2020, 169, 105163. [Google Scholar] [CrossRef]
Jang, B.; Kim, M.; Kim, I.; Kim, J.W. EagleEye: A Worldwide Disease-Related Topic Extraction System Using a Deep Learning Based Ranking Algorithm and Internet-Sourced Data. Sensors 2021, 21, 4665. [Google Scholar] [CrossRef]
Șerban, O.; Thapen, N.; Maginnis, B.; Hankin, C.; Foot, V. Real-Time Processing of Social Media with SENTINEL: A Syndromic Surveillance System Incorporating Deep Learning for Health Classification. Inf. Process. Manag. 2019, 56, 1166–1184. [Google Scholar] [CrossRef]
Abbood, A.; Ullrich, A.; Busche, R.; Ghozzi, S. EventEpi—A Natural Language Processing Framework for Event-Based Surveillance. PLoS Comput. Biol. 2020, 16, e1008277. [Google Scholar] [CrossRef] [PubMed]
EpiTator. Available online: https://github.com/ecohealthalliance/EpiTator (accessed on 19 August 2025).
Amin, S.; Uddin, M.I.; Hassan, S.; Khan, A.; Nasser, N.; Alharbi, A.; Alyami, H. Recurrent Neural Networks with TF-IDF Embedding Technique for Detection and Classification in Tweets of Dengue Disease. IEEE Access 2020, 8, 131522–131533. [Google Scholar] [CrossRef]
Alassafi, M.O.; Jarrah, M.; Alotaibi, R. Time Series Predicting of COVID-19 Based on Deep Learning. Neurocomputing 2022, 468, 335–344. [Google Scholar] [CrossRef]
Nguyen, V.-H.; Tuyet-Hanh, T.T.; Mulhall, J.; Minh, H.V.; Duong, T.Q.; Chien, N.V.; Nhung, N.T.T.; Lan, V.H.; Minh, H.B.; Cuong, D.; et al. Deep Learning Models for Forecasting Dengue Fever Based on Climate Data in Vietnam. PLoS Negl. Trop. Dis. 2022, 16, e0010509. [Google Scholar] [CrossRef]
Athanasiou, M.; Fragkozidis, G.; Zarkogianni, K.; Nikita, K.S. Long Short-Term Memory–Based Prediction of the Spread of Influenza-Like Illness Leveraging Surveillance, Weather, and Twitter Data: Model Development and Validation. J. Med. Internet Res. 2023, 25, e42519. [Google Scholar] [CrossRef]
Zhu, X.; Fu, B.; Yang, Y.; Ma, Y.; Hao, J.; Chen, S.; Liu, S.; Li, T.; Liu, S.; Guo, W.; et al. Attention-Based Recurrent Neural Network for Influenza Epidemic Prediction. BMC Bioinform. 2019, 20, 575. [Google Scholar] [CrossRef]
Kim, Y.; Park, C.-R.; Ahn, J.-P.; Jang, B. COVID-19 Outbreak Prediction Using Seq2Seq + Attention and Word2Vec Keyword Time Series Data. PLoS ONE 2023, 18, e0284298. [Google Scholar] [CrossRef] [PubMed]
Dai, S.; Han, L. Influenza Surveillance with Baidu Index and Attention-Based Long Short-Term Memory Model. PLoS ONE 2023, 18, e0280834. [Google Scholar] [CrossRef]
Yang, L.; Li, G.; Yang, J.; Zhang, T.; Du, J.; Liu, T.; Zhang, X.; Han, X.; Li, W.; Ma, L.; et al. Deep-Learning Model for Influenza Prediction from Multisource Heterogeneous Data in a Megacity: Model Development and Evaluation. J. Med. Internet Res. 2023, 25, e44238. [Google Scholar] [CrossRef] [PubMed]
Wu, N.; Green, B.; Ben, X.; O’Banion, S. Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case 2020. arXiv 2020. [Google Scholar] [CrossRef]
Li, L.; Jiang, Y.; Huang, B. Long-Term Prediction for Temporal Propagation of Seasonal Influenza Using Transformer-Based Model. J. Biomed. Inform. 2021, 122, 103894. [Google Scholar] [CrossRef]
Li, Y.; Wang, Y.; Ma, K. Integrating Transformer and GCN for COVID-19 Forecasting. Sustainability 2022, 14, 10393. [Google Scholar] [CrossRef]
Yom-Tov, E. Ebola Data from the Internet: An Opportunity for Syndromic Surveillance or a News Event? In Proceedings of the 5th International Conference on Digital Health 2015, Florence, Italy, 18–20 May 2015; pp. 115–119. [Google Scholar]
Choi, S.; Lee, J.; Kang, M.-G.; Min, H.; Chang, Y.-S.; Yoon, S. Large-Scale Machine Learning of Media Outlets for Understanding Public Reactions to Nation-Wide Viral Infection Outbreaks. Methods 2017, 129, 50–59. [Google Scholar] [CrossRef] [PubMed]
McGough, S.F.; Brownstein, J.S.; Hawkins, J.B.; Santillana, M. Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data. PLoS Negl. Trop Dis. 2017, 11, e0005295. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression Shrinkage and Selection via the Lasso: A Retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 273–282. [Google Scholar] [CrossRef]
Grubaugh, N.D.; Saraf, S.; Gangavarapu, K.; Watts, A.; Tan, A.L.; Oidtman, R.J.; Ladner, J.T.; Oliveira, G.; Matteson, N.L.; Kraemer, M.U.G.; et al. Travel Surveillance and Genomics Uncover a Hidden Zika Outbreak during the Waning Epidemic. Cell 2019, 178, 1057–1071.e11. [Google Scholar] [CrossRef]
Yong, B.; Owen, L. Dynamical Transmission Model of MERS-CoV in Two Areas. In Proceedings of the AIP Conference Proceedings, Bandung, Indonesia, 22–23 November 2016; p. 020010. [Google Scholar]
Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-Spectral Methods. Econometrica 1969, 37, 424. [Google Scholar] [CrossRef]
Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the First Instructional Conference on Machine Learning, West New York, NJ, USA, 9–15 August 2003; Volume 242, pp. 29–48. [Google Scholar]
Wang, Y.; Zhou, Z.; Jin, S.; Liu, D.; Lu, M. Comparisons and Selections of Features and Classifiers for Short Text Classification. IOP Conf. Ser. Mater. Sci. Eng. 2017, 261, 012018. [Google Scholar] [CrossRef]
Bijalwan, V.; Kumar, V.; Kumari, P.; Pascual, J. KNN Based Machine Learning Approach for Text and Document Mining. Int. J. Database Theory Appl. 2014, 7, 61–70. [Google Scholar] [CrossRef]
Pal, S.K.; Mitra, S. Multilayer Perceptron, Fuzzy Sets, and Classification. IEEE Trans. Neural Netw. 1992, 3, 683–697. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting. In Computational Learning Theory; Vitányi, P., Ed.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1995; Volume 904, pp. 23–37. ISBN 978-3-540-59119-1. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Zhang, X.; LeCun, Y. Text Understanding from Scratch. arXiv 2016. [Google Scholar] [CrossRef]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
International Health Regulations (2005) (IHR). Available online: https://www.who.int/teams/ihr (accessed on 19 August 2025).
Weekly Bulletin on Outbreak and Other Emergencies: Week 29: 14–20 July 2025. Available online: https://www.afro.who.int/countries/democratic-republic-of-congo/publication/weekly-bulletin-outbreak-and-other-emergencies-week-29-14-20-july-2025 (accessed on 19 August 2025).
Nigeria Centre for Disease Control and Prevention. Available online: https://ncdc.gov.ng/ (accessed on 19 August 2025).
Meng, Z.; Okhmatovskaia, A.; Polleri, M.; Shen, Y.; Powell, G.; Fu, Z.; Ganser, I.; Zhang, M.; King, N.B.; Buckeridge, D.; et al. BioCaster in 2021: Automatic Disease Outbreaks Detection from Global News Media. Bioinformatics 2022, 38, 4446–4448. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Shareghi, E.; Meng, Z.; Basaldella, M.; Collier, N. Self-Alignment Pretraining for Biomedical Entity Representations 2021. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual Meeting, 6–11 June 2021. [Google Scholar] [CrossRef]
Mutuvi, S.; Boros, E.; Doucet, A.; Lejeune, G.; Jatowt, A.; Odeo, M. Multilingual Epidemic Event Extraction. In Towards Open and Trustworthy Digital Societies; Ke, H.-R., Lee, C.S., Sugiyama, K., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 13133, pp. 139–156. ISBN 978-3-030-91668-8. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale 2020. In Proceedings of the 58th annual meeting of the association for computational linguistics, Virtual Meeting, 5–10 July 2020. [Google Scholar] [CrossRef]
Mutuvi, S.; Doucet, A.; Lejeune, G.; Odeo, M. A Dataset for Multilingual Epidemiological Event Extraction. In Proceedings of the 12th Conference on Language Resources and Evaluation, Marseille, France, 11–16 May 2020; pp. 4139–4144. [Google Scholar]
Lejeune, G. Daniel_corpus: A Corpus for Evaluating Multilingual Epidemic Surveillance Systems (2089 Annotated Documents in 5 Languages). 2013. Available online: https://aclanthology.org/2021.ranlp-1.138.pdf (accessed on 19 August 2025).
Menya, E.; Roche, M.; Interdonato, R.; Owuor, D. Enriching Epidemiological Thematic Features for Disease Surveillance Corpora Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; European Language Resources Association: Marseille, France, 2022; pp. 3741–3750. [Google Scholar]
Parekh, T.; Mac, A.; Yu, J.; Dong, Y.; Shahriar, S.; Liu, B.; Yang, E.; Huang, K.-H.; Wang, W.; Peng, N.; et al. Event Detection from Social Media for Epidemic Prediction 2024. arXiv 2024. [Google Scholar] [CrossRef]
Wadden, D.; Wennberg, U.; Luan, Y.; Hajishirzi, H. Entity, Relation, and Event Extraction with Contextualized Span Representations 2019. arXiv 2019. [Google Scholar] [CrossRef]
Du, X.; Cardie, C. Event Extraction by Answering (Almost) Natural Questions 2021. arXiv 2020. [Google Scholar] [CrossRef]
Hsu, I.-H.; Huang, K.-H.; Boschee, E.; Miller, S.; Natarajan, P.; Chang, K.-W.; Peng, N. DEGREE: A Data-Efficient Generation-Based Event Extraction Model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, DC, USA, 10–15 July 2022. [Google Scholar] [CrossRef]
Hsu, I.-H.; Huang, K.-H.; Zhang, S.; Cheng, W.; Natarajan, P.; Chang, K.-W.; Peng, N. TAGPRIME: A Unified Framework for Relational Structure Extraction 2022. arXiv 2022. [Google Scholar] [CrossRef]
Shi, B.; Huang, W.; Dang, Y.; Zhou, W. Leveraging Social Media Data for Pandemic Detection and Prediction. Humanit. Soc. Sci. Commun. 2024, 11, 1075. [Google Scholar] [CrossRef]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Virtual Meeting, 16–20 November 2020; Association for Computational Linguistics: Online, 2020; pp. 657–668. [Google Scholar]
Keras 3: Deep Learning for Humans. Available online: https://github.com/fchollet/keras (accessed on 19 August 2025).
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tens§orflow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; USENIX Association: Savannah, GA, USA, 2016; pp. 265–283. [Google Scholar]
Tan, C.W.; Yu, P.-D.; Chen, S.; Poor, H.V. DeepTrace: Learning to Optimize Contact Tracing in Epidemic Networks with Graph Neural Networks. IEEE Transactions on Signal and Information Processing over Networks. IEEE Trans. Signal Inf. Process. over Networks 2025, 11, 97–113. [Google Scholar] [CrossRef]
Tan, C.W.; Yu, P.-D. Contagion Source Detection in Epidemic and Infodemic Outbreaks: Mathematical Analysis and Network Algorithms. Found. Trends® Netw. 2023, 13, 106–251. [Google Scholar] [CrossRef]

Figure 1. Disease outbreak prediction using news data with attention-based models.

Figure 2. The complete number of news articles published between 2016 and 2020 shows a significant increase in 2020 following the WHO’s declaration of the COVID-19 outbreak.

Figure 3. The total amount of disease outbreak news articles published between 2016 and 2020 shows a marked intensification in 2020 following the inception of the COVID-19 pandemic.

Figure 4. Disease outbreak news articles length distribution.

Figure 5. Attention mechanism.

Figure 6. Figure 6 presents attention heatmaps. (a,b) for “Disease Outbreak News”, (c,d) for “No Disease Outbreak News” predictions, and (e) for incorrect prediction.

Figure 7. Performance of Bi-LSTM with Multi-Head Attention model: (a) accuracy; (b) loss; (c) confusion matrix; and (d) ROC Curve.

Table 1. Literature review summary.

Ref.	Year	Methods	Dataset	Specific Disease Outbreak	Limitations
[69]	2015	Spearman correlation	Bing queries, tweets, news articles, WHO Ebola case data	Ebola	Specific to Ebola, NLP methods and Clustering can be used
[14]	2016	Supervised non-negative matrix factorization	English newspaper	Flu, Malaria, Dengue, Diarrhea, and TB	Manually labeled data
[70]	2017	LDA and Word embedding	Korean news articles	MERS	Specific to MERS
[71]	2017	Multivariable model and LASSO regression [72]	Google Zika search data, tweets and HealthMap news data	Zika	Specific to Zika, NLP methods can be used
[12]	2018	Hill-climbing greedy search with Bayesian Information Criterion	HealthMap news	Ebola	Specific to 16 disease outbreaks
[73]	2019	Bayesian model and Mean posterior approximations	Zika case data Zika and Dengue News articles of Cuba	Zika and Dengue	Specific to Zika and Dengue
[45]	2019	SIR [74] model with GCT [75]	GDELT news event dataset	MERS	Specific to MERS, NLP methods can be used
[11]	2020	Clustering, Louvain Modularity, and NLP methods	LexisNexis database	Dengue and Zika	Specific to Dengue and Zika
[13]	2020	TF-IDF [76,77] with SGD, MNB, KNN [78], MLP [79], RF [80], AdB [81], and BERT [82] embeddings with CNN	Pakistan media news	Hepatitis, HIV/AIDS, Influenza, Dengue, and Malaria	Manually labeled data
[51]	2020	CNN [83], Bi-LSTM [84]	WHO-DON, WHO-IHR [85], WHO-AFRO [86], NCDC [87], and SAMOH.	Data on 100 disease outbreaks	Manually labeled and characterized
[9]	2020	MLP, CNN and LSTM	Twitter datasets in English language. Indonesian Dengue news	Dengue	Manually labeled data and specific to dengue
[88]	2021	PubMedBERT and SapBERT [89]	Google and RSS news feeds. Translating the various news documents from 10 languages into English.		Manually labeled data and rule-based approach outbreak event extraction
[90]	2021	BERT-multilingual-cased and uncased and semi-supervised learning [82,91]	DAnIEL News Dataset [92,93]. Analysis of texts in six languages.	Specific to various disease outbreaks	Manual annotations at token-level and detection of disease names and location, not prediction of disease outbreak
[94]	2022	EpidBioBERT	PADI-Web corpus	Animal disease outbreak	Specific to animal disease outbreak
[65]	2023	Multi-Attention LSTM	ILI cases, Baidu search engines, climate, and demography data	ILI	Specific to ILI
[95]	2024	DyGIE++ [96], BERT-QA [97], DEGREE [98], TagPrime [99]	Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic	Monkeypox, Zika, Dengue, and et. al.	Early warning of any impending epidemic
[100]	2024	BERT—Chinese- roberta-wwm-ext-large [101]	CCIR 2020	COVID-19	Leveraging social media data for pandemic monitoring and forecasting.

Times and Center for Disease Control (CDC) data were used. To predict influenza trends, [64] uses surveys, Baidu Index data, and ILI data provided by an ILI monitoring outpost hospital from thirty-one provinces with attention and LSTM. Multi-attention with LSTM has been used in [65] to predict influenza trends using heterogeneous data from various sources, where ILI, climate, demography, and Baidu search engine data are collected.

Table 2. Parameter settings of LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer models.

Parameters	LSTM	Bi-LSTM	Bi-LSTM with MHA	Transformer
batch_size	128	128	128	128
n_epochs	20	20	20	20
max_seq_len	100	100	100	100
learning_rate	$10^{- 3}$	$10^{- 3}$	$10^{- 3}$	$10^{- 6}$
embedding_dim	128	128	128	128

Table 3. Comparison of LSTM, Bi-LSTM, Bi-LSTM with MHA, and transformer models on disease outbreak news data.

Model	Precision, [%]	Recall, [%]	F1 Score, [%]	Accuracy, [%]
LSTM	89.67	94.09	91.50	91.60
Bi-LSTM	89.71	96.28	93.00	92.59
Bi-LSTM with MHA	97.45	99.11	98.00	98.25
Transformer	97.45	96.29	93.00	92.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gautam, A.S.; Raza, Z.; Lapina, M.; Babenko, M. Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data Cogn. Comput. 2025, 9, 291. https://doi.org/10.3390/bdcc9110291

AMA Style

Gautam AS, Raza Z, Lapina M, Babenko M. Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data and Cognitive Computing. 2025; 9(11):291. https://doi.org/10.3390/bdcc9110291

Chicago/Turabian Style

Gautam, Avneet Singh, Zahid Raza, Maria Lapina, and Mikhail Babenko. 2025. "Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks" Big Data and Cognitive Computing 9, no. 11: 291. https://doi.org/10.3390/bdcc9110291

APA Style

Gautam, A. S., Raza, Z., Lapina, M., & Babenko, M. (2025). Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks. Big Data and Cognitive Computing, 9(11), 291. https://doi.org/10.3390/bdcc9110291

Article Menu

Attention-Driven Deep Learning for News-Based Prediction of Disease Outbreaks

Abstract

1. Introduction

Motivation

2. Related Work

2.1. Disease Outbreak Prediction

2.2. News Data for Disease Outbreak Prediction

2.3. Disease Outbreak Prediction Using LSTM, Bi-LSTM, and Bi-LSTM with Multi-Head Attention

2.4. Disease Outbreak Prediction Using Transformer

3. Proposed Methodology

3.1. All the News 2.0

3.2. Disease Outbreak News Articles Extracted from All the News 2.0

3.3. Pre-Processing News Data

3.4. LSTM and Bi-LSTM

3.5. The Attention Mechanism

3.6. Bi-LSTM with Multi-Head Attention

3.7. Transformer Architecture

4. Results

4.1. LSTM

4.2. Bi-LSTM

4.3. Bi-LSTM with MHA

4.4. Transformer

4.5. Attention-Based DL Model Accuracy Comparison

5. Discussion

5.1. No Benchmark Available: Generic Disease Outbreak Prediction

5.2. Context, Limitations, and Future Integration with Network-Based Models

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI