1. Introduction
With the explosion of data available, obtaining the optimal solutions to data-driven problems is increasingly becoming a challenge. It has been recognized that the application of bio-inspired intelligent algorithms is necessary for addressing highly complex problems [
1]. Until now, there are numerous algorithms inspired by nature or biological phenomena, such as neural networks, genetic algorithm (GE), ant colony algorithm (ACO), particle swarm optimization, and so on. They have various applications in solving engineering and biomedical problems [
2,
3]. Neural networks are usually defined as adaptive nonlinear data processing algorithms that combine multiple processing units connected within the network. The neural networks attempt to replicate the mechanism via which neurons are coded in intelligent organisms, such as human neurons. The long short-term memory (LSTM) model is one of the popular neural networks [
4,
5].
The prevention and control of infectious diseases is an important research topic in biomedicine. In recent years, infectious diseases have occurred from time to time, such as influenza A (H1N1), the coronavirus disease 2019 (COVID-19), and so on. The outbreak of COVID-19 has spread across the world. Many countries adopted various forms of lockdown to reduce social contact and thus inhibit the spread of coronavirus; this disrupted supply chains, depressed consumer demand, and put millions out of work [
6,
7]. Moreover, the spread of influenza has continued to show an upward trend in multiple provinces across China, with outbreaks of the influenza A (H1N1) virus being reported at many schools in several areas in February 2023 [
8]. Thus, it is important to model infectious diseases to predict their trends. On the one hand, the outbreak of infectious diseases harms people’s health, and predicting the number of confirmed cases in advance can provide decision making support for prevention and control. On the other hand, estimating the number of existing infected cases could help allocate medical resources, such as beds and ventilators.
New problems emerge one after another, and traditional algorithms often cannot solve them effectively. By analyzing the problem, we can design personalized algorithms. In the field of infectious diseases, the time interval from infection to medical diagnosis is a random variable that obeys the specific log-normal distribution confirmed by previous research [
9]. Inspired by this biological law, the back-projection algorithm is proposed to estimate the number of infected cases. Analyzing the development law and predicting the pandemic provides useful insights to policymakers, and allows them to make informed decisions on allocating limited resources, controlling outbreaks, and ensuring the safety of the general public. Various population information and social factors, such as community mobility, population density, the awareness of wearing masks, and so on, have an impact on the spread of infectious diseases. We intuit that the use of multisource data will provide a highly meaningful avenue for modeling and forecasting.
In this paper, we first formulate a modified back-projection model inspired by the law of infectious disease transmission, and then propose a hybrid bio-inspired architecture combining modified back-projection and the recurrent neural network for pandemic prediction. The main contributions of this paper are summarized as follows:
We propose a novel hybrid bio-inspired neural network model that not only predicts the number of new daily confirmed cases, but also estimates the number of new daily infected cases.
Using the multimodal data, we design the LSTM module to estimate the time-varying infection rates in the infected–susceptible–infected (ISI) module. This is more practical and flexible compared with the common curve fitting methods.
The proposed model, BPISI-LSTM, outperforms the popular epidemic prediction models on real-word datasets with different sizes of prediction window.
The remainder of this paper is organized as follows.
Section 2 outlines the related work in pandemic prediction, especially for COVID-19.
Section 3 describes the framework of the proposed model and details its mathematical theory.
Section 4 provides the experimental results of predicting confirmed cases using the multimodal data of three countries.
Section 5 discusses the superiority of the model in estimating the number of infected cases. The conclusion is provided in
Section 6.
2. Related Work
We focus on the related methods of infectious disease prediction, which are mainly divided into compartmental mathematical models, mechanistic statistical models, and deep learning models.
Compartmental mathematical models include the susceptible–infected–recovered (SIR) model and its derived models, such as the susceptible–exposed–infected–recovered (SEIR) model. These models divide the population into exclusive groups and define the progress among the different groups through ordinary differential equations. Kim et al. [
10] developed a novel SEIR model based on the Coxian distribution approximating the distribution of the incubation. The model is adaptive in order to resolve the various realistic epidemic predictions, since all types of incubation periods are approximated by the Coxian distribution. However, several parameters need to be fitted using real epidemic data, which is a non-trivial problem. Sun et al. [
11] proposed a more generalized version of the SIR model, where the infection rate and the recovery rate both vary with time. The reciprocal regression is used to estimate the infection rate, and the recovery rate curve is fitted using the last five data points. The model is evaluated to track the epidemic of COVID-19 in 30 provinces in China and 15 cities in Hubei province. Chen et al. [
12] also derived a time-dependent SIR model that tracks the transmission and recovery rate at time
t. Due to the existence of asymptomatic infections of COVID-19, they extend the model by considering two types of infected persons: detectable and undetectable infected persons. Giordano et al. [
13] proposed the compartmental model considering eight stages of infection. The model discriminates the infected individuals according to whether they have been diagnosed and the severity of their symptoms. The prediction of the model, in the long run, is not very sensitive to the initial conditions, but it is sensitive to parameters in the model estimated using empirical data.
Back-projection is representative of mechanistic statistical models that were developed to estimate the human immunodeficiency virus (HIV) incidence using surveillance data on acquired immunodeficiency syndrome (AIDS) diagnosis [
9]. Becker et al. [
14] modified classical back-projection using the multiplicative method to model the age-specific relative risk of HIV infection. The smoothed expectation maximization (EM) algorithm is applied to solve the modified back-projection model. Chau et al. [
15] proposed modified back-projection based only on the number of HIV diagnoses. The model rectifies some of the shortcomings of the original back-projection method based on AIDS data alone. McEwan et al. [
16] applied the classical back-projection approach to estimate the number of patients living with chronic hepatitis C virus (HCV) infection in Taiwan. Moreover, they quantified the expected numbers in each of the five METAVIR fibrosis stages. Back-projection was also used to analyze the surveillance data of COVID-19 diagnoses for different regions, such as Hong Kong [
17], Australia [
18], and so on. However, it is difficult to estimate the recent infection cases precisely using the classical back-projection model, let alone predict the number of new daily infected cases in the future. There are two unavoidable sources of uncertainty. First, the prediction involves unknown future infection rates. Second, little is known about the recent infection rate, which is the consequence of the long and variable incubation period of the infectious disease, and cannot be overcome by statistical ingenuity [
14].
Neural network methods, such as long short-term memory (LSTM) [
19], the graph neural network [
20,
21,
22], and so on, have been extensively used to predict pandemics in recent years. To predict the influenza-like illness (ILI) in Guangzhou, Fu et al. [
23] designed a multi-channel LSTM network to extract fused descriptors from multiple types of inputs. They further improved the prediction accuracy by adding an attention mechanism, allowing the model to handle the relationship of multiple input streams more appropriately. Deng et al. [
24] designed a message-passing framework to combine learned feature embeddings and an attention matrix to model disease propagation over time. They evaluated the model on real epidemiological data and validated its effectiveness. However, the proposed model only uses flu disease data and geographic location data, thus ignoring external features such as weather, social factors, and population migration. Tian et al. [
25] proposed the COVID-Net network, combining both LSTM cells and gated recurrent unit (GRU) cells, which takes the five risk factors and disease-related history data as the input. Wu et al. [
26] developed a deep learning framework combining the recurrent neural network (RNN), the convolutional neural network (CNN), and residual links for epidemiological predictions. In the proposed framework, RNN captures the long-term correlation and CNN fuses information from different sources. The residual structure is applied to prevent overfitting issues. Their approach shows excellent performance on real epidemic data. These pure deep learning models are data-driven without any epidemic mechanism. They are likely to predict the short-term trend of the epidemic, while have poor long-term prediction precision.
In this paper, we aim to overcome these limitations by combining the mechanistic model and the deep learning model. Different approaches following this idea have been proposed for several applications; for example, Zheng et al. [
27] proposed a hybrid artificial intelligence (AI) model, including a susceptible–infected module, LSTM module, and natural language processing (NLP) module, for COVID-19 prediction. In addition to infectious disease data, the hybrid model takes the prevention and control measures and related news reports as input, considering the effects of prevention and control measures. Gatta et al. [
28] proposed a novel machine-learning-based framework able to estimate the parameters of compartmental models, such as contact rates and recovery rates, based on static and dynamic features of places. However, these methods cannot estimate the number of infected cases. In this paper, the law of infectious disease transmission and the deep learning model are combined to predict the numbers of confirmed and infected cases.
3. Methodology
In this section, the proposed methodology for designing the hybrid model for COVID-19 pandemic prediction is presented.
3.1. Framework of the Hybrid Model
The compartmental models based on differential equations divide the population into exclusive groups, define the transition from one group to another, and predict the epidemic. One of the most extensively used compartmental models is the susceptible–exposed–infected–recovered (SEIR) model, which does not distinguish the confirmed cases and infected cases. In practice, the model is solved using the confirmed cases rather than the infected cases due to the unobservability of the infected cases. Therefore, the number of infected cases obtained by the SEIR model is actually the number of confirmed cases. However, the estimation of infected cases is a crucial indicator in terms of informing policymakers and thus controlling the epidemic. Based on the retrospective method, the back-projection models the transition from the infected cases to the confirmed ones, and estimates the number of new daily infected cases. Thus, we take back-projection as the basic module of the proposed hybrid model.
Unfortunately, the weakness of back-projection also exists in the retrospective method, that is, the estimators of infected cases from day to t are inaccurate under the assumption that t is the latest day, where is a constant related to the transmission capacity of the coronavirus. Due to the time lag from infection to diagnosis, the estimation of the infected cases from day to t involves the information of confirmed cases in the future. The naive back-projection cannot deal with this problem.
In addition to the conversion from infection to diagnosis, the development law also exists within the infection cases. Under the prevention and control measures, the newly infected cases at the current moment are infected by the newly infected cases in recent days. Under this assumption, the ISI model is proposed to calculate the infection rate to revise the inaccurate estimation of new daily infected cases from day to t. The basic principle of the ISI model is to use the ratio of the number of newly infected cases at day t to the cumulative number of new confirmed cases over different time scales before day t to calculate the infection rate and establish an epidemic model.
The infection rate of coronavirus varies with time. Limited by the ability of fitting data of common functions, such as exponential functions, power functions, and so on, we use the LSTM model to predict infection rates from day to t. To include the impact of mobility on the spread of the pandemic, community mobility data collected via Google are used as additional features as the input of our LSTM module, in addition to disease-related historical information.
The output of the LSTM model, i.e., the infection rates from day
to
, is used in the ISI model to estimate the infected cases from day
to
, and then the confirmed cases are also calculated. The proposed framework is shown in
Figure 1.
3.2. Back-Projection Module
Individuals infected with coronavirus will be clinically diagnosed several days later, either because they feel unwell and actively undergo testing with a nucleic acid reagent, or because the government implements a national screening policy and they are passively diagnosed. In short, by collecting the nucleic acid test data from medical institutions, the new daily confirmed cases can be calculated, while the new daily infected cases are unobserved. Back-projection based on a retrospective approach estimates the new daily infected cases up to the present, forming the basis for prediction of the infected cases. The basic principle of the retrospective approach is that the new daily diagnosed individuals come from the new daily infected individuals from previous days with a certain probability.
Let denote the unobserved number of individuals infected with coronavirus on day t. The number of COVID-19 cases diagnosed on the day t is denoted by . The method of back-projection is based on the following assumptions.
Infected individuals must be confirmed later, that is, death before diagnosis is not considered.
The outputs are assumed to be independent Poisson variables.
The time from infection to diagnosis, denoted by
X, is a log-normal random variable, which is the same irrespective of when the individual is infected.
where
and
, and
is the 0.95 upper quantile of the standard normal distribution.
Under Assumption 3, we have
where
is the discretized log-normal density function.
Based on the above assumptions, we have
Thus, the mean number of confirmed cases on day
t is
where
and
.
Assumption 2 implies that
are also independent Poisson variables. Corresponding to the observed daily confirmed cases,
, we then have the likelihood function
Maximization of the likelihood function for the
via the EM algorithm always leads to non-negative estimates. However, there is a problem of large fluctuations within the sequence
using a naive EM algorithm, so we introduce smoothing in each iteration [
9]. The specific steps are as follows. Let
T represent today’s date.
Expectation Step: The posterior expectation of the number of patients who are infected on day
t and confirmed on day
is calculated as follows.
Maximum Step:
where
is the smoothed estimator of the
kth iteration.
Smooth Step:
where
is the symmetric binomial weight, that is,
.
When t is close to 1 or T, the subscript of may be out of range. To avoid this situation, we make the provision for the potential subscript out of range: when , and when .
Stopping Criterion: given a constant and the upper bound of the accepted error , the algorithm fails if .
Here we take , the size of smoothing window , and . The likelihood function in this paper is a concave function, and the smoothing function in the Smooth Step is a linear function; thus, the EM algorithm converges and the final convergent point is unique. The proof is omitted here, please see the References section for details.
According to , we can calculate the estimated number of new daily confirmed cases after obtaining .
3.3. Infected–Susceptible–Infected Module
Individuals infected with the coronavirus will spread the virus to those who are susceptible through social contact. Since the infected individuals will show abnormal symptoms, such as fever, dry cough, fatigue, etc., they will eventually accept the nucleic acid reagent test and be diagnosed.
The observation period of COVID-19 is 14 days, so we assume that the maximum length of time for an infected individual from being infected to no longer spreading the virus is 14 days, that is, all new daily infected cases are infected by patients infected in the past 14 days.
Most people under epidemiological investigations will be quarantined, observed, and tested with a nucleic acid reagent. It takes at least two positive tests for a patient to be diagnosed as positive for COVID-19. Therefore, we speculate that most of the confirmed cases have been quarantined at least 3 days before being diagnosed, and are unable to infect others [
27], which means that most of the infected persons were not infected by another infected individual who was infected 11 days previously. Therefore, for each day
t, this paper examines the infection rate of new daily infected cases in the past 10 days relative to the infected cases of day
t.
The infected–susceptible–infected (ISI) model is also based on the retrospective method, in which the newly infected cases on day
t were infected by the newly infected cases on day
. Therefore, we can use the following formula to describe
where
is the infection rate of day
s and
w is the parameter, and
is the weight assigned to different time points.
We calculate the infection rate according to Equation (
9).
3.4. Long Short-Term Memory
The recurrent neural network can dynamically incorporate experience due to internal recurrence. Unlike conventional RNN, LSTM can solve the problem of vanishing and exploding gradients. A LSTM memory cell has four units: input gate, output gate, forget gate, and a self-recurrent neuron. LSTM is implemented by following a composite function, and the detailed pipeline is shown in
Figure 2.
where
represents the logistic sigmoid function;
i,
o,
f, and
c represent the input gate, forget gate, output gate, and cell input activation vectors, respectively;
h represents the hidden vector. The weight matrix subscripts have an intuitive meaning; for example,
represents the hidden input gate matrix, etc.
5. Discussion
Most epidemic models can only predict the number of confirmed cases based on historical disease-related data. However, our model predicts the numbers of both confirmed and infected cases. We used BPISI-LSTM to estimate the numbers of infected cases in India, Austria, and Indonesia, respectively. We plotted the numbers of confirmed and infected cases over time, with the red line representing the estimated infected cases and the green dashed line representing the real confirmed cases.
Firstly, the numbers of infected and confirmed cases in India from the onset of COVID-19 to 22 November 2020 are shown in
Figure 5. As of 22 November 2020, the peak of infection occurred in India in early September 2020.
Secondly, the numbers of infected and confirmed cases in Austria between the onset of COVID-19 and 2 May 2021 are shown in
Figure 6. Up to 2 May 2021, there have been three infection peaks in Austria, in mid-March 2020, early November 2020, and early March 2021, respectively.
Thirdly, the numbers of infected and confirmed cases in Indonesia between the onset of COVID-19 and 2 May 2021 are shown in
Figure 7. As of 2 May 2021, two peaks of infection occurred in Indonesia, in mid-September 2020 and mid-January 2021.
From the above figures, we can see that the curve of confirmed cases has an overall delay compared with the curve of infected cases, which indicates that the number of infected cases is a more sensitive indicator. Thus, the estimation of infected cases can inform us on how to prevent and control the pandemic in advance.
The time interval from infection to medical diagnosis is a random variable that obeys the log-normal distribution. Inspired by this biomedical law, our designed bio-inspired intelligent algorithms show the powerful ability to estimate the number of infected cases and predict the number of confirmed cases. Experimental results show that the prediction performance of intelligent algorithms can be further improved based on biological laws.
6. Conclusions
By analyzing the transmission mechanism of COVID-19, we used multimodal data to predict confirmed cases and infected cases. On the one hand, the time interval from infection to medical diagnosis is a random variable that obeys the specific log-normal distribution. On the other hand, in addition to the daily disease-related data, movement trends over time by geography also provide a new perspective for epidemic prediction. Based on these two motivations, we propose a back-projection-based bio-inspired hybrid model (BPISI-LSTM). The model takes disease-related data and social migration data as input, and these data are encoded by LSTM and concatenated to obtain the multimodal feature for prediction. We validate the effectiveness of the proposed model on multimodal datasets of developed and developing countries. Firstly, our experiment results show that the utilization of biological laws, LSTM modules, and multimodal data improves the prediction accuracy of the confirmed cases. Secondly, compared with other models that can only predict the number of confirmed cases, BPISI-LSTM also estimates the number of infections, and thus predicts the pandemic in advance.
Mobility and disease-related features are both used in the model. We encourage future researches that explore more external features, such as, the prevalence of wearing masks, changes in the weather, and so on. Moreover, this modeling framework can be readily extended. For example, the LSTM module can be replaced by the graph neural network, which may better capture the mobility information between regions and attributes of regions such as the population and medical resources.