Forecast of Dengue Cases in 20 Chinese Cities Based on the Deep Learning Method

Dengue fever (DF) is one of the most rapidly spreading diseases in the world, and accurate forecasts of dengue in a timely manner might help local government implement effective control measures. To obtain the accurate forecasting of DF cases, it is crucial to model the long-term dependency in time series data, which is difficult for a typical machine learning method. This study aimed to develop a timely accurate forecasting model of dengue based on long short-term memory (LSTM) recurrent neural networks while only considering monthly dengue cases and climate factors. The performance of LSTM models was compared with the other previously published models when predicting DF cases one month into the future. Our results showed that the LSTM model reduced the average the root mean squared error (RMSE) of the predictions by 12.99% to 24.91% and reduced the average RMSE of the predictions in the outbreak period by 15.09% to 26.82% as compared with other candidate models. The LSTM model achieved superior performance in predicting dengue cases as compared with other previously published forecasting models. Moreover, transfer learning (TL) can improve the generalization ability of the model in areas with fewer dengue incidences. The findings provide a more precise forecasting dengue model and could be used for other dengue-like infectious diseases.


Introduction
Dengue is a mosquito-borne tropical disease caused by the dengue virus infection and is a climate-sensitive disease [1,2]. The symptoms include high fever, headache, vomiting, muscle and joint pains, and a skin rash [3]. In a small proportion of incidences, the disease develops into severe cases, resulting in bleeding, low levels of blood platelets, and blood plasma leakage or shock syndrome [1]. The dengue virus (five serotypes of the dengue virus) is transmitted by Aedes albopictus and Aedes aegypti [4] and is highly sensitive to environmental factors [5][6][7][8]. Meteorological conditions, such as temperature, precipitation, and humidity, have a significant impact on the spread of dengue fever (DF) as the conditions help to increase the population density of Aedes [5][6][7][8]. As temperature and precipitation increase, the stages of development of Aedes larvae to pupa become shorter, thereby increasing the population growth of Aedes [9,10]. Conversely, high temperatures above 35 • C or heavy precipitation may reduce DF transmission by decreasing the survival rate of Aedes [9,11]. Beside meteorological variables, socio-economic factors also influence the spread of DF [12].
DF has been a global threat since World War II [13,14]. According to a recent analysis of the global distribution and burden of dengue virus, approximately 390 million dengue cases are reported annually worldwide, especially in Asia and South America [15,16]. The growing number of dengue cases is a huge economic burden [17]. In China, DF reemerged in Guangzhou city in 1978 after the city was free of the disease for more than 40 years [18]. Since then, more regions have reported dengue outbreaks, which include Guangdong, Guangxi, Yunnan, and Zhejiang, and the incidence has increased steadily in recent years [19]. Dengue was one of the most severe public health threats in China, especially when 45,230 dengue cases were reported in Guangdong Province in 2014 [20].
In the absence of effective antiviral agents capable of treating dengue infection, early and accurate forecasts of dengue using different methods might minimize the threat and help the government to implement effective control measures. Numerous studies attempted to predict the incidence of DF using meteorological factors and vector population density or social media data. For example, Shi et al. established a set of statistical models using the least absolute shrinkage and selection operator method to improve the forecasting of dengue in Singapore [21]. Xu et al. used zero-inflated generalized additive models (ZIGAMs) to successfully demonstrate the effects of climatic conditions on the spread of mosquitos and dengue transmission rates [8]. In addition, Li et al. provide accurate dengue prediction by applying the susceptible infected recovered (SIR) epidemic model in mainland China [22]. Chae et al. recognized that some medical organization's infectious disease reports included delays that can occur in the reporting system and constructed a data-based infectious disease forecast model based on deep learning [23]. However, it is difficult to track DF in a timely manner because of the delayed local mosquito data reports, and most social media data that have received popular attention are not the outputs of instruments designed to produce valid and reliable data amenable for scientific analysis [24].
However, the relationship between dengue cases and meteorological features is highly complex and cannot be easily fitted by the classical time series model. The deep learning method offers more advantages for the health care field as compared with the traditional statistical model [25][26][27][28], and is being actively applied in the prediction the prevalence of infectious disease dynamics [29,30]. Lee et al. showed that artificial neural networks (ANNs) offer a potential benefit in forecasting fluctuations in the mosquito population (especially the extreme values). This method is better than traditional statistical techniques, such as the multiple regression model [31]. Aburas et al. produced encouraging results, with a correlation coefficient of 0.86 for the 2005 period for dengue case prediction in Singapore using ANNs [32]. Truncating the gradient where this does not do harm using memory cells and gate units, LSTM recurrent neural networks can bridge a large number of discrete time steps [33]. Therefore, LSTM is considered as one of the most advanced deep learning architectures for sequence learning tasks, such as speech recognition, or time series prediction [34,35]. LSTM has been used to forecast influenza trends and the epidemics of hand, foot, and mouth disease successfully [36,37]. However, a thorough search of the literature shows that few previous attempts deploying LSTM networks haves been carried out to predict dengue incidence and assess its performance. Current research is more concerned about areas with a high incidence of dengue fever, but it is difficult to perform prediction in areas with fewer dengue incidences. TL is an optimization method used to improve the learning of a new task through the transfer of knowledge from related learning tasks [38]. Correspondingly, TL may be used to improve the learning of models in areas with fewer dengue incidences. Therefore, this study added lag to the collected data set to consider temporal characteristics. In addition, thorough testing of all the areas selected was performed to examine the model's robustness. The model prediction performance of this study was verified by comparing it with different prediction algorithms, including a deep learning method and an infectious disease prediction model that uses time series analysis.
Ultimately, by using this study's obtained results, it should be possible to construct a model that can predict monthly dengue cases in real time. Such a model can not only accurately predict the number of DF cases in areas with a high incidence of dengue fever but also forecast the trend of DF in areas with fewer dengue incidences, which might be helpful for the local government and community to respond early to the disease.

Study Areas and Dengue Cases
Case-level records of human dengue incidence data from January 2005 to December 2018 were available in the China National Notifiable Disease Surveillance System (NNDSS). Relevant information for each case was recorded, including basic demographic characteristics (e.g., gender, age, nationality, and residential address) and time of disease-related events (e.g., date of disease onset, diagnosis, and death). All human dengue cases were diagnosed according to the diagnostic criteria for DF (WS216-2008) enacted by the Chinese Ministry of Health [39,40]. Due to the variability of dengue prevalence among the cities, we summarized the data into a monthly scale. Human dengue cases per month (denoted as D) and monthly meteorological data were aggregated for the neural network model training and prediction.
A total of 250 cities in China reported dengue cases from 2005 to 2018 ( Figure 1A), and this period shows an increasing trend of dengue human cases ( Figure 1C). The top 20 cities (i.e., Guangzhou, Foshan, Sinsong Panna, Dehong, Chaozhou, Zhongshan, Hangzhou, Zhanjiang, Fuzhou, Jiangmen, Shenzhen, Nanning, Zhuhai, Lincang, Shantou, Dongguan, Yangjiang, Putian, Qingyuan, and Zhaoqing) with the highest incidence of dengue were selected as the study areas in this study ( Figure 1A). Among these cities, Guangzhou accounts for 57.45% of the total dengue cases in mainland China ( Figure 1B). Compared with the dengue epidemic in other cities, the dengue epidemic has obvious periodicity ( Figure 1D-W), and the dengue data of Guangzhou were thus chosen to train the pre-training model. Moreover, all of these study areas have frequent economic and cultural communication with the nations of Southeast Asia, where dengue has been hyperendemic for decades [41].

Meteorological Data and Attribute Selection
The meteorological data of these cities were obtained from the National Meteorological Information Center (NMIC). A total of 15 meteorological variables (i.e., extreme wind speed, maximum wind speed, mean wind speed, minimum pressure, maximum pressure, average pressure, mean water pressure, minimum air temperature, maximum air temperature, mean air temperature, average of daily highest temperature, average of daily lower temperature, average of daily precipitation, number of days with rainfall, and average of relative humidity) was retained with no missing values in the raw data. The LSTM model might be overfitted if all variables are used for neural network model training. Overfitting improves the model performances on the training set; however, it works poorly on the test set, indicating that the generalization ability of the model is weak [42]. Thus, attribute selection was used to prevent overfitting and remove redundant attributes.
Attribute reduction for these meteorological variables was carried out using high correlation filtering and low variance filtering. The four variables (i.e., maximum wind speed, extreme wind speed, mean wind speed, and minimum pressure) were deleted as they had the smallest variance in all the study areas. Moreover, the two highest related meteorological variables (i.e., mean minimum temperature, mean air temperature) were removed. In total, 10 valid attributes (i.e., nine meteorological variables and one epidemiological variable) were considered in our study, and the feature parameters are shown in Table 1.

Method
This study constructed an LSTM-based DF prediction model, and compared its performance with other candidate models, i.e., back propagation neural network (BPNN), generalized additive model (GAM), support vector regression (SVR), and gradient boosting machine (GBM). The overall framework of this study is illustrated in Figure 2, and the detailed steps are presented hereafter.

Method
This study constructed an LSTM-based DF prediction model, and compared its performance with other candidate models, i.e., back propagation neural network (BPNN), generalized additive model (GAM), support vector regression (SVR), and gradient boosting machine (GBM). The overall framework of this study is illustrated in Figure 2, and the detailed steps are presented hereafter. The input data consist of the natural logarithm (NL) of the human dengue cases and meteorological condition variables of the current month. The output data are the log-values of dengue cases in the subsequent month. The normalized data were divided into a training set and testing set. The data from 2005 to 2016 were used as the training set, and the data from 2017 to 2018 were used as a test set, which are the unseen data during the training. The comparison of the model performance proceeded using the predicted value of the inverse logarithm.

LSTM Modeling
LSTM, as an advanced intelligent algorithm, enables automatic finding of the characteristics of long-term trends and the short-term fluctuation of time series data. LSTMs belong to the class of improved recurrent neural networks (RNN) and relies on the memory cell with three gating functions incorporated into its construction ( Figure S1) [33]. The ability of the LSTM neural network to find the connection automatically between attributes in a time series is derived from learning. A learning process in the neural network model is a weight adjustment from given examples, which makes the network output a true observation without changes in the network structure [32]. The LSTM model The input data consist of the natural logarithm (NL) of the human dengue cases and meteorological condition variables of the current month. The output data are the log-values of dengue cases in the subsequent month. The normalized data were divided into a training set and testing set. The data from 2005 to 2016 were used as the training set, and the data from 2017 to 2018 were used as a test set, which are the unseen data during the training. The comparison of the model performance proceeded using the predicted value of the inverse logarithm.

LSTM Modeling
LSTM, as an advanced intelligent algorithm, enables automatic finding of the characteristics of long-term trends and the short-term fluctuation of time series data. LSTMs belong to the class of improved recurrent neural networks (RNN) and relies on the memory cell with three gating functions incorporated into its construction ( Figure S1) [33]. The ability of the LSTM neural network to find the connection automatically between attributes in a time series is derived from learning. A learning process in the neural network model is a weight adjustment from given examples, which makes the network output a true observation without changes in the network structure [32]. The LSTM model consisted of 10 input parameters, such as the monthly mean maximum temperature, monthly average relative humidity, monthly raining days, and the observation taken last month. The X ti = Pr max ti , Pr a ti , Pr w ti , T min ti , T max ti , T h ti , P a ti , P d ti , H a ti , D ti is a set of input vector sequence in the same month. The c = (c 1 , c 2 , . . . , c 64 ) is the number of hidden layers in a memory cell. The input time series X = (X t1 , X t2 , . . . , X t12 ) is transmitted to the hidden layers, which contain n memory cells in each one, through weighted connections to compute output Y, the logarithmic of dengue cases in the subsequent month. The inputs are shown in X t1 to X t12 and the output is shown in Y (Figure 3).
is transmitted to the hidden layers, which contain n memory cells in each one, through weighted connections to compute output Y , the logarithmic of dengue cases in the subsequent month. The inputs are shown in 1 t X to 12 t X and the output is shown in Y (Figure 3). Most of the experiments were performed in Python (version 3.6.8) and run in the hardware environment with 64-bit Windows, a 3.0 GHz, Intel Core i5-8500 CPU. The LSTM network models used in this study were modeled through Tensor Flow (version 1.13.1), which is Google's released application programming interface for deep learning [43,44].
The architecture has three layers: A single RNN layer with LSTM (hidden layer), which includes 64 memory cells; an input layer; and an output layer (Figure 3). The initial learning rate was left at the value of 4 1e − . Since the transmission cycle of the dengue virus is one year in China, the time step of the LSTM network was set to 12. The root mean squared error (RMSE) in the validation set was smallest when the model input 12 sets of parameters to produce an output of the number of dengue cases in the next month (Table 2). Other Tensorflow defaults included weights' and biases' initialization and the activation function for the recurrent nodes. In addition, the regulation was used so that a dropout layer was added between the LSTM and the output layer and the dropout rate was set to 40% to combat network overfitting. The internal weight parameters of the LSTM neural network were adjusted through the adaptive momentum (Adam) optimizer [45]. The training time was 8000 epochs. Our source code is available at https://github.com/KeqiangXu/Dengue_Forecast_Based_on_LSTM.  Most of the experiments were performed in Python (version 3.6.8) and run in the hardware environment with 64-bit Windows, a 3.0 GHz, Intel Core i5-8500 CPU. The LSTM network models used in this study were modeled through Tensor Flow (version 1.13.1), which is Google's released application programming interface for deep learning [43,44].
The architecture has three layers: A single RNN layer with LSTM (hidden layer), which includes 64 memory cells; an input layer; and an output layer ( Figure 3). The initial learning rate was left at the value of 1e −4 . Since the transmission cycle of the dengue virus is one year in China, the time step of the LSTM network was set to 12. The root mean squared error (RMSE) in the validation set was smallest when the model input 12 sets of parameters to produce an output of the number of dengue cases in the next month (Table 2). Other Tensorflow defaults included weights' and biases' initialization and the activation function for the recurrent nodes. In addition, the regulation was used so that a dropout layer was added between the LSTM and the output layer and the dropout rate was set to 40% to combat network overfitting. The internal weight parameters of the LSTM neural network were adjusted through the adaptive momentum (Adam) optimizer [45]. The training time was 8000 epochs. Our source code is available at https://github.com/KeqiangXu/Dengue_Forecast_Based_on_LSTM.

Candidate Models
For LSTM, we designed two model training routes: One was used to train the LSTM model using the local data only; the second was used to train a pre-train model using data from Guangzhou and train models of other cities using transfer learning (TL). Moreover, the Guangzhou data were selected to train the pre-training model as they contained a large number of DF cases. This was the LSTM model that learns the concepts of mapping the input (meteorological data and dengue cases data) and output data (dengue cases in the next month). The model fit of the Guangzhou data was the starting point for the model of the second city.
In addition, we also computed other models that have been applied in dengue prediction. The BPNN model has shown excellent performance in multivariate time series prediction. For the BPNN model, an optimal parameter (include the number of neurons in the hidden layer and learning rate) was selected to avoid overfitting and improve the predictive performance [32]. For the SVR model, we considered using an ε-SVR approach, which uses a linear kernel function to track dengue dynamics [46]. For the GAM model [47] and the GBM model [48], the parameters of training used the default values in the python package.

Model Validation
Based on the inverse logarithm of model outputs, the model performance and prediction accuracy were measured by RMSE. RMSE is widely used to evaluate continuous variables by measuring the differences between predicted and observed values [49]: where Y t is the dengue cases of observation for time t, and Y t is the number of cases predicted by the model. A smaller RMSE value indicates a smaller difference between the predicted and observed values and indicates a higher prediction performance of the model. The root relative squared error (RRSE) can be used to evaluate the goodness-of-fit between predicted and observed values. Mathematically, the RRSE is evaluated by Equation (2): where U is the average of the observation. The RRSE index ranges from 0 to infinity, with 0 corresponding to the ideal. In order to fully evaluate the predictive performance of the model, we designed two scenarios. First, we evaluated the prediction accuracy over the last 24 months of each model and compared its performance; second, data from July to November, which covers the peak in dengue incidence in 2017 and 2018, were selected to assess the prediction performance of the model.

Comparison of LSTM LSTM-TL and Candidate Models
The statistics of the study show a trend of an increasing number of dengue cases year by year ( Figure 1C). Most of the dengue cases occurred in the Pearl River Delta region of Guangzhou, Foshan, Zhongshan, Jiangmen, Shenzhen, and Zhuhai ( Figure 1B). In particular, the dengue cases in Guangzhou accounted for more than half of the total dengue cases in China between 2005 and 2018 ( Figure 1B). The LSTM model, by TL training, has a lower RMSE in most cities than the LSTM model by training using only local data. Among the cities, Qingyuan, Dongguan, Shenzhen, Foshan, and Zhongshan are in the same province as Guangzhou, and the distance from Guangzhou is not more than 100 km. The predicted reductions in RMSE in these cities are significant, at 34.6%, 47.4%, 30.3%, 26.9%, and 32.5%, respectively. The RMSE prediction results significantly declined in these cities in the vicinity of Guangzhou because TL can improve the model to some degree, but it is required that the source and target tasks should be the same [38].
The predictive accuracy of dengue cases for each model from 2017 to 2018 and during the outbreak are shown in Table 3. According to the predictive accuracy for the two prediction periods, the LSTM model by TL training has a lower RMSE in most cities than the BPNN model, GAM model, SVR model, and GBM models. Our LSTM method reduced the average RMSE predictions by 12.99% to 24.91% as compared with the estimated dengue cases of other previously published models, and the average RMSE predictions in the outbreak period decreased by 15.09% to 26.82% (Table 3). Notably, the LSTM method reduced the RMSE predictions by 44.48% to 75.56% in Guangzhou (Table 3), which has the highest incidence of dengue fever in China, and the RMSE predictions in the outbreak period decreased by 44.75% to 75.7%. The goodness-of-fit assessment for each model is shown in Table 4. The predicted trend of dengue incidence in the top five cities with high dengue incidence by the LSTM model, GBM model, GAM model, and SVR model from 2017 to 2018 is shown in Figure 4. The predicted trend of dengue incidence in the other 15 cities with high dengue incidence by the LSTM model, GBM model, GAM model, and SVR model from 2017 to 2018 is shown in Figure S2-S4. Table 3. Comparison of model performances using the root mean square error (RMSE). The number before the symbol "/" is the RMSE of the model prediction for the last 24 months, and the number after the symbol "/" is the RMSE of the model prediction in the outbreak period (July to November in 2017 and 2018). TL: Transfer Learning.  Table 4. Comparison of the models' goodness-of-fit using the root relative squared error (RRSE). The number before the symbol "/" is the RRSE of the model prediction for the last 24 months, and the number after the symbol "/" is the RRSE of the model prediction in the outbreak period (July to November in 2017 and 2018).

Discussion
This study reviewed the meteorological factors related to dengue occurrence and proposed an LSTM-based model to efficiently predict dengue cases in 20 cities in mainland China. According to our best knowledge, this is the first time that dengue forecasting models were established based on the LSTM network and assessed in mainland China.
Judging from the national legal infectious disease report in the past five years, the incidence of dengue fever in China has continuously been high. According to the predicted data of the model, the

Discussion
This study reviewed the meteorological factors related to dengue occurrence and proposed an LSTM-based model to efficiently predict dengue cases in 20 cities in mainland China. According to our best knowledge, this is the first time that dengue forecasting models were established based on the LSTM network and assessed in mainland China.
Judging from the national legal infectious disease report in the past five years, the incidence of dengue fever in China has continuously been high. According to the predicted data of the model, the government can track dengue dynamics to carry out targeted prevention and control measures. To date, different dengue forecast models have been developed [20,24,50,51]. The Chinese Center for Disease Control and Prevention (CCDC) has introduced the China Infectious Disease Auto-mated-alert and Response System (CIDARS) for the detection of dengue outbreaks, but reports of the spread of dengue by the system are delayed, because this method is dependent on the numbers of notified dengue cases [51]. Some scholars have developed dengue forecast models using climate data, mosquito density data, and dengue case data [20,24]. These models are unable to track dengue fever in a timely manner, and there is room to improve the model's predictive performance, due to the fact that the local mosquito data cannot be updated quickly in China. However, dengue case data and meteorological data can be updated promptly in relevant departments. Thus, mosquito data were discarded, and monthly dengue cases and meteorological data were chosen to develop a DF prediction model.
Dengue outbreaks in mainland China are often caused by the virus being carried by returning travelers or visitors to China from dengue-endemic areas elsewhere, and most of the dengue cases occur in autumn [52]. In addition, we observed a wide band of numbers of monthly dengue cases ranging from 0 to 18,569 cases, which makes it difficult to predict dengue cases. To obtain accurate forecasting of non-linear time series, such as the prediction of infectious disease, it is crucial to model the long-term dependency in time series data. The periodic patterns spanning multiple time steps are difficult for a typical machine learning method to identify, but this can be achieved by the LSTM network [53,54]. The results in Table 2 showed that the setting of the time step affected the model performance and the model obtained got the best parameters when the time step was set to 12, which suggested that the underlying mechanism of dengue outbreak may be related to long-term climate change. Thus, the LSTM neural network was chosen to develop a DF prediction model that can accurately predict the prevalence of dengue in a timely manner.
In this study, the dataset was relatively small for deep learning models, and the neural network model was less effective for low-resource training, although the climate has been proven as a driving force for DF [8,13], which has a positive effect on the deep learning model, assisting in capturing the law of viral transmission and in predicting the number of cases. However, TL can improve the model to some degree, but it is required that the source and target tasks should be the same [38]. The TL is an optimization method to improve the learning of a new task through the transfer of knowledge from related learning tasks. In our study, TL is applicable in similar climate regions, and it can improve the learning of a new model in areas with fewer dengue incidences through transfer from the already trained model in areas with high dengue incidences.
The predictive accuracy and goodness-of-fit of our LSTM model is superior to the other models in Guangdong, which is China's most dengue-hit area (Tables 3 and 4). Further, we compared the neural network models in more cities of China with other previously published models in Table 3. In terms of prediction accuracy, the LSTM neural network model has a lower RMSE in most cities than other models. By the LSTM model being applied in more cities in mainland China, we found that the LSTM failed to capture the characteristics of viral transmission in areas with a low dengue incidence. This problem can be solved using TL in areas in the vicinity of Guangzhou with high RMSE (by LSTM). It can be seen that the LSTM model or the LSTM model trained by transfer learning in the outbreak period was closer to the observations (Table 3).
There have been a large number of relevant documents on the study of the impact of meteorological factors on dengue fever. However, most of the research models are limited to short-term analysis, and the cumulative effect and hysteresis of relevant factors are not considered, which limits this research results. At the same time, the geographical differences and research materials of different spatial scales will bring about a lack of comparability in the research results. Based on the LSTM method, this study achieved an accurate prediction of DF cases for high-risk areas in mainland China, using long-term time series dengue cases and the data of meteorological variables. This method might be used for the large-scale prediction of other dengue-like diseases.
However, our research has some limitations. First, the LSTM model takes a large amount of time for training compared to other machine learning models; however, the impact is not significant since the data collected in this study were from a small-sized dataset. Second, we could not obtain accurate predictions in some cities by using any model in this study, probably because we failed to consider other relevant potential socio-economic factors [55].

Conclusions
This study proposed an LSTM-based model, which enabled us to efficiently predict monthly dengue cases using meteorological data and dengue cases in 20 cities of mainland China. Several candidate models were also implemented in order to appraise the performance of the LSTM-based model. Briefly, the LSTM-based model could identify periodic patterns spanning multiple time steps in non-linear time series. Moreover, integrating LSTM and transfer learning could improve the prediction accuracy. We conjecture that the proposed LSTM and LSTM-TL model might be used for the large-scale prediction of other dengue-like diseases.