1. Introduction
Acute respiratory infections (ARI) are one of the main causes of morbidity and mortality in the world, particularly in children under 5 years old and adults over 65 years old. It has been estimated that 156 million acute lower respiratory infections occur worldwide every year and almost 2.4 million deaths are estimated to have occurred associated with these infections in 2016 [
1,
2,
3,
4,
5]. Therefore, the development of an effective monitoring and response system for infectious diseases is still a challenge [
6,
7].
The most frequent pathogens that cause ARI are respiratory syncytial virus (RSV), human metapneumovirus, rhinovirus/enterovirus, influenza viruses, parainfluenza 1–4, adenovirus, coronavirus,
Streptococcus pneumoniae, and
Mycoplasma pneumoniae [
8,
9]. ARI exhibit a seasonal pattern where RSV and influenza viruses are the major contributing pathogens. Changes in circulating viral strains of these viruses may result in yearly winter ARI epidemics. In addition, the introduction of novel influenza strains or other viruses into the human population can lead to the emergence of pandemics.
Despite the health and economic burden of RSV, there is currently no vaccine or effective antiviral tretament against this virus. In contrast, there are several antivirals and vaccines available for influenza. While mortality associated with influenza has been reduced since the introduction of influenza vaccine, this virus remains an important cause of ARI [
10].
Health surveillance around the world has become a subject of primary concern, due to the continuous emergence of infectious disease outbreaks, including those associated with ARI such as pandemic influenza, severe acute respiratory syndrome (SARS), and more recently, SARS-CoV-2 [
11]. A detailed understanding of transmissible disease dynamics and the accurate forecast of disease outbreaks translate in reductions in the loss of human lives and money savings derived from avoiding desperate measures during health contingencies [
12,
13,
14].
Several systems which collect epidemiological information from informal sources like non-government reports, news reports and field agents have been proposed as potential Early Warnings Systems (EWS). ProMED [
15], GOARN [
16], GPHIN [
17], Argus [
18], BioCaster [
19], EpiSPIDER [
20,
21], PREDICT [
22] are some examples. ProMED proved the usefulness of EWS with the early report of the SARS epidemic in 2003 in mainland China. In February 10, 2003, a ProMED report became the earliest public alert of a disease which would later be known as severe acute respiratory syndrome (SARS) and which would ultimately affect in excess of 8000 individuals worldwide and kill more than 900 [
23].
The use of Inernet search engines by the general public and physicians creates trends of terms, which match the temporal occurrence of diseases and allow for potential detection of outbreaks at early stages and before traditional surveillance methods identify them [
24,
25]. In the past years, the use of Internet search engines and social media platforms for surveillance and forecast of diseases has been widely studied. Google [
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38], Yahoo [
39], Baidu [
26,
36,
40], and Twitter [
31,
32] are some of the services that have been used to this effect. Infectious diseases such as dengue [
25,
36,
41,
42], influenza-like-illnesses (ILI) [
24,
27,
31,
32,
33,
35,
39,
40,
43,
44,
45,
46], diarrhea [
47], varicella [
29,
47], Lyme disease [
28], and Zika [
37] are among those that have been studied more frequently with this approach due to their seasonal and epidemic behaviors.
The related work to this research includes Santillana et al. [
32], in 2015. They used six different data sources to predict 2013 and 2014: CDC-reported ILI, near real-time hospital visit records from athenahealth, Google trends, influenza related Twitter microblogging, FluNearYou, and Google Flu Trends. One data source is unavailable since 2015, Google Flu Trends. The predictions are made by three different machine learning algorithms to perform multivariate regression, including stacked linear regression, support vector machines (SVM) and AdaBoost with decision tree regression. They report that they are able to predict one, two, and three weeks in the future. This research only reports different data sources and different regression methods, and the performance is evaluated with the Pearson correlation, root mean squared error (RMSE), maximum absolute percentage error (MAPE), and hit rate [
32]. Additionally, Volkova et al. [
31], in 2017, employed only two data sources: Defense Medical Information System (in USA) and Twitter (period 2011 to 2014). They use as baseline models SVM and AdaBoost and they proposed a long short-term memory (LSTM), which is a recurrent neural network. They trained the LSTM models on two seasons (2012–2013) and tested on the 2014 season. They also employed the same metrics used by Santillana et al. [
32], except the hit rate. Their models are capable of predicting weekly ILI dynamics and forecasting up to several weeks in advance [
31].
Therefore, previous research only report what data has been used and the best regression models for a specific period of time. Little has been said about data retrieval, data preprocessing, and feature extraction (called data acquisition process). They also do not use endemic channel information.
The main contribution of this research is a methodology composed of the data acquisition and the computational model. Other contributions of this research include: (i) a method to automatically select the search terms associated to the data source available; (ii) a smoothed endemic channel calculation; (iii) a predictive calculation made by merging the forecasting of an artificial neural network, the projection of a sum of sines model, and the proposed smoothed endemic channels.
Overall, this paper presents a methodology capable of making accurate predictions of ARI activity with data obtained from epidemiological reports along with search terms usage derived from the Google search engine. The combined use of epidemiological, machine learning, and forecasting techniques allowed us to develop a computational model that is capable of accurately predicting ARI trends. Adaptations of this model might prove useful for timely detection of outbreaks at early stages and before they become a major health burden. This research is part of a bigger project called Mexican Infectious Disease Analysis and Surveillance mapping application (MIDASmap;
http://midasmap.uaslp.mx/) which is under development and will be available online.
3. Results
This section presents the results from the proposed methodology fed with ARI dataset and Internet search terms. The recorded results are focused on the last stage of the methodology, the computational model, i.e., forecasting model (FFNN), projection model (SoS), and the merge prediction.
3.1. Forecasting Model
The FFNN was tested to assess one-week-in-advance forecasting of the ARI data for the winter seasons encompassed between 2015 and 2019 in a yearly fashion. Data from 2008 to 2018 were used in order to train the models, i.e., data from winter seasons between 2008 and 2015 were used as training data to forecast the 2015–2016 winter season.
In order to evaluate the accuracy of the forecasting model, several experiments (Exp 1 to 6) were performed. These included assessment of the training window size, as well as evaluation of the FFNN with and without retraining of the network every 13 weeks (a quarter of a year, denoted as Q1, Q2, Q3, and Q4 in Figures).
Table 3 and
Table 4 show the details of the training and testing sets for all experiments to determine the best window length. Since the initial weights of the FFNN (parameters) are initialized randomly, each experiment was executed one thousand times creating the same number of networks, then each FFNN was tested with data from the 27th week of the starting year to the 26th week of ending year. The training concluded when the best FFNN was selected (the one with the lowest RMSE) to be used in the final testing. It is worth noting that the year 2009 was omitted from these tests because of the Influenza AH1N1 pandemic.
The final testing was divided in quarters, each of them composed by 13 weeks, but the results are grouped by year. On the retrained version of the tests, the network was retrained by adding the most recent data along with part of the previous training data. The version without retraining used the same network during the four quarters of each winter season test.
The metrics of the results are shown in
Table 5, where the best three results for each of them are in bold fonts. An example of the results for the retrained FFNN is shown in
Figure 9, and without retraining in
Figure 10. These results show that Exp 6 without retraining have consistently more accurate results compared to other experiments.
3.2. Merge Prediction
The merge prediction is composed by the forecasting model (FFNN), the projection model (SoS), and the endemic channels merged in a linear equation that reduces the error-based metrics. Results of this prediction for the 2017–2018 season are shown in
Figure 11; these results can be compared with those obtained by the FFNN model, the SoS model, and the endemic channels in
Figure 12. The model was tested for several seasons, 2015 to 2019, see detailed results at
Appendix A.
This merged result has a reduced RMSE, RMSPE, and MAPE and a higher correlation coefficient compared with FFNN and SoS individually. The average results for these metrics obtained with the three methods during the four study seasons are shown on
Table 6. This proposal is also compared with similar models found in the state of the art, using data from Google searches [
32] and Twitter [
31,
32] for Influenza surveillance and ILI forecasting with machine learning algorithms, such as, AdaBoost, support vector machines (SVM), with linear and radial basis functions (RBF), and long short-term memory networks (LSTM), the best one-week forecast results were selected to compare with this methodology, see
Table 6. It is worth noting that the RMSE is not a normalized metric, because it depends on the units of measurement, and therefore it cannot be used as comparison metric.
4. Discussion
Over the last century the availability of vaccines and antibiotics has resulted in a significant reduction of the impact of infectious diseases in the human population worldwide. Nevertheless, infectious diseases continue to cause significant morbidity and mortality. Of special importance, the occurrence of epidemics and pandemics has resulted in major loss of life, health system saturation, and economic burdens. Early identification of outbreaks and epidemics is considered essential in order to limit the spread and effects of an infection within a community or country. Epidemiological surveillance systems are essential tools to identify the onset of outbreaks. Surveillance systems can rely on information that is obtained actively or passively. Active surveillance systems require that epidemiological information be obtained based on case finding activities which may require identifying individuals that fulfill certain case definitions or carrying out laboratory tests to detect a specific pathogen. In contrast, passive surveillance may use routinely gathered information for analysis. As a result, active surveillance systems tend to be more complete but more expensive than passive systems [
58]. In both instances, analysis of the information that has been gathered is a key element in order to identify changes in disease occurrence that indicate the onset of an epidemic in comparison to fluctuations that may be considered as normal. Of interest, current computing power allows the analysis of large data sets and with the use of diverse algorithms it is possible to identify signal changes that under traditional epidemiological analyses might be difficult to observe. In addition, the expanding use of the internet has resulted in the potential use of temporal and geographic query patterns for infectious disease surveillance [
59]. As a result, over the last decade there has been an increasing interest in the development of internet usage patterns to analyze infectious diseases dynamics and forecast expected behaviors and the occurrence of epidemics [
60].
In the present work we describe a computational model for ARI surveillance that might allow for early detection of outbreaks. Our model is based on ARI data reported on a weekly basis to the Health Ministry in Mexico as well as the number of internet searches of a set of terms by Mexican Google users. The model has been tested with historical data, and proved to predict the behavior of ARI data for four successive winter seasons (2015–2019); the best MAPE results being obtained with the SoS Projection and the merge prediction with 21.7% and 30.9%, respectively. In order to assess these results, we contrasted several metrics (Pearson correlation, RMSPE, and MAPE) with those reported with the use of other methodologies that have analyzed the behavior of respiratory infectious diseases. Unfortunately, there are no previous studies that have focused on ARI which provide similar metrics to assess the accuracy of forecasting, nor encompassing the same geographical and temporal boundaries of our study. Therefore, we included studies that have assessed specific respiratory infections (such as influenza or influenza-like illness) which have shown very good results [
31,
32]. The performance of this methodology was competitive in comparison to results reported in those studies with the use of other forecasting methodologies. The modular structure of the proposed model enables to change the forecasting or projection model to enhance the results; in addition, a decision-making stage could be added in which the predictions are analyzed to detect and send alerts when a potential outbreak is identified.
The main advantage of the proposed model is the use of data that is readily available, such as Internet search terms and routine disease surveillance data (ARI data) to predict an infectious disease. Some of previous reports that describe forecasting of respiratory infections (such as influenza infections) rely on samples obtained for virological testing rather than syndromic clinical reports [
61]. While our model could be limited when assessing the behavior of a specific microorganisms (such as influenza), it could allow for the timely identification of outbreaks when the etiological agent is unknown, such as the appearance of unusual cases of pneumonia late in 2019 in Wuhan, China, which were subsequently identified as caused by a novel viral strain (SARS-CoV-2) [
11]. In addition, because our system is based on routinely obtained information and does not require specific laboratory tests, it is expected to allow for surveillance of wide geographical areas, even in regions where laboratory facilities are not available. This could have immediate application in epidemiological surveillance, as a complementary methodology to already established strategies (for example, in the current SARS-CoV-2 pandemic) [
62,
63]. This methodology could be adapted for use at a subnational level (such as at regional or state level). Overall, the expected usefulness of ARI analysis using this methodology includes the timely identification of an increase in the number of ill persons. Of note, we observed that FFNN response forecasting (Exp6 no retrain) performed well in relation to peak number of ARI. A limitation to this observation is that our study included forecasting only for four winter seasons to be certain that this finding is reproducible in all seasons. Appropriate interventions when outbreak signals are identified by this methodology could include targeted laboratory testing, institution of outbreak assessment and control measures, as well as mobilization of health-care supplies (such as medications or vaccines, when available). In addition, it can be adapted to be used in other countries. This may be of particular use in regions where surveillance systems require strengthening. Furthermore, this proposal could also be used for assessment of other infectious diseases that show seasonal patterns, such as gastrointestinal infections, dengue, and varicella. The inclusion of an automatic strategy for term removing is also an advantage, since this allows to update the search term list and allows for inclusion of many additional terms, as the number of terms in the initial list does not matter because the model will reduce it to the minimum required for predicting results, eliminating subjectivity; nevertheless, the time required to collect and analyze a long list of new search terms would require additional time. During the first stage of the development of our model, Google Correlate was used to obtain the initial list of potential search terms analyzing their correlation with ARI data; unfortunately, this tool was discontinued at the end of 2019. Nevertheless, potential search terms can be assessed with the use of correlation tests, such as Pearson’s correlation. For the feature extraction stage, we assessed several approaches for endemic channel smoothing, including polynomial, splines, Fourier, among others [
49]. In addition, several techniques to model the cyclic behavior of time series on ARI data were explored, and found that SoS is simple, and resulted in improved data fitting. Moreover, other techniques were explored to model the cyclic behavior, such as the Holt–Winters method which is used in economy; however, they are more complex and did not improve the model.