Spatio-Temporal Analysis of Influenza-Like Illness and Prediction of Incidence in High-Risk Regions in the United States from 2011 to 2020

About 8% of the Americans contract influenza during an average season according to the Centers for Disease Control and Prevention in the United States. It is necessary to strengthen the early warning for influenza and the prediction of public health. In this study, Spatial autocorrelation analysis and spatial scanning analysis were used to identify the spatiotemporal patterns of influenza-like illness (ILI) prevalence in the United States, during the 2011–2020 transmission seasons. A seasonal autoregressive integrated moving average (SARIMA) model was constructed to predict the influenza incidence of high-risk states. We found the highest incidence of ILI was mainly concentrated in the states of Louisiana, District of Columbia and Virginia. Mississippi was a high-risk state with a higher influenza incidence, and exhibited a high-high cluster with neighboring states. A SARIMA (1, 0, 0) (1, 1, 0)52 model was suitable for forecasting the ILI incidence of Mississippi. The relative errors between actual values and predicted values indicated that the predicted values matched the actual values well. Influenza is still an important health problem in the United States. The spread of ILI varies by season and geographical region. The peak season of influenza was the winter and spring, and the states with higher influenza rates are concentrated in the southeast. Increased surveillance in high-risk states could help control the spread of the influenza.


Introduction
Influenza is caused by the influenza virus which mainly spreads through airborne droplets and direct contact. It has the characteristics of strong infectivity, rapid transmission and antigen variation. The activity of seasonal influenza begins to increase in October, most often peaks between December and February and can remain elevated until May. Influenza virus infections are very common and their incidence can only be estimated [1]. Previous estimates attributed to the World Health Organization indicated that 250,000-500,000 influenza-associated deaths occur annually, corresponding to estimates of 3.8-7.7 deaths per 100,000 individuals calculated using 2005 United Nations Department of Economic and Social Affairs World Population Prospects [2]. In particular, the 2017-2018 influenza season in the United States was notable for its high severity, with about 45 million illnesses and 810,000 influenza-associated hospitalizations throughout the United States [2].
As influenza may be characterized by fever, cough, sore throat, runny or stuffy nose, body aches, headache, chills or fatigue and so on, it is hard to diagnose as influenza, based on symptoms alone. A number of influenza tests are available to detect influenza viruses in respiratory specimens. The most common are called "rapid influenza diagnostic tests (RIDTs)" [3]. However, not all the people were tested for influenza, the number of reported cases of influenza may significantly underestimate the actual prevalence of

Data Resources
Information on outpatient visits to health care providers for ILI is collected weekly through the United States Outpatient Influenza-like Illness Surveillance Network. For this system, the confirmed influenza case was "A patient who tests positive for influenza virus infection by an approved laboratory test", and ILI is defined as "fever (temperature of 100 • F (37.8 • C) or greater) and a cough and/or a sore throat without a known cause other than influenza" (https://www.cdc.gov/flu/weekly/overview.htm, accessed on 24 November 2020).
We collected the ILI data from 1st week 2011 to 29th week 2020 from the Centers for Disease Control and Prevention (CDC). The data included the number of ILI cases in 49 states for different age groups. For hotspot states, time-series models were constructed by collecting data from 1st week 2011 to 52nd week 2018, and data from 1st week 2019 to 29th week 2020 were taken as test data to assess forecast performance [17].

Spatiotemporal Cluster Analysis
Moran's I is an important index for analyzing the spatial correlation of diseases [18]. Moran's I ranges from −1 to 1, where 0 indicates a random distribution of influenza. A value close to 1 indicates that the unit cluster has a similar value. A value close to −1 indicates that the unit with high values and low values are adjacent in space [19]. Based on its value and significance, Moran's I can detect four types of cluster, including the high-high (HH), high-low (HL), low-low (LL) and low-high (LH) clustering patterns, respectively. The number of permutations was 999, and the significance level was 0.05 [20].
Spatiotemporal cluster analysis is a measurement of temporal and spatial correlation on the foundation of spatial autocorrelation with the further consideration of the time factor [21]. It can relate the spatial characteristics to the temporal characteristics of influenza [16]. During the study period, the cluster was detected by retrospective spatiotemporal permutation scanning statistics [22]. A retrospective study is an analysis of a fixed geographic area and research period. The satellite scanning software scans multiple start and end dates, and evaluates real-time clusters (continuing to the study period and date) and historical clusters (which did not exist before the end date of the study period). Spatiotemporal scan statistics are defined by a specific window with a circular geographic base and height corresponding to time. The window size was constantly adjusted to detect possible spatiotemporal clusters [22]. In order to scan for small to large clusters, the largest radius was set to 50% of the total population at risk, the largest height was set to 50% of the total study period [23]. The logarithmic likelihood ratio (LLR) was used to compare observed and expected numbers to identify specific clusters. After detecting the most likely spatiotemporal clusters, these clusters were tested by the Monte Carlo method [24].
Monte Carlo simulation generates random copies of the data set under appropriate null hypotheses to determine the statistical significance of these results. The p values for these tests are calculated by comparing the maximum likelihood levels from the real data set with the maximum likelihood levels from the random data set, where p = rank/(1 + number of simulations) [11]. The number of copies should be at least 999 times to ensure sufficient accuracy. Therefore, we use 999 Monte Carlo replications to estimate the importance level of these clusters. If the points conforming to the evaluated cluster maintained their aggregated pattern when compared with 999 randomized simulations of the entire dataset, then it was considered important [25].

Time-Series Analysis
Time-series analysis has the advantage of predicting incidence. It is characterized by the number of patients in the past and responds by predicting the number of patients in the future. The SARIMA model is based on the sequential lag relationship existing in time-series data and more suitable for forecasting when the data has obvious seasonal characteristics [13]. The SARIMA model can be expressed as: SARIMA (p, d, q) (P, D, Q) s . Letters p, d, q are the order of autoregression, the order of difference and the order of moving average; Letters P, D, Q are the order of seasonal autoregression, the order of difference and the order of moving average, and s is the specific value of cycle, the cycle of American influenza is 52 weeks (s = 52) [13].
The process of establishing the SARIMA model was divided into three steps: First, a weekly time-series plot of incidence (per 100,000 population) was drawn to check for stationarity and seasonality. The model was constructed according to the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the model residuals. Secondly, ACF and PACF for estimating residuals were tested by Ljung-Box Q test, and the minimum of the Bayesian information criterion (BIC) was taken as the optimal SARIMA model. Finally, the model was applied to forecast the weekly ILI incidence for 30th week 2020 to 52nd week 2021.

Statistical Analysis
The data was organized by Microsoft Excel 2013. The SARIMA model was constructed by R 3.6.0 and SPSS 27.0. The value of Moran's I and local indicators of spatial association were calculated by GeoDa 1.14.0. The time scan statistic was measured with SaTScan TM 9.5. All the maps were drawn by ArcGIS 10.0.

Epidemiological Analysis
Included in our study were a total of 9,065,910 ILI cases from 1st week 2011 to 29th week 2020 in the United States. The ILI annual infection rate fluctuated from 5.92 to 15.84 per 100,000 population. ILI occurred throughout the year, most often peaked between December and February and lasted until May.
In terms of age, the number of ILI cases in the age group 5-24 years old was the most, and these groups accounted for about 35 percent, while the number of patients in the age group over 65 years old was the least, accounting for about 7 percent (Figure 1). The difference between different age groups had a statistical significance (p < 0.001).

Statistical Analysis
The data was organized by Microsoft Excel 2013. The SARIMA model was structed by R 3.6.0 and SPSS 27.0. The value of Moran's I and local indicators of sp association were calculated by GeoDa 1.14.0. The time scan statistic was measured SaTScan TM 9.5. All the maps were drawn by ArcGIS 10.0.

Epidemiological Analysis
Included in our study were a total of 9,065,910 ILI cases from 1st week 2011 to week 2020 in the United States. The ILI annual infection rate fluctuated from 5.92 to per 100,000 population. ILI occurred throughout the year, most often peaked between cember and February and lasted until May.
In terms of age, the number of ILI cases in the age group 5-24 years old was the m and these groups accounted for about 35 percent, while the number of patients in th group over 65 years old was the least, accounting for about 7 percent (Figure 1). Th ference between different age groups had a statistical significance (p < 0.001). This study collected the population of 49 states and visualized them on the map ure 2), and found no obvious association between population density and influenza dence. This study collected the population of 49 states and visualized them on the map (Figure 2), and found no obvious association between population density and influenza incidence.

Spatiotemporal Analysis
Overall, the highest cumulative incidence of ILI (per 100,000 population) during the study period was seen in the states of Louisiana, District of Columbia and Virginia, which reported 12,200, 9563 and 9554 cases, respectively. The lowest cumulative incidence of ILI incidence was reported from the states of Ohio, Washington and Iowa ( Figure 3).

Spatiotemporal Analysis
Overall, the highest cumulative incidence of ILI (per 100,000 population) during the study period was seen in the states of Louisiana, District of Columbia and Virginia, which reported 12,200, 9563 and 9554 cases, respectively. The lowest cumulative incidence of ILI incidence was reported from the states of Ohio, Washington and Iowa ( Figure 3).

Spatiotemporal Analysis
Overall, the highest cumulative incidence of ILI (per 100,000 population) during the study period was seen in the states of Louisiana, District of Columbia and Virginia, which reported 12,200, 9563 and 9554 cases, respectively. The lowest cumulative incidence of ILI incidence was reported from the states of Ohio, Washington and Iowa ( Figure 3).

Global Spatial Autocorrelation
The global spatial autocorrelation analysis for ILI suggested a clustering distribution at the state level in the years of 2012 to 2017, the global Moran's I reached the significance level of 0.05. In contrast, the global Moran's I for 2011, 2018 and 2019 display no significant spatial autocorrelation, though Moran's I greater than 0 (Table 1).

Local Spatial Autocorrelation
Local spatial autocorrelation analysis reveals only the relative states, rather than absolute correlations. Only those states whose local Moran's I have reached the significance level of 0.05 will be present on the map. From 2011 to 2019, the local spatial autocorrelation showed three HH clusters in total with two HL clusters, four LH and three LL clusters. HH clusters were observed in the states of Louisiana (5 years

Spatiotemporal Cluster Analysis
The spatiotemporal cluster analysis detected 23 clusters of ILI in the study period. The clusters were particularly obvious in spring and winter. For example, the risk ratio (RR) was highest in 2015, with three levels of clustering. Level 1, with Louisiana at the center of high incidence area and two surrounding states, the risk of ILI in this area was 11.66 times more likely to develop the disease than other areas (LLR = 69,009, p < 0.001). Level 2, with Virginia at the center of a high incidence area and three surrounding states, the risk of ILI in this area was 9.79 times more likely to develop the disease than other areas (LLR = 73,277, p < 0.001). Level 3, with New Mexico at the center of high incidence area and three surrounding states, the risk of ILI in this area was 3.38 times more likely to develop the disease than other areas (LLR = 26,518, p < 0.001). At the same time, the states with a high cluster in the local spatial autocorrelation analysis were all located in the high cluster area, the results were consistent. From the cluster time, the high incidence time mainly occurs between January and March ( Table 2).

Spatiotemporal Cluster Analysis
The spatiotemporal cluster analysis detected 23 clusters of ILI in the study period. The clusters were particularly obvious in spring and winter. For example, the risk ratio (RR) was highest in 2015, with three levels of clustering. Level 1, with Louisiana at the center of high incidence area and two surrounding states, the risk of ILI in this area was 11.66 times more likely to develop the disease than other areas (LLR = 69,009, p < 0.001). Level 2, with Virginia at the center of a high incidence area and three surrounding states, the risk of ILI in this area was 9.79 times more likely to develop the disease than other areas (LLR = 73,277, p < 0.001). Level 3, with New Mexico at the center of high incidence area and three surrounding states, the risk of ILI in this area was 3.38 times more likely to develop the disease than other areas (LLR = 26,518, p < 0.001). At the same time, the states with a high cluster in the local spatial autocorrelation analysis were all located in the high cluster area, the results were consistent. From the cluster time, the high incidence time mainly occurs between January and March ( Table 2).

Time-Series Analysis
Based on the result of spatiotemporal analysis, the HH cluster was mainly in Mississippi and Louisiana. In particular, Mississippi has been exhibiting an HH cluster in recent years. So we predict the incidence of ILI in Mississippi by time-series analysis.
Using raw training data from 1st week 2011 to 52nd week 2018, the trend difference (d = 0) and seasonal difference (D = 1) were calculated ( Figure 5). The augmented Dickey−Fuller Test indicated the sequence was stationary (t = −3.98, p = 0.01). The ACF and PACF plots were used to estimate the parameter ranges of p, P and q, Q [26]. After checking ACF and PACF plots, SARIMA (1, 0, 0) (1, 1, 0) 52 was the best fitted model with lowest AIC and BIC values, and the Ljung−Box Q Test of this model is valid (χ 2 = 21.822, p = 0.149), indicating it was a white noise sequence. All the parameter estimates were significant (Table 3).

Time-Series Analysis
Based on the result of spatiotemporal analysis, the HH cluster was mainly in Mis sippi and Louisiana. In particular, Mississippi has been exhibiting an HH cluster in rec years. So we predict the incidence of ILI in Mississippi by time-series analysis.
Using raw training data from 1st week 2011 to 52nd week 2018, the trend differe (d = 0) and seasonal difference (D = 1) were calculated ( Figure 5). The augmen Dickey−Fuller Test indicated the sequence was stationary (t = −3.98, p = 0.01). The A and PACF plots were used to estimate the parameter ranges of p, P and q, Q [26]. A checking ACF and PACF plots, SARIMA (1, 0, 0) (1, 1, 0)52 was the best fitted model w lowest AIC and BIC values, and the Ljung−Box Q Test of this model is valid ( 2 χ = 21.8 p = 0.149), indicating it was a white noise sequence. All the parameter estimates were s nificant (Table 3).  The model SARIMA (1, 0, 0) (1, 1, 0) 52 forecasting effect was tested by comparing the predicted values with the observed values from 1st week 2019 to 29th week 2020. As Figure 6 shows, the black and blue lines represent the observed values and predicted values, respectively, and the dark gray and light gray represent 80% and 95% confidence intervals, respectively. The predicted trend of ILI incidence was basically consistent with the actual trend, and both the root mean squared error (RMSE) and mean absolute percent error (MAPE) were small, indicating that the model prediction results were reliable. Then, forecasting the ILI incidence from 30th week 2020 to 52nd week 2021 by SARIMA. The forecast results showed that there was a high ILI incidence in winter and spring, and low ILI incidence in summer and autumn. The incidence of ILI will reach its peak in the 6th week 2021 (Table 4).

Discussion
Influenza is a contagious respiratory illness caused by influenza viruses. It can cause mild to severe illness. Serious outcomes of influenza infection can result in hospitalization or death. Due to the low detection rates, it is easy to underestimate the severity of influenza. The weekly ILI tests conducted by the CDC can effectively remind us of influenza trends. The main purpose of this study was to explore the epidemiological characteristics of ILI incidence, identify the states and possible clusters with high ILI incidence in the United States through spatiotemporal analysis, and then construct a SARIMA model to realize the short-term prediction of ILI incidence.
In the descriptive analysis, we recorded age characteristics, seasonal peaks and regional differences. Those individuals most at risk for severe symptoms and complications from this virus are the very young, vulnerable older adults, pregnant women, immunocompromised individuals of all ages, and those with chronic comorbid conditions [27]. Many studies indicated that influenza viruses caused severe morbidity and mortality in the elderly [28]. However, in this study, people aged 65 and over accounted for the lowest proportion of ILI cases. One of the reasons is that the cardinal number of population of 65 years and over is small. During the study period, a gradual increase in ILI incidence was observed, particularly during the 2018 and 2019 influenza seasons, with a sharp increase in the incidence of ILI. There are many reasons to explain the influenza outbreaks. Both extreme weather and insufficient vaccines are important reasons that affect the incidence of influenza [29].
Spatiotemporal analysis was used to identify high-risk areas for multiple diseases [20]. Yue [30] and Freitas [31] used spatiotemporal analysis to identify the spatial clustering characteristics of dengue fever cases. Liu [23] used spatiotemporal scanning analysis to explore the high-risk areas of hand, foot and mouth disease. In this study, the high incidence of influenza was mainly concentrated in the states of Louisiana, Virginia and Mississippi. Spatiotemporal analysis revealed the HH clusters and high-risk states were mainly located in Mississippi, and the time clusters were mainly concentrated in January to March. This finding was confirmed by other studies [32,33]. Time-series analysis has the advantage of predicting the incidence. It is characterized by the number of patients in the past and responds by predicting the number of patients in the future [13]. The prediction showed that the 95% confidence interval of the predicted ILI incidence almost contained the observed value. The RMSE and MAPE were small, which supported that the SARIMA model was effective in the prediction of ILI. Then, we used this model to forecast the ILI incidence from 30th week 2020 to 52nd week 2021. The results demonstrated that ILI incidence will increase in 45th week 2020 and peak in 6th week 2021, and the distribution is similar to the previous years.
Influenza viruses spread through human contact. Therefore, geography and population density are potential factors of influenza transmission [34]. According to Garrett's research, the high population density will accelerate the spread of influenza [35]. In this study, the Northeast and Southwest were the most densely populated areas with the lower ILI incidence in the United States. Mississippi is a mostly rural state with a low population density and the highest incidence, different from the results of Garrett's research [35]. The reason for the results of this study might be the lower economic level of Mississippi [35,36]. During epidemics, the poorest part of the population usually suffers the most. In addition, Mississippi is also the state with the highest proportion of black Americans. Many studies have shown that black individuals have a higher proportion of influenza cases [37][38][39].
Transmission of influenza varies across seasons and geographical areas in the United States. The obvious temporal clusters during the winter and spring, which was in accordance with the seasonality of the respiratory disease [40]. Most parts of the United States are temperate or subtropical climate. The continental climate zone in the central plain is characterized by the cold winters. Winters in temperate regions are characterized by an average temperature between 0 • C and 20 • C, with the minimum temperature dropping to as low as −40 • C in some regions. Generally speaking, influenza peaks in temperate regions in winter. However, the seasonal pattern in subtropical regions seem to be more complicated [29]. Influenza transmission is influenced by variations meteorological variables such as temperature, absolute humidity and precipitation. In the annual influenza epidemics of the United States, the transmission of influenza increases during periods with low precipitation and absolute humidity [41]. Absolute humidity is the total water content in the air. The survival rate of the influenza virus increases at lower absolute humidity levels. Relative humidity increases with the high precipitation. High relative humidity will accelerate the accumulation of respiratory droplets, which reduces the spread of influenza virus. In contrast, low relative humidity is favorable to the spread of influenza virus. The suitable temperature range activating influenza viral transmission could partially explain the common winter epidemics in the central regions [42].

Limitations
There is need to highlight some limitations that may be associated with our study outcomes. First, since all influenza activity reporting by public health partners and healthcare providers was voluntary, it was difficult to maintain the quality and consistency of the source of data. Second, the influenza incidence of 2020 may be underestimated due to the masks. Therefore, the 2020 data was used for model testing instead of model building. Third, the influence factors of influenza activity were not studied in-depth in this study. In the next step, the risk factors need being explored.

Conclusions
In this study, we found the high-risk clusters were concentrated in the southeast, and the incidence of influenza may reach its peak in the 6th week 2021. In order to limit the spread of the outbreaks, surveillance activities and health education should be selectively carried out in higher incidence areas in the epidemic season.
Author Contributions: Conceptualization, X.S.; methodology, J.B.; writing-original draft preparation, Z.S.; writing-review and editing, X.J., Y.Y. and H.Z.; All authors have read and agreed to the published version of the manuscript.