Novel coronavirus disease (COVID-19) has rapidly spread worldwide, becoming a global health threat [1
]. The disease was first identified in Wuhan, China, and continued to spread out across the world [2
]. According to the World Health Organization [3
], as of 4 June 2020, there have been more than 6.4 million confirmed cases and over 380 thousand deaths worldwide. These statistics have surpassed the number of deaths and cases for Middle East respiratory syndrome (MERS) and severe acute respiratory disorder (SARS) since their outbreaks [4
]. The pandemic has directly impacted the economy, society, and healthcare systems. According to the International Monetary Fund [5
], global economic growth in the year 2020 is estimated to be -3.0%, compared to +2.9% in 2019. The United Nations predicts that the pandemic can continue to adversely impact societies with perpetual disease spread due to improper policy interventions [6
Although the United States is ranked number one in the global health security index [7
], it is the leading country in the number of confirmed cases and deaths globally [8
]. As of 4 June 2020, there have been over 1.8 million confirmed cases and more than 108,000 deaths in this country [9
]. Moreover, the case fatality ratio (CFR) continues to fluctuate in this country. As of 4 June 2020, the United States ranks in ninth place worldwide, with a CFR of 5.8% [10
Recent studies have demonstrated that preexisting conditions, such as cardiovascular diseases [11
], respiratory diseases [12
], cancer [13
], infectious diseases [14
], and substance abuse [15
], can contribute to the elevated morbidity and mortality of COVID-19. In China, Zheng et al. [11
] utilized the MERS virus as a reference and suggested that SARS-CoV-12 can cause cardiac failure and acute myocarditis. Although the findings were preliminary, they indicated that patients could experience chronic cardiovascular effects secondary to contracting the disease. Lippi and Henry [12
] conducted a meta-analysis demonstrating that chronic obstructive pulmonary disease (COPD) patients are five times more at risk of contracting the SARS-CoV-2 virus. You et al. [13
] alluded to the guidelines suggested by French medical oncologists on cancer patient care during the pandemic. In South Africa, Cox et al. [14
] highlighted changes in tuberculosis (TB) patients’ treatment during the pandemic. In the United Kingdom, Marsden et al. [15
] indicated how individuals with substance abuse disorders might experience addiction augmentation during the pandemic, consequently, increasing the risk for COVID-19 contraction. They suggested that substance abuse disorder may not be overlooked when addressing preexisting conditions in COVID-19 patients.
In addition to preexisting conditions, environmental [16
], demographic, and socioeconomic [17
] factors can potentially influence COVID-19 incidence. For instance, Wang et al. [16
] indicated that COVID-19 transmission is influenced by temperature variability. Their results suggest that reduced COVID-19 transmission is associated with higher humidity and temperature. In the United States, Mollalo et al. [17
] suggested that higher percentages of nurse practitioners and black females and higher income inequality at the county level could explain 68.1% of COVID-19 incidence geographic variations.
Artificial neural networks (ANNs) are relatively novel techniques to model complex non-linear relationships in spatial epidemiology [18
]. The techniques have been applied in a variety of fields, including but not limited to environmental science [19
], agriculture [21
], finance [22
], artificial intelligence [24
], epidemiology and public health [25
]. Reddy and Imler [26
] demonstrated that ANNs could provide reliable predictions for chronic diseases, such as cirrhosis patients with hepatocellular carcinoma. They found high sensitivity (80.61–86.67%) and specificity (99.88–99.95%), corresponding to demographic and physiological inputs. Badnjević et al. [28
] incorporated ANNs to classify asthma; they found high levels of sensitivity (97.11%) in asthmatic individuals and specificity (98.85%) in healthy individuals. Their findings suggested that ANNs can be appropriate techniques for asthma detection. Due to a lack of research on the spatial complexities of COVID-19 at the national level, in this study, we leveraged the potential of ANNs in identifying complex spatial patterns and the power of geographic information systems (GIS) in spatial analysis [29
] to predict county-level COVID-19 incidence rates in the continental United States. We employed one of the widely used topologies of ANNs that is described in Section 2.4
Results of spatial analysis with Global Moran’s I
indicated that the distribution of COVID-19 incidence rate in the continental United States is clustered (Index: 0.36, z-score: 34.75, p
< 0.0001), rejecting the null hypothesis (random distribution). Moreover, Getis-Ord Gi
* could identify the location of hotspots of disease incidence rates (Figure 2
). In total, 217 counties were identified as hotspots (p
< 0.05), which were mainly located in the northeastern regions of the continental United States, western Georgia, central Ohio, southern Louisiana, and northeast Iowa.
The Boruta algorithm and Pearson’s correlation analysis selected 34 variables as less correlated and important variables (Supplementary Materials
), which were then fed as inputs to ANNs. Overall, among the activation functions, “tanh” had slightly better performance (lowest RMSE) and thus was used in the MLPs. We systematically increased the number of neurons in the hidden layers from 10 to 30. The lowest errors were obtained with 15 neurons in the hidden layer. The performances of all employed models, in terms of RMSE, MAE, and r between observed COVID-19 incidence rate and model predictions on the holdout sample are presented in Table 1
. Correlation coefficients of the models ranged between 0.30 and 0.65. The linear regression model achieved the least correlations with observed COVID-19 incidence rates (r < 0.3). On the contrary, the MLP with one hidden layer achieved the highest correlation (r = 0.65), indicating a satisfactory agreement between model predictions and observed COVID-19 incidence rates. Moreover, the accuracy assessment of the results indicated that the prediction error of the MLP with one hidden layer is less than others (RMSE = 0.72, MAE = 0.36). The worst performance was obtained by linear regression (RMSE = 0.99, MAE = 0.58), while the MLP with one hidden layer yielded better accuracy and generalization capability than other models and was thus considered as the proposed model for further analysis. Figure 3
compares the z-scores of actual and predicted values of the dependent variable for holdout samples using the one-hidden-layer MLP.
We performed a sensitivity analysis to investigate the effect of each variable on the COVID-19 incidence rate using the MLP with one hidden layer. Figure 4
shows the top 10 contributing variables in order of importance. According to Figure 4
, age-adjusted mortality rates of ischemic heart disease, pancreatic cancer, leukemia, Hodgkin’s disease, mesothelioma, and cardiovascular disease were among the top 10 factors with the highest relative importance for COVID-19 incidence rates, showing the potential importance of these preexisting conditions to COVID-19 incidence rate. In addition to the mortality rates, the proportion of males above 65 years old, higher median household income, precipitation, and maximum terrain slope were other important contributing variables.
The logistic regression model was used to explain the association between the presence/absence of the identified hotspots (p
< 0.05) of COVID-19 incidence rates and the explanatory variables obtained from sensitivity analysis. The results indicate that age-adjusted pancreatic cancer mortality rates followed by median household income, precipitation, and Hodgkin’s disease mortality rates could explain the positive association with the presence/absence of hotspots. Meanwhile, age-adjusted mortality rates for leukemia and cardiovascular disease, and maximum terrain slope, were negatively correlated with the occurrence of the hotspots. Table 2
summarizes the results of the logistic regression model statistics.
COVID-19 is an RNA virus that has the potential to mutate like the flu and measles, which may have contributed to the rapid transmission of the disease [49
]. Due to the successful performance of ANNs in modeling many complex relationships, we examined the applicability of ANNs in predicting COVID-19 incidence in the continental United States. One of the main advantages of ANNs over widely applied traditional statistical techniques is their predictive capabilities even when working with noisy, complex, and incomplete datasets [18
], which may also be useful for modeling other viruses with complex epidemiology, such as Zika virus. This motivated us to compile a relatively broad range (n
= 57) of socioeconomic, behavioral, environmental, topographic, and demographic factors together with mortality rates of preexisting conditions. The variables were either suggested by previous studies or were based on domain knowledge (rarely investigated at the county level).
Among the different combinations of network topologies and learning parameters that were examined, the MLP with one hidden layer performed better and thus was used for predictions. Sensitivity analysis of this model indicated that six age-adjusted mortality rates, including ischemic heart disease, pancreatic cancer, leukemia, Hodgkin’s disease, mesothelioma, and cardiovascular disease, had substantial impacts on county-level COVID-19 incidence across the continental United States. While there is still much to discover and research, the results suggest that the disease incidence may be influenced by the fluctuance in mortality rates’ distribution nationwide. Therefore, counties with elevated proportions of mortality rates of one or more chronic conditions may be more vulnerable to the higher incidence of COVID-19, when compared to other counties. As a result, it may potentially impact mortality rates during the pandemic. Lai et al. [50
] indicated that comorbidities and cancer might be substantial contributors to COVID-19 mortality excess rates. They proposed that their findings are applicable to COVID-19 incidence and mortality in the United States. Hanff et al. [51
] convey that COVID-19 mortality is significantly associated with comorbidities, including cardiovascular diseases (i.e., hypertension), suggesting that further studies may focus on detailed descriptions of comorbid physiological implications in COVID-19 patients, especially in the use pharmacological therapies. Alimadadi et al. [52
] proposed that sophisticated analysis, such as machine learning and artificial intelligence, may aid in combating the pandemic. They also suggest that these methods may provide a better understanding of COVID-19 diagnosis, medication treatment, prevention, and hospital logistics. Although our findings seem consistent with recent studies, drawing conclusions at the individual level is not valid due to ecological fallacy, thus the findings can only be interpreted at the county level.
According to our findings, demographic (i.e., % male above 65), socioeconomic (i.e., median household income), and environmental factors (i.e., maximum terrain slope and precipitation) are influential in predicting COVID-19 incidence, indicating that the disease is not merely affected or driven by physiological conditions. The findings support and extend the previous study of Mollalo et al. [17
], who utilized multiscale geographically weighted regression to explain geographic county-level variations of COVID-19 incidence in the United States. Their results indicated that counties with higher median household income and income inequalities were positively correlated with elevated disease incidence, predominantly in the tristate area. Kavanagh et al. [53
] proposed that socioeconomic and demographic factors are vital to consider when addressing the pandemic as they may be associated with income disparities that exist in the United States. This may be the case of some employees that may not have the option to work remotely from home, instead, potentially resulting in more frequent exposure to the virus, contributing to further spread of the disease. The study of Qu et al. [54
] emphasize the significance of examining the effects of environmental factors pertaining to COVID-19. Their results suggest that COVID-19 may be aggravated by air pollutants (i.e., airborne particulate matter), influencing infectivity. Hence, further studies on preexisting conditions, socioeconomic, demographic, and environmental impacts on COVID-19 incidence preferably at a less coarse granularity level are essential.
We acknowledge that the obtained consistency between the model and ground truth is not notably large. This is likely due to the limited knowledge about the recently emerged disease and factors that may be influential but not included in this study. Therefore, future studies should focus on improving the prediction accuracy of this initial model. Additionally, even though no significant difference is observed between the performance of MLP networks with one and two hidden layers, there may still exist complex relationships in the data that are not captured. This leads us to another limitation of this study, which is the number of training samples. With a higher amount of training data, one could apply deeper networks, i.e., networks with more than two hidden layers, and leverage the power of deep learning models. Deeper neural networks can capture potential non-linearity in the relationship between dependent and independent variables by stacking two or more hidden layers. Thus, such networks are, in general, capable of reaching higher accuracies and can reveal the nuances of the data. However, the amount of training data that was available in this study does not justify utilizing deep networks. A few possible solutions to increase the amount of data are to consider a longer temporal interval (which was not possible in this case), to incorporate data from other countries and regions, to use finer spatial units data (if available), or to use data augmentation techniques to (artificially) generate more training data and features. Moreover, although adjusted mortality rates of the diseases used in this study cannot be directly interpreted as preexisting conditions, higher mortality rates of a certain disease could allude to a higher incidence rate of it. Therefore, this study could be used to further investigate any potential correlation between disease prevalence and COVID-19 incidence.
After more than three months since the first confirmed case of COVID-19 in the US, and due to the substantial economic and social impacts of the pandemic itself and the resulting lockdown policies, discussions regarding “re-opening the country” are omnipresent. The findings of this paper could be used as one of the many guidelines needed by policymakers to decide if and where (at the county level) lockdown policies should be relaxed.