Demographics, Socioeconomic Context, and the Spread of Infectious Disease: The Case of COVID-19

Importance: Due to the evolving variants of coronavirus disease 2019 (COVID-19), it is important to understand the relationship between the disease condition and socioeconomic, demographic, and health indicators across regions. Background: Studies examining the relationships between infectious disease and socioeconomic variables are not yet well established. Design: A total of 3042 counties in the United States are included as the observation unit in the study. Two outcome variables employed in the study are the control of disease spread and infection prevalence rates in each county. Method: Data are submitted to quantile regression, hierarchical regression, and random forest analyses to understand the extent to which health outcomes are affected by demographics, socioeconomics, and health indicators. Results: Counties with better control of the disease spread tend to have lower infection rates, and vice versa. When measuring different outcome variables, the common risk factors for COVID-19 with a 5% level of statistical significance include employment ratio, female labor ratio, young population ratio, and residents’ average health risk factors, while protective factors include land size, housing value, travel time to work, female population ratio, and ratio of residents who identify themselves as mixed race. Conclusions: The implications of the findings are that the ability to maintain social distancing and personal hygiene habits are crucial in deterring disease transmission and lowering incidence rates, especially in the early stage of disease formation. Relevant authorities should identify preventive factors and take early actions to fight infectious diseases in the future.


Background
Infectious diseases have spread quickly throughout the world in recent decades. The rapidity with which we travel across borders and continents has fueled disease outbreaks, such as avian influenza, Middle East respiratory syndrome-coronavirus (MERS-CoV), and the most current and ongoing outbreak of coronavirus disease 2019, known as COVID- 19. While a renewed focus on pandemic planning has been established, understanding how the disease spreads and how it is affected by health indicators and socioeconomic factors have yet to be investigated and are of great interest to many healthcare professionals and social scientists.
The hypothesis that socioeconomic determinants of health, such as poverty, race, ethnicity, social marginalization, and environment, are linked to infectious diseases, including influenza, malaria, tuberculosis, Ebola, and other diseases, has been widely acknowledged [1][2][3]. The WHO Commission on the Social Determinants of Health, Closing the Gap, also explicitly stated that health inequalities may fuel many infectious diseases [4].  Differential healthcare seeking behaviors may exist for individuals with different income levels, potentially driven by differential access to healthcare. These differences might lead to delays in seeking care in response to respiratory infection [10], as well as differences in the quality of care available. Less access to healthcare may result in uncontrolled chronic conditions, such as pneumonia, asthma, and septicemia, and hence more severe diseases [11]. These conditions may lead to differences in the rates of antiviral prescriptions [12] and differential outcomes that require hospitalization, both of which contribute to the social consequences of ill health and further social stratification [9]. Incomerelated disparities in access to care are far wider in the United States than in other wealthy countries. Thirty-nine percent of Americans with a below-average income reported not seeing a doctor for a medical problem because of cost, compared with 7% of low-income Canadians and 1% of those in the UK [13]. Disparities in access are largely due to high rates of no insurance or insufficient insurance among low-income Americans. This group of people is more likely than adequately insured people to forgo needed medical services and medications because of cost. This condition is especially severe for millions of uninsured Americans with chronic conditions [14]. For infectious diseases, strong correlations between poverty and tubercular disease, influenza, acute respiratory tract infection, and acute respiratory infection are well documented [5].
However, researchers do not find a strong hazardous effect of income inequality on all-cause mortality using the U.S. data when controlling for income, education, race, and urbanization [15,16], which is only observed for homicides and, to a lesser extent, infant mortality and deaths from accidents. In general, although income is positively related to health, income inequality does not contribute to a higher population mortality rate. Thus, in this study, we include income level as a risk factor, but not income inequality.

Health Inequality and Race
Health disparities take on many forms for racial and ethnic minorities, including infant mortality, chronic disease, and premature death, compared to the rates among ethnic groups [17,18]. Other conditions, such as obesity and related chronic diseases and debilitating conditions, also disproportionately affect racial and ethnic minorities, which have Differential healthcare seeking behaviors may exist for individuals with different income levels, potentially driven by differential access to healthcare. These differences might lead to delays in seeking care in response to respiratory infection [10], as well as differences in the quality of care available. Less access to healthcare may result in uncontrolled chronic conditions, such as pneumonia, asthma, and septicemia, and hence more severe diseases [11]. These conditions may lead to differences in the rates of antiviral prescriptions [12] and differential outcomes that require hospitalization, both of which contribute to the social consequences of ill health and further social stratification [9]. Incomerelated disparities in access to care are far wider in the United States than in other wealthy countries. Thirty-nine percent of Americans with a below-average income reported not seeing a doctor for a medical problem because of cost, compared with 7% of low-income Canadians and 1% of those in the UK [13]. Disparities in access are largely due to high rates of no insurance or insufficient insurance among low-income Americans. This group of people is more likely than adequately insured people to forgo needed medical services and medications because of cost. This condition is especially severe for millions of uninsured Americans with chronic conditions [14]. For infectious diseases, strong correlations between poverty and tubercular disease, influenza, acute respiratory tract infection, and acute respiratory infection are well documented [5].
However, researchers do not find a strong hazardous effect of income inequality on all-cause mortality using the U.S. data when controlling for income, education, race, and urbanization [15,16], which is only observed for homicides and, to a lesser extent, infant mortality and deaths from accidents. In general, although income is positively related to health, income inequality does not contribute to a higher population mortality rate. Thus, in this study, we include income level as a risk factor, but not income inequality.

Health Inequality and Race
Health disparities take on many forms for racial and ethnic minorities, including infant mortality, chronic disease, and premature death, compared to the rates among ethnic groups [17,18]. Other conditions, such as obesity and related chronic diseases and debilitating conditions, also disproportionately affect racial and ethnic minorities, which have major implications for the quality of life and wellbeing of these population groups. For example, Asians had the lowest prevalence rate (8.6%) of obesity in the U.S., and Hispanic children had the highest prevalence (21.9%) from 2011 to 2014 (NCHS, 2016) [19]. African Americans were 30% and 100% more likely to die prematurely from heart disease and stroke, respectively, in 2010 than their white counterparts (HHS, 2016) [20]. African Americans have the highest mortality rate for all cancers combined compared with any other racial and ethnic group [21]. As race plays a role in health inequality, it must be included as a factor when determining the spread and severity of COVID-19 across counties.

Health Inequality and Education
Education and health are both considered indicators of the quality of human capital that can be invested and are linked to income level. The existing health economics literature suggests that the causal effect running from income to health is indirect and might be mediated by the purchase of healthcare services, suggesting that the correlation between income and health is potentially driven by factors such as education or rates of time preference [22][23][24]. Those who have a stronger desire for current consumption are likely to fail to make investments to protect their health and fail to obtain the education and skills needed to generate higher earnings [25]. Even with the endogeneity between these variables, income and education are still considered independent protective factors for self-reported health status [26]. Researchers apply state or metropolitan data and find relationships among mortality, income, and education. Specifically, average education drives average income and modulates the effect on mortality and even shifts it to a risk factor [27][28][29]. However, the conflict between the individual and aggregate data remains unresolved. Education is included in the study to control this underlying effect.
The aforementioned discussion implies that disadvantaged populations might be particularly vulnerable and susceptible to pandemics and crowd hospital wards, placing medical personnel at great risk. Understanding the spread of highly contagious diseases and considering socioeconomic factors are very important in policy implications. Recommendations for policies to prepare for and respond to a respiratory disease pandemic are a crucial need [30].

Research Method
According to a previous study that explains the variations in health status mediated by socioeconomic factors [31], this research project investigates the population outcome, denoted as Y for district (county) I across the United States. The variation related to socioeconomics (S), demographics and geographics (D), and health-related indicators (H) is calculated using Equation (1) as follows.
The population outcome Y in our study is the COVID-19 conditions, which are measured in two ways: control of disease spread and the severity of the infection. Explanations are provided below.

Spread of Infectious Disease
Traditionally, the basic reproduction number (R) is adopted in the field of public health to show the speed of infection for a disease. It is the average number of people who will be infected by a single infectious person over the course of his or her illness. This number, however, is constantly changing and is highly sensitive to short-term conditions and the specific methods of computation. Alternatively, in this study, we measured the number of days for the disease to spread from the first incident to the first day the data are available, depending on which came first, to the day that the infection rate reached 3% of the county population. This measure provides a direct indicator of how rapidly the virus is spreading. A longer time to reach 3% implies better control of disease spread in the county. The reason we chose 3% as the benchmark was because only 12 counties of the 3138 counties included in our study had a maximum infection rate of less than 3% in the pre-Delta period of the pandemic [1]. Even counties with a maximum infection rate of less than 3% have a value close to 3%. Thus, it could be an objective indicator of spread control.

Infection Severity
The direct measure of the severity of a disease in an area is the incidence or prevalence rate of the disease. This study employs the prevalence rate by accumulating positive confirmed case numbers each day divided by population numbers in the U.S. counties. The number of new cases increases and decreases, but the cumulative case numbers plateau when the disease condition is alleviated. For each county, the prevalence rate is computed on the day of the peak of newly confirmed cases as the indicator of the severity in each county. In the few counties where the newly confirmed cases did not reach the summit in the pre-Delta period, prevalence rates were computed at the end of the study period, 15 March 2021.

Study Period
Our study period, described as the COVID-19 pre-Delta period, is from 22 January 2020 to 15 March 2021. The start date is based on the availability of the daily statistics of COVID-19 released by USAFacts. The end date is the trough of the trend of newly confirmed cases in most of the states, representing the end of the first wave of COVID-19 and the beginning of the spread of the Delta variant.

Socioeconomics
Health inequality is best known to be attributed to income disparity, as stated in the previous section. Other associated factors, such as employment conditions and urbanization of the district, are potential determinants to be included in the control covariates. Housing value, broadband internet coverage, and the female labor force participation rate are included as controls related to the urbanization of the counties.

Herfindahl-Hirschman Index (HHI)
The HHI is a measure of industrial competitiveness. Suppliers' behaviors are substantially influenced by the market condition, and an interdependent relationship exists among institutions. In economic theory, quality is one of the components of nonprice competition, which might be a focal point of healthcare institutions when publicizing their brand names in the industry. Institutions with higher market power can manipulate their prices and quality to differentiate themselves in the broad band of services in the market. In contrast, firms facing fierce competition may be more cost conscious and maintain a minimum level of required quality. As a result, the HHI may be an important determinant of the variation in the quality of healthcare institutions across regions. This index is not readily available from government publications; thus, we imputed the figure based on the number of patients served by each of the healthcare institutions listed on the Center for Medicare and Medicaid Services (CMS) website. We proceeded to use the number of patients as the indicator of the market share for the institutions. The data published by CMS do not directly report the patient number. However, each facility reports its number of respondents, which we use as the proxy for our utilization measure.

Health Facility Indicators
The quality of healthcare facilities is measured with two indicators, patient readmission rate and patient satisfaction rate, for each institution within the county. The average values are obtained for each county. The readmission rate is defined by CMS as an admission to an acute care hospital within 30 days of discharge from the same or another acute care hospital for all causes, and thus the cause of the readmission does not need to be related to the cause of the initial hospitalization [32].
Patient satisfaction ratings were obtained from a survey conducted by the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS), a national, standardized survey of patients which asks about their experiences during a recent inpatient hospital stay [33]. Patients who stayed in the hospital were asked 27 questions, including experiences with nurses, doctors, environments, and general treatment. We used one specific measure, "Using any number from 0 to 10, where 0 is the worst hospital possible and 10 is the best hospital possible, what number would you use to rate this hospital during your stay?", to capture the general patients' satisfaction with the healthcare facilities.

Hierarchical Condition Category (HCC) Score
The aspects of geographic variation have also long been considered phenomenal, including health status in terms of diagnosis intensity and cost variation [34]. In this study, the HCC score was employed as a proxy for county residents' health conditions. It estimates how beneficiaries' fee for service (FFS) spending will compare to the overall average for the entire Medicare population. Thus, it is a risk factor for the health spending of the Medicare population. The CMS-HCC model is normalized to 1.0. Beneficiaries were considered relatively healthy and, therefore, less costly with a risk score less than 1.0. Beneficiaries with scores greater than one are expected to have above average spending, and vice versa. In other words, a higher score implies poorer health since spending is higher. [35] The HCC is generally regarded as the best risk-adjustment model available and is used by CMS for both Medical Advantage plan and (in a modified form) Part D payment.

Demographics, Geographics, and Other Control Variables
Variations in age, race and sex ratios across regions are driving factors for many diseases [36,37]. In addition to Black, Asian, and Hispanic, this study includes multiracial as an identity in the race category. A multicultural background may suggest a different perspective on adapting to new ideas about disease control. For infectious diseases, population density is presumed to be a critical factor since the ability to maintain social distancing has been widely emphasized in preventing disease. Thus, geographic factors are included in our models of estimation as control variables, including population density, county area, and travel time to work.

Data
A total of 3042 counties in the United States are the observation units in the study. Some counties are not included due to the unavailability of data, primarily because their populations are less than 500 and data are not collected for them. The outcome measures are extracted from USAFacts [38], which collects COVID-19 data from multiple sources, including the Centers for Disease Control and Prevention (CDC) and state-and local-level public health agencies. Quality measures for healthcare institutions in the United States are sourced from CMS [39] under the CAHPS ® (Consumer Assessment of Healthcare Providers and Systems) Healthcare Institution Survey, a national survey of family members or friends who cared for a patient who died in the care of a healthcare institution. Detailed descriptions of the quality measurement can be obtained from the CMS webpage [39].
The aforementioned socioeconomic, demographic, and county characteristics are compiled and published by the Census Bureau of the U.S. government [40]. Population, race ethnicity data, and sex ratios are obtained from the Population Estimates Program (PEP). Employment, education, income, transportation, and housing data are obtained from the American Community Survey (ACS). Uninsured population and disability data are obtained from the Current Population Survey (CPS), Annual Social and Economic Supplement (ASEC); State level-American Community Survey (ACS), one-year estimates; and County level-The Small Area Health Insurance Estimates (SAHIE) [41]. Geographic data are obtained from the Geography Division based on the TIGER/Geographic Identification Code Scheme (TIGER/GICS) computer file.

Quantile Regression (QR) Model
Since areas with different levels of infection severity may have different causes and risk factors, subdividing the analyses into different quantiles based on the population infection rate is of interest. Quantile regression (QR) analysis was proposed as an expansion of the least absolute deviation (LAD) [42]. QR has been used to detail the performance of explanatory variables under the influence of conditional medians. The benefit of QR estimation is that the models describe the performances of different quantile conditional distributions and, therefore, more comprehensively describe the characteristics of samples. This model is different from the OLS model, which only describes the mean marginal effects of the explanatory variables on the explained variables.
Based on the conventional descriptions of the QR study [42], we established a random variable cumulative distribution function, as shown in Equation (2).
where y i represents the dependent variable vector for county i, and x i is the independent explanatory variable vector, including socioeconomics, demographics, health indicators, and other control characteristics of the counties. β is the regression coefficient vector obtained through an estimation satisfying Equation (1) and varies according to different quantiles τ. Therefore, β(τ) represents the regression coefficient vector under the effect of the τth quartile. We simplified Equation (1) into a basic cross-sectional data quantile regression model, as shown in Equation (3).
where ε i (τ) represents the random error under quantile τ assuming E(ε I (τ)|x i ) = 0, and α i represents the area fixed effects (Koenker, 2004). The value ofβ(τ) estimated by QR under fixed effects represents the marginal effects of different quantile explanatory variables on the explained variables when other explanatory variables x i were controlled. The bootstrap method for sampling estimation was employed, and resampling was used to simulate the population distribution [43]. We also relaxed the assumption limit, which requires the conditional distribution of the errors to be homoscedastic [44]. Thus, a variance matrix estimation equation was obtained with consistency.

Hierarchical Regression Model
In this research, the observation unit is a county in one of the 50 U.S. states, nested in ten regions based on the classifications of the Center for Medicare and Medicaid Services (CMS); thus, a model of two-level nested groups was constructed. This hierarchy is suitable for applying mixed-effects models, which are characterized as containing both fixed effects and random effects; the former are analogous to standard regression coefficients and are estimated directly, and the latter are not directly estimated but are summarized according to their estimated variances and covariances. Random effects may take the form of either random intercepts or random coefficients in the nested groups. Multilevel models, also known as hierarchical models, have been used extensively in diverse fields, ranging from the health and social sciences to econometrics. [45][46][47] Our regression models take the following form: where i, s, and r denote the county, state, and CMS regions, respectively. Y represents the variables of interest for investigation: spread control and the infection rate of COVID-19. X denotes independent variables, including socioeconomic, demographic, health indicators, and other control variables. The region and state error terms and residual are, respectively,

Random Forest Model
The regression coefficients derived from the abovementioned models are a measure of the association between a particular feature and the outcomes. We supplemented our analyses with a random forest machine learning algorithm, which produces computed feature importance values and provides information about the relative importance of each feature for predicting outcomes for the entire sample. The importance value of each feature in the models determines which variables were the most important for determining the speed of disease spread and severity of the infection condition. The STATA software package is employed for the prediction of the random forest model, in which variable importance is calculated by summing the improvement in the objective function obtained from the splitting criterion over all internal nodes of a tree and across all trees in the forest. The process is generated through the mean decrease Gini. The outcome variables and the regressors are identical to those described in the aforementioned models. Additional details of the statistical analysis and feature engineering are available in studies by Breiman (2001) and Zou and Schonlau (2019) [48,49].

Results
The characteristics of the areas with severe and mild infection incidents might be very different. Thus, we divided our observation units, the U.S. counties, into terciles based on the infection rate: mild, moderate, and severe. Tables 1 and 2 show the summary statistics for the whole sample and the three groups. Mildly infected areas have a higher percentage of the white population, a lower percentage of the black population, a lower percentage of foreign-born people, a greater percentage of the older population, and a greater percentage of owner-occupied housing than severely infected areas. Regarding healthcare indicators, mildly infected areas have more concentrated healthcare institutions, institutions have lower readmission rates, and the population has fewer risk factors, as measured by general healthcare spending, than severely infected areas. All the differences are statistically significant at the 1% or 5% level.  Data are submitted to ordinary least square (OLS) and quantile regression methods to investigate whether relationships exist between those factors and the control of spread and the infection conditions of the disease. The results are presented in Tables 3-6. For the spread of the disease, the risk factors that facilitate (have a negative effect on) disease spread include the female labor ratio, percentage of population under 18 years old, percentage of the population over 65 years of age, and HCC. All factors are statistically significant at a 1% level for at least two of the three quantile groups. Counties' median housing value, land size, travel time to work, female population ratio, race mix, and HHI are protective factors that are positively related to the length of time to reach the 3% infection rate. Some factors exert opposite effects when measuring different quantiles of infection conditions, such as income and the percentage of the population with a college degree. The regression results for the analyses of infection rates are presented in Table 4. Income, broadband internet coverage, travel time to work, elderly population ratio, and college graduate ratio with negative effects on the infection rate are proactive factors, while employment ratio, population density, percentage of owner-occupied housing, the population ratio under 18 years, and percentage of uninsured individuals are positively related to the infection rate. Based on the R 2 value of the results, the model of the severe tercile has a better fit than those of OLS and the other two terciles. Our next step is to include the state-fixed effects in the models to increase the precision of the estimate, and the results return a better fit of R 2 and more variables with statistically significant coefficients, as presented in Tables 5 and 6. Generally, the signs of the coefficients are consistent with those in the models without fixed effects. The differences are that the coefficients of the hospital readmission rate and income are no longer significant in the control of disease spread model, and the female labor population rate and broadband coverage are no longer significant in the infection rate estimation. More interestingly, the coefficients of the uninsured population change to negative at the 1% statistical significance level.         Selected variables were submitted for quantile graphic presentation, as shown in Figures 3 and 4, to obtain a clearer picture of the extent to which the effects of the covariates on the outcomes vary across infection severity levels. For disease spread, income, density, and female ratio exert a strong positive effect on the control of infection spread when the disease condition is mild. However, as the infection rates deteriorated, the effect vanished gradually. On the other hand, housing value, HCC, and uninsured ratio exert prominent negative effects on the control of disease spread when the infection condition is mild, and the effect diminishes as the quantile approaches 1 (most severe). The opposite effect to what we found in the spread control model was observed for the infection incidence rate. All the variables illustrated in Figure 4 exert moderate effects when counties have mild infection rates, and the effects intensify when infection conditions worsen.
The data were further examined using a hierarchical regression analysis (mixed random and fixed effect model) to assess the robustness of the results. Similar results are obtained, as shown in Table 7. The state mixed effects for both outcome measurements are collected after the analyses, and the two-dimensional plot is shown in Figure 5, where the vertical axis represents the effects extracted from the spread model, and the horizontal axis represents the effect extracted from the infection rate model. The second quadrant shows the states with slow spread and low infection rates; the fourth quadrant shows the states with fast disease spread and high infection rates in the pre-Delta period of disease statistics. An apparent negative trend between the two outcome variables is observed, suggesting that better control of disease spread would lead to a lower infection rate.
disease condition is mild. However, as the infection rates deteriorated, the effect vanished gradually. On the other hand, housing value, HCC, and uninsured ratio exert prominent negative effects on the control of disease spread when the infection condition is mild, and the effect diminishes as the quantile approaches 1 (most severe). The opposite effect to what we found in the spread control model was observed for the infection incidence rate. All the variables illustrated in Figure 4 exert moderate effects when counties have mild infection rates, and the effects intensify when infection conditions worsen.   The data were further examined using a hierarchical regression analysis (mixed random and fixed effect model) to assess the robustness of the results. Similar results are obtained, as shown in Table 7. The state mixed effects for both outcome measurements are collected after the analyses, and the two-dimensional plot is shown in Figure 5, where the vertical axis represents the effects extracted from the spread model, and the horizontal axis represents the effect extracted from the infection rate model. The second quadrant shows the states with slow spread and low infection rates; the fourth quadrant shows the     Finally, random forest modeling is employed using the standard procedures designed by Breiman [44] and Frank et al. [50] to understand the relative importance of each explanatory variable. The prediction performances of the models are approximated using the out-of-bag (OOB) errors [51]. After a bootstrap of 1500 iterations, OOB errors of 0.027 and 0.204 are obtained for the models of infection rate and spread control, respectively. The results of the relative feature importance (FI) for the two outcome variables are shown in Figure 6, mainly confirming the findings of the regression analysis.
Mixed effect of spread control Figure 5. Random effects at the state level. Notes: The vertical axis indicates the effect extracted from the spread model, and the horizontal axis represents the effect extracted from the infection rate model. A higher value indicates better spread control, and a shift to the right indicates a higher infection rate. The second quadrant shows the states with better than average outcomes (slow spread and low infection rates); the fourth quadrant shows the states with worse than average outcomes (fast spread and high infection rates) in the first wave of disease statistics.
Finally, random forest modeling is employed using the standard procedures designed by Breiman [44] and Frank et al. [50] to understand the relative importance of each explanatory variable. The prediction performances of the models are approximated using the out-of-bag (OOB) errors [51]. After a bootstrap of 1500 iterations, OOB errors of 0.027 and 0.204 are obtained for the models of infection rate and spread control, respectively. The results of the relative feature importance (FI) for the two outcome variables are shown in Figure 6, mainly confirming the findings of the regression analysis. Spread Days Infection Rate Figure 6. Random forest prediction for importance. Notes: The H value represents medium housing value; HH person represents number of people per household, EmpRatio represents the employment to population ratio, Rating indicates the average hospital rating.

Discussion
At this time of unpredictable pandemic upticks due to evolving SARS-CoV-2 variants, studies are needed to provide insights into the risk factors determining the spread and the incidence rates of this contagious disease in a timely manner. This study investigates the key factors that influence the spread of COVID-19 and the variation in infection rates across the United States. During the outbreak of respiratory diseases when vaccines are not yet available, understanding the risk and preventive factors might help control the spread and gain more time for scientists to develop new vaccines and treatment methods. Results of these studies would help government authorities allocate medical resources, prepare for disease prevention, and plan strategically to achieve better management in disease control.
Intuitively, counties with better spread control would have a lower infection rate. Using the U.S. county data and various statistical models, our results indicate that most factors in the two models exert consistent effects, indicating that if they are protective factors for the control of disease spread (have positive effects), they also tend to be protective factors (have a negative effect) on COVID-19 infection rates. Analogously, if the factors exert negative effects on the control of disease spread or the risk factors, they tend to positively affect the infection rate. Our results reveal that for the spread control model,

Discussion
At this time of unpredictable pandemic upticks due to evolving SARS-CoV-2 variants, studies are needed to provide insights into the risk factors determining the spread and the incidence rates of this contagious disease in a timely manner. This study investigates the key factors that influence the spread of COVID-19 and the variation in infection rates across the United States. During the outbreak of respiratory diseases when vaccines are not yet available, understanding the risk and preventive factors might help control the spread and gain more time for scientists to develop new vaccines and treatment methods. Results of these studies would help government authorities allocate medical resources, prepare for disease prevention, and plan strategically to achieve better management in disease control.
Intuitively, counties with better spread control would have a lower infection rate. Using the U.S. county data and various statistical models, our results indicate that most factors in the two models exert consistent effects, indicating that if they are protective factors for the control of disease spread (have positive effects), they also tend to be protective factors (have a negative effect) on COVID-19 infection rates. Analogously, if the factors exert negative effects on the control of disease spread or the risk factors, they tend to positively affect the infection rate. Our results reveal that for the spread control model, the effects of risk and preventative factors are more prominent in the counties with mild infection conditions than in the counties with severe infection conditions, while the opposite results are observed for the models of infection rate, i.e., they have stronger effects on severely infected counties than on mildly infected counties. For example, the female population ratio is a preventive factor for the control of disease spread. As the female ratio increases by 1%, the spread is delayed by 1.20% when the infection is mild. However, the delay is only 0.5% when infection is in the moderate tercile, and no delay (insignificant) is observed in the severe tercile. Another example is the effect of median income on the infection rate. As income increases by 1%, the infection rate decreases by 0.05% (p < 0.05) for the severely infected counties. However, the effect is only 0.005% and 0.008% for the mildly and moderately infected counties, respectively, and the result is not statistically significant in the latter cases.
The risk factors are generally the employment to population ratio, young population (18 years or younger) ratio, female labor ratio, and residents' general health risk measured using the HCC. The protective factors include median housing value, broadband internet coverage, land size, travel time to work, female population ratio, multiracial ratio, HHI, median household income, and land area. These factors have different levels of effect and various levels of statistical significance when measured using spread control models or infection rate models. In both models, the factors showing consistent significance levels include housing value, land size, travel time to work, female population ratio, and multiracial ratio as preventive factors and employment ratio, female labor ratio, young population ratio, and HCC as risk factors.
Some factors exert opposite effects on the spread control and infection rate models. For example, more densely populated areas tend to have higher infection rates. However, population density is a protective factor when disease conditions are mild or moderate in the disease spread control model. This discrepancy is probably because population density captures the characteristics of urbanization of the county. The percentage of the uninsured population also exerts the opposite effect; it is a risk factor for the control of disease spread but a preventive factor for the infection rate. This difference is probably because the uninsured group is usually younger and healthier, and thus, the infection rate is lower. Furthermore, this population might not think COVID-19 would cause too much harm and thus did not maintain precautions as a habit; thus, spread control was negatively affected.
Studying the nature of these factors suggests that personal hygiene may play an important role in promoting disease prevention. Different cohorts may share the characteristics of ease of adaptation or openness of attitudes toward new habits under certain circumstances. For example, a multiracial cohort and people residing in more urbanized areas might find it more acceptable to adopt new habits of more frequent handwashing, mask wearing, and social distancing. A higher elderly ratio is associated with less severe infection, possibly because the elderly, which comprise the high-risk group, would take special precautions and adopt new habits to prevent them from contracting the disease.
Another interesting finding is about the female role in society. A greater female ratio helps delay disease spread and lower the infection rate. However, greater female labor force participation exerts the opposite effect, implying that females devoting time to the work force do not spend the time necessary to ensure sanitary conditions for their families and increase the infection risk in their communities. In summary, the study results imply that taking precautions in personal hygiene is important in both spread control and decreasing the infection incidence rate, as manifested by special cohort groups who might share certain characteristics for high vigilance in personal hygiene. However, when the general disease condition continues to worsen, reaching the higher quantile in infection rates, these protective factors play less important roles in preventing the disease.
Finally, well-established healthcare institutions with greater market power are significantly protective for slowing disease spread, implying that competitiveness is a less ideal market structure in the healthcare industry. Although the protective effect vanishes when the infection rate becomes severe, reputation and quality of care are better served in an imperfectly competitive setting.
The feature importance (FI) values for the random forest models generally confirm the findings of the regression analysis. Population density, female population ratio, and travel time to work are the top three factors determining spread control. The female population ratio, elderly ratio, and housing value are the top factors determining the infection rate. People with a nicely comfortable home environment exhibit a greater tendency to stay home and reduce their interactions with people outside the family, which in turn reduces the probability of contracting COVID-19. The only factor that exhibits a difference in determining power between the random forest and the regression model is the HHI. In the random forest model, the HHI is located at the bottom as the next to the least important factor, while it appears to be one of the few significant explanatory variables in the quantile regression and mixed effect models. This similar discrepancy in the results appears in the existing studies and is probably because random forest models assign greater weight to prediction accuracy and the magnitudes of the coefficients instead of the causal relationship and the statistical significance of individual regressors [52]. This finding is noted as a limitation in the interpretability of this research.

Conclusions
Effectively containing the spread of infectious disease is essential in public health considerations, especially when vaccines and efficacious cures for the diseases are not yet available. In this study, we employ three popular and newly developed models to investigate the COVID-19 pandemic condition before the introduction of the vaccines, including quantile regression, hierarchical mixed effect model, and random forest models. Notably, both protective and risk factors for COVID-19 are incorporated as predictors. Our results suggest that the protective factors that slow disease spread and lower infection rates include land size, housing value, travel time to work, female ratio, HHI, and percentage of the population who identify themselves with more than one race (multiracial). Some of these protective factors are related to the ease of maintaining social distancing, while others may be linked to cohort characteristics for their attitudes toward adopting new habits that might be beneficial for disease prevention, such as the habits of maintaining personal hygiene, mask-wearing, and handwashing. Populations with more females and multiracial cohorts seem more adaptable to taking precautions with personal hygiene. Healthcare facilities with higher ratings that face less competition also play a more important role in controlling disease spread and lowering infection rates than facilities facing fierce competition. However, most of the protective factors only exert a significant effect when disease conditions are mild or moderate in the counties. When the disease condition worsens, the effects of protective factors diminish. On the other hand, risk factors, such as employment ratio, female labor ratio, and HCC, exert more prominent effects when the disease condition is aggravated.
The implications of the risk factors for our study are described below. First, bustling business interactions facilitate the spread of viruses. Second, more females in the labor market aggravate the disease condition. Females usually play the primary role in running the household. If they devote their time to the labor market, they spend less time and effort maintaining sanitary conditions for their families. Third, the health risk indicator of a county, the HCC, directly exerts a significant positive effect on disease severity.
Our study also reveals some other interesting findings. Although the elderly might be frail and vulnerable, the elderly ratio is not associated with a higher infection rate, probably because this population is more careful about maintaining social distancing and practicing personal hygiene because they know that they are at high risk once infected. The uninsured population represents a younger, active, and healthier group of people who accelerate disease spread, but in general, the overall infection rate in an area is not particularly worsened when a highly uninsured population is present.
Continuing efforts to maintain personal hygiene, social distancing, and mask wearing are crucial for controlling disease spread. These measures are particularly effective when the infection condition is not serious, as indicated in the low quantile of infection rates in this study. When the infection condition continues to deteriorate, these protective factors lose their effect, and the risk factors become more powerful in aggravating the situation.
This study provides insight into controlling contagious disease spread and the infection rate in terms of socioeconomics, demographics, and indicators of regional healthcare facilities. The findings ascertain the importance of personal precautions, broadband internet coverage, and large-scale healthcare facilities. Suggestions for future studies include continuous efforts to monitor pandemic conditions for ever-emerging variants and assess the relationships between the unvaccinated rate, hospitalization rate, death rate, and demographics and socioeconomic indicators. Refined statistical models and machinelearning algorithms should also be adopted for greater precision of predictions or better interpretability of artificial intelligence models, such as Shapley Additive Explanations (SHAP) [53].