The Use of Penalized Regression Analysis to Identify County-Level Demographic and Socioeconomic Variables Predictive of Increased COVID-19 Cumulative Case Rates in the State of Georgia

Systemic inequity concerning the social determinants of health has been known to affect morbidity and mortality for decades. Significant attention has focused on the individual-level demographic and co-morbid factors associated with rates and mortality of COVID-19. However, less attention has been given to the county-level social determinants of health that are the main drivers of health inequities. To identify the degree to which social determinants of health predict COVID-19 cumulative case rates at the county-level in Georgia, we performed a sequential, cross-sectional ecologic analysis using a diverse set of socioeconomic and demographic variables. Lasso regression was used to identify variables from collinear groups. Twelve variables correlated to cumulative case rates (for cases reported by 1 August 2020) with an adjusted r squared of 0.4525. As time progressed in the pandemic, correlation of demographic and socioeconomic factors to cumulative case rates increased, as did number of variables selected. Findings indicate the social determinants of health and demographic factors continue to predict case rates of COVID-19 at the county-level as the pandemic evolves. This research contributes to the growing body of evidence that health disparities continue to widen, disproportionality affecting vulnerable populations.


Introduction
Coronavirus disease 2019 (COVID- 19), the disease caused by severe acute respiratory coronavirus 2 (SARS-CoV-2), was first detected in late 2019 in Wuhan, China [1]. Since its emergence, COVID-19 has spread globally, causing massive morbidity and mortality worldwide [2]. The World Health Organization declared COVID-19 a public health emergency of international concern on 30 January 2020, and subsequently labeled it a pandemic on 12 March 2020 [3]. To date, there have been more than 18 million confirmed infections worldwide with over 675,000 deaths attributed to COVID-19 [2].
Early in the pandemic, evidence indicated minority groups and those with lower socioeconomic position suffered disproportionately from COVID-19, both in the United States (US) and abroad [4,5]. Historically, these same groups experienced an inordinate burden of disease, both infectious and reliability, the original data source for CHR was located and used when available. If information was unavailable from the original data source, those county statistics were estimated by the mean of the geographically surrounding counties, such as the food environmental index. For rare events where data was suppressed due to small numbers, such as infant mortality, they were estimated as zero.

Outcome of Interest
The COVID-19 case definition for the GA DPH was an individual with positive molecular testing for SARS-CoV-2. Cases recorded by the GA DPH were reported through electronic lab reporting and the state electronic notifiable disease surveillance system, as well as via calls or faxes from providers [23]. The continuous dependent variable in our study was the cumulative number of confirmed cases per 100,000 residents in a county, as publicly reported by the GA DPH on 1 August 2020. Data for cases per 100,000 residents was log-transformed for normality before analysis. Cases were excluded from our analysis if they did not have a known county of residence in Georgia at the time of case reporting.

Data Analysis
Descriptive statistics, including mean, median, and standard deviation for all variables were calculated. Variables where data could not be ascertained or accurately estimated were omitted from the model selection process. CHR rankings for access to care, such as primary care physician rate, dentist rate, and mental healthcare provider rate, were excluded from analysis due to the small geographic county size and the possibility of residents from adjoining counties sharing providers. Racial demographics were separated into minority or non-Hispanic White. Table S1 reports the included and excluded variables in our analysis and their level of inclusion or exclusion in the county health rankings.
Initial multivariable regression analysis revealed significant multicollinearity, which was not easily rectified using standard techniques related to variance inflation, condition index, and variance proportion diagnostics. A predictor collinearity matrix (Figure 1), created using the statistical programming language R v.3.6.3 (R Core Team, R Foundation for Statistical Computing, Vienna, Austria) revealed this multicollinearity, justifying the use of a technique other than the best subset multivariable regression analysis. Lasso (least absolute shrinkage and selection operator) regression analysis was used due to the number of predictors and overlap of variables. Figure 1 confirms that the selection of lasso for analysis was appropriate as opposed to other coefficient shrinkage techniques, such as elastic net [25]. By using coefficient shrinkage to zero, lasso variable selection allowed for the automated determination of the most important variables in a group of collinear determinants where traditional least-squares linear regression models failed, and variance of the least-squares estimators was unacceptably high. Proposed by Tibshirani in 1996 [26], lasso has been used in a variety of settings with similar sets of variables for outcomes with complex sets of underlying predictors [27][28][29]. Lasso's prediction of variables was improved not only in settings where significant multicollinearity occurred but also when many predictors may be contributing small to moderate effects. Hence this technique is useful in the analysis of large data sets that include variables such as demographics, housing statistics, and economic indicators, which often overlap both within groups of variables as well as between groups. Lasso analysis was performed in SAS v9.4 (SAS Institute Inc. Cary, NC, USA) using the PROC GLMSELECT procedure and the default Schwarz Bayesian Information Criterion (SBC) for variable selection.
A sensitivity analysis was also performed by applying the same procedures to data from 1st April, 1st May, 1st June and 1st July, which provided multiple cross-sectional analyses to understand how predictive variables and overall predictive ability changed over time as the virus spread throughout the state. A sensitivity analysis was also performed by applying the same procedures to data from 1st April, 1st May, 1st June and 1st July, which provided multiple cross-sectional analyses to understand how predictive variables and overall predictive ability changed over time as the virus spread throughout the state.

Results
As of 1st August 2020, the confirmed cumulative rate reported by the DPH for the state of Georgia was 1726.62 per 100,000 residents, with 190,012 cumulative cases diagnosed in the state. The mean and median case rate for counties once non-residents and patients with unknown residency status were excluded was 1748.34 and 1538.46 per 100,000 residents, respectively. Cases per 100,000 residents varied from 612.60 in Long County to 6140.11 in Chattahoochee County. Long County reported 19,915 residents according to the DPH [23] and was 81% rural according to the CHR [25]. Chattahoochee County reported 10,749 residents and was 30% rural. The outcome of cumulative COVID-19 cases per 100,000 at the conclusion of the study are also shown by county in Figure 2.

Results
As of 1 August 2020, the confirmed cumulative rate reported by the DPH for the state of Georgia was 1726.62 per 100,000 residents, with 190,012 cumulative cases diagnosed in the state. The mean and median case rate for counties once non-residents and patients with unknown residency status were excluded was 1748.34 and 1538.46 per 100,000 residents, respectively. Cases per 100,000 residents varied from 612.60 in Long County to 6140.11 in Chattahoochee County. Long County reported 19,915 residents according to the DPH [23] and was 81% rural according to the CHR [25]. Chattahoochee County reported 10,749 residents and was 30% rural. The outcome of cumulative COVID-19 cases per 100,000 at the conclusion of the study are also shown by county in Figure 2.
Descriptive statistics, including mean, median, and standard deviation for case rates and the twelve variables chosen by lasso analysis on 1 August 2020, are shown in Table 1. Mean, median, and standard deviation for all independent variables considered are shown in Table S2. The overall Pearson's correlation coefficient on 1 August 2020, was 0.4940 with an adjusted r-squared of 0.4525, indicating the final model had a moderate correlation with cumulative case rates by county. Lasso and other coefficient shrinkage methods eliminated some variables from analysis due to coefficient shrinkage to zero. Therefore, these standardized coefficients (βz) can be interpreted as they relate to each other within the model but should not be interpreted directly in terms of the dependent variable. Unlike traditional ordinary least squares regression, the coefficients do not directly represent a percent change in cumulative case rates in our model. Lasso analysis was used in our case to choose variables in the face of collinearity, as well as to identify variables that may only have a mild to moderate association. Descriptive statistics, including mean, median, and standard deviation for case rates and the twelve variables chosen by lasso analysis on 1st August 2020, are shown in Table 1. Mean, median, and standard deviation for all independent variables considered are shown in Table S2. The overall Pearson's correlation coefficient on 1st August 2020, was 0.4940 with an adjusted r-squared of 0.4525, indicating the final model had a moderate correlation with cumulative case rates by county. Lasso and other coefficient shrinkage methods eliminated some variables from analysis due to coefficient shrinkage to zero. Therefore, these standardized coefficients (βz) can be interpreted as they relate to each other within the model but should not be interpreted directly in terms of the dependent variable. Unlike traditional ordinary least squares regression, the coefficients do not directly represent a percent change in cumulative case rates in our model. Lasso analysis was used in our case to choose variables in the face of collinearity, as well as to identify variables that may only have a mild to moderate association. Socioeconomic predictors in the final model included teen birth rate, children in poverty, children qualifying for free lunch, child mortality rate, and percentage of uninsured adults. The strongest indicators were those involving children, with the highest coefficient for the percent of children living in poverty (βz = 0.125). Additionally, children qualifying for free lunch (βz = 0.115), and child mortality rate (βz = 0.11) had a stronger positive association with increasing cumulative case rates relative to other variables in the final model. Lesser contributing variables were uninsured adults (βz = 0.078), and teen birth rate (βz = 0.035). Percent of non-Hispanic Whites (βz = -0.174) and percent of those with long commutes who drive alone (βz = −0.183) had the strongest standardized coefficients and were inversely related to cumulative case rates. In addition to minority status, other demographic indicators included were the percent of residents under 18 (βz = 0.034), percent of female residents (βz = −0.067), percent of residents not fluent in English (βz = 0.086) and Black/White segregation index (βz = 0.088). Other variables included were percent with annual influenza vaccine (βz = −0.062) and percent of those who self-report poor or fair health (βz = 0.09). Our sensitivity analysis shown in Table 2 indicates how variables chosen by lasso analysis changed over time at monthly intervals, beginning 1 April 2020. Standardized coefficient (βz) estimates were included for each variable in the table. However, due to the mechanics of lasso analysis mentioned above, these should not be compared between models or in direct relation to cumulative case rates. We report them to show their negative or positive association with cumulative case rates, as well as to allow comparison within the models of the contribution of a variable to a model at a specific time point. Table S3 includes t statistics and p-values for each coefficient presented below.
A strengthening association of predictive variables with the outcome and a generally increasing number of chosen variables over time were observed. The adjusted r-squared on 1 April was 0.0930, with only one variable being predictive of cumulative case rates. By 1 August 2020, twelve variables were included in the model with an adjusted r-squared of 0.4540. On 1 April 2020, race was not predictive of higher cumulative case rates. However, by 1 May 2020, continuing until the final model on 1 August 2020, higher numbers of minorities were consistently predictive of counties with increased cumulative case rates. This variable was the most consistent variable included and was chosen in all models after 1 April. Some variables included in earlier time points were considered indicators of urban versus rural spread, such as higher levels of air pollution (PM 2.5 ) and violent crime rates. With time, more indicators of socioeconomic status, such as low birthweight and lack of insurance, entered the model. The overall increase in adjusted r-squared and the number of socioeconomic variables predictive of increased case rates show that with the spread of COVID-19 over time in the state, the social determinants of health became increasingly predictive of higher cumulative case rates in the counties.

Discussion
Sequential lasso regression analysis showed an increasing trend of the predictive value of the social determinants of health on COVID−19 cumulative case rates. In our final analysis on 1 August 2020, our model using county-level demographics, health, access, and socioeconomic measures accounted for 45.4% of the variation in cumulative case rates by county. Additionally, we observed the number of variables included in the model by lasso regression, as well as model strength, increased as the pandemic progressed.
Our findings contribute to a growing body of literature that highlights the need to improve our understanding of the complex interconnectivity between demographics, socioeconomics, and structural inequities as they pertain to infectious diseases. Our study was also consistent with the finding of nationwide community-level disparities in COVID−19 infections and deaths in large US metropolitan areas [30]. Health disparities among race and ethnic divisions are not unique or specific to COVID−19, having been observed in a variety of infectious and chronic diseases [14,31,32]. Instead of proactively protecting those known to be the most vulnerable in society, the gaps in health disparities continue to widen during this crisis. These findings correspond with others that indicated minority groups were overrepresented in low-wage jobs considered essential, such as transportation and grocery store workers [33]. Additionally, another study found fewer than one in five Black Americans have job flexibility to work from home compared to more than a third of White and Asian American workers [34]. Thus, racial differences seen in our study and others may be related to a variety of reasons, including a varying ability to social distance and differences in access and quality of care [35], as well as differences in perceived susceptibility to infectious diseases [36]. Our research supports these explanations, as the inverse association between non-Hispanic Whites and cumulative case rates was the most consistent variable included over time and had one of the strongest coefficients in our final model, alongside a variety of other demographic and socioeconomic indicators.
The negative association of percent of residents with long, solo commutes in the model was first seen in the analysis on 1 July and was the most influential variable in the final analysis on 1st August. In general, this variable is considered indicative of poor health and chronic diseases, such as obesity, diabetes, hypertension, and cardiopulmonary disease [37][38][39]. We suspect this addition represented residents of suburban communities, who may be telecommuting during the crisis. Wealthy suburban residents may be more likely to have occupations that allow working from home, as opposed to urban residents, and may be additionally advantaged due to low crowding and a higher possibility of social distancing. This association between the ability to work from home and socioeconomic status has been previously reported [40]. Research elsewhere supports such variation in COVID-19 cases by geolocation [41].
Although it was a small contributor in terms of its impact to the final model (βz = 0.034), the addition of percent of people under 18 years old as a variable in the August 1 model is worth discussion. This variable was not seen previously in our sensitivity analysis. Its addition may be an aberration, but in light of concern over younger individuals becoming infected and spreading the infection [42], this association could also become stronger with time. There is a known increased risk of morbidity and mortality for older adults and seeming resistance to severe disease outcomes by young adults and children who may nonetheless be spreading the virus [43]. The increased availability of testing may play a part in the inclusion of this variable, as children, teens, and their young parents may now test at a higher rate even if they do not present with severe symptoms. Additionally, as an ecologic study, the inclusion of this variable could indicate an infection of adults or parents with children under the age of 18 in the household rather than the children themselves. Further research and monitoring are needed as children return to school.
Our study has several important limitations that should be taken into account. First, most of the county-level variables used as independent variables were measured by a variety of organizations across a two-to three-year time span for a purpose beyond COVID-19. Second, the implementation of new state policies for the mitigation of COVID-19, including stay-at-home orders, social distancing, and mask ordinances, may have impacts not measured through this cross-sectional research. Furthermore, our unit of analysis was the county. Therefore, aggregation bias should be considered as the relationships observed on the county-level may not hold up on the individual level. Our methodological rigor in the selection of covariates for the final model through lasso regression may also be a limitation as opposed to selecting the independent variables based purely on theoretical reasoning. Lasso regression analysis has been shown to over-select regression coefficients, which is a concern and drawback for this method. However, it still was shown to be superior to ordinary least squares techniques in similar situations [44,45].
Since COVID-19 is caused by a novel coronavirus, we believe validating traditional epidemiological techniques using computer learning models, such as lasso, can add support to previous findings related to race and have the additional ability of identifying variables that contribute small or moderate effects to COVID-19 infection rates. Additional research is needed to further explore the complicated relationship between COVID-19 pathogenesis, environmental factors, demographics, and socioeconomics with regard to the social determinants of health. We hope the use of the lasso in this study serves as another methodology that can be used to investigate other outcomes of COVID-19 and their relationship to the social determinants of health, such as cause-specific mortality and hospitalization rates. Due to surveillance gaps in this rapidly spreading disease, there have been challenges in collecting and obtaining individual-level information that can help address the concerns with an ecologic study. Combining individual-level data with neighborhood effects through the use of multilevel modeling could provide a clearer picture of factors related to COVID-19 diagnoses and mortality. Finally, since this is an ecologic, cross-sectional examination of COVID-19 in the state of Georgia, causal inference should not be extrapolated from these findings. However, our final model and sensitivity analysis provide a great starting point for future longitudinal research. The consistency of our findings with the disparities and inequalities observed across the country in morbidity and mortality rates suggest many structural-level issues are contributing to the spread of COVID-19 [46].

Conclusions
This research examined the community-level impact of factors from both a health and economic perspective on county-level COVID-19 case rates in the state of Georgia. Because health, demographic, and socioeconomic factors overlap in very complex ways, the full scale and intricacy of these inter-linkages are difficult to ascertain. However, we believe the strategic use of computer learning techniques, such as lasso, can elucidate some of these complexities. In the absence of consistent data collection on the demographics of positive cases, group-level studies such as ours help to identify influential predictors. Given the knowledge that the social determinants of health have significant effects on acute and chronic disease burden within a population, these findings support the linkage between fragile health, economic indicators, and demographics as key predictors of infection rates. Until longstanding inequities are eliminated and systemic injustices are addressed, the health and wellbeing of vulnerable and minority populations across Georgia will continue to be disproportionately affected, leaving marginalized communities to shoulder the largest burden of COVID-19.