Ecological Associations between Obesity Prevalence and Neighborhood Determinants Using Spatial Machine Learning in Chicago, Illinois, USA

: Some studies have established relationships between neighborhood conditions and health. However, they neither evaluate the relative importance of neighborhood components in increasing obesity nor, more crucially, how these neighborhood factors vary geographically. We use the geographical random forest to analyze each factor’s spatial variation and contribution to explaining tract-level obesity prevalence in Chicago, Illinois, United States. According to our ﬁndings, the geographical random forest outperforms the typically used nonspatial random forest model in terms of the out-of-bag prediction accuracy. In the Chicago tracts, poverty is the most important factor, whereas biking is the least important. Crime is the most critical factor in explaining obesity prevalence in Chicago’s south suburbs while poverty appears to be the most important predictor in the city’s south. For policy planning and evidence-based decision-making, our results suggest that social and ecological patterns of neighborhood characteristics are associated with obesity prevalence. Consequently, interventions should be devised and implemented based on local circumstances rather than generic notions of prevention strategies and healthcare barriers that apply to Chicago.


Introduction
The United States ranks 12th in the world regarding the number of overweight people [1]. Obesity prevalence was 40.0% among individuals aged 20 to 39 in 2020, 44.8% among those aged 40 to 59, and 42.8% among those aged 60 and up. Obesity is related to heart disease, stroke, type 2 diabetes, and several cancer types [2]. Although an active lifestyle reduces the risk of obesity, most adults in developed and developing countries cannot meet the recommended levels of physical activity due to sedentary lifestyles, the use of passive modes of transportation, etc. [3,4].
While it is generally understood that individual-level factors such as genetic predisposition [5] and behavioral aspects (e.g., physical activity) [6] play a role in weight gain, more research into the influence of residential neighborhood characteristics is needed to provide a multidisciplinary understanding of obesity prevalence. Environmental characteristics such as urban form, neighborhood safety, socioeconomic capital, and food availability have received attention in the literature [7][8][9]. As suggested elsewhere [7][8][9][10][11], these determinants are likely related to a wide range of social and environmental traits such as urban mobility and crime. Evidence also suggests that the built environment can positively influence health behaviors or be a health stressor [12]. For instance, the risk of obesity rises with residential instability and unaffordability of rent [13]. Although housing costs and house insecurity (i.e., homelessness) are a significant burden in the United States, little is known about their impact on obesity prevalence. Cost-burdened low-income households have limited resilience in economic crises or job loss, resulting in housing insecurity and other significant sacrifices that harm health [13]. It reduces a household's ability to pay for health-promoting necessities such as nutritious food, healthcare visits, energy, and home maintenance [14][15][16]. Additionally, urban amenities such as parks, bike lanes, and playgrounds encourage an active lifestyle and reduce the incidence of obesity [17][18][19][20][21]. While green spaces are important in reducing obesity concerns [22], there are inconsistent results concerning the association between green spaces and obesity [23].
To date, obesity is primarily investigated in public and behavioral health, and to a lesser extent, from a geographical science perspective. Spatial modeling could be a powerful means for urban health practitioners to grasp geographical patterns and dynamics that otherwise remain unnoticed. While most obesity research relies either on linear aspatial [24,25] or linear spatial models [26][27][28][29][30][31], nonlinear spatially explicit modeling to investigate the relationships between tract-level obesity prevalence and socioenvironmental factors is lacking, as we are aware of. For example, Ferdowsy et al. [32] used nonlinear random forest modeling to assess obesity risk by means of behavioral factors. Ghosh and Guha [33] employed latent Dirichlet allocation to investigate obesity-related themes in Twitter data. Further, obesity studies on the built environment are undertaken at the microscale utilizing geospatial technologies such as geographical positioning systems [34][35][36] and a focus on pediatric obesity [37,38]. However, obesity is context dependent and driven by an interplay of policy, social, economic, cultural, environmental, behavioral, and biological factors, as well as cross-sector and nonlinear interactions across these dimensions [39,40]. Collinearity across socioenvironmental linear models complicates the analyses in this way, necessitating the development of collinearity-aware models [41]. We investigate the spatial distribution of obesity prevalence in locals using the geographical random forest (GRF) model.
The GRF is a novel tree-based spatial machine-learning model [42,43]. It has the advantage of not presupposing local linearity and often outperforms an aspatial random forest model in predictive performance [44], but at the cost of greater computational complexity [45]. The GRF model is conceptually inspired by the geographically weighted regression [46], except that it is calibrated using a random forest rather than traditional least squares. GRF, unlike geographically weighted regression, does not need to account for multicollinearity and can evaluate all independent variables without the requirement for collinearity screening. It may also be used to examine local relationships between independent and dependent geographical variables while typically resulting in higher prediction accuracy than geographically weighted regression [42,44].
To our knowledge, no study used the GRF to examine spatial variation (i.e., nonstationarity) in the associations between tract-level obesity prevalence and socio-environmental neighborhood variables. Our goals were to (a) investigate local associations between obesity prevalence and tract-level variables in order to focus prevention and intervention efforts in high-risk areas and (b) compare GRF's prediction performance to that of a traditional random forest regression model. Of note, our goal was not to propose a holistic approach to model and map obesity prevalence by considering every available factor, but rather to present a new spatial approach to the community to understand obesity prevalence.
GRF was recently used to model socioeconomic circumstances in the European Union regions [47]. Furthermore, it was used to model the relative importance of 29 socioeconomic and health-related factors to the COVID-19 death rate, outperforming commonly used local and global regressions [44]. It was also applied to predict diabetes prevalence in the United States [42]. However, besides this limited number of applications, more evidence is needed.

Data
Obesity is a major problem in Chicago, where 61.2% of adults in the metropolitan area are overweight or obese [48]. Obesity was especially recognized as a contributing cause of death in metropolitan cities such as Chicago, Illinois [48]. We obtained crosssectional, ecological data for all 793 census tracts in Chicago from various sources. We obtained census tract polygon geometries as TIGER/Line Shapefiles from the United States Census Bureau to conduct spatial analysis and mapping using geographic information systems (GIS) [49]. With an average size of 0.28 square miles (standard deviation [SD] ±0.39), census tracts were deemed a suitable analytical scale for assessing area-level obesity prevalence. The 11-digit Federal Information Processing Standards codes were used to enrich the geometries with census tract-level variables.
We obtained estimates of obesity prevalence per census tract based on responses to the Behavioral Risk Factor Surveillance System survey and the Centers for Disease Control and Prevention's PLACES Project [50]. Obesity is defined as having a body mass index (BMI) of at least 30 kg/m 2 using self-reported height and weight data. The obesity prevalence estimates for the year 2020 served as our response variable (Table 1). Based on previous studies [51,52], we selected eight covariates. The years of the covariates match the obesity data in 2019. First, we added severe rent as a percentage of the population spending more than 50% of their income on housing rent. Households with high monthly expenses have little money left over for other necessities such as food, clothing, utilities, and health care. These experiences may negatively affect physical and mental health [13]. Second, the poverty percentage was quantified based on income that varies by family size and composition, accounts for the uneven distribution of income among a population; third, the unemployment percentage to the corresponding census tract population; fourth, the eviction percentage, eviction filings per 100 renter-occupied households; and fifth, percentage of vacant housing per census tract by the Chicago Health Atlas [52]. Sixth, we included available green space (e.g., parks, green open spaces, residential gardens) using the averaged Normalized Difference Vegetation Index per census tract. Remote sensing data were obtained from Landsat 8 NASA Earth Data [53,54]. Seventh, using data from the Divvy Bike share system, we calculated bicycle usage as the ratio of bike trips to, from, and within census tracts to the corresponding population [55]. Eighth, we considered crime extracted from the Chicago Police Department [56] as the ratio between the number of all types of crimes and the corresponding census tract population. The crime data refer to the locations of crime incidents (Table 1).

Analytical Approaches Aspatial Random Forest
Since the repertoire of machine learning techniques is large [57], we selected a wellestablished and typically well-performing regression-based model, namely the random forest, to assess the association between obesity prevalence and neighborhood determinants. The standard random forest is a global machine learning approach assessing associations uniformly across space, while the model can assess non-linearities and interactions between variables [44]. Moreover, there are no strict statistical assumptions [58].
Briefly, a random forest comprises a collection of separate decision trees for regression analyses [57]. Each decision tree is fitted from a given training dataset. First, a subset is created by randomly selecting samples with replacements from the original training set (usually 2/3 of the training set). The remaining data (usually the other 1/3) referring to the out-of-bag set is used for model evaluation (i.e., the assessment of the predictive accuracy). Additionally, a random subset of covariates is also picked for each node in each decision tree. The same approach is repeated for a large number of iterations, yielding a forest of trees trained with random subsets of training data. Finally, each tree's prediction error is computed, and all trees' final output is the average prediction value.
The out-of-bag accuracy is a robust independent measure frequently used to evaluate each variable's relevance and the overall model performance [57]. We used an increase in Mean Square Error (IncMSE) as a metric for determining the relevance of each variable [58]. The out-of-bag error is calculated by randomly permuting the values of each variable in the out-of-bag set. If the out-of-bag error rises, the variable is deemed to be important; and the greater the change, the more relevant the variable is in estimating the dependent variable [44,58]. However, the traditional random forest results in a single regression model assumed to be valid for the entire study area. As a result, the algorithm fails to account for geographic variations in the associations, which may lead to an inadequate representation of the associations.

Geographical Random Forest
To circumvent the restrictive assumption of a uniform association of the traditional random forest, we fitted a geographical random forest (GRF) with spatial weighting capable of modeling spatial non-stationarity [47]. Technically, GRF is calibrated locally using only nearby observations through a spatial kernel and a spatial weights matrix [59]. The main principle of GRF is similar to the geographically weighted regression [60,61], in which a moving window is applied to create local submodels. Each local random forest is evaluated for each site depending on the input data from surrounding observations. We used an adaptive spatial kernel for the GRF since it is widely used when data points are unevenly distributed spatially and when spatial autocorrelation is assumed to be present in the data [47,59]. We used the minimized out-of-bag error to determine an optimal bandwidth (BW). For a thorough discussion of the GRF, see Georganos et al. [47,59].
To determine the GRF's optimal hyperparameter (i.e., the number of trees and the proportion of randomly sampled features at each node), we used a random grid search as performed elsewhere [42,44]. From a set of possible hyperparameter combinations, we utilized 10-fold cross validation to determine the most suitable ones. We used a bandwidth of 30 observations, the number of trees was set to 1000, and the number of variables randomly sampled as candidates at each split was set to 5. Both the random forest and the GRF models were trained with these hyperparameters. We then compute performance metrics such as the mean square error (MSE), mean absolute error (MAE), root-meansquare error (RMSE), and coefficient of determination (R 2 ). Since significant residual spatial autocorrelation violates regression assumptions concerning the independence of observations, we used the well-established Moran's I statistic to investigate the degree of residual spatial autocorrelation. We also use the local Moran's I to track potential spatial residual clustering [60].
We used the permutation feature importance approach to evaluate the predictors' role in the random forest and GRF models. While the former is a global and aspatial model, GRF decomposes a random forest in local sub-models, considering, therefore, datainherent nonstationarity and spatial autocorrelation. GRF yields local feature importance, local residuals, and local goodness of fit statistics for each predictor in each local random forest model [44]. Similar to the random forest, we ranked the variables' importance based on the percent change in the MSE [44]. Further, we mapped the local variable importance to examine how each independent variable's effect on obesity prevalence varies geographically.
Partial dependence plots were used to characterize the nonlinear relationships between the obesity prevalence and the covariates. The partial dependence plots reveal whether the relationship between the target and a feature is linear, monotonic, curvilinear, or more complex by presenting the expected target response as a function of the input features of interest [62]. All our statistical analyses are conducted with the "SpatialML" package [63] in the R Statistical Computing Environment [64]. For cartography, we used ArcGIS 10.8.1.

Aspatial Random Forest and GRF Results
In the out-of-bag set, the GRF model had a lower MSE, RMSE, and MAE and a higher R 2 than the global random forest model (Table 2). Overall, the associations show complex shapes, highlighting the need to utilize nonlinear models. Figure 2

Aspatial Random Forest and GRF Results
In the out-of-bag set, the GRF model had a lower MSE, RMSE, and MAE and a higher R 2 than the global random forest model (Table 2). Overall, the associations show complex shapes, highlighting the need to utilize nonlinear models. Figure 2 depicts the nonlinear relationship between obesity prevalence and the covariates. The covariates biking (Figure 2b), crime (Figure 2c), unemployment (Figure 2d), and eviction rate (Figure 2e) were curvilinearly associated with obesity prevalence, while vacant housing (Figure 2f) and severe rent (Figure 2h) show a rather nonlinear relationship with obesity prevalence. While the overall associations indicate nonlinearities, linear correlations exist within specific ranges. There is, for example, a positive linear association between eviction rate and obesity prevalence in the range below 2%, but the effect remains stable thereafter (Figure 2e). Similarly, in the range between 15-27%, there is a linear positive association between severe rent and obesity prevalence, but the effect is neglectable after that (Figure 2h).  Table 2 compares the importance of these covariates in the random forest and GRF model. The crime ratio is the most important variable in the former, followed by poverty, unemployment, and eviction rate, according to the permutation-based feature importance, whereas poverty is the most important variable in the GRF, followed by the crime ratio, unemployment, and eviction rate. As shown in Table 2, the average positive MSE of the GRF model shows that most tracts have positive local covariate importance. Other determinants' importance ordering also varies from that of the global random forest model. Biking, for example, is placed fifth in the random forest model but last in the GRF model ( Table 2). The difference in feature importance is likely because the random forest is a global model and does not take into spatial and local variations. In contrast, GRF assesses the spatial (i.e., local) variation of the predictor variables.
Additionally, we mapped the determinants to understand better the spatial distribution of the local variable importance (IncMSE) (Figure 3). It is important to note that values above zero have importance on obesity prevalence, while higher tract values suggest greater importance. Green spaces seem to be of minor importance across the city (Figure 3a). It is the same for biking, with the exception of a few places southwest of downtown ( Figure 3b). Unemployment, on the other hand, is most important in the area south of downtown (Figure 3c). Poverty is most prevalent in various areas to the south and southwest of downtown (Figure 3d). The crime rate is most important in southern tracts (Figure 3e). Except for severe rent, which has the highest importance in the south, vacant housing (Figure 3f   Additionally, we mapped the determinants to understand better the spatial distribution of the local variable importance (IncMSE) (Figure 3). It is important to note that values above zero have importance on obesity prevalence, while higher tract values suggest greater importance. Green spaces seem to be of minor importance across the city ( Figure  3a). It is the same for biking, with the exception of a few places southwest of downtown (Figure 3b). Unemployment, on the other hand, is most important in the area south of downtown (Figure 3c). Poverty is most prevalent in various areas to the south and southwest of downtown (Figure 3d). The crime rate is most important in southern tracts (Figure

Discussion
This cross-sectional ecological study looks at the prevalence of obesity at the tract level in Chicago using the geographical random forest (GRF), an innovative spatial machine learning approach. According to our findings, the GRF model outperformed the  Figure 4a depicts the local R 2 of the GRF. The model fit varied across space, ranging between 35% to 60%. Predominantly tracts in the north of the city, the R 2 was below 0.5 (Figure 4a). The global Moran's I test (I = −0.01, p = 0.18) confirms there is no spatial residual autocorrelation. Additionally, the local Moran's I also reveal that there is no geographical clustering of residuals in most locations and that the residuals are randomly distributed (Figure 4c). These results confirm that the GRF models the data well.   Figure 4a depicts the local R 2 of the GRF. The model fit varied across space, ranging between 35% to 60%. Predominantly tracts in the north of the city, the R 2 was below 0.5 (Figure 4a). The global Moran's I test (I = −0.01, p = 0.18) confirms there is no spatial residual autocorrelation. Additionally, the local Moran's I also reveal that there is no geographical clustering of residuals in most locations and that the residuals are randomly distributed (Figure 4c). These results confirm that the GRF models the data well.

Discussion
This cross-sectional ecological study looks at the prevalence of obesity at the tract level in Chicago using the geographical random forest (GRF), an innovative spatial machine learning approach. According to our findings, the GRF model outperformed the

Discussion
This cross-sectional ecological study looks at the prevalence of obesity at the tract level in Chicago using the geographical random forest (GRF), an innovative spatial machine learning approach. According to our findings, the GRF model outperformed the typically used aspatial random forest model in terms of prediction accuracy. This suggests that GRF considers spatial heterogeneity in the associations and identifies the factors that trigger local variations in obesity prevalence rates that contribute to developing placebased interventions to control obesity. Our finding corroborates the results of a few other studies using the GRF model [23,44,47]. We also found three of the top four most important local factors (i.e., poverty, crime, and unemployment) refer to neighborhood determinants, and one is associated with housing (eviction rate). The ramifications of the findings are described below.

Poverty
Poverty has the greatest proportionate importance per census tract in the west and southwest downtown neighborhoods, while it has the least importance in the rest of Chicago (Figure 3d). In addition, the overall positive association between obesity and poverty in Figure 2d demonstrates the importance of poverty. Poverty creates an obesogenic environment in which people may lack access to affordable, healthful foods [65], lack funds for sports equipment and physical activity participation, and are exposed to psychological stress [66], as well as live in overcrowded houses with poor sleep quality [67]. However, the relationship between high wealth and increased physical activity is not well established and requires further research [68]. Since poverty is associated with an increased risk of obesity, policymakers and planners should assess the consequences of neighborhood poverty on health outcomes [69]. Deprivation promotes the formation of harmful habits and cultures, which are then passed down through generations. Obesity risk in emerging adulthood is significantly increased by cumulative exposure to neighborhood poverty. The deterioration of neighborhood socioeconomic conditions is also a significant obesity risk factor [69]. Our study suggests that multidisciplinary policies and organizations (such as the USDA Food and Nutrition Service and housing authorities) work together to control and reduce obesity.

Crime
The importance of crime per census (i.e., a variety of illegal behaviors) tract is greatest in the southern districts, although it is not widespread in Chicago's northern neighborhoods (Figure 3e). Criminality is strongly associated with obesity prevalence along southern Lake Michigan, where a large percentage of people face severe rent, housing insecurity, poverty, and unemployment ( Figure 1). The community and individual level factors may be associated with neighborhood crime. Business withdrawal, population outmigration, physical deterioration, declining community resources, and crumbling public infrastructure are witnessed at the community level in high-crime zones that are unsafe for physical exercise [70,71]. In addition, individuals' perceptions of unsafety, anxiety, and stress are influenced by neighborhood crime, which restricts participation in physical activities [72]. While dealing with crime issues at the neighborhood level, crime types should be treated separately because related institutions and policies should regulate crimes (home burglary vs. robbery). Well-designed neighborhoods with well-maintained socioeconomic capitals (i.e., sidewalks) encourage healthy behaviors and inhibit illegal behaviors.

Unemployment
The importance of unemployment per census tract is highest in the southern downtown neighborhoods along Lake Michigan (Figure 3c). Similarly, unemployment is strongly associated with obesity in southwest downtown south districts, where green spaces are abundant, but eviction rates are high (Figure 1). Unemployment is well known to be associated with an increased risk of illness. This association could be partly explained by the negative health-related behaviors-particularly smoking, diet, exercise, and alcohol consumption-caused by the lower income, altered daily routine, and psychological stress that typically accompany job loss [73]. However, Hughes et al. [51] found that job seekers were less likely to be overweight than never-unemployed people, implying that unemployment may vary with BMI for different populations. Ruhm [73] finds that a 1% increase in the state unemployment rate is associated with a 2% decrease in daily fat intake between 1987 and 1995; also he finds that during economic downturns, body weight falls among the severely obese and exercise increases among the previously physically inactive. Cutler et al. [74] find that a higher unemployment rate at graduation is associated with lower income and greater obesity later in life. Deb et al. [75] explore the impact of business closures on obesity. They find that job separation increases the likelihood of being obese more significantly for females, lower incomes, those least educated, and the middle aged, compared to elderly individuals. Unemployment should be studied primarily at the household and individual levels regarding obesity prevalence.

Eviction Rate
The importance of the eviction rate per census tract is greatest in the south, although it is also high in some western neighborhoods (Figure 3g). Along Lake Michigan, where unemployed people dwell in poverty, the eviction rate is highly associated with obesity ( Figure 1). The association between eviction of renter-occupied properties and health is not well understood [76], while housing affordability and eviction are inextricably associated. However, little is known about how policy interventions, such as supply-side housing subsidy programs designed to increase affordable housing, affect local eviction dynamics [77]. Housing insecurity and unaffordability can lead to stress, worry, and despair, as well as change metabolism and raise the risk of obesity [78][79][80]. During the COVID-19 global pandemic, when tenants had lost their employment, research on eviction rates at the neighborhood level is highlighted. Our findings draw attention to the need for localities to respond quickly in order to protect public health from the obesity epidemic through current measures such as increasing rental assistance and extending the eviction moratorium [81].

Other Factors
Other factors such as green spaces, severe rent, and vacant housing had minimal importance on the prevalence of obesity in Chicago. However, the literature [82] highlights the importance of green spaces as a valuable resource for physical activity and hence has the potential to contribute to reducing obesity and improving health. It piques the interest of planners and policymakers researching green space functionalities in terms of accessibility, availability, and visibility in Chicago, primarily focused on the tourism industry, rather than using green infrastructure to promote health. Similarly, biking infrastructure concentrated in the city center serves to promote tourism rather than physical activity [83]. A series of connected bike routes allows neighbors to quickly get to all places by bike, primarily in the city center. Integrating walking and cycling routes with green space is critical for creating a built environment that promotes physical activity. Additionally, unoccupied properties, such as underutilized urban spaces, cause urban blights, foster crime in neighborhoods, and might indirectly affect an active lifestyle [84].

Limitations
The tract-level estimates, which are assessed in several prior studies [85,86], have certain limitations. The outcome variable was collected by means of a telephone survey which likely faces problems due to recall bias and social desirability bias [87,88]. We cannot exclude that the population-level bias in self-reported weight and height is larger in telephone interviews than in in-person interviews to measure BMI. In addition to the data limitation, locally weighted models have some drawbacks. For example, we used an adaptive kernel bandwidth to choose the optimal number of tracts to train the GRF that accounts for differences in tract size. The tracts with different sizes varying across a geographic area may result in spillover effects of the dependent variable in neighboring tracts or spatial autocorrelation of the residuals. Not unexpectedly, the local R 2 in the GRF varied across space. For most tracts in the north of the city, the R 2 was below 0.5. These results indicate that the included variables only explained a limited fraction of the variance of the outcome variable and alternative variables should be included to improve the performance of the local models in these regions. While our results may be sensitive to the underlying analytical scale and zoning, causal inference is also hampered by the cross-sectional and ecological nature of the data [89].

Conclusions
This study is the first to use the GRF model with spatial weights to assess geographic variations of obesity prevalence at Chicago's tract level in response to the determinants associated with neighborhood conditions. The GRF outperforms the typically used nonspatial random forest model in terms of out-of-bag prediction accuracy. Poverty is the most important factor in Chicago tracts, while biking is the least important. While poverty is the most important predictor in Chicago's south suburbs, crime is the most important factor in explaining obesity's prevalence. Future research should look at other aspects of household quality (such as mobile homes, homelessness, and ecological factors), as well as the spatial behavior of the obesity epidemic from a neighborhood-household standpoint as a whole. Understanding the spatial heterogeneity of obesity-determinant correlations could support place-based policy developments to address the spatially varying obesity determinant.

Data Availability Statement:
The data presented in this paper are openly available at https://www. cdc.gov/obesity/data/prevalence-maps.html (accessed on 20 May 2022).