1. Introduction
Avoidable hospitalizations (AHs), also known as ambulatory care sensitive conditions (ACSCs), represent hospital admissions that could potentially be prevented through timely and effective outpatient care. These conditions include both chronic illnesses like diabetes, asthma, and congestive heart failure, as well as acute conditions such as pneumonia and complicated appendicitis. When primary care is effective, it can help prevent or manage these conditions, thereby reducing the need for hospitalization [
1,
2,
3].
As a key indicator of primary care quality, avoidable hospitalizations frequently result from inadequate or delayed community-based care. Their occurrence underscores the critical need for improved care coordination, enhanced preventive services, and better disease management strategies across healthcare settings [
1,
2].
Research consistently demonstrates that lower socioeconomic status (SES) is associated with an elevated risk of avoidable hospitalizations, conditions that could have been prevented through timely outpatient care [
4,
5,
6]. The combined effect of individual-level household income and neighborhood-level material deprivation on hospitalization risk proves particularly significant, with individuals residing in low-income neighborhoods experiencing the highest risk. While the precise mechanisms remain incompletely understood, experts believe factors such as limited healthcare access, health behaviors, and health outcomes likely play important roles [
4].
Usually, in urban environments, lower-SES populations face disproportionate exposure to higher air pollution levels, contributing to increased mortality risks from all causes, including respiratory conditions [
7,
8]. Despite extensive epidemiological research on air pollution recognizing this disparity, significant gaps remain in understanding optimal methods for adjusting SES confounding and potential biases arising from improper adjustment [
8].
Recent research has increasingly linked air pollution to neurological outcomes. A large population-based study in Ontario, Canada [
9], reported that long-term exposure to fine particulate matter (PM
2.5) was associated with a 5.5% increased risk of developing epilepsy, while ozone exposure was linked to a 9.6% increase. Other research suggests that air pollutants affect the number of pediatric patients in the emergency department with epilepsy attacks [
10]. Proposed mechanisms include the entry of pollutants into the bloodstream and central nervous system via the lungs, potentially crossing the blood–brain barrier and triggering neuroinflammatory processes [
11]. However, findings across studies are not entirely consistent, particularly with regard to specific pollutants.
The high correlation between social and economic factors presents analytical challenges in determining their relative contributions. Consequently, SES analyses frequently focus primarily on income or residential area while neglecting other crucial factors like educational attainment, potentially oversimplifying complex socioeconomic relationships.
International comparisons reveal considerable variation in avoidable hospitalization rates, with Mexico showing distinct patterns. For instance, asthma admission rates vary 12-fold across OECD countries, with Mexico, Italy, and Colombia reporting the lowest rates, while Latvia, Turkey, and Poland report rates more than twice the OECD average [
12].
Understanding avoidable hospitalizations in specific populations requires careful consideration of contextual factors including SES, healthcare access, and health behaviors. By studying these populations in detail, policymakers can identify necessary changes to address avoidable hospitalizations, ultimately improving healthcare outcomes while reducing costs.
Analyzing trends in avoidable hospitalizations by clinical condition helps inform healthcare policy and resource allocation by identifying increasing or decreasing rates over time. This examination can reveal important patterns and correlations between specific conditions and hospitalization rates, guiding targeted interventions. Additionally, it can help identify high-risk groups based on age, sex, or SES.
In this comprehensive study, we investigate the principal risk factors for avoidable hospitalizations in Mexico City, examining their relationship with SES and air pollution, identifying high-risk groups, and tracking temporal changes. Our analysis of 86,170 patient records from 2015 to 2019 employed negative binomial regression, logistic regression, and Gradient Boosting Machine (GBM) models to account for nonlinearity and interactions between variables. We included SES and air pollution as key risk factors, along with relevant covariates such as locality, age, sex, weight, and admission date.
To carefully examine SES effects, we generated a composite indicator using factor analysis (FA) that provides a more nuanced assessment of each economic and social factor’s contribution, thereby better capturing their interrelation. For air pollution, we constructed indexes through Principal Component Analysis (PCA) to properly account for spatial pollutant concentration correlations across localities. We applied an iterative algorithm specifically tailored to this problem to obtain relevant factors, systematically eliminating non-significant variables while penalizing the simultaneous inclusion of highly correlated variables to reduce multicollinearity and interpretation problems.
Our results demonstrate that different aspects of the composite SES indicator influence the incidence of various avoidable hospitalization categories, while environmental air pollution affects both the incidence and severity of hospitalizations. In particular, we identified significant interactions and nonlinear effects between variables, findings that can directly inform prevention efforts and public policy aimed at reducing avoidable hospitalizations.
2. Materials and Methods
To determine the relevance of air pollution (AP) and socioeconomic status (SES) on the leading causes of hospitalization, we matched each patient’s locality of residence with complementary datasets to estimate corresponding AP and SES indexes. Our analysis incorporated multiple confounding factors including sex, age, weight, access to social security, municipality of residency, and admission date (months 1–60), among others.
We estimated severity through two measures: the number of hospitalization days and mortality occurrence. For each variable, we assessed its contribution using an iterative algorithm designed to address the problem of multicollinearity. Unlike traditional Forward–Backward Selection algorithms [
13], which do not account for correlations between variables, our approach systematically eliminates and includes variables based on the Akaike Information Criterion (AIC) while penalizing correlation to maximize model interpretability. Additionally, to explore potential nonlinearity and variable interactions, we utilized relative feature importance derived from the Gradient Boosting Machine (GBM) model. All variables were scaled to a 0–1 range to facilitate comparison.
2.1. Data Sources
We integrated three primary data sources:
Hospitalizations: The Mexico City Ministry of Health (SEDESA) provided anonymized data from public hospitals under CONACYT project 7051. This comprehensive dataset included patient information such as:
Demographic characteristics (age, weight, sex);
Geographic origin;
Hospitalization indicators;
Health services entitlement status;
Admission and discharge dates;
Duration of hospitalization;
Medical conditions;
Locality of residence;
International Classification of Diseases (ICD) codes for:
- –
Initial diagnosis;
- –
Primary condition;
- –
Cause of death (when applicable).
Air Pollutant Concentrations: We obtained AP measures from Mexico City’s Automatic Air Quality Monitoring Network [
14]. For each monitoring station, we calculated 15-year averages (2005–2020) for concentrations of
,
,
,
,
,
,
, and
.
Using kriging interpolation and QGIS, we estimated mean concentrations for each patient’s locality based on centroid coordinates.
Census of Population and Housing: We incorporated official 2020 Mexico Census data containing detailed housing and population variables at the locality level, which served as the foundation for constructing our SES indicators.
2.2. Socioeconomic Status and Air Pollution Factors
Census data are widely employed for constructing neighborhood-level composite SES indicators, typically using Principal Component Analysis (PCA) or Factor Analysis (FA) to weight each variable’s contribution [
15,
16].
In this study, we derived SES indicators using FA to detect more nuanced economic and social dimensions (F_ECONOM and F_SOCIAL) in the data, resulting in indicators where higher values represent less favorable circumstances, which can be interpreted as economic and social lag indicators. We validated these factors by regression analysis against established indices including:
Social Lag Index (SLI) [
17];
Social Development Index (SDI) [
18];
Human Development Index (HDI) [
19].
These analyses resulted in coefficients of determination () exceeding 0.9, indicating strong concordance.
The economic factor (F_ECONOM) showed stronger influence with housing-related census variables such as:
In contrast, the social factor (F_SOCIAL) is strongly influenced by variables such as:
Average number of live-born children;
Average educational attainment;
Affiliation with different health services.
Additional details on the development of the SES indicators are presented in
Appendix A.2.
Regarding air pollution, Mexico City’s complex terrain significantly influences local meteorology and atmospheric pollutant behavior, resulting in spatially correlated AP patterns. Failure to account for this spatial correlation could lead to biased effect estimates. Therefore, we constructed pollution factors by grouping geographically correlated pollutants. Using PCA, we identified three distinct pollutant groups based on spatial concentration patterns:
PM_CO (, , and );
NO2_NOx ( and );
SO2_NO_O3 (, , and ).
Additional details on the development of the air pollutant factors are presented in
Appendix A.1.
2.3. Analytical Models
For each patient, we measured hospitalization severity through a composite indicator that combined mortality occurrence and hospitalization duration. Specifically, the severity Y for patient i was defined such that:
indicated mortality, with values approaching 1 representing faster mortality (greater severity).
indicated survival, with values approaching 0 representing shorter hospital stays (lower severity).
This formulation can alternatively be interpreted as a classification problem with high-severity (death) and low-severity (non-death) classes, weighted to account for extreme cases (see [
20] for more details on severity estimation).
A comprehensive catalog of Avoidable Hospitalization for Ambulatory Care Sensitive Conditions using ICD-10 codes is extracted from [
21].
Table 1 presents the 14 categories with sufficient data for the severity analysis, showing case counts by locality and admission date.
Two primary model types were developed: Models estimating monthly locality-specific hospitalization counts for specific conditions and models estimating hospitalization severity.
For hospitalization count modeling at the locality level, we included total locality population (POBTOT) as the primary expected predictor. Additionally, we considered population proportions by age group (0–2 years, 18–24 years, and 60+ years), population density (POB_AREA), male–female ratio (REL_H_M), and SES and AP factors described previously.
At the patient level, we considered the municipality of residence (E_MUN_XXXXX), admission date (ADM_DATE, months 1–60), and month of admission (MONTH).
For hospitalization count modeling, following the meta-analysis by Wallar et al. [
22] which concluded that negative binomial regression is most appropriate for this type of data, we employed negative binomial regression as our primary count model.
For the prediction of mortality during hospitalization, we selected logistic regression based on its established suitability for clinical outcomes (see [
23]). At the patient level, we included potentially relevant severity predictors such as age (AGE), weight (WEIGHT), sex (SEX_M: 1 = male, 0 = female), and origin (PROCED: 1 = external, 2 = emergency, 3 = referred, 4 = other, 9 = unspecified).
At the locality level, we incorporated municipality of residence (E_MUN_XXXXX) and admission date (ADM_DATE, months 1–60).
In both model types, municipality of residence proved particularly relevant as different municipalities may have varying hospital infrastructure, health policies, or other unmeasured factors that could spuriously correlate with SES or AP exposure.
We implemented a systematic variable selection algorithm that began by identifying the 10 variables most strongly correlated with the outcome. The procedure iteratively removed variables with the weakest contribution to model fit, evaluated using the AIC, while penalizing the inclusion of highly correlated variables. Specifically, the final model was selected to minimize , where is a penalty parameter and r represents the maximum absolute correlation between included variables. This approach balances goodness-of-fit with reduced multicollinearity, enhancing both model interpretability and robustness.
This approach effectively reduced multicollinearity while maintaining model interpretability.
While conventional modeling methods often struggle with high-dimensional relationships, advanced machine learning techniques like Gradient Boosting Machine (GBM) models have demonstrated superior performance in medical predictive analytics compared to traditional statistical models (e.g., Kong et al. [
24]). We employed GBM to automatically account for nonlinear confounding effects and interactions, estimating variable effects, and exploring complex relationships difficult to detect with classical models.
For robust validation, we reserved 15% of records for each category as a holdout set, ensuring our predictions were non-random and that variable importance measurements had genuine predictive value; this also allows for comparison where modeling nonlinearity and interactions are relevant in the model to increase accuracy.
3. Results
Our analysis revealed several key findings regarding relevant factors for each model. We report standardized coefficients and 95% confidence intervals to facilitate interpretation of effect magnitudes and associated uncertainty. If predictor variables do not appear in the results tables, it is because they were either weakly associated with the outcome or highly correlated with more influential variables, and were therefore excluded during model selection based on relevance and multicollinearity considerations.
For visualization, red colors in figures represent effects that increase hospitalizations or severity, while blue indicates protective effects; 95% Confidence Intervals (CIs) are also included. In GBM models, we present the normalized Gini importance for the top 10 influential variables.
Figure 1 displays the estimated risk factors associated with diabetes complications for both hospitalization frequency and severity, along with their 95% confidence intervals. Patient weight emerged as one of the most significant factors increasing severity, while
and
pollutants also showed substantial effects. For the number of hospitalizations, total locality population (POBTOT) showed the largest effect as expected, but economic status (F_ECONOM) demonstrated a similarly strong association where localities with less favorable economic conditions had a higher number of hospitalizations.
Figure 2 displays the estimated risk factors associated with influenza and pneumonia, for both hospitalization frequency and severity, along with their 95% confidence intervals. The age of the patient has a significantly larger effect than the rest of the variables, increasing severity, followed by the patient’s weight, while
and
pollutants also showed substantial effects. For the number of hospitalizations, population with 65 years or older (POB65_MAS) and exposure to
,
and
have similar effects in increasing the number of hospitalizations. Exposure to PM and CO is also related to an increase in the number of hospitalizations.
Figure 3 presents some results of the GBM model for diabetes complications that illustrate the interactions between variable pairs and their effects on the number of hospitalizations and severity. These plots highlight the importance of analyzing interactions and nonlinear effects. For example, while admission date showed no statistical significance in regression analysis, GBM revealed a nonlinear pattern where cases increased until month 30 then decreased, a pattern that could yield non-significant linear effects despite meaningful temporal variation. Similarly, examining the social–economic factor relationship showed that economic factors are more relevant, but unfavorable social conditions amplified effects when combined with poor economic status.
The severity analysis in
Figure 3 shows both linear age effects and nonlinear age-weight interactions. Severity peaked for older patients with either high or low weight. While logistic regression identified
and
effects, GBM additionally revealed severity increases when these pollutants co-occurred with PM and CO exposure.
Together, these figures demonstrate how both modeling approaches can identify significant risk factors while GBM provides additional insight into complex nonlinear relationships and interactions.
Table 2 and
Table 3 present variables of interest related to the number of hospitalizations, as identified by the regression and GBM models, respectively. Similarly,
Table 4 and
Table 5 show the corresponding results for hospitalization severity.
Table 2 and
Table 4 report estimated effects and 95% confidence intervals for the variables retained in the final regression models, while
Table 3 and
Table 5 display normalized Gini importance scores from the GBM models, with bold values indicating variables that were also retained in the regression models. Comparing these results provides complementary insights. While Gini importance does not indicate the direction of association, high importance scores for variables not selected in the regression models may reflect nonlinear relationships or interactions not captured by the linear specification.
The results of the negative binomial regression (
Table 2) confirm that the total population (POBTOT) has the strongest effect in all categories, as expected. GBM results (
Table 3) similarly identify POBTOT as the most important. For admission date (ADM_DATE), only ear, nose, and throat infections (EN&T INFEC) showed significant linear effects (decreasing over time), but GBM revealed substantial nonlinear effects for diabetes (DC), angina (ANG), and COPD, suggesting an inverted U-shaped temporal pattern as shown in
Figure 3, that linear models might miss.
Economic status (F_ECONOM) significantly affected only diabetes complications (DC) in regression, with GBM confirming DC as the most impacted category. For social status (F_SOCIAL), prenatal delivery-related conditions (DPCPD) showed the strongest effect, with less favorable status increasing hospitalizations. Hypertension (HYPERT) also showed increased hospitalizations for less favorable F_SOCIAL values.
Air pollution groups showed category-specific effects:
PM_CO (, , CO) increased hospitalizations for:
- –
Influenza and pneumonia (I&P);
- –
Gastroenteritis (GASTRO);
- –
Ear, nose, and throat infections (EN&T INFEC);
- –
Asthma (ASTH);
- –
Epilepsy (EPILEP).
SO2_NO_O3 significantly affected I&P
GBM importance scores align with these findings, showing PM_CO as the most influential pollutant group, particularly for asthma and ENT infections.
For hospitalization severity (
Table 4 and
Table 5), age and weight showed the strongest effects. Age significantly affected almost all categories, with influenza and pneumonia showing the largest effect. Weight most strongly impacted diabetes (DC) and hypertension (HYPERT). Although weight was not selected as a relevant variable in many regression models, GBM showed high importance across multiple categories, likely reflecting inverted U-shaped relationships where both high and low weights increase severity (e.g., for asthma). Sex differences emerged for pyelonephritis (riskier for males) and heart failure (riskier for females).
Asthma showed reduced severity with unfavorable economic status and PM_CO exposure. These results may reflect a mixture of more exposure to these pollutants in higher SES areas, and also a possible survivor bias where severely affected individuals cannot reside in highly polluted areas.
For the NO2_NOx group of pollutants, exposure shows increased severity for diabetes complications as well as influenza and pneumonia.
Table 6 presents the performance of regression (negative binomial and logistic) and GBM models to predict the number and severity of hospitalizations in the validation dataset. Severity can be interpreted as a weighted binary prediction, where more severe cases have a greater weight in the loss function; therefore, AUC values can be estimated. The correlation between predicted and ground truth is used as a performance metric for the number of hospitalizations and the AUC value for severity. In both cases, values closer to 1 are preferable, and bold values indicate the best model on each task and category. We observed that for the number of hospitalizations, the GBM model tends to perform better across most categories, while for severity, regression tends to have better results in many cases. However, in terms of average AUC value, GBM performs slightly better.
These regression models, which include only variables selected for their relevance and avoid simultaneously including highly correlated predictors, exhibit performance comparable to the more complex GBM models while offering greater interpretability.
4. Discussion
Our multifaceted analytical approach, combining composite SES/AP indicators with advanced modeling techniques, reveals complex interactions between risk factors for avoidable hospitalizations in Mexico City. By distinguishing economic versus social SES dimensions and their differential health impacts, we provide nuanced insights for targeted public health interventions.
The identified nonlinear effects and interactions—particularly between age and weight—highlight limitations of conventional regression approaches. These complex relationships explain why some factors (e.g., admission date) showed significance in GBM but not regression models, underscoring the value of sophisticated analytical methods in epidemiological research.
Diabetes complications exemplify the need for more complex research that accounts for nuanced studies of air pollution and SES. Our results show that the economic component of SES is the main contributor to increased hospitalizations, which is important for better-targeted care campaigns. Additionally, populations more exposed to should be more aware of a higher risk of severe hospitalization when it occurs.
Asthma is another example of how the complex interaction between SES and air pollution should be considered to better understand the effects of both factors on disease outcomes.
Table 2 shows that the most significant variables predicting the number of hospitalizations due to asthma in a location are its population and, more importantly, exposure to PM and CO. However, regarding hospitalization severity (
Table 4), we observed that better SES could lead to higher severity, which might seem to contradict previous research where low SES is associated with more severe asthma (e.g., [
25]). However, in the context of Mexico City, our results (see
Appendix A.3) show that higher exposure to PM, CO, and
is related to higher SES (consistent with existing research, e.g., [
26]), considering that PM and
are related to higher risks of asthma (see [
27]). Therefore, this interaction between SES and air pollution can result in the compensatory effects observed in severity. The existence of complex interactions can be inferred from the large performance difference between regression and GBM models (
Table 6), as GBM can better model nonlinear effects and interactions.
The higher severity observed in women with heart failure (
Table 4) is consistent with recent findings. Lu et al. [
28] reported that women hospitalized for heart attacks are less likely to receive key interventions such as cardiac catheterization, percutaneous coronary intervention (PCI), and coronary artery bypass grafting (CABG) compared to men, contributing to higher mortality rates among women. Similarly, Ezekowitz et al. [
29] found that hospital mortality rates were notably higher in women than in men for ST-segment elevation myocardial infarction (STEMI), with 9.4% mortality in women versus 4.5% in men, and 4.7% versus 2.9% for non-STEMI (NSTEMI). Given the gender differences observed in our analysis for Mexico City, it may be valuable to investigate whether similar disparities in treatment access or quality exist locally, where targeted changes in patient management could potentially reduce excess mortality.
The higher severity found in men compared to women for pyelonephritis also aligns with prior studies. Kim et al. [
30] reported an in-hospital mortality rate of 1.5 per 1000 episodes of pyelonephritis in men, compared to 0.5 per 1000 in women. Severe pyelonephritis in men is also associated with higher rates of complications such as renal abscesses, which are rare in females. Experimental models suggest that androgens (male hormones) may enhance the severity of urinary tract infections, including pyelonephritis, in males [
31].
Although SES is usually studied as a mixture of economic and social factors, our results show the need for more nuanced analysis. For example, in the correlations of variables (
Appendix A.3), even though economic and social factors are highly correlated, our results show that air pollutants are mainly related to the economic aspects of SES. In addition, for Mexico City, different pollutants behave differently in relation to SES. More favorable SES scores are related to higher exposure to PM, CO, and
, while lower SES is associated with higher exposure to
and
. Like Mexico City, many cities could have unique and more complex situations where these interactions between SES and air pollution should be considered to properly address public policy and prevention, underscoring the need for more research in regions with different circumstances than those typically studied.
Regarding model performance, although some interpretability is possible in GBM by studying the Gini importance of variables, the magnitude and direction of effects can be more difficult to assess, and multicollinearity issues remain, with the possibility of wrongly assessing the magnitude of effects that could be diluted across other variables. Therefore, if prediction accuracy is the main objective, GBM could be preferable, but when studying the effects of different factors on diseases, our results show that regression models can be highly interpretable while still maintaining competitive performance. However, comparing both types of models can be beneficial for discovering unknown nonlinear effects and interactions between variables.
Although changes in patient management over time could influence trends in avoidable hospitalizations, such information was not directly available in the dataset. However, our analysis did not reveal consistent temporal effects across most conditions. In particular, the variable representing admission date (ADM_DATE) was not retained in the final regression models for the majority of disease categories, suggesting limited or no measurable shifts in hospitalization patterns during the study period. An exception was observed for ear, nose, and throat infections, where a downward trend was detected. These findings may indicate overall stability in care delivery or access for most conditions, although unobserved factors such as policy changes or protocol updates cannot be ruled out. Future research could benefit from incorporating hospital-level or policy implementation data to further investigate temporal changes in care practices.
Our findings advance understanding of how air pollution and SES jointly influence both AH incidence and severity, a key improvement over previous studies examining these factors separately [
22,
32]. Notably, SES factors primarily affected hospitalization frequency (especially for chronic conditions like diabetes), while air pollution impacted both incidence and severity (e.g., diabetes complications and influenza). This pattern suggests socioeconomic factors influence long-term health behaviors and preventive care access, while pollution has both acute and chronic health effects.
5. Conclusions
Our analysis of 86,170 hospitalizations in Mexico City (2015–2019) yields several important conclusions with both scientific and policy implications:
5.1. Key Findings
SES effects are multidimensional: Economic and social components of SES showed distinct health impacts, with:
- –
Economic factors strongly influencing diabetes complications.
- –
Social factors more relevant for prenatal conditions and hypertension.
Pollution effects are pollutant-specific: Different pollutant groups affected:
- –
Incidence (––CO group): Asthma, influenza, epilepsy.
- –
Severity (– group): Diabetes complications, influenza.
Complex interactions exist: Notable nonlinear relationships were found for:
- –
Age–weight interactions in disease severity.
- –
Temporal patterns in hospitalization rates.
- –
SES–pollution interactions.
5.2. Policy Implications
These findings suggest several targeted intervention strategies:
5.3. Methodological Contributions
Our study demonstrates the value of:
Combining traditional and machine learning approaches.
Developing composite SES indicators.
Analyzing pollutant groups rather than individual species.
Examining both incidence and severity outcomes.
These findings significantly advance our understanding of the complex interplay between environmental and social determinants of health in urban populations. The methodologies developed here can be applied to other cities facing similar public health challenges, while the specific results provide actionable insights for improving population health in Mexico City.