Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City

Liao, Shengyao; Liu, Zhewei

doi:10.3390/buildings16010186

Open AccessArticle

Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City

by

Shengyao Liao

^1,2 and

Zhewei Liu

^3,*

¹

School of Resources and Environment, Shandong Agricultural University, Taian 271001, China

²

School of Real Estate and Land Management, Royal Agricultural University, Cirencester GL7 6JS, UK

³

Department of Geography, Geomatics and Environment, University of Toronto Mississauga, Mississauga, ON L5L 1C6, Canada

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(1), 186; https://doi.org/10.3390/buildings16010186 (registering DOI)

Submission received: 29 October 2025 / Revised: 28 November 2025 / Accepted: 5 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue Advancing Urban Analytics and Sensing for Sustainable Cities)

Download

Browse Figures

Versions Notes

Abstract

Urban heat islands (UHIs) have intensified in rapidly urbanizing regions like New York, exacerbating thermal discomfort, public health risks, and energy consumption. While previous research has highlighted various environmental and socioeconomic contributors, most existing studies lack interpretable, fine-scale models capable of quantifying the effects of specific drivers—limiting their utility for targeted planning. To address this challenge, we develop an interpretable machine learning framework using Random Forest and XGBOOST to predict land surface temperature across 1800+ census tracts in the New York metropolitan area, incorporating vegetation indices, water proximity, urban morphology, and socioeconomic factors. Both models performed strongly (mean R² ≈ 0.90), with vegetation coverage and water proximity emerging as the most influential cooling factors, while built form features played supporting roles. Socioeconomic vulnerability indicators showed weak correlations with temperature, suggesting a relatively equitable thermal landscape. Optimization simulations further revealed that increasing vegetation to a threshold level could lower average surface temperatures by up to 6.38 °C, with additional but smaller gains achievable through adjustments to water access and urban form. These findings provide evidence-based guidance for climate-adaptive urban design and green infrastructure planning. More broadly, the study illustrates the potential of explainable machine learning to support data-driven environmental interventions in complex urban systems.

Keywords:

heat island effect; built environment; machine learning

1. Introduction

Urban heat islands (UHIs)—the phenomenon where urban areas experience significantly higher temperatures than surrounding rural regions—have become increasingly pronounced with accelerating urbanization and climate change [1]. In megacities like New York, UHIs not only intensify energy demands but also elevate health risks [2], strain infrastructure, and reduce urban livability [3]. As a critical challenge to achieving the United Nations’ Sustainable Development Goals—especially those related to good health, sustainable cities, and climate action—mitigating the UHI effect is now central to urban sustainability agendas [4]. Governments and planners are paying growing attention to identifying effective strategies to alleviate UHI impacts and enhance urban resilience [5].

Early investigations into UHIs primarily examined physical-geographical variables, including temperature, humidity, vegetation indices, and the leaf area index [6,7,8,9]. Later studies expanded this scope to include socioeconomic and demographic factors such as population density and urban GDP [10,11,12,13,14]. Methodological evolution has also been notable: while early approaches relied on simple regression analysis [15,16], subsequent research incorporated principal component analysis and multiple linear regression [17,18], enhancing the capacity to capture spatial complexity in urban thermal patterns. In the context of New York City, historical heat events—such as the 2006 and 2011 heatwaves—have highlighted the severe public health risks and disproportionate impacts on vulnerable communities, prompting the development of targeted resilience initiatives like the “Cool Neighborhoods NYC” program and the NYC Green Infrastructure Plan [19,20]. These programs underscore the critical need for spatially explicit and interpretable UHI analysis to support urban cooling strategies [21,22].

More recently, the rise of machine learning has opened new avenues for modeling UHI dynamics with improved accuracy. Studies have employed models such as random forest and XGBOOST to capture nonlinear interactions among land surface, morphological, and climatic variables [23,24]. Furthermore, the integration of explainable AI (XAI) techniques [25]—such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and Partial Dependence Plots (PDP)—has advanced model interpretability beyond basic feature importance, enabling more transparent and actionable insights into urban thermal drivers [26,27]. Alongside these technical developments, the increasing availability of crowdsourced geospatial data—e.g., from OpenStreetMap [28], Overpass-turbo [29], and NYC Open Data—has enabled high-resolution analyses of urban form and land cover. Prior studies have shown that in developed urban areas such as New York, the positional accuracy of OSM datasets is comparable to that of official data sources [30]. Recent UHI literature has also increasingly focused on the roles of blue-green infrastructure—such as green roofs, urban forests, and water bodies—as well as urban morphology metrics (e.g., sky view factor, building height-to-width ratio) derived from multi-source remote sensing, which provide critical inputs for mitigating heat stress in dense urban settings [31,32,33,34]. Despite these advancements, many existing studies remain limited in two ways. First, they often adopt a generalized perspective on urban spatial form, overlooking the fine-grained built environment characteristics that shape UHI intensity [25,35]. Second, while machine learning models have been increasingly applied, few leverage interpretable frameworks that quantify the contributions of specific physical and socioeconomic drivers [26]. This lack of interpretability reduces their utility for informing policy and targeted mitigation strategies. Moreover, models frequently omit optimization-based assessments of how urban variables could be adjusted to reduce UHI exposure—leading to a disconnect between analytical findings and actionable planning insights [36].

To address these challenges, this study proposes a data-driven, interpretable machine learning framework for analyzing UHIs in the New York metropolitan region. Specifically, this study seeks to address the following three research questions:

Which physical and socioeconomic features most strongly influence spatial variation in urban heat across the New York metropolitan area?
To what extent can explainable machine learning models accurately predict land surface temperatures using multi-source urban datasets?
How can optimization simulations inform practical and targeted strategies for reducing heat exposure through changes in urban form and environmental design?

By leveraging Random Forest and XGBOOST models, we integrate vegetation indices, proximity to water, 3D urban morphology, and socioeconomic vulnerability indicators across more than 1800 census tracts. Our models achieve high predictive accuracy (mean R² ≈ 0.89), and feature importance analysis reveals that vegetation coverage and water proximity are the dominant cooling factors [37,38]. In contrast, socioeconomic indicators show weak correlations with temperature, indicating a relatively equitable thermal landscape [39,40]. Crucially, through simulation-based optimization, we identify threshold values—particularly for vegetation coverage—that can reduce surface temperatures by as much as 6.37 °C [41]. These findings offer actionable guidance for green infrastructure design and adaptive urban planning, and demonstrate the value of explainable machine learning for tackling environmental challenges in complex urban systems.

The remainder of this paper is as follows: The remainder of this paper is organized as follows. Section 2 presents the datasets employed and outlines the methodology. Section 3 reports the main results, highlighting both quantitative findings and key patterns. Section 4 provides a detailed discussion of these results. Finally, Section 5 concludes the paper by summarizing the contributions and suggesting directions for future research.

2. Materials and Methods

The main objective of this study is to use machine learning models to analyze urban heat islands and their drivers based on socioeconomic factors and the characteristics of the built environment. The analytic workflow of this study is shown as Figure 1: The environmental and socioeconomic datasets were collected and integrated, On the left part of the figure, two categories of Regional features are detailed: Nature & Architectural environment, which encompasses indicators like NDVI (Normalized Difference Vegetation Index), River & coastline, Road network, Building footprints, Green space and Point & Polygon POI (Points of Interest); and Social economic status, including metrics such as Poverty, No job, Housing burden, Low education, and No health insurance. These environmental and socioeconomic datasets were integrated with Census tract temperature data from multiple geographic units. after which Random Forest and XGBoost models were applied for regression analysis. The dataset was partitioned into training and testing subsets, with models trained and evaluated. Subsequent regression analysis was performed, and Feature importance was quantified via measures like Gini and Gain for RF and XGBOOST, paving the way for Sensitivity analysis and Optimal Adjustment Strategy Simulation. Then, correlation analysis was carried out, involving the construction of a Pearson correlation coefficient matrix and analysis at the 33% quantile, followed by an evaluation of regional differentiation in heat exposure—this evaluation considered dimensions such as average, difference, and range of socioeconomic indicators. Feature importance and sensitivity simulations were then conducted, followed by correlation analysis of socioeconomic indicators and evaluation of regional heat exposure differences.

2.1. Study Area and Collection of Data

This study focuses on the New York City metropolitan region, a dense urban environment characterized by significant spatial heterogeneity in land use, socioeconomic conditions, and built infrastructure [42]. The region’s complex urban morphology and diverse population make it an ideal case for investigating the drivers and dynamics of urban heat islands (UHIs) [43]. The analytical boundaries defined in this study are: 40.911327° N to 40.502397° N, and −74.247808° W to −73.700360° W. This range fully covers all five boroughs of New York City (New York County/Manhattan, Kings County/Brooklyn, Queens County/Queens, Bronx County/Bronx, and Richmond County/Staten Island), and includes parts of the western part of neighboring Nassau County, New York, and the eastern part of Hudson County, New Jersey. This study explicitly excludes most of Long Island; the Lower Hudson Valley, New York; most of Northern New Jersey; and southwestern Connecticut.

We collected a comprehensive set of geospatial and socio-environmental datasets for 2022 to analyze UHI patterns across more than 1800 census tracts, detailed in Table 1. We selected the census tract as our fundamental spatial unit for several compelling reasons. A census tract is a relatively stable, small area specifically designed by the U.S. Census Bureau to approximate “neighborhoods”, typically housing 1200 to 8000 residents. This makes it an ideal scale for studying urban phenomena with intertwined social and environmental dimensions. The more than 1800 census tracts in our study range in area from approximately 0.07 km² to 15.63 km², reflecting the true heterogeneity of urban space. To ensure comparability, all input variables were standardized as proportions or densities within each tract’s total area.

While fixed grids (e.g., 1 km × 1 km) offer geometric regularity, they present significant disadvantages for this study. First, key socioeconomic vulnerability data from the CDC SVI is collected and published exclusively at the census tract level. Using a fixed grid would necessitate complex and uncertain spatial allocation of this data, introducing error and weakening the model’s explanatory power. Second, and crucially, urban planning, public health interventions, and resource allocation are typically implemented at the census tract or similar administrative level. Our findings can thus be directly mapped to these policy units, empowering decision-makers to identify specific communities for priority intervention and greatly enhancing the practical utility of our research.

Land Surface Temperature (LST) data were derived from NASA’s Terra MODIS satellite (1 km resolution), focusing on summer daytime temperatures (June–August). The Normalized Difference Vegetation Index (NDVI) was computed using ESA Sentinel-2 Level-2A imagery (10 m resolution). Green spaces were obtained from the NYC Open Data portal’s Parks Properties dataset. Urban morphology indicators—including building height, building coverage ratio, and POI densities—were extracted from the Global Building Footprints dataset, and OpenStreetMap using Python (version 3.12) scripts and the OSMnx library. Hydrological features such as water systems and coastline distances were retrieved from Geofabrik and Overpass-turbo, respectively. Socioeconomic variables—including poverty rate, unemployment, education level, and health insurance coverage—were obtained from the 2022 CDC Social Vulnerability Index (SVI), which aggregates American Community Survey data. Road network metrics were computed from OSM road geometry, while all vector boundaries (e.g., administrative units) were standardized using the WGS84 coordinate system. Raster data processing was conducted in Google Earth Engine; vector data were handled in Python with GeoPandas, and all outputs were exported in GeoTIFF or Shapefile formats. These datasets provide a multidimensional foundation for examining the environmental, morphological, and socioeconomic determinants of UHI intensity in the study area.

2.2. Machine Learning Models for Predicting the Heat Island Effect

This study frames the UHI intensity prediction as a regression problem. Given a set of spatial and socioeconomic features

X = {x_{1}, x_{2}, \dots, x_{1}}

, where each feature vector

x_{1} ϵ R^{D}

, and the corresponding observed land surface temperature values

Y = {y_{1}, y_{2}, \dots, y_{1}}

, we aim to learn a mapping function

f : R^{D} \to R

that minimizes the prediction error across all census tracts:

{\hat{y}}_{i} = f (x_{1})

(1)

where

{\hat{y}}_{i}

denotes the predicted temperature for the i-th, and D represents the number of input features (e.g., NDVI, distance to water, POI density, etc.).

In this study, two ensemble models are used to approximate the mapping function: (1) Random Forest (RF) and (2) Extreme Gradient Boosting (XGBOOST). Both models construct multiple decision trees but differ in how trees are trained and aggregated. Random Forest builds trees independently on bootstrapped samples and averages their outputs, while XGBoost sequentially builds trees where each new tree aims to correct the residuals of previous trees. These models are well-suited for handling high-dimensional, nonlinear data and are considered interpretable due to their decision-tree architecture.

RF is an ensemble learning method that uses the bagging strategy (bootstrap aggregating). Each decision tree is trained on a random sample of the training set, and the final prediction is the average of all tree outputs:

{\hat{y}}_{i} = \frac{1}{M} \sum_{j = 1}^{M} f_{j} (x_{1})

(2)

where

f_{j} (x_{1})

is the output of the j-th decision tree and M is the total number of trees. This method reduces variance and helps prevent overfitting, especially when the number of features is large.

XGBOOST, in contrast, uses a boosting strategy. It iteratively adds new trees to model the residuals of previous predictions. At each iteration mmm, a new tree

f_{m + 1} (x)

is learned to minimize a regularized loss function:

L^{(m)} = \sum_{i = 1}^{N} L o s s (x_{i}, \frac{1}{M} \sum_{j = 1}^{M} f_{j} (x_{1})) + \sum_{j = 1}^{m} Ω (f_{j})

(3)

where

Ω (f_{j}) = γ T + \frac{1}{2} λ \sum w_{j}^{2}

is the regularization term penalizing tree complexity (with T nodes and

w_{j}

leaf weights). XGBOOST uses both the first- and second-order derivatives (i.e., gradients and Hessians) of the loss function to guide tree construction, which enhances convergence speed and model accuracy.

XGBOOST also implements advanced engineering optimizations such as parallelized tree construction and weighted quantile sketching for efficient handling of sparse or missing data. These innovations make XGBOOST particularly scalable and robust for large geospatial datasets.

The interpretability of both RF and XGBOOST is enabled through feature importance analysis. In RF, importance is assessed via Gini impurity reduction, while XGBOOST uses average gain—i.e., the improvement in model loss due to splits on a given feature. These importance scores reveal which urban, environmental, or socioeconomic factors most influence temperature variation, providing scientific insight for UHI mitigation planning.

2.3. Performance Evaluation

In order to explore the comprehensive impact of urban morphological characteristics (including NDVI, road network structure, distribution of functional facilities, adjacent water bodies, building attributes, etc.) on the local thermal environment (characterized by the temperature of the census area), an integrated machine learning algorithm based on decision tree was used for regression modeling. Such algorithms are good at capturing complex nonlinear relationships and interactions between high-dimensional features, which is crucial for understanding the urban temperature distribution under the combined influence of multiple factors. Specifically, two advanced integrated models, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost), are selected. RF reduces variance and improves generalization ability by constructing a large number of unrelated decision trees and averaging their predictions. XGBOOST builds decision trees sequentially, focusing on correcting the residuals of the previous tree, and optimizing the objective function through gradient descent to obtain strong predictive performance. The model input features include weighted average normalized vegetation index (NDVI), average road network length, point-of-interest (POI) density, shortest distance to the nearest water system, shortest distance to the coastline, building coverage ratio, and average building height. The target variable is the temperature observation of the corresponding census area. Model performance is quantitatively evaluated by mean squared error (MSE) and coefficient of determination (R²).

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \underline{y})}^{2}}

(4)

\underline{y}

represents the mean of the target variables, the denominator is the sum of the total squares, and the numerator is the sum of the residual squares

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}

(5)

y_{i}

represents the actual observations of samples,

\hat{y_{i}}

represents the model prediction value of samples, and n represents the total number of samples.

2.4. Data Analysis and Software

The data processing, statistical analysis, and machine learning modeling presented in this study were conducted using the Python programming language (version 3.12). The analysis relied extensively on key scientific libraries including pandas for data manipulation, numpy for numerical computations, and scikit-learn for implementing the Random Forest and XGBoost algorithms. Visualizations were generated using the matplotlib and seaborn libraries. All scripts were developed and executed within the Visual Studio Code (version 3.12) integrated development environment.

3. Results

3.1. Model Performance in Predicting Urban Heat Exposure

One of our work’s focuses is to model the complex relationship between various built environmental features to the urban heat exposure. A snapshot of our various collected datasets in the study regions is displayed in Figure 2. The maps in Figure 2 use geographic coordinates (latitude and longitude in degrees) as their spatial reference.

Both ensemble regression models show excellent prediction ability for the spatial distribution of temperature in the census area, as shown in Figure 3. The results of the Random Forest (RF) model show a mean square error (MSE) of 0.3975 and a coefficient of determination (R²) of 0.8994. The Extreme Gradient Boost (XGBOOST) model performed slightly better, with its MSE dropping to 0.3623 and R² increasing to 0.9083. These results show that both models successfully capture the complex relationship between the selected urban morphological features (NDVI, road network, POI, adjacent water bodies, building coverage and height) and local temperature, and can explain the variation of more than 89.9% (RF) and 90.8% (XGBOOST) in the temperature observations in the census tract.

Such result in Table 2 confirms that the selected feature set has a strong comprehensive interpretive ability for depicting the spatial differentiation of urban thermal environment. The Random Forest (RF) model achieved an average R² of 0.8994 with an average MSE of 0.3975, while XGBoost yielded a slightly higher average R² of 0.9083 and a lower average MSE of 0.3623. Both models exhibited low standard deviations in performance metrics across folds (MSE SD: 0.0925 for RF, 0.0951 for XGBoost; R² SD: 0.034 for RF, 0.0308 for XGBoost), indicating consistent predictive performance. Classification metrics derived from temperature threshold analysis revealed high precision (RF: 0.9212, XGBoost: 0.9126) and recall (RF: 0.9168, XGBoost: 0.9148), with correspondingly low false negative rates (RF: 0.0832, XGBoost: 0.0852) and false positive rates (RF: 0.0773, XGBoost: 0.0873). This comprehensive evaluation demonstrates the models’ strong generalization capability and reliability for urban heat island prediction. This performance difference may stem from XGBOOST’s built-in regularization mechanisms, more efficient gradient optimization strategies, and enhanced learning capabilities for complex feature interactions, giving it a slight advantage in handling that particular dataset.

The high R² values (both close to or above 0.9) and low MSE values of the two models together indicate that the model predictions are highly reliable and robust, and the relationship established is statistically significant.

3.2. Dominant Role of Vegetation in Feature Importance Analysis

To explore the factors influencing the urban heat island effect across New York census tracts, we evaluated the relative importance of each feature using two models: Random Forest (RF) and XGBoost (XGB). For the RF model, feature importance was measured using Gini importance, while for the XGB model, the gain metric (average reduction in the loss function) was applied.

As shown in Table 3, both models consistently identified NDVI (Normalized Difference Vegetation Index) as the dominant predictor, with Gini importance of 0.8989 in RF and a gain importance of 0.8843 in XGBoost, both ranking first. This overwhelming contribution indicates that vegetation coverage exerts a far stronger effect on surface temperature than any other factor considered in the analysis. The next most influential variables were distance to coastline (D2Coast, second in both models) and distance to water bodies (D2Water, third in RF and fourth in XGBoost), though their importance values (RF: 0.0240 and 0.0139; XGBoost: 0.0213 and 0.0157) were much lower than NDVI. Such results suggest that while geographical proximity to natural cooling elements like water does matter, its role is secondary compared with vegetation. It is worth noting that Green space ranked ninth in RF but third in XGBoost, showing notable model-dependent variation. The remaining features—including POI density, average building height, building coverage ratio, and road network length—all had importance values predominantly below 0.014. Although their direct contributions to prediction accuracy were small, their consistent presence in both models indicates they still capture subtle variations in local thermal environments.

These patterns are further illustrated in Figure 4, where NDVI stands out as the overwhelmingly dominant factor in both models, while all other features cluster near zero. The visual contrast reinforces the statistical results from the table, making it immediately clear that vegetation cover plays a decisive role in regulating urban surface temperature. At the same time, the compressed scale of the other variables highlights the sharp disparity between NDVI and the rest, suggesting that most urban forms and socioeconomic characteristics provide only marginal explanatory power on their own. This does not mean they are irrelevant; rather, their influence is overshadowed by NDVI but may emerge more clearly through interactions that are difficult to visualize directly from the importance scores. As shown in Figure 5, SHAP analysis further validates these findings, with NDVI again exhibiting the highest mean |SHAP value| (approximately 1.2–1.4), substantially exceeding D2Coast (≈0.25) and D2Water (≈0.10). The remaining features—Coverage Ratio, Poly_POI, Average Height, Road Network, Pnt_POI, and Green Space—all show mean SHAP values below 0.05, reinforcing their relatively minor individual influence in the model.

The prominence of NDVI reflects the well-known cooling effects of vegetation through shading and evapotranspiration. Distances to coastlines and water bodies are also meaningful, as water moderates surface temperatures due to its high heat capacity and associated sea–land breeze circulation. The relatively low contributions of POI density, road networks, and building form may be explained by weaker or indirect physical links with surface temperature, as well as data limitations (e.g., POI static distributions, simplified building metrics). Nonetheless, these features may still enhance model performance by capturing nonlinear interactions with dominant variables like NDVI.

3.3. Socioeconomic Indicators and Their Weak Correlation with Temperature

In this study, the point–two–sequence correlation method was applied to examine the spatial relationship between land surface temperature and five socioeconomic vulnerability indicators: poverty (F_POV150), unemployment (F_UNEMP), housing burden (F_HBURD), education (F_NOHSDP), and health insurance (F_UNINSUR). Each indicator was coded as a binary variable, with “1” indicating that a census tract falls within the top 10% of vulnerability for that metric and “0” indicating a non-vulnerable area. As summarized in Table 4, all five indicators produced p-values greater than 0.05 (ranging from 0.685 to 0.920) and correlation coefficients near zero (|r| < 0.01). These results indicate no statistically significant correlation in this framework. The temperature differences between vulnerable and non-vulnerable areas were modest in scale, all less than 0.75 °C. Among the five metrics, four showed slightly cooler conditions in high-vulnerability areas, while only health insurance vulnerability corresponded to a small increase (+0.20 °C). Additionally, the age-related indicators (Age 65+ and Age < 17) also showed no significant correlation with temperature, with correlation coefficients near zero (Age 65+: r = −0.00669, p = 0.773583; Age < 17: r = −0.00954, p = 0.68158) and minimal temperature differences between vulnerable and non-vulnerable areas (Age 65+: −0.33 °C; Age < 17: +0.03 °C), consistent with the overall pattern of weak bivariate relationships.

However, the tercile stratification reveals a different pattern. Instead of focusing only on the most extreme 10% of vulnerable tracts, this method classified all census tracts into low, medium, and high vulnerability groups based on the 33rd and 66th percentiles of each indicator. This broader categorization allows for detecting gradient-like trends across the full distribution rather than only at the tail. Figure 6 shows the temperature distribution of socioeconomic vulnerability categories based on ternary quartiles, with a total sample size of 1851 census districts. The specific sample sizes for each category are as follows: Poverty rate: Low (612), Medium (610), High (629); Unemployment rate: Low (615), Medium (611), High (625); Housing affordability: Low (612), Medium (611), High (628); Education level (no high school diploma): Low (612), Medium (610), High (629); Health insurance (uninsured rate): Low (614), Medium (609), High (628). For the age structure indicators: Proportion of Age 65+: Low (622), Medium (607), High (622); Proportion of Age < 17: Low (624), Medium (608), High (619). Each subplot is presented as a violin plot, with the Y-axis representing surface temperature (°C) and the X-axis representing the low, medium, and high vulnerability categories based on the 33rd and 66th percentiles. The width of the violin plot represents the probability density, and the inner box plots represent the quartiles and medians.

As illustrated in Figure 6, mean summer daily temperatures increased consistently with rising levels of socioeconomic vulnerability. Specifically, temperatures ranged from 26 °C (low) to 34 °C (high) for poverty, 28 °C to 34 °C for unemployment, 26 °C to 32 °C for housing burden, and 26 °C to 38 °C for both low educational attainment and lack of insurance. A similar gradient is evident for the age structure indicators: temperatures increased from 26 °C (low) to 32 °C (high) for tracts with a high proportion of residents Age 65+, and from 26 °C to 32 °C for tracts with a high proportion of residents Age < 17. Across all indicators, the most disadvantaged tercile experienced substantially higher heat exposure than the least disadvantaged tercile, with differences of 6–12 °C.

Taken together, these findings suggest that while binary top-decile comparisons show negligible differences, the tercile stratification reveals systematic overlaps between socioeconomic disadvantage and elevated heat exposure. The discrepancy arises because the binary approach focuses only on the most extreme 10% of vulnerable tracts, which may mask broader gradients across the population, especially in cities where policy interventions and environmental buffers mitigate conditions for the very worst-off neighborhoods. In contrast, the tercile method captures vulnerability across the full distribution, offering a more sensitive and comprehensive view of inequality. Therefore, although both approaches provide useful perspectives, the stratified analysis better reflects the cumulative heat burden faced by disadvantaged groups. Overall, our results indicate that socioeconomic inequality and climate vulnerability are closely intertwined in the study area, reinforcing the need for adaptation strategies that address both environmental exposures and their underlying social determinants.

As shown in Table 4 and Table 5, to further determine the quantitative relationship, we employed correlation analysis and multiple regression to reanalyze the socioeconomic vulnerability index as a continuous variable. The results revealed a subtle relationship: while the bivariate correlation between individual socioeconomic vulnerability indices and surface temperature was negligible (all |r| < 0.01, p > 0.68), multiple regression analysis identified two significant predictors. Areas with higher education levels (lower education levels: β = −0.569, p < 0.001) had lower surface temperatures, potentially reflecting related green infrastructure investment. Conversely, areas with limited health insurance coverage (no health insurance: β = 0.415, p < 0.001) had higher surface temperatures, indicating an overlap between environmental and social vulnerabilities. Poverty, unemployment, and housing burden did not show significant independent effects. Similarly, the age structure indicators (Age 65+ and Age < 17) were not significant in the multiple regression model (Age 65+: β = 0.006277, p = 0.11087; Age < 17: β = −0.00634, p = 0.104688), suggesting that age alone does not independently influence temperature patterns after accounting for other socioeconomic variables. These findings confirm that the previously reported weak relationship between socioeconomic status and temperature is not merely a result of categorical variable transformation, but rather demonstrates robustness to the analytical methods.

3.4. Sensitivity Simulations of Urban Morphological Features

Based on the random forest regression model, a sensitivity analysis was conducted on eight urban morphological features to evaluate their influence on land surface temperature. The random forest model, built with 100 decision trees, performed well on the test set (R² = 0.898, MSE = 0.634 °C), with a baseline temperature of 34.63 °C across all samples.

The quantitative outcomes are summarized in Table 6. Among all features, NDVI was by far the most influential: increasing NDVI to a coefficient of 0.75 led to a sharp temperature reduction of 6.37 °C, resulting in a predicted temperature of 28.26 °C. In contrast, distance-related factors such as D2coast (coefficient 0.05) and D2water (coefficient 0.05) produced much smaller effects, with reductions of 0.28 °C and 0.16 °C, respectively. Adjustments to the remaining features—road network, average building height, building coverage ratio, Pnt_POI, Poly_POI, and Green space—produced only marginal changes, all below 0.1 °C. Notably, green space at coefficient 0.05 showed a minimal reduction of only 0.01 °C. These results point to vegetation as the most effective single lever for reducing surface heat.

A broader view of the sensitivity experiments is shown in Figure 7, which illustrates the response of each feature across the full range of adjustment coefficients. The figure clearly shows that NDVI achieves substantially greater temperature reduction across its adjustment range compared to all other features. While the general trend confirms NDVI’s dominance, some subtler patterns emerge. For instance, increasing average building height to coefficient 2.0 produced a 0.05 °C reduction, while Poly_POI at coefficient 1.6 showed a 0.02 °C reduction—weak but measurable cooling effects, possibly reflecting shading from taller buildings or the cooling contributions of specific land uses. Meanwhile, dramatic reductions in features like road network (coefficient 0.05) showed only 0.04 °C cooling, though such adjustments are clearly unrealistic in real-world planning scenarios.

Taken together, the sensitivity analysis suggests that the most practical and effective strategies for mitigating the urban heat island effect are those that enhance vegetation cover (NDVI). Proximity to coastal and water bodies provides secondary benefits, while the contributions of other urban form features are relatively limited. The minimal impact of green space as an individual factor suggests that simply increasing green space distribution without considering vegetation quality may have limited cooling effect. At the same time, secondary effects—such as shading from taller structures or diversified land uses—should not be ignored, and the combined influence of multiple urban form features needs to be considered when developing comprehensive heat-mitigation strategies.

4. Discussion

This section reflects on the main findings of the study, positioning them within the broader literature and highlighting their implications for urban climate adaptation. Three themes are emphasized: (i) the dominant role of environmental and morphological features, (ii) the socioeconomic dimension of thermal exposure, and (iii) the implications for planning and policy.

4.1. Vegetation and Water as Primary Environmental Drivers of UHI

The analysis demonstrates that vegetation cover (NDVI) is by far the most influential determinant of spatial variation in land surface temperature, accounting for more than 90% of feature importance in both Random Forest and XGBoost models, accounting for 0.8989 (Gini importance) in Random Forest and 0.8843 (Gain importance) in XGBoost, both ranking first among all features. SHAP analysis quantitatively confirmed this dominance, showing NDVI’s mean (|SHAP value|) of approximately 1.2–1.4, far exceeding D2Coast (≈0.25) and D2Water (≈0.10). This finding solidly aligns with a global consensus on the paramount role of greenspace in UHI mitigation, as evidenced by studies from cities as diverse as Beijing and Berlin [31,33]. However, the sheer magnitude of NDVI’s dominance in our New York case study, surpassing the second-ranked feature by more than 37 times, is striking. This degree of dominance is not universally reported. For instance, studies in cities with denser built forms or arid climates often report a more balanced contribution from building morphology, albedo, and shading. In Singapore, for example, building height and compactness showed comparable importance to vegetation in certain districts [44], while in Phoenix, surface materials and urban geometry were critical due to limited vegetation and high solar irradiance [45]. The exceptional dominance of NDVI in our models may be attributed to New York’s unique geographical context: its extensive water bodies and the successful, large-scale implementation of greening programs, such as the Million Trees NYC initiative, which have collectively amplified the relative cooling signal of vegetation across the urban fabric. Sensitivity simulations revealed that increasing NDVI to a coefficient of 0.75 can reduce average summer surface temperatures by 6.37 °C, a magnitude far greater than the cooling contributions from other features (all below 0.3 °C). This potential cooling aligns with findings from cities like Vienna, where targeted greening was projected to reduce peak temperatures by 4–6 °C [46], but exceeds the 2–4 °C reductions commonly reported in meta-analyses for temperate cities [34], highlighting the high efficacy potential in New York’s specific context. By contrast, morphological indicators such as road network, POI density, building height, and coverage ratio exhibited importance values below 0.014 in both models, suggesting that their thermal effects are either nonlinear, context-dependent, or overshadowed by vegetation and water systems. It is worth noting that green space, despite its relatively low importance scores in RF (0.0077, rank 9) and moderate rank in XGBoost (3rd), showed minimal cooling effect (0.01 °C) in sensitivity testing, indicating that its cooling role may depend more on vegetation quality than spatial density. This nuance is crucial; it suggests that the type of green space (e.g., forested park vs. lawn) matters significantly, a point supported by studies showing that tree canopy cover is a better predictor of cooling than general green space area [47]. These findings were further validated through rigorous 10-fold cross-validation, with both models achieving high predictive accuracy (RF R² = 0.8994, XGBoost R² = 0.9083), ensuring the reliability of these vegetation-driven insights and underscore the critical need for nature-based interventions—including tree planting, pocket parks, and green roof initiatives—as the most impactful strategies for mitigating UHIs.

4.2. Socioeconomic Equity and Patterns of Thermal Exposure

Contrary to expectations and much of the global UHI literature, our results showed no statistically significant relationship between socioeconomic vulnerability indicators (poverty, unemployment, housing cost burden, low education, lack of insurance) and land surface temperature across census tracts in New York. This finding contrasts with numerous studies in the United States, such as those by Hsu et al. (2021) [39], which found that people of color and those living in poverty were consistently exposed to higher summer surface temperatures in most major U.S. cities, and Huang et al. (2019), who documented clear thermal inequities linked to income and race across hundreds of urban areas [48]. However, when re-analyzed as continuous variables using multivariate regression, a more nuanced pattern emerged: areas with higher educational attainment (Poor Education: β = −0.569, p < 0.001) exhibited significantly lower temperatures, while tracts with limited health insurance coverage (No Health Insurance: β = 0.415, p < 0.001) showed elevated temperatures. This suggests that, at least in this region, heat exposure is relatively evenly distributed across social groups, though specific vulnerability dimensions reveal complex interactions between social and environmental factors. The education-temperature relationship may reflect correlated investments in green infrastructure in better-educated neighborhoods, while the health insurance effect potentially captures overlapping social and environmental vulnerabilities in underserved communities. This pattern, where specific social determinants like education show a clearer signal than broad income-based measures, has also been observed in other contexts, suggesting that the pathways linking social disadvantage to environmental risk can be multifaceted and not always captured by aggregate poverty metrics [49]. This finding presents a notable contrast to studies in many U.S. and global cities, where thermal inequity is a persistent and pressing environmental justice issue. This suggests that, at least in this region, heat exposure is relatively evenly distributed across social groups. Several factors may contribute to this pattern: extensive coastal cooling effects (with D2Coast ranking second in importance in both models), large-scale greening programs like the Million Trees Project that targeted low-income areas, and the widespread provision of public housing in waterfront areas that benefit from natural cooling. These potential explanatory factors warrant further investigation. The success of targeted greening programs in mitigating thermal inequity, if confirmed, could serve as a valuable model for other cities [50]. Similarly, the distribution of public housing in micro climatically favorable zones is a unique aspect of New York’s urban history that may not be replicated elsewhere, explaining the divergence from national patterns. From an urban microclimate perspective, the relatively low importance of morphological factors such as building coverage and road network may help explain why socioeconomic gradients are less pronounced in thermal exposure. While the absence of thermal inequity is encouraging, it does not diminish the broader risks for vulnerable populations. Even if exposure is comparable, groups with fewer financial resources or limited access to cooling infrastructure remain disproportionately at risk of heat-related morbidity and mortality. Thus, equity-oriented heat management strategies should remain central to resilience planning.

4.3. Planning and Policy Implications for UHI Mitigation

The findings of this study carry practical implications for climate-adaptive planning. First, the evidence strongly supports prioritizing vegetation enhancement as the most effective and scalable cooling measure. Building on the identification of NDVI as the dominant cooling factor (6.37 °C potential reduction), municipalities could develop urban greening threshold maps that identify priority areas for intervention based on quantifiable cooling benefits. This approach moves beyond uniform greening mandates towards a performance-based paradigm, similar to recommendations emerging from studies in European cities seeking to optimize green infrastructure investments [51]. Urban greening should be incorporated into zoning codes, infrastructure investment, and community-led initiatives to maximize thermal benefits. Second, water proximity (D2Coast and D2Water) also emerged as stable secondary cooling factors, with sensitivity analysis showing 0.28 °C and 0.16 °C reductions, respectively, suggesting that enhancing public access to waterfronts and integrating blue–green infrastructure (e.g., wetlands, retention ponds, waterfront parks) could further alleviate heat stress. The synergistic effect of combining blue and green infrastructure for enhanced cooling and other ecosystem services is a key principle in contemporary sustainable urban drainage and climate adaptation strategies [34,52]. Third, the minimal individual impact of green space (0.01 °C reduction) suggests that simply increasing green space without ensuring adequate vegetation coverage may yield limited cooling benefits. This critical distinction between ‘green space area’ and ‘vegetation quality/density’ reinforces calls in the literature for planning policies to specify canopy cover targets or promote specific types of vegetation, rather than relying solely on area-based metrics [47]. The relatively uniform distribution of heat across socioeconomic strata in New York highlights the success of past greening and housing policies but also signals the need to maintain and expand such equity-sensitive programs. The validated machine learning framework (with cross-validated R² > 0.89) can be operationalized as a policy simulation tool, allowing planners to quantify the thermal impacts of various intervention scenarios before implementation. Finally, integrating interpretable machine learning frameworks into planning practice offers policymakers a transparent and data-driven tool to simulate the benefits of alternative interventions, thereby aligning scientific insight with actionable strategies.

4.4. Implications for Global Urban Planning

While our model is calibrated for the specific context of New York City, the interpretable machine learning framework and key findings offer transferable insights for planners in other cities globally facing similar UHI challenges. First, the established hierarchy of cooling effects—with vegetation as the dominant factor, followed by water proximity and urban form—provides a strategic blueprint for prioritizing investments in a variety of urban environments. The relative weight of these factors will vary by climatic zone and urban fabric—for example, water bodies may be more critical in arid cities, while shading from urban form may be paramount in ultra-dense cores—but the hierarchical approach to prioritization remains valid [53]. The identification of a vegetation cover threshold is particularly valuable, enabling cities to shift from arbitrary greening targets to performance-based objectives for optimizing green infrastructure budgets.

Second, our methodology, which combines readily available satellite and census data, offers a replicable and low-cost approach for cities that may lack extensive ground-based sensor networks, thereby lowering the barrier to data-driven planning in resource-constrained contexts. This democratizes advanced urban analytics, allowing cities in the Global South, for instance, to conduct robust UHI assessments without prohibitive costs, a step towards bridging the data gap in urban climate studies [54].

Finally, the distinct socioeconomic-thermal pattern observed in New York—characterized by relatively equitable exposure—serves as a valuable reference point. It demonstrates that severe thermal inequity is not an inevitable outcome of urbanization and can be mitigated through conscious policy. This provides a counter-narrative and a goal for cities currently grappling with significant environmental injustices. In global cities where significant thermal inequalities are found, our framework can be adapted as a diagnostic tool to detect and quantify such disparities, thereby informing targeted environmental justice interventions. The global applicability of our approach lies not in prescribing universal solutions, but in providing a transparent, quantitative methodology to guide city-specific planning and policy.

5. Conclusions

Clarifying the role of different drivers in urban heat islands is crucial for emergency management of the heat island effect. This study proposes an interpretable machine learning model that can make effective adjustment strategies for reducing urban heat islands. In this study, the changes in the decision tree model and various eco-building economic characteristics were used to explore the state of urban heat islands. The models were trained and tested using data from select cities in New York State, and the results showed that the models were all very accurate. Further analysis of the feature importance determined that NDVI has an absolute advantage in influencing the temperature of the study area, and then simulated different adjustment coefficients for each index to find the most effective scheme of the link heat island effect. In order to ensure the comprehensiveness of the study, the temperature differences between regions with different socioeconomic conditions were also studied to highlight the need for locally specific measures and model training specifications.

The models and results of this study are significant in many aspects, firstly, these machine learning models are real-time and dynamic, providing a data-based and easy-to-implement tool to dynamically simulate specific indicators in different regions and provide a basis for relevant policies. Secondly, this study contributes to the research on the role of machine learning models in the urban heat island effect. Data-driven models have been proven to be able to effectively complement physical principles models, so as to balance efficiency and mechanism, so that heat island effect research can have the advantages of large-scale implementation prediction, multivariate data fusion, dynamic scenario adaptation, mechanism research and refined simulation, and reduce the possible obstacles caused by the computational and computational cost requirements of physics-based models.

Based on the model and results presented in this study, there are some valuable directions for future exploration that can address some of the limitations of this study. First, this study does not cover all regions, resulting in incomplete research results, and future studies can use methods such as natural networks (GNNs) to capture the spatial adjacencies of regions and the corresponding characteristics of each region, so as to obtain more comprehensive and balanced data for model training. Secondly, our selected data lacked granularity and precision; the inclusion of built environment characteristics was incomplete; and the design of socioeconomic factors was inadequate because the affluence/health level indicators were too general and did not play a significant role. Using MODIS data (1 km resolution) for fine-scale modeling in a complex urban environment like New York City may lead to scale mismatch issues. Furthermore, relying solely on 2022 data limits the model’s generalization ability. Future research will incorporate community-level data on gardens and green spaces, and economic data on children and the elderly, and will process data from multiple years for comparison, while placing greater emphasis on data accuracy. Finally, since my research mainly focuses on the rough study of each indicator, which may make the strategy formulation less detailed, with the expansion of the dataset, future research can focus on the detailed study of the functions of each indicator (such as studying the canopy 3D green amount of vegetation, root permeability index, etc.), so as to enhance the utility of the model in management and response, so as to provide more accurate information.

Author Contributions

Conceptualization, S.L. and Z.L.; methodology, S.L.; software, S.L.; validation, S.L.; formal analysis, S.L.; investigation, S.L.; resources, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; visualization, S.L.; supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data supporting the findings of this study have been deposited in a public repository. The dataset, entitled New York Dataset, is available through Figshare at the following DOI: https://doi.org/10.6084/m9.figshare.30272128.v1. The shared data encompasses key urban metrics for New York, including but not limited to average road network length, average building height, building coverage ratio, the shortest distance to water systems or coastlines, surface temperature, NDVI (Normalized Difference Vegetation Index), point-of-interest (POI) density, and area POI density.

Acknowledgments

The authors would like to thank Yikang Wang, who gave us helpful suggestions throughout the preparation of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LST	Land Surface Temperature
NDVI	Normalized Difference Vegetation Index
D2coast	Distance to the nearest coast
D2water	Distance to the nearest stream
Poly_POI	Polygon POIs
Pnt_POI	Point POIs

References

Li, Y.; Schubert, S.; Kropp, J.P.; Rybski, D. On the influence of density and morphology on the Urban Heat Island intensity. Nat. Commun. 2020, 11, 2647. [Google Scholar] [CrossRef] [PubMed]
Stone, B.; Hess, J.J.; Frumkin, H. Urban form and extreme heat events: Are sprawling cities more vulnerable to climate change than compact cities? Environ. Health Perspect. 2010, 118, 1425–1428. [Google Scholar] [CrossRef] [PubMed]
Yu, W.; Yang, J.; Sun, D.; Xue, B.; Sun, W.; Ren, J.; Yu, H.; Xiao, X.; Xia, J.C.; Li, X. Shared insights for heat health risk adaptation in metropolitan areas of developing countries. Iscience 2024, 27, 109728. [Google Scholar] [CrossRef]
Budzik, G.; Sylla, M.; Kowalczyk, T. Understanding Urban Cooling of Blue–Green Infrastructure: A Review of Spatial Data and Sustainable Planning Optimization Methods for Mitigating Urban Heat Islands. Sustainability 2025, 17, 142. [Google Scholar] [CrossRef]
Peng, S.; Piao, S.; Ciais, P.; Friedlingstein, P.; Ottle, C.; Bréon, F.-M.; Nan, H.; Zhou, L.; Myneni, R.B. Surface urban heat island across 419 global big cities. Environ. Sci. Technol. 2012, 46, 696–703. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhou, Y.; Li, X.; Meng, L.; Wang, X.; Wu, S.; Sodoudi, S. A new method to quantify surface urban heat island intensity. Sci. Total Environ. 2018, 624, 262–272. [Google Scholar] [CrossRef]
Mihalakakou, G.; Santamouris, M.; Papanikolaou, N.; Cartalis, C.; Tsangrassoulis, A. Simulation of the urban heat island phenomenon in Mediterranean climates. Pure Appl. Geophys. 2004, 161, 429–451. [Google Scholar] [CrossRef]
Sarkar, A.; De Ridder, K. The urban heat island intensity of Paris: A case study based on a simple urban surface parametrization. Bound. Layer Meteorol. 2011, 138, 511–520. [Google Scholar] [CrossRef]
Fan, C.; Zou, B.; Li, J.; Wang, M.; Liao, Y.; Zhou, X. Exploring the relationship between air temperature and urban morphology factors using machine learning under local climate zones. Case Stud. Therm. Eng. 2024, 55, 104151. [Google Scholar] [CrossRef]
Ramírez-Aguilar, E.A.; Souza, L.C.L. Urban form and population density: Influences on Urban Heat Island intensities in Bogotá, Colombia. Urban Clim. 2019, 29, 100497. [Google Scholar] [CrossRef]
Kotharkar, R.; Surawar, M. Land use, land cover, and population density impact on the formation of canopy urban heat islands through traverse survey in the Nagpur urban area, India. J. Urban Plan. Dev. 2016, 142, 04015003. [Google Scholar] [CrossRef]
Zhang, J.; Wang, Y. Study of the relationships between the spatial extent of surface urban heat islands and urban characteristic factors based on Landsat ETM+ data. Sensors 2008, 8, 7453–7468. [Google Scholar] [CrossRef] [PubMed]
Cilek, M.U.; Cilek, A. Analyses of land surface temperature (LST) variability among local climate zones (LCZs) comparing Landsat-8 and ENVI-met model data. Sustain. Cities Soc. 2021, 69, 102877. [Google Scholar] [CrossRef]
Guo, F.; Hertel, D.; Schlink, U.; Hu, D.; Qian, J.; Wu, W. Remote sensing-based attribution of urban heat islands to the drivers of heat. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002312. [Google Scholar] [CrossRef]
Hawkins, T.W.; Brazel, A.J.; Stefanov, W.L.; Bigler, W.; Saffell, E.M. The role of rural variability in urban heat island determination for Phoenix, Arizona. J. Appl. Meteorol. 2004, 43, 476–486. [Google Scholar] [CrossRef]
Voogt, J.A.; Oke, T.R. Thermal remote sensing of urban climates. Remote Sens. Environ. 2003, 86, 370–384. [Google Scholar] [CrossRef]
Liu, L.; Zhang, Y. Urban heat island analysis using the Landsat TM data and ASTER data: A case study in Hong Kong. Remote Sens. 2011, 3, 1535–1552. [Google Scholar] [CrossRef]
Zhou, W.; Huang, G.; Cadenasso, M.L. Does spatial configuration matter? Understanding the effects of land cover pattern on land surface temperature in urban landscapes. Landsc. Urban Plan. 2011, 102, 54–63. [Google Scholar] [CrossRef]
Klein-Rosenthal, J.; Raven, J. Snapshot: Urban Heat And Urban Design—An Opportunity To Transform In NYC. 2017. Available online: http://www.sallan.org/Snapshot/2017/05/how_passive_houses_took_over_brussels.php (accessed on 18 July 2017).
National Academies of Sciences, Medicine, Division on Earth, Water Science, Technology Board; Committee to Review the New York City Department of Environmental Protection Operations Support Tool for Water Supply. Review of the New York City Department of Environmental Protection Operations Support Tool for Water Supply; The National Academics Press: Washington, DC, USA, 2019. [Google Scholar]
Charles-Guzman, K. Cool neighborhoods NYC: A data-driven approach to keep communities safe and adapt New York city to rising temperatures and extreme heat events. In Proceedings of the American Meteorological Society Meeting Abstracts, Boston, MA, USA, 12–16 January 2020. [Google Scholar]
Zimmerman, R.; Foster, S.; González, J.E.; Jacob, K.; Kunreuther, H.; Petkova, E.P.; Tollerson, E. New York City Panel on Climate Change 2019 Report Chapter 7: Resilience Strategies for Critical Infrastructures and Their Interdependencies. 2019. Available online: https://nyaspubs.onlinelibrary.wiley.com/doi/10.1111/nyas.14010 (accessed on 28 October 2025).
Li, X.; Zhou, Y.; Asrar, G.R.; Mao, J.; Li, X.; Li, W. Response of vegetation phenology to urbanization in the conterminous United States. Glob. Change Biol. 2017, 23, 2818–2830. [Google Scholar] [CrossRef]
Mendez-Astudillo, J.; Mendez-Astudillo, M. A machine learning approach to monitoring the UHI from GNSS data. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5800911. [Google Scholar] [CrossRef]
Stewart, I.D.; Oke, T.R. Local climate zones for urban temperature studies. Bull. Am. Meteorol. Soc. 2012, 93, 1879–1900. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning; Lulu: Morrisville, NC, USA, 2020. [Google Scholar]
Haklay, M.; Weber, P. Openstreetmap: User-generated street maps. IEEE Pervasive Comput. 2008, 7, 12–18. [Google Scholar] [CrossRef]
Barron, C.; Neis, P.; Zipf, A. A comprehensive framework for intrinsic OpenStreetMap quality analysis. Trans. GIS 2014, 18, 877–895. [Google Scholar] [CrossRef]
Zielstra, D.; Hochmair, H.H. Comparative study of pedestrian accessibility to transit stations using free and proprietary network data. Transp. Res. Rec. 2011, 2217, 145–152. [Google Scholar] [CrossRef]
Gunawardena, K.R.; Wells, M.J.; Kershaw, T. Utilising green and bluespace to mitigate urban heat island intensity. Sci. Total Environ. 2017, 584, 1040–1055. [Google Scholar] [CrossRef]
Chen, X.; Xu, Y.; Yang, J.; Wu, Z.; Zhu, H. Remote sensing of urban thermal environments within local climate zones: A case study of two high-density subtropical Chinese cities. Urban Clim 2020, 31, 100568. [Google Scholar] [CrossRef]
Bowler, D.E.; Buyung-Ali, L.; Knight, T.M.; Pullin, A.S. Urban greening to cool towns and cities: A systematic review of the empirical evidence. Landsc. Urban Plan. 2010, 97, 147–155. [Google Scholar] [CrossRef]
Kumar, P.; Debele, S.E.; Khalili, S.; Halios, C.H.; Sahani, J.; Aghamohammadi, N.; de Fatima Andrade, M.; Athanassiadou, M.; Bhui, K.; Calvillo, N. Urban heat mitigation by green and blue infrastructure: Drivers, effectiveness, and future needs. Innovation 2024, 5, 100588. [Google Scholar] [CrossRef]
Srivastava, V.T.; Sharma, A.; Jadon, S. A review of the formation, mitigation strategies from 50 years of global urban heat island studies. Environ. Dev. Sustain. 2024, 1–18. [Google Scholar] [CrossRef]
Cuce, P.M.; Cuce, E.; Santamouris, M. Towards sustainable and climate-resilient cities: Mitigating urban heat islands through green infrastructure. Sustainability 2025, 17, 1303. [Google Scholar] [CrossRef]
Chakraborty, T.; Lee, X. A simplified urban-extent algorithm to characterize surface urban heat islands on a global scale and examine vegetation control on their spatiotemporal variability. Int. J. Appl. Earth Obs. Geoinf. 2019, 74, 269–280. [Google Scholar] [CrossRef]
Estoque, R.C.; Murayama, Y.; Myint, S.W. Effects of landscape composition and pattern on land surface temperature: An urban heat island study in the megacities of Southeast Asia. Sci. Total Environ. 2017, 577, 349–359. [Google Scholar] [CrossRef] [PubMed]
Hsu, A.; Sheriff, G.; Chakraborty, T.; Manya, D. Disproportionate exposure to urban heat island intensity across major US cities. Nat. Commun. 2021, 12, 2721. [Google Scholar] [CrossRef] [PubMed]
Manoli, G.; Fatichi, S.; Schläpfer, M.; Yu, K.; Crowther, T.W.; Meili, N.; Burlando, P.; Katul, G.G.; Bou-Zeid, E. Magnitude of urban heat islands largely explained by climate and population. Nature 2019, 573, 55–60. [Google Scholar] [CrossRef]
Gago, E.J.; Roldan, J.; Pacheco-Torres, R.; Ordóñez, J. The city and urban heat islands: A review of strategies to mitigate adverse effects. Renew. Sustain. Energy Rev. 2013, 25, 749–758. [Google Scholar] [CrossRef]
Oke, T.R. Initial Guidance to Obtain Representative Meteorological Observations at Urban Sites. Instruments and Methods of Observation Program; IOM Report No. 81, WMO/TD 1250; World Meteorological Organization: Geneva, Switzerland, 2004. [Google Scholar]
Zhou, B.; Rybski, D.; Kropp, J.P. The role of city size and urban form in the surface urban heat island. Sci. Rep. 2017, 7, 4791. [Google Scholar] [CrossRef]
Jin, H.; Cui, P.; Wong, N.H.; Ignatius, M. Assessing the effects of urban morphology parameters on microclimate in Singapore to control the urban heat island effect. Sustainability 2018, 10, 206. [Google Scholar] [CrossRef]
Middel, A.; Chhetri, N.; Quay, R. Urban forestry and cool roofs: Assessment of heat mitigation strategies in Phoenix residential neighborhoods. Urban For. Urban Green. 2015, 14, 178–186. [Google Scholar] [CrossRef]
Gál, T.; Unger, J. Detection of ventilation paths using high-resolution roughness parameter mapping in a large urban area. Build. Environ. 2009, 44, 198–206. [Google Scholar] [CrossRef]
Zölch, T.; Maderspacher, J.; Wamsler, C.; Pauleit, S. Using green infrastructure for urban climate-proofing: An evaluation of heat mitigation measures at the micro-scale. Urban For. Urban Green. 2016, 20, 305–316. [Google Scholar] [CrossRef]
Huang, K.; Li, X.; Liu, X.; Seto, K.C. Projecting global urban land expansion and heat island intensification through 2050. Environ. Res. Lett. 2019, 14, 114037. [Google Scholar] [CrossRef]
Jesdale, B.M.; Morello-Frosch, R.; Cushing, L. The racial/ethnic distribution of heat risk–related land cover in relation to residential segregation. Environ. Health Perspect. 2013, 121, 811–817. [Google Scholar] [CrossRef]
Gerrish, E.; Watkins, S.L. The relationship between urban forests and income: A meta-analysis. Landsc. Urban Plan. 2018, 170, 293–308. [Google Scholar] [CrossRef] [PubMed]
Di Leo, N.; Escobedo, F.J.; Dubbeling, M. The role of urban green infrastructure in mitigating land surface temperature in Bobo-Dioulasso, Burkina Faso. Environ. Dev. Sustain. 2016, 18, 373–392. [Google Scholar] [CrossRef]
Liu, W.; Chen, W.; Peng, C. Assessing the effectiveness of green infrastructures on urban flooding reduction: A community scale study. Ecol. Model. 2014, 291, 6–14. [Google Scholar] [CrossRef]
O’Malley, C.; Piroozfar, P.; Farr, E.R.; Pomponi, F. Urban Heat Island (UHI) mitigating strategies: A case-based comparative analysis. Sustain. Cities Soc. 2015, 19, 222–235. [Google Scholar] [CrossRef]
Masson, V.; Lemonsu, A.; Hidalgo, J.; Voogt, J. Urban climates and climate change. Annu. Rev. Environ. Resour. 2020, 45, 411–444. [Google Scholar] [CrossRef]

Figure 1. The analytic workflow of this study. It involved collecting and integrating environmental and socioeconomic datasets, applying Random Forest and XGBoost for regression, conducting feature importance and sensitivity analyses, and finally examining socioeconomic correlations and regional heat exposure differences.

Figure 2. Spatial distribution of the datasets for derived features in our study region: (a) LST distribution map, (b) NDVI distribution map, (c) building distribution map, (d) road network distribution map, (e) coastline distribution map, (f) waterway distribution map, (g) Green Space.

Figure 3. Scatter plots of predicted versus actual land surface temperature values based on various urban built features. (a) Results from the Random Forest (RF) model (

R^{2} = 0.899

) (b) Results from the XGBoost model (

R^{2} = 0.908

). The solid black line in each plot represents the 1:1 perfect prediction line.

Figure 3. Scatter plots of predicted versus actual land surface temperature values based on various urban built features. (a) Results from the Random Forest (RF) model (

R^{2} = 0.899

) (b) Results from the XGBoost model (

R^{2} = 0.908

). The solid black line in each plot represents the 1:1 perfect prediction line.

Figure 4. The histogram of RF and XGBOOST’s feature importance.

Figure 5. The histogram of RF and XGBOOST’s SHAP analysis.

Figure 6. Temperature statistics for different poverty rate categories: (a) Housing burden category, (b) No high school diploma category, (c) Poverty category, (d) Unemployment category, (e) Uninsured category, (f) Age < 17 Proportion Category, (g) Age > 65+ Proportion Category.

Figure 7. Simulation of the adjustment coefficient of each indicator.

Table 1. Features as model input.

Features	Source	Value	Description
Natural features
LST	Google Earth Engine	26–37 °C	Weighted average temperature
NDVI	Google Earth Engine	$3464 \times 10^{- 4}$	Weighted average vegetation coverage
D2coast	Overpass-turbo	2773.7721 m	Distance from the location to the nearest coast
D2water	Geofabric	1975.8993 m	Distance from the location to the nearest stream
Built-environmental features
Poly_POI	OpenStreetMap	$12 \times 10^{- 4} / m^{2}$	Number of polygon POIs in the census tract/area of census tract
Pnt_POI	OpenStreetMap	$5 \times 10^{- 5} / m^{2}$	Number of point POIs in the census tract/area of census tract
Road network	OpenStreetMap	$116 \times 10^{- 4} k m / m^{2}$	Road network length/area of census tract
Average height	GlobalMLBuildingFootprints	7 m	Average height of buildings
Coverage ratio	GlobalMLBuildingFootprints	$2665 \times 10^{- 4}$	Building area/census tract area
Green Space	NYC	0.0158/ ${k m}^{2}$	Green space area/census tract area
Socioeconomic features
Socioeconomic status	SVI		Poverty rate, unemployment rate, education level, health insurance, age

Table 2. 10-fold cross-validation results.

Index	Unit	RF	XGBoost
Average MSE	[–]	0.3975	0.3623
$Average R^{2}$	[–]	0.8994	0.9083
MSE standard deviation	[–]	0.0925	0.0951
$R^{2}$ standard deviation	[–]	0.034	0.0308
Precision	[–]	0.9212	0.9126
Recall	[–]	0.9168	0.9148
False negative rate	[–]	0.0832	0.0852
False positive rate	[–]	0.0773	0.0873

Table 3. The data of RF and XGBOOST’s feature importance.

Feature	Unit	RF_Gini_Importance	XGBOOST_Gain_Importance	RF_Rank	XGB_Rank
NDVI	[–]	0.898893	0.884325	1	1
D2Coast	[–]	0.023986	0.021322	2	2
D2Water	[–]	0.013948	0.015697	3	4
Road Network	[–]	0.013093	0.009234	4	9
Coverage Ratio	[–]	0.011259	0.013800	5	6
Average Height	[–]	0.010767	0.011123	6	8
Poly_POI	[–]	0.010646	0.012254	7	7
Pnt_POI	[–]	0.009727	0.014872	8	5
Green Space	[–]	0.007681	0.017374	9	3

Table 4. Results of socioeconomic vulnerability and temperature correlation analysis.

Index	Correlation Coefficient [–]	p-Value [–]	Distinctiveness	Temperature Differences (°C)	Vulnerable Areas [Count]	Non-Vulnerable Areas [Count]
Poverty	−0.004101	0.860027	No	−0.746797	245	1565
Unemployed	−0.009183	0.692979	No	−0.056392	293	1518
Housing cost burden	−0.002334	0.920069	No	−0.543403	347	1460
Poor Education	−0.009421	0.685429	No	−0.504241	336	1475
No Health insurance	−0.009055	0.697050	No	0.202722	316	1495
Age 65+	−0.00669	0.773583	No	−0.33392	622	616
Age < 17	−0.00954	0.68158	No	0.033368	619	624

Table 5. Results of socioeconomic vulnerability and temperature multiple regression analysis.

Variable	Coefficient	Unit	Std_Error	CI_Lower	CI_Upper	t_Statistic	p_Value
Poverty	0.002515	[–]	0.264006	−0.71398	0.005482	0.009528	0.992399
Unemployed	0.151466	[–]	0.153904	−0.04037	0.54513	0.984155	0.325169
Housing cost burden	0.000163	[–]	0.044125	−0.09345	0.003678	0.003689	0.997057
Poor Education	−0.56935	[–]	0.15905	−0.76944	−0.17641	−3.57966	0.000353
No Health insurance	0.415089	[–]	0.098876	0.252456	0.634912	4.198074	2.82 × 10⁻⁵
Age 65+	0.006277	[–]	0.003935	−0.00165	0.013521	1.59506	0.11087
Age < 17	−0.00634	[–]	0.003908	−0.01358	0.001435	−1.62334	0.104688

Table 6. Summary of optimal tuning strategies.

Feature	Best Coefficient [–]	Predicted Temperature (°C)	Reduction (°C)
Road network	0.05	34.59	0.04
Average height	2	34.58	0.05
Coverage ratio	0.3	34.54	0.08
D2coast	0.05	34.35	0.28
NDVI	0.75	28.26	6.37
D2water	0.05	34.47	0.16
Pnt_POI	0.05	34.61	0.01
Poly_POI	1.6	34.61	0.02
Green Space	0.05	34.62	0.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, S.; Liu, Z. Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City. Buildings 2026, 16, 186. https://doi.org/10.3390/buildings16010186

AMA Style

Liao S, Liu Z. Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City. Buildings. 2026; 16(1):186. https://doi.org/10.3390/buildings16010186

Chicago/Turabian Style

Liao, Shengyao, and Zhewei Liu. 2026. "Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City" Buildings 16, no. 1: 186. https://doi.org/10.3390/buildings16010186

APA Style

Liao, S., & Liu, Z. (2026). Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City. Buildings, 16(1), 186. https://doi.org/10.3390/buildings16010186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Explaining and Reducing Urban Heat Islands Through Machine Learning: Evidence from New York City

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Collection of Data

2.2. Machine Learning Models for Predicting the Heat Island Effect

2.3. Performance Evaluation

2.4. Data Analysis and Software

3. Results

3.1. Model Performance in Predicting Urban Heat Exposure

3.2. Dominant Role of Vegetation in Feature Importance Analysis

3.3. Socioeconomic Indicators and Their Weak Correlation with Temperature

3.4. Sensitivity Simulations of Urban Morphological Features

4. Discussion

4.1. Vegetation and Water as Primary Environmental Drivers of UHI

4.2. Socioeconomic Equity and Patterns of Thermal Exposure

4.3. Planning and Policy Implications for UHI Mitigation

4.4. Implications for Global Urban Planning

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI