Comparison of Population-Weighted Exposure Estimates of Air Pollutants Based on Multiple Geostatistical Models in Beijing, China

Various geostatistical models have been used in epidemiological research to evaluate ambient air pollutant exposures at a fine spatial scale. Few studies have investigated the performance of different exposure models on population-weighted exposure estimates and the resulting potential misclassification across various modeling approaches. This study developed spatial models for NO2 and PM2.5 and conducted exposure assessment in Beijing, China. It explored three spatial modeling approaches: variable dimension reduction, machine learning, and conventional linear regression. It compared their model performance by cross-validation (CV) and population-weighted exposure estimates. Specifically, partial least square (PLS) regression, random forests (RF), and supervised linear regression (SLR) models were developed based on an ordinary kriging (OK) framework for NO2 and PM2.5 in Beijing, China. The mean squared error-based R2 (R2mse) and root mean squared error (RMSE) in leave-one site-out cross-validation (LOOCV) were used to evaluate model performance. These models were used to predict the ambient exposure levels in the urban area and to estimate the misclassification of population-weighted exposure estimates in quartiles between them. The results showed that the PLS-OK models for NO2 and PM2.5, with the LOOCV R2mse of 0.82 and 0.81, respectively, outperformed the other models. The population-weighted exposure to NO2 estimated by the PLS-OK and RF-OK models exhibited the lowest misclassification in quartiles. For PM2.5, the estimates of potential misclassification were comparable across the three models. It indicated that the exposure misclassification made by choosing different modeling approaches should be carefully considered, and the resulting bias needs to be evaluated in epidemiological studies.


Introduction
Long-term exposure to air pollutants has been proven to be associated with adverse health outcomes [1][2][3].Current research has focused on a medium-and long-term period exposure assessment at an individual level [4], which needs accurate exposure assessment at a fine spatial scale to eliminate exposure errors [5,6].It is a big challenge because of the sparse monitoring stations and missing coverage of specific predictors, such as satellite-based models [7].
Geostatistical models developed by the land-use regression (LUR) approach have been widely used to assess medium-and long-term exposure to air pollution in epidemiological studies [8,9].The geostatistical models were created based on the inputs of observational and geographic data sets, in which the former was involved as outcome variables and the latter were predictor variables.The choice of model development approaches, the method for dealing with predictor variables, and the model structures can all affect model performances [10][11][12][13].The golden standard method for model evaluation includes out-ofsample validation and hold-out cross-validation (CV) [10,14].However, some studies laced sufficient observational data for out-of-sample model validation or hold-out CV.Also, some results for evaluating some two-step models were not easy to interpret, e.g., the predictor selection and dimension reduction methods [15,16].In brief, evaluating the performance of exposure models developed based on limited monitoring sites was challenging, but they are still helpful for environmental health studies.In previous studies on modeling approach comparison [10,17], the model performance was evaluated by cross-validation or out-of-sample validation.Nevertheless, to further compare these models' performances, they need to be evaluated by exposure assessment in the real world, and potential exposure bias caused by different modeling algorithms needs to be assessed.
In this study, we used three modeling approaches, variable dimensionality reduction, machine learning, and variable screening, to build geostatistical models for NO 2 and PM 2. 5 and to compare their model performance by estimating the misclassification of population-weighted exposure estimates in quartiles.

Study Area and Observations at Monitoring Sites
Beijing, China's capital city, is a mega-city with a population of 21.5 million during the research period (2015-2020) [18].Ambient concentrations of PM 2.5 have decreased since stringent national and local environmental regulations were implemented in 2014 [19].However, the ambient exposure level of such air pollutants is still higher than the World Health Organization (WHO) Global Air Quality Guidelines, 10 µg/m 3 for NO 2 and 5 µg/m 3 for PM 2.5 [20].This study obtained NO 2 and PM 2.5 observational data from the Beijing air quality monitoring network (Beijing Municipal Environmental Monitoring Center).The NO 2 was measured using the ultraviolet fluorescence method, and the PM 2.5 was measured using the micro oscillating balance method at these monitoring sites [21].A total of 35 monitoring sites were divided into 4 types, including 1 background site, 7 traffic sites, 14 urban sites, and 13 suburban sites, as shown in Figure 1.The background site is located near a reservoir in the Miyun district.The lowest NO 2 and PM 2.5 concentrations were observed at this background site.The traffic sites are close to the (A) type roads (highways and arterial roads) with a distance of less than 250 m.The other 27 monitoring sites were divided into urban and suburban sites, in which the urban sites were located inside and around the 6th-ring road, and the suburban sites were the rest of them.
Toxics 2024, 12, x FOR PEER REVIEW 2 of 15 and the latter were predictor variables.The choice of model development approaches, the method for dealing with predictor variables, and the model structures can all affect model performances [10][11][12][13].The golden standard method for model evaluation includes out-ofsample validation and hold-out cross-validation (CV) [10,14].However, some studies laced sufficient observational data for out-of-sample model validation or hold-out CV.Also, some results for evaluating some two-step models were not easy to interpret, e.g., the predictor selection and dimension reduction methods [15,16].In brief, evaluating the performance of exposure models developed based on limited monitoring sites was challenging, but they are still helpful for environmental health studies.In previous studies on modeling approach comparison [10,17], the model performance was evaluated by crossvalidation or out-of-sample validation.Nevertheless, to further compare these models' performances, they need to be evaluated by exposure assessment in the real world, and potential exposure bias caused by different modeling algorithms needs to be assessed.
In this study, we used three modeling approaches, variable dimensionality reduction, machine learning, and variable screening, to build geostatistical models for NO2 and PM2.5 and to compare their model performance by estimating the misclassification of population-weighted exposure estimates in quartiles.

Study Area and Observations at Monitoring Sites
Beijing, China's capital city, is a mega-city with a population of 21.5 million during the research period (2015-2020) [18].Ambient concentrations of PM2.5 have decreased since stringent national and local environmental regulations were implemented in 2014 [19].However, the ambient exposure level of such air pollutants is still higher than the World Health Organization (WHO) Global Air Quality Guidelines, 10 µg/m 3 for NO2 and 5 µg/m 3 for PM2.5 [20].This study obtained NO2 and PM2.5 observational data from the Beijing air quality monitoring network (Beijing Municipal Environmental Monitoring Center).The NO2 was measured using the ultraviolet fluorescence method, and the PM2.5 was measured using the micro oscillating balance method at these monitoring sites [21].A total of 35 monitoring sites were divided into 4 types, including 1 background site, 7 traffic sites, 14 urban sites, and 13 suburban sites, as shown in Figure 1.The background site is located near a reservoir in the Miyun district.The lowest NO2 and PM2.5 concentrations were observed at this background site.The traffic sites are close to the (A) type roads (highways and arterial roads) with a distance of less than 250 m.The other 27 monitoring sites were divided into urban and suburban sites, in which the urban sites were located inside and around the 6th-ring road, and the suburban sites were the rest of them.Annual average concentrations were obtained from the raw hourly concentrations.First, the daily averages were calculated by hourly concentration, following a criterion that 50% of the data (12 h) were available at each monitoring site.Second, these daily averages were used for calculating weekly averages, following the same criterion that 3-day data were available.Third, these weekly averages were applied for calculating annual averages, following another criterion of 25%.It allows the yearly averages to be calculated based on daily, weekly, and seasonal bases and be temporally representative.A similar temporal average strategy was used in Araki et al.'s NO 2 exposure modeling study [22].Figure 2 depicts the calculated weekly average concentrations of NO 2 and PM 2.5 , where blank spaces represent missing concentrations.The above criteria screened some of the missing data, and the rest were due to the raw data loss.Table 1 shows the statistical summary of annual average NO 2 and PM 2.5 concentrations.In 2016 and 2017, the available yearly averages were less than 35 because of the data screening criteria.After 2018, one of the 35 monitoring sites, the Beijing Botanical Garden site, was excluded from the monitoring network, as shown in Figure 1.
Toxics 2024, 12, x FOR PEER REVIEW 3 of 15 Annual average concentrations were obtained from the raw hourly concentrations.First, the daily averages were calculated by hourly concentration, following a criterion that 50% of the data (12 h) were available at each monitoring site.Second, these daily averages were used for calculating weekly averages, following the same criterion that 3-day data were available.Third, these weekly averages were applied for calculating annual averages, following another criterion of 25%.It allows the yearly averages to be calculated based on daily, weekly, and seasonal bases and be temporally representative.A similar temporal average strategy was used in Araki et al.'s NO2 exposure modeling study [22].Figure 2 depicts the calculated weekly average concentrations of NO2 and PM2.5, where blank spaces represent missing concentrations.The above criteria screened some of the missing data, and the rest were due to the raw data loss.Table 1 shows the statistical summary of annual average NO2 and PM2.5 concentrations.In 2016 and 2017, the available yearly averages were less than 35 because of the data screening criteria.After 2018, one of the 35 monitoring sites, the Beijing Botanical Garden site, was excluded from the monitoring network, as shown in Figure 1.

Geographic Variables
In this study, we collected a wide array of geographic variables, including population density, traffic network, features (e.g., airport, rail yard, railways, etc.), points of interest

Geographic Variables
In this study, we collected a wide array of geographic variables, including population density, traffic network, features (e.g., airport, rail yard, railways, etc.), points of interest (POI), land-use types, Normalized Difference Vegetation Index (NDVI), topography and coordinate variables.Some of these geographic variables were related to the emission sources, such as road network, features, and POI, while others were derived from the natural characteristics in the study area, e.g., population density and elevation.The details of these geographic variables are described in Table S1 in the Supplementary Information (SI).

Algorithms for Model Development
In this study, three types of modeling approaches were applied for model development.First, we chose the partial least squares (PLS) to represent the dimension reduction modeling approach.Second, random forest (RF), an advanced and relatively simple modeling approach with relatively fewer steps, was selected as a machine learning algorithm.Third, the supervised linear regression (SLR) algorithm was chosen as a traditional modeling approach.The PLS and SLR algorithms are based on a linear regression framework [10,16].The RF algorithm is a complex algorithm dealing with non-linear relationships between response and predictor variables and also between predictor variables [10].The PLS and SLR models have clear and interpretable frameworks, while the RF model is less interpretable because of its hidden black box [23].
Specifically, the PLS regression algorithm reduces predictor variables to a smaller set of uncorrelated components and performs least squares regression on these components.The PLS decomposes the large geographic variable matrix into a sequence of orthogonal PLS scores computed to maximize the covariance between concentrations and their prediction.According to the experience in our previous study in Beijing [11,24], the number of PLS scores was set to be 3.
As a machine learning algorithm, RF is an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the developed models [23].In RF model development, potential predictor variables are forced to be partitioned into subsets, which include separate decision trees for training.The output is the average of the decision tree simulation results.The RF model provides an importance evaluation index of the variables (IncMSE) that can be used to determine the influence of predictor variables on the response variables [25].In this study, the setting of the RF models was based on our previous experience [11].The coefficients of the random sampling times (mtry), number of decision trees in a random forest (ntree), and a minimum number of decision tree nodes (node size) were set to be 50, 500, and 5, respectively.
The SLR model is a traditional and widely used stepwise linear regression method developed using selected geographic variables as predictors [26,27].First, univariate linear regression is conducted to find a starting point for an SLR model, as the highest R 2 is obtained.Second, the additional predictor is added in each round to obtain the most significant increase in R 2 until the rise of the R 2 is less than 0.1.The selected variable in each round is available when its direction is plausible with the outcome pollutant.Third, the variance inflation factor (VIF) is applied to prevent multicollinearity.The selected variables with a value of VIF more than 3 were removed [28].
All the models were developed by R software (R 4.2.0, https://www.r-project.org/,accessed on 1 April 2022), using the R packages of "pls", "randomForest" and so on.

Model Structure and Validation
For LUR models, their residuals are always spatially correlated [12].Thus, we applied a two-step model structure, a LUR model with ordinary kriging (OK), to develop a LUR-OK model.First, a LUR model was developed; second, the residuals of the LUR model were further explained by OK.The same or similar model structures were widely used in previous studies [29][30][31].
The developed models were evaluated by using leave-one-site-out cross-validation (LOOCV).The observed data were split into groups equal to the number of monitoring sites, in which each group included the observations from one monitoring site.One data group was used for testing (testing group), and the other remaining data groups (training group) were used to fit the model.Then, the fitted model was used to predict the testing group and repeated until predictions for all groups were generated.We used mean square error based-R-Squared (R 2 mse ), regression-based R 2 (R 2 reg ), and root-mean-square error (RMSE) to assess the accuracy and prediction ability of the model, which was computed on observations (y i ) and predictions ( ŷi ) according to the equations below: In both equations, n is the number of observations, and y is the mean of observations.R 2 mse is a measure of fit to the 1:1 line and is typically lower than R 2 reg, which is a measure of fit to the regression line.The model performance was mainly evaluated by R 2 mse and RMSE, and the R 2 reg was also reported for comparisons with other studies.

PLS Models
The geographic variables were downscaled using the PLS approach.The first three PLS scores were obtained as inputs for the PLS models.The first PLS score explains most variations across the geographic data set.Thus, we evaluated correlation coefficients between the first PLS score and geographic variables to represent the influence of the geographic variables on PLS models.Figure 3 depicts these correlation coefficients for the annual NO 2 and PM 2.5 models.For NO 2 models, the correlation coefficients between the first PLS score and the geographic variables were relatively stable across the annual models.The variables of all the annual NO 2 models that were highly correlated with the first PLS score were population density, POI variables (the count of temples, restaurants, gas stations and bus stops), NDVI variables (NDVI in summer and the 75th percentile of NDVI), land-use type variables (shrubland, impervious, grassland, and forest), and the proximity variables (the distance to the type (C) road, railway, the intersection between type (A) and type (B) roads, and the intersection between type (A) roads.S1 in SI.

RF Models
According to the IncMSE of the variables given by the RF models, the top ten variables were summarized in Figure 4.The traffic-related variables significantly influence the NO2 models.The variables of distance to the railyard, the count of gas stations, the distance to the type (A) road, and the sum of the type (A) road length had relatively high IncMSE values in all NO2 models.For PM2.5, the most important variables were longitude, shrubland, and NDVI.Similar variable sensitivities were observed in the PLS models for PM2.5, in which the longitude variable was also highlighted.Compared with the annual NO 2 models, the correlation coefficients between the first PLS score and the geographic variables varied among the annual PM 2.5 models, especially for the annual PM 2.5 models after 2017.The variables of all the annual PM 2.5 models with an absolute value of these correlation coefficients greater than 0.5 included the NDVI variables (NDVI in summer and the 75th, 50th, and 25th percentiles of NDVI), land-use type of forest, longitude (Lambert y), elevation, and distance to the railyard.The contrast of the correlation coefficients between these models presented different emission sources of NO 2 and PM 2.5 .As a traffic-related primary air pollutant, the spatial distribution of NO 2 was correlated with the geographic variables of road networks.In contrast, PM 2.5 was correlated with the longitude variable because of its partially secondary species formation that originated from long-distance transportation [32].

RF Models
According to the IncMSE of the variables given by the RF models, the top ten variables were summarized in Figure 4.The traffic-related variables significantly influence the NO 2 models.The variables of distance to the railyard, the count of gas stations, the distance to the type (A) road, and the sum of the type (A) road length had relatively high IncMSE values in all NO 2 models.For PM 2.5 , the most important variables were longitude, shrubland, and NDVI.Similar variable sensitivities were observed in the PLS models for PM 2.5 , in which the longitude variable was also highlighted.

RF Models
According to the IncMSE of the variables given by the RF models, the top ten variables were summarized in Figure 4.The traffic-related variables significantly influence the NO2 models.The variables of distance to the railyard, the count of gas stations, the distance to the type (A) road, and the sum of the type (A) road length had relatively high IncMSE values in all NO2 models.For PM2.5, the most important variables were longitude, shrubland, and NDVI.Similar variable sensitivities were observed in the PLS models for PM2.5, in which the longitude variable was also highlighted.

SLR Models
Figure 5 shows the selected geographic variables by the SLR model and their regression coefficients.For NO 2 models, the traffic-related variables were involved in model development.It is consistent with the RF models.Specifically, the coefficients of the variables of the distance to the airport, railway, and railyard had relatively high values.In addition, the NDVI and land-use type of water variables were also selected.Regarding PM 2.5 , the annual PM 2.5 models selected fewer variables than the NO 2 models.The variables of the distance to the airport and the distance to the type (B) road influenced the annual PM 2.5 models a lot. Figure 5 shows the selected geographic variables by the SLR model and their regression coefficients.For NO2 models, the traffic-related variables were involved in model development.It is consistent with the RF models.Specifically, the coefficients of the variables of the distance to the airport, railway, and railyard had relatively high values.In addition, the NDVI and land-use type of water variables were also selected.Regarding PM2.5, the annual PM2.5 models selected fewer variables than the NO2 models.The variables of the distance to the airport and the distance to the type (B) road influenced the annual PM2.5 models a lot.In order to compare model performance between the geostatistical models developed in the whole Beijing area and inside the 6th-ring road area that covered the total urban area.The LUR-urban (LURU) models were developed based on the monitoring sites in-

Models Developed in the Urban Area
In order to compare model performance between the geostatistical models developed in the whole Beijing area and inside the 6th-ring road area that covered the total urban area.The LUR-urban (LURU) models were developed based on the monitoring sites inside and around the 6th-ring road.The coefficients of these LURU models are shown in Figures S1-S3.
For the urban model developed by the PLS approach (PLSU), some variables' correlation coefficients had significant changes, such as the variables of the count of industries (POI), water land-use type, latitude, and distance to airports for the NO 2 models and the variables of the count of industry (POI), NDVI in winter, latitude, and distance to the airports for PM 2.5 models.The predictors in the PLS and PLSU models were different.It indicated that the PLS approach could be affected when the range of the observational data was expanded as the suburban areas were included.Different model coefficients were also found when comparing RF and RFU models, SLR and SLRU models.

Model Performance
The LOOCV results of NO 2 and PM 2.5 models are shown in Tables 2 and 3, respectively.The performances of the LUR and LUR-OK models were summarized in each table.Generally, the annual PLS models performed better than others, with a higher range of R 2 mse for NO 2 (0.72~0.82) and PM 2.5 (0.80~0.89), which were higher than the previous studies in Beijing [33,34].Regarding the LTM of NO 2 , the PLS models had comparable good performance with the SLR models, which performed better than the RF models.Compared with the NO 2 models without an OK in the model structure, the LUR-OK models improved the model performance for PLS and SLR LTM models with a slightly higher R 2 mse .For annual NO 2 models, some performed worse when adding OK to the model structure.However, adding OK into the model structure worked well for RF models of PM 2.5 .For annual and LTM PM 2.5 , the RF-OK models improved their performance with an increase of 0.06~0.21 in R 2 mse compared with the RF models.In comparison, these increases in R 2 mse of PLS and SLR models because of adding OK were −0.17~0.01 and −0.09~0.08,respectively.For the LTM of PM 2.5 , the PLS-OK model performed best among the PLS, RF, and SLR models with or without OK.A spatiotemporal covariate model was built in a previous study with an R 2 mse of 0.93, and its RMSE was 1.72 µg/m 3 [35].Improved performance by adding OK in the model structure was also found in a previous study [36].The good performance of the PLS approach on spatial modeling was concluded in the previous study [11,30].In terms of the comparisons between the traditional SLR model and machine learning models, the SLR model underperformed in this study, which is different from the previous comparisons [10].The controversy over machine learning approaches has existed for a long time.It may depend on the target air pollutants and the collecting method of monitoring data [13].
Different model performances were found for PLSU, RFU, and SLRU models developed in urban areas with less observational data involved in model development, as shown in Tables S2 and S3 in Supplementary Information.Compared with the models developed in the whole city area, the PLSU models had better performance for both NO 2 (LOOCV R 2 mse : 0.83~0.96)and PM 2.5 (LOOCV R 2 mse : 0.74~0.82),while RFU and SLRU models had opposite performances.The increased performance of these PLSU models might be due to overfitting.The RFU and SLRU models were sensitive to the number of observational monitoring sites.We also found worse model performances from 2018 to 2020 than those from 2015 to 2017, when observational data were missing from 2018 at the monitoring site in the western mountain area.It indicated that the monitoring sites with various observational levels were necessary for model development in the urban area [28].
Figure 6 shows the scatter plots of the long-term mean observations and LOOCV predictions of the PLS, RF, and SLR mdels for NO 2 and PM 2.5 .For NO 2 , the PLS (LOOCV R 2 mse : 0.78) and SLR (LOOCV R 2 mse : 0.76) models performed better than the RF models (LOOCV R 2 mse : 0.57).The RF model was overestimated at the background site.Similar Toxics 2024, 12, 197 8 of 14 results were found for PM 2.5 models, in which the LOOCV R 2 mse of PLS, RF, and SLR models were 0.80, 0.51, and 0.54, respectively.The PM 2.5 RF model performed worse at the background site than those at other types of monitoring sites.

Prediction in the Urban Area
A square covering the 6th ring road in Beijing was picked to show the predictions around urban areas.It was divided into 3301 grids at a 1 km spatial scale.These grids are categorized according to the quartile distributions of model predictions, as shown in Figure 7.For NO 2 , the grids with high predictions were clustered in the central urban area.In addition, the hotspots of the PLS and SLR model predictions were highlighted across the road network, while the grids with high RF predictions were aggregated (Figure 7).Regarding PM 2.5 , noticeable spatial differences were found across the three model predictions.The hotspots of the PLS model predictions were sparsely distributed, while the RF predictions were highlighted in the southern part of Beijing.In comparison, Toxics 2024, 12, 197 9 of 14 the SLR model predictions were shown with clear spatial clusters.The contrast among the three PM 2.5 model predictions arises from the difference in the selection of geographical variables as primary predictors and the variations in the significance of predictors during model development.Figures S4 and S5 in Supplementary Information show the correlation coefficient between the three models.For LTM, the correlation coefficients of NO 2 models among the three approaches were 0.72~0.83,and it was 0.81~0.85 in PM 2.5 models.The correlation coefficient between PLS and SLR was the lowest regardless of whether in either NO 2 models or PM 2.5 models.The lower correlation coefficients of SLR with the other two models may be related to the fact that the variables were screened in SLR while the other two modeling methods were not.
Since the LUR-OK model performed better than the LUR models in LTM predictions, we focused on using the LUR-OK model to make predictions for population-weighted exposure estimates.Figure 8 depicts the box plots of the spatial predictions at a 1 km spatial scale (shown as grids in Figure 7) for NO 2 and PM 2.5 in the urban area.Regarding the annual predictions, a noticeable decline was found for PM 2.5 year by year.It is expected that the air quality level in Beijing has become better in recent years [37].Meanwhile, regarding NO 2 , the decline in annual mean concentrations was almost flat from 2015 to 2017, especially for PLS-OK and RF-OK predictions.Comparing the predictions among different models, the PLS-OK and RF-OK models had comparable median predictions.In contrast, SLR had relatively low median predictions, especially for annual mean NO 2 prediction in 2017 (34.98 µg/m 3 ) and NO 2 and PM 2.5 LTM predictions (NO 2 : 37.58 µg/m 3 ; PM 2.5 : 71.43 µg/m 3 ).The NO 2 predictions obtained a more considerable divergence across the three models with the coefficient of variation (COV) of 11.85~19.41for three NO 2 models, compared with the PM 2.5 predictions among the three models with the COV of 7.57 ~11.88.It indicated that the NO 2 models were more sensitive to the predictors derived from the geographic variables than the PM 2.5 models.The increased performance of these PLSU models might be due to overfitting.The RFU and SLRU models were sensitive to the number of observational monitoring sites.We also found worse model performances from 2018 to 2020 than those from 2015 to 2017, when observational data were missing from 2018 at the monitoring site in the western mountain area.It indicated that the monitoring sites with various observational levels were necessary for model development in the urban area [28].
Figure 6 shows the scatter plots of the long-term mean observations and LOOCV predictions of the PLS, RF, and SLR mdels for NO2 and PM2.5.For NO2, the PLS (LOOCV R 2 mse: 0.78) and SLR (LOOCV R 2 mse: 0.76) models performed better than the RF models (LOOCV R 2 mse: 0.57).The RF model was overestimated at the background site.Similar results were found for PM2.5 models, in which the LOOCV R 2 mse of PLS, RF, and SLR models were 0.80, 0.51, and 0.54, respectively.The PM2.5 RF model performed worse at the background site than those at other types of monitoring sites.

Prediction in the Urban Area
A square covering the 6th ring road in Beijing was picked to show the predictions around urban areas.It was divided into 3301 grids at a 1 km spatial scale.These grids are categorized according to the quartile distributions of model predictions, as shown in Figure 7.For NO2, the grids with high predictions were clustered in the central urban area.In addition, the hotspots of the PLS and SLR model predictions were highlighted across the road network, while the grids with high RF predictions were aggregated (Figure 7).Regarding PM2.5, noticeable spatial differences were found across the three model predictions.The hotspots of the PLS model predictions were sparsely distributed, while the RF predictions were highlighted in the southern part of Beijing.In comparison, the SLR model development.Figures S4 and S5 in SI show the correlation coefficient between the three models.For LTM, the correlation coefficients of NO2 models among the three approaches were 0.72~0.83,and it was 0.81~0.85 in PM2.5 models.The correlation coefficient between PLS and SLR was the lowest regardless of whether in either NO2 models or PM2.5 models.The lower correlation coefficients of SLR with the other two models may be related to the fact that the variables were screened in SLR while the other two modeling methods were not.Since the LUR-OK model performed better than the LUR models in LTM predictions, we focused on using the LUR-OK model to make predictions for population-weighted exposure estimates.Figure 8 depicts the box plots of the spatial predictions at a 1 km spatial scale (shown as grids in Figure 7) for NO2 and PM2.5 in the urban area.Regarding the annual predictions, a noticeable decline was found for PM2.5 year by year.It is expected that the air quality level in Beijing has become better in recent years [37].Meanwhile, regarding NO2, the decline in annual mean concentrations was almost flat from 2015 to 2017, especially for PLS-OK and RF-OK predictions.Comparing the predictions among different models, the PLS-OK and RF-OK models had comparable median predictions.In contrast, SLR had relatively low median predictions, especially for annual mean NO2 prediction in 2017 (34.98 µg/m 3 ) and NO2 and PM2.5 LTM predictions (NO2: 37.58 µg/m 3 ; PM2.5: 71.43 µg/m 3 ).The NO2 predictions obtained a more considerable divergence across the three models with the coefficient of variation (COV) of 11.85~19.41for three NO2 models, compared with the PM2.5 predictions among the three models with the COV of 7.57 ~11.88.It indicated that the NO2 models were more sensitive to the predictors derived from the geographic variables than the PM2.5 models.

Population-Weighted Exposure Estimates
The population-weighted exposure estimates (PEE) in urban areas from 2015 to 2020 were predicted using the PLS-OK, RF-OK, and SLR-OK models.The long-term mean values of PEE in the urban area from 2015 to 2020 were calculated based on the model predictions and population density across the 1 km grids [38].The total population in the prediction area was 15.94 million, accounting for about 76% of the entire population in Beijing, ranging from 211 to 32,097 persons/km 2 across the grids in the urban area.The misclassification of PEE in quartiles between the three models was estimated personally.
Figure 9 depicts the misclassification of PEE in quartiles between the PLS-OK, RF-OK, and SLR-OK models.Tables S4 and S5 in SI summarizes the percentage of misclassification in quartiles for NO2 and PM2.5.For NO2, the total misclassification of particular PEE between the PLS-OK and RF-OK models was 37.87%, 49.27% between PLS-OK and RF-OK models, and 50.49% between RF-OK and SLR-OK models.The contrast between PLS-OK and RF-OK models was smaller than the other two pairs.Compared with the PEE predicted by the PLS-OK model, 19.16% of them indicated by the RF-OK model were overestimated, and 18.72% were underestimated.Overestimation and underestimation between the PEE predicted using the PLS-OK and SLR-OK models were 23.25% and 26.03%, respectively.Most of the misclassifications happened across the adjacent quar-

Population-Weighted Exposure Estimates
The population-weighted exposure estimates (PEE) in urban areas from 2015 to 2020 were predicted using the PLS-OK, RF-OK, and SLR-OK models.The long-term mean values of PEE in the urban area from 2015 to 2020 were calculated based on the model predictions and population density across the 1 km grids [38].The total population in the prediction area was 15.94 million, accounting for about 76% of the entire population in Beijing, ranging from 211 to 32,097 persons/km 2 across the grids in the urban area.The misclassification of PEE in quartiles between the three models was estimated personally.
Figure 9 depicts the misclassification of PEE in quartiles between the PLS-OK, RF-OK, and SLR-OK models.Tables S4 and S5 in Supplementary Information summarizes the percentage of misclassification in quartiles for NO 2 and PM 2.5 .For NO 2 , the total misclassification of particular PEE between the PLS-OK and RF-OK models was 37.87%, 49.27% between PLS-OK and RF-OK models, and 50.49% between RF-OK and SLR-OK models.The contrast between PLS-OK and RF-OK models was smaller than the other two pairs.Compared with the PEE predicted by the PLS-OK model, 19.16% of them indicated by the RF-OK model were overestimated, and 18.72% were underestimated.Overestimation and underestimation between the PEE predicted using the PLS-OK and SLR-OK models were 23.25% and 26.03%, respectively.Most of the misclassifications happened across the adjacent quartiles.For comparison between the PLS-OK and RF-OK models, only 4.74% of the PEE was misclassified into the upper or lower quartiles that were not adjacent.This number was about 10% for other pairs of comparisons.It indicated that the PEE of NO 2 predicted by PLS-OK and RF-OK models obtained similar results, which resulted in the least misclassifications of the PEE in quartiles across the three models.Regarding the PEE of PM 2.5 , the misclassification in quartiles was comparable between the three models.

Strengths and Limitations
This study compared three geostatistical model performances in an air pollutant exposure assessment study in a metropolitan city in China.The findings have a broad implication for environmental studies.First, the three approaches chosen for model development represent a variety of advanced and traditional, complicated, and simple geostatistical modeling methods.The usage of each modeling approach is distinctive.The SLR models outperformed the RF models.However, compared with the PLS model predictions, the RF model predictions had fewer misclassifications than the SLR predictions.Second, this study focused on spatial model comparisons, which may provide a reference for spatiotemporal modeling studies.For primary pollutants, like NO2, the spatial distribution of air pollutants highly relies on local features related to emission sources.A comprehensive array of geographic variables could influence the model performance at a fine spatial scale.The PLS approach has such potential for exposure assessment on directly emitted air pollutants [39].Third, the interpretability of machine learning models is challenging.Although its usage is potentially limited because of its black box instinct, it possesses the capability for utilization in an urban area abundant with spatially dense observational data on the basis of RF model performance in this study.
There were also some limitations in this study.First, there is a lack of additional data for model development and out-of-sample validation.In this study, the observational data used for model development was derived from the national and regional monitoring network.Increasing the abundance of spatial data would enhance the accuracy and stability of the model [40].Second, the spatial models might be overestimated as less spatially rich observational data were involved in model development.To make use of the limited spa-

Strengths and Limitations
This study compared three geostatistical model performances in an air pollutant exposure assessment study in a metropolitan city in China.The findings have a broad implication for environmental studies.First, the three approaches chosen for model development represent a variety of advanced and traditional, complicated, and simple geostatistical modeling methods.The usage of each modeling approach is distinctive.The SLR models outperformed the RF models.However, compared with the PLS model predictions, the RF model predictions had fewer misclassifications than the SLR predictions.Second, this study focused on spatial model comparisons, which may provide a reference for spatiotemporal modeling studies.For primary pollutants, like NO 2 , the spatial distribution of air pollutants highly relies on local features related to emission sources.A comprehensive array of geographic variables could influence the model performance at a fine spatial scale.The PLS approach has such potential for exposure assessment on directly emitted air pollutants [39].Third, the interpretability of machine learning models is challenging.Although its usage is potentially limited because of its black box instinct, it possesses the capability for utilization in an urban area abundant with spatially dense observational data on the basis of RF model performance in this study.
There were also some limitations in this study.First, there is a lack of additional data for model development and out-of-sample validation.In this study, the observational data used for model development was derived from the national and regional monitoring

Figure 1 .
Figure 1.Map of the 35 monitoring sites in Beijing.Figure 1. Map of the 35 monitoring sites in Beijing.

Figure 1 .
Figure 1.Map of the 35 monitoring sites in Beijing.Figure 1. Map of the 35 monitoring sites in Beijing.

Figure 2 .
Figure 2. Weekly average concentrations of NO2 and PM2.5 at the 35 monitoring sites in Beijing from 2015 to 2020.

Figure 2 .
Figure 2. Weekly average concentrations of NO 2 and PM 2.5 at the 35 monitoring sites in Beijing from 2015 to 2020.

15 Figure 3 .
Figure 3. Correlation coefficients between the first PLS score and the geographic variables.Each box denotes the mean value of the correlation coefficient with corresponding geographic variables across the buffers, and the upper and lower bars represent the standard deviation across the buffers.The abbreviations of each geographic variable are shown in Table S1 in SI.

Figure 3 .
Figure 3. Correlation coefficients between the first PLS score and the geographic variables.Each box denotes the mean value of the correlation coefficient with corresponding geographic variables across the buffers, and the upper and lower bars represent the standard deviation across the buffers.The abbreviations of each geographic variable are shown in TableS1in Supplementary Information.

Figure 3 .
Figure 3. Correlation coefficients between the first PLS score and the geographic variables.Each box denotes the mean value of the correlation coefficient with corresponding geographic variables across the buffers, and the upper and lower bars represent the standard deviation across the buffers.The abbreviations of each geographic variable are shown in Table S1 in SI.

Figure 4 .
Figure 4.The geographic variables with the top ten IncMSE values in the RF model results.3.1.3.SLR Models

Figure 4 .
Figure 4.The geographic variables with the top ten IncMSE values in the RF model results.

Figure 5 .
Figure 5.The coefficients of the variables selected by the SLR model.3.1.4.Models Developed in the Urban Area

Figure 5 .
Figure 5.The coefficients of the variables selected by the SLR model.

Figure 6 .
Figure 6.The average observations and LOOCV predictions of the three models for NO2 and PM2.5 from 2015 to 2020.Dashed lines denote the 1:1 line, and solid lines represent linear regression lines.

Figure 6 .
Figure 6.The average observations and LOOCV predictions of the three models for NO 2 and PM 2.5 from 2015 to 2020.Dashed lines denote the 1:1 line, and solid lines represent linear regression lines.

Figure 7 . 15 Figure 8 .
Figure 7. Maps of the long-term mean NO 2 and PM 2.5 predictions from 2015 to 2020.Toxics 2024, 12, x FOR PEER REVIEW 11 of 15

Figure 8 .
Figure 8. Distributions of annual averages and long-term means (LTM) from 2015 to 2020 of NO 2 and PM 2.5 predicted by three models.Each box's upper, middle, and lower lines denote 75%, 50%, and 25% of the concentration.The dots represent the predictions that are higher or lower than 1.5 times the IQR from the median.The point in the box represents the means of the predictions.

Figure 9 .
Figure 9.The misclassification of the PEE in quartiles between the PLS-OK, RF-OK, and SLR-OK models for NO 2 and PM 2.5 .The four classifications from top to bottom denote the ranges of 0~25%, 25~50%, 50~75%, and 75~100% of PEE.The numbers indicate the non-misclassified percentage in quartiles.

Table 2 .
LOOCV results of the NO 2 models.

Table 3 .
LOOCV results of the PM 2.5 models.