An Ensemble Spatiotemporal Model for Predicting PM2.5 Concentrations

Although fine particulate matter with a diameter of <2.5 μm (PM2.5) has a greater negative impact on human health than particulate matter with a diameter of <10 μm (PM10), measurements of PM2.5 have only recently been performed, and the spatial coverage of these measurements is limited. Comprehensively assessing PM2.5 pollution levels and the cumulative health effects is difficult because PM2.5 monitoring data for prior time periods and certain regions are not available. In this paper, we propose a promising approach for robustly predicting PM2.5 concentrations. In our approach, a generalized additive model is first used to quantify the non-linear associations between predictors and PM2.5, the bagging method is used to sample the dataset and train different models to reduce the bias in prediction, and the variogram for the daily residuals of the ensemble predictions is then simulated to improve our predictions. Shandong Province, China, is the study region, and data from 96 monitoring stations were included. To train and validate the models, we used PM2.5 measurement data from 2014 with other predictors, including PM10 data, meteorological parameters, remote sensing data, and land-use data. The validation results revealed that the R2 value was improved and reached 0.89 when PM10 was used as a predictor and a kriging interpolation was performed for the residuals. However, when PM10 was not used as a predictor, our method still achieved a CV R2 value of up to 0.86. The ensemble of spatial characteristics of relevant factors explained approximately 32% of the variance and improved the PM2.5 predictions. The spatiotemporal modeling approach to estimating PM2.5 concentrations presented in this paper has important implications for assessing PM2.5 exposure and its cumulative health effects.


Introduction
Studies show that air pollution has negative health impacts on humans [1], and this issue has received considerable attention in recent years [2,3]. The air pollutants that are most dangerous to humans include sulfur dioxide, nitrogen dioxide, ozone, and particulate matter with an aerodynamic diameter of less than 2.5 µm (PM 2.5 ) [4]. Because of its small size, particulate matter can adhere to the deep respiratory tract and affect blood circulation by penetrating lung cells [5,6]. Several studies have shown that particulate matter increases the risk of developing airway obstructive disease, chronic bronchitis [7], asthma (in children) [8][9][10], lung cancer [11], and various other cardiovascular diseases [12][13][14]. Thus, when assessing PM 2.5 pollution and the cumulative health effects, an accurate method of estimating PM 2.5 exposure at fine spatial and temporal resolutions is essential, even when 2.2. Monitoring Data PM 2.5 and PM 10 concentration data (unit: µg/m 3 ) were obtained from the Ministry of Environmental Protection of the People's Republic of China (http://datacenter.mep.gov.cn/). The data were gathered at 96 national air quality monitoring stations located in 17 cities ( Figure 1) from 1 January to 31 December 2014 (365 days). Because 15 days of data were missing, we imputed the omitted data using the singular value decomposition-based missing-value method [27] using the pcaMethods package (Max-Planck Institute for Molecular Plant Physiology, Golm, Germany), which is available through the R statistical software program (version 3.2.2). PM 10 and PM 2.5 concentrations were monitored hourly, and we calculated the daily average for each site using the 75% measurement standard (measurements covering 75% of a 24-h period were used to calculate the daily means). The daily mean concentration for 2014 is shown in Figure 2. In 2014, monitoring stations in the northwest region of Shandong Province presented lower average PM 2.5 concentrations than those of the central and southern regions (Figure 1).

Monitoring Data
PM2.5 and PM10 concentration data (unit: μg/m 3 ) were obtained from the Ministry of Environmental Protection of the People's Republic of China (http://datacenter.mep.gov.cn/). The data were gathered at 96 national air quality monitoring stations located in 17 cities ( Figure 1) from 1 January to 31 December of 2014 (365 days). Because 15 days of data were missing, we imputed the omitted data using the singular value decomposition-based missing-value method [27] using the pcaMethods package (Max-Planck Institute for Molecular Plant Physiology, Golm, Germany), which is available through the R statistical software program (version 3.2.2). PM10 and PM2.5 concentrations were monitored hourly, and we calculated the daily average for each site using the 75% measurement standard (measurements covering 75% of a 24-h period were used to calculate the daily means). The daily mean concentration for 2014 is shown in Figure 2. In 2014, monitoring stations in the northwest region of Shandong Province presented lower average PM2.5 concentrations than those of the central and southern regions ( Figure 1).

Predictors
We collected predictor data from multiple sources, including meteorological parameters, remote sensing derived variables, traffic indices, point emissions, land uses, and elevation levels. Meteorological parameters such as air temperature, precipitation, humidity, and wind speed are closely associated with concentrations of particulate matter, particularly PM2.5 [28][29][30][31][32]. Data from the Modern-Era Retrospective Analysis for Research and Applications were used to extract the meteorological parameters, including the surface air temperature (°C), effective surface specific humidity (kg/kg), bias-corrected total precipitation (kg/m 2 s), surface eastward wind speed (m/s), and surface northward wind speed levels (m/s). The Modern-Era Retrospective Analysis dataset is a NASA reanalysis dataset for the satellite era that integrates a variety of observational systems to produce a temporally and spatially consistent synthesis [33,34]. The spatial resolution of the meteorological parameters was 0.5° latitude by 0.6° longitude, and an hourly temporal resolution was used [34]. For the meteorological predictors, we calculated the average of the 24-h measurements for each day.
Transportation emissions constitute one source of air pollution emissions [26,[35][36][37]. We extracted two index variables as predictors of transportation: the road lengths of the 10 km buffer zones around each monitoring station and the shortest distance from each station to the nearest road. We used the Open Street Map (http://www.openstreetmap.org/) as the road network for deriving traffic predictors.
In addition to traffic predictors, industrial plants, such as coal, oil-burning, chemical, smelting, power, paper, and mining plants, may constitute major sources of PM2.5 emissions [38]. To reflect the emission sources for PM2.5, we obtained location information on 326 plants in the study domain via Baidu map points of interest technologies (a major map provider in China). We measured the number

Predictors
We collected predictor data from multiple sources, including meteorological parameters, remote sensing derived variables, traffic indices, point emissions, land uses, and elevation levels. Meteorological parameters such as air temperature, precipitation, humidity, and wind speed are closely associated with concentrations of particulate matter, particularly PM 2.5 [28][29][30][31][32]. Data from the Modern-Era Retrospective Analysis for Research and Applications were used to extract the meteorological parameters, including the surface air temperature ( • C), effective surface specific humidity (kg/kg), bias-corrected total precipitation (kg/m 2 s), surface eastward wind speed (m/s), and surface northward wind speed levels (m/s). The Modern-Era Retrospective Analysis dataset is a NASA reanalysis dataset for the satellite era that integrates a variety of observational systems to produce a temporally and spatially consistent synthesis [33,34]. The spatial resolution of the meteorological parameters was 0.5 • latitude by 0.6 • longitude, and an hourly temporal resolution was used [34]. For the meteorological predictors, we calculated the average of the 24-h measurements for each day.
Transportation emissions constitute one source of air pollution emissions [26,[35][36][37]. We extracted two index variables as predictors of transportation: the road lengths of the 10 km buffer zones around each monitoring station and the shortest distance from each station to the nearest road. We used the Open Street Map (http://www.openstreetmap.org/) as the road network for deriving traffic predictors.
In addition to traffic predictors, industrial plants, such as coal, oil-burning, chemical, smelting, power, paper, and mining plants, may constitute major sources of PM 2.5 emissions [38]. To reflect the emission sources for PM 2.5 , we obtained location information on 326 plants in the study domain via Baidu map points of interest technologies (a major map provider in China). We measured the number of plants potentially releasing PM 2.5 pollution within the optimal 10 km buffer of each monitoring station as well as the shortest distance between each station and its closest plant.
Land-use data for 2010 were obtained from the Chinese Academy of Sciences Data Center for Resources and Environmental Sciences [39]. The dataset consists of six first-class types (plow areas, forest land, grassland, water bodies, residential areas, and unutilized areas) and 25 second-class types with a 1 km resolution. We extracted the proportion of the forest land-use areas as an environmental factor and the proportion of construction land-use areas (referring to land used for factories and mines, oil fields, and stone pits) as an indirect emission factor within the optimal 10 km buffer (with the highest correlation to PM 2.5 ) of the monitoring stations.
Aerosol optical thickness (AOT) data were extracted from satellite images [40]. More specifically, AOT data (spatial resolution: 1 km; temporal resolution: daily) derived from the Moderate Resolution Imaging Spectroradiometer MODIS Terra/Aqua satellites were used, and we observed a strong correlation between the satellite AOT data and PM 2.5 levels in certain regions of China [41,42]. In addition, we extracted the monthly NDVI (normalized difference vegetation index) from the MODIS MODND1D dataset (spatial resolution: 500 m; temporal resolution: monthly) for each sampling location. The dataset was provided by the International Scientific and Technical Data Mirror Site Computer Network Information Center of the Chinese Academy of Sciences [43].
In addition to the predictors described above, we considered GDP and digital elevation model (DEM) values. GDP denotes the economic level of the study domain, and DEM values reflect the terrain characteristics that may influence pollution diffusion [44]. However, the data were not significantly correlated with PM 2.5 ; therefore, the two variables were not considered in the models.
Variables were selected based on the following two steps. First, we calculated the variance inflation factor (VIF) to avoid multicollinearity [45]. Second, we used the backward-stepwise method to select variables for each combination by removing a variable once until the Akaike's information criterion (AIC) [46] remained the same. Finally, we selected variable combinations with the lowest AIC values as optimal inputs.

Modeling Approach
This section describes our modeling process from the generation of the predictors of temporal basis trends (Section 2.4.1), non-linear additive modeling (Section 2.4.2), ensembling learning (Section 2.4.3), kriging of the residuals (Section 2.4.4), and validation and independent test (Section 2.4.5).

Generation of the Predictors of Temporal Trends
Besides the meteorological, remote sensing, traffic, point emission, land-use and elevation predictors listed in Section 2.3, we also extracted the predictors of temporal trends from the temporally continuous monitoring PM 2.5 data. Air pollutant levels, including PM 2.5 , follow a regular temporal trend of high values in winter and low values in summer [47]. Thus, we can use the temporal variables extracted to model seasonal trends for the study region. Of the two methods, empirical mode decomposition [48] and singular value decomposition (SVD) [49,50], we used the SVD method to obtain the temporal basis functions and capture dominant seasonal trends because of its better generalizability and missing data processing capabilities. Cubic smoothing splines were used to fit the temporal basis function, and the first and second temporal basis functions for the Shandong province of China are shown in Figure 3. The first and second temporal basis functions were used as predictors as they account for a majority of the seasonal variability of PM 2.5 for the study region [51].

Non-Linear Additive Modeling
The following equation illustrates our non-linear modeling framework: In Equation (1), s and t represent the location (latitude and longitude) and time (day index), respectively, of the monitoring sample; ( , ) represents the log-transformed concentration of measured or estimated PM2.5 (μg/m 3 ), because the histograms of the PM2.5 and PM10 levels (Supplemental Materials Figure S1) showed a right-skewed distribution; thus, log transformation was necessary for normalization; and μ(s,t) represents the mean for ( , ) with spatiotemporal residual ε(s,t)~N(0,σ 2 ). In Equation (2), the mean of μ(s,t) are modeled based on temporal variables (the first and second temporal basis functions ( ) and ( ) and spatiotemporal variables (xi(s,t) including co-located PM10, meteorological parameters, AOT, and NDVI) and spatial variables (pk(s), including traffic indices and land-use predictors). We used generalized additive models (GAMs) to perform the non-parametric non-linear associations (s(…) refers to smooth functions), and we set the maximum degree of freedom to 10 to prevent overfitting in the non-linear model [49].
The "mgcv" package available through the R statistical software program (version 3.2.2) was used to model the non-linear additive modeling in Equation (2).

Ensemble Learning
Section 2.4.1 presents the individual GAM models, which may be sensitive to new datasets when performing practical predictions. The ensemble method of machine learning [52] was used to enhance the predictions of the multiple GAM models included in Equations (1) and (2) and to avoid the potential over-fitting problem; meanwhile, standard deviation can be obtained from multiple predictions as an uncertainty indicator. Because of the strong performance of bootstrap aggregating (bagging) methods [53][54][55][56][57], we used bagging to conduct integrated learning for the individual models. In this method, input data are first sampled with replacements to generate multiple sets of input data of the same size as the original input data. Using different sample subsets, various regression models were trained. Cross-validations were used to determine the optimal number of models, and we ultimately obtained 1000 models to capture the dataset's different variability levels based on the varied sample subsets. Each trained model was then used to make predictions for the remaining sample to train the model [57]. Because the original dataset included 35,040 data points, we obtained a re-sampled training dataset (63.2% of the original dataset) and remaining test data

Non-Linear Additive Modeling
The following equation illustrates our non-linear modeling framework: In Equation (1), s and t represent the location (latitude and longitude) and time (day index), respectively, of the monitoring sample; y(s, t) represents the log-transformed concentration of measured or estimated PM 2.5 (µg/m 3 ), because the histograms of the PM 2.5 and PM 10 levels (Supplemental Materials Figure S1) showed a right-skewed distribution; thus, log transformation was necessary for normalization; and µ(s, t) represents the mean for y(s, t) with spatiotemporal residual ε(s, t)~N(0, σ 2 ). In Equation (2), the mean of µ(s, t) are modeled based on temporal variables (the first and second temporal basis functions f 1 (t) and f 2 (t) and spatiotemporal variables (x i (s, t) including co-located PM 10 , meteorological parameters, AOT, and NDVI) and spatial variables (p k (s), including traffic indices and land-use predictors). We used generalized additive models (GAMs) to perform the non-parametric non-linear associations (s( . . . ) refers to smooth functions), and we set the maximum degree of freedom to 10 to prevent overfitting in the non-linear model [49].
The "mgcv" package available through the R statistical software program (version 3.2.2) was used to model the non-linear additive modeling in Equation (2).

Ensemble Learning
Section 2.4.1 presents the individual GAM models, which may be sensitive to new datasets when performing practical predictions. The ensemble method of machine learning [52] was used to enhance the predictions of the multiple GAM models included in Equations (1) and (2) and to avoid the potential over-fitting problem; meanwhile, standard deviation can be obtained from multiple predictions as an uncertainty indicator. Because of the strong performance of bootstrap aggregating (bagging) methods [53][54][55][56][57], we used bagging to conduct integrated learning for the individual models. In this method, input data are first sampled with replacements to generate multiple sets of input data of the same size as the original input data. Using different sample subsets, various regression models were trained. Cross-validations were used to determine the optimal number of models, and we ultimately obtained 1000 models to capture the dataset's different variability levels based on the varied sample subsets. Each trained model was then used to make predictions for the remaining sample to train the model [57]. Because the original dataset included 35,040 data points, we obtained a re-sampled training dataset (63.2% of the original dataset) and remaining test data (36.8%) for independent validation [58]. The final integrated prediction was the mean weighted by the model's performance (measured by R 2 ): , from the GAMs generated by bootstrap sampling and their weights, w i that is proportion to the square of R 2 : Further, the predictions outputted the uncertainty of predictions by calculating the standard deviation of all the predictions from all the models: where σ f (s, t) is the standard deviation from the output of multiple models, and M is the number of nonzero weights.

Kriging of Spatiotemporal Residuals
The daily residuals were modeled for their spatiotemporal patterns. A reasonable assumption is the stable spatial domain after removal of local means [59]. Based on this assumption, the residuals after removal of the means present spatial pattern (autocorrelation) and the variogram could be used to capture such autocorrelation. Consequently, based on the variogram fit, we can infer the residuals for any location under the study domain. Kriging is a classical approach and the readers can refer to [60] for details. Here, based on the stable domain theory after removal of local means [49], kriging was used to estimate the residuals: where ε(s 0 , t) is the residual for a target location to be estimated, s 0 with time point, t, ε(s, t) is the residual from the estimated mean of the training sample at the location s i and the same time point t, from the ensemble models. λ i is the weight for ε(s i , t) determined by the fitted variogram. Due to the complexity and instability of directly fitting the spatiotemporal variogram, we simplified variogram fitting by independently fitting daily residuals. Thereby, we can avoid the complex assumption of temporal stability besides spatial stability for directly fitting and improved the effectiveness of parameter estimates for the variogram.
By leave-one-out-cross-validation, we selected the exponential model from multiple models (spherical, circular, exponential, Gaussian, linear models) [61] to simulate the variogram with their optimal performance.
where γ(d, t) is the variogram function for the residuals, ε(s, t) to be fit, d is spatial distance, t is the temporal index (day index), c 0 (t) is nugget, c(t) is the sill and a 0 is the parameter to be estimated. For variogram fitting, three important parameters-range, nugget and sill-were estimated. Nugget reflects the variance at the discontinuity at the origin, sill is the limit of the variogram tending to infinity lag distances, and range is the distance in which the difference of the variogram from the sill becomes negligible [58]. In our model, the variogram was used to derive the covariance between different locations at the same time that was used in the weighted regression in Equation (6).
Using the daily residual variogram models, we simulated temporal changes in the parameters (range, partial sill, and nugget) for the residual variogram models for 2014. For the predictions, we created grid interpolations for daily residuals for a year across the study region and then added the corresponding estimate of the residual to the bagging prediction estimate to arrive at our final predictions. Therefore, we improved the final predictions by incorporating non-linear associations and residual spatial autocorrelations. The variogram and prediction residuals were completed in ArcGIS (version 10.3, ESRI, Sacramento, CA, USA) using the geostatistical analyst module [62], and the nugget and sill variogram parameters were optimized through cross validations with a focus on the estimation of range parameters.

Validation and Independent Test
To examine the predictive performance of our models under different scenarios, we designed six models with different predictor combinations: Model 1, GAM with no use of the PM 10 predictor and residual kriging; Model 2, GAM with the PM 10 predictor but without residual kriging; Model 3, bagging with no use of the PM 10 predictor and residual kriging; Model 4, bagging with no use of the PM 10 predictor but with residual kriging; Model 5, bagging with the PM 10 predictor but without residual kriging; and Model 6, bagging with the PM 10 predictor and residual kriging. In the six models, the PM 10 and PM 2.5 variables were log transformed and transferred back by the exponential function. Bagging predictions for the training dataset were similar to the 63.2-36.8% cross validation. We also conducted a 10 × 10-fold cross-validation for Models 1 and 2. The R 2 value, root-mean-square error (RMSE), and residual plots between the predicted and observed values were used to compare the performances of the different models.
Independent test was also conducted using Model 4 trained by the 2014 data to predict the 2016 monthly mean PM 2.5 levels of the 30 sub-regions ( Figure S2 of Supplemental Materials) of Shandong Province. Due to the missing PM 2.5 monitoring station based data for 2016, we just compared the predicted monthly sub-regional mean level of PM 2.5 with the measured PM 2.5 level derived from the Ministry of Environmental Protection of People's Republic of China to illustrate our method's applicability. Monthly averages of PM 2.5 were summarized over daily predicted or measured values. Block kriging was used to derive the regional means based on the point estimated or measured values.

Data Summary
We obtained 35,040 measurements for all 96 sample locations, with a rate of missing daily values of 4%, and the SVD method was used to complete the training sample. The daily mean PM 2.5 and PM 10 values in the study domain are shown in Figure 2. For all measurements, the mean PM 2.5 concentration was 73.86 µg/m 3 , while the mean PM 10 was 128.80 µg/m 3 . The summer and autumn concentrations were obviously lower than the spring and winter concentrations ( Figure 2). The PM 2.5 across the 96 stations varied considerably from approximately 1 µg/m 3 to nearly 600 µg/m 3 (standard deviation: 50.79 µg/m 3 ), whereas the PM 10 levels across the 96 stations also varied greatly from roughly 3 µg/m 3 to nearly 690 µg/m 3 (standard deviation: 77.37 µg/m 3 ). The highest PM 2.5 mean concentration measured for the period was 130.22 µg/m 3 , and a relatively high PM 10 value (158.55 µg/m 3 ) was measured at the 1624A station located in the northwest part of the province (Figure 1). This figure is much higher than the other observed levels. The highest daily PM 2.5 concentration for the period was 592.33 µg/m 3 (PM 10 : 658.92 µg/m 3 ), and it was recorded at the 1657A station located in the central part of the province on January 19, 2014. We found 50 stations in the study domain with PM 2.5 annual mean concentrations exceeding 75 µg/m 3 . We also examined the means of the PM 2.5 to PM 10 ratios across daily timelines for 2014 (Figure 4), and the ratios appear to follow a dominant trend with minor variations (mean: 0.57 µg/m 3 with a standard deviation of 0.10 µg/m 3 , with slightly lower levels in summer than winter).

Predictors
Using univariate and multivariate GAMs (generalized additive models), we explored the nonlinear associations of each predictor with PM2.5 as well as their respective contributions to the variance. Table 1 presents the variance explained by each predictor of the univariate and multivariate GAM models with and without the inclusion of PM10. The results showed that the highest contributions were observed with the co-located PM10 (accounting for 67.97-73% of the variance) and the second highest contributions were observed with the first temporal basis function (accounting for 4.84-37% of the variance) in the univariate and multivariate models. For the univariate model, the day of the year and humidity levels accounted for the third and fourth highest contributions to PM2.5 variability, respectively. In addition to the co-located PM10 and the first temporal basis function, the other predictive variables varied in their contributions in terms of the variance between the univariate model and the multivariate model.

Predictors
Using univariate and multivariate GAMs (generalized additive models), we explored the non-linear associations of each predictor with PM 2.5 as well as their respective contributions to the variance. Table 1 presents the variance explained by each predictor of the univariate and multivariate GAM models with and without the inclusion of PM 10 . The results showed that the highest contributions were observed with the co-located PM 10 (accounting for 67.97-73% of the variance) and the second highest contributions were observed with the first temporal basis function (accounting for 4.84-37% of the variance) in the univariate and multivariate models. For the univariate model, the day of the year and humidity levels accounted for the third and fourth highest contributions to PM 2.5 variability, respectively. In addition to the co-located PM 10 Figure 5 and Figure S3 of the Supplemental Materials section present the non-linear associations between each predictor and PM 2.5 . The variance explained by the non-linear models was up to 10% than that of the linear models. As shown in Figure 5, the GAMs showed a good ability to capture the non-linear associations between the predictive variables and PM 2.5 . The varying associations (e.g., PM 10 non-linear associations, including critical varying points, such as 148 µg/m 3 with PM 2.5 (also known as 5 log PM 10 ) in Figure 5a) captured by the GAMs were critical because they better captured the varying trends between influential factors and concentrations than the linear model. For example, the higher concentrations measured in winter compared with those in summer for the day of the year parameter (Figure 5g) could not be quantified in the linear models. Similarly, varying non-linear associations were also observed for the AOT, the number of emission plants, precipitation, and air temperature.  Figure 5 and Figure S3 of the Supplemental Materials section present the non-linear associations between each predictor and PM2.5. The variance explained by the non-linear models was up to 10% than that of the linear models. As shown in Figure 5, the GAMs showed a good ability to capture the non-linear associations between the predictive variables and PM2.5. The varying associations (e.g., PM10 non-linear associations, including critical varying points, such as 148 μg/m 3 with PM2.5 (also known as 5 log PM10) in Figure 5a) captured by the GAMs were critical because they better captured the varying trends between influential factors and concentrations than the linear model. For example, the higher concentrations measured in winter compared with those in summer for the day of the year parameter ( Figure 5g) could not be quantified in the linear models. Similarly, varying non-linear associations were also observed for the AOT, the number of emission plants, precipitation, and air temperature.   Table 2 shows the results of the six models. Model 1 did not use PM10 as a predictor or perform a kriging interpolation of the residuals, and it achieved the worst performance, generating a CV R 2 of 0.53 and the highest RMSE of 34.69 μg/m 3 . The bagging method (Model 3) results were also similar to that of the ten-fold cross-validation and achieved a similar predictive performance as Model 1; in addition, this method was able to output the standard deviation as a measure of uncertainty. Although the co-located PM10 significantly contributed to explaining the variance (28-36%), the kriging interpolation of residuals was able to explain a large proportion (33%) of the variance (Model 4 vs. Model 3) when PM10 was not used as a predictor. When PM10 was used as a predictor, the kriging interpolation of the daily residuals improved the model's ability to explain the variance by 7% (Model 6 vs. Model 5). When the PM10 predictor was not used, the final model (Model 4) achieved a CV R 2 of 0.86, and when the PM10 predictor was used, the final model (Model 6) achieved a CV R 2 of 0.89. The  Table 2 shows the results of the six models. Model 1 did not use PM 10 as a predictor or perform a kriging interpolation of the residuals, and it achieved the worst performance, generating a CV R 2 of 0.53 and the highest RMSE of 34.69 µg/m 3 . The bagging method (Model 3) results were also similar to that of the ten-fold cross-validation and achieved a similar predictive performance as Model 1; in addition, this method was able to output the standard deviation as a measure of uncertainty. Although the co-located PM 10 significantly contributed to explaining the variance (28-36%), the kriging interpolation of residuals was able to explain a large proportion (33%) of the variance (Model 4 vs. Model 3) when PM 10 was not used as a predictor. When PM 10 was used as a predictor, the kriging interpolation of the daily residuals improved the model's ability to explain the variance by 7% (Model 6 vs. Model 5). When the PM 10   Our independent test shows the R 2 of 0.73 for sub-regional monthly PM 2.5 averages. Figures S6 of Supplemental Materials shows the scatter plot of predicted vs. measured values and their residual plot.

Variogram Modeling of the Residuals
As shown in Table 2, the daily residual kriging interpolation considerably improved the predictive performance and accounted for 33% of the explained variance in the model that did not use the PM 10 predictor. An exponential model was used to fit the variogram of the residuals. Three parameters for the variogram model (range, partial sill, and nugget) were measured over all 365 days of 2014 to explore the temporal patterns of daily residuals. Table 3 shows the variogram parameters for Models 4 and 6, and Figure 6 presents the temporal trends with the fitted curves of the three parameters for 2014. The results show lower range, partial sill, and nugget values in summer than in winter, and this trend may have been caused by the higher concentrations of PM 2.5 , larger influential range, and stronger spatial auto-correlations in winter than summer.  Our independent test shows the R 2 of 0.73 for sub-regional monthly PM2.5 averages. Figures S6 of Supplemental Materials shows the scatter plot of predicted vs. measured values and their residual plot.

Variogram Modeling of the Residuals
As shown in Table 2, the daily residual kriging interpolation considerably improved the predictive performance and accounted for 33% of the explained variance in the model that did not use the PM10 predictor. An exponential model was used to fit the variogram of the residuals. Three parameters for the variogram model (range, partial sill, and nugget) were measured over all 365 days of 2014 to explore the temporal patterns of daily residuals. Table 3 shows the variogram parameters for Models 4 and 6, and Figure 6 presents the temporal trends with the fitted curves of the three parameters for 2014. The results show lower range, partial sill, and nugget values in summer than in winter, and this trend may have been caused by the higher concentrations of PM2.5, larger influential range, and stronger spatial auto-correlations in winter than summer.   After quantifying the variogram from the exponential model, ordinary kriging was used to interpolate the residuals. The final PM2.5 prediction included the output of the bagging approach and the estimate of the specific-time residual. To extend our models to other years, we assumed limited changes in the variograms for PM2.5 across different years and overlaid the day-of-year interpolated surfaces of the residual (obtained by the kriging method in ArcGIS) with the target location within the study region to extract the residual estimate for a certain day of the target year.

Uncertainty
As illustrated, the bagging approach integrated the predictions from 1000 models to generate a stable estimate with standard deviations as a measure of uncertainty. Using the estimated standard deviation, we could also generate 95% confident intervals for the estimate [63,64]. The standard deviation is summarized in Table 4, and the standard deviation distribution is shown in Figure 7. The results show small uncertainty (only 5.2% of the standard deviations exceeded 5 μg/m 3 for Models 3 and 4, and only 1.9% of the standard deviation exceeded 5 μg/m 3 for Models 5 and 6). (a) After quantifying the variogram from the exponential model, ordinary kriging was used to interpolate the residuals. The final PM 2.5 prediction included the output of the bagging approach and the estimate of the specific-time residual. To extend our models to other years, we assumed limited changes in the variograms for PM 2.5 across different years and overlaid the day-of-year interpolated surfaces of the residual (obtained by the kriging method in ArcGIS) with the target location within the study region to extract the residual estimate for a certain day of the target year.

Uncertainty
As illustrated, the bagging approach integrated the predictions from 1000 models to generate a stable estimate with standard deviations as a measure of uncertainty. Using the estimated standard deviation, we could also generate 95% confident intervals for the estimate [63,64]. The standard deviation is summarized in Table 4, and the standard deviation distribution is shown in Figure 7.
The results show small uncertainty (only 5.2% of the standard deviations exceeded 5 µg/m 3 for Models 3 and 4, and only 1.9% of the standard deviation exceeded 5 µg/m 3 for Models 5 and 6).  After quantifying the variogram from the exponential model, ordinary kriging was used to interpolate the residuals. The final PM2.5 prediction included the output of the bagging approach and the estimate of the specific-time residual. To extend our models to other years, we assumed limited changes in the variograms for PM2.5 across different years and overlaid the day-of-year interpolated surfaces of the residual (obtained by the kriging method in ArcGIS) with the target location within the study region to extract the residual estimate for a certain day of the target year.

Uncertainty
As illustrated, the bagging approach integrated the predictions from 1000 models to generate a stable estimate with standard deviations as a measure of uncertainty. Using the estimated standard deviation, we could also generate 95% confident intervals for the estimate [63,64]. The standard deviation is summarized in Table 4, and the standard deviation distribution is shown in Figure 7. The results show small uncertainty (only 5.2% of the standard deviations exceeded 5 μg/m 3 for Models 3 and 4, and only 1.9% of the standard deviation exceeded 5 μg/m 3 for Models 5 and 6). (a)

Discussion
In this paper, we proposed an ensemble spatiotemporal modeling approach for robustly predicting PM2.5 concentrations. For the individual model, we used a GAM (generalized additive models) to determine the non-linear association between PM2.5 and multiple predictors (meteorological patterns, traffic indices, season variations, the number of emission sources, and landuse patterns). In particular, we extracted temporal basis functions from the monitoring stations to represent the seasonal PM2.5 variability for the study region, which accounted for a large portion of the explained variance. We then used the bagging method to sample the dataset to train 1000 individual models and derived stable ensemble predictions with standard deviations as an uncertainty measure. Then, the kriging method was used to model the residuals from the ensemble predictions to estimate the daily PM2.5 residuals throughout a year (365 days) for the study region. For the model that did not include PM10 as a predictor, the daily residual kriging achieved a similar predictive performance (R 2 : 0.86 vs. 0.89) as the model using PM10 as a predictor. Strong spatial autocorrelations of the residuals accounted for a considerable portion (33%) of the explained variance when co-located PM10 values were not included. These results denote the usefulness of residual spatial autocorrelations for predicting PM2.5 when PM10 measurements are missing as observed for our study location of Shandong Province, China. To our knowledge, previous studies that have performed the same predictions [22,23,65] have only reported R 2 values of 0.54-0.81, and few studies have achieved a similar estimation accuracy for PM2.5 without using PM10 as a predictor.
This study also illustrates the important contributions of non-linear associations in the models [19,20,22,23]. The final results show a considerable predictive improvement through the use of non-linear additive models. The results reveal a notable positive non-linear association between PM2.5 and PM10, AOT (aerosol optical thickness), the number of emission plants, and air temperature, with varying fluctuations found for different predictor intervals (Figure 5a-c,f), whereas a negative association was observed between PM2.5 and precipitation (Figure 5e). For the wind vector terms, the PM2.5 concentrations tended to decline as the wind speed increased in the south-north and east-west directions because of complex interactions (Figure 5d). Such associations are generally consistent with previous conclusions [29,49]. As a regional pollutant, PM2.5 is affected by various factors, including meteorological parameters, emissions sources, and traffic indices [18,19,22,66]. PM10 consists of PM2.5 and other components, and it is strongly correlated with PM2.5; therefore, co-located PM10 is a primary predictor of PM2.5 levels and explained most of the variance in the multivariate models. Following PM10, the first temporal basis function was the second most important predictor. The temporal basis functions captured the seasonal variability of PM2.5 levels for the study region. For the model with PM10 used as a predictor, the first temporal basis function accounted for roughly 5% of the total variance. However, in the multivariate model (Model 3) without PM10, meteorological parameters (including

Discussion
In this paper, we proposed an ensemble spatiotemporal modeling approach for robustly predicting PM 2.5 concentrations. For the individual model, we used a GAM (generalized additive models) to determine the non-linear association between PM 2.5 and multiple predictors (meteorological patterns, traffic indices, season variations, the number of emission sources, and land-use patterns). In particular, we extracted temporal basis functions from the monitoring stations to represent the seasonal PM 2.5 variability for the study region, which accounted for a large portion of the explained variance. We then used the bagging method to sample the dataset to train 1000 individual models and derived stable ensemble predictions with standard deviations as an uncertainty measure. Then, the kriging method was used to model the residuals from the ensemble predictions to estimate the daily PM 2.5 residuals throughout a year (365 days) for the study region. For the model that did not include PM 10 as a predictor, the daily residual kriging achieved a similar predictive performance (R 2 : 0.86 vs. 0.89) as the model using PM 10 as a predictor. Strong spatial autocorrelations of the residuals accounted for a considerable portion (33%) of the explained variance when co-located PM 10 values were not included. These results denote the usefulness of residual spatial autocorrelations for predicting PM 2.5 when PM 10 measurements are missing as observed for our study location of Shandong Province, China. To our knowledge, previous studies that have performed the same predictions [22,23,65] have only reported R 2 values of 0.54-0.81, and few studies have achieved a similar estimation accuracy for PM 2.5 without using PM 10 as a predictor.
This study also illustrates the important contributions of non-linear associations in the models [19,20,22,23]. The final results show a considerable predictive improvement through the use of non-linear additive models. The results reveal a notable positive non-linear association between PM 2.5 and PM 10 , AOT (aerosol optical thickness), the number of emission plants, and air temperature, with varying fluctuations found for different predictor intervals (Figure 5a-c,f), whereas a negative association was observed between PM 2.5 and precipitation (Figure 5e). For the wind vector terms, the PM 2.5 concentrations tended to decline as the wind speed increased in the south-north and east-west directions because of complex interactions (Figure 5d). Such associations are generally consistent with previous conclusions [29,49]. As a regional pollutant, PM 2.5 is affected by various factors, including meteorological parameters, emissions sources, and traffic indices [18,19,22,66]. PM 10 consists of PM 2.5 and other components, and it is strongly correlated with PM 2.5 ; therefore, co-located PM 10 is a primary predictor of PM 2.5 levels and explained most of the variance in the multivariate models. Following PM 10 , the first temporal basis function was the second most important predictor. The temporal basis functions captured the seasonal variability of PM 2.5 levels for the study region. For the model with PM 10 used as a predictor, the first temporal basis function accounted for roughly 5% of the total variance. However, in the multivariate model (Model 3) without PM 10 , meteorological parameters (including precipitation, wind speed, temperature, and humidity) accounted for only 4.55% of the variance. In addition, traffic indices accounted for 5.01% of the variance; emission plants accounted for roughly 3.34% of the variance; and AOT and NDVI (normalized difference vegetation index) accounted for 4.77% and 0.24% of the variance, respectively. Previous studies of certain regions show weak correlations between AOT and surface PM 2.5 because the surface reflectance ratio between visible and shortwave infrared channels of the AOT product was underestimated [67]. Our study also illustrates the limited contributions of AOT and NDVI as predictors.
For the models using PM 10 as a predictor (Models 5 and 6), co-located PM 10 accounted for most (67.97%) of the variance, whereas the other predictors together accounted for only 13.33%. This finding illustrates that PM 10 captured a major part of the spatiotemporal variability in PM 2.5 . Without the PM 10 predictor, the other variables accounted for roughly 53.2% of the variance. In addition to the first temporal basis function (accounting for 26.71%), the other predictors together accounted for only 26.49% of the variance.
Unfortunately, historical PM 2.5 and PM 10 measurements are not available for China and many other countries across the globe [16,68]. Even today, the spatial coverage of the PM 2.5 and PM 10 monitoring network are limited in China. To comprehensively monitor PM 2.5 pollution levels and assess their cumulative health effects on humans, reliable estimates of PM 2.5 concentrations at fine spatiotemporal resolutions should be performed using the limited available PM 2.5 monitoring data for time periods without available data over extensive spatial areas. Unfortunately, accurately predicting PM 2.5 levels at high spatiotemporal resolutions is difficult without relevant data on co-located pollutants. In this paper, we explored the use of kriging interpolations of daily residuals, and the results show considerable improvements in predictive performance. Whereas the GAM already took both spatial and temporal information by the covariates, such spatial and spatiotemporal covariates couldn't fully capture the spatiotemporal variability of PM 2.5 and so the performance is not so good without use of spatiotemporal residuals or PM 10 . The daily residual's kriging captured an important portion of PM 2.5 spatiotemporal variability not captured by GAMs. Although a previous study also employed residual kriging interpolations [69], the variogram modeling of daily residuals remains poorly researched. Therefore, we analyzed the variogram patterns of residuals for a one-year period and performed a cross validation to illustrate the generalizability of the proposed method. As a regional pollutant, PM 2.5 exhibited stronger effects and spatial autocorrelations than nitrogen oxide. Spatial autocorrelations of the residuals were better able to explain the spatial variability of PM 2.5 than nitrogen oxide pollutants [49]. Because of the considerable effects of PM 2.5 , particularly for winter in Shandong Province, the residual kriging method was applicable despite the limited number of PM 2.5 monitoring stations examined and large spatial distances between the monitored samples. As shown in the results (Figure 6), PM 2.5 in winter showed effects over a longer time period and a greater spatial area than those observed in summer. We expected PM 2.5 to have an effect over a longer time period (a longer range) and to present higher concentrations (also resulting in higher partial sill and nugget values) in winter than summer. Our variogram modeling of the daily residuals captured these spatial autocorrelation trends to compensate for the gap in the variance explained due to the missing PM 10 predictor, and thus improved the accuracy of our final predictions. In practice, co-located PM 10 and other pollutant measurements are not usually available. Thus, Model 4 (using residual kriging interpolations but not PM 10 as a predictor) has the potential for use in a greater number of applications than the other models that used PM 10 as a predictor. Our cross-validation results show that Model 4 achieved high levels of accuracy and the results were slightly better than those of Model 5 (using PM 10 as a predictor but not the residual kriging interpolation) (R 2 : 0.86 vs. 0.82), although the results were slightly less accurate than those of Model 6 (using PM 10 and residual kriging) (R 2 : 0.86 vs. 0.89). The R 2 value of Model 6 was only 7% higher than that of Model 5, which may have been related to the spatiotemporal PM 10 values, which accounted for a major portion of the variance explained in Model 6; whereas the remaining 7% that was not captured by the predictors was explained by the spatial autocorrelations of the residuals.
Based on individual GAMs, our approach introduced the bagging technique to obtain the stable prediction with standard deviation as an uncertainty indicator. Thereby, our prediction could be robust and less over-fitting although our performance is pretty similar to the individual model's output from the other approaches [13,24,25,65,70]. On the other hand, due to characteristics of strong spatial autocorrelation of PM 2.5 concentration, our approach could considerably improve performance even without use of the PM 10 predictor. Thus, for practical prediction, our approach does not need the extraction of complicated covariates such as output of the chemical transport model (GEOS-CHEM) and is less costly with a similar performance at high spatiotemporal resolution. Production of PM 2.5 is complicated and varies with different regions. While some approaches may have good performances in other regions [13,65,70], they may be not applied to China as our approach, due to unavailability of some predictive covariates or differences in geographical and meteorological factors and emission sources.
This study presents several limitations. First, we extracted the first and second temporal basis functions from regular monitoring data to represent the study region's seasonal variability in PM 2.5 . This procedure could have resulted in overfitting. To determine the influence on the final results, we conducted a 10 × 10 cross-validation and independent test, and the results (CV R 2 : 0.86-0.89; 0.73 for independent test) show high levels of predictive accuracy, thereby illustrating the limited influence of the temporal basis functions extracted from the predictions. Second, our variogram modeling of the daily residuals was not systematic, which may have affected the applicability of the proposed spatiotemporal approach. Our validation and independent results show that such effects are limited for extensive applications. Third, our spatiotemporal modeling approach was based on monitoring data for Shandong Province, China; thus, its applicability may be limited. PM 2.5 pollution has a larger sphere of influence in northern regions of China, including Shandong Province, and the spatial autocorrelation of residuals of PM 2.5 concentrations was markedly strong. Thus, the use of the residual kriging method for less dense monitoring networks could still capture the spatiotemporal variability of PM 2.5 and account for a large portion of the variance. For regions with lower PM 2.5 pollution and sparser monitoring networks, the residual kriging method may not be applicable, and spatial effects modeling may represent a more preferable approach. We have explored and reported on such a method in another study [71].

Conclusions
This paper has proposed an ensemble spatiotemporal modeling approach for improved prediction of PM 2.5 even with missing co-located pollutants such as PM 10 . Our approach uses a generalized additive model to quantify the non-linear associations between the predictors and PM 2.5 , and ensemble learning was used to obtain stable predictions, with standard deviations used as an uncertainty measure. In addition, the variogram of the daily residuals was modeled to capture day-of-year patterns for the residuals. The results showed the better CV R 2 of 0.89 (with co-located PM 10 used as a predictor) and 0.86 (without PM 10 used as a predictor) than the previous methods. The results demonstrate that our approach can be used to estimate PM 2.5 exposure at a good accuracy and indicate its potential applicability for estimating PM 2.5 exposure in Shandong Province or other similar regions of China.
Supplementary Materials: The following are available online at www.mdpi.com/1660-4601/14/5/549/s1, Figure S1: Histogram with normal density plot of PM 2.5 and PM 10 , Figure S2: Thirty sub-regions in Shandong province, Figure S3: Associations of predictive covariates and PM 2.5 , Figure S4: Residual plots for the measured and fitted PM 2.5 in Model 4 (no PM 10 used but including estimates of the residuals), Figure S5: Residual plots for the measured and predicted PM 2.5 in Model 6 (PM 10 used and incorporation of estimates of the residuals), Figure S6: Residual plots for the measured and predicted monthly-mean PM 2.5 in regions, Table S1: An annex  table for technical terms.