Imputing Satellite-Derived Aerosol Optical Depth Using a Multi-Resolution Spatial Model and Random Forest for PM 2.5 Prediction

: A task for environmental health research is to produce complete pollution exposure maps despite limited monitoring data. Satellite-derived aerosol optical depth (AOD) is frequently used as a predictor in various models to improve PM 2.5 estimation, despite signiﬁcant gaps in coverage. We analyze PM 2.5 and AOD from July 2011 in the contiguous United States. We examine two methods to aid in gap-ﬁlling AOD: (1) lattice kriging, a spatial statistical method adapted to handle large amounts data, and (2) random forest, a tree-based machine learning method. First, we evaluate each model’s performance in the spatial prediction of AOD, and we additionally consider ensemble methods for combining the predictors. In order to accurately assess the predictive performance of these methods, we construct spatially clustered holdouts to mimic the observed patterns of missing data. Finally, we assess whether gap-ﬁlling AOD through one of the proposed ensemble methods can improve prediction of PM 2.5 in a random forest model. Our results suggest that ensemble methods of combining lattice kriging and random forest can improve AOD gap-ﬁlling. Based on summary metrics of performance, PM 2.5 predictions based on random forest models were largely similar regardless of the inclusion of gap-ﬁlled AOD, but there was some variability in daily model predictions.


Introduction
Ambient outdoor air pollution, particularly particulate matter less than 2.5 micrometers in aerodynamic diameter (PM 2.5 ), poses a substantial risk to human health [1][2][3]. Air pollution monitors that can directly measure pollution concentrations are placed at a limited set of locations, resulting in large areas without direct measurements of groundlevel pollution exposure. Aerosol optical depth (AOD) measures the amount of aerosol in the atmosphere and can be remotely sensed by satellite instruments at various spatial resolutions [4]. A growing literature has developed for using satellite-derived AOD as a proxy and predictor for PM 2.5 concentrations, often in conjunction with land-use and meteorological variables, using a range of model types such as geographically weighted regression, linear mixed effect models, and machine learning methods [5][6][7].
However, AOD itself has substantial missingness, complicating the process of predicting PM 2.5 concentrations. Gaps in AOD coverage are a result of cloud cover, snow cover, and surface brightness; for the moderate resolution imaging spectroradiometer (MODIS) 10 km product, on average each grid cell has no AOD available on approximately 70% of days, with substantial variation across regions [8,9]. AOD's patterns of missingness are also not random for the purpose of PM 2.5 prediction; cloud and snow cover may plausibly be related to PM 2.5 concentrations [9,10]. At the scale of the continental United States, research suggests that missingness as a result of cloud cover is not likely to greatly bias monthly and yearly PM 2.5 , although there is regional and seasonal variation [11]. However, Liang et al. [12] show that long-term PM 2.5 estimates in China are substantially biased as result of missing AOD observations. Furthermore, for health effects research, the relevant geographic scale is small and more impacted by missingness. When using PM 2.5 estimates based on satellite-derived AOD with substantial missingness, time series studies will miss many days and lose statistical power, and cohort studies will use potentially biased exposure estimates, resulting in a loss of statistical power. A number of approaches have been proposed for handling missing AOD observations when estimating PM 2.5 [7]. One approach has been to combine different AOD retrievals, although this will still result in incomplete coverage; e.g., Geng et al. [13] combine AOD measurements from Terra and Aqua satellites using linear regression. Other approaches have used AOD where available, but otherwise bypassed the need for gap-filling AOD [14][15][16].
Many recent studies use multi-stage approaches, where AOD is gap-filled, and then a model relating PM 2.5 to the gap-filled AOD and other land-use and meteorological variables is fit. These gap-filling models may use land-use and meteorological terms, as well as chemical transport model (CTM) estimates. Hu et al. [17] forego a statistical modeling procedure for gap-filling AOD, saving computational time, and they replace missing AOD values with CTM (GEOS-Chem) estimates of AOD. Xiao et al. [18] and Huang et al. [19] use linear models that include cloud fraction estimates, meteorological and land-use data together with smoothing splines, to account for spatial correlation for imputing AOD. Lv et al. [20] gap-fill AOD using a model that relates AOD to the ratio of daily and seasonal averages of PM 2.5 multiplied by the average seasonal AOD values for the grid cell for each city under study; a second stage then uses ordinary Kriging to fill in the remaining gaps. Because of the computational costs of ordinary Kriging, this method will not scale well to large datasets, but previous studies suggest that smoothing splines may not perform as well as Kriging in some settings [21]. Chen et al. [22] use a mixed effect model to first combine Terra and Aqua AOD measurements, and interpolate missing AOD values using inverse-distance weighting (IDW). IDW with a maximum distance will not be able to provide full coverage for AOD, however, as there are large missing areas with no observed data. Random forest (RF) is a popular machine learning method used for gap-filling, due to the fast implementations available and its ability to account for complex non-linear interactions of features [7,23]. Bi et al. [10] uses a two-stage model with RF being used to impute AOD using a number of relevant variables, including MODIS cloud and snow fractions. Several other recent papers impute AOD as part of a multi-stage process using RF (e.g., see Stafoggia et al. [24], Huang et al. [25], and Zhang et al. [26]). However, judging performance based on "out-of-bag" measures or random holdouts of observed data may be misleading in spatial prediction problems with large contiguous areas of missing data. Furthermore, when a strong spatial pattern is present as in AOD, it is unclear how RF performs compared to spatial statistical models. Jiang et al. [27] use RF to gap-fill AOD in China and evaluate performance using aerosol robotic network (AERONET) measurements [28,29]; however, there are relatively few locations with AERONET measurements on which to validate performance on any given day.
Importantly, models for gap-filling AOD are generally more costly to fit than models for estimating PM 2.5 , due to the much larger number of daily observations. For example, in our case study, using a modeling grid of 12 km spatial resolution over the contiguous United States entails over 50,000 daily cells. While several studies have used machine learning methods for AOD gap-filling to overcome the computational costs, traditional spatial statistical methods like Kriging are not well-suited to handle large datasets due to the need to invert the spatial covariance matrix. Over the course of the last decade or so, several spatial statistical methods have been developed to handle big data [30,31]. Although considerable attention has been given to using ensemble and hybrid approaches for estimating PM 2.5 (e.g., [32][33][34][35]), AOD gap-filling for large areas has received less focus, possibly due to the greater computational cost.
To our knowledge, studies have not thus far considered ensemble methods for combining large-scale spatial statistical methods with machine learning methods for gap-filling AOD. In this study, we focus on a particular spatial statistical method, lattice kriging (LK) [36], together with RF for AOD gap-filling. Our study considers both RF and LK models for gap-filling MODIS AOD, as well as ensemble methods for combining these predictions following the super learner methodology [37,38]. Our case study focuses on the contiguous United States using daily data for the month of July 2011. We focus on a single month for computational reasons as each AOD model is fit daily to tens of thousands of observations, and we perform 10-fold cross-validation on each day as part of the ensemble method construction, resulting in an additional 10 models fit per day for LK and RF. We assess performance using spatially clustered holdouts for AOD gap-filling models that may more accurately measure performance than more commonly used approaches. Finally, we assess whether the imputed AOD product using ensemble methods improves PM 2.5 estimation in a random forest model. Broadly, we find that ensemble methods can be effective for AOD gap-filling, but there is less evidence to suggest an ultimate benefit for PM 2.5 estimation.

Study Area
The study area of interest is the contiguous United States, consisting of 48 states, and Washington DC, using daily data for the month of July 2011. We focused our analysis on July, as summer months generally have less AOD missingness than other months on average, while retaining substantial day-to-day variability in missingness. Descriptions of the data sources follow the work of Hu et al. [17].

PM 2.5 Measurements
We obtain measurements of PM 2.5 from the U.S. Environmental Protection Agency (EPA) Air Quality System (AQS) (https://www.epa.gov/outdoor-air-quality-data). We used 24h averaged concentrations collected from 1248 federal reference method samplers.

MODIS AOD
For the purpose of this study, we utilize Collection 6 level 2 Aqua MODIS retrievals at 550 nm wavelength using the MYD04_L2 product [4,39]. High-confidence AOD retrievals from the combined deep-blue/dark target parameter were used [8]. AOD at 550 nm is the main product provided by the various mainstream aerosol instruments, and 550 nm is the most common wavelength used in research applications for aerosols (https://atmosphereimager.gsfc.nasa.gov/faqs/aerosol). Following previous work [17], these retrievals at a resolution of 10 km are regridded to 12 km × 12 km community multi-scare air quality (CMAQ) grids. We consider daily MODIS AOD data from July 2011, for a total of 53,807 daily cells in the contiguous United States. The proportion of cells in which daily AOD is observed ranges from a minimum of 26.33% to 54.63%, with an average of 41.08%. The top row of Figure 1 demonstrates two days with the least and most missing observed AOD points. We used MODIS AOD rather than a finer scaled product (e.g., 1 km 2 products), in part because our goal was to explore large-scale variation in the national map for AOD, and in part due to computational costs.

GEOS-Chem AOD
GEOS-Chem is a "global 3-D model of atmospheric chemistry driven by assimilated meteorological observations from the Goddard Earth Observing System (GEOS) of the NASA Global Modeling Assimilation Office (GMAO)" (http://acmg.seas.harvard.edu/ geos/) [40]. We utilize version 10.1 of the model, using GEOS-5 meteorological data for 2011, with total column AOD calculated as the sum of 6 AOD parameters (sulfate-nitrateammonium, black carbon, organic carbon, accumulation-mode sea-salt, coarse-mode seasalt, and total dust) over 37 vertical layers (from the surface up to ≈20 km) [17,41].

Meteorological Variables
We obtained meteorological data from the North American Land Data Assimilation System phase 2 (NLDAS-2) (https://ldas.gsfc.nasa.gov/nldas/) [42,43]. These data have a spatial resolution of approximately 13 km and are available hourly. For this analysis, we use pressure at surface (pa), u-and v-direction wind speed (m/s), temperature (K), relative humidity (%), precipitation (kg/m 2 ), fraction of total precipitation that is convective (no units), convective available potential energy (J/kg), surface DW shortwave radiation flux (W/m 2 ), surface DW longwave radiation flux (W/m 2 ), and potential evaporation (kg/m 2 ). Measurements are averaged from 10 a.m. to 4 p.m. local time to construct daily daytime observations, roughly coinciding with the Aqua overpass time (about 1:30 p.m.).

Land Use
We include elevation obtained from the National Elevation Dataset at 30 m spatial resolution (https://viewer.nationalmap.gov/basic/). We obtained total length of highways (m), total length of limited-access road (m), and total length of local road (m) from ESRI StreetMap USA (Environmental Systems Research Institute, Redlands, CA, USA). Forest cover (unitless) and impervious surface (%) are derived from the National Land Cover Database (https://www.mrlc.gov/). In addition, we include point emissions data for PM 2.5 and PM 10 combined (in tons) from the EPA 2011 National Emissions Inventory report (https://www.epa.gov/air-emissions-inventories). Population density is obtained from the 2010 Census at the tract level (population/km 2 ).

Data Integration
Data were projected into a common coordinate system using the U.S. Lambert conformal conic projection. For each 12 km × 12 km grid cell, forest cover, impervious surface, and elevation were averaged, while road length and point emission values were summed. Meteorological variables and population density were assigned based on nearest distance. Grid cells containing multiple PM 2.5 monitors for a day were averaged.

Statistical Methods
We conduct AOD gap-filling using two distinct methods from the spatial statistical and machine learning fields, respectively, as well as methods for combining them.

Lattice Kriging
Lattice kriging (LatticeKrig or LK) is a multi-resolution Gaussian process model [36]. At a high-level, LK models the spatial process using several levels of two-dimensional basis functions, which are laid out on a grid and approximately double with each successive layer. These basis functions are compact, which means that for a particular point, only a small number of basis functions are used to make the prediction. The coefficients associated with the basis functions are assumed to be correlated, and this structure can flexibly model observed spatial covariance structures. Estimation proceeds through a likelihood-based approach after specifying various tuning parameters.
Following Nychka et al. [36], we observe {y i } at locations {x i } for i = 1, .., n. We assume that {y i } follow an additive model consisting of a mean function based on covariates, a spatial process, and a measurement error term: where d is a p × 1 vector of fixed coefficients associated with the covariates Z i , and g(x i ) denotes the spatial process. The mean-zero error terms i are presumed to be independent and identically distributed, i.e., ∼ N(0, σ 2 I), where = ( 1 , ..., n ) T . The overall spatial process g(x i ) can be written as a sum of L independent spatial processes g l (x i ): where φ j,l denotes the the lth level of resolution's jth basis function, and c l j denotes the coefficient associated with this basis function. Although the basis functions and number of levels are fixed (i.e., chosen), the coefficients for each level l, c l = (c l 1 , ..., c l m(l) ) T are assumed to follow a multivariate normal with mean zero and covariance ρQ −1 l : Each level's spatial process is independent with marginal variance ρα l subject to the constraint ∑ L l=1 α l = 1, so that the marginal variance of the overall spatial process g(x i ) is ρ. The main parameters that need to be selected are the number of levels of resolution (L), the number of grid points along the largest dimension at the coarsest level of resolution (which in turn determines the total number of basis functions), the relative weight of each spatial level's process (parameterized by ν), and the scale/range parameter (a). We provide a more thorough description of the model in the Supplemental Materials, and we also refer readers to the originating paper [36], the comparison paper by Heaton et al. [30], and the documentation for the R implementation (version 8.4) [44].

Random Forest
Random forest (RF) consists of constructing a large number of regression trees [23]. At a high level, regression trees search for the best (as determined by mean-squared error) binary split among the covariates (i.e., features), and then split the data accordingly. This process continues until some condition is met (e.g., there is only 1 observation left, so no further split can be made). Two key components of RF are: (1) bagging, or bootstrap aggregation, wherein each tree is fit to a random sample (with replacement) of the original sample; and (2) at each node, the algorithm considers only a random subset m of the original p predictors for deciding on the best split. A single decision tree would likely overfit to the data. Averaging many trees that implement these two components results in reducing variance, while maintaining a low bias from the procedure.
The key parameters are the number of trees (B), the number of predictors to randomly select for each split (m, or mtry) from the original p, and, to a lesser extent, the node size (n size ). In general, we choose the number of trees B to be large, and we use validation data to help select m and n size . For this study, we use the ranger package (version 0.12.1) in R [45].

Super Learner Methods
Super learners (SL), related to stacked generalization and stacked regression methods [46], use a potentially large and diverse set of algorithms by weighting their predictions optimally according to some risk measure such as squared error loss. Although a large number of algorithms are recommended in practice, we use just RF and LK as our algorithms in order to demonstrate the use of SL and to maintain focus on the cross-validation approach. The process for super learners is as follows [37,38,47,48]: Divide observed data into k folds.

2.
For each fold k, let the kth fold be the validation data, and the remainder be the training data. Fit each algorithm or model m to the training data and make predictions on the kth fold.

3.
Stack all predictionsŷ m for each algorithm.

4.
Estimate the weights α m for algorithm m = 1, . . . , M using the model formulation where α m ≥ 0 and ∑ M m=1 α m = 1. α m can be estimated by non-negative least-squares methods and then normalizing the weights to sum to 1.
After these α m model weights are estimated through the cross-validation process, each algorithm is fit to the full observed data, and test data predictions are made by using these weights for combining predictions. Davies and van der Laan [48] provide a discussion of extending SL theory to the case of spatial data. Murray et al. [35] use a similar stacked regression approach for determining weights in combining separate models for PM 2.5 prediction.

AOD Gap-Filling Analysis
From the observed AOD data, we consider a spatially clustered approach for creating a testing dataset on which to evaluate the results. In the proposed method, ten random AOD observations are selected from the observed AOD values for each day in July 2011. These observations and any other observations within a 250 km radius are then held out as the test dataset. We selected ten random points, in particular, to ensure that different areas of the observed data had a sufficient chance of being held out for the testing data, so that the performance of the different methods for a particular day was not likely to be judged solely on predictions for one area of the observed map. We also observed that this procedure resulted in close to a 70/30 split between training and testing data, on average. Figure 1 demonstrates the observed and training data on two particular days. This approach to creating testing data is an attempt to mimic the actual observed pattern of AOD data, where large contiguous areas are missing and require imputation. In particular, for many missing observations, there are likely no points nearby to aid in prediction. In our analysis, we consider each day separately for model fitting and prediction.
For RF and LK, we used validation data taken from the training data following a similar approach. For RF, we use 2000 trees with a node size of 5 and m = 7. We include 22 variables: the projected centroid coordinates, elevation, the 2011 emission inventory, forest cover, impervious surface, total lengths of highway, limited-access road, and local road, population density, potential evaporation, surface DW longwave radiation flux, surface DW shortwave radiation flux, convective available potential energy, fraction of total precipitation that is convective, precipitation, relative humidity, temperature, u-and v-direction wind speed, pressure at surface, and GEOS-Chem AOD. For LK, we set the number of levels of resolution to 5, a = 12, ν = 0.1, and the number of grid points along the largest dimension at the coarsest level of resolution to 15. By default, the LatticeKrig package includes the spatial coordinates as fixed predictors in Z. In addition, we include the interaction between the coordinates, GEOS-Chem, the interaction between each coordinate and GEOS-Chem, and elevation as the fixed predictors in the model. No variable selection was performed-we instead focused on tuning the spatial aspect of the model.
In addition to comparing RF and LK, we aim to consider SL approaches for combining these distinct methods. For each day, we construct 10 cross-validation folds using the blockCV R package (version 2.1.1) [49]. This constructs spatial blocks for the validation dataset, so that performance more accurately mimics the task of gap-filling AOD. In the Supplementary Materials ( Figure S6), we provide a full set of the maps showing these spatial block cross-validation folds. Sarafian et al. [50], Murray et al. [35] (for PM 2.5 prediction) and Young et al. [51] (for NO 2 ) also consider spatially clustered cross-validation approaches for assessing model performance. Based on the cross-validation folds, we stack LK and RF validation predictions as discussed in the previous section. We assess 4 different methods for combining LK and RF:

1.
Average. We construct a simple average of RF and LK predictions. Cross-validation data are not used in this approach.

2.
SL: overall. After stacking all of the cross-validation predictions for all days together, we produce a single set of optimal weights with (4) for making predictions. 3.
SL: daily. We stack cross-validation predictions for each day separately, and we obtaining a daily set of optimal weights with (4).

4.
SL: distance-based. For each cross-validation fold on each day, we determine the closest distance between each point in the cross-validation fold and the training data.
We then stack all of the cross-validation predictions across days together with these nearest-neighbor distances. We bin these stacked predictions according to nearestneighbor distances with bin widths of 25 km from 0 to 300 km and higher. Using (4), we estimate the optimal weights for LK and RF for each binned distance grouping. We then fit a simple loess model relating interval mid-point distance and optimal weight, and we use these fitted optimal weights for combining LK and RF for predictions. The motivation for this last technique is that the further the unobserved point is from the observed data, the more different algorithms may be in predictions. If there is strong spatial correlation, then LK may perform better; in contrast, if there is limited range in the spatial correlation and the covariates are more important, then RF may produce better fits based on the relationship between the covariates and response.
In all cases, we restrict weights to be between 0.1 and 0.9. We primarily assess model performance on the basis of root mean square error (RMSE) and the coefficient of determination (R 2 ), as well as the intercept and slope from a linear model relating the true AOD observations to the predicted values.

PM 2.5 Analysis
We compare several random forest models for PM 2.5 concentration estimation to determine whether including gap-filled AOD as a predictor can improve prediction performance. Our gap-filled AOD is based on the super learner distance-based method. There are five variations on the random forest models' features: The other features included in the random forest models are the same as those in the AOD analysis. All models except M5 additionally include an indicator variable for whether AOD was observed at the location, and all models include a so-called convolution layer of PM 2.5 . Several analyses [17,34] have demonstrated that a weighted-average of nearby PM 2.5 observations can aid in model prediction for PM 2.5 . Briefly, for each location, the convolution layer of PM 2.5 is a weighted average of all other training PM 2.5 observations, not including the location itself. The weights are inversely proportional to the squared distance between locations (less distant observations in the training data are weighted more). The procedure for creating the convolution PM 2.5 layer must be repeated for each training/validation split for each day.
We consider three distinct 10-fold cross-validation approaches for assessing performance: • Random: Cross-validation folds are constructed by selecting observed PM 2.5 monitors at random on a daily basis. • Constant spatial clusters: Cross-validation folds are constructed by creating spatially clustered areas that are constant across all days. A particular area of the map will be assigned to the same cross-validation fold for each day. • Varying spatial clusters: Cross-validation folds are constructed by creating spatial clusters at random for each day.
Spatial clusters are constructed using the blockCV package (version 2.1.1) in R with block widths of 150 km. Figure S8 in the Supplemental Materials displays the constant spatial construction by color-coding monitor locations.
Models were fit both for each day separately, as well as for all of the days in July 2011 together. In the latter spatio-temporal random forest model, day of the year and day of the week are additionally included as integer predictor variables. Our primary metrics of interest are RMSE, R 2 , and the full prediction maps, but we also present the intercept/slope estimates from fitting a linear regression model with the true PM 2.5 values as the dependent variable and the random forest prediction as the independent variable. For all models, we set the number of trees to 2000. We varied m (mtry) between values of 4, 8, 12, and 16 and presented the best results for each model and cross-validation fold type on the basis of RMSE. The full maps and feature importance results are based on m = 4. The Supplemental Materials include additional figures and tables for m = 8.

AOD Gap-Filling
We highlight several results from our analysis. First, daily results show that any of the average/super learner approaches match or exceed performance from either LK or RF alone on a majority of days on the basis of RMSE (Table 1, Figure S2). While there are some days where either LK or RF does particularly well, there are also days where these two methods perform worse than any of the other methods. The ensemble methods are the best or close to the best in terms of RMSE and R 2 on the majority of days. The distance-based SL method performs best on more days (10 out of 31 days) than any of the other approaches considered. Based on our described approach, the distance-based SL prediction weights LK greater for testing points that are close to training data, while for points further away from the training data, it weights RF more, relying on a combination of location, land use, and meteorological features (see Figure S7 in the Supplemental Materials).
Evaluating test predictions across all 31 days together, LK and RF have R 2 values of 0.644 and 0.619, respectively. The average and SL methods improve to R 2 values in the range of 0.657 to 0.659. Compared to LK alone, the RMSE is reduced 2.34% and 2.30% in the distance-based and overall SL models, respectively. A simple average of the LK and RF predictions also provides most of these gains. The super learner method based on the daily construction of weights performs well but is marginally worse than the other ensemble methods. In explaining this small difference in performance, we posit that the daily weight estimates may be slightly over-fitted to the daily training data relative to the other ensemble methods, which benefit from pooling cross-validation predictions across days together for the estimation of the super learner weights.
Both LK and RF methods diverge substantially from using just the observed AOD data in the July 2011 averages (Figure 2). LK and RF predictions are notably different from each other in the Appalachian Mountains and in parts of the Southwest, where LK predicts higher values relative to RF (Figure 3). There are also apparent edge effects in some areas such as south Florida, where LK predictions will tend to diverge more substantially from those of RF. These may be partly due to issues with LK's coefficient estimation in areas where there are few data for a particular day (see Figures S3 and S4 in the Supplemental Materials for plots of daily AOD predictions and daily differences between LK and RF).

PM 2.5 Prediction
The gap-filled AOD used in the PM 2.5 analyses is based on the SL distance-based method for combining RF and LK predictions on the basis of the results in the AOD analysis. We first highlight a few notable results from the daily random forest models. First, the daily random forest models suggest that RMSE is improved consistently but only marginally by including the imputed AOD predictor vs. the four alternatives ( Table 2, M3a). The outlier model is M5, which trains solely on observations where AOD is observed and predicts using the imputed AOD where AOD is missing. The results from model M5 are substantially worse than the other models, with a relatively biased prediction map (Figure 4d). For the remainder of the results, we omit discussion of this model. Generally, the gain in RMSE for the model using gap-filled AOD (M3) is small and ranges from 0.01 to 0.03 µg m −3 against the other models. The results for R 2 are similar, with small gains of approximately 0.002 to 0.006. Second, the daily random forest models tended to have better PM 2.5 prediction in locations where AOD was not observed as compared to areas where AOD was observed, regardless of the features included. Third, cross-validated RMSE is substantially larger in spatial cross-validation settings than in the case with folds consisting of randomly selected locations.
The spatio-temporal random forest results in the second set of columns in Table 2 show somewhat different patterns. RMSE and R 2 are generally improved over the daily models for the random and varying spatially clustered cross-validation analyses, but there is no longer a benefit to including imputed AOD. On the contrary, the model predictions tend to do better when neither AOD nor GEOS-Chem are included on the basis of R 2 and RMSE. The exception to these results are in the constant spatially clustered cross-validation setting-here, there is some very marginal improvement from including imputed AOD over the other models. We posit that in spatio-temporal models, multiple days' observations in the same area as where we intend to make a prediction on a different day can largely diminish the predictive power of AOD. However, when the same spatial area is consistently missing, the model can no longer rely on other days' observations for the same area to improve prediction accuracy. Notably, this setting mirrors qualities of producing full maps of PM 2.5 observations. Given that there is a fixed network of monitors (not all of which operate on every day), PM 2.5 prediction is primarily focused on areas where there is never a monitor present. We emphasize that the percentage decrease in RMSE from including the imputed AOD in this constant spatially clustered cross-validation setting for the spatio-temporal random forest models is small at 1.25%, 0.67%, and 0.77% compared to the models with no AOD or GEOS-Chem, just GEOS-Chem, or the combined AOD/GEOS-Chem variable, respectively. Notably in this case, daily models slightly outperform the performance of the spatio-temporal models. In the other cross-validation settings, RMSE and R 2 both improve substantially from fitting a full spatio-temporal model over a series of daily models.
Results for RMSE and R 2 by region (as defined by NOAA) and cross-validation setting are also provided in Tables 3 and 4. RMSE results tend to be worst in the West, Southwest, and Central regions across cross-validation settings. Notably, although RMSE is quite low for the Northwest region, the R 2 for this area is comparatively low, suggesting low pollution levels but also poor prediction performance from the models. The spatio-temporal models improve the regional RMSE and R 2 as compared to the daily models, except for the constant spatially clustered cross-validation setting, where there is variability in changes in regional performance. Variable importance metrics for the m = 4 setting based on the spatio-temporal models are presented in Table S1 using the permutation-based method [23]. Briefly, this importance metric denotes the increase in mean-squared error on the OOB sample for each tree after permuting the values of the feature. On this basis, the convolution layer of PM 2.5 is the most important predictor for these models. When imputed AOD is included, the relative importance of several other variables is slightly diminished. While imputed AOD is not the most important feature, it appears substantively important on the basis of a mean decrease in accuracy. Additional feature importance tables are provided in the Supplemental Materials for m = 8 (Table S2). In general, for larger m, the convolution layer of PM 2.5 becomes more important-it is more likely to be selected as the optimal feature for splitting a node as m increases, and it is a particularly strong predictor.
When comparing model M3 to models M1, M2, and M4, the average of monthly mean differences are close to 0 µg m −3 , but the monthly mean differences are apparently spatially correlated (Figure 4). The model trained only on points where AOD is observed (M5) leads to over-estimated average monthly values of PM 2.5 relative to the model using gap-filled AOD, with an average difference of 0.25 µg m −3 . The standard deviation of daily differences for all grid cells for July 2011 is 0.44 µg m −3 for M3 and M1, 0.32 µg m −3 for M3 and M2, and 0.38 µg m −3 for M3 and M4, suggesting small but meaningful variability in the daily model predictions. Some of the largest daily differences between models tend to be in areas with sparse air pollution monitors ( Figure S11). Figure 5 shows the July 2011 averaged values from M3 with the gap-filled AOD as a feature in the spatio-temporal random forest model.   Table 2. R 2 and root mean square error (RMSE) result from the daily and spatio-temporal random forest models for three different 10-fold cross-validation settings. Summary metrics are evaluated on all observations as well as separately on the basis of AOD's missingness status (observed or missing). The M3a and M3b random forest models include gap-filled AOD as a predictor.

Discussion
We highlight the main findings of this study in three points. First, we emphasize the importance of constructing testing and cross-validation data that mimic the missing data patterns for both AOD and PM 2.5 prediction. Previously reported metrics for AOD gapfilling using RF may be over-stated if using out-of-bag (OOB) metrics [10,24], as using large contiguous areas for testing suggests substantially lower R 2 . Different cross-validation settings for PM 2.5 model evaluation also suggest that performance varies considerably based on the manner of holdout, echoing the findings of previous studies that "spatial" cross-validation performance metrics are typically worse than random cross-validation metrics. In our study, our spatial cross-validation procedures leave out spatially clustered sets of monitors as in several recent studies [35,50,51]. Our results show roughly similar performance metrics for PM 2.5 estimation compared to previous RF results when using the random cross-validation setting, with ≈0.80 (≈2.99) vs. 0.81 (2.78) R 2 (RMSE) for summer 2011 in Hu et al. [17]. The small difference in performance in our model as compared to that of Hu et al. [17] is likely because we fit the model solely to data in July 2011 rather than the full year, and because we did not include additional variables such as convolutional layers for land use terms, which have been shown to improve overall performance.
Second, we demonstrate how super learner approaches combining a large-scale spatial statistical method and machine learning predictions can improve upon the performance of each constituent predictor, and how the super learner method can be further modified for the particular task of AOD gap-filling. Future work should examine extensions to more machine learning and spatial statistical methods. For example, several recent studies have highlighted a number of spatial statistical methods with promising predictive performance and low computational costs [30,31,[52][53][54][55][56][57][58], and using these in an ensemble approach may provide further improvements. Spatial data present additional theoretical challenges for super learner methods, given that the training data and testing data will generally not be independent of each other [48]. A limitation of the current study is the limited time frame and the use of daily rather than spatio-temporal AOD gap-filling models. We focused on July 2011 as there was, on average, less missingness in AOD in the summer than in other months, and we limited our analysis to a single month due to the high computational cost of fitting daily models with 10-fold cross-validation for the super learner. Future studies may consider expanding the time frame of spatial prediction beyond a month and including spatio-temporal statistical models that may better utilize the available observed data.
Finally, we demonstrate that imputed AOD using our proposed ensemble method can have a very small impact on particular RF models for estimating PM 2.5 concentrations, depending on the cross-validation setting. With a convolution layer of PM 2.5 and a rich set of other features, we generally find that AOD (imputed or not) is not strictly needed for good prediction of PM 2.5 in RF models as judged by R 2 and RMSE. However, population-level metrics like R 2 and RMSE may be misleading in masking improved small-scale predictions, and we find subtle differences in the predicted values between models, with and without gap-filled AOD as a predictor. Similarly, Huang et al. [25] find meaningful differences in daily PM 2.5 predictions in models with and without AOD as a predictor, particularly in areas of the map with sparse PM 2.5 monitors and on high-pollution days. Thus, although gap-filling AOD in the proposed ensemble approach adds to the computational costs of making PM 2.5 predictions, as compared to fitting a model without AOD as a predictor, there may be some marginal benefit to the quality of PM 2.5 predictions. A limitation of the current study is the lack of certain variables for AOD gap-filling and PM 2.5 estimation; previous work has found that the inclusion of cloud and snow fractions may improve AOD gap-filling and produce meaningful visual improvements in PM 2.5 estimation [10]. Moreover, finer resolution AOD products such as multi-angle implementation of atmospheric correction (MAIAC)-derived AOD may provide greater predictive power for PM 2.5 [59] at the expense of increasing the computational costs of gap-filling. Future studies should examine the computational costs and benefits of using the proposed method to gap-fill AOD for predicting PM 2.5 at the 1 km resolution.

Conclusions
This article considered a recently developed large-scale spatial statistical method (LK), a popular machine learning method (RF), and various combinations of these methods for gap-filling daily MODIS AOD on a 12 km × 12 km grid in the contiguous United States in July 2011. Using large contiguous areas as holdouts for our performance comparison, we found that ensemble approaches can improve daily AOD gap-filling as compared to the individual methods on their own, on the basis of RMSE and R 2 . The ultimate goal of making improvements in gap-filling AOD is to improve the performance of PM 2.5 prediction models and to provide complete spatial coverage. For this task, we compared several daily and spatio-temporal random forest models, with and without the inclusion of gap-filled AOD as a predictor. These results were mixed, showing very small but consistent improvements to RMSE and R 2 by including the gap-filled AOD in the daily models, but varying results in the spatio-temporal models. However, in the more realistic spatially clustered cross-validation setting, where spatial clusters of the observed locations are assigned to the same cross-validation fold for all days, we find small improvements for PM 2.5 from including the gap-filled AOD predictor. Furthermore, although the differences in RMSE and R 2 are small, daily prediction maps suggest some meaningful differences between the considered models, particularly in areas with sparse monitoring locations. Future research should extend this work by considering computationally efficient spatiotemporal statistical approaches for gap-filling AOD (as compared to daily gap-filling models), increasing the time frame of study, and considering gap-filling at finer resolutions (e.g., MAIAC AOD at 1 km resolution).
Supplementary Materials: All codes for replicating the analyses in this paper are provided at https://github.com/bzki/aodpm25_paper. Additional figures, tables, and details are available online at https://www.mdpi.com/2072-4292/13/1/126/s1. Figure S1: Daily split between training and testing data for AOD experiments, Figure S2: RMSE and R 2 across days for various methods in AOD experiments, Figure S3: Daily observed and predicted AOD values, Figure S4: Daily differences between LK and RF AOD predictions, Figure S5: Differences in average predictions for various methods, Figure S6: Spatial cross-validation folds for PM 2.5 , Figure S7: Comparison of LK and RF at difference distances between test and training data, Figure S8: Constant spatial clustering crossvalidation map for PM 2.5 , Figure S9: Average PM 2.5 predicted map, Figure S10: Average differences from gap-filled AOD RF model, Figure S11: Daily differences from gap-filled AOD RF model, Figures S12-S14: Scatter plots of predicted and observed PM 2.5 , Tables S1 and S2: Feature importance tables from RF models for PM 2.5 , Tables S3-S5: Intercept and slope estimates from PM 2.5 cross-validation, Section S3: Additional LatticeKrig modeling details.

Acknowledgments:
We thank Xuefei Hu for sharing the dataset and Douglas Nychka for generously responding to questions about the LatticeKrig package.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: