1. Introduction
Exposure to ambient air pollution is a well-established risk factor for multiple adverse health outcomes. Particulate matter of size less than 2.5 μm in aerodynamic diameter (PM
2.5) has been shown to be especially harmful due to its ability to penetrate deep into the respiratory system [
1,
2,
3,
4,
5,
6]. PM
2.5 represents a chemically diverse mixture of pollutants and its major sources include electricity generation, motor vehicles, and wildland fires. Accurate estimation of PM
2.5 concentration is an important component of air pollution health research [
7,
8].
A network of monitors established for regulatory purposes provides precise, direct measurements of PM
2.5. However, these monitors are relatively expensive to install and maintain, and, therefore, are both spatially sparse and preferentially located in prioritized areas. There exist two additional data sources that are highly correlated with PM
2.5, which enable the development of methods to predict PM
2.5 beyond locations with ground monitors. The first is outputs from the chemical transport model (CTM), which provide simulations of air quality based on emissions, physical processes, and meteorological data. Gridded CTM simulations provide complete spatial coverage, but require bias-correction with observations due to uncertainties in model input, discretization, and other sources of errors [
9]. The second is satellite measurements of aerosol optical depth (AOD), which provide measures of aerosol for the entire atmospheric column at potentially finer spatial resolution than CTM [
10,
11,
12]. But AOD also requires transformation and bias-correction and is subject to high missingness due to cloud cover and retrieval error [
13,
14,
15].
Previous studies have explored the use of CTM or AOD for predicting PM
2.5 concentrations [
16,
17,
18]. An important and active area of air pollution exposure assessment research is the development of methods that can utilize both CTM and AOD to predict PM
2.5 to address limitations associated with each data type [
17,
19,
20]. One such framework combines predictions from geostatistical models trained separately from AOD or CTM using Bayesian ensemble averaging [
21]. This approach utilizes all available monitoring, AOD, and CTM data, and provides fine-scale estimates when AOD is available, and gap-filled estimates from CTM when AOD is missing. While current machine learning approaches—such as Random Forests, Support Vector Machines, and Neural Networks—can yield high predictive accuracy and computational efficiency [
20], our approach additionally provides probabilistic uncertainty quantification for each prediction via prediction intervals and standard deviations. These outputs are crucial for downstream exposure health analyses. For this reason, our framework has been adopted by the Multi-Angle Imager for Aerosols (MAIA) project to estimate daily PM
2.5, PM
10, and speciated PM
2.5 major components in multiple large population centers around the world [
22,
23]. A more detailed description of our method and its comparison to related approaches is provided in Murray et al. (2019) [
21].
Here, we introduce an R package,
ensembleDownscaleR, developed with R (version ~4.4.1), containing a suite of functions to facilitate the adaptation of this modeling approach to other settings and air quality modeling applications. We describe the statistical method from [
21], as well as model extensions implemented for MAIA. We also detail the functionality of the R package and provide an example analysis using 2018 data from the Los Angeles metropolitan area to estimate daily PM
2.5 at 1 km spatial resolution. This work is structured as a tutorial designed to guide practitioners in the use of the
ensembleDownscaleR R package. Each stage includes a methodological introduction, code examples applied to the Los Angeles dataset, and corresponding results. We made the data available on Zenodo (
https://zenodo.org/record/14996970, accessed on 24 March 2025) and provided all code used for model fitting, data processing, and plot/table creation on GitHub (
https://github.com/WyattGMadden/ensembleDownscaleR_tutorial, accessed on 24 March 2025), to ensure full reproducibility of results and assist users in their own analyses. The
ensembleDownscaleR R package is available for download on GitHub (
https://github.com/WyattGMadden/ensembleDownscaleR, accessed on 24 March 2025).
2. Case Study Data
Our analyses correspond to a spatial region overlapping the Los Angeles metropolitan area, ranging from −120.50 to −115.75 longitude, and from 32.00 to 35.70 latitude, and a temporal range of 1 January 2018, to 31 December 2018 (
Figure 1). This region contained 60 monitoring stations from the Air Quality System (AQS) that report daily average PM
2.5 measurements.
AOD data were obtained from the Multi-Angle Implementation of Atmospheric Correction (MAIAC) with a spatial resolution of 1 km [
24]. Chemical transport model (CTM) data were obtained from the Community Multiscale Air Quality (CMAQ) model, which provides daily PM
2.5 simulations at a 12 km spatial resolution [
25]. We spatially linked the AQS monitor, AOD, and CTM data using the MAIAC 1 km grid for modeling fitting and prediction.
Additional spatial land use and spatio-temporal meteorological covariates were matched to both monitor locations and grid cells. These include elevation (
elevation), population density (
population), cloud cover (
cloud), east–west wind velocity (
u_wind), north–south wind velocity (
v_wind), height of planetary boundary layer (
hpbl), shortwave radiation (
short_rf), and humidity at two meters above the ground (
humidity_2m). Elevation was obtained from the ASTER Global Digital Elevation Model [
26]. Population density was obtained from the Oak Ridge National Laboratory LandScan USA 2018 dataset [
27]. Cloud cover was obtained from the NASA MODIS atmosphere L2 cloud product [
28]. Wind velocities, shortwave radiation, and height of planetary boundary layer were obtained from the NASA NLDAS-2 Forcing Dataset [
29]. Additional details on the source of each covariate and spatial resolutions are detailed in the
Supplementary Materials.
We compiled the Los Angeles data into four datasets that were used both in the subsequent analyses and the accompanying tutorial: PM
2.5 monitor-linked CTM data and covariates, PM
2.5 monitor-linked AOD data and covariates, grid cell CTM data and covariates, and grid cell AOD data and covariates. The grid cell datasets include all days in the month of July 2018, rather than the full year, to conserve memory. Daily monitor PM
2.5 observation missingness averaged 27% (range 0–45%, IQR 10–38%), CTM data were available for all grid cells in the study area for all days, and daily AOD missingness averaged 11% (range 2–79%, IQR 6–35%) of the grid cells in the study area. This resulted in 15,821 observation—CTM pairs, 11,668 observation—AOD pairs and a maximum of 122,735 prediction grid cells for each day. Further details of these datasets are included in the
Supplementary Materials.
3. Methods and Results
In this section, we provide a case study for all stages of the Bayesian ensemble fitting and prediction process on the case study. For each stage, we detail the methods used, provide code examples to illustrate the use of relevant
ensembleDownscaleR package functions, and present corresponding results. We break up this complete workflow for producing Bayesian ensemble PM
2.5 predictions into six stages, detailed in
Figure 2 and as follows:
Fit two separate Bayesian downscaler regression models, and , on the monitoring PM2.5 data, one spatially and temporally matched with CTM data and the other matched with AOD data.
Produce estimates of PM2.5 (posterior predictive means) and variances for all times and grid cells using from stage 1. Produce estimates of PM2.5 means and variances for all times and grid cells for which AOD is available, using .
Use cross-validation to produce two sets of out-of-sample PM2.5 prediction means and variances using the same data and model form as in stage 1. This produces two datasets of out-of-sample prediction means and variances for each monitor observation.
Estimate spatially varying weights from the out-of-sample prediction means and variances from stage 3 and the monitor PM2.5 measurements.
Use Gaussian process spatial interpolation (krigging) to predict weights for all grid cells in the study area from stage 4 weight estimates.
Use the mean and variance estimates from stage 2 and the weight estimates from stage 5 to acquire ensemble predictions of PM2.5 at each time and grid cell in the study area.
The total computation time with 25,000 Markov chain Monte Carlo (MCMC) iterations per model fit (for two stage 1 model fits, 20 stage 3 model fits, and one stage 4 model fit), took approximately 11.29 h on an Apple MacBook Pro laptop (macOS Sequoia, version ~15.5) with an Apple M3 Max processor and 36 GB of RAM.
3.1. Stage 1: Downscaler Regression Model
This section details the model specifications available for Bayesian downscaler regression model fitting using the grm() function.
3.1.1. Model
The Bayesian downscaler regression model is formulated as a spatial–temporal regression of PM2.5 against X, which is either AOD or CTM depending on user input.
The statistical model is as follows:
where
and
are the spatial–temporal intercept and AOD/CTM slope of the regression model at location
s and time
t,
and
are fixed effects for spatial and spatio-temporal covariates
and
, respectively, and
∼
. Here,
is modeled with an inverse Gamma prior distribution,
∼
, where
and
hyperparameters are specified with the
sigma.a and
sigma.b arguments in the
grm() function. For these and the remainder of the hyperparameter arguments, defaults are set to represent uninformative priors.
The slope and intercept parameters are composed of the following additive spatial and temporal random effects and fixed effects:
where spatial random effects
∼
follow a Gaussian process (
) depending on a user-specified kernel
with range parameter
and distance
d. Temporal random effects
are set as a first-order random walk. Normal priors are applied to fixed effects
and
, which in a Bayesian framework induce an
penalty with equivalence to a ridge-regression penalty in a frequentist setting [
30]. This provides theoretically justified shrinkage when covariates are numerous or correlated.
Users can specify inclusion of any combination of additive or multiplicative spatial or temporal random effects, and can input
and
matrices for fixed effects. For example, if the user specifies the inclusion of an additive spatial effect, a multiplicative temporal effect, and no
and
matrices, the intercept/slope equations would simplify as follows:
The inclusion of additive or multiplicative temporal and spatial effects is specified by the user with the include.additive.temporal.effect, include.multiplicative.temporal.effect and include.additive.spatial.effect, and include.multiplicative.spatial.effect arguments in the grm() function.
3.1.2. Spatial Random Effects
The spatial random effects are modeled using Gaussian processes (GP), with the covariance kernel
specified by the user. We provide four covariance kernels for the spatial random effects: exponential, and Matérn for
, where
We also allow for the user to specify a custom covariance kernel, which must be positive definite and symmetric and parametrized with respect to the range parameter and distance d such that .
While non-additive spatio-temporal effects are not currently implemented, we provide the option to use different sets of spatial effects for different time periods. By using different spatial effects for, say, seasons or months, some temporal variation in spatial effects can be accounted for. Regardless of a number of spatial effect sets, Gaussian process parameters (
,
) are shared across sets. For example, if spatial–temporal sets are specified for seasons, the additive spatial random effect is as follows:
where
n is the number of spatial locations and ⊗ is the Kronecker product.
Priors are placed on the Gaussian process parameters
and
such that
∼
and
∼
, where
,
,
, and
are specified by the user using the
tau.alpha.a,
tau.alpha.b,
theta.alpha.a, and
theta.alpha.b arguments in the
grm() function. While other parameters in the Bayesian downscaler regression model have conjugate full-conditional distributions and thus are sampled with Gibbs updates [
31], the
and
parameters are sampled with a Metropolis–Hastings [
32] step, and thus require tuning parameter specification for the proposal distribution. We employ a log-normal proposal distribution with a user-specified tuning parameter
, such that
∼
. The user can specify the tuning parameter
using the
theta.alpha.tune argument in the
grm() function, as well as the initial value for
using the
theta.alpha.init argument. Prior specifications and tuning parameter arguments are similar for
and
.
3.1.3. Temporal Random Effects
The first-order random walk temporal random effects (
) are specified such that
with
similarly specified. To reduce computation burden, each
is discretized as 2000 evenly spaced values between 0 and 1, and each
determines the temporal smoothness level. Initial values for
and
can be specified by the user using the
rho.alpha.init and
rho.beta.init arguments in the
grm() function. An inverse gamma prior is placed on
and
such that
∼
and
∼
. The user can specify the hyperparameters
,
,
, and
using the
omega.alpha.a,
omega.alpha.b,
omega.beta.a, and
omega.beta.b arguments in the
grm() function.
3.1.4. Fixed Effects
The user is able to specify inclusion of fixed effects and for spatial and spatio-temporal covariates, and , respectively. The fixed effects are modeled with normal priors, ∼ and ∼ where inverse gamma priors are placed on the and parameters such that ∼ and ∼. Thus, the parameters share the same hyperparameter settings as those for , specified by the user using the and arguments in the grm() function.
3.1.5. Stage 1 Code Example
We loaded the ensembleDownscaleR package and fit the Bayesian downscaler regression models for CTM using the previously described Los Angeles PM2.5 dataset. The code for fitting the AOD model is omitted for brevity, but is similar to the CTM model fit and is included in the tutorial code at GitHub.
cmaq_fit <- grm(
Y = monitor_pm25_with_cmaq$pm25,
X = monitor_pm25_with_cmaq$cmaq,
L = monitor_pm25_with_cmaq[, c(“elevation”, “population”)],
M = monitor_pm25_with_cmaq[, c(“cloud”, “v_wind”, “hpbl”,
“u_wind”, “short_rf”, “humidity_2m”)],
n.iter = 25e3,
burn = 5e3,
thin = 20,
covariance = “matern”,
matern.nu = 0.5,
coords = monitor_pm25_with_cmaq[, c(“x”, “y”)],
space.id = monitor_pm25_with_cmaq$space_id,
time.id = monitor_pm25_with_cmaq$time_id,
spacetime.id = monitor_pm25_with_cmaq$spacetime_id,
verbose.iter = 10
)
We note that grm() returns a fitted model object—cmaq_fit here—that is a named list of all posterior parameter draws from the MCMC sampler. Summary statistics or maximum a posteriori (MAP) estimates can be calculated directly from this list.
3.2. Stage 2: Produce Estimates and Predictions with Available CTM and AOD Data
CTM data are available for all times and locations in the study area, while AOD data availability depends on the time period. In this stage, we used the grm_pred() to produce posterior predictive means and variances for all CTM and AOD data variables. For example, we first input the fitted model and CTM data for all times and locations in the study area, to produce and for all locations s and times t. We then input the fitted model and the sparser AOD data to produce and for all times and locations for which AOD data are available. We note that grm_pred() outputs NA values for prediction data with locations identical to monitor locations because, in this case, the observed concentrations can be used.
Stage 2 Code Example
Using the fitted downscaler regression models from stage 1, we produced full PM
2.5 predictions for all locations and times in the study area using the CTM-based fitted model, and for all locations and times for which AOD is available using the AOD-based fitted model (
Figure 3). The code for producing AOD predictions is omitted here, but is included in the full tutorial code at GitHub.
cmaq_pred <- grm_pred(
grm.fit = cmaq_fit,
X = cmaq_for_predictions$cmaq,
L = cmaq_for_predictions[, c(“elevation”, “population”)],
M = cmaq_for_predictions[, c(“cloud”, “v_wind”, “hpbl”,
“u_wind”, “short_rf”, “humidity_2m”)],
coords = cmaq_for_predictions[, c(“x”, “y”)],
space.id = cmaq_for_predictions$space_id,
time.id = cmaq_for_predictions$time_id,
spacetime.id = cmaq_for_predictions$spacetime_id,
n.iter = 1e3,
verbose = T
)
3.3. Stage 3: Use Cross-Validation to Produce Out-of-Sample Prediction Means and Variances
3.3.1. Cross-Validation Details
K-fold cross-validation prevents overfitting by separating the dataset into k number of folds, iteratively fitting the model to folds and predicting the remaining fold. We provide two functions to perform k-fold cross-validation with the geostatistical regression model. The first is the function create_cv(), which creates cross-validation indices according to a user-specified sampling scheme. The second function, grm_cv(), returns the out-of-sample PM2.5 predictions, calculated according to user-inputted cross-validation indices (either obtained from the create_cv() function or created by the user), and arguments similar to those used for the grm() function to specify the downscaler regression model. The out-of-sample predictions are stacked into a dataset of the same length and order as the original dataset on which the cross-validation is applied.
The
create_cv() allows specification of the following types of cross-validation (
Figure 4):
Ordinary: Folds are randomly assigned across all observations
Spatial: Folds are randomly assigned across all spatial locations.
Spatial Clustered: K spatial clusters are estimated using k-means clustering on spatial locations. These clusters determine the folds.
Spatial Buffered: Folds are randomly assigned across all spatial locations. For each fold, observations are dropped from the training set if they are within a user-specified distance from the nearest test set point.
In
Figure 4, we visually detail how the folds are assigned for each type of cross-validation, using the monitor locations and times in our study area as an example and assuming five cross-validation folds. We plot all fold assignments for four randomly chosen days, for ordinary, spatial, and spatial clustered cross-validation, with color representing fold assignment. For the spatial buffered cross-validation, we plot only one fold assignment, with color representing if a location is in the first test fold, the first training fold, or dropped due to being within a 30 km buffer of a location in the first test fold.
When assigning folds, we enforce that each fold contains at least one observation from each spatial location and spacetime indicator. Locations for which there are fewer observations than folds should be filtered out of the dataset prior to analysis. Data from the first and last time points are left unassigned, and out-of-sample predictions for these data are output as missing data. If the out-of-sample dataset has a larger temporal range than the in-sample dataset for a given fold, the out-of-sample predictions for the extra time points are also output as missing data.
3.3.2. Producing Out-of-Sample Prediction Means and Variance
The grm_cv() function uses the previously detailed cross validation indices and model specifications to produce estimates of and , where is the PM2.5 value at location s, time t, and and are the posterior predictive distributions based on models and , respectively, the downscaler regression models. Specifically, grm_cv() outputs posterior predictive means and variances that are used in stage 4 to fit the full ensemble model.
3.3.3. Stage 3 Code Example
We created the cross-validation indices with the create_cv() function for both AOD and CTM linked monitors, and then used the grm_cv() function to produce out-of-sample PM2.5 predictions for all monitor observations, for both the CTM and AOD data. The code for producing AOD out-of-sample predictions is omitted here, but is included in the tutorial code at GitHub.
cv_id_cmaq_ord <- create_cv(
space.id = monitor_pm25_with_cmaq$space_id,
time.id = monitor_pm25_with_cmaq$time_id,
type = “ordinary”
)
cmaq_fit_cv <- grm_cv(
Y = monitor_pm25_with_cmaq$pm25,
X = monitor_pm25_with_cmaq$cmaq,
cv.object = cv_id_cmaq_ord,
L = monitor_pm25_with_cmaq[, c(“elevation”, “population”)],
M = monitor_pm25_with_cmaq[, c(“cloud”, “v_wind”, “hpbl”,
“u_wind”, “short_rf”, “humidity_2m”)],
n.iter = 25e3,
burn = 5e3,
thin = 20,
coords = monitor_pm25_with_cmaq[, c(“x”, “y”)],
space.id = monitor_pm25_with_cmaq$space_id,
time.id = monitor_pm25_with_cmaq$time_id,
spacetime.id = monitor_pm25_with_cmaq$spacetime_id,
verbose.iter = 10
)
3.4. Stage 4: Estimate Spatially Varying Weights
At this stage, we used the ensemble_spatial() function to fit the ensemble model , where are spatially varying weights. We estimated the weights by fitting the ensemble model on the out-of-sample predictions produced during stage 3, and , and the original PM2.5 data at all times and monitor locations. We placed a Gaussian process prior on the weights, ∼, where is an exponential kernel, and and are the distance and range parameters, respectively. Similar to the spatial processes in stage 1, we placed an inverse gamma prior on and a gamma prior on , such that ∼ and ∼. The user can specify the hyperparameters , , , and using the tau.a, tau.b, theta.a, and theta.b arguments in the ensemble_spatial() function.
The ensemble_spatial() function accepts the output from grm_cv() in stage 3 as input, and outputs the full posterior distribution samples of where .
Stage 4 Code Example
We displayed the out-of-sample predictions produced in stage 3 to these ensemble weight estimates and the original monitor PM
2.5 measurements, while including the weight estimates for reference (
Figure 5). If desired, one can additionally use these weight estimates with the
gap_fill() function to produce ensemble-based predictions for each location at which PM
2.5 is observed, though these outputs are not used in later stages. We did this here to compare the ensemble model performance to the CTM-based and AOD-based models on the times and locations for which both CTM and AOD data were observed, employing ordinary, spatial, spatial clustered, and spatial buffered cross-validation for each model (
Table 1, see
Supplementary Materials for more details). For all cross-validation formulations, the ensemble model outperforms both the CTM-based and AOD-based models in terms of RMSE and
, while maintaining accurate 95% prediction interval coverage. Note that the
gap_fill() function uses the weight estimates to produce ensemble-based predictions for times and locations at which both CTM and AOD data are observed, and fills in the remaining times and locations with the CTM-based predictions.
At this stage, we estimated the spatially varying weights for the ensemble model, using the out-of-sample predictions produced in stage 3, and the original monitor PM2.5 measurements.
ensemble_fit <- ensemble_spatial(
grm.fit.cv.1 = cmaq_fit_cv,
grm.fit.cv.2 = aod_fit_cv,
n.iter = 25e3,
burn = 5e3,
thin = 20,
tau.a = 0.001,
tau.b = 0.001,
theta.tune = 0.2,
theta.a = 5,
theta.b = 0.05
)
3.5. Stage 5: Predict Weights for All Locations
At this stage, we used the weight_pred() function to spatially interpolate the posterior samples of garnered in stage 4, across all 1 km × 1 km grid cells in the study area. Specifically, where , with output from stage 4. These weights are used in the final stage to produce ensemble-based PM2.5 predictions for all locations in the study area.
Table 1.
Model PM2.5 prediction performance for ensemble model from stage 4, and CTM-based and AOD-based models from stage 2, using each cross-validation type available in the create_cv() function. We assess model performance using 10-fold ordinary, spatial, spatial clustered, and spatial buffered cross-validation. The spatial buffered cross-validation is formulated with buffer sizes of 12.6 km and 42.6 km, corresponding with approximately 0.7 and 0.3 spatial random effect correlation, respectively.
Table 1.
Model PM2.5 prediction performance for ensemble model from stage 4, and CTM-based and AOD-based models from stage 2, using each cross-validation type available in the create_cv() function. We assess model performance using 10-fold ordinary, spatial, spatial clustered, and spatial buffered cross-validation. The spatial buffered cross-validation is formulated with buffer sizes of 12.6 km and 42.6 km, corresponding with approximately 0.7 and 0.3 spatial random effect correlation, respectively.
CV Type | Model | RMSE | R2 | Posterior SD | 95% PI Coverage |
---|
Ordinary | AOD-Based | 4.401 | 0.573 | 4.273 | 0.954 |
CMAQ-Based | 3.847 | 0.674 | 4.077 | 0.960 |
Ensemble | 3.713 | 0.696 | 4.223 | 0.971 |
Spatial | AOD-Based | 4.710 | 0.486 | 4.764 | 0.954 |
CMAQ-Based | 4.379 | 0.555 | 4.603 | 0.957 |
Ensemble | 4.116 | 0.607 | 4.714 | 0.969 |
Spatial Buffered (0.3 Corr) | AOD-Based | 4.778 | 0.471 | 4.767 | 0.953 |
CMAQ-Based | 6.200 | 0.109 | 5.194 | 0.950 |
Ensemble | 4.349 | 0.561 | 4.988 | 0.970 |
Spatial Buffered (0.7 Corr) | AOD-Based | 4.736 | 0.480 | 4.758 | 0.952 |
CMAQ-Based | 4.578 | 0.514 | 4.612 | 0.955 |
Ensemble | 4.243 | 0.583 | 4.758 | 0.968 |
Spatial Clustered | AOD-Based | 5.394 | 0.325 | 5.151 | 0.945 |
CMAQ-Based | 5.304 | 0.348 | 5.104 | 0.959 |
Ensemble | 4.735 | 0.480 | 5.227 | 0.966 |
Stage 5 Code Example
We spatially interpolated the posterior samples of
from stage 4 across all locations in the study area using the weight_pred() function (
Figure 6). This provided the weights used in the final stage to produce full ensemble estimates for all times and locations in the study area.
weight_preds <- weight_pred(
ensemble.fit = ensemble_fit,
coords = cmaq_for_predictions[, c(“x”, “y”)],
space.id = cmaq_for_predictions$space_id,
verbose = T
)
Figure 6.
(A) Posterior mean spatially interpolated weights produced in stage 5. (B) Ensemble-based posterior predictive PM2.5 mean estimates. (C) Ensemble-based posterior predictive PM2.5 standard deviation estimates.
Figure 6.
(A) Posterior mean spatially interpolated weights produced in stage 5. (B) Ensemble-based posterior predictive PM2.5 mean estimates. (C) Ensemble-based posterior predictive PM2.5 standard deviation estimates.
3.6. Stage 6: Compute Ensemble Predictions for All Locations
The last stage comprised using the posterior means and variances for all CTM and AOD data produced in stage 2, and the spatially interpolated weights from stage 5, to compute PM2.5 posterior predictive means and variances for all times t and locations s in the study area, done with the gap_fill() function.
For times and locations for which both CTM and AOD are observed, gap_fill() output ensemble-based estimates, where and . For times and locations for which solely CTM is available, gap_fill() output posterior predictive means and variances identical to those produced in stage 2 from .
Stage 6 Code Example
Here, we input the posterior means and variances for all CTM and AOD data produced in stage 2, and the spatially interpolated weights from stage 5, into the
gap_fill() function, which output PM
2.5 posterior predictive means
and variances
for all times
t and locations
s in the study area (
Figure 6).
4. Discussion
In this work, we introduce the ensembleDownscaleR package for fitting Bayesian geostatistical regression and ensemble models, designed for predicting PM2.5 using CTM simulations and AOD measurements. We also provide a code tutorial based on a case study of the Los Angeles metropolitan area data from 2018. The purpose of this work is to guide practitioners in generating robust PM2.5 predictions and uncertainty quantification with a flexible and well-documented software workflow. The framework can also be applied to other air pollutants and data integration problems.
There are areas for future improvement that are worth noting. The Gaussian process spatial random effects employed by our model are appropriate for the size of the Los Angeles case study data used here. For much larger datasets (with many more monitors and/or much larger prediction grids), the inference of the Gaussian process parameters will be prohibitively slow. Incorporation of scalable random processes, such as Nearest Neighbor Gaussian Processes [
33], could make these methods feasible for much larger datasets than those assessed here. Furthermore, spatial covariance specifications are currently limited to isotropic, stationary kernels. There are cases where PM
2.5 data may exhibit correlation that suggests anisotropic or nonstationary kernels would be more appropriate, such as data that includes periods with high wind or localized wildfires. Inclusion of covariates such as wind information can often resolve this. Inspecting residual correlation and covariance parameters can ensure that the covariance is reasonably specified for a given dataset. Finally, the current software does not support integrated parallelization. For example, the
grm_cv function fits a model on each cross-validation fold sequentially rather than exploiting multiple cores or compute nodes to fit each model concurrently. Model fitting time could be substantially lowered without sacrificing software ease-of-use by incorporating parallelization specifications directly in the
ensembleDownscaleR package functions.