Next Article in Journal
Aerial Biological Target Classification Based on Time–Frequency Multi-Scale Feature Fusion Network
Previous Article in Journal
Sensing Lunar Dust Density Using Radio Science Signals of Opportunity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Technical Note

ensembleDownscaleR: R Package for Bayesian Ensemble Averaging of PM2.5 Geostatistical Downscalers

1
Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
2
Department of Environmental Health, Emory University, Atlanta, GA 30322, USA
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(11), 1941; https://doi.org/10.3390/rs17111941
Submission received: 24 March 2025 / Revised: 5 May 2025 / Accepted: 12 May 2025 / Published: 4 June 2025

Abstract

:
Ambient fine particulate matter of size less than 2.5 μm in aerodynamic diameter (PM2.5) is a key ambient air pollutant that has been linked to numerous adverse health outcomes. Reliable estimates of PM2.5 are important for supporting epidemiological and health impact assessment studies. Precise measurements of PM2.5 are available through networks of monitors; however, these are spatially sparse and temporally incomplete. Chemical transport model (CTM) simulations and satellite-retrieved aerosol optical depth (AOD) measurements are two data sources that have been used to develop prediction models for PM2.5 at fine spatial resolutions with increased spatial coverage. As part of the Multi-Angle Imager for Aerosols (MAIA) project, a geostatistical regression model has been developed to bias-correct AOD, followed by Bayesian ensemble averaging to gap-fill missing AOD values with CTM simulations. Here, we present a suite of statistical software (available in the R package ensembleDownscaleR) to facilitate the adaptation of this modeling approach to other settings and air quality modeling applications. We describe the Bayesian ensemble averaging approach, model specifications, estimation methods, and evaluation via cross-validation that is implemented in the software. We also provide a case study of estimating PM2.5 using 2018 data from the Los Angeles metropolitan area with an accompanying tutorial. All code is fully reproducible and available on GitHub, data are made on Zenodo, and the ensembleDownscaleR package is available for download on GitHub.

1. Introduction

Exposure to ambient air pollution is a well-established risk factor for multiple adverse health outcomes. Particulate matter of size less than 2.5 μm in aerodynamic diameter (PM2.5) has been shown to be especially harmful due to its ability to penetrate deep into the respiratory system [1,2,3,4,5,6]. PM2.5 represents a chemically diverse mixture of pollutants and its major sources include electricity generation, motor vehicles, and wildland fires. Accurate estimation of PM2.5 concentration is an important component of air pollution health research [7,8].
A network of monitors established for regulatory purposes provides precise, direct measurements of PM2.5. However, these monitors are relatively expensive to install and maintain, and, therefore, are both spatially sparse and preferentially located in prioritized areas. There exist two additional data sources that are highly correlated with PM2.5, which enable the development of methods to predict PM2.5 beyond locations with ground monitors. The first is outputs from the chemical transport model (CTM), which provide simulations of air quality based on emissions, physical processes, and meteorological data. Gridded CTM simulations provide complete spatial coverage, but require bias-correction with observations due to uncertainties in model input, discretization, and other sources of errors [9]. The second is satellite measurements of aerosol optical depth (AOD), which provide measures of aerosol for the entire atmospheric column at potentially finer spatial resolution than CTM [10,11,12]. But AOD also requires transformation and bias-correction and is subject to high missingness due to cloud cover and retrieval error [13,14,15].
Previous studies have explored the use of CTM or AOD for predicting PM2.5 concentrations [16,17,18]. An important and active area of air pollution exposure assessment research is the development of methods that can utilize both CTM and AOD to predict PM2.5 to address limitations associated with each data type [17,19,20]. One such framework combines predictions from geostatistical models trained separately from AOD or CTM using Bayesian ensemble averaging [21]. This approach utilizes all available monitoring, AOD, and CTM data, and provides fine-scale estimates when AOD is available, and gap-filled estimates from CTM when AOD is missing. While current machine learning approaches—such as Random Forests, Support Vector Machines, and Neural Networks—can yield high predictive accuracy and computational efficiency [20], our approach additionally provides probabilistic uncertainty quantification for each prediction via prediction intervals and standard deviations. These outputs are crucial for downstream exposure health analyses. For this reason, our framework has been adopted by the Multi-Angle Imager for Aerosols (MAIA) project to estimate daily PM2.5, PM10, and speciated PM2.5 major components in multiple large population centers around the world [22,23]. A more detailed description of our method and its comparison to related approaches is provided in Murray et al. (2019) [21].
Here, we introduce an R package, ensembleDownscaleR, developed with R (version ~4.4.1), containing a suite of functions to facilitate the adaptation of this modeling approach to other settings and air quality modeling applications. We describe the statistical method from [21], as well as model extensions implemented for MAIA. We also detail the functionality of the R package and provide an example analysis using 2018 data from the Los Angeles metropolitan area to estimate daily PM2.5 at 1 km spatial resolution. This work is structured as a tutorial designed to guide practitioners in the use of the ensembleDownscaleR R package. Each stage includes a methodological introduction, code examples applied to the Los Angeles dataset, and corresponding results. We made the data available on Zenodo (https://zenodo.org/record/14996970, accessed on 24 March 2025) and provided all code used for model fitting, data processing, and plot/table creation on GitHub (https://github.com/WyattGMadden/ensembleDownscaleR_tutorial, accessed on 24 March 2025), to ensure full reproducibility of results and assist users in their own analyses. The ensembleDownscaleR R package is available for download on GitHub (https://github.com/WyattGMadden/ensembleDownscaleR, accessed on 24 March 2025).

2. Case Study Data

Our analyses correspond to a spatial region overlapping the Los Angeles metropolitan area, ranging from −120.50 to −115.75 longitude, and from 32.00 to 35.70 latitude, and a temporal range of 1 January 2018, to 31 December 2018 (Figure 1). This region contained 60 monitoring stations from the Air Quality System (AQS) that report daily average PM2.5 measurements.
AOD data were obtained from the Multi-Angle Implementation of Atmospheric Correction (MAIAC) with a spatial resolution of 1 km [24]. Chemical transport model (CTM) data were obtained from the Community Multiscale Air Quality (CMAQ) model, which provides daily PM2.5 simulations at a 12 km spatial resolution [25]. We spatially linked the AQS monitor, AOD, and CTM data using the MAIAC 1 km grid for modeling fitting and prediction.
Additional spatial land use and spatio-temporal meteorological covariates were matched to both monitor locations and grid cells. These include elevation (elevation), population density (population), cloud cover (cloud), east–west wind velocity (u_wind), north–south wind velocity (v_wind), height of planetary boundary layer (hpbl), shortwave radiation (short_rf), and humidity at two meters above the ground (humidity_2m). Elevation was obtained from the ASTER Global Digital Elevation Model [26]. Population density was obtained from the Oak Ridge National Laboratory LandScan USA 2018 dataset [27]. Cloud cover was obtained from the NASA MODIS atmosphere L2 cloud product [28]. Wind velocities, shortwave radiation, and height of planetary boundary layer were obtained from the NASA NLDAS-2 Forcing Dataset [29]. Additional details on the source of each covariate and spatial resolutions are detailed in the Supplementary Materials.
We compiled the Los Angeles data into four datasets that were used both in the subsequent analyses and the accompanying tutorial: PM2.5 monitor-linked CTM data and covariates, PM2.5 monitor-linked AOD data and covariates, grid cell CTM data and covariates, and grid cell AOD data and covariates. The grid cell datasets include all days in the month of July 2018, rather than the full year, to conserve memory. Daily monitor PM2.5 observation missingness averaged 27% (range 0–45%, IQR 10–38%), CTM data were available for all grid cells in the study area for all days, and daily AOD missingness averaged 11% (range 2–79%, IQR 6–35%) of the grid cells in the study area. This resulted in 15,821 observation—CTM pairs, 11,668 observation—AOD pairs and a maximum of 122,735 prediction grid cells for each day. Further details of these datasets are included in the Supplementary Materials.

3. Methods and Results

In this section, we provide a case study for all stages of the Bayesian ensemble fitting and prediction process on the case study. For each stage, we detail the methods used, provide code examples to illustrate the use of relevant ensembleDownscaleR package functions, and present corresponding results. We break up this complete workflow for producing Bayesian ensemble PM2.5 predictions into six stages, detailed in Figure 2 and as follows:
  • Fit two separate Bayesian downscaler regression models, M C T M and M A O D , on the monitoring PM2.5 data, one spatially and temporally matched with CTM data and the other matched with AOD data.
  • Produce estimates of PM2.5 (posterior predictive means) and variances for all times and grid cells using M C T M from stage 1. Produce estimates of PM2.5 means and variances for all times and grid cells for which AOD is available, using M A O D .
  • Use cross-validation to produce two sets of out-of-sample PM2.5 prediction means and variances using the same data and model form as in stage 1. This produces two datasets of out-of-sample prediction means and variances for each monitor observation.
  • Estimate spatially varying weights from the out-of-sample prediction means and variances from stage 3 and the monitor PM2.5 measurements.
  • Use Gaussian process spatial interpolation (krigging) to predict weights for all grid cells in the study area from stage 4 weight estimates.
  • Use the mean and variance estimates from stage 2 and the weight estimates from stage 5 to acquire ensemble predictions of PM2.5 at each time and grid cell in the study area.
The total computation time with 25,000 Markov chain Monte Carlo (MCMC) iterations per model fit (for two stage 1 model fits, 20 stage 3 model fits, and one stage 4 model fit), took approximately 11.29 h on an Apple MacBook Pro laptop (macOS Sequoia, version ~15.5) with an Apple M3 Max processor and 36 GB of RAM.

3.1. Stage 1: Downscaler Regression Model

This section details the model specifications available for Bayesian downscaler regression model fitting using the grm() function.

3.1.1. Model

The Bayesian downscaler regression model is formulated as a spatial–temporal regression of PM2.5 against X, which is either AOD or CTM depending on user input.
The statistical model is as follows:
P M 2.5 ( s , t ) = α ( s , t ) + β ( s , t ) X ( s , t ) + l ( s ) γ + m ( s , t ) δ + ϵ ( s , t )
where α ( s , t ) and β ( s , t ) are the spatial–temporal intercept and AOD/CTM slope of the regression model at location s and time t, γ and δ are fixed effects for spatial and spatio-temporal covariates l ( s ) and m ( s , t ) , respectively, and ϵ ( s , t ) N ( 0 , σ 2 ) . Here, σ 2 is modeled with an inverse Gamma prior distribution, σ 2 I G ( a σ 2 , b σ 2 ) , where a σ 2 and b σ 2 hyperparameters are specified with the sigma.a and sigma.b arguments in the grm() function. For these and the remainder of the hyperparameter arguments, defaults are set to represent uninformative priors.
The slope and intercept parameters are composed of the following additive spatial and temporal random effects and fixed effects:
α ( s , t ) = α 0 + α ( s ) + α ( t ) β ( s , t ) = β 0 + β ( s ) + β ( t )
where spatial random effects α ( s ) , β ( s ) G P ( 0 , τ 2 K ( θ , d ) ) follow a Gaussian process ( G P ) depending on a user-specified kernel K with range parameter θ and distance d. Temporal random effects α ( t ) , β ( t ) are set as a first-order random walk. Normal priors are applied to fixed effects γ and δ , which in a Bayesian framework induce an L 2 penalty with equivalence to a ridge-regression penalty in a frequentist setting [30]. This provides theoretically justified shrinkage when covariates are numerous or correlated.
Users can specify inclusion of any combination of additive or multiplicative spatial or temporal random effects, and can input L and M matrices for fixed effects. For example, if the user specifies the inclusion of an additive spatial effect, a multiplicative temporal effect, and no L and M matrices, the intercept/slope equations would simplify as follows:
α ( s , t ) = α 0 + α ( s ) β ( s , t ) = β 0 + β ( t )
The inclusion of additive or multiplicative temporal and spatial effects is specified by the user with the include.additive.temporal.effect, include.multiplicative.temporal.effect and include.additive.spatial.effect, and include.multiplicative.spatial.effect arguments in the grm() function.

3.1.2. Spatial Random Effects

The spatial random effects are modeled using Gaussian processes (GP), with the covariance kernel K specified by the user. We provide four covariance kernels for the spatial random effects: exponential, and Matérn for ν 1 2 , 3 2 , 5 2 , where
K e x p = exp d θ K m a t = 1 Γ ( ν ) 2 ν 1 2 ν θ d ν K ν 2 ν θ d
We also allow for the user to specify a custom covariance kernel, which must be positive definite and symmetric and parametrized with respect to the range parameter θ and distance d such that K u s e r = f ( θ , d ) .
While non-additive spatio-temporal effects are not currently implemented, we provide the option to use different sets of spatial effects for different time periods. By using different spatial effects for, say, seasons or months, some temporal variation in spatial effects can be accounted for. Regardless of a number of spatial effect sets, Gaussian process parameters ( τ 2 , θ ) are shared across sets. For example, if spatial–temporal sets are specified for seasons, the additive spatial random effect is as follows:
[ α n ( s s p r ) , α n ( s s u m ) , α n ( s f a l ) , α n ( s w i n ) ] G P ( 1 4 0 n , τ α 2 I 4 , 4 K ( θ α , d ) n , n )
where n is the number of spatial locations and ⊗ is the Kronecker product.
Priors are placed on the Gaussian process parameters τ α 2 and θ α such that τ α 2 I G ( a τ α 2 , b τ α 2 ) and l o g ( θ α ) G a m m a ( a θ α , b θ α ) , where a τ α 2 , b τ α 2 , a θ α , and b θ α are specified by the user using the tau.alpha.a, tau.alpha.b, theta.alpha.a, and theta.alpha.b arguments in the grm() function. While other parameters in the Bayesian downscaler regression model have conjugate full-conditional distributions and thus are sampled with Gibbs updates [31], the θ α and θ β parameters are sampled with a Metropolis–Hastings [32] step, and thus require tuning parameter specification for the proposal distribution. We employ a log-normal proposal distribution with a user-specified tuning parameter θ α * , such that θ α ( t + 1 ) Log - Normal ( θ α ( t ) , θ α * ) . The user can specify the tuning parameter θ α * using the theta.alpha.tune argument in the grm() function, as well as the initial value for θ α using the theta.alpha.init argument. Prior specifications and tuning parameter arguments are similar for τ β 2 and θ β .

3.1.3. Temporal Random Effects

The first-order random walk temporal random effects ( α ( t ) , β ( t ) ) are specified such that
E [ α ( t ) ] = ρ α α ( t + 1 ) if t = 1 ρ α α ( t 1 ) + α ( t + 1 ) 2 if 1 < t < T ρ α α ( t 1 ) if t = T V a r [ α ( t ) ] = ω α 2 if t = 1 ω α 2 2 if 1 < t < T ω α 2 if t = T
with β ( t ) similarly specified. To reduce computation burden, each ρ is discretized as 2000 evenly spaced values between 0 and 1, and each ω determines the temporal smoothness level. Initial values for ρ α and ρ β can be specified by the user using the rho.alpha.init and rho.beta.init arguments in the grm() function. An inverse gamma prior is placed on ω α 2 and ω β 2 such that ω α 2 I G ( a ω α 2 , b ω α 2 ) and ω β 2 I G ( a ω β 2 , b ω β 2 ) . The user can specify the hyperparameters a ω α 2 , b ω α 2 , a ω β 2 , and b ω β 2 using the omega.alpha.a, omega.alpha.b, omega.beta.a, and omega.beta.b arguments in the grm() function.

3.1.4. Fixed Effects

The user is able to specify inclusion of fixed effects γ and δ for spatial and spatio-temporal covariates, L and M , respectively. The fixed effects are modeled with normal priors, γ N ( 0 , ( λ γ I ) ) and δ N ( 0 , ( λ δ I ) ) where inverse gamma priors are placed on the λ γ and λ δ parameters such that λ γ I G ( a σ 2 , b σ 2 ) and λ δ I G ( a σ 2 , b σ 2 ) . Thus, the λ parameters share the same hyperparameter settings as those for σ 2 , specified by the user using the sigma . a and sigma . b arguments in the grm() function.

3.1.5. Stage 1 Code Example

We loaded the ensembleDownscaleR package and fit the Bayesian downscaler regression models for CTM using the previously described Los Angeles PM2.5 dataset. The code for fitting the AOD model is omitted for brevity, but is similar to the CTM model fit and is included in the tutorial code at GitHub.
  • library(ensembleDownscaleR}
  • monitor_pm25_with_cmaq <- readRDS(“monitor_pm25_with_cmaq.rds”)
  • cmaq_fit <- grm(
  •          Y = monitor_pm25_with_cmaq$pm25,
  •          X = monitor_pm25_with_cmaq$cmaq,
  •          L = monitor_pm25_with_cmaq[, c(“elevation”, “population”)],
  •          M = monitor_pm25_with_cmaq[, c(“cloud”, “v_wind”, “hpbl”,
  •                                                                       “u_wind”, “short_rf”, “humidity_2m”)],
  •          n.iter = 25e3,
  •          burn = 5e3,
  •          thin = 20,
  •          covariance = “matern”,
  •          matern.nu = 0.5,
  •          coords = monitor_pm25_with_cmaq[, c(“x”, “y”)],
  •          space.id = monitor_pm25_with_cmaq$space_id,
  •          time.id = monitor_pm25_with_cmaq$time_id,
  •          spacetime.id = monitor_pm25_with_cmaq$spacetime_id,
  •          verbose.iter = 10
  • )
We note that grm() returns a fitted model object—cmaq_fit here—that is a named list of all posterior parameter draws from the MCMC sampler. Summary statistics or maximum a posteriori (MAP) estimates can be calculated directly from this list.

3.2. Stage 2: Produce P M 2.5 Estimates and Predictions with Available CTM and AOD Data

CTM data are available for all times and locations in the study area, while AOD data availability depends on the time period. In this stage, we used the grm_pred() to produce posterior predictive means and variances for all CTM and AOD data variables. For example, we first input the fitted M C T M model and CTM data for all times and locations in the study area, to produce μ s t C T M and σ s t 2 , C T M for all locations s and times t. We then input the fitted M A O D model and the sparser AOD data to produce μ s t A O D and σ s t 2 , A O D for all times and locations for which AOD data are available. We note that grm_pred() outputs NA values for prediction data with locations identical to monitor locations because, in this case, the observed concentrations can be used.

Stage 2 Code Example

Using the fitted downscaler regression models from stage 1, we produced full PM2.5 predictions for all locations and times in the study area using the CTM-based fitted model, and for all locations and times for which AOD is available using the AOD-based fitted model (Figure 3). The code for producing AOD predictions is omitted here, but is included in the full tutorial code at GitHub.
  • cmaq_for_predictions <- readRDS(“../data/cmaq_for_predictions.rds”)
  • cmaq_pred <- grm_pred(
  •          grm.fit = cmaq_fit,
  •          X = cmaq_for_predictions$cmaq,
  •          L = cmaq_for_predictions[, c(“elevation”, “population”)],
  •          M = cmaq_for_predictions[, c(“cloud”, “v_wind”, “hpbl”,
  •                                                             “u_wind”, “short_rf”, “humidity_2m”)],
  •          coords = cmaq_for_predictions[, c(“x”, “y”)],
  •          space.id = cmaq_for_predictions$space_id,
  •          time.id = cmaq_for_predictions$time_id,
  •          spacetime.id = cmaq_for_predictions$spacetime_id,
  •          n.iter = 1e3,
  •          verbose = T
  • )

3.3. Stage 3: Use Cross-Validation to Produce Out-of-Sample Prediction Means and Variances

3.3.1. Cross-Validation Details

K-fold cross-validation prevents overfitting by separating the dataset into k number of folds, iteratively fitting the model to k 1 folds and predicting the remaining fold. We provide two functions to perform k-fold cross-validation with the geostatistical regression model. The first is the function create_cv(), which creates cross-validation indices according to a user-specified sampling scheme. The second function, grm_cv(), returns the out-of-sample PM2.5 predictions, calculated according to user-inputted cross-validation indices (either obtained from the create_cv() function or created by the user), and arguments similar to those used for the grm() function to specify the downscaler regression model. The out-of-sample predictions are stacked into a dataset of the same length and order as the original dataset on which the cross-validation is applied.
The create_cv() allows specification of the following types of cross-validation (Figure 4):
  • Ordinary: Folds are randomly assigned across all observations
  • Spatial: Folds are randomly assigned across all spatial locations.
  • Spatial Clustered: K spatial clusters are estimated using k-means clustering on spatial locations. These clusters determine the folds.
  • Spatial Buffered: Folds are randomly assigned across all spatial locations. For each fold, observations are dropped from the training set if they are within a user-specified distance from the nearest test set point.
In Figure 4, we visually detail how the folds are assigned for each type of cross-validation, using the monitor locations and times in our study area as an example and assuming five cross-validation folds. We plot all fold assignments for four randomly chosen days, for ordinary, spatial, and spatial clustered cross-validation, with color representing fold assignment. For the spatial buffered cross-validation, we plot only one fold assignment, with color representing if a location is in the first test fold, the first training fold, or dropped due to being within a 30 km buffer of a location in the first test fold.
When assigning folds, we enforce that each fold contains at least one observation from each spatial location and spacetime indicator. Locations for which there are fewer observations than folds should be filtered out of the dataset prior to analysis. Data from the first and last time points are left unassigned, and out-of-sample predictions for these data are output as missing data. If the out-of-sample dataset has a larger temporal range than the in-sample dataset for a given fold, the out-of-sample predictions for the extra time points are also output as missing data.

3.3.2. Producing Out-of-Sample Prediction Means and Variance

The grm_cv() function uses the previously detailed cross validation indices and model specifications to produce estimates of f C T M ( y s t | M C T M ) and f C T M ( y s t | M A O D ) , where y s t is the PM2.5 value at location s, time t, and f C T M and f A O D are the posterior predictive distributions based on models M C T M and M A O C , respectively, the downscaler regression models. Specifically, grm_cv() outputs posterior predictive means μ s t and variances σ s t 2 that are used in stage 4 to fit the full ensemble model.

3.3.3. Stage 3 Code Example

We created the cross-validation indices with the create_cv() function for both AOD and CTM linked monitors, and then used the grm_cv() function to produce out-of-sample PM2.5 predictions for all monitor observations, for both the CTM and AOD data. The code for producing AOD out-of-sample predictions is omitted here, but is included in the tutorial code at GitHub.
  • cv_id_cmaq_ord <- create_cv(
  •          space.id = monitor_pm25_with_cmaq$space_id,
  •          time.id = monitor_pm25_with_cmaq$time_id,
  •          type = “ordinary”
  • )
  • cmaq_fit_cv <- grm_cv(
  •          Y = monitor_pm25_with_cmaq$pm25,
  •          X = monitor_pm25_with_cmaq$cmaq,
  •          cv.object = cv_id_cmaq_ord,
  •          L = monitor_pm25_with_cmaq[, c(“elevation”, “population”)],
  •          M = monitor_pm25_with_cmaq[, c(“cloud”, “v_wind”, “hpbl”,
  •                                                                      “u_wind”, “short_rf”, “humidity_2m”)],
  •          n.iter = 25e3,
  •          burn = 5e3,
  •          thin = 20,
  •          coords = monitor_pm25_with_cmaq[, c(“x”, “y”)],
  •          space.id = monitor_pm25_with_cmaq$space_id,
  •          time.id = monitor_pm25_with_cmaq$time_id,
  •          spacetime.id = monitor_pm25_with_cmaq$spacetime_id,
  •          verbose.iter = 10
  • )

3.4. Stage 4: Estimate Spatially Varying Weights

At this stage, we used the ensemble_spatial() function to fit the ensemble model p ( y s t | M C T M , M A O D ) = w s f C T M ( y s t | M C T M ) + ( 1 w s ) f A O D ( y s t | M A O D ) , where w s are spatially varying weights. We estimated the weights w s by fitting the ensemble model p ( y s t | M C T M , M A O D ) on the out-of-sample predictions produced during stage 3, f C T M ( y s t | M C T M ) ^ and f C T M ( y s t | M C T M ) ^ , and the original PM2.5 data at all times and monitor locations. We placed a Gaussian process prior on the weights, q s = l o g i t 1 ( w s ) G P ( 0 , τ w 2 K ( θ w , d ) ) , where K is an exponential kernel, and τ w 2 and θ w are the distance and range parameters, respectively. Similar to the spatial processes in stage 1, we placed an inverse gamma prior on τ w 2 and a gamma prior on θ w , such that τ w 2 I G ( a τ w 2 , b τ w 2 ) and l o g ( θ w ) G a m m a ( a θ w , b θ w ) . The user can specify the hyperparameters a τ w 2 , b τ w 2 , a θ w , and b θ w using the tau.a, tau.b, theta.a, and theta.b arguments in the ensemble_spatial() function.
The ensemble_spatial() function accepts the output from grm_cv() in stage 3 as input, and outputs the full posterior distribution samples of q s where q s = l o g i t 1 ( w s ) .

Stage 4 Code Example

We displayed the out-of-sample predictions produced in stage 3 to these ensemble weight estimates and the original monitor PM2.5 measurements, while including the weight estimates for reference (Figure 5). If desired, one can additionally use these weight estimates with the gap_fill() function to produce ensemble-based predictions for each location at which PM2.5 is observed, though these outputs are not used in later stages. We did this here to compare the ensemble model performance to the CTM-based and AOD-based models on the times and locations for which both CTM and AOD data were observed, employing ordinary, spatial, spatial clustered, and spatial buffered cross-validation for each model (Table 1, see Supplementary Materials for more details). For all cross-validation formulations, the ensemble model outperforms both the CTM-based and AOD-based models in terms of RMSE and R 2 , while maintaining accurate 95% prediction interval coverage. Note that the gap_fill() function uses the weight estimates to produce ensemble-based predictions for times and locations at which both CTM and AOD data are observed, and fills in the remaining times and locations with the CTM-based predictions.
At this stage, we estimated the spatially varying weights w s for the ensemble model, using the out-of-sample predictions produced in stage 3, and the original monitor PM2.5 measurements.
  • ensemble_fit <- ensemble_spatial(
  •          grm.fit.cv.1 = cmaq_fit_cv,
  •          grm.fit.cv.2 = aod_fit_cv,
  •          n.iter = 25e3,
  •          burn = 5e3,
  •          thin = 20,
  •          tau.a = 0.001,
  •          tau.b = 0.001,
  •          theta.tune = 0.2,
  •          theta.a = 5,
  •          theta.b = 0.05
  • )
  • ensemble_preds_at_observations <- gap_fill(
  •          grm.pred.1 = cmaq_fit_cv,
  •          grm.pred.2 = aod_fit_cv,
  •          weights = ensemble_fit
  • )

3.5. Stage 5: Predict Weights for All Locations

At this stage, we used the weight_pred() function to spatially interpolate the posterior samples of w s garnered in stage 4, across all 1 km × 1 km grid cells in the study area. Specifically, w p r e d i c t i o n s = l o g i t 1 ( q p r e d i c t i o n s ) where q p r e d i c t i o n s = krig ( q s ) , with q s output from stage 4. These weights are used in the final stage to produce ensemble-based PM2.5 predictions for all locations in the study area.
Table 1. Model PM2.5 prediction performance for ensemble model from stage 4, and CTM-based and AOD-based models from stage 2, using each cross-validation type available in the create_cv() function. We assess model performance using 10-fold ordinary, spatial, spatial clustered, and spatial buffered cross-validation. The spatial buffered cross-validation is formulated with buffer sizes of 12.6 km and 42.6 km, corresponding with approximately 0.7 and 0.3 spatial random effect correlation, respectively.
Table 1. Model PM2.5 prediction performance for ensemble model from stage 4, and CTM-based and AOD-based models from stage 2, using each cross-validation type available in the create_cv() function. We assess model performance using 10-fold ordinary, spatial, spatial clustered, and spatial buffered cross-validation. The spatial buffered cross-validation is formulated with buffer sizes of 12.6 km and 42.6 km, corresponding with approximately 0.7 and 0.3 spatial random effect correlation, respectively.
CV TypeModelRMSER2Posterior SD95% PI Coverage
OrdinaryAOD-Based4.4010.5734.2730.954
CMAQ-Based3.8470.6744.0770.960
Ensemble3.7130.6964.2230.971
SpatialAOD-Based4.7100.4864.7640.954
CMAQ-Based4.3790.5554.6030.957
Ensemble4.1160.6074.7140.969
Spatial
Buffered
(0.3 Corr)
AOD-Based4.7780.4714.7670.953
CMAQ-Based6.2000.1095.1940.950
Ensemble4.3490.5614.9880.970
Spatial
Buffered
(0.7 Corr)
AOD-Based4.7360.4804.7580.952
CMAQ-Based4.5780.5144.6120.955
Ensemble4.2430.5834.7580.968
Spatial
Clustered
AOD-Based5.3940.3255.1510.945
CMAQ-Based5.3040.3485.1040.959
Ensemble4.7350.4805.2270.966

Stage 5 Code Example

We spatially interpolated the posterior samples of w s from stage 4 across all locations in the study area using the weight_pred() function (Figure 6). This provided the weights used in the final stage to produce full ensemble estimates for all times and locations in the study area.
  • weight_preds <- weight_pred(
  •     ensemble.fit = ensemble_fit,
  •     coords = cmaq_for_predictions[, c(“x”, “y”)],
  •     space.id = cmaq_for_predictions$space_id,
  •     verbose = T
  • )
Figure 6. (A) Posterior mean spatially interpolated weights w s produced in stage 5. (B) Ensemble-based posterior predictive PM2.5 mean estimates. (C) Ensemble-based posterior predictive PM2.5 standard deviation estimates.
Figure 6. (A) Posterior mean spatially interpolated weights w s produced in stage 5. (B) Ensemble-based posterior predictive PM2.5 mean estimates. (C) Ensemble-based posterior predictive PM2.5 standard deviation estimates.
Remotesensing 17 01941 g006

3.6. Stage 6: Compute Ensemble Predictions for All Locations

The last stage comprised using the posterior means and variances for all CTM and AOD data produced in stage 2, and the spatially interpolated weights from stage 5, to compute PM2.5 posterior predictive means y ^ s t and variances σ ^ y s t 2 for all times t and locations s in the study area, done with the gap_fill() function.
For times and locations for which both CTM and AOD are observed, gap_fill() output ensemble-based estimates, where y ^ s t = w s μ s t C T M + ( 1 w s ) μ s t A O D and σ ^ y s t 2 = w s 2 σ s t 2 , C T M + ( 1 w s ) 2 σ s t 2 , A O D . For times and locations for which solely CTM is available, gap_fill() output posterior predictive means and variances identical to those produced in stage 2 from M C T M .

Stage 6 Code Example

Here, we input the posterior means and variances for all CTM and AOD data produced in stage 2, and the spatially interpolated weights from stage 5, into the gap_fill() function, which output PM2.5 posterior predictive means y ^ s t and variances σ ^ y s t 2 for all times t and locations s in the study area (Figure 6).
  • results <- gap_fill(grm.pred.1 = cmaq_pred,
  •                                grm.pred.2 = aod_pred,
  •                                weights = weight_preds)

4. Discussion

In this work, we introduce the ensembleDownscaleR package for fitting Bayesian geostatistical regression and ensemble models, designed for predicting PM2.5 using CTM simulations and AOD measurements. We also provide a code tutorial based on a case study of the Los Angeles metropolitan area data from 2018. The purpose of this work is to guide practitioners in generating robust PM2.5 predictions and uncertainty quantification with a flexible and well-documented software workflow. The framework can also be applied to other air pollutants and data integration problems.
There are areas for future improvement that are worth noting. The Gaussian process spatial random effects employed by our model are appropriate for the size of the Los Angeles case study data used here. For much larger datasets (with many more monitors and/or much larger prediction grids), the inference of the Gaussian process parameters will be prohibitively slow. Incorporation of scalable random processes, such as Nearest Neighbor Gaussian Processes [33], could make these methods feasible for much larger datasets than those assessed here. Furthermore, spatial covariance specifications are currently limited to isotropic, stationary kernels. There are cases where PM2.5 data may exhibit correlation that suggests anisotropic or nonstationary kernels would be more appropriate, such as data that includes periods with high wind or localized wildfires. Inclusion of covariates such as wind information can often resolve this. Inspecting residual correlation and covariance parameters can ensure that the covariance is reasonably specified for a given dataset. Finally, the current software does not support integrated parallelization. For example, the grm_cv function fits a model on each cross-validation fold sequentially rather than exploiting multiple cores or compute nodes to fit each model concurrently. Model fitting time could be substantially lowered without sacrificing software ease-of-use by incorporating parallelization specifications directly in the ensembleDownscaleR package functions.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs17111941/s1. Table S1: Data dictionary for monitor_pm25_with_cmaq dataset; Table S2: Data dictionary for monitor_pm25_with_aod dataset; Table S3: Data dictionary for cmaq_for_predictions dataset; Table S4: Data dictionary for aod_for_predictions dataset; Table S5: Posterior mean spatial random effects. S2: Determining Spatial Buffer.

Author Contributions

Conceptualization, W.G.M. and H.H.C.; methodology, W.G.M. and H.H.C.; software, W.G.M. and H.H.C.; validation, W.G.M. and M.Q.; formal analysis, W.G.M. and H.H.C.; investigation, W.G.M. and H.H.C.; resources, H.H.C. and Y.L.; data curation, W.G.M. and Y.L.; writing—original draft preparation, W.G.M.; writing—review and editing, M.Q., Y.L. and H.H.C.; visualization, W.G.M. and M.Q.; supervision, H.H.C.; project administration, Y.L. and H.H.C.; funding acquisition, Y.L and H.H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the NASA (80NSSC19K0191, PI: Y. Liu) and the NASA MAIA investigation led by D. Diner (Subcontract #1558091). It has not been formally reviewed by NASA. The views expressed in this document are solely those of the authors and do not necessarily reflect those of the NASA. NASA does not endorse any products or commercial services mentioned in this publication.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code is fully reproducible and available on GitHub, data are made available on Zenodo, and the ensembleDownscaleR package is available for download on GitHub.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Apte, J.S.; Marshall, J.D.; Cohen, A.J.; Brauer, M. Addressing global mortality from ambient PM2.5. Environ. Sci. Technol. 2015, 49, 8057–8066. [Google Scholar] [CrossRef] [PubMed]
  2. Fu, P.; Guo, X.; Cheung, F.M.H.; Yung, K.K.L. The association between PM2.5 exposure and neurological disorders: A systematic review and meta-analysis. Sci. Total Environ. 2019, 655, 1240–1248. [Google Scholar] [CrossRef] [PubMed]
  3. Yuan, S.; Wang, J.; Jiang, Q.; He, Z.; Huang, Y.; Li, Z.; Cai, L.; Cao, S. Long-term exposure to PM2.5 and stroke: A systematic review and meta-analysis of cohort studies. Environ. Res. 2019, 177, 108587. [Google Scholar] [CrossRef]
  4. Alexeeff, S.E.; Liao, N.S.; Liu, X.; Van Den Eeden, S.K.; Sidney, S. Long-term PM2.5 exposure and risks of ischemic heart disease and stroke events: Review and meta-analysis. J. Am. Heart Assoc. 2021, 10, e016890. [Google Scholar] [CrossRef] [PubMed]
  5. Fan, J.; Li, S.; Fan, C.; Bai, Z.; Yang, K. The impact of PM2.5 on asthma emergency department visits: A systematic review and meta-analysis. Environ. Sci. Pollut. Res. 2016, 23, 843–850. [Google Scholar] [CrossRef]
  6. Gong, C.; Wang, J.; Bai, Z.; Rich, D.Q.; Zhang, Y. Maternal exposure to ambient PM2.5 and term birth weight: A systematic review and meta-analysis of effect estimates. Sci. Total Environ. 2022, 807, 150744. [Google Scholar] [CrossRef]
  7. Shaddick, G.; Thomas, M.L.; Amini, H.; Broday, D.; Cohen, A.; Frostad, J.; Green, A.; Gumy, S.; Liu, Y.; Martin, R.V.; et al. Data integration for the assessment of population exposure to ambient air pollution for global burden of disease assessment. Environ. Sci. Technol. 2018, 52, 9069–9078. [Google Scholar] [CrossRef]
  8. Van Donkelaar, A.; Hammer, M.S.; Bindle, L.; Brauer, M.; Brook, J.R.; Garay, M.J.; Hsu, N.C.; Kalashnikova, O.V.; Kahn, R.A.; Lee, C.; et al. Monthly global estimates of fine particulate matter and their uncertainty. Environ. Sci. Technol. 2021, 55, 15287–15300. [Google Scholar] [CrossRef]
  9. Appel, K.W.; Bash, J.O.; Fahey, K.M.; Foley, K.M.; Gilliam, R.C.; Hogrefe, C.; Hutzell, W.T.; Kang, D.; Mathur, R.; Murphy, B.N.; et al. The Community Multiscale Air Quality (CMAQ) model versions 5.3 and 5.3.1: System updates and evaluation. Geosci. Model Dev. Discuss. 2020, 2020, 1–41. [Google Scholar] [CrossRef]
  10. Xiao, Q.; Chang, H.H.; Geng, G.; Liu, Y. An ensemble machine-learning model to predict historical PM2.5 concentrations in China from satellite data. Environ. Sci. Technol. 2018, 52, 13260–13269. [Google Scholar] [CrossRef]
  11. Zhang, D.; Du, L.; Wang, W.; Zhu, Q.; Bi, J.; Scovronick, N.; Naidoo, M.; Garland, R.M.; Liu, Y. A machine learning model to estimate ambient PM2.5 concentrations in industrialized highveld region of South Africa. Remote Sens. Environ. 2021, 266, 112713. [Google Scholar] [CrossRef] [PubMed]
  12. Chu, Y.; Liu, Y.; Li, X.; Liu, Z.; Lu, H.; Lu, Y.; Mao, Z.; Chen, X.; Li, N.; Ren, M.; et al. A review on predicting ground PM2.5 concentration using satellite aerosol optical depth. Atmosphere 2016, 7, 129. [Google Scholar] [CrossRef]
  13. Kianian, B.; Liu, Y.; Chang, H.H. Imputing satellite-derived aerosol optical depth using a multi-resolution spatial model and random forest for PM2.5 prediction. Remote Sens. 2021, 13, 126. [Google Scholar] [CrossRef]
  14. Xiao, Q.; Geng, G.; Cheng, J.; Liang, F.; Li, R.; Meng, X.; Xue, T.; Huang, X.; Kan, H.; Zhang, Q.; et al. Evaluation of gap-filling approaches in satellite-based daily PM2.5 prediction models. Atmos. Environ. 2021, 244, 117921. [Google Scholar] [CrossRef]
  15. Pu, Q.; Yoo, E.H. A gap-filling hybrid approach for hourly PM2.5 prediction at high spatial resolution from multi-sourced AOD data. Environ. Pollut. 2022, 315, 120419. [Google Scholar] [CrossRef]
  16. Just, A.C.; Arfer, K.B.; Rush, J.; Dorman, M.; Shtein, A.; Lyapustin, A.; Kloog, I. Advancing methodologies for applying machine learning and evaluating spatiotemporal models of fine particulate matter (PM2.5) using satellite data over large regions. Atmos. Environ. 2020, 239, 117649. [Google Scholar] [CrossRef]
  17. Ma, X.; Zou, B.; Deng, J.; Gao, J.; Longley, I.; Xiao, S.; Guo, B.; Wu, Y.; Xu, T.; Xu, X.; et al. A comprehensive review of the development of land use regression approaches for modeling spatiotemporal variations of ambient air pollution: A perspective from 2011 to 2023. Environ. Int. 2024, 183, 108430. [Google Scholar] [CrossRef]
  18. Pu, Q.; Yoo, E.H. Ground PM2.5 prediction using imputed MAIAC AOD with uncertainty quantification. Environ. Pollut. 2021, 274, 116574. [Google Scholar] [CrossRef] [PubMed]
  19. Ma, Z.; Dey, S.; Christopher, S.; Liu, R.; Bi, J.; Balyan, P.; Liu, Y. A review of statistical methods used for developing large-scale and long-term PM2.5 models from satellite data. Remote Sens. Environ. 2022, 269, 112827. [Google Scholar] [CrossRef]
  20. Berrocal, V.J.; Guan, Y.; Muyskens, A.; Wang, H.; Reich, B.J.; Mulholland, J.A.; Chang, H.H. A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5 concentration. Atmos. Environ. 2020, 222, 117130. [Google Scholar] [CrossRef]
  21. Murray, N.L.; Holmes, H.A.; Liu, Y.; Chang, H.H. A Bayesian ensemble approach to combine PM2.5 estimates from statistical models using satellite imagery and numerical model simulation. Environ. Res. 2019, 178, 108601. [Google Scholar] [CrossRef] [PubMed]
  22. Diner, D.J.; Boland, S.W.; Brauer, M.; Bruegge, C.; Burke, K.A.; Chipman, R.; Di Girolamo, L.; Garay, M.J.; Hasheminassab, S.; Hyer, E.; et al. Advances in multiangle satellite remote sensing of speciated airborne particulate matter and association with adverse health effects: From MISR to MAIA. J. Appl. Remote Sens. 2018, 12, 042603. [Google Scholar] [CrossRef]
  23. Jin, Z.; Pu, Q.; Janechek, N.; Zhang, H.; Wang, J.; Chang, H.; Liu, Y. A MAIA-like modeling framework to estimate PM2.5 mass and speciation concentrations with uncertainty. Remote Sens. Environ. 2024, 303, 113995. [Google Scholar] [CrossRef]
  24. Lyapustin, A.; Wang, Y. MCD19A2 MODIS/Terra+Aqua Land Aerosol Optical Depth Daily L2G Global 1 km SIN Grid V006. NASA EOSDIS Land Processes DAAC. 2018. Available online: https://lpdaac.usgs.gov/products/mcd19a2v006/ (accessed on 27 May 2025).
  25. Baker, K.; Woody, M.; Tonnesen, G.; Hutzell, W.; Pye, H.; Beaver, M.; Pouliot, G.; Pierce, T. Contribution of regional-scale fire events to ozone and PM2.5 air quality estimated by photochemical modeling approaches. Atmos. Environ. 2016, 140, 539–554. [Google Scholar] [CrossRef]
  26. NASA/METI/AIST/Japan Spacesystems and U.S./Japan ASTER Science Team. ASTER Global Digital Elevation Model V003. 2019. Release Date: 5 August 2019. Available online: https://lpdaac.usgs.gov/products/astgtmv003/ (accessed on 27 May 2025).
  27. Rose, A.; Weber, E.; Moehl, J.; Laverdiere, M.; Yang, H.; Whitehead, M.; Trombley, N.; Sims, K.; Whitlock, C.; Bhaduri, B. LandScan USA 2018. 2019. Data Set. Available online: https://landscan.ornl.gov/ (accessed on 27 May 2025).
  28. Platnick, S.; Ackerman, S.A.; King, M.D.; Meyer, K.; Menzel, W.P.; Holz, R.E.; Baum, B.A.; Yang, P. MODIS Atmosphere L2 Cloud Product (06_L2); NASA MODIS Adaptive Processing System, Goddard Space Flight Center: Greenbelt, MD, USA, 2015. Available online: https://modis-atmosphere.gsfc.nasa.gov/products/cloud (accessed on 1 April 2025).
  29. NASA Land Data Assimilation System (LDAS) Team. NLDAS-2 Forcing Dataset Information. 2025. Available online: https://ldas.gsfc.nasa.gov/nldas/v2/forcing (accessed on 27 May 2025).
  30. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  31. Gelfand, A.E.; Smith, A.F.M. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990, 85, 398–409. [Google Scholar] [CrossRef]
  32. Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
  33. Datta, A.; Banerjee, S.; Finley, A.O.; Gelfand, A.E. Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets. J. Am. Stat. Assoc. 2016, 111, 800–812. [Google Scholar] [CrossRef]
Figure 1. The PM2.5 monitor locations and grid cells for the Los Angeles metropolitan area data. The yellow squares mark the location of all grid cells in the study area and the blue triangles mark the locations of the 60 PM2.5 monitoring stations.
Figure 1. The PM2.5 monitor locations and grid cells for the Los Angeles metropolitan area data. The yellow squares mark the location of all grid cells in the study area and the blue triangles mark the locations of the 60 PM2.5 monitoring stations.
Remotesensing 17 01941 g001
Figure 2. The workflow detailing all steps of the Bayesian ensemble fitting process. Stages are denoted with dashed green boxes. Tasks (including the action performed and the function employed) and outputs are denoted with solid blue boxes. Arrows indicate which outputs are used in later stages.
Figure 2. The workflow detailing all steps of the Bayesian ensemble fitting process. Stages are denoted with dashed green boxes. Tasks (including the action performed and the function employed) and outputs are denoted with solid blue boxes. Arrows indicate which outputs are used in later stages.
Remotesensing 17 01941 g002
Figure 3. For 15 July 2018, (A) PM2.5 predictions for all locations in the study area using the CTM-based fitted model. (B) Input CTM PM2.5 simulations. (C) PM2.5 predictions for all locations at which AOD is available using the AOD-based fitted model. (D) Input AOD observations.
Figure 3. For 15 July 2018, (A) PM2.5 predictions for all locations in the study area using the CTM-based fitted model. (B) Input CTM PM2.5 simulations. (C) PM2.5 predictions for all locations at which AOD is available using the AOD-based fitted model. (D) Input AOD observations.
Remotesensing 17 01941 g003
Figure 4. Four types of cross-validation are available in the ensembleDownscaleR package: ordinary, spatial, spatial clustered, and spatial buffered. The column facets are the four randomly chosen days with near-full locations available, and the row facets are the four types of cross-validation.
Figure 4. Four types of cross-validation are available in the ensembleDownscaleR package: ordinary, spatial, spatial clustered, and spatial buffered. The column facets are the four randomly chosen days with near-full locations available, and the row facets are the four types of cross-validation.
Remotesensing 17 01941 g004
Figure 5. (A) Monitor PM2.5 measurements in study area for 15 July 2018. (B) Out-of-sample CTM-based PM2.5 mean estimates for 15 July 2018. (C) Out-of-sample AOD-based PM2.5 mean estimates for 15 July 2018. (D) Ensemble model with spatially varying weights.
Figure 5. (A) Monitor PM2.5 measurements in study area for 15 July 2018. (B) Out-of-sample CTM-based PM2.5 mean estimates for 15 July 2018. (C) Out-of-sample AOD-based PM2.5 mean estimates for 15 July 2018. (D) Ensemble model with spatially varying weights.
Remotesensing 17 01941 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Madden, W.G.; Qi, M.; Liu, Y.; Chang, H.H. ensembleDownscaleR: R Package for Bayesian Ensemble Averaging of PM2.5 Geostatistical Downscalers. Remote Sens. 2025, 17, 1941. https://doi.org/10.3390/rs17111941

AMA Style

Madden WG, Qi M, Liu Y, Chang HH. ensembleDownscaleR: R Package for Bayesian Ensemble Averaging of PM2.5 Geostatistical Downscalers. Remote Sensing. 2025; 17(11):1941. https://doi.org/10.3390/rs17111941

Chicago/Turabian Style

Madden, Wyatt G., Meng Qi, Yang Liu, and Howard H. Chang. 2025. "ensembleDownscaleR: R Package for Bayesian Ensemble Averaging of PM2.5 Geostatistical Downscalers" Remote Sensing 17, no. 11: 1941. https://doi.org/10.3390/rs17111941

APA Style

Madden, W. G., Qi, M., Liu, Y., & Chang, H. H. (2025). ensembleDownscaleR: R Package for Bayesian Ensemble Averaging of PM2.5 Geostatistical Downscalers. Remote Sensing, 17(11), 1941. https://doi.org/10.3390/rs17111941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop