2.1. Model Building
The particle counter used in the community monitors was a modified Dylos 1700. The firmware was changed to increase the number of particle size bins from two to four (>0.5 μm, >1.0 μm, >2.5 μm, and >10 μm). A custom circuit board was designed to interface the Dylos with an Arduino Yun to add networking capabilities. The circuit board also integrated a Honeywell HIH 6300 temperature and relative humidity sensor (Honeywell, Charlotte, NC, USA). Each Arduino Yun was then connected to either Wi-Fi, Ethernet, or a T-Mobile cellular modem.
In previous work, we collocated a monitor at the California Air Resources Board (CARB) Calexico-Ethel site, located at the Calexico High School on East Belcher Street, which collects reference measurements of PM from both Federal Equivalent Method (FEM) beta-attenuation monitors (BAMs) and Federal Reference Method (FRM) filter-based gravimetric samplers. By comparing FEM and FRM data with data from the community air monitors, an equation for estimating mass concentrations from the Dylos particle count concentrations was developed. The calibration equation was validated by comparing calibrated monitor results with PM
2.5 measured by collocated reference instruments at six other sites. The process was previously described in detail [
15].
Community monitoring PM
2.5, PM
10, relative humidity, and temperature data from 35 sites for a 12-month period, 1 October 2016 to 1 October 2017, were used in the following analyses. Relative humidity is known to change particle size due to the addition or subtraction of water from particles [
16]. Since temperature and relative humidity were moderately correlated, only relative humidity was included in the conversion equation. The PM data were converted from particle counts to particle mass as detailed in Carvlin et al. [
15]. However, the conversion equation was updated using data from the Calexico-Ethel site for the 12-month study period. The previous PM
10 conversion equation used data from 15 January 2016 to 12 July 2016. The conversion equation differed from the previous equation as we used new conversion constants based on the new time period: for PM
10, c1 = 9.35, c2 = 0.216, c3 = −0.344; for PM
2.5, c1 = 5.41, c2 = 0.00831, c3 = −0.224. PM
coarse was calculated as PM
10 − PM
2.5. The respective conversion equations used for PM
2.5 and PM
10 were as follows:
where β
0 is the intercept, RH is the relative humidity as measured by the RH sensor on our custom circuit board, and e is the residual error. The factors c1, c2, and c3 were used in the inversion of the model to estimate BAM equivalent PM concentrations from Dylos measurements, as described in Carvlin et al. [
15].
As a part of the conversion process, data were run through an automated quality control (QC) process; hours with less than 75% of data and data with particle counts less than 30 in Dylos bin 1 were discarded. After the automatic QC, a manual QC process was performed to identify time periods when the monitor response was slowly attenuated due to incremental dust build-up on the photodiode and when the monitor readings were oscillating rapidly between high and low, which resulted in a further 1.2% of data being dropped. The conversion equation can produce negative numbers, which were used as is in the following analyses unless otherwise noted. Hourly PM2.5 and PMcoarse data were averaged to monthly data using a 50% data completeness cutoff. This left a total of 207 monthly data points across 33 monitors. Each monitor had six months of data on average; however, some monitors had only a few months. The monitors that had the least amount of data were those near the Salton Sea. These monitors have poor cell reception and are subject to harsher environmental conditions, which leads to lower data completeness.
Regulatory data were downloaded from the California Air Resources Board (CARB) website. The regulatory network consists of five sites located near population centers in Imperial Valley that have Met One 1020 PM
10 beta attenuation monitors (BAMs) (Met One, Grants Pass, OR, USA) [
17]. Two of these sites also have Met One 1020 PM
2.5 BAMs. There are also five sites located around the Salton Sea that are operated by the Imperial Irrigation District that have PM
2.5 and PM
10 Thermo Fischer Scientific Series 1405-D tapered element oscillating microbalances (TEOMs) (Thermo Fischer Scientific, Waltham, MA, USA) [
18]. These sites were set up to monitor emissions from the Salton Sea as it recedes due to changes in water rights that reduced the agricultural runoff that kept the sea from evaporating. Only QC screened data were used. Values greater than 985 μg/m
3 for BAMs were excluded since this is above the range of the instrument [
19]. No upper cutoff was used for TEOM measurements since all values were within the instrument range [
20]. All negative values from regulatory instruments were kept as is and were included in the analysis.
Land-use variables and community monitor locations were loaded into ArcGIS (ESRI, v. 10.3.1, Redlands, CA, USA). Then, 250 m, 500 m, and 1000 m buffers were created around each monitor. Land-use parameters were sampled within each of these buffers. Geographic information system (GIS) and meteorological variables are listed in
Table 1 along with their source, date, buffers, and averaging period. All data manipulation and analyses were performed using
R statistical software (v. 3.3.3,
https://www.r-project.org/).
Agricultural burning records were received from the Imperial Air Pollution Control District. Acres burned was recorded on the daily level. This information was added to the model as acres burned within 5 km of a monitoring site within the last day. When multiple burns were recorded within 5 km of a site, they were summed.
Other GIS variables were considered but rejected since all monitors had the same value or nearly the same value for that variable. Dropped variables included indicators of industrial PM emissions since none of our monitors were located near industrial sites that had permits to release PM. Satellite PM2.5 was included, but was not predictive, perhaps because the data were 15 years old and satellite measurements are known to not perform well in desert areas.
Meteorological data completeness was less than ideal, especially for planetary boundary layer height. Because the models require a complete dataset, hours that did not have meteorological data were dropped. This resulted in a limited number of complete hours for October and November 2017 and, therefore, they are not included in the monthly and yearly PM maps.
Some monitors have buffers that cross the US/Mexico border. However, we had no land-use data for Mexico. If only the US side of the land use was used, then the true value of that land use would be underestimated. To adjust for this, the percentage area of the buffer within Mexico was recorded for each site. Then, the variables were multiplied by 100/(100 − % area), which gave the land use for the whole buffer assuming the same distribution in Mexico as in the US. A satellite image of the area showed that, in most cases, land within the buffer on both sides of the border was primarily urban land.
Land-use variables were converted from continuous to categorical or binary. This was done because the monitors do not cover the range of land use seen throughout the valley, particularly the range of land use sampled by the grid of points used for prediction (the fishnet). Therefore, if linear extrapolation was used then the fishnet predictions failed, becoming extremely small or large. Histograms of each variable were analyzed to determine whether the variable should be converted to binary or categorical. The cut point for the binary variables was the first quartile. Most of these variables had nearly all of the measurements around zero and just a few at much higher values. All categorical variables were given three categories. The cut points for the categorical variables were the first quartile, the median, and the third quartile. If the continuous value was less than or equal to the first 25% of all values, it was given the categorical value “low”; if it was between 25% and 75%, it was categorized as “medium”, and if it was greater than 75% it was categorized as “high”. After conversion from continuous to categorical and binary, the fishnet predictions were much closer to the range of the monitoring measurements. However, the choice of cut points is dependent on the data and, therefore, limits the more general application of the models developed in this paper.
Three alternative models were considered. Categorical variables were converted to binary variables for use in models which cannot process categories. The models were a Bayesian additive regression trees (BART) model, a lasso model, and a partial least squares (PLS) model. PLS is a modeling technique that re-projects the data in order to find the dimension in the input variable space that explains the most variance in the outcome. PLS is used in PM modeling, in particular when there are a large number of variables [
21]. Lasso is a penalized least squares method that reduces the number of variables in the model based on an alpha parameter. This parameter is chosen based on cross-validated testing. Mercer et al. [
22] and Knibbs et al. [
23] used lasso to help reduce the number of variables in PM LUR models. BART is a model that sums individual regression trees using a Bayesian approach [
24]. It was used to predict torrential rain and avalanches, and to relate vehicle trip duration to household characteristics [
25,
26,
27]. It should be noted that BART has a random component such that, each time it is re-run, it will produce slightly different results. This is why, for model creation and variable selection, it was important to have a large sample size of runs to get a sense of the average response for the model.
Models were compared by leaving out one site at a time and calculating the RMSE at that site using the rest of the sites. For each model, the RMSE was averaged across all sites. BART was found to be the best performing model for PM2.5 and PMcoarse. A variable selection test for these models was performed to identify which variables had the most impact on the model. The variable selection for the BART models was done by dropping one variable at a time from the model and calculating the test R2. Each BART model was run many times and the results were averaged to reduce bias from the random component. The variables that led to the largest decrease in R2 were those that had the most impact on the model. The 10 most important variables for the PM2.5 and PMcoarse models were compared. In order to compare variable selection stability across models, the top 10 BART and lasso variables were compared. Lasso variables were selected based on their standardized coefficient values, corrected by their standard deviations.