A Statistical Calibration Framework for Improving Non-Reference Method Particulate Matter Reporting: A Focus on Community Air Monitoring Settings

: Recent advancement in lower-cost air monitoring technology has resulted in an increased interest in community-based air quality studies. However, non-reference monitoring (NRM; e.g., low-cost sensors) is imperfect and approaches that improve data quality are highly desired. Herein, we illustrate a framework for adjusting continuous NRM measures of particulate matter (PM) with ﬁeld-based comparisons and non-linear statistical modeling as an example of instrument evaluation prior to exposure assessment. First, we collected continuous measurements of PM with a NRM technology collocated with a US EPA federal equivalent method (FEM). Next, we ﬁt a generalized additive model (GAM) to establish a non-linear calibration curve that deﬁnes the relationship between the NRM and FEM data. Then, we used our ﬁtted model to generate calibrated NRM PM data. Evaluation of raw NRM PM 2.5 data revealed strong correlation with FEM (R = 0.9) but an average bias (AB) of − 2.84 µ g / m 3 and a root mean square error (RMSE) of 2.85 µ g / m 3 , with 406 h of data. Fitting of our GAM revealed that the correlation structure was maintained (r = 0.9) and that average bias (AB = 0) and error (RMSE = 0) were minimized. We conclude that ﬁeld-based statistical calibration models can be used to reduce bias and improve NRM data used for community air monitoring studies.


Introduction
Recent advancements in air monitoring have led to increased availability of 'lower-cost' non-reference method (NRM) technologies that are readily adaptable to a variety of air pollution studies [1][2][3][4][5]. Such developments provide a tremendous opportunity to improve studies of air pollution, particularly at smaller scales such as the community or neighborhood level, where collecting air monitoring data was previously unfeasible. For example, The Community Assessment of Freeway Exposure and Health (CAFEH) team collaborated with Boston neighborhoods near major highways to assess traffic-related air pollution using a variety of sensors [6,7]. In 2016, the EPA worked with the Figure 1. Map of study location depicting the study site within the Columbia core-based statistical area (CBSA). The CBSA represents an area with one or more counties anchored by an urban center of at least 10,000 people, together with adjacent counties that are socioeconomically associated with the urban center.

Non-Reference Method (NRM) Monitoring Equipment
We used the TSI DustTrak™ DRX Aerosol Monitor Model 8533EP (TSI Inc, MN, USA) as the NRM to monitor PM in this study. The DRX is an attractive NRM because it combines two approaches to monitoring PM: 1) a photometer uses conventional light scattering technology to closely estimate particulate mass concentrations; and 2) a particle counter is used to measure particle size and number concentration by detecting the light scattered from individual particles [25]. The DRX is innovative in how it obtains and processes particle signals, and ultimately records mass concentrations of different PM size fractions ( Figure 2). By individually sizing particles with an optical size larger than 1 µm diameter and calculating the mass of those particles, multiple mass fractions can be measured at the same time while improving the accuracy of reported mass concentration values [26]. Map of study location depicting the study site within the Columbia core-based statistical area (CBSA). The CBSA represents an area with one or more counties anchored by an urban center of at least 10,000 people, together with adjacent counties that are socioeconomically associated with the urban center. The DRX was deployed at South Carolina's only National Core Multipollutant Network (NCore) site within a manufacturer-provided environmental enclosure (DUSTTRAK Environmental Enclosure 8537) on a platform adjacent to the FEM sample inlet. Given concerns over humidity  The DRX was deployed at South Carolina's only National Core Multipollutant Network (NCore) site within a manufacturer-provided environmental enclosure (DUSTTRAK Environmental Enclosure 8537) on a platform adjacent to the FEM sample inlet. Given concerns over humidity [27,28], the DRX was outfitted with a manufacturer-provided Heated Inlet Sample Conditioner (part number 801850) that effectively adjusts sampled air to a relative humidity (RH) to 50%. The humidity correction is intended to reduce potential measurement bias by maintaining a constant relative humidity level in the sample air stream before it enters the DRX optics chamber [29]. This Heated Inlet Sample Conditioner had an additional accessory, the Autozero Module (part number 801690) which was also attached to the DRX and programed to automatically re-zero the instrument every 24 h.

Data Collection, Management, and Preparation
FEM PM 2.5 used in this study was collected with a Thermo Environmental Model 1405 F tapered element oscillating microbalance (TEOM) and reported at hourly intervals. Prior to reporting, data were verified by South Carolina Department of Health and Environmental Control (SC DHEC) personnel. NRM PM 2.5 data were collected at one-minute intervals from the DRX and reported at hourly intervals to facilitate comparisons. Data Quality Assurance/Quality Control involved multiple steps designed to minimize instrumental errors. First, we only retained data if values of all PM size fractions were reported, the concentrations were >0 µg/m 3 , and all smaller size fraction concentration measures were equal or less than the larger size fractions' concentration measures. Next, data were trimmed to retain the central 99% (0.5-99.5%) for the DRX in order to avoid potential skewing by data outliers. Finally, hourly data summaries were calculated for each size fraction if at least 80% of the one-minute data for a particular hour block were available.

Statistical Analyses
The aims of our statistical analyses were to conduct a field performance evaluation of our NRM and to construct a calibration curve for adjusting data based on observed differences with FEM. For evaluation, we use descriptive statistics, correlation coefficients, root mean square error (RMSE) and average bias (ab) to assess relationships between the DRX measurements and the FEM measurements. To construct our calibration curve, we employ a generalized additive model (GAM) to construct a non-linear function (i.e., penalized spline) that captures the relationship between FEM and NRM across our observed concentration range. Specifically, smooth functions (within GAMs) were developed using a combination of model selection and automatic smoothing parameter selection with penalized regression splines, which optimized the fit and made an effort to minimize the number of dimensions in the model [30]. Smoothing parameters were chosen through restricted maximum likelihood (REML) and, standard errors and corresponding confidence intervals were estimated using an unconditional Bayesian method [30].

Calibration Model
The framework of our generalized additive model can be expressed in following form: where Y is the hourly FEM referent PM 2.5 concentration reported at time i during the co-located monitoring period, α is the intercept term, s is the smooth function representing NRM PM 2.5 summarized measurements for hour i, and ε i represents the error term, which is assumed to be normally distributed.
Our model was fit using the 'mgcv' package in R: A Language and Environment for Statistical Computing [31].

Ambient Conditions during Monitoring Period
Our co-location monitoring campaign began on 17 January 2018, and ended on 7 February 2018, giving us a 22 day (526 hr) observational period. Weather conditions during this time were generally cool, as average (standard deviation) temperatures were 9.2 (6.2) • C but ranged from −4.4 to 21.7 • C indicating modest variability. Relative humidity averaged 69.9 (25.6)% but ranged from 14.0 to 100% revealing that relatively dry to complete saturation (i.e., rain) events were experienced. Average (SD) wind speed was 1.6 (1.9) m/s and ranged from 0 to 9.3 m/s.

PM Data Summary and Evaluation
The focus of our co-location monitoring was to collect data that were suitable for comparison of our NRM to a FEM. As such, we first chose to implement EPA criteria [32] in selection of our comparison data and thus days in which 24 hr Federal Reference Method (FRM) measurements are below 3.0 µg/m 3 were excluded. This criteria resulted in the dropping of 1 day (i.e., approximately 22 hrs) from our comparison period when the collocated FRM PM 2.5 reported concentration was 2.2 µg/m 3 (on 01/29/2018). However, we note the FRM data collection was every 3rd day during this study period and only available for 7 days ( Table 1). As such, we chose to extend this criteria [32] to FEM data and dropped hourly FEM PM 2.5 values less than or equal to 2.9 µg/m 3 . This criteria resulted in the additional exclusion of n = 98 h, where reported FEM TEOM values ranged from −3.6 to 2.9 µg/m 3 , giving us a final comparison period of 406 hrs in which PM conditions were deemed suitable for comparison [33]. During the sample period, the 1 hr average (SD) FEM PM 2.5 concentration was 8.3 (4.6) µg/m 3 , indicating that air pollution levels were low to modest during our comparison. The 1 hr average (SD) NRM PM 2.5 was 5.6 (3.5) µg/m 3 , indicating that our DRX, on average, reported lower levels than the FEM. Calculation of absolute bias (AB) confirms this difference at −2.54 µg/m 3 during the comparison period (n = 406 h). The corresponding root mean square error (RMSE) was 2.55 µg/m 3 . Another aspect of interest in our evaluation was to determine how well our NRM tracked temporal changes in air quality reported by the FEM. To examine this, we plot these data over time to visualize agreement between the different measurement approaches (Figure 3). Here, we see that our NRM performed rather well at capturing the peaks and valleys in reported air quality. Correlation analysis confirms this, as we found strong positive agreement (R = 0.9) between hourly NRM and FEM data. of interest in our evaluation was to determine how well our NRM tracked temporal changes in air quality reported by the FEM. To examine this, we plot these data over time to visualize agreement between the different measurement approaches (Figure 3). Here, we see that our NRM performed rather well at capturing the peaks and valleys in reported air quality. Correlation analysis confirms this, as we found strong positive agreement (R = 0.9) between hourly NRM and FEM data. The final aspect of interest for our NRM evaluation is the reliability of our NRM to successfully capture data during a period of interest. Here, we found that the DRX performed very strongly, as we successfully monitored and retrieved data for nearly 526 hrs (31,528 min of data), indicating a capture rate of 100%; however, our data cleaning steps retained 99% of the one-minute data. Additional stringent criteria, such as removing hourly observations ≤ 2.9 µg/m 3 (n = 98) and another 22 h with corresponding FRM measurement of 2.2 µg/m 3 , resulted in 406 total hours of data (77% of the original data: 406/526) which are used in subsequent analysis.

Statistical Calibration Model
Results from our GAM fit found that our non-linear calibration curve ( Figure 4) was able to explain 83% of the deviance (i.e., variability around the mean) of FEM reported 1 hr PM2.5 and had an R 2 = 0.82. Application of our model fit to produce calibrated data shows that adjusted data retained the original correlation structure (R = 0.9) and that bias was successfully minimized, as average bias is reduced from 33% to 6% (Figure 5a,b). The average (SD) for the calibrated PM2.5 was 8.3 (4.1), with a range of 3-33 µg/m 3 . Examination of the calibration curve ( Figure 4) reveals that the relationship with our response was non-linear, as we found areas of the data with strong agreement (~12 ug/m 3 ) and areas where the 'steepness' of the curve varied (relatively low values, relatively high values). A resulting scatter plot of our calibrated data (superimposed over raw data) reveals the general positive shift in the data that our application of the model performs (Figure 5c,d). The final aspect of interest for our NRM evaluation is the reliability of our NRM to successfully capture data during a period of interest. Here, we found that the DRX performed very strongly, as we successfully monitored and retrieved data for nearly 526 hrs (31,528 min of data), indicating a capture rate of 100%; however, our data cleaning steps retained 99% of the one-minute data. Additional stringent criteria, such as removing hourly observations ≤ 2.9 µg/m 3 (n = 98) and another 22 h with corresponding FRM measurement of 2.2 µg/m 3 , resulted in 406 total hours of data (77% of the original data: 406/526) which are used in subsequent analysis.

Statistical Calibration Model
Results from our GAM fit found that our non-linear calibration curve (Figure 4) was able to explain 83% of the deviance (i.e., variability around the mean) of FEM reported 1 hr PM 2.5 and had an R 2 = 0.82. Application of our model fit to produce calibrated data shows that adjusted data retained the original correlation structure (R = 0.9) and that bias was successfully minimized, as average bias is reduced from 33% to 6% (Figure 5a,b). The average (SD) for the calibrated PM 2.5 was 8.3 (4.1), with a range of 3-33 µg/m 3 . Examination of the calibration curve ( Figure 4) reveals that the relationship with our response was non-linear, as we found areas of the data with strong agreement (~12 ug/m 3 ) and areas where the 'steepness' of the curve varied (relatively low values, relatively high values). A resulting scatter plot of our calibrated data (superimposed over raw data) reveals the general positive shift in the data that our application of the model performs (Figure 5c,d).

Discussion
In this study, we aimed to illustrate how the development of statistical modeling tools can be used to improve reporting of particulate matter in studies that rely on non-reference monitoring (NRM) methodologies (e.g., low-cost sensors). The key finding from our study was that our statistical model can be applied to identify and remove non-linear trends in biases observed in data collected by our NRM monitor ( Figure 5) and a FEM. Evaluations of raw NRM data identified a consistent negative bias (i.e., underreporting) in the DRX data that, if not accounted for, could lead to erroneous reporting in air quality studies. That said, raw DRX data tracked very well with reported FEM measures, indicating that raw data can be used to successfully monitor trends.
An important aspect of discussion for these data is to relate our findings to performance goals specified by EPA's sensor evaluation guidelines [34]. To achieve this, we applied EPA formulations to our raw and calibrated NRM DRX data to improve understanding of how the resulting approach can be applied in community-based air quality studies. We found an overall bias error estimate of 33.1% for our raw NRM DRX data, a finding that implies that the raw DRX measurements are suitable for Education and Information only (TIER (or level) I, precision and bias error < 50%) [34]. However, when our calibration framework is employed, bias was reduced to 6.4%, indicating that calibrated NRM data can be used for hotspot identification and characterization (TIER II), and personal exposure monitoring (TIER IV). Guidelines for the precision and bias errors for TIERs II and IV are

Discussion
In this study, we aimed to illustrate how the development of statistical modeling tools can be used to improve reporting of particulate matter in studies that rely on non-reference monitoring (NRM) methodologies (e.g., low-cost sensors). The key finding from our study was that our statistical model can be applied to identify and remove non-linear trends in biases observed in data collected by our NRM monitor ( Figure 5) and a FEM. Evaluations of raw NRM data identified a consistent negative bias (i.e., underreporting) in the DRX data that, if not accounted for, could lead to erroneous reporting in air quality studies. That said, raw DRX data tracked very well with reported FEM measures, indicating that raw data can be used to successfully monitor trends.
An important aspect of discussion for these data is to relate our findings to performance goals specified by EPA's sensor evaluation guidelines [34]. To achieve this, we applied EPA formulations to our raw and calibrated NRM DRX data to improve understanding of how the resulting approach can be applied in community-based air quality studies. We found an overall bias error estimate of 33.1% for our raw NRM DRX data, a finding that implies that the raw DRX measurements are suitable for Education and Information only (TIER (or level) I, precision and bias error < 50%) [34]. However, when our calibration framework is employed, bias was reduced to 6.4%, indicating that calibrated NRM data can be used for hotspot identification and characterization (TIER II), and personal exposure monitoring (TIER IV). Guidelines for the precision and bias errors for TIERs II and IV are both < 30% [34]. We note that our findings may also be suitable to the requirements for supplemental monitoring (TIER III, precision and bias error < 20%). Overall, we find these results to be encouraging, as they demonstrate significant improvements in the reported data from our NRM and expand the potential applicability of the monitoring approach.
A detailed study by Rivas et al. identified several technical problems affecting the performance of the DRX including frequent artefact jumps, zero measurements and differences in actual ambient PM levels and DRX measurement records [35]. The authors recommended zero offset calibrations and the need to correct problematic data [35]. The field-based statistical calibration model illustrated in our study can be used to reduce bias and improve reporting of NRM data. There are multiple reasons why our NRM may have reported different values from the FEM TEOM method. First, we believe this may be partially explained by a loss of volatile and semi-volatile compounds during the heated inlet phase of air flow in the DRX monitoring schematic (Figure 2). This is postulated due to the simple fact that heat increases volatilization. This is consistent with other studies as a comparison study [36] in the United Kingdom observed differences in mass concentrations between a TEOM and a gravimetric sampler that were attributed to volatilization. Here, the authors noted that adding the calculated mass of ammonium nitrate significantly improved the mass concentration estimate and suggested that higher amounts of semi-volatile compounds tended to be associated with elevated PM levels [36]. Some of these semi-volatile compounds may have been lost at our sampling site. Other researchers have also pointed to the relationships between water and certain properties of PM components (e.g., inorganic or hygroscopic organic compounds) as important contributors to differences in mass concentrations [37,38].
Together, these factors must be considered and can complicate comparisons between instruments. Another possibility for our reporting of lower PM values is the potential for deposition to occur on the inlet tubing of the DRX. The DRX setup uses a seven inch slightly curved conductive tubing that connects to a 360 o omni-directional sampling inlet and this may contribute to a longer than necessary sampling pathway. Furthermore, the NRM and TEOM operated at different flow rates during our study, a fact that impacted the amount of air pulled per volume and may have contributed to differences in reporting ( Table 2). More specifically, the total flow rate for the DRX is 3.0 L/min, with 2.0 L/min serving as the measured sample volume and the remainder representing sheath flow. On the other hand, the TEOM operates at a total flow rate of 16.67 L/min, with a measured sample volume of 3.0 L/min and a bypass flow of 13.67 L/min. Additionally, both the NRM and the TEOM use heated inlets, hence air is sampled at approximately 50 • C, as opposed to being sampled at ambient temperature and pressure [29,36,39].

Limitations
As with all studies, our calibration framework is not without limitations. First, co-location studies can always benefit from increased durational periods as the breadth of conditions is an important factor to consider. As such, our campaign should only be viewed as an example because our period was relatively short and thus limits the interpretability of our results and applicability of the resulting calibration model. Ideally, co-location campaigns would capture the full range of conditions experienced in a potential air shed; however, we note that this may not always be feasible in community air quality studies. Another limitation that stems from our relatively short campaign is that our calibration curve may not be stable over the long term as we did not include data from other times of the year (e.g., summer) that may change the relationships between our reported data. A recent study highlighted the importance of re-evaluating model performance and new model development over time [40].
Another potential limitation of this study was that we did not adjust for meteorology in our calibration model. While inclusion of weather conditions may have improved model fit, we chose not to include weather conditions given the use of the heated inlet mechanism and that the air used to determine the mass concentrations was under relatively similar conditions. Finally, the scope of this work is limited to a particular type of sensor (the DRX) calibrated with data from in a single environment with low PM levels and for a limited period of time. As such, generalizing the specific results to other pollutants from different manufacturers and in different environments may be difficult. However, the concept of using GAMs to remove bias and improve data quality can be applied to other instruments and in a variety of environments. Such extended applications of this concept can improve the state of the science of low-cost air sensors and contribute to the further understanding of air pollution mixtures and their effects on human health.

Future Directions
Ultimately, our goal is to improve community outreach and awareness, and aid the development of community-based air pollution studies with accurate and interpretable data. As such, we plan to continue developing our calibration models with the intent of supporting future monitoring efforts in near port communities in Charleston, SC [41]. Our calibration framework may improve such efforts as high-quality data are required to document baseline measurements and can serve as a reference for potential future changes [42,43]. Additionally, gas-and particle-phase chemical analysis will be a major strength of such comparison studies in the future.

Conclusions
This study reveals that field-based statistical calibration models can be used to reduce bias and thus improve reporting of NRM in community air monitoring studies. This work also provides pertinent information needed for the voluntary performance evaluation program suggested by the US EPA. This work highlights the general usefulness of NRM technology for neighborhood-level air quality characterization. We also demonstrate that care is needed to interpret results. However, longer monitoring periods are recommended to understand neighborhood/community-level air quality.