A Model-to-Monitor Evaluation of 2011 National-Scale Air Toxics Assessment (NATA)

Environmental research has widely utilized the ambient concentrations of hazardous air pollutants (HAPs) modeled by the National-Scale Air Toxics Assessment (NATA) program; however, limited studies have evaluated the model’s performance. This study aims to evaluate the model-to-monitor agreement of the 2011 NATA data with the monitoring data reported to the U.S. Environmental Protection Agency’s (EPA) Air Quality System (AQS). Concentrations of 27 representative HAPs measured at 274 sites in the U.S. in 2011 were merged with NATA data by census tract. The comparison consisted of two steps for each HAP: first, the model-monitor difference at each site was compared with the limit of quantitation (LOQ); second, the modeled annual average was compared to the 95% confidence interval of the monitored annual average. Nationally, NATA could predict national medians of most HAPs well; however, it was unable to capture high concentrations. At individual sites, a large portion of model-monitor differences was below the LOQs, indicating they were unquantifiable. Model-to-monitor agreement displayed inconsistent patterns in terms of chemical groups or EPA regions and was strongly impacted by the comparison methods. The substantial non-agreements of NATA predictions with monitoring data require caution in environmental epidemiology and justice studies that are based on NATA data.


Introduction
Valid and representative data of hazardous air pollutants (HAPs) are required to evaluate emission compliance, air quality attainment, and population health risks. Chronic and acute exposure to HAPs may cause damage to multiple human organs [1], including respiratory [2], nervous [3,4], circulatory [5], reproductive [6], immune [7], digestive [8], and urinary systems [9]. The U.S. Environmental Protection Agency (EPA) aimed to reduce HAP emissions by 75% of the 1993 level to meet the requirements of the Government Performance and Results Act. EPA has been working with state, local and tribal air pollution control agencies to measure ambient HAP concentrations. The current monitoring efforts are inadequate for increasingly refined health and climate studies. Health data are collected at the individual level or small geographic scale; however, sparsely distributed air monitoring stations often lack spatial representativeness [10]. For example, national analyses of air pollutants only identified 169 different stations for polycyclic aromatic hydrocarbons (PAHs) [11] and 379 stations for fine particulate matter (PM 2.5 ) [12]. The sub-kilometer scale variation of air pollutants requires dense sampling networks with more than 1-2 nodes per km 2 [13], which far exceed the current capacity. Modeling programs are then developed to estimate exposures at high temporal and spatial resolutions [14].
EPA initiated the National-scale Air Toxics Assessment (NATA) in 1996 to serve as a geographical extension of the existing air monitoring network. NATA is designed to inform decision-making, e.g., to prioritize pollutants and sources, identify locations for investigation, and design monitoring programs [15]. NATA models HAP concentrations at geographic resolutions down to the census tract level. These high spatial-resolution data have many environmental applications. Environmental epidemiology studies have used NATA data to explore associations between HAP exposure and health endpoints such as respiratory disease [16,17], autism spectrum disorder in children [18], and school performance [19,20]. The cancer risk estimates in NATA often serve as bases for addressing environmental justice issues [21][22][23][24][25][26][27][28]. NATA methodology and data are also used to model population exposure [29], predict future exposures [30], estimate excess risks [31], and establish emission-to-intake relationship [32].
Evaluating NATA model performance is imperative given the numerous applications of NATA data. NATA modeling uses conservative assumptions that potentially lead to overestimation [15]; however, some comparison studies gave the opposite results [33][34][35]. A few independent evaluation studies used local-scale monitoring in California [35], Pittsburgh, Pennsylvania [33], Detroit, Michigan [36], Texas [34], and South Baltimore, Maryland [37]. These model-to-monitor comparisons are often limited in terms of the number of chemicals and geographic areas. EPA has conducted limited evaluations and encouraged more studies [36,38].
The 2011 NATA yielded a rich database that contains concentrations, exposures, and cancer and non-cancer risks for 180 HAPs, as well as their contributing sources. There has been no independent evaluation of 2011 NATA, although EPA has made limited model-to-monitor comparisons for selected compounds [39]. Methodologically, EPA used multiple comparison measures, e.g., linear regressions, a factor of 2, and absolute biases [40]; however, they often give inconsistent results. Measurement uncertainty was not considered in previous comparisons, which might lead to massive biases as many modeled concentrations are far below the detection limits. These limitations call for a systematic approach for model-to-monitor comparisons.
This study aims to evaluate the model performance of 2011 NATA by comparing modeling and monitoring concentrations. We compile the real measurements collected at 274 sites throughout the U.S. in 2011 and merge modeling and monitoring datasets. We then assess agreement using statistical and empirical methods, considering the measurement uncertainty.

Data Sources and Compilation
The monitoring HAP data were extracted from the U.S. EPA's Air Quality System (AQS). AQS is a web-based air pollution database accessible to the public. It contains ambient air pollution data and sampling condition information collected from tribal, local and state agencies through consistent and strict quality assurance (QA) processes. The HAPs were monitored following EPA's Air Toxics Monitoring Methods [41]. In brief, volatile organic compounds (VOCs) were measured by the TO-15 method, aldehydes by TO-11A, PAHs by TO-13A and heavy metals (including antimony, arsenic, beryllium, cadmium, chromium, cobalt, lead, manganese, nickel and selenium) by the IO-3.5 method. Most HAP samples were analyzed at central laboratories and their typical limits of quantitation (LOQs) are available [42,43]. We downloaded daily (24-hour) HAP concentrations measured in 2011 [44]. Conventional units, such as part per billion (ppb), part per million (ppm), ppb carbon (ppbC), or ppm carbon (ppmC), were converted to the standard unit µg/m 3 to match that used in NATA. Locations of the monitoring sites in AQS were geocoded and assigned the census tract number in ArcGIS (v10.3.1, ESRI Inc., Redlands, CA, USA).
The modeling HAP concentrations at the census tract level were downloaded from the 2011 NATA database. The 2011 NATA contained 78,000 census tracts in the continental U.S. AQS and NATA data were then merged by census tract. The merged dataset contained up to 274 monitoring stations from AQS ( Figure 1) but only 274 matched census tracts from NATA. This subset of NATA data was representative of the entire 2011 NATA dataset, as their key descriptive statistics were very similar (Table 1). Thus, NATA in the following text means the matched sub-dataset. In any census tract, NATA gives a single annual average concentration of a HAP, and AQS gives 5-162 measurements of the same HAP taken in the year 2011.

Model-to-Monitor Comparison Methods
The 2011 NATA modeling results contain very low concentrations for certain compounds, e.g., the annual average concentrations of 1,1,2-trichloroethane and chromium were 0.00041 and 0.00003 µg/m 3 , respectively. In practice, the measurement method has a limit of quantification (LOQ) for a specific chemical. The LOQ is defined as the lowest concentration that can be accurately measured during regular laboratory analyzing conditions [42]. Following the concept of LOQ, the absolute value of difference (ΔM) between modeled concentration (CNATA) and monitored concentration (CAQS), i.e., ΔM = |CNATA -CAQS|, is unquantifiable if ΔM is less than LOQ. CNATA and CAQS can be considered to be in agreement given an unquantifiable ΔM [45].
To evaluate the national-level agreement, we compared national medians rather than means, considering the substantial spatial heterogeneity among monitoring sites. For each HAP, we first determined whether the ΔM was quantifiable, and then compared two medians using the Wilcoxon signed-rank test. A p-value of ≥0.05 was considered the agreement.
At individual sites, we compared annual modeled and monitored averages using statistical methods if ΔM was quantifiable. We calculated the 95% confidence interval (CI) of the annual mean concentration of a compound, and then determine if the single NATA annual average value fell within this 95% CI. We log-transformed AQS data as they followed a skewed lognormal distribution, and then calculated the 95% CI using the Cox's method [46]. This is a strict statistical comparison method and applies the widely accepted criterion of 95% CI or equivalently the p-value of 0.05. EPA indicated that statistical analysis was the best way for model performance evaluations [40,47].

Model-to-Monitor Comparison Methods
The 2011 NATA modeling results contain very low concentrations for certain compounds, e.g., the annual average concentrations of 1,1,2-trichloroethane and chromium were 0.00041 and 0.00003 µg/m 3 , respectively. In practice, the measurement method has a limit of quantification (LOQ) for a specific chemical. The LOQ is defined as the lowest concentration that can be accurately measured during regular laboratory analyzing conditions [42]. Following the concept of LOQ, the absolute value of difference (∆M) between modeled concentration (C NATA ) and monitored concentration (C AQS ), i.e., ∆M = |C NATA − C AQS |, is unquantifiable if ∆M is less than LOQ. C NATA and C AQS can be considered to be in agreement given an unquantifiable ∆M [45].
To evaluate the national-level agreement, we compared national medians rather than means, considering the substantial spatial heterogeneity among monitoring sites. For each HAP, we first determined whether the ∆M was quantifiable, and then compared two medians using the Wilcoxon signed-rank test. A p-value of ≥0.05 was considered the agreement.  At individual sites, we compared annual modeled and monitored averages using statistical methods if ∆M was quantifiable. We calculated the 95% confidence interval (CI) of the annual mean concentration of a compound, and then determine if the single NATA annual average value fell within this 95% CI. We log-transformed AQS data as they followed a skewed lognormal distribution, and then calculated the 95% CI using the Cox's method [46]. This is a strict statistical comparison method and applies the widely accepted criterion of 95% CI or equivalently the p-value of 0.05. EPA indicated that statistical analysis was the best way for model performance evaluations [40,47].
At each site, if NATA agreed with the AQS, the site was defined as an agreement site for that chemical; otherwise, it was defined as underestimation or overestimation site. These steps were repeated for all the available sites and the 27 HAPs of interest. The percentages of underestimation, agreement, and overestimation sites were calculated for all the sites in the U.S. as well as by EPA region. While there is not a bright line to define the degree of agreement, we define a compound as an under-predicted, agreement, or over-predicted compound if it is under-predicted, in agreement, or over-predicted at ≥50% of sites, respectively. This definition enables us to get an overall impression of the model-to-monitor agreement of each HAP at the national or regional level.
EPA has long been using a factor of 2 as the criterion for model-to-monitor comparisons [35,48,49], i.e., a C NATA /C AQS ratio of 0.5-2 could be considered the agreement. Rather than using the simple ratio, we adopted an equivalent metric fractional bias (FB): FB is a relative bias that combines bias and ratio. It is symmetrical, bounded and dimensionless [47]. An FB between −0.67 and +0.67 indicates acceptable agreement, and values of −2 and +2 indicate extreme underestimation and overestimation, respectively [47,50]. An FB was calculated when ∆M = |C NATA − C AQS | was quantifiable. All
The differences between AQS and NATA medians were small, although they were statistically significant for most compounds. Sixteen HAPs had their ∆Ms less than the corresponding LOQs, indicating the differences were too small to be measurable. Naphthalene was the only compound that showed good agreement between AQS and NATA medians when ∆M > LOQ. Out of the remaining 10 compounds, 9 had their national medians underestimated by NATA, and acetaldehyde had its national median overestimated by NATA. Overall, NATA predicted national medians correctly for 17 compounds, underestimated medians for 10 compounds, and overestimated medians for 1 compound.
NATA was unable to capture extreme concentrations. The maximum concentrations in AQS were much higher than those modeled by NATA for all HAPs except for toluene, bromomethane, and methyl isobutyl ketone. This could be explained by the inability of dispersion models to simulate extreme concentrations [38,51]. Notes: HAPs-hazardous air pollutants; LOQ-limit of quantitation; SD-standard deviation; Med-median; Max-maximum. * p-value of >0.05 indicating no significant difference.

National Model-to-Monitor Agreement
At individual monitoring sites, the ∆M between NATA and AQS annual averages was first examined for each compound ( Table 3). The ∆M of chromium was noticeably below the LOQ at all the sites. Similarly, ∆M was below the LOQ at over 90% of sites for cumene, vinyl chloride, 1,1,2-trichloroethane and methyl tert-butyl ether, and at 50-90% of sites for another seven compounds. A total of 12 compounds showed agreement at ≥50% of sites by comparing ∆M to LOQ. Notes: LOQ-limit of quantitation; CI-confidence interval; LCL-lower confidence limit; UCL-upper confidence limit; Under-underestimation; Over-overestimation. * Percentages of agreement, underestimation, and overestimation sites are ≥50%.
When ∆M was quantifiable, toluene, formaldehyde, acetaldehyde, naphthalene showed agreement at 20-27% of sites, methyl chloride, 1,3-butadiene, tetrachloroethylene, and carbon disulfide showed agreement at 10-20% of sites, and the remaining 20 chemicals all showed agreement at <10% of sites. Therefore, NATA agreed with AQS at a small portion (<30%) of sites nationally at the quantifiable concentration ranges (Table 3).
Taken together, 14 compounds had NATA-AQS agreement at >50% of sites, as highlighted in Table 3. Methyl chloride, chloroform, carbon tetrachloride, carbon disulfide and lead were nationally under-predicted, and toluene, formaldehyde and acetaldehyde were nationally over-predicted. Benzene, ethylbenzene, naphthalene, 1,3-butadiene and n-hexane did not show strong patterns.
Better agreement results were observed if adopting EPA's factor of 2 criterion (Table 4). A total of 21 compounds showed agreement at ≥50% of sites. A significant increase in the numbers of agreement sites occurred for benzene, methyl chloride, carbon tetrachloride, formaldehyde and acetaldehyde. Styrene, tetrachloroethylene and 1,3-butadiene showed agreement at 44-48% of sites. Chloroform, carbon disulfide and lead were underestimated at 72-75% of sites. NATA overestimated concentrations at a small portion of sites for all the compounds. These indicated that the "factor of 2" criterion was more lenient than the statistical comparisons. Notes: * The percentage of the total agreement is ≥50%.

Regional Model-to-Monitor Agreement
The agreement between NATA estimates and AQS measurements could be further examined by EPA regions, as shown in Figure 2. Checking by compound in Figure 2, lead (TSP), formaldehyde, naphthalene, ethylbenzene, toluene, carbon disulfide, 1,3-butadiene, acetaldehyde, benzene, n-hexane, chloroform, methyl chloride and carbon tetrachloride had a poor agreement in most regions. In contrast, eight compounds showed good agreement in all regions, including acrylonitrile, methyl isobutyl ketone, cumene, methyl tert-butyl ether, chromium VI (TSP), 1,1,2-trichloroethane, trichloroethylene, vinyl chloride. Checking by region in Figure 1, certain compounds or chemical groups displayed regional characteristics. Aromatic compounds showed poor agreement in Region 1, e.g., benzene did not show model-to-monitor agreement at any sites in Region 1. Halogenated compounds showed poor agreement in Region 2, e.g., methyl chloride did not show model-to-monitor agreement at any sites in Region 2. Carbonyls compounds showed poor agreement in Region 6, as none of the three carbonyls showed agreement at more than 50% of sites. Overall, the agreement displayed a strong by-compound pattern but not a regional pattern.

Similar Findings from National and Local Studies
Our results confirmed previous national and local evaluations of NATA modeling. Previous NATA evaluations found good agreement for only a few compounds and underestimation for most compounds [15,35]. In the 2005 NATA model assessment, only 8 out of 68 compounds showed agreement at the national level, and other compounds were all underestimated [48]. At state and local levels, Lupo and Symanski [34] found 1996 NATA underestimated 8 out 15 HAPs and 1999 NATA underestimated 18 out of 27 HAPs in Texas. Wang et al. [50] found general agreement for benzene and toluene concentrations modeled by 1999 NATA in Camden, New Jersey. Logue et al. [33] reported

Similar Findings from National and Local Studies
Our results confirmed previous national and local evaluations of NATA modeling. Previous NATA evaluations found good agreement for only a few compounds and underestimation for most compounds [15,35]. In the 2005 NATA model assessment, only 8 out of 68 compounds showed agreement at the national level, and other compounds were all underestimated [48]. At state and local levels, Lupo and Symanski [34] found 1996 NATA underestimated 8 out 15 HAPs and 1999 NATA underestimated 18 out of 27 HAPs in Texas. Wang et al. [50] found general agreement for benzene and toluene concentrations modeled by 1999 NATA in Camden, New Jersey. Logue et al. [33] reported that the 2002 NATA underestimated 32 out of 49 HAPs measured at 7 sites in and around Pittsburgh, Pennsylvania. The Detroit Exposure and Aerosol Research Study (DEARS) reported that benzene concentrations in 2002 NATA generally agreed with field measurements from 2004 to 2007 [36]. Garcia et al. [35] found that 12 HAPs were underestimated by 1996 NATA, 8 out of 9 were underestimated by 1999 NATA, 10 out of 12 were underestimated by 2002 NATA, and 6 out of 10 were underestimated by 2005 NATA. It was notable that previous studies found good agreement for benzene; however, our results showed moderate agreement, possibly due to different comparison methods. These findings indicate that model-to-monitor agreement was inconsistent by region and chemical, and under-prediction was more frequent [35,36,50].
The general underestimation by NATA was attributable to factors including (1) missing emissions sources; (2) underestimated emission rates; (3) sites intended to find peak concentrations; and (4) measurement accuracy [52]. As seen in Table 2, NATA model in general was unable to capture extreme concentrations. Average concentrations measured from monitors within a census tract might be affected by extrema due to nearby short-term strong emissions, which could not be captured by the census-tract averages in NATA. Similarly, the National Emission Inventory, on which NATA estimates were based, might have missed local emission sources [51]. Lack of stable estimates of meteorological conditions and photochemical reactions is another factor leading to disagreement. For example, unstable estimates on wind conditions and secondary formation of chemicals were major weakness of the NATA model [38]. The uncertainty in monitored measurement due to insufficient and unbalanced geographic coverage of monitoring sites also contributed to the discrepancies between monitored and modeled estimates [35,51]. These factors warrant future improvements in both monitoring adequacy (technology, frequency and coverage) and modeling parameterization.

Impacts of Comparison Methods and Metrics
Model-to-monitor comparison results were significantly impacted by comparison methods. We introduced LOQs to overcome the measurement uncertainty issue, which was ignored in previous studies. It turned out model-to-monitor differences were unquantifiable at a large portion of sites for many compounds. A small and practically unquantifiable difference should mean an agreement; however, statistical analyses of these uncertain, small numbers often lead to significant differences. For example, we found 100% agreement for chromium due to its extremely low concentrations (median = 0.00001 µg/m 3 ) estimated by NATA, while the 2005 NATA evaluation reported a 0% agreement when just using ratios [48]. This and other examples suggest ignorance of measurement uncertainty and, in particular, the LOQs, would lead to distinctly different results.
Previous studies have applied a number of model-to-monitor comparison metrics and methods, including biases and root mean square error [50,51,53,54], Kendall rank correlation [38], ratios [33][34][35], regressions [50], and even complex metrics [55]. There has been no commonly accepted criterion for the metric; for example, EPA uses a relative bias of within ±30% and median ratio of 0.5-2 for agreement. The goodness-of-fit of a regression line, indicated by R 2 , is often arbitrary. The median ratio of modeled-to-monitored concentrations is the most commonly used metric; however, it may become extremely small or large when concentrations are too small to be practically quantifiable. The strengths of our approach were the consideration of measurement uncertainty and statistical comparisons with the commonly used 95% confidence interval criterion.

Study Limitations
We have also acknowledged limitations in data sources and the comparison methodology. We used all the available annual averages without considering the number of measurements or seasonality, in order to increase the sample sizes. A representative annual average should be calculated from data measured in at least two seasons [52]. AQS did not report the LOQ for each measurement, and thus we adopted LOQs from EPA's major contract laboratory [43]. The use of a single LOD for all the measurements of a compound might have caused misclassification of ∆Ms, considering varying LOQs over time or by laboratories. This limitation also calls for inclusion of LOQs in future air quality data reporting. The analysis unit was census tract, a small geographic unit often used in environmental disparity and epidemiology research. NATA already admitted that NATA estimates were unreliable at the census tract level and discouraged uses of census tract data [52]. The lack of local information refrained us from explaining regional differences in model performance. For example, the poor agreement of halogenated compounds in Region 2 might be due to data quality in emissions and meteorological conditions as well as photochemical reactions. The underlying factors contributing to these regional differences need further inquiry and investigation.

Implications for Environmental Health Disparity Research
The environment plays a critical role in determining people's health [56]. Environmental health disparity is the difference in health risks that people have when they experience both uneven exposure to various environmental risk factors and social inequality [57]. It is often examined at census tract level because census tract is considered a geographic area roughly representative for a neighborhood where the sociodemographic characteristics are homogeneous among a stable size (1500 housing units and 4000 people on average) of the population [58]. With easily available census-tract level sociodemographic data from census and exposure data from NATA, many studies have examined environmental disparities in HAP exposures and risks, and always found a significant association between HAP exposure and sociodemographic status [21,24,59]. Our evaluation indicates two major uncertainties in NATA data. First, a large portion of NATA estimates are so low that they fall below the LOQs. Accordingly, large uncertainties exist in estimating cancer risks from exposure to carcinogenic HAPs when applying the linear non-threshold dose-response relationship. Second, NATA modeling was unable to predict extremely high concentrations due to lack of information on local, intermittent and sporadic emissions. These two uncertainties may result in false strong disparity patterns observed in many studies based on NATA data. The overall model performance warrants future disparity studies be conducted with actual HAP monitoring data, in particular, when examining disparities at the local level.

Conclusions
This study provides an independent model-to-monitor assessment for census-tract level HAP concentrations modeled in the latest NATA. Significant portions of modeled concentrations (5-100%) fell below the limits of quantitation (LOQs), and less than 30% of quantifiable concentrations showed statistical agreement. Out of the 27 compounds examined, 14 compounds showed agreement at over 50% of sites. Underestimation of NATA estimates was predominant in non-agreement cases. The agreement was inconsistent in terms of chemical group or region and was impacted by the comparison methods. These findings generally concurred with those from previous national, state, and local NATA evaluation studies. The substantial non-agreements of NATA predictions with monitoring data signal cautions for environmental epidemiology and justice studies that utilize NATA modeling data.