1. Introduction
In the past three decades, many air quality models have been developed as tools for air quality simulation and prediction in air quality planning, management, and assessment. Besides the Gaussian models, which mainly focus on transport and turbulence diffusion processes at the local scale, numerical chemistry and transport models (CTMs) have been developed. CTMs can simulate the atmospheric physical and chemical processes of air pollutants from emission to removal in the atmosphere, including advection transport, turbulent diffusion and convection, gas-phase and liquid-phase chemical reactions, aerosol dynamics and heterogeneous chemistry, dry deposition and surface processes, and wet deposition by clouds and precipitation back to the Earth’s surface. They are suitable for the simulation and prediction of both primary air pollution problems and secondary and regional air pollution problems, such as particulate matter, photochemical oxidation, and acid deposition, and can be used with different spatial scales (city, regional, or global scale), different resolutions or grid sizes, and different temporal resolutions or time steps. Based on the development level and the degree of complexity, different models and versions vary in physical and chemical algorithms and options, the demand for emission and meteorological input data, and the computer resource requirements.
Due to the complexity of atmospheric processes, the limitations of scientific knowledge, and computing technology, the quality models are more or less approximations of real physical and chemical processes and mechanisms. Therefore, the air quality models may not simulate all the physical and chemical processes well for all species and application scenarios. On the other hand, there are always uncertainties in the emission inventories and meteorology input data. CTMs are not always accurate enough and have various uncertainties. In addition to accuracy, consistency is another important requirement when used as a prediction and assessment tool for air quality management or regulatory purposes such as air quality planning and environmental impact assessment. At the same time, input data, parameter availability, and ease of use are always important issues for air quality modeling in regulatory applications or management applications. Therefore, the evaluation and validation of air quality models are essential for model development and are the necessary preconditions for regulatory applications and the basis for model development.
During the long period of development and application, the CMAQ model has conducted systematic evaluation for every version [
1,
2,
3], especially through the annual CMAS conference [
4]. For example, the comprehensive evaluation of CMAQ v5.3 and v5.3.1 [
3,
4] has been conducted recently using monitoring data from the 1304 O
3 monitors and 2010 PM
2.5 monitors of the Air Quality System (AQS) maintained by the U.S. Environmental Protection Agency (EPA), 190 O
3 monitors and 196 PM
2.5 monitors of the National Air Pollution Surveillance Program (NAPS) in Canada, PM
2.5 components from 242 CSN sites, 149 sites of the Interagency Monitoring of Protected Visual Environments (IMPROVE), and 94 Clean Air Status and Trends Network (CASTNet) sites. CAMx7.1 was also evaluated for daily PM
2.5 concentrations and component species against the observed data of 2016, including 107 CSN sites and 150 IMPROVE monitoring sites. There are also many evaluation works that have been conducted for specific model development and specific applications [
5,
6,
7,
8,
9,
10].
For the regulatory application of air quality models, it is crucial to evaluate the comprehensive performance for the main pollutants of the required average time in air quality standards. For a long time, even though many model evaluations for specific air pollutant cases have been evaluated [
11,
12,
13,
14], there have been only a few systematic evaluation and verification studies for regional air quality models at a national level in China. In the project on Long-range Trans-boundary Air Pollutants in Northeast Asia (LTP), the simulation of the episode and long-term results of sulfur concentrations of CMAQ by China, RAQM by Japan, and CADM by Korea were compared [
15]. In the MICS-Asia phase III, the performances for O
3, NO
x, and PM from 14 independent modeling groups have been compared [
16,
17]. In response to the problems in the model applications and to meet the urgent needs for regulatory applications of air quality models for the Air Pollution Control Action Plan (2014–2017), which has targeted PM
2.5 in China, the Ministry of Ecology and Environment and the Ministry of Science and Technology of China have jointly launched the project of “Research on the Technical System of Regulatory Air Quality Modeling” in 2017 (hereinafter referred to as “the Project”) [
18].
This study was the model evaluation part of the Project. Five regional CTM air quality models and versions (hereafter AQMs) have been evaluated for the regional PM
2.5 and O
3 pollution problem using the same set of monitoring data and unified statistical methods, i.e., CMAQ version 5.02 (CMAQv5.02 hereafter), CMAQ version 5.3.2 (CMAQv5.3.2 hereafter) [
19], CAMx version 6.2 (CAMx6.2 hereafter), CAMx version 7.1 (CAMx7.1 hereafter) [
20,
21], and the NAQPMS model.
4. Evaluations Results of AQMs
Simulations of CO, NO
2, SO
2, O
3, PM
2.5, and PM
10, the six air pollutants listed in the ambient air quality standards of China [
41] which are essential for the regulatory application of AQMs, were evaluated for model performance. Furthermore, to investigate the model performance for PM, in addition to the PM
2.5 and PM
10 concentrations, the PM
2.5 component concentrations, including sulfate, nitrate, EC, OC, and ammonium, were also evaluated. The evaluations considered both average cases and the most severe case to understand model behavior during severe air pollution events.
4.1. Evaluations of Model Performance for CO Concentration
While CAMx6.2 underestimated the CO concentration, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 performed reasonably well in simulating the mean CO concentrations (
Figure 1a,
Table S1). The simulated mean CO concentrations for the four models were 0.53 mg/m
3, 1.50 mg/m
3, 1.32 mg/m
3, and 0.88 mg/m
3, while the mean observation concentrations were 1.25 mg/m
3, 1.16 mg/m
3, 1.46 mg/m
3, and 1.17 mg/m
3, respectively, showing that the four models have small biases and errors. The NMBs of the four models were −0.41, 0.28, 0.10, and −0.15, and the NMEs of the four models were 0.59, 0.52, 0.57, and 0.35, respectively. The correlation coefficients of the four models were 0.59, 0.46, 0.54, and 0.55 (
Figure 1b,
Table S1). The NAQPMS model also had a similar good performance, with a high R of 0.03–0.84 (average of 0.44) and an NMB and RMSE range of −0.49 to 1.12 (average of 0.32) and 0.20 to 2.50 (average of 1.35), respectively (
Figure 1b,
Table S1).
For the highest CO concentration case, CAMx7.1 predicted well, with 2.88 mg/m3 against the 2.97 mg/m3 of observation. However, CAMx6.2, CMAQv5.0.2, and CMAQv5.3.2 simulated with 0.60 mg/m3, 1.63 mg/m3, and 1.50 mg/m3 against the observed concentrations of 2.60 mg/m3, 2.50 mg/m3, and 2.49 mg/m3, respectively. The four models also have similar correlation coefficients for high CO cases, as other cases, with correlation coefficients of 0.54, 0.55, 0.70, and 0.43, respectively. The four models also have a similar bias and errors as in other cases, with NMBs of −0.77, −0.03, −0.37, and −0.34 and NMEs of 0.77, 0.45, 0.47, and 0.46, respectively.
Generally, except for CAMx6.2, which underestimated CO, the performances of all the participating models were acceptable in simulating mean CO concentrations. The participating models also performed well for the highest CO concentration cases.
4.2. Evaluations Model Performance for SO2 Concentration
Generally, CAMx6.2 slightly and CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 significantly overestimated the SO
2 concentration (
Figure 1c,
Table S1). The simulated mean SO
2 concentrations for all cases from the CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models were 5.81 µg/m
3, 27.28 µg/m
3, 38.80 µg/m
3, and 26.91 µg/m
3, while the mean observation concentrations were 4.13 µg/m
3, 10.16 µg/m
3, 9.80 µg/m
3, and 11.46 µg/m
3, respectively. The four models have a large positive bias, with NMBs of 3.37, 5.36, 13.35, and 4.94, respectively, and relatively large errors of NME of 3.64, 5.43, 13.39, and 5.02, respectively (
Figure 1d,
Table S1). Similarly, the NAQPMS model overestimated SO2 concentrations with a large positive bias, an average NMB of 3.30, and a large RMSE of 15.60. The correlation coefficients for the five models, CAMx6.2, CAMx7.1, CMAQv5.0.2, CMAQv5.3.2, and NAQPMS, were averaged at 0.46, 0.20, 0.46, 0.41, and 0.46, respectively (
Figure 1d,
Table S1).
In order to investigate the performance of the model for severe SO2 pollution situations, the highest SO2 concentration case was specifically evaluated. CAMx6.2 underestimated the SO2 concentration, while the other three models still overestimated. Notably, CMAQv5.02 and CMAQv5.3.2 performed better than all cases on average. Specifically, the four models predicted 4.96 µg/m3,61.58 µg/m3,44.04 µg/m3, and 48.63 µg/m3, whereas the observation concentrations were 12.83 µg/m3, 33.58 µg/m3, 33.74 µg/m3, and 32.52 µg/m3, respectively. The four models also have better correlation coefficients of 0.66, 0.78, 0.73, and 0.50, respectively, for high SO2 cases. The bias and errors are still high, but the bias is not always positive. The NMBs and NMEs are also improved for the highest concentration cases, with NMBs of −0.61, 0.83, 0.31, and 0.48 and NMEs of 0.61, 0.85, 0.47, and 0.58, respectively.
The model performance varied significantly in different regions. Relatively speaking, a more considerable discrepancy in the simulation can be found for southern China. For example, both CMAQv5.0.2 and CMAQv5.3.2 overpredicted SO2 concentrations with a higher bias and error in Shenzhen and Chengdu, showing the problem of updating the emission inventory or the faster improvement of coal combustion-related air pollution. A similar poor performance for SO2 and underestimation of sulfate had been found in a previous model comparison study by the LTP project, where three modeling results from China, Korea, and Japan showed a big discrepancy.
Compared to the model performance for other modeling species, the source of the poor performance of all models in simulating SO2 is probably the rapid emission inventory change due to the air pollution action plan during that period, especially the phasing out of small boilers and changing fuel from coal to natural gas and electricity for household heating in northern China. The better performance for the high SO2 cases of models suggested that there is still potential to improve not only the emission amount but also the temporal variations in emission inventories.
4.3. Evaluations of Model Performace for NO2 Concentration
CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 predicted very well in simulating the NO
2 concentrations (
Figure 1e,
Table S1). The NO
2 concentrations simulated by the three models were 47.67 µg/m
3, 55.95 µg/m
3, and 47.07 µg/m
3, while the mean observation concentrations were 48.54 µg/m
3, 61.58 µg/m
3, and 54.94 µg/m
3, respectively. On the other hand, the CAMx6.2 model underestimated NO
2 at 19.65 µg/m
3 compared to the observed value of 35.66 µg/m
3. On average, the bias and errors of the four models are small; the NMBs are only −0.23, 0.27, −0.02, and 0.07, and the NMEs are 0.56, 0.70, 0.42, and 0.55, respectively (
Figure 1f,
Table S1). The models also have good correlation coefficients of 0.64, 0.54, 0.52, and 0.53. The NAQPMS model has a similarly good performance for NO
2, with an R of 0.03~0.78, NMB in the range of −0.19~1.88, and RMSE in the range of 8.70~56.00, respectively.
For the highest NO2 concentration cases, the performances of the four models are not as good as the average of all cases. The four models underestimated the NO2, at 24.18 µg/m3, 54.28 µg/m3, 66.63 µg/m3, and 64.73 µg/m3, compared to the observations of 57.54 µg/m3, 108.25 µg/m3, 84.84 µg/m3, and 118.15 µg/m3, respectively. Nevertheless, the bias and errors of the four models are still small, with NMBs of −0.58, −0.50, −0.21, and −0.45 and NMEs, which improved slightly from the average cases, of 0.59, 0.50, 0.23, and 0.46, respectively. The four models also have better correlation coefficients of 0.66, 0.69, 0.41, and 0.70, respectively, for the high NO2 cases.
Overall, these participating models perform well in simulating the NO
2 concentrations. The good performance for NO
2 is similar to the previous study in MICS-Asia III [
17].
4.4. Evaluations of Model Performance for O3 Concentration
The CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models perform reasonably well in simulating the daily maximum 8 h average O
3 concentrations (MDA8 O
3) (
Figure 1g). The four models’ simulated mean MDA8O
3 concentrations for all cases were 30.18 µg/m
3, 115.76 µg/m
3, 90.94 µg/m
3, and 72.29 µg/m
3, while the mean observation concentrations were 29.65 µg/m
3, 87.45 µg/m
3, 78.65 µg/m
3, and 60.78 µg/m
3, respectively. The simulated results from the four models have good correlation coefficients of 0.75, 0.53, 0.75, and 0.48, respectively, and have small biases and errors (
Figure 1h,
Table S1). The NMBs for MDA8O
3 of the four models are 0.06, 0.40, −0.17, and 0.14, and the NMEs for MDA8O
3 of four models are 0.54, 0.48, 0.57, and 0.56, respectively. The NAQPMS model has a similarly good performance for O
3, with an R of 0.37~0.86, NMB in the range of −0.83~4.83, and RMSE in the range of 16.1~83.6, respectively.
For the highest (MDA8O3) concentration cases, the simulated MDA8O3 from the four models was 48.01 µg/m3, 199.42 µg/m3, 189.04 µg/m3, and 132.41 µg/m3, while the observation concentrations were 62.41 µg/m3, 172.53 µg/m3, 190.89 µg/m3, and 111.1 µg/m3, respectively. The four models have similar biases and errors, −0.65, 0.16, −0.04, and 0.18 for the NMB and 0.37, 0.19, 0.23, and 0.20 for the NME, respectively. The four models also have better correlation coefficients than the average cases of 0.64, 0.82, 0.92, and 0.76, respectively.
Overall, the participating models performed well in simulating the MDA8O
3 concentrations. The good performance of the participating model is similar to the model evaluation results of CMAQ and CAMx in the U.S. [
42,
43]. However, it is worth noting that most cases for evaluations in this study were in the fall or winter when O
3 levels are typically low. In the previous study of MICS-Asia III, the model performance for O
3 showed a considerable variability, high uncertainties, and usually overestimation (high NMB of 0.25–1.25 for May–September for the Greater BTH region domain) [
16]. Therefore, further research is needed to investigate the performances of the participating models during the O
3 pollution season.
4.5. Evaluations of Model Performance for PM10 Concentration
The models generally underestimated PM
10 concentrations. CAMx6.2, CAMx7.1, and CMAQv5.0.2 significantly underestimated PM
10 concentrations, and only CMAQv5.3.2 slightly underestimated (
Figure 1i,
Table S1). The simulated average concentrations were 66.57 µg/m
3, 95.17 µg/m
3, 87.43 µg/m
3, and 121.54 µg/m
3, while the mean observation concentrations were 149.62 µg/m
3, 155.88 µg/m
3, 134.78 µg/m
3, and 148.21 µg/m
3, respectively. The NMBs for PM
10 of the CAMx6.2, CAMx7.1, and CMAQv5.0.2 models were −0.44, −0.30, and −0.20, while CMAQv5.3.2 had −0.01 (
Figure 1j,
Table S1). The NMEs for PM
10 of the four models were 0.65, 0.51, 0.56, and 0.51, respectively. The correlation coefficients for PM
10 of the four models were 0.48, 0.45, 0.48, and 0.38, respectively (
Figure 1j,
Table S1). Similarly, the NAQPMS model had a similar range of errors and correlation coefficients for PM
10, with an R in the range of −0.05~0.88, NME in the range of −0.08~0.69, and RMSE in the range of 27.10~169.90.
For the highest PM10 concentration cases, while CMAQv5.3.2 reproduced peak concentrations very well, CAMx6.2, CAMx7.1, and CMAQv5.0.2 underestimated PM10 concentrations significantly. The four models predicted PM10 concentrations of 79.50 µg/m3, 135.14 µg/m3, 129.02 µg/m3, and 263.5 µg/m3, whereas the observation concentrations were 272.50 µg/m3, 272.50 µg/m3, 272.50 µg/m3, and 272.5 µg/m3, respectively. Accordingly, CMAQv5.3.2 had a very small bias of −0.03, and the other three models had bigger biases of −0.71, −0.50, and −0.53. CMAQv5.3.2 had a small NME of 0.20, while the other three models had NMEs of 0.71, 0.50, and 0.53, respectively. The CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models had good correlation coefficients of 0.88, 0.78, and 0.77, respectively, and CAMx6.2 had a low correlation coefficient of 0.34. Except for the CMAQv5.3.2 model, the other three models still have the potential to improve their performances for PM10.
4.6. Evaluations of Model Performance for PM2.5 Concentration
During the action plan period of 2014–2017, PM
2.5 was the most severe air pollution problem. The air quality models are urgently needed as prediction and assessment tools for control policy-making. The CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models perform well in simulating the mean PM
2.5 concentrations and have a better performance than in simulating PM
10. For most cases, the four models slightly underestimated the PM
2.5 concentrations, and the simulated mean PM
2.5 concentrations were 60.72 µg/m
3, 79.85 µg/m
3, 80.58 µg/m
3, and 75.48 µg/m
3, while the mean observation concentrations were 95.19 µg/m
3, 95.16 µg/m
3, 92.77 µg/m
3, and 94.24 µg/m
3, respectively (
Figure 1k,
Table S1). The biases and errors of the four models were similarly small: −0.29, −0.07, −0.04, and −0.11 for NMB and 0.51, 0.48, 0.53, and 0.52 for NME, respectively (
Figure 1l,
Table S1). The modeled results also have good correlation coefficients with observations of 0.58, 0.55, 0.60, and 0.39, respectively (
Figure 1l,
Table S1). The model performances for PM
2.5 in this study were similar to the work of MICS-Asia III, in which the participating models also slightly underestimated the PM
2.5 concentration with small biases [
16].
For the highest PM2.5 concentration cases, the models still underestimated the PM2.5 concentrations. The CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models predicted 97.46 µg/m3, 120.44 µg/m3, 91.46 µg/m3, and 120.14 µg/m3, while the observation concentrations were 155.93 µg/m3, 155.93 µg/m3, 155.93 µg/m3, and 157.54 µg/m3, respectively. The NMBs for the four models were −0.37, −0.23, −0.41, and −0.24, and the NMEs were 0.42, 0.30, 0.44, and 0.25, respectively. The four models had very good correlation coefficients of 0.89, 0.87, 0.90, and 0.93, respectively. Similarly, the NAQPMS model simulated PM2.5 reasonably well, with an R in the range of 0.01~0.89, NMB in the range of −0.10~1.61, and RMSE in the range of 21.30~155.90.
Overall, all participating models performed reasonably well in simulating the PM2.5 concentrations.
4.7. Evaluations for Modeled PM2.5 Components
The AQMs simulate the compositions of particulate matter, such as sulfate, nitrate, elemental carbon (EC), organic carbon (OC), and crustal elements, and then sum each of the independently simulated components to obtain PM2.5 and PM10 concentrations. Therefore, to evaluate the model performance in simulating PM2.5 and PM10, it is more important to understand its performance in simulating different PM components.
In this work, the model performances in simulating the PM2.5 components of sulfate, nitrate, BC, OC, and NH4+ are evaluated.
4.7.1. Evaluation for Nitrate of PM2.5
The CAMx7.1 and CMAQv5.0.2 models perform very well and slightly underestimated the mean nitrate concentrations, and CAMx6.2 and CMAQv5.3.2 underestimated more significantly. The simulated nitrate concentrations of CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 were 10.36 µg/m
3, 19.32 µg/m
3, 18.15 µg/m
3, and 13.77 µg/m
3, while the mean observation concentrations were 19.12 µg/m
3, 18.96 µg/m
3, 18.86 µg/m
3, and 19.37 µg/m
3, respectively (
Figure 2a,
Table S1). The biases and errors of the four models for nitrate were small, with NMBs of −0.04, 0.34, 0.47, and 0.27 and NMEs of 0.95, 0.87, 1.02, and 1.16, respectively (
Figure 2b,
Table S1). The four models had good correlation coefficients of 0.62, 0.55, 0.63, and 0.40, respectively. The NAQPMS model had a similar good performance for nitrate, with an R in the range of 0.04~0.89 for high concentration cases, and an NMB in the range of −0.87~11.37. Overall, the participating models perform well in simulating the nitrate concentrations of PM
2.5.
4.7.2. Evaluation for Sulfate of PM2.5
Generally, the CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models perform reasonably well in simulating the mean sulfate concentrations (
Figure 2c,
Table S1). CAMx6.2 and CMAQv5.0.2 slightly underestimated, while CAMx7.1 and CMAQv5.3.2 slightly overestimated the sulfate concentrations. The simulated mean sulfate concentrations of the four models were 6.53 µg/m
3, 11.63 µg/m
3, 8.97 µg/m
3, and 6.82 µg/m
3, respectively, while the mean observation concentrations were 7.71 µg/m
3, 7.70 µg/m
3, 7.50 µg/m
3, and 8.76 µg/m
3, respectively. CAMx6.2 and CMAQv5.3.2 had small biases, with NMBs of 0.23 and 0.21, while CAMx7.1 and CMAQv5.0.2 had acceptable NMBs of 1.25 and 0.75, respectively (
Figure 2d,
Table S1). The four models also had reasonable errors; the NMEs were 0.81, 1.51, 1.16, and 0.82, respectively. The four models had good correlation coefficients of 0.58, 0.53, 0.62, and 0.43, respectively. The NAQPMS model had a similar acceptable performance for sulfate, with an R in the range of −0.35~0.86, NMB in the range of −0.29~3.66, and RMSE in the range of 2.30~23.70 (
Figure 2d,
Table S1).
Overall, the participating models effectively simulated the sulfate concentrations, which is similar to the good performance for sulfate in previous studies conducted in the U.S, e.g., the models overestimate the magnitudes of concentrations at CSN and IMPROVE sites in most regions and seasons by 0.1–0.6 μg/m
3, reported in 2016 [
21,
42,
43]. However, the overestimations of SO
2 concentrations of the participating models suggest that there are still large uncertainties in emission inventory, and the aqueous and heterogeneous reactions and meteorology need to be further investigated. In the model evaluation work of MICS-Asia III, the general underestimation of sulfates was also found, and the possible absence of sulfate formation mechanisms such as heterogeneous reactions was suggested.
4.7.3. Evaluation for BC of PM2.5
Overall, the CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models perform very well in simulating the mean BC concentrations. The simulated mean BC concentrations for the four models were 6.13 µg/m
3, 7.04 µg/m
3, 5.56 µg/m
3, and 8.04 µg/m
3, while the mean observation concentrations were 4.80 µg/m
3, 6.88 µg/m
3, 4.89 µg/m
3, and 6.79 µg/m
3, respectively (
Figure 2e,
Table S1). All four models slightly overestimated the BC concentration, with positive NMBs of 0.57, 0.45, 0.49, and 1.05. The errors of the four models in simulating BC were relatively large, with NMEs of 0.98, 0.96, 0.92, and 1.41, respectively (
Figure 2f,
Table S1). The models show reasonable correlation coefficients of 0.50, 0.60, 0.48, and 0.59. The NAQPMS model had a similar performance for the evaluation cases, with an R in the range of −0.02~0.86 and NMB in the range of −0.53~4.63.
4.7.4. Evaluation for OC of PM2.5
The performances of CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 were diversified in simulating the mean OC concentrations. CAMx6.2 and CAMx7.1 simulated the mean OC concentrations very well, whereas CMAQv5.0.2 underestimated and CMAQv5.3.2 overestimated the mean OC concentrations significantly. The simulated mean OC concentrations of the four models were 14.03 µg/m
3, 16.15 µg/m
3, 6.72 µg/m
3, and 20.94 µg/m
3, while the mean observation concentrations were 14.84 µg/m
3, 13.14 µg/m
3, 15.27 µg/m
3, and 12.27 µg/m
3, respectively (
Figure 2g,
Table S1). The CAMx6.2 and CMAQv5.3.2 models had large NMBs of 1.63 and 2.43, whereas CAMx7.1 and CMAQv5.0.2 had reasonable NMBs of 0.80 and 0.66, respectively (
Figure 2h,
Table S1). All four models had large errors, with NMEs of 2.14, 1.06, 1.55, and 2.79. The four models also had reasonable correlation coefficients of 0.60, 0.56, 0.57, and 0.53 (
Figure 2h,
Table S1). The NAQPMS model had a similarly diversified performance for OC, with an R in the range of 0.15~0.81, NMB in the range of −0.42~30.88, and RMSE in the range of 2.60~44.30.
4.7.5. Evaluation for NH4+ of PM2.5
The CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 models underestimated the mean NH
4+ concentrations, but with small biases and errors (
Figure 2i,j,
Table S1). The simulated mean NH
4+ concentrations were 5.41 µg/m
3, 10.05 µg/m
3, 8.66 µg/m
3, and 5.80 µg/m
3, while the mean observation concentrations were 13.40 µg/m
3, 13.41 µg/m
3, 13.17 µg/m
3, and 12.96 µg/m
3, respectively. The NMBs for NH
4+ of CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 were −0.31, 0.16, 0.06, and −0.22, respectively. The four models had consistent small NMEs for NH
4+ of 0.69, 0.71, 0.70, and 0.74, respectively. He four models also had good correlation coefficients of 0.68, 0.59, 0.66, and 0.52, respectively. The NAQPMS model had a similar good performance for NH
4+, with a high R of 0.89 for high concentration cases. The model performances for NH
4+ of PM
2.5 in this study are similar to the work conducted in MICS-Asia III (Chen et al., 2019, [
16]).
All the participating models simulated gaseous NH3 in the atmosphere, which is the precursor of NH4+ in PM2.5. Unfortunately, there were few gaseous NH3 concentration data for model verification in this study. In AQMs, the NH4+ concentrations are affected by the partitioning between gaseous NH3 and the NH4+ of PM2.5; the equilibrium of cations and anions is both in aerosol and liquid phases. At the same time, most areas in the modeling domain are in NH3-rich conditions. The discrepancy of sulfate, nitrate, and missing cations and precursors in the emissions, such as chloride ions, can lead to uncertainties in simulating NH4+ concentrations for PM2.5. The performances of the gaseous ammonia and ammonium of PM2.5 for participating models still need to be further evaluated.
5. Discussion
5.1. Sources of Uncertainty of CTMs and the Limitation of the Evaluation in This Work
In this study, we mainly used daily average concentrations, which are required for PM
2.5 and most air pollutants by the Ambient Air Quality Standard of China [
41], as evaluation statistical measures. The participating models demonstrated their basic ability in simulating the daily average concentrations for NO
2, CO, O
3, PM
2.5, and MDA8 O
3 with acceptable errors and biases, and can be used as prediction and evaluation tools in regulatory applications. However, for the purpose of the detailed evaluation of models and improving the specific chemical mechanisms or physical algorithms of models, detailed case-specific evaluations using a higher temporal resolution, even if not defined in the air quality standards (e.g., for PM
2.5), should be considered. We did not evaluate the model performance for annual average concentration due to a lack of monitoring data.
The sources of uncertainty in the air quality models can be roughly categorized as emission data, meteorological data and meteorology models, and chemical mechanisms and physical algorithms. In the Project, the meteorology input data, including global analysis data and land-use data, were tested and provided. An emission inventory was updated and provided; the pre-processing tool and suggestions were also provided. The meteorology and air quality specifications of meteorology and air quality models were tested and suggested. However, each model group updated and modified the input emission data, pre-processed the data, and selected the meteorology model, CTMs, modeling specifications, and options based on their own circumstances and considerations [
18]. Given the large number of types of data, the various uncertainties in the emission inventory and meteorology input data, and the tremendous atmospheric chemical and physical processes, scientific schemes, and specification options, this study, following the common air quality modeling approach, did not require a consistency of input data and model specifications. This study also did not conduct in-depth sensitivity analyses for sources of uncertainty using either brutal-force or using a sensitivity analysis module for the models. Therefore, the evaluation results cannot be used as guidance for specific data, mechanism, and algorithm improvements.
Due to the uncertainties in emissions and chemical characteristics of different air pollutants, it is difficult to define fixed criteria for the air pollutants in different concentration levels and emissions (much higher concentrations and more complicated emission sources in China, with various meteorology). We applied the approach of a comparison of statistical indicators, used in the development of CMAQ and CAMx in the U.S. [
1,
2,
3] and other countries, and made suggestions for criteria for further consideration in the Project [
18].
It is also worth noting that, even though there are five cases for high O3 events, the main purpose of this evaluation is for PM2.5. Due to the measurement data availability, we did not investigate in-depth the performance of models for O3 and its precursors, and the products and radicals of photochemical reactions. The evaluation for high O3 seasons should be investigated further, especially for the species of VOCs, NOx, and radicals.
Since the PM2.5 concentrations are very much dominated by regional emissions, the large domain covering East Asia or the whole of China in this study can very much reduce the uncertainties of boundary conditions. However, due to the global background and high concentration in the upper air of tropospheric ozone, further investigations about boundary condition are needed.
5.2. Uncertainty of Emission Data
The uncertainties in the emission data used in the CTM air quality model stem from both the emission inventory and the pre-processing procedures. In this evaluation, we did not observe significant performance differences among the participating models or their versions, despite variations in chemical mechanisms, physical algorithms, model configurations, and spatial resolutions. However, the models showed a consistent performance across different species, with a relatively better performance for PM2.5, CO, and NO2, and poorer performance for PM10, SO2, and certain PM2.5 components. Compared to the generally better performance of CMAQ and CAMx models in the United States and Europe, it can be concluded that the emission inventory remains the largest source of uncertainty in China.
Even though the participating models simulated the sulfate concentrations relatively well, the overestimation of SO2 may still suggest the possible underestimation of sulfate by the models. The model evaluation results for SO2 and sulfate in this study were comparable to the poor performance from previous evaluations conducted in China. In contrast, extensive evaluation efforts in the U.S. and Europe using the CMAQ and CAMx models have generally demonstrated a good performance for both SO2 and sulfate. The poor performance of the participating models in simulating SO2 indicated that there was still a large uncertainty in the emission inventory, which was most likely caused by the rapid emission inventory change as a result of the air pollution action plan during that period. The performances of meteorology modeling, such as precipitation and humidity, which are related to aqueous and heterogeneous sulfate formation mechanisms, also need to be further investigated for the possible underestimation of sulfate.
For the highest PM10 concentration cases, three models, CAMx6.2, CAMx7.1, and CMAQv5.0.2, underestimated PM10 concentrations significantly. Since coarse PM mainly comes from sources such as road dust, wind-blown dust and soil, and the long-range transport of dust–sand storms, the difference in the model performances or uncertainty of models for PM10 may be related to the uncertainties in local emission data. A high spatial and temporal resolution emission inventory or emission models, either inline or offline, considering human activity, the conditions of road or loading sites, and meteorology are essential for improvement.
The pre-processing of emission data, including the spatial and temporal allocation of emission sources of administrative units to the modeling grids and time steps, as well as chemical speciation, can contribute further to the uncertainties and inconsistency due to the lack of necessary information and methodology. For example, the current approach using population and GDP for the spatial allocation of the emissions of administrative units to the modeling grids is not appropriate or accurate for all air pollutants and types of sources. The other essential information, such as chemical speciation profiles for the chemical speciation of VOCs, NOx, and PM emissions for model species and temporal profiles of the sources for temporal allocation, is usually not available for emission sources.
5.3. Evaluation Methods
The primary objective of this study is to provide a general performance evaluation for the key air pollutants of air quality models for regulatory applications. Therefore, we pay more attention to the model general performance for six air pollutants from the air quality standards of China, and five components of PM
2.5. This work used the most common statistical indicators for model performance, which is common practice in quantitative model evaluation (
Tables S1 and S2). However, when investigating the performance of specific cases or different species, the model performance showed more uncertainties. For example, the participating models also showed reasonable correlation coefficients, indicating that the models reproduced and represented temporal and regional distributions and temporal variations in air pollutant concentrations well. However, upon a closer examination of the time series across different case studies, it can be found that simulated values for various air pollutants exhibit distinct deviations. Deviations are notably greater for PM
2.5, particularly for some PM
2.5 components (
Tables S1 and S2). Simulated O
3 usually showed a better performance for diurnal variation, while CO and NO
2 showed a poorer performance (
Tables S1 and S2). Therefore, even the general performance of models, e.g., for PM
2.5 and O
3, are acceptable; a further evaluation for detailed cases, diurnal variation, and PM
2.5 components and precursors of O
3 is necessary.
This study also did not conduct in-depth sensitivity analyses for sources of uncertainty using either brutal-force or using a sensitivity analysis module of the models. This study did not require a consistency of input data and model specifications. Therefore, the evaluation results do not provide direct guidance for the improvement of specific input data such as emission inventory and pre-processing methods, meteorology input data and modeling, and mechanisms and algorithms of air quality models. For evaluation work for the purpose of model and application improvement, in-depth evaluations for specific species and specific cases (location and meteorology) are needed. In future work, standardized charts and figures for time series and spatial distributions for regulatory model evaluations could be considered.
6. Conclusions
This study inter-compared five regional chemistry and transport models/versions using the same set of observation data and statistics indicators.
All the participating models performed well in simulating the daily average concentrations of NO2, CO, O3, and PM2.5 with an acceptable bias, and can be used as prediction and evaluation tools in regulatory applications. The participating models generally overestimated SO2, underestimated PM10, and showed a variability among models for the air pollutant and PM2.5 component concentrations. The participating models also showed reasonable correlation coefficients, indicating that the models reproduced and represented the temporal and regional distributions and temporal variations in air pollutant concentrations well.
Except for CAMx6.2, which underestimated CO, the performances of all participating models were acceptable in simulating the mean CO concentrations. The participating models also performed well for the highest CO concentration cases. Similarly, except for CAMx6.2, which underestimated NO2, all the models reproduced the average and the range of NO2 concentrations well. However, for the highest concentrations cases, all the models underestimated the NO2 concentrations. Overall, the participating models perform well in simulating the MDA8 O3 concentrations. The good performances of the participating models were similar to the model evaluation results of CAMQ and CAMx in the U.S and better than the results of MICS-Asia, in which a considerable variability of model performance for O3 was found. It is worth noting that only five O3 cases were evaluated, while the remaining cases occurred during fall and winter, when O3 levels are generally lower. Further research is needed to investigate the performance of the participation model during the O3 pollution season.
The participating models generally underestimated PM10 concentrations and showed a great variability among different models. CAMx6.2, CAMx7.1, and CMAQv5.0.2 significantly underestimated PM10 concentrations, and only CMAQv5.3.2 slightly underestimated. The participating models performed well in simulating PM2.5 concentrations; for most cases, the participating models slightly underestimated the PM2.5 concentrations. However, the participating models showed a great variability for PM2.5 components. The models performed reasonably well in simulating the mean sulfate and nitrate concentration in PM2.5. The participating models also simulated BC very well in terms of the mean concentration and peak value. However, the performances of CAMx6.2, CAMx7.1, CMAQv5.0.2, and CMAQv5.3.2 were diversified in simulating the mean OC concentrations. The participating models performed reasonably well in simulating the average and peak values of NH4+ concentrations. However, since there was no NH3 concentration data for verification and most areas in the modeling domain are under NH3-rich conditions, the performance of NH4+ may be affected by the partitioning between NH3 air concentrations and aerosols and the equilibrium of cations and anions of aerosols. The performances of the gaseous ammonia and ammonium of PM2.5 for the participating models still need to be further evaluated.
In this air quality model evaluation work, different air quality models and versions and different model parameterization schemes for different pollutants and components, as well as for different periods of time and regions, still have varying degrees of uncertainties. The evaluation results suggested that, for regulatory applications of air quality models, continuous and routine evaluations are essential, and a regulatory mechanism or framework is essential for air quality applications and improvement.
Based on the evaluation results, it can be concluded that the emission inventory remains the largest source of uncertainty in China. Therefore, updating and improving the emission inventory should be considered a top priority for the regulatory applications and improvement of CTM air quality models.
This study aims to conduct a general performance evaluation of the key air pollutants in air quality models for regulatory applications. Overall, the model performance was found to be acceptable for most air pollutants, and the modeling results can provide valuable information for policy-making. However, significant uncertainties remain, particularly given the limitations of this evaluation. Further regular and systematic evaluations are needed. More in-depth evaluations for specific input data—such as emissions, meteorology, and modeling—as well as chemical mechanisms and physical algorithms in air quality models are essential for improving both models and data. At the same time, air quality model application guidelines, standardized evaluation methodologies, and regulations are necessary to ensure consistent model evaluations and regulatory applications in China.