Validation and Calibration of CAMS PM 2.5 Forecasts Using In Situ PM 2.5 Measurements in China and United States

: An accurate forecast of ﬁne particulate matter (PM 2.5 ) concentration in the forthcoming days is crucial since it can be used as an early warning for the prevention of general public from hazardous PM 2.5 pollution events. Though the European Copernicus Atmosphere Monitoring Service (CAMS) provides global PM 2.5 forecasts up to the next 120 h at a 3 h time interval, the data accuracy of this product had not been well evaluated. By using hourly PM 2.5 concentration data that were sampled in China and United States (US) between 2017 and 2018, the data accuracy and bias levels of CAMS PM 2.5 concentration forecast over these two countries were examined. Ground-based validation results indicate a relatively low accuracy of raw PM 2.5 forecasts given the presence of large and spatially varied modeling biases, especially in northwest China and the western United States. Speciﬁcally, the PM 2.5 forecasts in China showed a mean correlation value ranging 0.31–0.45 (0.24–0.42 in US) and RMSE of 38–83 (8.30–16.76 in US) µ g / m 3 , as the forecasting time horizons increased from 3 h to 120 h. Additionally, the data accuracy was found to not only decrease with the increase of forecasting time horizons but also exhibit an evident diurnal cycle. This implies the current CAMS forecasting model failed to resolve the local processes that modulate the diurnal variability of PM 2.5 . Moreover, the data accuracy varied between seasons, as accurate PM 2.5 forecasts were more likely to be derived in the autumn in China, whereas these were more likely in spring in the US. To improve the data accuracy of the raw PM 2.5 forecasts, a statistical bias correction model was then established using the random forest method to account for large modeling biases. The cross-validation results clearly demonstrated the e ﬀ ectiveness and beneﬁts of the proposed bias correction model, as the diurnal varied and temporally increasing modeling biases were substantially reduced after the calibration. Overall, the calibrated CAMS PM 2.5 forecasts could be used as a promising data source to prevent general public from severe PM 2.5 pollution events given the improved data accuracy.


Introduction
With the fast pace of urbanization and economic growth, the deteriorated air quality, particularly the increasing concentrations of fine particulate matter like PM 2.5 (particles with an aerodynamic diameter no more than 2.5 µm), has raised great concern given the negative impacts on public health, environment, and even climate [1,2]. On the one hand, an ample amount of cohort studies have confirmed the adverse impacts of PM 2.5 on public health, since both long-term and/or short period acute exposures can lead to cardiovascular diseases, pneumonia, and even premature death [3][4][5][6][7]. On the using satellite retrievals [53,54]. Meanwhile, methods toward such a goal may include linear/nonlinear fitting [46], quantile mapping [50], and even more complex machine learning approaches like random forest [51,53], and so forth.
In this study, we attempt to evaluate the data accuracy of the CAMS PM 2.5 forecasts, since the data accuracy of this product has not yet been well examined. PM 2.5 concentration measurements from the national air quality monitoring networks in China and the United States (US), between 2017 and 2018, are used as the ground truth to compare against the forecasted PM 2.5 concentration data at each forecasting step. Moreover, we establish a bias correction model using the random forest method to calibrate the raw CAMS PM 2.5 forecasts to further improve the data accuracy. Science questions to be answered by this study include: (1) Can the CAMS PM 2.5 forecasts be used as a reliable data source to give early warning of severe PM 2.5 pollution events in the upcoming days? (2) Is it feasible to further improve the data accuracy of raw CAMS PM 2.5 forecasts?

CAMS PM 2.5 Concentration Forecasts
The CAMS provides a prognostication of air pollution levels across the globe during the next few days. Specifically, the CAMS generates a forecast of global atmospheric composition with time horizons extending to as large as of next 120 h, consisting of 56 reactive traces gases in the troposphere as well as stratospheric ozone and five different types of aerosols (i.e., desert dust, sea salt, organic matter, black carbon, and sulphate). For aerosols and chemical species (e.g., PM 2.5 concentration), the forecasts are produced twice daily (00:00 and 12:00 UTC) at a 3 h time interval. By default, the forecasts are provided at a horizontal resolution of 40 × 40 km at 137 vertical levels from the surface up to a top height of 60 km.
In this study, the CAMS PM 2.5 concentration forecasts at the ground level between 2017 and 2018 were acquired from the ECMWF data archive (IFS cycle 46r1). The global forecasts at each time step (every 3 h since 00:00 UTC to the next 120 h) were processed on a 0.4 • × 0.4 • latitude/longitude grid. Therefore, a data set with a dimension of 40 × 450 × 900 (time/latitude/longitude) was obtained for each date during the study period. Since our purpose is to evaluate the accuracy of this product, no quality control measures such as outlier removal were applied to PM 2.5 forecasts.

In Situ PM 2.5 Concentration Measurements
In this study, hourly PM 2.5 concentration data measured from ground-based air quality observation networks in China and contiguous US (CONUS) were used to evaluate the accuracy as well as calibrate the raw CAMS PM 2.5 forecasts. Two-year in situ PM 2.5 measurements between 2017 and 2018 were acquired from the China National Environmental Monitoring Center and the US Environmental Protection Agency, respectively. As an essential quality control, we only selected PM 2.5 records with at least 10 days of valid measurements in each month and missing value ratios lower than 50%. This is to ensure the number of valid PM 2.5 samples in each data record and to avoid the potential errors in data accuracy estimations due to different number of samples in each data record. Meanwhile, PM 2.5 data values lower than 1 or higher than 1000 were excluded [55]. In addition, a moving median method with a sliding window of 13 (6 h time lags around the current sampling time) was applied to each ground-based PM 2.5 concentration time series to remove possible outliers. Finally, PM 2.5 measurements within the footprint of each PM 2.5 forecast were averaged to represent the regional mean PM 2.5 concentration level over the grid cell, yielding 507 hourly PM 2.5 concentration records in China and 241 records over CONUS, respectively.

Auxiliary Data
To calibrate the raw CAMS PM 2.5 forecasts, here we incorporated a set of meteorological factors to characterize meteorological conditions at the base time (UTC 00:00, denote as t 0 hereafter) and each forecasting step (denote as t n hereafter while n refers to the time interval from t 0 ). The ultimate goal is to resolve the PM 2.5 variations associated with changing atmospheric conditions, since the temporal variation in PM 2.5 loading is largely regulated by anthropogenic emissions and meteorological conditions. Due to the lack of high-resolution emission inventory, in this study, seven meteorological factors that are closely related to PM 2.5 variations, namely, relative humidity (RH), temperature (T), zonal wind component (U), meridional wind component (V), total precipitation (Prep), planetary boundary layer height (BLH), and surface pressure (SP), were applied to characterize the changes in meteorological conditions. Specifically, factors at t 0 were collected from the fifth generation ECMWF atmospheric reanalysis of the global climate (ERA5), while the forecasted meteorological fields were acquired from the Global Forecast System (GFS) of the National Centers for Environmental Prediction (NCEP), respectively. In addition, the analyzed PM 2.5 concentration data from EAC4 were acquired to represent the initial PM 2.5 loading at t 0 , namely, the background field for the derivation of PM 2.5 forecasts.

Random Forest
To account for the possible modeling biases in CAMS PM 2.5 forecasts, we applied the random forest (RF) method to establish a machine learning-based bias correction model to calibrate the raw CAMS simulations. Given the fast generalization and less overfitting advantages, as well as the capability of evaluating the relative importance of input features, RF has been widely used in solving regression and classification problems [56,57]. Compared with other bias correction methods, the machine learning method does not require explicit physical assumptions since the model is driven simply by the input data [58][59][60]. In this study, we assumed the temporal variation in PM 2.5 loading during a short period is mainly ascribed to the change of meteorological conditions rather than emission intensity. In such context, the biases in PM 2.5 forecasts could be modeled as: where PM cal denotes the calibrated PM 2.5 forecasts, while PM tn and PM t0 are raw CAMS PM 2.5 forecasts and PM 2.5 reanalysis from EAC4, respectively. MET t0 and MET tn denote meteorological variables at t 0 and forecasting time step of t n , respectively. As PM 2.5 concentration exhibits significant variations in time, we also incorporated a dummy variable (season in Equation (1)) to indicate the season of each CAMS PM 2.5 value to account for seasonal dependent biases. Specifically, data values in December, January, and February were considered to be wintertime observations, while March, April, and May were treated as springtime. Likewise, June, July, and August were referred to as summer while September, October, and November for autumn. The season was numbered from one to four but used as a categorical variable in the RF model. To simplify the modeling process and to avoid possible offsetting effect among modeling biases, one calibration model was created for each CAMS forecasting step. To make the computational burden manageable, we randomly selected 80% of pair wised samples as the training set and the remaining 20% as the testing set for the cross-validation purpose.
To better examine the possible dependence of modeling biases in CAMS PM 2.5 forecasts, we estimated the relative importance of each predictor in Equation (1) by taking advantage of the unique capacity of RF. In principle, the relative importance of each predictor is evaluated via the permuted variable's delta error [61]. Specifically, we assume there is a training dataset containing M variables and N observations. For any variable, we firstly randomly permute (reorder) all of its N observations, while maintaining the rest of the training dataset values in the same order and then retrain the model using the permuted dataset. In RF, the relative importance of a given variable is oftentimes evaluated by the percent increase in the mean squared modeling error after the permutation, and the selected variable is considered to play an important role if the modeling error increases significantly, and vice versa.

Statistical Metrics for Accuracy Evaluation
Three commonly used statistical metrics, including correlation coefficient (R), root mean squared error (RMSE), and mean bias error (MBE), were hereby calculated between spatially and temporally co-located in situ PM 2.5 measurements and CAMS PM 2.5 forecasts to quantitatively evaluate the accuracy and uncertainty of the latter. Mathematically, these three metrics can be derived from the following equations: where o i denotes ground-based in situ PM 2.5 measurements and p i represents the CAMS PM 2.5 forecasts, respectively. o and p are arithmetic means of the observed and forecasted PM 2.5 concentrations, respectively, while n denotes the number of data pairs. Figure 1 shows the site-specific data accuracy of CAMS PM 2.5 forecasts in China and CONUS with a forecasting time horizon of 3 h (step-3). It shows that the forecasted PM 2.5 concentrations exhibit a moderate correlation with ground-based PM 2.5 measurements. Larger positive correlation was found mainly in the eastern China whereas weaker correlation in the west regions. Conversely, large positive correlation was more likely to be observed in the north and west of CONUS. In terms of RMSE, large modeling biases (>70 µg/m 3 ) were found mainly in the northwest of China and the west of CONUS (>30 µg/m 3 ). Such extraordinary high biases indicate a relatively low accuracy of the raw CAMS PM 2.5 forecasts. In reference to MBE, we may find that CAMS PM 2.5 forecasts overestimated in situ PM 2.5 measurements in eastern China (highly populated regions) and those severely polluted areas (e.g., Sichuan basin and Gansu). These spatially varied large modeling biases indicate that the current CAMS forecasting model failed to accurately resemble PM 2.5 concentration levels across China, and this could be attributable to the lack of accurate emission inventories and limited access to observational data when simulating aerosols over China. In contrast, evident overestimations were observed in 3 h PM 2.5 forecasts across CONUS. Nevertheless, the overestimations were much smaller as compared to China, and this could be due to the relatively low ambient PM 2.5 loadings in CONUS than China. On the other hand, there are ample of free accessible ground-based air quality observations in CONUS, which significantly help reduce modeling errors in aerosol simulations by assimilating these in situ observational data.

Data Accuracy of CAMS PM 2.5 Forecasts
Similarly, Figure 2 shows the accuracy of CAMS PM 2.5 forecasts at step-120 (i.e., with a forecasting time horizon of 120 h). Noteworthy is that there is a significant decrease in data accuracy as time horizons increased from 3 to 120 h. Compared with statistical metrics shown in Figure 1 (step-3), the PM 2.5 forecasts at step-120 not only showed a weaker correlation with in situ PM 2.5 measurements but also suffered from larger modeling biases (only in China). This implies the degradation of forecasting accuracy as the forecasting time horizon increases. Additionally, the overestimations were significantly enlarged with the increase of forecasting time horizons, extending to cover an area of even more than half of the land areas of China. Conversely, it is interesting to notice that both RMSE and MBE were found to even decrease as time horizons increased from 3 to 120 h in CONUS. This effect is opposed to the obvious error propagation assumption that was revealed in China as the modeling biases increased significantly with the increase of forecasting time horizons.  To better examine the temporal evolution of data accuracy of PM 2.5 forecasts, we also calculated site-specific R and RMSE between PM 2.5 forecasts and co-located in situ measurements at each forecasting step. Given the evident diurnal variation in PM 2.5 concentration, we adjusted the UTC time of PM 2.5 forecasts to the local time (UTC+8 for China while UTC-5 for CONUS), to account for the time difference between China and US, so that the derived accuracy metrics in these two countries can be compared fairly. Figure 3 compares regional averaged R and RMSE at each forecasting time step between China and CONUS. It is indicative that the PM 2.5 forecasting accuracy decreased with the increase of forecasting time horizons, especially in China, where a statistically significant decreasing trend of R and an increasing tendency of RMSE were observed. Overall, such an accuracy degradation pattern is reasonable as future PM 2.5 concentration levels depend on not only the changes in emission sources but also meteorological conditions. Despite the fact that forecasting of these two factors (i.e., emission and meteorological fields) is subject to larger uncertainty, as time evolves due to the possible error propagation, we should be aware that the limited access to observational data (both meteorological data and air quality measurements) could be also a critical factor in resulting in extraordinary large biases in PM 2.5 forecasts in China. This could be partially corroborated by the temporal variations in RMSE in CONUS, since no increasing trend was observed. Such an effect could be attributable to the relatively stationary variation in mean PM 2.5 loading in CONUS during a short period. In other words, the CAMS forecasting model succeeded in predicting mean PM 2.5 concentration levels in CONUS, but failed in capturing the fluctuations of PM 2.5 , which then resulted in a decreased correlation. In addition to the time-evolving accuracy degradation, the forecasting accuracy was also found to vary with an evident diurnal cycle. As shown in Figure 3, the largest correlation was mainly observed at 17:00 local time in China and 16:00 in CONUS on each specific date, whereas the smallest correlation at 05:00 in China and 04:00 in CONUS. The largest RMSE was observed at 05:00 in China and 04:00 in CONUS, whereas daily minimum at 14:00 in China and 16:00 in CONUS, respectively. Such an evident diurnal variation in forecasting accuracy indicates that the current CAMS forecasting model might fail to accurately resolve the local processes, such as the variation of boundary layer height that play important roles in determining the diurnal variability of PM 2.5 [62]. These results collectively revealed the fact that the current CAMS PM 2.5 forecasts suffered from large yet nonstationary modeling biases, though detailed reasons remain unclear since numerical simulation efforts are required to diagnose the possible reasons. Nevertheless, these results highlight the importance to perform essential bias correction to account for diurnal varied large modeling biases in this PM 2.5 forecasts prior to the practical usage of this dataset. Figure 4 gives a further comparison of seasonal averaged R and RMSE to examine the possible seasonal variation in data accuracy. Evident seasonal differences were observed in the PM 2.5 forecasting accuracy in both countries. In China, the highest correlation was observed in the autumn while the lowest RMSE in the summer. Given the generally low PM 2.5 loading in the summer, the lowest RMSE is thus reasonable. In such context, we may conclude that the CAMS forecasting model had the highest accuracy in predicting autumntime PM 2.5 concentrations in China since the RMSE in the autumn is the second lowest and the correlation is the highest. In contrast, PM 2.5 forecasts in spring and winter showed a relatively low accuracy as larger biases were more likely to be observed in the spring.
As indicated by the CAMS science team, the large modeling biases in these two seasons could be attributable to the newly implemented dust emission and aerosol composition schemes in the CAMS forecasting system. Specifically, the new dust emission scheme always results in high dust emission values while the newly added nitrate and ammonium compositions could lead to an overestimation of AOD. In China, dust storms occur more frequently in spring while more nitrates and ammonium are released in spring and winter due to excessive heating related primary combustions [63,64]. Therefore, the overestimated AOD and dust emissions may inevitably lead to significant overestimations in PM 2.5 forecasts during these two seasons. On the contrary, large RMSE were more likely to be observed in summer and autumn in US, though high correlations were also observed at the meantime. More importantly, the RMSE was even found to decrease in spring and summer with the increase of forecasting time horizons, and this may help explain the decreasing trend of RMSE shown in Figure 3b, though detailed reasons remain unclear.  In contrast, the CAMS PM 2.5 forecasts showed higher overall accuracy in predicting PM 2.5 concentration in the northeast China, since it shows relatively low RMSE and MBE. Specifically, the data accuracy varied with smaller deviations along the forecasting time horizon, indicating by a shorter box as compared to others. Among the seven regions of interest, PM 2.5 concentration in central China was poorly predicted given significant overestimations and large variations in the forecasting accuracy. In US, the CAMS PM 2.5 forecasts poorly resembled the PM 2.5 concentration in the west part of the country given much larger RMSE. Overall, the evident spatial and temporal variations in these three data accuracy metrics clearly indicate that the current CAMS PM 2.5 forecasts suffered from spatially and temporally varied modeling biases. To examine the possible dependence of the forecasting accuracy on PM 2.5 pollution levels, we also calculated correlation coefficients between regional mean PM 2.5 concentration and two data accuracy metrics, namely R and RMSE in China and CONUS. As shown in Table 1, the site-specific RMSE was closely correlated with mean PM 2.5 concentration levels across China except in the South China, where RMSE showed no statistical dependence (R = −0.01) on mean PM 2.5 concentration values. The positive correlation between mean PM 2.5 concentration levels and RMSE thus indicates the forecasted PM 2.5 concentration data were subject to larger modeling biases in regions with higher PM 2.5 loadings, especially over central China and the southwest of the country. In other words, the modeling biases in raw PM 2.5 forecasts may resemble the spatial distribution pattern of mean PM 2.5 concentration levels. Similar phenomenon was also observed in the west and southwest of CONUS as larger modeling biases may occur in regions with higher PM 2.5 loadings. In contrast, there was no apparent dependence of correlation on mean PM 2.5 concentration levels, except in East China, western US, where statistically significant positive correlation was observed. This implies the CAMS forecasting model can better capture the future PM 2.5 concentration levels in regions with higher ambient PM 2.5 loadings. The negative correlation (mid-west of US) thus means an opposite response. These dependence effects also revealed that PM 2.5 forecasts derived from the current CAMS forecasting model suffered from spatially heterogeneous and magnitude dependent biases.

Machine Learning-Based Calibration of CAMS PM 2.5 Forecasts
Given the above revealed nonstationary and spatially heterogeneous modeling biases in CAMS PM 2.5 forecasts, we proposed to reduce such biases by calibrating the original PM 2.5 forecasts using a machine learning-based bias correction model. Figures 6 and 7 compare the validation accuracy of CAMS PM 2.5 forecasts at two specific forecasting time horizons (i.e., step-3 versus step-120) before and after the calibration. It is indicative that the data accuracy of PM 2.5 forecasts was significantly improved after the calibration. In China, the R value was improved from 0.45 to 0.78 and the RMSE was reduced from 50.15 to 25.51 µg/m 3 ( Figure 6). This benefiting effect was even more prominent for PM 2.5 forecasts at step-120, as the R value was improved from 0.43 to 0.76 and the RMSE was reduced from 91.69 to 22.79 µg/m 3 (Figure 7). Similarly, the benefiting effect of the calibration method was also remarkable in US as the R value was improved from 0.39 to 0.67 (0.30 to 0.58 for step-120), while the RMSE was reduced from 15.08 to 6.53 µg/m 3 (13.05 to 7.08 for step-120). In light of scatters, the calibrated data values agreed better with in situ PM 2.5 measurements, though the calibrated data values still underestimated the high PM 2.5 loadings to some extent.  Figure 8 shows the site-specific data accuracy of the calibrated PM 2.5 forecasts (step-3) in China and CONUS. It is indicative that the large modeling biases in raw CAMS PM 2.5 forecasts were substantially reduced by the calibration model. As shown, the heterogeneous modeling biases in the original CAMS PM 2.5 forecasts (Figure 1), especially the spatially varied large modeling biases in Sichuan basin and the northwest part of the country, had been significantly reduced after the calibration, resulting in a spatially more homogeneous distribution of three statistical data accuracy metrics. Figure 9 compares the improvement of mean data accuracy of PM 2.5 forecasts at each forecasting time step before and after the calibration. Shown is that there was an evident improvement in the data accuracy of CAMS PM 2.5 forecasts at each forecasting time step after the calibration as evidenced by improved correlation and reduced RMSE. More importantly, the diurnal varied modeling biases were also well accounted for by the calibration model. Compared with the RMSE derived from the original PM 2.5 forecasts that exhibited evident diurnal variability, no apparent diurnal cycle was observed in the calibrated PM 2.5 forecasts (Figure 9b). Additionally, the increasing trend of RMSE was largely reduced in the calibrated dataset. Overall, these results not only justify the effectiveness of the proposed machine learning based calibration model in reducing large modeling biases in raw CAMS PM 2.5 forecasts, but also highlight the need to improve the performance of the forecasting model used in the current CAMS system. Otherwise, the nighttime PM 2.5 forecasts would suffer from extraordinary large modeling biases. Figure 7. Comparison of the cross-validation accuracy between the original (a,b) and the calibrated (c,d) PM 2.5 forecasts with a forecasting time horizon of 120 h (step-120). Note the data values compared here were unseen data which were randomly selected and retained for the cross-validation purpose.  The results clearly show that the large and spatially heterogeneous modeling biases in raw PM 2.5 forecasts were significantly reduced after the calibration, resulting in a PM 2.5 forecast better resembling the ground-based PM 2.5 measurements. In spite of the effectiveness in reducing large modeling biases in raw PM 2.5 forecasts, we should be aware that the calibration model is still incapable of reconstructing all PM 2.5 concentration measurements (e.g., the high PM 2.5 loading over the middle-to-west regions in China on 9 February 2018). This is because the calibrated data still depend highly on raw PM 2.5 forecasts, as large errors in PM 2.5 background fields would persist in the subsequent forecasting field without adequate corrections.  To examine the possible dependence of modeling biases, we also estimated the relative importance of predictors that were used in the random forest model to calibrate raw PM 2.5 forecasts at steps 3 and 120. As shown in Figure 12, PM 2.5 concentration at t 0 (i.e., the initial background) was found to have the largest relative importance (RH in China), which even excess that of the raw PM 2.5 forecasts at t 3 . This result not only emphasizes the critical role of the initial PM 2.5 concentration in determining future PM 2.5 concentration levels, but also implies the presence of large modeling bias in raw CAMS PM 2.5 forecasts. Otherwise, the raw PM 2.5 forecasts at t 3 should play the most important role in resembling actual PM 2.5 concentration measured at t 3 . Among the remaining predictors, meteorological variables such as RH and BLH as well as season are three predictors that played more important roles in calibrating PM 2.5 forecasts in China. In contrast, season, P, and T were three critical predictors for the calibration of PM 2.5 forecasts in US. Moreover, all meteorological variables at t 0 were found to have larger importance than that of the forecast fields at t 3 except RH and BLH in the US. The reasons behind this effect could be two folds. First, PM 2.5 loadings vary little within a 3 h time interval. Second, the forecasted meteorological fields might suffer from large biases. With the increase of forecasting time horizons, the relative importance of the initial PM 2.5 background was reduced (Figure 12a versus Figure 12b). Rather, the forecasted PM 2.5 fields were found to play the most critical role in predicting actual PM 2.5 measurements. This implies the forecasted PM 2.5 fields better resemble the actual PM 2.5 fields, justifying the effectiveness of the CAMS forecasting model in predicting future PM 2.5 pollution levels in turn. Moreover, the forecasted meteorological fields were also found to play more important roles than the analyzed fields at t 0 , especially in China. This is in line with expectation since the meteorological conditions may vary significantly as time horizon increases. The distinct relative importance of predictors between step-3 and step-120 indicate that both the analyzed and forecasted fields should be included to better calibrate the raw CAMS PM 2.5 forecasts.

Discussion
In this study, we used in situ PM 2.5 concentration measurements from the national air quality monitoring networks in China and US, between 2017 and 2018, as ground truth to evaluate the data accuracy of CAMS PM 2.5 forecasts. Due to the coarse spatial resolution (40 km) of the model's footprint, the CAMS PM 2.5 forecasts would not perform as accurate as ground measurements and/or satellite observations on the local scale such as in urban regions. Additionally, the coarse spatial resolution could lead to significant bias in the direct point-to-grid comparisons between model outputs and ground measurements, since in situ measurements may be largely affected by pollution sources at the local scale. In other words, the assessed data accuracy could be somewhat biased given the low representation of in situ measurements, especially in regions with few monitoring stations such as western China. On the other hand, the ground-based PM 2.5 data series were derived simply by averaging PM 2.5 records that were measured at multiple stations falling within the same model grid cell. Such an averaging scheme is easy to apply but ignores the spatial variations in PM 2.5 . This is because the limited number of monitoring stations may not provide accurate pollution levels on regional scale [17]. In other words, the spatial representativeness of the averaged PM 2.5 records could be biased, and the derived PM 2.5 records might poorly represent PM 2.5 concentration levels on the given CAMS grid cell. In such context, the reported data accuracy could be prone to large uncertainty. Meanwhile, the spatial distribution and/or density of monitoring stations is also a critical factor that could influence the representativeness of the averaged PM 2.5 record [65]. For instance, same number of PM 2.5 records with the one measured all in downtown areas, while the other sampled in both rural and urban regions may result in two distinct PM 2.5 concentration levels if we simply averaged each set of records. Overall, the scales related bias should be recognized in interpreting point-to-grid comparison results, especially even at a much coarse model grid resolution.
In regard to the temporal evolution of R, RMSE, and MBE, we observed that the forecasting accuracy generally decreased with the increase of forecasting time horizons. This is in line with expectation since the forecasted PM 2.5 fields would suffer from larger bias due to error propagation and larger uncertainty in the forecasted meteorological fields. However, noteworthy is that there was no significant increase in RMSE in US with the extending of forecasting time horizons, the exact reason remains unclear and this is more likely to be associated with the near stationary variability of daily mean PM 2.5 concentration levels in US. Moreover, an evident diurnal cycle was observed in these statistical metrics (Figures 3 and 4), indicating the presence of diurnal varied biases in the CAMS forecasted PM 2.5 fields. Such a diurnal variation pattern was also found by Marécal et al. when evaluating the performance of models used in the European regional air quality system of MACC [66]. Since CAMS was a heritage of the MACC project, similar diurnal varied modeling bias revealed in the current CAMS PM 2.5 forecasts implies that the CAMS model still fails to account for the diurnal varied modeling bias found in the MACC project. In such context, we may ascribe the observed diurnal varied bias to the defects of CAMS model in which factors modulating the diurnal variability of PM 2.5 , like variations in emissions and boundary layer height might be not well resolved [66].
To further improve the accuracy of CAMS PM 2.5 forecasts, a calibration model was hereby developed by using the random forest method to account for large modeling biases in raw CAMS PM 2.5 forecasts. Such a data-driven method was proven to be effective in reducing nonlinear and nonstationary modeling biases in CAMS PM 2.5 forecasts by making use of both analyzed and forecasted PM 2.5 and meteorological fields, rendering the calibrated PM 2.5 forecasts higher accuracy in resembling ground PM 2.5 measurements. Nevertheless, we should notice that only the analyzed and forecasted meteorological fields were used other than PM 2.5 data whereas factors indicating the actual aerosol loading (e.g., satellite AOD observations) were not included. The absence of observational aerosol data made the developed calibration model might be incapable of correcting large modeling biases in regions with high PM 2.5 loadings due to large bias in the analyzed PM 2.5 field. To derive better PM 2.5 forecasts and/or to account for spatially and temporally varied large modeling biases in CAMS PM 2.5 concentration forecasts, more accurate aerosol observational data and auxiliary factors that are closely related to the production and dispersion of PM 2.5 particles (e.g., hourly boundary layer height) could be included.

Conclusions
In this study, the data accuracy of CAMS PM 2.5 forecasts was evaluated using two-year hourly in situ PM 2.5 concentration measurements that were sampled from the national air quality monitoring network in China and CONUS. The ground-based validation results revealed a relatively low accuracy of the raw CAMS PM 2.5 forecasts given the presence of nonlinear and nonstationary modeling biases. Temporally, the data accuracy of PM 2.5 forecasts generally decreased with the increase of forecasting time horizons. Additionally, the data accuracy was found to vary with evident diurnal cycle as the highest accuracy was more likely to be observed in the late afternoon (17:00 local time in China and 16:00 in CONUS), whereas the lowest accuracy in the early morning (05:00 local time in China and 04:00 in CONUS). Moreover, the data accuracy varied between seasons as PM 2.5 forecasts in the autumn in China (spring in US) appeared to be better simulated. Generally, the revealed low accuracy of the raw CAMS PM 2.5 forecasts could be attributable to factors such as the coarse spatial resolution of CAMS model, representation errors due to distinct scales of ground measurements and model outputs, limited access to observational data, as well as improper formulation of boundary layer effect in the CAMS model. A machine learning-based data calibration model was then developed to reduce large modeling biases that were found in the raw CAMS PM 2.5 forecasts. The validation results indicate that the calibrated data not only had a much lower RMSE but better correlated with ground-based PM 2.5 measurements, suggesting an improved accuracy of the calibrated PM 2.5 forecasts. Overall, the assessed data accuracy of CAMS PM 2.5 forecasts in this study provides a good reference to potential data users, and the developed machine learning-based calibration model can be used as a promising postprocessing measures to improve the data accuracy.