Improving PM 2.5 Air Quality Model Forecasts in China Using a Bias-Correction Framework

: Chinese cities are experiencing severe air pollution in particular, with extremely high PM 2.5 levels observed in cold seasons. Accurate forecasting of occurrence of such air pollution events in advance can help the community to take action to abate emissions and would ultimately beneﬁt the citizens. To improve the PM 2.5 air quality model forecasts in China, we proposed a bias-correction framework that utilized the historic relationship between the model biases and forecasted and observational variables to post-process the current forecasts. The framework consists of four components: (1) a feature selector that chooses the variables that are informative to model forecast bias based on historic data; (2) a classiﬁer trained to efﬁciently determine the forecast analogs (clusters) based on clustering analysis, such as the distance-based method and the classiﬁcation tree, etc.; (3) an error estimator, such as the Kalman ﬁlter, to predict model forecast errors at monitoring sites based on forecast analogs; and (4) a spatial interpolator to estimate the bias correction over the entire modeling domain. One or more methods were tested for each step. We applied ﬁve combinations of these methods to PM 2.5 forecasts in 2014–2016 over China from the operational AiMa air quality forecasting system using the Community Multiscale Air Quality (CMAQ) model. All ﬁve methods were able to improve forecast performance in terms of normalized mean error (NME) and root mean square error (RMSE), though to a relatively limited degree due to the rapid changing of emission rates in China. Among the ﬁve methods, the CART-LM-KF-AN (a Classiﬁcation And Regression Trees-Linear Model-Kalman Filter-Analog combination) method appears to have the best overall performance for varied lead times. While the details of our study are speciﬁc to the forecast system, the bias-correction framework is likely applicable to the other air quality model forecast as well.


Introduction
The fine particulate matter with an aerodynamic diameter of less than 2.5 µm (PM 2.5 ) is the dominant air pollutant in most Chinese cities in recent years. In 2016, the nationwide annual mean-observed PM 2.5 concentration was 47 µg/m 3 , which exceeded 35 µg/m 3 , the level II national ambient air quality standards (NAQQS) of China (GB3095-2012), by more than 34%. The pollution levels are much higher in densely populated regions [1]. For example, in the Beijing-Tianjin-Hebei (BTH) area the average annual mean of PM 2.5 concentration was 71 µg/m 3 in 2016, twice of the standard, which would cause severe adverse health effects [2,3]. It is critical to provide the public and

PM 2.5 Model Forecasts
The daily PM 2.5 model forecasts from 2014 to 2016 were obtained from an operational air quality forecasting system (Available online: www.aimayubao.com) built upon the CMAQ model (version 5.0.2), along with the Weather Research and Forecasting Model (WRF, version 3.4.1) [32] and an emissions processing component. The emissions inventory used (called AiMa inventory) was compiled and projected from various inventories and information sources and was further adjusted by utilizing inverse modeling techniques. The CMAQ modeling is configured with the saprc07tc gas chemistry mechanism and the aero6 aerosol module. The WRF simulation is driven by NCEP's GFS 0.5-degree global weather forecast products. The forecasting system produces 144 h forecasts (called AiMa forecasts) at each cycle that covers 5 days local time. The operation of the forecasting system started on 4 February 2014. The model grids have a spatial resolution of 12 km covering the entirety of China. The configurations of the forecasting system have not been changed since operation, ensuring the consistency of the forecasting error distribution. The daily forecasts for the lead times of 1, 3, and 5 days were used in this study. The relative longer forecasting lead time (compared to 1-2 days lead time in previous studies [16,22]) would enable us to evaluate whether the post-correction method is useful for dynamic management practices [33], which usually requires about a 3-5 day lead time to take actions.

PM 2.5 Observations
The air quality observational data from the Chinese air quality monitoring network was obtained from the official real-time air quality monitoring publishing platform [34]. The monitoring network had 945 monitors in operation in 2014 and expanded to 1496 monitors in 2015. The network measured PM 2.5 concentrations with the widely used TEOM (Tapered Element Oscillating Microbalance) instruments. The 24 hourly observations within a day from 0:00 through 24:00 (Beijing local time) were averaged to calculate daily mean PM 2.5 concentrations. In case that two or more monitors are within the same grid cell, the averages of the observations from these monitors are used. To distinguish from the original observational data from monitors, we refer to this dataset as observations at "avg-monitors". As a result, the 945 monitors operational since 2014 were assigned to 545 different grid cells and all the 1496 monitors operational since 2015 were assigned to 840 grid cells (Figure 1a). Figure 1b illustrates the spatial patterns of the observed PM 2.5 pollution levels in China in year 2016.

PM2.5 Model Forecasts
The daily PM2.5 model forecasts from 2014 to 2016 were obtained from an operational air quality forecasting system (Available online: www.aimayubao.com) built upon the CMAQ model (version 5.0.2), along with the Weather Research and Forecasting Model (WRF, version 3.4.1) [32] and an emissions processing component. The emissions inventory used (called AiMa inventory) was compiled and projected from various inventories and information sources and was further adjusted by utilizing inverse modeling techniques. The CMAQ modeling is configured with the saprc07tc gas chemistry mechanism and the aero6 aerosol module. The WRF simulation is driven by NCEP's GFS 0.5-degree global weather forecast products. The forecasting system produces 144 h forecasts (called AiMa forecasts) at each cycle that covers 5 days local time. The operation of the forecasting system started on 4 February 2014. The model grids have a spatial resolution of 12 km covering the entirety of China. The configurations of the forecasting system have not been changed since operation, ensuring the consistency of the forecasting error distribution. The daily forecasts for the lead times of 1, 3, and 5 days were used in this study. The relative longer forecasting lead time (compared to 1-2 days lead time in previous studies [16,22]) would enable us to evaluate whether the post-correction method is useful for dynamic management practices [33], which usually requires about a 3-5 day lead time to take actions.

PM2.5 Observations
The air quality observational data from the Chinese air quality monitoring network was obtained from the official real-time air quality monitoring publishing platform [34]. The monitoring network had 945 monitors in operation in 2014 and expanded to 1496 monitors in 2015. The network measured PM2.5 concentrations with the widely used TEOM (Tapered Element Oscillating Microbalance) instruments. The 24 hourly observations within a day from 0:00 through 24:00 (Beijing local time) were averaged to calculate daily mean PM2.5 concentrations. In case that two or more monitors are within the same grid cell, the averages of the observations from these monitors are used. To distinguish from the original observational data from monitors, we refer to this dataset as observations at "avg-monitors". As a result, the 945 monitors operational since 2014 were assigned to 545 different grid cells and all the 1496 monitors operational since 2015 were assigned to 840 grid cells (Figure 1a). Figure 1b illustrates the spatial patterns of the observed PM2.5 pollution levels in China in year 2016.

Bias-Correction Framework
To improve forecast performance, we propose a bias-correction framework as a post-processing procedure for model forecasts. The framework utilizes historic data to establish relationships between the model forecast biases and a variety of model-simulated or observed variables, and then uses these relationships to correct the current forecast. The framework is conducted in four steps: (1) feature selection; (2) forecast analog determination; (3) local correction estimation; and (4) correction spread. Figure 2 illustrates the framework and the methods we tested in each step.

Bias-Correction Framework
To improve forecast performance, we propose a bias-correction framework as a post-processing procedure for model forecasts. The framework utilizes historic data to establish relationships between the model forecast biases and a variety of model-simulated or observed variables, and then uses these relationships to correct the current forecast. The framework is conducted in four steps: (1) feature selection; (2) forecast analog determination; (3) local correction estimation; and (4) correction spread. Figure 2 illustrates the framework and the methods we tested in each step. Bias-correction framework with its four steps and the four method combinations that are tested in this study. The fifth method tested is the 7-day Moving Average (7-Day-MA) method, which can be regarded as a special case within the framework but is not explicitly shown here.

Feature Selection
In the "feature selection" step, we choose a group of variables containing information about model biases from a pool of modeled or observed variables. By eliminating non-informative variables, we can improve the predictability of model biases by analogs. In this study, data from 545 avg-monitors (blue dots in Figure 1) in 2014 and 2015 are used as historic data for selecting informative variables, or features, at each of these avg-monitors. Only the variables that are selected in this step will be used in the following "analog determination" step.
Here, we consider in total 19 candidate variables. Five of them are observation variables (i.e., the mean observed concentrations of the five criteria air pollutants, CO, NO2, O3, PM2.5, and SO2) on the day prior to the forecasting cycle, and the rest of the candidate variables are model output, including model forecasted concentrations of the five criteria air pollutants, forecasted daily mean PM2.5 composition concentrations (SO4, NO3, NH4, EC and OC), and four meteorological variables (i.e., wind speed, temperature, planetary boundary layer heights, and relative humidity). We tested two methods for feature selection: (1) a linear regression model (denoted as LM); and (2) the random forest algorithm (denoted as RF). The LM method constructed a linear regression model with the PM2.5 forecast biases being the explained variable and the candidate variables being the explanatory variables. Those explanatory variables with p-value < 0.05 [35,36] were retained as informative Bias-correction framework with its four steps and the four method combinations that are tested in this study. The fifth method tested is the 7-day Moving Average (7-Day-MA) method, which can be regarded as a special case within the framework but is not explicitly shown here.

Feature Selection
In the "feature selection" step, we choose a group of variables containing information about model biases from a pool of modeled or observed variables. By eliminating non-informative variables, we can improve the predictability of model biases by analogs. In this study, data from 545 avg-monitors (blue dots in Figure 1) in 2014 and 2015 are used as historic data for selecting informative variables, or features, at each of these avg-monitors. Only the variables that are selected in this step will be used in the following "analog determination" step.
Here, we consider in total 19 candidate variables. Five of them are observation variables (i.e., the mean observed concentrations of the five criteria air pollutants, CO, NO 2 , O 3 , PM 2.5 , and SO 2 ) on the day prior to the forecasting cycle, and the rest of the candidate variables are model output, including model forecasted concentrations of the five criteria air pollutants, forecasted daily mean PM 2.5 composition concentrations (SO 4 , NO 3 , NH 4 , EC and OC), and four meteorological variables (i.e., wind speed, temperature, planetary boundary layer heights, and relative humidity). We tested two methods for feature selection: (1) a linear regression model (denoted as LM); and (2) the random forest algorithm (denoted as RF). The LM method constructed a linear regression model with the PM 2.5 forecast biases being the explained variable and the candidate variables being the explanatory variables. Those explanatory variables with p-value < 0.05 [35,36] were retained as informative variables for analog determination. With the RF method [37], the variables with large importance indicators are selected as informative variables. We implemented these feature selection algorithm with the R software (the lm function for LM and the Boruta function for RF) [38]. Figure 3 shows the number counts of the avg-monitors that selected each candidate variable by the LM and RF algorithms. On average, the LM algorithm selected about six informative variables for each avg-monitors. In contrast, the RF algorithm selected about 15 on average. This difference is likely an indication that RF is less effective than LM at distinguishing the information content among variables. As we will find out later in our analysis (Table 1), a method involving LM (e.g., the CART-LM-KF-AN method, see definition in Figure 2) often outperformed a method involving RF (e.g., CART-RF-KF-AN). This comparison highlights the importance of a proper feature selection procedure. Figure 3 shows the number counts of the avg-monitors that selected each candidate variable by the LM and RF algorithms. On average, the LM algorithm selected about six informative variables for each avg-monitors. In contrast, the RF algorithm selected about 15 on average. This difference is likely an indication that RF is less effective than LM at distinguishing the information content among variables. As we will find out later in our analysis (Table 1), a method involving LM (e.g., the CART-LM-KF-AN method, see definition in Figure 2) often outperformed a method involving RF (e.g., CART-RF-KF-AN). This comparison highlights the importance of a proper feature selection procedure.
Among the 19 variables, PM2.5 observations on the initial day of the forecast cycle (I.PM2.5) and CMAQ PM2.5 forecast (F.PM2.5) were the top two most selected variables for the 1-day lead time, indicating that they contain the most information about the short-term model forecast errors. As expected, the I.PM2.5 is less informative for longer lead times, resulting in a decrease in selection counts by the LM algorithm for the 3 and 5 day lead times (Figure 3). For all lead times in question, model-forecast air pollutant variables, such as F.PM2.5, OC, and F.CO, and meteorological variables, such as RH, were also frequently selected by the LM algorithm at many avg-monitors. The forecasted OC and F.CO were frequently selected, likely because they are indicative of model biases in emissions and transport.

Analog Determination
In the "analog determination" step, we search for a "forecast analog", an ensemble of previous model forecasts that are similar to the current forecast to be corrected [16,21]. The bias information in the forecast analog is then used to estimate the correction to be applied in the current forecast in the following "local correction estimation" step. The similarity between current and historic forecasts is measured in terms of the informative variables we selected in the "feature selection" step, using two  Among the 19 variables, PM 2.5 observations on the initial day of the forecast cycle (I.PM 2.5 ) and CMAQ PM 2.5 forecast (F.PM 2.5 ) were the top two most selected variables for the 1-day lead time, indicating that they contain the most information about the short-term model forecast errors. As expected, the I.PM 2.5 is less informative for longer lead times, resulting in a decrease in selection counts by the LM algorithm for the 3 and 5 day lead times (Figure 3). For all lead times in question, model-forecast air pollutant variables, such as F.PM 2.5 , OC, and F.CO, and meteorological variables, such as RH, were also frequently selected by the LM algorithm at many avg-monitors. The forecasted OC and F.CO were frequently selected, likely because they are indicative of model biases in emissions and transport.

Analog Determination
In the "analog determination" step, we search for a "forecast analog", an ensemble of previous model forecasts that are similar to the current forecast to be corrected [16,21]. The bias information in the forecast analog is then used to estimate the correction to be applied in the current forecast in the following "local correction estimation" step. The similarity between current and historic forecasts is measured in terms of the informative variables we selected in the "feature selection" step, using two different methods: (1) Euclidean distance between two forecasts in the feature space (denoted as DIST) and (2) classification predicted by the CART algorithm [39] (denoted as CART). The CART algorithm generates a decision tree which minimizes the total deviations within the branches of the tree. The algorithm is widely used in remote sensing image processing for land cover classification [40,41], ecology modeling [42,43] and pattern recognition studies [44]. We implemented the CART calculation with the rpart() function in the R software [38].

Local Correction Estimation
In the "local correction estimation" step, we estimated the forecast bias at individual avg-monitors based on the "forecast analog" determined in the previous "analog determination" step.
A straightforward way to estimate the bias is the Distance-based Analog (DIST-AN) method, which takes an inverse distance weighted average of the biases in the forecast analog.
where the PM cp,T+k and PM p,T+k , respectively, refer to the corrected and raw model forecasts of the PM 2.5 concentrations. δ T+k refers to the correction, which is calculated by the inverse distance weighted mean forecast biases of the M analogs.
In addition to the inverse distance weighted average, the Kalman filter (KF), known for its easy implementation, fast convergence speed, and effectiveness at eliminating data noises, is also tested in this study to estimate forecast errors. The Kalman filter works on an ordered set of inputs. Previous studies have used the input dataset ordered by time [26,45]. In this study, we implemented the Kalman filter on a set of analogs ordered by distances. The Kalman filter used a dynamic weighting method to fuse the observations and estimations at time t, as shown in the equation below: where thex t+1|t refers to the estimation of forecast error at the time t+1 using the information at time t. Thex t|t−1 denotes the forecast error estimation at the time t. The y t refers to the observed forecast error at time t. The weighting factor K t was called Kalman gain, which was calculated through the optimization of estimation and observation noises. The detailed approach for K t estimation can be found in Delle Monache, Nipen, Liu, Roux, and Stull [23]. Depending on the methods used in the "analog determination" step, the Kalman filter is applied in the Distance- cycle. This method is chosen for its fast computation and easy implementation. The 7-day window length has also been used in previous studies on model post-correction [16,23]. The 7-Day-MA method can be regarded as a special case of the analog method, in which the 7 days prior to the forecasting cycle are the forecast analog and each day is weighted equally.

Correction Spread
After the "local correction estimation" is done, we further spread the estimated corrections at avg-monitors to the entire domain including model grids containing no monitors. The corrections were then applied to the original gridded model forecast to obtain bias-corrected forecasts over the whole China domain. In this study, we used ordinary Kriging [46] to spatially interpolate the biases from monitors to the entire domain.

Evaluation
Following the post-process framework, we tested five combinations of methods ( Figure 2). For example, the CART-LM-KF-AN method uses the LM method for "feature selection", the CART method for "analog determination", and the KF method for "local correction estimation". Readers can refer to Figure 2 for an illustration of how varied methods for each step are combined. To evaluate the performance in varied steps of the procedure, we reported separately the performance statistics for the "local correction estimation" and "correction spread" steps. The reported performance statistics include the coefficient of determination (R 2 ), the normalized mean error (NME), and the root mean square error (RMSE) [16].
The observation and model outputs from 4 February 2014 to 31 December 2015 are used as historic data, with which we selected informative variables and searched for forecast analogs. In the "local correction estimation" step, we used the model output for 2016 to estimate the bias-corrected forecasts at the 545 individual avg-monitors. We then used the corresponding 2016 observations to evaluate the performance after the "local correction estimation" step. Note that not all avg-monitors were used in the "local correction estimation". After the "correction spread" step, we then used the data from the remaining 211 avg-monitors to evaluate the performance of the "final product" at locations without observations. The 211 avg-monitors were so selected that they were adequately apart from each other and from the 545 avg-monitors that were used in the "local correction estimation". These 211 avg-monitors are marked as red dots in Figure 1. The performance evaluation was conducted separately for 1-day, 3-day, and 5-day lead forecasts.

Performance in Estimating Local Corrections
Although up to the "local correction estimation" step we only computed the local corrections at locations with observations, these local corrections were crucial for the performance of the final product. Therefore, in this study, we applied five different methods ( Figure 2) and evaluated them at avg-monitors. Figure 4 presents the performance of the raw PM 2.5 model forecasts in predicting observations at avg-monitors in 2016. The annual mean biases of model forecasts in 2016 were generally negative in the southern and northwestern part of China, while they were positive (mostly 0-10 µg/m 3 ) in the middle part of China, regardless of the forecast lead-times. According to the distributions of statistical metrics of R 2 and NME, the raw model forecast performed much better in North China and East China than in other regions, while it performed much worse in West China. Spatial patterns with geographical divisions in biases and other performance statistical metrics, i.e., R 2 , NME, and RMSE here, indicate potential non-uniform uncertainties in the emission rates estimation among regions and varied prediction errors in meteorological forecast fields over different terrains. We do observe a significant degradation in the performance of predicting meteorological variables at the surface in regions with complex terrains (results not shown). However, as the lead-time increases, the performance of the raw PM 2.5 model forecasts only degrade slightly ( Figure 4 and Table 1), implying that the error does not grow much in predicting meteorology during the forecasting period of 144 h.  Table 1 summarizes the performance statistics of each method for different lead times and Table 2 summarizes the percentage changes of the performance statistics with respect to the raw model forecasts. Compared with the raw model forecasts, all the bias-correction methods are able to decrease the NME and the RMSE at all the three lead times. The reductions in the NME are 7.4-19.3%, 7.3-14.4%, and 4.5-12.2%, and the reductions in RMSE are 7.8-16.1%, 7.5-12.5%, and 6.3-11.3% for 1-day, 3-day, and 5-day lead times, respectively, showing that these post-processing techniques are effective to improve the PM2.5 forecast at locations with observations. Figure 5 shows that all methods can improve NME and RMSE at the majority of avg-monitors. For example, the CART-LM-KF-AN method decreases NME and RMSE at about 70% to 80% avg-monitors for all three lead times.
For most of these methods, however, R 2 only increases marginally for the 1-day lead time and even decreases slightly in the 3-day and 5-day lead times. For example, DIST-AN decreases the R 2 by 17.7% for the 3-day lead time. The ineffectiveness in increasing R 2 may reflect that these analog-based methods, although good at reducing biases, do not improve the ability to capture the variability in the data, especially for longer lead times. Among the five methods in question, the CART-LM-KF-AN method has the largest R 2 for all the three lead times, with a 6% increase in R 2 for the 1-day lead and essentially no change for the 3-day and 5-day leads from the raw model forecast.  Table 1 summarizes the performance statistics of each method for different lead times and Table 2 summarizes the percentage changes of the performance statistics with respect to the raw model forecasts. Compared with the raw model forecasts, all the bias-correction methods are able to decrease the NME and the RMSE at all the three lead times. The reductions in the NME are 7.4-19.3%, 7.3-14.4%, and 4.5-12.2%, and the reductions in RMSE are 7.8-16.1%, 7.5-12.5%, and 6.3-11.3% for 1-day, 3-day, and 5-day lead times, respectively, showing that these post-processing techniques are effective to improve the PM 2.5 forecast at locations with observations. Figure 5 shows that all methods can improve NME and RMSE at the majority of avg-monitors. For example, the CART-LM-KF-AN method decreases NME and RMSE at about 70% to 80% avg-monitors for all three lead times. Table 2. Percentage changes (%) of the performance statistics with respect to original model forecast by the five methods in the "correction estimation" step.

Performance of the Final Product
After we estimated the correction for the model forecasts at each of the individual avg-monitors, we spread the local corrections across the entire Chinese domain by spatially interpolating the local corrections with ordinary Kriging. Figure 6 shows an example of the "final product" of a 5-day lead forecast for 30 December 2016, using the CART-LM-KF-AN method. Table 3. Performance statistics for R 2 , NME, and RMSE (in ug/m 3 ) at locations without monitors after correction spread. By comparing the "final product" with observations at 211 avg-monitors (whose data were not used in the "local correction estimation" step), we can evaluate the performance of the "final product" at locations without avg-monitors. Table 3 shows that the correction estimated through the spatial interpolation can also effectively reduce forecast errors, even at locations without PM2.5 monitors. Depending on the methods and lead times, the fraction of avg-monitors that finds improvements in NME and RMSE varies from 50% to 80% (Figure 7). The improvements, for most methods, are slightly less but still comparable to those at locations with observations (Tables 1 and 3), indicating that, compared to the forecast errors at locations with observations, the uncertainties induced by spatial interpolation are likely insignificant. In other words, the "local correction estimation" (including feature selection and analog determination) rather than "correction spread" is the "bottle-neck" in the post-correction framework. Efforts to further improve the performance should be directed to improve the estimation of local corrections. For most of these methods, however, R 2 only increases marginally for the 1-day lead time and even decreases slightly in the 3-day and 5-day lead times. For example, DIST-AN decreases the R 2 by 17.7% for the 3-day lead time. The ineffectiveness in increasing R 2 may reflect that these analog-based methods, although good at reducing biases, do not improve the ability to capture the variability in the data, especially for longer lead times. Among the five methods in question, the CART-LM-KF-AN method has the largest R 2 for all the three lead times, with a 6% increase in R 2 for the 1-day lead and essentially no change for the 3-day and 5-day leads from the raw model forecast.

CART-LM-KF-AN
The results also show the impact of forecasting lead times on the performance of bias-correction techniques (Tables 1 and 2, Figure 5). In general, the enhancement in the forecast performance decreases with the longer lead time. The decreasing effectiveness of the post-processing procedure with lead times may partly result from the fact that the increasing uncertainties in the model forecasted meteorology and pollutant concentrations lead to larger uncertainties in the analog determination for the longer lead times. Among the five methods in this study, the performance of the CART-LM-KF-AN method is most insensitive to varied lead times (NME 0.41, 0.43, 0.45, and RMSE 27.5, 27.5, 28.7 ug/m 3 for 1-day, 3-day, and 5-day lead times, respectively), showing that the combination of the CART and LM methods constitutes a more robust analog determination algorithm for the PM 2.5 model forecast in China. In contrast, the performance of the DIST-AN method (NME 0.40, 0.46, 0.49, and RMSE 27.0, 29.0, 30.3 ug/m 3 for 1-day, 3-day, and 5-day lead times, respectively) and the 7-Day-MA method (NME 0.39, 0.43, 0.46 and RMSE 27.0, 27.8, 29.1 ug/m 3 for 1-day, 3-day, and 5-day lead time, respectively) degrade significantly with longer lead times. Although the performance of the CART-LM-KF-AN method is similar to or a little worse than the 7-Day-MA and DIST-AN for the 1-day lead time, CART-LM-KF-AN outperforms other methods for a longer lead time, which is a good property for the purpose of dynamic air quality management.

Performance of the Final Product
After we estimated the correction for the model forecasts at each of the individual avg-monitors, we spread the local corrections across the entire Chinese domain by spatially interpolating the local corrections with ordinary Kriging. Figure 6 shows an example of the "final product" of a 5-day lead forecast for 30 December 2016, using the CART-LM-KF-AN method.

Discussion
In comparison with previous studies conducted in the U.S., our results generally show less improvements (in terms of percentage) from the raw model forecasts. For example, Djalalova, Delle, Monache, and Wilczak [16] used KF-AN (similar to DIST-KF-AN in this study) to post-correct hourly CMAQ PM2.5 forecasts for the 1-day lead time and reduced the MAE values by 65% from 8.5 μg/m 3 to 3 μg/m 3 . Kang, et al. [47] also applied the KF-AN method to daily PM2.5 forecasts and reduced RMSE by 33% from 7.5 μg/m 3 to 5 μg/m 3 . Previous studies have found that the KF-and AN-based methods better perform in lower pollution level regions [28]. The differences in the performance between this and previous studies may result from the fact that PM2.5 levels in China are much higher than in the U.S. Consistent with previous studies, our analysis also shows that the percentage improvements by the CART-LM-KF-AN method are generally larger in relatively cleaner regions (e.g., the Pearl River Delta in South china, Northeast China, and other remote regions) than in heavily polluted regions (e.g., the North China Plain and the Yangtze River Delta in East China) (Figure 8), suggesting that there might be important factors missing in the trained relationship between model biases and predictor variables over polluted regions. One such factor is the fast-changing emissions in both magnitude and distribution in regions such as the North China Plain and the Yangtze River Delta during the modeled three years [48,49], a result of increasingly more strict emission control enforcements and/or economic fluctuations. The significant change of emission rates in these regions between the training years (2014-2015) and the prediction year (2016) could confound the trained bias correction relationships. In contrast, the actual emission rates were likely to vary insignificantly in the cleaner regions over the same time period. By comparing the "final product" with observations at 211 avg-monitors (whose data were not used in the "local correction estimation" step), we can evaluate the performance of the "final product" at locations without avg-monitors. Table 3 shows that the correction estimated through the spatial interpolation can also effectively reduce forecast errors, even at locations without PM 2.5 monitors. Depending on the methods and lead times, the fraction of avg-monitors that finds improvements in NME and RMSE varies from 50 to 80% (Figure 7). The improvements, for most methods, are slightly less but still comparable to those at locations with observations (Tables 1 and 3), indicating that, compared to the forecast errors at locations with observations, the uncertainties induced by spatial interpolation are likely insignificant. In other words, the "local correction estimation" (including feature selection and analog determination) rather than "correction spread" is the "bottle-neck" in the post-correction framework. Efforts to further improve the performance should be directed to improve the estimation of local corrections.

Discussion
In comparison with previous studies conducted in the U.S., our results generally show less improvements (in terms of percentage) from the raw model forecasts. For example, Djalalova, Delle, Monache, and Wilczak [16] used KF-AN (similar to DIST-KF-AN in this study) to post-correct hourly CMAQ PM2.5 forecasts for the 1-day lead time and reduced the MAE values by 65% from 8.5 μg/m 3 to 3 μg/m 3 . Kang, et al. [47] also applied the KF-AN method to daily PM2.5 forecasts and reduced RMSE by 33% from 7.5 μg/m 3 to 5 μg/m 3 . Previous studies have found that the KF-and AN-based methods better perform in lower pollution level regions [28]. The differences in the performance between this and previous studies may result from the fact that PM2.5 levels in China are much higher than in the U.S. Consistent with previous studies, our analysis also shows that the percentage improvements by the CART-LM-KF-AN method are generally larger in relatively cleaner regions (e.g., the Pearl River Delta in South china, Northeast China, and other remote regions) than in heavily polluted regions (e.g., the North China Plain and the Yangtze River Delta in East China) (Figure 8), suggesting that there might be important factors missing in the trained relationship between model biases and predictor variables over polluted regions. One such factor is the fast-changing emissions in both magnitude and distribution in regions such as the North China Plain and the Yangtze River Delta during the modeled three years [48,49], a result of increasingly more strict emission control enforcements and/or economic fluctuations. The significant change of emission rates in these regions between the training years (2014-2015) and the prediction year (2016) could confound the trained bias correction relationships. In contrast, the actual emission rates were likely to vary insignificantly in the cleaner regions over the same time period.

Discussion
In comparison with previous studies conducted in the U.S., our results generally show less improvements (in terms of percentage) from the raw model forecasts. For example, Djalalova, Delle, Monache, and Wilczak [16] used KF-AN (similar to DIST-KF-AN in this study) to post-correct hourly CMAQ PM 2.5 forecasts for the 1-day lead time and reduced the MAE values by 65% from 8.5 µg/m 3 to 3 µg/m 3 . Kang, et al. [47] also applied the KF-AN method to daily PM 2.5 forecasts and reduced RMSE by 33% from 7.5 µg/m 3 to 5 µg/m 3 . Previous studies have found that the KF-and AN-based methods better perform in lower pollution level regions [28]. The differences in the performance between this and previous studies may result from the fact that PM 2.5 levels in China are much higher than in the U.S. Consistent with previous studies, our analysis also shows that the percentage improvements by the CART-LM-KF-AN method are generally larger in relatively cleaner regions (e.g., the Pearl River Delta in South china, Northeast China, and other remote regions) than in heavily polluted regions (e.g., the North China Plain and the Yangtze River Delta in East China) (Figure 8), suggesting that there might be important factors missing in the trained relationship between model biases and predictor variables over polluted regions. One such factor is the fast-changing emissions in both magnitude and distribution in regions such as the North China Plain and the Yangtze River Delta during the modeled three years [48][49][50], a result of increasingly more strict emission control enforcements and/or economic fluctuations. The significant change of emission rates in these regions between the training years (2014-2015) and the prediction year (2016) could confound the trained bias correction relationships. In contrast, the actual emission rates were likely to vary insignificantly in the cleaner regions over the same time period.

Conclusions
To improve the PM2.5 forecast, we proposed a bias-correction framework that utilized the relationships between biases and select forecasted and observational variables. The framework consists of four steps: feature selection, analog determination, local correction estimation, and correction spread. We applied this bias-correction framework to PM2.5 forecasts in 2014-2016 over China from the operational AiMa air quality forecasting system using the CMAQ model. Five methods, differing in how to perform feature selection, analog determination, and local correction estimation, were tested in this study, and we found all the five methods were able to improve the overall forecast performance in terms of RMSE and NME, though to a relatively limited degree.
Based on our results, we recommend the CART-LM-KF-AN method. In most cases, the performance of this method is better or comparable to other methods. Particularly, the method shows consistent improvement for longer lead times (3-5 day) when other methods degrade in their performance. This is important for dynamic air quality management, as this type of practice often requires longer lead time.
In comparison with previous studies that were all conducted in areas outside of China, our results generally show fewer improvements (in terms of percentage) from the raw model forecast, Figure 8. The difference in R 2 (a,d,g), NME (b,e,h), and RMSE (c,f,i) values between bias-corrected forecasts by the CART-LM-KF-AN method and raw CMAQ forecasts (Corrected forecast minus raw forecasts) at the 545 avg-monitors for the 1, 3, and 5-day lead times. The differences in RMSE is in µg/m 3 .

Conclusions
To improve the PM 2.5 forecast, we proposed a bias-correction framework that utilized the relationships between biases and select forecasted and observational variables. The framework consists of four steps: feature selection, analog determination, local correction estimation, and correction spread. We applied this bias-correction framework to PM 2.5 forecasts in 2014-2016 over China from the operational AiMa air quality forecasting system using the CMAQ model. Five methods, differing in how to perform feature selection, analog determination, and local correction estimation, were tested in this study, and we found all the five methods were able to improve the overall forecast performance in terms of RMSE and NME, though to a relatively limited degree.
Based on our results, we recommend the CART-LM-KF-AN method. In most cases, the performance of this method is better or comparable to other methods. Particularly, the method shows consistent improvement for longer lead times (3-5 day) when other methods degrade in their performance. This is important for dynamic air quality management, as this type of practice often requires longer lead time.
In comparison with previous studies that were all conducted in areas outside of China, our results generally show fewer improvements (in terms of percentage) from the raw model forecast, especially in regions with much higher pollution levels. A major reason for this is that the fast-changing emissions in high pollution regions of China can confound the relationship between model biases and predictor variables. On the other side, the correction spread is not found to be a significant source of errors at locations without monitors. Future efforts should be directed to improve the performance in the local correction estimation, especially to explore methods that can build relationships that better represent the missing factors of changing emissions.