Estimation of the Near-Surface Ozone Concentration with Full Spatiotemporal Coverage across the Beijing-Tianjin-Hebei Region Based on Extreme Gradient Boosting Combined with a WRF-Chem Model

: With the intensiﬁcation of global warming and economic development in China, the near-surface ozone (O 3 ) concentration has been increasing recently, especially in the Beijing-Tianjin-Hebei (BTH) region, which is the political and economic center of China. However, O 3 has been measured in real time only over the past few years, and the observational records are discontinuous. Therefore, we propose a new method (WRFC-XGB) to establish a near-surface O 3 concentration dataset in the BTH region by integrating the Weather Research and Forecasting with Chemistry (WRF-Chem) model with the extreme gradient boosting (XGBoost) algorithm. Based on this method, the 8-h maximum daily average (MDA8) O 3 concentrations are obtained with full spatiotemporal coverage at a spatial resolution of 0.1 ◦ × 0.1 ◦ across the BTH region in 2018. Two evaluation methods, sample- and station-based 10-fold cross-validation (10-CV), are used to assess our method. The sample-based (station-based) 10-CV evaluation results indicate that WRFC-XGB can achieve excellent accuracy with a high coefﬁcient of determination (R 2 ) of 0.95 (0.91), low root mean square error (RMSE) of 13.50 (17.70) µ g m − 3 , and mean absolute error (MAE) of 9.60 (12.89) µ g m − 3 . In addition, superb spatiotemporal consistencies are conﬁrmed for this model, including the estimation of high O 3 concentrations, and our WRFC-XGB model outperforms traditional models and previous studies in data mining. In addition, the proposed model can be applied to estimate the O 3 concentration when it has not been measured. Furthermore, the spatial distribution analysis of the MDA8 O 3 in 2018 reveals that O 3 pollution in the BTH region exhibits signiﬁcant seasonality. Heavy O 3 pollution episodes mainly occur in summer, and the high O 3 loading is distributed mainly in the southern BTH areas, which will pose challenges to atmospheric environmental governance for local governments.


Introduction
Near-surface ozone (O 3 ), a secondary air pollutant, is produced primarily by photochemical reactions of volatile organic compounds (VOCs), nitrogen oxides (NO x ), and carbon monoxide (CO) under solar radiation [1]. Epidemiological studies have demonstrated that human long-term exposure to high levels of O 3 could cause asthma, lung cancer, and cardiovascular diseases [2]. In addition, high O 3 concentrations inhibit vegetation growth, reduce the primary productivity of vegetation, and diminish crop yields [3]. Moreover, as a greenhouse gas, O 3 can change the global climate by affecting the radiative energy budget of the Earth-atmosphere system [4]. Since the beginning of the 21st century, China has experienced rapid urbanization and industrialization; meanwhile, the emissions resulting model is named WRFC-XGB. In combination with European Centre for Medium-Range Weather Forecasts (ECMWF) Fifth-Generation Reanalysis (ERA5) data, satellite data, and WRF-Chem output data, the MDA8 O 3 concentrations in the BTH region in 2018 were estimated. Section 2 introduces the establishment of the WRFC-XGB model and the 10-fold CV (10-CV) method. In Section 3, we evaluate the performance of the WRFC-XGB model from different perspectives and compare the results of the model with those of traditional models and existing O 3 studies. Finally, we summarize some of our conclusions in Section 4.

Study Area
The BTH region, which includes two municipalities (Beijing and Tianjin) and eleven prefecture-level cities in Hebei Province (Figure 1), is one of the three major areas in China characterized by air pollution. Approximately 136 million people live in an area of~248,000 km 2 [29]. Unfortunately, many residents are exposed to high aerosol and O 3 loadings. According to the National Urban Air Quality Report in July 2018, most cities in this region, especially those located in the southern BTH region, placed low in the MDA8 O 3 ranking among all the cities in China [30].

Near-Surface O 3 Monitoring Data
The hourly surface O 3 concentration records of 75 monitoring sites in the BTH region were collected from the China Environmental Monitoring Centre (CEMC). The distribution of these sites is shown in Figure 1. To reduce the uncertainty in this dataset, we removed hourly data with less than 8 h of daily monitoring data, and the MDA8 O 3 concentration of each station from 1 January 2018, to 31  In this study, the WRF-Chem 3.9.1 was used to simulate the hourly O 3 concentration in the BTH region with a spatial resolution of 9 km. WRF-Chem is a fully coupled online atmospheric chemistry model [31] that is generally driven by meteorological and emission data. The meteorological driving datasets were sourced from the National Centers for Environmental Prediction (NCEP) Final Operational Global Analysis with temporal and spatial resolutions of 6 h and 1 • × 1 • , respectively. The initial field and boundary conditions of the proposed model were built on the basis of these datasets. Emission data are also essential. The emission data were divided into anthropogenic and biogenic emissions. The anthropogenic emission inventory data were obtained from the China Multiresolution Emission Inventory (MEIC) with a 0.25 • × 0.25 • spatial resolution, and the monthly emission data were converted into hourly emissions. Moreover, the MEIC mainly includes five anthropogenic emission sources, including industrial, power plants, residential, vehicular, and agricultural emissions, and covers 10 major air pollutants and greenhouse gases [32][33][34][35][36]. The biogenic emissions were obtained from the Model of Emissions of Gases and Aerosols from Nature (MEGAN) [37].

Other Auxiliary Data
Meteorological factors can also affect air pollution [38]. Eight meteorological variables were selected to establish the model in this paper: the 2-m temperature (TEM), relative humidity (RH), boundary layer height (BLH), evaporation (ET), surface pressure (SP), wind direction (WD), wind speed (WS), and surface solar radiation downwards (SSRD). All the above parameters were collected from the ERA5 product with spatial and temporal resolutions of 0.25 • × 0.25 • and 1 h, respectively. In addition, vegetation can also release VOCs; hence, monthly normalized difference vegetation index (NDVI) data collected from the Resource and Environment Data Cloud Platform were also used as input data with a spatial resolution of 1 km. Furthermore, to ensure spatial consistency among the datasets, the spatial resolution of all input data was interpolated to 0.1 • × 0.1 • by the bilinear interpolation method.

WRFC-XGB Model
In this study, we combined the WRF-Chem model with the XGBoost algorithm to propose a new two-stage method called WRFC-XGB. In the first stage, to map the fullcoverage spatial distribution of the surface O 3 concentration, we employed the WRF-Chem model to roughly estimate the near-surface MDA8 O 3 concentration (SIMO 3 ). The Carbon-Bond Mechanism version Z (CBMZ) was selected for the chemical mechanism because it is more efficient in O 3 -NO titration than other mechanisms [39,40]. The specific physicochemical parameterization schemes of the WRF-Chem model are shown in Table 1. The model adopts two layers within the nested grid with the coordinate system in a Lambert projection ( Figure S1). The resolution of the first layer (D01) is 27 km, and the simulation can cover most of North China while providing the background fields of large-scale atmospheric transport diffusion and pollutant concentrations. In contrast, the resolution of the second layer (D02) is 9 km, and it mainly covers the BTH region. To improve the accuracy of the WRF-Chem model MDA8 O 3 simulation, the model was run every month with a spin-up time for the first 168 h. Then, in the second stage, the XGBoost machine learning model was combined with the SIMO 3 obtained in the first stage, meteorological parameters, and NDVI to further calibrate and estimate the MDA8 O 3 concentration with full spatiotemporal coverage. The XGBoost model was developed based on gradient enhancement in 2016 [47]. Unlike other machine learning algorithms used in previous studies, each iteration of the XGBoost model adds a tree to fit the residuals between the prediction results of the previous tree, and then the true values are estimated on the basis of the existing tree [48]. In addition, the proposed model incorporates a regularization term, which can effectively prevent overfitting. The model is expressed as the following Equation (1): where O 3 _Pre i,j indicates the estimated MDA8 O 3 concentration on day i at grid j; DOY i,j is the day of year (DOY); TEM i,j , RH i,j , BLH i,j , ET i,j , SP i,j , WD i,j , WS i,j , and SSRD i,j are the values of TEM, RH, BLH, ET, SP, WD, and WS at grid j on day i, respectively; NDVI m,j is the NDVI value in month m at grid j; and SIMO 3i,j denotes the WRF-Chem model output of the MDA8 O 3 concentration on day i at grid j. Similar to the inverse distance weighting (IDW) [49], the DOY was used the weighted time distance, namely, the reciprocal of the distance from each day to the middle of the year, which can better reflect the continuity of the daily variation in O 3 pollution. Before constructing the model, we conducted a correlation analysis for all independent variables (Table 2); the results show that all correlations were statistically significant (p < 0.01). Among them, a positive relationship was captured between the O 3 concentration and TEM, RH, BLH, WD, WS, SSRD, NDVI, and SIMO 3 , while a negative relationship was found with ET and SP. In addition, to avoid the systematic errors caused by multicollinearity, the variance inflation factor (VIF) index was calculated to identify the collinearity among all independent variables used in the WRFC-XGB model. A VIF smaller than 10 indicates the absence of multicollinearity in our model. According to the results, no multicollinearity existed in our model (VIF < 10). Furthermore, we compared the WRFC-XGB model with other machine learning methods (Table S1).
To further evaluate the WRFC-XGB model, the traditional models used previous studies to estimate the O 3 concentration were selected for comparison: the MLR model, generalized additive model (GAM), GWR model, and linear mixed effect (LME) model [22,[50][51][52]. The same training dataset was used for each of these four models for the O 3 estimation in 2018.

Evaluation Method
To test the estimation performance of the WRFC-XGB model, the commonly used 10-CV method was applied herein. For this method, the observation results were first randomly divided into ten parts. Then, nine subsets were selected as the training data, and another subset was used as the verification data. The above process was repeated 10 times to ensure that each dataset was verified once, and the average values of 10 verification results were taken as the final result [53]. Furthermore, according to the division of subsets for 10-CV, we employed sample-based 10-CV and station-based 10-CV to evaluate the model [54]. In addition, four indexes, namely, the regression line, R 2 , root mean square error (RMSE), and mean absolute error (MAE), between the observed and estimated O 3 were also calculated to evaluate the agreement between the simulated results and measurements.

Feature Importance
Before estimating the MDA8 O 3 concentrations, the applicability of independent variables was evaluated first. Figure 2 shows the feature importance (FI) of all input variables of the WRFC-XGB model. For this figure, the FI indicates the contribution of each independent variable to the established model, and the maximum FI is 100%, where a higher FI indicates a greater impact of the input variable on the MDA8 O 3 estimation. In general, the highest contribution was captured in SIMO 3 with an FI of 36%, which can clarify the rationality of our model and the accuracy of the WRF-Chem simulation, followed by SSRD with an FI of 28%. As a general rule, radiation is conducive to photochemical reactions, resulting in the formation of O 3 , and heavy O 3 pollution episodes are usually accompanied by high levels of radiation or severe weather conditions with more precursors [10]. Another important reaction condition is temperature, which accounts for 10% of the MDA8 O 3 estimation. Temperature is one of the main driving factors responsible for generating O 3 . On the one hand, temperature can affect O 3 concentrations by influencing atmospheric turbulence and photochemical reactions [55,56]. On the other hand, temperature can increase the biological emission of VOCs, thus increasing the O 3 concentration in the BTH region [57]. Following these variables, the total contribution of all the other meteorological factors, including BLH, RH, SP, WD, WS, and ET, was~17%, indicating that these variables affect the generation, transmission, and dissipation of O 3 to varying degrees [58][59][60][61]. Notably, the FI of DOY was approximately 6%, indicating that the surface O 3 concentration exhibited a significant temporal variation that could be captured effectively by our WRFC-XGB model.   The station-based 10-CV estimation accuracy is also improved (R 2 = 0.57). This significant improvement in the estimation accuracy is due to the excellent autonomous learning and expression capabilities of the XGBoost machine learning method, which effectively corrects the errors of the WRF-Chem model. Overall, our WRFC-XGB model can effectively and precisely estimate the MDA8 O 3 concentrations across the BTH region not only during periods of light pollution but also during periods of heavy pollution.

Spatial Consistency Verification
The O 3 concentrations show significant spatial heterogeneity across the BTH region, which could also cause uncertainty in the spatial MDA8 O 3 concentration distribution at different scales. Therefore, we calculated the three evaluation indicators for sample-based and station-based 10-CV at each station in the BTH region. Figure 4 plots the spatial distributions of the R 2 , RMSE, and MAE at each surface measurement site in the BTH region in 2018 from the sample-and station-based 10-CV for our WRFC-XGB model. In general, our WRFC-XGB model yields superb spatial O 3 estimates. For the sample-based 10-CV, the highest R 2 is 0.98, which is found in Langfang. All sites exhibit an R 2 greater than 0.85, 97% of all sites have an RMSE less than 20 µg m −3 , and 99% of all sites hold an MAE less than 15 µg m −3 ( Figure S2). In contrast, the lowest R 2 is 0.86 in Tianjin, and the site is near the Bohai Sea. The accuracies at the sites located in Qinhuangdao near the Bohai Sea are also relatively low, similar to the spatial distribution simulated by the WRF-Chem model ( Figure S3). This may be because these stations are affected by the sea breeze, and the WD and WS simulated by the WRF-Chem model have large errors relative to the other meteorological elements. For the station-based 10-CV, the R 2 is greater than 0.85 at 91% of the sites, the RMSE is less than 20 µg m −3 at 79% of the sites, and the MAE is less than 15 µg m −3 at 79% of the sites. Overall, the simulation results of the WRFC-XGB model display good spatial heterogeneity and can capture the characteristics in both high-and low-pollution areas, which can be used to analyze and interpret the spatial differences in the O 3 concentration. In addition, the accuracy of sample-based 10-CV is higher than that of station-based 10-CV at all time scales, which is consistent with another previous study [62], mainly because the O 3 concentration characteristics between similar sites are more similar than those between distant sites, and station-based 10-CV reduce the number of sites for the training data.   , respectively. This is mainly because the LME model considers both fixed effects and random effects. Fixed effects represent the annual average state of influence of each input variable on O 3 , while random effects are used to explain the diurnal variation relationships between O 3 and SIMO 3 and the meteorological factors, as well as the monthly variation relationship between O 3 and NDVI. Nevertheless, although the estimation accuracy of the LME model is similar to that of the WRFC-XGB model, it requires an excessively long computation time and a large number of calculations.

Comparison with Other Traditional Models and Studies
Moreover, without using SIMO 3 as an input variable, the estimation accuracies of these traditional models are reduced for both sample-based and station-based 10-CV ( Figure S4). The results indicate that the SIMO 3 dataset obtained by the WRF-Chem model plays a vital role in improving the O 3 estimation accuracy. In addition, the worst-performing model at this time is not MLR but GWR, mainly because GWR is a spatial analysis algorithm that is based on the local effects of objects distributed in space, and thus, considers the influences of spatial changes in the model input variables on the estimated O 3 concentration [63]. In the WRF-Chem model, although the simulated O 3 concentration is low overall, its spatial distribution is reasonable, and therefore, provides a good basis for estimating the O 3 for GWR. Therefore, after removing the O 3 concentration output from the WRF-Chem model, the estimation accuracy of GWR drops the most below even that of MLR.
To further prove the reliability of the WRFC-XGB model proposed in this study, we compared the results of our model with the conclusions of existing publications (Table 3) . Compared with the application of other models on the same spatial (regional) scale (10-CV R 2 = 0.84) [67], our model achieves a higher estimation accuracy. This is mainly because the result of a high-resolution (9 km × 9 km) model with physicochemical principles (WRF-Chem) is incorporated into the WRFC-XGB model, which reduces the deviation caused by interpolation.

Spatial Distribution of MDA8 O 3 in the BTH Region
The distribution of observation sites in the BTH region is uneven; most of the sites are located in Beijing and Tianjin, and some of them lack measurements. In general, the MDA8 O 3 concentration at all sites in the BTH region in 2018 was 106.67 ± 60.29 µg m −3 . In this study, the WRFC-XGB model was used to estimate the spatial and temporal coverage of the MDA8 O 3 concentration with a spatial resolution of 0.1 • × 0.1 • in the BTH region in 2018. Figure 6b pected, the O 3 concentration in summer is much higher than that in the other seasons due to the long sunshine duration, strong solar radiation, and active photochemical reactions in summer, all of which are conducive to the formation of high concentrations of O 3 [68]. Moreover, high temperatures lead to significant increases in the VOC emissions of O 3 precursors, especially from natural sources, resulting in serious O 3 pollution [69].
According to the results of the WRFC-XGB model, there are obvious spatial differences in the estimated MDA8 O 3 concentration, with higher O 3 concentrations in Beijing, Tianjin, and southern Hebei, which is consistent with the results of Xue et al. [64]. Figure 7f shows the spatial distribution of the frequency of O 3 concentrations exceeding the standard (the percentage of days when the O 3 concentration exceeds the standard divided by 365), yielding values between 5% and 30%, which is similar to the summertime distribution of the MDA8 O 3 concentration (Figure 7c). O 3 is known to have a complicated nonlinear relationship with VOC precursors and NOx (NO + NO 2 = NO X ). In China, power plants, industry, and transportation are the main sources of NO X emissions, accounting for approximately 88% of total NO X emissions [70]. The population distribution of the BTH region varies considerably, where the population is most densely concentrated in Beijing, Tianjin, and the central and southern regions of Hebei [71]. These densely populated areas have large traffic flows and developed industries, and thus, are characterized by the highest emissions in the BTH region. A previous study showed that Beijing, Tianjin, Shijiazhuang, Tangshan, and Handan account for 65.4% and 65.2% of the total NOx and CO emissions of BTH, respectively [72]. In addition to anthropogenic emissions, topography also has a significant impact on O 3 pollution. The BTH region is composed mainly of the Bashang Plateau, Taihang Mountains, Yanshan Mountains, and northern part of the North China Plain. Because the study area is topographically surrounded by mountains, southeasterly winds can blow pollutants toward Beijing and accumulate; hence, Beijing is prone to heavy pollution incidents [73].

Conclusions
Since 2013, many O 3 concentration measurement sites have been built in China. However, the spatial distribution of those sites is uneven, resulting in an inadequate understanding of the surface O 3 loading, especially in the BTH region. Therefore, a new method, i.e., the WRFC-XGB model, which combines the WRF-Chem model with the XGBoost machine learning algorithm, was developed in this study. Combining SIMO 3 data, meteorological data, NDVI data, and DEM data, the MDA8 O 3 concentration across the BTH region in 2018 was estimated based on this model. Compared with the results of previous studies and other traditional methods, our model shows higher accuracy and better spatial prediction capabilities, with sample-based and station-based 10-CV R 2 (RMSE) values of 0.95 (13.50 µg m −3 ) and 0.91 (17.70 µg m −3 ), respectively. Then, we employed the WRFC-XGB model to estimate the full spatiotemporal coverage of MDA8 O 3 in the BTH region. The results show that the MDA8 O 3 concentration is 106.67 ± 60.29 µg m −3 in BTH. The O 3 concentration has obvious spatial differences across BTH, with high O 3 concentrations being found in Beijing, Tianjin, and southern Hebei. Overall, our WRFC-XGB method possesses superior O 3 estimation performance; thus, it can be widely used for estimating O 3 over long-term and wide spatial scales.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/atmos13040632/s1, Figure S1: Double-layer nesting distribution in the WRF-Chem model, the color bar represents the altitude; Figure S2: Probability density functions (PDFs) and cumulative density functions (CDFs) of the sample-based (red columns) and stationbased (blue columns) 10-fold cross-validation; Figure S3: The validation results of the WRF-Chem simulations and site observations in the Beijing-Tian-Hebei region in 2018; Figure S4: The scatter density plot of the final estimation accuracy of different traditional models with and without fusion of the WRF-Chem model; Table S1: Comparison of MDA8 O3 concentration estimation accuracy between integrating different machine learning algorithms with the WRF-Chem model and the WRFC-XGB model. Data Availability Statement: The ERA5 data are available at https://www.ecmwf.int/en/forecasts/ datasets/reanalysis-datasets/era5 (accessed on 25 March 2021). The NDVI data are available at https://www.resdc.cn/data.aspx?DATAID=254 (accessed on 2 April 2021).

Conflicts of Interest:
The authors declare no conflict of interest.