Mapping Seasonal High-Resolution PM2.5 Concentrations with Spatiotemporal Bagged-Tree Model across China

High concentrations of fine particulate matter (PM2.5) are well known to reduce environmental quality, visibility, atmospheric radiation, and damage the human respiratory system. Satellite-based aerosol retrievals are widely used to estimate surface PM2.5 levels because satellite remote sensing can break through the spatial limitations caused by sparse observation stations. In this work, a spatiotemporal weighted bagged-tree remote sensing (STBT) model that simultaneously considers the effects of aerosol optical depth, meteorological parameters, and topographic factors was proposed to map PM2.5 concentrations across China that occurred in 2018. The proposed model shows superior performance with the determination coefficient (R2) of 0.84, mean-absolute error (MAE) of 8.77 μg/m3 and root-mean-squared error (RMSE) of 15.14 μg/m3 when compared with the traditional multiple linear regression (R2 = 0.38, MAE = 18.15 μg/m3, RMSE = 29.06 μg/m3) and linear mixed-effect (R2 = 0.52, MAE = 15.43 μg/m3, RMSE = 25.41 μg/m3) models by the 10-fold cross-validation method. The results collectively demonstrate the superiority of the STBT model to other models for PM2.5 concentration monitoring. Thus, this method may provide important data support for atmospheric environmental monitoring and epidemiological research.


Introduction
Particulate matter with ≤2.5 µm diameter is called fine particulate matter (PM 2.5 ) [1]. Numerous epidemiological studies have found that cardiovascular and respiratory diseases are closely related to long-term exposure to PM 2.5 [2]. Emerging evidence has also shown that PM 2.5 is associated with impaired cognitive function [3], Alzheimer's disease, Parkinson's disease, cognitive decline, and dementia [4,5]. High resolution and high coverage PM 2.5 levels promote epidemiologists to analysis the effects of PM 2.5 in human health with more efficient [6]. Unfortunately, the lack of accurate monitoring data on long-term PM 2.5 levels results in scarcity of epidemiological studies concerning the impact of particulate matter on human health [7]. The uneven atmospheric monitoring network of the China Meteorological Administration, established in 2013, cannot capture the regional PM 2.5 concentration. Thus, establishing a suitable PM 2.5 model with wide-area coverage is necessary.
Aerosol optical depth (AOD) is an important parameter in atmospheric research [8]. AOD can be calculated by integrating aerosol extinction coefficient in a vertical column of atmosphere. In recent years, an aerosol robot network (AERONET) has been established in China, which can monitor AOD values relatively accurately and support regionally environmental analysis. However, the sparse distribution of monitoring stations makes it difficult to characterize the actual spatial change of AOD [9]. Satellite remote sensing can realize wide-area aerosol retrieval, thereby providing chances for large-scale regional air quality assessments [10]. Numerous studies have demonstrated a complex correlation between AOD and surface PM 2.5 levels. Surface PM 2.5 concentrations estimated by satellite-based AOD have been widely applied in recent years to monitor air quality [10,11]. The AOD products commonly used to estimate PM 2.5 concentrations include MODIS AOD [12,13], MERRA-2 AOD, and Himawari-8 AOD [14,15]. A new high-resolution (1 km) daily MCD19A2 AOD retrieved by a multiangle implementation of atmospheric correction (MAIAC) algorithm was released on 30 May 2018 [16]. In the MAIAC AOD product development process, researchers improved many key operations, such as snow and cloud screening and selecting aerosol types after analysis based on time series images. At present, MAIAC AOD data have been widely used to reveal the changes of AOD in various regions of the world [17]. However, for estimating fine particle concentrations on fine scale, higher resolution AOD products are indispensable, and the resolution of AOD data often used in previous studies cannot meet the requirements. Based on this situation, the VIIRS sensor was initiated with the launch of the S-NPP satellite in 2011. It is a new generation of satellite sensor used to describe aerosol characteristics [18]. As a scanning radiometer, it has expanded and improved capabilities when compared with the traditional AVHRR and MODIS sensors [18], and can generate aerosol products with a spatial resolution of 750 m [19].
The methods for estimating PM 2.5 concentrations based on satellite remote sensing mainly include the empirical formula [20], chemical transport model [12], and statistical model. The classical statistical models include the linear mixed-effect (LME) model, the generalized additive model, and the geographical-weighted regression model [12,[21][22][23]. A large number of PM 2.5 concentration data from ground monitoring stations are required to develop and verify these models [7]. However, these models are unable to completely capture the complex relations of PM 2.5 with various influencing factors [24] and cannot fully reflect temporal and spatial differences in PM 2.5 distributions. Thus, developing superior model to map PM 2.5 concentrations is still an important task by respectively considering the spatio-temporal heterogeneity of different variables.
In this study, the AOD products with high coverage ratio are generated by integrating MODIS MAIAC AOD [13] and VIIRS IP AOD [25]. Integrating the advantages of the two AOD products by considering similar pixels between the two products can improve the coverage of MAIAC AOD products. Moreover, a spatiotemporal bagged-tree (STBT) model that considers the spatiotemporal heterogeneity between different influencing factors was applied to map PM 2.5 concentrations across China that occurred in 2018. Sample-based and station-based cross-validation (CV) methods are used to evaluate the performance of the STBT model.

Study Area and Datasets
Ground PM 2.5 observation, MAIAC AOD, VIIRS IP AOD, meteorological parameters, topographic factors, and other auxiliary data related to site-measured PM 2.5 levels were applied in this study, as shown in Table 1. The datasets cover the period from 1 January to 31 December 2018.

Study Area
In this study, the nationwide PM 2.5 observation data at 1591 sites were downloaded from the database of the China National Environmental Monitoring Center. As shown in Figure 1, the monitoring sites are distributed unevenly in the study area; specifically, the stations are densely distributed in the east and sparsely distributed in the west. East China is adjacent to the Pacific Ocean, which spans many temperature zones, such as tropical, subtropical, and temperate monsoon. In contrast, the northwest belongs to a non-monsoon region with a temperate continental climate.

MODIS AOD
The MODIS AOD product has high retrieval accuracy and is widely applied for PM 2.5 level monitoring over large areas [26]. The MAIAC AOD product developed by the MAIAC algorithm has a high spatial resolution of 1 km. The confidence level of the MAIAC AOD product used in this study is high. MAIAC AOD from 1 January 2018 to 31 December 2018 were downloaded from NASA (http://ladsweb.modaps.eosdis.nasa.gov/, accessed on 2 February 2019).

VIIRS IP AOD
The visible infrared imaging radiometer (VIIRS), which is extended from the MODIS series, was carried on the S-NPP satellite [27] and used to obtain the AOD product with 750 m resolution. The VIIRS IP AOD has been used in domestic and international studies to retrieve PM 2.5 concentrations over large areas [28]. VIIRS IP AOD from 1 January 2018 to 31 December 2018 were downloaded from NOAA (https://ncc.nesdis.noaa.gov/VIIRS/, accessed on 11 April 2019).

Meteorological Data
The surface PM 2.5 concentrations were closely related to meteorological parameters, especially the boundary layer height and the wind [29]. In this study, the meteorological data from the reanalysis dataset of the European Meteorological Centre are used in this study. These data were downloaded from ERA5 (https://cds.climate.copernicus. eu/cdsapp#!/home, accessed on 22 June 2019). The meteorological parameters used in this study, including relative humidity (RH), temperature (TEMP), boundary layer height (BLH), and wind speed (WS), as shown in Table 1.

Geographic and Topographic Data
MODIS retrieved Normalized difference vegetation index (NDVI) with a time resolution of 16 days was used in this study, which can represent different land cover types. NDVI data at a spatial resolution of 0.05 • were downloaded from the NASA Earth Observatory (http://neo.sci.gsfc.nasa.gov/, accessed on 17 June 2019). In addition, digital elevation model (DEM) data from the U.S. Geological Survey (https://www.usgs.gov/, accessed on 18 April 2018) with a spatial resolution 30 m was used in this study to characterize the topographic features of the study area.

Multi-Source AOD Data Fusion
Given cloud effect and MODIS aerosol retrieval method, a large number of AOD data are missing in the study area. According to the mechanism of retrieving aerosol loadings from satellite-based sensors, some researchers considered the relationship between AOD loadings and NDVI, comprehensively weighing the spatial proximity, AOD and NDVI similarity, to recover AOD [30,31]. The VIIRS IP AOD at 550 nm can provide a reliable dataset with a high resolution (750 m) [32]. In this study, filling AOD vacancy based on the adaptive threshold method was adopted to enhance the spatiotemporal continuity of the data. The similar pixels from VIIRS IP AOD found by local range were used to recover missing pixels in the MAIAC AOD product. Here, adaptive determination was used to search for similar pixels by considering local differences between MAIAC AOD and VIIRS IP AOD values and spatial distance. Similar pixels should satisfy the following inequality, where A j and A i refer to a similar AOD pixel and target AOD pixel in the VIIRS IP AOD dataset, respectively, and A_thi is the adaptive threshold calculated by the AOD local standard deviation formula: where MAI ACdiv and V I IRSdiv refer to two AOD datasets from MAIAC and VIIRS in a given window respectively, std represents the calculated standard deviation, and length and width represent the local window size. Similar pixels are endowed with weights based on AOD differences and spatial relations: where x and y refer to longitude and latitude, respectively, and β is a small value that prevents Dij from equaling zero, which is empirically determined. Normalized processing is then carried out: Finally, the missing value in MAIAC AOD is filled based on the weighting sum of these similar pixels in MAIAC AOD that are determined by the similarity relation between MAIAC and VIIRS AOD:

Bagged-Tree Model
Decision tree models typically give good classification decisions [33]. The model is built in the light of the bagged-tree combination classification method [34]. The combined classifier used in this work is composed of multiple individual classifiers consisting of decision trees. The training data of each tree are extracted by using bootstrap. Each individual classifier has its own classification results. The classified result from the combined classifier is determined by the combination of the results of individual classifiers to avoid overfitting.

Spatiotemporal Weighted Function
The spatial cross-correlation and temporal autocorrelation of the data were explored by considering the spatio-temporal heterogeneity of PM 2.5 concentrations in this study. The spatial cross-correlation can be expressed by the spatial weight function: where d s refers to the space distance, PM w refers to the PM 2.5 of station w adjacent to the target station, and W refers to the number of stations within the selected scope. The temporal autocorrelation is expressed by the temporal weight function: where PM t refers to the PM 2.5 measured in the day before and after the current day, 1 n ∑ n 1 PM n is the averaged PM 2.5 value within a week, n is the number of valid days of the week, 1 m ∑ m 1 PM M represents the averaged PM 2.5 level within a month, and m is the number of valid days of the month. The coefficient is obtained by linear analysis, and θ refers to the linear analysis residual.

MLR Model
The simple multiple linear regression (MLR) model can be expressed as: where b indicates the intercept, α 1 − α 7 refer to regression coefficients, and ε represents the error term.

LME Model
The ordinary MLR model can be extended as LME model by considering random effect in a specific time. LME model can explain the time-related relationship between surface PM 2.5 levels and multiple predictors in a specific region, and can be expressed as: where n and m refers to the grid and time index, respectively; β 0 represent the fixed intercept; β 1 − β 7 are the fixed slopes for these corresponding predictors; b day 1,n,m and b day 0,n,m represent the time-specific random slope and intercept for intercept and AOD, respectively; ∑ indicates the variance-covariance matrix of the random effects; ε n,m represent the error term.

Model Evaluation
The performance of the proposed STBT model was validated via sample-based and station-based 10-fold cross-validation (CV) methods to calculate the determination coefficient (R 2 ), MAE, and RMSE. CV has an ability to reveal whether or not a model is overfit. Finally, the results of the proposed model were compared with those of traditional estimation models, such as the MLR and LME models, to determine its accuracy and generalizability.

Assessment of Fused AOD
The AERONET measurements are used to evaluate the fused AOD data. However, AERONET network observes AOD value in multiple wavelengths, almost of which are different from MAIAC AOD at 550 nm. Therefore, AERONET aerosol retrievals at 550 nm can be interpolated from the value at other wavelengths by the second-order polynomial method. By fitting the linear relationship between the fused and AERONET AOD, the error is validated using correlation coefficient (R), RMSE, and bias. The analysis results are shown in Figure 2. percentage in the study area is calculated by the ratio between the number of valid and total pixels. Figure 3 shows the daily coverage of original and fused AOD products in the study areas (70-140 • E, 10-55 • N). The average daily coverage of the original AOD is only 21.20%. In contrast, the coverage of the fused AOD reaches 37.24%.

Statistical Analysis of the Datasets
The spatial and temporal resolution of different factors is unified to 750 m and 1 day by the linear interpolation method, respectively. After screening for abnormal data, a total of 215,893 matched data are obtained. The data are statistically analyzed, and their maximum, minimum, mean, and standard deviation are calculated. The statistical results are shown in Figure 5.

Model Evaluation and Comparison
The bagged-tree model is applied in this study; data numbering 215,893 are matched through 1591 stations, and each datum contains 13 attributes, including time, longitude, latitude, temperature, relative humidity, etc. Parameter debugging is also very important in the process of model training. Combined with the data volume and feature number, the minimum leaf size is set to 8 and the number of Learning Cycles is set to 30 in the bagged-tree model.
The performance of the STBT model is evaluated by using R 2 , RMSE, and MAE, as shown in Figure 6. Both station-based and sample-based 10-CV method is adopted to determine whether overfitting occurs. The comparisons of the proposed STBT model and two traditional models (MLR and LME) are also Figure 6. Two kinds of 10-CV methods were adopted to verify the performance of these models. Firstly, 90% samples from the 215,893 data were randomly selected to train the STBT model, and the remaining 10% was regarded as validation samples. Secondly, considering the wide distribution of measured sites, 90% of 1591 sites are randomly selected to train the model, and the remaining 10% sites are used as verification, which can adequately reveal the prediction ability of the model in different spatial domains.  Figure 6b,e. It is gratifying that the STBT model performs well with site-based (sample-based) 10-CV: R 2 value of 0.81 (0.84), the corresponding the MAE is 8.93 (8.77) µg/m 3 , and RMSE is 16.37 (15.14) µg/m 3 , as shown in Figure 6c,f. The similar verification results between site-based and sample-based 10-CV method indicate that the proposed STBT model has good prediction ability over the regions without measurements and could effectively avoid overfitting by considering spatial and temporal heterogeneity. Compared with the two traditional MLR and LME models, the R 2 of the proposed STBT model is higher by 121.05% and 61.54%, respectively, its RMSE is lower by 41.91% and 40.42%, respectively, and its MAE is lower by 51.69% and 43.17%, respectively. Thus, compared with other models, the STBT model shows greatly improved performance for mapping regional PM 2.5 concentrations.  Figure 7. The bias between estimations and measurements from during the study period (from 1 January to 31 December 2018) is plotted in the same figure. The annual average bias between the estimated and measured PM 2.5 concentration is 7.16 µg/m 3 . The estimated results match the measured values well, especially in summer. Figure 8 shows the seasonal average PM 2.5 levels estimated by the STBT model across China. These subfigures reveal significant seasonal changes in the distribution of surface PM 2.5 levels. Among the four seasons, winter demonstrates the greatest levels of pollution, with an average PM 2.5 value of 44 µg/m 3 . By contrast, summer shows the lowest levels of pollution, with an average PM 2.5 value of 31 µg/m 3 . This significant seasonal change is strongly correlated with anthropogenic emissions [35][36][37][38]. A mass of particulate matter produced by burning fossil fuels and biomass promote the high polluted levels in winter [39,40]. Adverse weather conditions during cold periods could promote the accumulation of air pollutants over a certain region [41]. The low pollution in summer may be related to the less fossil fuel and biomass burning in this season. Moreover, clean marine air mass, intense atmospheric convection, and sufficient wet deposition of aerosols can significantly reduce pollution levels during the Asian summer monsoon [42]. Ground PM 2.5 concentrations also show distinct spatial inconsistency. Low seasonal PM 2.5 levels exist in the eastern coastal area. By contrast, the seasonal average PM 2.5 levels over the Beijing-Tianjin-Hebei and Xinjiang regions are high, likely because of regional industrial development or adverse terrain accumulate of air pollutants. Moreover, the performance of the STBT model in the western region may be influenced by the sparse distribution of measured stations in these regions. Furthermore, the lifetime of PM 2.5 in the atmosphere can be up to 6 days, and during those days the particles can travel up to 3000 km. The wind erosion effect leads to the very high concentration of PM 2.5 in northwestern China [43], which transports sand dust from Taklamakan Desert to adjacent areas.

Regional PM 2.5 Concentrations
Four typical polluted regions are selected, including the Yangtze River Delta region, the North China Plain, the Sichuan Basin and the Pearl River Delta region. As shown in Figure 9, the North China Plain remains the most polluted area, due to many anthropogenic emission sources, adverse topographic conditions, and other factors [44]. The Pearl River Delta has the lowest polluted levels among the four regions, because the monsoon on the east coast disperses fine particles. The concentration of fine particles in Sichuan Basin is also relatively high, which is mainly due to the closed topography, which results in pollutant accumulation [45].

Conclusions
The spatiotemporal distribution of surface PM 2.5 levels across China are mapped by a STBT model using fused AOD data collected in 2018 in this study. The main conclusions follow: (1) Compared with the average coverage of the original MAIAC AOD (21.20%), the coverage of the fused AOD reaches 37.24% by using an adaptive threshold algorithm of auxiliary pixels. The stability and performance of the STBT model is improved by considering the spatiotemporal heterogeneity of different modeling factors. In future work, our research team aims to improve models with better performance for regional PM 2.5 mapping.