Ambient PM 2.5 Estimates and Variations during COVID-19 Pandemic in the Yangtze River Delta Using Machine Learning and Big Data

: The lockdown of cities in the Yangtze River Delta (YRD) during COVID-19 has provided many natural and typical test sites for estimating the potential of air pollution control and reduction. To evaluate the reduction of PM 2.5 concentration in the YRD region by the epidemic lockdown policy, this study employs big data, including PM 2.5 observations and 29 independent variables regarding Aerosol Optical Depth (AOD), climate, terrain, population, road density, and Gaode map Point of interesting (POI) data, to build regression models and retrieve spatially continuous distributions of PM 2.5 during COVID-19. Simulation accuracy of multiple machine learning regression models, i.e., random forest (RF), support vector regression (SVR), and artiﬁcial neural network (ANN) were compared. The results showed that the RF model outperformed the SVR and ANN models in the inversion of PM 2.5 in the YRD region, with the model-ﬁtting and cross-validation coefﬁcients of determination R 2 reached 0.917 and 0.691, mean absolute error (MAE) values were 1.026 µ g m − 3 and 2.353 µ g m − 3 , and root mean square error (RMSE) values were 1.413 µ g m − 3 , and 3.144 µ g m − 3 , re-spectively. PM 2.5 concentrations during COVID-19 in 2020 have decreased by 3.61 µ g m − 3 compared to that during the same period of 2019 in the YRD region. The results of this study provide a cost-effective method of air pollution exposure assessment and help provide insight into the atmospheric changes under strong government controlling strategies. pollution.


Introduction
Coronavirus disease 2019 (COVID- 19), as an infectious disease, was identified in the city of Wuhan, China, and spread to nearly every country around the globe [1][2][3]. On 20 January 2021, COVID-19 has been known to cause more than two million deaths worldwide, with a global mortality rate of 3.4%. In response to the outbreak of COVID-19, a nation-wide lockdown of cities was proposed by the Chinese government after January 2020, putting its 1.3 billion citizens inside their homes [4][5][6]. Almost all production activities, such as transportation, construction, and industries were completely restricted [7][8][9]. Such unprecedented stagnation of industrial production and residents' consumption has effectively reduced air pollution emission, providing natural and typical test sites for estimating the impacts of human activities controlling on the air pollution control and reduction [10][11][12][13].
At present, studies on PM 2.5 pollution during COVID-19, mainly use PM 2.5 concentrations, which are generally sourced from ground observations and satellite remote sensing inversions [14][15][16][17][18][19]. Ground observations provided by meteorology stations are at a diurnal scale with high accuracy. However, these stations are usually sparsely distributed, limiting the knowledge of spatially continuous distributions of PM 2.5 concentrations. Comparatively, satellite remote sensing inversions can provide a spatially continuous distribution of PM 2.5 , which can fill the data gap in areas where there is no monitoring station [20][21][22]. As a result, this study uses satellite remote sensing inversions to obtain high-precision PM 2.5 concentration data to assess PM 2.5 changes during  To build the inversion model, variables regarding aerosol optical depth (AOD), climate, LUCC (land use and land cover) were usually selected according to previous studies [23][24][25][26]. Classic model and machine learning methods have been applied to fit the linear and non-linear relations between environmental variables and PM 2.5 concentrations in previous research work [27][28][29][30]. It is suggested that classic models are usually sensitive to collinearity between independent variables and fail to handle a very large sample with missing data or outliers [31]. Although the variance expansion test and statistics can avoid the influence of collinearity by deleting those collinear variables [32], such a screening step can lose some important variables by mistake [33]. The linear models, e.g., multiple linear regression, failed to detect non-linear relations [22,34,35], given that the formation, diffusion, migration, and transformation of PM 2.5 are complex, and perhaps non-linearly related to environmental factors. Machine learning methods can handle a very large sample with fast computing speed [36]. They were proved to be robust and insensitive to missing data and outliers. In recent years, machine learning methods, such as random forest (RF) [23,30,37,38], support vector regression (SVR) [39], and artificial neural network (ANN) [40] have been successfully used in estimating PM 2.5 concentrations. Consequently, machine learning methods can be used to estimate the PM 2.5 concentration during COVID-19.
In this study, we hypothesized that the government's "lockdown policy" may have reduced air pollution in urban agglomeration. To address the influence of "lockdown policy" on PM 2.5 concentrations, spatial PM 2.5 concentrations during COVID-19 (2020-I) and the same period in 2019 (2019-I) were compared. Firstly, 29 independent variables regarding AOD, climate, terrain, population, road density, and Gaode map POI data were collected to build the RF, SVR, and ANN PM 2.5 retrieving models. Secondly, the prediction accuracies of the three models were evaluated by determination R 2 , the cross-validation (CV), MAE, and RMSE. The importance of variables was assessed to examine the impact of each predictor on PM 2.5 concentration. Finally, the optimal model was determined and applied in PM 2.5 retrieval, to further estimate the influence of "lockdown policy" on PM 2.5 concentrations. Investigation of PM 2.5 changes before, and during, COVID-19 not only quantitatively evaluate the impact of the epidemic on economic activities and emission reductions, but also help understand the potential for pollution control in the Yangtze River Delta (YRD). This study aims to obtain high-resolution spatial continuous PM 2.5 data and analyze the potential of PM 2.5 pollutant emission reduction during COVID-19. The findings provide a reference for future air pollution control in the YRD.

Data and Methods
The Yangtze River Delta is located in the north-central subtropical zone and at the junction of eastern coastal China and the Yangtze River, including Shanghai, Zhejiang Province, and Jiangsu Province, as shown in Figure 1. The study region is the Yangtze River Delta's core area, including 16 cities, such as Shanghai, Nanjing, and Hangzhou. The Yangtze River Delta accounts for 2.2% of the national land and 11.7% of the national population, contributing about 21% of the country's gross domestic product (GDP). The urbanization level has reached 64.7%, and the urban space layout is still expanding. Therefore, the Yangtze River Delta is China's leading economic development area. However, the rapid development of industrialization and urbanization has caused unprecedented pressure on the ecological environment leading to frequent pollution incidents.

Data
Independent variables covered both natural and socio-economic aspects and were divided into a training dataset (80% of the observation) and a testing dataset (20% of the observation). Table 1 lists seven types of data that were used to fit the PM 2.5 concentration inversion model and evaluate the accuracy. The retrieval and pre-processing of these datasets in the current study are described below. The workflow for processing data, fitting the model to produce the PM 2.5 map, and assessing accuracy is exhibited in the flowchart in Figure 2. 2.1.1. PM 2.5 Data PM 2.5 data were derived from hourly observations in the real-time publishing platform of urban air quality at China Environmental Monitoring Station (http://www.cnemc.cn/ sssj/ accessed on 1 December 2020). There is a total of 214 monitoring stations, with the time range from 12 January to 20 February 2019 and 1 January to 9 February 2020. In accordance with the requirements for the validity of air pollutant concentration data in GB3095-2012, the quality control of PM 2.5 data was performed [15]. Firstly, values of the hourly PM 2.5 concentrations ≤ 0 and missing values were excluded. Secondly, if the measured data have been missing for more than 4 h in a day, all the data would be invalidated and excluded from the calculation of average daily PM 2.5 . Finally, a few anomalies with the hourly PM 2.5 concentrations > 900 µg m −3 were also eliminated. A monthly average of PM 2.5 was obtained based on the arithmetic mean of the daily average concentration.

Aerosol Optical Depth (AOD) Data
In this study, the MODIS Collection 6 MAIAC AOD products (MCD19A2) at a spatial resolution of 1 km from 12 January to 20 February 2019, and 1 January to 9 February 2020, covering the YRD region, were collected. Here, only the MAIAC AOD retrievals at 550 nm, and passing the recommended quality assurance (QA), are used, which yield a reliable data quality in China, especially in bright urban areas [41][42][43]. Last, the Terra and Aqua MAIAC AOD data were averaged and integrated to expand the spatial coverage of PM 2.5 estimates.

POIs Data
POIs is a kind of place or a kind of thing marked on the map, including name, category, coordinate and other information, which can reflect social and economic activities. POIs were retrieved from the Gaode Map (https://www.amap.com/ accessed on 24 September 2020), which is the largest desktop and mobile map service provider in China. We obtained 8,806,799 POI records from 2019 to 2020 using Gaode Map's application programming interface. Gaode Map classified these POIs into 23 categories based on their Chinese semantic phrase. All records were unified as Gauss Kruger coordinate system. Table 2 presents the 20 categories and the number of POI records for each category, excluding the 3 categories of Place Name and Address, Incidents and Event, and Indoor facilities. Meteorological data were gathered from the Chinese meteorological data sharing service network (http://data.cma.cn/ accessed on 24 September 2020), including daily average wind speed, atmospheric pressure, temperature, relative humidity, and 24 h cumulative precipitation. The data was pre-processed and interpolated to obtain the meteorological elements' continuous surface in the area.

Elevation Data
Elevation data were downloaded from China's geospatial data cloud (http://gdex. cr.usgs.gov/gdex/ accessed on 24 September 2020), with the spatial resolution Define if appropriate.of 30 m, and the corresponding location altitude was extracted through the monitoring stations.

Boundary and Road Network Data
The boundary maps at city levels were obtained from the Open Street Map (https:// www.openstreetmap.org/ accessed on 24 September 2020). Such datasets include China's national highways, city roads, provincial, county, and township-level roads. The road density is calculated and generated by the kernel density method of ArcGIS software.

Random Forest Model
The random forest is a new machine learning algorithm consisting of multiple classifications and regression tree (CART) integrations [22,44,45]. Compared with CART, there are three distinct characteristics. First, random forests generate many trees, each of which is generated by a bootstrap sample in the original dataset, while in CART, all raw data are utilized to create only one tree. Second, the segmentation of tree nodes is performed by random forest each time based on an optimal variable in the subset of predictors, while CART selects the optimal variable among all predictors to segment the tree nodes. Finally, the trees in the random forest are completely grown without prune. This makes the random forest model not easy to overfit [46]. Three training parameters need to be defined in the random forest algorithm: n_estimators, the number of trees in the forest-based on a bootstrap sample of the observations; max_features, the number of features to be considered when looking for the best split (the default setting is "auto": then max_features=n_features) and min_samples_lea, the minimum number of samples required to be at a leaf node (the default value is one). The two main parameters (i.e., n_estimators and max_features) in predicting the PM 2.5 were determined and optimized, based on the out-of-bag (OOB) error rate of calibration.

Support Vector Regression Model
Support Vector Regression, SVR was proposed by Corinna Cortes and Vapnik in 1995 [47,48], which constructs a hyperplane or a set of hyperplanes in a high-or infinitedimensional space, which can be used for classification, regression, or other tasks. The performance of SVR can be decided by three parameters, i.e., the kernel function, penalty factor (C), and the variance in kernel function (Gamma). Grid search and cross-validation were applied to determine the optimal values of the three parameters. In this study, radial basis function settings (RBF) with C = 8 and Gamma = 11 were optimal according to the validation results.

Back Propagation Artificial Neural Network
Back Propagation Artificial Neural Network (ANN) was proposed by Rumelhart and McClelland in 1986 [49], which consists of an interconnected group of artificial neurons. It processes information using a connectionist approach to computation. ANN is a non-linear statistical data modeling tool that can fit complex relationships between inputs and outputs, or find patterns in data. The structure of the ANN model includes three levels: Input level (29 neurons), an implication level (25 neurons), and an output level (1 neuron). The activation function was Relu, and the solver was Sigmoid.

Cross Validated Model Accuracy
The model performance is evaluated by determination coefficient (R 2 ), mean absolute error (MAE), and root mean square error (RMSE). The larger the R 2 , the smaller the MAE and RMSE, indicating that the model prediction accuracy is higher. The relevant calculation formulas are as follows, where M is the measured value, P is the predicted value, M is the mean measured value, and n is the number of samples in the validation set.

Model Performance
Determination coefficient R 2 , MAE, and RMSE were applied to estimate the accuracy of modeling. As shown in Table 3  RF model provides an important assessment for each predictor variable. The importance of each variable could be assessed via the percent increase in prediction error (MSE) resulting from randomly permuting the values of an explanatory variable for the out-of-bag observations [22]. The importance assessment can make the variable selection more efficient.
As shown in Figure 3, during 2019-I, the five impact factors ranked by importance were as follows: temperature, precipitation, DEM, wind speed, tourist attraction. In contrast, during 2020-I, the order of importance was as follows: Temperature, road furniture, atmospheric pressure, relative humidity, and precipitation. It is suggested that RF models utilized a higher number and diverse selection of predictors for PM 2.5. Overparameterization can be avoided as = the RF can detect non-linear relations between variables and PM 2.5 concentration, and the variable selection was included as a part of the cross-validation process [22,38].

Cross Validated Model Accuracy
Cross-validation on the validation data set was applied to check to overfit of models. The cross-validated R 2 , MAE, and RMSE for PM 2.5 and model type are presented in  The results indicate that the RF estimation should be a good approximation to the true state of PM 2.5 concentrations in the Yangtze River Delta.
The regional mean value of measured PM 2.5 and predicted PM 2.5 (RF) of 16 cities in the Yangtze River Delta are shown in Figure 5. During 2019-I (the same period in 2019), differences between measured PM 2.5 and predicted PM 2.5 ranged from 0.089 µg m −3 to −2.867 µg m −3 ; comparatively, during 2020-I (during COVID-19), differences between measured PM 2.5 and predicted PM 2.5 ranged from 0.121 µg m −3 to 1.669 µg m −3 . RF model performed well in most cities of Yangtze River Delta with satisfying goodness of fit. Cities with relatively big estimations errors are Zhoushan, Nantong, Taizhou, and Huzhou. The cities, as mentioned above, are coastal cities with low concentrations of PM 2.5 , where the weather conditions are complex and changeable, and which give rise to larger estimation errors. In conclusion, a comprehensive comparison between models shows that R 2 values of RF model are higher than SVR and ANN, while MAE values and RMSE values of RF are lower than those of SVR and RMSE. The results suggest that RF model is optimal in predicting PM 2.5 concentrations. Therefore, RF model was selected for estimation of PM 2.5 .

PM 2.5 Estimates during COVID-19
In this study, RF model was developed to estimate PM 2.5 in the Yangtze River Delta with MODIS AOD data, meteorological, DEM, road density, and POI data. The results of the prediction of PM 2.5 , which are based on RF were mapped in the ArcGIS platform ( Figure 6). According to our estimates, the mean value of PM 2.  Overall, the spatial distribution of PM 2.5 concentrations in the Yangtze River Delta showed a pattern of high north and low south; PM 2.5 concentrations significantly decreased under the "lockdown policy" during COVID-19 in 2020. We pushed the PM 2.5 site data into space through the model, effectively making up for the lack of space in the PM 2.5 monitoring stations and obtaining data covering the entire region during COVID-19.

PM 2.5 Variations during COVID-19
The overall declining trends of PM 2.5 in the Yangtze River Delta can be found during COVID-19 in 2020, with only a few areas in Taizhou showed upward trends (Figure 7). The regional mean value of PM 2.5 in the Yangtze River Delta has declined by 3.61 µg m −3 during COVID-19, with the highest decline rate found in Yangzhou (5.70 µg m −3 ), and lowest rate found in Taizhou (2.26 µg m −3 ), respectively. In general, higher decline rates of PM 2.5 were mainly found in the north part of the Yangtze River Delta, which is also consistent with the spatial clustering of PM 2.5 in the north part of the Yangtze River Delta. The area with high PM 2.5 concentrations is usually the area with a high concentration of human activities. The northern part of the Yangtze River Delta has a flat terrain with a densely distributed population, industries, and farming activities. In contrast, the southern part is a mostly hilly and mountainous area with low population density and low air pollutant emission. According to previous studies, PM 2.5 pollution in the Yangtze River Delta mainly comes from industry and traffic. Therefore, the obvious reductions of PM 2.5 found in this study were directly related to the strict lockdown actions. The majority of fine particles from industry and traffic emissions were the primary emissions from industrial. It was found that traffic emissions decreased with an increase in secondary particles in PM 2.5 during the COVID-19 lock period.

Discussion
Air pollution brings about many challenges for the sustainable development of cities. The sparse distribution of monitoring stations limits our understanding of spatial-temporal dynamics of air conditions. To address this gap, many researchers try to obtain the spatially continuous distribution of PM 2.5 based on relations between PM 2.5 and AOD. The AOD products, applied in earlier studies, have coarse spatial resolutions of about 10 km, which is difficult to apply in air pollution estimation studies at the urban scale. The recent newly developed AOD product, based on MODIS data, has a high resolution of 1 km, which significantly improves the spatial resolution of regional PM 2.5 mapping and is gradually applied to the estimation model of urban PM 2.5 .
In this study, R 2 value of RF model during COVID-19 in 2020 and the same periods of 2019 are 0.93, and 0.917, respectively; and the cross-validation R 2 are 0.77 and 0.69, respectively. The RF model outperformed the SVR and ANN models in the Yangtze River Delta. It is suggested that the RF model explained a large fraction of the measured PM 2.5 spatial variability based on the monitoring data and AOD in the Yangtze River Delta. To be comparable with our study, only those studies on AOD-PM 2.5 estimations over the Yangtze River Delta are selected (Table 4). RF model can capture 69-77% of the variations in the sample-based CV and can outperform most previous models used for generating 3 km resolution PM 2.5 maps of Yangtze River Delta, e.g., the Spatio-temporal model (STM) (CV R 2 = 0.63; Yang et al., 2017) [25] and Linear mixed-effects (LME) model (CV R 2 = 0.725;   [50]. The accuracy of the current RF model is close to the results of the PM 2.5 mapping model with 6 km and 10 km resolutions, including the geographically weighted regression model (GWR) model (Jiang et al., 2017) [51] and the three-stage hierarchical spatial and temporal statistical model (T-SSM) (She et al., 2020) [52]. The comparison indicates that the RF model is suitable for estimating and predicting PM 2.5 concentration in the Yangtze River Delta. However, the RF model, developed in this study, is slightly over-fitting. Humidity correction and vertical correction are suggested in future modeling of PM 2.5 to reduce the error of input variables to obtain the optimal research results. Recent pioneer studies revealed that the mean value of PM 2.5 concentrations in 367 cities during COVID-19 has decreased by 18.9 µg m −3 compared with the periods before COVID-19; PM 2.5 in the city with the worst breakouts of COVID-19: Wuhan decreased by 1.4 µg m −3 [53]. The mean value of PM 2.5 concentrations in Zhejiang province declined by 14.691 µg m −3 during COVID-19 [54]. The varying degree of PM 2.5 varied due to different spatial-temporal scales of studies. However, a consensus is that PM 2.5 concentrations decreased, in general, under the strict "lockdown policy" during COVID-19, and the air quality had improved [10,[55][56][57][58]. This study provides a theoretical basis for controlling human activities to enhance the quality of air under extreme air pollution conditions. The published literature uses PM 2.5 data from urban monitoring sites. This paper compares different models and uses the most accurate model to estimate PM 2.5 data in the Yangtze River Delta during the epidemic and obtain PM 2.5 data covering the entire region. Therefore, compared with the published literature, the PM 2.5 data, estimated by the model, covers urban areas and rural areas, and can be reached through spatial analysis. The research results revealed the spatial heterogeneity of PM 2.5 pollution during COVID-19.
In summary, RF-derived PM 2.5 concentrations during COVID-19 in 2020 and the same period in 2019 were compared to assess the influence of "lockdown policy" on air pollution. The results of this study provide an important reference for air pollution control strategy. Although PM 2.5 reduction, during COVID-19, is mainly caused by declining emissions caused by the stagnation of production and human activities, the effects of climatic change or previous inertia emission reduction cannot be ignored. Their contributions need to be clarified in future studies.

Conclusions
The machine learning method was able to explain a large proportion of the variability in the ambient PM 2.5 concentrations in the Yangtze River Delta, with variables of meteorology, elevation, population, road, and POI data. The RF model of PM 2.5 outperformed the SVR and ANN models in the Yangtze River Delta (YRD) region, and the predicted PM 2.5 concentration, based on RF model, was of high spatial variations in the YRD region. Therefore, the RF model was found to provide an exposure assessment for studies on air pollution in China in the future. RF-based results suggested that PM 2.5 concentrations in the YRD region decreased at multiple spatial scales during COVID-19 in 2020, compared with the value during the same period, in 2019, under the influence of "lockdown policy" on air pollution. We propose that further studies could look into the applications of the RF model as a decision-making tool in air pollution control, and the temporal and spatial resolution should be further improved.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.