Optimization of PM 2 . 5 Estimation Using Landscape Pattern Information and Land Use Regression Model in Zhejiang , China

The motivation of this paper is that the effect of landscape pattern information on the accuracy of particulate matter estimation is seldom reported. The landscape pattern indexes were incorporated in a land use regression (LUR) model to investigate the performance of PM2.5 simulation over Zhejiang Province. The study results show that the prediction accuracy of the model has been improved significantly after the incorporation of the landscape pattern indexes. At class-level, waters and residential areas were clearly landscape components influencing decreasing or increasing PM2.5 concentration. At landscape-level, CONTAG (contagion index) played a huge negative role in pollutant concentrations. Latitude and relative humidity are key factors affecting the PM2.5 concentration at province level. If the land use regression model incorporating landscape pattern indexes was used to simulate distribution of PM2.5, the accuracy of ordinary kriging for the LUR-based data mining was higher than the accuracy of LUR-based ordinary kriging, especially in the area of low pollution concentration.


Introduction
Fine particulate matter (PM 2.5 ) refers to particles with aerodynamic equivalent diameters less than 2.5 µm that are highly toxic to humans and reduce air visibility and have thus attracted rising social attention in recent years.With accelerating urbanization in China, fine particles have become the primary pollutant affecting the air quality in cities and severely influence people's daily lives.A routine analysis of the PM 2.5 was added to the new Ambient Air Quality Standards (GB2095-2012) published in China in March 2012.PM 2.5 has become a pollutant of focus for future atmosphere pollution research and control in the nation.
The land-use regression (LUR) model is widely applied in the simulation of atmospheric pollutants on different spatial and temporal scales [1][2][3][4][5][6][7][8][9].The model achieves the simulation of the PM 2.5 spatial distribution through performing a regression analysis of the PM 2.5 concentrations at monitoring stations and the influencing factors in the surroundings (e.g., land use, topography, transportation, climate, population, and pollution sources).The landscape pattern index is a landscape ecological expression that quantifies land use [10][11][12][13][14]. Landscape pattern indexes and atmosphere pollution are typically related by a complicated pattern-process relationship.Land use structure and landscape patterns can affect the spatial distribution of particles [15].Research has suggested that green patches in a city landscape serve a greater atmospheric pollution purification function if the average area is greater and the fragmentation index lower [16].The reduction in the atmospheric pollution is different depending on the population density, industrial distribution, and landscape patterns (characteristics such as horizontal structure, heterogeneity, and connectivity) [17].Thus, the inclusion of landscape pattern indexes in the LUR model may improve the accuracy of the simulation.Nevertheless, studies on this aspect have seldom been reported.
Using the LUR model, there are two methods to achieve the simulation of the spatial distribution of the PM 2.5 concentration.In one method, the particle concentrations of the stations are obtained based on the constructed model, and the simulation of the spatial distribution of the PM 2.5 concentration is achieved through spatial interpolation [18,19].In the other method, the raster data of variables are incorporated into the model for regression analysis mapping to obtain the spatial distribution of the PM 2.5 concentration [20].However, there have been few studies that compared these two methods of LUR model application.Some studies have shown that the complete spatial distribution of particles in a region could be reflected using interpolation [21].The performance of ordinary kriging interpolation was better than that of remote sensing inversion for the reflection of the overall particle distribution in the study region, and the continuity of the results obtained was also better.However, traditional interpolation methods tend to weigh extreme changes excessively due to their dependence on single factors.Surface fitting for PM 2.5 is typically not possible for regions with low station density or missing data [22].
The urbanization of the Yangtze River delta is developing rapidly, which is one of the national economic centers, and the pollution of particulate matter is serious.Zhejiang province is the main component of the Yangtze River delta, typical representative of regional characteristics and it has rich land use types and diverse landscape patterns.In this study, we investigated the effects of landscape pattern indexes on PM 2.5 simulation using the LUR model and compared the two methods of land-use model application.Finally, data mining method was introduced to improve the regional PM 2.5 estimation.

Investigated Regions and Monitoring Stations
Zhejiang is a coastal province located on the south of the Yangtze River Delta in southeast China.It is one of the provinces that are most economically active but has a small land area.In 2015, the population of the province was 48.7334 million, with a high population density.The PM 2.5 index in regions such as Hangzhou, Shaoxing, and Huzhou indicated a heavy pollution level.The environmental health of the province is thus a matter of great concern.The investigated region contained 150 evenly distributed national air quality monitoring stations (Figure 1).Zhoushan city is composed of sparse islands and has unique natural and cultural conditions.For the study to be representative, Zhoushan city was not included.

Settings for LUR Model
The equation of the LUR models is expressed as where the dependent variable y is the pollutant concentrations, independent variables ⋯ are the potential variables, ⋯ are the associated coefficients, and is the constant intercept [20].

Dependent Variable
The PM2.5 concentration data were from the air quality publication platform of the Zhejiang Environmental Protection Bureau (http://aqi.zjemc.org.cn/aqi/flex/index.html).Daily averages of the PM2.5 were collected from the 150 national air quality monitoring stations during June 2015 to May 2016, and from them, the monthly average and annual average of each station were obtained.

Independent Variables
A total of 80 independent variables of the LUR model were selected, which covered five categories, including the meteorological data, land-use data, population data, digital elevation data, and pollution source data.By manual interpretation, data were obtained on the land use/coverage in areas within 5 km of the stations.The types of land use included woodland, residential, industrial, commercial, urban greenery, transportation, agricultural, bare land, waters, and roads.Buffers were created for 100, 300, 500, 800, 1000, 2000, 3000, 4000, and 5000 m, according to previous research findings [23][24][25].Using version 4.2 of FRAGSTATS [10,26], We calculated landscape pattern index [27] of different distance buffers for analysis (Table 1).
The weather data were obtained from the China Meteorological Data Service Center (http://data.cma.cn/), the indexes obtained included temperature, relative humidity, pressure, sunlight, wind speed, and rainfall.
The national population density per square kilometer was obtained from the Center for International Earth Science Information Network provided by Columbia University (http://sedac.ciesin.columbia.edu/),and the population data of each station were extracted from that data.

Settings for LUR Model
The equation of the LUR models is expressed as where the dependent variable y is the pollutant concentrations, independent variables x 1 • • • x n are the potential variables, a 1 • • • a n are the associated coefficients, and ε is the constant intercept [20].

Dependent Variable
The PM 2.5 concentration data were from the air quality publication platform of the Zhejiang Environmental Protection Bureau (http://aqi.zjemc.org.cn/aqi/flex/index.html).Daily averages of the PM 2.5 were collected from the 150 national air quality monitoring stations during June 2015 to May 2016, and from them, the monthly average and annual average of each station were obtained.

Independent Variables
A total of 80 independent variables of the LUR model were selected, which covered five categories, including the meteorological data, land-use data, population data, digital elevation data, and pollution source data.By manual interpretation, data were obtained on the land use/coverage in areas within 5 km of the stations.The types of land use included woodland, residential, industrial, commercial, urban greenery, transportation, agricultural, bare land, waters, and roads.Buffers were created for 100, 300, 500, 800, 1000, 2000, 3000, 4000, and 5000 m, according to previous research findings [23][24][25].Using version 4.2 of FRAGSTATS [10,26], We calculated landscape pattern index [27] of different distance buffers for analysis (Table 1).
The weather data were obtained from the China Meteorological Data Service Center (http://data.cma.cn/), the indexes obtained included temperature, relative humidity, pressure, sunlight, wind speed, and rainfall.
The national population density per square kilometer was obtained from the Center for International Earth Science Information Network provided by Columbia University (http://sedac.ciesin.columbia.edu/),and the population data of each station were extracted from that data.Data on pollution sources were from the monitoring of primary pollutants by the Zhejiang Environmental Protection Bureau, and the sources included power plants, steel plants, and other industries.

Model Development and Evaluation
Each variable was paired with the PM 2.5 concentration for bivariate correlation analysis to screen for influencing factors that were significantly correlated with the PM 2.5 concentration.Remove the independent variables that have insignificant t-statistics (α = 0.05).To solve the issue of the reduced model accuracy due to collinearity between the factors, the influencing factor that was the most correlated with PM 2.5 among the variables in the same category was selected.The variables that were highly relevant (R > 0.6) to the selected factor were eliminated, and the variables with correlations different from historical experiences were removed [18].By comparing the prediction accuracy of forward, backward, and stepwise selection, we found that the prediction accuracy of stepwise selection was higher [25,[28][29][30].All variables that satisfied the requirements were subjected to stepwise multivariate linear regression along with the PM 2.5 concentration.The statistical parameters were defined, followed by the detection of outliers and influential data points.Based on practical needs, a regression equation a higher adjusted R 2 and easy-to-interpret independent variables was selected as the final prediction model.
Due to the lack of landscape information and the brush selection of independent variables, in the end, 126 samples were applied to regression modeling.The models were validated using cross-validation statistic of geostatistical analysis of ArcGIS software (ESRI, Red Lands, CA, USA) [31,32].By comparing the PM 2.5 predicted values versus monitored values, the model yields a smaller RMSE (Root-Mean-Square Error), and a greater adjusted R 2 provides better fitting.Generally, lower RMSE values mean more stable and accurate models [18].

LUR Model Construction
A stepwise multivariate linear regression analysis was performed on the 65 variables that satisfied the requirements for the model construction (Table 2).The last five variables included in the model were the latitude, annual average relative humidity, 5000_residence_CA, 4000_water_LPI, and 5000_CONTAG.Among these variables, the latitude and 5000_residence_CA were positively correlated to the PM 2.5 .The latitude is positively correlated to the PM 2.5 concentration, which may be attributed to the differences in land use types and weather data for the regions adjacent to the stations as the latitude increases.On the one hand, the average urban construction area in northern Zhejiang province is more than southern Zhejiang province.Urbanization can lead to serious pollution [33].On the other hand, under the influence of the wind direction of the northwest wind in the Yangtze River delta region, the fine particulate matter of the urban site at the junction of the Yangtze River in the Yangtze River delta is significantly affected by the internal transmission of the region [34].Northerly air masses transport pollution to the south.In addition, the variable that showed the most significant negative correlation with the increase in latitude from Wenzhou to Huzhou and Jiaxing was the annual average temperature, with a Pearson correlation coefficient of −0.865.The higher temperature, the stronger atmospheric convection.The pollutants in the atmosphere was be transported to distance, thereby reduced the concentration of fine particulate matter.The variable that showed the second most significant correlation was the air pressure, with a Pearson correlation coefficient of 0.410.High pressure inhibits the transportation of particles [35].5000_residential_CA represented residential area within the 5000 m buffer.The more 5000_residence_CA, the more sources of pollution.It can stimulate liveness of the surrounding businesses, transportation, and production activities so as to release more pollution and cause environmental damage to a certain extent [36,37].The variables negatively correlated with the PM 2.5 were the annual average relative humidity, 5000_CONTAG, and 4000_waters_LPI.When relative humidity is high, as long as PM 2.5 concentration reaches specific value, it will settle because of its own weight, thereby reducing the concentration of particles in the air [38].5000_CONTAG represented the degree of clustering or trend of extension of the landscape patterns in the 5000 m buffer regions of the monitoring stations [39].Generally, the higher the 5000_CONTAG, the better the continuity of a certain dominant patch in the landscape, while a higher degree of fragmentation of the landscape increases the inter-patch transportation cost for urban residents and speeds up gas emissions.4000_water_LPI represented the proportion of the largest patch of water in the landscape of 4000 m buffer regions.An increase in the area of water bodies in a city, whether in a scattered or centralized distribution, results in a decreased air temperature, increased humidity, increased average wind speed, reduced urban heat island effects, and increased spread of air pollutants.To investigate the effects of the landscape pattern indexes on the simulation accuracy of the model, the specific parameters used in the model before and after the inclusion of the landscape pattern indexes were as indicated in Tables 3 and 4.

Cross Validation
It is important for the simulation precision of LUR based landscape pattern index to get rid of the spatial autocorrelation and detect the normal distribution and trends of PM 2.5 concentration in supplementary section (Figures S1-S4).The RMSE of the ordinary kriging for the LUR-based data-mining was 3.512 µg/m 3 , which of the LUR-based ordinary kriging was 3.571 µg/m 3 , that of the ordinary kriging was 4.067 µg/m 3 , and that of the data-mining-based ordinary kriging was 4.055 µg/m 3 (Table 5).A smaller RMSE indicates a lower deviation of the predicted values from the measured values.Therefore, the cross-validation results suggested that the simulation accuracy of ordinary kriging for the LUR-based data-mining was better (Figure 2).As found in the pair-wise comparisons, the RMSE was significantly reduced using the LUR model; the accuracy of particle concentration prediction was improved by 0.5 µg/m 3 .These improvements were attributed to the factors incorporated in the regression model.The factors considered included not only space and distance but also environmental and social factors such as the weather, transportation, population, elevation, and land use.Different influencing factors have different interpretation abilities for the PM 2.5 concentration.The use of the LUR model could avoid over-reliance on the spatial and distance factors.Furthermore, the factors incorporated in the model were considered comprehensively, which resulted in predictions that were closer to the actual conditions; hence, the model is based on a more solid scientific foundation.

Cross Validation
It is important for the simulation precision of LUR based landscape pattern index to get rid of the spatial autocorrelation and detect the normal distribution and trends of PM2.5 concentration in supplementary section (Figures S1-S4).The RMSE of the ordinary kriging for the LUR-based datamining was 3.512 μg/m 3 , which of the LUR-based ordinary kriging was 3.571 μg/m 3 , that of the ordinary kriging was 4.067 μg/m 3 , and that of the data-mining-based ordinary kriging was 4.055 μg/m 3 (Table 5).A smaller RMSE indicates a lower deviation of the predicted values from the measured values.Therefore, the cross-validation results suggested that the simulation accuracy of ordinary kriging for the LUR-based data-mining was better (Figure 2).As found in the pair-wise comparisons, the RMSE was significantly reduced using the LUR model; the accuracy of particle concentration prediction was improved by 0.5 μg/m 3 .These improvements were attributed to the factors incorporated in the regression model.The factors considered included not only space and distance but also environmental and social factors such as the weather, transportation, population, elevation, and land use.Different influencing factors have different interpretation abilities for the PM2.5 concentration.The use of the LUR model could avoid over-reliance on the spatial and distance factors.Furthermore, the factors incorporated in the model were considered comprehensively, which resulted in predictions that were closer to the actual conditions; hence, the model is based on a more solid scientific foundation.

Concentration Simulation
We have compared the prediction accuracy of the different LUR-model application methods in supplementary section, the simulation accuracy was greater when directly applying the land use model to point fitting (Figure S5, Table S1).Compared with the other three methods, ordinary kriging for LUR-based data mining resulted in a prediction range of 19.81-53.06µg/m 3 (Figure 3); for the low-pollution regions, the prediction value was even closer to the monitored range of 19.83-59 µg/m 3 (Figure 3).The overall results of the concentration simulation by the four methods were similar.This distribution of prediction value was, to a large extent, consistent with the monitored particle distribution (Figure 3).In particular, the pollution in the northern Zhejiang was severe, but that in the southern region was relatively mild.Among the station of the northern Zhejiang, the PM 2.5 concentration at the Zhaohui, Wuqu, and Hemu school station in Hangzhou were detected to be as highest as 59 µg/m 3 .In addition, the PM 2.5 concentration at others in the northern Zhejiang almost more than 40 µg/m 3 .On the contrary, among the stations of the southern Zhejiang, the PM 2.5 concentration were detected to be as low as 40 µg/m 3 .Although PM 2.5 concentration of Qianjiangyuan station in the east was lower than all of sites of the Zhejiang province, the pollution in the east and west of the province were similar, and the degree of severity was between those of the north and south.The heavily polluted regions were mainly located in Huzhou, Jiaxing, Hangzhou, Shaoxing, and the northeast of Jinhua, which were regions with higher levels of urbanization.On the other hand, the landscape in the southern regions was mainly composed of mountains, and the population was more scattered compared with in the north.The southern region also had a higher vegetation coverage, lower emission of pollutants, and higher air humidity.Therefore, the pollutant concentration was lower.Urban planning could directly impact the air quality of a city [40,41].Without proper guidance and management, rapid urban development will lead to environmental issues and health risks caused by poor air quality.As shown in our study, the air quality can be improved with the following measures: reduce the area of residence within 5000 m of the city center, increase the clustering of lands with the same type of use, suitably increase the area of water bodies within 4000 m of the city center, avoid intensive traffic that triggers the outbreak of pollutant emissions [42][43][44], and allow the dispersion of air pollutants.Although we have proposed some suggestions for improving air quality, there are some shortcomings in our research.Only 150 sites' data over Zhejiang province was collected to test the model in our study, the representation was limited, and the method was mainly linear, without considered the nonlinear relation between factors.The prediction accuracy of the model still has room for improvement.Although we have a certain understanding of the long-term changes of particulate matter at regional scale, the process of short-term formation and dispersion of particulate matter cannot be caught by the model due to the lack of daily changes.

Conclusions
The prediction accuracy of the model with the inclusion of the landscape pattern factors was greater than the model of without the inclusion of the landscape pattern factors.The latitude and 5000_residential_CA were positively correlated to the PM2.5, however, the variables negatively correlated with the PM2.5 were the annual average relative humidity, 5000_CONTAG, and 4000_waters_LPI.Over the lightly polluted region, the predicted value obtained by ordinary kriging for the LUR-based data mining was closest to the monitored value, and the RMSE was the lowest (3.512 μg/m 3 ).

Conclusions
The prediction accuracy of the model with the inclusion of the landscape pattern factors was greater than the model of without the inclusion of the landscape pattern factors.The latitude and 5000_residential_CA were positively correlated to the PM 2.5 , however, the variables negatively correlated with the PM 2.5 were the annual average relative humidity, 5000_CONTAG, and 4000_waters_LPI.Over the lightly polluted region, the predicted value obtained by ordinary kriging for the LUR-based data mining was closest to the monitored value, and the RMSE was the lowest (3.512 µg/m 3 ).S1: Fitting parameters of different application mechanisms.

Figure 1 .
Figure 1.Distribution and elevation of the national air quality monitoring stations in Zhejiang province.

Figure 1 .
Figure 1.Distribution and elevation of the national air quality monitoring stations in Zhejiang province.
Figure S5: Predicted pollutant concentration and error distribution resulting from different application mechanisms: (a) applying model to

S5:
Predicted pollutant concentration and error distribution resulting from different application mechanisms: (a) applying model to surface fitting; (b) applying model to point fitting.Table

Table 1 .
Landscape pattern index of different distance buffers.

Table 2 .
List of variables of modeling.
DEM: Digital Elevation Model.

Table 3 .
Parameters of the multiple regression model with the inclusion of landscape pattern factors.

Table 4 .
Parameters of the multiple regression model prior to the inclusion of landscape pattern factors.

Table 4 .
Parameters of the multiple regression model prior to the inclusion of landscape pattern factors.

Table 5 .
Cross-validation parameters for the four different simulation methods.