Spatio-Temporal Characteristics of PM2.5 Concentrations in China Based on Multiple Sources of Data and LUR-GBM during 2016–2021

Fine particulate matter (PM2.5) has a continuing impact on the environment, climate change and human health. In order to improve the accuracy of PM2.5 estimation and obtain a continuous spatial distribution of PM2.5 concentration, this paper proposes a LUR-GBM model based on land-use regression (LUR), the Kriging method and LightGBM (light gradient boosting machine). Firstly, this study modelled the spatial distribution of PM2.5 in the Chinese region by obtaining PM2.5 concentration data from monitoring stations in the Chinese study region and established a PM2.5 mass concentration estimation method based on the LUR-GBM model by combining data on land use type, meteorology, topography, vegetation index, population density, traffic and pollution sources. Secondly, the performance of the LUR-GBM model was evaluated by a ten-fold cross-validation method based on samples, stations and time. Finally, the results of the model proposed in this paper are compared with those of the back propagation neural network (BPNN), deep neural network (DNN), random forest (RF), XGBoost and LightGBM models. The results show that the prediction accuracy of the LUR-GBM model is better than other models, with the R2 of the model reaching 0.964 (spring), 0.91 (summer), 0.967 (autumn), 0.98 (winter) and 0.976 (average for 2016–2021) for each season and annual average, respectively. It can be seen that the LUR-GBM model has good applicability in simulating the spatial distribution of PM2.5 concentrations in China. The spatial distribution of PM2.5 concentrations in the Chinese region shows a clear characteristic of high in the east and low in the west, and the spatial distribution is strongly influenced by topographical factors. The seasonal variation in mean concentration values is marked by low summer and high winter values. The results of this study can provide a scientific basis for the prevention and control of regional PM2.5 pollution in China and can also provide new ideas for the acquisition of data on the spatial distribution of PM2.5 concentrations within cities.


Introduction
In response to the growing air pollution problem, China has set up large-scale groundbased PM 2.5 monitoring stations to monitor and warn of heavily polluted weather [1]. PM 2.5 can largely reduce the body's immunity and cause respiratory diseases such as asthma and chronic bronchitis, as well as cardiovascular diseases such as heart disease and atherosclerosis, and can increase the risk of cancer [2]. The 2019 Global Burden of Disease Study reports that air pollution is the leading environmental risk factor for global Int. J. Environ. Res. Public Health 2022, 19, 6292 3 of 20 The contributions of this study are as follows: (1) This study uses an integrated approach combining the LUR model, Kriging method and LightGBM model to improve the daily concentration estimates of PM 2.5 in the Chinese region from 2016 to 2021. AOD data, latitude and longitude information, meteorological observation elements, land use and road data are used to estimate PM 2.5 concentrations. Specifically, the accuracy of PM 2.5 change prediction is improved by stepwise selection of LUR models to identify important predictor variables, and then five machine learning algorithms (BPNN, DNN, RF, XGBoost and LightGBM) are used to build prediction models. (2) The hybrid spatial prediction model proposed in this paper combines the strengths of LUR in identifying the most influential emission predictors. A hybrid spatial prediction model built by identifying the most influential emission predictors combined with LightGBM's strength in estimating non-linear trends will be more widely effective than traditional machine learning estimation methods. Validated by R 2 , RMSE and MAE metrics, the results show that LUR-GBM performs better. (3) The spring, summer, autumn, winter and 2016-2021 average concentrations are modelled, and the spatial and temporal characteristics of regional PM 2.5 concentrations in China are analysed.
The rest of the paper is organised as follows. The Section 2 focuses on the data sources used in this study. The Section 3 introduces the methodology and model construction. Section 4 discusses the model results and the spatial and temporal characteristics of PM 2.5 distribution; Section 5 is the discussion, and Section 6, the conclusions.

MODIS Remote Sensing Data
The MODIS sensor of NASA is mounted on the Terra and Aqua satellites with multiple channels, featuring multi-spectral, wide coverage and high temporal resolution, which can invert the spatial distribution of AOD data with high accuracy. The MODIS MOD021KM data released from 2016 to 2021 with a spatial resolution of 1 km were used in this work.

Meteorological Data
The main meteorological data used are planetary boundary layer height (PBLH), relative humidity (RH), air temperature (TEM), surface pressure (SP), wind speed (WIN) and total rainfall (RF). The meteorological data were obtained from the ERA5 data on the Eu-

Meteorological Data
The main meteorological data used are planetary boundary layer height (PBLH), relative humidity (RH), air temperature (TEM), surface pressure (SP), wind speed (WIN) and total rainfall (RF). The meteorological data were obtained from the ERA5 data on the European Centre for Medium-Range Weather Forecasts website and were rastered, resampled and cropped using ArcGIS to match the spatial resolution of the AOD data.

Land Use and Road Dataset
This study uses the land-use dataset published by the China Geographic Monitoring Cloud platform. The classification and description of the independent variables are shown in Table 1. Using ArcGIS 10.7 from Esri, Redlands, CA, USA, the land use was classified into six categories, including arable land, forest land, grassland, water, construction land and bare land, after stitching, cropping and reclassification, and considering the area and attributes of each type of land. The road data was obtained from the vector road network of OpenStreetMap, and four categories of highways, trunk roads, primary roads and secondary roads, were extracted within the study area, and the same buffer zones were established with the monitoring station as the centre. The length of each type of road within each buffer zone was obtained as the road factor by the spatial superposition method.

LightGBM
The LightGBM algorithm is an improved optimisation algorithm for the gradient boosting decision tree (GBDT) [36]. The model training process was based on a sufficient amount of sample data, and the final output of the model was determined by building multiple decision trees (weak learners) and combining the outputs of the decision tree clusters. The actual training process can be expressed as follows: the decision trees are added in an iterative manner, and when the increase in accuracy due to tree addition is less than a certain threshold, the iteration is stopped and the LightGBM model consisting of N tree decision trees is obtained [37].
where PM i is the PM 2.5 influencing factors; f k (PM i ) is the kth decision tree. Heuristic information in LightGBM iteration trees can be used as an important measure of features. Therefore, the tree structure-based metric will directly affect the quality of the subset of candidate features and ultimately determine the experimental effectiveness of the original machine learning algorithm. For any given tree structure, PM_Split represents the total number of times each PM 2.5 influence factor has been partitioned in the iteration tree. PM_Gain represents the level of importance of each PM 2.5 impact factor characteristic. They are defined as follows: where K is the K decision trees resulting from K rounds of iterations.

LUR Model
LUR is an effective method for modelling PM 2.5 concentrations because of its high simulation accuracy and comprehensive considerations [38]. In this study, a multivariate regression equation, or LUR model, was constructed for PM 2.5 concentrations in relation to land-use type, topography, meteorology, road traffic, population density and pollution sources. The basic form of the model usually consists of one dependent variable and two or more independent variables and is calculated in equation [39].
where y is the dependent variable and represents the PM 2.5 concentration value; PM(x 1 ), PM(x 2 ), . . . , PM(x n ) are the different influencing factors of PM 2.5 ; α 0 , α 1 , α 2 , . . . , α n are the coefficient to be determined; ε is the random variable.

Kriging
The basic principle of Kriging's method is to estimate data at other unobserved locations in space from data at regularly distributed sample points in space [40].

Regionalised Variables
The study area can be considered as a regionalised variable satisfying Kriging's interpolation condition R(S), S 1 , S 2 , . . . , S n are the location of PM 2.5 ground monitoring stations in the area. R(S 1 ), R(S 2 ), . . . , R(S 3 ) are the observed value of PM 2.5 at the corresponding station. For a point S 0 in the region, the spatial attribute R d (S 0 ) can be obtained by interpolation with the Kriging method, and the temporal attribute R t can be expressed in terms of the month in which the point is located [41], which can be expressed as: where R d (S 0 ) is the spatial attribute of the given point, ω i is the Kriging weight, R(S i ) is the monitoring value of the station around the point and m is the month of the given point.
Kriging satisfies the set of optimal coefficients with the smallest difference between the estimated value R d (S 0 ) at the station and the true value R(S 0 ), while satisfying the condition of unbiased estimation, as follows:

Variance Functions
The variance function is the basis of the kriging interpolation method and is a model function used to describe the spatial relationship between PM 2.5 ground monitoring stations and between stations and pixels. The variance function for the regionalised variable R(S) can be expressed as the semi-variance µ(S i , S j ) of the difference between the observations at the monitoring station S i and S j as Equation (6):

Equation Solving
The Kriging equation can be obtained by minimising the variance of the unbiased sum estimate in Equation (7).
where ϕ is the Lagrangian multiplier factor. Solving the above system of equations yields the Kriging weights ω i and hence the estimated value R d (S 0 ), for any point S 0 in the region. The Kriging method takes full account of the correlation of PM 2.5 site data by calculating the variance function of the sample. Figure 2 shows the research framework. A total of six models (BPNN, DNN, RF, XGBoost, LightGBM, LUR-GBM) were developed in this study. Given that the shortcomings of machine learning models in selecting appropriate predictor variables can be addressed by applying LUR, this study aims to use an integrated approach combining LUR and machine learning models to improve the estimation of regional PM 2.5 daily concentrations in China for the period 2016 to 2021. First, a traditional LUR is used to identify significant predictor variables. A deep neural network, random forest and XGBoost algorithms were then used to fit a predictive model based on the variables selected by the LUR model. Data partitioning, 10-fold cross-validation, external data validation and seasonal and yearbased validation methods were used to validate the robustness of the developed models. Specifically, the significant predictor variables identified through the stepwise variable selection of the LUR procedure were applied to LightGBM to improve the accuracy of PM 2.5 change predictions. A hybrid spatial prediction model combining the strengths of LUR in identifying the most influential emission projections with the predictability of machine learning in estimating non-linear trends will be more effective than techniques that rely on LUR or machine learning alone. In order to fully consider the problem of spatial correlation of monitoring station data in PM 2.5 mass concentration estimation and to improve the accuracy of PM 2.5 spatial estimation, this paper introduces the Kriging method and constructs a spatio-temporal LUR-GBM model, which provides a new idea to solve the complex spatial relationship in PM 2.5 estimation. The LUR-GBM model takes into account the influence of the PM 2.5 value at any point in space on the values of other stations in the surrounding neighbourhood, using the spatial location estimate R d calculated by the Kriging method and the temporal location R t of that point as input variables to the model. The LUR-GBM model can be expressed as Equation (8).

LUR-GBM Model
where EPM 2.5 is the LUR-GBM model PM 2.5 estimate, Model is the LUR-GBM model, LAT is the latitude and LON is the longitude. timate d R calculated by the Kriging method and the temporal location t R of that point as input variables to the model. The LUR-GBM model can be expressed as Equation (8).

Accuracy Evaluation
To fully evaluate the performance of the LUR-GBM model, a ten-fold cross-validation (10-CV) based on samples, sites and time was used, and the computed results were compared with BPNN, DNN, RF, XGBoost and LightGBM. Three indicators, coefficient of determination (R 2 ), root mean square error (RMSE), mean prediction error (MAE) and mean absolute percentage error (MAPE), were calculated separately from the model prediction results to test the model performance [42]. R 2 is a measure of the degree of linear correlation between variables and reflects the proportion of the variation in the dependent variable that can be explained by the independent variable. Therefore, the coefficient of determination was selected as one of the indicators for model evaluation in this study [43]. Each evaluation indicator is calculated using the following formula:

Accuracy Evaluation
To fully evaluate the performance of the LUR-GBM model, a ten-fold cross-validation (10-CV) based on samples, sites and time was used, and the computed results were compared with BPNN, DNN, RF, XGBoost and LightGBM. Three indicators, coefficient of determination (R 2 ), root mean square error (RMSE), mean prediction error (MAE) and mean absolute percentage error (MAPE), were calculated separately from the model prediction results to test the model performance [42]. R 2 is a measure of the degree of linear correlation between variables and reflects the proportion of the variation in the dependent variable that can be explained by the independent variable. Therefore, the coefficient of determination was selected as one of the indicators for model evaluation in this study [43]. Each evaluation indicator is calculated using the following formula: where PM F is the predicted PM 2.5 value; PM T is the measured PM 2.5 value; N is the number of samples.

Correlation Analysis of PM 2.5 Concentrations and Impact Factors
The results of the bivariate correlation analysis between PM 2.5 concentration and influencing factors are shown in Table 2. Within the land use sub-categories, arable land, forest land, grassland and urban and rural industrial and mining residential land all have a strong influence on the change of PM 2.5 concentration. Among the road traffic data, highways and major arterial roads had a strong influence on the change of PM 2.5 concentration, while topography, forest land, grassland and unused land had a negative relationship with PM 2.5 concentration, and road traffic, urban and rural industrial and mining residential land maintained a positive relationship with PM 2.5 concentration. PM 2.5 concentrations are negatively correlated with factors such as woodland, grassland, water, altitude, precipitation and relative humidity. PM 2.5 concentrations are positively correlated with factors such as industrial and mining settlements, barometric pressure, temperature, population density and road length. p-values represent the level of significance. p-values are highly significant at α = 0.01 for correlation and at α = 0.05 for correlation. Table 2 shows that population was significantly correlated at α = 0.05, and all other modelling variables were highly significant at the α = 0. 01 level, all passing the variable significance test.

Model Performance
Using China as the study area, data from 1 January 2016 to 31 December 2021 were selected, and the training dataset and the test validation dataset were selected by multiple random sampling. The training set was 70%, the test validation set was 30% and the experimental evaluation was repeated and averaged as the evaluation result of the model. Training of the LUR-GBM model was completed via Python 3.7. The LUR-GBM model was trained using the target factors selected by bivariate correlation analysis as features of the model and the PM 2.5 concentrations at the monitoring stations as supervised values. The LightGBM model had the following detailed parameters: Base learner = GBDT, the number of base learners is 100, Num_leaves = 31, Learning_rate = 0.05, Feature_fraction = 0.9, Bagging_fraction = 0.8, Bagging_freq = 5. Table 3 shows the performance of the machine learning models, with R 2 ranging from 0.76 to 0.98 for the five machine learning models in a sample-based cross-validation. The R 2 of both the LightGBM and LUR-GBM models considering site data correlation, was greater than 0.9, with the LUR-GBM model performing best. The RMSE of the models ranged from 6.43 to 11.37 µg/m 3 , with the LUR-GBM model having the lowest RMSE value and the BPNN model having the highest RMSE value (11.37 µg/m 3 ). The MAE was 4.17 to 8.35 µg/m 3 , with the LUR-GBM model having the lowest MAE value of 4.17 µg/m 3 , followed by the LightGBM model at 4.56 µg/m 3 .
In the site-based cross-validation, the R 2 values of the LightGBM and LUR-GBM models considering geographical correlation and temporal variation were significantly higher than those of traditional machine learning models such as XGBoost, RF, and BPNN, but the R 2 values were lower compared to those of the sample-based cross-validation because of the significant spatial heterogeneity of PM 2.5 distribution in space. The LUR-GBM model has the highest R 2 value of 0.91, followed by the LightGBM model, and the BPNN model performed the worst. Comparing the RMSE and MAE metrics, the RMSE and MAE values of the LightGBM and LUR-GBM models were significantly lower than those of other traditional machine learning models, with the LUR-GBM model performing best with RMSE and MAE values of 7.46 µg/m 3 and 5.01 µg/m 3 , respectively. The LUR-GBM model performs well at the spatial scale, taking full account of the relevance of site data. The relatively poor performance of the time-based cross-validation models is due to the fact that the PM 2.  Figure 3 shows the scatter plot of the PM 2.5 concentrations estimated by the BPNN, RF, DNN, XGBoost, LightGBM and LUR-GBM models fitted to the PM 2.5 concentrations measured at the ground monitoring sites. As can be seen from Figure 3, the LightGBM model and LUR-GBM model outperform traditional machine learning models such as BPNN, DNN, RF and XGBoost. The reason for this is that the LightGBM model and the LUR-GBM model take into account site data and temporal variation and can better characterise the spatial and temporal characteristics of PM 2.5 . The scatter density plots drawn by the LightGBM model and the LUR-GBM have a fit ratio R 2 of 0.91 and 0.98, respectively, indicating that the LUR-GBM model is the best fit. The LUR-GBM is based on the LightGBM model with the introduction of the Kriging method, which improves the accuracy of PM 2.5 estimation by calculating the variance function and taking full account of the spatial correlation of station data. The BPNN estimated ground-level PM 2.5 mass concentrations were the least well fitted, grossly underestimating PM 2.5 values and performing the worst. The overall error values of our model are small, but as some extreme phenomena can occur, such as dust storms in places like Xinjiang, most of the areas where the detection values exceed 200 µg/m 3 are in these areas. This leads to situations where some of the predicted data can deviate significantly from the true values, which, combined with the fact that the monitoring stations in this part of the country are not fully covered and the large distances between the various monitoring stations, leads to large deviations. Furthermore, the model is based on daily regional PM 2.5 mass concentration data for 2016-2021 in China, taking into account regional variability and, therefore, a small number of deviations in the predicted values. We can find by the value of MAPE that the average error of BPNN is more than 30% at maximum, and the value of MAPE of the LightGBM model and LUR-GBM model among the six models is less than 20%, where the average error of LUR-GBM model is 15.304%. A comprehensive comparison of the six machine learning models showed that the LUR-GBM model had the best prediction performance, followed by the LightGBM model, while the BPNN had the worst prediction performance among the six models. We validated the six models using annual average data from 2016 to 2021, and Figure  4 shows the scatter density plots of PM2.5 concentrations estimated by the BPNN, RF, DNN, XGBoost, LightGBM and LUR-GBM models fitted to the actual PM2.5 concentrations measured at ground monitoring stations. The overall performance of the six models was better than the performance of the predictions of daily concentrations. This is due to the fact that annual concentrations are less variable and volatile and that annual values are less affected by extreme values. As can be seen from Figure 4, the LightGBM and LUR-GBM models outperformed traditional machine learning models such as BPNN, DNN, RF and XGBoost, with R 2 values of 0.82 and 0.866, respectively, in terms of goodness of fit. BP and DNN had the worst fit performance of 0.75 and 0.79, respectively. The best RMSE values among the six models were 5.571 ug/m 3 for the LightGBM model and 5.291 ug/m 3 for the LUR-GBM model, while the worst was 6.669 ug/m 3 for the BP. In terms of MAE values, the LUR-GBM model had a minimum of 4.021 ug/m 3 and the BP had a maximum of 6.669 ug/m 3 . In terms of MAPE values, all six models were less than 15%, with the LUR-GBM model being the smallest at 10.71%. A comprehensive comparison of the six machine learning models shows that the LUR-GBM model had the best prediction performance, followed by the LightGBM model, while the BPNN has the worst prediction We validated the six models using annual average data from 2016 to 2021, and Figure 4 shows the scatter density plots of PM 2.5 concentrations estimated by the BPNN, RF, DNN, XGBoost, LightGBM and LUR-GBM models fitted to the actual PM 2.5 concentrations measured at ground monitoring stations. The overall performance of the six models was better than the performance of the predictions of daily concentrations. This is due to the fact that annual concentrations are less variable and volatile and that annual values are less affected by extreme values. As can be seen from Figure 4, the LightGBM and LUR-GBM models outperformed traditional machine learning models such as BPNN, DNN, RF and XGBoost, with R 2 values of 0.82 and 0.866, respectively, in terms of goodness of fit. BP and DNN had the worst fit performance of 0.75 and 0.79, respectively. The best RMSE values among the six models were 5.571 ug/m 3 for the LightGBM model and 5.291 ug/m 3 for the LUR-GBM model, while the worst was 6.669 ug/m 3 for the BP. In terms of MAE values, the LUR-GBM model had a minimum of 4.021 ug/m 3 and the BP had a maximum of 6.669 ug/m 3 . In terms of MAPE values, all six models were less than 15%, with the LUR-GBM model being the smallest at 10.71%. A comprehensive comparison of the six machine learning models shows that the LUR-GBM model had the best prediction performance, followed by the LightGBM model, while the BPNN has the worst prediction performance among the six models. The values of RMSE, MAE and MAPE all decreased compared to the annual concentration data. However, the R 2 was considerably lower compared to the annual concentration data, mainly because the annual concentration data was too small compared to the daily concentration data for a good fit. performance among the six models. The values of RMSE, MAE and MAPE all decreased compared to the annual concentration data. However, the R 2 was considerably lower compared to the annual concentration data, mainly because the annual concentration data was too small compared to the daily concentration data for a good fit.  Figure 5 shows a scatter density plot of the PM2.5 concentrations estimated by the LUR-GBM model on a seasonal scale and fitted to the PM2.5 concentrations measured at ground-based monitoring stations. A ten-fold cross-validation based on samples showed that R 2 (0.98) was highest in autumn. The highest RMSE (12.54 μg/m 3 ) in spring and MAE (7.61 μg/m 3 ) in winter were the seasons where the higher correlation between surface temperature and PM2.5 contributed to the difference in R 2 . In contrast, the lowest R 2 (0.91) and the lowest RMSE (4.34 μg/m 3 ) and MAE (3.01 μg/m 3 ) were recorded in summer. The lower estimation error in summer was due to the lower ground level PM2.5 mass concentration due to frequent rainfall and the higher estimation accuracy. Overall, the LUR-GBM model performed well on seasonal scales and was able to predict the distribution of PM2.5 mass concentrations on seasonal scales. To test the accuracy of the LUR-GBM model simulation, a linear correlation analysis was performed between the simulated PM2.5 model values at

Seasonal Distribution Characteristics
The seasons were first divided according to the climatic conditions of the Chinese region as a whole: spring from March to May, summer from June to August, autumn from September to November and winter from December to February. The spatial distribution of seasonal average PM 2.5 concentrations is shown in Figure 7, mostly high in winter and low in summer, falling in spring and rising in autumn. Summer air quality is good in all cities, with pollution below 35 µg/m 3 in most areas. East China, Central China and the Fenwei Plain are the most polluted in winter, with most cities in the region exceeding 70 µg/m 3 . Concentrations are higher in the north than in the south in spring and more serious in autumn, mainly in East China and Xinjiang. The very highest values of seasonal pollution occur in winter in Xinjiang, reaching above 100 µg/m 3 . Apart from the relatively good air quality in summer, Xinjiang has a certain degree of pollution in all other seasons, but it is still among the most polluted of all cities in the country in summer, and the pollution is at high levels throughout the region in winter. The overall air quality in southern China is good, with little difference between spring, summer and autumn, based on less than 40 µg/m 3 , and relatively serious pollution in winter, mostly concentrated in Hunan and Jiangxi provinces.

Seasonal Distribution Characteristics
The seasons were first divided according to the climatic conditions of the Chinese region as a whole: spring from March to May, summer from June to August, autumn from September to November and winter from December to February. The spatial distribution of seasonal average PM2.5 concentrations is shown in Figure 7, mostly high in winter and low in summer, falling in spring and rising in autumn. Summer air quality is good in all cities, with pollution below 35 μg/m 3 in most areas. East China, Central China and the Fenwei Plain are the most polluted in winter, with most cities in the region exceeding 70 μg/m 3 . Concentrations are higher in the north than in the south in spring and more serious in autumn, mainly in East China and Xinjiang. The very highest values of seasonal pollution occur in winter in Xinjiang, reaching above 100 μg/m 3 . Apart from the relatively good air quality in summer, Xinjiang has a certain degree of pollution in all other seasons, but it is still among the most polluted of all cities in the country in summer, and the pollution

Fitting Assessment of PM2.5 Concentrations in Typical Chinese Cities
The Beijing-Tianjin-Hebei Urban Agglomeration, the Yangtze River Delta and the Fenwei Plain are areas with high emission intensity per unit area of air pollution sources in China, and these three regions are also key areas identified by the state for air pollution prevention and control [44]. We fitted PM2.5 concentrations to 10 typical cities in heavily polluted areas, which are cities with large populations in Beijing, Tianjin and Hebei, the Fenwei Plain and the Yangtze River Delta and have a relatively large number of observation sites. As shown in Figure 9, the R 2 of the fit was above 98% for all 10 cities, with Hangzhou and Hefei having the highest accuracy in terms of RMSE and MAE values and Shijiazhuang and Tianjin having poorer results. We found that the accuracy of northern cities was lower than that of southern cities, and the main reason for this is that northern cities such as Beijing and Tianjin are affected by sandstorms, while the lower winter temperatures and more snow and ice lead to more complex aerosol types, which affects the accuracy of the model. Spatially, as a whole, the polluted regions show more serious pollution in the east than in the west, which is consistent with China's overall economic development and urbanisation and population distribution. Pollution is serious in northern China, with pollutants concentrated in southern Hebei, northern Henan and western Shandong, with average concentrations above 70 µg/m 3 , due to dense industry and serious pollutant emissions in northern China. Central China and the Sichuan Basin also have greater air pollution due to the economically developed and densely populated central China, where intense human activity has led to increased pollutant emissions, and the special topography of the Sichuan Basin, which is not conducive to the dispersion of pollutants. Due to its southerly location, coastal position, high rainfall and low air pollution, the average PM 2.5 concentration in southern China is below 30 µg/m 3 , which is lower than the national average annual concentration. In addition, the Xinjiang region also experienced more serious air pollution due to the frequent dust storms and poor air quality in the Taklamakan Desert in Xinjiang.

Fitting Assessment of PM 2.5 Concentrations in Typical Chinese Cities
The Beijing-Tianjin-Hebei Urban Agglomeration, the Yangtze River Delta and the Fenwei Plain are areas with high emission intensity per unit area of air pollution sources in China, and these three regions are also key areas identified by the state for air pollution prevention and control [44]. We fitted PM 2.5 concentrations to 10 typical cities in heavily polluted areas, which are cities with large populations in Beijing, Tianjin and Hebei, the Fenwei Plain and the Yangtze River Delta and have a relatively large number of observation sites. As shown in Figure 9, the R 2 of the fit was above 98% for all 10 cities, with Hangzhou and Hefei having the highest accuracy in terms of RMSE and MAE values and Shijiazhuang and Tianjin having poorer results. We found that the accuracy of northern cities was lower than that of southern cities, and the main reason for this is that northern cities such as Beijing and Tianjin are affected by sandstorms, while the lower winter temperatures and more snow and ice lead to more complex aerosol types, which affects the accuracy of the model.

Discussion
(1) To verify that the PM 2.5 concentration prediction based on the LUR-GBM model was more accurate, validation was carried out from the perspective of different datasets and different control models. In terms of cross-sectional datasets, by predicting PM 2.5 concentrations based on sample-based datasets, site-based datasets and time-based datasets, the LUR-GBM model was found to have the highest prediction accuracy with sample-based datasets. In particular, compared to the PM 2.5 concentration prediction based on the station dataset, the result of the sample dataset-based prediction improved R 2 by 7.69%, reduced RMSE by 13.81% and reduced MAE by 16 In terms of the spatial distribution of PM 2.5 concentrations, China's pollution regions as a whole are characterised by higher levels in the east than in the west. North China is the most polluted region, mainly including southern Hebei, northern Henan and western Shandong. This was followed by greater air pollution in Central China, the Sichuan Basin and Xinjiang. Southern China has the lowest PM 2.5 concentration and the best air quality. (4) PM 2.5 concentration predictions for ten typical cities in heavily polluted regions of China were studied and discussed and found to be less accurate in northern cities than in southern cities. Hangzhou and Hefei had the highest forecast accuracy, while Shijiazhuang and Tianjin had a lower forecast accuracy.

Conclusions
In this paper, a typical hybrid model LUR-GBM is proposed based on the PM 2.5 observation data of China from 2016 to 2021. The spatial and temporal distribution of PM 2.5 concentrations was estimated using AOD data from satellite remote sensing inversions as well as conventional meteorological observation elements, land use and road data. By analysing the spatial and temporal patterns of PM 2.5 and its influencing factors, this paper clarifies the changes in PM 2.5 at different time scales and the underlying mechanisms in recent years and summarises the general patterns of PM 2.5 concentrations in the spatial and temporal distribution in China. Therefore, the inversion of PM 2.5 can help to grasp the regional variation process of PM 2.5 in time and space by taking into account the land use information, correlation and spatio-temporal heterogeneity. This study provides a scientific basis for the prevention and control of regional PM 2.5 pollution and a new way of thinking for management departments to obtain data on the spatial distribution of PM 2.5 concentrations. The LUR-GBM method is a better solution to the problem of spatial heterogeneity of research objects.
The recommendations in this paper are as follows: (1) Improve joint prevention and control mechanisms in different regions. The formation and sources of PM 2.5 are complex, and it is difficult to control a single source and a single city to radically reduce the pollution. Analysis of the spatial distribution of PM 2.5 on a regional scale can further provide reliable information to support the establishment of improved regional joint prevention and control mechanisms in order to better address urban air pollution. (2) Fine-grained regulation of pollution levels by zoning. Pollution prevention and control measures are formulated according to the different geographical features, meteorological conditions and economic development of different regions, taking into account local conditions. Differential control management for heavily polluted areas and general areas. The relevant government departments should speed up the improvement of early warning and treatment of heavily polluted areas. (3) Implementation of seasonal differentiation of control. This study found significant differences in PM 2.5 concentrations between seasons, requiring the implementation of targeted prevention and control measures. Measures such as reducing pollution through artificial precipitation, imposing restrictions on motor vehicles and reasonable heating. (4) Strengthen the control of pollution at the source. There is a need to increase energy restructuring and energy conservation and emission reduction efforts to prevent and control air pollution at the source. Rational allocation of functional tasks of agency staff to areas with different PM 2.5 levels through predictive warning. Timely release of information on pollution sources to achieve the transformation from governance to prevention.