Spatial Estimation of Regional PM 2.5 Concentrations with GWR Models Using PCA and RBF Interpolation Optimization

: In recent years, geographically weighted regression (GWR) models have been widely used to address the spatial heterogeneity and spatial autocorrelation of PM 2.5 , but these studies have not fully considered the effects of all potential variables on PM 2.5 variation and have rarely optimized the models for residuals. Therefore, we ﬁrst propose a modiﬁed GWR model based on principal component analysis (PCA-GWR), then introduce ﬁve different spatial interpolation methods of radial basis functions to correct the residuals of the PCA-GWR model, and ﬁnally construct ﬁve combinations of residual correction models to estimate regional PM 2.5 concentrations. The results show that (1) the PCA-GWR model can fully consider the contributions of all potential explanatory variables to estimate PM 2.5 concentrations and minimize the multicollinearity among explanatory variables, and the PM 2.5 estimation accuracy and the ﬁtting effect of the PCA-GWR model are better than the original GWR model. (2) All ﬁve residual correction combination models can better achieve the residual correction optimization of the PCA-GWR model, among which the PCA-GWR model corrected by Multiquadric Spline (MS) residual interpolation (PCA-GWRMS) has the most obvious accuracy improvement and more stable generalizability at different time scales. Therefore, the residual correction of PCA-GWR models using spatial interpolation methods is effective and feasible, and the results can provide references for regional PM 2.5 spatial estimation and spatiotemporal mapping. (3) The PM 2.5 concentrations in the study area are high in winter months (January, February, December) and low in summer months (June, July, August), and spatially, PM 2.5 concentrations show a distribution of high north and low south.


Introduction
In recent years, with accelerated urbanization, industrialization, and modernization, air pollution problems have become increasingly serious, and PM 2.5 , as one of the main pollutants in air pollution in China, has garnered significant widespread concern in scientific fields, including the atmospheric environmental protection field. Furthermore, concern has grown among the general public [1][2][3]. PM 2.5 is highly active, small in size but large in surface area, suspended in the air for a long time, and easily adsorbs heavy metals, microorganisms, and other toxic and harmful substances, which can not only directly reduce atmospheric visibility by scattering and absorbing sunlight, causing disturbance to people's daily lives, but can also enter the end of the human respiratory tract through airflow, directly endangering human health [4][5][6][7]. PM 2.5 data are provided by precise measurements of PM 2.5 ground monitoring stations, but due to the limited number, limited spatial coverage, and uneven distribution of PM 2.5 ground monitoring stations in China, data can only be obtained from observations in specific areas. Therefore, many experts and scholars have conducted a series of studies on how to obtain high-precision PM 2.5 Remote Sens. 2022, 14, 5626 2 of 26 concentrations in areas without monitoring stations and explore the spatial and temporal distribution of PM 2.5 .
Therefore, some scholars have proposed the geographically weighted regression (GWR) model [29], which can better explain the problem of spatial autocorrelation and spatial heterogeneity in the existence of PM 2.5 and has a high accuracy for PM 2.5 estimation. For example, Zou et al. compared the accuracy of land-use regression (LUR) and GWR models for PM 2.5 mapping in California, USA, and showed that the GWR model had higher mapping accuracy than the LUR model [30]. Gu et al. estimated the spatial distribution of urban PM 2.5 in China in 2016 using the IDW method and the GWR model by combining socioeconomic activity factors such as population density, industrial structure, and the level of economic development and showed that the GWR model could better explain the spatial heterogeneity of the effects of various factors linked to socioeconomic activities on PM 2.5 among Chinese cities [31]. Zhang et al. introduced NO 2 and the enhanced vegetation index (EVI) into the GWR model and combined aerosol optical depth (AOD) and meteorological parameters to estimate the spatial distribution of PM 2.5 in the Chinese region. The results show that the GWR model with the introduction of NO 2 and EVI can explain about 87% of the spatial variation of PM 2.5 , and its estimation accuracy is significantly higher than that of the original GWR model [32]. Xiao et al. used satellite-derived AOD, topographic data, meteorological data, and atmospheric pollutants to combine GWR analysis with bayesian maximum entropy (BME) theory to assess the spatial and temporal characteristics of PM 2.5 exposure in most regions of China and achieve spatial and temporal distribution mapping of PM 2.5 in continuous regions [33]. Wei et al. used three interpolation methods, tension spline functions (TSF), empirical bayesian kriging (EBK), and kriging, to correct the residuals of the GWR model and construct three combined models to spatially interpolate PM 2.5 during the National Day and Chinese New Year in south-central China. The results showed that meteorological factors and zenith tropospheric delay (ZTD) can better explain the spatial heterogeneity of PM 2.5 , and the interpolation accuracy of the combined model of GWR and TSF is significantly higher than that of other combined interpolation models [34].
The complex and diverse factors influencing PM 2. 5 and their correlation with each other lead to multicollinearity among the independent variables of the model, which affects the model's accuracy and performance [35]. To address this problem, most existing studies have used multicollinearity diagnosis to remove explanatory variables with multicollinearity, thus reducing the multicollinearity among independent variables [36], but this approach may lead to the omission of key influencing factors of PM 2.5 and thus cannot fully consider the influence of all potential explanatory variables on PM 2.5 changes [37][38][39]; thus, some scholars have introduced the principal component analysis (PCA) method to optimize the GWR model and have achieved a better estimation accuracy.
For example, Guo et al. used the PCA method to extract eight environmental variables (elevation, slope, normalized vegetation index, etc.) by dimensionality reduction and then used the extracted principal component variables to construct a GWR model to spatially simulate soil organic carbon storage in Forked River Town, China. The results showed that the PCA method played an important role in reducing the redundancy and multicollinearity of auxiliary variables, and the prediction accuracy of the GWR model constructed based on principal components was higher than that of the ordinary least squares regression and ordinary collaborative kriging models constructed based on principal components [40]. Zhang et al. used PCA to extract five factors associated with COVID-19 mortality by downscaling from 14 indicators of social, economic, and environmental impacts, which were used to construct a GWR model that effectively analyzed the spatial and temporal characteristics of the sources triggering COVID-19 mortality [41]. Zhai et al. estimated the spatial distribution of PM 2.5 in the Beijing-Tianjin-Hebei region using a geographically weighted regression model based on principal component analysis (PCA-GWR), and the results showed that the PCA method improved the estimation accuracy of the GWR model by fully considering the contribution of all potential predictor variables to PM 2.5 variation. Additionally, the PCA-GWR model generated PM 2.5 spatial distribution maps that clearly portrayed more details of spatial variability than conventional GWR models [42].
In summary, all existing studies can achieve PM 2.5 concentration estimation in areas without monitoring stations, but in terms of solving the spatial autocorrelation and spatial heterogeneity of PM 2.5 , GWR models can better explain these two characteristics and are more effective in estimating PM 2.5 concentrations in areas without monitoring stations. In addition, there are still relatively few studies on the application of PCA methods to GWR models for PM 2.5 spatial distribution estimation and relatively few studies on the quadratic correction of residuals for GWR models optimized based on principal component analysis. Therefore, we consider these aspects together and use the atmospheric pollutants, meteorological data, normalized vegetation index, elevation, population size, and zenith wet delay (ZWD) data of the middle and lower reaches of the Yangtze River as the database. We use the GWR model as the base model, combined with the PCA and the radial basis function (RBF) interpolation method based on five different basis functions, to construct six GWR improvement models (PCA-GWR, PCA-GWRCRS, PCA-GWRTS, PCA-GWRMS, PCA-GWRTPS, PCA-GWRIMS) to estimate the spatial distribution of PM 2.5 in the study area. We then compare their interpolation accuracy and model performance and select the method with the best accuracy to generate the spatial distribution map of PM 2.5 concentration in the study area.

Study Area and Data Preprocessing
The middle and lower reaches of the Yangtze River Economic Belt (hereinafter collectively referred to as the middle and lower reaches of the Yangtze River) span the centraleastern region of China, located between 24 • 29 -35 • 08 N latitude and 108 • 21 -123 • 10 E longitude, and comprise six major provinces and one municipality directly under the Central Government (the lower reaches include Shanghai, Jiangsu, Zhejiang and Anhui, and the middle reaches include Jiangxi, Hubei, and Hunan). This region is one of the most developed economic regions in China. It accounts for more than a quarter of the Chinese population and approximately one-third of the Chinese gross domestic product (GDP). The Yangtze River Economic Zone has important ecological value, strong comprehensive strength, and great development potential; promoting the development of the Yangtze River Economic Zone is important for China's economic development. However, with the economic growth of the Yangtze River Economic Zone, increases in population and motor vehicles, coupled with a regional consumption structure dominated by coal, cause severe air pollution, especially in the middle and lower reaches of the Yangtze River. This pollution has become the focus of air environment management and has received widespread public attention. PM 2.5 is the main indicator of air pollutants. To support China's pollution prevention and control battle and ecological environmental protection strategy, we take the monthly average PM 2.5 concentration data collected from PM 2.5 ground monitoring stations in the middle and lower reaches of the Yangtze River economic belt for 2018-2020 as the research object.
Atmospheric pollutant (PM 2.5 , O 3 , CO, NO 2 , SO 2 ) data were obtained from PM 2.5 ground monitoring station observations (data from http://envi.ckcest.cn/environment/, accessed on 14 June 2022), meteorological data were obtained from meteorological monitor- ing station observations (data from http://data.cma.cn/, accessed on 14 June 2022), and elevation (ELE) data were obtained from the SRTMDEMUTM 90 M resolution digital elevation data product of the Geospatial Data Cloud (data from https://www.gscloud.cn/sources, accessed on 3 July 2022). The elevation, air quality monitoring station, and meteorological monitoring station distribution map is shown in Figure 1. and control battle and ecological environmental protection strategy, we take the monthly average PM2.5 concentration data collected from PM2.5 ground monitoring stations in the middle and lower reaches of the Yangtze River economic belt for 2018-2020 as the research object.
Atmospheric pollutant (PM2.5, O3, CO, NO2, SO2) data were obtained from PM2.5 ground monitoring station observations (data from http://envi.ckcest.cn/environment/, accessed on 14 June 2022), meteorological data were obtained from meteorological monitoring station observations (data from http://data.cma.cn/, accessed on 14 June 2022), and elevation (ELE) data were obtained from the SRTMDEMUTM 90 M resolution digital elevation data product of the Geospatial Data Cloud (data from https://www.gscloud.cn/sources, accessed on 3 July 2022). The elevation, air quality monitoring station, and meteorological monitoring station distribution map is shown in Figure  1. ZWD is the wet component of the ZTD due to water vapor in the atmosphere [43,44]. ZTD is a signal propagation delay formed by the bending and delay of electromagnetic wave signals emitted by Global Navigation Satellite System (GNSS) [45] satellites as they traverse the troposphere due to the influence of atmospheric refraction [46,47]. The ZWD data used in the experiments were obtained from the VMF data server platform (https://vmf.geo.tuwien.ac.at/, accessed on 7 May 2022).
The normalized difference vegetation index (NDVI) is one of the important parameters to reflect crop growth and nutrient information, which can detect vegetation growth and vegetation cover and reflect the background influence of the plant canopy [48,49]. The NDVI data used in the experiment were obtained from the Data Center for Resource and Environmental Sciences, Chinese Academy of Sciences (http://www.resdc.cn/, accessed on 10 December 2021); the population size (POP) data were obtained from the Worldpop website (https://hub.worldpop.org/, accessed on 3 July 2022). Table 1 indicates the time scale of each variable and access to information on the type and spatial resolution of these variables. ZWD is the wet component of the ZTD due to water vapor in the atmosphere [43,44]. ZTD is a signal propagation delay formed by the bending and delay of electromagnetic wave signals emitted by Global Navigation Satellite System (GNSS) [45] satellites as they traverse the troposphere due to the influence of atmospheric refraction [46,47]. The ZWD data used in the experiments were obtained from the VMF data server platform (https: //vmf.geo.tuwien.ac.at/, accessed on 7 May 2022).
The normalized difference vegetation index (NDVI) is one of the important parameters to reflect crop growth and nutrient information, which can detect vegetation growth and vegetation cover and reflect the background influence of the plant canopy [48,49]. The NDVI data used in the experiment were obtained from the Data Center for Resource and Environmental Sciences, Chinese Academy of Sciences (http://www.resdc.cn/, accessed on 10 December 2021); the population size (POP) data were obtained from the Worldpop website (https://hub.worldpop.org/, accessed on 3 July 2022). Table 1 indicates the time scale of each variable and access to information on the type and spatial resolution of these variables.  Figure 1, we can see that the number of meteorological monitoring stations is smaller than the number of PM 2.5 concentration monitoring stations, and ZWD is grid data. To ensure the smooth implementation of the subsequent experiments, we use the IDW method to spatially interpolate the meteorological data (TEM, PRS, WS, RH) and ZWD to obtain the corresponding raster data [17,34]. Figure 2 shows the root-mean-square error of the cross-validation results of the IDW interpolated meteorological and ZWD data.   Figure 1, we can see that the number of meteorological monitoring stations is smaller than the number of PM2.5 concentration monitoring stations, and ZWD is grid data. To ensure the smooth implementation of the subsequent experiments, we use the IDW method to spatially interpolate the meteorological data (TEM, PRS, WS, RH) and ZWD to obtain the corresponding raster data [17,34]. Figure 2 shows the root-mean-square error of the cross-validation results of the IDW interpolated meteorological and ZWD data. From Figure 2, we can see that the RMSEs of meteorological factors (TEM, PRS, WS, RH) for 2018-2020 all remain within a range of intervals with small and relatively stable values, indicating the good applicability and stability of the spatial interpolation effect of IDW on meteorological factors. For the problem of large differences in the RMSE of ZWD data in different months, we found that this situation was caused by large differences in the values of ZWD in different months. The difference between the mean size of ZWD in June-August and the mean size of ZWD in December, January, and February was nearly 4 times, while the differences between the mean values of meteorological factors in different months were all less than 1 time. Therefore, the interpolation accuracy of the IDW method for ZWD is high relative to the size of ZWD and can be used for subsequent studies.
After obtaining the meteorological factor raster and ZWD raster with higher accuracy using the IDW method, we extracted the values of all raster data to the corresponding From Figure 2, we can see that the RMSEs of meteorological factors (TEM, PRS, WS, RH) for 2018-2020 all remain within a range of intervals with small and relatively stable values, indicating the good applicability and stability of the spatial interpolation effect of IDW on meteorological factors. For the problem of large differences in the RMSE of ZWD data in different months, we found that this situation was caused by large differences in the values of ZWD in different months. The difference between the mean size of ZWD in June-August and the mean size of ZWD in December, January, and February was nearly 4 times, while the differences between the mean values of meteorological factors in different months were all less than 1 time. Therefore, the interpolation accuracy of the IDW method for ZWD is high relative to the size of ZWD and can be used for subsequent studies.
After obtaining the meteorological factor raster and ZWD raster with higher accuracy using the IDW method, we extracted the values of all raster data to the corresponding PM 2.5 ground monitoring points using the spatial analysis tool of ArcGIS 10.4 software to obtain explanatory variables with a uniform spatial and temporal scale with PM 2.5 data.

GWR Model
The geographically weighted regression (GWR) model is a spatial analysis technique that embeds the spatial location of the data into the linear regression equation based on the traditional linear regression model. Since it takes into account the local effects of spatial objects, it can better explain the spatial heterogeneity and spatial autocorrelation problems that exist in spatial data and has a high estimation accuracy. The principle of the GWR model is as follows [29][30][31][32][33][34]: where F i is the observed value of the sample point and is used as the dependent variable in the GWR model. (u i , v i ) are the coordinates of the i-th sample point, β k (u i , v i ) is the i-th regression coefficient on each sample point, x ik is the k-th explanatory variable of the i-th observation point, p is the total number of explanatory variables, ε i is the regression residual, and β 0 (u i , v i ) is the regression intercept term of the model at the i-th sample point. The weighted least squares method was used to estimate the model regression coefficients; the coefficient matrix for each point is as follows: where W(u i , v i ) is the diagonal matrix of spatial weights, X is the design matrix of independent variables, and Y is the matrix of dependent variables. The spatial weight matrix W is calculated using the bi-square function: where w ij is the weight between the spatially known points j and the points i to be estimated, d ij is the Euclidean distance between the points i to be estimated and the sample points j, and θ is the bandwidth size, which is judged using corrected Akaike information criterion (AICc); when AICc is smallest, the bandwidth of the chosen weight function is optimal.

PCA-GWR Model
PCA is a statistical method to effectively reduce the spatial dimensionality of data that can explore the trend of multiple variables and convert multiple potential explanatory variables into new, mutually independent linear combinations of variables to replace the original variables, where the new combinations are also called principal components. The number of extracted principal components needs to be determined by the contribution of the principal components to the explanation of the variables to generally extract several principal components with a cumulative contribution of 90% or more; otherwise, the number of principal components should be adjusted [40][41][42]. PCA is primarily accomplished through the integration tools of the Scientific Platform Serving for Statistics Professional 2021. SPSSPRO (Version 1.0.11) (Online Application Software). (Retrieved from https://www.spsspro.com, accessed on 13 July 2022).
The PCA-GWR model is a combinatorial optimization model, which is based on the principle of using the PCA method to extract the principal components as new independent variables, instead of the original independent variables, to establish the modified GWR model; it not only fully considers the contribution of all potential explanatory variables to the changes in the dependent variable but also effectively addresses multicollinearity among explanatory variables. The main processes are as follows:

•
Step 1: The data of the independent variables of the GWR model were standardized, then the Kaiser-Mayer-Olkin (KMO) test and Bartlett's test of sphericity were performed on the data. If the KMO value was greater than 0.5 and the p-value of Bartlett's test of sphericity was less than 0.05, there was a strong correlation between the independent variables, and PCA can be performed; otherwise, the data are not suitable for PCA [50].

•
Step 2: The correlation between PM 2.5 and the independent variable data was analyzed using the gray relation analysis (GRA) [51] integrated tool in SPSSPRO to obtain the gray correlation value, and the closer the gray relational grade was to 1, the higher the correlation between the variable and PM 2.5 .

•
Step 3: The variables with high correlation (gray relational grade >0.9) were selected as input variables for PCA, and all principal components were calculated using the PCA integration tool in SPSSPRO.

•
Step 4: All the principal components were ranked and cumulatively summed according to the percentage of variance, and those with a cumulative percentage of variance greater than or close to 90% were selected as the final input variables of the GWR model. The PCA-GWR model was then constructed to obtain the estimation results of the target variables.

RBF Interpolation
Radial basis function interpolation (RBF) is an accurate deterministic spatial interpolation method that makes no assumptions about the data and provides accurate prediction surfaces, which is beneficial for dealing with scattered data and approximating surfaces. In addition, it can interpolate predicted values larger than the maximum and smaller than the minimum of the observed values when the maximum and minimum of the spatial data are not clear, and it has the advantages of simple computational format, flexible node configuration, small computational effort, and relatively high accuracy. The RBF interpolation during this research was implemented using the RBF interpolation analysis tool in the geostatistical analysis toolkit of ArcGIS 10.4 software [52]. The basic principle of the model can be expressed as follows [53,54]: where (x, y) are the coordinates of the points to be interpolated, n is the number of sample points, λ i is the weight coefficient obtained by solving the linear system of equations, ϕ(r i ) is the basis function, and T(x, y) = a + bx + cy is the trend function. The coefficients of the trend function T(x, y) are solved using the least squares method, and the following constraints must be satisfied when solving: where (x i , y i ) are the coordinates of the sample points i.
The basis functions of the RBF are chosen from the Completely Regular Spline (CRS) function: φ CRS (r i ); the Tension Sample (TS) function: φ TS (r i ); the Multiquadric Spline (MS) function: φ MS (r i ); the Inverse Multiquadric Spline (IMS) function: φ IMS (r i ); and the Thin Plate Spline (TPS) function: φ TPS (r i ). The five basis functions [55] are calculated as follows: where r i is the Euclidean distance between the point (x, y) to be interpolated and the i-th sample point, E 0 is the exponential integration function, K 0 is the corrected Bessel function, c 0 is a constant (0.577215), ω is the smoothing factor, and the optimal smoothing factor for each basis function is automatically calculated by the parameter optimization function in the geostatistical analysis tool of ArcGIS 4.0.

Combined Model with Residual Correction Based on the RBF Interpolation
The PCA-GWR model is an inaccurate spatial interpolation method, and its estimated value at a known location is not equal to the known value; hence, the residual interpolation correction of the estimated value of the PCA-GWR model can be used to further improve the accuracy of PM 2.5 estimation. In addition, due to the diverse and complex influencing factors of PM 2.5 , the PCA-GWR model cannot fully explain the spatial variation of PM 2.5 , which means the residuals of the model will have some spatial autocorrelation; hence, the spatial interpolation method can be considered to correct the residuals of the PCA-GWR model to further explain the spatial characteristics of PM 2.5 .
Therefore, based on these two considerations of the PCA-GWR model, five radial basis function (RBF) interpolation methods based on different basis functions (CRS, TPS, IMS, IM, TS) are selected to interpolate the residuals of the PCA-GWR model to further optimize the interpolation accuracy of the model, and five residual correction models (PCA-GWRCRS, PCA-GWRTS, PCA-GWRMS, PCA-GWRTPS, PCA-GWRIMS) are constructed. Their model principles are described as follows: where F PCA−GWRRBF denotes the value after residual RBF interpolation correction for the estimated values of the PCA-GWR model,F PCA−GWR denotes the estimated value of the PCA-GWR model, and Z RES (RBF) denotes the residual estimates obtained after the RBF interpolation of the regression residuals of the PCA-GWR model. The subscript RBF indicates five different RBF spatial interpolation methods (CRS, TS, MS, TPS, IMS).

Evaluation Indicators
To evaluate the model accuracy more intuitively, we use four metrics, the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the decidability Factor R 2 , to comprehensively evaluate the model's accuracy and performance.
where n is the total number of samples, x i is the observed value of the target variable at the i-th position,x i is the estimated output of the model at the same position, and x i is the average value of the total number of samples. In general, RMSE and MAE are mainly used to evaluate the estimation accuracy of the model, and the smaller the value is, the higher the estimation accuracy of the model and vice versa. MAPE and R 2 are mainly used to evaluate the performance of the model, and the smaller the value of MAPE and the closer the value of R 2 is to 1, the better the performance and fitting effect of the model and vice versa.

PM 2.5 Descriptive Statistics
To further understand the change in PM 2.5 concentration in January-December 2018-2020, we conducted descriptive statistics on PM 2.5 ground monitoring station data, and the results are shown in Figure 3.

Analysis of PM2.5 and Its Related Explanatory Variables
To further understand the change in PM2.5 concentration in January-December 2018-2020, we conducted descriptive statistics on PM2.5 ground monitoring station data, and the results are shown in Figure 3.
From Figure 3, it can be seen that the maximum (Figure 3a), minimum (Figure 3b), mean (Figure 3c), and standard deviation (Figure 3d) of PM2.5 concentrations in January-December 2018-2020 show a 'U'-shaped distribution; therefore, it can be concluded that PM2.5 concentrations are high in December, January, and February, low in June-August, and moderate in March-May and September-November each year. In addition, the standard deviation of PM2.5 concentrations in December, January, and February of 2018-2020 is greater than that of the remaining months, indicating that the PM2.5 data for December, January, and February are more discrete and less stable, while the data for the remaining months are more stable.  In addition, the standard deviation of PM 2.5 concentrations in December, January, and February of 2018-2020 is greater than that of the remaining months, indicating that the PM 2.5 data for December, January, and February are more discrete and less stable, while the data for the remaining months are more stable.

GRA
To ensure that both PCA and GWR models have good modeling effects and that subsequent tests were carried out smoothly, we used the GRA method to analyze the correlation between PM 2.5 and 12 explanatory variables and measured the correlation between the two variables by the closeness of the gray relational grade to 1 [56]. The GRA results are shown in Figure 4.
From Figure 4, it can be concluded that the gray relational grade between PM2.5 an the 12 explanatory variables (CO, NDVI, NO2, O3, PRS, RH, SO2, TEM, WS, ZWD, EL and POP) are all greater than 0.9, whereas the gray relational grade with ELE and POP lower than other variables, and the highest gray relational grade is with PRS; the gr relational grade in 2018-2019 showed a slow trend of increasing and then decreasing o the monthly scale. In summary, PM2.5 has a high correlation with all 12 explanatory va ables, which can be used as input variables for the construction of GWR and PCA mode

Multicollinearity Diagnosis
Multicollinearity refers to the distortion of model estimates due to the correlatio between explanatory variables in a linear regression model; it is necessary to test for mu ticollinearity among explanatory variables before constructing a GWR model. Selecting combination of variables suitable for modeling based on the diagnostic results can ensu the accuracy of model estimation. Therefore, the exploratory regression method in t spatial statistics tool of ArcGIS 4.0 is used to test the multicollinearity among the 12 e planatory variables and judge the severity of multicollinearity by the magnitude of t output variance inflation factor (VIF) value. The closer the VIF value is to 1, the lighter t multicollinearity among the variables, and the greater the VIF value is than 1, the mo severe the multicollinearity between variables. If the VIF value is between 1 and 5, th the multicollinearity among the explanatory variables is mild and the impact on the es mation accuracy of the regression model is negligible. If the VIF value is greater than From Figure 4, it can be concluded that the gray relational grade between PM 2.5 and the 12 explanatory variables (CO, NDVI, NO 2 , O 3 , PRS, RH, SO 2 , TEM, WS, ZWD, ELE, and POP) are all greater than 0.9, whereas the gray relational grade with ELE and POP is lower than other variables, and the highest gray relational grade is with PRS; the gray relational grade in 2018-2019 showed a slow trend of increasing and then decreasing on the monthly scale. In summary, PM 2.5 has a high correlation with all 12 explanatory variables, which can be used as input variables for the construction of GWR and PCA models.

Multicollinearity Diagnosis
Multicollinearity refers to the distortion of model estimates due to the correlation between explanatory variables in a linear regression model; it is necessary to test for multicollinearity among explanatory variables before constructing a GWR model. Selecting a combination of variables suitable for modeling based on the diagnostic results can ensure the accuracy of model estimation. Therefore, the exploratory regression method in the spatial statistics tool of ArcGIS 4.0 is used to test the multicollinearity among the 12 explanatory variables and judge the severity of multicollinearity by the magnitude of the output variance inflation factor (VIF) value. The closer the VIF value is to 1, the lighter the multicollinearity among the variables, and the greater the VIF value is than 1, the more severe the multicollinearity between variables. If the VIF value is between 1 and 5, then the multicollinearity among the explanatory variables is mild and the impact on the estimation accuracy of the regression model is negligible. If the VIF value is greater than 5, then the multicollinearity among the explanatory variables is more serious, the impact on the estimation accuracy of the model is not negligible, and a reasonable method must be used to address it [57,58]. The results of the diagnosis of multicollinearity among the explanatory variables are shown in Figure 5.
after excluding TEM and ZWD all decrease to varying degrees and are all less than 5. Moreover, there is a corresponding decrease in the VIF of the remaining explanatory variables in November 2019 and June 2020 after the exclusion of these two variables. In summary, we considered the use of the remaining 10 explanatory variables of PM2.5 (CO, NO2, O3, SO2, PRS, WS, RH, NDVI, ELE, and POP) to construct a GWR model for the interpolation estimation of PM2.5 spatial distribution.   Figure 5c)) causing the highest likelihood of multicollinearity (VIF > 10). However, the VIF between the explanatory variables for PM 2.5 in June 2020 and November 2019 is less than 5, indicating that the multicollinearity between the explanatory variables for PM 2.5 in these two months is small. Therefore, to minimize the multicollinearity among the explanatory variables of PM 2.5 and obtain the best combination of explanatory variables suitable for PM 2.5 estimation in all months, we screened and excluded the explanatory variables with large VIFs and performed stepwise multicollinearity diagnosis. The results show that in months with more severe multicollinearity (VIF > 5), the VIFs of the remaining explanatory variables after excluding TEM and ZWD all decrease to varying degrees and are all less than 5. Moreover, there is a corresponding decrease in the VIF of the remaining explanatory variables in November 2019 and June 2020 after the exclusion of these two variables. In summary, we considered the use of the remaining 10 explanatory variables of PM 2.5 (CO, NO 2 , O 3 , SO 2 , PRS, WS, RH, NDVI, ELE, and POP) to construct a GWR model for the interpolation estimation of PM 2.5 spatial distribution.

PCA
Although the 10 explanatory variables selected by stepwise exploratory regression effectively reduced the multicollinearity among the explanatory variables, two explanatory variables (ZWD and TEM) with a high correlation with PM 2.5 were excluded, and the contribution of all potential explanatory variables to the change in PM 2.5 was not fully considered; hence, we chose the PCA method to reduce the dimensionality of the 12 explanatory variables to further minimize the influence of multicollinearity among the explanatory variables while maximizing the contribution of the explanatory variables to the variation in PM 2.5 spatial distribution.
Before conducting principal component analysis, we conducted the KMO test and Bartlett's test of sphericity on the explanatory variables for each month of 2018-2020. The experimental results yielded KMO values greater than 0.5 among the explanatory variables for each month of 2018-2020, and Bartlett's test of sphericity of p-values was 0.000 *** (Note: *** represents a 1% significance level), which basically meets the requirements of principal component analysis and allows for the PCA of explanatory variables; the results of the percentage of variance of the PCA are shown in Figure 6. effectively reduced the multicollinearity among the explanatory variables, two explanatory variables (ZWD and TEM) with a high correlation with PM2.5 were excluded, and the contribution of all potential explanatory variables to the change in PM2.5 was not fully considered; hence, we chose the PCA method to reduce the dimensionality of the 12 explanatory variables to further minimize the influence of multicollinearity among the explanatory variables while maximizing the contribution of the explanatory variables to the variation in PM2.5 spatial distribution.
Before conducting principal component analysis, we conducted the KMO test and Bartlett's test of sphericity on the explanatory variables for each month of 2018-2020. The experimental results yielded KMO values greater than 0.5 among the explanatory variables for each month of 2018-2020, and Bartlett's test of sphericity of p-values was 0.000 *** (Note: *** represents a 1% significance level), which basically meets the requirements of principal component analysis and allows for the PCA of explanatory variables; the results of the percentage of variance of the PCA are shown in Figure 6.
As shown in Figure 6, the percentage of variance of the first principal component (PC1) for January-December 2018-2020 is between 20 and 35%, the percentage of variance of the second principal component (PC2) is between 14 and 20%, and the percentage of variance of the third principal component (PC3) and fourth principal component (PC4) is approximately 10%; the rest of the percentage of variance is below 10% and decreases with the increase in the principal component number. The cumulative percentage of variance of PC1-PC8 is approximately 90%, indicating that PC1-PC8 contributed 90% and above to the 12 explanatory variables, so we selected PC1-PC8 as the independent variables of the GWR model and constructed the PCA-GWR model for PM2.5 spatial distribution estimation.  As shown in Figure 6, the percentage of variance of the first principal component (PC1) for January-December 2018-2020 is between 20 and 35%, the percentage of variance of the second principal component (PC2) is between 14 and 20%, and the percentage of variance of the third principal component (PC3) and fourth principal component (PC4) is approximately 10%; the rest of the percentage of variance is below 10% and decreases with the increase in the principal component number. The cumulative percentage of variance of PC1-PC8 is approximately 90%, indicating that PC1-PC8 contributed 90% and above to the 12 explanatory variables, so we selected PC1-PC8 as the independent variables of the GWR model and constructed the PCA-GWR model for PM 2.5 spatial distribution estimation.

Comparison of Model Accuracy
Through a series of exploratory analyses, we finally selected 10 variables (CO, NO 2 , O 3 , SO 2 , PRS, WS, RH, NDVI, ELE, and POP) to construct the GWR model and selected principal component analysis to extract the eight principal components, whose cumulative contribution to the 12 explanatory variables was nearly 90%, to construct the PCA-GWR model and compared the accuracy and model performance of the two models, the results of which are shown in Figure 7. than that of the GWR model, among which the R 2 in July 2018, June 2019, and Septem 2020 is more significantly improved relative to the GWR model by 15.79%, 23.53%, a 12.86%, respectively, in that order. The R 2 of the PCA-GWR model is larger in Janua March and October-December than in April-September, among which the PCA-GW model had R 2 values greater than 0.9 in January, November, December 2018-2019 a January and December 2020.
In summary, compared with the GWR model, the PCA-GWR model can not o fully consider the contribution of all potential explanatory variables to the PM2.5 chan and minimize the multicollinearity among explanatory variables, but also effectiv improve the precision and fitting effect of the spatial estimation of PM2.5; hence, it is f sible to optimize the GWR model using principal component analysis. However, the ting effect and estimation accuracy for some months are still relatively poor, so we su sequently consider the residual correction process of the PCA-GWR model using spatial interpolation method to further improve the estimation accuracy and fitting fect of PM2.5.  Comparing Figure 7a 1 -a 3 ,b 1 -b 3 , it can be seen that the RMSE of both the GWR and PCA-GWR models is less than 8 µg/m 3 and the MAE values are less than 6 µg/m 3 for the monthly average PM 2.5 estimation from 2018 to 2020, which indicates that both models have higher accuracy in estimating PM 2.5 concentrations. In addition, the RMSE and MAE of the PCA-GWR model generally improved to different degrees compared with the GWR model, with the RMSE of November 2018, June 2019, September 2020, and December 2020 improving significantly compared with the GWR model by 9.89%, 17.94%, 11.59%, and 12.98%, respectively. The MAEs in November 2018, June 2019, September 2020, and December 2020 were more significantly optimized relative to the GWR model, with improvements of 12.20%, 16.86%, 12.20%, and 9.51%, respectively; therefore, it can be concluded that the spatial estimation accuracy of the PCA-GWR model for PM 2.5 is better than that of the GWR model.
From Figure 7c 1 -c 3 ,d 1 -d 3 , it can be seen that the MAPE of the PCA-GWR model is smaller than that of the GWR model in 2018-2020, where the improvement of MAPE relative to the GWR model is more obvious in November 2018, June 2019, September 2020, and December 2020, with 11.39%, 19%, 11.63%, and 10.11%, respectively, in that order. The MAPE of the PCA-GWR model remains between 10 and 13 in June-August and is less than 10 in the rest of the months, with the smallest MAPEs in February 2018, December 2019, and January 2020. Meanwhile, the R 2 of the PCA-GWR model is larger than that of the GWR model, among which the R 2 values in July 2018, June 2019, and September 2020 are more significantly improved relative to the GWR model by 15.79%, 23.53%, and 12.86%, respectively, in that order. The R 2 of the PCA-GWR model is larger in January-March and October-December than in April-September, among which the PCA-GWR model had R 2 values greater than 0.9 in January, November, December 2018-2019, and January and December 2020.
In summary, compared with the GWR model, the PCA-GWR model can not only fully consider the contribution of all potential explanatory variables to the PM 2.5 changes and minimize the multicollinearity among explanatory variables, but also effectively improve the precision and fitting effect of the spatial estimation of PM 2.5 ; hence, it is feasible to optimize the GWR model using principal component analysis. However, the fitting effect and estimation accuracy for some months are still relatively poor, so we subsequently considered the residual correction process of the PCA-GWR model using the spatial interpolation method to further improve the estimation accuracy and fitting effect of PM 2.5 .

Regional Distribution of Model Residuals
From Figure 7, we know that the model performance and estimation accuracy of PCA-GWR are better than those of the GWR model, but the estimation accuracy and fitting effect of the PCA-GWR model still have room for improvement, so the spatial distribution of the residuals of the PCA-GWR model is visualized to further analyze the spatial distribution pattern of the residuals of the PCA-GWR model.
Since the strengths and weaknesses of the model interpolation effects for 2018-2020 are basically the same, we use the spatial distribution of residuals of the PCA-GWR model for January-December 2018 in the middle and lower reaches of the Yangtze River ( Figure 8a 1 -a 12 ) as an example to save space and use these plots as the basis for our analysis.
From Figure 8, it can be seen that the residual distribution of the PCA-GWR model shows a spatial trend of high in the north and low in the south, and the absolute values of residuals greater than 20 µg/m 3 are mainly concentrated in January, February, and December. When we combine these data with Figure 7, we can see that although the residuals in January, February, and December 2018 are larger, their MAPEs are less than 10 and their R 2 values are greater than 0.8, indicating that the PCA-GWR model has a better fit for PM 2.5 , but there is still room for optimizing the estimation accuracy of the model. When we combine PM 2.5 values with Figure 3, it can be seen that the PM 2.5 concentrations in June-August 2018 are lower than those in other months, but the absolute value of the residuals from June to August is large in relation to the ratio of PM 2.5 concentration, thus making the MAPE of the PCA-GWR model in June, July, and August large and the R 2 small (Figure 7). Remote Sens. 2022, 14, x FOR PEER REVIEW 15 of 26 From Figure 8, it can be seen that the residual distribution of the PCA-GWR model shows a spatial trend of high in the north and low in the south, and the absolute values of residuals greater than 20 μg/m³ are mainly concentrated in January, February, and December. When we combine these data with Figure 7, we can see that although the residuals in January, February, and December 2018 are larger, their MAPEs are less than 10 and In summary, although the accuracy and fitting effect of the PCA-GWR model are better than those of the conventional GWR model, there is still room for optimizing the residuals of the PCA-GWR model for estimating PM 2.5 concentrations in months with high and low PM 2.5 concentrations; therefore, residual correction for the PCA-GWR model can be considered.

Residual Correction of PCA-GWR Model
To ensure the smooth process of introducing spatial interpolation methods for the residual correction of PCA-GWR models in subsequent experiments, we performed a spatial autocorrelation analysis on the regression residuals of the PCA-GWR model. The experiments were performed by calculating the global Moran's I spatial autocorrelation of the residuals of the PCA-GWR model with the spatial analysis tool of GeoDa version 1.18.0 software; the results are shown in Table 2. As seen from Table 2, the residuals of the PCA-GWR model for most months of 2018-2020 are spatially autocorrelated, their p-values are almost all less than 0.1, and the absolute values of the Z values are almost all greater than 1.65, indicating that the spatial autocorrelation of the residuals of the PCA-GWR model is significant at the 0.1 level with a confidence level greater than 90%, and the spatial autocorrelation of the residual results are generated by random processes with less than 10% chance. Therefore, we introduce five different radial basis function (CRS, TS      Comparing Figure 9a-c, we can see that the RMSE of all five models is less than 5 µg/m 3 for each month in 2018 (Figure 9a), the RMSE value is less than 3.5 µg/m 3 for each month in 2019 (Figure 9b), and the RMSE value is less than 3 µg/m 3 for each month in 2020 (Figure 9c). This indicates that the interpolation accuracy of all five interpolation models for PM 2.5 is high, among which the interpolation effect is the best for 2020 PM 2.5 and the worst for 2018 PM 2.5 . Comparing Figure 10a-c, we can see that the MAPE of all five models is less than 10, among which the MAPE of both the PCA-GWRMS and PCA-GWRTPS models is less than 5, indicating that the two models have a better fitting effect and better model performance for PM 2.5 , while the PCA-GWRMS model is better than the PCA-GWRTPS model among the two models.
In summary, all five interpolation models (PCA-GWRCRS, PCA-GWRTS, PCA-GWRMS, PCA-GWRIMS, and PCA-GWRTPS) can better achieve the interpolation estimation of the spatial distribution of monthly PM 2.5 in the middle and lower reaches of the Yangtze River for 2018-2020 with better interpolation accuracy and fitting effect, among which the PCA-GWRMS model outperformed the other four residual correction models and the PCA-GWR model in all aspects.

Generation of the Spatial Distribution Map of the PM 2.5 Concentration
Through the analysis, we concluded that the accuracy and performance of the PCA-GWRMS model are better than those of the other models, and this model takes into account more comprehensive PM 2.5 influencing factors and less data loss; hence, we chose to use the PCA-GWRMS model to generate the spatial distribution map of PM 2.5 concentrations in the middle and lower reaches of the Yangtze River from 2018 to 2020. Its mapping steps are as follows:

•
Step 1: Based on the PM 2.5 concentration of 390 ground monitoring stations, we use ArcGIS 4.0 to encrypt the PM 2.5 monitoring stations and obtain 0.5 • × 0.5 • grid points.

•
Step 2: The inverse distance weighting (IDW) method is used to interpolate the atmospheric pollutants (CO, NO 2 , O 3 , SO 2 ), meteorological data (TEM, PRS, WS, RH), and ZWD data to obtain the raster of the corresponding data, and then ArcGIS 4.0 is used to extract the values of the NDVI raster, ELE raster, and POP raster to the 0.5 • × 0.5 • grid points and 390 PM 2.5 ground monitoring stations.

•
Step 3: We construct the PCA-GWRMS model using data from 390 monitoring stations to obtain PM 2.5 estimates for 0.5 • × 0.5 • grid points, then visualize the predicted values for 0.5 • × 0.5 • grid points and the actual PM 2.5 values from 390 ground monitoring stations using the inverse distance weighting (IDW) [31] interpolation method to generate a PM 2.5 concentration spatial distribution map from January to December 2018-2020 (Figures 11-13).
From the PM 2.5 spatial distribution in Figures 11-13, it can be seen that the PM 2.5 concentration distribution in the middle and lower reaches of the Yangtze River in 2018-2020 has a 'U'-shaped distribution on a monthly scale, which is consistent with the results described in Figure 3, where the PM 2.5 concentrations in January, February, and December are high, and those in June, July, and August are low, especially in the northern part of the study area in January each year, which is generally higher than 75 µg/m 3 . From the overall PM 2.5 spatial distribution, PM 2.5 concentrations show a spatial trend of high in the north and low in the south, and this variation is obvious in January-March and November-December each year, indicating that the use of the PCA-GWRMS model can better estimate regional PM 2.5 concentrations and generate a spatial distribution map of PM 2.5 concentrations with a high degree of refinement.
are high, and those in June, July, and August are low, especially in the northern part of the study area in January each year, which is generally higher than 75 μg/m³. From the overall PM2.5 spatial distribution, PM2.5 concentrations show a spatial trend of high in the north and low in the south, and this variation is obvious in January-March and November-December each year, indicating that the use of the PCA-GWRMS model can better estimate regional PM2.5 concentrations and generate a spatial distribution map of PM2.5 concentrations with a high degree of refinement.

Discussion
In this paper, we found that the distribution of PM2.5 in six provinces and one city in the middle and lower reaches of the Yangtze River in China shows a 'U'-shaped distribution on different monthly scales, with high PM2.5 concentrations mainly occurring in winter months (January, February, and December), where PM2.5 concentrations are higher in January than in other months, and low concentrations are mainly distributed in the summer months (June, July, and August) (Figure 3). This phenomenon is the same as the regional PM2.5 concentration distribution in some existing studies [59,60], and the main reason for its formation is that the atmospheric temperature near the ground in winter in China is lower than that of the upper atmosphere, forming an inverse temperature phenomenon, resulting in a relatively stable atmospheric structure and no air convection in the vertical direction, which makes it difficult for PM2.5 and other atmospheric pollutants near the ground to diffuse and accumulate to form haze [61]. At the same time, due to the lower temperatures near the ground in winter, the water vapor content in the air is lower, causing the air near the ground to be drier and facilitating haze formation. In the summer, near-surface atmospheric temperature is high, the water vapor content in the air is high, and the vertical movement of the atmosphere is active; therefore, the inverse temperature

Discussion
In this paper, we found that the distribution of PM 2.5 in six provinces and one city in the middle and lower reaches of the Yangtze River in China shows a 'U'-shaped distribution on different monthly scales, with high PM 2.5 concentrations mainly occurring in winter months (January, February, and December), where PM 2.5 concentrations are higher in January than in other months, and low concentrations are mainly distributed in the summer months (June, July, and August) (Figure 3). This phenomenon is the same as the regional PM 2.5 concentration distribution in some existing studies [59,60], and the main reason for its formation is that the atmospheric temperature near the ground in winter in China is lower than that of the upper atmosphere, forming an inverse temperature phenomenon, resulting in a relatively stable atmospheric structure and no air convection in the vertical direction, which makes it difficult for PM 2.5 and other atmospheric pollutants near the ground to diffuse and accumulate to form haze [61]. At the same time, due to the lower temperatures near the ground in winter, the water vapor content in the air is lower, causing the air near the ground to be drier and facilitating haze formation. In the summer, near-surface atmospheric temperature is high, the water vapor content in the air is high, and the vertical movement of the atmosphere is active; therefore, the inverse temperature phenomenon does not easily occur [62]. Moreover, more rainfall in summer is not conducive to the formation and diffusion of haze [63].
Considering the complex and diverse influencing factors of PM 2.5 [20,21,48], we demonstrated the high correlation between PM 2.5 and 12 explanatory variables such as meteorological factors, ZWD, and NDVI using the GRA method (Figure 4), and interestingly, we found a sudden increase in the gray relational grade of PM 2.5 with POP and ELE in June 2020, while the gray relational grade of these two variables remained around 0.92 and lower than other explanatory variables in the same month for the rest of 2018-2020. We consider that this phenomenon may first be due to the fact that ELE involves long-term data (Table 1), which do not change in a short time range, and POP involves annual-scale data (Table 1), which do not change with the month. However, the PM 2.5 concentrations and the remaining explanatory variables are monthly data and change with the month and season ( Table 1, Figures 3 and 11-13), making the gray relational grade of PM 2.5 with ELE and POP in different months less variable and lower than the other explanatory variables of PM 2.5 in the same month.
Secondly, because of the COVID-19 outbreak in early 2020 [64], China has promoted travel reduction and imposed closure on areas with severe COVID-19 outbreaks [65], making the mean PM 2.5 concentration in June 2020 slightly lower than that in July, but much lower than that in May (Figure 3c), which is different from the changing patterns of the mean PM 2.5 in May-July 2018 and 2019. Meanwhile, June is in the transition period of spring and summer, with low PM 2.5 concentrations, making PM 2.5 concentration changes vulnerable to various factors such as meteorological factors, POP, and ELE. Finally, we conclude that the higher gray correlation between PM 2.5 and POP and ELE in June 2020 may be influenced by our COVID-19 prevention and control and the alternation of spring and summer seasons.
Multicollinearity is an issue that must be considered in linear regression models [40,66], and through our study, we found that the PCA-GWR model was able to minimize the loss of data, and the spatial estimation accuracy and fitting effect of the PCA-GWR model were better than those of the traditional GWR model (Figure 7). The analysis of this phenomenon may be because the traditional stepwise exploratory regression extraction method eliminates explanatory variables with multicollinearity while also eliminating explanatory variables with PM 2.5 correlations (Figures 4 and 5). Despite the multicollinearity among explanatory variables, each explanatory variable has a unique influence on the formation and distribution of PM 2.5 and cannot be completely replaced, while the principal component analysis extracts principal components that can fully consider the contribution of all potential explanatory variables to PM 2.5 variation [48,67,68]. Therefore, we suggest that the PCA method can be considered to improve the efficiency and accuracy of the linear model when the linear model under consideration has more explanatory variables or the multicollinearity among the explanatory variables is more serious.
Our spatial autocorrelation analysis and the spatial visualization analysis of the PCA-GWR model residuals showed that the residuals of the PCA-GWR model have some positive spatial correlation (Table 2) and a clustering effect occurs spatially, with high values clustering around other high values (Figure 8), meaning the model's station residual values are affected by the surrounding stations. Therefore, we used five different radial basis function interpolation methods (CRS, TS, MS, IMS, TPS) for the residual correction of the PCA-GWR model and demonstrated that the five improved combined models (PCA-GWRCRS, PCA-GWRTS, PCA-GWRMS, PCA-GWRIMS, and PCA-GWRTPS) were the best in PM 2.5 concentration spatial estimation and are all better than the PCA-GWR model (Figures 7, 9 and 10). This improvement and optimization are due to the fact that the PCA-GWR model cannot fully explain the spatial characteristics of PM 2.5 and its remaining spatial characteristics are expressed in the form of residuals, such as positive spatial correlation (Table 2), thus the residual correction of the model using the spatial interpolation method can better explain such characteristics.
The PCA-GWRMS model has the best applicability among all the combined models, with more than 60% improvement and optimization in both MAPE and RMSE (Figures 7, 9 and 10). The advantage of this model is its smoother and less fluctuating trend of RMSE and MAPE in different months (Figures 9 and 10), which can better deal with the different PM 2.5 concentrations due to the high and low PM 2.5 concentrations caused by differences in estimation accuracy and fitting effects and combines the advantages of the PCA and RBF interpolation and the GWR model to achieve effective spatial estimation and mapping of PM 2.5 concentrations.

Conclusions
In summary, the work's accomplishments can be summarized as follows. 1.
PM 2.5 concentrations show a 'U'-shaped distribution and seasonal distribution on the monthly scale, mainly reflecting higher PM 2.5 concentrations in January, February, and December (winter) and lower PM 2.5 concentrations in June, July, and August (summer). On the spatial scale, PM 2.5 concentrations are mainly high in the north and low in the south, and the high concentration areas are mainly located in the northern part of western Jiangsu Province, northern Anhui Province, central Hubei Province, and northeastern Hunan Province, while the PM 2.5 concentrations in Jiangxi Province and southern Zhejiang Province are relatively low for the whole study area.

2.
To extract the best independent variables of the GWR model, the principal component analysis method has advantages over the traditional exploratory regression rejection method, and the PCA method can better balance the problems of multicollinearity among the explanatory variables of PM 2.5 and the adequacy of the contribution of potential explanatory variables to the distribution of PM 2.5 as well as the problem of data loss. The RMSE, MAE, MAPE, and R 2 of the PCA-GWR model are all improved compared with those of the GWR model, which can better achieve the spatial estimation of PM 2.5 .

3.
All five residual correction combination models (PCA-GWRMS, PCA-GWRTPS, PCA-GWRCRS, PCA-GWRTS, and PCA-GWRIMS) outperform the PCA-GWR model in the spatial estimation of PM 2.5 concentrations in the middle and lower reaches of the Yangtze River region of China for 2018-2020, indicating that the residual correction of the PCA-GWR model using radial basis function interpolation can effectively improve the model performance and better achieve the spatial estimation and mapping of PM 2.5 concentrations in the study area. In addition, the PCA-GWRMS model shows stronger advantages than other combined models in terms of applicability and model performance for the spatial estimation of PM 2.5 in the study area.