Using an Eigenvector Spatial Filtering Based Spatially Varying Coecient Model to Analyze the Spatial Heterogeneity of COVID-19 and Its Inuencing Factors in Mainland China

 Background: The COVID-19 pandemic has led to many deaths and economic disruptions 13 across the world. Several studies have examined the effect of health risk factors on COVID-19 14 rates in different places, but the problem of spatial heterogeneity has not been adequately 15 addressed. 16  Methods: In this paper, we developed an Eigenvector Spatial Filtering based spatially 17 varying coefficient model (ESF-SVC) to reveal the spatially varying impact of certain health 18 risk factors on the COVID-19 spread. The experiment was conducted during 7 weeks within 19 two study extents (Hubei province and mainland China). Spatial varying coefficient maps were 20 produced for spatial pattern discovery. 21  Results: Results showed that the ESF-SVC model could take good control of over-fitting 22 problems, with average adjusted R 2 16.31% (in Hubei province) and 10.25% (in mainland 23 China) higher than that of GWR. The cross validation RMSE of ESF-SVC model was also the lowest. In Hubei province, Population density and wind speed had a significant impact on 25 COVID-19 infection rates and that their effect was constant across cities. While in mainland 26 China, migration score, building density, temperature and altitude showed significant impact 27 and their effect varies across space. The influence of migration score was less contributive and 28 less significant in cities around Wuhan than cities farther away, while the altitude showed 29 stronger contributions in high altitude cities. 30  Conclusions: Our study hopes to provide not only a feasible path to solve the problem of 31 spatial autocorrelation and spatial heterogeneity in COVID-19 characterization but also an 32 intuitive way to discover spatial patterns in large study areas, which could help people and 33 government awareness of the potential health risks and shed some light in COVID-19 control 34 strategies. 35


38
The COVID-19 pandemic, initially found in Wuhan, China in December 2019, is 39 incredibly infectious and has caused large impacts across the world . 40 The World Health Organization (WHO) designated the COVID-19 as a global 41 pandemic on March 11, 2020 (Cucinotta and Vanelli, 2020). The virus has spread rapidly 42 worldwide and infected cases were found in almost all countries by the end of 2020, 43 with global cumulative cases above 79 million and the death cases above 1.7 million, 44 and the number kept rising(WHO, 2020).

Spatial weights matrix construction 201
The spatial weights matrix W we use in the study is a K nearest neighbors' contiguity 202 matrix, which was a distance-based matrix and could deal with the problem of "isolation" 203 units(Gerkman and Ahlgren, 2014; Lesage and Fischer, 2008;Schneider et al., 2021). 204 For each study unit i, calculate the distance between the centroid of unit i and the other 205 n-1 units ( 1 , 2 … −1 , +1 , … , ) and rank by distance. Then find the k closest 206 units to i and denoted these as neighbors, while other units are not. When cities i and j 207 are neighbors, takes on non-zero values (one, for a binary matrix), otherwise zero 208 (Rogerson, 2001). 209 Then the spatial weights matrix W is centered as below: 210 C = (I − 11 T n ) (I − 11 T n ) (2)

211
where n represents the city polygon units in this study, n=17 in the study of Hubei 212 province and n=362 in the study of mainland China, I is an n-dimension identity matrix 213 and 1 is an n-by-1 vector of ones. 214

Eigenvector extraction 215
The centered spatial weight matrix C in Equation 2 was decomposed into 216 eigenvectors and eigenvalues as follows: 217 where E = ( 1 , 2 , 3 ,…, ) is the collection of eigenvectors; ∧ is an n×n diagonal 219 matrix of eigenvalues ( 1 , 2 , 3 ,…, ) which defines the corresponding eigenvectors' 220 MC value, 1 is the largest and is the smallest eigenvalue. 221 To capture the positive spatial autocorrelation, the subset of eigenvectors C was first 222 selected using the criterion λ λ 1 > 0.25 for the following model. Then a stepwise variable 223 selection method was applied in further eigenvector extraction to avoid 224 multicollinearity (Helbich and Griffith, 2016). 225

Variable selection and ESF-SVC model construction 226
ESF regression uses a set of eigenvectors as new variables. The vectors extracted in 227 Section 2.2.2 are added as control variables to the model, which is expressed as follows: 228 where Y represents the dependent variable and X represents an n × p matrix whose -230 th row of element is the independent variable , E represents an n × k matrix of k 231 selected eigenvectors, and ε represents disturbances. 232 An extended ESF-based SVC model could be shown as: 233 where is a n × 1 vector of the -th health risk factor, is the -th EV that 235 combined with its corresponding health risk factor , 0 , 0 , are the regression 236 coefficients that estimated by restricted maximum likelihood method, ε represents 237 random disturbance, and " • " denotes the element-wise product operator. 238 The coefficients − include two parts: The first part 0 represents the global effect 243 of health risk factors on the COVID-19 infection rate and the second part 244 represents the spatial effect on health risk factors and their local influence on the 245 COVID-19 infection rate. 246

Model assessment and comparison 247
Four models with Gaussian link function-OLS, GWR, ESF and ESF-SVC -were 248 built and compared using the criteria specified below. 249 Firstly, the adjusted R 2 , the AIC, the RMSE was selected as performance criteria to 250 assess model reliability, calculated as follows: 251 where is the COVID-19 infection rate of the ith city, ̂ is the predicted model value 254 of the ith city, ̅ is the total average COVID-19 infection rate, and df is the degrees of 255 freedom of the corresponding model. 256 Then cross validation method was utilized to further test model stability and fitting 257 accuracy. Models with better cross validation performance could be viewed as bearing 258 less over-fitting problem (Ghojogh and Crowley, 2019;Picard and Cook, 1984). Instead 259 of 10-fold cross validation, a more time consuming but unbiased leave-one-out cross 260 validation method (LOOCV) was applied to alleviate the problem of spatial 261 autocorrelation that might exist within the test dataset (Brenning, 2012;Refaeilzadeh et 262 13 al., 2009). That is, only one city with its corresponding variables acts as the validation 263 sample and the other cities serve as the training set. The process was repeated until each 264 city were taken as a validation sample and the average RMSE was taken as the 265 assessment criteria. 266 Finally, the Moran's I value for residuals was also taken as a performance criterion 267 to test if spatial autocorrelation in the residuals was successfully filtered out. If the 268 Moran's I value is high and significant, it suggests that the residuals are spatially 269 autocorrelated, which violates the hypothesis of independence in linear regression. 270 Generally, the higher the adjusted R 2 , the lower the RMSE, AIC and the RMSE of 271 LOOCV, the smaller and less significant the Moran's I for residuals, the better the 272 model performance. 273

Spatial pattern discovery 274
The output of the ESF-SVC model contains the coefficients (be they spatially varying 275 or constant) as well as their statistical significance, indicating how and to what extent 276 risk factors affect the COVID-19 infection rate and how they vary across space. For 277 better visualization, we mapped the coefficients of corresponding risk factors through 278 7 weeks. Quantile method was used to grade the legend and each risk factor shared the 279 same mapping dimension. Cities whose coefficient significance was less than 0.1 were 280 labeled by scattered points. A total of 28 (4 coefficients * 7 weeks) coefficient maps 281    Table 4 shows the Pearson correlation coefficients between health risk factors and 302 COVID-19 infection rate in Hubei province. In the first 3 weeks from January 22 to 303

Correlation analysis and multicollinearity diagnosis 301
February 11, all 10 risk factors did not show significant correlation with the COVID-304 19 infection rate. This might be because the nucleic acid detection method was not 305 mature and detection resources was not sufficient at first, such that many patients were 306 not detected as infected very quickly in high risk areas (i.e., Hubei province). The 307 Chinese Healthcare Commission announced an improved method for detecting 308 confirmed cases on February 12, which was the start time point of week 4 in our study.  Table 5 to test if multicollinearity exists. The VIF of PDEN, HOS, MS, GDP 314 and BD was larger than 10, which means multicollinearity problem exists. Therefore, a 315 stepwise regression method was applied for variable selection to avoid multicollinearity. 316 Finally, only PDEN and WDSP was maintained and other variables was screened out. 317 Table 6 shows the VIF value for factors after stepwise selection and the values are 318 smaller than 10, which means the remaining variables were not multicollinear and could 319 be used for modeling. 320

Model assessment 327
The results of different models (OLS, ESF, GWR, ESF-SVC) are shown in Table 7. The ESF-SVC model's average RMSE is the smallest (0.310), but the AIC is larger than 331 the other models, because the spatially varying coefficients increase model complexity.

335
The LOOCV results are shown in Table 8 The Moran's I values for model residuals are displayed in Table 9. These numbers 341 show that the OLS residuals were significantly spatially autocorrelated during the first 342 3 weeks, but that they did not show significant spatial autocorrelation in the final weeks. 343 The GWR residuals only showed significant spatially autocorrelation in week 2 and 344 none of the residuals for the ESF and ESF-SVC models were spatially autocorrelated, 345 suggesting that they could better filter out the influence of spatial autocorrelation. 346 To intuitively analyze model fitness, the model average residuals are visualized in 347 Figure 3. Map legend was grade by quantiles of all 4 models' residuals. The average 348 residuals for the ESF-SVC and GWR models remained at the same level, and were 349 better than those of the OLS and ESF models. 350

ESF-SVC model coefficients 355
The constructed ESF-SVC model coefficients are shown in week 4 to week 7, with average coefficient values of 0.944 and 1.172, respectively. 363 In general, these results show that none of the potential risk factors but the spatial 364 varying intercepts that generated by eigenvectors, which represent spatial 365 autocorrelation pattern, were associated with COVID-19 infection rates in Hubei 366 province during the first three weeks. Starting in week 4, PDEN and WDSP began to 367 show significant impacts, but their influence did not vary spatially across 17 cities in 368 Hubei province. WDSP, with an average coefficient of 1.172, played a more important 369 role in the growth of the COVID-19 infection rate in Hubei province since week 4. No 370 spatial varying coefficient maps were produced, because the health risk factor 371 coefficients were constant. 372   Table  398 14 shows the VIF of risk factors. All the VIF values are less than 10, which means the 399 multicollinearity among these factors was weak and could be added for modeling. 400   To intuitively analyze the model performance, Figure 4 shows the average residuals 431 in mainland China, which were calculated by taking the absolute average predicted

ESF-SVC model coefficients 445
The constructed ESF-SVC model coefficients are shown in Table 18    and/or small sample size. 502 When the study area was expanded to mainland China, the average adjusted R 2 of 503 ESF-SVC model was 0.624, which was 10.25% higher than GWR (0.566), 19.54% 504 higher than ESF (0.522) and 105.94% higher than the OLS model (0.303). The average 505 RMSE value of the ESF-SVC model was much higher than in the ESF and OLS models, 506 and slightly higher than that of the GWR model. The AIC of the ESF-SVC model was 507 no better than that of the other models, because the spatial varying coefficients 508 increased the model complexity and reduced the degrees of freedom. The LOOCV 509 results for the ESF-SVC model, with an average RMSE of 3.160, were significantly 510 smaller than those for the GWR (6.200) and OLS models (6.167) and slightly smaller 511 than that for the ESF model Overall, the ESF-SVC method provide a more accurate path for modeling COVID-525 19 infection rates by simultaneously considering the spatial autocorrelation and the 526 spatial heterogeneity. The ESF-SVC method also took good control of the over-fitting 527 problems that plagued the GWR model, thereby providing a more robust model. 528

Influence of health risk factors 529
The spatial varying coefficient values reflect how the corresponding risk factor 530 possibly affected the COVID-19 infection rates because the independent variables were 531 all standardized and normalized. If the risk factor passed the significance test during 532 the modeling, the larger the absolute coefficient, the greater its contribution to the 533

COVID-19 infection rate. 534
When the study area was limited to Hubei province, only PDEN and WDSP passed 535 the significance test and were selected for modeling. WDSP, with a lager average 536 coefficient than that of PDEN, contributes more to the increase of infection rate after 537 February 12. This might be because the virus could spread more in windy weather 538 (Coşkun et al., 2021). However, as the study area expanded to mainland China, WDSP 539 did not pass the model significance test, indicating that the influence of WDSP is more 540 likely to act on small but high risk areas. that should be considered is the model efficiency. Leave One Out cross-validation 602 method was applied in our study to avoid the spatial impact on the cross-validation 603 testing set, which would be time consuming for large dataset and needs to be improved 604 in the future. 605

606
In this paper, we developed an ESF-SVC method for modeling COVID-19 infection 607 rates. Population density, hospital capacity, migration score, GDP, building density, 608 precipitation, temperature, air pressure and DEM were considered as health risk factors 609 and the experiments was conducted using two study extents: Hubei province and 610 mainland China over 7 weeks from January 22, 2020 to March 10, 2020. 611 The proposed ESF-SVC model could take good control of over-fitting problems, with 612 a higher average adjusted R 2 compared with ESF and GWR. Also, the ESF-SVC 613 34 model's cross validation RMSE were also largely lower than other three models, 614 indicating that it can better estimate the relationship between COVID-19 infection rate 615 and health risk factors within different areas. 616 The ESF-SVC's spatial varying coefficients at different periods were displayed in 617 tables and maps for spatial and temporal pattern discover. The effect of health risk 618 factors was different as the study extents and study period change. In Hubei province, 619 only wind speed (WDSP) and population density (PDEN) passed the model significant 620 test. WDSP contributes more to the increase of infection rate than other health risk 621 factors after February 12, indicating that the virus could spread by wind in high-risk 622 areas with larger population density. Therefore, wearing a mask while going out is the 623 preferred response, which was also recommended by other related studies within In general, our method could better control multicollinearity and over-fitting 640 problems that plague the GWR model. In addition, it provides an innovative way of 641 exploring how health risk factors influence people differently across space and time. 642 The findings about the impact of health risk factors were also partially consistent with 643 previous studies on COVID-19 and other respiratory infectious diseases, which could 644 help people and government awareness of the potential health risks and shed some light 645 in COVID-19 control strategies. As the proposed method was not limited by datasets, 646 it could be a reference to other spatial epidemiology researches. As a next step, we plan 647 to add temporal variables in the spatial weights matrix and expand the ESF-SVC model 648 into an ESF based spatiotemporal varying coefficient model. We will also consider how 649 to expand the study area to the entire world and explore the spatiotemporal patterns 650 within different countries and continents.