Construction of a Seasonal Difference-Geographically and Temporally Weighted Regression (SD-GTWR) Model and Comparative Analysis with GWR-Based Models for Hemorrhagic Fever with Renal Syndrome (HFRS) in Hubei Province (China)

Hemorrhagic fever with renal syndrome (HFRS) is considered a globally distributed infectious disease which results in many deaths annually in Hubei Province, China. In order to conduct a better analysis and accurately predict HFRS incidence in Hubei Province, a new model named Seasonal Difference-Geographically and Temporally Weighted Regression (SD-GTWR) was constructed. The SD-GTWR model, which integrates the analysis and relationship of seasonal difference, spatial and temporal characteristics of HFRS (HFRS was characterized by spatiotemporal heterogeneity and it is seasonally distributed), was designed to illustrate the latent relationships between the spatio-temporal pattern of the HFRS epidemic and its influencing factors. Experiments from the study demonstrated that SD-GTWR model is superior to traditional models such as GWR- based models in terms of the efficiency and the ability of providing influencing factor analysis.


Introduction
HFRS is a zoonosis caused by different species of hantavirus, such as Hantaan virus (HTNV) and Seoul virus (SEOV) [1]. As an extremely dangerous viral disease, HFRS is a characterized by fever, acute renal dysfunction and haemostatic manifestations [2] and globally it causes a considerable number of deaths each year [3]. China accounted for approximately 90% of all HFRS cases in the world from 1990 to 2003 [4]. From 1981 to 2005, more than 20,000 cases of HFRS occurred annually, and in 1986, it caused 2569 deaths [5]. Hubei Province is one of the main HFRS outbreak areas in mainland China since the first case was discovered in Wuhan, China, in the 1980s. The fatality rate of HFRS in factors are independent variables. Regression analysis is a powerful tool for detecting relationships between the dependent variable and independent variables [21,22].
As we mentioned before, HFRS is a specifically spatial distributed epidemic. Each model (OLS, GWR and GTWR) has their own merits and limitations to verify the spatial-temporal characteristics of HFRS.
Firstly, OLS is a traditional statistical method that uses least squares in linear regression analysis. Its weighting calculation neglects the spatial location information. The spatial heterogeneity cannot be integrated with OLS in the analysis of HFRS.
Secondly, both of temporal information and spatial information should be considered for HFRS. GWR model uses spatial weights to evaluate the connections between the dependent variable and independent variables [21], while ignoring temporal information.
Thirdly, GTWR model employs both non-stationary spatial and temporal weights matrices to represent spillovers from neighboring geographical locations, which improves the accuracy of estimation compared with the GWR model. HFRS caused a strong seasonal epidemic in Changsha City (Hunan Province) from 2000 to 2009 and in Anhui Province [23,24]. In our previous study, we have proved that the seasonal epidemic pattern of HFRS in Hubei Province was characterized by a shift from the unimodal type (autumn/winter peak) to the bimodal type [6]. OLS models and GWR-based models (GWR and GTWR) cannot deal with the unique seasonal characteristics of HFRS.
Considering the seasonal characteristics of HFRS, a new Seasonal Difference-Geographically and Temporally Weighted Regression model (SD-GTWR) was developed in this research. This SD-GTWR model was developed based on the GTWR model. As a result, it inherited the spatial and temporal characteristics of the GTWR model and thus can simulate the tendencies of HFRS epidemics with consideration of both their spatial and temporal variations. However, the SD-GTWR model integrates the seasonally distributed characteristics of HFRS epidemics and has a capability to perform seasonal difference calculations to eliminate the temporal non-stationary problem of HFRS data. Specifically, the new SD-GTWR model displays three characteristics as follows: First, evolved from traditional linear regression methods (such as OLS), the GWR model estimated regression models based on spatial non-stationarity analysis [21]. Spatial variations can be measured by a specified neighborhood value for each location data in the GWR model [25]. Temporal parameters were considered when the data showed both spatial and temporal characteristics. As a result, GTWR is considered a spatiotemporal non-stationary model. Spatial and temporal weighted matrixes are inherited from GTWR.
Secondly, in the SD-GTWR model, seasonal difference calculations can eliminate the impact of seasonal variation and the temporal non-stationarity of HFRS case data. Seasonal variation is an important characteristic of HFRS outbreaks in every province of mainland China [6,23,24]. Seasonal fluctuations in temporal dimension have special effects on the accuracy of coefficient analysis on factors.
Thirdly, SD-GTWR utilized Incremental Spatial Autocorrelation (ISA) to verify the initial bandwidth value which is an important parameter during the iteration calculation of the model. When big data is utilized an enormous amount of time will be consumed for the determination of bandwidth values due to the numerous iterations required in the calculations. ISA provides a new way to reduce the computation time. Although the matrix order has thousands of levels, bandwidth selection range will be narrowed significantly with SD-GTWR model.

Construction of the Seasonal Difference-Geographically and Temporally Weighted Regression Model (SD-GTWR)
Extended from GTWR model, the SD-GTWR model not only inherits the functions of GTWR, but also has its own particular advantages.

Principle of the Geographically and Temporally Weighted Regression Model (GTWR)
When making GWR estimations, there are commonly three types of models that could be selected. These are the Gaussian GWR model (GWGR), Poisson GWR model (GWPR) and the Logistic GWR model (GWLR).
The GWGR model is simple and commonly used in the related studies about this issue [26]. In order to compare our result to similar studies, and also for convenience consideration, we decided to use GWGR as the basis for our new SD-GTWR model. The GWGR model uses spatial weight functions and bandwidths to avoid getting a negative estimation result. Meanwhile, the amount of data used in this study is large enough to serve GWGR model estimations.
From the previous research [20], GTWR was expressed as: where u i represents the x coordinate of observed location i, v i represents the y coordinate of the observed location i, and then t i represents the observed time for the observed location i. y i , the dependent variable of this model, represents the number of HFRS cases at location (u i ,v i ,t i ). x ik is the corresponding possible influencing factor. d represents the number of categories of influencing factors, and n represents the count number of observed locations. The parameter β k (u i ,v i ,t i ), k = 0,1,2, . . . ,d) is an arbitrary function for the observed location (u i ,v i ,t i ), and it is an unknown parameter for the observed location (u i ,v i ,t i ). ε i is a constant, which represents the random error for the observed location i. This model is different from other "fixed" coefficient estimation models, such as OLS or the Autoregressive Integrated Moving Average (ARIMA) model. It allows the parameter estimates to vary across space and time to capture the local effects in different time. Spatial autocorrelation hypothesis assumes that observed data closed to observed location i have greater impacts than the data spatially far from observed location i does. From Equation (1), parameter estimation for (u i ,v i , t i ) can be expressed as:β where X is a matrix of influencing factors in observation i for separate space and time, Y is vector for HFRS cases in observation location i. W(u i ,v i ,t i ) is an n × n spatial and temporal diagonal matrix. Its diagonal elements demonstrate the spatial and temporal weighting of influencing factors for observed location i. Deduced from Equation (2), w ij (j = 1,2, . . . ,n) is named the spatial weight function. It can indicate the weight of another observation except for the observed location (u i ,v i ,t i ) [21].
is a calculated value based on each spatial and temporal point parameter. It can be expressed using Equation (4): can be obtained according to Equation (2). HFRS cases variable Y at point (u i ,v i ,t i ) is expressed as: where x T i = (1, x i1 , x i2 , . . . , x id ) represents the values of influencing factors at observed location (u i ,v i ,t i ).

Spatial Weight Function Selection for SD-GTWR
Concluded from Tobler's first law of geography "Everything is related to everything else, but near things are more related than distant things" [27], the correlation coefficient of HFRS cases between different counties is negatively associated with spatial distance. In order to clarify the autocorrelation of HFRS infection, effects from neighbors should be considered according to its spatial distance from the focal location [28].
In previous studies, results of the calculation should be directly influenced by the spatial weight function [21,22]. In recent research, the major functions for GWR-based models (GWR model and GTWR model) are the tri-cube kernel function, Gauss kernel function and bi-square kernel function [3,28]. In the tri-cube kernel function process, the situation for infinity in the regression point might happen. Gauss kernel function considers all observation data (such as observation points weak/none relevant to the current observation data) in data estimation. The bi-square kernel function excludes the unnecessary influences to eliminate the weak referenced observation points. It can be expressed as: In the GWR model, d ij is the spatial distance between regression point (u i ,v i ) and (u j ,v j ). In the GTWR model, d ij is the spatial and temporal distance between regression point (u i ,v i ,t i ) and another regression point (u j ,v j ,t j ). h is not a negative value and represents the bandwidth. h indicates the relationship between w ij and d ij .
In Equation (7), λ is spatial parameter to measure spatial distance. µ is a temporal parameter to measure temporal distance. In Equation (6), when the spatial and temporal weight function is ascertained, the bandwidth value h will directly impact the regression result. The reason is that h directly decides the value of w ij for point (u i ,v i ,t i ). The selection of bandwidth value can be accomplished by using Cross Validation (CV), Generalized Cross Validation (GCV), the Akaike information criterion (AIC), Bias information criterion (BIC), etc.
Using each approach, the operation time will increase exponentially with the growing size of the statistical data. To solve this problem, Incremental Spatial Autocorrelation (ISA) is utilized to specify an initial bandwidth value. As ISA method sharply narrows the selection range of bandwidth, correspondingly reducing the operation time to a certain degree.

SD-GTWR
Research has demonstrated that HFRS happens as a characteristic of seasonal or cyclic time series [3,11]. In 1976, ARIMA was developed by Box and Jenkins to forecast non-seasonal time series. To deal with seasonal data like HFRS cases, Box and Jenkins developed an extension for ARIMA model called Seasonal Autoregressive Integrated Moving Average (S-AIMA). It uses a seasonal difference method to obtain stabilized data. Seasonal differences are used to deal with the non-stationary time series.
Following the research on ARIMA and S-ARIMA done by Box and Jenkins, the SD-GTWR model was constructed. Time series analysis and autocorrelation analysis were conducted to ensure the feasibility of using seasonal difference methods. The GTWR model is a prerequisite for doing an advanced seasonal difference model to use seasonal differences. In our previous research, HFRS cases in Hubei Province displayed a bimodal seasonal distribution pattern rather than a linear distribution during 1980-2000 [6]. Seasonal differences calculation (SD-GTWR) for HFRS cases in Hubei Province can be expressed as Equation (8) based on the GTWR model: W t is the result of seasonal difference on time t. S means the time periods of seasonal difference. X t represents the HFRS cases vector in time t. X t−s represents the HFRS cases vector value S units before time t. Different from GTWR, vector W t is used as a dependent variable to progress the procedures instead of the initial cases data.

Implementation of SD-GTWR for HFRS
A series of tests were conducted to identify the spatial and temporal characteristics of the HFRS epidemic. First, it should be verified whether the HFRS infection data demonstrates a spatial autocorrelation feature. Secondly, time sequence analysis should be adopted to verify the seasonal characteristics for HFRS. The first and second steps should provide a seasonal difference characteristics result for HFRS. Thirdly, OLS, GWR-based model (including GWR, GTWR) and SD-GTWR should be performed respectively based on the HFRS infections data. The purpose of this implementation was to improve the reliability and accuracy of the SD-GTWR model.

Study Data
The study area is Hubei Province, which is located in the south-central part of China. With an area of 186,000 square kilometers, it lies in the middle reaches of the Yangtze River. It is situated at 108'21"-116'07" east longitude and 29'05"-33'20" north latitude. Jianghan Plain takes up most of the central and southern area, while mountains are found in the west area. Hubei Province has thousands of square kilometers of plains area, and also possesses large areas of hills and mountainous regions. In its total area, mountainous regions account for 56% and hills occupy 24%, while the remaining 20% being water area. The two major rivers, the Yangtze River and Hanshui tributary flow through it. The Yangtze river is 1061 km in length in the Hubei Province and occupies about 1416 km 2 of drainage area. The Hanshui tributary is 878 km in length and occupies about 450 km 2 of drainage area. There are thousands of lakes in Jianghan Plain, the largest two being Liangzi Lake and Hong Lake.
The study data covers the period from 2011 to 2015. Basic geographic information data of Hubei Province was collected from the Chinese National Administrator of Surveying, Mapping and Geo-Information. The geographic data contains the counties' administrative regions by name and code. HFRS case data and rodent density data were provided by Hubei Provincial Center for Disease Control and Prevention and Chinese Center for Disease Control and Prevention. HFRS case data contains the monthly case values for each county. Climate data was obtained from the National Centers for Environmental Prediction and Hubei Meteorological Bureau. Climate data contains monthly average temperature, humidity and rainfall for each county. Human population density data, which includes the annual population for each county, was extracted from the Hubei Statistical Yearbook.
HFRS in China was mainly caused by two types of hantavirus (HTNV transmitted by Apodemus agrarius and SEOV transmitted by Rattus norvegicus) [5,14,29]. In Hubei Province, HFRS was mainly in the form of SEOV [6]. There are many factors that should be considered when performing a regression analysis for HFRS [24,30,31]. Climate influencing factors were commonly adopted. In this study, temperature, rainfall, relative humidity, SOI and air pressure factors were analyzed as the climate factors for HFRS in Yingshang County [14]. Eco-geographical factors such as intensity of human activity, climate conditions, and landscape elements have been proven to affect the occurrence of HFRS in Changsha, Hunan Province. Derived from the previous research, average temperature, average humidity, average rainfall, human population density, rodent population density, water area and county average surface elevation were included to execute our model. It indicates that the number of HFRS cases fluctuated during this period. In general, the number of case is decreased over the time. A nonlinear distribution pattern appears from the curve. This finding reveals that HFRS cases data cannot be stationary distributed temporally.  Figure 1 demonstrates the time series for HFRS cases in Hubei Province from 2011 to 2015. It indicates that the number of HFRS cases fluctuated during this period. In general, the number of case is decreased over the time. A nonlinear distribution pattern appears from the curve. This finding reveals that HFRS cases data cannot be stationary distributed temporally. Incremental Spatial Autocorrelation (ISA) was provided by the ArcGIS 10.2 software (PESRI, Redlands, CA, USA), which used Spatial Autocorrelation (Global Moran's I) tool for a series of increasing distances. ISA also measured the intensity of spatial clustering for each distance. Table 1 and Figure 2 displays Moran's Index value in different distances provided by Incremental Autocorrelation Analysis tool. Z-scores reflect the intensity of spatial clustering. A statistically significant peak Z-score indicates distances in which spatial clustering is the most prominent. A Zscore peak represents the best fixed distance value in spatial regression analysis. In this study, a Zscore peak emerged when the distance was set to 103,540.34. The initial calculation bandwidth for GWR and GTWR model was set to 103,540.34.  Incremental Spatial Autocorrelation (ISA) was provided by the ArcGIS 10.2 software (PESRI, Redlands, CA, USA), which used Spatial Autocorrelation (Global Moran's I) tool for a series of increasing distances. ISA also measured the intensity of spatial clustering for each distance. Table 1 and Figure 2 displays Moran's Index value in different distances provided by Incremental Autocorrelation Analysis tool. Z-scores reflect the intensity of spatial clustering. A statistically significant peak Z-score indicates distances in which spatial clustering is the most prominent. A Z-score peak represents the best fixed distance value in spatial regression analysis. In this study, a Z-score peak emerged when the distance was set to 103,540.34. The initial calculation bandwidth for GWR and GTWR model was set to 103,540.34.  Figure 1 demonstrates the time series for HFRS cases in Hubei Province from 2011 to 2015. It indicates that the number of HFRS cases fluctuated during this period. In general, the number of case is decreased over the time. A nonlinear distribution pattern appears from the curve. This finding reveals that HFRS cases data cannot be stationary distributed temporally. Incremental Spatial Autocorrelation (ISA) was provided by the ArcGIS 10.2 software (PESRI, Redlands, CA, USA), which used Spatial Autocorrelation (Global Moran's I) tool for a series of increasing distances. ISA also measured the intensity of spatial clustering for each distance. Table 1 and Figure 2 displays Moran's Index value in different distances provided by Incremental Autocorrelation Analysis tool. Z-scores reflect the intensity of spatial clustering. A statistically significant peak Z-score indicates distances in which spatial clustering is the most prominent. A Zscore peak represents the best fixed distance value in spatial regression analysis. In this study, a Zscore peak emerged when the distance was set to 103,540.34. The initial calculation bandwidth for GWR and GTWR model was set to 103,540.34.

Parameter Selection
After defining a bandwidth value, it is important to define the spatial and temporal balancing parameters λ and µ. In the GTWR model, the range of geo-location coordinates and date time are on totally different scales, necessitating the two parameters be set into a unified range scale for computation. Based on difference reciprocal, λ value was defined as the maximum spatial value, µ value was defined as the minimum temporal value. The units of space and time were set as km and month, respectively. In Hubei Province, the maximum distance between two neighboring counties is about 800 km. The sample time frame is 60 months (12 months per year multiplied by 5 years). According to reciprocal weight values, λ was set to 3, µ was set to 40.

Parameter Selection
After defining a bandwidth value, it is important to define the spatial and temporal balancing parameters  and μ. In the GTWR model, the range of geo-location coordinates and date time are on totally different scales, necessitating the two parameters be set into a unified range scale for computation. Based on difference reciprocal,  value was defined as the maximum spatial value, μ value was defined as the minimum temporal value. The units of space and time were set as km and month, respectively. In Hubei Province, the maximum distance between two neighboring counties is about 800 km. The sample time frame is 60 months (12 months per year multiplied by 5 years).
According to reciprocal weight values,  was set to 3, μ was set to 40.  Previous studies have demonstrated that outbreaks of HFRS epidemics were strongly related with climate influencing factors [6], and the HFRS cases in Hubei Province displayed a bimodal distribution for each year [32,33]. From Figures 1 and 3, it also appears that two peaks happen in a year (12 months). Accordingly, it seems that it is better to narrow down the seasonal difference range for each year to 6 months [34]. As a result of the above reasons, the time frame for the seasonal difference calculations was set to 6 months.

Seasonal Analysis
After determining the seasonal difference value, it is inevitable to set up independent and dependent values. In our research data, there were 5 (years)  12 (months)  76 (counties) rows. Previous studies have demonstrated that outbreaks of HFRS epidemics were strongly related with climate influencing factors [6], and the HFRS cases in Hubei Province displayed a bimodal distribution for each year [32,33]. From Figures 1 and 3, it also appears that two peaks happen in a year (12 months). Accordingly, it seems that it is better to narrow down the seasonal difference range for each year to 6 months [34]. As a result of the above reasons, the time frame for the seasonal difference calculations was set to 6 months.
After determining the seasonal difference value, it is inevitable to set up independent and dependent values. In our research data, there were 5 (years) × 12 (months) × 76 (counties) rows. Independent variables were set as a (4560 × 7) matrix, expressed as X. The dependent variable was set as a (4560 × 1) matrix, expressed as Y. HFRS cases data was seasonally different with a six month interval (peaks happened in February and August). The seasonally different dependent variable matrix can be expressed as YDif.

Correlation Analysis on Influencing Factors
Correlation analysis was used to analyze the correlation of the influencing factors, including average temperature (Avertemp), average humidity (Averhumi), average rainfall (Rainacc), Area, rodent density (RodentDensity), human population density (PopDensity), water area (WaterArea) and surface mean elevation (MeanHeight). In Table 2, the 5% significant level is marked as "*", while "**" represents a 1% significance level. The result shows that five factors are statistically significant (p < 0.10), which are rodent density (Rodent Density), human population density (PopDensity), water area (Water Area), average temperature (Avertemp) and average surface elevation (MeanHeight). The following regression analysis should select the five factors as influencing factors.  Table 3 reflects the model diagnostics result. The R square value is 0.447, which means that 44.7% of the variation for HFRS cases and possible influencing factors can be explained.   Table 4, parameter B represents the regression intercept for parameters [35]. The values of B indicate that the independent variables are associated, either positively or negatively. It can be inferred from Table 4 that factors including average humidity (Averhumi), rodent density (RodentDensity) and human population density (PopDensity) are positively associated with the infection of HFRS. Factors including average temperature (Avertemp) and average surface elevation (MeanHeight) are negatively associated with HFRS cases.

Compared Results among GWR-Based Models
Regarding the test on goodness of fit, Table 5 indicates that R square for the GWR model is 0.54. This value is higher than the value from the OLS model in Table 3 (0.447). This means that in terms of fitting the data, the GWR model that incorporated the temporal and seasonal effects achieves a 22.7% improvement as compared to the OLS model. Along with the R square value, the corrected Akaike Information Criterion (AICc) also measures the regression level for models. AICc is not an absolute measure of goodness of fit, however, it is useful for test models with different explanatory variables when applying the same dependent variable [36,37]. The lower AICc value a model has, the better fitness the observed data provides. As presented in Table 5, R square values for GWR, GTWR and SD-GTWR are 0.54, 0.61 and 0.77. The corresponding AICc values are 3539.23, 3462.41, 3421.06. Therefore, we can infer from the statistical results that the GTWR model is better than the GWR model by 13.0% and that the SD-GTWR model is better than the GTWR model by 26.2%. There could be two reasons for this phenomenon. First, in GTWR-based models (GTWR and SD-GTWR), the temporal factors covered more information than they did in the GWR model. Secondly, considering the seasonal pattern of HFRS cases, the SD-GTWR model simulated the relationships between HFRS epidemic and the possible influencing factors much better [37]. Table 6 shows the F-tests results for the three models and their corresponding p-values. Variables reach 5% level significant are marked with "*". It can be inferred that different significant variables can be found when using different models. In the GWR model, only climate factors including average temperature (Avetemp), average humidity (Averhumi) and average rainfall (Rainacc) are considered statistically significantly influencing factors. In the GTWR model, including the climate factors (as the GWR model discovered), rodent density (RodentDensity) and surface mean elevation (MeanHeight) are also regarded as influencing factors. In the SD-GTWR model, the water area (WaterArea) and human population density (PopDensity) factors are also selected for the estimation of HFRS cases. On the other hand, average temperature factor is excluded from the SD-GTWR model. The GWR model adds spatial characteristics to the general linear regression when processing the data. It is a non-stationary regression method. The GWR model is much better than OLS model in simulating the HFRS trends. First, HFRS demonstrates a strong spatial and temporal autocorrelation characteristic [37]. Neglecting its spatial variation may have a great impact on any simulation results. Secondly, HFRS from neighboring regions could contribute to the emergence of HFRS in the focal county. This phenomenon can be ascribed to the global and local autocorrelation characteristics of HFRS in Hubei Province.
Compared with the GWR model, R-square values were much higher in the GTWR and SD-GTWR models. Regression parameters were utilized as functions to describe the spatial and temporal position of sample data in the GTWR-based models (the GTWR model and SD-GTWR model). The calculation accuracy of the two models are higher than that of the GWR model, because that spatial and temporal weights in these functions can better respond to the influencing factors in different spatial and temporal locations.

Conclusions
Regression models were commonly used to evaluate the relationships between possible influencing factors and the number of HFRS epidemic cases. The GTWR model was used to speculate about the connections between influencing factors and the number of HFRS epidemics. With the consideration of spatial and temporal variation, simulation results from the GTWR model were more accurate than those from non-spatial models like OLS or from a non-temporal model like GWR. Combined with the seasonal characteristics of the HFRS epidemic in Hubei Province, a new model named SD-GTWR was initially developed to conduct regression analysis on HFRS cases. Estimations have been made by different models such as OLS, GWR, GTWR and SD-GTWR. It can be inferred from the model diagnosis results that with the process of seasonal difference, the SD-GTWR model better simulated the correlations of the possible influencing factors.
The regression results from different models revealed the characteristics of different influencing factors in Hubei Province in 2011-2015. The following conclusions can be reached: First, the models applied in this paper demonstrated the relationships between HFRS epidemics and meteorological factors at different levels. Meteorological factors notably impacted the changing trends of HFRS outbreaks, for the reason that they are associated with the spatial presence of this infection.
Secondly, one of the influencing factors, average humidity, has been demonstrated as significantly associated with the HFRS outbreaks in the GWR, GTWR and SD-GTWR models. It can be interpreted that this factor might have a strong impact on the HFRS epidemic in Hubei Province.
Thirdly, different models showed different estimation parameter results. This is reasonable since the OLS model does not consider spatial heterogeneity, which means that coefficients with spatial variation characteristics like rainfall may not be obvious. Moreover, the spatial and temporal variation of rodent density and surface mean elevation share the same distributions. It can be inferred that the model estimation results are dominantly affected by the spatial and temporal variation of HFRS.