Spatial Interaction Modeling of OD Flow Data: Comparing Geographically Weighted Negative Binomial Regression (GWNBR) and OLS (GWOLSR)

: Due to the emergence of new big data technology, mobility data such as ﬂows between origin and destination areas have increasingly become more available, cheaper, and faster. These improvements to data infrastructure have boosted spatial and temporal modeling of OD (origin-destination) ﬂows, which require the consideration of spatial dependence and heterogeneity. Both ordinary least square (OLS) and negative binomial (NB) regression methods have been used extensively to calibrate OD ﬂow models by processing ﬂow data as di ﬀ erent types of dependent variables. This paper aims to compare both global and local spatial interaction modeling of OD ﬂows between traditional and geographically weighted OLS (GWOLSR) and NB (GWNBR) modeling methods. From this study with empirical data it is concluded that GWNBR outperforms GWOLSR in reducing spatial autocorrelation and in detecting spatial non-stationarity. Although, it is noted that both local modeling methods show improvement when compared against the equivalent global models.


Introduction
Spatial interaction models, extensively used to investigate and analyze spatial movements, have become a well-established method for understanding factors affecting geographical mobility, such as migration [1,2], transport [3], international trade [4], commuting [5], and tourism [6]. A primary concern in spatial interaction modeling is the statistical and spatial distributions of OD (origin-destination) flows between origin and destination locations, which do not tend to follow normal distribution and spatial independence.
Traditionally, spatial interaction models are calibrated using an ordinary least square (OLS) regression, especially when dealing with normally distribution data. Here, log-transformed flow data, which follows an approximately normal distribution, is used as the dependent variable for calibrating spatial interaction models. However, in large networks, OD flow data often consist of zero flows between some ODs. As these zero flows are not always compatible with OLS estimation, the Poisson model is often used, particularly when dealing with count data. Where flow data is shown global and local modeling methods (OLS and NB). The remainder of the paper is structured as follows: Section 2 introduces the study area, data sets, and global and local modeling methods. Section 3 presents analytical and modeling results, followed by a comparison between two modeling methods in Section 4. Finally, Section 5 draws general conclusions and makes recommendation for future work.

Study Area
Jiangsu province is located on China's eastern coast and covers an area of 102,600 square kilometers. With a total of 63 counties and cities, this province is usually divided into three regions-northern (29 counties), central (16 counties, with its capital, Nanjing city), and southern (18 counties). These three regions have been shown to demonstrate large variations in levels of economic growth and social welfare, where the southern region, popularly called SuNan, has become a model of growth in China within current literature [34]. In 2014, Jiangsu reported a gross domestic product (GDP) of up to 6.51 trillion yuan RMB [35], thereby representing one of the fastest growing economies and most vigorous provinces in China.

Data Collection
In 2015, the expressway network across Jiangsu was ranked first in China with respect to density [35]. In this case study, the following data sets were used: location of toll-gates on expressway network, OD traffic flows between toll-gates, and county-level socio-economic data. The location of 334 toll-gates, covering the whole Jiangsu Province, as shown in Figure 1, was captured by a GPS device. The OD traffic flows between the 334 toll-gates were calculated from the transaction records at each toll-gate. Each vehicle driver is required to pay a fee to use the expressway network. Fees are paid in cash or electrically (e.g., WeChat) when the driver passes through a toll-gate. Every transaction record contains the ID of the vehicle and its time of entering or leaving the toll-gate. Therefore, each vehicle only has two records per trip-time the driver enters the expressway through the first toll-gate and a second time where the driver exits the expressway through the second toll-gate. These two records define the complete OD flow for the vehicle, from the entering gate to the exiting gate. In 2014, there were 235 million OD flows between the 334 toll-gates, which is typical in terms of high volume and velocity of big data. Administrative units in China include province, prefecture, and county. Statistically, the county unit is the lowest spatial unit for national socio-economic census surveys. To model the inter-dependence between transport and economy on regional scale, it is important to aggregate the traffic flows from toll-gate to county unit. The two counties, in which the entering and exiting gates (recorded when calculating OD flow) are located, become the origin and destination counties accordingly. As there were very few OD flows within a county, intra-flows at county level were excluded from inter-county flows. Out of the 63 counties within Jiangsu province, no transaction records were available in 2014 for four counties (Gaochun County, Rudong County, Funing County, and Jinhu County). Therefore, a total of 59 counties were included within the empirical study. The aggregated traffic flows at county level was shown to form a flow matrix 59 × 59 for 2014. Statistical (secondary) data of GDP were collated from the Jiangsu Statistical Yearbook 2015 [35]. This study assumes that the distance between two counties is the measured road network distance (including the expressway and other road networks) between the capitals (towns) of the two counties.

Global and Local Modeling
Spatial interaction modeling has been employed extensively to analyze a variety of geographical mobility factors (e.g., commuting, tourism, and migration). However, differences in methods used, e.g., OLS, Poisson, and NB regression, to calibrate the spatial interaction models have also been highlighted within the literature [1,7,9]. When considering spatial non-stationarity within spatial interaction models [27], calibration becomes more complicated as mobility often deals with count-type dependent variables (e.g., number of people or goods).
There has been confusion amongst modelers concerning the proper selection of modeling methods for a specific application. Theoretically, empirical evidence confirms the inter-dependence between transport and economy on a regional scale. Meaning that the improvement of transport infrastructure is driven by both rapid economic development [36] and increased economic activities, which in turn are stimulated by decreasing transport cost due to improvements to transport infrastructure [37]. GDP is a crucial indicator of economic development and has been used extensively for flow modeling in the research of trade [38], migration [39], tourism [40], and aviation [41]. In this study, the study area has been shown to have advanced transport infrastructure and nationally renowned urban and economic agglomeration zones. Consequently, it is important to examine the impacts of economic development on transport flows at county level. Taking spatial interaction modeling as an example, this paper aims to compare two local modeling methods-GWNBR and GWR.

Global Models of Flow
The spatial interaction model for traffic flows at a regional level is represented by Equation (1): where M is the number of vehicles travelling from county i to county j (indicating the interaction intensity between two areas), GDP and GDP represent the push and pull forces at the origin county i and destination county j, respectively, and d denotes the network distance between the capitals of counties i and j. A negative exponential function was selected for the distance decay function f(d ) at a regional scale [1,42], e.g., exp(−βd ), where β means the spatial distance friction coefficient, where a higher β value indicates that the flow is more sensitive to the network distance.
Here, the equivalent global model can be calibrated by OLS as follows (Equation (2)):

Global and Local Modeling
Spatial interaction modeling has been employed extensively to analyze a variety of geographical mobility factors (e.g., commuting, tourism, and migration). However, differences in methods used, e.g., OLS, Poisson, and NB regression, to calibrate the spatial interaction models have also been highlighted within the literature [1,7,9]. When considering spatial non-stationarity within spatial interaction models [27], calibration becomes more complicated as mobility often deals with count-type dependent variables (e.g., number of people or goods).
There has been confusion amongst modelers concerning the proper selection of modeling methods for a specific application. Theoretically, empirical evidence confirms the inter-dependence between transport and economy on a regional scale. Meaning that the improvement of transport infrastructure is driven by both rapid economic development [36] and increased economic activities, which in turn are stimulated by decreasing transport cost due to improvements to transport infrastructure [37]. GDP is a crucial indicator of economic development and has been used extensively for flow modeling in the research of trade [38], migration [39], tourism [40], and aviation [41]. In this study, the study area has been shown to have advanced transport infrastructure and nationally renowned urban and economic agglomeration zones. Consequently, it is important to examine the impacts of economic development on transport flows at county level. Taking spatial interaction modeling as an example, this paper aims to compare two local modeling methods-GWNBR and GWR.

Global Models of Flow
The spatial interaction model for traffic flows at a regional level is represented by Equation (1): where M ij is the number of vehicles travelling from county i to county j (indicating the interaction intensity between two areas), GDP i and GDP j represent the push and pull forces at the origin county i and destination county j, respectively, and d ij denotes the network distance between the capitals of counties i and j. A negative exponential function was selected for the distance decay function f d ij at a regional scale [1,42], e.g., exp −βd ij , where β means the spatial distance friction coefficient, where a higher β value indicates that the flow is more sensitive to the network distance.
Here, the equivalent global model can be calibrated by OLS as follows (Equation (2)): where α, β, and γ are the parameters to be calibrated and e ij is the error term. In this case, the dependent variable has been transformed from an integer to a real variable. In most cases, across disciplines, the transformed variable follows a normal distribution so OLS regression is used for calibration [43][44][45]. However, traffic flow is considered a counting variable, where the data follows a Poisson distribution. The Poisson regression method has been employed in numerous studies to calibrate the model in Equation (1) [46][47][48]. In cases where the flow data is shown to demonstrate over-dispersion (i.e., its variance is much larger than its mean), negative binomial regression should be chosen [1] and Equation (2) updated accordingly, as shown by Equation (3): where alpha is a dispersion parameter, which is greater than 0 in the case of over-dispersion.

Local Models of Flow
The global models mentioned above do not take into account spatial non-stationarity when modeling spatial interaction. Spatial non-stationarity, which highlights the varying relationships between flows and other socio-economic variables across the study area, can be explored using geographically weighted regression (GWR) [8,49]. As with global models, there are two different local modeling methods for treating dependent variables differently. When using a log-transformed flow as the dependent variable, local modeling can be represented as Equation (4): For a Gaussian model, the calibration WLS (weighted least square) method is applicable as shown by Equation (5): Further details concerning model calibration can be found in Fotheringham et al. [27]. When using raw flow values as the dependent variable, local modeling is updated as shown by Equation (6): where ij means the location of flow from origin site i to destination site j. Here, each flow has a set of parameters together with other local statistics, e.g., standard error and t-statistics.
The spatially weighted interaction model (SWIM) is a local modeling method that incorporates flow data [27] and is based on the Poisson model. In the case of NB regression, where it is assumed that there are a total of n flows and m explanatory variables, parameters in Equation (6) are calibrated as shown by Equation (7) (for further details of algorithms, see da Silva and Rodrigues [32]): where A is a vector matrix of n × m and W {ij} is an n × n diagonal matrix of spatial weight for flow ij, as shown by Equation (8).

Bandwidth
Due to the complexity of flow data that includes both origin and destination sites, local modeling needs to appropriately measure the distance between flows and properly calibrate the local models. In the case of flow-focused spatial interaction modeling [27], the distance between two flows ij and i j , as shown in Figure 2, are measured as d (ij)(i j ) (as shown by Equation (9)), in which the direction of flow is considered: Due to the complexity of flow data that includes both origin and destination sites, local modeling needs to appropriately measure the distance between flows and properly calibrate the local models. In the case of flow-focused spatial interaction modeling [27], the distance between two flows ij and i'j', as shown in Figure 2, are measured as d (ij)(i'j') (as shown by Equation (9)), in which the direction of flow is considered: Figure 2. The Euclidean distance between flow(i,j) and flow(i ′ j ′ ).
There are two popular methods to calculate spatial weight W(d ij ): fixed bandwidth and adaptive bandwidth.
Fixed bandwidth aims to search for an optimal distance, within which all flows of j will be calculated a spatial weight, w ij by following a Gaussian function, as shown by Equation (10): where b is the optimal threshold distance, called bandwidth in this case, and d ij is the distance between flows i and j. Adaptive bandwidth aims to search for the optimal number of nearest flows and is used to determine the distance bi for each flow i. It is clear the bi is affected by the density of flows near flow i. In this case, spatial weight is calculated using the bi-square kernel function, as shown by Equation (11).
Here, optimal bandwidth is generated through a golden selection process until a minimum corrected Akaike information criterion (AICc) value has been achieved.

Flow-Based Global Moran
Global Moran I has become a popular method to measure spatial autocorrelation, which is a form of spatial dependence. Many geographic patterns demonstrate spatial dependence due to complicated socio-economic processes shaping the patterns, such as economic agglomeration in this case study. When the spatial autocorrelation, a form of spatial dependence, is present in the model residuals, the statistical model will have biased and inefficient parameter estimations [8]. In this paper, this method is employed to compare the reduction of spatial auto-correlation in the residuals of global and local models. Considering the unique spatial autocorrelation in flow data, spatial weight is defined by using contiguity at both origin and destination sites (as detailed by Chun [50]).
Flows are considered adjacent only when both sites (origin and destination) are immediately adjacent to one another, as shown in Figure 3. In the spatial weight matrix, w ij = 1 if flow i and j are adjacent, otherwise, w ij = 0. There are two popular methods to calculate spatial weight W d ij : fixed bandwidth and adaptive bandwidth.
Fixed bandwidth aims to search for an optimal distance, within which all flows of j will be calculated a spatial weight, w ij by following a Gaussian function, as shown by Equation (10): where b is the optimal threshold distance, called bandwidth in this case, and d ij is the distance between flows i and j. Adaptive bandwidth aims to search for the optimal number of nearest flows and is used to determine the distance bi for each flow i. It is clear the bi is affected by the density of flows near flow i. In this case, spatial weight is calculated using the bi-square kernel function, as shown by Equation (11).
Here, optimal bandwidth is generated through a golden selection process until a minimum corrected Akaike information criterion (AICc) value has been achieved.

Flow-Based Global Moran
Global Moran I has become a popular method to measure spatial autocorrelation, which is a form of spatial dependence. Many geographic patterns demonstrate spatial dependence due to complicated socio-economic processes shaping the patterns, such as economic agglomeration in this case study. When the spatial autocorrelation, a form of spatial dependence, is present in the model residuals, the statistical model will have biased and inefficient parameter estimations [8]. In this paper, this method is employed to compare the reduction of spatial auto-correlation in the residuals of global and local models. Considering the unique spatial autocorrelation in flow data, spatial weight is defined by using contiguity at both origin and destination sites (as detailed by Chun [50]).
Flows are considered adjacent only when both sites (origin and destination) are immediately adjacent to one another, as shown in Figure 3. In the spatial weight matrix, w ij = 1 if flow i and j are adjacent, otherwise, w ij = 0.

Comparisons
Model performance was assessed using several statistical methods. In the case of global models, adjusted R2 were used to show how much variance (%) in the dependent variable can be explained by the global model. AICc was used to compare the efficiency of model performance improvements between global and local models. However, due to different methods of calibration, AICc cannot be used to compare OLS and NB. Flow-based Moran I, which indicates the spatial autocorrelation in the residuals of models, was instead used to compare the efficiency of modeling methods when considering spatial dependence. The strengths of local modeling methods were evident when mapping the distribution of parameter estimations and their t-statistics. To compare the spatial patterns between significant parameter estimations from two local modeling methods, the Lee-Sallee shape [51] indicator was used, as shown by Equation (12): where A and A are the spatial distribution of parameter estimations from OLS and NB local models, respectively, ∩ is the logic intersection between both spatial distributions, and U is the union of both. The traditional correlation coefficients were used to compare the statistical correlation between their data distribution as follows: R = correlation coefficient between the parameter estimates from both OLS and NB models and k = 1, 2, 3 represents three explanatory variables (GDPi, GDPj, and Dij).

Flow Patterns
In 2014, there were 111,556 (= 334 × 334) traffic flows between the 334 toll-gates, with a total volume across these traffic flows of 234,119,115. Flows with a volume larger than 10,000 were mapped using ARCGIS 10.6, as shown in Figure 4a. Figure 4a clearly shows that high-volume flows are mainly distributed across southern Jiangsu. Among these flows, 22 flows were shown to have volumes greater than 500,000, accounting for 7.37%; 79 flows had volumes between 200,000 and 500,000, accounting for 9.47%; and 84,126 flows had volumes smaller than 1000, accounting for 6.45%. Spatially, large-volume flows between a small number of primary toll-gates were found to dominate, particularly around the urban agglomeration zones of Nanjing, Suzhou, and Wuxi cities. This highlights the imbalanced distribution of traffic flows across the study area.

Comparisons
Model performance was assessed using several statistical methods. In the case of global models, adjusted R2 were used to show how much variance (%) in the dependent variable can be explained by the global model. AICc was used to compare the efficiency of model performance improvements between global and local models. However, due to different methods of calibration, AICc cannot be used to compare OLS and NB. Flow-based Moran I, which indicates the spatial autocorrelation in the residuals of models, was instead used to compare the efficiency of modeling methods when considering spatial dependence. The strengths of local modeling methods were evident when mapping the distribution of parameter estimations and their t-statistics. To compare the spatial patterns between significant parameter estimations from two local modeling methods, the Lee-Sallee shape [51] indicator was used, as shown by Equation (12): where A 0 and A 1 are the spatial distribution of parameter estimations from OLS and NB local models, respectively, ∩ is the logic intersection between both spatial distributions, and U is the union of both. The traditional correlation coefficients were used to compare the statistical correlation between their data distribution as follows: R k = correlation coefficient between the parameter estimates from both OLS and NB models and k = 1, 2, 3 represents three explanatory variables (GDPi, GDPj, and Dij).

Flow Patterns
In 2014, there were 111,556 (= 334 × 334) traffic flows between the 334 toll-gates, with a total volume across these traffic flows of 234,119,115. Flows with a volume larger than 10,000 were mapped using ARCGIS 10.6, as shown in Figure 4a. Figure 4a clearly shows that high-volume flows are mainly distributed across southern Jiangsu. Among these flows, 22 flows were shown to have volumes greater than 500,000, accounting for 7.37%; 79 flows had volumes between 200,000 and 500,000, accounting for 9.47%; and 84,126 flows had volumes smaller than 1000, accounting for 6.45%. Spatially, large-volume flows between a small number of primary toll-gates were found to dominate, particularly around the urban agglomeration zones of Nanjing, Suzhou, and Wuxi cities. This highlights the imbalanced distribution of traffic flows across the study area. After aggregating the traffic flows from toll-gate to county level, there were 3481 flows (= 59 × 59) between 59 counties, as shown in Figure 4b. In 2014, high-volume flows were mainly located in southern Jiangsu, with dominant flows present around the core centers of Nanjing, Suzhou, and Wuxi. The smaller-volume flows (less than 100,000) were found to be primarily situated in northern Jiangsu. However, within this region, some flows were found to be larger than 100,000, which can be mainly attributed to connections amongst the regional central cities of Xuzhou, Huai'an, and Yancheng. Among all the flows, 10 flows with a volume larger than 3 million were found to account for 24.43%, taking a dominant position on the network. In addition, 32 flows with volumes between 1 to 3 million were found to account for 21.47% and 2214 flows with volumes less than 10,000 were found to account for only 2.71%. Same as the flow patterns at toll-gate level, the network was dominated by a small-number of high-volume flows in southern Jiangsu. Statistically, a mean volume of 67,256.29 was reported, with a standard deviation of 42,4074.9. From this, the ratio of variance to mean was calculated at 2,673,943.5 indicating an over-dispersion statistical pattern. Figure 5 shows county level traffic flows as histogram and log-transformed data. For the former, an exponential distribution of flow volumes can be seen. Comparatively, the log-transformed data indicates that the flow volume follows an (approximately) normal distribution. After aggregating the traffic flows from toll-gate to county level, there were 3481 flows (=59 × 59) between 59 counties, as shown in Figure 4b. In 2014, high-volume flows were mainly located in southern Jiangsu, with dominant flows present around the core centers of Nanjing, Suzhou, and Wuxi. The smaller-volume flows (less than 100,000) were found to be primarily situated in northern Jiangsu. However, within this region, some flows were found to be larger than 100,000, which can be mainly attributed to connections amongst the regional central cities of Xuzhou, Huai'an, and Yancheng. Among all the flows, 10 flows with a volume larger than 3 million were found to account for 24.43%, taking a dominant position on the network. In addition, 32 flows with volumes between 1 to 3 million were found to account for 21.47% and 2214 flows with volumes less than 10,000 were found to account for only 2.71%. Same as the flow patterns at toll-gate level, the network was dominated by a small-number of high-volume flows in southern Jiangsu.
Statistically, a mean volume of 67,256.29 was reported, with a standard deviation of 424,074.9. From this, the ratio of variance to mean was calculated at 2,673,943.5 indicating an over-dispersion statistical pattern. Figure 5 shows county level traffic flows as histogram and log-transformed data. For the former, an exponential distribution of flow volumes can be seen. Comparatively, the log-transformed data indicates that the flow volume follows an (approximately) normal distribution. Thus, two options for calibrating the spatial interaction models were employed. OLS-based calibration was used for log-transformed flow data as it follows a normal distribution. Meanwhile, NB regression was used for raw flow data, as the data is considered a counting variable with a strong over-dispersion pattern. The polygon-based Moran I for GDP, as shown in Figure 6, was 0.207 (p value of 0), indicating a clustering pattern of GDP, with the highest values in Nanjing and Suzhou cities.
Jiangsu. However, within this region, some flows were found to be larger than 100,000, which can be mainly attributed to connections amongst the regional central cities of Xuzhou, Huai'an, and Yancheng. Among all the flows, 10 flows with a volume larger than 3 million were found to account for 24.43%, taking a dominant position on the network. In addition, 32 flows with volumes between 1 to 3 million were found to account for 21.47% and 2214 flows with volumes less than 10,000 were found to account for only 2.71%. Same as the flow patterns at toll-gate level, the network was dominated by a small-number of high-volume flows in southern Jiangsu. Statistically, a mean volume of 67,256.29 was reported, with a standard deviation of 42,4074.9. From this, the ratio of variance to mean was calculated at 2,673,943.5 indicating an over-dispersion statistical pattern. Figure 5 shows county level traffic flows as histogram and log-transformed data. For the former, an exponential distribution of flow volumes can be seen. Comparatively, the log-transformed data indicates that the flow volume follows an (approximately) normal distribution. Thus, two options for calibrating the spatial interaction models were employed. OLS-based calibration was used for log-transformed flow data as it follows a normal distribution. Meanwhile, NB regression was used for raw flow data, as the data is considered a counting variable with a strong over-dispersion pattern. The polygon-based Moran I for GDP, as shown in Figure 6, was 0.207 (p value of 0), indicating a clustering pattern of GDP, with the highest values in Nanjing and Suzhou cities.

Global Modeling-OLS/NB
The OLS global model was calibrated as follows (t-statistic value of the parameter estimation given in brackets): log M = −2.592 + 0.960 * log(GDP ) + 1.069 * log (GDP − 0.008 * d + e , Here, the adjusted R was found to be 0.609 with an AICc value of 11,633.207. This indicates that 60.9% of the variance in the log-transformed volume of flows can be explained by the three variables. All three explanatory variables were found to be statistically significant at a 1% level. Comparatively, the parameter estimations of GDP at origin site (OGDP) and GDP at destination site (DGDP) were found to be 0.960 and 1.069, respectively. This indicates that a greater positive contribution was reported by the economic power in the destination county when compared with the origin county. It also reveals that GDP has much higher pulling than pushing effects on traffic flows across the province. The parameter estimation of OD distance was found to be significantly negative at −0.008.
By contrast, the NB global model (Equation (3)

Global Modeling-OLS/NB
The OLS global model was calibrated as follows (t-statistic value of the parameter estimation given in brackets): Here, the adjusted R 2 was found to be 0.609 with an AICc value of 11,633.207. This indicates that 60.9% of the variance in the log-transformed volume of flows can be explained by the three variables. All three explanatory variables were found to be statistically significant at a 1% level. Comparatively, the parameter estimations of GDP at origin site (OGDP) and GDP at destination site (DGDP) were found to be 0.960 and 1.069, respectively. This indicates that a greater positive contribution was reported by the economic power in the destination county when compared with the origin county. It also reveals that GDP has much higher pulling than pushing effects on traffic flows across the province. The parameter estimation of OD distance was found to be significantly negative at −0.008.
By contrast, the NB global model (Equation (3)) was calibrated as follows (t-statistic values for each parameter estimation are given in brackets): Here, the adjusted pseudo R 2 was reported to be 0.737 with an AICc value of 72,822.835. The pseudo R 2 does not have the same meaning as the adjusted R 2 from the OLS model although a higher pseudo R 2 does indicate better performance. All three explanatory variables were found to be statistically significant at a 1% level. Comparatively, the parameter estimations of GDP at origin site (OGDP) and GDP at destination site (DGDP) were 0.7870 and 0.987, respectively, which again highlights a greater contribution from the economic power in the destination county when compared to the origin county. This reveals that GDP has a much greater pulling than pushing effect on traffic flows across the province. The parameter estimation of OD distance was found to be significantly negative at −1.958. Here, the alpha value was found to be 2.165, indicating a strong over-dispersion pattern.
Deploying the contiguity-based Moran I, as described in Section 2, the spatial autocorrelations in the residuals from the global OLS and NB models were calculated to be 0.348 and 0.197, respectively. All of which were found to be statistically significant at a 1% level. This indicates that the global NB model has a weaker spatial dependence than the global OLS model.

Local Modeling-OLS/NB
Considering the varied density of flows across the study area, as shown in Figure 4b, an adaptive bandwidth strategy (Equation (9)) was chosen to calibrate each local model.

GWOLS Flowing Model
Based on a golden selection process, a bandwidth of 18 (meaning 18 flows are picked up as a sample for a local OLS model of each flow) was searched to achieve the minimum AICc value. The adjusted R 2 for the local modeling was reported as 0.839, which is a vast improvement on the performance of the global OLS model. Here, the AICc value was found to be reduced from 11,633.207 to the current value of 10,352.371. At a significance level of 1%, there were 1624 (46.7%), 1714 (49.2%), and 1440 (41.4%) flows out of 3481 with statistically significant parameter estimations of OGDP, DGDP, and distance, respectively. No explanatory variables were found to achieve 50% of significant parameter estimation. The statistical distributions of all three t-statistics, and their corresponding parameter estimations, are shown in Figure 7.
All parameter estimations of flows found to be significant at a 1% level were mapped, as shown in Figure 8. Among the parameter estimation of OGDP, the maximum value was found to be 13.53, with a minimum of −13.13, and around 50 flows were found to have a negative parameter estimation. The majority of parameter estimations were found to range between 0 and 3, indicating a pushing effect of economic development on traffic flows, particularly in northern Jiangsu. Comparatively, the maximum parameter estimation value of DGDP was found to be 13.05, with a minimum of −12.7, and around 51 flows were found to have a negative parameter estimation. The majority of parameter estimations were found to be positive, indicating a pulling effect of economic development on traffic flows, particularly in northern Jiangsu. All parameter estimations of flows found to be significant at a 1% level were mapped, as shown in Figure 8. Among the parameter estimation of OGDP, the maximum value was found to be 13.53, with a minimum of −13.13, and around 50 flows were found to have a negative parameter estimation. The majority of parameter estimations were found to range between 0 and 3, indicating a pushing effect of economic development on traffic flows, particularly in northern Jiangsu. Comparatively, the maximum parameter estimation value of DGDP was found to be 13.05, with a minimum of −12.7, and around 51 flows were found to have a negative parameter estimation. The majority of parameter estimations were found to be positive, indicating a pulling effect of economic development on traffic flows, particularly in northern Jiangsu. Finally, parameter estimation values of distance were found to range from a minimum −0.073 to a maximum 0.005, with a higher value (indicating greater sensitivity to transport distance) reported in southern Jiangsu. Overall, these parameters have demonstrated spatial heterogeneity across the study area. However, the spatial effects of GDP were found to be relatively stronger than distance. Finally, parameter estimation values of distance were found to range from a minimum −0.073 to a maximum 0.005, with a higher value (indicating greater sensitivity to transport distance) reported in southern Jiangsu. Overall, these parameters have demonstrated spatial heterogeneity across the study area. However, the spatial effects of GDP were found to be relatively stronger than distance.

GWNBR Flowing Model
Based on a golden selection process, a bandwidth of 75 (meaning 75 flows are picked up as a sample for local NB model for each flow) was searched to achieve the minimum AICc value. The adjusted pseudo R for local modeling was found to be 0.918, which indicates a vast improvement on the performance of the global NB model. The AICc value was found to be greatly reduced from 72,822.835 to the current value of 69,648.787. At a significance level of 1%, 2582 (74.2%), 2735 (78.6%), and 3209 (92.2%) flows out of 3481 were found to be statistically significant parameter estimations of OGDP, DGDP, and distance, respectively. All explanatory variables were found to achieve more than 70% of significant parameter estimation. The statistical distributions of all three t-statistics, and their corresponding parameter estimations, are shown in Figure 9.

GWNBR Flowing Model
Based on a golden selection process, a bandwidth of 75 (meaning 75 flows are picked up as a sample for local NB model for each flow) was searched to achieve the minimum AICc value. The adjusted pseudo R 2 for local modeling was found to be 0.918, which indicates a vast improvement on the performance of the global NB model. The AICc value was found to be greatly reduced from 72,822.835 to the current value of 69,648.787. At a significance level of 1%, 2582 (74.2%), 2735 (78.6%), and 3209 (92.2%) flows out of 3481 were found to be statistically significant parameter estimations of OGDP, DGDP, and distance, respectively. All explanatory variables were found to achieve more than 70% of significant parameter estimation. The statistical distributions of all three t-statistics, and their corresponding parameter estimations, are shown in Figure 9. All parameter estimations of flows, found to be significant at a 1% level, were mapped as shown in Figure 10. Among the parameter estimation of OGDP, the maximum value was found to be 3.43 and the minimum −1.95, with three flows reporting negative parameter estimations. The majority of parameter estimations were found to range between 0.5 and 2. This highlights the pushing effect of economic development on traffic flows, particularly in northern and central Jiangsu. Comparatively, the maximum parameter estimation value of DGDP was found to be 2.9, with the minimum −2.03, and six flows reported negative parameter estimations. Here, the majority of parameter estimations were found to be positive, thus highlighting the pulling effect of economic development on traffic flows, particularly in eastern Jiangsu.
Finally, the parameter estimation values of distance were found to range from a minimum −7.95 to a maximum 0.51, with a higher absolute value for the flows between northwestern and southern Jiangsu (particularly in border counties). Relatively, distance was found to have a stronger spatial effect than GDP. Overall, these parameters have demonstrated spatial non-stationarity across the study area. However, relative spatial effects of GDP were found to be weaker than that of distance, which is contrary to the conclusion made from the OLS local model. All parameter estimations of flows, found to be significant at a 1% level, were mapped as shown in Figure 10. Among the parameter estimation of OGDP, the maximum value was found to be 3.43 and the minimum −1.95, with three flows reporting negative parameter estimations. The majority of parameter estimations were found to range between 0.5 and 2. This highlights the pushing effect of economic development on traffic flows, particularly in northern and central Jiangsu. Comparatively, the maximum parameter estimation value of DGDP was found to be 2.9, with the minimum −2.03, and six flows reported negative parameter estimations. Here, the majority of parameter estimations were found to be positive, thus highlighting the pulling effect of economic development on traffic flows, particularly in eastern Jiangsu. Finally, the parameter estimation values of distance were found to range from a minimum −7.95 to a maximum 0.51, with a higher absolute value for the flows between northwestern and southern Jiangsu (particularly in border counties). Relatively, distance was found to have a stronger spatial effect than GDP. Overall, these parameters have demonstrated spatial non-stationarity across the study area. However, relative spatial effects of GDP were found to be weaker than that of distance, which is contrary to the conclusion made from the OLS local model.

Discussion
To compare the two modeling methods, differences in modeling results were evaluated. First, statistics (mean, standard deviation, and number of significant flows) of the three parameter estimations, as shown in Table 1, were compared. From this, it can be suggested that local NB modeling is superior at distinguishing these flows statistically, and thus will detect heterogeneity with more ease. Second, employing the contiguity-based Moran I, as described in Section 2.3.4, the spatial autocorrelations of residuals from local OLS and NB models were calculated and found to be 0.143 and 0.111, respectively. Compared with traditional statistical models, e.g., OLS in this paper, geographically weighted modeling methods lead to the reduction of spatial autocorrelation in the model residuals. Between the two geographically weighted modeling methods compared in this paper, the geographically weighted negative binomial regression methods can better reduce the spatial autocorrelation in the model residuals. It indicates the parameter estimations from local NB are more efficient.
Third, using all the flows as samples, the correlations between the parameter estimations from both OLS and NB based local models were calculated for OGDP (0.599), DGDP (0.582), and distance (0.25). This suggests that the parameter estimations of GDP have a higher similarity between the two methods, but less for distance.
Fourth, using the Lee-Sallee shape index for the three parameter estimations, values were found to be 0.41 for OGDP, 0.44 for DGDP, and 0.40 for distance. These very similar but low values indicate variations between spatial distribution of three parameter estimations across the two methods. Both the correlation coefficient and Lee-Sallee shape index confirm this disparity in the modeling results.
Finally, calibrating GWFM was found to be a time-consuming process due to the golden selection of bandwidth, particularly when there was a large number of flows (e.g., 3481 flows in this case study). Using the following computer configuration-CPU Intel i5-3470 (3.2 GHz) and RAM 4.00 GB-the total times for calibrating the local OLS and NB models were 0.067 and 1.817 hours, respectively.

Conclusions
Big data, collected from the transaction records of toll-gates across Jiangsu province, was used to determine traffic flow (aggregated to county level) for the purpose of spatial interaction modeling. Using GDP as a pushing force at the origin site and a pulling force at the destination site, the flow-focused spatial interaction modeling was calibrated globally, using ordinary least square (OLS) regression and negative binomial (NB) regression methods. The results reveal that the pulling effect of economic development was stronger on traffic flows than its pushing effect in economically wealthy regions. To consider spatial auto-correlation and non-stationarity, local spatial interaction models were calibrated using geographically weighted OLS and NB, respectively. This study has confirmed that both local modeling methods (either OLS or NB oriented) can improve the model performance of the counterpart global model, in terms of modeling statistics (e.g., adjusted R 2 and AICc) and spatial autocorrelation (e.g., Moran I). Both modeling results were also found to exhibit strong spatial non-stationarity in the transport impacts of economy and transport distance. Comparatively, global and geographically weighted negative binomial flow modeling was found to reduce spatial dependence more efficiently than their OLS counterparts. In particular, results from local modeling, which were massively different from those reported for geographically weighed OLS modeling, were found to better detect spatial non-stationarity.
In conclusion, both methods could be used to model global and local flows that result from complicated spatial interactions. Compared with global model counterparts, the two local modeling methods considering spatial non-stationarity, could be used to produce maps to help understand the spatial process of socio-economic contributions. Wider implications of this study suggest that these results (maps and statistics) could be used by policy makers to further regional economic and transport development. When flow data is shown to demonstrate an over-dispersion statistical pattern and a strong clustering spatial pattern, GWNBR outperforms GWOLSR in reducing spatial autocorrelation in the model residuals. As a comparative study, this paper has demonstrated the following novelties and has added value to GIS in the following areas.
First, this study is novel in the use of new big data, where regional transaction data recorded at toll-gates on an expressway network across a large-area province has been used. Statistical models of such flow data were used to help understand the varied contributions of economic development to traffic flows and to consider the spatial interaction between county units. Thus, this study has proven the added value of using big data to analyze regional transport patterns.
Second, this study is novel as it has developed and successfully applied geographically weighted NB, including the Moran I of flow data, which has been rarely reported in GIS literature. Different from general geographically weighted regression methods, geographically weighted NB considers the complicated statistical and spatial patterns of flow data and as such requires the measurement of flow distance and calibration of NB models. Again, this study has proven the added value of employing GWNBR to model similar flow data locally.
Most importantly, this study adds value by making comparisons between GWOLSR and GWNBR and discussing which is superior in reducing spatial autocorrelation in model residuals and in detecting spatial non-stationarity. The more reducing spatial autocorrelation residual, the better the model is. This will aid modelers in making decisions when modeling flow data, especially where spatial non-stationarity needs to be considered.
However, this study highlights potential challenges that could be addressed in future work. The first concerns the visualization of created parameter estimation maps, which was a challenge due to the large number of flow lines. Second, future work could focus on reducing the computational time for calibrating local models particularly when working with a large-size matrix. Here, the use of a cloud or parallel computation has been identified as a potential solution to this challenge. Third, challenges may become visible as spatio-temporal flow modeling becomes increasingly more complex due to the increased availability of flow data with high temporal resolution, e.g., hourly records in this study.
Theoretically, more socio-economic variables could be included in the spatial interaction models. This would enable increased model performance and provide more evidence for regional policy making. Technically, transaction data should be disaggregated by vehicle type and mode of transport to enable the integration of spatial interaction modeling into traffic simulation.