Spatial interaction models, extensively used to investigate and analyze spatial movements, have become a well-established method for understanding factors affecting geographical mobility, such as migration [1
], transport [3
], international trade [4
], commuting [5
], and tourism [6
]. A primary concern in spatial interaction modeling is the statistical and spatial distributions of OD (origin-destination) flows between origin and destination locations, which do not tend to follow normal distribution and spatial independence.
Traditionally, spatial interaction models are calibrated using an ordinary least square (OLS) regression, especially when dealing with normally distribution data. Here, log-transformed flow data, which follows an approximately normal distribution, is used as the dependent variable for calibrating spatial interaction models. However, in large networks, OD flow data often consist of zero flows between some ODs. As these zero flows are not always compatible with OLS estimation, the Poisson model is often used, particularly when dealing with count data. Where flow data is shown to demonstrate over-dispersion, negative binomial regression (NB) is used to replace the Poisson model. Comparisons between OLS and NB have been reported within the literature [7
] and can inform decisions concerning choice of statistical model to analyze flow data.
However, in addition to it following a non-normal distribution, flow data, which represents geographical mobility, can often demonstrate complicated spatial structures resulting from complex spatial interactions between origin and destination units. Thus, consideration of spatial dependence and heterogeneity is required. Spatial dependence, as the first law of geography, is caused by certain spill-over effects which is an event in one context that occurs because of something else in a neighboring context, whereas, spatial heterogeneity, as the second law of geography, is driven by contextual variation over space. Within the literature, spatial autocorrelation—a form of spatial dependence—has been used to examine spatial randomness in the residuals of statistical models. Likewise, spatial non-stationarity—a form of spatial heterogeneity—has been frequently employed to explore the variability of independent variable contributions across space. As such, geographically weighted modeling has been proven effective and efficient with respect to spatial autocorrelation and non-stationarity [8
Due to the emergence of new technology associated with big data, such as sensors, tracking devices, smart transactions, and citizen science, the availability of mobility data (i.e., flows between origin and destination areas) has increased, as collection methods become cheaper and faster [9
]. This improved data infrastructure has boosted spatial and temporal modeling of OD flows [12
], which have a long-standing progression in the evolution of GIS [15
]. Furthermore, this has stimulated interest in, and demand for, temporally and geographically weighted flow modeling. For example, Qian et al. [17
] analyzed the spatial-temporal characteristics of expressway traffic flow, and Hui et al. [18
] focused on the nonlinear characteristics of expressway traffic flow in their analysis.
A variety of modeling methods have been employed to predict traffic flows on different infrastructure types. For example, to predict highway traffic flow, studies have used agent-based modeling with spatial cognition methods [19
] as well as support vector regression along with Bayesian classifiers [20
]. Likewise, high-speed traffic flow has also been predicted using deep learning methods [21
]. However, many empirical studies [2
] do not consider the spatial dependence present in the flow data [24
] and spatial non-stationarity of flow determinants [27
]. This can lead to biased and inefficient modeling results. Indeed, Fischer and Griffith [24
] and LeSage and Pace [25
] identified theoretical and empirical reasons to explain inadequacies of global OLS and NB models when analyzing flows that exhibit spatial dependence. There are several methods that consider spatial heterogeneity, such as moving window regression [28
] and spatially adaptive filtering [29
]. The geographically weighted regression (GWR) approach has been widely applied, with many variations that are adapted for specific domains [31
] or include spatial interactions [27
]. This approach has also given rise to the modeling of spatial non-stationarity. However, there are few studies that develop and apply geographically weighted NB for regional transport flows in the literature.
Approaches employing both OLS and NB regression have been used extensively for statistical modeling of flow data [24
], where (depending on the dependent variable) a counting or log-transformed ratio variable is used. Statistically, there is a strong argument that NB should be used when flow data demonstrates an over-dispersion pattern [27
]. Spatially, there is also a strong statement that a geographically weighted regression model can better reduce the spatial autocorrelation in the residuals of a model than its global model counterpart. However, it is unclear which geographically weighted model—geographically weighted OLS (GWOLSR) or geographically weighted negative binomial regression (GWNBR)—better reduces spatial autocorrelation of model residuals. Thus, presenting a research gap in spatial statistical modeling of flow data. Findings from such methodological comparisons can enable model developers to make informed decisions regarding local modeling of flow data.
Using Jiangsu province, an economically wealthy province in eastern China, as a case study, this paper analyzes and models traffic flow data, collected through transaction recordings [33
], using both global and local modeling methods (OLS and NB). The remainder of the paper is structured as follows: Section 2
introduces the study area, data sets, and global and local modeling methods. Section 3
presents analytical and modeling results, followed by a comparison between two modeling methods in Section 4
. Finally, Section 5
draws general conclusions and makes recommendation for future work.
To compare the two modeling methods, differences in modeling results were evaluated. First, statistics (mean, standard deviation, and number of significant flows) of the three parameter estimations, as shown in Table 1
, were compared. From this, it can be suggested that local NB modeling is superior at distinguishing these flows statistically, and thus will detect heterogeneity with more ease.
Second, employing the contiguity-based Moran I, as described in Section 2.3.4
, the spatial autocorrelations of residuals from local OLS and NB models were calculated and found to be 0.143 and 0.111, respectively. Compared with traditional statistical models, e.g., OLS in this paper, geographically weighted modeling methods lead to the reduction of spatial autocorrelation in the model residuals. Between the two geographically weighted modeling methods compared in this paper, the geographically weighted negative binomial regression methods can better reduce the spatial autocorrelation in the model residuals. It indicates the parameter estimations from local NB are more efficient.
Third, using all the flows as samples, the correlations between the parameter estimations from both OLS and NB based local models were calculated for OGDP (0.599), DGDP (0.582), and distance (0.25). This suggests that the parameter estimations of GDP have a higher similarity between the two methods, but less for distance.
Fourth, using the Lee–Sallee shape index for the three parameter estimations, values were found to be 0.41 for OGDP, 0.44 for DGDP, and 0.40 for distance. These very similar but low values indicate variations between spatial distribution of three parameter estimations across the two methods. Both the correlation coefficient and Lee–Sallee shape index confirm this disparity in the modeling results.
Finally, calibrating GWFM was found to be a time-consuming process due to the golden selection of bandwidth, particularly when there was a large number of flows (e.g., 3481 flows in this case study). Using the following computer configuration—CPU Intel i5-3470 (3.2 GHz) and RAM 4.00 GB—the total times for calibrating the local OLS and NB models were 0.067 and 1.817 hours, respectively.
Big data, collected from the transaction records of toll-gates across Jiangsu province, was used to determine traffic flow (aggregated to county level) for the purpose of spatial interaction modeling. Using GDP as a pushing force at the origin site and a pulling force at the destination site, the flow-focused spatial interaction modeling was calibrated globally, using ordinary least square (OLS) regression and negative binomial (NB) regression methods. The results reveal that the pulling effect of economic development was stronger on traffic flows than its pushing effect in economically wealthy regions. To consider spatial auto-correlation and non-stationarity, local spatial interaction models were calibrated using geographically weighted OLS and NB, respectively. This study has confirmed that both local modeling methods (either OLS or NB oriented) can improve the model performance of the counterpart global model, in terms of modeling statistics (e.g., adjusted R2 and AICc) and spatial autocorrelation (e.g., Moran I). Both modeling results were also found to exhibit strong spatial non-stationarity in the transport impacts of economy and transport distance. Comparatively, global and geographically weighted negative binomial flow modeling was found to reduce spatial dependence more efficiently than their OLS counterparts. In particular, results from local modeling, which were massively different from those reported for geographically weighed OLS modeling, were found to better detect spatial non-stationarity.
In conclusion, both methods could be used to model global and local flows that result from complicated spatial interactions. Compared with global model counterparts, the two local modeling methods considering spatial non-stationarity, could be used to produce maps to help understand the spatial process of socio-economic contributions. Wider implications of this study suggest that these results (maps and statistics) could be used by policy makers to further regional economic and transport development. When flow data is shown to demonstrate an over-dispersion statistical pattern and a strong clustering spatial pattern, GWNBR outperforms GWOLSR in reducing spatial autocorrelation in the model residuals. As a comparative study, this paper has demonstrated the following novelties and has added value to GIS in the following areas.
First, this study is novel in the use of new big data, where regional transaction data recorded at toll-gates on an expressway network across a large-area province has been used. Statistical models of such flow data were used to help understand the varied contributions of economic development to traffic flows and to consider the spatial interaction between county units. Thus, this study has proven the added value of using big data to analyze regional transport patterns.
Second, this study is novel as it has developed and successfully applied geographically weighted NB, including the Moran I of flow data, which has been rarely reported in GIS literature. Different from general geographically weighted regression methods, geographically weighted NB considers the complicated statistical and spatial patterns of flow data and as such requires the measurement of flow distance and calibration of NB models. Again, this study has proven the added value of employing GWNBR to model similar flow data locally.
Most importantly, this study adds value by making comparisons between GWOLSR and GWNBR and discussing which is superior in reducing spatial autocorrelation in model residuals and in detecting spatial non-stationarity. The more reducing spatial autocorrelation residual, the better the model is. This will aid modelers in making decisions when modeling flow data, especially where spatial non-stationarity needs to be considered.
However, this study highlights potential challenges that could be addressed in future work. The first concerns the visualization of created parameter estimation maps, which was a challenge due to the large number of flow lines. Second, future work could focus on reducing the computational time for calibrating local models particularly when working with a large-size matrix. Here, the use of a cloud or parallel computation has been identified as a potential solution to this challenge. Third, challenges may become visible as spatio-temporal flow modeling becomes increasingly more complex due to the increased availability of flow data with high temporal resolution, e.g., hourly records in this study.
Theoretically, more socio-economic variables could be included in the spatial interaction models. This would enable increased model performance and provide more evidence for regional policy making. Technically, transaction data should be disaggregated by vehicle type and mode of transport to enable the integration of spatial interaction modeling into traffic simulation.