A Network-constrained Integrated Method for Detecting Spatial Cluster and Risk Location of Traffic Crash: a Case Study from Wuhan, China

Research on spatial cluster detection of traffic crash (TC) at the city level plays an essential role in safety improvement and urban development. This study aimed to detect spatial cluster pattern and identify riskier road segments (RRSs) of TC constrained by network with a two-step integrated method, called NKDE-GLINCS combining density estimation and spatial autocorrelation. The first step is novel and involves in spreading TC count to a density surface using Network-constrained Kernel Density Estimation (NKDE). The second step is the process of calculating local indicators of spatial association (LISA) using Network-constrained Getis-Ord Gi* (GLINCS). GLINCS takes the smoothed TC density as input value to identify locations of road segments with high risk. This method was tested using the TC data in 2007 in Wuhan, China. The results demonstrated that the method was valid to delineate TC cluster and identify risk road segments. Besides, it was more effective compared with traditional GLINCS using TC counting as input. Moreover, the top 20 road segments with high-high TC density at the significance level of 0.1 were listed. These results can promote a better identification of RRS, which is valuable in the pursuit of improving transit safety and sustainability in urban road network. Further research should address spatial-temporal analysis and TC factors exploration.


Introduction
In recent years, the Chinese vehicle fleet has experienced rapid growth.The production and sales of vehicles in China has been ranked first in the world [1].However, according to statistics from National Bureau of Statistics, in 2006 the death rate per million registered vehicles in China was up to 6.2, whereas the number in USA was estimated to be approximately 1.77, the number in Japan was only 0.77 [2,3].Fatality in the transportation industry is 77.6% of the total fatality number in safety production, 15 times the number in the area coal mining industry [3].Traffic crash (TC) has been the number one "killer" that threatens people's lives and property in China [4].How to keep public transit safety and sustainable has become a major concern for the Department of Road Administration.Wuhan, with thoroughfares to nine provinces in China, is one of the largest cities in the country, with a continuing expansion of urban area and highway construction.In the period of 2000-2010, the number of motor vehicle grew at 25 percent and continues to grow with confidence, however, the road mileage increased by 3 percent [5].As the motor vehicle trip demand increases, the tense situation between traffic supply and demand emerged.Frequent TCs and traffic congestions have kept adversely affecting people's everyday life and social development.Stringent efforts should be made to uplift the traffic safety standards and control traffic congestion for a sustainable development of transportation and urbanization.Besides, it is confirmed that TCs contribute around 10 to 15 percent of random traffic congestion, and cause the greatest amount of lost time due to congestion delays [6].A systematic analysis of TC scenario, proper traffic control devices, suitable roadway design practices and effective traffic police activities can often help to reduce TC.Moreover, it has been proved that spatial analysis could provide an effective solution to detect the pattern and suggest reasons for the pattern characteristics [7][8][9].The detection of TC pattern and identification of hot spots is the first step of TC strategies, including identification, ranking, profiling and treatment [10].Nevertheless, city-level empirical research on spatial pattern of TC and risk road identification in China is lacking.
Two decades ago, it had been noted that "there has been little published on the geography of traffic crashes" [11].This clearly has changed over the last two decades [12].In a broad sense, TC is the result of interactions between human activities and geographical environment.In geographical space, TC was abstracted as discrete event in area or line.When considering factors associated with TC, there are driver factors, motor vehicle conditions, roadway conditions, traffic characteristics and environmental factors [13].Driver factors, defined as subjective judgments, are always involved in reacting to objective conditions, such as roadway conditions.Therefore, the improvement of objective conditions can result in the decrease of TCs.In one sense, knowing the influence of road condition has on crashes would help target the maintenance effort for the road system.Nevertheless, considering the costs and resources for the improvement of objective conditions, the question becomes how to determine the road with priority to be mended, particularly with respect to which roadway segments are riskier than others.RRSs are defined as segments with more TCs in the same time interval and the equal roadway length.Considering that TC is a kind of point event, which often occurs along roadways, RRSs can be identified by detecting the cluster pattern of TC along roadways at a the city scale.Thus, the detection of spatial clusters of TC is an essential approach to identifying RRSs for the appropriate allocation of resources for road safety improvements [14].
Research on the spatial pattern of TC has substantially progressed during the last years.Previous work has shown that the distribution of TC have apparent spatial cluster characteristics [14][15][16][17], i.e., there are TC hot spots, hot road segments or hot areas that are a combined geographic unit of high-risk points, road segments or areas [10,[18][19][20].According to the scale of the research object and its research area, there are two main methods for analyzing the spatial pattern of geographic events: the area statistics method and the discrete event method.
As for area statistics, due to the spatial heterogeneity and spatial dependence of areas, the global spatial autocorrelation and local indicators of spatial association (LISA) are used to measure the cluster degree of attributes of the area, such as TC from different areas [7,21,22].The global spatial autocorrelation method, including global Moran's I, global Geary's C, and Getis-Ord Gi*, can be used to describe the distribution of TC across the study area, however the location of TC clusters and the differences among each TC clusters were ignored [23].Afterwards, the LISA method, such as local Getis-Ord Gi* improved from global Getis-Ord Gi*, had been proved to be available for detecting hot areas and identifying the center of a cluster at a significant level [24].Due to the fact that TC is spatial event in planar space but constrained in road network, the network-constrained LISA named local indicators of network-constrained clusters (LINCS) was proposed [25].The GLINCS, based on G statistics, and ILINCS, based on I statistics, are mostly used LINCS in network space.
In regard to the spatial cluster of discrete events, there are approaches, including descriptive analysis, quadrat analysis and distance analysis.Furthermore, the most typical methods based on discrete events may be the nearest neighbor distance method, Ripley's K function methods [26], Kernel Density Estimation (KDE) methods [27,28] and others.Traditionally, the KDE methods have been widely used in point-pattern analyses for discrete events, especially in TC analyses [14,29,30].Although no single technique has emerged as the "best" for detecting and predicting TC clusters, recent research suggests that KDE outperforms other approaches due to its simplicity and easy implementation [31].Also, the KDE method may outperform the empirical Bayesian method in the identification of hazardous road segments when only the location of the crash can be used for the analysis [32].However, although the KDE has shown acceptable properties using density values, its homogeneous 2D assumption for events distributed in 1.5D space, such as TC on a road network, seems to be irrelevant [33][34][35][36][37][38].To overcome this limitation, Okabe proposed the idea of the spatial analysis based on a network, Network-Constrained Kernel Density Estimation (NKDE), which can overcome the shortcomings of the KDE method and reduce the deviation of its results [39][40][41].Furthermore, research has demonstrated the validity of NKDE to analyze network-based phenomena, such as TC [35,[42][43][44][45][46][47].
Although KDE and NKDE are useful methods for the spatial cluster analysis in TC research, they had some limitations.One inevitable problem is the local maximums and boundary effects due to the derivation of the kernel function.Therefore, deciding which clusters are statistically significant is necessary.Nevertheless, there is not enough attention paid to the statistical significance of KDE in the current literature [48].Meanwhile, the same question has been proposed by some researchers, such as Xie and Anderson [14,42].Xie noted that NKDE has one of the same fundamental drawbacks as planar KDE.Moreover, Plug said that KDE is better for visualization purposes than for identification of black spots [49,50].
Hence, in this paper, firstly KDE and NKDE are compared to portray the spatial cluster characteristic of TC in area scale and network scale, respectively.Still, as to each road network polyline, the NKDE generates a smoothing density surface with reduction of data noise and statistical bias.Secondly, considering the statistical significance of NKDE, GLINCS method is used to identify high-risk road segments by using the density value as input attributes.Next, the result of NKDE-GLINCS is compared with the GLINCS.
This paper aims to evaluate and represent the TC pattern to contribute to the traffic safety in Wuhan City.The detection can help to identify vulnerable locations and road segments that require remedial measures.A spatial method for visualization of TC spatial cluster and identification of risky road segments is expounded The remainder of this manuscript is organized in the following manner: first, descriptions of data used in the current study; second, explanation of methods; third, results of our analyses; and finally, discussion of implications and limitations of the methods, and suggestions for future research.

Data Source
TC in this research is defined as a motor vehicle crash with death or property loss that occurs on, and is restrained to, the roadway network in central urban areas.Simultaneously, it is a type of geographical phenomenon in the subset of 2-D geo-space.The main data sources of the present study collected from Wuhan in 2007 are as follows: (1) TC data from the Traffic Management Department; (2) roadway network; and (3) administrative map.The data from 2007 was chosen because the main form of transportation had been generated within the roadway network when neither metro nor the "three urban highway" had been employed.All the data were stored as shapefiles and shown as a TC map in an ArcGIS platform (Figure 1).
The TC data is stored in a geo-database with the basic information (such as time of day, fatality or serious injury, driver's characteristics, environmental conditions, vehicle's characteristics, etc.), roadway information (type of traffic control device, light condition, type of road, etc.) and handling information (insurance, compensation, cleaning up accident scene, etc.).According to the basic information, the total number of crashes in 2007 was 3113, consisting of 301 with fatalities, 2773 with injuries, and 39 with property losses.Only the information including address, time, and injury or fatality is extracted to enhance the computational speed and reduce the complication degree of the methods below.Since the position information is stored as addresses in a traditional way, such as "Jiefang Road to Medicine Verified Agency", the validation of each TC according to its address on a map should be the first step of data pre-processing.In this research, the Geocoding API in "Baidu Map" is used to match the address to a location on administrative map and roadway map.The roadway network data are extracted and abstracted from the Traffic Map of Wuhan (2007).In the TC map, each roadway is shown as a polyline with attribute of its road grade.In addition, TC abstracted as point is placed on the polyline of its matching road.Here, the urban roadway network includes main roadways, secondary roadways and branch roads and excludes metro-ways, highways, railways and ferry-ways.According to statistics at the city level, 78.3% of TC occurred on main roadways, 8.6% on secondary roadways, 2.4% on branch roads and 10.6% on other types of roadways.

Network Kernel Density Estimation
NKDE mainly discusses the first-order properties of spatial data in a nonparametric way to reveal cluster pattern of point events.It details not only the density value of each target but also the continuous surface map of risk targets in the study area.A symmetric and continuous surface is placed on each of the center points of the spatial units to calculate the density of the entire area considering the distances between the center point and the locations of observations within the surface.The estimators of NKDE is as followed.
is the density at location s; r is the search width, which is always larger than 0 (only points within r are calculated for   s  ); and is d is the distance from the estimation point to the observation point marked as i; k, known as the kernel function, is the function of the ratio between is d and r to measure the distance decay effect; and a i is the number of TC at location s.
The choice of the two parameters, r and k, is extremely critical.When r increases, the surface of the density becomes smoother, ignoring some details of the density.When r decreases, the surface of the density becomes uneven, enhancing the cost of the calculation.Besides, it demonstrates that the effect of the choice of kernel function is less than the effect of the choice of the search width [47,48].There are many kernel density functions, such as the Gaussian, Quartic, Conic, Negative exponential and Epanechnikov [51].In this paper, the Gaussian function is used as shown by the following: The NKDE has the same analysis process as the KDE, with the main discrimination being the measurement of distance.The NKDE is based on the kernel function method of the planar KDE, with an extended measurement of distance between two points from the Euclidean distance to network distance.The core point of NKDE lies in the division of the road network into a fixed length (called a lixel) and the measurement of r and using the shortest distance of the network.The procedure can be organized in six steps: (1) Check the topology and connectivity of road network, and merge road segments with the same road name.(2) Divide each road with a fixed length (marked as l) into basic road segment units (called lixels).
(3) Calculate the number of TC in each lixel, marked as i, i = 1, 2, ...n.(4) Use the kernel function to determine the density distribution of each TC to each lixel inside the search width.(5) Determine the density value of each lixel, which is the sum of the density contributions from each TC to the lixel within the searching width.(6) Output the density value of each lixel.

Local Indictor of Network-Constrained Clusters of Getis-Ord Gi*
Getis-Ord Gi*, one of the most used methods for evaluating clusters, can define the actual locations where hotspots are clustered together based on a formal assessment of statistical significance [52,53].The Getis-Ord Gi* evaluates the degree to which each feature is surrounded by features with a similarly high or low values within a specified geographical distance (neighborhood).It can measure the concentration ratio of high or low values for the study area.Large Z-values indicate that hotspots are clustered together, whereas low Z-values indicate that cold spots are clustered together.The Getis-Ord Gi* local statistic is given as: and beyond which all locations are not neighbors (indicated by o in the W matrix) [54].n is equal to the total number of features and: In this case, the relationship between the lixel unit i and the other lixel units in its neighborhood distance is shown as four types: high-high, low-low, high-low and low-high.High-high and low-low mean that there is a positive correlation between the unit and its neighbors.High-low and low-high mean that there is a negative correlation between the unit and its neighbors.If the density of the lixel unit and its neighbors are all high under a level of statistical significance that is, a high-high correlation, these high-high units (H-H segments) in this paper are merged as RRSs.
GLINCS, based on Getis-Ord Gi*, measure the autocorrelation and concentration essentially with the same equation of Equation ( 4).Nevertheless, the definition of weight matrix W is quite different.In GLINCS, the network is split into smaller segments to better reflect characteristic of the scale of research data and research area.ij W in Equation ( 4) is defined as connectivity between segment i and j and can designate whether segment i and segment j share a common node [55].The value indicates the autocorrelation and concentration value around the interest observed link i (i = 1, …, n).The Getis-Ord Gi* is added as a supplement to the result of the NKDE for two main reasons: one is that the smoothing density surface can decrease the noise and bias of TC point location.Traffic crash, regarded as point events with precise point location on a map, may occur and affect a certain length along a road.The use of NKDE can largely spread the counting number to the nearby road and do as a smoothing process of GLINCS.Besides, the density values can be interpreted as a risk index as input of GLINCS compared with counting index.The other is that the process of the LISA can help formally evaluate the significance of the extensiveness of locations with high-density values.Although both local Moran's I and Getis-Ord Gi* are able to reveal hot spots, Getis-Ord Gi* is able to differentiate autocorrelation due to the spatial associations of high-high values or low-low values.Moreover, as Xie pointed, different methods of local statistics measurements should be tested when the KDE and LISA are integrated for cluster detection [47].

Results and Discussion
Utilizing the proposed methods, a case study was conducted with real world TC data from the city of Wuhan.Firstly, density value of TC, obtained with method of KDE and NKDE using the software of GeoDaNet, was compared to indicate the TC pattern at the city level.Secondly, the GLINCS was calculated based on the density values of NKDE and counting values with 99 iterations of Monte Carlo simulations.Finally, the cluster and riskier segments were found by using the Z-value to test the significance.

Results of the KDE and NKDE for Traffic Crash Events
To verify the most appropriate methods to detect cluster pattern, four contrast experiments were conducted, based on the methods of KDE and NKDE.As seen in Table 1, the parameters and density value were shown and compared.To contrast the clusters discrimination from scale effect of NKDE, same lixel length and different search width were set in experiments 1 and 2. To discuss the cluster discrimination from resolution effect of NKDE, different lixel length and same search width were set in experiments 2 and 3. To compare the characteristic of clusters pattern from KDE and NKDE, same search width was set in experiment 3 and experiment 4. The search width was set as 40 m and 200 m due to the minimum width of urban main road.According to the specifications of urban road planning, width of main road is defined as 45-55 m, secondary road is 40-50 m and branch road is 15-30 m.Thus, the minimum value of r should be no less than 40 m.Besides, it was suggested that the applicable value of r was 100-300 m, which was widely used in urban planning at the scale of neighborhood, block and street [56].We used 200 m due to more obvious hot segments compared with other search widths from 100 m, 500 m and 1000 m.The lixel is like a resolution in a raster and the smaller lixel is, the higher the precision.Hence, the length of lixel in NKDE was set at 10 m and 40 m, which were also identified by Xie [47].In addition, it may be shown that a Gaussian kernel is generally robust and a usual choice for KDE.
As shown in Table 1, higher cluster result from higher standard deviation and density sum when search width was wider and length of lixel was longer.When comparing experiments 1 and 2, sum of density value for every segment increased with search width from 40 m to 200 m.When comparing experiments 2 and 3, number of lixel segments increased approximately 1.6 times with length of lixel from 40 m to 10 m, which means that there were more segments less than the length of lixel when length was set at 40 m.Moreover, mean value of density kept stable between experiments 2 and 3, but sum value and standard deviation of experiment 2 were more than that of experiment 3.However, the density statistic value in experiment 4 decreased greatly with lower sum value and standard deviation.
A group of thematic maps was represented to reflect the density values in road network of NKDE (see Figure 2a-c) and, a density surface was displayed in the raster map to visualize the result of KDE (see Figure 2d).In Figure 2a, the density values are displayed as a scattered distribution pattern consistent with the original location map of TC.Moreover, the road segments with higher density values were located in the main road in the center city.Together, the patterns in Figure 2b,c presented quite similar distribution characteristics with a distinct high-low cluster pattern in which the districts with higher density values were Jiang'an and Jianghan and the roads with higher density values were Hanyang Avenue, Zhongshan Avenue, Zhongbei Road, Jiefang Avenue and others.Nonetheless, map in Figure 2b has elucidated more cluster details due to its smaller lixel length than map in Figure 2c.Due to the resolution effect of smaller lixel, more information of distribution variation is easily captured in the same length of road in experiment 2 than in experiment 3. Cluster pattern in Figure 2d portrayed approximate hot area in planar space not exactly in the road.

The GLINCS of NKDE
After the density value is calculated for each road segment using NKDE, it is then used as an attribute for computing the GLINCS to explain the method of NKDE in a quantitative way.In this part, five experiments were conducted.As seen in Table 2, input attributes in the first two experiments were the smoothing density value while in the last three experiments were crash events with a counting approach to allocate the crash points just in its near network edges.To contrast strongly NKDE-GLINCS and GLINCS, experiment 7 and 8 were implemented as a comparative study of experiment 5 and 6.Experiment 9 was performed for an estimation of whether TC spatial cluster existed in a city level.To detect H-H road segments in different levels of statistical significance, Monte Carlo simulation was repeated 99 times via a conditional permutation process.Considering the intensive computation load in our experiments, 99 was set for a balance of random stableness and computational efficiency; the null hypothesis for Getis-Ord Gi* is complete spatial randomness.In this hypothesis, density value represents one of many possible spatial arrangements, which are a conditional spatial permutation process by shuffling the density values among the network.The results revealed that when the significance level decreases from 0.01 to 0.1, the number of H-H segments of both experiment 5 and experiment 6 increased from a relatively high speed to a relatively low speed.In addition, the number of H-H road segments in experiment 5 maintained a relatively constant time (2.6-2.8)compared with the number in experiment 6.In experiment 9, it had been proved that there is spatial cluster existing in traffic crash.Besides, in this experiment, the road network was not clip with fixed lixels and the result of Getis-Ord Gi* had successfully revealed a global cluster pattern compared with experiments 7 and 8. Results in experiments 7 and 8 revealed that the GLINCS method failed to identify the H-H road segments.Nevertheless, results in experiments 5 and 6 demonstrated that the NKDE-GLINCS method using kernel density as input in experiments 5 and 6 could disclose and identify H-H segments successfully.The distribution of H-H segments under significance level of 0.1 in experiment 6 was shown in Figure 3a.

The Detection of Riskier Road Segments
Although experiments 5 and 6 can detect H-H segments, experiment 5 can identify the H-H segments without considering more details to keep segments in a coherent and valid length.RRSs that combines H-H road segments with the same road name under the significance level of 0.1 is detected.Here, RRSs in experiment 5 were ranked according to the GLINCS value and the density value from high to low.The first 20 RRSs were shown in the table below.Moreover, RRSs with TC can be seen in the three-dimensional visualization with ESRI ArcScene in Figure 3b.The height of red column represents mean density value of TC b in road segments.When the Z-score is greater than 1.65, which means the road segments are significantly risky under the significance level of 0.1, the higher column denotes RRSs and the wider column denotes intense cluster of RRSs.In Table 3 and Figure 3, the top risky road in Wuhan was Hansha Road with a considerably high-density value and the largest G*, which was consistent with government announcements and news reports.One reason is the high speed of motor vehicles near the entrance of the Hancai Highway.Furthermore, these sections of high-risk road have been improved by the local department of transport these recent years.Moreover, the Jiefang Main Avenue was shown to be a secondary risky road with the maximum number of lixels and a steady but lower density value relative to the density values of other risk roads such as Dingziqiao Road.Till now, Jiefang Main Avenue has been verified as having been always cautioned as a hazardous road for TC due to the high traffic flow and especially the traffic jams during the morning and evening rush hours.This situation also exists on ZhongBei Road.In addition, risky roads are more likely to be located in the district of Jiangan, considering the complexity of the road network.

Summary and Conclusions
The intense demand of roads to cater to the rapid economic development has made road traffic crashes causing traffic congestion one of the most pervasive forms of "bottle neck" in Wuhan, China.Secure and efficient transportation and mobility are key components and central to sustainable development of all-round urbanization.For an effective solution of the TC problem with a limited budget, risky road segments with higher probability of TCs should be given the priority to be maintained.The detection of spatial clusters of TC is the first vital step for the appropriate allocation of resources for safety improvement in a sustainable way.To identify riskier road segments in Wuhan, a two-step approach using NKDE, extended from KDE in planar space, and GLINCS, based on Getis-Ord Gi* was illustrated.As presented, TC on the road network in Wuhan, with a total of 3113 crashes between motor vehicles, were selected for testing and verifying.It is confirmed that NKDE-GLINCS perform better than traditional GLINCS in identifying the cluster due to the preprocessing of NKDE smoothing.The case study also provides evidence of effectiveness and robustness of the NKDE-GLINCS method.In addition, the top 20 roads with high-high TC density at the significance level of 0.1 are listed and presented in 3-D visualization.The results of this case study should be useful in assisting transportation agencies and motorists to identify risky roads quickly and play an important role in the further analysis and prediction of TC.
Compared with conventional TC analysis methods, NKDE can be used not only for analyzing the properties of point events and measuring the variation in the mean values of spatial processes but also for a preprocess for a smoothing density value from the origin data.The main advantage of the NKDE method is that the uncertainty about the process can be understood and implement easily.However, NKDE methods may always be used as visualization tools, due to the absence of significance testing.Herein, NKDE result was input as attribute for GLINCS to use the density indicator formally for evaluating the significant locations with high-density values.
Although the NKDE-GLINCS method for detecting the cluster pattern of TC has availability and advantage, there are still some fields to be improved.As discussed previously, in this study, only the spatial characteristics of TC was analyzed, whereas previous research has shown that the factors associated with TC may be diverse and complicated [15,17,47,57,58].Thus, further study is needed to add other parameters to the kernel function and weight matrix, such as road density, road accessibility and land-use of the study area.Despite these improvements existing, TC distribution analysis using NKDE-GLINCS in other areas or cities in different scales are still expected.Besides, some other applications for geographical events constrained by or associated with networks are encouraged in the future. 0

Figure 1 .
Figure 1.Locations of traffic crash in Wuhan in 2007.
computed for a feature i within a distance d, standardized as a z-score.j x is the attribute value for feature j within distance d of a given feature i, ( ) ij W d is a symmetric one/zero spatial weight matrix from a threshold d for the distance between features i and j, the threshold d defines the distance within which all locations are considered as neighbors (indicated by 1 in the W matrix),

Figure 2 .
Figure 2. Density values in experiments of NKDE(network-constrained kernel density estimate) and KDE(kernel density estimate).(a) Density map of experiment 1; (b) Density map of experiment 2; (c) Density map of experiment 3; (d) Density map of experiment 4.

Table 1 .
Parameters and density values in experiments of KDE(kernel density estimate) and NKDE (network-constrained kernel density estimate).

Table 2 .
Parameters and H-H(high-high) segment numbers in experiments of GLINCS.

Table 3 .
Top 20RRSs (risker road segments) under the significance level of 0.1.