Identify Road Clusters with High-Frequency Crashes Using Spatial Data Mining Approach

: This paper develops a three-step spatial data mining approach to directly identify road clusters with high-frequency crashes (RCHC). The ﬁrst step, preprocessing, is to store the roads and crashes in a spatial database. The second step is to describe the conceptualization of road–road and crash–road spatial relationships. The spatial weight matrix of roads (SWMR) is constructed to describe the conceptualization of road–road spatial relationships. The conceptualization of crash–road spatial relationships is established using crash spatial aggregation algorithm. The third step, spatial data mining, is to identify RCHC using the cluster and outlier analysis (local Moran’s I index). This approach was validated using spatial data set including roads and road-related crashes (2008–2018) from Polk County, IOWA, U.S.A. The ﬁndings of this research show that the proposed approach is successful in identifying RCHC and road outliers.


Introduction
According to the World Health Organization, ~1.25 million people die each year on the roads as a result of crashes (traffic accidents) [1].The road traffic network is an integrated and complex system consisting of four elements: "people, vehicle, roads, and environment" [2,3].As a carrier of traffic, roads with ancillary facilities have an important impact on the frequency of crashes.From the perspective of transportation authorities and safety specialists, strategies such as renovating road facilities, improving road traffic conditions, and using prompt signs of crash warning at road clusters with high-frequency crashes (RCHC) are effective in reducing crashes.Therefore, given the massive roads, how to identify RCHC is one of the most significant challenges faced by transportation authorities and safety specialists.
A review of previous studies shows that data mining [4][5][6] has been widely used to traffic crash analysis.Kumar et al. [7] used the latent class clustering and k-modes clustering technique on road accident data from Haridwar, India.Castro and Kim [8] explored the role of different factors on injury risk using a Bayesian network, decision trees, and artificial neural networks to detect factors of the greatest influence on car accidents.Taamneh et al. [9] established a set of rules that can be used by the United Arab Emirates Traffic Agencies to identify the main factors that contribute to accident severity.Li et al. [10] applied statistics analysis and data mining algorithms on the fatal accident dataset as an attempt to discover variables that are closely related to fatal accidents.
The above studies focus on using data mining approach to obtain the relationships between non-spatial factors and traffic crashes, neglecting mining geospatial features associated with traffic crashes.Unlike non-spatial data mining, few studies have been dedicated to the spatial autocorrelation measure [11], an important method of spatial data mining [12][13][14], to identify crash hotspot.Ouni and Belloumi [15] examined the stability of the performance of two spatial autocorrelation measures based on a road safety risk index through the comparison of the results in Tunisia.Xie and Yan [16] integrated network kernel density estimation with local Moran's I for hot spot detection of traffic accidents.Blazquez and Celis [17] identified critical areas with high child pedestrian crash risk in the city of Santiago, Chile, using kernel density estimation and Moran's I index in a GIS environment.
In the above spatial autocorrelation measures, including Moran's I [18], Geary's ratio [19], and Getis-Ord Gi* [20], the Moran's I was most favored by researchers, as its distributional characteristics are more desirable and the indicator has greater general stability and flexibility [18,21].The Moran's I, first known as a global single indicator of assessing spatial autocorrelation, can qualitatively detect whether the spatial distribution is dispersed, random, or clustered in the entire space with respect to their attribute values.In this context, it is important to note that the global Moran's I cannot quantitatively describe traffic crashes that is mainly concentrated on those roads.
Therefore, it is necessary to calculate the local Moran's I index [22] of each road, and perform clustering and outlier analysis to reveal RCHC.The clustering and outlier analysis method examines the local Moran's I index of individual road based on a comparison with the neighboring roads, which is as an effective method to identify RCHC.
In addition, some studies have used network kernel density estimation [23][24][25] as a spatial data mining method to compute spatial concentrations of point-based crashes in a road network.The spatial weight between crashes is used as a distance or spatial closeness of crashes along the road network.These studies take point-based crashes as the research object and used point-based spatial clustering analysis method, which is effective to detect hazardous locations by clusters of crashes.
The above studies provide a foundation for the research content of this paper.However, previous studies neglected some issues by using spatial data mining methods.First, these studies mainly focus on the point-based spatial clustering to find hotspots of crashes, and thus cannot directly identify RCHC.Second, some studies neglected the road-road or crash-road spatial relationships that affect accuracy of the result of spatial data mining.
To solve the issues, first, this paper focuses on the line-based spatial clustering method and takes linear roads as the research object, which can directly identify RCHC.Second, in this study, road-road and crash-road spatial relationships are applied in spatial data mining methods.Crashes are spatially aggregated as the attribute of the count of crashes of road (ACCR) by considering the road-crash geometric and attribute relationships.Then, a spatial weight matrix of roads (SWMR) [26][27][28] is established based on the road-road topological and geometric relationships.The ACCRs and SWMR are used as the input parameters in the cluster and outlier analysis (local Moran's I) to improve accuracy of the result of spatial data mining.
The aim of this study is to (a) create the accurate SWMR of complex road network respect to overpass crossing or underpass crossings to support further spatial statistics analysis (e.g., high-low clustering, hotspots analysis) (b) identify the RCHC to help transportation authorities and safety specialists to identify and prioritize roads that require more safety attention to reduce crashes.
The rest of the paper is organized as follows.Section 2 presents the methodology used in this study.Section 3 describes the spatial data, including the traffic crashes and roads of Polk County, Iowa used in this study.Section 4 illustrates and discusses the results by using the methodology within the study area.Section 5 recommends future work.Finally, Section 6 concludes the paper.

Process Map
This paper presents a three-stepped spatial data mining approach to directly identifying RCHC by using: (1) preprocessing, (2) conceptualization, and (3) spatial data mining.The process map for the approach is shown in Figure 1.

Process Map
This paper presents a three-stepped spatial data mining approach to directly identifying RCHC by using: (1) preprocessing, (2) conceptualization, and (3) spatial data mining.The process map for the approach is shown in Figure 1.

1) Preprocessing
The first step (preprocessing) is storage of the road network, crashes, and region boundary in spatial database.In this paper, we use PostGreSQL database [29] and PostGIS spatial data engine [30] to store and query massive spatial roads and crashes.The attribute and geometry info of road and crash table inherited from the input road and crash shapefile.

2) Conceptualization
The second step (conceptualization) is to build the conceptualization of spatial relationships.There are two types of conceptualization of spatial relationships: (1) the conceptualization of linear road relationships and (2) the conceptualization of road-crash relationships.Spatial weight matrix of roads (SWMR) is constructed to describe the conceptualization of road spatial relationships.The road network topology is created using pgRouting [31], which extends the PostGIS/PostgreSQL geospatial database to provide road network geospatial routing ability functions (e.g., pgr_createTopology, pgr_createVertices, pgr_analyzegraph, pgr_nodeNetwork).The road network DBF table with count of crashes is calculated using crash spatial aggregation algorithm to describe the conceptualization of spatial relationships between roads and crashes.(1) Preprocessing The first step (preprocessing) is storage of the road network, crashes, and region boundary in spatial database.In this paper, we use PostGreSQL database [29] and PostGIS spatial data engine [30] to store and query massive spatial roads and crashes.The attribute and geometry info of road and crash table inherited from the input road and crash shapefile.
(2) Conceptualization The second step (conceptualization) is to build the conceptualization of spatial relationships.There are two types of conceptualization of spatial relationships: (1) the conceptualization of linear road relationships and (2) the conceptualization of road-crash relationships.Spatial weight matrix of roads (SWMR) is constructed to describe the conceptualization of road spatial relationships.The road network topology is created using pgRouting [31], which extends the PostGIS/PostgreSQL geospatial database to provide road network geospatial routing ability functions (e.g., pgr_createTopology, pgr_createVertices, pgr_analyzegraph, pgr_nodeNetwork).The road network DBF table with count of crashes is calculated using crash spatial aggregation algorithm to describe the conceptualization of spatial relationships between roads and crashes.
(3) Spatial data mining The third step (data mining) is to identify RCHC using the cluster and outlier analysis (local Moran's I).According to studies of the point-based spatial clustering analysis method, this paper proposes a method of directly identifying RCHC by using line-based Moran's I index.We can Appl.Sci.2019, 9, 5282 4 of 19 quantitatively identify RCHC (positive autocorrelation) and road outliers (negative autocorrelation) via map visualization of road clusters and outliers shapefile.

File Types
There are three types of file: (1) the input files, (2) the intermediate files, and (3) output results used in the approach, as shown in Table 1.Input files include point-based shapefile of crash, linear shapefile of road, and the regional shapefile of boundary.All the files contain the geometry and the important attributes (e.g., location, names, etc.).

The Intermediate Files
Intermediate files include PostGreSQL traffic database and SWMR file.We convert the input shapefiles to traffic database that contains the road, crash DBF table and boundary DBF table using the tool-"PostGIS 2.0 Shapefile and DBF".The road DBF table contains the information: geometry, attribute, network topology, and road-relative crashes.

Output Results
Output results include the SWMR file and road clusters and outliers shapefile.The SWMR file is constructed based on geometric and topological adjacency of road network to describe the conceptualization of road spatial relationships as a foundation of spatial analysis.To improve the versatility of the approach, we use ASCII encoded gwt format file, which is compatible with spatial analysis software such as ArcGIS [32] and GeoDa [33], to store SWMR data.The first line of the file of the SWMR file in gwt format is the name of the unique identifier field (e.g., ID).After that, each row in the file is formatted into three columns: the ID of the road i (ID i ), the ID of the road j (ID j ), and the spatial weight (W ij ).The road clusters and outliers shapefile contains the high-high and low-low road clusters, and the high-low and low-high outliers are the result of cluster and outlier analysis (local Moran's I index).The high-high road clusters are RCHC we should identify in this research.

Remark on Availability of Input Files
The availability of input data determines the practicability of the approach.The input data can be easily accessed from Geofabrik's free download server and state Department of Transportation noted in Section 3.However, the traffic crash shapefile, not being core input data, may be difficult to obtain in few cities or regions.As a result of taking roads (line type) as the core research objects instead of crashes (points) in this approach, csv format plain code file or excel xls file of crash have the basic information (e.g., crash id and location) also can be used to statistic the crash count of each road applying the attribute matching method in the proposed approach if we do not have the detailed crash shapefile in few cities or regions.Thus, the practicability of the approach is improved.

Crash Spatial Aggregation Algorithm
To find the RCHC, we should aggregate crashes in roads as the count of traffic crashes on the road.First, we add a fields (<crash_count>, <crashlist >) of the road DBF table.Second, we calculate the attributes (<crash_count>, <crashlist>) for each road.To determine whether the crash is on a road, the following two premises shall be considered.
Premise 1 As the accuracy of crash global positioning system (GPS) coordinate has a positioning error of approximately 10 m [34].A crash occurs on the road if its coordinates are within 10 m of the buffer of the linear road considering the positioning error of crash.Premise 2 As the Interstate Highway standards for the U.S. Interstate Highway System use a 12 foot (3.7 m) standard lane width [35], a crash occurring on the road can be determined if its coordinates are within 3.7 m of the buffer of the road.
Under above premises, we can determine the crash is on the road if the geometric and attribute relationships between crash and road meet both conditions 1 and 2 (means geometric and attribute integrated matching).If the crash cannot match any road by geometric and attribute integrated matching, the crash can be spatially aggregated to the road meet both conditions 3 and 4 (means spatial fuzzy matching).

1.
Condition 1: The shortest distance between traffic crash and road less than 47 m (consider 10 lane roads and GPS positioning accuracy) order by the distance.

2.
Condition 2: Attribute matching between name of road and location of crash.

3.
Condition 3: The shortest distance between traffic crash and road is less than 10 m (consider GPS positioning accuracy).4.
Condition 4: The shortest distance between traffic crash and road is minimum in all datasets.
We use qt platform to realize the crash spatial aggregation algorithm in PostGIS/PostGreSQL database following the conditions.Figure 2 shows the definition of main data types.Figure 3 shows the main structure of the algorithm.

SWMR Construction Algorithm
A spatial weight matrix (SWM) is a representation of the spatial structure of the data, and it is designed to generate, store, reuse, and share the conceptualization of the relationships among a set of features [36].A SWM is the key input parameters in cluster and outlier analysis.The input SWM directly determines the correctness of the calculation results of cluster and outlier analysis.Consequently, when using an inappropriate SWM, cluster and outlier analysis cannot be trusted in general.
As taking linear roads as the research object in this paper, we need to build the SWMR that matches with spatial distribution characteristics of roads for identification of RCHC by using cluster and outlier analysis.Conceptually, the SWMR is an N×N matrix (as shown in Equation (1) [26,27].There is one row for every road and one column for every road.
where N is the number of roads; i, j is the unique identifier of roads; and W ij is the weight of matrix (means the cell value for any given row i and column j combination) that quantifies the spatial relationship between roads.
Typically, W ij in SWMR are defined using Euclidean distance measurements and contiguity, fixed, or inverse distance weighting schemes [37,38].However, road traffic system with crashes is based on road network which has complex topology and geometric relationships [39,40].For identification of RCHC, defining spatial relationships in terms of road network is more appropriate.
SWMR models road spatial relationships and should follow the topology between roads that are restricted to the adjacency of road network.At the most basic level, there is a binary strategy for creating W ij to quantify the spatial relationships among roads.If the road i is a 1st-order neighbor to road j in the road network, then W ij = 1, else then W ij = 0.
In this paper, we consider topological or geometric adjacency roads, which share the same intersection or the same node, equal to 1st-order neighbors.That means there is a topological or geometric adjacency between road i and road j, then W ij = 1; otherwise, W ij = 0 using binary strategy.However, to find the geometric or topological adjacency roads is a difficult problem since the different road types (e.g., highway, local, bridge, and tunnel) in real road network have complex topological and geometric relationships.Note that roads with an overpass crossing or underpass crossing lack a true intersection.For instance, Hul Ave and I235 highway have two intersections in the 2D map; however, they do have two overpass crossings (nonintersecting) in the photo, as shown in Figure 4.
designed to generate, store, reuse, and share the conceptualization of the relationships among a set of features [36].A SWM is the key input parameters in cluster and outlier analysis.The input SWM directly determines the correctness of the calculation results of cluster and outlier analysis.Consequently, when using an inappropriate SWM, cluster and outlier analysis cannot be trusted in general.
As taking linear roads as the research object in this paper, we need to build the SWMR that matches with spatial distribution characteristics of roads for identification of RCHC by using cluster and outlier analysis.Conceptually, the SWMR is an N×N matrix (as shown in Equation (1) [26,27].There is one row for every road and one column for every road.where N is the number of roads; i, j is the unique identifier of roads; and Wij is the weight of matrix (means the cell value for any given row i and column j combination) that quantifies the spatial relationship between roads.Typically, Wij in SWMR are defined using Euclidean distance measurements and contiguity, fixed, or inverse distance weighting schemes [37,38].However, road traffic system with crashes is based on road network which has complex topology and geometric relationships [39,40].For identification of RCHC, defining spatial relationships in terms of road network is more appropriate.
SWMR models road spatial relationships and should follow the topology between roads that are restricted to the adjacency of road network.At the most basic level, there is a binary strategy for creating Wij to quantify the spatial relationships among roads.If the road i is a 1st-order neighbor to road j in the road network, then Wij = 1, else then Wij = 0.
In this paper, we consider topological or geometric adjacency roads, which share the same intersection or the same node, equal to 1st-order neighbors.That means there is a topological or geometric adjacency between road i and road j, then Wij = 1; otherwise, Wij = 0 using binary strategy.However, to find the geometric or topological adjacency roads is a difficult problem since the different road types (e.g., highway, local, bridge, and tunnel) in real road network have complex topological and geometric relationships.Note that roads with an overpass crossing or underpass crossing lack a true intersection.For instance, Hul Ave and I235 highway have two intersections in the 2D map; however, they do have two overpass crossings (nonintersecting) in the photo, as shown in Figure 4.In GIS, there are different ways to model the topological and geometric relationship of road network in real world.A total of 11 kinds of graphs were summarized in this paper, as shown in Figure 5, to demonstrate the typical topological and geometric relationships of road network.Considering the typical topological and geometric relationships of road network, we can find the geometric or topological adjacency roads to calculate spatial weights of SWMR:

•
The spatial weights of SWMR are 1 if the roads are topological adjacent for the following cases.In GIS, there are different ways to model the topological and geometric relationship of road network in real world.A total of 11 kinds of graphs were summarized in this paper, as shown in Figure 5, to demonstrate the typical topological and geometric relationships of road network.Considering the typical topological and geometric relationships of road network, we can find the geometric or topological adjacency roads to calculate spatial weights of SWMR:

•
The spatial weights of SWMR are 1 if the roads are topological adjacent for the following cases.Figure 6 shows the main structure of the SWMR construction algorithm.

Cluster and Outlier Analysis (Local Moran's I)
The First Law of Geography [41], according to Professor Waldo Tobler, is "everything is related to everything else, but near things are more related than distant things."Based on the first law of geography, geographical phenomena or attributes are related to each other in spatial distribution, have spatially related characteristics; that is, the closer the distance is, the more similar the things are [41].Therefore, there are three types of distributions of crashes: dispersed, random, or clustered.
The global Moran's I developed by Professor Moran in 1948, is one of the most preferred measure of spatial autocorrelation [42] Appl.Sci.2019, 9, 5282 9 of 19 where I i is the calculated local Moran's I index of Road i.I i is a relative measure and can only be interpreted within the context of its computed z-score or p-value.The p-value is a probability, a type of statistics to express confidence level.The z-score is a standardized local Moran's I index.W ij is the weight in SWMR (as discussed in Section 2.3.2) that quantifies the spatial relationship between road i and road j. x i , x j is the crash count (as discussed in Section 2.3.1) of road i and road j.X is the average crash count of all roads.n is the total number of roads, i = 1, 2; n and j = 1, 2, n. S i 2 is the measure of sample variance, defined as Appl.Sci.2019, 9, x FOR PEER REVIEW 9 of 20 Figure 6 shows the main structure of the SWMR construction algorithm.

Cluster and Outlier Analysis (Local Moran's I)
The First Law of Geography [41], according to Professor Waldo Tobler, is "everything is related to everything else, but near things are more related than distant things."Based on the first law of geography, geographical phenomena or attributes are related to each other in spatial distribution, have spatially related characteristics; that is, the closer the distance is, the more similar the things are [41].Therefore, there are three types of distributions of crashes: dispersed, random, or clustered.
The global Moran's I developed by Professor Moran in 1948, is one of the most preferred measure of spatial autocorrelation [42] In the cluster and outlier analysis, it is necessary to make assumptions about the spatial distribution of crashes, which is randomization null hypothesis of spatial distribution.The test of randomization null hypothesis of spatial distribution can be performed based on the z-score and p-value along with the local Moran's I index.The equation to calculate the z-score (Z I i ) for I i is shown as where The Z i -score is a standardized local Moran's I value of road i.The z-scores and p-values for roads are measures of statistical significance which tell us whether or not to reject the randomization null hypothesis, road by road [1].In this study, p-value ≤ 0.05 (95% confidence level) is used to indicate significant clusters, which is applied to each road.For either road, if its p-value is smaller than 0.05 and z-score is greater than 1.96 or less than −1.96, that road will be considered as one of the cluster or outlier.The road who z-score is positive and greater than 1.96 and p-value is smaller than 0.05, with the neighboring roads have similar z-score and p-value, form the high-high road clusters.The high-high road clusters are the RCHC, which can be used to directly identify the dangerous roads.

Data Description
This study focuses on the Polk County, Iowa, United States.Based on the 2010 census, its population was 430,640, representing 14% of the state's residents, making it the Iowa's most populous county.
The study considers all types of roads (e.g., local, highway, bridge, and tunnel) and crashes occur on road (not in intersection) within the Polk County boundary.The data can be downloaded from the website: https://geodata.iowa.gov/dataset/county-boundaries-iowa.

The Spatial Data of Roads
The spatial data of IOWA statewide roads that this study employ can be download form OpenStreetMap [43] website (http://download.geofabrik.de/north-america.html).We extract spatial data of roads from the Iowa statewide road using (a) select layer by location tool (roads intersect with the Polk County boundary) and (b) select layer by attribute (roads suitable for cars) from ArcGIS geoprocessing tool box.A total of 27,606 roads are successfully recorded in ArcGIS software, as shown in Figure 7.Note that the road shapefile of other cities or regions can be easily downloaded from Geofabrik's free download server, which has the latest global spatial road data normally update every day from the OpenStreetMap project.Note that the road shapefile of other cities or regions can be easily downloaded from Geofabrik's free download server, which has the latest global spatial road data normally update every day from the OpenStreetMap project.

The Spatial Data of Road-Related Crashes
This study employs the spatial data of crashes provided by the Iowa Department of Transportation's public platform (https://data.iowadot.gov/),which has the statewide data of general traffic crashes from the prior 10 years.
We extract spatial data of crashes of the Polk County from the Iowa statewide traffic crash shapefile using  Note that we can also access other cities or regions crash shapefile published by state Department of Transportation in USA using the same extracting method.

The SWMR of Polk County
As described earlier, SWMR is the critical input parameters for the local Moran's I analysis.Therefore, it is necessary to establish the SWMR of the Polk County to accurately express the spatial relationship between roads.Note that we can also access other cities or regions crash shapefile published by state Department of Transportation in USA using the same extracting method.

The SWMR of Polk County
As described earlier, SWMR is the critical input parameters for the local Moran's I analysis.Therefore, it is necessary to establish the SWMR of the Polk County to accurately express the spatial relationship between roads.
As discussed in Section 2.2.3, we use ASCII encoded gwt format file to store SWMR data.The first line of the file of the SWMR in gwt format is the name of the unique identifier field (we use the field 'ID' as the unique identifier field in this paper).The spatial weight (W ij ) is calculated by considering topological and geometric relationships between roads.Due to space limitations, we take the typical highway-local link road graph shown in Figure 9 as an example to demonstrate the SWMR of Polk County.Table 2 lists the content of the ASCII encoded gwt format file of roads, including motorway links (ID.23970, 24103) and bridges (ID.3833, 3950) created by the algorithm of constructing of SWMR (discussed in Section 2.3.2), to save the conceptualization data of road spatial relationships.It should be noted that the SWMR is a sparse matrix, and there is a large amount of zero Wij data.Therefore, in this paper, the rows with a spatial weight of 0 are omitted since the default setting for spatial weights is 0 in the spatial data mining approach, which can effectively reduce the file storage space.The study uses cluster and outlier analysis (the local Moran's I, as discussed in Section 2.3.3)Table 2 lists the content of the ASCII encoded gwt format file of roads, including motorway links (ID.23970, 24103) and bridges (ID.3833, 3950) created by the algorithm of constructing of SWMR (discussed in Section 2.3.2), to save the conceptualization data of road spatial relationships.It should be noted that the SWMR is a sparse matrix, and there is a large amount of zero W ij data.Therefore, in this paper, the rows with a spatial weight of 0 are omitted since the default setting for spatial weights is 0 in the spatial data mining approach, which can effectively reduce the file storage space.

The Results of Road Cluster and Outlier Analysis of Polk County
The study uses cluster and outlier analysis (the local Moran's I, as discussed in Section 2.3.3)from geo-processing tools in ArcGIS [44,45], by taking the following input parameters, as shown in Table 3, to calculate local Moran's I index, z-score, and p-value for each linear road to obtain road clusters and outliers across Polk County.All roads should have at least one neighbor [32] according to the best practice guidelines of cluster and outlier analysis.In this research, a total of 27,491 roads were selected in the input feature class since we find 115 roads in OpenStreetMap roads of Polk County have no neighbor when we construct the SWMR of Polk County.
In general, there are four types of road cluster and outlier in the road cluster and outlier shapefile, as discussed in Section 2.2.3: high-high cluster (cotype is HH), low-low cluster (cotype is LL), high-low outlier (cotype is HL), and low-high outlier (cotype is HL).The high-high cluster, high-low outlier, and low-high outlier are colored in red, black, and blue, respectively.The results, as shown in Figure 10 using map visualization, clearly demonstrate the road clusters and outliers.All roads should have at least one neighbor [32] according to the best practice guidelines of cluster and outlier analysis.In this research, a total of 27,491 roads were selected in the input feature class since we find 115 roads in OpenStreetMap roads of Polk County have no neighbor when we construct the SWMR of Polk County.
In general, there are four types of road cluster and outlier in the road cluster and outlier shapefile, as discussed in Section 2.2.3: high-high cluster (cotype is HH), low-low cluster (cotype is LL), high-low outlier (cotype is HL), and low-high outlier (cotype is HL).The high-high cluster, high-low outlier, and low-high outlier are colored in red, black, and blue, respectively.The results, as shown in Figure 10   High-high road cluster indicates that there is a positive autocorrelation.The roads in this cluster all have high crashes and the neighbor roads also have high-frequency crashes.That means, Highhigh road cluster is the RCHC that we should identify from the 27,606 roads of Polk County.
The RCHC of Polk County, as shown in Figure 10, centered along with I 35, I235, US69, I80, and High-high road cluster indicates that there is a positive autocorrelation.The roads in this cluster all have high crashes and the neighbor roads also have high-frequency crashes.That means, High-high road cluster is the RCHC that we should identify from the 27,606 roads of Polk County.
The RCHC of Polk County, as shown in Figure 10, centered along with I 35, I235, US69, I80, and US 6, can be discovered is relevant to hazardous roads detection.There are 738 roads in the RCHC of Polk County, which account for 2.67% of all roads and 24, 652 crashes occurred in RCHC, accounting for 59.07% of all crashes.That means 59.07%crashes occurred in 2.67% roads in Polk County.In addition, we can discovery that 85.60% crashes in RCHC occurred on major roads, including motorway, trunk, primary, secondary, and tertiary, as shown in Figure 11.We can quantitatively identify that 59.07% crashes of Polk County have strong positive spatial autocorrelation with topological or geometric adjacency of roads.4, based on which, transportation authorities can develop targeted mitigation strategies for the roads in RCHC to effectively reduce the number of crashes.4, based on which, transportation authorities can develop targeted mitigation strategies for the roads in RCHC to effectively reduce the number of crashes.
The high-low and low-high outlier shows that there is a negative autocorrelation.The roads in high-low outlier have high-frequency crashes; however, their neighboring roads have low-frequency crashes.Contrarily, the roads in low-high outlier have low-frequency crashes; however, their neighboring roads have high-frequency crashes.A portion of the calculated local Moran's I, z-score, and p-value of each road in low-high and high-low outlier are shown in Table 5.Transportation authorities should also pay attention to roads in high-low and low-high outliers to find why there is a negative autocorrelation.

Recommendation of Future Work
In this paper, we have developed a spatial data mining approach to directly identify road clusters with high-frequency crashes (RCHC) by using spatial weight matrix of roads (SWMR) and the local Moran's I for cluster and outlier analysis.We believe that the proposed approach can be extended to the following fields, which can be considered as future work.

Spatiotemporal Data Mining Approach.
In this study, the ten-year crashes were equally applied in spatial data mining approach.However, different temporal crash factors, such as light, season, weather, may be varied.It is necessary to differentiate the treatment of crashes according to different temporal factors.As such, we will develop the spatiotemporal data mining approach considering the several spatiotemporal correlations between the crashes and main factors to discover the spatial and temporal patterns of traffic crashes according to different factors.

Identify Traffic Bottleneck.
In recent years, identify traffic patterns, including traffic bottlenecks, has received much attention [46][47][48].The proposed approach in this paper has the potential to identify the traffic bottleneck.The spatial relationship between the bottleneck and broad network can be established by a fuzzy spatial aggregation algorithm, and the spatial weight matrix of road network are then used to study the degree of autocorrelation of the traffic congestion and discover the spatial distribution pattern of the traffic bottleneck under the constraints of the road network.

Identify Certain Roadway Damages.
Due to various adverse factors, such as pounding and impact [49], chloride diffusion [50] and corrosion [51,52], and freeze and thaw [53,54], roadways are subject damages, such as potholes and cracks, which will negatively impact traffic flow [55], generating abnormal traffic patterns, such as sudden slow down and lane changes, which will worsen over time.Therefore, a spatiotemporal data mining approach by using the historical data will be developed.Cluster and outlier analysis can be used to discover roadway damages by checking the degree of autocorrelation of traffic flow and the outliers represent with abnormal traffic flow.

Cloud-Based RCHC Identification
The proposed approach was developed and validated using a personal computer that has the capacity to identify RCHC in large cities or regions with populations of ~430,000 people, such as Polk County; however, it is difficult to process the big data of crashes and roads in megacities.As a future work, we will deploy, test, and amend the proposed approach in the cloud computing environment [56] to provide high performance computing solutions for identify RCHC in megacities, to further improve the data processing capability of this approach.

Conclusions
As important carriers of traffic, roads and their ancillary facilities have important impacts on the frequency of crashes [57,58].This paper successfully demonstrates a spatial data mining approach to directly identify road clusters with high-frequency crashes (RCHC).The application of methodology was illustrated by using the spatial data set (stored in SHP file format) including traffic crashes (2008-2018) and roads of Polk County, Iowa, U.S.A.The proposed crash spatial aggregation algorithm uses geometric and attribute integrated matching and spatial fuzzy matching to build the crash-road spatial relationships considering GPS location accuracy.The developed spatial weight matrix of roads (SWMR) algorithm has the ability to detect and accommodate overpass crossing and underpass crossing with the consideration of the 11 typical topological and geometric relationships of roads.The algorithm, creates accurate SWMR of complex road network, have the added value that can support further spatial statistics (e.g., high-low clustering and Getis-Ord G i * analysis) of road network crashes.As a major contribution, the research adopts a new idea and focuses on line-based local Moran's I analysis by taking line-based roads as the core research objects instead of point-based crashes.As a result, the proposed method can directly identify RCHC.

Figure 1 .
Figure 1.Process map for the approach of directly identifying road clusters with high-frequency crashes (RCHC).

Figure 1 .
Figure 1.Process map for the approach of directly identifying road clusters with high-frequency crashes (RCHC).

Figure 2 .
Figure 2. Definition of main data types.Figure 2. Definition of main data types.

Figure 2 .
Figure 2. Definition of main data types.Figure 2. Definition of main data types.

Figure 2 .
Figure 2. Definition of main data types.

Figure 3 .
Figure 3. Structure of crash spatial aggregation algorithm.Figure 3. Structure of crash spatial aggregation algorithm.

Figure 3 .
Figure 3. Structure of crash spatial aggregation algorithm.Figure 3. Structure of crash spatial aggregation algorithm.
(a) intersections in 2D map (b) overpass crossings in a photo

Figure 4 .
Figure 4. Overpass crossing (nonintersecting) of the highway and local road.Figure 4. Overpass crossing (nonintersecting) of the highway and local road.

Figure 4 .
Figure 4. Overpass crossing (nonintersecting) of the highway and local road.Figure 4. Overpass crossing (nonintersecting) of the highway and local road.
(a) T-intersection with node, (c) T-intersection of highway bridge with node, (f) cross-intersection Appl.Sci.2019, 9, 5282 8 of 19 with node, (j) topological adjacency, (k) topological adjacency with bridge, and (l) topological adjacency with tunnel.• The spatial weights of SWMR are 1 if the roads are geometric adjacent for the following cases.(b) T-intersection without node, (d) T-intersection of highway bridge without node, and (e) cross-intersection without node.• The spatial weights of SWMR are 0 if the roads are neither geometric adjacent nor non topological adjacent for the following cases; (g) overpass crossing and (h) underpass crossing.
(a) T-intersection with node, (c) T-intersection of highway bridge with node, (f) crossintersection with node, (j) topological adjacency, (k) topological adjacency with bridge, and (l) topological adjacency with tunnel.• The spatial weights of SWMR are 1 if the roads are geometric adjacent for the following cases.(b) T-intersection without node, (d) T-intersection of highway bridge without node, and (e) cross-intersection without node.• The spatial weights of SWMR are 0 if the roads are neither geometric adjacent nor non topological adjacent for the following cases; (g) overpass crossing and (h) underpass crossing.

Figure 5 .
Figure 5.Typical topological and geometric relationships of road network.Figure 5. Typical topological and geometric relationships of road network.

Figure 5 .
Figure 5.Typical topological and geometric relationships of road network.Figure 5. Typical topological and geometric relationships of road network.
. The global Moran's I use a single index to detect the degree of autocorrelation of the same variable in the spatial region, and can verify the spatial distribution pattern in entire spatial extent.In this study, the local Moran's I (suggested in Professor Anselin based on global Moran's I) [19] is used as a local indicator of spatial autocorrelation to find RCHC.The local Moran's I, one of the most widely used local indicators of spatial association statistics[16], is calculated for each road to reveal the degree of spatial autocorrelation and is used to analyze whether the same variable (<crash_count> in this research) has autocorrelation at a specific local location.The local Moran's I index is expressed as[19,22,42]

Figure 6 .
Figure 6.Structure of the spatial weight matrix of roads (SWMR) construction algorithm.

Figure 6 .
Figure 6.Structure of the spatial weight matrix of roads (SWMR) construction algorithm.

20 Figure 7 .
Figure 7. Distribution of roads in Polk County.

Figure 7 .
Figure 7. Distribution of roads in Polk County.
(a) select layer by location tool (crashes intersect with the Polk County boundary) and (b) select layer by attribute tool (crashes are not intersection-related) from ArcGIS geo-processing tool box.A total of 41,734 road-related crashes that happened in Polk County from 1 January 2008-6 August 2018 are successfully recorded in ArcGIS, as shown in Figure 8.

Figure 8 .
Figure 8. Distribution of crashes in Polk County.

Figure 8 .
Figure 8. Distribution of crashes in Polk County.
Appl.Sci.2019, 9, x FOR PEER REVIEW 14 of 20 using map visualization, clearly demonstrate the road clusters and outliers.

Figure 10 .
Figure 10.Distribution of road clusters and outliers of Polk County.

Figure 10 .
Figure 10.Distribution of road clusters and outliers of Polk County.
Appl.Sci.2019, 9, x FOR PEER REVIEW 15 of 20 addition, we can discovery that 85.60% crashes in RCHC occurred on major roads, including motorway, trunk, primary, secondary, and tertiary, as shown in Figure11.We can quantitatively identify that 59.07% crashes of Polk County have strong positive spatial autocorrelation with topological or geometric adjacency of roads.

Figure 11 .
Figure 11.Statistics of crash counts in different road classes in RCHC.

Figure 11 .
Figure 11.Statistics of crash counts in different road classes in RCHC.Portion of calculated local Moran's I, z-score, and p-value for each road in RCHC are shown in Table4, based on which, transportation authorities can develop targeted mitigation strategies for the roads in RCHC to effectively reduce the number of crashes.The high-low and low-high outlier shows that there is a negative autocorrelation.The roads in high-low outlier have high-frequency crashes; however, their neighboring roads have low-frequency crashes.Contrarily, the roads in low-high outlier have low-frequency crashes; however, their neighboring roads have high-frequency crashes.A portion of the calculated local Moran's I, z-score, and p-value of each road in low-high and high-low outlier are shown in Table5.Transportation authorities should also pay attention to roads in high-low and low-high outliers to find why there is a negative autocorrelation.

Table 1 .
The overview of file types.

Table 2 .
The content of the ASCII encoded gwt format file.
4.2.The Results of Road Cluster and Outlier Analysis of Polk County

Table 2 .
The content of the ASCII encoded gwt format file.

Table 3 .
Input Parameters of cluster and outlier analysis (Anselin local Moran's I).

Table 3 .
Input Parameters of cluster and outlier analysis (Anselin local Moran's I)

Table 4 .
Local Moran's I, z-score, and p-value of roads in high-high cluster (RCHC).

Table 4 .
Local Moran's I, z-score, and p-value of roads in high-high cluster (RCHC).

Table 5 .
Local Moran's I, z-score, and p-value of roads in high-low and high-low outliers.