A New Approach to Identifying Crash Hotspot Intersections (CHIs) Using Spatial Weights Matrices

: In this paper we develop a new approach to directly detect crash hotspot intersections (CHIs) using two customized spatial weights matrices, which are the inverse network distance-band spatial weights matrix of intersections (INDSWMI) and the k-nearest distance-band spatial weights matrix between crash and intersection (KDSWMCI). This new approach has three major steps. The ﬁrst step is to build the INDSWMI by forming the road network, extracting the intersections from road junctions, and constructing the INDSWMI with road network constraints. The second step is to build the KDSWMCI by obtaining the adjacency crashes for each intersection. The third step is to perform intersection hotspot analysis (IHA) by using the Getis–Ord Gi* statistic with the INDSWMI and KDSWMCI to identify CHIs and test the Intersection Prediction Accuracy Index (IPAI). This approach is validated by comparison of the IPAI obtained using open street map (OSM) roads and intersection-related crashes (2008–2017) from Spencer city, Iowa, USA. The ﬁndings of the comparison show that higher prediction accuracy is achieved by using the proposed approach in identifying CHIs.


Introduction
The nation's transportation infrastructural systems are deteriorating [1] under adverse influences from multiple factors, such as corrosion [2,3], aging [4], impact [5,6], and vibration [7], and even with the recent advances in structural health monitoring [8][9][10] and intelligent transportation systems [11,12], traffic crashes still happen. The latest quick facts report from the National Highway Traffic Safety Administration indicates that there were 2,746,000 people injured in 6,452,000 police-reported crashes in 2017 in the USA [13]. As junctions of traffic flow and pedestrian flow, intersections with ancillary facilities have an important impact on the frequency of crashes. Intersection-related crashes, which account for a large portion of all crashes, need more research attention. For example, Iowa, USA, saw about 225,185 intersection-related crashes, about 40.41% of all crashes, from 2008 to 2018 [14]. Given the fact of the massive number of intersections, identifying crash hotspot intersections (CHIs) is an important but challenging task.
A review of previous studies shows that the Getis-Ord Gi*, well known in hotspot analysis, has been commonly used to detect crash hotspots [15]. Hotspot analysis examines the Getis-Ord Gi* statistic [16,17], a local indicator of spatial autocorrelation developed by Professor Arthur Getis and J. Keith Ord, for individual crashes based on a comparison with neighboring crashes to quantitatively This paper presents a three-step approach to directly identifying crash hotspot intersections (CHIs) by using (1) construction of the inverse network distance-band spatial weights matrix of intersection (INDSWMI), (2) construction of the k-nearest distance-band spatial weights matrix between crash and intersection (KDSWMCI), and (3) intersection hotspot analysis (IHA). The process map for the approach is shown in Figure 1.

1) INDSWMI construction
The first step (INDSWMI construction) includes the following three substeps: (a) Road network construction. In this research, we apply osm2pgrouting [35], an open source software product, to build the road network including road junction and segment tables from the open street map (OSM) [36] road spatial dataset.
Intersection extraction. Note that a road junction that links with three or more road segments is considered an intersection, which is a junction of traffic flow and pedestrian flow [37,38] in this research. We developed an intersection extraction algorithm to extract intersections, such as T-intersection, Y-intersections, cross-intersections, and Xintersections [39], from the road junction 2) KDSWMCI construction (1) INDSWMI construction The first step (INDSWMI construction) includes the following three substeps: (a) Road network construction. In this research, we apply osm2pgrouting [35], an open source software product, to build the road network including road junction and segment tables from the open street map (OSM) [36] road spatial dataset. (b) Intersection extraction. Note that a road junction that links with three or more road segments is considered an intersection, which is a junction of traffic flow and pedestrian flow [37,38] in this research. We developed an intersection extraction algorithm to extract intersections, such as T-intersection, Y-intersections, cross-intersections, and X-intersections [39], from the road junction table and create the intersection table. The intersection extraction algorithm can be described as the following structured query language (SQL) script:"create table public.intersection as select * from public.road_junction a where a.degree> 2". (c) INDSWMI construction. We developed the INDSWMI generation algorithm to construct the INDSWMI, which can conceptualize the spatial relationships between intersections, with road  on the intersection table and road segment table. The INDSWMI is  saved in the swm file format which is compatible with ArcGIS. (2) KDSWMCI construction The second step (KDSWMCI construction) is to conceptualize the crash-intersection spatial relationships. We developed the KDSWMCI generation algorithm to calculate the number of crashes and adjacency crashes of each intersection and save the crash-intersection spatial relationships in the KDSWMCI table.
(3) Intersection hotspot analysis The third step, intersection hotspot analysis, is to identify CHIs using a statistical variable-Getis-Ord Gi*. The intersection hotspot analysis (IHA) generates the intersection hotspots shapefile using the standardized Getis-Ord Gi* of each intersection under randomization null hypothesis [40] computation. We can detect CHIs through geographic information system (GIS) visualization of the intersection hotspots shapefile. The Intersection Prediction Accuracy Index (IPAI) was calculated to quantitatively evaluate the prediction performance of IHA.

Data Types
There are three types of data in this approach: (1) the input data, (2) the intermediate data, and (3) the output data, as shown in Table 1.  Table  Point   The road junction spatial table  generated  by osm2pgrouting software   Road segment  Intermediate  PostGreSQL  Table  Line   The road segment spatial table  generated  by osm2pgrouting software   Crash table  Intermediate  PostGreSQL  Table  Point  The crash spatial table converted  by shp2pgsql tool   Intersection  Intermediate  PostGreSQL  Table  Point The intersection spatial table with number of crashes extracted from road junction KDSWMCI Two data sets are required as the inputs for this approach: the OSM road data and the crash shapefile. The OSM road data should contain the geometric info (a list of point coordinates) and the "highway" attribute. Note that "highway" in British English is used to indicate any type of road, such as motorway, primary, route, footway, and pedestrian, within OSM. In this research, we selected only the highways for cars, such as motorway, primary, secondary, tertiary, residential, and service, since we focused on traffic intersections. The crash shapefile should contain the geometric info (x-coordinate, y-coordinate) that is used to locate crashes on intersections. Crash-related factors [41,42], such as environment and driver, and crash attributes, such as crash type and date, are not necessary but are suggested.

The Intermediate Data
Intermediate files include a series of tables, such as crash, road junction, road segment, KDSWMCI, and intersection tables. The crash table was converted by using the shp2pgsql tool. The road junction and road segment tables, generated by osm2pgrouting software, contain the topological information and form of the road network. Note that the osm2pgrouting software cannot generate the intersection table. We developed an intersection extraction algorithm to build the intersection table with attributes such as the degree and type based on road network topological information.
In this research, we used PostGreSQL [43], a widely used open source database management system (DBMS), and PostGIS [44], an open source geospatial engine for PostgreSQL, to store and query spatial data of roads, intersections, and crashes. The geometry and attributes of the above tables were inherited from the input OSM road data and crash shapefile.

Output Results
The INDSWMI file and intersection hotspot shapefile are the output results. The INDSWMI file is constructed based on the network distance of intersections to conceptualize the intersection spatial relationships as a foundation for IHA. To improve the effectiveness of the approach, we used the binary swm file format, which is compatible with ArcGIS [45], to store INDSWMI data. Each row in a binary swm format file is formatted into four columns: OBJECTID (row index), GID (the ID of intersection i), NID (the ID of intersection j), and WEIGHT (the spatial weight between intersection GID and NID). The intersection hotspot shapefile, the result of the IHA (Getis-Ord Gi*), contains the hotspot intersections, coldspot intersections, and non-significant intersections. The hotspot intersections are the results that we should identify in this research.

The INDSWMI Generation Algorithm
A spatial weights matrix (SWM), the conceptualization of spatial relationships between features [46], is the key input parameter for hotspot analysis. The SWMs can be mainly divided into contiguity-based spatial weights matrices (CSWMs) and distance-band spatial weights matrices (DSWMs). The CSWM, which expresses the existence of a neighbor relation as a binary value, 1 or 0, is widely used in region-based hotspot analysis [47]. The DSWM, which expresses the spatial relationships as distance weights [48], is intrinsically most appropriate for point-based hotspot analysis. We adopted the DSWM to express the spatial relationships between intersections since intersections are a typical point feature.
The inverse network distance-band spatial weights matrix of intersections (INDSWMI) based on the DSWM is an N×N matrix. Generally, W ij is defined using inverse Euclidean distance measurement. However, intersections are constrained by the road network, which has complex topological and geometric relationships [49]. The inverse network distance along the road is more appropriate for the measurement of W ij of the INDSWMI, which is defined as where the distance band is a cutoff distance that can be specified in accordance with the minimum number of neighbors of each intersection (the minimum number of neighbors is 1 in this research according to the suggestion of Getis and Aldstadt [29]); nd ij is the network distance between intersections i and j. If nd ij is less than or equal to the distance band, then W ij = 1/nd ij (reverse network distance); otherwise, W ij = 0. nd ij is expressed as length(r k , r k ∈ sp(I i , I j )) (2) where sp(I i , I j ) is the shortest path set of intersections i and j, calculated by Dijkstra's algorithm [50], and contains a series of road segments (r k ); nd ij is the total length of road segments within sp(I i , I j ).
Note that row standardization of W ij is suggested to create proportional weights in cases where intersections have an unequal number of neighbor intersections [29]. The row-standardized form of W i j for hotspot analysis is expressed as Based on Equations (1) and (2), we used the qt framework [51], an open source cross-platform, to develop the INDSWMI generation algorithm with PostGreSQL/PostGIS. Figure 2 shows the definitions of the main data types: intersection, network distance of intersections (NDI), network distance matrix of intersections (NDMI), spatial weight of intersections (SWI), and INDSWMI.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 23 Based on Equations (1) and (2), we used the qt framework [51], an open source cross-platform, to develop the INDSWMI generation algorithm with PostGreSQL/PostGIS. Figure 2 shows the definitions of the main data types: intersection, network distance of intersections (NDI), network distance matrix of intersections (NDMI), spatial weight of intersections (SWI), and INDSWMI. Using the main data types, we developed a five-step algorithm with pseudocode shown in Figure 3 to generate the INDSWMI. The five steps are to (1) connect to the PostGreSQL/PostGIS database; (2) cache the intersection table in the PostGreSQL/PostGIS database to memory; (3) build the NDMI using the pgr_dijkstra fuction provided by the pgrouting extension for the PostGreSQL/PostGIS database; (4) generate the INDSWMI based on the NDMI restricted by the distance band; and (5) save the INDSWMI to the swm file format. Using the main data types, we developed a five-step algorithm with pseudocode shown in Figure 3 to generate the INDSWMI. The five steps are to (1)

The KDSWMCI Generation Algorithm
To identify the crash hotspot intersections (CHIs), we should build the k-nearest distance-band spatial weights matrix between crash and intersection (KDSWMCI, k=1) based on CSWM by obtaining the adjacency crashes and the number of crashes at each intersection.
Generally, an SWM is an N×N matrix whose elements are spatial weights. In this research, we expanded the SWM to an M×N matrix, as shown in Equation (4), which can accommodate different numbers of intersections and crashes. There is one row for each intersection and one column for each crash.
Here, m is the number of crashes; n is the number of intersections; i is the unique identifier of a crash; j is the unique identifier of an intersection; and Wij is the weight that quantifies the spatial relationship between crashes and intersections. If crash i occurs on intersection j, then Wij = 1; otherwise, Wij = 0. We can determine that a crash is on an intersection if the geometric relationships between the crash and intersection meet both of Conditions 1 and 2: 1. Condition 1: The shortest distance between the crash and the intersection is less than or equal to the threshold distance. 2. Condition 2: The intersection-related crash relates to one, and only one, intersection. That means that the k-nearest neighbor is the same as the 1st-nearest neighbor. Therefore, if crash i occurs

The KDSWMCI Generation Algorithm
To identify the crash hotspot intersections (CHIs), we should build the k-nearest distance-band spatial weights matrix between crash and intersection (KDSWMCI, k=1) based on CSWM by obtaining the adjacency crashes and the number of crashes at each intersection.
Generally, an SWM is an N×N matrix whose elements are spatial weights. In this research, we expanded the SWM to an M×N matrix, as shown in Equation (4), which can accommodate different numbers of intersections and crashes. There is one row for each intersection and one column for each crash.
Here, m is the number of crashes; n is the number of intersections; i is the unique identifier of a crash; j is the unique identifier of an intersection; and W ij is the weight that quantifies the spatial relationship between crashes and intersections. If crash i occurs on intersection j, then W ij = 1; otherwise, W ij = 0. We can determine that a crash is on an intersection if the geometric relationships between the crash and intersection meet both of Conditions 1 and 2: 1.
Condition 1: The shortest distance between the crash and the intersection is less than or equal to the threshold distance.

2.
Condition 2: The intersection-related crash relates to one, and only one, intersection. That means that the k-nearest neighbor is the same as the 1st-nearest neighbor. Therefore, if crash i occurs on intersection j, then the shortest distance between crash i and intersection j should be the minimum distance in all datasets.
According to Conditions (1) and (2), the weight of the KDSWMCI is expressed as where d ij is the Euclidean distance between crash i and intersection j; d ik is the Euclidean distance between crash i and intersection k. A threshold distance of 28. We realized the KDSWMCI generation algorithm based on the qt framework with the PostGreSQL/PostGIS database. Figure 4 shows the definitions of the main data types: crash, spatial weight between crash and intersection (SWCI), and KDSWMCI. Note that the structure of intersections used in the algorithm is defined in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 23 on intersection j, then the shortest distance between crash i and intersection j should be the minimum distance in all datasets.
According to Conditions (1) and (2), the weight of the KDSWMCI is expressed as where dij is the Euclidean distance between crash i and intersection j; dik is the Euclidean distance between crash i and intersection k. A threshold distance of 28. We realized the KDSWMCI generation algorithm based on the qt framework with the PostGreSQL/PostGIS database. Figure 4 shows the definitions of the main data types: crash, spatial weight between crash and intersection (SWCI), and KDSWMCI. Note that the structure of intersections used in the algorithm is defined in Figure 2.

Intersection Hotspot Analysis (Getis-Ord Gi*)
The Getis-Ord G, including Getis-Ord General G and Getis-Ord Gi * [16], is one of the preferred measurement techniques for hotspot analysis [53]. The Getis-Ord General G is a single index that can detect the degree of autocorrelation to verify the spatial distribution pattern in the entire spatial extent. The Getis-Ord Gi * is used as a local indicator [54] of spatial autocorrelation in IHA to identify CHIs. The Getis-Ord Gi * was calculated for each intersection to reveal the degree of spatial autocorrelation and was used to analyze whether the same variable (<num_crash> in this research) has spatial autocorrelation. The Getis-Ord Gi * is expressed as [16] * 1 where Gi * is the statistic that expresses the spatial degree of spatial autocorrelation of intersection i with the number of crashes over all neighboring intersections; Wij is the weight in the INDSWMI (discussed in Section 2.3.1) that quantifies the spatial relationship between intersection i and intersection j; xj is the number of crashes at intersection j (discussed in Section 2.3.2 based on the KDSWMCI); and n is the total number of neighboring intersections. According to the first law of geography, the number of crashes at an intersection, related to each other in geography, has a spatial distribution pattern that is either dispersed, random, or clustered [55]. The spatial distribution of the Gi * statistic is random when randomness is also observed in the underlying distribution of the number of crashes at intersections. However, crashes have a clustered distribution pattern in reality. Therefore, in IHA, it is necessary to make assumptions about the spatial distribution of the number of crashes at intersections, which is the randomization null hypothesis. Testing of the randomization null hypothesis of spatial distribution can be performed based on the

Intersection Hotspot Analysis (Getis-Ord Gi*)
The Getis-Ord G, including Getis-Ord General G and Getis-Ord Gi * [16], is one of the preferred measurement techniques for hotspot analysis [53]. The Getis-Ord General G is a single index that can detect the degree of autocorrelation to verify the spatial distribution pattern in the entire spatial extent. The Getis-Ord Gi * is used as a local indicator [54] of spatial autocorrelation in IHA to identify CHIs. The Getis-Ord G i * was calculated for each intersection to reveal the degree of spatial autocorrelation and was used to analyze whether the same variable (<num_crash> in this research) has spatial autocorrelation. The Getis-Ord Gi * is expressed as [16] G where Gi * is the statistic that expresses the spatial degree of spatial autocorrelation of intersection i with the number of crashes over all neighboring intersections; W ij is the weight in the INDSWMI (discussed in Section 2.3.1) that quantifies the spatial relationship between intersection i and intersection j; x j is the number of crashes at intersection j (discussed in Section 2.3.2 based on the KDSWMCI); and n is the total number of neighboring intersections. According to the first law of geography, the number of crashes at an intersection, related to each other in geography, has a spatial distribution pattern that is either dispersed, random, or clustered [55]. The spatial distribution of the Gi * statistic is random when randomness is also observed in the underlying distribution of the number of crashes at intersections. However, crashes have a clustered distribution pattern in reality. Therefore, in IHA, it is necessary to make assumptions about the spatial distribution of the number of crashes at intersections, which is the randomization null hypothesis. Testing of the randomization null hypothesis of spatial distribution can be performed based on the z-score (a standardized statistic) of the Gi * , as shown in Equation (7) [56], along with the p-value (a probability value used to express the confidence level).
Here, Gi * ZScore is the standardized Gi * value of intersection i. The Gi * ZScores are measures of statistical significance which inform us whether or not to reject the randomization null hypothesis, intersection by intersection. In this study, p-values of ≤0.05 (95% confidence level) were used to indicate different levels of significant clusters, which were applied to each intersection. To be more specific, if an intersection's p-value was >0.05 and its Gi * ZScore was >1.96 [57], that intersection was considered a hotspot intersection at the 95% confidence level. W ij is the weight in the INDSWMI, x j is the number of crashes at intersection j calculated in the KDSWMCI, and x is the average number of crashes at all neighboring intersections. S is related to the measurement of sample variance and is defined as [56]

The Intersection Prediction Accuracy Index (IPAI)
Evaluation of hotspot analysis prediction performance is an important issue relating to the suitability of this proposed approach. The Prediction Accuracy Index (PAI) [58], firstly proposed by Chainey et al., is the ratio of the hit rate to the fraction of area covered [59]. The PAI has been widely applied to measure hotspot analysis results [17,[59][60][61]. Previously, the Crash Prediction Accuracy Index (CPAI) was developed by using road length rather than area for evaluating traffic crash hotspot analysis performance [17,61]. Based on the previous studies, the Intersection Prediction Accuracy Index (IPAI) was developed in this research with road network restraints for evaluating IHA prediction performance as follows: where HI is the set of hotspot intersections; I is the set of all intersections in the study region; R is the set of all roads in the study region; m, n, and r are the number of HI, I, and R, respectively; x j is the number of crashes of intersection j within HI; x i is the number of crashes at intersection i within I; sp(i, j) is the shortest path set between intersections i and j within HI; l sp(i, j) is the total length of the shortest path set; and l i is the length of road i within I. Note that a road can be in any shortest path only once and cannot be duplicated. IPAI is an indicator to quantify the prediction performance of IHA. The total length of road and the total number of crashes in the study region are constants. That is, the higher the IPAI which means that the larger number of crashes in CHIs while the shorter lengths of shortest paths between CHIs, the better the prediction performance of the approach.

Original Data
More than 50% of the population of United States live in small cities and towns [62]. Small cities have been ignored by researchers. This negligence of smaller cities has profound consequences for urban studies [62]. Additionally, in fact, megacities are usually composed of several small cities. Indeed, research identifying CHIs in small cities is meaningful and important. Therefore, Spencer city, Iowa, United States-a small city-was selected for evaluation of the proposed approach. Spencer covers an area of approximately 28.96 km 2 . We considered roads for cars, such as motorway, primary, secondary, tertiary, residential, and service roads, and crashes which occurred on intersections within the Spencer city boundary.

The Road Network
The raw input OSM roads employed in this study can be downloaded from the OSM website (https://www.openstreetmap.org). Firstly, we exported the spatial data of all types of roads in the osm file format using an enveloping rectangle [43.1974, −95.1108, 43.1043, and 95.201] of the study area. Secondly, we applied the osm2pgrouting tool to select the roads for cars and built the road network in the PostGreSQL/PostGIS database. Thirdly, we used the ST_Intersects function provided by the PostGIS extension to clip the roads within the Spencer city boundary, which can be downloaded from OSM Boundaries 4.6.4 (https://wambachers-osm.website/boundaries/). A total of 2081 road segments for cars and 1456 junctions were successfully selected, as shown in Figure 6.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 23 Indeed, research identifying CHIs in small cities is meaningful and important. Therefore, Spencer city, Iowa, United States-a small city-was selected for evaluation of the proposed approach. Spencer covers an area of approximately 28.96 km 2 . We considered roads for cars, such as motorway, primary, secondary, tertiary, residential, and service roads, and crashes which occurred on intersections within the Spencer city boundary.

The Road Network
The raw input OSM roads employed in this study can be downloaded from the OSM website (https://www.openstreetmap.org). Firstly, we exported the spatial data of all types of roads in the osm file format using an enveloping rectangle [43.1974, -95.1108, 43.1043, and 95.201] of the study area. Secondly, we applied the osm2pgrouting tool to select the roads for cars and built the road network in the PostGreSQL/PostGIS database. Thirdly, we used the ST_Intersects function provided by the PostGIS extension to clip the roads within the Spencer city boundary, which can be downloaded from OSM Boundaries 4.6.4 (https://wambachers-osm.website/boundaries/). A total of 2081 road segments for cars and 1456 junctions were successfully selected, as shown in Figure 6. Note that intersections should be extracted from road junctions since junctions are not always intersections. A total of 1065 intersections in Spencer city were selected using the intersection extraction algorithm (discussed in Section 2.1), as shown in Figure 7. Note that intersections should be extracted from road junctions since junctions are not always intersections. A total of 1065 intersections in Spencer city were selected using the intersection extraction algorithm (discussed in Section 2.1), as shown in Figure 7.

Intersection-Related Crashes
We employed the spatial data of crashes provided by the Iowa Department of Transportation's public platform (https://data.iowadot.gov/), which has statewide data of general traffic crashes from the previous 10 years. The crash data contain 49 types of information (e.g., crash_date, district, county_num, literal, light, weather, rdtype, xcoord, and ycoord) and met the requirements of this research. For this research, a dataset of intersection-related crashes that occurred in Spencer city was selected and analyzed.
We extracted spatial data of crashes in Spencer city from the Iowa statewide crash data using the ST_Intersects function provided by the PostGIS extension to clip the intersection-related crashes within the Spencer city boundary. A total of 1149 intersection-related crashes occurred in Spencer city, as shown in Figure 8.

Intersection-Related Crashes
We employed the spatial data of crashes provided by the Iowa Department of Transportation's public platform (https://data.iowadot.gov/), which has statewide data of general traffic crashes from the previous 10 years. The crash data contain 49 types of information (e.g., crash_date, district, county_num, literal, light, weather, rdtype, xcoord, and ycoord) and met the requirements of this research. For this research, a dataset of intersection-related crashes that occurred in Spencer city was selected and analyzed.
We extracted spatial data of crashes in Spencer city from the Iowa statewide crash data using the ST_Intersects function provided by the PostGIS extension to clip the intersection-related crashes within the Spencer city boundary. A total of 1149 intersection-related crashes occurred in Spencer city, as shown in Figure 8.

Intersection-Related Crashes
We employed the spatial data of crashes provided by the Iowa Department of Transportation's public platform (https://data.iowadot.gov/), which has statewide data of general traffic crashes from the previous 10 years. The crash data contain 49 types of information (e.g., crash_date, district, county_num, literal, light, weather, rdtype, xcoord, and ycoord) and met the requirements of this research. For this research, a dataset of intersection-related crashes that occurred in Spencer city was selected and analyzed.
We extracted spatial data of crashes in Spencer city from the Iowa statewide crash data using the ST_Intersects function provided by the PostGIS extension to clip the intersection-related crashes within the Spencer city boundary. A total of 1149 intersection-related crashes occurred in Spencer city, as shown in Figure 8.  Note that it is difficult to determine the crash hotspot intersections from the figure above because there are several overlapping coordinates of crashes. Therefore, the proposed approach is needed to identify the crash hotspot intersections (CHIs).

The INDSWMI of Spencer City
As described earlier, the SWM is the critical input parameter for hotspot analysis. Therefore, it was necessary to establish the INDSWMI of Spencer city to accurately express the spatial relationships between intersections.
The spatial weights (W ij ) of the INDSWMI were calculated based on the network shortest distance along the road segments, as discussed in Section 2.3.1. We take the typical T-intersections and cross-intersections shown in Figure 9 as an example to evaluate the intersection extraction algorithm and demonstrate the INDSWMI of Spencer city.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 23 Note that it is difficult to determine the crash hotspot intersections from the figure above because there are several overlapping coordinates of crashes. Therefore, the proposed approach is needed to identify the crash hotspot intersections (CHIs).

The INDSWMI of Spencer City
As described earlier, the SWM is the critical input parameter for hotspot analysis. Therefore, it was necessary to establish the INDSWMI of Spencer city to accurately express the spatial relationships between intersections.
The spatial weights (Wij) of the INDSWMI were calculated based on the network shortest distance along the road segments, as discussed in Section 2.3.1. We take the typical T-intersections and cross-intersections shown in Figure 9 as an example to evaluate the intersection extraction algorithm and demonstrate the INDSWMI of Spencer city. We used the geometric method ST_Intersection, provided by the PostGIS extension, to evaluate the intersection extraction algorithm based on road network topological relationships. When all intersections coincide with the coordinates of the collisions using the geometric method, the accuracy of the intersection extraction algorithm can be verified. We took a typical T-intersection (ID. 921) as an example to demonstrate the evaluation results of the intersection extraction algorithm using the geometric method, as shown in Table 2. The evaluation results successfully demonstrate that the intersection extraction algorithm accurately extracts intersections. We used the geometric method ST_Intersection, provided by the PostGIS extension, to evaluate the intersection extraction algorithm based on road network topological relationships. When all intersections coincide with the coordinates of the collisions using the geometric method, the accuracy of the intersection extraction algorithm can be verified. We took a typical T-intersection (ID. 921) as an example to demonstrate the evaluation results of the intersection extraction algorithm using the geometric method, as shown in Table 2. The evaluation results successfully demonstrate that the intersection extraction algorithm accurately extracts intersections. According to a suggestion by Getis and Aldstadt [29], a distance band of 683.64 m was selected to ensure that each intersection had at least one neighbor. Note that the suggestion that each intersection should have at least one neighbor is one of the best-practice guidelines for Getis-Ord Gi* analysis and has been applied to a vast majority of scenarios. However, our distance band of 683.64 m is the result of following the above suggestion for the study area in this research. Therefore, this distance band is not applicable universally and should be adjusted accordingly for different scenarios. Table 3 lists the results under a distance band of 683.64 m, including T-intersections (ID. 584, 921, 1609) and cross-intersections (ID. 113, 1174), created by the INDSWMI generation algorithm (discussed in Section 2.3.2) to save intersections' spatial relationships. Note that Table 3 does not list all the neighbors of intersections (for example, intersection 113 has 97 neighbors under a distance band of 683.64 m) due to space limitations. Generally, the spatial weights (W ij ) of the INDSWMI were network distance inverted (for example, W 113, 584 = 0.019263 when the network distance between intersections 113 and 584 is 51.9122 m), so nearer intersections have a larger weight than intersections that are farther away. The columns (objectid, gid, nid, row-standardized weight) in Table 3 can be used to generate the binary swm format file, as discussed in Section 2.2.3. Note that the INDSWMI is a sparse matrix with a large amount of zero W ij data when the network distance between intersections i, j is greater than the distance band. Therefore, in this paper, the rows with a spatial weight of 0 were omitted since the default setting for spatial weights is 0 in hotspot analysis; this can effectively reduce the required file storage space.

The Results of Intersection Hotspot Analysis (Getis-Ord Gi*) of Spencer City
We used intersection hotspot analysis (IHA, as discussed in Section 2.3.3) based on geo-processing tools in ArcGIS by taking the input parameters listed in Table 4 to calculate the Getis-Ord Gi*, z-score, p-value, and Gi*-bin (confidence level bin) for each intersection to identify the CHIs of Spencer city (2008)(2009)(2010)(2011)(2012)(2013)(2014). Note that the "Distance Band", "Weights Matrix File", and "Input Field," as the key input parameters, were generated in Sections 2.3.1 and 2.3.2. In general, when the confidence level exceeds 95%, we consider it to have significant statistical significance. As discussed in Section 2.3.3, we formed a randomization null hypothesis about the spatial distribution of the number of crashes at intersections. If an intersection's positive p-value is >0.05 and its Gi * ZScore is >1.96, then the spatial distribution pattern of the number of crashes on the intersection is a random distribution with a probability of less than 5%, and the spatial distribution on the intersection is clustered (positive correlation, hot spot) with a probability of greater than 95%. An intersection that meets the above conditions is considered a CHI at the 95% confidence level. From this perspective, we can identify the CHIs that indicate a positive spatial autocorrelation of the number of crashes. The intersections among the CHIs all have a high frequency of crashes, and their neighbor intersections also have a high frequency of crashes. The calculated Gi * ZScores and p-values of CHIs (2008-2014) are listed in Table 5. Furthermore, the expected Gi*ZScore has a positive correlation with the number of crashes for each intersection. Based on the linear regression, we used the number of crashes as independent variable x and the Gi*ZScore as independent variable y in the output feature shapefile to draw a scatter plot, shown in Figure 10. The scatter plot indicates that there is a linear relationship between the number of crashes and the G i *ZScore.
The CHIs in the output feature shapefile of IHA at the 95% confidence level are colored in red in Figure 11 using GIS visualization. Figure 11 clearly demonstrates that the CHIs are clustered along the roads including Grand Ave, S Grand Ave, 11th St SE, and 1st Ave E. Based on the spatial distribution of CHIs, transportation authorities can develop targeted mitigation strategies to effectively reduce the number of crashes. Furthermore, the expected Gi * ZScore has a positive correlation with the number of crashes for each intersection. Based on the linear regression, we used the number of crashes as independent variable x and the Gi * ZScore as independent variable y in the output feature shapefile to draw a scatter plot, shown in Figure 10. The scatter plot indicates that there is a linear relationship between the number of crashes and the Gi * ZScore. The CHIs in the output feature shapefile of IHA at the 95% confidence level are colored in red in Figure 11 using GIS visualization. Figure 11 clearly demonstrates that the CHIs are clustered along the roads including Grand Ave, S Grand Ave, 11th St SE, and 1st Ave E. Based on the spatial distribution of CHIs, transportation authorities can develop targeted mitigation strategies to effectively reduce the number of crashes.

A Performance Comparison of IHA between INDSWMI and IEDSWMI
As discussed in Section 2.3.1, the inverse Euclidean distance-band spatial weights matrix of intersections (IEDSWMI) can also be used in IHA. Further experimentation of IHA comparing the GIS visualization and Intersection Prediction Accuracy Index (IPAI, as discussed in Section 2.3.3) between the INDSWMI and IEDSWMI was discussed to validate the performance of the proposed

A Performance Comparison of IHA between INDSWMI and IEDSWMI
As discussed in Section 2.3.1, the inverse Euclidean distance-band spatial weights matrix of intersections (IEDSWMI) can also be used in IHA. Further experimentation of IHA comparing the GIS visualization and Intersection Prediction Accuracy Index (IPAI, as discussed in Section 2.3.3) between the INDSWMI and IEDSWMI was discussed to validate the performance of the proposed approach.
The crash data of Spencer city for 2008-2014 were used as the training data in the IHA, and the crash data of Spencer city for 2015-2017 were used as the test data in the IPAI calculation. Note that the training and test crash data were both applied in the KDSWMCI generation algorithm to statistically analyze the number of crashes separately. To be more specific, the number of crashes in 2008-2014 at each intersection was used as the training data to identify CHIs. Then, the number of crashes in 2015-2017 at intersections which were within these CHIs was used as the test data to calculate the IPAI. By this approach, the same datasets and the same parameters (as shown in Table 4), except the weights matrix files (INDSWMI and IEDSWMI), were implemented in intersection hotspot analysis at the 95% confidence level, and the IPAI was then measured.
To contrast them, Figure 12 shows the results of intersection hotspot analysis for the INDSWMI (Figure 12a) and IEDSWMI (Figure 12b). By comparing Figure 12a,b, we can see that the intersection distribution pattern of the CHIs is slightly different. There are fewer CHIs in Figure 11a than in Figure 12b. However, the distribution pattern of roads within the CHIs is notably different. There are much fewer roads within CHIs in Figure 12a than in Figure 12b, which means that the CHIs in Figure 12a have stronger spatial aggregation than those in Figure 12b. These differences are expected since the road network distance of intersections is more appropriate as a distance measurement than the Euclidean distance for IHA.  Note that the above visual comparison can only qualitatively validate the performance of the proposed approach. In order to quantitatively validate the performance of the proposed approach, the IPAI values of IHA using the INDSWMI and IEDSWMI should be compared. In theory, the higher the IPAI, the better the CHI prediction performance of the approach. The calculated IPAI results, shown in Table 6, indicate that higher prediction accuracy was achieved by IHA with the INDSWMI (IPAI: 4.79) than by IHA with the IEDSWMI (IPAI: 3.45).  Note that the above visual comparison can only qualitatively validate the performance of the proposed approach. In order to quantitatively validate the performance of the proposed approach, the IPAI values of IHA using the INDSWMI and IEDSWMI should be compared. In theory, the higher the IPAI, the better the CHI prediction performance of the approach. The calculated IPAI results, shown in Table 6, indicate that higher prediction accuracy was achieved by IHA with the INDSWMI (IPAI: 4.79) than by IHA with the IEDSWMI (IPAI: 3.45). Furthermore, an interesting result is indicated: the predicted percentage of crashes in CHIs during 2015-2017 is similar to the identified percentage of crashes in CHIs during 2008-2014 in both IHA with INDSWMI (51.68%-47.90%) and IHA with IEDSWMI (52.42%-48.95%). From this perspective, the results reveal that the CHIs which are identified by this approach have a high probability of being intersections with a high frequency of crashes in the future.

Conclusions
Intersections have an important impact on the frequency of crashes. In this paper we successfully demonstrated an approach to directly identify crash hotspot intersections (CHIs) using spatial weights matrices (SWMs). The application of this methodology was illustrated by using a spatial data set, including roads and traffic crashes, of Spencer city, Iowa, USA. The proposed inverse network distance-band spatial weights matrix of intersections (INDSWMI) generation algorithm uses the network distance matrix of intersections (NDMI) and a distance band to conceptualize the spatial relationships between intersections with respect to road network restraints. The developed k-nearest distance-band spatial weights matrix between crash and intersection (KDSWMCI, k = 1) generation algorithm has the ability to aggregate the crashes with intersections by considering GPS location accuracy. The INDSWMI generation algorithm can also be applied to building the SWM of road network crashes with the added value that it can be used to support further spatial analysis (e.g., high-low clustering and Local Moran's I analysis). As a major contribution, we developed the Intersection Prediction Accuracy Index (IPAI) to test the prediction performance of intersection hotspot analysis (IHA). According to the findings of a performance comparison between IHA with the INDSWMI and IHA with the inverse Euclidean distance-band spatial weights matrix of intersections (IEDSWMI), the proposed approach has higher accuracy in identifying CHIs.
The potential of this study can be further realized if we address the following two issues in our future work. First, in this study, the crashes were applied in the proposed approach without consideration of the different crash types. Crashes can be divided into different types-vehicle rollover, single-car accident, rear-end collision, side-impact collision, and head-on collision-and each type may have a different spatial pattern. It is necessary to differentiate the treatment of crashes according to different types. As such, we will develop a spatial data mining approach considering different types of crashes to discover the different spatial patterns of crashes. Second, in the current approach, the IPAI was tested with one distance band chosen to ensure that each intersection had at least one neighbor. However, the selection of a different distance band according to a different minimum neighbor count may have some influence on the prediction accuracy of the proposed approach. As such, in future work we will analyze the prediction accuracy of the IHA with different distance bands and suggest a distance band selection to maximize the prediction performance of the proposed approach.