NS-DBSCAN: A Density-Based Clustering Algorithm in Network Space

: Spatial clustering analysis is an important spatial data mining technique. It divides objects into clusters according to their similarities in both location and attribute aspects. It plays an essential role in density distribution identiﬁcation, hot-spot detection, and trend discovery. Spatial clustering algorithms in the Euclidean space are relatively mature, while those in the network space are less well researched. This study aimed to present a well-known clustering algorithm, named density-based spatial clustering of applications with noise (DBSCAN), to network space and proposed a new clustering algorithm named network space DBSCAN (NS-DBSCAN). Basically, the NS-DBSCAN algorithm used a strategy similar to the DBSCAN algorithm. Furthermore, it provided a new technique for visualizing the density distribution and indicating the intrinsic clustering structure. Tested by the points of interest (POI) in Hanyang district, Wuhan, China, the NS-DBSCAN algorithm was able to accurately detect the high-density regions. The NS-DBSCAN algorithm was compared with the classical hierarchical clustering algorithm and the recently proposed density-based clustering algorithm with network-constraint Delaunay triangulation (NC_DT) in terms of their e ﬀ ectiveness. The hierarchical clustering algorithm was e ﬀ ective only when the cluster number was well speciﬁed, otherwise it might separate a natural cluster into several parts. The NC_DT method excessively gathered most objects into a huge cluster. Quantitative evaluation using four indicators, including the silhouette, the R-squared index, the Davis–Bouldin index, and the clustering scheme quality index, indicated that the NS-DBSCAN algorithm was superior to the hierarchical clustering and NC_DT algorithms.


Introduction
The first law of geography states that closer spatial entities are more strongly related to each other than the distant ones [1].Therefore, detecting spatial clusters of spatial events is an important technique in spatial analysis [2].Clustering analysis has been widely used in hot-spot detection [3][4][5][6][7][8], traffic accident analysis [9,10], climate regionalization [11,12], and earthquake clustering identification [13][14][15].Clustering analysis is generally divided into two categories, one using the spatial point pattern analyses to discover aggregated points with statistical indicators and the other obtaining clusters from the perspective of data mining [16].The spatial point pattern methods, such as the local K-function [17,18], ISPRS Int.J. Geo-Inf.2019, 8, 218 2 of 20 local Moran's I [19,20], Getis-Ord Gi [10], scan statistics [21], and local indicators of mobility association (LIMA) [22], are commonly adopted for indicating aggregated regions and discovering the density trend of spatial dataset.Some of them have already been applied to network-constraint events [10,17,23].In contrast to spatial point pattern methods, generic clustering algorithms for multidimensional features not only delineate aggregated configuration of dataset but also precisely depict specific shapes of separated clusters.
Clustering algorithms divide a dataset into several clusters, with similar objects in the same cluster and dissimilar objects in different clusters [24][25][26][27][28][29][30][31][32][33].They have been widely used in geoscience for spatial data [2,34].Conventional clustering algorithms can be separated into four general categories: partitioning, hierarchical, density-based, and grid-based algorithms.The partitioning algorithms divide a dataset into several subsets by continuously optimizing an objective function.The hierarchical algorithms obtain a dendrogram by merging or splitting clusters.The density-based algorithms expand from the dense areas to obtain high-density clusters separated by low-density regions.The grid-based algorithms are a composite approach that first divides a dataset into several subregions and applies other clustering algorithms to each subregion.
Taking the Euclidean distance of points as a measure of dissimilarity, generic point clustering algorithms are mostly designated for the planar space.The partitioning algorithms can easily detect spherical clusters.The k-means algorithm is a traditional partitioning algorithm, and it is susceptible to outliers.The k-medoids algorithms [25,26,[35][36][37] overcome this shortcoming.Among hierarchical clustering algorithms, agglomerative algorithms are commonly used.A dendrogram of hierarchical clustering algorithm cannot be modified once it is formed, making it inapplicable to incremental clustering.The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm [27] solves this problem by maintaining a clustering feature tree and is able to distinguish noises (objects that do not belong to any cluster).The CURE (Clustering Using REpresentatives) algorithm [28] introduces the concept of representative objects to enhance efficiency.Moreover, the CHAMELEON algorithm [29] performs hierarchical clustering by constructing a k-nearest neighbor graph and is able to find clusters of arbitrary shape and variable size.The DBSCAN algorithm [30] is a famous algorithm that excels in detecting arbitrary-shaped clusters and removing noises.Many algorithms improve and extend the DBSCAN algorithm.For example, the OPTICS (Ordering Points To Identify the Clustering Structure) algorithm [31] focuses on identifying variable density clusters.The ADCN (Anisotropic Density-based Clustering for discovering spatial point patterns with Noise) algorithm [32] is able to find linear and ring clusters using anisotropic elliptical neighborhood.In addition to the aforementioned traditional spatial clustering algorithms and their improvements, some scholars also proposed new clustering paradigms, enriching the clustering methodology.The FTSC (Field-Theory based Spatial Clustering) [33] and ASCDT (Adaptive Spatial Clustering based on Delaunay Triangulation) algorithms [14] proposed by Deng draw on the basic principle of field theory and interpret the meaning of clusters from the perspective of cohesion.The ASCDT algorithm [14] breaks the long edges in Delaunay triangulation by global and local constraints to obtain clusters.The ASCDT+ algorithm [38] is outstanding in clustering points in the presence of obstacles (e.g., mountains) and facilitators (e.g., highway).In some cases, the similarities of both spatial and nonspatial attributes have to be considered based on two options.The first option deals with the spatial and nonspatial attributes separately, and the clusters meet both spatial and nonspatial criteria, such as the CLARANS (Clustering Large Applications based on RANdomized Search) [26] and DBSC (Density-Based Spatial Clustering) [11] algorithms.The second option handles the similarity for nonspatial attributes and spatial distances simultaneously to define an overall similarity between objects, such as the GDBSCAN (Generalized DBSCAN) algorithm [39] and DBRS (Density-Based Spatial clustering with Random Sampling) algorithm [40].Moreover, the SEClu (Spatial Entropy-based Clustering) algorithm [41] also takes the spatial autocorrelation (SAR) of nonspatial attribute into account and discovers clusters with high SAR.
The aforementioned spatial clustering algorithms are mostly used in the planar space, while rarely applied to the network-constraint events [9] (e.g., shops alongside the streets, car accidents on the roads, and earthquakes along coastlines), in spite that many phenomena in the real world, especially in city environment, are constrained by spatial networks.The network-constraint events are distributed in discrete and heterogeneous network space.If they are analyzed in continuous and homogeneous planar space, the conclusions can be unsatisfactory or even erroneous [42].In recent years, the methods of network spatial analysis have received much more attention [17,[43][44][45][46][47][48].These methods remarkably change the distance measure from Euclidean distance in planar space to the shortest-path distance on networks.At the same time, spatial clustering algorithms also have been extended for network-constraint events.Yiu [49] proposed the k-medoids algorithm, ε-Link algorithm (extended from DBSCAN), and single-link algorithm for network events.On the basis of the algorithm proposed by Yiu, Sugihara [50] further improved the hierarchical clustering algorithm for network events.He used five distance measures to hierarchically cluster event points.It is quite a mature clustering algorithm and has been integrated into Spatial Analysis along Networks (SANET) [51], a toolbox for network spatial analysis.Stefanakis [52] extended the DBSCAN algorithm to cluster events in dynamic networks whose edges might be temporarily inaccessible.Moreover, Chen [53] proposed a framework for clustering moving objects in a spatial network based on the notion of cluster block.Deng [16] extended the DBSCAN algorithm to network space and proposed a density-based clustering algorithm that requires no input parameters.Shi [54] proposed a new framework for cluster detection in traffic networks based on the spatial-temporal flow modeling.Spatial clustering algorithms in network space were also proposed for different kinds of applications.Oliveira [55,56] extended the DBSCAN and OPTICS algorithms to the water distribution pipe breakage to locate the region of high-breakage densities.Smaltschinski [57] clustered the harvest stands on a road network using the single-linkage algorithm to reduce movements of harvesters and forwarders.
Theoretically, taking the shortest-path distance as the distance measure, most of clustering algorithms in the planar space can be extended to the network space.However, in fact, the shift is not that straightforward because of the high complexity of computing the shortest-path distance between two objects.For example, the partitioning algorithms are insufficient in clustering network events because they are very time-consuming due to repeated graph traverse and are not effective at all [49].The hierarchical and density-based algorithms perform efficiently and effectively for network events, and the density-based algorithms actually have a natural advantage in detecting arbitrarily shaped clusters and distinguishing noises.Besides, the number of clusters is the input parameter for partitioning algorithms and often the terminal condition of hierarchical algorithms, while they are actually difficult to determine.However, the density-based algorithms do not take the number of clusters as an input parameter, which is beneficial to the discovery of natural clusters [31].Yiu, Oliveira, and Stefanakis once extended the DBSCAN algorithm to network-constraint events.These algorithms shared the same problem, that is, two global input parameters eps (the distance to determine a point's neighborhood) and MinPts (the minimal density of dense objects) were compulsory.The clustering results were very sensitive to the input parameters; therefore, obtaining good clustering results heavily depended on the user's domain knowledge [31].To address this problem, the network space DBSCAN (NS-DBSCAN) algorithm helped improve the aforementioned drawbacks in two aspects.(1) The algorithm provided a new way to visualize the density distribution of event points, according to which two parameters (eps and MinPts) could be better determined.(2) A local shortest-path distance (LSPD) algorithm was proposed to improve efficiency.
The rest of the article is organized as follows.Section 2 briefly introduces the basic idea of the proposed algorithm and the notion of density ordering.In Section 3, two core steps of the proposed algorithm are presented.Section 4 evaluates the effectiveness of NS-DBSCAN algorithm on the points of interest (POI) in Hanyang district, Wuhan, China.Section 5 discusses the basic principles to determine two input parameters and the ways to deal with the one-way and dead-end roads.Section 6 concludes with a summary of the study, implications of the proposed algorithm to current research, and some hints for future research.

Basic Idea
Figure 1 is a simulated dataset used to explain the cluster detection for network events.Starting from an event point p, the part of the network that can be reached within a distance eps is p's eps-neighborhood; p is a central point, and the event points within p's eps-neighborhood are p's eps-neighbors N_eps (p).The number of neighbors is defined as the density of p.In Figure 1, the density of P 5 is 4.  The idea of density ordering is based on the fact that the densities of spatially adjacent event points are similar.Starting from an event point, the NS-DBSCAN algorithm iteratively puts the surrounding points into a density ordering table in descending order of their densities.If the event points are spatially close, they tend to have similar densities and are also close in the density ordering table.The densities of event points in the density ordering table can be visualized as a bar chart, resulting in a density ordering graph that presents hills of varying sizes corresponding to implicit clusters.The idea of density ordering is based on the fact that the densities of spatially adjacent event points are similar.Starting from an event point, the NS-DBSCAN algorithm iteratively puts the surrounding points into a density ordering table in descending order of their densities.If the event points are spatially close, they tend to have similar densities and are also close in the density ordering table.The densities of event points in the density ordering table can be visualized as a bar chart, resulting in a density ordering graph that presents hills of varying sizes corresponding to implicit clusters.
Figure 2 shows how the density ordering graph was obtained and how it reflected the density distribution of event points.The NS-DBSCAN algorithm randomly selected an event point as the central point.Irrespective of which event point was selected first, the NS-DBSCAN algorithm went straight to the peak of local density and visited every event point around it.Figure 2 shows that cp1 was assumed to be the first one to be selected as the central point.In cp1's neighborhood, cp2 was the highest-density point and hence was chosen as the next central point.Also, cp3 was chosen as the next central point after cp2 and was the density peak of this region.When the peak was reached, the algorithm visited the event points around it and recorded their densities from high to low in the density ordering table.This procedure stopped when no event point was reached within the search window defined by eps.Moreover, the algorithm randomly selected one of the points that were not in the density ordering table as a new central point.In this way, the bars of density ordering graph usually started with a peak and went down gradually until it encountered another peak.An implicit cluster was likely to appear between the two peaks.Therefore, the density ordering graph was able to depict the density distribution of event points and even indicated the intrinsic clustering structure.
surrounding points into a density ordering table in descending order of their densities.If the event points are spatially close, they tend to have similar densities and are also close in the density ordering table.The densities of event points in the density ordering table can be visualized as a bar chart, resulting in a density ordering graph that presents hills of varying sizes corresponding to implicit clusters.When the density ordering graph was obtained, setting the density threshold MinPts accordingly led to practical clusters with the strategy of DBSCAN algorithm.

NS-DBSCAN Algorithm
The NS-DBSCAN algorithm is essentially the extended DBSCAN algorithm for network-constraint events.The algorithm consists of two core steps: Step 1: Generating density ordering: In this step, the density ordering table and graph were obtained with one parameter eps.The step included two substeps: the first one involved obtaining eps-neighbors, where the LSPD algorithm is introduced.The second substep involved generating the density ordering table and graph with the densities of event points.
Step 2: Forming clusters: In this step, the second parameter MinPts was set according to the density ordering graph and clusters were formed by categorizing spatially adjacent and dense event points into the same cluster.

Obtaining Eps-Neighbors
The NS-DBSCAN algorithm obtained eps-neighbors of a central point using the LSPD algorithm.First, as depicted in Figure 3, an undirected planar graph was constructed for the network and event points in Figure 1.A new attribute, current distance to central vertex (CDCV), was assigned to each vertex, indicating the shortest-path distance of a vertex to the central vertex at any time.As interpreted in Figure 4, a basic expansion was defined as the motion from a vertex (start vertex) to its adjacent vertex (end vertex) along the edges.The path between the start vertex and the end vertex was called the expansion path.The CDCV of end vertex was updated as the sum of the CDCV of start vertex and the weight of expansion path after a basic expansion.Basic expansions aimed to minimize the CDCV of every end vertex with the constraint of eps.That is, if a basic expansion failed to decrease the CDCV of an end vertex or the CDCV of an end vertex exceeded eps, this expansion path would be blocked.In Figure 4, if CDCV(P 1 ) + W(P 1 , P 2 ) ≥ CDCV(P 2 ) or CDCV(P 1 ) + W(P 1 , P 2 ) > eps, the expansion path P 1 →P 2 would be blocked., E, W) was generated for the simulated dataset.P is the event vertex, representing the event points.V denotes ordinary vertices, representing the location where the road segments intersect.An ordinary vertex will not be created in the segments' intersection if an event vertex already exists, such as P1.E is the edge, representing the road segments between the two vertices.W is the weight of edges, which is defined as the length of edges in the study.

Figure 4.
A basic expansion is a motion from a source (start) vertex to a target (end) vertex, the path between which is the expansion path.Current distance to central vertex (CDCV) (p) represents p's current distance to central vertex, and W (p, q) denotes the weight (length) of expansion path between vertices p and q.The CDCV of end vertex is updated to the sum of CDCV of start vertex and the weight of expansion path between them., E, W) was generated for the simulated dataset.P is the event vertex, representing the event points.V denotes ordinary vertices, representing the location where the road segments intersect.An ordinary vertex will not be created in the segments' intersection if an event vertex already exists, such as P 1 .E is the edge, representing the road segments between the two vertices.W is the weight of edges, which is defined as the length of edges in the study., E, W) was generated for the simulated dataset.P is the event vertex, representing the event points.V denotes ordinary vertices, representing the location where the road segments intersect.An ordinary vertex will not be created in the segments' intersection if an event vertex already exists, such as P1.E is the edge, representing the road segments between the two vertices.W is the weight of edges, which is defined as the length of edges in the study.The following steps explain the general steps of LSPD algorithm, and the corresponding pseudocode is presented in Table 1.
(1) Setting CDCV of central vertex to 0 and that of other vertices to ∞.

Generating the Density Ordering Table and Graph
The density ordering table recorded the event point, its density, and its eps-neighbors as follows: {Id, Density, N_eps} where Id represents the identifier of event point, Density is its density, and N_eps is its eps-neighbors.A noteworthy phenomenon was that the density of an eps-neighbor was similar to its central point and it was likely to be gathered at the same cluster.Hence, if the local density peak was first visited followed by its surrounding event points, the order of event points in the density ordering table indicated implicit clusters.The density ordering graph of road network in Figure 3 is depicted in Figure 5.
The pseudocode of generating density ordering is shown in Table 2.For the simulated dataset, let eps = 1.An empty table was initialized.P 1 was first selected (by the order of event point identifier) and N_eps (P 1 ) = {P 5 , P 2 , P 4 , P 8 , P 3 } and the queue Q ={P 1 }.As Q was not empty, P 1 was dequeued from Q. P 1 's density and eps-neighbors were written into the density ordering table to get a record {Id: P 1 ; Density: 5; N_eps: [P 5 , P 2 , P 4 , P 8 , P 3 ]}.N_eps(P 1 ) was inserted into Q and the density of each point in N_eps(P 1 ) was calculated.After that, the points in Q were sorted according to the order of their densities from high to low, that is Q = {P 8 , P 5 , P 2 , P 4 , P 3 } and their densities were 5, 4, 3, 3, and 3, respectively.Then, the first point P 8 dequeued from Q was added to the density ordering table.Q was updated as {P 5 , P 2 , P 4 , P 3 , P 7 , P 9 , P 6 }, where the densities of P 7 , P 9 , and P 6 were unknown.The densities of these points were calculated to be 4, 3, and 3, respectively, and Q was updated as {P 5 , P 7 , P 2 , P 4 , P 3 , P 9 , P 6 }.Afterward, the first point in Q was sequentially dequeued to perform the aforementioned operations until Q was empty.Eventually, the order of event points in the density ordering table was P 1 , P 8 , P 5 , P 7 , P 2 , P 4 , P 3 , P 9 , and P 6 .The pseudocode of generating density ordering is shown in Table 2.For the simulated dataset, let eps = 1.An empty table was initialized.P1 was first selected (by the order of event point identifier) and N_eps (P1) = {P5, P2, P4, P8, P3} and the queue Q ={P1}.As Q was not empty, P1 was dequeued from Q. P1's density and eps-neighbors were written into the density ordering table to get a record {Id: P1; Density: 5; N_eps: [P5, P2, P4, P8, P3]}.N_eps(P1) was inserted into Q and the density of each point in N_eps(P1) was calculated.After that, the points in Q were sorted according to the order of their densities from high to low, that is Q = {P8, P5, P2, P4, P3} and their densities were 5, 4, 3, 3, and 3, respectively.Then, the first point P8 dequeued from Q was added to the density ordering table.Q was updated as {P5, P2, P4, P3, P7, P9, P6}, where the densities of P7, P9, and P6 were unknown.The densities of these points were calculated to be 4, 3, and 3, respectively, and Q was updated as {P5, P7, P2, P4, P3, P9, P6}.Afterward, the first point in Q was sequentially dequeued to perform the aforementioned operations until Q was empty.Eventually, the order of event points in the density ordering table was P1, P8, P5, P7, P2, P4, P3, P9, and P6.
The priority queue Q determined the order of event points in the density ordering table because it ensured that the high-density event points preferentially entered the density ordering table.This made the density ordering graph appear like hills starting with local density peaks.p is inserted into Q and get N_eps(p) by LSPD algorithm; (5) end if (6) while Q is not empty do (7) the first point q is dequeued from Q, write the density of q and N_eps(q) into the density ordering table; (8) for each point q in N_eps(q) do (9) if q's density is unknown then (10) calculate its density by LSPD algorithm; (11) end if (12) if q is not in density ordering table nor in Q then (13) add q into Q to make the densities of the points in Q are from high to low.(14) end if (15) end for (16) else (17) go to (2); (18) end while (19) draw a density ordering graph according to the density ordering table; The priority queue Q determined the order of event points in the density ordering table because it ensured that the high-density event points preferentially entered the density ordering table.This made the density ordering graph appear like hills starting with local density peaks.

Forming Clusters
The density ordering table indicates implicit clusters.To explicitly obtain clusters, the density threshold MinPts was set according to the density ordering graph.The event points with a density greater than MinPts were core points, and its eps-neighbors were border points.The proposed algorithm constructed a cluster by initiating a cluster with a core point and gradually bringing border points of core points in the cluster until no more points were added.The pseudocode of forming clusters is shown in Table 3. end if (8) for each point q in cluster C do (9) if q is a core point then (10) for each point s in q's border points do (11) if s is not a member of cluster C then (12) add s to cluster C; (13) end if (14) end for (15) end if (16) end for (17) end if (18) end for For the simulated dataset, eps = 1, MinPts = 4, and the general process of forming clusters is explained as follows.Point P 1 was first selected as a core point and did not belong to any cluster; therefore, a cluster C1 containing P 1 was initiated.The border points of P 1 , {P 5 , P 8 , P 2 , P 4 , P 3 }, were added into C1, as shown in Figure 6a.The next core point in C1 was P 5 , and its border points {P 1 , P 8 , P 7 , P 9 } were added to C1, as shown in Figure 6b.Thereafter, the border points of P 8 , {P 5 , P 9 , P 1 , P 7 , P 6 }, were added into C1, as shown in Figure 6c.Next, P 2 , P 4 , P 3 , P 7 , P 9 and P 6 did not bring any other points to C1 because they were not core points.A cluster C1 = {P 1 , P 5 , P 2 , P 4 , P 8 , P 3 , P 7 , P 9 , P 6 } was eventually formed.The algorithm then marked P 10 and P 11 as noises and assigned P 12 -P 16 to another cluster C2.The final clusters of the simulated dataset are shown as Figure 6d.(15) end if (16) end for (17) end if (18) end for

Experiment
This section evaluates the effectiveness of NS-DBSCAN algorithm both qualitatively and quantitatively by testing it on the real dataset and comparing with the classical hierarchical clustering algorithm proposed by Sugihara [50] and the density-based clustering algorithm with network-constrained Delaunay triangulation (NC_DT algorithm) recently proposed by Deng [16].Section 4.1 gives a brief introduction to the dataset used in the experiment.Section 4.2 presents the preprocessing of dataset.Experiment and comparisons of three algorithms are presented in Section 4.3.

Dataset
The Hanyang district is one of the important industrial areas in Wuhan, Hubei Province.It is a less-developed area where several large lakes are embedded.Therefore, highly populated regions are quite separated.As shown in Figure 7, the POIs in Hanyang district are mainly distributed in or around the residential areas (I & III), colleges (II), industrial parks (IV & V), and auto part shops (VI).The Wangjiawan and Longyang villages are the main residential areas, with lots of commercial housings, schools, banks, and shops.The Wuhan Technician College is an important college where thousands of technicians are trained every year.The Huangjinkou Urban Industrial Park contributes the largest share of tax among urban industrial parks in Wuhan.The Wantong Industrial Park and Shengyuan Yuanhua Industrial Park are also large industrial parks.The Huangjinkou Auto Parts Market is the largest auto parts market in central China, where more than 3000 shops sell car accessories.These highly populated regions are also regions where POIs easily form clusters.
The POI data were downloaded from Baidu Map (http://lbsyun.baidu.com/)on January 14, 2018.They included 4062 points, and were classified into 11 categories according to their functions for citizens, as shown in Figure 7.The road network dataset was downloaded from Open Street Map (https://www.openstreetmap.org)on January 9, 2018.It included 944 road segments after preprocessing.The POI data were downloaded from Baidu Map (http://lbsyun.baidu.com/)on January 14, 2018.They included 4062 points, and were classified into 11 categories according to their functions for citizens, as shown in Figure 7.The road network dataset was downloaded from Open Street Map (https://www.openstreetmap.org)on January 9, 2018.It included 944 road segments after preprocessing.

Preprocessing
The road network (Figure 8a) and POI were preprocessed before they were used in the experiment.The preprocessing included the following four main steps: (1) Deleting all flyovers and tunnels to make the road network planar.
(2) Extracting the skeletons of divided highways and splitting the road segments where they intersect (Figure 8b).( 3) Moving the point alongside roads to its nearest road segment to create event vertices and establish a correspondence between event vertices and point, followed by creating ordinary vertices where road segments intersect (Figure 8c).( 4) Splitting road segments at event vertices (Figure 8d).(3) Moving the point alongside roads to its nearest road segment to create event vertices and establish a correspondence between event vertices and point, followed by creating ordinary vertices where road segments intersect (Figure 8(c)).

Results
The three algorithms were tested in the same dataset of Hanyang district.The NS-DBSCAN and NC_DT algorithms were both implemented in ArcGIS SDK.Further, the hierarchical clusters were

Results
The three algorithms were tested in the same dataset of Hanyang district.The NS-DBSCAN and NC_DT algorithms were both implemented in ArcGIS SDK.Further, the hierarchical clusters were obtained using SANET version 4.1 [51], a plug-in program which statistically analyzes spatial patterns of events that occur on/alongside networks.
Figure 9 shows the clustering result of NS-DBSCAN algorithm.The six highly populated regions were all accurately detected and the less aggregated POIs became noise.The very high density regions (I, III, and IV) were portrayed as a large cluster, while those with subregions (II, V, and VI) were depicted with several separated clusters.The NS-DBSCAN algorithm was capable of distinguishing the separated highly populated regions, and the shape of clusters approximately portrayed the shape of these regions.In addition, some other regions with high local density were also detected and small clusters were formed there.
regions (I, III, and IV) were portrayed as a large cluster, while those with subregions (II, V, and VI) were depicted with several separated clusters.The NS-DBSCAN algorithm was capable of distinguishing the separated highly populated regions, and the shape of clusters approximately portrayed the shape of these regions.In addition, some other regions with high local density were also detected and small clusters were formed there.The hierarchical clustering algorithm for network-constraint events proposed by Sugihara [50] is a classical algorithm in literature.The algorithm evaluates five typical variants of distances between clusters, including the closest-pair distance (single-linkage distance in planar space), the farthest-pair distance (complete-linkage distance in planar space), the average distance (average-linkage distance in planar space), the median-pair distance (median distance in planar space), and the radius distance (centroid distance in planar space) on network events.Moreover, these distance measures were also evaluated in this study.Hierarchical clustering requires cluster number as an input parameter, which has a significant impact on the clustering result.The optimal cluster number was set as the one maximizing the silhouette (an evaluation indicator for clustering algorithms; the larger the silhouette, the better the clustering result).
Clusters of hierarchical clustering algorithm in Figure 10 delineated some high-density regions (II, III, IV, V, and VI) and the sparse POIs were also gathered to clusters.As amplified in Figure 10, region I was divided into two separate parts, although it was highly aggregated visually.The algorithm divided the dataset into several clusters according to their distances, without considering their density distribution.Although the clustering results were sensitive to the cluster number, the algorithm was still capable of delineating the density distribution of dataset as long as the cluster numbers were suitably set.The hierarchical clustering algorithm for network-constraint events proposed by Sugihara [50] is a classical algorithm in literature.The algorithm evaluates five typical variants of distances between clusters, including the closest-pair distance (single-linkage distance in planar space), the farthest-pair distance (complete-linkage distance in planar space), the average distance (average-linkage distance in planar space), the median-pair distance (median distance in planar space), and the radius distance (centroid distance in planar space) on network events.Moreover, these distance measures were also evaluated in this study.Hierarchical clustering requires cluster number as an input parameter, which has a significant impact on the clustering result.The optimal cluster number was set as the one maximizing the silhouette (an evaluation indicator for clustering algorithms; the larger the silhouette, the better the clustering result).
Clusters of hierarchical clustering algorithm in Figure 10 delineated some high-density regions (II, III, IV, V, and VI) and the sparse POIs were also gathered to clusters.As amplified in Figure 10, region I was divided into two separate parts, although it was highly aggregated visually.The algorithm divided the dataset into several clusters according to their distances, without considering their density distribution.Although the clustering results were sensitive to the cluster number, the algorithm was still capable of delineating the density distribution of dataset as long as the cluster numbers were suitably set.
The NC_DT algorithm is a density-based algorithm for network-constraint events recently proposed by Deng [16].The algorithm considered the road segments as areas with a specific width, not a geographical line entity.It constructed the Delaunay triangulation for all the event points and deleted the edges that were not within the road areas.The remaining triangulation was defined as network-constraint Delaunay triangulation (NC_DT), in which the shortest-path distances between event points were obtained.The major merit of this algorithm was no input parameters required.It determined neighborhood size (eps) by network kernel density estimation and potential entropy.Moreover, statistical tests of each event point under a null hypothesis were introduced to decide which point finally became the core point.Similar to the DBSCAN algorithm, the NC_DT algorithm obtained density-based clusters by expanding from core points and the less aggregated points finally became noises.The NC_DT algorithm is a density-based algorithm for network-constraint events recently proposed by Deng [16].The algorithm considered the road segments as areas with a specific width, not a geographical line entity.It constructed the Delaunay triangulation for all the event points and deleted the edges that were not within the road areas.The remaining triangulation was defined as network-constraint Delaunay triangulation (NC_DT), in which the shortest-path distances between event points were obtained.The major merit of this algorithm was no input parameters required.It determined neighborhood size (eps) by network kernel density estimation and potential entropy.Moreover, statistical tests of each event point under a null hypothesis were introduced to decide which point finally became the core point.Similar to the DBSCAN algorithm, the NC_DT algorithm obtained density-based clusters by expanding from core points and the less aggregated points finally became noises.
Figure 11 shows that the NC_DT algorithm did not perform effectively for this dataset; it gathered 89% of POIs into a giant cluster and many small clusters appeared mostly in the less-dense regions.It was not suitable for the dataset because the distances between some event points might be exaggerated due to the disconnectedness of NC_DT.As a consequence, the parameter determined by the statistical method based on the distances of event points might be invalid.For example, the POIs in the amplified region of Figure 11 should be in the same cluster.However, they were separated into two different clusters because the points in different clusters were not connected by NC_DT.In fact, most small clusters were formed where NC_DT was disconnected.A possible remedy is to add all the interaction and inflection points of road segments into a distance matrix and distinguish these points from event points while clustering.However, this inevitably adds to the high time and space complexity of algorithm.Therefore, the NC_DT algorithm was recommended in very dense event points, in which case the effect of disconnected NC_DT could be largely avoided.Figure 11 shows that the NC_DT algorithm did not perform effectively for this dataset; it gathered 89% of POIs into a giant cluster and many small clusters appeared mostly in the less-dense regions.It was not suitable for the dataset because the distances between some event points might be exaggerated due to the disconnectedness of NC_DT.As a consequence, the parameter determined by the statistical method based on the distances of event points might be invalid.For example, the POIs in the amplified region of Figure 11 should be in the same cluster.However, they were separated into two different clusters because the points in different clusters were not connected by NC_DT.In fact, most small clusters were formed where NC_DT was disconnected.A possible remedy is to add all the interaction and inflection points of road segments into a distance matrix and distinguish these points from event points while clustering.However, this inevitably adds to the high time and space complexity of algorithm.Therefore, the NC_DT algorithm was recommended in very dense event points, in which case the effect of disconnected NC_DT could be largely avoided.
The quantitative evaluation of the three clustering algorithms is listed in Table 4. Four indicators, including the silhouette [25], the R-squared index (RS) [58], the Davis-Bouldin index (DB) [59], and the clustering scheme quality index (SD) [60], were used to evaluate the effectiveness of clustering algorithms.All the indicators showed that the NC_DT algorithm was not effective for the dataset.Among the five distance measures of hierarchical algorithm, the closest-pair distance performed badly, and the rest were acceptable.The NS-DBSCAN algorithm obtained several clustering results with different groups of input parameters.Although not all the input parameters led to a good clustering result, all the indicators showed that the clustering result was relatively better compared with the other two algorithms in general.The quantitative evaluation of the three clustering algorithms is listed in Table 4. Four indicators, including the silhouette [25], the R-squared index (RS) [58], the Davis-Bouldin index (DB) [59], and the clustering scheme quality index (SD) [60], were used to evaluate the effectiveness of clustering algorithms.All the indicators showed that the NC_DT algorithm was not effective for the dataset.Among the five distance measures of hierarchical algorithm, the closest-pair distance performed badly, and the rest were acceptable.The NS-DBSCAN algorithm obtained several clustering results with different groups of input parameters.Although not all the input parameters led to a good clustering result, all the indicators showed that the clustering result was relatively better compared with the other two algorithms in general.

Parameterization
To obtain satisfactory clusters, two input parameters eps and MinPts need to be carefully set, and the density ordering graph can be the guidance.The density ordering graph exhibited some characteristics.First, it was able to delineate the overall density distribution of event points and indicate implicit clusters.For example, with the dataset used in the experiment, Figure 12 depicts the density ordering graphs of four situations (eps = 100, 200, 300, and 400).The hills within the orange rectangle in Figure 12b delineate Wangjiawan (region I in Figure 7), a highly populated region where clusters were most likely to appear because the densities of spatially adjacent event points were quite high.Second, with the increase in eps, the density ordering graph lost more details about density distribution, and large clusters were more likely to come out.Finally, the hills passed through by the line at MinPts are likely to form clusters.Besides, the number of clusters approximately equaled the number of hills crossed by MinPts.For example, the red line in Figure 12b passed through 27 hills, and 31 clusters came into being eventually, as shown in Figure 9.

One-way and Dead-End Cases
All the paths of road network were assumed to be bidirectional in this study.However, one-way roads sometimes exist in a city's road network and they should be considered while clustering.In the LSPD algorithm, if an expansion path is unidirectional, it is blocked when it goes backward.For example, in Figure 4, the expansion path P1→P2 is blocked if the path from P2 to P1 is unidirectional.
Dead-end roads do not have any impact on the study.In a real dataset, dead-end roads do not intersect with other roads on one side or on both sides.Figure 13 demonstrates the dead-end side of a road and the basic expansion from P1 to the dead-end vertex P2 (usually an ordinary vertex).CDCV (P2) was updated to CDCV (P1) + W (P1, P2), and hence was greater than CDCV (P1).P2 became the next start vertex and the basic expansion from it went nowhere because the expansion path from P2 to P1 was blocked.In this situation, no other vertices were influenced by the dead-end vertex.Based on the characteristics of density ordering graph, the parameter setting followed the basic principles.First, appropriate eps should be set to get a density ordering graph with hills of medium size.This helped to obtain medium-sized clusters.For example, the hills in Figure 10b were preferable than those in (a) (too slim) and (c) and (d) (too fat).Second, MinPts was set below the peaks where clusters were expected.Setting two input parameters in accordance with the aforementioned principle finally led to a relatively practical clustering result.

One-way and Dead-End Cases
All the paths of road network were assumed to be bidirectional in this study.However, one-way roads sometimes exist in a city's road network and they should be considered while clustering.In the LSPD algorithm, if an expansion path is unidirectional, it is blocked when it goes backward.For example, in Figure 4, the expansion path P 1 →P 2 is blocked if the path from P 2 to P 1 is unidirectional.
Dead-end roads do not have any impact on the study.In a real dataset, dead-end roads do not intersect with other roads on one side or on both sides.Figure 13 demonstrates the dead-end side of a road and the basic expansion from P 1 to the dead-end vertex P 2 (usually an ordinary vertex).CDCV (P 2 ) was updated to CDCV (P 1 ) + W (P 1 , P 2 ), and hence was greater than CDCV (P 1 ).P 2 became the next start vertex and the basic expansion from it went nowhere because the expansion path from P 2 to P 1 was blocked.In this situation, no other vertices were influenced by the dead-end vertex.
LSPD algorithm, if an expansion path is unidirectional, it is blocked when it goes backward.For example, in Figure 4, the expansion path P1→P2 is blocked if the path from P2 to P1 is unidirectional.
Dead-end roads do not have any impact on the study.In a real dataset, dead-end roads do not intersect with other roads on one side or on both sides.Figure 13 demonstrates the dead-end side of a road and the basic expansion from P1 to the dead-end vertex P2 (usually an ordinary vertex).CDCV (P2) was updated to CDCV (P1) + W (P1, P2), and hence was greater than CDCV (P1).P2 became the next start vertex and the basic expansion from it went nowhere because the expansion path from P2 to P1 was blocked.In this situation, no other vertices were influenced by the dead-end vertex.

Conclusions
This study extended the DBSCAN algorithm to network events and proposed a new clustering algorithm named NS-DBSCAN.Although two input parameters were still required, a density ordering graph was provided as guidance.An experiment in a real dataset was conducted to evaluate the effectiveness of the proposed algorithm, and it accurately detected highly populated communities.Compared with the hierarchical algorithm and NC_DT algorithm, NS-DBSCAN algorithm could better delineate the density distribution of the dataset.Besides, the four indicators showed that NS-DBSCAN algorithm, in general, performed better than two compared algorithms.
The proposed algorithm was important because of the following reasons: First, it concentrated on clustering network-constraint events, which were very common in city environments.Second, it provided an efficient algorithm (LSPD algorithm) to obtain eps-neighbors of event points without constructing a distance matrix.Finally, it provided a new visualization of the density distribution of spatial point events, and it could be the guidance of segmentation threshold parameterization.The proposed visual parameterization was less time-consuming than those extending DBSCAN with statistical indicators.Since clustering results should meet different needs in real-life application, the proposed visualization allowed users to obtain suitable clustering results graphically.
Future research on NS-DBSCAN should take the following issues into consideration.First, each hill exceeding MinPts in the density ordering graph, in general, should correspond to one cluster.However, such correspondence is not strict.Therefore, a more practical mechanism for ordering event points to increase the certainty of one-to-one correspondence between hills and clusters can be explored.Second, users may have to try different eps to get a suitable graph.The eps parameter is expected to be determined heuristically or statistically in future studies.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 4 of 21neighbors N_eps (p).The number of neighbors is defined as the density of p.In Figure1, the density of P5 is 4.

Figure 1 .
Figure 1.A simulated dataset including road network and event points.The 1-neighborhood of central point P5 is marked as a thick gray line, covering four event points (P1, P7, P8, and P9).

Figure 1 .
Figure 1.A simulated dataset including road network and event points.The 1-neighborhood of central point P 5 is marked as a thick gray line, covering four event points (P 1 , P 7 , P 8 , and P 9 ).

Figure 2 .
Figure 2. Network space density-based spatial clustering of applications with noise (NS-DBSCAN) algorithm reached the peak of local density from cp1 to cp3 (cp3 is the local density peak).

21 ( 4 )
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 6 of Repeating step 3 until all expansion paths are blocked.The vertices whose CDCV are neither ∞ nor 0 consist of eps-neighbors of the central vertex.

Figure 3 .
Figure 3.An undirected planar graph N = (V∪P, E, W) was generated for the simulated dataset.P is

Figure 3 .
Figure 3.An undirected planar graph N = (V∪P, E, W) was generated for the simulated dataset.P is the event vertex, representing the event points.V denotes ordinary vertices, representing the location where the road segments intersect.An ordinary vertex will not be created in the segments' intersection if an event vertex already exists, such as P 1 .E is the edge, representing the road segments between the two vertices.W is the weight of edges, which is defined as the length of edges in the study.

21 ( 4 )
ISPRS Int.J.Geo-Inf.2018, 7,  x FOR PEER REVIEW 6 of Repeating step 3 until all expansion paths are blocked.The vertices whose CDCV are neither ∞ nor 0 consist of eps-neighbors of the central vertex.

Figure 3 .
Figure 3.An undirected planar graph N = (V∪P, E, W) was generated for the simulated dataset.P is

Figure 4 .
Figure 4.A basic expansion is a motion from a source (start) vertex to a target (end) vertex, the path between which is the expansion path.Current distance to central vertex (CDCV) (p) represents p's current distance to central vertex, and W (p, q) denotes the weight (length) of expansion path between vertices p and q.The CDCV of end vertex is updated to the sum of CDCV of start vertex and the weight of expansion path between them.

Figure 4 .
Figure 4.A basic expansion is a motion from a source (start) vertex to a target (end) vertex, the path between which is the expansion path.Current distance to central vertex (CDCV) (p) represents p's current distance to central vertex, and W (p, q) denotes the weight (length) of expansion path between vertices p and q.The CDCV of end vertex is updated to the sum of CDCV of start vertex and the weight of expansion path between them.

( 2 )
Performing a basic expansion with the central vertex as the start vertex.(3) Continuing the expansion from the new vertices, which are actually the end vertices of the last expansion.(4) Repeating step 3 until all expansion paths are blocked.The vertices whose CDCV are neither ∞ nor 0 consist of eps-neighbors of the central vertex.

21 Figure 5 .
Figure 5.A density ordering graph of simulated dataset.It is a bar chart where the horizontal axis is the identifier of event points and the ordinate is the density of each event point.

Figure 5 .
Figure 5.A density ordering graph of simulated dataset.It is a bar chart where the horizontal axis is the identifier of event points and the ordinate is the density of each event point.

Figure 6 .
Figure 6.The core points in cluster gradually brought the border points into the cluster: (a) P1 brought P5, P2, P4, P8, and P3 to cluster 1; (b) P5 brought P7 and P9 to cluster 1; (c) P8 brought P6 to cluster 1.The points of the simulated dataset eventually formed two clusters, and two points became noises.

Figure 6 .
Figure 6.The core points in cluster gradually brought the border points into the cluster: (a) P 1 brought P 5 , P 2 , P 4 , P 8 , and P 3 to cluster 1; (b) P 5 brought P 7 and P 9 to cluster 1; (c) P 8 brought P 6 to cluster 1; The points of the simulated dataset eventually formed two clusters, and two points became noises, as shown in (d).

Figure 8 .
Figure 8. Steps of preprocessing: (a) original dataset; (b) extraction of skeletons; (c) movement of POI to the nearest road segment; and (d) splitting of road segments at event vertices.

Figure 8 .
Figure 8. Steps of preprocessing: (a) original dataset; (b) extraction of skeletons; (c) movement of POI to the nearest road segment; and (d) splitting of road segments at event vertices.

Figure 10 .
Figure 10.Clusters of hierarchical clustering algorithm (farthest-pair distance, cluster number = 36) basically delineated the regions of aggregated POI.

Figure 10 .
Figure 10.Clusters of hierarchical clustering algorithm (farthest-pair distance, cluster number = 36) basically delineated the regions of aggregated POI.

Figure 11 .
Figure 11.Clusters of NC_DT algorithm does not work effectively for the dataset.

Figure 11 .
Figure 11.Clusters of NC_DT algorithm does not work effectively for the dataset.

Figure 13 .
Figure 13.A basic expansion in a dead-end road segment.

Figure 13 .
Figure 13.A basic expansion in a dead-end road segment.

Figure 13 .
Figure 13.A basic expansion in a dead-end road segment.

Table 2 .
Pseudocode of Generating Density OrderingTable and Graph.

Table 2 .
Pseudocode of Generating Density OrderingTable and Graph.

Table 4 .
Indicators for Evaluating the Effectiveness of Clustering Algorithms.

Table 4 .
Indicators for Evaluating the Effectiveness of Clustering Algorithms.
1For silhouette and RS, larger is better; For DB and SD, smaller is better.