Road Map Inference : A Segmentation and Grouping Framework

Abstract: We propose a new segmentation and grouping framework for road map inference from GPS traces. We first present a progressive Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm with an orientation constraint to partition the whole point set of the traces into clusters that represent road segments. A new point cluster grouping algorithm, according to the topological relationship and spatial proximity of the point clusters to recover the road network, is then developed. After generating the point clusters, the robust Locally-Weighted Scatterplot Smooth (Lowess) method is used to extract their centerlines. We then propose to build the topological relationship of the centerlines by a Hidden Markov Model (HMM)-based map matching algorithm; and to assess whether the spatial proximity between point clusters by assuming the distances from the points to the centerline comply with a Gaussian distribution. Finally, the point clusters are grouped according to their topological relationship and spatial proximity to form strokes for recovering the road map. Experimental results show that our algorithm is robust to noise and varied sampling rates. The generated road maps show high geometric accuracy.


Introduction
A road network map is one of the most fundamental features of geospatial information, with a variety of applications, such as public management, urban planning and navigation.It is, however, challenging to promptly and consistently update existing road maps [1], especially in developing countries, where road networks change rapidly.Conventional map updating techniques mainly consist of ground-based and aerial surveys [2].Ground-based surveys are limited to a long cycle of information acquisition, whereas aerial surveys can obtain large-scale data about the Earth's surface in a short time.Although many types of algorithms have been proposed to extract road networks from high-resolution satellite imagery [3,4], it is still difficult to generate large-scale road networks with these algorithms, due to the complexity of road networks and occlusion caused by trees and buildings.
In order to overcome the shortcomings of traditional road map mapping, User-Generated Content (UGC) [5] is currently being used for road map generation in two ways: (1) collaborative mapping programs, such as OpenStreetMap, which are highly dependent on manual work; and (2) automatic road generation from Global Positioning System (GPS) trajectories [6], which is referred to as map inference and is now a common and efficient method for updating road networks.The GPS trajectories are collected by GPS-equipped vehicles, bicycles or pedestrians and can explicitly represent the geometry and topology of the latest road networks.
The purpose of map inference is the extraction of roads' geometric positions, topological connections, as well as some attribute information, such as lane counts [7] and road names [6].Most of the existing map inference methods are aimed at dealing with low-noise, densely-sampled and uniformly-distributed GPS traces [1].However, in many cases, GPS probes may reduce sampling rates, due to energy and communication costs [8].Normally, these types of data constitute the main part of GPS traces.Thus, the inference of high-quality road maps from sparsely-sampled and high-noise GPS traces is an important and challenging research topic.
In order to overcome the limitation of low GPS sampling rates, we propose a new segmentation and grouping framework to generate road maps.The primary contributions of our method can be summarised as follows: • A flexible segmentation and grouping framework for map inference that is robust to noise and the variable sampling rates of GPS traces.

•
An extended progressive Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm with an orientation constraint for two-dimensional (2D) point cloud segmentation.

•
A Hidden Markov Model (HMM)-based map matching algorithm for point cluster topological relationship construction.

•
A progressive point cluster grouping algorithm for road map recovery according to the stroke principle [9].
The remainder of the paper is organised as follows: Section 2 presents a review of the related work.Section 3 outlines the procedure of our map inference algorithm.Section 4 describes the experiments on two GPS traces datasets.Section 5 evaluates the generated road maps; and Section 6 discusses the future work of our method.

Related Work
Current map inference methods can be roughly classified into four categories: clustering-based methods, trace merging-based methods, Kernel Density Estimation (KDE)-based methods and intersection linking-based approaches [10,11].
The clustering-based approaches begin with the determination of a series of cluster seeds by the location, bearing and heading of the traces.Seeds are then linked to form the road segments according to the raw traces.The existing typical methods include methods proposed by Edelkamp and Schrödl [12], Schrödl et al. [7] and Agamennoni et al. [13,14].These methods can infer road maps with high geometric accuracy, i.e., detailed lane information and intersection models.However, these methods are designed for densely-sampled GPS traces.The accurate geometric information can only be inferred from fine-grained GPS traces.They are not robust to the noise of GPS traces.
Trace merging-based approaches involve the consolidation of trace segments based on their geometric relations and shape similarity.The centerlines of roads are then generated by the merged traces.Cao and Krumm [15] proposed an aggregation technique to pull together traces on the same road.They associate two types of forces with each trace.The first force makes each trace attracted or pulled by nearby traces, and the other one makes each trace constrained by the original position.To refine the intersection of inferred road network, Wang et al. [16] presented a novel approach for generating routable road maps from GPS traces based on the research of [15].They introduced circular boundaries to isolate points near intersections and used a k-means clustering method to refine the intersections.Their methods perform well on densely-sampled GPS trace data, but are not robust to noise.In order to improve the performance of the trace merging-based method, some mathematic techniques are utilised.Niehoefer et al. [17] applied a filter mechanism to reduce the noise of GPS traces captured by mobile phone.Chen et al. [18] theoretically guarantee the geometric accuracy of the inferred road map based on the assumptions that the traces are densely sampled and stay within the road surface.Wang et al. [1] applied Hausdorff distance to measure the distance between GPS traces to filter noise.The trace merging approach is a type of line-based method [19], and its performance is highly dependent on the sampling rates.If two consecutive points of traces are far from each other, especially when the vehicle turns at an intersection, the segments of GPS traces cannot align with the roads.In this case, spurious road segments are produced.
The KDE-based methods treat the map inference as a digital image processing procedure.Raw GPS traces are rasterised to a 2D grey image, and a global threshold is then enforced to filter the image.Finally, skeletonisation approaches, such as the Voronoi diagram and morphological operations, are applied to extract the centerlines of the road segments [20][21][22].This type of method is sensitive to the density distribution, and it is difficult to select an appropriate global threshold.In order to overcome the limitation of KDE, Biagioni and Eriksson [23] proposed a progressive KDE-based method.They developed a hybrid map inference pipeline, including the generation of an initial map by grey-scale skeletonisation, pruning of the map with a Viterbi map matching algorithm and refinement of the topology and geometry of the roads.Their method is robust to noise and disparity; however, since it is a line-based approach, sparsely-sampled traces cannot be handled.
Intersection linking approaches consist of two main steps.The first step is to detect the intersections of the road map.The second step is to link the intersections according to the raw traces.Fathi and Krumm [24] designed a local shape descriptor to represent the GPS traces distribution and trained a classifier on the descriptor to find the intersections from GPS traces.Karagiorgou and Pfoser [25] proposed a similar algorithm.They used a speed threshold and changes in direction as indicators to identify the intersections.These methods work well for road networks with simple intersections and densely-sampled traces.However, they are also sensitive to the sampling rates of GPS traces.
Most of the methods described above have utilised densely-sampled GPS traces, and some of the methods considered noise and disparity [23].In some cases, however, the collected GPS traces are not dense enough for these algorithms.Although the total amount of the traces is huge, the sampling rates for a single trace are sparse.The point-based KDE method [19] can deal with sparsely-sampled GPS traces, but is sensitive to the disparities.
The above review reveals that a more generic algorithm for map inference needs to be developed.The algorithm should have the ability to deal with both densely-and sparsely-sampled traces.Recently, Qiu et al. [26] proposed a method to segment sparsely-sampled GPS traces to point clusters for the generation of the centerlines of road segments.Based on this work, we propose a new segmentation and grouping framework for map inference.To the best of our knowledge, this is the first time that such a framework for map inference from GPS traces has been presented.The method works well for both densely-and sparsely-sampled GPS traces.

Method
The dataset of GPS traces Tr for our method was collected by GPS-equipped vehicles: Tr = {tr i |i = 1, 2, . . ., N}, where N is the number of the traces.GPS trace tr i is collected by a vehicle in a specific time span and consists of a sequence of GPS points p i,r , with each point recording the vehicle's position at a specific time, denoted by tr i = {p i,r |r = 1, 2, . . ., n} with p i,r = (lon i,r , lat i,r , t i,r ), where lon i,r ∈ [−180, 180] and lat i,r ∈ [−90, 90] are the longitude and latitude of the point; t i,r ∈ R + is the time at which the point was collected; n is the number of points; and time t i,1 < t i,2 < . . .< t i,n .With a large amount of GPS traces collected by vehicles over space and time, we can generate the road network by the accumulated traces.In the experiment, the longitude and latitude of all points are transformed to a projected coordinate system, denoted by (x, y).
Our method of map inference from GPS traces includes two components: point cloud segmentation and point cluster grouping.For segmentation, we compute the orientation of the points using two consecutive points as a vector.We then partition all of the GPS points into clusters with an extended progressive DBSCAN algorithm with an orientation constraint, where each cluster represents a short, nearly straight curve of road.
In the grouping component, we generate the centerlines of the point clusters by the robust Locally-Weighted Scatterplot Smooth (Lowess) [27] algorithm.An HMM-based map matching algorithm [28] is then used to match raw GPS traces with the centerlines of the clusters to build the topological relationship between clusters.The proximity between clusters is assessed by assuming that the distances of the points to the centerline comply with a Gaussian distribution.Finally, we group clusters according to the topological relationship and proximity to form the stroke for road map recovery.
Before segmentation, traces are preprocessed.If the distance between two consecutive points in a raw trace is greater than a threshold (i.e., 300 m), the trace is separated into two traces.If two points are too far from each other, the line segment between these two points will not properly align with the real road.

DBSCAN Algorithm with an Orientation Constraint
The DBSCAN algorithm [29] is a spatial clustering algorithm for points.It is designed to discover clusters with arbitrary shapes by using a density constraint, which is estimated based on the number of points in a specific area.The algorithm is robust to noise and does not require users' assistance to specify the number of clusters.In the classic DBSCAN algorithm, the ε − neighbourhood is employed to represent the density, which is defined by Definition 1.The input of the DBSCAN algorithm includes ε and MinPts, where ε is the search radius for finding neighbours of a point, and MinPts is the threshold for determining whether a point is a core point (Definition 2) or not.The generation of point cluster c m is an incremental procedure.First, a core point is found; and its neighbours are used to initialize c m .Then, c m is expanded by adding neighbours of the core points in c m until no point can be added.Definition 1. ε − neighbourhood of point p, denoted by N p , is defined by N p = {q ∈ P|dis(p, q) ≤ ε}, where P is the set of points and dis(p, q) is the Euclidean distance of points p and q.Definition 2. Core point.Point p is called a core point if its ε − neighbourhood N p contains at least MinPts points (minimum number of points).
In the context of our point cloud segmentation, we intend to determine point clusters that represent nearly straight curves.The range of the coordinates x and y values (x, y ∈ R) is different from the range of points' orientations ([0 • , 360 • ] or [0, 2π]).Therefore, we redefine the neighbourhood of a point with two components: search radius ε and difference of orientation α between two points (Definition 3).We also redefine the core point by the density of the traces (Definition 4.) The orientation starts from 0 • , which coincides with the x-axis, and increases in the counter-clockwise direction.Definition 3. ε, α − neighbourhood of point p, denoted by point set N p , is defined by N p = {q ∈ P|dis(p, q) ≤ ε ∧ di f f Ori(p, q) ≤ α}, where P is the set of points, dis(p, q) is the Euclidean distance of points p and q and di f f Ori(p, q) is the difference in the orientations of points p and q.Definition 4. Core point.Point p is called the core point if the point set of its ε, α − neighbourhood N p belongs to no less than MinLns GPS traces (minimum number of GPS traces).
The procedure of our DBSCAN algorithm with an orientation constraint is shown as Algorithm 1.
In the first stage, we use ε, α and MinLns to generate point clusters (Lines 1 to 15), where ε and α are the search radius and the difference of orientations between points for determining the neighbourhood, and MinLns is the number of GPS traces for determining if a point is a core point.In the second stage, we use Principal Component Analysis (PCA) to generate a line segment to fit the point cluster and filter the point cluster by the length and orientation of the line segment (Lines 16 to 21).If the length of the line segment is smaller than threshold MinLen (minimum length), we discard the point cluster.If the orientation of the line segment (range of [0 • , 180 • ]) does not meet with the mean orientation of the points in the cluster (modulo 180 • ), i.e., if the difference of the orientations is larger than threshold MaxDoOs (maximum difference of orientations), then also we discard the point cluster.Parameters MinLen and MaxDoOs are aimed at generating a relatively long segment and also help to eliminate clusters generated by noise.The variable nStartId is the identity for the first generated point cluster.
Input: RawPoints, ε, α, MinLns, MinLen, MaxDoOs, nStartId Output: PointClusters, nId 1 ∀p i ∈ P set p i .classId= −1 and p i as unvisited; Different parameters for finding neighbours result in different point clusters.Among parameters ε, α and MinLns, the most sensitive is α. Figure 1 illustrates the sensitivity of α. Figure 1a is a point set of raw GPS traces.We generate point clusters with α = 1 • and α = 5 • .Figure 1b,c shows two of the point clusters with α = 1 • and α = 5 • , respectively.The two point clusters are then fit by line segments computed with PCA (red lines in Figure 1).In the comparison of Figure 1b,c, we can observe that the point cluster generated with a small α can be well represented by a line segment.However, at the same time, a strong constraint (small α) leads to many points that cannot be clustered.
In order to balance this trade-off, we implement a progressive DBSCAN algorithm with an orientation constraint to generate point clusters, which is shown as Algorithm 2. First, we generate point clusters with a small α.We then enlarge α and use the points that are not classified as the input to generate new clusters.This second step is repeated until the constraint α reaches the predefined threshold.Our progressive DBSCAN algorithm with an orientation constraint needs three parameters for cluster generation and two parameters for cluster filtering.In the cluster generation procedure, search radius ε was fixed at 15 m.This value was estimated based on the lane width and number of the real urban roads.The lane width in cities is about 3.75∼4.0m, and we assume that the roads that the vehicles visited contained at least 4 lanes.Parameter α was set in the range of [1 • , 30 • ].The lower bound setting was based on the fact that the points close to each other in the same road have similar orientations; and the upper bound setting took into account the noise and sampling rate of GPS traces.The third parameter MinLns was set at 2, meaning that the actual road segment was not placed on the areas that were visited by vehicles less than twice.
In cluster filtering, we assume that the length of the generated road segment is larger than the width of the actual road (15 m).Therefore, we fixed MinLen to 15 m.Parameter MaxDoOs was set to 5 • , according to the experimental tests.

Point Clusters Grouping
Road segments generated by our progressive DBSCAN algorithm with an orientation constraint contain many short segments with similar orientations.To reduce redundancy, they need to be grouped to form long curves.For each point cluster, we assume that the distances from the points to the centerline of the cluster complied with Gaussian distribution G ∼ (µ, σ 2 ), where µ = 0 represents the centerline and σ is the standard deviation.Based on this assumption, we build the topological relationship and assess the proximity between point clusters to group them.
In this section, we employ the robust Lowess algorithm to generate centerline l i from point cluster c i and use the Median Absolute Deviation (MAD) formula to estimate the corresponding σ i of the Gaussian distribution.We then apply an HMM-based map matching algorithm to match raw GPS traces with generated centerlines to construct a topological relationship between point clusters.Finally, we group point clusters in a progressive manner based on their spatial proximity.

Centerline Generation from Point Cluster
There are many types of centerline generation algorithms for unorganised point clouds in the computer science community [30].The point clusters generated by our algorithm represent nearly straight curves.The curve does not self-intersect, but is distorted by noise and outliers.Thus, we use the robust Lowess method to generate the centerline of the point cluster.The method uses residual analysis to overcome the problem caused by noise and outliers.
Figure 2a is a typical point cluster generated with our map inference method.In order to apply the robust Lowess algorithm, we need to pre-process the point cluster.First, we rotate the points to the new coordinate system computed by PCA.The abscissa is shown as the line in Figure 2a.We then sort the points in ascending order according to their x values in the new coordinate system and connect them to form an unsmooth curve, which is shown in Figure 2b.After these two steps, we can use robust Lowess to generate a smooth curve, as shown in Figure 2c.Finally, we rotate the curve back to the original coordinate system.
(a) (b) The general procedure of the robust Lowess method can be summarised as follows: 1.
Apply a locally-weighted linear least-squares regression for each point to smooth the curve.

2.
Calculate the residual for each point according to the smoothed curve.

3.
Compute the value of MAD to remove outliers and update local weights.The residuals of the outliers are more than 6-times that of the value of MAD.

4.
Smooth the curve as described in Step 1, again using the new weights.5.
Repeat Steps 2 through 4 for a total of five iterations to obtain a smooth curve.
In the algorithm, the local linear regression for each point is performed on its k nearest neighbours, which is computed by their x values in the rotated coordinate system.We need to manually determine k for each cluster.Since the density (numbers of points in a fixed area) of raw GPS points varies over space, we use the average number of the points contained in a specified span dSpan to estimate k: k = dSpan/l × n, where l is the length of the line segment generated from the point cluster with PCA and n is the number of the points in the cluster.We fixed the dSpan to 10 m according to experimental tests.
For each point cluster c i = {p i,r |r = 1, 2, . . ., m}, we generate its centerline l i : l i = {v i,r |r = 1, 2, . . ., m ∧ m ≤ m}, where v i,r is the nodes that constitute the centerline.Due to outliers, the number of nodes (m ) that constitute the centerline l i is less than or equal to the number of points (m) in cluster c i .We can then estimate the corresponding σ i with the MAD formula, as shown in Equation (1).
where dis(p i,r , l i ) = min{dis(p i,r , v i,r )}, (r = 1, 2, . . ., m ) is the distance from point p i,r to centerline l i , dis(p i,r , v i,r ) is the Euclidean distance from point p i,r to node v i,r , and 1.4826 is the scaling factor for normal distribution data.Although the distribution of point clusters does not strictly comply with the Gaussian distribution, we can obtain a reasonable estimation with the MAD formula.

Topological Relationship Construction between Clusters
According to Newson and Krumm [28], matching of GPS traces to the actual road map (i.e., map matching) is achieved through finding the most probable route that the vehicle visited based on the location of the vehicle measured by GPS.They modelled the map matching problem as an HMM, which is a Markov process with a set of hidden states and observations.Transitions from state to state are defined by transition probabilities.Each state has a probability distribution called the emission probability distribution over the possible observations.Given a sequence of observations, a sequence of hidden states can be generated by maximizing the overall probability.
In the context of map matching, the hidden states are road segments; and observations are the points measured by GPS.The transition probabilities provide the probabilities that the vehicle transitions from one segment to another; and the emission probability for a state over an observation denotes the probability that the vehicle actually visited the segment given the point (position of vehicle) measured by GPS.
In the HMM-based map matching algorithm proposed by Krumm et al. [28,31], they projected each point measured by GPS to its nearby road segments to generate a set of candidates for matching and calculated the emission probability based on the distance from the point to the projected point of each candidate.For a vehicle moving from one point to the next point (measured by GPS), a set of possible transitions from one road segment to another could then be generated.The transition probabilities for the segments are calculated according to the difference of two distances: (1) the Euclidean distance of the two GPS points; and (2) the shortest path length of the two projected points on road segments.Finally, they use the Viterbi algorithm to match the GPS trace generated by a vehicle with the path that had the maximum probability among all possible paths.The algorithm simultaneously considered the locations of GPS points and the topological relationship of the actual road map.This method is robust to noise.
The proposed method matches the raw GPS traces with the centerlines generated by point clusters to build the topological relationships between the point clusters.The hidden state is the centerline, and the observation is the point measured by GPS.The matching result also can be used to remove the redundant point clusters that are caused by noises.
The transition probabilities in our method provide the probability of a vehicle moving from one centerline l i (generated by c i ) to another centerline l j (generated by c j ).It is derived from the two point clusters c i and c j , where c i = {p i,r |r = 1, 2, . . ., m}, c j = {p j,s |s = 1, 2, . . ., n} and m and n are the sizes of the clusters.The sequences of the points are indicated by the raw GPS traces.Intuitively, if there are m i,j (m i,j ≤ m) points in c i that have their next points contained in c j , we define the transition probability for transitioning from state l i to l j as Pr(l i , l j ) = m i,j /m.Based on this definition, the sum of the non-zero transition probabilities Pr(l i , l j ) for transitioning from l i to all of the other possible l j is equal to 1; and we call the number of possible l j the out-degree d(l i ) of l i .
In experiments with actual GPS data, we observed two facts for some centerlines: (1) Pr(l i , l i ) Pr(l i , l j ), where i = j, which means that the transition probability for transitioning from centerline l i to itself is much bigger than transitioning from l i to others; and (2) Pr(l i , l j ) Pr(l r , l s ), where d(l i ) d(l r ) and i = j, r = s, which denotes that the transition probabilities for transitioning from lower out-degree centerlines are much larger than the transition probabilities for transitioning from higher out-degree centerlines.In order to avoid the preference for matching GPS traces with low out-degree centerlines, we refined the transition probabilities Pr(l i , l j ) to be independent of the out-degree of centerlines by introducing a weight factor ω. The weight factor is derived from the out-degree of centerlines.The results based on this setting are more reasonable.The refining steps are as follows: 1.
For a point cluster c i ∈ C, obtain the number m of the points included in c i that their successive points are not included in c i .

2.
For a point cluster c j ∈ C, obtain the number m i,j of the points included in c i that have their successive points in c j (i = j).

3.
Compute the transition probability for transitioning from centerline l i to l j , Pr(l i , l j ) = m i,j /m .4.
Repeat Steps 2 and 3, computing transition probabilities for transiting from the l i to all possible l j . 5.

6.
Repeat Steps 1 to 5, computing the non-zero transition probabilities of all centerlines l i ∈ L.

7.
Find minimum out-degree d min among all of the out-degrees of all centerlines l i ∈ L.

8.
Update transition probabilities as Pr(l i , l j ) = ω i × Pr(l i , l j ), where l i ∈ L, l j ∈ L ω i is a weight factor, and The emission probability for a particular point p t measured by GPS at time t associated with a nearby centerline l i , denoted by Pr(l i |p t ), gives the likelihood that p t was observed from l i .It indicates the probability that the vehicle visited the centerline l i at time t given the measurement p t .We calculate the emission probability based on the assumption that the distance distribution of points to the centerline complies with Gaussian distribution G ∼ (µ, σ 2 ).For each point of raw GPS traces, the probabilities are estimated with Equation (2): where dis(p t , l i ) = min{dis(p t , v i,r )|v i,r ∈ l i } is the distance from point p t to centerline l i .
Variance σ is estimated in the centerline generation procedure.After computing σ i for each point cluster c i , we use max i σ i as σ to estimate the emission probabilities.In practice, we find up to 10 matched centerline candidates for each point.If the dis(p t , l i ) is greater than 6 × σ, the emission probability is close to 0, and we do not consider l i as a candidate for matching with p t .The emission probability Pr(l i |p t ) of point p t is normalised by scaling to [0, 1].
After initializing the transition probabilities and emission probabilities, we use the Viterbi map matching algorithm [28,32] to match raw GPS traces with generated centerlines to find the most probabilistic path that vehicles visited.The path with the highest probability indicates the connectivity between clusters.Clusters that do not match with any raw trace are treated as noise.

Point Cluster Grouping
A set of routes O = {o i |i = 1, 2, . . ., N} is obtained by matching raw GPS traces with centerlines, where N is the number of routes, which corresponds to the number of traces.Each route o i is produced by matching a trace tr i = {p i,r |r = 1, 2, . . ., n} with centerlines.Thus, o i consists of a set of centerlines, o i = {l i 1 , l i 2 , . . ., l i n }, where l i r is matched with point p i,r of trace tr i .Since we know a vehicle moves from point p i,r to the next point, p i,r+1 , we can infer that centerlines l i r and l i r+1 are topologically connected.In the next step, we employ the geometric positions of these two centerlines to determine if they can be grouped.
The concept of a stroke is that a curve can be drawn in one smooth movement [9].It is derived from the principle of good continuation, which indicates that elements that follow similar directions tend to be grouped together.Using the stroke concept, we group point clusters to generate smooth curves to the farthest possible extent.For two point clusters c i and c j that produce two topologically-connected curves l i and l j , respectively, we temporarily group c i and c j to obtain a new in Dis s (l i , l g ) as the α-quantile mean distance dis α (l i , l g ) from centerline l i to l g .The α-quantile mean distance dis α (l j , l g ) from l j to l g can also be obtained by the same procedure.Finally, if the sum of the two α-quantile mean distances is smaller than or equal to 6 × (σ i + σ j ), we group the two point clusters c i and c j to obtain c g .The α-quantile mean distance represents the deviation from the two original centerlines to the new centerline that produced by the two grouped clusters.dis α (l i , l g ) + dis α (l j , l g ) ≤ 6 × (σ i + σ j ) means the new centerline is in the union of the two buffer regions of l i and l j .According to the experimental testing, we set the α to a small value (i.e., 0.02).
Figure 3. Illustration of point cluster grouping.Black solid lines are l i and l j that are generated from point clusters c i and c j ; black dashed lines indicate the 6σ buffer regions; and red solid line l g is generated by c g , which is produced by grouping c i and c j .
Another issue is that the sampling rates of the raw GPS traces are not consistent.Two topologically-connected clusters may not be spatially near to each other.We define the distance between two clusters as in Equation (3).If the distance between two clusters dis(c i , c j ) is greater than a threshold (i.e., 100 m), they cannot be grouped.In the experiment of sparsely-sampled GPS traces, we found that the most probabilistic path could not discover the relationship between two adjacent clusters.Thus, as a final step of cluster grouping, we find the clusters for grouping according to the distance between them.If the distance between two clusters is less than or equal to a threshold (i.e., 100 m), we apply the grouping algorithm with an angle threshold β equal to 60 • to these two clusters.
dis(c i , c j ) = min p i,r ∈c i ( min p j,s ∈c j dis(p i,r , p j,s )) (3)

Experiments
We used two datasets to test our algorithm.The dataset collected in Chicago, USA (Dataset 1), consists of densely-sampled traces; and the other dataset collected in Wuhan, China (Dataset 2), consists of sparsely-sampled traces.The characteristics of the two datasets are described below: 1.
The Chicago dataset was GPS traces from a bus serving the University of Illinois at Chicago campus [23].The dataset covered an area of 3.8 km × 2.4 km, with the sampling rates varying from 1 s to 29 s (with a mean of 3.61 s and standard deviation of 3.67 s).The distances between two consecutive points ranged from 19.97 m to 96.6 m (with a mean of 24.4 m and standard deviation of 3.31 m).There were about 118,000 points.

2.
The Wuhan dataset was GPS traces collected by taxis.The dataset covered an area of 4.8 km × 5.5 km, with the sampling rates varying from 1 s to 81 s (with a mean of 37.42 s and standard deviation of 17.66 s).The distances between two consecutive points ranged from 0 m to 496.255 m (with a mean of 218.84 m and standard deviation of 140.38 m).There were about 350,000 points.
Figure 4a shows the raw GPS traces of the Chicago dataset displayed as points.We applied our progressive DBSCAN algorithm with an orientation constraint to the dataset and obtained the 2D GPS points segmentation results, as shown Figure 4b.A total of 1099 point clusters were obtained, and those that were spatially adjacent to each other are shown in different colours.In Figure 4b, we can visually detect that the clusters in the rectangle could be grouped to produce a nearly straight curve, due to their similar orientations and spatial adjacency.The reason for generating clusters with similar orientations are the following: (1) we applied the segmentation algorithm beginning with a strong constraint (α for finding neighbours of a point is 1 • ); and (2) orientations computed by two consecutive points were inaccurate due to the noise of GPS points and the sampling rate.
We then matched raw GPS traces with the centerlines that were generated by the point clusters to construct the topological relationship between clusters.Clusters that were topologically connected were grouped according to their geometric position.As the final step of point clusters' progressive grouping procedure, we also used the HMM-based map matching method to filter the clusters.If the centerline l i was matched with only two (or less) raw GPS traces, we regarded l i to be generated by noise and removed the corresponding point cluster c i .
Figure 4c shows the point cluster grouping results.Fifty-five clusters were obtained.A comparison of Figure 4b,c shows that the point clusters in the rectangle have been grouped by our algorithm.In fact, spatially-adjacent clusters with similar orientations were grouped properly.
As a final step, we used the robust Lowess algorithm to generate centerlines from the grouped point clusters.The inferred road map (blue lines) was overlapped with the ground truth map (light yellow lines) in Figure 4d.According to Figure 4d, the inferred road segments in the rectangle did not represent the ground truth map well: the roads were not frequently visited by the GPS-equipped vehicles; therefore, there was a lack of sufficient points to recover the roads with our algorithm.In the ellipse shown in Figure 4d, some spurious road segments were generated.As can be observed in Figure 4a, GPS points in this area were noisy, and the accumulation of the noise produced the spurious road segments.We applied the same procedure to the Wuhan dataset.Figure 5a shows the preprocessed GPS traces (grey lines) overlapped with the corresponding points (black points).According to this figure, the sparsely-sampled GPS traces could not align well with the real position of the vehicles, especially for the traces near the road network's intersections.We then used our segmentation algorithm to generate point clusters, as shown as Figure 5b.Due to the inaccuracy of the GPS points' orientations, 4970 clusters were generated.After grouping, the 216 clusters shown in Figure 5c were obtained.Finally, we generated the centerlines of the road network with the robust Lowess algorithm.The centerlines are shown as blue lines in Figure 5d, and the yellow lines represent the ground truth map. Figure 5d indicates that some road segments (e.g., road segments in the rectangular area) could not be recovered due to an insufficient number of GPS points.We compare our map inference method with four typical methods, including a hybrid map inference pipeline [23], a clustering-based method [12], a KDE-based method [20] and a trace merging-based method [15] using the Wuhan dataset.The results are shown in Figure 6. Figure 6a shows the roads generated by Edelkamp and Schrödl's method.Their method cannot generate road centerlines for these traces.The reason is that the points' orientations of sparsely-sampled GPS traces are inaccurate, and hence, the cluster seeds cannot be grouped together.We used 3000 traces for the method (the total number of the traces in Wuhan dataset is 12,832).According to Figure 6a, the 3000 traces cover the road network on the ground truth map. Figure 6b shows the road network generated by Biagioni and Eriksson's method.We applied the full number of traces to this method.However, the algorithm failed at the step of collapsing nodes into intersections.We then applied the method to 3000 traces, as in Figure 6b.Due to the sparse sampling rate, many roads could not be recovered by the algorithm.Figure 6c shows the road network generated by Davies' method with the full number of traces in the Wuhan dataset.The method is robust to the sampling rate, and we compare our method to this method in the next section.We also evaluated Cao and Krumm's method using the Wuhan dataset.However, due to the sparse sampling rate, the algorithm did not converge, and adjacent traces were not merged.
Thus far, we have only compared the inferred road centerlines to the ground truth maps to qualitatively evaluate our map inference algorithm.In the next section, we present a quantitative evaluation for the geometric accuracy of our map inference algorithm.

Evaluation
Several quantitative evaluation methods for the accuracy of the geometry and topology of inferred road maps have recently been proposed.Liu et al. [19] performed their quantitative evaluation by measuring the recall and precision of the roads on the inferred map (M) with respect to the roads on the ground truth map (Truth).The recall denoted the fraction of roads on the ground truth map that were retrieved, and the precision was the fraction of roads on the inferred map that were relevant.They used distance and orientation as constraints to match the roads of the two maps with each other by nearest distance and to determine the true positive length tp = M ∩ Truth by the matched proportion.The recall, precision and F-measure were then calculated as follows: recall = tp/||Truth||, precision = tp/||M||, and F − measure = (2 × precision × recall)/(precision + recall), where ||.|| measures the total length of the road segments in the corresponding set.Their method evaluated the geometric accuracy of the inferred map.
Biagioni and Eriksson [23] introduced a new method to measure the geometric and topological similarities of the inferred and ground truth maps at the same time.They resampled the two types of maps and placed holes on the ground truth map and marbles on the inferred map.They then matched marbles with holes by the distances between them and counted the unmatched marbles and empty holes.The accuracy of the inferred map with respect to the ground truth map was quantified by the spurious marbles and empty holes.reason that the total length of the roads on the ground truth map and inferred road map in Figure 7b is longer than the roads in Figure 7a.We evaluate the two inferred road maps by different matching thresholds of distance, where the results are shown in Tables 1 and 2. In Tables 1 and 2, the length of well-matched roads is the total length of roads on the ground truth map (or the inferred map) that were matched with roads on the inferred map (or the ground truth map), given the matching threshold of distance.Figure 8 presents the plots of the recall, precision and F-measure of the two methods.Figure 8a shows that the values of the recall, precision and F-measure criteria of Biagioni and Eriksson's method [23] increase as the matching threshold increase from 5 m to 20 m and became constant after the threshold reached 20 m. Figure 8b demonstrates that the values of these three criteria did not change much with our method after reaching the matching threshold of 10 m.The values of the criteria of our method were larger than the values of Biagioni and Eriksson's method when the matching threshold was smaller than 20 m.These results indicate that the road map inferred by our method has better accuracy in terms of the geometry.In addition, Biagioni and Eriksson demonstrated using the same dataset [23] that, in terms of geometry, their method outperformed the existing map inference methods, i.e., clustering-based methods, trace merging-based methods and KDE-based methods.The ground truth and inferred maps for Dataset 2 (Wuhan dataset) are displayed in Figure 5d.We separated the centerlines with opposite directions for bidirectional roads; however, the ground truth map only had one centerline for each road segment.Thus, all sampled points of the ground truth map could be matched with two points of the inferred roads.The total length of roads on the ground truth map in Figure 5d is 117.67 km, and the total length of the roads inferred by our method was 148.38 km.The matching results are shown in Table 3.The length of well-matched roads represents the total length of roads on the ground truth map that matched with roads on the inferred map, given the matching threshold of distance.The length of the correctly inferred roads is the total length of roads on the inferred map that matched with roads on the ground truth map, given the matching threshold of distance.Since the total length of roads on the inferred road map was larger than the total length of roads on the ground truth map, we modified the standard formula of recall and precision as follows: recall = ||well matched roads||/||Truth||, and precision = ||correctly in f erred roads||/||M||.Table 4 shows the matching results of the road network generated by Davies' method (Figure 6c).The total length of the inferred roads in Figure 6c is 97.97 km.  Figure 9 displays the graphical plots of recall, precision and F-measure for the sparsely-sampled Wuhan dataset (Dataset 2). Figure 9a is the plot of our method; and Figure 9b is the plot of Davies' method.According to Figure 9a, the precision is greater than 0.9, when the matching threshold of distance is larger than 15 m.Therefore, we conclude that our method can infer a road map from sparsely-sampled GPS traces with high geometric accuracy.When the threshold is equal to 5 m, the precision is found to be less than 0.5, because we generated two centerlines for the bidirectional roads and the ground truth map only used one centerline to represent bidirectional roads.On comparing Figure 9a,b, we conclude that the road network generated by our method has a better geometric accuracy than the road network inferred by Davies' method.

Conclusions
We have proposed a new segmentation and grouping framework for road map inference from GPS traces.The key aspect is the partitioning of the whole points of the GPS traces into clusters that represent nearly straight curves to recover the road map.In summary, our method has the capability to deal with both densely-and sparsely-sampled GPS traces.According to the experimental and evaluation results, we can conclude that our method performs well in terms of the geometric accuracy.The framework we proposed is open and flexible.We can develop a more sophisticated 2D point cloud segmentation algorithm to replace the current one, thereby further improving the overall performance of the proposed method.We can also use other types of skeletonisation algorithms that are more robust to noise to generate the centerlines of the point clusters.
In this research, we used two consecutive points to estimate the orientations of points.However, noise and sparse sampling considerably affect the accuracy of orientation estimation.In future work, we plan to integrate the connection between points and the point cloud distribution to estimate orientations to improve the performance of our algorithm.The method is designed to generate nearly straight curves to recover the road network.However, not all roads are straight.In future work, we will simulate curves with a sequence of line segments to infer winding roads from GPS traces.In addition, we need to process the intersections of the centerlines to generate the graph model of the road network.

11 if 12 set p k as visited; 13 else 14 classify
Card(N p k ) < MinLns then p k to class nId; 15 ∀p j ∈ N p k ∧ p j = p k push p j into Q, classify p j to class nId; 16 calculate the length L nId and Ori nId ([0 • , 180 • ]) of the cluster nId by PCA; 17 calculate the mean orientation mOri nId of all the points in cluster nId, modulo 180 • ; 18 calculate the difference of these two orientations DoOs; 19 if L nId < MinLen or DoOs > MaxDoOs then 20 set the class Id of the point p i ∈ cluster nId as -1; 21 continue; 22 nId + +, nStartId = nId; 23 return nId and point clusters according to their classId;

Figure 1 .Algorithm 2 : 8 α
Figure 1.An illustration of the sensitivity of α.(a) Point set of raw GPS traces; (b) one of the generated point clusters with α = 1 • ; (c) one of the generated point clusters with α = 5 • .

Figure 2 .
Figure 2. Procedure of generating centerline from points by the robust Locally-Weighted Scatterplot Smooth (Lowess) algorithm.(a) Points and the abscissa (x-axis) calculated by PCA; (b) points sorted in ascending order according to their x values; (c) line generated by the robust Lowess algorithm.

Figure 4 .
Figure 4. Experiment result of the Chicago dataset.(a) Raw GPS traces of the Chicago dataset displayed as points; (b) point clusters generated by the progressive Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm with an orientation constraint; (c) point clusters after progressive grouping; (d) the generated road map (blue lines) overlapped with the ground truth map (yellow lines) of Chicago.

Figure 5 .
Figure 5. Experiment result of Wuhan dataset.(a) Raw GPS traces overlapped the GPS points collected in Wuhan; (b) point clusters generated by the progressive DBSCAN algorithm with an orientation constraint; (c) point clusters after progressive grouping; (d) the generated road map (blue lines) overlapped with the ground truth map (yellow lines) of Wuhan.

Figure 6 .
Figure 6.Inferred road networks overlapped with the ground truth map of the Wuhan dataset.(a) Road network generated by Edelkamp and Schrödl's method; (b) road network generated by Biagioni and Eriksson's method; (c) road network generated by Davies' method.

Figure 8 .
Figure 8. F-measure of Biagioni and Eriksson's method and our method.(a) Recall, precision and F-measure of Biagioni and Eriksson's method for Dataset 1; (b) recall, precision and F-measure of our method for Dataset 1.

Figure 9 .
Figure 9.Comparison of our method and Davies' method for the Wuhan dataset.(a) Recall, precision and F-measure of our method for the Wuhan dataset; (b) recall, precision and F-measure of Davies' method for the Wuhan dataset.

Table 1 .
Matching results of Biagioni and Eriksson's method on the Chicago dataset.

Table 2 .
Matching results of our method on the Chicago dataset.

Table 3 .
Matching results of our method for the Wuhan dataset.

Table 4 .
Matching results of Davies' method for the Wuhan dataset.