Extracting Urban Road Footprints from Airborne LiDAR Point Clouds with PointNet++ and Two-Step Post-Processing

: In this paper, a novel framework for the automatic extraction of road footprints from airborne LiDAR point clouds in urban areas is proposed. The extraction process consisted of three phases: The ﬁrst phase is to extract road points by using the deep learning model PointNet++, where the features of the input data include not only those selected from raw LiDAR points, such as 3D coordinate values, intensity, etc., but also the digital number (DN) of co-registered images and generated geometric features to describe a strip-like road. Then, the road points from PointNet++ were post-processed based on graph-cut and constrained triangulation irregular networks, where both the commission and omission errors were greatly reduced. Finally, collinearity and width similarity were proposed to estimate the connection probability of road segments, thereby improving the connectivity and completeness of the road network represented by centerlines. Experiments conducted on the Vaihingen data show that the proposed framework outperformed others in terms of completeness and correctness; in addition, some narrower residential streets with 2 m width, which have normally been neglected by previous studies, were extracted. The completeness and the correctness of the extracted road points were 84.7% and 79.7%, respectively, while the completeness and the correctness of the extracted centerlines were 97.0% and 86.3%, respectively.


Introduction
Airborne light detection and ranging (LiDAR) is an active remote sensing technique for acquiring 3D geospatial data over the Earth's surface [1,2], with which a point clouds dataset encoding 3D coordinate values under a given geographic coordinate system can be generated [3]. The point clouds can be further processed to extract thematic information and to generate geo-mapping products, such as man-made objects [4], stand-alone plants [5], a digital elevation model (DEM) [6], etc. The latest commercial LiDAR systems, such as the Leica ALS80 airborne LiDAR, can provide a planar accuracy better than 10 cm and a vertical accuracy better than 6 cm at a typical flying height of 1000 m.
LiDAR point clouds are becoming one of the main data sources for road extraction because of at least four advantages compared with optical imagery. First, the elevation information from LiDAR points can be used to separate elevated structures, such as buildings and vegetation from roads [7,8]. Second, LiDAR points can cover an urban road continuously. The ability of LiDAR's penetration through tree canopies reduces the occlusion effect caused by trees; its relatively narrow scanning angle also ensures the integrity of road information [9]. Third, LiDAR data contains extra information, such as intensity and full-waveform, which help to improve the accuracy of road extraction [10]. Finally, as acquired by an active sensor, LiDAR data are not affected by shadow and are less likely to be affected by weather [11]. incident angle, etc., and cannot describe the texture and structure of ground objects fully, making it difficult to extract high-quality road footprints from ground points even with the application of LiDAR intensity. In addition, point clouds usually lack spectral information, which poses difficulties for reliable object recognition. Moreover, because of the irregular distribution of point clouds, more effort is needed to extract accurate break lines or features, such as road edges. Due to the aforementioned problems of road extraction from the single dataset of point clouds, some studies have started to focus on using the combination of optical images and point clouds for road extraction.
A current airborne LiDAR system usually integrates a high-resolution charge coupled device (CCD) metric camera, from which high-resolution aerial images can be collected along with laser scanning data. Optical images can provide spectral and texture information simultaneously, which can be fused with LiDAR point clouds to identify ground objects, including mandate ones such as roads. Zhu et al. [24] presented an innovative automatic road extraction technique that combined information from digital images and laser scanning data, which could recover the hidden road edges in digital images by removing any high objects identified by point clouds. Hu et al. [25] focused on the integrated processing of high-resolution image and LiDAR data for automatic extraction of grid structured urban road network. The method firstly segmented the road areas from the LiDAR data using the intensity and height information. Then, road detection was performed by iterative Hough transform based on the contextual targets (i.e., parking lots and grasslands) obtained by image analysis. Youn et al. [26] presented a method for automatically extracting urban road networks from orthophotos and LiDAR data. The method started from the subdivision of a study area into small regions based on the homogeneity of the dominant road directions from the orthophotos. A process called "acupuncture" was used to select the road candidates in each region. Finally, extracted road candidates were edited to avoid collocation with non-road features such as buildings and grass fields. Wang et al. [27] preprocessed both point clouds and aerial images, and then used an improved mean shift algorithm to classify the LiDAR data fused with spectrum attributes into clusters, such as roads, buildings, vegetation, etc. Sameen and Pradhan [28] proposed a two-stage optimization strategy based on fuzzy object analysis for extracting urban road networks from LiDAR data combined with spectral information. In the strategy, the scale parameter was optimized using the Bhattacharyya distance, while the shape and compactness parameters were optimized using the Taguchi method. Milan [29] developed an integrated objectbased analysis framework for detecting and extracting various types of urban roads from high-resolution optical images and LiDAR data.
Nevertheless, the problem of road detection and extraction from the combination of point clouds and optical images still poses challenges to the research community: (1) The accuracy of road extraction results in complex urban scenes is low. Although the fusion of point clouds and optical images can improve the quality of road extraction, a large number of high buildings, tree shadows, vehicles, and road traffic signs reduce the integrity of road data and affect the extraction of road features. (2) The current mainstream road points extraction methods mainly adopt intensity and elevation of point clouds as the input features to a classifier. However, even if these two features are used [16,27], they cannot fully describe the difference between road areas and non-road areas (such as parking lots, low grass, bare land, etc.), resulting in low quality of the extracted roads.
As the development of deep learning and its applications to remote sensing data processing progresses, it shows that deep learning networks can learn the deep features of input data [30][31][32][33][34][35], which can be used as compensation for artificially constructed features to achieve high-accuracy classification and segmentation of remotely sensed data, including point clouds. Maturana and Scherer [31] proposed a method called VoxNet, which converted point clouds into voxels, and successfully applied 3-D convolutional neural networks (CNN) to process voxel data, achieving the recognition of geometric shapes in point clouds. Sun et al. [32] converted the point clouds into a normalized digital surface model (NDSM) image, then combined it with the spectral image, and adopted a Remote Sens. 2022, 14, 789 4 of 23 classic autoencoder to semantically segment the point clouds. However, the first step of each above-mentioned deep learning model is to rasterize the point clouds by interpolation. PointNet [33] is the first deep learning model that is directly applicable to point clouds. It was designed to be invariant to the permutations of input points; thereby, the process of classifying points into specific categories was greatly simplified into an end-to-end method, without the need for preprocessing steps such as rasterization. PointNet could learn global features from a set of point data, but it lost the description of local geometric features. To solve this problem, the authors further proposed a hierarchical network PointNet++ [34] to capture fine geometric structures from the neighborhood of each point. Following these two models, several modifications and variations were developed. Inspired by the scale invariance feature transform (SIFT) feature descriptor, PointSIFT [35] added a series of orientation-encoding units in the architecture of the PointNet++, to combine features from different orientations. Because the simple max or average pooling used in PointNet++ did not make the best use of local contextual information, PointWeb [36] proposed an adaptive feature adjustment module by which each feature could be pulled or pushed by the other features in a given local region according to the adaptively-learned impact indicators. Local contextual information could be encoded in this manner and the experimental results outperformed the compared ones. Liu et al. [37] proposed a point context encoding (PointCE) module to learn a set of scaling factors, by which the network could selectively focus on a few important intermediate features. Similarly, LAE-Conv [38] learned the coefficients that represented the importance of a neighborhood to the central point by multilayer perceptron (MLP), then aggregated the central point features as a weighted sum of its neighbors' features. Hu et al. [39] proposed an efficient and lightweight network called RandLA-Net for large-scale point clouds segmentation. This network utilized random point sampling to achieve remarkably high efficiency in terms of memory and computation. A local feature aggregation module was proposed to capture and preserve geometric details by progressively increasing the receptive field for each 3D point. The experiments showed that RandLA-Net could process 1 million points in a single pass up to 200× faster than existing approaches.
Compared with the extensive application of deep learning to terrestrial laser scanning (TLS) and mobile laser scanning (MLS) point clouds, there are relatively fewer studies on airborne laser scanning (ALS) data. Ödemir et al. [40] classified aerial point clouds into ground-level objects, including roads and pavements, vegetation, buildings' facades and roofs, and three deep learning algorithms and one other machine learning algorithm were tested and evaluated. Wen et al. [41] presented a global-local graph attention convolution neural network (GACNN) that combined edge attention and density attention, which was applied to the classification of unstructured 3D point clouds obtained by airborne LiDAR and which achieved good results. However, in these applications, road and other ground-level objects, such as pavements and parking lots, belong to the same class of impervious surfaces. Bearing all the abovementioned in mind, the purpose of this article is to introduce a new urban road extraction framework from airborne LiDAR point clouds and high-resolution co-registered images. The framework consists of three phases: (1) Road points extraction by PointNet++, where the features of the input data include not only those selected from raw LiDAR points, such as 3D coordinate values, intensity, etc., but also the DN values of co-registered images and generated geometric features to describe a strip-like road. (2) Two-step post-processing of the extracted road points from PointNet++ by graph-cut and constrained triangulation irregular networks (CTINs) to smooth road points and to remove clustered non-road points. (3) Centerline extraction based on a modified method originally proposed in [21], where road segment gaps and holes caused by the occlusion of vehicles, trees, and buildings are recovered by estimating the connection probability of the road segment through collinearity and width similarity.
In summary, there are two main contributions of the work: (1) Strip-like descriptors are proposed to characterize the geometric property of a road, which can be applied to both point clouds and optical images. Strip-like descriptors, together with other features, are input to the PointNet++ network, resulting in a more accurate road point classification. (2) Collinearity and width similarity are proposed to estimate the connection probability of road segments to repair the road gaps and holes caused by the occlusion of vehicles, trees, buildings, and some omission errors.
Experimental results show that the proposed framework not only outperformed previous methods [14,21,42] in terms of completeness and correctness, but also extracted narrower residential streets, which have normally been neglected by previous studies.

Sample Materials
Experiments were conducted on the Vaihingen dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) to compare the results of the proposed framework with previous methods. The data were acquired in August 2008 by Leica ALS50 with an average flying height of 500 m and a 45 • field of view. The point density was approximately 4.0 points/m 2 . In this dataset, all LiDAR points contain the x-, y-, z-coordinates in the WGS84 reference coordinate system, intensity values, and other attributes that were not likely to be useful for classification. In addition, orthophotos that had been co-registered with the point clouds could provide DN values of R, G, B channels for the corresponding laser scanning point. Ground truth was manually classified as roads, public patches, buildings, and low, medium, and tall vegetations. The testing area was located in the center of the city of Vaihingen, with an area of 500 m × 600 m. The testing area included diverse landcover, mainly roads, buildings, trees, lawns, vehicles, a river, etc., which was an adequately complicated landscape for road detection in practice. The training data contained about 3,474,448 points covering an area of 1000 m × 1100 m. The training and testing areas are illustrated in Figure 1.

Overview of Method
The workflow of the proposed approach is described in Figure 2, which is broadly divided into three phases: road points classification, extracted road points optimization and road centerline extraction. In the first phase, the road points were extracted from the input data using PointNet++ network, which were termed as initial road points. The input data are those featured by three coordinate values of the point clouds, intensity, RGB DN values from the co-registered image, and nine geometric features that were generated by the proposed approach to describe road geometry both in point clouds and image space. In the second phase, graph cut and non-road cluster filtering based on CTINs were performed to refine the initial road classification results. In the third phase, collinearity and width similarity were proposed to estimate the connection probability of the road segments, thereby improving the connectivity and completeness of the road network. The experimental results were evaluated on different quantitative metrics.

Road Points Classification
As mentioned in the workflow of the proposed approach, the initial road points were labeled by PointNet++. A crucial point of PointNet++ is to learn multiple layers of local context information progressively. First, the raw dataset was grouped into a set of overlapping sub-datasets. Local features were learned from these sub-datasets to generate low-level features. Higher-level features were learned from these low-level ones, subsequently, with the same procedure. These grouping and learning processes were repeated until the global features of all input points were obtained. A hierarchical propagation strategy was adopted by propagating the features extracted from sub-dataset points back to the original dataset.
PointNet++ can recognize fine-grained patterns, which helped produce better results in complex scenes. Moreover, PointNet++ showed higher robustness to changes in point density. Therefore, we chose PointNet++ to label initial road points in the first phase.
However, as the original of PointNet++ is mainly for the segmentation of point clouds acquired by terrestrial LiDAR, the input data were merely the three-dimensional coordinate values of point clouds. Some studies [43,44] expanded the input features to include intensity and RGB color information acquired by fusing optical images or by multispectral LiDAR. Although the PointNet++ network claims it can provide many deep learning features during the learning process, these features are based on the entire dataset and are not always optimal for the extraction of specific objects, such as roads. This is mainly caused by the application of unsupervised learning in the model initialization stage, where MLP mapping with shared weights showed difficulty in fully characterizing the spatial distribution of the point clouds, resulting in the network lacking the ability to learn geometric features of some specific objects, such as a strip-like road. Therefore, several additional features, such as intensity, RGB DN values, point density, and strip descriptors were used as input features to the PointNet++ model in the proposed framework. Most of the features are straightforward; two parameters to describe a strip-like road are expounded in the following subsections.
First, project the point clouds onto the x-y plane. Given a point P, which is termed a reference point, define a virtual point Q, which is a point dis far from point P and the direction to it is θ, measured counterclockwise from the North, as shown in Figure 3. With the variance of dis and θ, lots of virtual points can be defined. The virtual points which possess the same or similar intensity values are identified as similar virtual points. Length and width are used to characterize the elongation and narrowness of a road, both of which need a defined main direction. Given a reference point, count the number of consecutive similar virtual points in a given direction. Define the direction with the largest number of consecutive similar virtual points as the main direction of the reference point, as shown in Figure 4. It is worth noting that there may be several main directions of a given reference point, depending on the similarity of paved material in the neighborhood of the reference point. Armed with the main direction, the length of a road T pt_length can be defined by the number of consecutive similar virtual points along the main direction. In the case of multiple main directions, T pt_length is the same along each direction, and any one of these directions can be the main direction. Considering that road width varies in intersections, main direction divergence T pt_div is proposed to describe road width variation in road intersections. It is defined as the number of directions that are clockwise and counterclockwise rotated from the main direction, under the condition that, along each direction, the number of consecutive similar virtual points is less than the difference between T pt_length and a given threshold. Figure 5 shows the strip-like descriptors of point clouds in different regions. Calculation details are as follows: (1), where dL = 2.5 × the maximum width o f the roads. The 0 • direction is defined as the direction where P points to North. The increase of θ by 10 • can reduce computation load while ensuring the accuracy of strip descriptors [21]. (3) Calculate the intensity difference between each virtual point and the reference point and mark the virtual points with an intensity difference lower than a given threshold T I as similar virtual points. Count the number of consecutive similar virtual points starting from the reference point in 36 directions from 0 • , 10 • , 20 • , · · · , 350 • , and denote them by {N 0 , N 1 , · · · , N 35 }. This step can be formulated as follows: where I p is the intensity value of the reference point and I q(dis,θ) is the intensity value of the virtual point at the distance dis in the θ direction with regard to the reference point. T I is the intensity difference threshold, and, as noted in [20], the range of road intensity difference in a local road area is around -15~+15 if the road is paved with the same material, so the absolute value of T I is set to 15 for the data sets tested in this study. "M dis,θ = 1" is used to mark the similar virtual points. Therefore, the number of consecutive similar virtual points in any direction can be counted by the following steps: a.
Set the initial direction ω to 0 • , denote the virtual points in this direction by Q ω = {q 1 , q 2 , · · · , q L }, where the points are sequenced by the increasing distance from the virtual points and the reference point. c.
For a given point q i (i = 1, 2, · · · , dL) from Q ω , if its corresponding M dis,θ = 1, then go to the next point q i+1 and add 1 to N i . Otherwise, stop the counting for this direction, set ω = ω + 10 • , and go to step b. d.
Repeat the steps until ω = 360 • . (4) Suppose ε is the main direction of P, which, according to the definition of the main direction, is the direction with the largest number of consecutive similar virtual points. Let N ε be the number of continuous similar virtual points in the ε direction. Iteratively calculate the difference between N ε and N i , record them as {n 0 , n 1 , · · · , n 35 }, which can be expressed as:  The divergence of the main direction ε of the reference point can be obtained by calculating two parameters: M εL and M εR . M εL refers to the number of consecutive directions in which n i is less than a given threshold T dL in a counterclockwise direction starting from the main direction ε. While M εR refers to the number of consecutive directions in which n i is less than a given threshold T dL in a clockwise direction starting from the main direction ε. T dL is the difference threshold. A too large value of T dL causes roads with large widths to be labeled as parking lots or other patches of paved surface, while a too small value of T dL will distinguish some narrow lanes through which a car can hardly pass. It is set to half of the width of the widest road in the area, which is obtained by trial-and-error. The detailed steps are as follows.
a. Rotate from the main direction ε counterclockwise, every 10 • , to check whether n i is less than the given threshold T dL where subscript i = ε 10 • + 1 or i = 0 if ε = 350 • . If it is true and i = ε/10 • , then, Otherwise, if n i > T dL , or i = ε/10 • , continue to the next step. b. Rotate from the main direction ε clockwise, every 10 • , to check whether n j is less than the given threshold T dL where subscript j = ε 10 • − 1 or j = 35 if ε = 0 • . If it is true and j = ε, then, Otherwise, if n j > T dL , or j = ε/10 • , continue to the final step. Finally, T pt_length , T pt_div_ε and T pt_div are calculated by the following formulas: T pt_div = T pt_div_ε i f ε is the only main direction o f the re f erence point max(T pt_div_εi ) εi(i < 36) are the main directions o f the re f erence point Denote the counterpart of T pt_length and T pt_div in the image space as T img_ * length and T img_ * div (where * collectively denotes R, G, B). The calculation of these features is the same as that of T pt_length and T pt_div , except that T img_ * length and T img_ * div are calculated from the image space. Reference point and virtual points are projected onto the co-registered image, followed by the steps described above, where laser points replaced by image pixels and channels R, G and B are viewed as 3 different intensity values; therefore, 6 features, namely T img_Rlength , T img_Rdiv , T img_Glength , T img_Gdiv , T img_Blength and T img_Bdiv , are formed. Figure 6 shows the strip descriptors conducted to the filtered point clouds, where only road points and other ground-level points remained. Table 1 summarizes the employed features used for the experiment.

Features Symbol Definition
Coordinate values x, y, z The coordinate values of a given point.

Intensity I
The intensity of a given point.

Point density T pt_density
The number of points per unit area.
Strip descriptors for point clouds T pt length T pt_div T pt_length and T pt_div are used to describe the length and main direction divergence (width) of a road in point clouds, see the text for details.
Color R, G, B The DN values of a co-registered image.
Strip descriptors for optical image T img_ * length and T img_ * div ,,) are used to describe the length and main direction divergence (width) of a road in an image, see the text for details.
Prior to the training of the PointNet++, point clouds of the training and testing data are divided into small 3D cubic voxels. The size of a voxel is 200 m. Each point is represented by a 16D vector where x-, y-, z-coordinates, intensity, density, T pt_length and T pt_wid are obtained from LiDAR point clouds, while digital numbers of R, G, B channels, T img_ * length and T img_ * div ( * = R, G, B) are provided by the co-registered image. Figures 7 and 8 show the schemes of the basic steps of using PointNet++ for point clouds classification.  In the experiments, training samples and the standard testing data were obtained by hand labeling using the editing tools provided by LiDAR_Suite, an airborne point clouds processing software developed by the author's R&D group that has been commercialized. Figure 9 shows the results of the initial road points extracted at this stage. There were many small clustered and isolated points (Figure 9b), indicating commission errors. Furthermore, the edges of the roads were not smooth, and there were holes inside the road caused by vehicles, trees, etc., as shown in Figure 9c,d.

Two-Step Post-Processing of Initial Road Points
The initial road points extracted by the first phase of the proposed workflow contained both commission and omission errors. Therefore, two-step post-processing by graph-cut and CTINs were applied to optimize the classification results.

Road Point Smoothing with Graph-Cut
As pointed out in [45], local smoothing of the initial classification results can effectively remove isolated "road points", while the omitted road points located on the edge and inside the road will be relabeled. A graph-cut-based mathematical framework was proposed in [45] for this purpose. As illustrated in Figure 10, this approach can be divided into the following three steps: construction of the graphical model, construction of the energy function, and redefinition of point labels. The model was constructed by the undirected adjacency graph, expressed as {V, E}, where V denotes the nodes (representing the initial road points) and E denotes a set of edges connecting the nodes. Unlike [45], we did not directly use a k-nearest neighbors (KNN) algorithm to search for nodes; rather, we predefined a flattened cuboid as the neighborhood of a given point whose height is much smaller than its length and width, as shown in Figure 11, to prevent the points reflected from low objects, such as bushes, along the road from the construction of the graph, because a road surface can be viewed as a flat plane along its direction. The interested reader is referred to the cited reference for a detailed description of the method.

Clustered Non-Road Points Removal Based on CTINs
Though some commission and omission errors can be removed after the graph-cut smoothing, small clusters of non-road points can remain. CTINs is applied to eliminate these clusters. The mechanism behind CTINs is based on the following observation: The side length of a triangle formed by road points is smaller than the side length of a triangle formed by the edge points of the road and the adjacent non-road points. Therefore, side length can be used as the constraint to divide the triangles constructed by Delaunay triangulation performed on the smoothed initial road points into an unconnected cluster of triangles and isolated points. Since a road surface is well connected, and it extends for several or even several tens of kilometers in urban areas, the area of the road surface is normally larger than many clusters of non-road points, which can be removed by setting a coverage area threshold. For the elimination of other large area clusters composed of non-road points, such as parking lots, public parks, etc., we utilized the following facts: (1) The shape of a road is a regular rectangle or curved rectangle, where the length is far greater than the width, as shown in Figure 12a. (2) For a given urban area, most are occupied by residential and commercial buildings, which means the total area of roads is a small fraction of the total area of the given region, as shown in Figure 12b. (3) Clusters of non-road points from patches such as parking lots, public parks, etc., stretch omnidirectionally, so that the length and width of a bounding rectangle for the patch holds the same level of values, as shown in Figure 12c. Zhang et al. [46] pointed out that parameters describing object shape, such as length, width, etc., could effectively detect man-made objects. Therefore, it is possible to filter out large-area non-road point clusters, such as parking lots, by determining the ratio of length to width (aspect ratio) and point coverage of the bounding rectangle of such a cluster. The specific steps are as follows: (1) The Bower-Watson algorithm [47] is applied to construct the Delaunay triangulation of the graph-cut processed initial road points. (2) Traverse the edges of the triangles and remove the edges with lengths greater than T L , which is set to twice the mean average point spacing in the raw point clouds. Figure 13 shows the triangles before and after this step is performed. (3) Remove isolated points, resulting in separated clusters of triangles, as shown in Figure 13b. (4) Calculate the total area of individual clusters of triangles obtained in step 3. Remove those clusters with areas less than a given threshold T S , which is the minimum area of a road in the urban setting. The empirical value of Ts for urban areas is 100 m 2 . The total area of clustered triangles can be calculated as follows: the points constructed from the triangles are projected onto the x-y plane. Their maximum and minimum X, Y values, which are denoted by X max , X min , Y max and Y min , respectively, are obtained. A rectangle can be formed by them. Grid the rectangle with the cell size of 1.5 × average point spacing o f the raw dataset. Count all cells with at least one point inside and then sum up the area of these cells, which is the approximation of the total area of the clustered triangles. We used this approximation method rather than calculating the area of individual triangles, and then summed them up to reduce computational load. (5) For the remaining clusters of triangles, the minimum bounding rectangle (MBR) method proposed in [48] was used to form a bounding rectangle for each cluster. Denote the length, width and area of the rectangle by L MBR , W MBR and S MBR . Calculate aspect ratio lw CDT and point coverage pc CDT as follows: where S CDT is the area of a given triangle cluster calculated in step 4. If lw CDT > T lw and pc CDT > T pc , where T lw , pc CDT are two predefined thresholds, then remove the cluster and all points inside. The values of T lw and T pc are 6 and 0.3, respectively, which are based on prior knowledge acquired by inspecting these parameters in a given urban region.  The results of initial road points followed by the two-step post-processing are illustrated in Figure 14. It is obvious that most of the patches can be removed; however, gaps in a continuous road remain, as shown by the rectangle in Figure 14c,d which will be connected in the third phase of the proposed framework for centerline extraction.

Road Centerlines Extraction
Road centerlines are not only functional for the extraction of a complete road network, but also for quantitative comparison with previous studies. We mainly adopted the approach presented in Hui et al. [21], which is a hierarchical fusion and optimization grouping strategy for road centerline extraction, for this purpose. In this approach, several linear structural elements (SE) of different lengths are used to yield different road extraction outcomes of different levels, and the results of each level are combined and optimized with those of the previous level in a bottom-up manner. However, the method cannot effectively solve the road interruption caused by the occlusion of vehicles, trees or buildings, resulting in poor connectivity of the extracted centerlines; this inconsistent topological relationship makes it difficult to form a complete road network. Therefore, this paper proposes the connection probability of two disconnected road segments as the criteria to determine if the segments should be connected. The main principles of the method are that the same road segment usually has the same width, and most roads are straight. Collinearity and width similarity are defined first to calculate the connection probability of road segments. The calculations of the two parameters are as follows: (1) First, define the connection range. To prevent connecting two lines that are far apart, a radius R is defined to determine the connection range. If the distance between the endpoints of two road segments is greater than R, the connection probability of the two lines sets to 0. The value of R is set to the width of the smallest residential block in the study area, which is 50 m in the testing region. (2) Collinearity (C line ): A road with enough length may be curve in some part, which can change the extension direction of the road at the endpoints. To keep the true extension direction, a segment with length L at the end part of a road is selected to fit a straight line, as shown in Figure 15. A too-small value of L causes a large deviation between the direction of the fitted straight line and the actual direction, while a too-large L value may include the curved part of the road in the straight fitted line. It was set to 20 m in the study, by trial-and-error, which may be set as a constant parameter if similar studies are performed. Then, collinearity is defined by the following formula: where, d 1 and d 2 are the distances from one end of a road to another fitted straight line (see Figure 15). θ is the angle between the two fitted straight lines. L 1 and L 2 are the lengths of the matched lines. A matched line is defined as follows: create the buffer zone of a fitted straight line. Project the road segment surrounded by the buffer zone onto the direction of the fitted straight line, then L i (i = 1, 2) is the length of the projected segment. The buffer width is half the largest road width in the studied area. (3) Width similarity C width is another parameter for calculating connection probability, which is defined by the following: where, w 1 and w 2 are the widths of the two road segments. Max(dw) is the difference between the possible largest and smallest widths of roads in study area. (4) Connection probability of the two road segments can then be calculated by the following formula: where, g 1 and g 2 are the weights for collinearity measure and width similarity, respectively, both of which can be set to 0.5 if the two parameters are of similar importance. If the connection probability is greater than a given threshold T p , then the two road segments are connected.  Figure 16 shows the results of the road centerlines extraction by [21] and by the proposed method, as well as the true centerlines determined by hand editing. Interruptions seen in Figure 16b have all been connected by the proposed method, as shown in Figure 16c.

Evaluation Metrics
Three standard evaluation metrics, completeness (E C ), correctness (E CR ) and quality (Q) [49,50] are adopted to quantitatively assess intermediate and final results: where the true positive (TP) and the false positive (FP) are, respectively, the correctly and wrongly extracted road points, and the false negative (FN) refers to the missed road points. For road centerlines, if the deviation of the extracted centerline from the true one (which is extracted manually) is less than the average point space, then it is a true positive centerline, otherwise, it is a false positive one.

Results
As stated in the flowchart, the first phase of the proposed framework is to distinguish road from non-road points by PointNet++. To verify the effectiveness of PointNet++ for road points labeling, three experiments were conducted: (1) Input the raw point clouds directly to PointNet++ with each point represented by a 16D vector, described in Section 2. Six output classes are set to the PointNet++, namely, roads, public patches, buildings, and low, medium and tall vegetations. The result is denoted by M _class6 . (2) The same input as (1) but only two outputs are set to the classifier: road points and non-road points; the result is denoted by M _class2 . (3) Raw point clouds are firstly filtered by auto-adaptive progressive TIN proposed in [51] to obtain ground points, then input into PointNet++ to distinguish road and non-road points with and without the 9 strip descriptors, respectively; results are denoted by M _class2_from_ground_points and M _without_geometric_features . Figure 17 shows all the experimental results. In addition, the minimal width of roads is set to 2.0 m, through which most cars can pass.
where the true positive (TP) and the false positive (FP) are, respectively, the correctly and wrongly extracted road points, and the false negative (FN) refers to the missed road points. For road centerlines, if the deviation of the extracted centerline from the true one (which is extracted manually) is less than the average point space, then it is a true positive centerline, otherwise, it is a false positive one.

Results
As stated in the flowchart, the first phase of the proposed framework is to distinguish road from non-road points by PointNet++. To verify the effectiveness of PointNet++ for road points labeling, three experiments were conducted: (1) Input the raw point clouds directly to PointNet++ with each point represented by a 16D vector, described in Section 2. Six output classes are set to the PointNet++, namely, roads, public patches, buildings, and low, medium and tall vegetations. The result is denoted by M_class6.
(2) The same input as (1) but only two outputs are set to the classifier: road points and non-road points; the result is denoted by M_class2.
(3) Raw point clouds are firstly filtered by auto-adaptive progressive TIN proposed in [51] to obtain ground points, then input into PointNet++ to distinguish road and nonroad points with and without the 9 strip descriptors, respectively; results are denoted by M_class2_from_ground_points and M_without_geometric_features. Figure 17 shows all the experimental results. In addition, the minimal width of roads is set to 2.0 m, through which most cars can pass.  Table 2 shows the quantitative assessment of the results from PointNet++ with different inputs, as well as the results after two-step post-processing. It can be seen from  Table 2 shows the quantitative assessment of the results from PointNet++ with different inputs, as well as the results after two-step post-processing. It can be seen from Table 2 that the PointNet++ classification result with the input of ground points obtained by autoadaptive progressive TIN filtering the raw data, and with the 16D vector representation of the input points, achieved the highest accuracy, both in terms of completeness and quality. Correctness is the lowest, but it is still at the same level as M _class6 , M _class2 and M _class_without_geometric_features . It is obvious from Figure 17b-e that there are many patches of road points that actually belong to parking lots, public parks, etc. Furthermore, pepperand-salt noises are common. Therefore, two-step post-processing aiming to improve the accuracy of the initial road points is necessary; the initial road points with the highest road integrity M _class2_from_ground_points were selected for the two-step post-processing.
Results of two-step post-processing are shown in Figure 18, where Figure 18a illustrates isolation points elimination by graph-cut, and Figure 18b displays the final result after clustered non-road points removal by CTINs. The last two rows in Table 2 show that the improvement obtained by the three evaluation metrics is slight after the first postprocessing step, but that they are improved significantly after the second post-processing step, indicating that, in general, patches such as parking lots, public parks, etc., affect the road extraction more than other factors such as vegetations, vehicles, and buildings on or along a road.  Centerlines that are extracted in the final stage of the proposed framework not only serve for the generation of the road network, but also allow for further comparison with previous studies, as shown in Figure 19 and Table 3. The completeness, accuracy and quality achieved by the proposed approach are 97.0%, 86.3% and 84.1%, respectively. They are much higher than others, which the authors attribute to the following three reasons: (1) The deep features learned by PointNet++ and the geometric features generated in the first phase of the framework jointly improved the completeness of the extracted road points. (2) Two-step post-processing is applied to decrease both omission and commission errors in the initial road points distinguished by PointNet++. (3) Collinearity and width similarity are introduced to calculate the connection probability to further improve the completeness of extracted roads. Table 3. Three evaluation metrics comparing the proposed method to previous studies. The indicators of PCD [14] are provided by [42].

Discussion
Though the experimental results show that the proposed framework outperformed the other methods compared, a significant point that is worthy of further study is that the framework adopted many features other than 3D coordinate values of point clouds to the deep learning model, such as intensity, DN values from optical images, point density and strip descriptors. This motivated us to investigate more specific features describing the characteristics of roads and more intermediate features generated in the learning stage, so that higher quality road footprints could be extracted in the first phase of the framework. The quality of the final road network is mainly determined by the completeness of the extracted road centerlines. A reasonable connection between the endpoint of a road and a road intersection can improve the completeness of the centerlines, but how to measure the reasonability remains a challenge.
KNN search and fixed radius searches were used in the experiment when features were generated. Though they performed well, both of them possess drawbacks: (1) When the KNN search is applied to filter point clouds where only ground points (including road footprints) remained, the distance between a pair of points in the search region may be too large, resulting in useless features. That is, though the features can be generated, they are insignificant to the classifier or even decrease the classification accuracy. (2) When a fixed radius search is applied, some points may lack enough neighbor points for feature calculation, causing some feature loss.
In the selection of different deep learning models for the first step of the proposed framework, we found that the recently proposed model RandLA-Net takes similar training time as PointNet++, but, as for the efficiency of the application of the learned model to testing data, the former is much higher than the latter, though the classification accuracy of PointNet++ is slightly higher than RandLA-Net. The higher efficiency of RandLA-Net can be attributed to its random sampling mechanism. Considering that deep learning classification was the first step in our approach, which was followed by refined steps and the slightly higher classification achieved by PointNet++ in the experiment, we selected PointNet++ as the deep learning classifier in the proposed framework. However, it is worth noting that RandLA-Net can be a better choice in terms of its high efficiency in the model application. Further studies should be conducted on this issue.
Another observation that is interesting enough to study is the relationship between point density and road detectability. A road that can be detected and extracted from point clouds is strongly correlated to the point density of the dataset. Narrower roads or stripshaped objects can be distinguished from higher density point clouds. To analyze the influence of point density on the extracted road width, the raw point clouds are resampled to certain average point spacing to form 3 datasets with a point density of 4.0 points/m 2 , 1.0 points/m 2 , and 0.25 points/m 2 , respectively. The proposed framework was applied to extract road points from these datasets. Figure 20 and Table 4 show the results. It is worth noting that, when the point density was greater than 1.0 points/m 2 , meaning there were at least three points on the cross section of a road with a width of 3 m, the reduction in the point density displayed little influence on the extraction of the minimum width road. However, when the point density continued decreasing, such as when the point density was 0.25 points/m 2 , many narrow roads were not detected and extracted, as shown by the rectangle in Figure 20c.

Conclusions
A novel framework for extracting urban roads from LiDAR point clouds was proposed. Firstly, the recently developed deep learning model, PointNet++, was used to directly distinguish road points from non-road points, where two geometric features describing a strip-like road were generated both from point clouds and optical images, and input to PointNet++ with other features of the raw point clouds. Then, a two-step post-processing algorithm composed of graph-cut smoothing and CTINs-based clustered non-road points removal was performed to refine the initially-labeled road points. Finally, a connection probability based on centerline tracking was proposed to generate road network. The effectiveness of the proposed framework was verified by the experimental results using the Vaihingen dataset. Quantitative evaluations show that the proposed method outperformed others in terms of higher evaluation metrics and narrow street detection.
Although the proposed method was tested here with the Vaihingen dataset for the purpose of comparison with other methods, where point density was about 4.0 points/m 2 , it should be applicable to higher point density datasets, as new generation airborne LiDAR, such as Geiger-mode LiDAR, single-photon LiDAR, etc., are available for acquiring point clouds with a density over 100 points/m 2 . More studies should be focused on the applicability of the proposed framework and strip-like descriptors characterizing road geometric properties in very high-density point clouds, which are the future research topics of the authors' group.