In a real scene, the points returned by the LIDAR are never perfect. The difficulties in handling LIDAR points lie in scan point sparsity, missing points, and unorganized patterns. The surrounding environment also adds more challenges to the perception as the surfaces may be arbitrary and erratic. Sometimes it is even difficult for human beings to perceive useful information from a visualization of the scan points.

#### Segmentation Algorithms

To perceive the 3D point cloud information, normally two steps are involved: segmentation and classification. Some may include a third step, time integration, to improve the accuracy and consistency. Segmentation of point cloud is the process of clustering points into multiple homogeneous groups, while classification is to identify the class of the segmented clusters, e.g., bike, car, pedestrian, road surface, etc.

As summarized in the survey paper [

30], the algorithms for 3D point cloud segmentation can be divided into five categorizes: edge based, region based, attributes based, model based, and graph based. In this section, we will provide supplementary reviews to reveal the recent development in this field. As a result, a new category is identified, which is based on deep learning algorithms.

Edge based methods are mainly used for particular tasks in which the object must have strong artificial edge features, like road curb detection [

31,

32]. However, it is not a useful approach for nature scene detection and is susceptible to noise. To improve the robustness, in [

33], the elevation gradients of principal points are computed, and a gradient filter is applied to filter out points with fluctuations.

Region based methods make use of region growing mechanisms to cluster neighborhood points based on certain criteria, e.g., Euclidean distance [

34,

35] or surface normals [

36]. In most cases, the process starts with generating some seed points and then growing regions from those points according to a predefined criteria. As compared against the edge based method, this approach is more general and practical. It also avoids the local view problem as it takes neighborhood information into account. In [

37], a scan-line based algorithm was proposed to identify the local lowest points, and those points were taken as the seeds to grow into ground segments based on slope and elevation. A feature based on the normal vector and flatness of a point neighborhood was developed in [

38] to grow the regions in trees and non-planar areas. To make the growing process more robust, a self-adaptive Euclidean clustering algorithm was proposed in [

34]. In [

39], a new attribute “unevenness,” which was derived based on the difference between the ranges of successive scanning rings from each laser beam, was proposed as the growing criteria. As claimed in [

40,

41,

42], it was more capable of detecting small obstacles and less sensitive to the presence of ground slopes, vehicle pitch, and roll.

In the literature, some researchers also looked into how to effectively generate the seed points by taking more heuristics into account so that they can lead to a more effective and robust region growing process. In [

43], Vieira et al. first removed points at sharp edges based on curvatures before selecting the seed points, since good seed points are typically found in the interior of a region, rather than at its boundaries. In [

44], the normal of each point was first estimated, then the point with the minimum residual was selected as the initial seed point, while in [

45], the local plane, instead of normal, at each point was extracted and a corresponding score was computed followed by the selection of seed planes based on the score. A multi-stage seed generation process was proposed in [

28]. Non-empty voxels were grouped into segments based on proximity, and these segments served as the seeds for the next segmentation process, which made use of the coherence and proximity of the coplanar points. Finally the neighborhood coplanar point segments are merged based on plane connection and intersection.

The region based segmentation methods have been implemented widely in the literature, however as pointed out in [

29,

30,

46,

47], the segmentation results depend too heavily on the selection of the seed points. Poorly selected points may result in inadequate and inefficient segmentations, and different choices of seed points usually lead to different segmentations [

25]. Additionally, all of the region based methods require extensive computation resources, taxing both time and memory [

29,

48].

Model based methods, also known as parametric methods, first fit the points into predefined models. These models, like plane, sphere, cone, and cylinder, normally can be expressed effectively and compactly in a closed mathematic form. Those inliers to a particular model are clustered as one segment. Most of the model based methods are designed to segment the ground plane. The two most widely implemented model fitting algorithms in the literature are RANSAC (Random Sample Consensus) and HT (Hough Transform). Therefore, the model based methods share the same pros and cons as these two algorithms.

In [

24,

27,

32,

49,

50], the authors implemented the RANSAC algorithm to segment the ground plane in the point cloud with the assumption of flat surface. However, as mentioned in [

23,

51], for non-planar surfaces, such as undulated roads, uphill, downhill, and humps, this model fitting method is not adequate.

To mitigate these defects, Oniga et al. [

52] fitted the plane into quadratic form instead of planar form based on RANSAC. Then a region growing process was designed to refine the quadratic plane. Asvadi et al. in [

51] divided the space in front of the vehicle into several equal-distant (5 m) strips and fit one plane for each strip based on least square fitting. In [

23], a piecewise ground surface estimation was proposed, which consist of four steps: slicing, gating, plane fitting, and validation. The slicing step slices the space in front of the vehicle into regions with approximately equal number of LIDAR points, whereas the gating step rejects outliers in each region based on interquartile range method. RANSAC plane fitting is then applied to each sliced region to find all the piecewise planes, and a final validation step is carried out by examining the normal and height differences between consecutive planes.

The HT model fitting methods can also be found in the literature to fit different models, e.g., planes, cylinders, and spheres. In [

53,

54], the 3D HT was applied on point level and normal vectors to identify planar structures in the point clouds, whereas in [

55], the authors proposed a sequential HT algorithm to detect cylinders in the point cloud. This sequential approach reduced the time and space complexity as compared to the conventional approach which required 5-D Hough space.

As elaborated above, the model based methods are well established in the literature for planar surface extraction. Normally, these methods are used as a primary step in segmentation to remove the ground plane, while other methods, e.g., region growing, are then applied to cluster the remaining points. However, the major disadvantage of model based methods is that it does not take neighborhood and context information into account, and thus it may force random points into a particular model. Furthermore, the segmentation is sensitive to the point cloud density, position accuracy, and noise [

29].

Attribute methods normally take a two-step approach, where the first step is to compute the attribute for each point, and the second step is to cluster the points based on the associated attributes. As mentioned in [

30], this set of methods allow for more cues to be incorporated into the formulation on top spatial information. However, the success of the segmentation also depends strongly on the derived hidden attributes.

Besides those works reviewed in [

30], the attribute based algorithm proposed in [

56] demonstrated that it was capable of segmenting pole-like objects, which was considered as challenging due to its thin feature. In this algorithm, the optimal neighborhood size of each point was first calculated. The geometric features, taking the neighboring information into account, were derived based on PCA (Principle Component Analysis). Each point was then assigned with three types of attributes (linear, planar, and spherical) using LIBSVM [

57] by taking the geometric features as input. Finally, segmentation rules were designed to cluster the points based on their associated attributes.

The other group of methods that are widely used in the literature is graph based methods. These methods cast the point cloud into a graph structures with each point as the vertex/node and the connection between neighbor points as graph edges. The graph based method has demonstrated its strength in image semantic segmentation as it is able to incorporate local and global cues, neighborhood information, context, smoothness, and other customized features into its formulation and optimize the segmentation globally across the entire image.

Following the graph cut methods in image segmentation, in the content of point cloud, they always follow the form of CRF (Conditional Random Field [

58]) or MRF (Markov Random Field), and the optimization is normally through min-max flow cut algorithm or its variations.

In [

59,

60], the authors first created a k-nearest neighbors graph, assigned each node according to a background penalty function, added hard foreground constraints, and solved the foreground and background segmentation through min-cut. Moosmann et al. [

25] used the graph based method to segment ground and objects using a unified and generic criterion based on local convexity measures.

As to be shown later, the graph based methods have also been implemented as the pipelines for sensor fusion between LIDAR and vision. Compared to other methods, graph based ones are more robust in dealing with complex scene due to their global features as aforementioned. The major issue with these methods is that it normally takes more time to compute, especially for the optimization part.

With the recent development in machine learning algorithms in computer vision, some researchers also looked into how to apply machine learning architectures, which are normally applied to 2D image, into the 3D point cloud for segmentation and detection. A commonly used dataset is proposed in [

61], which contains a colored 3D point cloud of several Haussmanian style facades.

In [

62], the author implemented the Random Forest classifier to classify each point into one semantic class. The classifier was trained based on the light-weight 3D features. Afterwards, individual facades were separated by detecting differences in the semantic structure. To improve the memory efficiency and segmentation accuracy, Riegler et al. [

63] proposed an Octree Network based on 3D convolution. It exploits the sparsity in the point cloud and focuses memory allocation and computation in order to enable a deeper network without compromising resolution.

This set of algorithms is recently developed and thus has some crucial and practical issues which makes it difficult to achieve real time operation. However, they do provide new insights into the point cloud segmentation problem. As to be shown in the detection algorithm, they can provide a unified pipeline to combine the segmentation and detection processes.