Review: deep learning on 3D point clouds

Point cloud is point sets defined in 3D metric space. Point cloud has become one of the most significant data format for 3D representation. Its gaining increased popularity as a result of increased availability of acquisition devices, such as LiDAR, as well as increased application in areas such as robotics, autonomous driving, augmented and virtual reality. Deep learning is now the most powerful tool for data processing in computer vision, becoming the most preferred technique for tasks such as classification, segmentation, and detection. While deep learning techniques are mainly applied to data with a structured grid, point cloud, on the other hand, is unstructured. The unstructuredness of point clouds makes use of deep learning for its processing directly very challenging. Earlier approaches overcome this challenge by preprocessing the point cloud into a structured grid format at the cost of increased computational cost or lost of depth information. Recently, however, many state-of-the-arts deep learning techniques that directly operate on point cloud are being developed. This paper contains a survey of the recent state-of-the-art deep learning techniques that mainly focused on point cloud data. We first briefly discussed the major challenges faced when using deep learning directly on point cloud, we also briefly discussed earlier approaches which overcome the challenges by preprocessing the point cloud into a structured grid. We then give the review of the various state-of-the-art deep learning approaches that directly process point cloud in its unstructured form. We introduced the popular 3D point cloud benchmark datasets. And we also further discussed the application of deep learning in popular 3D vision tasks including classification, segmentation and detection.


Introduction
We live in a three-dimensional world, but since the invention of the camera in 1888, visual information of the 3D world is being projected onto 2D images using cameras. 2D images, however, lose depth information and relative positions between two or more objects in the real world, which makes it less suitable for applications that require depth and positioning information such as robotics, autonomous driving, virtual reality and augmented reality among others. To capture the 3D world with depth information, early convention was to use stereo vision where 2 or more calibrated digital cameras are used to extract the 3D information. Point cloud is a data structure that is often used to represent 3D geometry, making it the immediate representation of the extracted 3D information from stereo vision cameras as well as of the depth map produced by RGB-D. Recently, 3D point cloud is booming as a result of increasing availability of sensing devices such as LiDAR and more recently, mobile phones with time of flight (tof) depth camera, which allow easy acquisition of the 3D world in 3D point cloud.
Point cloud is simply a set of data points in a space. The point cloud of a scene is the set of 3D points sampled around the surface of the objects in the scene. In its simplest form, a 3D point cloud is represented by the XYZ coordinates of the points, however, additional features such as surface normal, RGB values can also be used. Point cloud is a very convenient format for representing 3d world and it has a range of application in different areas such as robotics, autonomous vehicles, augmented and virtual reality and other industrial purposes like manufacturing, building rendering e.t.c.
In the past few years, processing of point cloud for visual intelligence has been based on handcrafted features [1,2,3,4,5,6]. The review of handcrafted based feature learning techniques is conducted in [7]. The handcrafted features do not require large training data and were seldom used as there were not enough point cloud data and deep learning was not popular. However, with increasing availability of acquisition devices, point cloud data is now readily available making use of deep learning for its processing feasible. However, the application of deep learning on point cloud is not easy due to the nature of the point cloud. In this paper, we review the challenges of point cloud for deep learning; the early approaches devised to overcome these challenges; and also the recent state-of-the-arts approaches that directly operate on point cloud, focusing more on the latter. This paper is intended to serve as guide to new researchers in the field of deep learning on point cloud as it presents the recent state-of-the-arts approaches of deep learning on point cloud.
We organized the rest of the paper into the following: section 2 discussed the challenges of point cloud which makes the application of deep learning difficult. Section 3 reviewed the meth-ods that overcome the challenges by converting the point cloud into a structured grid. Section 4 contains in-depth of the various deep learning methods that process point cloud directly. In section 5, we presented 3D point cloud benchmark datasets. We discussed the application of the various approaches in the 3D vision tasks in section 6. We summarize and conclude the paper in section 7.

Challenges of deep learning on point clouds
Applying deep learning on 3D point cloud data comes with many challenges. Some of these challenges include occlusion which is caused by clutterd scene or blind side; noise/outliers which are unintended points; points misalignment e.t.c. However, the more pronounced challenges when it comes to application of deep learning on point clouds can be categorized into the following: Irregularity: Point cloud data is also irregular, meaning, the points are not evenly sampled accross the different regions of an object/scene, so some regions could have dense points while others sparse points. These can be seen in figure 1a.
Unstructured: Point cloud data is not on a regular grid. Each point is scanned independently and its distance to neighboring points is not always fixed, in contrast, pixels in images are represented on a 2 dimension grid, and spacing between two adjacent pixels is always fixed.
Unorderdness: Point cloud of a scene is the set of points(usually represented by XYZ) obtained around the objects in the scene and are usually stored as a list in a file. As a set, the order in which the points are stored does not change the scene represented. For illustration purpose, we show the unordered nature of point sets in figure 1c These properties of point cloud are very challenging for deep learning, especially convolutional neural networks (CNN). These is because convolutional neural networks are based on convolution operation which is performed on a data that is ordered, regular and on a structured grid. Early approaches overcome these challenges by converting the point cloud into a structured grid format, section 3. However, recently researchers have been developing approaches that directly uses the power of deep learning on raw point cloud, see section 4, doing away with the need for conversion to structured grid.

Structured grid based learning
Deep learning, specifically convolutional neural network is successful because of the convolution operation. Convolution operation is used for feature learning, doing away with handcrafted features. Figure 2 shows a typical convolution operation on a 2D grid. The convolusion operation requires a structured grid. Point cloud data on the other hand is unstructured, and this is a challenge for deep learning, and to overcome the challenge many approaches convert the point cloud data into a structured form. These approaches can be broadly divided into two categories, voxel based and multiview based. In this section, we review some of the state-of-the-arts methods in both voxel based and multiview based categories, there advantages as well as there drawbacks.

Voxel based
Convolution operation on 2d images, uses a 2d filter of sizeẋ×ẏ to convolve a 2D input represented as matrix of sizeẊ ×Ẏ witḣ x <=Ẋ andẏ <=Ẏ. Voxel based methods [8,9,10,11,12] uses similar approach by converting the point cloud into a 3D voxel structure of size X × Y × Z and convolve it with 3D kernels of size x × y × z with x, y, z <= X, Y, Z respectively. Basically, two important operations takes place in this methods, the offline(preprocessing) and the online (learning). The offline methods converts the point cloud into a fixed size voxels as shown in figure 3. Binary voxels [13] is often used to represent the voxels. In [11] a normal vector is added to each voxel to improve discimination capability. The online operation, is the learning stage. In this stage, deep convolutional neural network is designed usually using a number of 3D convolutional, pooling, and fully connected layers. [13] represented 3D shapes as a probability distribution of binary variables on a 3D voxel grid and were the first work that 2 (a) Irregular. Sparse and dense regions (b) Unstructured. No grid, each point is independent and distance between neighboring points is not fixed (c) Unordered. As a set, point cloud is invariant to permutation Figure 1: Challenges uses 3D Deep Convolutional Neural Networks. The input to the network, point cloud, CAD models or RGB-D images, is converted into a 3D binay voxel grid and is processed using a convolusional deep belief network [14]. [8] uses 3D CNN for landing zone detection for unmanned rotorcraft. LiDAR from the rotorcraft is used to obtain point cloud of the landing site, which is then voxelized into 3D volumes and 3D CNN binary classifier is applied to classify the landing site as safe or otherwise. In [9] a 3D Convolutional Neural Network is proposed for object recognition, like [13], the input to the network in [9] is converted into a 3D binary occupancy grid before applying 3D convolution operations to generate a feature vector which is passed through fully connected layers to obtain class scores. Two voxel based models where proposed in [10]. First model addressed overfitting using auxiliary training tasks to predict object from partial subvolumes and the second model mimic Multiview-CNNs by convolving the 3D shapes with anisotropic probing kernel.
Voxel based methods, although have shown good performances, they however do suffer from high memory consumption due to the sparsity of the voxels, figure 3, which results in wasted computation when convolving over the non occupied regions. The memory consumption also limits the voxel resolution to usually between 32 cube to 64 cube. These drawbacks is also in addition to the artifacts introduced by the voxelization operation.
To overcome the challenges of voxelization, [15,16] proposed adaptive representation. These representation is much complex than the regular 3D voxels, however, its still limited to only 256 cube voxels.

multiview based
These methods [17,18,19,10,20,21,22], take advantage of the already matured 2D CNNs into 3D. Because images are actually representation of the 3D world squashed onto a 2D grid by a camera, methods under this category follows these technique by converting point cloud data into a collection of 2D images and apply existing 2D CNN techniques to it, see 4. Compared to their volumetric based counter parts, Multiview based methods have better performance as the Multiview images contains richer information than 3D voxels even though the latter contains depth information.  [17] is the first work in this direction with the aim of bypassing the need for 3D descriptors for recognition and achieved stateof-the-arts accuracy. [18] proposed a stacked local convolutional autoencoder (SLCAE) for 3D object retrieval. [10] introduced multi-resolution filtering which captures information at multiple scales and in addition they used data augmentation to improved on [17].
Multiview based networks have better performance than voxel based methods, this is because of two reasons, 1) they used an already well researched 2D techniques and 2) they can contains reacher information as they do not have quantization artifacts of voxelization.

Higher dimensional lattices
There are other methods for point cloud processing using deep learning that converts the point cloud into higher dimensional regular lattice. SplatNet [23] processes point cloud directly, however, its primary feature learning operation occurs at the bilateral convolutional layer(BCL). The BCL layer converts the 3 features of unordered points into a six-dimensional(6D) permutohedral lattice, and convolve it with a kernal of similar lattice. SFCNN [24] uses a a fractalized regular icosahedral lattice to map points onto a discretized sphere and defined a multi-scale convolution operation on the regular shperical lattice.

Deep learning directly on raw point cloud
Deep learning on raw point cloud is receiving lot of attention since PointNet [25] was released in 2017. Many state-of-thearts methods have been developed since then. These techniques process point cloud directly despite the challenges of section 2. In this section, we review the state-of-the-arts techniques that work in this direction. We began with PointNet which is the bedrock for most of the techniques. Other techniques improved on PointNet by modeling local region structure.

PointNet
Convolutional Neural Networks is largely successful because of the convolution operation, which enables learning on local regions in a hierachical manner as the network gets deeper. Convolution however, requires structured grid which is lacking in point cloud data. PointNet [25] is the first method that applies deep learning on unstructured point cloud and its the basis for which most other techniques are based on. In this subsection we give a review of PointNet.
The architecture of PointNet is shown in figure 5. The input to PointNet is raw point cloud P = R N×D , where N represents the number of points in the point cloud and D the dimension, usually D = 3 representing the XYZ values of each points, however additional features can be used. Because points are unordered, PointNet is made up with symmetric funtions. Symmetric functions are functions whose output are the same irrespective of the input order. PointNet is built on 2 basic symmetric functions, multilayer perceptron(MLP) with learnable parameters, and a maxpooling function. The MLPs are feature transformations that transform the feature dimension of the points from D = 3 to D = 1024 dimensional space and there parameters are shared by all the points in each layer. To aggregate the global feature, maxpooling symmetric function is employed to produce one global 1024-dimensional feature vector. The feature vector represent the feature descriptor of the input which can be used for recognition and segmentation tasks.
PointNet achieves state-of-the-arts performance on several benchmark datasets. The design of PointNet, however, do not considers local dependency among points, thus, it does not capture local structure. The global maxpooling applied select the feature vector in a "winner-take -all" [26] principle, making it very susceptible to targetted adversarial attack as demonstrated in [27]. After PointNet many approaches were proposed to capture local structure.

Approaches with local structure computation
Many state-of-the-arts approaches where developed after Point-Net that captures local structure. These techniques capture local structure hierarchically in a smilar fashion to grid convolution with each heirachy encoding richer representation.
Basically, due to the inherent nature of point cloud of unorderedness, local structure modeling rests on three basic operations: sampling; grouping; and a mapping function that is usually approximated by a multilayer perceptron (MLP) which maps the features of the nearest neighbor points into a feature representation that encodes higher level information, see figure 6. We briefly explained this operations before reviewing the various approaches.
Sampling Sampling is employed to reduce resolution of points accross layers in synonymity to how convolution operation reduces the resolution of feature maps via convolutional and pooling layers. Giving point cloud P ∈ R N×3 of N points, the sampling reduces it to M pointsP ∈ R M×3 , where M ≤ N. The subsampled M points, also referred to as representative points or centroids, are used to represent the local region from which they were sampled. Two approaches are popular for subsampling 1) random point sampling, where each of the N points is equally likely to be sampled and 2) farthest point sampling (FPS) where the M points are sampled such that each sampled point is the most distant point from the rest of the M − 1 points. Other sampling methods include uniform sampling and Gumbel Subset Sampling [28].
Grouping With the representative points sampled, k-nearest neighbor algorithm is use to select the nearest neighbor points to the representatives points to group them into a local patch, figure 7. The points in a local patch will be used to compute the local feature representation of the neighborhood. In grid convolution, the receptive field, are the pixels on the feature map under a kernel. The kNN is either used directly where k nearest points to a centroid are sampled, or a ball query is used. With ball query, points are selected only when they are within a radius distance to the centroid points.
Non-linear mapping function Once the nearest points to each representative points are obtained, the next step is to map them into a feature vector which represents the local structure. In grid convolution, the receptive field is mapped into a feature neuron using a simple matrix multiplication and summation with convolutional kernels. This is not easy in point cloud, because the points are not structured, therefore most approaches approximate the function using PointNet [25] based methods which is composed of symmetric functions consisting of a multilayer perceptrons, h(·), and a maxpooling function, g(·) as shown in equation 1.
(1) 4  N centroids points are sampled from N, and k-NN points to each of the centroid are selected to form a M groups. Each of the M group represents a local region(receptive field). A non-linear function, usually approximated by PointNet based MLP, is then applied on the local region to learn C-dimensional local region feature (C c).

Approaches that do not explore local correlation
Several approaches follow pointnet like MLP where correlation between points within a local region are not considered and instead, individual point features are learned via shared MLP and local region feature is aggregated using a maxpooling function in a winner-takes-all principle.
PointNet++ [29] extended PointNet for local region computation by applying pointnet hiearchically in local regions. Giving a point sets, P ∈ R N * 3 , farthest point sampling algorithm is used to select centroids, and ball query is used to select nearest neighbor points for each centroids. PointNet is then applied on the local regions to generate a feature vector of the regions. These process is repeated in a hierarchical form thereby reducing the points resolution as it goes deeper. In the last layer along the hierarchy, the whole point's features are passed through a PointNet to produce one global feature vector. PointNet++ achieves state of the art accuracy on many public datasets including, ModelNet40 [13] and ScanNet [30].
VoxelNet [31] proposed a Voxel Feature Encoding(VFE). Giving a point cloud, it is first casted into 3D voxels of resolution D ×Ĥ ×Ŵ , and points are grouped according to the voxel they fall into. Because of the irregularity of point cloud, T points are sampled in each voxel inorder to have uniform number of points per voxel. In a VFE layer, the centroids for each of the voxel is computed as a local mean of the T points withing the voxel, the T points are are then processed using a fully connected network (FCN) to aggregate information from all the points similar to PointNet. The VFE layers are stacked and a maxpooling layer is applied to get a global feature vector of each voxel making the feature of the input point cloud to be represented by a sparse 4D vector, C ×D ×Ĥ ×Ŵ. To fit voxelnet into figure 6 the centroids for each voxel are the centroids/representative points, the T points in each voxel are the nearest neighbor points and the FCN is the non linear mapping function.
Self organizing map, (SOM), originally proposed in [32], is used to create a self organizing networks for point cloud in SO-Net [33]. While random point sampling/farthest point sampling/ uniform sampling is used to select centroids in most of the methods discussed, in So-Net, SOM is constructed with a 5 Pointwise convolution is proposed in [34]. In this technique, there is no subsampled/representative points, because the convolution operation is done on all the input points. In each point, nearest neighbor points are sampled based on a size or radius value of a kernel centered on the point. The radius value can be adjusted for different number of neighbor points in any layer. Each pointwise convolution is applied independently on the input and it transforms input points from 3-dimension to 9dimension. The final feature is obtained by concatenating the output of all the pointwise convolution for each point and it has a resolution equavalent to the input. These final feature is then used for segmentation using convolution layer or classification task using fully connected layers.

Approaches that explore local correlation
Several approaches explore the local correlations between points in a local region to improve discriminative capability. This is intuitive because points do not exist in isolation, rather, multiple points together are needed to form a meaningful shape.
PointCNN [35] improved on PointNet++ by proposing an X-transformation on the k-nearest neighbor points of each centroids before applying a PointNet-like MLP. The centroids/representative points are randomly sampled, and k-NN is used to select the neighborhood points which are passed through an X-transformation block before applying the nonlinear mapping function. The purpose of the X-transform is to permute the input into a more canonical form which in essence also takes into consideration the relationship between points within a local region. In pointweb [36], "a local web of points" is designed by densely connecting points within a local region and learns the impact of each point on the other points using an Adaptive Feature Adjustment (AFA) module. In [37] the authors propsed a "pointConv" operation which similarly explore the intrinsic structure of points within a local region by computing the inverse density scale of each point using a kernel density estimation (KDE). The kernel density estimation is computed offline for each point, and is fed into an MLP to estimate the density estimates.
In [38], the centroids are selected using uniform sampling strategy, and the nearest neighbor points to the centroids are selected using spherical neighborhood. The non-linear function is also approximated using a multi-layer perceptron(MLP), but with additional discriminative capability by considering the relation between each centroids to its nearest neighbor points. The relationship between neighboring points is based on the spatial layout of the points. Similaryly, GeoCNN [39] explores geometric structure within local region by weighing the features of neighboring points based on the distance to their respective centroid point, however, the authors performs point wise convolution without reducing point resolution across layers. [40] argues that overlapping receptive field caused by multiscale architecture of most of PointNet based approaches could result in computation redundancy because same neigboring points could be included in different scaled regions. To address the redundancy, the authors proposed annularly convolution which is a ring based approach that avoids having overlaps between hierarchy of receptive fields and alsp captures relationship between points in within the recpetive field.
PointNet-like MLP is the popular mapping function for approximating points in a local patch into a feature vector, however, [41] argues that MLP does not account for the geometric prior of point clouds and also requires sufficently large parameters. To address these issues, the authors proposed a family filters that are composed of two functions, a step function that encodes local geodesic information, followed by a third order taylor expansion. The approach learns hierarchical representations and achieves state-of-the-art performance in classification and segmentation tasks.
Point Attention transformers (PAT) is proposed in [28]. The authors proposed a new subsampling method termed "Gumbel Subset Sampling (GSS)" which unlike farthest point sampling 6 (FPS), its permutation invariant, and its robust to outliers. The authors used absolute and relative position embedding, where each point is represented by a set of its absolute position and relative position to other points in a local patch, pointNet is then applied on the set. And to further capture relationship between points, a modified Multi-Head Attention (MHA) mechanism is used. A new sampling ang grouping techniques with learnable parameters were proposed in [42] in a module termed dynamic points agglomeration module(DPAM) which learns an agglomeration matrix which when multiplied with incoming poimt features reduces the resolution(similar to sampling) and produce an aggregated feature (similar to grouping and pooling).

Graph based
Graph based approaches were proposed in [43,44,45,47]. Graph based approaches represents the point cloud with graph structure by treating each point as a node. Graph structure is good for modelling correlation between points as explicitly represented by the graph edges. [43] uses a kd-tree which is a special kind of graph. The kd-tree is built in a top-down manner on the point cloud to create a feed-forward Kd-network with learnable parameters in each layer. The computation performed in the Kd-network is in buttom-up fashion. The leaves represents the input points, 2 nearest neighbor (left and right) nodes are used to compute their parent node using shared parameters of weight matrix and a bias. The Kd-network captures hierarchical representations along the depth of the kd-tree, however, because of tree design, nodes at the same depth level do not capture overlapping receptive fields. [44,45,47] are based on typical graph network G = {V, E} whose vertices V represents the points and edges E represented as a V × V matrix. In [44] edge convolution is proposed. The graph is represented as a k-nearest neighbor graph over the inputs. In each edge convolution layer, features of each point/vertex are computed by applying a non-linear function on its nearest neighbor vertices as captured by the edge matrix E. The non-linear function is a multilayer perceptron (MLP). After the last edgeConv layer, global maxpooling is employed to obtain a global feature vector similar to [25]. One distinct difference of [44] from normal graph network is that the edges are updated after each edgeConv layer based on the computed features from the previous layer hence the name Dynamic Graph CNN(DGCNN). While there is no resolution reduction as the network goes deeper in DGCNN which leads to increased in computation cost, [45] defined a spectral graph convolution in which the resolution of the points reduces as the network gets deeper. In each layer, k-nearest neighbor points are sampled, but instead of using mlp-like operation on the the k local points sets like in [29], a graph G k = {V, E} is defined on the sets, the vertices V of the graph are the points and the edges E ⊆ V × V are weight based on the pair-wise distance between the xyz spatial corrdinates of the points. Graph fourier transform of the points is then computed and filtered using spectral filtering. After the filtering, the resolution of the points is still the same, clustering, recursive cluster pooling technique is proposed to aggregate the information in each graph into one vertex.
In [47], the authors proposed a graph network that fully explore not only the local correlation, but also non local correlation. The correlation is explored in 3 ways, self correlation which explores channel-wise correlation of a node's feature; local correlation that explore local dependency among nodes in a local region; and non-local correlation for capturing better global feature by considering long-range local features. Table 1 summarized the approaches showing there sampling, grouping and mapping function methods.

Benchmark Datasets
A considerable amount of point cloud datasets has been published in recent years. Most of the existing datasets are provided by universities and industries. They can provide a fair comparison for testing diverse approaches. These public benchmark datasets consist of virtual scenes or real scenes, which focus particularly on point cloud classification, segmentation, registration and object detection. They are notably useful in deep learning since they can provide huge amounts of ground truth labels for training the network. The point cloud is obtained by different platforms/methods, such as Structure from Motion (SfM), Red Green Blue -Depth (RGB-D) cameras, and Light Detection And Ranging (LiDAR) systems. The availability of benchmark datasets usually decrease as the size and complexity increases. In this section, we introduce some popular datasets for 3D research.

3D Model Datasets
ModelNet [13]:. This dataset was developed by the Princeton Vision & Robotics Labs. ModelNet40 has 40 man-made object categories (such as airplane, bookshelf and chair) for shape classification and recognition. It consists of 12,311 CAD models, which has been split into 9,843 training and 2,468 testing shapes. ModelNet10 dataset is a subset of ModelNet40 that only contains 10 categories of classes. It is also divided into 3991 training and 908 testing shapes.
ShapeNet [48]:. The large-scale dataset was developed by Stanford University et al. It provides semantic category labels for per model. rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. ShapeNet has indexed almost 3,000,000 models when the dataset published, and there are 220,000 models has been classified into 3,135 categories. ShapeNetCore is a subset of ShapeNet, which consists of nearly 51,300 unique 3D models. It provides 55 common object categories and annotations. ShapeNetSem is also a subset of ShapeNet, which contains 12,000 models. It is more smaller but covers more extensive categories of 270.
Augmenting ShapeNet:. [49] has created detailed part labels for 31963 models from ShapeNetCore dataset. It provides 16 shape categories for part segmentation. [50] has provided 1200 virtual partial models from ShapeNet dataset. [ iQmulus [73]. The large-scale urban scene dataset was developed by Mines ParisTech et al in January 2013. The entire 3D point cloud has been classified and segmented into 50 classes. The data was collected by StereopolisII MLS, a system developed by French National Mapping Agency (IGN). They use Riegl LMS-Q120i sensor to acquire 300 million points.
Oxford Robotcar [74]. This dataset was developed by the University of Oxford. It consists of around 100 times trajectories (a total of 101,046km trajectories) through central Oxford between May 2014 to December 2015. This long-term dataset captures many challenging environment changes including season, weather, traffic, and so on. And the dataset provides both images, LiDAR point cloud, GPS and INS ground truth for autonomous vehicles. The LIDRA data were collected by two SICK LMS-151 2D LiDAR scanners and one SICK LD-MRS 3D LIDAR scanner.
NCLT [75]. It was developed by the University of Michigan. It contains 27 times trajectories through the University of Michigans North Campus between January 2012 to April 2013. This dataset also provides images, LiDAR, GPS and INS ground truth for long-term autonomous vehicles. The LiDRA point cloud was collected by a Velodyne-32 LiDAR scanner.
Semantic3D [76]. The high quality and density dataset was developed by ETH Zurich. It contains more than four billion of points where the point cloud are acquired by static terrestrial laser scanners. There are 8 semantic classes provided, which consist of man-made terrain, natural terrain, high vegetation, low vegetation, buildings, hard scape, scanning artefacts and cars. And the dataset is split into 15 training scenes and 15 testing scenes.
DBNet [77]. This real-world LiDAR-Video dataset was developed by Xiamen University et al. It aims at learning driving policy, since it is different from previous outdoor datasets. DB-Net provides LiDAR point cloud, video record, GPS and driver behaviors for driving behavior study. It contains 1,000 km driving data captured by a Velodyne laser.
NPM3D [78]. nuScenes [81]. The nuTonomy scenes (nuScenes) dataset proposes a novel metric for 3D object detection which was developed by nuTonomy (an APTIV company). The metric consists of multi-aspects, which are classification, velocity, size, localization, orientation, and attribute estimation of the object. This dataset was acquired by an autonomous vehicle sensor suite (6 cameras, 5 radars and 1 lidar) with 360 degree field of view. It contains 1000 driving scenes collected from Boston and Singapore, where the two cities are both traffic-clogged. The objects in this dataset have 23 classes and 8 attributes, and they are all labeled with 3D bounding boxes.
BLVD [82]. This dataset was developed by Xian Jiaotong University and it was collected in Changshu (China). It introduces a new benchmark which focuses on dynamic 4D object tracking, 5D interactive event recognition and 5D intention prediction. BLVD dataset consists of 654 video clips, where the videos are 120k frames and the frame rate is 10fps/sec. All frames are annotated to obtain 249,129 3D annotations. There are totally 4,902 unique objects for tracking, 6,004 fragments for interactive event recognition, and 4,900 objects for intention prediction.

Application of deep learning in 3D vision tasks
In this section we discussed the application of the methods discussed in section 4 in 3 popular 3D vision tasks namely: classification, segmentation and object detection. See figure 8. We review the performance of the methods on popular benchmark datasets, Modelnet40 dataset [13] for classification, ShapeNet [48] and Stanford 3D Indoor Semantics Dataset(S3DIS) [61] datasets for parts and semantic segmentation respectively.

Classification
Object classification has been one of the primary areas for which deep learning is used. In object classification the objective is: giving a point cloud, a network should classify it into a certain category. Classification is the pioneering task in deep learning because early breakthrough deep learning models such as AlexNet [83], VGGNet [84], and ResNet [85] are classification models. In point cloud, most early techniques for classification using deep learning relied on a structured grid, section 3, however, we limit ourself to only approaches that process point cloud directly.
The features learned by the techniques reviewed in both section 4 and 3 can easily be used for classification task by passing them through a fully connected network whose last layer represents classes. Other machine learning classifiers such as SVM can also be used as in [9,86]. In figure 9 a timeline performance of point based deep learning approaches on modelnet40 is shown.

segmentation
Segmentation of point cloud is the grouping of points into homegenous regions. Traditionally, segmentation is done using edges [87] or surface properties such as normals, curvature and orientation [87,88]. Recently, feature based deep learning approaches are used for point cloud segmentation with the goal of segmenting the points into different aspects. The aspects could be different parts of an object, referred to as part segmentation or different class categories, also referred to as semantic segmentation.
In parts segmentation, the input point cloud represent a certain object and the goal is to assign each point to a certain parts as shown in figure 8b, hence the name "part" segmentation.
In [25,33,44] the global descreptor learned is concateneated with the features of the points and then passed through MLP to classify each point into a part category. [29,35] propagates the global descreptor into high resolution predictions using interpolation and deconvolution methods respectively. In [34] the per point features learned are used to achieve segmentation by passing them through dense convolutional layers. Encoderdecoder architecture is used in [43] for both parts and semantic segmenatation. In table 8b the result of various techniques on ShapeNet parts datasets are shown.
In semantic segmentation, the goal is to assign each point to a particular class. For example, in figure 8d, the points belonging to chair are shown in red, while ceiling and floor in green and blue respectively, e.t.c. Popular public datasets for Semantic segmentation evaluation are S3DIS [61] and ScanNet [30]. We show in table 4 performances of some of the state-of-the-arts methods on S3DIS and ScanNet datasets.
Instance segmentation on point cloud recieves less attention compared to part and semantic segmentation. Instance segmentation is when the grouping is based on instances where multiple objects of the same class are uniquely identified. Some state-of-the-art works on instance segmentation on point cloud are [89,90,91,92,93] which are built on PointNet/PointNet++ feature learning backbone. Table 8b shows performances of the methods discussed 4 on the popular ShapeNet datasets.

Object detection
Object detection is an extension of classification where multiple objects can be recognized and each object is localized using a bounding box as shown in figure 8c. RCNN [96] were the first that proposed 2D object detection by selective search, where different regions are selected and passed to the network one at a time. Several variants were later proposed [97,98,99].
Other state-of-the-art 2D object detection is YOLO [100] and its variants such as [101,102]. In summary, 2D object detection is based on 2 major stages, region proposals and classification.
Like in 2D images, detection in 3D point cloud is also emperical on the two stages of proposal and classification. Proposal stage in 3D point cloud, however, is more challenging than in 2D due to the search space being 3 dimensional and the sliding window or region to be proposed is also in 3 dimension. vote3D [103] and vote3Deep [104] convert input point cloud into a structured grid and perform extensive sliding window operation   In VoxelNet [31], the sparse 4D feature vectore is passed through a region proposal network to generate 3D detection. FrustumNet [105] proposed regions in 2D and obtain the 3D frustrum of the region from the point cloud and pass it through PointNet to predict 3D bouding box. [89] first uses Point-Net/PointNet++ to obtain feature vector of each point, and based on the hypothesis that points belonging to the same object are closer in feature space proposed a similarity matrix which predicts if a given pair of points belong to the same object. In [106], PointNet and PointNet++ are used to designed a generative shape proposal network to generate proposals which are further processed using PointNet for classification and segmentation. PointNet++ is used in [107] to learn point-wise features which are used to segment foreground points from backgroud points and employs buttom-up 3D proposal to generate 3D box proposals from the foreground points. The 3D box proposals are further refined using another PointNet++-like structure. [108] used PointNet++ to learn point wise features which 11 are considered to be seeds. The seeds then independently cast a vote using a hough voting module based on MLP. The votes of the same object are close in space hence allow for easy clustering. The clusters are further processed using a shared PointNetlike module for vote aggregation and propsal. PointNet is also utilized in [109] with Single Short Detector (SSD) [110] for object detection.
One of the popular object detection dataset is the Kitti dataset [69,70]. The evaluation on kitti is divided into easy, moderate and hard depending on occlusion level, minimum height of the bounding box and maximum truncation. We report the performance of various object detection methods on Kitti dataset in tables 5 and 6.

Summary and Conclusion
The increasing availability of point cloud as a result of evolving scanning devices coupled with increasing application in autonomous vehicles, robotics, AR and VR demands for fast and efficient algorithms for the point cloud processing inorder to achieve improved visual perception such as recognition, segmentation and detection. Due to scarse data availability, unpopularity of deep learning, early methods for point cloud processing relied on handcrafted features. However, with the revolution brought about by deep learning in 2D vision tasks, and evolution of acquisition devices of point cloud which leads to availability of point cloud data, computer vision community are focusing on how to utilize the power of deep learning on point cloud data. Point cloud provides more accurate 3D information 12 which is vital in applications that require 3D information. Due to the nature of point cloud, its very challenging to use deep learning for its processing. Most approaches resolve to convert the point cloud into a structured grid for easy processing by deep neural networks. These approaches, however, leads to either loss of depth information or introduces conversion artifacts and requires higher computational cost. Recently, deep learning directly on point cloud is recieving alot of attention. Learning on point cloud directly do away with convertion artifacts and mitigates the need for higher computational cost. Point-Net is the basic deep learning method that process point cloud directly. PointNet however, does not capture local structures. Many approaches were developed to improve on pointNet by capturing local structures. Inorder to capture local structures, most methods follows three basic steps; sampling to reduce the resolution of points and to get centroids for representing local neighborhood; grouping, based on K-NN to select neighboring points to each centroids; mapping function, usually approximated by an MLP, which learn the representation of neigbhoring points. Several methods resolves to approximating the MLP with PointNet-like network. However because PointNet does not explore inter points relationship, several approaches explore inter-points relationships within a local patch before applying pointNet like MLPs. Taking into account the pointto-point relationship between points has proven to increase the discriminative capability of the networks.
While deep learning on 3D point cloud has shown good performance on several tasks including classification, parts and semantic segmentation, other areas, however, are recieving less attention. Instance segmentation on 3D point cloud, where individual objects are segmented in a scene, remain largely uncharted direction. Most current object detection relies on 2D detection for region proposal, few works are available on detecting objects directly on point cloud. Scaling to larger scene also remain largely unexploited as most of the current works relies on cutting large scenes into smaller pieces. As at the time of this review, only few works [120,121] explored deep learning on large scale 3D scene.

Datasets
Measure Score    Table 6: Performance on the KITTI 3D object detection benchmark