Deep-Learning-Based Point Cloud Semantic Segmentation: A Survey

.


Introduction
In recent years, with the booming development of a large group of emerging industries, such as smart cities, automotive navigation systems, augmented reality, and environmental assessment, a large amount of research related to 3D scene perception has been motivated.This research invariably requires the processing and analysis of huge amounts of 3D data.How to enhance the understanding of 3D scenes and extract effective high-level features has become an important scientific problem in 3D computer vision.
As a key form and essential information carrier of 3D data, a point cloud is a collection of points representing the information of objects in 3D scenes, which can be used as a digital representation of the real world.Point clouds usually contain coordinates, color, intensity values, and other attributes so that the original geometric structure of the object in 3D scenes can be retained to the maximum extent.As a key step in understanding 3D scenes, point cloud semantic segmentation is a technique that divides the original point cloud into several subsets with different semantic information and classifies each point into specific groups according to the degree of attribute similarity.At present, point cloud semantic segmentation has been widely applied to national strategic needs, such as autonomous driving [1], augmented reality [2], and transmission line inspection [3].It has important research significance and broad development prospects.
In recent years, deep learning techniques have made breakthroughs in computer vision, and more and more computer vision tasks rely on convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and other derived neural network architectures.Due to their excellent feature learning capacity, the deep neural network has achieved remarkable results and occupied a dominant position in point cloud semantic segmentation.Deep-learning-based point cloud semantic segmentation methods can be subdivided into point-based methods and rule-based methods.The latter transforms the original point cloud into regular structures, such as 2D images and voxels, and automatically extracts features through neural networks to achieve the segmentation of different categories of objects in 3D scenes at the semantic level.However, due to the sparse and unstructured characteristics of point clouds, such operations not only increase the computational overhead but also lead to a large loss of key information, seriously affecting the accuracy of the segmentation methods.Therefore, it is crucial and urgent to explore how to further improve the performance of point cloud segmentation methods while ensuring that the original information is as complete as possible.
There have been some review papers on point cloud semantic segmentation [4][5][6][7][8], but a systematic summary analysis of the latest proposed segmentation methods and datasets is still needed.This paper aims to provide researchers with a comprehensive and systematic understanding of the current state of research in the field of point cloud segmentation by summarizing and analyzing the representative methods proposed from 2015 to 2023.As shown in Figure 1, this paper focuses on point cloud semantic segmentation, introducing and discussing the latest research progress in detail through the following seven sections.First, we analyze the characteristics of point clouds, and to address the challenges they pose, we classify point cloud semantic segmentation into rule-based segmentation and point-based segmentation according to the processing of methods.The representative and innovative implementations of each type of method are elaborated in detail.Furthermore, we introduce mainstream evaluation metrics in the field of point cloud semantic segmentation, summarize more than 20 datasets, and compare the performance results of different methods on the datasets.Finally, the future development trends and research focus of point cloud semantic segmentation are predicted and foreseen.
Electronics 2023, 12, x FOR PEER REVIEW 2 of 26 (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and other derived neural network architectures.Due to their excellent feature learning capacity, the deep neural network has achieved remarkable results and occupied a dominant position in point cloud semantic segmentation.Deep-learning-based point cloud semantic segmentation methods can be subdivided into point-based methods and rule-based methods.The latter transforms the original point cloud into regular structures, such as 2D images and voxels, and automatically extracts features through neural networks to achieve the segmentation of different categories of objects in 3D scenes at the semantic level.However, due to the sparse and unstructured characteristics of point clouds, such operations not only increase the computational overhead but also lead to a large loss of key information, seriously affecting the accuracy of the segmentation methods.Therefore, it is crucial and urgent to explore how to further improve the performance of point cloud segmentation methods while ensuring that the original information is as complete as possible.
There have been some review papers on point cloud semantic segmentation [4][5][6][7][8], but a systematic summary analysis of the latest proposed segmentation methods and datasets is still needed.This paper aims to provide researchers with a comprehensive and systematic understanding of the current state of research in the field of point cloud segmentation by summarizing and analyzing the representative methods proposed from 2015 to 2023.As shown in Figure 1, this paper focuses on point cloud semantic segmentation, introducing and discussing the latest research progress in detail through the following seven sections.First, we analyze the characteristics of point clouds, and to address the challenges they pose, we classify point cloud semantic segmentation into rule-based segmentation and point-based segmentation according to the processing of methods.The representative and innovative implementations of each type of method are elaborated in detail.Furthermore, we introduce mainstream evaluation metrics in the field of point cloud semantic segmentation, summarize more than 20 datasets, and compare the performance results of different methods on the datasets.Finally, the future development trends and research focus of point cloud semantic segmentation are predicted and foreseen.

Point Cloud Characteristics
Compared with 2D images, 3D point clouds not only avoid the impact of the image acquisition process due to the complex structure of the objects, random lighting conditions, partial object occlusion or adhesion, and other limitations but also have the advantages of diversity and rich information contained.Accordingly, the processing and analysis of point clouds have become the focus of research in the field of 3D computer

Point Cloud Characteristics
Compared with 2D images, 3D point clouds not only avoid the impact of the image acquisition process due to the complex structure of the objects, random lighting conditions, partial object occlusion or adhesion, and other limitations but also have the advantages of diversity and rich information contained.Accordingly, the processing and analysis of point clouds have become the focus of research in the field of 3D computer vision.However, since point clouds are characterized by nonuniformity, nonstructure, and disorder, it is necessary to process them effectively according to their characteristics.In this section, a comprehensive introduction to point cloud characteristics is presented to provide some reference for future research on point cloud semantic segmentation.

Diversity of Point Clouds
Depending on the different data acquisition principles and methods, point clouds can be roughly categorized into three types: image-derived point clouds, light detection and ranging (LiDAR) point clouds, and other point clouds.The image-derived point cloud is mainly obtained by stereo matching methods of RGB-D images acquired from depth sensors using time of flight (ToF), structured light, and other technologies.The LiDAR point cloud is obtained using the time delay between the emission of pulses from the laser and its reflection back to the receiver to measure the distance of the object's surface and combine it with the position and attitude information.According to the different carriers of the LiDAR system, it can be classified into fixed, handheld, vehicle-borne, airborne, and so on.Driven by the rapid development of sensor technology and the demand for applications, novel point clouds have been proposed, such as multisource fusion point clouds [9] and interferometric synthetic aperture radar (InSAR) point clouds [10].Compared with regular point clouds, novel point clouds have fully demonstrated their value and unique advantages in related research over the past few years [11][12][13], which provides the possibility for innovation of application scenarios.Figure 2 shows the various types of point clouds acquired by different acquisition methods and devices.vision.However, since point clouds are characterized by nonuniformity, nonstructure, and disorder, it is necessary to process them effectively according to their characteristics.
In this section, a comprehensive introduction to point cloud characteristics is presented to provide some reference for future research on point cloud semantic segmentation.

Diversity of Point Clouds
Depending on the different data acquisition principles and methods, point clouds can be roughly categorized into three types: image-derived point clouds, light detection and ranging (LiDAR) point clouds, and other point clouds.The image-derived point cloud is mainly obtained by stereo matching methods of RGB-D images acquired from depth sensors using time of flight (ToF), structured light, and other technologies.The LiDAR point cloud is obtained using the time delay between the emission of pulses from the laser and its reflection back to the receiver to measure the distance of the object's surface and combine it with the position and attitude information.According to the different carriers of the LiDAR system, it can be classified into fixed, handheld, vehicle-borne, airborne, and so on.Driven by the rapid development of sensor technology and the demand for applications, novel point clouds have been proposed, such as multisource fusion point clouds [9] and interferometric synthetic aperture radar (InSAR) point clouds [10].Compared with regular point clouds, novel point clouds have fully demonstrated their value and unique advantages in related research over the past few years [11][12][13], which provides the possibility for innovation of application scenarios.Figure 2 shows the various types of point clouds acquired by different acquisition methods and devices.[14]), (b) points generated from CAD models (ModelNet [15]), (c) point clouds scanned by Matterport (S3DIS [16]), (d) point clouds scanned by a vehicle-mounted laser scanning system (A2D2 [17]), (e) point clouds scanned by a mobile laser scanning system (Paris-Lille-3D [18]).

Information Richness of Point Clouds
The point cloud is the most direct and significant data carrier for describing the real world in the digital era and plays a vital role in national requirements and science research, which contains rich information.This information is derived from multiple dimensions, which is essential in processing and analyzing point clouds.Specifically, geometric information provides the spatial position and structure of the objects.Texture information describes the fine-grained features of the object surface.Color information contains RGB values or reflective intensities obtained from other sensors, which makes point clouds more visually realistic and enhances visualization.Normal information describes the direction of each point normal to the object's surface, which is necessary for tasks such as 3D reconstruction and geometry analysis.Semantic information indicates the object category to which a point belongs, which is necessary to achieve a deep understanding of 3D scenes.The high-dimensional and varied information carried by the point cloud provides a wealthy data resource for further research in 3D computer vision.[15]), (c) point clouds scanned by Matterport (S3DIS [16]), (d) point clouds scanned by a vehicle-mounted laser scanning system (A2D2 [17]), (e) point clouds scanned by a mobile laser scanning system (Paris-Lille-3D [18]).

Information Richness of Point Clouds
The point cloud is the most direct and significant data carrier for describing the real world in the digital era and plays a vital role in national requirements and science research, which contains rich information.This information is derived from multiple dimensions, which is essential in processing and analyzing point clouds.Specifically, geometric information provides the spatial position and structure of the objects.Texture information describes the fine-grained features of the object surface.Color information contains RGB values or reflective intensities obtained from other sensors, which makes point clouds more visually realistic and enhances visualization.Normal information describes the direction of each point normal to the object's surface, which is necessary for tasks such as 3D reconstruction and geometry analysis.Semantic information indicates the object category to which a point belongs, which is necessary to achieve a deep understanding of 3D scenes.The high-dimensional and varied information carried by the point cloud provides a wealthy data resource for further research in 3D computer vision.

Nonuniformity of Point Clouds
Point cloud acquisition methods based on laser scanning equipment and depth sensors provide a direct and effective means for the 3D digital representation of the real world.For most of the 3D scenes at various scales, the object categories and point density distributions are different, and the penetration capacity of the point cloud acquisition equipment is limited, which can only reflect the surface of the object and almost completely ignore the internal structure, which leads to large differences in point cloud density in different regions, as shown in Figure 3. Therefore, extracting and understanding high-level features in 3D scenes is challenging.sors provide a direct and effective means for the 3D digital representation of the real world.For most of the 3D scenes at various scales, the object categories and point density distributions are different, and the penetration capacity of the point cloud acquisition equipment is limited, which can only reflect the surface of the object and almost completely ignore the internal structure, which leads to large differences in point cloud density in different regions, as shown in Figure 3. Therefore, extracting and understanding high-level features in 3D scenes is challenging.

Nonstructure of Point Clouds
While 2D images are represented in the computer as the matrix, point clouds are more flexible.As shown in Figure 4, the spatial distribution of points is not limited to certain structured representations, and each local region contains different numbers of points, and the relative positions of pairs of points differ.This unstructured characteristic makes it difficult to manipulate the original point cloud using conventional convolution.For this reason, some researchers have attempted to transform point clouds into regular data that retain the original geometric structure by constructing voxels.Due to the limitation of the resolution size of the voxel and the computing power requirements, voxelization methods inevitably lead to the loss of a large amount of key information, and the complexity of the algorithms grows cubically with the increase in voxel refinement.Therefore, such methods do not apply to the processing of large-scale 3D scenes.

Nonstructure of Point Clouds
While 2D images are represented in the computer as the matrix, point clouds are more flexible.As shown in Figure 4, the spatial distribution of points is not limited to certain structured representations, and each local region contains different numbers of points, and the relative positions of pairs of points differ.This unstructured characteristic makes it difficult to manipulate the original point cloud using conventional convolution.For this reason, some researchers have attempted to transform point clouds into regular data that retain the original geometric structure by constructing voxels.Due to the limitation of the resolution size of the voxel and the computing power requirements, voxelization methods inevitably lead to the loss of a large amount of key information, and the complexity of the algorithms grows cubically with the increase in voxel refinement.Therefore, such methods do not apply to the processing of large-scale 3D scenes.
Point cloud acquisition methods based on laser scanning equipment and depth sensors provide a direct and effective means for the 3D digital representation of the real world.For most of the 3D scenes at various scales, the object categories and point density distributions are different, and the penetration capacity of the point cloud acquisition equipment is limited, which can only reflect the surface of the object and almost completely ignore the internal structure, which leads to large differences in point cloud density in different regions, as shown in Figure 3. Therefore, extracting and understanding high-level features in 3D scenes is challenging.

Nonstructure of Point Clouds
While 2D images are represented in the computer as the matrix, point clouds are more flexible.As shown in Figure 4, the spatial distribution of points is not limited to certain structured representations, and each local region contains different numbers of points, and the relative positions of pairs of points differ.This unstructured characteristic makes it difficult to manipulate the original point cloud using conventional convolution.For this reason, some researchers have attempted to transform point clouds into regular data that retain the original geometric structure by constructing voxels.Due to the limitation of the resolution size of the voxel and the computing power requirements, voxelization methods inevitably lead to the loss of a large amount of key information, and the complexity of the algorithms grows cubically with the increase in voxel refinement.Therefore, such methods do not apply to the processing of large-scale 3D scenes.

Disorder of Point Clouds
The disorder of point clouds means that the point cloud is essentially a collection of disordered points in 3D space.The order of the collected points varies greatly due to the variation of the object's posture, sensor type, and observation platform.The coordinates of each point can independently characterize the spatial position, while for a cluster of point clouds, the initial input order is not necessary, and each point is not associated with other points in the neighborhood.If the n × 3 scale point cloud is input into the neural network, there are n! kinds of arrangement and combination sequences.As shown in Figure 5, changing the order of key points describing the same desk in the figure generates different point cloud matrices and is not affected by the physical storage in the computer.
How to effectively solve the problem of disorder has become the key to the tasks of point cloud registration, point cloud classification, and point cloud semantic segmentation.
disordered points in 3D space.The order of the collected points varies greatly due to the variation of the object's posture, sensor type, and observation platform.The coordinates of each point can independently characterize the spatial position, while for a cluster of point clouds, the initial input order is not necessary, and each point is not associated with other points in the neighborhood.

Public Datasets
With the rise of artificial intelligence, computer vision tasks need to utilize deep neural networks with larger parameter sizes and more complex architectures for high-level feature extraction.High-quality point cloud datasets are an important guarantee for the effective training of networks and the verification of the performance of the proposed segmentation algorithms.However, the collection and labeling of massive data require not only a lot of labor, material, and financial resources but also the guidance of domain experts and professional skills in related industrial software.To promote the development of point cloud semantic segmentation-related research, some research institutions provide semantically informative and reliable public datasets, and the use of these mainstream public datasets for network training and validation not only guarantees the fairness and validity of comparison with other networks but also provides a low-cost and feasible solution for building deep networks with excellent performance.This section highlights five datasets commonly used for point cloud semantic segmentation: ShapeNet [19], S3DIS [16], ScanNet [20], Semantic3D [21], and SemanticKITTI [22].Figure 6 shows these datasets' annotation examples.

Public Datasets
With the rise of artificial intelligence, computer vision tasks need to utilize deep neural networks with larger parameter sizes and more complex architectures for high-level feature extraction.High-quality point cloud datasets are an important guarantee for the effective training of networks and the verification of the performance of the proposed segmentation algorithms.However, the collection and labeling of massive data require not only a lot of labor, material, and financial resources but also the guidance of domain experts and professional skills in related industrial software.To promote the development of point cloud semantic segmentation-related research, some research institutions provide semantically informative and reliable public datasets, and the use of these mainstream public datasets for network training and validation not only guarantees the fairness and validity of comparison with other networks but also provides a low-cost and feasible solution for building deep networks with excellent performance.This section highlights five datasets commonly used for point cloud semantic segmentation: ShapeNet [19], S3DIS [16], ScanNet [20], Semantic3D [21], and SemanticKITTI [22].Figure 6 shows these datasets' annotation examples.ShapeNet: ShapeNet is a large dataset of 3D CAD models with rich annotations, which consists of two parts, ShapeNetCore and ShapeNetSem, where ShapeNetCore contains 55 common categories of about 51,300 3D models, and each model annotation con- ShapeNet: ShapeNet is a large dataset of 3D CAD models with rich annotations, which consists of two parts, ShapeNetCore and ShapeNetSem, where ShapeNetCore contains 55 common categories of about 51,300 3D models, and each model annotation consists of 2-5 parts.ShapeNetSem is a smaller, more densely annotated subset that validates and annotates more than 12,000 3D models in 270 categories with size, volume, shape, and other attributes.
S3DIS: The Stanford 3D Indoor Scene Dataset (S3DIS) is a large indoor scene dataset generated using a Matterport 3D laser scanner.The dataset covers 6 indoor regions consisting of more than 215 million points, 70,496 regular RGB images and 1413 equirectangular RGB images, and 272 indoor scenes with instance-level semantic annotations, covering a total area of more than 6000 m 2 , with 13 categories, each point with surface normals, coordinates, semantic annotations, and other attributes.This dataset plays a key role in the learning of indoor scene features in 3D vision.
ScanNet: ScanNet is a dataset of indoor scenes composed of RGB-D video sequences.The dataset consists of 1513 scans of 707 indoor environments, generating 2.5 million RGB-D views with 21 categories.The attributes include not only the precalibration parameters, textures, and coordinates but also instance-level semantic annotations.This dataset is an important contribution to the realization of 3D scene perception.
Semantic3D: Semantic3D is a representative large-scale outdoor scene point cloud dataset, providing more than 30 different scenes, such as churches, stations, squares, soccer fields, and villages.Among them, 15 scenes are used for network training, and the remaining scenes are used for network testing.There are over 4 billion points in the scenes, including attributes such as coordinates, colors, intensity values, and other attributes, covering 8 categories, including artificial terrain, natural terrain, high vegetation, low vegetation, buildings, landscape, cars, and scanning artifacts.Considering the actual hardware situation of the researcher's development environment, two types of subdatasets are provided, semantic-8 and reduced-8.Semantic-8 has the complete test data, while reduced-8 contains only 4 subsets as test cases.
SemanticKITTI: SemanticKITTI is a large point cloud dataset of outdoor scenes around Karlsruhe, Germany, generated by automotive LiDAR, which plays a vital role in the study of the semantic segmentation of road traffic scenes in the field of autonomous driving.The dataset contains about 4.5 billion points in 28 categories, covering 22 sets of scene sequences, including city traffic, residential areas, highways, and rural roads.The sequences 0-10 are used for network training and the sequences 11-21 are used for network testing.This dataset provides a reliable benchmark for evaluating the performance of the models in the task of 3D outdoor scene target recognition and semantic segmentation.
The mainstream datasets of point cloud semantic segmentation are summarized according to name, year, type, application scenario, category, size, and sensor, as shown in Table 1.

Evaluation Metrics
For the quantitative evaluation of the model's performance, mainstream evaluation metrics are needed to sufficiently guarantee the fairness and validity of the experimental results.At present, researchers mostly use execution time, complexity, and accuracy as the benchmark for evaluating models.However, the time overhead of the segmentation algorithms is closely related to the hardware systems used by researchers, and few researchers provide data about the time and space complexity of the proposed methods.Therefore, this paper focuses on the accuracy evaluation metrics of the methods.
Presently, overall accuracy (OA), mean class accuracy (mAcc), and mean intersection over union (mIoU) are used as the metrics to evaluate the performance of point cloud semantic segmentation methods.For the convenience of description, the notations appearing later are indicated here: Assuming that there are N + 1 semantic classes (including empty class), M ij denotes the number of units with actual semantic type i but predicted type j and vice versa for M ji .M ii denotes the number of units with actual semantic type i and predicted type i.
OA: OA is the ratio of the number of samples correctly predicted by the segmentation algorithms to the total number of samples, as shown in Equation ( 1): mAcc: mAcc is an improvement of OA, which calculates the precision for each category separately, and then averages the summed results according to the number of categories, as shown in Equation ( 2): mIoU: mIoU is the most important index to evaluate the performance of the segmentation methods, which first calculates the ratio between the intersection of the predicted and true regions of the models for each category, and then calculates the average value of the summed results according to the number of categories, as shown in Equation ( 3): Considering the simplicity and representativeness, three evaluation metrics, OA, mAcc, and mIoU, are selected in this paper to compare and analyze different point cloud semantic segmentation methods for researchers' reference.In the early stage, deep-learning-based methods could not deal with 3D data effectively and required dimensionality reduction.Su et al. [43] projected the original point cloud in multiple viewpoints to obtain 2D images from different viewpoints, then used the proposed network MVCNN to extract features and aggregate them in the pooling layer, and finally remapped the aggregated features back to the point cloud to achieve segmentation.This method achieves better accuracy and is a pioneer in solving the unstructured problem of point clouds.Feng et al. [44] improved MVCNN by increasing the number of projection views, obtaining feature vectors by CNN for images obtained from 12 views individually, and grouping the prediction scores from the fully connected layers.The group-level features are combined into the object features by weighting and then averaging between different groups.Aiming to improve the problem of point cloud loss in structured processing, You et al. [45] proposed a point cloud segmentation network, PVRNet, that fully considers the relationship between points and views, which fully integrates view features and points features through the correlation prediction module and proposes two correlation feature fusion methods, i.e., point cloud correlation feature fusion methods with a single viewpoint and point cloud correlation feature fusion method with multiple viewpoints.Finally, the features of both are aggregated to further improve the network's capacity to understand the deep-level features of objects in 3D scenes.Milioto et al. [46] proposed an efficient GPU-based k-NN postprocessing method that can be used to address discretization and inferential ambiguity.Robert et al. [47] computed occlusion-aware mappings between 3D points and 2D pixels, and then aggregated relevant image features for each point through observation conditions based on the attention scheme.This method achieved 74.4 mIoU on S3DIS with sixfold cross-validation, which set a new state-of-the-art for large-scale indoor semantic segmentation.However, since multiview is only an approximate abstraction of the object, there might be partial occlusions and defects in the objects themselves, and it is difficult to cover all objects for large-scale scenes with multiview image-based methods.Therefore, few such methods have been used for point cloud semantic segmentation in recent studies.

Point Cloud Semantic Segmentation Methods
(2) RGB-D Image-Based The depth image takes the distance from the laser scanning device to each object in space as the key information and reflects the geometry of the object's surface.Depth images are usually generated from point clouds in spherical coordinates based on azimuth and zenith angles.Boulch et al. [48] proposed a semantic segmentation network, SnapNet, for fusing depth image features and achieved impressive results on semantic-8.The method first preprocesses the point cloud and generates viewpoints, selects different viewpoints to generate RGB images and depth images, and then uses a fully convolutional neural network to annotate the RGB images and depth images, and finally back-projects the labels into the point cloud to obtain the semantic segmentation results.Guerry et al. [49] improved SnapNet by proposing SnapNet-R, which can process multiple views simultaneously compared with SnapNet, thus obtaining more dense labels and further improving the performance.Since the maximum pooling operation during feature aggregation leads to a partial loss of local information, Wu et al. [50] proposed SqueezeSeg, a point cloud semantic segmentation network based on a conditional random field (CRF) and depth images, which uses spherical projection to transform sparse point clouds into 2D images to feed into SqueezeNet [51] for 3D classification and semantic segmentation and uses CRF as the recursive layers to further optimize the results.However, the accuracy of this method is sensitive to the noise generated in the point cloud acquisition process.SqueezeSegV2 [52] improves SqueezeSeg by adding the context aggregation module (CAM) to increase the perceptual field of the network and improve the efficiency of using contextual information, which makes the network more robust to the noises and outliers generated during point cloud acquisition.Considering the nonuniform distribution of spatial features in point clouds, Xu et al. [53] proposed SqueezeSegV3 with spatially adaptive convolution (SAC), which uses different filters for different neighborhood locations in the point cloud projection-generated images, thus making full use of the capacity of the network.In a recent study, Yang et al. [54] proposed a novel framework, SAM3D, by leveraging the Segment Anything Model (SAM) for 3D vision, which first utilizes SAM to predict the segmentation results of RGB images and then adopts the bidirectional merging approach to project the 2D masks of adjacent frames into 3D point clouds.Finally, the 3D masks predicted from different frames are gradually merged into the 3D mask of the whole 3D scene.Table 2 compares the performance of image-based point cloud semantic segmentation methods on the datasets.

Voxel-Based Methods
The use of voxelization methods to handle point clouds is another idea for transforming unstructured data into structured data.The process of voxelization is to represent an object as voxels that are closest to the object.VoxNet [55] was the first to use the voxelization method to transform unstructured point clouds into regular voxels and then use 3D CNN to predict the semantic labels of the occupied voxels by standard convolution operations.Although this method solved the problem of unstructured point clouds, it also had the limitations of low efficiency of voxel arrangement due to the sparsity and high computational complexity of 3D CNN.Su et al. [56] designed SPLATNet for sparse voxels, which first interpolates the original point cloud to the sparse voxel by splat operation, then convolves the occupied voxels by convolve operation, and finally, interpolates the output features to the original point cloud by slice operation.This method significantly improves the efficiency by using the index structure to convolve only the occupied voxels.To alleviate the impact of a point cloud scale on performance, Rosu et al. [57] proposed LatticeNet with PointNet as the backbone, which can convolve sparse voxels quickly while keeping the computational overhead low and then project the features back to the point cloud through the DeformSlice module.This method has shown effectiveness in handling large-scale point clouds.Tchapmi et al. [58] proposed an end-to-end semantic segmenta-tion network, SEGCloud, combined with a 3D fully convolutional network, which first voxelizes the point cloud, then applies 3D CNN to generate downsampled voxel labels, and then transforms the voxel labels back to point labels by a trilinear interpolation layer, finally, combining the point features with the interpolated scores using a 3D fully connected conditional random field and postprocessing to obtain fine-grained semantic information.However, due to the sparsity of the point cloud itself, the voxelized units are still sparse and discrete, and these cause unnecessary computational overhead.In response, researchers have tried to transform sparse point clouds into nonuniform voxels, for example, using the octree instead of fixed-size voxels.OctNet, proposed by Riegler et al. [59], uses the octree to divide 3D scenes into nonuniform voxels of varying sizes according to the distribution density of points and allows computational resources to be concentrated in voxel-dense regions, which saves computational overhead to some extent.O-CNN, proposed by Wang et al. [60], uses the octree to divide the point cloud into several nodes, takes the average normal vector of nodes as input of the network, and utilizes 3D CNN for feature learning.The complexity of the method grows squarely with the depth of the octree, which saves computational resource consumption to some extent and is suitable for 3D classification and semantic segmentation tasks of high-resolution voxels.For more effective handling of sparsely distributed points, Meng et al. [61] proposed a kernel-based interpolated variational autoencoder architecture to encode the local geometry within each voxel and utilized the radial basis function to compute a local, continuous representation within each voxel.This method provides richer fine-grained features without increasing parameters, improving the expressive capacity and leading to more robust results.Recently, some meaningful work was presented where PCSCNet [62] avoids the discretization error from larger-sized voxels through cross-entropy loss and position-aware loss, keeping the efficiency in the case of lower voxel resolutions.SIEV-Net [63] utilizes a hierarchical voxel feature encoding module and a height information complement module to minimize the height information lost during the point feature aggregation process.Table 3 compares the performance of voxel-based point cloud semantic segmentation methods on the datasets.

Point-Based Segmentation
Rule-based segmentation methods solve the limitation that 2D CNNs cannot be directly applied to point clouds, but there are challenges, such as loss of key information and high complexity.To solve the mentioned challenges, researchers have started to focus on the research of point-based segmentation, which can be divided into multilayer perceptron-based method (MLP-based method), recurrent neural network-based method (RNN-based method), graph convolution network-based method (GCN-based method), and transformer-based method (transformer-based method).Figure 8 shows the basic framework of the point-based segmentation network.It should be noted that the internal structures of the encoder and decoder are different for each network.
SPLATNet [56]  Effectively reduces loss of height information

Point-Based Segmentation
Rule-based segmentation methods solve the limitation that 2D CNNs cannot be di rectly applied to point clouds, but there are challenges, such as loss of key information and high complexity.To solve the mentioned challenges, researchers have started to focu on the research of point-based segmentation, which can be divided into multilayer per ceptron-based method (MLP-based method), recurrent neural network-based method (RNN-based method), graph convolution network-based method (GCN-based method) and transformer-based method (transformer-based method).Figure 8 shows the basi framework of the point-based segmentation network.It should be noted that the interna structures of the encoder and decoder are different for each network.

MLP-Based Methods
MLP is a commonly used architecture in point cloud processing, which is a feedforward neural network consisting of multiple fully connected layers.MLP-based methods usually use shared MLPs as the basic structure of the network, which means that all points in the point cloud share the same parameters.Qi et al. [64] proposed a pioneering network, PointNet, which takes the original point cloud as input, sums the feature of each point by the symmetric function and extracts the feature vector with the maximum value in each dimension, extracts the features of each point independently using MLP, and finally aggregates the features of all points using the maximum pooling layers to obtain the global representation.PointNet effectively solves the problems of permutation invariance and rotation invariance of point clouds.However, the local and interaction information with other points in the neighborhood learned by PointNet is insufficient because deeper layer features cannot cover a larger spatial extent.To address this limitation, Qi et al. [65] improved PointNet by proposing a deep hierarchical network, PointNet++, which consists of the sampling layer, grouping layer, and PointNet backbone network.First, the farthest point sampling (FPS) algorithm is used to select the point with the largest spatial separation in the high-dimensional space as the center of the local region to ensure that the data dimensionality is reduced while preserving the main geometrical structure, and then the local regions are constructed by using the grouping module.Finally, the backbone network is used to recursively learn the features of the local region.Although this network solves the problem of extraction of local features, the capacity to capture information, such as direction and distance between points, is still insufficient.Jiang et al. [66] developed a PointSIFT module that can efficiently explore the neighborhoods in multiple directions.The module uses orientation-encoding units to describe eight crucial orientations and achieves the learning of multiscale features by stacking several orientation-encoding units.To capture the correlation between neighboring points, Zhao et al. [67] designed PointWeb, a network based on the adaptive feature adjustment (AFA) module.The network densely connects each point with others in a local region, exploring in depth the interactions between point pairs.For each local region, an impact map carrying the impact of the elements between point pairs is applied to the feature difference map.Then, the features are adaptively pushed and pulled according to the adaptively learned impact indicators, which in turn achieves the dynamic adjustment and assignment of features beneficial in the point cloud classification and segmentation tasks.SO-Net, proposed by Li et al. [68], models the spatial distribution of point clouds by constructing a self-organizing map (SOM) and performs hierarchical feature extraction on each point and SOM node, and then aggregates the obtained set of feature vectors into global features by averaging pooling.Finally, the semantic features representing the input point cloud are recovered from the global features.Zhang et al. [69] proposed a novel yet effective ShellConv convolution operator that uses the statistics of concentric spherical shells to define representative features to resolve the ambiguity of point order, enabling conventional convolution to be performed on these features.Based on ShellConv, an efficient neural network named ShellNet is further built, which recursively computes each spatial neighborhood and aggregates the statistics of different regions by maximum pooling, while maintaining fewer layers, achieving the balance of efficiency and accuracy.
To further improve the networks' capacity to understand 3D scenes, researchers have tried to introduce the attention mechanism in MLP-based methods.Yan et al. [70] designed PointASNL with strong robustness to noisy point clouds through an adaptive sampling (AS) module based on the attention mechanism.This module adaptively adjusts the features of the sampled points by augmenting the neighborhood points obtained from the FPS algorithm and reweighting the features according to the learned attention weights, thus effectively mitigating the bias caused by the outliers in original point clouds.Hu et al. [71] proposed a lightweight neural network, RandLA-Net, for large-scale point cloud processing, which introduces local spatial encoding (LocSE) units to preserve geometric features and uses the attention-based pooling unit to achieve feature aggregation.By stacking LocSE units and pooling units to increase the perceptual field, the network effectively enhances the understanding of local regions and achieves significant improvement in computational efficiency.Ma et al. [72] provided a new perspective by designing the pure residual MLP network PointMLP, a model equipped with a proposed lightweight geometric affine module that achieves state-of-the-art performance on the ScanObjectNN dataset.Table 4 compares the performances of MLP-based point cloud semantic segmentation methods on the datasets.

RNN-Based Methods
In the field of 2D image processing, RNNs can better capture the contextual information between pixels to significantly improve the learning ability of deep neural networks.In the field of point cloud processing, RNNs can also be used to learn the contextual information between point pairs.Fan et al. [73] proposed a point recurrent neural network for moving point cloud processing, which achieves the fusion of pointwise features and state features by correlating the spatiotemporal information and better solves the limitation that the features from points in different periods cannot be operated directly due to the disorder of point clouds.To better capture the multiscale contextual interaction information and achieve the extraction of adjacent features, Ye et al. [74] proposed a novel end-to-end semantic segmentation network named 3P-RNN to solve the problem of extracting local geometric features under different point density distributions.This 3P-RNN consists of two main components, namely, pointwise pyramidal pooling module and bi-directional hierarchical RNN; the former is used to extract contextual interaction information at different scales to achieve multilevel semantic feature fusion, and the latter for capturing long-range spatial relations.Huang et al. [75] designed a lightweight segmentation network, RSNet, which can efficiently learn the local geometric structure.The network consists of a slice pooling layer, RNN layers, and a slice unpooling layer.Specifically, the slice pooling layer maps the features of unordered points into an ordered sequence of feature vectors, then inputs the sequence into RNN layers for processing and updating, thus achieving effec-tive interaction of spatial contextual information.In the end, the slice unpooling layer reverses the projection and assigns updated features to each point to obtain the semantic segmentation results.
Zhao et al. [76] proposed DAR-Net, a point cloud segmentation network supporting dynamic feature aggregation, fully considering the differences between the sizes of objects in complex 3D scenes.The network uses RNN to recursively process disordered point clouds, forms a backbone consisting of key points by aggregating middle-level features, and adaptively adjusts the model perceptual field as well as key point weights, thus achieving an accurate grasp of local and global features.Experimental results show that the proposed approach outperforms static pooling methods significantly when dealing with large-scale point clouds.The 3DCNN-DQN-RNN proposed by Liu et al. [77] uses 3D CNN to learn and encode the location, color, and other attributes of points from multiscale; efficiently locates the position of points belonging to a particular category through a deep Q-network (DQN); and feeds the correlated feature vectors into the residual RNN to further extract richer high-level features.Table 5 compares the performances of RNN-based point cloud semantic segmentation methods on the datasets.

GCN-Based Methods
A graph convolutional network (GCN) models the real-world problem as the interaction and information transfer between neighboring nodes in a graph and has been widely used in knowledge graphs, recommendation systems, and other fields.To this end, researchers further extend the applicability of GCN by transforming the point cloud into a graph structure and formulating computational strategies for nodes and edges, fully exploiting the interaction between point pairs and effectively transferring the learned information, which provides a new paradigm and solution for a deeper semantic perception.To solve the problem of feature homogeneity in graphs, Simonovsky et al. [78] designed a point cloud segmentation network by setting fixed radiuses to divide spatial regions and then connecting the neighboring points in the same region with edges and assigning attributes, such as coordinates, color, and intensity values, to achieve the construction of the graph structure.By performing edge-conditioned convolution (ECC) in the neighborhood, the extraction of edge features between point pairs in the local area is achieved.Wang et al. [79] proposed the dynamic graph convolutional neural network (DGCN), which extracts the features of the centroid by constructing local neighborhood graphs and using dynamic edge convolution (EdgeConv) to obtain the edge feature vectors of the centroids and k-nearest neighboring points.Then, the global features and the local spatial features output by each EdgeConv are fused to further improve the network's capacity to recognize similar features in the feature space and the semantic segmentation performance.However, DGCN has a high computational complexity when performing EdgeConv and suffers from the problem of network gradient disappearance.Lei et al. [80] proposed a discrete spherical convolution (SPH3D) operator, which divides the spatial region nonuniformly on the spherical coordinate system and specifies a set of trainable parameters to extract features.This metric-based kernel is applied in GCN without relying on edge convolution, which makes more benefits in computational efficiency.Lu et al. [81] designed the PointNGCNN with the feature matrix and Laplacian matrix of each neighborhood as inputs and used the neighborhood graph filter constructed based on Chebyshev polynomials to achieve the learning of neighborhood geometric features in Cartesian space and feature space.Finally, the pointwise semantic descriptors are obtained by fully connected layers.Experimental results show that PointNGCNN achieves good performance in the 3D recognition and segmentation tasks.Li et al. [82] proposed point convolution (P conv ) and point pooling (P pool ) for 3D points based on the graph structure and designed a novel point cloud feature learning network, PointVGG.Among them, P conv learns the geometric information between the center point and its neighboring points.P pool acquires a more detailed local geometric representation by aggregating points.Zhang et al. [83] proposed an architecture AF-GCN based on graph convolution and the self-attention mechanism.The network uses graph convolution to learn local features in the shallow coding stages, and in the deeper stages, long-range contexts are modeled more efficiently by the graph attentive filter (GAF).
For most GCNs, convolution operations are usually only suitable for the feature extraction of structurally fixed graphs.Considering the complexity of graph structures and the heterogeneity in connecting modes, Zhang et al. [84] efficiently organized the point cloud by constructing a hybrid index structure based on Kd-Octree and generated patchbased feature descriptors at leaf nodes as input for 3D pairwise point cloud matching.Li et al. [85] designed an adaptive graph convolutional neural network, AGCN, which can take arbitrary-sized graphs as input.The network uses spectral graph convolution (SGC) to achieve the adaptive transformation of graph topology based on the scale of inputs and the relevance of contextual information, which better solves the problem of inadequate learning of contextual information and geometric features.Landrieu et al. [86] designed a novel deep-learning-based network to address the challenge of large-scale point clouds in semantic segmentation, which, when unsupervised, partitions the original point cloud into geometrically homogeneous elements, represents them as superpoints and constructs a superpoint graph (SPG).SPGs provide rich edge features and accurate representations of contextual relationships between object parts in point clouds by embedding superpoints and using a gated recurrent unit (GRU), and experimental results show impressive results in Semantic3D and S3DIS datasets.Geng et al. [87] proposed a structural representation algorithm for local embedding superpoint graphs (LE-SPG) and then designed a gated integration graph convolutional network (GIGCN) for feature learning and semantic segmentation of the graphs.To prevent the model from gradient vanishing or exploding during training, the hidden states of gated recurrent units (GRUs) in each layer are integrated using a new layer called gated hidden state integration (GHSI), and backpropagation is strengthened by giving the loss function direct access to each layer, fully absorbing the features from different layers.
GCN-based methods extend convolution operations and graph representations to 3D space, which provides a new research idea for processing raw point clouds.At present, researchers have enhanced the learning capacity of networks for local and global information by introducing attention mechanisms and constructing dynamic graphs, which have led to significant achievements of GCNs in the field of point cloud processing.Table 6 compares the performances of GCN-based point cloud semantic segmentation methods on the datasets.

Transformer-Based Methods
Transformer is a new deep learning architecture based on self-attention mechanisms, which was originally applied to natural language processing (NLP) tasks, such as sentiment analysis and machine translation.In recent years, inspired by the fruitful results in NLP, researchers have tried to apply Transformer to the field of computer vision and achieved impressive results [88][89][90].Point clouds are essentially a set of unordered, unstructured sparse points, and the core of the Transformer architecture is the self-attention mechanism and feed-forward neural network, which does not depend on the order of the points and is more suitable for point cloud processing than CNN architectures.
Guo et al. [91] innovatively introduced the Transformer architecture into point cloud processing and proposed a novel network PCT for point cloud classification and semantic segmentation.The network uses coordinate-based input embedding modules and offsetattention modules with strong robustness to ensure the inherent order invariance of transformers to avoid the ordering of the point cloud and conducts feature learning through the self-attention mechanism.Zhao et al. [92] designed self-attention layers for point clouds and applied these to construct self-attention Point Transformer networks for point cloud processing.This network is based on self-attention operators, using the subtraction relation and adding the trainable, parameterized position encoding to the attention vector and transformation features.In addition, residual point transformer blocks are constructed with the Point Transformer as the core to facilitate the exchange of information between local feature vectors.Engel et al. [93] designed a multiheaded attention network with strong robustness to point clouds, which constructs input sequences by top-k operations and extracts the latent features of local geometric and spatial relations from different subspaces based on the learned scores through SortNet.Then, the local features are correlated with the global features through a multiheaded attention mechanism, which then better captures spatial relationships and geometric features and demonstrates competitive performance in point cloud classification and segmentation tasks.
To address the problem of the large computational overhead of multihead attention mechanisms, Yang et al. [94] designed a point cloud processing network named PAT with group shuffle attention (GSA) and Gumbel subset sampling (GSS) as the core operations, which largely improved the performance by deeply mining the relationships between the elements of point sets.Among them, GSA is a parameter effective for self-attention operation for learning relationships between points.GSS serves as an effective alternative to the widely used FPS with the advantages of permutation invariance, task agnostic, and differentiability, which enables effective learning on high-dimensional representations.Zhong et al. [95] designed a novel point-based network named multilevel multiscale transformer (MLMST), which consists of three modules: point pyramid transformer (PPT), multiscale transformer (MST), and multilevel transformer (MLT).Among them, PPT captures context information from different resolutions and scales, MST aims to model the context interaction across different scales and enhances the expressive capability of the network, and MLT learns the cross-level information interaction to further aggregate geometric and semantic features.Han et al. [96] designed a deep neural network, named DTNet, mainly consisting of dual point cloud transformer (DPCT) modules, which enhances the information transfer and interaction by aggregating the pointwise and channelwise multihead self-attention models to efficiently learn contextual features at different resolutions and scales from the perspective of spatial position and channel and connecting the outputs of different modules element by element.In turn, the expression capability of the network is improved.Lai et al. [97] proposed Stratified Transformer, which can be used to capture long-range contexts and demonstrates high performance in point cloud segmentation.For each query point, it densely samples nearby points and sparse distant points in a stratified way.In addition, to cope with the challenges posed by irregular point arrangements, the network's representation and generalization capabilities are further enhanced by designing adaptive contextual relative position encoding and point embedding to achieve an effective fusion of local and long-range features.Most existing Transformer-based methods provide the same feature-learning paradigm for all 3D points, ignoring the huge differences in object sizes in 3D scenes.In this regard, Zhou et al. [98] designed a novel size-aware Transformer framework that introduces multiscale features to each attention layer and allows each point to adaptively choose its attentive fields through the multigranular attention (MGA) scheme and the reattention module.Experimental results show that SAT achieves balanced performance on different categories of S3DIS and ScanNet datasets, which demonstrates the superiority of modeling categories of different sizes.Table 7 compares the performance of Transformer-based point cloud semantic segmentation methods on the datasets.

Prospects
As the focus of research in 3D computer vision, point cloud semantic segmentation is playing an increasingly prominent role in a large number of emerging industries, including smart cities, automatic navigation systems, and virtual reality.Based on the existing research, this paper summarizes the key issues and development trends and provides the following outlook on future research directions.
(1) Multimodal data processing.Point cloud semantic segmentation methods from different research perspectives are based on different data forms (e.g., 2D images, voxels, point clouds).However, the data of a single form can hardly satisfy the all-around understanding and representation of 3D scenes.To this end, Xu et al. [99] proposed a point cloud semantic perception network based on voxels and graph-structured data.The network transforms the raw point cloud into voxels, constructs an adjacency graph for spatial contexts, and encodes the representation to realize the association of local geometric features between voxels.Liu et al. [100] proposed a dual-branch network named PVCNN for parallel processing of points and voxels, in which the voxel-based feature extraction branch aggregates coarse-grained features in the neighborhood, and the point-based branch uses MLP to achieve the extraction of fine-grained features.Therefore, designing lightweight and efficient multimodal data processing networks is an innovative idea to improve the performance of point cloud semantic segmentation methods.(2) Point cloud semantic segmentation in remote sensing.The point cloud is one of the common data carriers in the field of remote sensing.At present, although there are some point cloud datasets with large data volumes, such as SemanticKITTI, Semantic3D, and DALES for outdoor scenes, the existing data are still insufficient to satisfy the demand for semantic segmentation of super-large-scale urban scenes.For this reason, it is significant to construct high-quality and reliable spatiotemporal remote sensing datasets to support scientific research on remote sensing point cloud semantic segmentation.In recent studies, Unal et al. [101] innovatively proposed a novel strategy named Scribbles that can effectively simplify data annotation and published the first LiDAR point cloud dataset based on this strategy, ScribbleKITTI.This weak annotation approach does not need to finely annotate the boundaries of the object, but simply determines the start and end points of a line annotation, thus saving human, material, and financial resources to a great extent.Therefore, using this strategy to simplify the annotation of datasets may be the research direction and development trend of re-mote sensing point cloud semantic segmentation in the future.In addition, due to the different focus of remote sensing and computer vision, the performance evaluation system in computer vision is not fully applicable in remote sensing.How to build a standardized and unified performance evaluation system is the focus of future research on remote sensing point cloud semantic segmentation.(3) Weakly supervised and unsupervised learning.The performance of deep-learningbased methods relies on a large amount of data, but the existing datasets are far from satisfying the development needs.By using the weakly supervised learning strategy with only a small amount of weakly labeled data or the unsupervised learning strategy to train networks, the data hunger problem due to insufficient datasets can be largely alleviated.In this regard, Yang et al. [102] proposed an unsupervised point cloud semantic segmentation network by combining co-contrastive learning and a mutual attention sampling strategy, which deeply explores the contextual interactions between point pairs and accurately identifies points with strong cross-domain correlations through the object sampler and the background sampler, showing impressive performance on ScanObjectNN and S3DIS datasets.Xie et al. [103] designed an unsupervised pretraining strategy, PointContrast, to dynamically adjust the distance between features by comparing the matching of points before and after point cloud transformation in different views of the same scene.The method demonstrates its effectiveness in point cloud semantic segmentation and 3D target detection tasks across six different benchmarks for indoor, outdoor, and synthetic datasets, while also proving the feasibility that the learned representation can generalize across domains.(4) Few-shot and zero-shot learning.Deep learning is a data-driven technique that relies heavily on labeled samples.Due to the limitations of small size, uneven quality, and unbalanced data volume of different categories, few-shot and zero-shot learning strategies have been developed to solve the problem of overdependence on sample data.Specifically, the few-shot learning [104,105] strategy extracts key information from sample data with only a small amount of labeled samples so that the pretrained model can generalize to categories that did not occur during training.The zero-shot learning strategy [106,107] uses a limited number of samples that have no intersection with the categories in test sets to train models and achieve the construction of crossdomain representations by learning cross-domain features.The few-shot and zeroshot learning strategies provide a new research concept for achieving point cloud classification and semantic segmentation in the absence of sample data, which is instructive.

Conclusions
Point cloud semantic segmentation is a popular research topic in 3D computer vision.To segment large-scale point clouds more efficiently and robustly, researchers have developed different types of methods in the past few years and achieved some significant progress.In this paper, we discuss the diversity, information richness, nonuniformity, nonstructure, and disorder of point clouds and summarize representative public datasets and mainstream evaluation metrics.Based on a broad review, we believe that the size, quality, and diversity of datasets are the key factors for training deep models.The main challenges of existing 3D datasets can be summarized as follows: (1) Difference in sensor types and data acquisition platforms leads to certain obstacles in processing different datasets by the models.(2) The density of point clouds in 3D space is extremely nonuniform and the datasets are commonly long-tailed, which leads to the uneven focus of the models on different object categories in scene understanding.(3) The diversity of 3D dataset types leads to large differences in the categories and numbers of objects in each scene, which poses challenges to the cross-domain learning capability of the models.

Figure 1 .
Figure 1.Illustration of survey pipeline.Different colors represent specific sections.Best viewed in color.

Figure 1 .
Figure 1.Illustration of survey pipeline.Different colors represent specific sections.Best viewed in color.

Figure 3 .
Figure 3. Illustration of the nonuniformity of point clouds.

Figure 4 .
Figure 4. Illustration of the nonstructure of point clouds.Different colors represent different categories.

Figure 3 .
Figure 3. Illustration of the nonuniformity of point clouds.

Figure 3 .
Figure 3. Illustration of the nonuniformity of point clouds.

Figure 4 .
Figure 4. Illustration of the nonstructure of point clouds.Different colors represent different categories.Figure 4. Illustration of the nonstructure of point clouds.Different colors represent different categories.

Figure 4 .
Figure 4. Illustration of the nonstructure of point clouds.Different colors represent different categories.Figure 4. Illustration of the nonstructure of point clouds.Different colors represent different categories.
If the  × 3 scale point cloud is input into the neural network, there are ! kinds of arrangement and combination sequences.As shown in Figure 5, changing the order of key points describing the same desk in the figure generates different point cloud matrices and is not affected by the physical storage in the computer.How to effectively solve the problem of disorder has become the key to the tasks of point cloud registration, point cloud classification, and point cloud semantic segmentation.

Figure 5 .
Figure 5. Illustration of the disorder of point clouds.

Figure 5 .
Figure 5. Illustration of the disorder of point clouds.

26 Figure 7 .
Figure 7. Illustration of point cloud regularization.Points of different colors represent different categories.5.1.1.Image-Based Methods (1) Multiview Image-Based In the early stage, deep-learning-based methods could not deal with 3D data effec-

Figure 7 .
Figure 7. Illustration of point cloud regularization.Points of different colors represent different categories.

Figure 8 .
Figure 8. Basic frameworks of point-based CNNs.Points of different colors represent the learned features of different car parts.

Figure 8 .
Figure 8. Basic frameworks of point-based CNNs.Points of different colors represent the learned features of different car parts.

Table 1 .
Summary of mainstream datasets for point cloud semantic segmentation (where R ← real-world environment, S ← synthetic environment in the Type column, Oc ← object classification, Ps ← part segmentation, Is ← indoor segmentation, Os ← outdoor segmentation, Hs ← heritage segmentation, Us ← urban segmentation in the Application Scenario column, Tm ← thousand models, Tf ← thousand frames, To ← thousand objects, Mp ← million points in the Size column, ALS ← airborne laser scanning, MLS ← mobile laser scanning, TLS ← terrestrial laser scanning, -← information not available in the Sensor column).

Table 2 .
Comparison of image-based point cloud semantic segmentation methods.

Table 3 .
Comparison of voxel-based point cloud semantic segmentation methods.

Table 4 .
Comparison of MLP-based point cloud semantic segmentation methods.

Table 5 .
Comparison of RNN-based point cloud semantic segmentation methods.

Table 6 .
Comparison of GCN-based point cloud semantic segmentation methods.

Table 7 .
Comparison of Transformer-based point cloud semantic segmentation methods.