Semantic Point Cloud Segmentation with Deep-Learning-Based Approaches for the Construction Industry: A Survey

: Point cloud learning has recently gained strong attention due to its applications in various ﬁelds, like computer vision, robotics, and autonomous driving. Point cloud semantic segmentation (PCSS) enables the automatic extraction of semantic information from 3D point cloud data, which makes it a desirable task for construction-related applications as well. Yet, only a limited number of publications have applied deep-learning-based methods to address point cloud understanding for civil engineering problems, and there is still a lack of comprehensive reviews and evaluations of PCSS methods tailored to such use cases. This paper aims to address this gap by providing a survey of recent advances in deep-learning-based PCSS methods and relating them to the challenges of the construction industry. We introduce its signiﬁcance for the industry and provide a comprehensive look-up table of publicly available datasets for point cloud understanding, with evaluations based on data scene type, sensors, and point features. We address the problem of class imbalance in 3D data for machine learning, provide a compendium of commonly used evaluation metrics for PCSS, and summarize the most signiﬁcant deep learning methods developed for PCSS. Finally, we discuss the advantages and disadvantages of the methods for speciﬁc industry challenges. Our contribution, to the best of our knowledge, is the ﬁrst survey paper that comprehensively covers deep-learning-based methods for semantic segmentation tasks tailored to construction applications. This paper serves as a useful reference for prospective research and practitioners seeking to develop more accurate and efﬁcient PCSS methods.


Introduction
To understand the semantic segmentation of point clouds in the 3D domain, it is helpful to first understand the origins of this technique in the 2D domain.Semantic segmentation, in the first place, is a computer vision technique that involves partitioning an image into multiple semantic regions or segments, where each segment corresponds to a specific region of interest within the image.The goal of semantic segmentation is to classify and assign a label to every pixel in an image, indicating the category or class it belongs to.Unlike other types of image segmentation, such as instance segmentation or boundary detection, semantic segmentation focuses on understanding the high-level content of an image rather than individual objects or their boundaries.It aims to capture the meaning or semantics of the image by assigning a meaningful label to each pixel.Before the introduction of machine learning, early adopters of semantic segmentation in images relied on traditional computer vision techniques.They extracted low-level features like color, texture, edges, and gradients from the images.Various segmentation algorithms were then applied to partition the image into regions based on these features.Techniques such as thresholding, region growing, and edge-based segmentation were commonly used [1].This process aimed to identify distinct regions or boundaries within the image for further analysis and classification.However, these approaches were limited by the need for handcrafted features and explicit rules, making them less flexible and robust compared to machine-learning-based methods.Machine learning models, such as deep neural networks, have since emerged as powerful tools that can automatically learn and extract complex features from images, significantly improving the accuracy and generalizability of semantic segmentation.The development of convolutional deep neural networks alleviated the need for engineered features and produced a powerful representation that captures texture, shape, and contextual information [2].The introduction of fully convolutional networks (FCNs) to semantic segmentation tasks alleviated the need for preprocessing images into low-resolution superpixels and was end-to-end trained to predict a semantic label for each pixel in an image [3].The FCN was able to achieve state-of-the-art performance on the PASCAL VOC 2012 dataset [4] and was the first model to surpass the performance of traditional computer vision techniques.
Similarly, the semantic segmentation of point cloud data refers to the process of classifying individual points in three-dimensional space into semantic categories or labels.It involves assigning a meaningful label to each point based on its characteristics and the context of the surrounding points.The objective is to segment the point cloud into different regions or objects, as shown in Figure 1, where each region or object is associated with a specific semantic class.Deep-learning-based segmentation for point clouds opens up new opportunities for applications, such as robotics, industrial automation, and 3D scene understanding also for the construction industry, enabling machines to perceive and interact with the environment in three-dimensional space.Among various scene understanding problems, 3D semantic segmentation allows for finding accurate object boundaries along with their labels in 3D space, which is useful for fine-grained tasks [5].The architecture, engineering, and construction (AEC) sector ranks as one of the most intensive fields where vision-based systems are used to facilitate decision-making processes during the construction phase.Construction sites make efficient monitoring extremely tedious and difficult due to clutter and disorder.Due to their intrinsic nature, job sites are prone to management failures such as unsatisfactory construction quality, delays on schedule, extra costs, and even injuries [6].Multiple surveillance tasks, especially site monitoring, are performed already with the help of image-based computer vision.Adding 3D data has the potential to significantly increase information density and the level of automation.Below, we list practical examples of applications where learning-based point cloud segmentation can be beneficial to quality, efficiency, and safety in construction.Shape and position control in concrete casting and precast installation requires finding accurate object/segment boundaries [7].Volume calculations in the construction and demolition (C&D) phases can be automated from laser scanner, UAV, or satellite syn-thetic aperture radar (SAR) footage with the help of semantic segmentation [8,9].Reverse engineering digital twins of existing buildings (as built and scan to BIM) requires assigning element classes to scan data to reconstruct element geometries [10,11].Site progress monitoring can be automated through 3D component recognition and comparison with a planned database [12,13].
In addition to many surveillance applications, semantic segmentation of real-time sensor data as a subfield of computer vision is one crucial component for guiding (partially) autonomous robots.We expect robots to take over even more construction tasks in the future, but sites are constantly changing environments and scattered with moving obstacles.Thus, real-time scene understanding will be mandatory.Price drops for sensor hardware have ultimately made it economically feasible to install the necessary hardware.However, software development is needed to utilize the information, as the following examples of non-learning approaches illustrate.
Wang et al. [14] used colored point cloud segmentation to estimate the position of precast concrete rebar elements.With appropriate training data and a learning-based approach, this can be applied to all types of construction equipment that robots interact with.Automated guided vehicles (AGVs) require vehicle guidance and collision avoidance.This can be performed with cameras, but imaging sensors rely on good lighting conditions.This is not always the case on construction sites, and light detection and ranging (LiDAR) can be a solution for this [15].Finally, semantic-enhanced sensor data can dramatically improve the overall safety of construction sites where humans and machines work together.Ray and Teizer [16] used ray tracing to calculate the blind spots of construction equipment, but the algorithm cannot distinguish between multiple obstacles.Machine safety decision making can be improved with learning-based point cloud understanding.
The scope of this paper is to focus on learning-driven PCSS with deep neural networks, which can learn deep geometric connections of complex structures by transforming the feature space into a high-dimensional representation problem.The early adopters, Point-Net [17] and PointNet++ [18], have proven that deep learning techniques are fundamental for many computer vision tasks in the sparse 3D point cloud domain.Multiple strategies were published in the following years, transferring approaches from image-based computer vision and natural language processing to point cloud processing (convolution-, recurrent-, graph-, and transformer-based methods).Within the civil engineering domain, only a few papers covered the use of deep learning to perform PCSS.Some remarkable publications automated as-built BIM generation [19,20], building reconstruction [21,22], and bridge part semantic segmentation [23,24].Nevertheless, up to now, there is no settled trend for the civil engineering branch, and most approaches only feature outdated PointNet network architectures.
There have been previous reviews of deep learning for construction applications, from which we stand out.Guo et al. [25] and He et al. [26] have provided extensive surveys on deep learning with point clouds, with the latter focusing specifically on semantic segmentation.However, these surveys primarily concentrate on 3D scene understanding with a general approach.Further, they do not address the developments in the recently dominant Transformer network architectures.Other related review studies, such as those conducted by Jacobsen and Teizer [27], Akinosho et al. [28], and Khallaf and Khallaf [29], have explored machine learning applications in the construction industry.While they touch on point-cloud-based object classification, they focus on imaging methods for construction site monitoring and lack an in-depth and unbiased comparison of various deep learning methods specifically tailored for point cloud segmentation.Therefore, this paper presents an unbiased assessment and detailed comparison of the most popular deep learning methods developed in recent years, emphasizing their applicability to point cloud segmentation in the construction industry.For the first time, we address a critical challenge in this domain-the scarcity of industry-specific datasets for training deep learning models.To fill this gap, we have compiled an extensive table of relevant training and validation datasets, which, to our knowledge, has not been previously published to this extent.
By offering novel insights and original perspectives in comparison to existing review studies, our research reinforces its significance and contribution to the broader field of deep-learning-based approaches for the construction industry.Our goal with this article is to support future research in the fields of applied civil engineering and computer vision by establishing a solid foundation for further exploration.By outlining the most recent methodologies and resources available for training and validating new algorithms, we hope to unify benchmarking efforts and encourage wider participation within the research community.
The structure of this paper is as follows: Section 2 describes the methodology adopted in this study.Section 3 addresses the state of the art in industry-specific datasets for the construction domain.Section 4 discusses the issue of unbalanced datasets during training and presents potential solutions.Section 5 introduces and compares the most popular developments in deep-learning-based point cloud segmentation in recent years.Section 6 examines the findings of this survey, and finally, Section 7 concludes the paper.

Methodology
To provide a thorough view of PCSS with application in civil engineering, this literature review followed the steps described below.First, we searched for all topic-related publicly available datasets and compiled a look-up table of the resulting datasets.We started the internet search from the free and open papeswithcode.com[30] database and expanded this list by identifying further datasets mentioned in papers that used these datasets for training and validation on point cloud scene understanding and semantic segmentation tasks.These papers were found in a conducted grid search with the following online search engines: scopus.com,semanticscholar.org,and scholar.google.com,by using a combination of keywords including "3D", "dataset", "point cloud", "semantic segmentation", "instance segmentation", and "scene understanding".The criterion for inclusion in our list was that the datasets must be freely accessible to all and are relatable to civil engineering in a broader sense.The exploration resulted in 52 datasets that met our criteria and would be further considered.First, we evaluated all datasets by the covered scene type, the utilized sensors, the data representation, the intended use case, the available features, and the dataset size.Second, we looked into the class imbalance problem in 3D data and how this is dealt with in the literature.Based on this, we collected solutions presented in the past to handle class imbalance.Third, we summarized the evaluation metrics that are commonly used by authors to evaluate and compare the performance of scene understanding tasks in 3D space.Lastly, we summarized and compared a selection of significant deep learning methods developed in the past for solving the 3D PCSS problem.We followed up on the work of [25,26], who had both previously conducted surveys on the general topic of deep learning on point clouds.We deepened this review to cover specific topics of the industry and collected the advantages and disadvantages of applying certain methods.As a measure of relevance in this survey, we quantified PCSS methods by its benchmark results (viz.mIoU) on different datasets, originally sourced from the papeswithcode.com[30] database and cross-validated with its associated papers.Representative benchmark results on the S3DIS dataset [31] are given in Section 5 and Table A1.This survey only considers publications published up until the end of December 2022.

Public Datasets for Scene Understanding Methods
Efficient training in machine learning and the applicability of deep neural networks heavily depends on the available training data.While the performance of most shallow machine learning algorithms converges at some point, deep learning can increase performance relative to the amount of training data [32].Creating huge datasets for the individual is not always feasible.Most protagonists in the machine learning community, therefore, necessarily rely on the use of freely available data.However, famous achievements demonstrate that the need for public datasets and the need for people to work together does not have to be a burden but offers great opportunities for accelerated progress.
The MNIST dataset, which features 70,000 gray-scale images of handwritten digits, was first published in 1999 [33] under a free-to-use, copy, share, and adapt license (Creative Commons license: CC BY-SA 3.0) and is still widely used as a benchmark for evaluating the performance of convolutional neural network (CNN) models [34].The ImageNet database, which was introduced in 2012 (Russakovsky et al., 2015), has been instrumental in driving the development of deep learning algorithms in the community to groundbreaking innovation [35].Three-dimensional point cloud semantic segmentation (PCSS) is a fastgrowing field of research, and the amount and variety of open-source datasets to train, validate, and compare new model architectures are almost exclusively credited to the car industry and autonomous driving [36][37][38][39][40].
Three-dimensional datasets designed for computer vision scene understanding applications in the construction industry are rare and deviate in pure volume from their automotive counterparts.A literature review was conducted to summarize existing civil engineering-related open-source datasets and make them more visible for future research.The review was performed with a combination of search items including "3D", "dataset", "point cloud", "scene understanding", "semantic segmentation", and "instance segmentation", which revealed a range of different datasets.In total, 52 datasets were found that met the authors' criteria.To the best of the authors' knowledge, this list is the most complete database of construction-related 3D datasets for computer vision and scene understanding.A tree structure is presented in Figure 2 to visualize the extent of combinations in the wider scope of 3D scene understanding.The tree structure shows how certain sensors, data types, and data formats are commonly paired for given use cases in the literature.The main scene understanding tasks are arranged as use cases on top of the tree.The layers represent certain objectives, attributes, and methods that we extracted from the reviewed datasets.The connecting edges show the configurations found within the papers in the following Table 1.This table includes a comprehensive compilation of a wide range of freely available datasets that are relevant for different 3D point cloud scene understanding tasks in the scope of the construction industry.We categorized the available data by its binary data type of acquisition {real-word; synthetic}, the content of the dataset {scene type}, utilized hardware {sensors}, the dataset file format {representation}, the intended use case {application task}, and the available point cloud features {features}.In addition, the number of used semantic annotation classes is given.The presented datasets consider different scene types and were recorded with different sensors to capture 3D spaces.A quantitative evaluation of the look-up table for scene types and sensors is given in Figure 3. Sub-figure (a) shows the occurrence frequency of each scene type in our database.Sub-figure (b) breaks down which sensors were used to capture the scene types.As shown, some included datasets feature dynamic urban scenes from autonomous driving (D).We considered these datasets for two reasons.On the one hand, urban driving data contain road infrastructure and building facades in the peripheral field of view of the car's sensors.On the other hand, autonomous driving is a leading force for scene understanding and Simultaneous Localization And Mapping (SLAM) methods, which implies that the latest algorithms are often developed and benchmarked with urban driving data.The SemanticKITTY dataset contains semantic classes of structural elements: roads, sidewalks, buildings, poles, road signs, and other structures besides the standard traffic unit classes, and became a benchmark for autonomous driving and outdoor SLAM algorithms [74].The NCTL dataset consists of omnidirectional imagery, 3D LiDAR, planar LiDAR, GPS, and odometry of outdoor and indoor scenes from the University of Michigan campus, captured by an autonomous Segway robot [49].Unfortunately, the point data are not annotated; therefore, the application for supervised machine learning is not possible without considerable effort to prepare the data.Besides autonomous driving, the second most covered spaces in the included datasets are residential indoor scenes.Most real-world indoor datasets were captured with RGB-D devices and were released between 2015 and 2017.This trend can be explained by the price drop of 3D imaging hardware in the mid-2010s, which caused a leap in the development of many successful semantic segmentation methods that use both RGB and depth features [56,83].Consumer RGB-D cameras, like the Microsoft Kinect [84], are easy to use but not necessarily best suited for capturing large-scale spaces.Their small field of view requires stitching together many frames per scene in postprocessing, which often leads to registration errors and non-complete room geometry, as shown in Figure 4. Almost all indoor datasets have in common that they present fully furnished rooms of residential buildings and office spaces.In most cases, the class annotations focus on furniture besides the main building components (wall, floor, ceiling, door, and window).One of the first large indoor scene understanding datasets was the SUN RGB-D benchmark suite [52], which holds 10K RGB and depth images of residential rooms.Objects and room geometry is annotated in 2D by polygons and in 3D by its boundary boxes and orientation.SceneNN [56] holds comparable content, but the scenes are reconstructed into triangle meshes and possess per-vertex and per-pixel annotation.One of the most impactful 3D datasets in the indoor domain with point-wise semantic class annotations available is ScanNetV2 [62], now in its second generation.The dataset offers RGB-D video streams and 3D camera poses of 1.5K residential scenes captured with low-cost sensor setups and crowd-sourced instance-level semantic annotation.VASAD [22] is a synthetic volume and semantic architectural dataset composed of six buildings.The focus of VASAD is to improve semantic segmentation and volumetric reconstruction for BIM modeling.VASAD's semantic classes only consist of building components, and its authors introduce a method to automatically simulate terrestrial laser scanning (TLS) behavior by raytracing virtual scanlines from iteratively added viewpoints to sample point clouds from meshbased CAD models.The Stanford 3D Indoor Scene Dataset S3DIS [59] stands out as the biggest fully annotated indoor point cloud dataset and is heavily used for training and benchmarking in semantic segmentation tasks [85].The S3DIS sources from joint 2D-3D semantic data [31] and contains 6 large-scale indoor areas with 271 rooms.These areas show diverse architectural styles and appearances and mainly include office areas, educational and exhibition spaces, restrooms, open spaces, lobbies, stairways, and hallways.Each point in the scene point cloud is annotated with one of the 13 semantic classes shown in Figure 5.Besides five furniture classes, S3DIS contains seven classes (ceiling, floor, wall, column, beam, window, and door) relevant to the construction industry [59].Because S3DIS is the only real-world dataset of this size and holds a reasonable number of classes to classify structures of buildings, we adopted the S3DIS dataset as our reference dataset in the following sections.In the presented datasets, no content of the scene type industrial can be found and only four datasets include scenes showing infrastructure objects.A few publications exist about deep learning approaches for semantic segmentation of bridge components and industrial environments [20,23,24], but at the time of this publication, only RC Bridges [86], which contains ten reinforced concrete bridges in the area around Cambridge, is freely available.This dataset is intended to cluster point clouds into the listed bridge parts slab, pier, pier cap, and girder, but Ruodan LU et al. [86] did not publish ground truth semantic annotations.The only two sources featuring semantically annotated point clouds of infrastructure content have low quality, representing urban areas captured from an aerial view with sparse point density and low Level of Detail (LoD) [19,78].
Object detection on 3D data is becoming increasingly relevant for the construction industry in terms of detecting building parts (e.g., doors, railings, pipes, rebars, and built-in parts) in the context of BIM but is also crucial for autonomous robots on construction sites to detect interactive objects and obstacles.The available datasets predominantly feature models of furniture and decoration from residential buildings [70,73].ShapeNet, the most famous 3D-object dataset, holds a collection of human-made objects divided into 270 categories, including vehicles, clothing, weapons, and household equipment, but only a few construction elements, like towers [51].

Dataset Quality
A backlog in data quality (Figure 4), caused by poor sensor technologies used in the process of capturing indoor/building datasets, is a relevant issue for industrial application and research.Most indoor datasets originate between 2015 and 2017, while most large-scale outdoor datasets originate from the period between 2019 and today.This observation correlates with the finding in Figure 3b, where indoor datasets tend to be captured on RGB-D devices, while most urban datasets rely on laser scanning technology.Sensor arrays and stationary rotating cameras help with registration accuracy and scene completeness as they produce 360°panoramic depth images like in the S3DIS dataset [59], but they still suffer from higher measuring inaccuracy if compared to Light Detection And Ranging (LiDAR) (manufacturer's information Matterport: precision of ±1% within max.4.5 m range [87]).Most outdoor datasets use costly mobile or terrestrial laser scanning technologies in combination with synchronized panoramic cameras to color the point clouds, Global Positioning System (GPS) for georeferencing, and inertial measurement units (IMUs) for motion correction.The industry standard rotating LiDAR sensors for mobile laser scanning (MLS) applications from companies like Velodyne Inc. and Ouster Inc. claim range accuracy down to ±0.5 cm within 35 m (down to ±2.0 cm within max.200 m) range [88].Laser sensors for TLS applications from companies like Leica Geosystems AG sell laser scanners with 3D accuracy down to ±3.2 mm within 35 m (±5.6 mm within 100 m) range [89].By utilizing laser scanners and multi-sensor-fusion, the latest urban datasets allow significantly higher data quality [37,82].

Why Class Imbalance is a Problem for Deep Learning
The availability of datasets that are designed for semantic segmentation tasks, applicable in the construction industry, and acquired from real-world scenes are limited and often characterized by a lack of object class balance.To achieve optimal performance and avoid overfitting on individual classes, deep learning models must be trained on training sets that are as balanced as possible.The evaluation of the well-adopted S3DIS dataset [31] in Figure 5 shows that, in the real-world example, wall instances are heavily over-represented in common indoor scenes.In the case of educational and office areas, a predominant amount of site-specific furniture can be found.In the S3DIS example, chairs dominate the object classes.A deeper examination of the per-point level in Figure 6 reveals that chair instances often appear in the dataset but occupy only a small amount of data points.The situation is opposite for both the structural elements ceiling and floor, as these elements are single connected large segments with a wide surface area.Comparing the number of scan points by their semantic labels shows that only three classes (wall, ceiling, and floor) occupy about 63% of the training data (85% of the building component objects), while beam and column, but also window and door classes, are marginalized.
If a class imbalance exists within training data, learners typically over-classify the majority group due to its increased prior probability.As a result, the instances belonging to the minority group are misclassified more often than those belonging to the majority group [90].Dataset imbalances of this form are commonly referred to [91] as follows: Intrinsic imbalance, which is caused by the naturally occurring frequency of data, e.g., rooms usually have four walls but only one door and measurable damage in sensor data streams only occur in very rare catastrophic events.Extrinsic imbalance, which is caused by external factors, e.g., ceilings did not get scanned due to camera viewing angles, and most buildings' architecture is biased by a country's certain style.
The mechanism behind the poor network adoption for subordinate classes can be explained by Anand et al. [92].Within the training process of feedforward networks, they could show that the backpropagation (BP) gradient vector lengths are proportional to the sizes of the training sets.In standard BP, the sum of all gradient vectors corresponds to the direction to follow to change perceptron weights.In imbalanced scenarios, the majority class gradients dominate the net gradient vector, which can result in a moving direction pointing upwards for minority classes.Consequently, standard BP will make major improvements in reducing the net error in the first steps and will likely get stuck in a slow mode of error reduction.Besides poor convergence behavior, class imbalance can lead to further problems.In the case of strong object variation, it can happen that the model does not see enough data on the underrepresented class to classify this class correctly, even if the dataset seems extensive and detailed.For performance evaluation, some metrics, such as accuracy, might mislead the analyst with high scores that incorrectly indicate good performance [90].Finally, class imbalance makes it difficult to create appropriate validation sets if the sample pool to choose from is small.Nevertheless, creating well-balanced datasets is not just the challenge of balancing the numbers.Japkowicz [93] concluded, through an experiment with an artificial dataset, that a system's sensitivity to training data imbalance increases with the degree of complexity of the system and that non-complex linear separable domains do not appear sensitive to any amount of imbalance.This entails for the use case of semantic segmenting building elements that distinguishing between a vertical wall and a horizontal floor is easier for a neural network than distinguishing between a vertical cylindrical pipe and a cylindrical column because more analogous classes add extra complexity to the model.To illustrate this, we refer to the segmentation results published by Yeritza Perez-Perez et al. [20], where point cloud data of industrial areas containing plenty of mechanical/electrical/plumbing (MEP) building systems were segmented with the help of a support vector machine (SVM) model.The confusion matrix in Figure 7c reveals that floors and pipes get classified with high precision, while columns and beams suffer from poor classification accuracy.These results correlate with the available data from the used dataset represented in Figure 7a, where a strong discrepancy in the training/test split is recognizable due to poor class distribution in real-world scenes, which, in turn, forces the analyst to an imbalanced data split.In practice, this effect is further amplified by the observation that the degree of complexity is inverse to the likelihood of occurrence in the real-world dataset.Scan points of object classes with a low level of complexity (e.g., floors) appear more frequently than object classes with a high level of complexity (e.g., columns).The confusion matrix representations show the extent to which misclassification impacts the prediction result.In the example by Yeritza Perez-Perez et al. [20], the model prefers to classify vertical columns as geometrically similar objects, of which it has seen more data in training, like walls and pipes.This dataset holds a large number of beam instances, but with a low number of points assigned, so that the model's classification decision becomes polluted by similar vertical objects, like pipes.In general, learners will typically over-classify the majority group due to its increased prior probability.As a consequence, the instances belonging to the minority group are misclassified more often [90].

Solutions to Handle Dataset Imbalance
With the knowledge that class imbalance affects learning performance, applicationoriented solutions can be discussed.Different techniques for classical machine learning have been published to compensate for the immense hindering effects that class imbalance has on standard learning algorithms.These techniques generally approach the problem from one of two sides.They either try to change the condition of the class composition on the dataset side (active) or keep the dataset unchanged and adjust training or inference on the algorithm side (passive).Techniques that combine data-based and algorithm-based adjustments can be summarized as hybrid approaches [90].
Active techniques (basic concept depicted in Figure 8), which change the class distribution, include oversampling the minority classes, undersampling the majority classes, and a combination of the two methods.A collection of frequently used sampling techniques and their approach to the imbalance problem is given in Table 2.These techniques alter the dataset to make standard training algorithms work.Active sampling techniques are well developed.A detailed review of various oversampling and undersampling techniques is given by [91,94].Lemaitre et al. [95] published an open-source Python toolbox including a wide range of state-of-the-art methods to cope with the problem of imbalanced datasets.

Borderline Oversampling
Generates synthetic samples for the minority class that are close to the decision boundary between the minority and majority classes.

Backpropagation-Based Oversampling
Generates synthetic samples for the minority class using back-propagation to learn a transformation from the majority class to the minority class.
Undersampling the majority class is a common first choice as this can help reduce the overall storage size and training time, which can be beneficial for large datasets and slow classifiers.However, undersampling carries the inherent risk of losing potentially useful information from the majority class and the risk of creating biased data selection.The sample might not accurately represent the real world, causing the model to output inaccurate results.Random undersampling of the majority class (Figure 8, left) is easy to implement and fast to execute.At the same time, it is associated with an uncontrollable risk of information loss.To address this limitation, more advanced techniques have been developed that leverage meta-information, like the distance between two samples, neighborhood knowledge, clustering, and even evolutionary algorithms, like simulating bird flocking behavior [96].Generating more instances of the underrepresented class through oversampling (Figure 8, right) does not lead to information loss, which makes it a popular technique for balancing imbalanced datasets [97].Applying oversampling techniques to the minority class can enhance the performance of a classifier on that particular class.The downside of oversampling is opaque.By replicating synthetic instances of the original dataset, the likelihood of creating random noise and causing class-overfitting increases.Overfitting, in particular, occurs when classifiers produce multiple clauses in a rule for multiple copies of the same example, which causes the rule to become overly specific.Although the training accuracy will be high in this scenario, the classification performance on the unseen testing data is generally far worse [91,96].Oversampling for multi-class datasets potentially produces large amounts of data, which is storage-intensive and time-consuming to train with.It is important to note that oversampling before splitting the data into training/testing sets can lead to identical copies in both sets.This causes overfitting and poor generalization.Therefore, splitting the data before applying oversampling techniques is mandatory to ensure accurate evaluation and generalization of the model.Combined overand undersampling can be a solution to the overfitting problem.Gustavo E. A. P. A. Batista et al. [98] combined SMOTE oversampling with Tomek links undersampling as a cleaning technique by removing overlapping samples in the training set to improve classification in a binary case.However, active sampling may not be the most effective approach to deal with imbalanced 3D point cloud data.Sampling techniques have proven to be effective for binary datasets [97] and datasets with a small number of features [99], but point clouds typically hold multiple features (e.g., X, Y, and Z-coordinates and r, g, and b colors for each point), which can make it difficult to pick meaningful samples or to determine which samples to discard.
Passive methods, which keep the dataset unchanged, include penalizing misclassification through cost-sensitive learning, adjusting the decision threshold of the classifier, and changing the performance metric to help the learner deal with the imbalance without the need to change the training dataset configuration.Accuracy and error rates are easy to implement metrics to evaluate the performance of multi-class classifiers and are frequently used in point cloud classification [17,100,101].However, using singular assessment criteria does not provide adequate information, particularly in the context of imbalanced datasets [102], where the classifier prefers to ignore minority classes in favor of high accuracy on the majority classes.An exemplary case can be found in the situation when the majority class represents 99% of all cases, and the classifier simply assigns the label of the majority class to all test cases.An excellent but misleading accuracy of 99% will be assigned to a classifier that classified no minor cases right [103].
More informative assessment metrics are necessary for conclusive evaluations of performance in the presence of imbalanced data.In general, the machine learning literature often proposes the following evaluation metrics: receiver operating characteristics (ROC) curves, the area under the ROC curve (AUC), precision-recall (PR) curves, and cost curves, to evaluate the performance of imbalanced data classification from the confusion matrix [91].However, these graphical methods are designed for binary classification and cannot be applied to multi-class classification problems.In the case of multi-class classification, the confusion matrix must be binarized.This can be conducted in two different ways: the One-vs-Rest scheme compares each class against all the others (assumed as one) and the One-vs-One scheme compares every unique pairwise combination of all classes [104].Stateof-the-art benchmarks use multi-class-suited metrics to compare performance in 3D point cloud tasks in practice.The most relevant ones are listed in Table 3 and application-oriented performance metrics will be discussed in Section 4. Finally, a confusion matrix (binary case in Figure 9 and multi-class in Figure 7c) is a powerful tool to visualize classification results.
Table 3. Summary of frequently used evaluation metrics for 3D point cloud scene understanding tasks.Overall accuracy (OA), mean overall accuracy (mAcc), mean Intersection over Union (mIoU), mean average precision (mAP), average precision scores with IoU thresholds set to 25% and 50% (mAp@25 and mAP@50, respectively), mean precision (mPrec) and mean recall (mRec).Cost-sensitive learning considers the cost of prediction errors and assigns penalties to each class through a cost matrix.Increasing the cost of the minority group is equivalent to increasing its importance and decreases the likelihood that the learner will misclassify instances from this group [90,111].A compilation of cost-sensitive learning approaches for neural networks is given by Buda et al. [103].While implementing cost-sensitive learning into the training algorithm is a comparatively easy task and most common algorithms are modified to take a class penalty or weight into consideration, one of the biggest challenges in cost-sensitive learning is the assignment of an effective cost matrix [90].The cost matrix can be defined empirically based on past experiences or by a domain expert with knowledge of the problem.However, the structure of multi-class datasets is not definite, and the relations among classes are not always obvious.For example, one class might be a majority compared to one other class, but a minority or well balanced for the rest of them [111].Another way is to set the false negative prediction cost to a fixed value and vary only the false positive cost.The ideal cost matrix is identified using a validation set.This has the advantage of exploring a range of costs but can be expensive and even impractical if the dataset size or number of features is too large [90].neural networks is given by Buda et al. [103].While implementing cost-sensitive learning 459 into the training algorithm is a comparatively easy task and most common algorithms are 460 modified to take a class penalty or weight into consideration, one of the biggest challenges in 461 cost-sensitive learning is the assignment of an effective cost matrix [90].The cost matrix can 462 be defined empirically based on past experiences or by a domain expert with knowledge 463 of the problem.However, the structure of multi-class datasets is not definite, and the 464 relations among classes are not always obvious.For example, one class might be a majority 465 compared to one other class, but a minority or well-balanced for the rest of them [111].466 Another way is to set the false negative prediction cost to a fixed value and vary only the 467 false positive cost.The ideal cost matrix is identified using a validation set.This has the 468 advantage of exploring a range of costs but can be expensive and even impractical if the 469 dataset size or number of features is too large [90].

Task
470 Thresholding methods adjust the decision threshold of a classifier to convert a pre-471 dicted probability or scoring into a singular class label.In the context of a binary classi-472 fication problem with class labels 0 and 1, a common approach is to utilize normalized 473 predicted probabilities and apply a threshold of 0.5.Values below the threshold are at-474 Thresholding methods adjust the decision threshold of a classifier to convert a predicted probability or scoring into a singular class label.In the context of a binary classification problem with class labels 0 and 1, a common approach is to utilize normalized predicted probabilities and apply a threshold of 0.5.Values below the threshold are attributed to class 0, while values equal to or above the threshold are assigned to class 1.The presence of imbalanced data or low acceptance of false predictions poses a challenge, as the default threshold may not accurately reflect the optimal interpretation of predicted probabilities.Adjusting the threshold through hyperparameter tuning can enhance the classifier recall in a binary classification scenario.Nonetheless, multi-class classification, as typically encountered in point cloud classification, where each scan point can carry a single class label, cannot rely extensively on thresholding methods.Instead, class labels are typically assigned based on the highest predicted probability.In conclusion, it is noteworthy that the solutions discussed for addressing dataset imbalance are representative of a broad range of techniques in machine learning.Johnson and Khoshgoftaar [90] suggest that the current research on the use of deep learning to tackle class imbalance in non-image data is limited and should be further explored.Since point cloud data are a novel research domain, they pose a unique challenge for deep learning methods, and therefore, there is still limited progress in developing effective solutions.

Evaluation Metrics
Understanding common evaluation schemes and the metrics to measure performance is necessary for developing and applying deep learning methods.Although the overall accuracy (OA) is a significant index on a minimum unit basis (e.g., pixel, point, voxel, and mesh), it is not a well-defined metric to evaluate model predictions.Accuracy assigns equal cost to false positive and false negative cases, which, in reality, is rare.For example, engineering is known for high safety requirements where false negative predictions can have a crucial impact, while false positives may be neglected.Binary test results can be best visualized in a confusion matrix, as shown in Figure 9. Within the confusion matrix, the capital letters TP/FP refer to the number of positive predictions that are true/falseclassified, and similarly, FN/TN refer to the number of negative predictions that are false/true-classified.The sum, N, of all four cells is equal to the total number of classified points.The capital letters PP/PN correspond to the horizontal sum of positive/negative predictions, and vice versa RP/RN correspond to the vertical sum of positive/negative real units.By convention, the lowercase letters, tp, f p, f n, tn and rp, rn and pp, pn, refer to the joint and marginal probabilities, and the four contingency cells and the two pairs of marginal probabilities (rp + rn = 1 and pp + pn = 1) each must sum to 1 [112].Figure 7c shows an example of a confusion matrix for multi-class classification.In the multi-class case, FP, FN, and TN cannot be obtained directly from the confusion matrix as in the binary case; instead, they must be determined individually for each class according to the summation scheme in Figure 10.Common statistical performance measures can be calculated from the evaluation of a binary classification problem and the per-class discretization.Precision (also known as confidence) denotes the proportion of predicted positive cases that are correctly real positives.Recall (also known as sensitivity) aims to identify all real positive (often called relevant) cases.The F1-score denotes the weighted average (harmonic mean) of precision and recall and is a measure of a test's fidelity, which is usually more useful than accuracy since it requires both precision and recall to have a higher value at the same time for the f1-score to rise, especially if dealing with uneven class distributions [113,114].The formulas to calculate precision, recall, and the F1-Score are reported in Equations ( 1), (2), and (3), respectively: Based on the confusion matrix kernel, different evaluation metrics were proposed to test the performance of various point cloud understanding tasks, not only in the binary scenario but also in multi-class cases.We assume a set of K + 1 classes, where p ij denotes the smallest instance (e.g., pixel, voxel, mesh, and point) of class i implied to belong to class j (i = j represents true and i = j represents false classifications).To generalize the implementation, matrix indexing is applied so that p ii and p jj represent true positives and true negatives, while p ij and p ji represent false positives and false negatives, respectively [26].The formulas to calculate the overall accuracy (OA) and the mean class accuracy (mAcc) are reported in Equations ( 4) and ( 5), respectively.
OA (sometimes abbreviated oAcc) measures the overall effectiveness of a classifier by computing the ratio between the number of truly classified samples and the total number of samples.
mAcc measures the average per-class effectiveness by computing the OA per class and averaging by the total amount of K [115].
OA and mAcc are commonly used as performance metrics for point cloud classification [17,100,101].However, for reasons we already explained in the last section (Section 3.3), both are hardly meaningful in the presence of imbalanced classes and to evaluate segmentation results.Accounting for the area of segment overlap improves the informative value.The formula to calculate the mean Intersection over Union (mIoU) is reported in Equation (6).mIoU, as shown in Figure 11, computes the intersection ratio between ground truth and predicted values per class and averages the sum over the total number of classes K [116].
mIoU and mAcc are the most frequently used performance metrics for 3D point cloud semantic segmentation [40,59,74], assuming L I , I ∈ [0, K] is the number of instances in every class and c ij is the number of points of instance i inferred to belong to instance j (i = j represents correct and i = j represents incorrect segmentation) [26].The formulas to calculate the average precision (AP) and the mean class average precision (mAP) are reported in Equations ( 7) and (8), respectively.

IoU = Area of Union
Area of Overlap AP is often used as the metric for 3D object detection [117].It is calculated class-wise as the area under the precision-recall curve.(7) mAP is frequently used for 3D object detection and 3D instance segmentation if discriminating between multiple instances within one class.It is an extension of AP by averaging the per-class precision over the total number of K [26,109].
Apart from these commonly used metrics in machine learning, other problem-specific metrics are worth mentioning: Overall average category Intersection over Union (Cat.mIoU) and overall average instance Intersection over Union (Ins.mIoU) can be used for 3D part segmentation [26].Precision and success are commonly used to evaluate the overall performance of 3D single-object tracking.Average multi-object tracking accuracy (AMOTA) and average multi-object tracking precision are the most frequently used metrics for 3D multi-object tracking [25,37].

Deep-Learning-Based Methods for Point Cloud Understanding
Deep learning for computer vision in the 3D domain has garnered significant attention and interest in recent years, particularly since 2015, when various approaches emerged to extend the application of 2D convolutions for image object recognition to the 3D model space [118][119][120].However, the transition from 2D raster images to 3D point clouds presents unique challenges due to the unordered and unstructured nature of point cloud data [121].Different preprocessing steps are required to transform the raw point cloud data into a structured format that can be processed by a neural network.According to Guo et al. [25], common learning-based PCSS approaches can be divided into three baseline categories: projection-based, discretization-based, and point-based methods.We adopted the taxonomy for 3D point cloud segmentation in Figure 12 and expanded them with the latest achievements in point-based methods.The choice to apply a particular type or a hybrid method depends on the problem to solve.In the following, we provide a concise overview to comprehend the basic characteristics of each method, followed by a discussion of the advantages and disadvantages of applying certain models in the construction industry, considering customary point density and cloud size for point clouds of building objects.
Projection-based and discretization-based methods share a common initial step.Both transform the unstructured point cloud into a regular intermediate representation to enable data handling.This regular representation is usually a d-dimensional grid d ∈ N + (e.g., image with d = 2 and voxel with d = 3), in which points can be referenced by indices, and the neighborhood relationship is defined by contiguousness.Semantic segmentation is performed on the structured data in the intermediate space, and the result is subsequently projected back to the original, unstructured point cloud.In contrast to projection and discretization-based methods, point-based methods operate directly on irregular point clouds without using an intermediate representation [25].We evaluate the publicly available semantic segmentation results of 54 methods on the S3DIS benchmark in Figure 13 on a timeline.For a performance comparison of the state of the art, we selected the mIoU as the most meaningful metric in the presence of class imbalance and the most adopted metric in public databases for PCSS tasks.The data for this comparison, secondary performance metrics (mIoU, mAcc, and oAcc), and an allocation of the deployed network architecture can be found in Appendix A1.    13.Reported results for semantic segmentation task on the large-scale indoor S3DIS benchmark (including all 6 areas, 6-fold cross-validation) [59].Performance is expressed in terms of mean Intersection over Union (mIoU).Item numbers represent the ranking within this database according to mIoU performance.The data corresponds to the evaluation in Table A1.

Projection-Based Methods
Projection-based methods render raster image intermediate data from spatial 3D point clouds to enable the use of powerful 2D convolutional networks for semantic segmentation [3].Several approaches have been proposed to project the 3D space onto 2D surfaces to cover large landscapes.
Multi-view representation approaches are among the earliest approaches for 3D data semantic segmentation.The limitations of 3D unstructured points can be avoided by taking snapshots of point cloud scenes with predefined image sizes, as shown in Figure 14a.These uniform, synthetic images can serve as input to any 2D CNN intended for semantic segmentation.The synthetic raster images may encompass features like trichromatic color values, XYZ-surface normals, and scalar field depth information [3].One significant advantage of projection-based PCSS is the high availability of annotated training data since developers can revert to the knowledge base of standard image semantic segmentation resources and adopt pre-trained models through transfer learning [122,123].Nevertheless, only a few studies have used multi-view-based deep learning approaches for PCSS in construction because of fundamental limitations.First, the performance of multi-view PCSS methods is sensitive to viewpoint selection and occlusions.For large scenes, which are common in the construction environment, it is challenging to select adequate viewpoints and camera angles to cover the whole scene; hence, the projection step inevitably introduces information loss [25].Besides their limitations and the intense competition among the different approaches (Figure 13), the latest DeepViewAgg model [124] produces competitive state-of-the-art semantic segmentation results by leveraging multi-view aggregation to merge features from images taken at arbitrary positions.This method has the advantage of not being dependent on colorized point clouds, facilitating a more general sensor choice and higher recording speed.Spherical representation methods project the whole point cloud, captured by a single scan, onto a sphere with the origin in place of the capturing device's position [127].Several LiDAR sensors already represent the raw input data in a range-image-like fashion [128].Because of its synergy with 360-LiDAR sensors, spherical projection representation, as shown in Figure 14b, has proven to be extremely valuable in autonomous driving applications [127][128][129], where the real-time and robust perception of the environment is indispensable, while point cloud density and a low level of detail are secondary.Because of its runtime advantage, Xu et al. [130] called projection methods the "de facto" method for large-scale point cloud segmentation.Compared to single-view projection, spherical projection retains more information.However, this intermediate representation inevitably introduces several problems, such as discretization errors and blurred CNN outputs, due to subsampling to the resolution of the synthetic image [128].If the center of the projection is not equal to the initial sensor's position, the method suffers from the same problems as multi-view representation, i.e., occlusion and translucency.Due to the dependence on the sample locations, projection methods are only suitable to a limited extent for practical application in the construction industry.No published benchmark results of projection methods are known for the labyrinthine S3DIS indoor dataset to compare in Figure 13.

Discretization-Based Methods
Discretization methods aim to transform raw, unstructured point clouds into a grid structure for enabling 3D indexing and storing agent neighborhood information.Discretizationbased approaches can be classified into two types: dense volumetric discretization and sparse permutohedral discretization.
Dense discretization representation methods first compute the boundary box surrounding the total point cloud and then subdivide the box space into a user-defined grid.Grid instances are called voxels and can be imagined as pixels in 3D.The position of a voxel is inferred from its position to other voxels.Neighboring voxels can be retrieved by alternating the index.The point cloud is subdivided by the grid, and every voxel is checked for the occupancy of points.In the simplest case, a voxel is labeled as true if occupied by points and false if not.In a more sophisticated version, the algorithm calculates a density value by counting how many points lie within each cell [131].This regular data can be fed to a 3D fully convolutional neural network for voxel-wise semantic segmentation.Finally, all points within a voxel are assigned the same predicted semantic label as the voxel.Among deep learning on 3D data, volumetric CNNs are the pioneers in applying 3D CNNs on voxelized shapes [17].
Adopted from Tchapmi et al. [5], Figure 15 represents the basic steps of voxel discretization, the prediction of class labels per voxel, and the mapping back to the unordered point cloud.The stepwise snapshots illustrate one of the main drawbacks of voxelized data representation: voxelization involves point cloud downsampling, which introduces data loss and causes classification artifacts from back-projection.An inherent limitation of this technique is its sensitivity to the granularity of the voxels.Increasing the voxel resolution can alleviate some of these issues.However, the voxel size is a complex trade-off between the level of detail (LOD) of the output, memory usage, and computational complexity.Thus, identifying an optimal voxel resolution is a challenging task [25].Furthermore, it should be noted that dense discretization strategies with fixed-size voxel structures entail the storage of not only occupied spaces but also free or inner spaces [121].As a result, this approach may lead to memory overhead, particularly in the case of large scenes common for construction.Sparse discretization representation methods decrease inefficiencies by omitting unoccupied space.Earlier approaches like OctNet [132] hierarchically partition the sparse point cloud, using a set of unbalanced octrees.Tree structures allow memory allocation and computation to focus on relevant dense voxels without sacrificing resolution.However, empty space still imposes computational and memory burdens in OctNet.Graham et al. [133] proposed a novel submanifold sparse convolutional network (SSCN) based on indexing, which does not perform computations in empty regions, overcoming the drawback of OctNet.Based on these findings, MinkowskiNet [134] enables the direct processing of 3D sequences (e.g., LiDAR streams) with 4D spatiotemporal ConvNets with generalized sparse convolutions.Contrary to cubic discretization, approaches like LatticeNet [135] tessellate the scene space with d-dimensional permutohedral lattices into d-dimensional simplices (simplices are triangles for d = 2 and tetrahedrons for d = 3).The vertices of the permutohedral lattice store only the simplices, which contain non-empty regions.This sparse allocation allows for efficient implementation of all typical operations in CNNs.Spatial discretization is favorable in the construction industry, where true volumetric scene reconstruction is required and the level of reconstruction detail is secondary.This is the case with navigation and localization applications through the creation of building occupancy maps [136,137].

Point-Based Methods
Point-based methods do not rely on an intermediate representation.However, the lack of canonical order and permutation invariance makes raw point data infeasible for convolutional network architectures that require regular input data.This was first solved in 2017 with PointNet [17].Unlike the previously mentioned projection and discretization methods, points are processed through a fully connected multi-layer perceptron (MLP) network to learn per-point features.The key to this approach was a single max-pooling layer that was trained to select a subset of informative points.The output is a global signature of the input set, which can be used for shape classification tasks.However, pointwise semantic segmentation requires local and global knowledge.Qi et al. [17]  Point-based convolutional neural networks (CNNs) apply convolving kernels at each point in the point cloud to learn pointwise features.However, directly convolving kernels against features associated with the points results in losing shape information and variance to point ordering [106].Many approaches have been proposed to address these issues.Hua et al. [138] invented a convolution operator that bins nearest neighbors into kernel cells before convolving with kernel weights, which can be applied at each point of a point cloud (Figure 16 center).Wang et al. [139] proposed continuous convolution, a new learnable operator that operates over non-grid structured data and uses kernel functions defined for arbitrary points in the continuous support domain.Similarly, Boulch [140] replaced discrete kernels with continuous ones to generalize convolution for point clouds.This formulation allows arbitrary point cloud sizes and can easily be used for designing neural networks similar to 2D CNNs.Li et al. [106] proposed a method to learn a χ-transformation that weights and permutes input points and features before convolving them.The jointly learned χ-operator is explicitly dependent on the input order, as χ is trained to permute the feature vector to ensure permutation invariance.According to Li et al. [106], this only works to a limited extent but is still significantly better than the direct application of typical convolutions on point clouds.
Real-world data are typically associated with inhomogeneous point density.Methods based on a fixed number of samples deteriorate when point density fluctuates [18,141].Hermosilla et al. [142] used Monte Carlo convolution for non-uniformly sampled point clouds by phrasing the convolution integral as a Monte Carlo approximation and, thus, providing a new form of robust sampling invariance.Similarly, Wu et al. [143] extended the Monte Carlo approximation method with an MLP in each filter to approximate the weight functions and a density scale to re-weight the learned weight function.They call this permutation-invariant convolution operations PointConv.However, the "naive" implemen-tation of PointConv is memory-consuming and inefficient.A reformulation by reducing PointConv to two standard operations (matrix multiplication and 2D convolution) takes advantage of GPU parallel computing and allows easy implementation with mainstream deep learning frameworks.Thomas et al. [100] presented kernel point convolution (KP-Conv) for flexible and deformable convolutions on sparse point clouds, as shown on the right of Figure 16.This method is inspired by image-based convolution, but instead of kernel pixels, it uses a set of kernel points to define the volume, where a linear correlation function applies each kernel weight.For greater flexibility, the number of kernel points is not fixed, and the positions of the kernel points are formulated as an optimization problem of the best coverage in a sphere space and trained to fit the point cloud geometry.Liu et al. [144] (FG-Net) published a general deep learning framework for large-scale point clouds to tackle issues (noise, outliers, and dynamically changing environments) associated with methods, designed for small point clouds.They introduce a deformable convolution for modeling local object structures with kernels that dynamically adapt to geometries.Pointwise attention aggregation is used to capture the distributed contextual information in spatial locations and semantic features across a long spatial range.
Tang et al. [145] showed how to improve the segmentation results of a well-known ConvNet baseline model [140] by applying contrastive boundary learning (CBL) to enhance feature discrimination between points across boundaries.Experiments demonstrate that CBL can help to improve predictions, especially in cluttered regions, and reduce classification artifacts.The CBL framework is not limited to ConvNet architecture but can be coupled to any other multi-stage backbone, as results with the RandLA-Net backbone [146] have shown.
Point-based graph neural networks (GNNs) allow a natural representation of data (point cloud and mesh) within non-Euclidean space.The nature of these data does not imply familiar properties, such as an orthonormal coordinate system, vector space structure, or shift invariance.Consequently, basic operations such as convolution, which are taken for granted in the Euclidean case, are not even well defined in non-Euclidean domains [147].In recent decades, researchers have been working on how to conduct convolutional operations on graphs.Graph convolutional network (GCN) models are neural networks that can leverage the graph structure and aggregate node information from the neighborhoods in a convolutional fashion [148].
Landrieu and Simonovsky [149] suggested a superpoint graph (SPG) (Figure 17) constructed from simple yet meaningful shapes, which partition from the global point cloud by local geometric features (linearity, planarity and scattering, and verticality).Superedges interconnect superpoints in a global graph to capture the spatial context information, which is then exploited by a GCN.Wang et al. [150] proposed a dynamic graph CNN (DGCNN), an EdgeConv-based model to group points in Euclidean and semantic space over potentially long distances.The EdgeConv module applies to the dynamically updated graphs in each network layer.Zhiheng and Ning [151] published the PyramNet architecture as a combination of a graph embedding module (GEM) and a pyramid attention network (PAN).To improve local feature expression ability, GEM utilizes a covariance matrix to replace the Euclidean distance.The PAN combines features of different resolutions and different semantic strengths from four convolution kernels to improve segmentation.Wang et al. [152] used graph attention convolution (GAC) with learnable kernel shapes to adapt to the structure of the objects.This is achieved by dynamically assigning attention weights to different neighboring points and feature channels based on their spatial positions and feature differences.Ma et al. [153] tested a plug-and-play point global context reasoning (PointGCR) module, which can help to capture context information along the channel dimension by using an undirected graph representation and self-attention mechanism.The authors show a significant boost in performance with the PointGCR module appended to several famous encoder-decoder networks for point cloud segmentation on outdoor and indoor scenes.Xie et al. [154] published the multi-resolution graph neural network (MuGNet) architecture, which translates large-scale point clouds into directed connectivity graphs and efficiently segments the point clouds with a bidirectional graph convolution network.Using a multi-resolution feature fusion network reduces memory consumption but conserves rich contextual relationships.The success of this algorithm was its ability to train deep networks reliably.However, training very deep GCNs comes with certain shortcomings.First, stacking multiple GCN layers leads to the vanishing gradient problem, the same as in CNNs.Second, the over-smoothing problem can occur when repeatedly applying many GCN layers [155].Thus, most state-of-the-art GCNs are limited to shallow network architectures, usually no deeper than four layers [156,157].Point-based recurrent neural network (RNN) methods aim to improve scene understanding by treating the point input as a sequence of features.RNNs have proven to be highly effective for processing sequence data with variable lengths, like sensor data streams, human speech, and written text [158].Few attempts exist to leverage the power of RNNs and the ability to memorize the prior input for point cloud semantic segmentation.Engelmann et al. [159] improved on the first PointNet [17], which is bound to subdivide larger point clouds into a grid of blocks and process each block individually.In the proposed network, one recurrent consolidation unit (RCU) inputs a sequence of block features originating from four spatially nearby blocks and returns a sequence of corresponding updated block features.The RCU is configured as a gated recurrent unit (GRU) where updated block features are returned only after the GRU has seen the whole input sequence.The GRU retains relevant information about the scene in their internal memory and updates it according to new observations.Huang et al. [160] (RSNet) suggested a lightweight local dependency module that uses three slice pooling layers along the X, Y, and Z axis to convert unordered point feature sets into an ordered sequence of feature vectors.By modeling one slice as one timestamp, the information interacts with other slices as the information flows through the RNN internal state (memory).Comparably, Ye et al. [161] (3P-RNN) utilized a combination of pointwise pyramid pooling to capture local geometries and a two-directional sliding window hierarchical RNN to explore long-range spatial dependencies in the X and Y direction.To this extent, the above RNN-based methods rely on static pooling to aggregate features from the local neighborhood, without adapting to density changes [162].Similar to the aforementioned dense voxel discretization methods, this leads to computational inefficiencies and poor segmentation results.At this point, the latest state-of-the-art benchmark results, like S3DIS's large-scale indoor scene benchmark [59] (Figure 13), show that RNN-based and GCN-based methods lately cannot keep up with modern point-based convolution, MLP-pooling, and transformer-based methods.
Point-based multi-layer perceptron networks (MLPs) usually process each point individually through shared MLP to perform local feature aggregation.MLPs are highly efficient but often face difficulties in grasping the global context and spatial features of point clouds [17,163].Among the point-based MLP networks, state-of-the-art methods can be subdivided into pooling-based and attention-based transformer approaches, to capture a wider context for each point and learn richer local structures.
Pooling-based methods learn to consolidate features in the local neighborhood of a point group.The hierarchical neural network PointNet++ [18] can recognize fine-grained patterns and generalizes complex scenes by recursively applying PointNet on a nested partitioning of the input point set.The information from different neighborhood sizes is aggregated for each point of interest with neighboring feature pooling.Thus, MLPbased methods can improve general feature learning for non-uniform point density in different areas [18].In contrast to PointNet++, which defines the neighborhood from the metric world space, Engelmann et al. [164] employed two grouping mechanisms to define neighborhoods in the world space by clustering points using K-means, but also in the feature space, by computing the k nearest neighbors (KNNs).Assuming that points from the same semantic class are likely to be nearby in the feature space, they defined a pairwise similarity loss function and found with [165] that semantic similarity can be measured as a distance in the feature space.Qiu et al. [166] (BAAF-Net) resolved three major drawbacks of existing works that affect segmentation quality: ambiguity in close points, redundant features, and inadequate global representations.While most architectures are designed to solve a broad range of scene understanding tasks (including object detection and shape classification), the BAAF-Net architecture entirely focuses on large-scale point cloud semantic segmentation.This is conducted by the concept of augmenting the local context bilaterally and fusing multi-resolution features for each point adaptively.Lin et al. [85] proposed a unified framework called PointMeta.Its building blocks can be summarized in terms of four meta functions: a neighbor update function, a neighbor aggregation function, a point update function, and a position embedding function (implicit or explicit).By modifying the existing approaches, Lin et al. [85] derived a basic building block (PointMetaBase).PointMetaBase-XXL surpasses previous methods in several PCSS benchmarks (Figure 13) in the configuration with multiple stacked PointMetaBase blocks, max-pooling as neighborhood aggregation function, and Explicit Position Embedding (EPE adopted from Point Transformer [108]).
Transformer-based methods, first introduced to the field of natural language processing (NLP) [167], are designed to process sequential input data.The Transformer is a decoderencoder structure with three main modules for input (word) embedding, positional (order) encoding, and self-attention.The self-attention module is the core component that relates different positions of a single sequence to compute a representation of the sequence.Following the success of Transformer network architectures in the language and image domains, where competitive performance was achieved compared to their CNN counterparts [168][169][170], Transformers have lately been applied to process unordered 3D point clouds.Since point clouds essentially are sets of points with positional attributes, the self-attention mechanism seems particularly suitable for these data type [108,167].
The simplest approach to modify the Transformer [167] for point cloud processing implies treating the entire cloud as a sentence and each point as a word.Guo et al. [171] propose a naïve point cloud Transformer (PCT) by implementing a coordinate-based point embedding, which ignores interactions between points and instantiates the attention layer with the self-attention proposed in the original Transformer [167].Without further modification and purely based on coordinate features, the naïve PCT was capable to outperform state-of-the-art methods.Yang et al. [172] presented a point attention Transformer (PAT) with two core operations: group shuffle attention (GSA) for mining relations between elements in the feature set and gumbel subset sampling (GSS), to select a representative subset of input points.To improve frame rates in real-time applications, Hu et al. [146] published RandLA-Net, a lightweight architecture that directly infers per-point semantics for large-scale point clouds.It uses random point sampling instead of more complex and computationally heavy point-selection approaches.RandLA-Net introduces a novel local feature aggregation module that increases the receptive field for each 3D point and leverages attentive pooling to preserve local features and overcome the loss of key features expected from random sampling.Besides the significant downsampling, the method can retain features necessary for accurate segmentation.Engel et al. [173] designed a Point Transformer architecture that utilizes a SortNet module and global feature generation to extract local and global features and relates both representations by introducing the local-global attention mechanism.The output of Point Transformer is a sorted and permutation-invariant feature list that can directly be incorporated into common computer vision applications.With the same name, Zhao et al. [108] proposed another Point Transformer architecture, entirely based on vector self-attention and pointwise operations, as a general backbone for several 3D scene understanding tasks.These encoder and decoder structures consist of modular point transformer layers, stacked in multiple stages with transition layers for down-and upsampling.For dense prediction tasks such as semantic segmentation, Point Transformer adopts a U-Net [174] design, as shown in Figure 18.The number of stages and the sampling rates can be varied depending on the application, e.g., to construct a lightweight backbone for fast processing.To overcome its early limitations, Wu et al. [175] proposed group vector attention, a revised Point Transformer V2 architecture with improved position encoding and with an efficient partition-based pooling scheme.Lai et al. [176] proposed a Stratified Transformer to compensate for the drawbacks of the Point Transformer (V1) [108] and improve on capturing long-range context in point clouds by using standard multi-head self-attention [167].Partitioning a point cloud into non-overlapping cubic windows limits the effective receptive field (EFR) [177] to the local region, which causes false predictions.Enlarging the window size helps to increase the respected area, but comes with a higher memory cost.To extend the attention beyond the limited local region, Lai et al. [176] used a stratified strategy to ensure that the subgroup (strata) is adequately represented.Points within a small window next to a query point are sampled densely.Points within a second large window are sampled sparsely as a trade-off between EFR and memory consumption.Wang et al. [107] built upon Stratified Transformers and achieved state-of-the-art point cloud semantic segmentation results on the S3DIS large-scale indoor scene benchmark, as shown in Table A1, with leading accuracy scores on structural component classification (ceiling, floor, wall, and column).Improvements are gained from the proposed window normalization with prior knowledge to account for variant local neighborhood point cloud density in the downsampling step.

Discussion
With this survey, we have shown that point cloud semantic segmentation has made great strides through recent developments in machine learning.It has become a significant research area in computer vision, with broad applications in fields such as robotics, autonomous driving, and geodesy.The construction industry can greatly benefit from that.However, several challenges must be addressed before these methods can be fully integrated into industrial applications.In Section 3, we showed that there are currently no significant datasets available for construction-related point cloud learning.The state-of-theart publications we surveyed used one of two workarounds to circumvent this shortage.Most publications focus on one of the few well-established datasets [59,62,131] for indoor scenes, which are, however, polluted by furniture and often sourced from outdated RGB-D sensors.Fewer publications established their private task-specific datasets [19,24,86] but avoid sharing this with the public.Not sharing the deployed data with the public makes the reproduction of the results impossible and complicates the further use of the results.This study highlights a significant challenge in the training of neural networks through supervised learning, wherein the availability of substantial human-labeled data remains crucial.Specifically, the creation of extensive, well-documented, and open-source datasets comprising point clouds combined with technical meta-data (such as component labels, object properties, and damage classification) emerges as a major hurdle.Despite conducting extensive research, viable alternatives to large annotated datasets for effectively training machine learning models have not yet been identified.
Transfer learning [122,123] is a powerful tool that can lower the amount of required training data; yet, context-specific data to fine tune the model does not become obsolete.The opposite is the case.Fine tuning a pre-trained model raises the demand for high-quality and versatile data [178,179].The need for data applies also to the required number of different datasets, designed for different sectors and applications in the construction industry.Sectors include building structures, civil engineering (over-and underground), infrastructure, and pre-cast fabrication.Applications include object detection, shape classification, semantic and instance segmentation, 3D reconstruction, and SLAM.
The preparation of such datasets is a non-trivial task.First of all, it requires expensive hardware to capture the data.Secondly, human resources are needed to annotate data, an activity that is even more challenging in the 3D domain than in the 2D domain.Furthermore, domain experts are needed to create and review the dataset.Datasets must be extrinsically balanced (regarding architecture and style, location, and environmental changes) and intrinsically (in terms of their beholding classes).The huge efforts needed to create qualitative datasets have led companies to keep their elaborately compiled data under lock when data have become a valuable commodity in all industries.However, open-sourcing these data is necessary for interoperability (organizations or systems to work together), saving economic costs (avoiding collecting new data) and improving data quality (crowd-sourced debugging) and verification (reproducibility).Comprehensive and insightful databases, like the one we proposed in Table A1, help make such data more visible and accessible and, thus, boost development.
Balancing datasets retroactively is only possible to a limited extent, as we explain in Section 3. One lesson we learned from reviewing multiple datasets within this survey was to keep track of the object distribution already in the process of creating a new dataset.The statistics about class distribution, as proposed in Section 2, in terms of points per class as well as instances per class, help to keep an overview and to counteract iteratively in the case of disequilibrium.From the findings in the first half of this paper, we advocate for the community to put more effort into the creation of industry-standard datasets and for the recipients to honor the hard work of developing datasets.
Synthetic data are considered a complementary solution against dataset scarcity because synthetic data annotation is essentially free [180].Table 1 features synthetic datasets and singular attempts that utilize synthetic training data to improve real-world performance.An outstanding example is the VASAD dataset [22] for building reconstruction.Researchers argue that the transfer of knowledge can improve the ability to perform complex tasks when initially performed in simulation [181,182].Furthermore, the class imbalance problem becomes obsolete if class-object appearance in scenes is user-defined.However, experience with synthetic data to improve point cloud semantic segmentation is still limited, and studying their effects is currently under investigation.For example, the effect known as the "Sim2Real (sim-to-real) gap" [182,183] describes the discrepancy between simulated and real environment data, a phenomenon that can result in bad performance if training the network with synthetic data but applying the model to the real world [184].
Unlike training data, which are content and context-locked, general machine learning models can be transferred independently on their initial application as this survey shows.Algorithms initially developed to segment, e.g., cars and pedestrians, can be successfully deployed to segment buildings, if trained with the appropriate data.One objective of this survey has been to identify the future trends and most promising methods for PCSS to channel our future research and share this with everyone with the same intentions.Several authors described their methods as the most suitable for PCSS and scene understanding, but this review leads to the assumption that, still, no method proves to be dominant.Convolutional neural networks (CNNs) have traditionally been preferred for processing image data, while recurrent neural networks (RNNs) have been for sequential data.However, we find no consensus on the most appropriate method to analyze the 3D domain.Projection-based methods have legitimation in autonomous driving and mapping applications because of fast processing and their synergy with directed stereo cameras and spherical LiDAR if the demand for resolution is low.Discretization-based learning methods for PCSS appear to be not competitive due to memory overhead with fully connected CNNs and the lack of canonical order for permutation invariance.However, voxel discretization remains important for all kinds of point cloud processing [185] and map building in SLAM.The reviewed RNN methods for point cloud understanding struggle to adapt to changing point cloud density and rely on extensive partitioning procedures.Only a few approaches of this kind could be found to deal with large-scale PCSS.Like in natural language processing, transformers recently look more promising for the sequence datatypes [167].Graph neural networks (GNNs) and graph convolution offer a natural representation of point clouds and claim to solve the issue of capturing long-range spatial context information.However, stacking multiple GCN layers leads to a vanishing gradient and over-smoothing.Autonomous vehicles and a network of sensors on our roads potentially soon produce a huge 3D roadmap dataset [186].There is no expertise in this field for now, but GNNs might be a valuable approach for scene understanding in infrastructure civil engineering with low requirements for detail but the need to process huge spatial datasets.We find that point-based networks, among all the mainstream methods, look the most promising for PCSS and support this statement by the leaderboard in Table A1 and additional public benchmarks [30,62].Pooling-based MLPs, attention-based transform-ers, and point-based CNNs perform on par, with small deviations depending on which benchmark is investigated.Yet, multiple sources lately comment on transformers being the most versatile general-purpose architecture for different types of data and computer vision tasks [168,187,188].Even though, many of the new transformer models still incorporate the best parts of convolutions.That means future models are more likely to use both than to abandon CNNs entirely.
There are, of course, some drawbacks associated with the topic of deep learning in general and PCSS for the construction industry, which we want to discuss.We find in Section 3.2 that PCSS results are sensitive to geometric similar shapes and models easily get confused in the presence of class imbalance.This problem can be illustrated with the practical example of bridge's part segmentation.Even though most bridges have similar components with the same function, these components look very different across several bridge types.If models can not generalize geometries, the industry faces the risk that the needed number of specialized datasets adds up fast, eventually becoming impossible to cover.Further, the costs associated with deep learning are the high demand for computing power, particularly in the training phase, and also for inference.This is true for all deep learning, but transformers are particularly affected [189,190].Training the models requires expensive GPU server clusters, which are not common in conventional civil engineering offices.Cloud services can be a workaround for the training phase, but real-time application on-site will require equipping robots with powerful hardware.Finally, point cloud semantic segmentation is only one first step of machine learning applications in civil engineering.Instance segmentation is becoming increasingly important, as the construction industry continues to integrate digital technologies into the building process and construction surveillance [191][192][193].Instance segmentation can be considered a refined version of semantic segmentation.Where semantic segmentation assigns all segments (points) of the same recognized class into one global group, instance segmentation can differentiate between different objects of the same group.With the ability to automatically identify and label objects in a 3D point cloud, instance segmentation can provide valuable information for various construction-related tasks, including fully autonomous robot operation, as-built reconstruction, automated building information modeling (BIM), progress tracking, and quality control.

Conclusions
Point cloud learning has gained strong attention due to its numerous applications in various fields of computer vision, but the visual understanding of construction sites by deep learning, such as semantic segmentation, is hardly mentioned in the literature.Fortunately, general learning-based approaches can also be transformed into solving constructionrelated tasks, with the right training data at hand.This paper presented a contemporary survey of the state-of-the-art algorithms for learning-based PCSS methods tailored toward the construction industry and its unique demands.We conducted a comprehensive literature review of the available datasets and well-established model architectures.An unprecedented database for scene understanding training datasets was evaluated, and the evaluation of the indoor scene's largest benchmark for PCSS methods was presented.From the revision of the latest state-of-the-art publications, we found a strong future trend toward transformer-based model architecture in the presence of dense point clouds with a high level of detail.Applications with requisitions of very fast inference times still profit from point projections to leverage the well-adopted 2D convolutional image segmentation.However, this comes at the expense of a lower level of detail.Very large sparse point clouds seem to profit from graph-based methods.When considering that, in the recent past, the majority of research in the field of deep learning has been focused on Transformers as a versatile tool for various applications, we expect that this trend will also impact the construction industry.The initial indications of this development were evident in our benchmark comparison.Falling prices for sensor hardware and increasing success in all areas suggest that the number of applications will only increase.One challenge that should not go unmentioned is the question of how to meet the increasing demand for computing power.This question remains unanswered today, but should be solvable in the future with the development of more efficient tensor processors and cloud computing.
The extensive review of public datasets revealed a significant data scarcity to train, test, and validate supervised learning, which must be treated by domain experts from inside the industry itself.This challenge can be best tackled together with a large community.We hope to contribute to the success with this work and accelerate future developments by summarizing and providing an extensive database of the state-of-the-art.Future research is encouraged to close this gap but also needs societies that provide the necessary funding.

Figure 1 .
Figure 1.Concept of point cloud segmentation with learned building classes.Right: raw point cloud with color information.Left: semantically enriched point cloud with predicted class labels output by a neural network.All points with the same predicted class are colored in the same color.

Figure 4 .Figure 5 .
Figure 4. Examples of poor registration in 3D reconstruction from depth frames in the SceneNN dataset [56].(a) Misalignment of the ceiling in two point clouds, highlighted by the red circle.(b) Deviation from the ground plane and large parts missing from the wall behind the desk.Highlighted by the two red lines and the opening angle, the left part of the room tilts, and the floor is not level.

S3DISFigure 6 .
Figure 6.Quantitative evaluation of the per-point semantic class distribution of the S3DIS (2D-3D semantic dataset) [31].(a) Per-point per-scene class distribution.(b) Classes allocation of the total S3DIS dataset.The S3DIS dataset is partitioned into six subsets with diverse properties in architectural style and appearance.Area_1 to Area_6 represent the six subsets.

Figure 7 .
Figure 7. Quantitative evaluation of the Perez-Perez dataset configuration [20].(a) Test and training data split partitioned by semantic classes.Numbers show the ratio between the testing and training split.(b) Class distribution available in the dataset comparing instances vs. raw per-point segments.(c) Exemplary confusion matrix to represent the average precision of semantic labeling per segment from a support vector machine (SVM) model.Axis flipped for the unified convention.Published by Yeritza Perez-Perez et al. [20].

Figure 8 .
Figure 8. Graphical representation of the two principles of under-and oversampling for imbalanced training data handling in machine learning applications.

Figure 9 .
Figure 9. Confusion matrix for binary testing.Note: Other references may use a different convention for axes.Declaration: real positive (RP), real negative (RN), predicted positive (PP), predicted negative (PN).

Figure 10 .
Figure 10.Multi-class confusion matrix summation scheme to determine the following: true positive (TP), false positive (FP), false negative (FN), and true negative (TN).An example is shown for the class wall.Note: other references may use a different convention for axes.

Figure 12 .
Figure12.Taxonomy of deep learning methods for 3D point cloud segmentation.Adopted from[25] and extended with the latest transformer-based methods.

Figure 15 .
Figure 15.Process steps for 3D semantic segmentation on voxelized point clouds.The steps show the voxel discretization, including point cloud downsampling, the per-voxel class prediction, and the label back-projection onto the unstructured point cloud.Prediction artifacts can be optimized to produce final results.Originally shown in [5].
achieved global awareness with an additional segmentation network, which feeds the global point cloud feature vector back to the local per-point features by simple matrix-vector concatenation.Per-point class scores can be predicted from the combination of local and global point features.Based on the success of PointNet, various segmentation methods have been proposed: point-based convolution networks, graph-based networks, RNN-based networks, and point-based MLP networks.

Figure 16 .
Figure 16.Left: pixel-wise convolution with kernel size 3 × 3. Center: Pointwise convolutions: For each point, nearest neighbors are searched and binned into kernel cells before convolving with kernel weights.Graphic originally shown in[138].Right: Kernel point convolution 2D example.Input points with a constant scalar feature (in grey) are convolved through a KPConv that is defined by a set of kernel points (in black) with filter weights on each point.The graphic was originally shown in[100].

Figure 17 .
Figure 17.Visualization of individual steps in the superpoint graph pipeline.Left: geometric partitions from superpoints.Right: visualization of the interconnecting superpoint graph (SPG); originally shown in Landrieu and Simonovsky [149].

Table 1 .
List of publicly available datasets for 3D-scene understanding, categories by data acquisition method, the content of the dataset, used hardware, data representation, and extent of available annotation classes.The digital version (.csv) of this table is provided in the Supplementary Materials.Declaration of data typ real-world (R), synthetic (S).
One-Sided SelectionSelects a subset of the majority class such that the samples are more dissimilar to the minority class than the majority class samples that are not selected.
Confusion matrix for binary testing.Note: Other references may use a different convention for axes.RP: Real Positive, RN: Real Negative, PR: Predicted Positive, PN: Predicted Negative.