Uncertainty Estimation in Deep Neural Networks for Point Cloud Segmentation in Factory Planning

The digital factory provides undoubtedly a great potential for future production systems in terms of efficiency and effectivity. A key aspect on the way to realize the digital copy of a real factory is the understanding of complex indoor environments on the basis of 3D data. In order to generate an accurate factory model including the major components, i.e. building parts, product assets and process details, the 3D data collected during digitalization can be processed with advanced methods of deep learning. In this work, we propose a fully Bayesian and an approximate Bayesian neural network for point cloud segmentation. This allows us to analyze how different ways of estimating uncertainty in these networks improve segmentation results on raw 3D point clouds. We achieve superior model performance for both, the Bayesian and the approximate Bayesian model compared to the frequentist one. This performance difference becomes even more striking when incorporating the networks' uncertainty in their predictions. For evaluation we use the scientific data set S3DIS as well as a data set, which was collected by the authors at a German automotive production plant. The methods proposed in this work lead to more accurate segmentation results and the incorporation of uncertainty information makes this approach especially applicable to safety critical applications.

well before implementation (Kuhn, 2006), i.e. before factory ramp-up, before new machinery is ordered, before construction is under way or before the production process is detailed. This is due to the fact that the structure and layout of the building influences several other domains. A changing building model can entail changes in spatial availability for production or logistics assets. Thus, the layout of production lines or the concept of machines may have to be adapted accordingly. Further, virtual planning reduces travel efforts as planners do not have to meet on-site to discuss modifications or reorganizations. They can rather meet in a multi-user simulation model or a virtual reality supported 3D environment, which saves a substantial amount of travel time and cost. Digital 3D models are the basis for building reorganizations as well as the introduction of completely new or modified manufacturing process steps. In order to determine the as-is state of a production plant there are several challenges to tackle. First of all, current data in the respective plant have to be collected. In order to acquire 3D information, laser scanning and photogrammetry are useful digitalization techniques. After plant digitalization the collected data have to be pre-processed including data cleaning and fusion of inputs from different sources. Working solely on the basis of point clouds for the sake of factory or process simulation is not possible as point clouds generated by laser scanners and photogrammetry techniques suffer from occlusions, which results in holes within the point cloud. For instance, the outcomes of collision checking are not reliable when the point cloud is not complete. Additionally, a point cloud does not contain any information on how to separate different objects. Therefore, the introduction of new or the displacement of existing objects is time consuming, as the respective set of points has to be selected manually. In order to separate different objects from one another automatically, a segmentation step has to be introduced (Petschnigg et al, 2020). Most of the existing deep learning architectures make use of the frequentist notion of probability. However, these so-called frequentist neural networks suffer from two major drawbacks. They do not quantify uncertainty in their predictions. Often, the softmax output of frequentist neural networks is interpreted as network uncertainty, which is, however, not a good measure. The softmax value only normalizes an input vector but cannot as such be interpreted as network (un)certainty (Gal and Ghahramani, 2016). Especially for out of distribution samples the softmax output can give rise to misleading interpretations (Sensoy et al, 2018). In the case of deep learning frameworks being integrated into safety critical applications like autonomous driving it is important to know what the network is uncertain about. There was one infamous accident caused by a partly autonomous driving car that confused the white trailer of a lorry with the sunlit sky or a bright overhead sign (Banks et al, 2018). By considering network uncertainties similar scenarios could be mitigated. Another shortcoming of frequentist neural networks is their tendency to overfit on small data sets with a high number of features. In this work, however, we focus on uncertainty estimation rather than the challenge of feature selection. We present a novel Bayesian 3D point cloud segmentation framework based on PointNet (Qi et al, 2017a) that is able to capture uncertainty in network predictions. The network is trained using variational inference with multivariate Gaussians with a diagonal covariance matrix as variational distribution. This approach adds hardly any additional parameters to be optimized during each backward pass (Steinbrener et al, 2020). Further, we formulate an approximate Bayesian neural network by applying dropout training as suggested in (Gal and Ghahramani, 2016). Further, we use an entropy based interpretation of uncertainty in the network outputs and distinguish between overall, data related and model related uncertainty. These types of uncertainty are called predictive, aleatoric and epistemic uncertainty, respectively (Chen et al, 2017). It makes sense to consider this differentiation as it shows, which predictions are uncertain and to what extent this uncertainty can be reduced by further model refinement. The remaining uncertainty after model optimization and training is then inherent to the underlying data set. Other notions of uncertainty based on the variance or credible intervals of the predictive network outputs are discussed and evaluated. To the best of our knowledge no other work has treated the topic of uncertainty estimation and Bayesian training of 3D segmentation networks that operate on raw and unordered point clouds without a previous transformation into a regular format. Aside from an automotive data set that is collected by the authors at a German automotive manufacturing plant, the proposed networks are evaluated on a scientific data set in order to ensure the comparability with other state-of-the-art frameworks. Summing up, the contributions of this paper are: -Workflow: We describe how to quantify uncertainty in segmentation frameworks that operate on raw and unstructured point clouds. Further, it is discussed how this information assists in generating current factory models. -Framework: We formulate a PointNet (Qi et al, 2017a) based 3D segmentation model that is trained in a fully Bayesian way using variational inference and an approximate Bayesian model, which is derived by the application of dropout training. -Experiment: We evaluate how the different sources of uncertainty affect the neural networks' segmentation performance in terms of accuracy. Further, we outline how the factory model can be improved by considering uncertainty information.
The remainder of this paper is organized in the following way. Section 2 conducts a thorough literature review on 3D point cloud processing framework including deep neural networks, Bayesian neural networks and uncertainty quantification. In the subsequent Section 3 the frequentist, the approximate Bayesian and the fully Bayesian models are described in more detail. Section 4 discusses the scientific and industrial data sets that are used for the evaluation of our models and elaborates their characteristics. The models are evaluated with respect to their performance in Section 5. Finally, Section 6 provides a discussion, which describes the bigger scope of this work and concludes the paper.

Literature Review
The following paragraphs cast light upon prior research in the areas of segmentation of 3D point clouds as well as Bayesian neural networks and uncertainty estimation. The neural networks discussed in the first section are all based on the classical or frequentist interpretation of probability. Bayesian neural networks rather take on the Bayesian interpretation of probability, which views probability as a personal degree of belief.

3D Segmentation
In contrast to images that have a regular pixel structure, point clouds are irregular and unordered. Further, they do not have a homogeneous point density due to occlusions and reflections. Neural networks that process 3D point clouds have to tackle all of these challenges. Most networks are based on the frequentist interpretation of probability and are divided into three classes based on the format of their input data. There are deep learning frameworks that consume voxelized point clouds (Lei et al, 2020;Qi et al, 2016;Wu et al, 2015;Zhou and Tuzel, 2018), collections of 2D images derived by transforming 3D point clouds to the 2D space from different views (Chen et al, 2017;Feng et al, 2018;Yang et al, 2018) and raw unordered point clouds (Qi et al, 2017a,b;Ravanbakhsh et al, 2016). On the one hand, voxelization of point clouds has the advantage of providing a regular structure apt for the application of 3D convolutions. On the other hand, it renders the data unnecessarily big as unoccupied areas of the point cloud are still represented by voxels. Generally, this format conversion introduces truncation errors (Qi et al, 2017a). Further, voxelization reduces the resolution of the point cloud in dense areas, leading to a loss of information (Xie et al, 2020). Transforming 3D point clouds to 2D images from different views allows the application of standard 2D convolutions having the advantage of elaborate kernel optimizations. Yet, the transformation to a lower space can cause the loss of structural information embedded in the higher dimensional space. Additionally, in complex scenes a high number of viewports have to be taken into account in order to describe the details of the environment (Xie et al, 2020). For this reason, the following work focuses on the segmentation of raw point clouds.
In order to generate a factory model out of raw point clouds, the objects of interest have to be detected and their pose needs to be estimated. One approach that extracts 6 degrees-offreedom (DoF) object poses, i.e. the translation and orientation with respect to a predefined zero point, in order to generate a simulation scene is presented in (Avetisyan et al, 2019a). The framework is called Scan2CAD and describes a frequentist deep neural network that consumes voxelized point clouds as well as computer-aided design (CAD) models of eight household objects and directly learns the 6DoF CAD model alignment within the point cloud. The system presented in (Avetisyan et al, 2019b) has similar input data and estimates the 9DoF pose, i.e. translation, rotation and scale, of the same household objects. A framework for the alignment of CAD models, which is based on global descriptors computed by using the Viewpoint Feature Histogram approach (Rusu et al, 2010) rather than neural networks, is discussed in (Aldoma et al, 2011). Generally, direct 6DoF or 9DoF pose estimation on the basis of point clouds and CAD models can be used to set up environment models and simulation scenes. However, these approaches always require the availability of CAD models, which is not the case for many building and inventory objects in real-world factories. Thus, we follow the approach of semantic segmentation instead of direct pose estimation. Semantic segmentation allows us to extract reference point clouds of objects, for which no CAD model is available. These objects can either be modelled in CAD automatically by using meshing techniques or by hand if the geometry is too difficult to capture realistically. Further, the segmentation approach enables us to part the point cloud into bigger contexts, i.e. subsets of points belonging to the construction, assembly or logistics domain. These smaller subsets of points can be sent to the respective departments for further processing, reducing the computational burden of the point cloud to be processed. Mere pose estimation is not sufficient to fulfil this task. Aside from the semantic segmentation of point clouds, this work focuses on the formulation of Bayesian neural networks and how to leverage the uncertainty information that can be calculated in order to increase the models' accuracy.

Bayesian Deep Learning and Uncertainty Quantification
In contrast to frequentist neural networks, where the network paramteters are point estimates, Bayesian neural networks (BNNs) place a distribution over each of the network parameters. For this reason, a prior distribution is defined for the parameters. After observing the training data the aim is to calculate the respective posterior distribution, which is difficult as it requires the solution of a generally intractable integral. There exist several solution approaches including variational inference (VI) (Blundell et al, 2015;Graves, 2011), Markov Chain Monte Carlo (MCMC) methods (Brooks et al, 2011;Gelfand and Smith, 1990;Hast-ings, 1970), Hamiltonian Monte Carlo (HMC) algorithms (Duane et al, 1987) and Integrated Nested Laplace approximations (INLA) (Rue et al, 2009). VI provides a fast approximation to the posterior distribution. However, it comes without any guaranteed quality of approximation. MCMC methods in contrast are asymptotically correct but they are computationally much more expensive than VI. Even the generally faster HMC methods are clearly more time consuming than VI (Blei et al, 2017). As the data sets used for evaluating this work are huge in size, we decide to apply VI due to efficiency reasons.
In literature there are several ways of how uncertainty can be quantified in BNNs. It is possible to distinguish between data and model related uncertainty, which are referred to as aleatoric and epistemic uncertainty, respectively (Der Kiureghian and Ditlevsen, 2009). The overall uncertainty inherent to a prediction can be computed as the sum of aleatoric and epistemic uncertainty and is called predictive uncertainty. Such a distinction is beneficial for practical applications in order to determine to what extent model refinement can reduce predictive uncertainty and to what extent uncertainty stems from the data set itself. One possibility to describe predictive uncertainty U pred is based on entropy, i.e. U pred = H[y |w, x ] (Gal et al, 2017). Another way of quantifying uncertainty in the network parameters of BNNs is presented in (Steinbrener et al, 2020). This approach introduces only two uncertainty parameters per network layer, which allows us to grasp uncertainty layer-wise but does not impair network convergence. The overall model uncertainty is measured by estimating credible intervals of the predictive network outputs p(y |w k , x ). This is based on the notion that higher uncertainty in the network parameters results in higher uncertainty in the network outputs. Further, the predictive variance can be used for uncertainty estimation as well.

Model Descriptions
In the following the frequentist, the approximate Bayesian and the fully Bayesian model are explained. In order to formulate these models let X = {x 1 , . . . , x n } be the input data and Y = {y 1 , . . . , y n } the corresponding labels, where y i ∈ {1, . . . , m}, m ∈ N, i ∈ {1, . . . , n}, n ∈ N. Further, let W and B denote all the network parameters including the weights and biases, respectively. The network weights and biases of the i-th network layer are denoted by W i and B i , i ∈ {1, . . . , d}, where d ∈ N is the network depth. The described network architectures mainly apply convolutional layers, thus, we write conv(i, j) for a convolutional layer with input dimension i ∈ N and output dimension j ∈ N. In the following σ (·) denotes a nonlinear function. In the sequel, uncertainty estimation is explained in more detail and its practical implementation is discussed.

Frequentist PointNet
The baseline for the following derivations and evaluations is the PointNet segmentation architecture (Qi et al, 2017a). This framework consumes raw and unordered point sets in a block structure. The number of points in each of the blocks is exactly 4096 -either due to random down-sampling or due to up-sampling by repeated drawing of points. Each input point is represented by a vector x containing xyz-coordinates centred about the origin and RGB values. For later illustration purposes, we add another three dimensions, which hold the original point coordinates, i.e. dim(x) = 9. The actual network input is a tensor of dimension bs × 4096 × 6, where bs ∈ N represents the batch size. The batch size corresponds to the number of input blocks being treated at a time. Each of these blocks consists of exactly 4096 points. Further, the centred point coordinates are rather used for network training than the original ones, thus, the last dimension is 6 instead of 9. In this architecture a symmetric input transformation network is applied first. It is followed by a convolutional layer conv(6, 64) and a feature transformation network. After the feature transformation another two convolutional layers conv(64, 128, 1024) are applied before extracting global point cloud features using a max pooling layer. These global features are concatenated to the local features, which correspond to the direct output of the feature transformation network. The resulting network scores are generated by four convolutional layers conv (1088, 512, 256, 128, m), where m ∈ N is the number of classes. The rectified linear unit (ReLU) is used as a non-linear activation function in this network.

Approximate Bayesian PointNet
For the approximate Bayesian PointNet segmentation network we use the notion that dropout training in neural networks corresponds to approximate Bayesian inference (Gal and Ghahramani, 2016). In the following, this network will be referred to as dropout PointNet. Dropout in a single hidden layer neural network can be defined by sampling binary vectors c 1 ∈ {0, 1} d 1 and c 2 ∈ {0, 1} d 2 from a Bernoulli distribution such that c 1,q ∼ Be(p 1 ) and c 2,k ∼ Be(p 2 ), where q = 1, . . . , d 1 and k = 1, . . . , d 2 . The variables d 1 and d 2 corresponds to the number of weights in the respective layer and p 1 , p 2 ∈ [0, 1]. Then dropout can be interpreted asŷ The bias in the second layer is omitted, which corresponds to centring the output. The network outputŷ is normalized using the softmax function The log of this function results in the log-softmax loss. In order to improve the generalization ability of the network, L 2 regularization terms for the network weights and biases can be added to the loss function. The optimization of such a neural network acts as approximate Bayesian inference in deep Gaussian process models (Gal and Ghahramani, 2016). This approach neither changes the model nor the optimization procedure, i.e. the computational complexity during network training does not increase. It is suggested to apply dropout before every weight layer in the network, however, empirical results with respect to convolutional neural networks show inferior performance when doing so. Thus, we place dropout before the last three layers in the PointNet model, with a dropout probability of 0.1. Other than that the frequentist model is left unchanged. Placing dropout within the input or feature transform network results in considerably lower performance.

Bayesian PointNet
As already mentioned, BNNs place a distribution over each of the network parameters. In Bayesian deep learning all of the network parameters including weights and biases are expressed as one single random vector W . The prior knowledge about these parameters is captured by the a priori distribution p(w). After observing some data (X,Y ) the a posteriori distribution can be derived. Using Bayes' Theorem the posterior density reads The likelihood p(Y |w, X) is given by ∏ n i=1 BNN(x i ; w) y i , which corresponds to the product of the BNN outputs for all training inputs under the assumption of stochastic independence. However, the integral in the denominator is usually intractable, which makes the direct computation of the posterior difficult. In Section 2.2 different methods for posterior approximation are discussed. As already mentioned, we use VI as it is most efficient in the case of a huge amount of training data. The idea of VI is to approximate the posterior p(w|Y, X) by a parametric distribution q ϕ (w). To this end the Kullback-Leibler divergence (KL-divergence) between the variational and posterior density is minimized, i.e.
The KL-divergence does not describe a real distance metric as the triangle inequality and the property of symmetry are not fulfilled. Nevertheless it is frequently used in BNN literature to measure the distance between two distributions. Due to the unknown posterior in the denominator of the KL-divergence, it cannot be optimized directly. According to (Bishop, 2006) the minimization of the KL-divergence is equivalent to the minimization of the log evidence lower bound (ELBO), which reads After the optimization of the variational distribution it can be used to approximate the posterior predictive distribution for unseen data. Let x be an unseen input with corresponding label y . The posterior predictive distribution represents the belief in a label y for an input x and is given by The two factors under the integral correspond to the (future) likelihood and the posterior. The intractable integral can be approximated by Monte Carlo integration using K ∈ N terms and the posterior distribution is replaced by the variational distribution, i.e.
with BNN denoting a forward pass through the network andŵ k is the k-th weight sample. Finally, the predictionŷ is given by the index of the largest element in the mean of the posterior predictive distribution and thus readŝ After having discussed the theoretical background, we describe our Bayesian model and the corresponding variational distribution. The model we suggest has a similar structure to the framework in (Steinbrener et al, 2020). The weights W i and biases B i of the i-th network layer i ∈ {1, . . . , d} are defined as follows τ wi := log(1 + exp(δ wi )) (9) τ bi := log(1 + exp(δ bi )) (10) where δ wi ∈ R, δ bi ∈ R, µ wi ∈ R d i and µ bi ∈ R d i are the variational parameters. Further, 1 d denotes the d-dimensional vector consisting of all ones, ε wi ∈ R d i as well as ε bi ∈ R d i are multivariate standard normally distributed and represents the Hadamard product. Thus, the weights and biases follow a multivariate normal distribution with a diagonal covariance matrix, i.e.
For more detailed insights on the respective gradient updates see (Steinbrener et al, 2020). Due to the dying ReLU problem, we use the leaky ReLU activation function in the Bayesian model with a negative slope of 0.01. The mean of the weights is initialized using the Kaiming normal initialization with the same negative slope as for the leaky ReLU activation.

Uncertainty Estimation
As already described the estimated uncertainty can be split into predictive, aleatoric and epistemic uncertainty. In practice, predictive uncertainty U pred is approximated by marginalization over the weights, In Equation (15) p(y |w k , x ) corresponds to the predictive network output of label y for an input data point x and the k-th weight sampleŵ k of the variational distribution. The total number of Monte Carlo samples is given by K ∈ N. Aleatoric uncertainty U alea is interpreted as the average entropy over all the weight samples, Finally, epistemic uncertainty U ep is the difference between predictive uncertainty and aleatoric uncertainty, i.e. U ep = U pred −U alea . Further, uncertainty in network prediction can be quantified by calculating the variance for the predictive network outputs. Another way is to calculate a credible interval of network outputs for each class. For instance, the 95 %-credible interval can be calculated for each class. In the case that the 95 %-credible interval of the predicted class overlaps with the 95 %-credible interval of any other class, the prediction is considered to be uncertain.

Data Sets
Two different data sets are used to evaluate our Bayesian and the approximate Bayesian segmentation approach, which forms the core contribution of this work. The first one is the Stanford large-scale 3D indoor spaces data set that is open to scientific use and thus ensures the comparability of our approach to other methods. The second data set is a large-scale point cloud data set collected and pre-processed by the authors at a German automotive OEM.

Stanford Large-Scale 3D Indoor Spaces Data Set
The Stanford large-scale 3D indoor spaces (S3DIS) data set (Armeni et al, 2016) is an RGB-D data set of 6 indoor areas. It features more than 215 million points collected over an area totalling more than 6 000 m 2 . The areas are spread across three buildings including educational facilities, offices, sanitary facilities and hallways. The annotations are provided on instance level and they distinguish 6 structural elements from 7 furniture elements. This totals the 13 classes including the building structures of ceiling, floor, wall, beam, column, window and door as well as the furniture elements of table, chair, sofa, bookcase, board and clutter. The data set can be downloaded from http://buildingparser.stanford.edu/dataset.html.

Automotive Factory Data Set
This data set was collected using both the static Faro Focus3D X 130HDR laser scanner and two DSLR cameras. In more detail a Nikon D5500 camera with an 8 mm fish-eye lens and a Sony Alpha 7R II with a 25 mm fixed focal length lens were used. We generate a global point cloud comprising 13 tacts of car body assembly by registration of several smaller point clouds collected at each scanner position. The final point cloud comprises more than one billion points before further pre-processing. The cleaning process is achieved using noise filters for coarse cleaning and fine tuning is done by hand. The resulting point set consists of 594 147 442 points. This accounts for a reduction of about 40 % of the points after point cloud cleaning. Most of the removed points are noise points caused by reflections and the blur of moving objects like people walking by the laser scanner. The data set is divided into 9 different classes, namely car, hanger, floor, band, lineside, wall, column, ceiling and clutter. The labelling is done manually by the authors. The class clutter is a placeholder for all the objects that cannot be assigned to one of the other classes. All of the remaining classes are either building structures or objects that can only be moved with high efforts, thus, they are essentially immovable and have to be considered during planning tasks. The resulting data set is highly imbalanced with respect to the class distribution. Figure 1 (a) depicts the class distribution of this data set. Clearly, there is a notable excess of points belonging to the class ceiling and relatively few points belong to the classes of wall and column. This is mainly due to the layered architecture of the ceiling that results in points belonging to the structure on various heights. As walls and columns are mostly draped with other objects like tools, cables, fire extinguishers, posters and information signs, there is only a small number of points that truly belongs to the classes of wall and column. This is also the reason why especially these two classes suffer from a high degree of missing data, i.e. holes in the point cloud. Due to this inhomogeneous class distribution, any segmentation system has to cope with this class imbalance. Figure 1 (b) illustrates the point cloud of one tact of car body assembly.

Results and Analysis
The proposed networks are evaluated on our custom automotive factory data set as well as the scientific data set S3DIS. The segmentation performance is measured with respect to their accuracy and the mean intersection over union. Further, the described ways of uncertainty quantification are evaluated in terms of accuracy after disregarding uncertain predictions. All the considered models are implemented using Python's open source library PyTorch (Paszke et al, 2019). The input point clouds comprising rooms or assembly tacts are cut into blocks and the number of points within these blocks is sampled to 4096. These blocks serve as input for all networks. All models are trained using mini-batch stochastic gradient descent with a batch size of 16 on the S3DIS and the automotive factory data set for the frequentist and the proposed dopout and Bayesian networks. The momentum parameter is set to 0.9 for all models. A decaying learning rate lr is used with an initial learning rate of lr = 0.001 in the frequentist and the dropout model as well as lr = 0.01 in the Bayesian model. The learning rate is decayed every 10 epochs by a factor of 0.7 during frequentist and dropout training as well as 0.9 during Bayesian training. The batch size and the learning rate are optimized by using grid search and cross-validation. In the approximate Bayesian neural network dropout is applied before the last three convolutional layers and the dropout rate is set to 0.1 for the automotive factory data set. In the case of the S3DIS data set dropout is only applied before the last convolutional layer with a dropout rate of 0.1. As we do not have dedicated prior information for the Bayesian model, the prior indicates that the parameter values should not diverge. Thus, we choose a prior expectation of zero for all parameters and a standard deviation of 4 and 8 for all weights and biases, respectively. In terms of approximating the posterior predictive distribution, we draw K = 50 Monte Carlos samples. All the considered models converge and training is stopped after 100 epochs.

Segmentation Accuracy
The three architectures, i.e. the frequentist, dropout and Bayesian PointNet (PN), described in Section 3 are evaluated in the following. The evaluation metrics used are the accuracy and the mean Intersection over Union (IoU). The accuracy is calculated by the number of correctly classified points divided by the total number of points. The IoU or Jaccard coefficient describes the similarity between two sets with finite cardinality. It is defined by the number of points of the intersection divided by the number of points of the union of the two sets. In this case, we evaluate the overlap between the points classified as class i by the model and the points of class i in the ground truth. Thus, the IoU for class i reads J i = # points correctly classified as i # points classified as i + # number points in class i in ground truth , where i ∈ {1, . . . , m} indicates the class label. The mean IoU is calculated as the mean IoU value over all classes. Table 1 illustrates that Bayesian PointNet clearly surpasses the performance of frequentist and dropout PointNet with respect to accuracy as well as mean IoU on the test set of both data sets. For the S3DIS data set we test the models on area 6 and for the automotive factory data set we set aside two distinct assembly tacts. The prior information in the Bayesian model acts as additional observations and is thus able to reduce overfitting and increase model performance. An even more striking difference in performance will be illustrated in the next section when the information provided by uncertainty estimation is considered.

Uncertainty Estimation
As already mentioned we estimate uncertainty using an entropy based approach, the predictive variance as well as an approach based on estimating credible intervals on the probabilistic network outputs. Predictive and aleatoric uncertainty are calculated as suggested in the Equations (15) and (16). Epistemic uncertainty is the difference of these two quantities. The predictive variance is determined on the basis of K = 50 forward passes of the input through the network using the unbiased estimator for the variance. Based on the same sample the 95 %-confidence intervals of the network outputs for each class are calculated. Table 2 contains the results of Bayesian PointNet for one room of each room type in area 6 of S3DIS data set as well as one tact of car body assembly belonging to the test data set. The leftmost column contains the accuracy of Bayesian PointNet. The next column contains the accuracy when considering only predictions that have a predictive uncertainty smaller or  equal to the mean predictive uncertainty plus two sigma of the predictive uncertainty. The same is displayed in the next columns with respect to aleatoric and epistemic uncertainty as well as the variance of the predictive network outputs. In the last column the predictions for which the 95 %-credible interval of the predicted class overlaps with no other class' 95 %-credible interval are considered certain and are used for prediction. It can be seen that the accuracy increases considerably when only looking at certain predictions with respect to any of the uncertainty measures. Generally, the results for predictive and aleatoric uncertainty as well as the credible interval based method are most promising. This confirms our notion that the predictive uncertainty value is determined mainly by aleatoric uncertainty after thorough network training. The percentage of predictions, which are found to be uncertain, is displayed in Table 3. Generally, between about 3 % and 11 % of the predictions are dropped using the above parameters. The number of dropped predictions decreases when predictions with a higher uncertainty value are considered, e.g. all predictions with uncertainty greater or equal to the mean uncertainty plus three sigma. Generally it can be claimed that the lower the threshold for uncertain predictions, i.e. the more predictions are dropped, the higher the resulting accuracy. Thus, a trade-off between dropping uncertain predictions and segmentation accuracy needs to be found. However, this is largely dependent on the specific use case. Table 4 illustrates the results of dropout PointNet for one room of each room type in area 6 of S3DIS data set as well as one tact of car body assembly belonging to the test data set. The accuracy of Bayesian PointNet surpasses the accuracy of dropout PointNet for most of the evaluated rooms. In terms of the percentage of predictions dropped the results are similar. Table 5 presents the percentage of disregarded predictions compared to the baseline containing all predictions. Again, about 2 % to 11 % of the predictions are dropped by dropout PointNet using the same uncertainty threshold as before. Usually dense point clouds are generated, when building up an environment model of a factory in order  to capture as many details as possible. Thus, it is important to keep a high number point wise predictions after uncertainty estimation in order to guarantee high quality when placing object geometries in a simulation model. However, a higher prediction accuracy in the segmentation step also increases the quality of the resulting environment model. As we already discussed, a higher accuracy can be achieved by dropping a larger amount of uncertain predictions. In the case of environment modelling, it is vital to drop as few predictions as possible because otherwise building structures and their exact location that is necessary for model generation can get lost. Generally, we notice that the dropout model is more difficult to train than the Bayesian one, which manifests in a higher epistemic uncertainty in the dropout model. Empirically it is shown that the application of dropout exhibits inferior performance in convolutional architectures (Gal and Ghahramani, 2016), which could lead to the increased epistemic uncertainty values. Further, the impact of uncertainty on the segmentation performance is more striking in the Bayesian model. However, for applications where one or two percent of accuracy can be sacrificed, the dropout model is a good alternative to the Bayesian model as users can take on a frequentist network and just add dropout during training and test time, without having to define and optimize a distribution over all the network parameters.
Overall it can be concluded that Bayesian PointNet has superior performance and dropout PointNet has similar performance to the frequentist model without considering uncertainty information. When only considering certain predictions Bayesian as well as dropout Point-Net clearly surpass the performance of the frequentist model. In terms of uncertainty measure the best results are achieved when using the approach using confidence intervals as well as predictive or aleatoric uncertainty. However, the confidence interval based method drops considerably more predictions in some of the examples. are displayed in red. The applied model corresponds to Bayesian PointNet with the credible interval based uncertainty measure. It shows that the network is certain about the majority of its predictions. Uncertain predictions are concentrated at the ceiling, wall and columns, which is in line with our expectations of Section 4.2. As the ceiling but especially the walls and columns are hung with clutter objects, this leads to uncertain predictions as the point clouds of these classes are incomplete due to holes. Further, Figure 3 shows the boxplots of the predictive softmax outputs in the Bayesian model for a correct and a wrong prediction of two single points in the automotive factory data set. In Figure 3 (a) it can be seen that the network is certain about its correct prediction, i.e. all the predictive softmax output values of the correct class are close to one, while the network outputs for all other classes are close to zero. In the case of a wrong prediction, see Figure 3 (b), the boxes of the true and predicted label overlap indicating an uncertain prediction. The white box corresponds to the correct class and the shaded box corresponds to the wrongly predicted class.

Discussion and Conclusion
The use of Bayesian neural networks instead of frequentist ones allows the quantification of network uncertainty. On the one hand, this leads to more robust and accurate models. On the other hand, in safety critical application uncertain prediction can be identified and treated with special care. For the use case of point cloud segmentation in modelling production sites there are hardly any safety critical issues, however, an increased model accuracy leads to more accurate reconstructions of the real-world production system in a simulation engine or in CAD software. A first methodology for systematic data collection and processing in a large-scale industrial environment was presented in (Petschnigg et al, 2020). In future work this will be extended to a more thorough workflow including technology specifications and a mathematical concept of how to place the segmented objects in a simulation model. In summary, in this work we present a novel Bayesian neural network that is capable of 3D deep semantic segmentation of raw point clouds. Further, it allows the estimation of uncertainty in the network predictions. Additionally, a network using dropout training to approximate Bayesian variational inference in Gaussian processes is described. We compare the uncertainty information gained by the Bayesian and the dropout model. Both the Bayesian and the dropout model effectively increase the network performance during test time compared to the frequentist framework when taking into account the information gained by network uncertainty. The dropout model shows on-par performance to frequentist PointNet without taking into account network uncertainty. Bayesian PointNet is more robust against overfitting than the frequentist one and achieves higher test accuracy even without considering uncertainty information. Bayesian segmentation allows to work with fewer example data, due to prior information acting like additional observations, while computational complexity basically stays the same. All of the proposed networks are embedded in an industrial prototype that aims at generating static simulation models out of a raw point cloud only. The complete system will be evaluated in our next work.

Conflict of interest
The authors declare that they have no conflict of interest.