Semantic Segmentation of Large-Scale Outdoor Point Clouds by Encoder–Decoder Shared MLPs with Multiple Losses

: Semantic segmentation of large-scale outdoor 3D LiDAR point clouds becomes essential to understand the scene environment in various applications, such as geometry mapping, autonomous driving, and more. With an advantage of being a 3D metric space, 3D LiDAR point clouds, on the other hand, pose a challenge for a deep learning approach, due to their unstructured, unorder, irregular, and large-scale characteristics. Therefore, this paper presents an encoder–decoder shared multi-layer perceptron (MLP) with multiple losses, to address an issue of this semantic segmentation. The challenge rises a trade-off between efﬁciency and effectiveness in performance. To balance this trade-off, we proposed common mechanisms, which is simple and yet effective, by deﬁning a random point sampling layer, an attention-based pooling layer, and a summation of multiple losses integrated with the encoder–decoder shared MLPs method for the large-scale outdoor point clouds semantic segmentation. We conducted our experiments on the following two large-scale benchmark datasets: Toronto-3D and DALES dataset. Our experimental results achieved an overall accuracy (OA) and a mean intersection over union (mIoU) of both the Toronto-3D dataset, with 83.60% and 71.03%, and the DALES dataset, with 76.43% and 59.52%, respectively. Additionally, our proposed method performed a few numbers of parameters of the model, and faster than PointNet++ by about three times during inferencing. convolution operator. The convolution


Introduction
The 3D LiDAR point clouds have become one of the most significant 3D data presentations for depth information, and have been deployed in various applications, such as urban geometry mapping, autonomous driving, virtual reality, augmented reality, and more [1][2][3]. Point cloud is a set of points in a 3D metric space, which provides rich 3D information, such as geometry, color, intensity, normal, and more, to accurately measure the surrounding objects. This information can be utilized for scene understanding. Among tasks that are related to the point cloud scene understanding, a semantic segmentation is a task that has the role of assigning each point to a meaningful label. This means that it does not only tell the location of the object, but it also describes what kind of object is in the scene. In this paper, we propose an outdoor 3D point clouds semantic segmentation.
The deep learning (DL) approach has proved to have an outstanding performance on classification, detection, and segmentation on 2D images [4]. Compared to the 2D images [5][6][7], point clouds of the outdoor scene are formed in the following properties [1,2]: (1) the points are unstructured, because they are not arranged in a regular grid, and are generally sparse in the 3D world space; (2) they are irregular, because the density of the point coordinates is not uniform, and they generally vary with the distance to the sensor; (3) they are unordered, because the order of storing information of points in the dataset

•
We propose a simple, and yet effective, strategy of the above aforementioned mechanisms, such as a random point sampling, attention-based pooling, and multiple losses summation integrated with the encoder-decoder shared MLPs method, for the large-scale outdoor point clouds semantic segmentation; • We proof that our method performs good results and has a lower computational cost than PointNet++ [11].
The remainder of the paper is organized as follows. We review previous methods using DL methods in Section 2. Then, we describe our proposed method in Section 3. Next, we describe our experimental setup and analyze our results in Sections 4 and 5, respectively. In Section 6, we discuss and compare our experimental results with other methods. Finally, Section 7 concludes our study.

Related Works
Prior works are 2D projection [8] and voxelization [9]. The approach of 2D projection [8] is first to project 3D point clouds, from multiple views, into a collection of 2D images, and then utilize the mature structured 2D CNN. Such projection is simple, but loses 3D geometric information. Thus, it is not suitable for the scene semantic segmentation. Another approach [9] is to voxelize the point clouds into a regular 3D grid, and then utilize the structured 3D CNN. Such voxelizating consumes a lot of computational cost for largescale dense 3D data, because the computation and memory usage will grow cubically when the data is scaled up. Thus, it is not suitable for the outdoor scene semantic segmentation. Later, the point-based DL approach has been proposed, in which the approach tends to feed the raw point clouds into the DL network directly. The prior works of the point-based DL approach are point-wise multi-layer perceptrons (MLPs) [10][11][12][13][14][15][16][17][18][19][20], in which the methods learn per-point features using shared MLPs as a base network to receive high efficiency. Such shared MLPs extract point-wise features independently and lose point relation information. Therefore, to capture more rich local and global structures, several integrated mechanisms have been introduced. Those include neighborhoods sampling, attentionbased pooling, and local-global feature aggregation. Point convolution methods [21][22][23][24][25][26][27] tend to define a point-wise convolution operator, to permute local points into canonical order. The discrete convolution operator uses points to carry kernel weights, but it loses neighboring information. Therefore, a 3D point continuous convolution operator is proposed. However, there are challenges in defining the point convolution operator, learning an accurate permutated matrix, and reducing the computational cost at the preprocessing step. Graph-based methods [28][29][30][31][32][33][34][35][36] structure point clouds as a super graph, to extract local shape information from the neighbors and feed this to a graph convolution network. Such a graph-based method uses point relationships that are defined as edges, and interpolates them into the network. However, it is not easy for the interpolation function to define a spatial extend of the neighborhood, to extract sufficient local and global point features.
We are interested in DL approaches on raw point clouds. Therefore, we review semantic segmentation methods on point-based DL approaches in detail, as follows.

Point-Wise MLPs Method
The point-wise MLPs method learns per-point features, using shared MLPs as a base network to receive high efficiency. The method is pioneeringly proposed by PointNet [10]. The shared MLPs learns raw point clouds to extract point-wise features, and then global max pooling aggregates the features of all the points. However, each point is learned individually, and the point relation is not considered. Therefore, to capture more rich local and global structures, PointNet++ [11] proposed hierarchical representation learning, with a farthest point sampling (FPS) layer, a K-nearest neighbor grouping layer, and a max pooling layer. PointSIFT [12] proposed an orientation-encoding module, integrated with shared MLPs. The module captures eight crucial orientation information points of point clouds, to achieve strong scale awareness. Engelmann et al. [13] defined a pairwise distance loss and centroid loss mechanism to the shared MLPs, for better structuring of the learned point feature space. Further, 3DContextNet [14] proposed the Kd-tree structure, to learn representative features progressively. The method enhances the computation cost by learning on implicit partition space and skipping the learning on empty space. ShapeContextNet [15] proposed a shape context kernel, to learn features in concentric shells and aggregate the features using dot-product self-attention for sufficient capturing of both local and global shape information. PointWeb [16] proposed the adaptive feature adjustment (AFA) module, to enhance the local neighborhood features. PAT [17] proposed the group shuffle attention (GSA) module, with self-attention and gumbel subset sampling. The module tends to define a stronger representative feature, with a lower computation cost. LSANet [18] proposed a local spatial awareness layer to learn spatially oriented distribution weights, and integrates it into the shared MLPs. ShellNet [19] proposed a shell convolution, to learn representative features from concentric spherical shell statistics. The method tends to enlarge the receptive fields with less deep layers. RandLA-Net [20] proposed the local spatial-encoding module, to randomly select representative features and attentive pooling modules, to aggregate features using the attention mechanism. The method tends to effectively extract features from a wide neighborhood, with a low computation cost.

Point Convolution Method
The point convolution method tends to define a point-wise convolution operator, to permute local points into canonical order. PointCNN [21] proposed the X-transformed convolution operator, which can weight input features and permute points into a latent and canonical order. PCCN [22] proposed a continuous convolution operator, which is parameterized by MLPs and tends to support the full continuous vector space. A-CNN [23] proposed an annular convolution operator, which can learn to capture the local feature correlation through a ring-shaped structure and direction. ConvPoint [24] proposed a continuous convolution operation. This method learns the weighted summation from the feature convolution operation and simple MLPs operation of spatial features. KPConv [25] proposed a kernel point convolution operator. The convolution takes radius neighborhoods as the input and learns the weights that are located in the Euclidean space, by a subset of kernel points. DPC [26] proposed a dilated point convolution operator. The convolution Remote Sens. 2021, 13, 3121 4 of 18 learns the kernel weights over the dilated neighborhood. This method tends to increase the receptive field size of the point convolution. InterpCNN [27] proposed an interpolated convolution operator, to define a permutation and sparsity invariant of the points. The interpolation function takes a set of discrete kernel weights and interpolates the point features into the neighboring kernel weights.

Graph-Based Method
Graph-based method structures the local shape information by the edges of the graph defined on the point relations, before feeding it into the convolution network. DGCNN [28] proposed an edge convolution, to learn global shape properties by incorporating local neighborhood information. SPG [29] proposed a super point graph, to structure the point clouds into a collection of interconnected shapes. The graph represents edge features of each object's points. Then, the graph is fed into PointNet [10] for embedding, before feeding it to gated recurrent unit (GRU) for the final prediction. Ladrieu et al. [30] proposed a graph-structured contrastive loss, which can learn to embed local points. The local points embedder is a lightweight model that has been adopted from PointNet [10]. GACNet [31] proposed a graph attention convolution. The method tends to properly assign attention weights to different neighboring points, based on both points spatial geometry and features. PAG [32] defined edge-preserved pooling, edge-preserved unpooling, and point atrous convolution. The model is used to exploit multi-scale edge features by introducing a sampling rate parameter to enlarge the receptive fields. HDGCN [33] proposed a depthwise graph convolution to aggregate the feature channel wisely, and point-wise graph convolution to learn the features across the different channels. Jiang et al. [34] proposed a hierarchical graph framework, to incorporate the point branch and edge branch. The edge branch is used to integrate the point features and generate edge features. Lei et al. [35] proposed a spherical convolution network, to learn point neighborhoods through the spherical convolution kernel. The spherical convolution kernel extracts local features by an octree partitioning structure. DPAM [36] proposed a graph convolution network that can learn a soft points cluster agglomeration by exploring the relation of points in the semantic space. The method tends to define a learning architecture to sample and group the points dynamically.

Methodology
In this paper, we propose an encoder-decoder shared MLP with multiple losses, for the large-scale outdoor point clouds semantic segmentation. We adopt the network architecture of PointNet++ [11] as a base network. The PointNet++ [11] is a hierarchical feature, learning through sampling, grouping, shared MLPs, and a pooling layer. We propose random point sampling and attention-based pooling, while PointNet++ [11] used farthest point sampling and max pooling, respectively. However, we fully adopt the K-nearest neighbor grouping and shared MLPs from PointNet++ [11]. Additionally, we propose a summation of multiple loss scores to help to refine the learned feature structure.
Our proposed network architecture is composed of a feature encoding and a decoding network. The raw point clouds are fed into the feature encoding network to extract local neighboring features. Then, the learned local neighboring features are up-pooled to achieve up-pooled features. The up-pooled features are concatenated with the skip linked features to generate propagated features for semantic point labeling. We describe, in more detail, our complete network, feature encoding network, feature decoding network, and multiple losses summation architecture in Section 3.1, Section 3.2, Section 3.3 and Section 3.4, respectively.

Network Architecture
Our complete network architecture is shown in Figure 1. The network architecture is an encoder-decoder shared MLP with multiple losses. Firstly, the raw point clouds are fed into the feature encoding network. The feature encoding network consists of the set Remote Sens. 2021, 13, 3121 5 of 18 abstraction (SA) [11] modules, followed by the attention-based pooling (AP) layers [20,39]. The SA and AP modules are used to down-pool the point features for four levels, to extract high-level local feature abstraction. We enlarge the receptive fields of shared MLPs from 32 to 512 filter sizes, to widely capture the important information.
detail, our complete network, feature encoding network, feature decoding network multiple losses summation architecture in Sections 3.1-3.4, respectively.

Network Architecture
Our complete network architecture is shown in Figure 1. The network architectu an encoder-decoder shared MLP with multiple losses. Firstly, the raw point cloud fed into the feature encoding network. The feature encoding network consists of th abstraction (SA) [11] modules, followed by the attention-based pooling (AP) layers [20 The SA and AP modules are used to down-pool the point features for four levels, to ex high-level local feature abstraction. We enlarge the receptive fields of shared MLPs 32 to 512 filter sizes, to widely capture the important information. Then, the learned local neighboring features are up-pooled to achieve up-pooled tures. The up-pooled features are concatenated with the skip linked features, to gen propagated features for semantic point labeling. The feature decoding network con of four levels of the feature propagation (FP) modules, followed by the fully conne (FC) layers. We narrow the filter size from 256 to 128. The FC layers condense the f of size 256 and 128 into the number of categories (C). The filter size of the FC lay defined with respect to the filter size of the shared MLPs from the FP module. Finally sum the four loss scores to receive the final loss, to help to refine the structure o learned features space.

Feature Encoding Network
To extract the local features, we adopt the set abstraction module (SA) [11] and a tion-based pooling [20,39] for the features encoding, as shown in Figure 2. The SA mo [11] learns the point clouds, to encode the local features through sampling, group shared MLP, and a pooling layer. In the sampling layer, we propose random point pling (RPS), to randomly select representative points. Then, in the grouping layer, th nearest neighbor method finds the K nearest points within a radius of each sampled p Finally, shared MLPs with the attention-based pooling layer aggregate features of t nearest points that are within the same radius as the local neighboring features.
An input point cloud is represented as | = {1,2, . . , }, with ∈ ℝ , where the features of the input points, such as xyz coordinates, color, normal, etc. Then, the learned local neighboring features are up-pooled to achieve up-pooled features. The up-pooled features are concatenated with the skip linked features, to generate propagated features for semantic point labeling. The feature decoding network consists of four levels of the feature propagation (FP) modules, followed by the fully connected (FC) layers. We narrow the filter size from 256 to 128. The FC layers condense the filters of size 256 and 128 into the number of categories (C). The filter size of the FC layer is defined with respect to the filter size of the shared MLPs from the FP module. Finally, we sum the four loss scores to receive the final loss, to help to refine the structure of the learned features space.

Feature Encoding Network
To extract the local features, we adopt the set abstraction module (SA) [11] and attention-based pooling [20,39] for the features encoding, as shown in Figure 2. The SA module [11] learns the point clouds, to encode the local features through sampling, grouping, shared MLP, and a pooling layer. In the sampling layer, we propose random point sampling (RPS), to randomly select representative points. Then, in the grouping layer, the K-nearest neighbor method finds the K nearest points within a radius of each sampled point. Finally, shared MLPs with the attention-based pooling layer aggregate features of the K nearest points that are within the same radius as the local neighboring features.
An input point cloud is represented as p i |i = {1, 2, , n} , with p i ∈ R F , where F is the features of the input points, such as xyz coordinates, color, normal, etc.

Sampling Layer
From the set of N input features of dimensionality F, we randomly select the subset of point N. Compared with the farthest point sampling (FPS), the random point sampling (RPS) has less coverage of the entire point set than FPS. However, RPS achieves a low computational cost with O(1), which is suitable for large-scale points. We simply compute the random point sampling with the Python numPy package. We use numpy.random.choice() to generate the indices. Then, we select the corresponding point features through these indices. Remote Sens. 2021, 13, x FOR PEER REVIEW  (1), which is suitable for large-scale points. We simply pute the random point sampling with the Python numPy package. We use nump dom.choice() to generate the indices. Then, we select the corresponding point fe through these indices.

Grouping Layer
Since point clouds are not arranged in a regular grid, similarly to 2D images, co ting the neighborhoods allows us to capture the local point relationships. The neighb features vary depending on the sparsity of the point set. Therefore, we adopt the gro layer from [11], to compute the neighboring features as follows. For the poin firstly query its neighboring points within a radius of each sampled point, using nearest neighbor method. The K-nearest neighbor method computes based on poin Euclidean distances. The K value is set to 32. Then, we compute the relative point po by concatenating the features of each point with its neighboring features. For each K neighbor points { , … , , … , } to the centroid point , the relative point posi computed as follows: where and are the xyz coordinates of points, ⨁ is the concatenate operatio is the relative point position with ∈ ℝ .

Shared MLPs Layer
We learn a representation of this relative point position by the point-wise s multi-layer perceptrons (MLPs).

Grouping Layer
Since point clouds are not arranged in a regular grid, similarly to 2D images, computing the neighborhoods allows us to capture the local point relationships. The neighboring features vary depending on the sparsity of the point set. Therefore, we adopt the grouping layer from [11], to compute the neighboring features as follows. For the i th point, we firstly query its neighboring points within a radius of each sampled point, using the K-nearest neighbor method. The K-nearest neighbor method computes based on point-wise Euclidean distances. The K value is set to 32. Then, we compute the relative point position by concatenating the features of each point with its neighboring features. For each of the K neighbor points p 1 i , . . . , p k i , . . . , p K i to the centroid point p i , the relative point position is computed as follows: where p i and p k i are the xyz coordinates of points, is the concatenate operation, and r k i is the relative point position with r k i ∈ R F .

Shared MLPs Layer
We learn a representation of this relative point position by the point-wise shared multilayer perceptrons (MLPs). For each of K point positions r 1 i , . . . , r k i , . . . , r K i relative to the centroid point P i , we encode the relative point position as follows: where r k i is the relative point position and f k i is the corresponding learned local features with f k i ∈ R F .

Attention-Based Pooling Layer
The pooling layer is used to aggregate the set of neighboring point features f k i . The PointNet++ [11] used max pooling to hard aggregate the neighboring features. However, it results in losing a lot of information. Inspired by [20,39], we adopted the attention-based pooling layer, which is a robust pooling mechanism used to automatically learn meaningful local features through attention-weighted scores and its summation. the attention-weighted scores for each feature are computed by a shared MLP, followed by So f tmax, as follows: where W is the weights of MLP( ) and s k i is an attention weighted score with s k i ∈ R F . Then, we sum the dot production of the local features f k i and the learned attention scores s k i to automatically capture the meaningful features, namely, the encoded features, as follows: where ( . ) is the dot product operation and g i is the encoded features with g i ∈ R F .

Feature Decoding Network
After the original point set is down-pooled to extract the learned local features in the feature encoding network, we propagate the learned local features for point-wise labeling in the feature decoding network. We fully adopt the feature propagation (FP) module from PointNet++ [11], as shown in Figure 3. The FP module defined the distancebased interpolation, to compute the average of the inverse distance weights. Then, the interpolated features are concatenated with skip linked features, followed by shared MLPs. We denote the decoded features as follows: where FP( ) is the feature propagation module, g l i is the encoded features at level lth with g l i ∈ R F , g l−1 i is the skip linked features at level (l − 1)th with g l−1 i ∈ R F , andf i is the decoded features withf i ∈ R F .

Attention-Based Pooling Layer
The pooling layer is used to aggregate the set of neighboring point features . The PointNet++ [11] used max pooling to hard aggregate the neighboring features. However, it results in losing a lot of information. Inspired by [20,39], we adopted the attention-based pooling layer, which is a robust pooling mechanism used to automatically learn meaningful local features through attention-weighted scores and its summation. Firstly, we compute the attention-weighted scores. From the set of learned local features { , … , , … , }, the attention-weighted scores for each feature are computed by a shared MLP, followed by , as follows: where is the weights of ( ) and is an attention weighted score with ∈ ℝ .
Then, we sum the dot production of the local features and the learned attention scores to automatically capture the meaningful features, namely, the encoded features, as follows: where ( . ) is the dot product operation and is the encoded features with ∈ ℝ .

Feature Decoding Network
After the original point set is down-pooled to extract the learned local features in the feature encoding network, we propagate the learned local features for point-wise labeling in the feature decoding network. We fully adopt the feature propagation (FP) module from PointNet++ [11], as shown in Figure 3. The FP module defined the distance-based interpolation, to compute the average of the inverse distance weights. Then, the interpolated features are concatenated with skip linked features, followed by shared MLPs. We denote the decoded features as follows: where ( ) is the feature propagation module, is the encoded features at level th with ∈ ℝ , is the skip linked features at level ( − 1)th with ∈ ℝ , and is the decoded features with ∈ ℝ .

Multiple Loss Scores
The loss function helps to shape the feature space during the training. We introduce multiple loss scores to enhance the learning structure of the features. We use the crossentropy function to calculate the error between the predicted probabilities and the ground truth labels. We compute the cross-entropy loss at every level of the FP module, as shown in Figure 1. module to the number of categories by the fully connected layer (FC). Then, we calculate the error between the predicted probabilitiesŷ i and the ground truth labels y i , as follows: where FC( ) is the fully connected layer,f i is the decoded features withf i ∈ R F , ( ) is a cross-entropy function, and L j is the loss score. Then, we sum all losses, as follows:

Experimental Setup
We conducted an experiment on a moderate computer with an Intel Core TM i7-7700 CPU @ 3.60 GHZ, 16.0 GB RAM from Intel Corporation, Seoul, South Korea, and NVIDIA GeForce GTX 1070 from NVIDIA Corporation, Seoul, South Korea. The code was written in Python language and Pytorch framework with cuda library, for accelerating the training. The training was carried out by the optimizer Adam, with learning rate of −1 × 10 −3 and weight decay of 2 × 10 −5 , a batch size of 8, and a number of epochs of 100. The fully connected layer was followed by ReLU activation, batch normalization, and dropout with ratio of 0.5.

Datasets
We evaluate our proposed method on the semantic segmentation of point clouds with the following two large-scale outdoor benchmark datasets: Toronto-3D [37] and DALES [38].

Toronto-3D Dataset
The Toronto-3D dataset [37] is an outdoor scene point clouds semantic segmentation benchmark. It contains around 78M points, from 1 km of urban road scene in Canada. The dataset is labeled into eight categories, including road, road marking, natural, building, utility line, pole, car, and fence. Each point provides properties such as xyz coordinates, rgb color, intensity, GPS time, scan angle rank, and category label. The data format contains four blocks, such as L001, L002, L003, and L004, around 250 m each, with a various number of points. We use L001, L003, and L004 for the training set, and L002 for the testing set, following the guideline from the original paper of the Toronto-3D dataset, as shown in Table 1.

DALES Dataset
The DALES dataset [38] is an outdoor scene point clouds semantic segmentation benchmark. It contains around 505M points, from 330 km 2 of urban road scene in Canada. The dataset is labeled into eight categories, including ground, vegetation, cars, trucks, power lines, poles, fences, and buildings. Each point provides properties such as xyz coordinates, reflectance, and class label. The data format contains 40 tiles, around 0.5 km 2 each, with a various number of points. We use 29 tiles for the training set and 11 tiles for Remote Sens. 2021, 13, 3121 9 of 18 the testing set, following the guideline from the original paper of the DALES dataset, as shown in Table 2.

Data Pre-Processing
We divide the original point data into small patches by their order, storing them in the original data file. Table 3 presents each patch consisting of 8192 × F. In the Toronto-3D dataset [37], we use two types of data properties, such as 8192 × 3 for the xyz coordinates of the points (xyz) and 8192 × 6 for a combination of the xyz coordinates and rgb colors of the points (xyz + rgb). In the DALES dataset [38], we only use one type of data property, such as 8192 × 3 for the xyz coordinates of the points (xyz). We do not perform the random jitter, rotation, or other data augmentations. The raw point data are used only; however, all the points are normalized into zero mean within a unit patch. Then, we randomly select 1024 points out of 8192 points to suit with the input shape of our network model. The random selection method is the simple Python numPy random choice function.

Evaluation Metrics
We follow the evaluation metrics of a general semantic segmentation study. We use the overall accuracy (OA) and mean intersection over union (mIoU) as the main evaluation metrics, to evaluate the overall quality of the segmentation. Firstly, we compute the per class IoU, as follows: where TP, FP, FN represent true positive, false positive, and false negative, respectively, and c is the cth category label. The mIoU is simply the mean across all eight categories, excluding the unclassified category, as follows: where C is the number of the category label (C = 8) and c is the cth category in C. The overall accuracy (OA) is computed by the sum of all the correct predicted points over the total number of points, as follows: where N is the total number of points. Table 4 shows our results on the Toronto-3D dataset. We experimented with our method, with the properties that are mentioned in the above Section 4.3. For experiment on the xyz coordinates of the points, our proposed method achieved an overall accuracy (OA) and a mean intersection over union (mIoU) of 72.55% and 66.87%, respectively. We noticed that our method performed well in the road, natural, building, and utility line categories, with IoUs of 92.74%, 88.66%, 93.52%, and 81.03%, respectively. The pole and fence category were achieved around the average scores, with IoUs of 67.71% and 56.90%, respectively. Additionally, we conducted an ablation study on our proposed method, between the xyz coordinates of the points (xyz), and the combination of xyz coordinates and rgb colors of the points (xyz + rgb), as shown in Table 4. We noticed that the results that were predicted on the xyz + rgb of the points outperformed the prediction on only the xyz of the points, in both OA and mIoU, with 83.60% and 71.03%, respectively. Additionally, the predicted results of the xyz + rgb of the points also outperformed the prediction of the xyz of the points in all the categories. Our weakness is on the road marking and car categories. The IoUs of the prediction on the xyz of the points were 14.75% and 39.65%; and the prediction of the xyz + rgb of the points were 27.43% and 44.41%, which were under an average score. Because they have a fewer numbers of points than the others, we can assume that this was the case, which it is the imbalance data challenge.

Results on Toronto-3D Dataset
The qualitative assessment is illustrated in Figure 4. The L002 scene, which was used as the testing dataset, is rendered in rgb colors, and its ground truth label is rendered in categorical color code, as shown in Figure 4a,b, respectively. Figure 4c demonstrates the qualitative result of our proposed method that was predicted on the xyz coordinates of the points. We can observe that our method performed well in the road, building, and natural categories. However, the road marking category was confused by the road category, and the pole category was also confused by one block of the road category. Additionally, only some points of the car, fence, and utility line categories were able to be predicted. Figure 4d illustrates our results that were predicted on the combination of xyz coordinates and rgb colors of the points (xyz + rgb). We noticed that the road, natural, and building categories were still well predicted. Additionally, the pole, utility line, car, and fence categories were improved. They were predicted better than using only the xyz of the points. However, the nature category was confused by some part of the building category. Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 18  Table 5 shows our results on the DALES dataset. We experimented with our method, with the properties mentioned in the above Section 4.3. For the experiment on the xyz  Table 5 shows our results on the DALES dataset. We experimented with our method, with the properties mentioned in the above Section 4.3. For the experiment on the xyz coordinates of the points, our proposed method achieved an overall accuracy (OA) and a mean intersection over union (mIoU) of 76.43% and 59.52%, respectively. We noticed that our method performed well in the ground, vegetation, and fences categories, with IoUs of 86.78%, 85.40%, and 84.89%, respectively. The power lines, pole and car categories achieved around average scores, with IoUs of 67.47%, 50.76%, and 50.63%, respectively. However, our weakness is on the trucks and buildings categories, with IoUs of 32.59% and 17.66%, respectively, which were under the average score. The qualitative assessment is illustrated in Figure 5. The 5100_54440 scene, which was used as the testing dataset, can be rendered for its ground truth label in the categorical color code only, since the DALES dataset does not provide the rgb colors of the points, as shown in Figure 5a,b, which demonstrates the qualitative result of our proposed method that was predicted on the xyz coordinates of points. The DALES dataset defined 29 tiles or scenes as the training set, and 11 scenes as the testing set. We followed this guideline, and conducted the testing on all 11 scenes of the testing set. Table 5 shows our testing results of all 11 scenes. The detailed explanation of how we prepared the DALES dataset is described in Section 4.2.2. However, we randomly selected one scene among 11 scenes, for displaying this qualitative assessment. The 5100_54440 scene was randomly selected, and conducted the semantic segmentation. With this scene point cloud, our method could predict well in the ground and vegetation categories only. We could not detect the other categories. This challenge occurs because the DALES dataset has a huge volume (around 505 M points), while the Toronto-3D dataset is around 78 M points. Our method used the input shape of 1024 points, which was small and could not capture the large neighboring points very well. We can assume that this was the case. On the other hand, our method performed well on the Toronto-3D dataset. Therefore, we can assume that our method, with the small input shape of 1024 points, can segment the point clouds up to around 78 M points. coordinates of the points, our proposed method achieved an overall accuracy ( ) and a mean intersection over union ( ) of 76.43% and 59.52%, respectively. We noticed that our method performed well in the ground, vegetation, and fences categories, with s of 86.78%, 85.40%, and 84.89%, respectively. The power lines, pole and car categories achieved around average scores, with s of 67.47%, 50.76%, and 50.63%, respectively. However, our weakness is on the trucks and buildings categories, with s of 32.59% and 17.66%, respectively, which were under the average score. The qualitative assessment is illustrated in Figure 5. The 5100_54440 scene, which was used as the testing dataset, can be rendered for its ground truth label in the categorical color code only, since the DALES dataset does not provide the rgb colors of the points, as shown in Figure 5a,b, which demonstrates the qualitative result of our proposed method that was predicted on the xyz coordinates of points. The DALES dataset defined 29 tiles or scenes as the training set, and 11 scenes as the testing set. We followed this guideline, and conducted the testing on all 11 scenes of the testing set. Table 5 shows our testing results of all 11 scenes. The detailed explanation of how we prepared the DALES dataset is described in Section 4.2.2. However, we randomly selected one scene among 11 scenes, for displaying this qualitative assessment. The 5100_54440 scene was randomly selected, and conducted the semantic segmentation. With this scene point cloud, our method could predict well in the ground and vegetation categories only. We could not detect the other categories. This challenge occurs because the DALES dataset has a huge volume (around 505 M points), while the Toronto-3D dataset is around 78 M points. Our method used the input shape of 1024 points, which was small and could not capture the large neighboring points very well. We can assume that this was the case. On the other hand, our method performed well on the Toronto-3D dataset. Therefore, we can assume that our method, with the small input shape of 1024 points, can segment the point clouds up to around 78 M points.

Discussion on Toronto-3D Dataset
We conducted a comprehensive comparison study on the point-wise MLPs method (i.e., PointNet++ [11] and RandLA-Net [20]), point convolution method (i.e., KPConv [25]), and graph-based method (i.e., DGCNN [28]). The results of the Toronto-3D dataset are demonstrated in Table 6. The results of the PointNet++ [11], RandLA-Net [20], KPConv [25], and DGCNN [28], were recorded from the original paper of the Toronto-3D dataset. It was reported that they were trained on the xyz coordinates of the points. We applied only our method on the Toronto-3D dataset. For fair comparison, we compare our method performance using the xyz coordinates of the points with them. Compared with those methods, our proposed method achieved the lowest overall accuracy (OA), with 72.55%, but we achieved the second highest mean IoU (mIoU), with 66.87%, which was lower than the RandLA-Net, with 77.71%. Our method led to the highest IoU in the building and fence categories, and the second highest IoU in the road and road marking categories. Even though the natural, utility line, and pole categories did not achieve the highest or the second highest IoU among them, they were predicted well, with IoUs of 88.66%, 81.03%, and 67.71%, respectively.  1 The bold red number represents the highest score. 2 The bold blue number represents the second highest score.

Discussion on DALES Dataset
We conducted a comprehensive comparison study on the point-wise MLPs method (i.e., PointNet++ [11]), point convolution method (i.e., KPConv [25]), and graph-based method (i.e., SPG [29]). The results on the DALES dataset are demonstrated in Table 7. The results of the PointNet++ [11,KPConv [25], and SPG [29], were recorded from the original paper of the DALES dataset. It was reported that they were trained on the xyz coordinates of the points. We applied only our method on the DALES dataset. Compared to those methods, our proposed method achieved the lowest score in both overall accuracy (OA), with 76.43%, and mean IoU (mIoU), with 59.52%. However, our method led to the highest IoU in the fence category, and the second highest IoU in the trucks and poles categories. Even though the ground, vegetation, and power lines categories did not achieve the highest or the second highest IoU among them, they performed well, with IoUs of 86.78%, 85.40%, and 67.47%, respectively. The cars and poles categories achieved IoU scores of 50.63% and 50.76%, respectively, which were at the average score. Our weakness is on the trucks and buildings categories, with 32.59% and 17.66%, respectively, which were under the average score. We noticed that our method achieved better performance on the Toronto-3D dataset than on the DALES dataset.  Table 8 shows a comparison of computational complexity and inference time per forward pass. The dash represents information that we could not study. For the neighboring strategy, PointNet++ [11] used farthest point sampling (FPS), with a complexity of O N 2 , while our method and RandLA-Net [20] used the random point sampling (RPS), with a complexity of O(1). KPConv [25] used the Kd-tree for projecting, with a complexity of O (KN logN). We could not figure out the complexity of DGCNN [28]. We tested only PointNet++ [11] on the Toronto-3D and DALES dataset, of which the input shape is set to 1024 points and the batch size is 8, while we could not test for the others. For the number of parameters of the model, our model has around 1.98 M parameters, which is the second fewest numbers, among others. Additionally, our experimental comparison with PointNet++ (its inference time is 370.37 ms) shows that our method (its inference time is 102.45 ms) is faster than about three times. The dash represents undefined study. 2 The bold number represents the lowest score.

Effect of Our Proposed Mechanism
We conducted a study on the effect of random point sampling (RPS), attention-based pooling (AP), and multiple losses (ML), on the Toronto-3D dataset, as shown in Table 9. We studied four scenarios, as follows: (1) We applied farthest point sampling (FPS), AP, and ML. The FPS + AP + ML gave an overall accuracy (OA) and a mean intersection over union (mIoU) of 70.88% and 65.67%, respectively. (2) We applied the RPS, max pooling (MP), and ML. The RPS + MP + ML gave an OA and mIoU of 64.79% and 65.37%, respectively. (3) We applied the RPS, AP, and a single loss score (SL). The RPS + AP + SL gave an OA and mIoU of 61.42% and 60.12%, respectively. (4) We applied the RPS, AP, and ML. The RPS + AP + ML gave an OA and mIoU of 72.55% and 66.87%, respectively. Through this study, we can see that the fourth scenario (PRS + AP + ML) gave the highest OA and mIoU. Through this study, we can see that the fourth scenario (PRS + AP + ML) gave an OA of 76.43%, which is lower than the first scenario (FPS + AP + ML). However, PRS + AP + ML gave an mIoU of 59.52%, which is the highest score among other scenarios. For the semantic segmentation task, the mIoU metrics are considered to be more precise than OA. Therefore, we can assume that PRS + AP + ML is the best choice among those scenarios.

Conclusions
In this paper, we proposed random point sampling, attention-based pooling, and multiple losses summation, integrated with the encoder-decoder shared MLPs method, for the large-scale outdoor point clouds semantic segmentation. We experimented and demonstrated significant computational gains and results on the following two large-scale outdoor benchmark datasets: Toronto-3D and DALES dataset. We achieved an overall accuracy (OA) and a mean intersection over union (mIoU), of both the Toronto-3D dataset, with 83.60% and 71.03%, and the DALES dataset, with 76.43% and 59.52%, respectively. Our experimental results shows that our method performed well on the segmentation of the road, nature, building, utility line, and pole categories of the Toronto-3D dataset; and the ground and vegetation categories of the DALES dataset.
We proved that our proposed method can (1) speed up the point selection neighboring process, through the random point sampling layer, by the computational complexity of O(1); (2) capture the important information for aggregation, through the attention-based pooling layer, by the attention-weighted summation; and (3) refine the structure of the learned features space, through the summation of multiple losses. Additionally, comparing the computational complexity, our method performed a few numbers of parameters of the model, and faster than PointNet++, about three times during inferencing. However, there are limitations of our method, due to imbalance, unorganized, and huge volume data challenges, which is our future research.