Multi-Scale Attentive Aggregation for LiDAR Point Cloud Segmentation

: Semantic segmentation of LiDAR point clouds has implications in self-driving, robots, and augmented reality, among others. In this paper, we propose a Multi-Scale Attentive Aggregation Network (MSAAN) to achieve the global consistency of point cloud feature representation and super segmentation performance. First, upon a baseline encoder-decoder architecture for point cloud segmentation, namely, RandLA-Net, an attentive skip connection was proposed to replace the commonly used concatenation to balance the encoder and decoder features of the same scales. Second, a channel attentive enhancement module was introduced to the local attention enhancement module to boost the local feature discriminability and aggregate the local channel structure information. Third, we developed a multi-scale feature aggregation method to capture the global structure of a point cloud from both the encoder and the decoder. The experimental results reported that our MSAAN signiﬁcantly outperformed state-of-the-art methods, i.e., at least 15.3% mIoU improvement for scene-2 of CSPC dataset, 5.2% for scene-5 of CSPC dataset, and 6.6% for Toronto3D dataset.


Background
Point clouds contain 3-dimensional (3D) information. Benefitting from the progress of modern sensor technology, high-quality point clouds can be obtained relatively easily. In computer vision and remote sensing, point clouds can be obtained by four main techniques including photogrammetric methods, Light Detection and Ranging (LiDAR) systems, Red Green Blue-Depth (RGB-D) cameras, and Synthetic Aperture Radar (SAR).
LiDAR point clouds are widely used in many 3D understanding tasks nowadays, such as classification, semantic segmentation, object detection; among them, semantic segmentation of LiDAR point clouds is a crucial step toward high-level 3D point cloud understanding, which has significant implication in automatic driving, robotics, augmented reality (AR), smart city, among others. In this work, we focus on developing effective deep learning-based models for the semantic segmentation of LiDAR points clouds, improving from recent developments outlined in the review section below.

Reviews
Most conventional segmentation methods design and extract handcrafted features such as geographic features, spatial attribution of 3D shapes, histogram statistic from point clouds, and then apply machine learning methods such as Support Vector Machine (SVM) [1], Random Forest (RF) [2], Conditional Random Field (CRF) [3], and Markov Random Field (MRF) [4] to model the designed features for segmenting. Handcrafted features rely on prior knowledge of designers, introduce additional uncertainties from hyper-parameter settings and possess limited generality. Recently, with the emergence of open-source point cloud datasets and the rapid development of GPU technology, deep learning-based methods have dominated the field of point cloud semantic segmentation, which automatically learn high-level semantic representations end-to-end.
Convolutional Neural Network (CNN) has been seen widely in applications, e.g., in image processing, video analysis, and natural language processing [5]. However, conventionally, CNN can only process structural data such as images and cannot be directly applied to unordered and unstructured point clouds. To apply CNN to 3D point clouds, point clouds have been transformed to structural data using multi-view representation [6,7], spherical representation [8][9][10], volumetric representation [11][12][13], lattice representation [14,15], and hybrid representation [16,17]. However, this kind of method confronts problems in memory consumption and inaccurate expression, among others.
Currently, the mainstream technology is those methods that directly process unstructured point clouds, which can be separated into four categories: point-wise MLP methods, point convolution methods, RNN-based methods, and graph-based methods [18].
Point-wise MLP methods. These methods apply shared-MLPs as basic units. The pioneer of this kind of method is PointNet [19]. PointNet applies MLPs and symmetric pooling function to learn global features of input points. However, global features cannot capture local structural information and relations between points, which limits the ability of the network. PointNet++ [20] divides the point cloud into a set of small point clouds and extracts local features by using PointNet as a basic unit. Inspired by and based on PointNet and PointNet++, some modules are further introduced to learn local features better. [21] proposed a PointSIFT module to achieve orientation encoding and scale awareness to adapt to eight orientations and patterns of different scales. PointWeb [22] proposed an Adaptive Feature Adjustment (AFA) module to learn relationships among local neighboring points. RandLA-Net [23] proposed an efficient and lightweight network to process 3D point clouds by applying random down-sampling to boost efficiency and save memory, introducing a local geometric feature extraction module to capture geometric information, and utilizing an attentive pooling module to aggregate local features.
Point convolution methods. Efficient convolutional operations are proposed for point clouds. PointCNN [24] learned an X-transformation from the input points, which could weight input features associated with the points and permute the points into a latent and potentially canonical order. KPConv [25] proposed a new convolutional operation, KPConv, where the weights of convolutional kernels were assigned to the input points close to them based on the Euclidean distances.
RNN-based methods. These methods are applied to capture intrinsic contextual features of point clouds. [26] proposed a point-wise pyramid pooling module to capture local coarse-to-fine structures and utilizes two-directional hierarchical RNNs to obtain spatial long-range discrepancies.
Graph-based methods. These methods focus on capturing potential shapes and geometric structures. SPG [27] defined point clouds as a set of simple shapes and super-points and generated directed graphs to capture structural and contextual information. GACNet [28] proposed a graph attention convolution to learn features from local regions selectively, with learnable shapes of kernels to adapt to objects with different shapes.
Despite the current progress of point cloud segmentation, further improvements are envisioned. First, the attention mechanism [29,30], which has been shown to be effective for the global balance and consistency of the encoder and decoder features in recent image segmentation [31,32], has not been applied to point cloud segmentation methods. Refs. [20,[22][23][24][25]28] only apply the traditional skip connection, i.e., a concatenation operation, to combine encoder and decoder, which results in a semantic gap between the feature layers. Second, the multi-scale convolutional features are critical for grasping the entire structure of a point cloud, but most of the recent methods [20][21][22][23][24][25]27,28] only focus on the structures of input format and the encoder but neglect multi-scale information fusion in the decoder. The accuracy of point cloud segmentation is expected to improve with a careful design of considering the attention modules and the multi-scale information fusion. In addition, the details of CNN structures can also be improved towards better local feature representation.

Our Works
In this paper, we propose a Multi-Scale Attentive Aggregation Network (MSAAN) for LiDAR point cloud semantic segmentation to address the above-mentioned limitations. Our contributions are summarized as follows.
(1) An Attentive Skip Connection (ASC) module based on the attention mechanism was proposed to replace the traditional skip connection to bridge the semantic gap between point cloud features in the encoder and decoder. (2) A multi-scale aggregation was introduced to fuse point-cloud features of different scales not only from the decoder but also from the encoder. (3) A Channel Attentive Enhancement (CAE) module was introduced to the local spatial encoding module of RandLA-Net [23] to further increase the representation ability of local features. (4) Our MSAAN significantly outperformed state-of-art methods on the CSPC and Toronto3D datasets with at least 5% on mean intersection over union (mIoU) score.

Methods
The proposed Multi-Scale Attentive Aggregation Network (MSAAN) accessed the large-scale point clouds as a single input and predicted a segmentation map that assigns each point to a category. MSAAN was developed on top of the recent RandLA-Net [23].
Several key adjustments are made for improvements. The framework, an encoder-decoder style structure, is shown in Figure 1 and detailed in Section 2.1. The input data was firstly processed by a Point Feature Enrichment (PFE) module, which is detailed in Section 2.2. Each layer of encoder passes through a Local Attention Enhancement (LAE) module and the output features were then randomly sampled for learning efficiently the local features of the point cloud, as detailed in Section 2.3. In Section 2.4, we fused the encoder and decoder features at the same scale by a proposed Attentive Skip Connection (ASC) module to obtain more balanced and distinctive semantic information instead of a common skip connection. We describe our multi-scale output aggregation for fusing features from different scales in Section 2.5. The numbers in and below the blue block are the channel number of features and the point number of inputs, respectively. RS, US, MLP, and "nclass" are for random sampling, up-sampling, multi-layer perceptron, and the number of classes, respectively.

Backbone of the Encoder
We followed the strategy of RandLA-Net [23], which sampled the whole point cloud probabilistically. At each batch of the learning loop, only one point with the minimum probability was selected and taken as a center point to query N points from a pre-constructed K-D tree based on the Euclidean distance; all of them constituted the input data, denoted as F∈R N×d (Figure 1), where d is the number of observational values of each point, for example, that d equals 6 typically indicates the longitude, latitude, and altitude (x, y, z) which form the coordinate and three color bands R, G, B. The probability of these selected N points was then be enlarged to ensure new points would be selected. This way quickly and evenly samples the point clouds and avoids segmenting an object into many parts, which has been demonstrated superior to those previous studies [20][21][22]24], which take sliced local patches as the network input. The input data F was passed through the PFE layer to obtain a new richer feature F E ∈R N× [(d + 3)*K], where 3 was the number of values determining the coordinate, and K represented the number of nearest points and indicates the information of neighbor points was integrated into the features of the current point. Typically, N is far greater than K, in this work, N = 4096 and K = 16. F E then passed through a fully connected layer to obtain the input point cloud feature F in ∈R N×8 for the encoder. The backbone network consisted of layers of four scales. In the encoder, the features passed through the LAE and the random sampling layer, the latter created the next scale with the down-sampling rate a quarter.

PFE (Point Feature Enrichment) Module
We applied a PFE module [33] as the preprocessor of the original input data. The PFE layer applies a gated fusion strategy to enrich the input data of the segmentation network by incorporating information of the current point and its neighboring points. The PFE module is illustrated in Figure 2, (please refer to [33] for a detailed description).  [33]). The first number in the bracket is the point number, and the second is the feature number.

LAE (Local Attention Enhancement) Module
The original LAE module was proposed by [23]. In this paper, we revised the LAE structure by adding a channel attentive enhancement (CAE) branch into the second branch of the Local Feature Enhancement (LFE) layer. The original LAE only extracts relative geographic features to obtain the spatial structure of the point cloud, the introduction of CAE captures the discrepancies of different channels and re-balances them along the channel direction. The structure of the revised LAE is illustrated in Figure 3. We mainly introduced the newly added CAE and referred to [23] for detailed descriptions of the other parts such as LFE and Relative Geographic Extraction (RGE). CAE was constructed as follows. Firstly, we obtained the feature map F∈R N×K×d of K neighboring points of N input points and transposed F as B∈R N×d×K . The third dimension of the multiplication result of F and B is reduced to 1 with a max pooling operation and restored to d with a copy operation. The multiplication result of F and B is subtracted from it. Then, we obtain the attentive weight matrix W∈R N×d×d by the following operation: The weight W was used to update F:

ASC (Attentive Skip Connection) Module
Inspired by [34], we proposed an ASC module for balancing the encoder and decoder features. The ASC module bridges the semantic gap between features in encoder and decoder to achieve a better feature representation with global consistency. The module is detailed in Figure 4. The low-level features here present features in the encoder stage, and the high-level features present features of the decoder stage at the same scale. We computed attentive scores of the high-level features with a squeeze, an MLP, and a softmax operation, the scores were multiplied with the low-level features that had passed through an MLP and been squeezed. Finally, we concatenated the attentive low-level features and the squeezed high-level features as the final output of this module after an expansion operation.

Multi-Scale Aggregation
The multi-scale feature aggregation of the decoder has been proved effective in image segmentation but not in point cloud segmentation. We proposed our aggregation method for point cloud segmentation, which is different from the commonly-used strategies in image processing, which only utilize the information of the decoder [35,36]. It is worth noting that the encoder information was introduced into image segmentation very recently [37]. In this work, we first upsampled the features of each scale in the decoder to the spatial dimension of the input, then concatenated them with the output features of the first LAE layer in the encoder. The concatenated features passed through an MLP, a fully connected layer without spatial dimension reduction, to obtain new features with 32 channels. The new features at four scales were concatenated to form a 128-d feature map, as shown in Figure 5. The map was then compressed with two fully connected layers, a dropout layer, and a fully connected layer to output the categories, as shown in Figure 1.

Experiment Design
We evaluated the proposed method on two datasets, CSPC [38] and Toronto3D [39]. CSPC (Complex Scene Point Cloud dataset) is the most recent point cloud dataset for semantic segmentation of large-scale outdoor scenes, covering five urban and rural scenes where scene-1 shows a simple street, scene-2 shows a busy urban street, scene-3 shows a busy urban street at night, scene-4 shows a campus, and scene-5 shows a rural street. This dataset includes 68 million points, including six objects: ground, car, building, vegetation, bridge, and pole. The point numbers of each category are listed in Table 1. Every point is attached to six property values, three for positions and three for RGB colors. Toronto3D covers a street of 1000 m length, including four areas, L001, L002, L003, and L004, and 78.3 million points. Every point possesses the information of 3D position, RGB color, intensity, GPS time, scan angle rank, and category. There are eight categories including road, road marking, natural, building, utility line, pole, car, and fence. The point numbers of each category are listed in Table 2. We applied three representative metrics, Intersection over Union (IoU) of each class, mean IoU (mIoU), and Overall Accuracy (OA) to evaluate the performances of our method and the methods to be compared. The mIoU was considered as the main index.
We set K to 16 in the K nearest searching and N to 4096 in each batch. The training epochs of CSPC and Toronto3D were both set to 100. The learning rate was set to 0.01. Our algorithm was implemented with TensorFlow 1.11 and CUDA 9.0 on Ubuntu 16.04 system. An Nvidia 1080 TI 11 G GPU is used. The source code is available at http: //gpcv.whu.edu.cn/data/, accessed on 8 February 2021.

Experiments and Analysis
CSPC Dataset. We set scene-1, scene-3, and scene-4 as training sets, while scene-2 and scene-5 were testing sets. We compared our method with SnapNet [40], 3D-CNN [41], DeepNet [42], PointNet++ [20], KPConv [25], and RandLA-Net [23]. The results are shown in Tables 3 and 4. First, our methods sweepingly and significantly outperformed the baseline and third-best RandLA-Net. The introduction of the attentive skip connection, the multi-scale aggregation, and the Channel Attentive Enhancement (CAE) branch in LAE lead to improvements of 15.3% and 5.2% on mIoU with scene-2 and scene-5. Second, our method outperformed the second-best KPConv 10.2% and 3.0% on mIoU. KPConv proposed a convolutional operation named KPConv to capture local features whose weights were defined by a set of convolution kernels. By contrast, our method processes the complete grid by random sampling and multi-scale feature aggregation, which firmly grasps the global information at each learning loop. Third, there was a large dissimilarity in performance between the earlier studies, including SnapNet, 3D CNN, and DeepNet, PointNet++, and the recent RandLA-Net, and ours (Tables 3 and 4). The earlier ones have much worse performance, for example, the mIoU scores were at least 20% lower than the recent scores. SnapNet, which projects 3D point cloud into 2D images of multi-views and uses deep learning-based methods to segment these images to realize the segmentation of point cloud performed the worst. PointNet++ applies point-wise MLPs to extract local features. 3D-CNN transforms point cloud to sparse voxels as the inputs of a 3D CNN for segmenting. DeepNet transforms 3D point cloud into voxels as well. All of them lack the ability to grasp the complete and global point cloud structure. In addition, our method shows strength in identifying sparse points, as indicated by the significant improvement in performance in classifying poles and bridges. We list some predicted samples of local regions in Figure 6, to demonstrate the difference between our results and the baseline RandLA-Net. The details reveal a better performance of our method. Toronto3D Dataset. We used the L001, L003, and L004 as the training set and the L002 as the test set. We compared our method with PointNet++ [20], DGCNN [43], KPConv [25], MS-PCNN [44], TG-Net [45], and RandLA-Net [23]. DGCNN proposes a dubbed edge convolution acting on graphs. MS-PCNN uses dynamic point-wise convolutional operations in multiple scales for point cloud segmentation. TGNet proposes a graph convolution function named TGConv to extract point features of neighbors. The results are shown in Table 5. Conclusions like in the CSPC Dataset can be drawn. Our method exceeded the second-best RandLA-Net 6.6% and the rest methods by at least 15% on mIoU. Compared with RandLA-Net, our method improved on three evaluation metrics, especially the IoU of road marking, pole, car, and fence improved over 10 percent.

Ablation Study
To better understand the effect and influence of each proposed module, the Multi-Scale Aggregation (MS), the Channel Attention (CA), and the Attentive Skip Connection (ASC), we conducted an ablation study. Specifically, we gradually added MS, CA, and ASC to the backbone network, i.e., the second-best RandLA-Net, to evaluate the model performance. From Table 6, it is observed that the introduction of MS, the combination of MS and CA, and the combination of MS, CA, and ASC have made the mIoU score increase 5%, 6.4%, and 11.7%, respectively. This demonstrated the effectiveness of all of the introduced modules, and MS and ASC are both the main contributors, each of which contributed to 5% mIoU improvement. The reason behind such significant progress can be concluded as: a multi-scale aggregation made up for the lack of critical point cloud information fusion in both of the encoder and decoder of original RandLA-Net, and an attentive skip connection instead of an arbitrary concatenation reweights and balances the features from the encoders and the decoders to achieve a global consistency of feature representation. The multi-scale aggregation for both encoder and decoder and the attention mechanism for the encoder-decoder fusion provide useful references for the future design of the point cloud segmentation model. The RandLA-Net+MS (decoder) indicates the multi-scale aggregation is only executed at the multi-scale features of the decoder [36], whereas our multi-scale aggregation utilizes information both from encoder and decoder. Our method gets 1.7% mIoU im-provement, showing the effectiveness of integrating low-level features into the global final feature representation.

Conclusions
We proposed an MSAAN (Multi-Scale Attentive Aggregation Network) for large-scale point cloud semantic segmentation. Three contributions were made. First, we proposed an attentive skip connection (ASC) module to replace the commonly used concatenation to balance the encoder and decoder features of the same scales. Second, we introduced a channel attentive enhancement (CAE) module to boost the local feature discriminability and aggregate the local channel structure information. Finally, we fused the multi-scale features of the network to achieve global consistency. The experimental results on the CSPC dataset and Toronto dataset proved the effectiveness of our method. The attention mechanism plays an important and even indispensable role in modern CNN-based image feature representation. Our work further extends the application of the attention modules in point cloud processing.