1. Introduction
Airborne LiDAR point clouds, scanned by light detection and ranging equipment mounted on aerial platforms, are a collection of points with original geometric properties. With the rapid development of computer vision and remote sensing technology, the application of airborne LiDAR point cloud data to urban scenes is paid more and more attention, especially in the fields of navigational positioning, automatic driving, smart city, and 3D vision [
1], etc. Point clouds in urban scenes are important information carriers, which are consisted of complex surface features. In order to accurately understand 3D urban scenes from the point level, the concept of point cloud semantic segmentation was proposed. Semantic segmentation, as an important technique for LiDAR point cloud data processing, is aimed at subdividing point clouds into several specific point sets with independent attributes, recognizing the target types of point sets, and making semantic marking [
2]. Semantic segmentation of airborne LiDAR point clouds in urban scene can quickly extract typical feature information and understand complex urban scenes, so as to effectively reflect the spatial layout, development scale and greening level of the city, which has a crucial role in the fields of urban development planning, smart city and geo-database [
3]. Nevertheless, semantic segmentation of point clouds is a great challenge since airborne LiDAR point clouds have characteristics of high redundancy, incompleteness and complexity [
4,
5].
To extract surface features from 3D point clouds, traditional methods usually construct the corresponding segmentation model according to geometric attributes and data statistical features chosen manually, such as support vector machine (SVM) [
6], random forest (RF) [
7], conditional random field (CRF) [
8], Markov random field (MRF) [
9], etc. However, selection of statistical features mainly relies on priori knowledge of operators, which has great randomness, limited ability in feature extraction of point clouds, and poor generalization. With the improvement of calculation power of computers and continuous emerging of 3D scene dataset, deep learning is taking a dominant role in the field of point cloud semantic segmentation field.
Deep learning [
10] firstly was used for semantic segmentation of point clouds through rasterization of point clouds. Su et al. [
11] proposed Multi-View Convolutional Neural Network (MVCNN), which got the segmentation results through convolution and aggregation of 2D images of point clouds under different perspectives. According to existing snapshots, Boulch et al. [
12] produced pairs of snapshots which contained RGB views and depth maps of geometric features, then provided labels for corresponding pixels of each pair of snapshots, and then mapped the marked pixels onto the original data. Wu et al. [
13] extracted features from projected 2D images by using CNN, output the pixel-by-pixel labeling chart, refined it with the conditional random field (CRF) model, and finally got the instance-level labels through the traditional clustering algorithm. Besides, voxelization of irregular 3D point clouds is a common method that researchers are used to process the original point clouds. Maturana et al. [
14] proposed VoxNet network based on voxelization of point clouds, which classified point clouds by using the supervised 3D convolutional neutral network (CNN). Tchapmi et al. [
15] generated the bold voxel labels through the 3D fully convolutional neural network based on voxelization of point clouds and then enhanced the prediction results by combining the trilinear interpolation and fully-connected CRF learning fine granularity. Wang et al. [
16] implemented multi-scale voxelization of point clouds and extracts features, made adaptive learning of local geometric features, and realized global optimization of prediction class probabilities by using CRF with full considerations to spatial consistency of point clouds. The above semantic segmentation methods based on multi-views or voxels solve the structural problems and have some practicability. However, semantic segmentation methods based on multi-views are inevitable to lose 3D space information in the rasterization process of point clouds. The semantic segmentation methods based on voxels increase the spatial complexity and incur great expenses for storage and operation.
Therefore, some effective frameworks for direct processing of point cloud data are proposed. Qi et al. [
17] designed PointNet, which made pointwise coding through multilayer perception (Mlp) and got global features through aggregation function. Nevertheless, it ignores the concept of local space and lacks extraction and utilization of local features. Qi et al. [
18] proposed the improved version of PointNet, denoted as PointNet++. It proposes the density adaptive cut-in layer, learns features of point sets at different scales according to multi-layer sampling and grouping, and captures local detail information. However, PointNet++ still processes each point independently, without considerations to connections among neighbor points. In PointNet++, K nearest neighbor searching results have a problem of single direction. Jiang et al. [
19] designed a scale perception descriptor for ordered coding of information from different directions and effective capture of local information of point clouds. Based on KNN construction of local neighbor graph, Wang et al. [
20] used EdgeConv module to capture local geometric features of point clouds and learn features by making full use of point neighborhood information. Based on the local neighborhood processing of PointNet++, Zhao et al. [
21] increased the adaptive feature adjustment module to transform and aggregate upper and bottom information, then integrated information of different channels through Mlp and max pooling, and strengthened the description ability of features to local neighborhood. Xie et al. [
22] proposed selection, aggregation, and transformation of key components by building shape context kernels, captured and spread local and global information to express internal attributes of object points. The transformation component is configured according to the overall network of PointNet. Landrieu et al. [
23] divided point clouds into several super-points according to geometric shapes, and then learnt features at each super-point by using the sharing PointNet, thus enabling to predict semantic labels. Li et al. [
24] proposed the X-Conv operator based on the spatial local correlation of point cloud data. The X-Conv operator standardizes the disordered point clouds through weighting and replacement of input points, and then extracts local features by using CNN. Based on the SA module of PointNet++, Qian et al. [
25] introduced in the InvResMLP module to realize the high-efficiency and practical scaling of model, which solved the problem of gradient disappearance and improves ability of feature extraction. Hua et al. [
26] determined features of each point through the pointwise convolution and thereby realized semantic segmentation. Hu et al. [
27] replaced farthest point sampling (FPS) of PointNet++ by the random sampling and increased the perception field of each 3D point gradually through the local feature aggregation module, thus retaining the geometric details effectively. Nong et al. [
28] performed densely connected the point pairs based on PointNet++, supplemented center point features to learn contextual information, and proposed an interpolation method with adaptive elevation weights to propagate point features. However, the method is limited by the lack of global information connection. Due to the great success of the transformer model [
29] in capturing contextual information, researchers have introduced it into 3D point cloud processing [
30]. Li et al. [
31] proposed geometry-aware convolution to handle a large number of geometric instances, and then supplemented the receptive field with dense hierarchical architecture, and designed an elevation-attention module to improve the classification refinement. Zhao et al. [
32] used the transformer to exchange local feature information and fit geometric spatial layout. Guo et al. [
33] proposed the offset-attention module to better understand point clouds features and capture local geometric information using neighbor embedding strategy. Zhang et al. [
34] introduced a bias based on the transformer model to extract relationships between local points to address the sparsity of point cloud data, and proposed a standardization set abstraction module to extract global information to complement topological relationships.
Although the above methods have achieved some progresses in semantic segmentation of point clouds, they have not adequately considered relations among point features and lack of deep interaction relations. The semantic segmentation of urban LiDAR point clouds is a challenge due to the uneven spatial data distribution of airborne laser point clouds, mixed distribution of point clouds at neighborhood surface boundary, and different scales of the objects with the same semantics. To address these problems, a convolutional network based on fusion attention mechanism which is used for 3D point clouds directly was designed in this study on the basis of PointNet++, which was called as SMAnet. Fusion attention mechanism makes parallel treatment based on self-attention mechanism (SAM) [
35] and multi-head attention mechanism (MAM) [
29]. The essence of SAM is to calculate similarity according to global features of each point and allocate different weights. With full considerations to interaction among points, the SAM can distinguish the mixed point clouds at surface boundary effectively. With considerations to influences of correlations of different local features on points, the multi-head attention module (MAM) was introduced in. The specific idea behind MAM is to divide high-dimensional features of points into different feature subspaces which contain different attribute information of points. Later, it judges feature similarity among different feature subspaces, thus adjusting subspace channel information. The MAM captures connections among different aspects of point features, makes full use of information correlation in local features, increases fine granularity of network, and can recognize surface points at different scales effectively.
The network also uses the light multi-scale feature extraction and supplements local geometric information by local features at different levels. Moreover, different from previous global feature extraction based on aggregation function, a global information extraction method based on SoftMax-stochastic pooling (SSP) was designed, which expanded the receptive field of network model and increases calculation efficiency as well as segmentation accuracy.
The remainder of this study is organized as follows.
Section 2 introduces the proposed SMAnet method and principle.
Section 3 introduces the experiment details and experimental results.
Section 4 presents the discussion including comparative analysis, ablation experiments and some other additional experiments.
Section 5 summarizes experimental conclusions.