Multispectral LiDAR Point Cloud Classiﬁcation Using SE-PointNet++

: A multispectral light detection and ranging (LiDAR) system, which simultaneously col-lects spatial geometric data and multi-wavelength intensity information, opens the door to three-dimensional (3-D) point cloud classiﬁcation and object recognition. Because of the irregular distribution property of point clouds and the massive data volume, point cloud classiﬁcation directly from multispectral LiDAR data is still challengeable and questionable. In this paper, a point-wise multispectral LiDAR point cloud classiﬁcation architecture termed as SE-PointNet++ is proposed via integrating a Squeeze-and-Excitation (SE) block with an improved PointNet++ semantic segmentation network. PointNet++ extracts local features from unevenly sampled points and represents local geometrical relationships among the points through multi-scale grouping. The SE block is embedded into PointNet++ to strengthen important channels to increase feature saliency for better point cloud classiﬁcation. Our SE-PointNet++ architecture has been evaluated on the Titan multispectral LiDAR test datasets and achieved an overall accuracy, a mean Intersection over Union (mIoU), an F 1 - score , and a Kappa coefﬁcient of 91.16%, 60.15%, 73.14%, and 0.86, respectively. Comparative studies with ﬁve established deep learning models conﬁrmed that our proposed SE-PointNet++ achieves promising performance in multispectral LiDAR point cloud classiﬁcation tasks.


Introduction
Airborne single-channel light detection and ranging (LiDAR) has been widely used in many applications, such as topographic mapping, urban planning, forest inventory, environmental monitoring, due to its abilities of quickly acquiring large-scale and highprecision information of the Earth's surface [1,2]. Based on the highly-accurate threedimensional (3-D) height information and single-wavelength infrared intensity information, point cloud classification has become an active research direction in the fields of photogrammetry and remote sensing and computer science [3][4][5]. However, only the LiDAR data themselves achieve unsatisfactory fine-grained point cloud classification results due to the lack of rich spectral information. Therefore, LiDAR data are commonly combined with optical images to better understand the Earth surface mechanics, and monitor the ground objects and their changes [6][7][8]. Some competitions and special issues were organized on 2-D multispectral feature images converted from 3-D multispectral LiDAR data [25,26]. Although rasterizing 3-D multispectral LiDAR point clouds into 2-D multispectral feature images greatly reduces the computational cost when processing point cloud classification of large areas, such conversion, which brings conversion errors and spatial information loss of some objects (e.g., powerline and fence), leads to incomplete and possibly unreliable point cloud classification results.
The 3-D multispectral point clouds based methods directly perform point-wise classification tasks without data conversion. Wichmann et al. [27] first performed data fusion through a nearest neighbor approach, and then analyzed spectral patterns of several land covers to perform point cloud classification. Morsy et al. [28] separated land-water from vegetation-built-up by using three normalized difference feature indices derived from three Titan multispectral LiDAR wavelengths. Sun et al. [29] performed 3-D point cloud classification on a small test scene by integrating the PCA-derived spatial features along with the laser return intensity at different wavelengths. Ekhtari et al. [30,31] proved that multispectral LiDAR data have good feature recognition abilities by directly classifying the data into ten land ground objects (i.e., building 1, building 2, asphalt 1, asphalt 2, asphalt 3, soil 1, soil 2, tree, grass, and concrete) via a support vector machine (SVM) method. Ekhtari et al. [31] performed an eleven-class classification task and achieved an OA of 79.7%. In addition, some multispectral point clouds classification studies have demonstrated that 3-D multispectral LiDAR point cloud based methods were superior to 2-D multispectral feature image based methods by an OA improvement of 10% in [32], and 3.8% in [33]. Recently, Wang et al. [34] extracted the geometric-and-spectral features from multispectral LiDAR point clouds by a tensor representation. Compared with a vector-based feature representation, a tensor preserves more information for point cloud classification due to its high-order data structure. Although most aforementioned methods achieved better point cloud classification performance in most cases, even on correctly identifying some very-long-and-narrow objects (e.g., powerlines and fences) [2] or some ground objects in complex environments (e.g., roads occluded by tree canopies) [1]. These methods classified objects from multispectral LiDAR data according to the spectral, geometrical, and heightderived features of the data. Feature extraction and selection plays an important part in point cloud classification, but there is no simple way to determine the optimal number of features and the most appropriate features in advance to ensure robust point cloud classification accuracy. Therefore, to further improve point cloud classification accuracy, deep learning methods will be explored for point cloud classification by directly performing on 3-D multispectral LiDAR point clouds.
To effectively process unstructured, irregularly-distributed point clouds, a set of networks/models have been proposed, such as PointNet [35], PointNet++ [36], DGCNN [37], GACNet [38], and RSCNN [39]. Specifically, PointNet [35] used a simple symmetric function and a multi-layer perceptron (MLP) to handle unordered points and permutation invariance of a point cloud. However, PointNet neglected points-to-points spatial neighboring relations, which contained fine-grained structural information for object segmentation. DGCNN [37], via EdgeConv, constructed local neighborhood graphs to capture the local domain information and global shape features of a point cloud effectively. To avoid feature pollution between objects, GACNet [38] used a novel graph attention convolution (GAC) with learnable kernel shapes to dynamically adapt to the structures of the objects to be concerned. To obtain an inductive local representation, RSCNN [39] encoded the geometric relationship between points by applying weighted sum of neighboring point features, which resulted in much shape awareness and robustness. However, GACNet and RSCNN have a high cost of data structuring, which limits their abilities to generalize complex scenarios. PointNet++ [36], a hierarchical structure of PointNet, is capable of both extracting local features and dealing with unevenly sampled points through multi-scale grouping (MSG), thereby improving the robustness of the model. Chen et al. [40], based on the PointNet++ network, performed a LiDAR point cloud classification by considering both the point-level and global features of centroid point, and achieved a good classification Due to the simplicity and robustness of PointNet++, in this paper, we select it as our backbone for point-wise multispectral LiDAR point cloud classification. The features learned by the PointNet++ contain some ineffective channels, which cost heavy computation resources and result in a decrease of classification accuracy. Therefore, to emphasize important channels and suppress the channels unconducive to prediction, a Squeeze-and-Excitation block (SE-block) [41] is integrated into the PointNet++, termed as SE-PointNet++. The proposed SE-PointNet++ architecture is applied to the Titan multispectral LiDAR data collected in 2014. We select 13 representative regions and label them manually with six categories by taking into account ground objects distributions in the study areas and the geometrical and spectral properties of the Titan multispectral LiDAR data. The main contributions of the study include the following: (1) A novel end-to-end SE-PointNet++ is proposed for the point-wise multispectral Li-DAR point cloud classification task. Specifically, to improve point cloud classification performance, a Squeeze-and-Excitation block (SE-block), which emphasizes important channels and suppresses the channels unconducive to prediction, is embedded into the PointNet++ network. (2) We investigate by comprehensive comparisons the feasibility of multispectral LiDAR data and the superiority of the proposed architecture for point cloud classification tasks, as well as the influence of the sampling strategy on point cloud classification accuracy.

Multispectral LiDAR Test Data
As the first commercial system available to scientific research and topographical mapping, a Titan multispectral airborne LiDAR system contains three active laser wavelengths of 1550 nm, 1064 nm, and 532 nm, respectively. Capable of capturing discrete and full-waveform data from all three wavelengths, the Titan system has a combined ground sampling rate up to 1 MHz. Table 1 lists the detailed specifications of the system. The scan angle varied between ± 20 • across track from the nadir, and the Titan system acquired points at around 1075 m altitude with 300 kHz Pulse Repetition Frequency (PRF) per wavelength, and 40 Hz scan frequency. All recorded points were stored in LAS files [28]. Therefore, the Titan multispectral LiDAR system provided three independent LiDAR point clouds corresponding to the three wavelengths, contributing to 3-D point cloud classification tasks. The study area is located at a small town (the center of latitude 43 • 58 00", longitude 79 • 15 00") in Whitchurch-Stouffville (ON, Canada). As shown in Figure 1, we selected thirteen representative regions containing rich object types, such as asphalt roads, forest and individual trees, open soil and grass, one-and two-story gable roof buildings, industrial buildings, and powerlines. There were nineteen flying strips (ten strips vertically intersecting nine strips), covering an area of about 25 km 2 . Note that, in this study, due to no metadata (such as system parameters and trajectories) provided with the nineteen flying strips, absolute intensity calibration was not performed. The nineteen strips were roughly registered by an iterative closest points (ICP) method. Similarly, because control/reference points were unavailable, the geometric alignment quality was not statistically reported. In this study, as seen in Figure 1, the thirteen study areas (red rectangles for model training and blue rectangles for model testing) were selected for assessing our SE-PointNet++ architecture. and individual trees, open soil and grass, one-and two-story gable roof buildings, industrial buildings, and powerlines. There were nineteen flying strips (ten strips vertically intersecting nine strips), covering an area of about 25 km 2 . Note that, in this study, due to no metadata (such as system parameters and trajectories) provided with the nineteen flying strips, absolute intensity calibration was not performed. The nineteen strips were roughly registered by an iterative closest points (ICP) method. Similarly, because control/reference points were unavailable, the geometric alignment quality was not statistically reported. In this study, as seen in Figure 1, the thirteen study areas (red rectangles for model training and blue rectangles for model testing) were selected for assessing our SE-PointNet++ architecture.

Data Preprocessing
To generate the required input to our SE-PointNet++ architecture, a two-step data preprocessing (i.e., data fusion and data annotation) is proposed, as shown in Figure 2. Data fusion aims to merge the three individual point clouds of wavelengths (532 nm,1024 nm, and 1550 nm) into a single point cloud, in which each point contains its coordinates and three-wavelength intensity values. Data annotation aims to manually label the selected thirteen Titan multispectral point cloud regions into several categories of interest and obtain a training dataset for our proposed architecture.

Data Preprocessing
To generate the required input to our SE-PointNet++ architecture, a two-step data preprocessing (i.e., data fusion and data annotation) is proposed, as shown in Figure 2. Data fusion aims to merge the three individual point clouds of wavelengths (532 nm, 1024 nm, and 1550 nm) into a single point cloud, in which each point contains its coordinates and three-wavelength intensity values. Data annotation aims to manually label the selected thirteen Titan multispectral point cloud regions into several categories of interest and obtain a training dataset for our proposed architecture.

Data Fusion
The Titan multispectral LiDAR data consist of three independent point clouds, corresponding to three laser wavelengths of 1550 nm, 1064 nm, and 532 nm. The three laser beams were titled by +3.5°, 0°, and +7° from the nadir direction. To fully use the reflectance characteristics of the three wavelengths and improve the point density of the point clouds,

Data Fusion
The Titan multispectral LiDAR data consist of three independent point clouds, corresponding to three laser wavelengths of 1550 nm, 1064 nm, and 532 nm. The three laser beams were titled by +3.5 • , 0 • , and +7 • from the nadir direction. To fully use the reflectance characteristics of the three wavelengths and improve the point density of the point clouds, the three independent point clouds are first merged into a single, high-density multispectral point cloud, each point of which contains its own intensity and two other assigned intensities from the other wavelengths. In this study, we adopt a 3-D spatial join technique [27] to merge the three independent point clouds. Specifically, among the three point clouds, each single-wavelength point cloud is taken as the reference in turn. Each point of the reference point cloud is processed to find its neighbors in the other two wavelengths of point clouds by a nearest neighbor searching algorithm and then obtains the intensities calculated from the neighbors by a bilinear interpolation method. The search radius is determined according to the point density. For example, as seen in Table 1, for each wavelength, the average point density is about 3.6 points/m 2 , thus, the maximum search distance is set to be 1.0 m to prevent grouping points located on different objects. Note that, for a point in the reference point cloud, if no neighbors are found in one of the other two wavelengths, the intensity value of this wavelength is set to be zero. Finally, the single multispectral point cloud data are preprocessed, each point of which contains its coordinates ([x, y, z]) and its three-wavelength laser intensity (LRI) values LRI λ 1550 , LRI λ 1064 , LRI λ 532 .

Data Annotation
In this study, we select thirteen representative test areas from the collected Titan multispectral point cloud dataset, including six categories (i.e., road, building, grass, tree, soil, and powerline). The total number of points is 8.52 million for the thirteen test areas. By means of the CloudCompare software, six categories were manually labeled point by point. Table 2 shows the number of points of each category in each test area. Each of the selected areas is saved in a separate file, where each point contains seven attributes-coordinates (x, y, z), three wavelengths (LRI λ 1550 , LRI λ 1064 , LRI λ 532 ), and the category label.

SE-PointNet++ Framewwork
Usually, some of the features learned by deep learning methods, such as PointNet++, might be ineffective for point cloud classification tasks, resulting in high computational costs and a decrease of classification accuracy. Therefore, to emphasize important channels and suppress the channels unconducive to prediction, an improved PointNet++ architecture, termed as SE-PointNet++, is proposed by embedding a Squeeze-and-Excitation block into the PointNet++ architecture. Figure 3 illustrates the SE-PointNet++ framework. As seen in Figure 3, the SE-PointNet++ takes a multispectral LiDAR point cloud (N is the number of points) as the input and outputs an identical-spatial-size point cloud, where each point is labelled with a specific category in an end-to-end manner. The SE-PointNet++ architecture involves an encoder network, a decoder network, and a set of skip link concatenations. The encoder network consists of four set abstraction modules, which recursively Remote Sens. 2021, 13, 2516 7 of 19 extract multi-scale features. The decoder network is operated by four feature propagation modules. The feature propagation modules aim to gradually recover a semantically-strong feature representation to accurately classify the point cloud. The skip link concatenations, to enhance the capability of feature representation, integrate the features selected from the set abstraction modules with the features having the same size in the feature propagation modules. The following subsections first detail the Squeeze-and-Excitation block, followed by a description of the SE-PointNet++. nels and suppress the channels unconducive to prediction, an improved PointNet++ architecture, termed as SE-PointNet++, is proposed by embedding a Squeeze-and-Excitation block into the PointNet++ architecture. Figure 3 illustrates the SE-PointNet++ framework. As seen in Figure 3, the SE-PointNet++ takes a multispectral LiDAR point cloud (N is the number of points) as the input and outputs an identical-spatial-size point cloud, where each point is labelled with a specific category in an end-to-end manner. The SE-PointNet++ architecture involves an encoder network, a decoder network, and a set of skip link concatenations. The encoder network consists of four set abstraction modules, which recursively extract multi-scale features. The decoder network is operated by four feature propagation modules. The feature propagation modules aim to gradually recover a semantically-strong feature representation to accurately classify the point cloud. The skip link concatenations, to enhance the capability of feature representation, integrate the features selected from the set abstraction modules with the features having the same size in the feature propagation modules. The following subsections first detail the Squeeze-and-Excitation block, followed by a description of the SE-PointNet++.

Squeeze-and-Excitation Block (SE-Block)
The SE-block aims to improve the expressive ability of the network by explicitly modelling the interdependencies among the channels of its convolutional features of the network without introducing a new spatial dimension for the fusion of the feature channels.

Squeeze-and-Excitation Block (SE-Block)
The SE-block aims to improve the expressive ability of the network by explicitly modelling the interdependencies among the channels of its convolutional features of the network without introducing a new spatial dimension for the fusion of the feature channels. Figure 4 illustrates the SE-block structure. Denote F tr as a traditional convolution structure. The SE-block is a computational unit built on a transformation F tr , which maps an input X = x 1 , x 2 , . . . , x C (X ∈ R H ×W ×C , where H , W , and C are, respectively, the height, width, and channel number of X) to feature map U ∈ R H×W×C (where H, W and C are, respectively, the height, width, and channel number of U). The F tr transformation is defined as follows: refers to the parameters of the c-th filter. x s is the s-th input of X. v s c is a 2-D spatial kernel which acts on the corresponding channel of X and represents a single channel of v c . u c ∈ R H×W refers to the c-th 2-D matrix in U. Here, there are two processes-squeeze and excitation.
• Squeeze: Global Information Embedding. The squeeze process F sq (u c ) is designed to compress the spatial information of the feature map by performing a global average pooling for each channel of the feature graph, thereby only channel information is retained. To reduce channel dependencies, the global spatial information is squeezed into a channel descriptor. To this end, a global average pooling is used to generate channel-wise statistics. Formally, a statistic z ∈ R C is generated by shrinking U through its spatial dimensions, H × W. The c-th element of z is calculated by: Excitation: Adaptive Recalibration. The excitation process F ex (z, W) aims to assign a weight to each element of the 1 × 1 × C channel descriptor generated by the squeeze process through two fully connected layers. In this module, a simple gating mechanism is employed with a sigmoid activation to fully capture channel-wise dependencies, which similar to a gating mechanism in recurrent neural network. Specifically, F ex (z, W) exploits the channel-wise interdependencies in a non-mutually exclusive manner by appending two fully connected layers after the squeeze process F sq (u c ). The outputs of the two fully connected layers are activated using the Rectified Linear Unit (ReLU) and the sigmoid functions, respectively. In this way, the output of the second fully connected layer constitutes a channel-wise attention descriptor, denoted as s. The attention descriptor s acts as a weight function to recalibrate the input feature map to highlight the contributions of the informative channels. The attention descriptor s is defined as follows: where δ refers to the ReLU function, W 1 ∈R C r ×C and W 2 ∈R C× C r , r is a reduction ratio in the dimensionality-reduction layer. σ refers to the sigmoid function, which limits the importance of each channel to the range of [0, 1], and is multiplied to U in a channel-wise manner to form the input of the next level. The final output of the block is obtained by rescaling U with the activations s:  • Squeeze: Global Information Embedding. The squeeze process ( ) is designed to compress the spatial information of the feature map by performing a global average pooling for each channel of the feature graph, thereby only channel information is retained. To reduce channel dependencies, the global spatial information is squeezed into a channel descriptor. To this end, a global average pooling is used to generate channel-wise statistics. Formally, a statistic ∈ ℝ is generated by shrinking through its spatial dimensions, × . The c-th element of is calculated by: Excitation: Adaptive Recalibration. The excitation process ( , ) aims to assign a weight to each element of the 1 × 1 × channel descriptor generated by the squeeze process through two fully connected layers. In this module, a simple gating mechanism is employed with a sigmoid activation to fully capture channel-wise dependencies, which similar to a gating mechanism in recurrent neural network. Specifically, ( , ) exploits the channel-wise interdependencies in a non-mutually exclusive manner by appending two fully connected layers after the squeeze process ( ).
The outputs of the two fully connected layers are activated using the Rectified Linear

Training Sample Generation
In the collected airborne Titan multispectral LiDAR point clouds, different categories demonstrate different spatial attributes, such as point density due to a top-down data acquisition means. That is, the features learned in the dense sampling areas may not be extended to the sparse sampling areas, and the model trained with the sparse points may not be able to identify fine-grained local structures. Therefore, to address the density inconsistency and sparsity issues of point clouds, a density consistent method is proposed for processing the scenarios with different point densities. The input of the proposed density consistent method is defined as a point set S = {p 1 , . . . , p n | p i = (x, y, z, f )} where x, y, z are the coordinates of the point, and f is the feature vector, such as color and surface normal. We first normalize S to the range of [−1, 1] and output a new point setŜ. Considering the limited GPU capacity, it is impossible to feed an entire training point set directly into the network. We, according to the processed test areas and multispectral LiDAR points, grid the normalized training set (Ŝ) into a set of blocks using a block size of 0.12 × 0.12 m 2 Remote Sens. 2021, 13, 2516 9 of 19 without overlapping, each of which contains a different number of points. To obtain a fixed number of samples for each point block, a farthest point sampling (FPS) algorithm is used to down-sample it with a given number sample size, N. Note that, for each point block, the higher the number of the training samples, the more the information learned by the proposed architecture. However, computational performance should be taken into account when the number of samples is defined. Due to point density inconsistency, some point blocks might contain few points with the number of sampling points smaller than N. Therefore, data interpolation is required to obtain the defined number size of sampling points.

SE-PointNet++
After the generation of the training samples, the N multispectral LiDAR points with six attributes (coordinates (x, y, z) and three-wavelength intensities (LRI λ 1550 , LRI λ 1064 , LRI λ 532 )) are directly input into our SE-PointNet++ architecture, which involves an encoder network, a decoder network, and a set of skip link concatenations. The encoder network consists of four set abstraction modules to recursively extract multi-scale features at the scale of {1/4, 1/16, 1/64, 1/256} with regard to the input point cloud with the point number of N. Figure 5 shows the first set abstraction module. Specifically, a set abstraction module takes an N × 6 matrix as the input and outputs an N/4 × 64 matrix of N/4 subsampled points with 64 − dimensional feature vectors summarizing the local contextual information. As seen in Figure 5, the set abstraction module in the encoder network consists of a Sampling layer, a Grouping layer, and a Channel Feature Attention layer. Firstly, the sampling layer defines the N/4 centroids of local regions by selecting a set of points through an iterative FPS algorithm. Given the input points {x 1 , x 2 , . . . , x n }, the FPS selects a subset of points x i 1 , x i 2 , . . . , x i m , such that x i j is the fastest distant point from the remaining point set The centroids of the sampling layer are 1/4 times of the input points each time. Afterwards, a grouping layer is used to construct the corresponding local regions by searching for the neighboring points around the N/4 centroids by a ball query algorithm. For each centroid, via the ball query algorithm, all neighboring points are found within a given radius, from which K points are randomly selected to construct a local region (K is set to 32 in this study). After the implementation of the sampling and grouping layers, the multispectral LiDAR points are then sampled into N/4 point sets, each of which contains 32 points with their 6 attributes. The output involves a group of point sets with the size of N/4 × 32 × 6. Subsequently, we encode these local regions into feature vectors via our Channel Feature Attention layer. For each point, we extract its features by multi-layer perceptrons (MLPs), and emphasize its important features and suppress its unimportant channels by the SE block. Specifically, in the SE block, each channel of the N/4 points is squeezed via a max-pooling, and then its weight value is calculated and normalized to the range of [0, 1] by the two MLP layers and sigmoid function. The higher the weight value, the more important the channel. Finally, the important channels with higher weight values are then excited. To avoid the situation of missing features when the weight is close to zero, a short connection is used to connect the features before and after the channel feature attention layer. Because there exist different dimensions of the learned features before and after the channel feature attention layer, convolution operation is performed to match their dimensions in the shortcut connection.
The decoder network includes four feature propagation modules, which gradually recover a semantically-strong feature representation to produce a high-quality classified point cloud. Figure 6 shows an example of the feature propagation module. As illustrated in Figure 6, the output data size of the encoder is N/256 × 512 (where, 512 is the dimension of features), which contains more useful channel feature information. To propagate the learned features from the sampled points to the original points, interpolation is first employed through an inverse distance weighting within the feature propagation module. portant features and suppress its unimportant channels by the SE block. Specifically, in the SE block, each channel of the /4 points is squeezed via a max-pooling, and then its weight value is calculated and normalized to the range of [0, 1] by the two MLP layers and sigmoid function. The higher the weight value, the more important the channel. Finally, the important channels with higher weight values are then excited. To avoid the situation of missing features when the weight is close to zero, a short connection is used to connect the features before and after the channel feature attention layer. Because there exist different dimensions of the learned features before and after the channel feature attention layer, convolution operation is performed to match their dimensions in the shortcut connection. The decoder network includes four feature propagation modules, which gradually recover a semantically-strong feature representation to produce a high-quality classified point cloud. Figure 6 shows an example of the feature propagation module. As illustrated in Figure 6, the output data size of the encoder is /256 × 512 (where, 512 is the dimension of features), which contains more useful channel feature information. To propagate the learned features from the sampled points to the original points, interpolation is first employed through an inverse distance weighting within the feature propagation module.   To enhance the capability of the feature representation, the interpolated fea the /64 points are then concatenated with the skip linked point features from abstraction modules via the skip link concatenations. Then, to capture features coarse-level information, the concatenated features are passed through a "unit p similar to 1×1 convolution in CNNs. A few shared fully connected and ReLU la date the feature vector of each point. The process is repeated until the features h propagated to the original point set.

Results
To assess the performance of the SE-PointNet++ on multispectral point clou fication, we conducted several groups of experiments on the Titan multispectra Data.

Experimental Setting
We implemented all tests in the framework of Pytorch1.5.0 and trained th GTX 1080Ti GPU. Each point of the Titain multispectral LiDAR data contained it To enhance the capability of the feature representation, the interpolated features on the N/64 points are then concatenated with the skip linked point features from the set abstraction modules via the skip link concatenations. Then, to capture features from the coarse-level information, the concatenated features are passed through a "unit pointnet", similar to 1×1 convolution in CNNs. A few shared fully connected and ReLU layers update the feature vector of each point. The process is repeated until the features have been propagated to the original point set.

Results
To assess the performance of the SE-PointNet++ on multispectral point cloud classification, we conducted several groups of experiments on the Titan multispectral LiDAR Data.

Experimental Setting
We implemented all tests in the framework of Pytorch1.5.0 and trained them with GTX 1080Ti GPU. Each point of the Titain multispectral LiDAR data contained its coordinates ([x, y, z]), three-wavelength intensity values LRI λ 1550 , LRI λ 1064 , LRI λ 532 and the category label. The number of sampling points, N, was set to be 4096 points. The number of neighbors, K, was set to be 32 when calculating low-dimensional prior G ϕ . Parameters, i.e., learning rate, batch size, decay rate, optimizer, and max epoch, were set to be 0.001, 8, 10 −4 , Adam, and 200, respectively. Among the selected thirteen test areas, the first ten areas (area_1 to area_10, about 70% of the data) were used as the training sets, and the remaining (area_11 to area_13, about 30% of the data) as the test sets.

Evaluation Metrics
The following evaluation metrics were used to quantitatively compare and analyze the multispectral LiDAR point cloud classification results. These metrics include overall accuracy (OA) [42], mean intersection over union (mIoU) [43], Kappa coefficient (Kappa) [44], and F 1 -score [45]. The metrics are presented as follows: where tp is the number of true positives, tn is the number of true negatives, f p is the number of false positives, and f n is the number of false negatives.

Overall Performance
To evaluate the point cloud classification performance of the SE-PointNet++, we applied it to the Titan multispectral LiDAR data. Figure 7 shows the point cloud classification results of our SE-PointNet++ architecture. Visual inspection demonstrates that most categories (see Figure 7a-c)) were correctly classified, compared with the ground truth (see Figure 7d-f). Specifically, the road, grass, tree, and building points in the three areas were all clearly classified. However, some soil points were misclassified as the grass points. The reason behind this phenomenon might be the similar topological features and geographical distributions of these two categories. In addition, some powerline points were misclassified as the tree points, which may also be caused by the mixing and lack of obvious boundaries between the two categories. The reason behind this phenomenon might be the similarities of the spatial distributions and topological characteristics between soil and grass. Comparatively, although the study area contains few points of powerline, our presented SE-PointNet++ correctly recognized the powerline points from the data due to the geometrical and distribution characteristics of powerline. Due to the use of LiDAR elevation information, objects at different elevations can be easily differentiated, for example, only 697 road points (account for 0.3% of all points) were misclassified as other high-rise categories, such as building, tree, and powerline. Specifically, 928 building points (0.85%) were misclassified as road and grass points; 404 soil points (1%) were misclassified as building and tree points.

Comparative Experiments
To fairly demonstrate the robustness and effectiveness of our architecture, we compared it with other representative point-based deep learning models, including PointNet [35], PointNet++ [36], DGCNN [37], GACNet [38], and RSCNN [39]. Although these models have achieved excellent performance on point cloud classification and semantic segmentation, they have not been tested on large-scale multispectral LiDAR data containing complex urban geographical features. A quantitative comparison between the SE-Point-Net++ and the other five models are listed in Table 4.
As shown in Table 4, in terms of the point cloud classification accuracy, the PointNet achieved the worst point cloud classification with an OA of 83.79%. The reason behind this phenomenon is that the PointNet neglected the points-to-points spatial neighboring relations, which contained fine-grained structural information for segmentation. The GACNet behaved modestly with an OA of 89.91 %. Due to the integration of an attention scheme with graph attention convolution, the GACNet was capable of adapting kernel shapes to dynamically adapt to the structures of the objects, thereby improving the quality of feature representation for accurate point cloud classification. The PointNet++, DGCNN, RSCNN, and our SE-PointNet++ outperformed the aforementioned two methods with the OA of over 90.0%. For the DGCNN, benefiting from the local neighborhood graphs constructed by EdgeConv, similar local shapes can be easily captured, which contributed to the improvement of feature representation. RSCNN, via relation-shape convolutions, produced geometric relationships of points to obtain discriminative shape awareness, thereby To statistically assess the point cloud classification performance of our network, the classification confusion matrix was calculated and listed in Table 3. As seen in this table, our proposed architecture obtained the OA higher than 70% for five categories, i.e., road, building, grass, tree, and powerline. Particularly, the grass and tree categories achieved higher classification accuracies with an OA of 94.70% and 97.09%, respectively. One of the facts is that, in the Titan multispectral LiDAR system, three channels, i.e., 532 nm, 1024 nm, and 1550 nm, provide rich vegetation information. In feature engineering-based methods, some vegetation indices can be derived from the three channels, improving the classification accuracies of grass and tree. Our architecture achieved a modest classification performance on soil with an OA of 52.45%. Most soil points were misclassified as grass ones. The reason behind this phenomenon might be the similarities of the spatial distributions and topological characteristics between soil and grass. Comparatively, although the study area contains few points of powerline, our presented SE-PointNet++ correctly recognized the powerline points from the data due to the geometrical and distribution characteristics of powerline. Due to the use of LiDAR elevation information, objects at different elevations can be easily differentiated, for example, only 697 road points (account for 0.3% of all points) were misclassified as other high-rise categories, such as building, tree, and powerline. Specifically, 928 building points (0.85%) were misclassified as road and grass points; 404 soil points (1%) were misclassified as building and tree points.

Comparative Experiments
To fairly demonstrate the robustness and effectiveness of our architecture, we compared it with other representative point-based deep learning models, including Point- Net [35], PointNet++ [36], DGCNN [37], GACNet [38], and RSCNN [39]. Although these models have achieved excellent performance on point cloud classification and semantic segmentation, they have not been tested on large-scale multispectral LiDAR data containing complex urban geographical features. A quantitative comparison between the SE-PointNet++ and the other five models are listed in Table 4. As shown in Table 4, in terms of the point cloud classification accuracy, the PointNet achieved the worst point cloud classification with an OA of 83.79%. The reason behind this phenomenon is that the PointNet neglected the points-to-points spatial neighboring relations, which contained fine-grained structural information for segmentation. The GACNet behaved modestly with an OA of 89.91 %. Due to the integration of an attention scheme with graph attention convolution, the GACNet was capable of adapting kernel shapes to dynamically adapt to the structures of the objects, thereby improving the quality of feature representation for accurate point cloud classification. The PointNet++, DGCNN, RSCNN, and our SE-PointNet++ outperformed the aforementioned two methods with the OA of over 90.0%. For the DGCNN, benefiting from the local neighborhood graphs constructed by EdgeConv, similar local shapes can be easily captured, which contributed to the improvement of feature representation. RSCNN, via relation-shape convolutions, produced geometric relationships of points to obtain discriminative shape awareness, thereby achieving a good point cloud classification performance. However, these methods performed ineffectively with a high cost of data structuring, which limits their abilities to generalize complex scenarios. PointNet++ used a hierarchical structure to extract the local information of points, thereby achieving a better point cloud classification result. As seen from the other evaluation metrics, the SE-PointNet++ achieved the best scores of mIoU (62.15%), F 1 -score (73.14%) and Kappa coefficient (0.86). The introduction of the SE-blocks that fused local and global features of points contributed to fine-grained information acquisition. As a result, the SE-PointNet++ achieved with the mIoU value increased up to 60.15%. Note that the six categories in the thirteen test areas were unevenly distributed. As seen in Table 2, the number of points greatly changed from one category to another, leading to that the categories with sufficient points achieved a good point cloud classification performance, while the categories with few points achieved a poor point cloud classification performance. Figure 8 shows the comparative point cloud classification results of the area_11 test area. Figure 9 shows a close-up view of the comparative point cloud classification results. Three oval-shaped regions (i.e., Regions A, B, and C) were highlighted for further comparison. Visual inspection of Region A indicates that the DGCNN, RSCNN, and GACNet models misclassified most bare soil points as grass ones. This is due to their similar topological features and geographical distributions. The three models heavily relied on the geometric relationships between points. Specifically, DGCNN used EdgeConv as a feature extraction layer to leverage neighborhood structures in both point and feature spaces. RSCNN and GACNet used relation-shape convolution and graph attention convolution to encode the geometric relationship of points, respectively. Similarly, as seen in Region C, for the DGCNN , RSCNN and GACNet classification results, some bare soil points were misclassified as road and grass points. Although the PointNet++ achieved satisfactory classification accuracy of the soil category (see Table 4 and Region B in Figure 9), some road points were misclassified as bare soil ones. However, as seen from Region B, our SE-PointNet++ model accurately predicted road points. This is due to the SE-block embedded in the feature extraction process to enhance important channels and suppress useless channels, thereby our architecture achieved better classification accuracies for the soil, road, and grass categories.

Parameter Analysis
In the proposed SE-PointNet++, there are two parameters, input data and the number of sampling points, . We designed the following experiments to investigate (1) the superiority of the multispectral LiDAR data, and (2)

Parameter Analysis
In the proposed SE-PointNet++, there are two parameters, input data and the number of sampling points, . We designed the following experiments to investigate (1) the superiority of the multispectral LiDAR data, and (2)

Parameter Analysis
In the proposed SE-PointNet++, there are two parameters, input data and the number of sampling points, N. We designed the following experiments to investigate (1) the superiority of the multispectral LiDAR data, and (2) the feasibility of SE-PointNet++ to the selection of the parameter N.

Input Data
To assess the superiority of the fused multispectral LiDAR data to point cloud classification, we input different types of data into the proposed SE-PointNet++ architecture. Five experiments were conducted to investigate the input data types: (1) only geometrical data without any intensities (i.e., Case 1-1), (2) 1550 nm LiDAR data (i.e., Case 1-2), (3) 1064 nm LiDAR data (i.e., Case 1-3), (4) 532 nm LiDAR data (i.e., Case 1-4), and (5) multispectral LiDAR data (i.e., . In this group of experiments, N was set to be 4096. Table 5 shows the average experimental results for the selected 13 test areas. As seen from Table 5, the point cloud classification accuracies of the 532 nm LiDAR data are very close to those of the 1550 nm LiDAR data. Compared with the other two single-wavelength data, the 1064nm LiDAR data achieved better point cloud classification accuracies with an OA of 90.47%, an mIoU of 58.09%, an F 1 -Score of 68.40%, and a Kappa coefficient of 0.8503. The multispectral LiDAR data achieved the best point cloud classification performance by the OA improvement from 1% to 9%, the mIoU improvement from 1% to 9%, the F 1 -score improvement from 5% to 14%, and the Kappa coefficient improvement from 0.01 to 0.14.

Number of Sampling Points
To investigate the effect of the number of sampling points on point cloud classification results using the Titan multispectral LiDAR data, we varied N from 4096, 2048, 1024, 512, 128, to 32. Table 6 shows the experimental results of the six categories. As shown in Table 6, the point cloud classification accuracies dramatically decreased when the number of the sampling points decreased from 4096 to 32. Our architecture achieved its best performance when N was set to be 4096. The main reason for this situation is that the less the number of sampling points, the less the features being represented for each category. Of course, the larger the number of points, the better the point cloud classification accuracy. However, due to the computation capacity limitation of the computer being used, the number of sampling points cannot be set extensively large in this study. As such, in our study, the better point cloud classification accuracies were obtained when N = 4096.

Analysis of Imbalanced Data Distribution
As seen in Table 4, although our SE-PointNet++ achieved relatively better performances than most of the comparative methods. In particular, DGCNN obtained the comparable performance with the SE-PointNet++, and even outperformed the SE-PointNet++ in terms of the overall accuracies. We believe that the main reason for this phenomenon is the distribution imbalance of the categories in the study area. As shown in Table 2, the six categories are not balanced in distribution. For example, the data set contains 3.48 million tree points, while twenty thousand powerline points. This imbalanced point distribution degraded the overall point cloud classification accuracies. To fairly evaluate the performance, we balanced the points of all categories to a certain extent by an oversampling and undersampling method. The experimental results are shown in Table 7. Comparatively, our SE-PointNet++ showed a significant improvement compared with the other methods with an OA of 85.07%, an mIoU of 58.17%, an F 1 -score of 71.45%, and a Kappa coefficient of 0.80. Through the above experiments and discussion, we confirmed that the SE-PointNet++, that is, the integration of SE-blcok with PointNet++, performed positively and effectively in improving the quality of the multispectral LiDAR point cloud classification results.

Conclusions
The multispectral LiDAR point cloud data contain both geometrical and multi-wavelength information, which contributes to identifying different land-cover categories. In this study, we proposed an improved PointNet++ architecure, named SE-PointNet++, by integrating an SE attention mechanism into the PointNet++ for the multispectral LiDAR point cloud classification task. First, data preprocessing was performed for data merging and annotation. A set of samples were obtained by the FPS point sampling method. By embedding the SE-block into the PointNet++, the SE-PointNet++ is capable of both extracting local geometrical relationships among points from unevenly sampled data, and strengthening important feature channels simultaneously, which improves multispectral LiDAR point cloud classification accuracies.
We have tested the SE-PointNet++ on the Titan airborne multispectral LiDAR data. The dataset was classified into six land-cover categories: road, building, grass, tree, soil, and powerline. Quantitative evaluations showed that our SE-PointNet++ achieved a classification performance with the OA, mIoU, F 1 -score, and Kappa coefficient of 91.16%, 60.15%, 73.14%, and 0.86 of, respectively. In addition, the comparative studies with five established methods also confirmed that the SE-PointNet++ was feasible and effective in 3-D multispectral LiDAR point cloud classification tasks.
Author Contributions: Z.J. designed the workflow and responsible for the main structure and writing of the paper; H.G., D.L., Y.Y., and Y.Z. conceived and designed the experiments and discussed the results described in the paper; Z.J. and P.Z. performed the experiments; H.W. analyzed the data; H.G., Y.Y., and J.L. gave comments and editions to paper writing. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.