1. Introduction
Airborne laser scanning (ALS), also known as airborne light detection and ranging (LiDAR), is an important active remote sensing technique that has displayed rapid development in recent years [
1]. The technique has the advantage of quickly acquiring large-scale, high-density, and high-precision 3D ground data, and plays an increasingly important role in many applications, including topographic mapping [
2], urban planning [
3], forest biomass estimation [
4], environmental monitoring [
5], and power line detection [
6]. By employing ALS for ground-based scanning, a massive and disordered point cloud can be obtained. Classifying LiDAR point clouds is a basic and key step in the application of LiDAR data. Classifying ALS point clouds with high precision and high efficiency remains a challenging task due to the irregular distribution of point clouds and the complexity of scenes [
7,
8].
Point cloud classification usually involves assigning a category label to each point, which is often referred to as point cloud semantic segmentation in computer vision. In early studies, researchers classified point clouds by employing hand-engineered features and traditional classifiers [
9,
10,
11,
12] or preprocessed the point clouds before classification [
13,
14]. These methods belong to traditional machine learning methods, which fail to learn high-level features, whereas the methods based on deep learning techniques can further improve the classification accuracy due to the ability to learn high-level features. Many works [
15,
16] transform 3D point clouds into 2D images or 3D grids and then use deep learning techniques for classification; however, the transformation leads to information loss and a high computation cost. To avoid these problems, some studies directly process raw points while employing deep learning techniques, such as PointNet++ [
17], SPG [
18], and RandLA-Net [
19]. The latter two networks have achieved good results in the large-scale point cloud classification task, which is considerably challenging. Many works capture more local features of the point cloud data by introducing a graph neural network [
20] and a graph attention mechanism [
21]. For example, GACNN [
22] achieved a better performance than that of some prevalent networks on ALS point cloud datasets.
Due to the fact that encoder–decoder architecture can capture hierarchical semantic information, it has been successful in many 2D image semantic segmentation tasks. Many prevalent networks [
17,
21] also employ it for 3D point cloud semantic segmentation. The architecture needs appropriate methods for downsampling and feature extraction. Farthest point sampling (FPS) is widely used as a downsampling method, but it has a high time complexity of
(
N2). This implies that the method is not suitable for point cloud datasets with a large number of points [
19]. The graph neural networks for point cloud classification can efficiently capture the local structure information of point clouds, but the receptive field size of many graph neural networks is usually not sufficient to capture comprehensive contextual information.
Based on the above analysis, we present GAFFNet, a graph neural network that adopts encoder–decoder architecture. We abandon the expensive downsampling methods, such as FPS, and adopt the voxel grid downsampling, which balances efficiency and performance in order to help the feature extraction module to obtain features at different levels. On the basis of simplifying the preprocessing steps, we design a new feature extraction module. For ALS point cloud classification, our network achieves good results with a high efficiency. Our main contributions are as follows:
(1) We present GAFFM, a new feature extraction module based on the graph attention mechanism. The module increases the receptive field for each point and fuses the features of different scales. Therefore, the module can effectively capture wider contextual features at different levels;
(2) We propose GAFFNet, and the experimental results verify its effectiveness for ALS point cloud classification. We achieve more satisfactory results on the ISPRS semantic 3D dataset when compared with the results of other methods;
(3) GAFFNet has strong adaptability. It reduces the preprocessing steps and improves the efficiency. Due to the fact that our network is insensitive to the number of point clouds, this advantage allows us to directly feed the point cloud blocks with different numbers of points to the network after dividing the training set into point cloud blocks, and it also allows the trained network to be directly applied to test sets with different numbers of points without additional preprocessing.
The remaining part of this paper is organized as follows: in
Section 2, a brief summary of the research related to our work is given;
Section 3 presents GAFFNet in detail; experiments are performed in
Section 4, and we evaluate the performance of GAFFNet and other methods; finally, we provide the conclusion in
Section 5.
2. Related Work
Early studies on ALS point cloud classification mainly relied on hand-engineered features, using the unsupervised, supervised, or combination methods. The unsupervised method sets certain rules and divides the ground objects into a few categories [
23,
24]. This kind of method is highly dependent on the threshold, and therefore has poor adaptability. The supervised method provides hand-engineered features to traditional machine learning algorithms [
9,
10,
11,
12] for classification. This method does not have the ability to learn high-level features; in fact, it is difficult to further improve the classification accuracy. The combination method usually preprocesses the point clouds in unsupervised ways [
13,
14] and then uses supervised traditional machine learning algorithms to classify the point clouds. This kind of algorithm improves the classification accuracy to a certain extent; however, its processing flow is complex.
Deep learning has gradually become one of the most important technologies in pattern recognition, computer vision, and other fields in recent years [
25], and point cloud classification methods based on deep learning have gradually emerged. According to the different input data formats for a neural network, the point cloud classification methods based on deep learning can be divided into three types: multiview-based, voxel-based, and point-based methods. The first two methods [
15,
16,
26] transform 3D point clouds into 2D images or 3D grids and then use 2D CNN or 3D CNN to process them. However, information loss is inevitable, and the voxel-based method is not suitable for large-scale point clouds because of its high computation cost. The point-based method directly processes the raw points. As a pioneer work, PointNet [
27] uses MLP and max pooling to extract global features of point clouds, but it is difficult to fully capture the contextual information. Then, PointNet++ [
17] was developed, which represents an improvement on PointNet, as it employs a hierarchical neural network; it achieved good results. Inspired by PointNet/PointNet++, some researchers have proposed other networks [
28,
29].
In recent years, researchers have used a graph structure to represent point clouds, and are attempting to employ the graph neural network to classify point clouds [
20,
30]. Wang et al. [
31] proposed DGCNN using a graph structure to capture local geometric information while ensuring permutation invariance. More recently, the attention mechanism has been paid more attention, and various research has introduced it in order to learn a more adaptive local relationship of point clouds. Inspired by GAT [
32], GACNet [
21] captures the relationship information between points through the graph attention mechanism, thereby allowing one to achieve better classification results. GACNN [
22] performs better than other prevalent networks (e.g., PointNet++ and GACNet) on ALS point cloud datasets through its graph attention convolution module, which can learn local structural features and global contextual information. However, many graph neural networks have a problem where the receptive field size is not sufficient to capture comprehensive contextual information.
Figure 1 shows the different ALS point cloud classification methods intuitively. The significant difference between traditional machine learning methods and deep learning methods is that the former generally relies on low-level features, such as hand-engineered features, whereas the latter has the ability to learn high-level features to improve the classification accuracy. In addition, PointNet and its improved methods, as well as the methods based on the graph neural network, belong to point-based methods because they directly process the raw points.
4. Experiments
4.1. Data Description
We evaluated the performance of our network on the ALS point cloud dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). The dataset was obtained in August 2008 by a Leica ALS50 system [
35]. ISPRS provides this dataset from Vaihingen (Germany) as a benchmark dataset for 3D semantic labeling. The ISPRS 3D dataset, which consists of 1,165,598 points, contains nine categories [
12], namely, powerline (power), low vegetation (low_veg), impervious surfaces (imp_surf), car, fence/hedge, facade, roof, shrub, and tree. As shown in
Figure 7, the dataset covers three areas that correspond to three scenes. Scene 1 is used as the training set with 753,876 points, whereas Scene 2 and Scene 3 are used as the test set, and the test set has a total of 411,722 points.
Table 1 and
Table 2 show the number and the proportion of 3D points per category in the training set and the test set, and it can be seen that the distribution of categories in the dataset is extremely unbalanced, especially the powerline points, which only account for approximately 0.1%.
4.2. Implementation Details
Limited by GPU memory, it is almost impossible to directly feed the entire training set into the network. Therefore, we first divided the point cloud of the training set into the small point cloud blocks with the same size, and then fed them into the network. Due to the fact that our network is insensitive to the number of point clouds, the network allows the point cloud blocks with a different number of points to be directly fed, and also allows for the entire test set to be directly fed into the trained network. Specifically, the area of each point cloud block that is partitioned from the training set is 30 × 30 m, and each block overlaps its adjacent blocks; the distance between two adjacent blocks is 10 m in the x-axis or y-axis direction. The blocks whose number of points is less than the fixed number are not fed to the network. Furthermore, 10% of training blocks are selected as the validation dataset.
In addition to the features provided by the ISPRS 3D dataset, the input features for the network also include the height above the DTM feature, which is important for point cloud classification [
12]. We employed the method proposed by [
36] to obtain the height above the DTM feature for each point, and fed it into the network, where the DTM grid size was set to 0.8 m and the rigidness was set to 2. The input features for our network included x-y-z coordinates, intensity, the product of the number of returns and the return number, and the height above DTM.
Our network was implemented based on PyTorch, and we employed the Adam optimizer. The initial learning rate was set to 0.002 and decreased by 20% every ten epochs. All processing steps of our method (such as downsampling, KNN algorithm, etc.) are implemented by Python. The sizes of the voxel grid for downsampling were 0.6, 1.2, 2.4, and 4.8 m, and the K parameter in the KNN algorithm was set to 10. The hardware platform used in the experiment was equipped with Intel i9-9900K CPU, 32G memory (RAM) and a single NVIDIA RTX2080Ti GPU.
4.3. Experiment Results
Figure 8 and
Figure 9 show the classification results for Scenes 2 and 3 in the test set, respectively, and our method correctly labelled most of the points in the test set.
In addition, we quantitatively evaluated the classification results with the standard evaluation metrics used by the ISPRS 3D labeling contest. The results include the three metrics of accuracy, recall, and F1 score for each category and are shown in
Table 3.
From
Table 3, we can see that our method can correctly classify most of the test points. We achieved satisfactory classification results (F1 score higher than 75%) for five categories, namely, low vegetation, impervious surfaces, roof, car, and tree, most of which are large objects. For example, the precision, recall, and F1 score for the roof category are 93.9, 94.2, and 94.1%, respectively. Our method has poor classification results for the shrub and fence/hedge categories, which may be due to their similarity of low-level features, including height and geometric distribution features. In addition, some shrub points were wrongly labelled as trees and low vegetation, which may be due to the lack of clear boundaries between shrubs and the two categories.
4.4. Comparison with Other Methods
Due to the category imbalance for the ISPRS 3D dataset, it is not entirely reasonable to evaluate the network performance using only the overall accuracy (OA). We introduced the macro average F1 score (abbreviated as macro avg F1), which is the unweighted mean of the F1 scores for all categories. This metric assigns equal weight to each category and is insensitive to category imbalance. Therefore, it is more reasonable to combine the OA and macro avg F1 score to evaluate the classification performance for imbalance datasets.
We compared our method with six other existing methods published by the ISPRS organizers through the above evaluation metrics (see
Table 4). The six methods, namely, UM [
37], LUH [
38], BIJ_W [
8], RIT_1 [
39], NANJ2 [
40], and WhuY4 [
41], have achieved the top six performances in terms of OA on the ISPRS 3D dataset. Among them, UM and LUH rely on hand-engineered features to classify point clouds, whereas the other four methods employ deep learning techniques for classification. Relatively speaking, the methods based on deep learning perform better than the other methods.
During the training step, the network can obtain a good OA performance by paying attention to the categories with a large number of points, because these categories have a considerable impact on OA performance, while ignoring the small categories that have little impact on OA performance. To obtain a good macro avg F1 score, we need to treat every category equally, including the small categories. However, too much focus on small categories tends to lead to overfitting and reduces OA performance. Therefore, it is not easy to balance the OA and macro avg F1 score.
As can be seen from
Table 4, compared with other methods, our method well balances OA and the macro avg F1 score, and achieves the highest performance in terms of the macro avg F1 score. In addition, our method displays the best performance in terms of F1 scores for some categories. It is worth mentioning that our method achieves a significantly high classification performance for the facade category. To illustrate this intuitively, we plotted the classification results of three different methods in
Figure 10, namely, NANJ2, WhuY4, and GAFFNet, which demonstrate the top three performances shown in
Table 4. Facade points are easily covered and misclassified. As shown in
Figure 10, the NANJ2 and WhuY4 methods misclassify many facade points into the roof category or other categories, and our method has a much larger number of correctly classified points (green) than that of the other two methods. This is most likely because the NANJ2 and WhuY4 methods lose the 3D spatial information when converting the point cloud into 2D images.
To evaluate the superiority of our method more comprehensively, we compared our method with more methods—specifically, the methods based on the graph neural network. We not only compared our method with GACNN [
22], but also with the prevalent graph neural networks, such as GACNet [
21], DGCNN [
31], and GAT [
32]. Those methods based on the graph neural network have all been proposed recently. The authors of [
22] reported the classification results of GACNet and DGCNN for the ISPRS 3D dataset, and the results are shown in
Table 5.
In a point cloud classification network, the feature extraction module is one of the most critical parts in determining the classification performance, but other parts (such as downsampling, graph construction, and preprocessing) can also affect the classification results. To demonstrate the superiority of our feature extraction module more fairly, we use improved versions of GACNet and DGCNN for comparisons. Specifically, for the improved versions, only the feature extraction modules were the same as the original networks, and other parts were the same as our method. The improved versions of GACNet and DGCNN are called GACNet-voxel and DGCNN-voxel, respectively. In addition, we also compared our method with GAT, which first proposed the graph attention mechanism. GAT was not applied to point cloud classification when it was proposed, so we dealt with it in the same way as for GACNet; the improved version is called GAT-voxel. The classification results of all the above methods are shown in
Table 5.
Our method achieved a better OA performance than that of other methods and obtained a satisfactory macro avg F1 score (as shown in
Table 5). In addition, our method achieved a better performance than that of GACNet (GACNet-voxel) and DGCNN (DGCNN-voxel) by a large margin. This implies the superiority of our feature extraction module. The reason for the lower performance of other methods may be due to the fact that their receptive field size is limited for sufficiently capturing semantic features.
It can be seen from
Table 5 that the classification results of GACNet-voxel and DGCNN-voxel are not worse than those of their original versions. Furthermore, GACNN and the original GACNet employ the FPS for downsampling, which has a high time complexity, whereas our method and the improved versions all use voxel grid downsampling, which has a lower time complexity than that of the FPS.
4.5. Ablation Study
To evaluate the influence of various modules and features in GAFFNet, we conducted the following ablation experiments to demonstrate the effectiveness of GAFFNet:
(1~3) The addition of max/sum/mean pooling. After each GAFFM, we added the widely used max/sum/mean pooling;
(4) The replacement of voxel grid downsampling with random sampling. The downsampling method in our network was replaced by random sampling with a high time efficiency (abbreviated as GAFFNet-RS), and 25% points were retained after each random sampling;
(5~6) A change in the number of NFFU stacks. In GAFFNet, we stacked two NFFUs to form a standard GAFFM. In the ablation study, we conducted two groups of experiments with one NFFU or three NFFUs in a GAFFM;
(7) The removal of the height above the DTM feature. The computational cost is required to obtain the height above the DTM feature; therefore, we evaluated the impact of removing this feature on the network performance;
(8) The removal of statistical features. In our NFFU, we fused the statistical features, and in the ablation study, we evaluated the impact of removing this feature on the network performance.
Table 6 shows the OA and macro avg F1 scores for all ablation networks. From
Table 6, we can make the following conclusions:
(1) The widely used max/sum/mean pooling tends to select features mechanically, so their performance is not necessarily optimal, and the mean pooling is the worst of the three;
(2) Random sampling can shorten the sampling time, but the distribution of sampling points is random; this method may lose part of the point cloud information, which leads to the performance loss of the OA and macro avg F1;
(3) The performance degrades regardless of whether there is one NFFU or three NFFUs in GAFFM. This may be due to the limited receptive field size of one NFFU, which makes it difficult to improve the performance, and the excessive number of trainable parameters for three NFFUs, which leads to overfitting;
(4) After removing the height above the DTM feature, the OA performance decreases, whereas the macro avg F1 score is the highest among all networks. Therefore, if we do not prioritize a high OA performance, we can remove this input feature;
(5) After removing the statistical features, the performance loss of the OA and macro avg F1 is considerable different: the former loss is small, and the latter loss is large.
5. Conclusions
For ALS point cloud classification, we propose GAFFNet, which is based on the graph attention mechanism, reducing the preprocessing steps. We use the graph attention network as the base network and design a new feature extraction module (i.e., GAFFM) that fuses multi-level features and effectively increases the receptive field size for each point with a low computational cost. Therefore, the module can effectively capture wider contextual features at different levels, which can improve the network classification performance. In addition, our GAFFNet is insensitive to the number of point clouds, which greatly reduces the corresponding preprocessing steps and improves the classification efficiency. In this paper, the superiority of our method is demonstrated via extensive experiments. To further improve the classification accuracy, we are planning to further mine the relationship between neighboring points in the neighbor set, which helps us to calculate the attention weight of each neighboring point more accurately.