1. Introduction
The development of 3D scanning and 3D imaging technologies has resulted in easier acquisition and a wider range of application of point cloud data. In the domain of remote sensing and geoinformation, the point cloud was at first primarily utilised to produce the digital surface model (DSM) and the digital terrain model (DTM) [
1,
2]. Nowadays, the point cloud has become a major data source for 3D model reconstruction and 3D mapping [
3,
4]. In the domain of computer vision and robotic research, the point cloud can be utilised in object detection, tracking, and 3D modelling [
5,
6]. In terms of forestry applications, the point cloud provides the measurements required for forest type classification and tree species identification [
7,
8]. In addition, the point cloud can be used to record the shape and exterior of ancient relics or historical buildings for the purpose of digital preservation [
9,
10]. In recent years, because of the need for autonomous driving, the point cloud has been extensively utilised to detect and identify all types of traffic objects for the purpose of road inventory and the production of high-definition maps (HD maps) [
11,
12,
13]. Among these applications, a very common task is point cloud classification [
14,
15], also known as point cloud semantic segmentation [
16,
17].
The primary objective of point cloud classification is to assign a semantic class label to each point of the point cloud. However, the automatic classification of the point cloud is rendered challenging by some of the data characteristics of the point cloud, such as the irregularity of point distributions, the enormous number of points, the non-uniform point density, and the complexity of the observed scenes [
14,
16,
18]. An approach to simplify point cloud classification would be to convert the unordered and irregular point cloud into a regular raster format before classification [
19,
20]. However, the classification result would be affected by the details inevitably lost during the data conversion [
21]. In order to avoid the problems that occur during the conversion of data, another classification approach is adopted, in which the handcrafted features are classified after their extraction from the point cloud. If the extracted features are representative and discriminative, regular classification algorithms, such as the maximum likelihood classifier (MLC), can normally generate a satisfactory classification result [
21]. Nevertheless, the identification of effective handcrafted features hinges on the application requirements, previous experience, or the prior domain knowledge provided by experts [
15]. In order to advance the level of automatic classification, many studies have attempted to extract as many features from the original point cloud as possible, and then execute the automatic selection and classification of the features by means of machine learning (ML) algorithms [
22,
23]. Despite the developed learning theories and the powerful classification efficacy of many ML algorithms, such as support vector machines (SVM) and random forests (RF), most of them are shallow learning models. As a result, it remains a primary concern as to how to extract effective features when adopting these approaches [
24].
In recent years, deep learning (DL) has exhibited rather satisfactory results in image processing and big data analysis [
25]. Unlike the handcrafted features devised on the basis of the domain knowledge, DL directly learns effective features from the original data by means of convolution and multilayer neural networks, and it has gradually been applied in point cloud classification as well [
15,
18,
26,
27]. Similarly, because of the unordered and irregular nature of the point cloud data, regular DL models, such as the convolution neural networks (CNNs) or the recursive neural networks (RNNs) cannot be directly applied in point cloud classification. The simplest approach is still to convert the point cloud to regularly distributed 2D images [
28] or 3D voxels [
29] and then use the DL model to classify the converted data. However, the data conversion may result in the loss of details or data ambiguity/distortion caused by the introduction of artefacts [
27]. In order to avoid the problems that occur during data conversion, Qi, Su, Kaichun and Guibas [
27] proposed a pioneering 3D deep learning model PointNet, which allows the direct input of the original point cloud data without having to convert the data to another form. PointNet stands out with its simple model, small number of parameters, and quick calculation, yet it is unable to extract local features. Many DL models applicable to point cloud classification have recently been proposed, such as PointNet++ [
30], PointCNN [
31], PointSIFT [
32], and KPConv [
33]. These DL models lack a common framework, and some of the model structures have been deliberately expanded to enhance the classification efficacy, which not only increases the model complexity and the calculation time but also results in overfitting [
34], rendering the trained model inapplicable to other scenes. In addition, most of the DL models are derived from the computer vision field, and the point clouds processed are mostly for indoor scenes without considering the characteristic and application requirements of outdoor point cloud data in the remote sensing field, such as airborne laser scanning (ALS) or mobile laser scanning (MLS) point clouds [
15]
In summary, while specific handcrafted features are effective in point cloud classification to a certain extent, the efficacy hinges on sufficient prior domain knowledge. In contrast, while the powerful DL models are capable of learning features automatically, complicated network architecture and a relatively long calculation time are often required to achieve a better classification result. The primary purpose of this study was to enhance the performance of point cloud classification by combining the advantages of handcrafted features and learned features without having to increase the complexity of the network model. This was done on the basis of the existing 3D DL models and the accumulative experience and knowledge of remote sensing. In this study, the handcrafted features whose efficacy had been verified by previous relevant studies were integrated into a simple DL model that lacked the ability to extract local features, enabling the network to recognise other effective features through automatic learning. The experiment results illustrated that the simple DL model with handcrafted features will achieve, and even exceed, the performance of a complex DL model, and fulfil the application requirements of the remote sensing domain.
3. Methodology
Figure 1a illustrates the traditional feature-based point cloud classification process. Assume that
is a point in the point cloud and
is the 3D coordinates of the point
.
represents the neighbourhood of point
, and
is the handcrafted feature set derived from K points in the neighbourhood
. These handcrafted features are the major inputs of the classifier; therefore, the quality of the classification result hinges on whether the handcrafted features possess identifiability.
Figure 1b illustrates the classification process based on 3D deep learning, in which the 3D coordinates of the point are directly input and the point cloud classification is executed after the features are self-learned via the DL network. The method possesses a higher level of automation, and generally the more complex the network architecture is, the better are the classification results achieved.
Figure 1c illustrates the point cloud classification process proposed by this study, which combines the advantages of both the handcrafted features and the learned features, in the hopes of achieving the effects produced by a complex deep learning model (e.g., PointNet++ or KPConv) by means of a simple deep learning model (e.g., PointNet).
Furthermore, in order to evaluate the applicability of the proposed method to different point cloud datasets, in this study, we conducted a classification and result analysis of the ALS and MLS point cloud data, respectively. Although both the scanning approaches generate point cloud data presented in the form of 3D coordinates, there is a vast distinction between their properties. While the ALS point cloud generally has a lower point density, higher acquisition speed, and a wider covering area, the MLS point cloud normally has a higher density and more distinct details. Another major distinction lies in the different scanning directions toward the ground objects. While the ALS scans are performed in a downward vertical direction at a high altitude, resulting in a sparse point cloud of the vertical plane of the ground objects (e.g., the walls of buildings), the MLS scans are conducted in a horizontal direction at the ground level toward the ground objects and may omit the points of the horizontal plane of several ground objects (e.g., the rooves of buildings).
3.1. Extraction of Handcrafted Features
The network architecture of PointNet indicates that it is characterised by its simple model, few parameters, and fast training speed, yet it is deficient in the extraction of local features. As a result, the handcrafted features used in this study were mainly the local features of the point cloud, for the purpose of making up for the deficiencies of PointNet. In addition, the intrinsic properties of the point cloud data, such as return intensity and elevation information, were utilised as features, whose effects on the classification efficacy were evaluated.
Table 2 presents the handcrafted features used in this study. Their definition and calculation method are as follows.
3.1.1. Covariance Features
Covariance features are the most representative type of common local features, and many researchers have confirmed their positive effect on classification [
49,
50,
51]. The covariance features of point
are generated mainly by calculating the (3 × 3) covariance matrix of the coordinates of all the points in its neighbourhood [
51]. In this study, we first found the neighbourhood
via K-nearest neighbours (KNN) and calculated the covariance matrix of all the points
in the neighborhood. Then, by using eigen decomposition, we found the three eigenvalues of the covariance matrix, arranged from small to large as
, and the three corresponding eigenvectors
. Many geometric features can be derived from the eigenvalues, among which the three common shape features: linearity (L), planarity (P), and scattering (S), as proposed by Demantké, Mallet, David and Vallet [
50], can be utilised to determine the shape behaviour of the points within the neighbourhood
. Their calculation methods are illustrated in Formula (1) to Formula (3). When the point cloud in the neighbourhood was in a linear form,
, while the value of linearity (L) was close to 1; when the point cloud in the neighbourhood was in a planar form,
, and the value of planarity (P) was close to 1; when the point cloud in the neighbourhood was in a dispersed and volumetric form,
, and the value of scattering (S) was close to 1. Furthermore, in this study, we used the verticality (V) put forth by Guinard and Landrieu [
52]. The calculation method is illustrated in Formula (4). This verticality (V) was utilised to determine the verticality of the point distribution. The horizontal neighbourhood produced a value close to 0, while the vertical linear neighbourhood produced a value close to 1, and the vertical planar neighbourhood (e.g., façade) produced a median value (
).
In addition, the normal vector of each point in the point cloud is generally regarded as one of the important features by many researchers [
49]. The normal vector can be computed using many different calculation methods [
53]. In this study, we used the eigen vector
corresponding to the minimal eigenvalue
in the covariance matrix of point
as the normal vector N of the point and decomposed it along the 3D coordinate axes into three normal components serving as the normal features of the point, as demonstrated in Formula (5):
3.1.2. Height Features
Another effective feature involves elevation information of the point cloud. Both the ALS and the MLS data contain a large number of points on the ground surface. Almost all the other above-ground objects are connected to the ground, and the junction is where classification errors are most likely to occur. In order to solve this problem, many researchers filtered out the points on the ground before classifying the above-ground objects [
54], while some introduced the height difference ∆z between the ground and the above-ground object as a feature for classification [
38,
40,
54]. As the complete automation of the filtration of the points on the ground surface cannot be attained at the present stage, it remains a time-consuming and laborious task to thoroughly filter out the points on the ground [
15]. In view of this, in this study, we utilised the height difference ∆z as the height feature, which was computed by subtracting the height of the lowest point in the scene from the height of the point.
3.1.3. Intensity Features
Apart from acquiring the 3D coordinates of the target point, the laser scanner occasionally records the return intensity I of the laser at the same time, the value of which may be affected by the texture and roughness of the target surface, the laser wavelength, the emitted energy, and the incidence angle, and therefore can facilitate classification to some extent [
39,
49].
3.2. Feature Selection and Model Configuration for ALS Point Cloud Classification
Figure 2a illustrates the ALS point cloud data utilised in this study. These test data constitute a subset of an ALS data collected over Tainan, Taiwan in August 2010. The data acquisition was carried out by the RIEGL-Q680i scanner, with a flight altitude of approximately 800 m and a point density of approximately 10 pt/m2. The primary application of this data was 3D urban mapping, and four different classes, including Ground, Building, Car, and Tree, were manually labelled beforehand and were utilised as the reference data for training and testing, as illustrated in
Figure 2b. These test data contains a total of 2,262,820 points, and the number of points and percentage belonging to each class are shown in
Table 3.
On the basis of the two deep networks PointNet and PointNet++, in this study, we integrated a different type of handcrafted features discussed in the previous section into the ALS data and produced different models, as shown in
Table 4.
According to the design of the PointNet and PointNet++ models, the input point cloud was divided into several blocks first; then, a fixed number of points from each block were selected for training. Considering that the original study does not offer suggestions about the most ideal size of the blocks, in this study, we found the best block size via experiments (see
Section 4.1 for details) and divided the ALS point cloud data into 15 m × 15 m blocks, with each block containing 2048 sampled points. The training strategy for the model basically corresponded with the setup suggested by the original study. The setting of the relevant hyperparameters is illustrated in
Table 5. The definitions and effects of these hyperparameters can be found in [
55].
Unlike the two aforementioned models that require the process of dividing the point cloud into blocks, KPConv directly conducts the classification of the point cloud of the scene. However, an excessive number of points might result in insufficient memory and subsequently lead to a failure in the calculation, hence the need for the subsampling of the original point cloud. Moreover, the convolution kernels in KPConv were classified as the rigid type and the deformable type. The training strategy for the model followed the setup from the original study, with a max epoch of 500, a batch size of 12, 15 kernel points, and the radius of influence of the point convolution set as 4 m [
33].
3.3. Feature Selection and Model Configuration for MLS Point Cloud Classification
Figure 3a illustrates the MLS point cloud data utilised in this study. The test area is located at Tainan High Speed Rail Station District (Shulan) in Tainan, Taiwan. This MLS data were collected via the Optech Lynx-M1 scanner in December 2017, with a point density of approximately 200 pt/m
2, and primarily used for road inventory and the production of HD maps. In comparison with the ALS data, the observed scene in the MLS data consisted of a larger number and greater complexity of ground objects and of the classifications of ground objects, including a total of eight classes: Ground, Tree, Street lamp, Traffic sign, Traffic light, Island (divisional island), Car, and Building, as shown in
Figure 3b. There are 14,899,744 points in this observed scene, and the number of points and percentage belonging to each class is listed in
Table 6.
On the basis of the two models PointNet and PointNet++, we tested different feature combinations for the MLS point cloud classification, as shown in
Table 7, which differed from the ALS data with the additional intensity feature and removal of the normal features.
In consideration of the dense point cloud and the large quantity of MLS data, in order to effectively acquire the local geometric features of the point cloud, we first executed subsampling and determined via experiments that the best block size was 5 m × 5 m, with 4096 points extracted from each block for training. The training strategy for the PointNet and PointNet++ models followed the setup from the original study. The setting of the relevant hyperparameters is illustrated in
Table 8. The setup of KPConv basically resembled that described in
Section 3.2. However, in view of the relatively complex scene and the relatively large number of point clouds in the MLS point cloud data, the max epoch was set to 600, while the batch size was set to 8.
3.4. Classification Performance Evaluation
In order to assess the efficacy of each model in point cloud classification, we used the classification performance metrics, which are frequently used in machine learning and include overall accuracy (OA), precision, recall, and F1-score, and Matthews correlation coefficient (MCC) [
56]. The calculation of these indicators for binary classification can be expressed as follows:
Here,
TP, FP,
FN, and
TN represent true positive, false positive, false negative, and true negative, respectively, all of which can be calculated from the point-based confusion matrix.
In the case of a multi-class problem with K classes, the macro-averaging procedure is commonly employed to calculate the overall mean of per-class measures for different indicators [
57]. By this procedure, the precision, recall and F-1 score are computed for each class according to Equations (7)–(9), and then averaged via arithmetic mean. In addition, a multi-class extension of MCC in terms of the confusion matric was also considered in this study [
58], which is defined as follows:
where
c is the total number of samples correctly predicted,
s is the total number of samples,
is the number of samples that class
k was predicted, and
is the number of samples that class
k truly occurred.
In comparison to the OA, the average F1-score is less vulnerable to the problem of imbalanced data. As a result, many studies have used the average F1-score in the assessment of the point cloud classification performance [
14,
18,
59]. As an alternative measure unaffected by the imbalanced dataset issue, MCC_k is more informative than average F1-score and OA in evaluating classification problems [
45,
60]
6. Summary and Conclusions
This study focused on the two deep learning networks PointNet and PointNet++, and analyzed the effects of the addition of various type of handcrafted features on the point cloud classification efficacy. In addition, two point cloud datasets, including an ALS dataset covering a simple scene and an MLS dataset covering a complex scene, are used to test the performance of the proposed method.
In terms of the PointNet model, the various types of handcrafted features introduced in this study are clearly useful for classifying ALS and MLS point cloud data. In particular, the shape features that contain local geometric structure information have the most significant improvement in the classification performance of point clouds. For ALS point cloud classification, the addition of the shape features considerably rectified the problem of misclassification, i.e., the buildings being misclassified as the trees or the grounds, the grounds being misclassified as the buildings and the trees, and the trees being misclassified as the buildings. For MLS point clouds, the problem of misclassification of pole-like objects such as street lamps, traffic signs, and traffic lights can be significantly rectified by adding intensity and shape features to the PointNet model. In addition, the inclusion of these local features also effectively solves the problem of cars and buildings being misclassified as trees. In terms of PointNet++, despite its intrinsic ability to extract local features, the addition of the handcrafted features could facilitate the classification performance to a little extent for both ALS and MLS data. In addition, we find that height features are beneficial for ALS data classification, but not for MLS data. This should be due to the different point distributions between ALS and MLS point clouds.
By comparing the aforementioned results with the results produced by RF and KPConv, we found that PointNet, with the addition of the features, performed better in the case of the ALS data, while KPConv, equipped with the 3D convolution kernel, performed better in the case of the complex MLS data, but had a complex model architecture and required a considerable amount of calculation time. With the addition of local features, PointNet could attain results in the MLS data classification similar to those produced by PointNet++ and KPConv, but with the advantages of a simple model architecture and a short calculation time. As a result, the PointNet model incorporating handcrafted features will be more beneficial for practical applications in classifying simple observed scenes or analyzing complex scenes efficiently.
Through the experiments, we found that there is ample room for discussion and improvement. First, in terms of the influence of the number of ground object classifications, take the ALS data used in this study for example; while only four types of ground objects were classified in the experiment, many scenes in reality might be considerably more complex. Therefore, testing and discussions on more complex ALS scenes containing more ground object classifications should be conducted in the future. Furthermore, through practical applications, we observed that in the cases of both the ALS and the MLS data, the ground points occupied a majority of the data and, consequently, resulted in the problem of data imbalance. If this problem is solved, better results and performance can be expected. Finally, in this study, we tested only some point-based features. Discussions on the efficacy of other features, such as contextual-based features [
54], object-based features [
22], and full-waveform features [
38], should be carried out in the future.