Airborne Laser Scanning Point Cloud Classification Using the DGCNN Deep Learning Method

: Classiﬁcation of aerial point clouds with high accuracy is signiﬁcant for many geographical applications, but not trivial as the data are massive and unstructured. In recent years, deep learning for 3D point cloud classiﬁcation has been actively developed and applied, but notably for indoor scenes. In this study, we implement the point-wise deep learning method Dynamic Graph Convolutional Neural Network (DGCNN) and extend its classiﬁcation application from indoor scenes to airborne point clouds. This study proposes an approach to provide cheap training samples for point-wise deep learning using an existing 2D base map. Furthermore, essential features and spatial contexts to effectively classify airborne point clouds colored by an orthophoto are also investigated, in particularly to deal with class imbalance and relief displacement in urban areas. Two airborne point cloud datasets of different areas are used: Area-1 (city of Surabaya—Indonesia) and Area-2 (cities of Utrecht and Delft—the Netherlands). Area-1 is used to investigate different input feature combinations and loss functions. The point-wise classiﬁcation for four classes achieves a remarkable result with 91.8% overall accuracy when using the full combination of spectral color and LiDAR features. For Area-2, different block size settings (30, 50, and 70 m) are investigated. It is found that using an appropriate block size of, in this case, 50 m helps to improve the classiﬁcation until 93% overall accuracy but does not necessarily ensure better classiﬁcation results for each class. Based on the experiments on both areas, we conclude that using DGCNN with proper settings is able to provide results close to production.


Introduction
Autonomous and reliable 3D point cloud classification or semantic segmentation is an important capability in applications ranging from mapping, 3D modeling, navigation to urban planning. However, this task is considered nontrivial [1] as extracting semantic information is challenging due to the high redundancy, uneven sampling density, and lack of explicit structure in point clouds [2,3]. Earlier approaches overcame this challenge by transforming the point cloud into a structured grid (image or voxel) which led to an increase in computational costs or loss of depth information [4]. PointNet, the first neural network directly consuming raw point cloud data, employs a series of multilayer perceptrons to learn higher dimensional features for each individual point and concatenates them to obtain global context within a small 3D block, which shows effective and efficient performance for classification [5]. Nevertheless, it is still largely unknown how training data should be prepared in terms of quality, variety, and numbers to obtain acceptable (e.g., >90%) classification accuracy.
Airborne Laser Scanning (ALS) point clouds and aerial photos are the two main very high-resolution and accurate input data available to map cities. Both data have different Point cloud data have particular characteristics that make classification even more challenging: they are unordered and unstructured, often with large variations in point density and occlusions [15]. Deep learning for 3D point cloud data has been developed. Some methods apply dimensionality reduction by converting 3D data into multiview images (MVCNN, SnapNet, etc.); other methods organize point clouds into voxels (Seg-Cloud, OctNet, etc.) or directly use 3D points as inputs (PointNet, PointNet++, SuperPoint Graph, etc.). Inspired by PointNet [5], several point-wise deep learning methods classify 3D point cloud data using a network composed of a succession of fully connected layers. However, PointNet limitations on capturing the spatial correlation between points triggered several alternative point-wise deep learning network architectures such as SuperPoint Graph [16], PointCNN [17], and DGCNN [18].
In the context of neural networks, a model may have difficulties in learning meaningful features [19]. Most experiments on point-wise deep learning use benchmark indoor point clouds (e.g., Stanford S3DIS dataset) with input features consisting of 3D coordinates (x, y, z), color information or Red, Green, Blue (RGB), and normalized coordinates (n x , n y , n z ). Implementations for airborne point clouds with different input features are available in the literature. Soilán et al. [20] implemented a multiclass classification workflow (ground, vegetation, building) using PointNet applied to an ALS point cloud. They replaced RGB features as used in the original PointNet publication by LiDAR-derived features: intensity, return number, and height of the point with respect to the lowest point in a 3x3m neighborhood. Even though the classification accuracy was 87.8%, there is high confusion between vegetation and buildings. Wicaksono et al. [21] used a DGCNN to classify an ALS point cloud into building and nonbuilding classes by two different feature combinations: with and without color features. Based on their results, they stated that color features do not improve the classification and suggested further research to address the incorporation of color information. In contrast, using a so-called sparse manifold CNN, Schmohl and Soergel [22] obtained a 0.8% higher overall accuracy when using additional color information on their test set segmentation. Xiu et al. [23] classified ALS point cloud data concatenated with color (RGB) features from an orthophoto using a PointNet architecture. By applying RGB features, overall accuracy increased by 2%, from 86% to 88%. Additionally, Poliyapram et al. [24] propose end-to-end point-wise LiDAR and a so-called image multimodal fusion network (PMNet) for classification of an ALS point cloud of Osaka city in combination with aerial image RGB features. Their results show that the combination of intensity and RGB features could improve overall accuracy from 65% to 79%, while the performance in identifying buildings improved by 4%. We conclude that the beneficial effect of using RGB features in ALS point cloud classification is unclear and indecisive. A possible explanation for the inconsistency of the results is problems in the fusion of the ALS point cloud and the color information.
One source of fusion problems could be the effect of relief displacement in areas with high-rise buildings which is, so far, hardly discussed in the literature. In a (ground) orthoimage, relief displacement disrupts the true orthogonality of highly elevated objects (e.g., high buildings) and results in horizontal displacement of up to several meters from their real position [25]. As a consequence, LiDAR points in the displacement area may have incorrect RGB values.
Another major challenge of designing deep learning systems for spatial-spectral data classification is the lack of labeled training samples [26]. Yang et al. [27] propose automatic training sample generation using a 2D topographic map and an unsupervised segmentation by first separating ground from nonground points and then performing a point-in-polygon operation. Unsupervised segmentation was performed to reduce noise and improve accuracy of the previous task. Labeled points were trained and tested by a SuperPoint graph and results were an average F1 score of 74.8%. However, the F1 score for water (41.6%) continued to underperform. Effective classification with imbalanced class, in which some classes in the data have a significantly higher number of examples in the training set than other classes. These circumstances add difficulty, as most classifiers will exhibit bias towards the majority class and may ignore the minority class altogether [28]. Winiwarter et al. [29] investigated the applicability of PointNet++ for ALS point cloud classification on the ISPRS Vaihingen benchmark and a Vorarlberg dataset. They also mention that classes with high occurrences tend to have higher classification accuracies than those that appear less frequently in the training (and evaluation) data. Typically, imbalanced class distribution results in performance loss [30]. Lin et al. [31] introduce a focal loss function to address class imbalance in object detection in a case of extreme imbalance between foreground and background pixels. Huang et al. [32] stated that for deep learning techniques (e.g., PointNet), the results of classification depend on the manner of point-sampling and block-cutting during preprocessing, and the manner of interpolation during postprocessing.
Based on the aforementioned related work, we conclude that finding the optimal input feature combination for ALS point cloud classification incorporating RGB color remains an open issue due to inconsistencies between different research results. Optimization of several deep learning parameter settings (e.g., loss function, block cutting), which are not intuitive, has the potential to improve the classification results. Furthermore, as class imbalance is naturally inherent in many remote sensing classification problems, providing a sufficient amount of good quality training samples without overfitting the data is still an important research topic.

Experiments
This study takes a point cloud colored by an orthophoto as an input to estimate automatically 2D urban map objects. These consist of building blocks and road networks in vector format (polygon or polyline). Our methodological workflow consists of two main tasks: training set preparation and classification and involves two different test areas ( Figure 1). Point cloud classification as defined in this study refers to the task of assigning a predefined class or semantic label (e.g., bare land, building, tree, road) to each individual 3D point of a given point cloud, which is also known as semantic segmentation or class labeling. circumstances add difficulty, as most classifiers will exhibit bias towards the majority class 154 and may ignore the minority class altogether [28]. Winiwarter et al. [29] investigated the 155 applicability of PointNet++ for ALS point cloud classification on the ISPRS Vaihingen 156 benchmark and a Vorarlberg dataset. They also mention that classes with high occur-157 rences tend to have higher classification accuracies than those that appear less frequently 158 in the training (and evaluation) data. Typically, imbalanced class distribution results in 159 performance loss [30]. Lin et al. [31] introduce a focal loss function to address class imbal-160 ance in object detection in a case of extreme imbalance between foreground and back-161 ground pixels. Huang et al. [32] stated that for deep learning techniques (e.g., PointNet), 162 the results of classification depend on the manner of point-sampling and block-cutting 163 during preprocessing, and the manner of interpolation during postprocessing. 164 Based on the aforementioned related work, we conclude that finding the optimal in-165 put feature combination for ALS point cloud classification incorporating RGB color re-166 mains an open issue due to inconsistencies between different research results. Optimiza-167 tion of several deep learning parameter settings (e.g., loss function, block cutting), which 168 are not intuitive, has the potential to improve the classification results. Furthermore, as 169 class imbalance is naturally inherent in many remote sensing classification problems, 170 providing a sufficient amount of good quality training samples without overfitting the 171 data is still an important research topic. 172

173
This study takes a point cloud colored by an orthophoto as an input to estimate au-174 tomatically 2D urban map objects. These consist of building blocks and road networks in 175 vector format (polygon or polyline). Our methodological workflow consists of two main 176 tasks: training set preparation and classification and involves two different test areas (Fig-177 ure 1). Point cloud classification as defined in this study refers to the task of assigning a 178 predefined class or semantic label (e.g., bare land, building, tree, road) to each individual 179 3D point of a given point cloud, which is also known as semantic segmentation or class 180 labeling.  To examine different parameter settings for ALS point cloud classification, this study 186 uses a Dynamic Graph CNN (DGCNN) architecture proposed by Wang et al. [19]. 187 DGCNN is a point-wise neural network architecture that combines PointNet and a graph 188

DGCNN
To examine different parameter settings for ALS point cloud classification, this study uses a Dynamic Graph CNN (DGCNN) architecture proposed by Wang et al. [19]. DGCNN is a point-wise neural network architecture that combines PointNet and a graph CNN approach. The network architecture uses a spatial transformation module and estimates global information, akin to PointNet. The Dynamic Graph CNN approach captures local geometric information while ensuring permutation invariance. It extracts edge features through the relationship between a central point and neighboring points by constructing a nearest-neighbor graph that is dynamically updated from layer to layer.
Based on the architecture of PointNet, the DGCNN architecture (see Figure 2) incorporates a so-called EdgeConv module to capture local geometric features from points, which is missing in previous point-wise deep learning architectures [33]. EdgeConv constructs a local graph between a point and its k-nearest neighbor points and applies convolution-line operations on the graph edges. DGCNN uses PointNet [5] as the basic architecture but combines it with graph CNNs. Instead of using fixed graphs, as other graph CNN methods do, EdgeConv updates its neighborhood graphs dynamically for each layer of the network, thereby effectively increasing the spatial coverage of the neighborhoods as the convolution step between layers downsamples the point cloud.  The network architecture uses a spatial transformation module and esti-189  mates global information, akin to PointNet. The Dynamic Graph CNN approach captures 190  local geometric information while ensuring permutation invariance. It extracts edge fea-191  tures through the relationship between a central point and neighboring points by con-192 structing a nearest-neighbor graph that is dynamically updated from layer to layer. 193 Based on the architecture of PointNet, the DGCNN architecture (see Figure 2) incor-194 porates a so-called EdgeConv module to capture local geometric features from points, 195 which is missing in previous point-wise deep learning architectures [33]. EdgeConv con-196 structs a local graph between a point and its k-nearest neighbor points and applies con-197 volution-line operations on the graph edges. DGCNN uses PointNet [5] as the basic archi-198 tecture but combines it with graph CNNs. Instead of using fixed graphs, as other graph 199 CNN methods do, EdgeConv updates its neighborhood graphs dynamically for each layer 200 of the network, thereby effectively increasing the spatial coverage of the neighborhoods 201 as the convolution step between layers downsamples the point cloud. 202 Each EdgeConv block applies an asymmetric edge function ℎΘ( , ) = ℎΘ( , − 203 ) across all layers to combine both the global shape structure (by capturing the coordi-204 nates of the patch center ) and the local neighborhood information (by capturing ( − 205 )) as shown in Figure 3. Similar to PointNet and PointNet++, the aggregation operation 206 to downsample the input representation in DGCNN is max pooling.

(a)
The DGCNN semantic segmentation network architecture (b) Spatial transformation block (c) EdgeConv block To demonstrate the feasibility of DGCNN to classify huge ALS point cloud data, this 215 study uses two study areas of different sizes, characteristics, and input point cloud speci-216 fications. Area-1 (city of Surabaya, Indonesia) represents a metropolitan urban area dom-217 inated by dense settlements while Area-2 (city of Utrecht, the Netherlands) has more var-218 iation in urban land use. Datasets of both study areas exhibit imbalances in their class 219 distribution. Area-1 has a total size of 5 km 2 while Area-2 has a total size of 25 km 2 . Each EdgeConv block applies an asymmetric edge function hΘ x i , x j = hΘ x i , x j − x i across all layers to combine both the global shape structure (by capturing the coordinates of the patch center x i ) and the local neighborhood information (by capturing (x j − x i )) as shown in Figure 3. Similar to PointNet and PointNet++, the aggregation operation to downsample the input representation in DGCNN is max pooling.
To demonstrate the feasibility of DGCNN to classify huge ALS point cloud data, this study uses two study areas of different sizes, characteristics, and input point cloud specifications. Area-1 (city of Surabaya, Indonesia) represents a metropolitan urban area dominated by dense settlements while Area-2 (city of Utrecht, the Netherlands) has more variation in urban land use. Datasets of both study areas exhibit imbalances in their class distribution. Area-1 has a total size of 5 km 2 while Area-2 has a total size of 25 km 2 .   The first test area is located in the second-largest Indonesian metropolitan area, Su-226 rabaya city in West Java Province. The city is characterized by dense settlement areas with 227 various types of well-connected roads. Surabaya city is home to numerous high-rise build-228 ings and skyscrapers. Many parks exist and vegetation in Surabaya city is dominated by 229 trees (see Figure 4). For this study area, we classify the 3D point cloud into four classes: 230 bare land, trees, buildings, and roads. Due to the limited number of LiDAR points cover-231 ing water in the study area, a water class is not included.

Area-1
The first test area is located in the second-largest Indonesian metropolitan area, Surabaya city in West Java Province. The city is characterized by dense settlement areas with various types of well-connected roads. Surabaya city is home to numerous high-rise buildings and skyscrapers. Many parks exist and vegetation in Surabaya city is dominated by trees (see Figure 4). For this study area, we classify the 3D point cloud into four classes: bare land, trees, buildings, and roads. Due to the limited number of LiDAR points covering water in the study area, a water class is not included.   The first test area is located in the second-largest Indonesian metropolitan area, Su-226 rabaya city in West Java Province. The city is characterized by dense settlement areas with 227 various types of well-connected roads. Surabaya city is home to numerous high-rise build-228 ings and skyscrapers. Many parks exist and vegetation in Surabaya city is dominated by 229 trees (see Figure 4). For this study area, we classify the 3D point cloud into four classes: 230 bare land, trees, buildings, and roads. Due to the limited number of LiDAR points cover-231 ing water in the study area, a water class is not included.  Area-1 covers 21.5 km 2 and consists of 354.2 million points. The ALS point cloud was captured by an Optech Orion H300 instrument and has an average density of about 30 points/m 2 . The aerial photos captured at the same time by a tandem camera have spatial resolutions of 8 cm with less than 15 cm positional accuracy. The ALS point cloud is divided into two classes: ground and nonground points. The dataset was projected in the UTM49 South coordinate system using the WGS84 geoid. Both LiDAR point cloud and aerial photos were acquired from the same platform at the same time in 2016. The reference data used to label the points and to evaluate the final results are an Indonesian 1:1000 base map from 2017. The base map was acquired by manual 3D delineation from the same aerial photos.

Training Set Preparation
To efficiently process the points, the 3D point cloud of Area-1 is divided into 8 grids (see Figure 5a) and each grid is split into 16 tiles. Tile no. 5, located in the top left part of Area-1, is used as a test area and remaining tiles are used for training. To efficiently process the points, the 3D point cloud of Area-1 is divided into 8 grids 248 (see Figure 5.a) and each grid is split into 16 tiles. Tile no. 5, located in the top left part of 249 Area-1, is used as a test area and remaining tiles are used for training. In preparing the training set, we first projected RGB (Red, Green, Blue) color infor-254 mation from an orthophoto onto the ALS point cloud data by nearest neighbor. Next, the 255 point cloud data were downsampled to 1 m 3D spacing for efficiency and to facilitate the 256 capturing of global information. 257 To label the training data, 2D building and road polygons from the 1:1000 base map 258 of Surabaya city were used. Although the point cloud data used in this study were already 259 classified into ground and nonground points, two challenges needed to be solved when 260 the 2D base map was used to label the point cloud. First, the corresponding base map does 261 not provide information on trees. Second, as we use 2D base map polygons to label the 262 points, the labeled building and road points may include many mislabeled points in cases 263 where trees exist above buildings or roads. Therefore, we performed hierarchical filtering 264 to label tree and bare land points based on surface roughness. The method is also intended 265 to improve the quality of the training samples by removing likely mislabeled points on 266 buildings and road points. 267 The labeling criteria are as follows: 268 a. nonground points are labeled as buildings using 2D building polygons of the base 269 map. Using the same method, ground points were labeled as road. Remaining points 270 were labeled as bare land; 271 b. from the points labeled as building or road, any point that has surface roughness 272 above a threshold is relabeled as tree. The surface roughness was estimated for each 273 point based on the distances to the best fitting plane estimated using all neighboring 274 points inside an area of 2m × 2m. Given the resulting roughness values for both trees 275 and building points in selected test areas, and given the fact that tree canopies in the 276 study area have minimum diameters of about 3m, the roughness threshold was set to 277 0.5 m; 278 c. a Statistical Outlier Removal (SOR) algorithm was performed to remove remaining 279 outliers. We set the threshold for the average distance ( ̅ ) = 30 and multiplier of 280 standard deviation = 2. This means that the algorithm calculates the average distance 281 of 30 k-neighboring points and then removes any point having distance more than 282 ̅ + 2 * ; 283 d. as the final step, training samples data were converted to the hdf5 (.h5) format by 284 splitting each part of the area into blocks of size 30 × 30m with a stride of 15m. Based 285 on our experiments in Area-1, these parameter values give the best accuracy. 286 In preparing the training set, we first projected RGB (Red, Green, Blue) color information from an orthophoto onto the ALS point cloud data by nearest neighbor. Next, the point cloud data were downsampled to 1 m 3D spacing for efficiency and to facilitate the capturing of global information.
To label the training data, 2D building and road polygons from the 1:1000 base map of Surabaya city were used. Although the point cloud data used in this study were already classified into ground and nonground points, two challenges needed to be solved when the 2D base map was used to label the point cloud. First, the corresponding base map does not provide information on trees. Second, as we use 2D base map polygons to label the points, the labeled building and road points may include many mislabeled points in cases where trees exist above buildings or roads. Therefore, we performed hierarchical filtering to label tree and bare land points based on surface roughness. The method is also intended to improve the quality of the training samples by removing likely mislabeled points on buildings and road points.
The labeling criteria are as follows: a. nonground points are labeled as buildings using 2D building polygons of the base map. Using the same method, ground points were labeled as road. Remaining points were labeled as bare land; b.
from the points labeled as building or road, any point that has surface roughness above a threshold is relabeled as tree. The surface roughness was estimated for each point based on the distances to the best fitting plane estimated using all neighboring points inside an area of 2 m × 2 m. Given the resulting roughness values for both trees and building points in selected test areas, and given the fact that tree canopies in the study area have minimum diameters of about 3m, the roughness threshold was set to 0.5 m; c.
a Statistical Outlier Removal (SOR) algorithm was performed to remove remaining outliers. We set the threshold for the average distance (d) = 30 and multiplier of standard deviation = 2. This means that the algorithm calculates the average distance of 30 k-neighboring points and then removes any point having distance more than d + 2 * standard deviation; d.
as the final step, training samples data were converted to the hdf5 (.h5) format by splitting each part of the area into blocks of size 30 × 30 m with a stride of 15 m. Based on our experiments in Area-1, these parameter values give the best accuracy. However, to ensure that the network uses an efficient spatial range, the block size may require an adjustment if applied to study areas of different characteristics.

The Choice of Feature Combinations and Loss Functions
Similar to PointNet and DGCNN, each point in the point cloud input was attributed to a nine-dimensional feature vector consisting of three spatial coordinates (x, y, z) and six additional features. Candidate additional features are: spectral color information (Red, Green, Blue), normalized 3D coordinates n x , n y , n z , and off-the-shelf LiDAR features (intensity, return number, and number of returns). Normalized 3D coordinates n x , n y , n z are used as additional features by PointNet, PointNet++, DGCNN, and other networks to boost the translational invariance of the algorithm (Qi et al. 2018). Normalized 3D coordinates, in which the point cloud original coordinates are transformed to a common space coordinate (ranging from 0 to 1) by subtracting x, y, z values by its central 3D coordinates of each tile, are expected to give global information. In the indoor case, normalized point coordinates provide a strong indication on the type of object (e.g., floors always have Z values close to 0, walls have X or Y values at 0 or 1, etc.). This study tries to investigate the effectiveness of such normalized coordinates in the outdoor scenarios of orthogonal airborne point clouds. To evaluate the contribution of each feature, we compare four different sets of feature vectors as presented in Table 1. Deep neural networks learn to map inputs to outputs based on the training data. At each training step, the network compares the model predictions to actual labels to determine and increase the model performance. Typically, gradient descent is used as optimization algorithm to minimize errors and update the current model parameters (weights and biases). To calculate the model error, a loss function was used.
A class imbalance exists in Area-1 because it is heavily dominated by buildings (59%) and trees (19%). Therefore, we investigated two different loss functions that are incorporated within the DGCNN architecture: softmax cross entropy loss and focal loss [31].

1.
Softmax Cross Entropy (SCE) loss. This is a combination of a softmax activation function and cross entropy loss. Softmax is frequently appended to the last fully connected layer of a classification network. Softmax converts logits, the raw score output by the last layer of the neural network, into probabilities in the range 0 to 1. The function converts the logits into probabilities by taking the exponents of the given input value and the sum of exponentials of all values in the input. The ratio between the exponential input value and the sum of exponential values is the output of softmax. Cross entropy describes the loss between two probability distributions. It measures the similarity of the predictions to the actual labels of the training samples. Consider a training dataset D = {(x i , y i )|i ∈ {1, 2, . . . , M}} with M input points x i within a batch of size M, and y i is the i-th label target class (one-hot vector) among C classes. f (x i ) denotes the feature vector before the last fully connected layer of C classes. W j and b j , j ∈ {1, 2, . . . , C} represent the trainable weights and biases of the j-th class in softmax regression, respectively. Then the SCE loss is written as follows: Focal loss is introduced to address accuracy issues due to class imbalance for onestage object detection. Focal loss is a cross entropy loss that weighs the contribution of each sample to the loss based on the classification error. The idea is that, if a sample is already classified correctly by the network, its contribution to the loss decreases. Lin et al. [31] claim that this strategy solves the problem of class imbalance by making the loss implicitly focus on problematic classes. Moreover, the algorithm weights the contribution of each class to the loss in a more explicit way using Sigmoid activation. The focal loss function for multiclassification is defined as: where C denotes the number of classes; y i equals 1 if the ground-truth belongs to the i-th class and 0 otherwise; p i is the predicted probability for the i-th class; γ ∈ {0, +∞} is a focusing parameter; α ∈ {0, 1} is a weighting parameter for the i-th class. The loss is similar to categorical cross entropy, and they would be equivalent if γ = 0 and α i = 1.

Training Settings
During training, 4096 points are uniformly sampled from each training block of size 30 × 30 m, Section 4.1, to form data batches with a consistent number of points, while all points are used during testing. This study uses nine features for training; therefore, the size of the data fed into the network is 4096 × 9. We used k = 20 nearest neighbors for each point to construct the k-nearest neighbor graph. For all experiments, the final model was obtained after running 51 epochs, optimized by an Adam optimizer with an initial learning rate of 0.001, a momentum of 0.9, and a mini batch size of 16. The 3D point cloud semantic segmentation using DGCNN was performed in the High-Performance Computing (HPC) environment of Delft University of Technology, consisting of 26 computing nodes. For training, two Tesla P100-16GB GPUs were used.

Area-2
Area-2 uses open source point cloud data of the Dutch up-to-date Elevation Archive File, version 3 (AHN3) downloaded from PDOK [34] and a ready-made geodata webservice [35]. Each individual 3D point came labeled as one of the following classes: bare land, buildings, water, bridges, and others. The "others" class consists mostly of trees and other vegetation, but also includes objects such as railways and cars. Since points classified as bridges are sparse, in this study, bridge points were merged into the bare land class.
This study uses the same AHN3 point cloud dataset used by Soilán et al. [20]. The dataset consists of four grids of AHN3 point clouds located in the surroundings of the city of Utrecht and Delft (see Figure 6)  buildings (yellow), water (blue), and "others" (green).

376
If a deep learning method such as DGCNN would be successful in classifying AHN3 377 data based on available AHN3 labels for different areas, this opens new possibilities for 378 cheaply labeling future version of AHN. Thus, AHN4 or AHN5 could be automatically 379 classified using available class labels from previous editions. Each grid of the AHN3 point cloud was split into 25 tiles and downsampled uni-382 formly with a point interval of 1 m. In total, 12 tiles from 38FN1, 37EN2 and 31HZ2 were 383 used and only eight tiles from 32CN1 were used since this grid contains a large amount 384 of vegetation, which leads to more points due to multiple returns. The number of used 385 points is summarized in Table 2 For DGCNN, a point cloud needs to be split into 3D blocks with a certain block size 395 (see Figure 7). In training mode, points with point features, are randomly sampled 396 from a single block and put into the neural network. In DGCNN, the -nn graph is dy-397 namically updated in feature space from layer to layer. Thus, it is difficult to compute the 398 effective range using a simple equation. What we know for sure is that the effective range 399 is limited by the block size and affected by the size of the neighborhood of each point as 400 defined in the -nn graphs. Based on our test on different -values, = 20 gives the best 401 result. If a deep learning method such as DGCNN would be successful in classifying AHN3 data based on available AHN3 labels for different areas, this opens new possibilities for cheaply labeling future version of AHN. Thus, AHN4 or AHN5 could be automatically classified using available class labels from previous editions.

Training Set Preparation
Each grid of the AHN3 point cloud was split into 25 tiles and downsampled uniformly with a point interval of 1 m. In total, 12 tiles from 38FN1, 37EN2 and 31HZ2 were used and only eight tiles from 32CN1 were used since this grid contains a large amount of vegetation, which leads to more points due to multiple returns. The number of used points is summarized in Table 2. Apart from x, y and z coordinates, all points from AHN3 are also provided with extra attributes, such as return number, intensity, GPS time, etc. For Area-2, nine features were used, including 3D coordinates (x, y, z), LiDAR features (return number, number of returns, and intensity), and normalized coordinates (n x , n y , and n z ). Classification of AHN3 point clouds by the DGCNN architecture was implemented in the PyTorch framework [36].

The Choice of Block Size
For DGCNN, a point cloud needs to be split into 3D blocks with a certain block size (see Figure 7). In training mode, N points with F point features, are randomly sampled from a single block and put into the neural network. In DGCNN, the k-nn graph is dynamically updated in feature space from layer to layer. Thus, it is difficult to compute the effective range using a simple equation. What we know for sure is that the effective range is limited by the block size and affected by the size of the neighborhood of each point as defined in the k-nn graphs. Based on our test on different k-values, k = 20 gives the best result.  For the Area-2 experiment, 4096 points were randomly sampled from each block dur-410 ing training before being fed into the neural network. To ensure some overlap between 411 different blocks, we used 1.5 as the sample rate for each point. This means that to deter-412 mine the number of blocks, the total number of points in the point cloud will be multiplied 413 by a sample rate of 1.5 and then divided by 4096. In the test stage, the sample rate was set 414 as 1.0 and all points in each block were used. During training, the batch size was 8, which 415 meant eight blocks could be processed at the same time. However, we used a batch size 416 of only 1 during testing, since the number of points was different in each block when no 417 random sampling was used and different blocks cannot be stacked together. The network 418 was optimized by an Adam optimizer with an initial learning rate of 0.001, as suggested 419 as the default in DGCNN [18]. For all experiments, the model used in testing was obtained 420 by choosing the best model after training with 50 epochs. For Area-2, one NVIDIA 421 GeoForce RTX 2080 Ti GPU was used. 422

423
Given the complexity of digital classification, selecting reliable quality metrics to as-424 sess the classification results is crucial [37]. For assessing the classification results, this 425 study used evaluation metrics that are mainly used in deep learning and remote sensing 426 classification research [38][39][40]. This study used different feature combinations, loss func-427 tions, and block sizes for the point cloud classification. Several selected quality metrics are 428 described as follows:  To investigate the feasibility of DGCNN for classification of aerial point clouds and the influence of different effective ranges, using the determined k-value (k = 20), we experimented with three different block sizes: 30, 50 and 70 m (Figure 7).

Training Settings
For the Area-2 experiment, 4096 points were randomly sampled from each block during training before being fed into the neural network. To ensure some overlap between different blocks, we used 1.5 as the sample rate for each point. This means that to determine the number of blocks, the total number of points in the point cloud will be multiplied by a sample rate of 1.5 and then divided by 4096. In the test stage, the sample rate was set as 1.0 and all points in each block were used. During training, the batch size was 8, which meant eight blocks could be processed at the same time. However, we used a batch size of only 1 during testing, since the number of points was different in each block when no random sampling was used and different blocks cannot be stacked together. The network was optimized by an Adam optimizer with an initial learning rate of 0.001, as suggested as the default in DGCNN [18]. For all experiments, the model used in testing was obtained by choosing the best model after training with 50 epochs. For Area-2, one NVIDIA GeoForce RTX 2080 Ti GPU was used.

Evaluation Metrics
Given the complexity of digital classification, selecting reliable quality metrics to assess the classification results is crucial [37]. For assessing the classification results, this study used evaluation metrics that are mainly used in deep learning and remote sensing classification research [38][39][40]. This study used different feature combinations, loss functions, and block sizes for the point cloud classification. Several selected quality metrics are described as follows:

•
Overall accuracy, indicating the percentage of correctly classified points of all classes from the total number of reference points. This metric shows general performance of the model, and thus may provide limited information in case of class imbalance.

•
The confusion matrix is a summary table reporting the number of true positives, true negatives, false negatives, and false positives of each class. The matrix provides information on the prediction metrics per class and the types of errors made by the classification model. • Precision, recall, and F1 score: Precision and recall are metrics commonly used for evaluating classification performance in information technology and are related to the false and true positive rates [41,42]. Recall (also known as completeness) refers to the percentage of the total points correctly predicted by the model, while the precision (also known as correctness) refers to the percentage of correctly classified points in all positive predictions. The F1 score is a weighted average of precision and recall to measure model accuracy. The metrics are formulated as follows:

Area-1
For ALS point cloud classification in Area-1, four different feature combinations and two loss functions were compared. The total number of samples used for training was 30.929.919 points, dominated by building points (59%). Trees, bare land, and road classes are sampled by 21%, 13% and 7%, at the points, respectively.

Results of Different Feature Combinations
To investigate the best feature combination to classify ALS point cloud colored by an orthophoto using a deep learning approach, three different metrics (completeness/recall, correctness/precision, and F1 score) along with Overall Accuracy (OA) were used. Some results are visualized in Figure 7. Table 3 shows the classification results of all predefined feature combinations and loss functions used in this study. Based on the evaluation results, Feature Set 4 achieved the highest overall accuracy (91.8%) and F1 score for all classes. In general, the use of normalized coordinate features n x , n y , n z in combination with other features is not as effective as the combination of spectral color with LiDAR features. The use of full RGB color and off-the-shelf LiDAR features significantly improves the F1 score of trees by at least 7% and buildings by 5.7%.
Based on the class quality metrics presented in Table 4, the potential of different feature combinations to predict different land cover classes in our test area is discussed below: which makes the precision rate higher. A combination of LiDAR intensity and normalized coordinates (Sets 2 and 3) effectively maintains a high number of points correctly classified as bare land, indicated by high recall (92.1%). On average, the road class considerably had the lowest recall rates while the bare land class always had the lowest precision. This indicates that there is high confusion between bare land and roads, which we assume mainly happens due to the presence of open areas having similar heights and the same color such as parking areas, front yards, and backyards.

•
Tree class Feature Set 4 obtains the highest recall and precision rates with scores of 84.3% and 93.3%, respectively. The use of both RGB color and LiDAR information in Feature Set 4 significantly increased the tree detection by almost 11% compared to the other feature sets. In general, the main source of error was trees misclassified as buildings, which particularly occurs for trees adjacent to buildings. Our results also show that there are more trees misclassified as buildings than buildings detected as trees which results in recall rates that are always lower than precision.

• Buildings
The recall and precision rate of building detection remarkably improved when using Feature Set 4. It is likely that the decreasing number of confusions between buildings and trees induces higher building classification accuracy. One of the biggest error sources for building classification are small details on roofs and building façades that are classified as trees.

• Roads
Although the road detection accuracy is not as good as other classes, the highest recall and precision rates were achieved by Feature Set 4 with scores of 81.6% and 86.9%, respectively. Using RGB and intensity (Set 4) as input features significantly improved the recall rate of roads by reducing the number of road points detected as bare land. As our study focuses on urban classification and base map generation, points on cars or trucks were labeled as roads. Given the results, the road classification results were not very much affected by the presence of cars. Figure 8 visualizes the classification results of different feature combinations over a subset of our test area in comparison to the following data sources: base map, orthophoto, LiDAR intensity, and digital surface model (DSM). The white rectangle highlights an area where most classification results fail to detect a highway and an adjacent road of different heights. Feature Set 1 resulted in a misclassification of some points on the overpass highways as buildings and the adjacent road below the highway were falsely classified as bare land (white rectangle in Figure 8a). Because road, buildings, and bare land have similar geometric characteristics (e.g., planarity), using the LiDAR intensity feature in addition is beneficial to increase the road classification accuracy.
A sand pile existed in the study area due to construction work at the time of data acquisition (yellow ellipses in Figure 8). Only Feature Set 4 correctly classified the sand pile points as bare land while other feature sets falsely classified points on the sand pile as buildings. This suggests that using complementary airborne LiDAR and spectral orthophoto features increases detection accuracy. Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 23 resents bare land, green represents trees, orange represents buildings, and red represents roads, respectively.

512
A sand pile existed in the study area due to construction work at the time of data 513 acquisition (yellow ellipses in Figure 8). Only Feature Set 4 correctly classified the sand 514 pile points as bare land while other feature sets falsely classified points on the sand pile 515 as buildings. This suggests that using complementary airborne LiDAR and spectral ortho-516 photo features increases detection accuracy. Even though class imbalance exists in our study area, the overall accuracy was not 519 necessarily increased by applying a focal loss (FL) function as may be expected. Table 5 520 shows the results of Feature Set 4 when using two different loss functions: SCE and FL ( 521 = 0.2, γ=2). The overall accuracy (OA) of the results of Feature Set 4 decreased by 3.7% 522 when FL was used. However, the F-1 score for the bare land and road classes dropped by 523 ~6% and ~15%, respectively, when FL was used. Our explanation for this is that the loss 524 function focuses on decreasing the loss of the classes that produce large amounts of mis-525 classified points, in this case buildings and trees, thereby somehow neglecting bare land 526 and, notably, roads. 527  In (a-e), blue color represents bare land, green represents trees, orange represents buildings, and red represents roads, respectively.

Results of Different Loss Functions
Even though class imbalance exists in our study area, the overall accuracy was not necessarily increased by applying a focal loss (FL) function as may be expected. Table 5 shows the results of Feature Set 4 when using two different loss functions: SCE and FL (α = 0.2, γ = 2). The overall accuracy (OA) of the results of Feature Set 4 decreased by 3.7% when FL was used. However, the F-1 score for the bare land and road classes dropped by~6% and~15%, respectively, when FL was used. Our explanation for this is that the loss function focuses on decreasing the loss of the classes that produce large amounts of misclassified points, in this case buildings and trees, thereby somehow neglecting bare land and, notably, roads. Based on the confusion matrix presented in Table 6, Feature Set 4 in combination with FL has the highest precision rate (86.7%) for trees, but the number of correctly detected tree points is lower than for the other feature sets. This is because FL focuses on increasing the detection rate by evaluating the errors of the dominant class so that the number of misclassified tree points decreases. For building classification, the highest recall (93.7%) was achieved when using FL with only 0.3% recall difference to the results obtained by  The white rectangles and yellow ellipses in Figure 9 indicate (a) an area where some parts of the highway were misclassified as buildings and (b) a sand pile that was misclassified as a building when using focal loss (FL). For our purposes, the SCE loss function performed better then FL in the DGCNN architecture. Based on the confusion matrix presented in Table 6, Feature Set 4 in combination with 531 FL has the highest precision rate (86.7%) for trees, but the number of correctly detected 532 tree points is lower than for the other feature sets. This is because FL focuses on increasing 533 the detection rate by evaluating the errors of the dominant class so that the number of 534 misclassified tree points decreases. For building classification, the highest recall (93.7%) 535 was achieved when using FL with only 0.3% recall difference to the results obtained by 536 SCE. For road classification, the use of FL doubled the number of false negatives compared 537 to SCE. 538  Figure 9 indicate (a) an area where some 541 parts of the highway were misclassified as buildings and (b) a sand pile that was misclas-542 sified as a building when using focal loss (FL). For our purposes, the SCE loss function 543 performed better then FL in the DGCNN architecture.  One drawback of using aerial photos is the positional shift of highly elevated objects 549 (e.g., high-rise buildings). This effect is called relief displacement and is caused by varia-550 tions in the camera angle. Displacement errors increase with the height of the object and 551 the distance to the acquisition location. Objects suffering from relief displacement in pho-552 tos usually have bigger sizes in the photo than in reality, as some parts of vertical walls 553 are exposed and buildings appear to lean in one specific direction. In aerial photo classi-554 fication, relief displacement is considered as one of the main sources of mislabeling (Chen 555 et al., 2018). Figure 10 shows a relief displacement error in an orthophoto of a leaning 556 building that blocks a lower building and nearby trees. 557 ALS point clouds can be used to detect objects blocked by high-rise buildings that 558 even a human operator cannot identify in an orthophoto. For example, part of a building 559 in Figure 10c (highlighted by a white ellipse) was automatically and correctly detected by 560 our method (yellow outline) but is missing in the building reference (pink outline). This 561

Results on Area with Relief Displacement
One drawback of using aerial photos is the positional shift of highly elevated objects (e.g., high-rise buildings). This effect is called relief displacement and is caused by variations in the camera angle. Displacement errors increase with the height of the object and the distance to the acquisition location. Objects suffering from relief displacement in photos usually have bigger sizes in the photo than in reality, as some parts of vertical walls are exposed and buildings appear to lean in one specific direction. In aerial photo classification, relief displacement is considered as one of the main sources of mislabeling (Chen et al., 2018). Figure 10 shows a relief displacement error in an orthophoto of a leaning building that blocks a lower building and nearby trees.
Remote Sens. 2021, 13, x FOR PEER REVIEW 16 of 23 means that even though we used ground orthophotos to color the point clouds, which, as 562 a consequence, resulted in wrongly colored points in case of relief displacement, the net-563 work we employed still classified the points correctly. It is likely that, during training, 564 DGCNN is able to learn and give smaller weights or big penalties to the color features in 565 case relief displacement exists, thereby favoring the geometric point cloud information.  With a total of 109,389,471 points, Area-2 is dominated by the "others" class (57.6%). 574 The ground and building classes occupy percentages of 30.9% and 11.9%, respectively. 575 Water has the smallest representation with a percentage of 0.6%. Table 7 summarizes the 576 quantitative results of point cloud semantic segmentation over Area-2 for different block 577 sizes and a fixed neighborhood size of = 20. The best overall accuracy (93.28%) and 578 average F1 score were achieved with a block size of 50 m. Block size of 30 m had the lowest 579 overall accuracy (91.7%) and per class F1 score, indicating that, in Area-2, more balanced 580 results among different classes can be obtained when the block size is larger. ALS point clouds can be used to detect objects blocked by high-rise buildings that even a human operator cannot identify in an orthophoto. For example, part of a building in Figure 10c (highlighted by a white ellipse) was automatically and correctly detected by our method (yellow outline) but is missing in the building reference (pink outline). This means that even though we used ground orthophotos to color the point clouds, which, as a consequence, resulted in wrongly colored points in case of relief displacement, the network we employed still classified the points correctly. It is likely that, during training, DGCNN is able to learn and give smaller weights or big penalties to the color features in case relief displacement exists, thereby favoring the geometric point cloud information.

Area-2
With a total of 109,389,471 points, Area-2 is dominated by the "others" class (57.6%). The ground and building classes occupy percentages of 30.9% and 11.9%, respectively. Water has the smallest representation with a percentage of 0.6%. Table 7 summarizes the quantitative results of point cloud semantic segmentation over Area-2 for different block sizes and a fixed neighborhood size of k = 20. The best overall accuracy (93.28%) and average F1 score were achieved with a block size of 50 m. Block size of 30 m had the lowest overall accuracy (91.7%) and per class F1 score, indicating that, in Area-2, more balanced results among different classes can be obtained when the block size is larger. Considering the recall and precision values shown in Table 8, points from the "other" and ground classes were identified well in all block sizes with high values in both recall and precision. This is not surprising considering the high number of points in both classes compared to other categories. When the network processes point clouds of bigger blocks, the predicted building points have better recall but lower precision rates which means the model misses a certain number of building points, although most predictions are correct. For the water class, when the block size is very large (70 m), the precision is worse than recall rate, indicating that the model is not accurate enough for detecting water points. Compared to the other classes, the water class always has the lowest precision and recall rates. This is because the number of points on water in our Area-2 is much less than other classes.  Figure 11 illustrates the point cloud classification results for different block sizes. Points on big buildings are often classified as bare land for a block size of 30 m (see black rectangles). This is likely because, in some areas with large building roofs, the block only contains building points. As points on both building roofs and bare land share similar characteristics, a block mainly containing building points is falsely classified as bare land. In this subarea, a block size of 30 or 70 m classifies most of water points as bare land, while using a block size of 50 m results in a correct classification of most water points.
As indicated by the blue box, a large number of points from the "other" class (in this case parked cars) are labeled as buildings when the block size is 30 m. This happens, presumably, because the points on cars and buildings both have planar surfaces at different height. With a bigger block size, most of points on parked cars are correctly classified as "others". In this case, a bigger block size sufficiently provides an effective spatial range to the network for recognizing the differences between buildings and cars.
Other examples in Figure 12a,e, respectively, show 3D and 2D visualizations of classification results when using a block size of 30 m. Some points on building façades are labeled as "others" (see red ellipse). There is also a "block effect" highlighted by the white ellipses, where the edges of some blocks can be clearly seen. The "block effect" no longer exists and fewer points on building façades are classified as "others" when using block sizes of 50 and 70 m.
Using a larger block size, however, does not always result in better classification accuracy. Based on our results, more "others" points, typically tree points, were misclassified as buildings when using bigger block sizes. As shown in Figure 13, a block size of 30 m classifies "others" (tree) points best, while a block size of 70 m has the highest misclassification rate on "others" (trees)-see white circles. A possible explanation is that a larger block size will result in lower point density that later causes some key points and details to be missing-e.g., tree canopies may appear more flat which makes them resemble building roofs to some extent. This suggests that the network requires a higher point density to capture the essential characteristics of trees. Thus, selecting the optimal block size for point cloud classification should be carefully determined because there is an accuracy tradeoff between different object classes-in our case, this is between trees and buildings.  categorized as "others".

607
As indicated by the blue box, a large number of points from the "other" class (in this 608 case parked cars) are labeled as buildings when the block size is 30 m. This happens, pre-609 sumably, because the points on cars and buildings both have planar surfaces at different 610 height. With a bigger block size, most of points on parked cars are correctly classified as 611 "others". In this case, a bigger block size sufficiently provides an effective spatial range to 612 the network for recognizing the differences between buildings and cars. 613 Other examples in Figure 12a and 12e, respectively, show 3D and 2D visualizations 614 of classification results when using a block size of 30 m. Some points on building façades 615 are labeled as "others" (see red ellipse). There is also a "block effect" highlighted by the 616 white ellipses, where the edges of some blocks can be clearly seen. The "block effect" no 617 longer exists and fewer points on building façades are classified as "others" when using 618 block sizes of 50 and 70 m. As an additional evaluation, this study provides per class probability distributions over the test dataset as obtained by the network with different block sizes-see Figure 14. The histograms contain the confidence level of the models in predicting the "winning" classes. In general, the histograms are consistent with the F1 score results presented in Table 7.
The model has high confidence when predicting bare land using any block sizes as most of the bare land points have more than 80% confidence. The building class shows a wider range of confidence level ranging between 40% and 80%. The building class with a block size of 30 m has more points with a lower confidence result (below 70%). Apparently, this proves that using a smaller block size may result in lower building accuracy.
Compared to other classes, the histograms for water and "others" classes have significant differences when using different block sizes. The water class has the lowest confidence (40-60%) when using a block size of 30 m and has the highest confidence level when using a block size of 50 m (above 90%). When using a block size of 70 m, water class has the second best confidence level ranging between 75% and 90%. Considering that the water class has the smallest point representation, using an appropriate block size can eliminate the influence of class imbalance.
The confidence level when predicting the probability of the "others" class is high (around 80%) when using both block sizes of 30 and 50 m. However, when predicting the "others" class with a block size of 70 m, the model confidence drops to values between 50% and 80%. This is consistent with our findings that using a bigger block size results in lower tree classification accuracy. Remote Sens. 2021, 13, x FOR PEER REVIEW 19 of 23 Using a larger block size, however, does not always result in better classification ac-624 curacy. Based on our results, more "others" points, typically tree points, were misclassi-625 fied as buildings when using bigger block sizes. As shown in Figure 13, a block size of 30 626 m classifies "others" (tree) points best, while a block size of 70 m has the highest misclas-627 sification rate on "others" (trees)-see white circles. A possible explanation is that a larger 628 block size will result in lower point density that later causes some key points and details 629 to be missing-e.g., tree canopies may appear more flat which makes them resemble 630 building roofs to some extent. This suggests that the network requires a higher point den-631 sity to capture the essential characteristics of trees. Thus, selecting the optimal block size 632 for point cloud classification should be carefully determined because there is an accuracy 633 tradeoff between different object classes-in our case, this is between trees and buildings. 634  Using a larger block size, however, does not always result in better classification ac-624 curacy. Based on our results, more "others" points, typically tree points, were misclassi-625 fied as buildings when using bigger block sizes. As shown in Figure 13, a block size of 30 626 m classifies "others" (tree) points best, while a block size of 70 m has the highest misclas-627 sification rate on "others" (trees)-see white circles. A possible explanation is that a larger 628 block size will result in lower point density that later causes some key points and details 629 to be missing-e.g., tree canopies may appear more flat which makes them resemble 630 building roofs to some extent. This suggests that the network requires a higher point den-631 sity to capture the essential characteristics of trees. Thus, selecting the optimal block size 632 for point cloud classification should be carefully determined because there is an accuracy 633 tradeoff between different object classes-in our case, this is between trees and buildings. 634 water class has the smallest point representation, using an appropriate block size can elim-653 inate the influence of class imbalance. 654 The confidence level when predicting the probability of the "others" class is high 655 (around 80%) when using both block sizes of 30 and 50 m. However, when predicting the 656 "others" class with a block size of 70 m, the model confidence drops to values between 657 50% and 80%. This is consistent with our findings that using a bigger block size results in 658 lower tree classification accuracy.

662
This study investigated the feasibility of DGCNN for ALS point cloud classification 663 over urban areas and discusses how different settings of input feature combinations, loss 664 function, and block size affects classification results. Several experiments on two different 665 areas indicate that using DGCNN with proper settings is able to provide accurate results 666 close to production requirements in classifying airborne point clouds. In Area-1, ALS 667 point clouds colored by orthophotos were used to investigate different input feature com-668 binations and loss functions. We labeled the training samples used for classification using 669 the best available public vector data from a 1:1000 base map. Based on the classification 670 results, the combination of full RGB image features and airborne LiDAR features outper-671 forms other feature sets and significantly increases the classification accuracy by 6%. The 672 softmax cross entropy loss function performed better than focal loss, although the latter 673 loss function was included in the testing because of some class imbalance in our input 674 data. 675 In Area-2, training samples were labeled using available class labels from the Dutch 676 Figure 14. Per class probability distribution obtained by the network over test area with different block sizes (a-c). From top to bottom: class of bare land (1st row), buildings (2nd row), water (3rd row), and "others" (4th row).

Conclusions and Recommendations
This study investigated the feasibility of DGCNN for ALS point cloud classification over urban areas and discusses how different settings of input feature combinations, loss function, and block size affects classification results. Several experiments on two different areas indicate that using DGCNN with proper settings is able to provide accurate results close to production requirements in classifying airborne point clouds. In Area-1, ALS point clouds colored by orthophotos were used to investigate different input feature combinations and loss functions. We labeled the training samples used for classification using the best available public vector data from a 1:1000 base map. Based on the classification results, the combination of full RGB image features and airborne LiDAR features outperforms other feature sets and significantly increases the classification accuracy by 6%. The softmax cross entropy loss function performed better than focal loss, although the latter loss function was included in the testing because of some class imbalance in our input data.
In Area-2, training samples were labeled using available class labels from the Dutch AHN distribution. We tested three different block sizes (30,50, and 70 m) of AHN3 point clouds using LiDAR off-the-shelf input features (X, Y, Z, intensity, return number, and number of returns). A block size of 50 m provided the highest classification accuracy result (93.3%) and efficiently reduced the misclassification of building points as bare land. Moreover, balanced and good F1 scores for all kinds of objects were obtained when using a block size of 50 m. There was a trade-off between building and tree (class "others") classification accuracy results when increasing the block size. This implies that, to classify outdoor point clouds using DGCNN or other PointNet-based deep learning architectures, block size is a crucial parameter to be carefully tuned.
In our experiments, Area-2 (91-93%) achieved a higher classification accuracy than Area-1 (83.9%) when using LiDAR input features. This could be related to the number of training samples that we used for Area-2 which is two times bigger than for Area-1.
Further research should include the development of an optimal input feature and block size selection procedures. Such a procedure should largely replace the current empirical for increasing deep learning classification accuracy. Our research indicates that 3D deep learning matured so much that it is now actually able to extract geometric information as required for digital maps or digital point cloud repositories at near-operational quality, but in a much shorter time than traditional workflows. Comparisons to other machine learning approaches, which also include computational costs, would be interesting to study. Furthermore, the applicability of our method to data representing other cities and countries as well as possible extensions to rural environments is a beneficial direction for future research.
Author Contributions: E.W. designed the workflow and responsible for the main structure and writing of the paper. E.W., Q.B. and M.K.F. conducted the experiments and discussed the results described in the paper. R.C.L. gave comments and editions to paper writing. All authors have read and agreed to the published version of the manuscript.