Deep-Learning-Based Classiﬁcation for DTM Extraction from ALS Point Cloud

: Airborne laser scanning (ALS) point cloud data are suitable for digital terrain model (DTM) extraction given its high accuracy in elevation. Existing ﬁltering algorithms that eliminate non-ground points mostly depend on terrain feature assumptions or representations; these assumptions result in errors when the scene is complex. This paper proposes a new method for ground point extraction based on deep learning using deep convolutional neural networks (CNN). For every point with spatial context, the neighboring points within a window are extracted and transformed into an image. Then, the classiﬁcation of a point can be treated as the classiﬁcation of an image; the point-to-image transformation is carefully crafted by considering the height information in the neighborhood area. After being trained on approximately 17 million labeled ALS points, the deep CNN model can learn how a human operator recognizes a point as a ground point or not. The model performs better than typical existing algorithms in terms of error rate, indicating the signiﬁcant potential of deep-learning-based methods in feature extraction from a point cloud.


Introduction
In recent decades, airborne laser scanning (ALS) has become more important in the process of digital terrain model (DTM) production [1]. ALS can provide a description of a surface on a terrain with high accuracy and density. However, ALS also records the information of non-terrain objects, such as buildings and trees. Thus, the ALS filtration is important in the processing of ALS data. Given the various non-ground objects on the surface and the lack of topology among the points, filtering of the ALS point cloud can be difficult and troublesome. In fact, point filtering often occupies approximately 80% of the workload of ALS data processing in DTM production. The algorithms of ALS filtration can be divided into three categories based on their characteristics, as follows: (1) Slope-based methods. The kernel foundation of these methods considers that two adjacent points are likely to belong to different categories if they have a mutation in height [2,3]. Slope-based methods are fast and easy to implement. Their shortcoming is their dependency on different thresholds in different terrains. (2) Mathematical morphology-based methods. These methods are composed of a series of 3D morphological operations on the ALS points. The results of morphological methods heavily rely on the filter window size. Small windows can only filter small non-ground objects, such as telegraph poles or small cars. By contrast, large windows often filter several ground points and make the results of filtration smooth. Zhang [4] proposed progressive morphological filters, 2 of 16 which can filter large non-ground objects with ground points preserved by varying the filter window size, to overcome this problem. (3) Progressive triangular irregular network (TIN)-based method. Axelsson [5] proposed the iterative TIN; this network has been used in some business software. The TIN selects the coarse lowest points as ground points and builds a triangulated surface from them. Then, the TIN adds new points to the triangular surface under many constrains for slope and distance. However, the method is easily affected by negative outliers; these outliers draw the triangular surface downward. (4) Surface-based methods. These methods maintain a surface model of the ground based on the interpolation of ground points [6][7][8][9]. However, these methods are sensitive to input parameters and negative outliers.
Other recent algorithms try to use optimization to obtain accurate classification. For instance, semi-global filtering (SGF) [10] employs a novel energy function balanced by adaptive ground saliency to adapt to steep slopes, discontinuous terrains, and complex objects. Then, the SGF uses semi-global optimization to determine labels by minimizing the energy.
Although the existing methods have done well in ALS filtration, they still need much human labor to generate DTM based on the filtration results. We want to make full use of the existing ALS and responding DTM by learning a deep neural network from a big amount of the existing data. Neural networks has been used in pattern recognition and classification for a long time [11,12]. The deep convolutional neural networks (CNN) [13] are inspired by biological vision systems; these networks have recently shown their ability to extract high-level representations through compositions of low-level features [14]. In the present study, we propose a new filtering algorithm based on deep CNN. First, training samples are obtained from many labeled points. Each image is generated from the point and its neighboring points; the image can be a positive or negative training sample depending on the label. Second, a deep CNN model is trained using the labeled data. Images generated from points are treated as input of the deep CNN model. Then, the input will be processed by several components being comprised of a convolution layer, a batch Normalization layer, an activation layer and a pooling layer after some components. At last, the results of the last pooling layer will be connected to subsequent three fully-connected layers, the last fully-connected layer will produce the probability for the input to be a ground point or a non-ground point. Detailed construction of deep CNN model can be seen in Section 2.2. The deep CNN model can learn the important feature of the input automatically from the huge training data, which usually work better than hand-craft features. Finally, each point is mapped to an image to classify a raw ALS point cloud; this image is classified as an image belonging to a ground point or is not used by the trained CNN model. This paper is organized as follows: Section 2 describes the proposed method. Section 3 presents the ALS filtration results and analysis. Section 3.3 compares the proposed method with other methods. Section 4 concludes this study and identifies several aspects for improvement.

Methods
The workflow of our approach for filtering is shown in Figure 1. ALS filtering means to find and delete all non-ground points from ALS data. We treat filtering as a binary classification problem to classify all the points of ALS data as ground points or non-ground points. The major steps include the calculation of context information for each point from the neighboring points in a window, the transformation of the information of the window into an image, and the training and classification based on the images using the CNN model. Training sample points are selected from a large number of point clouds with different terrain complexities. A deep CNN model is trained from the labeled images.

Information Extraction and Image Generation
For each ALS point ( i P ), its surrounding points within its "square window" are divided into many cells. The "square window" means a square in (x, y) spatial coordinates. It is a two dimensional window. In the method, the size of "square window" is 96 m × 96 m, which is divided into 128 × 128 cells. Each "square window" extracted for a point can be transferred to a 128 × 128 image by mapping each cell to a pixel with red, blue, and green colors. For each cell, the maximum  (2); these integers would be the red, green, and blue values of the corresponding pixel in the image transformed from the cells, as follows: Figure 1. Workflow of the proposed approach. "T" means the samples of ground points and "F" means the samples of non-ground points.

Information Extraction and Image Generation
For each ALS point (P i ), its surrounding points within its "square window" are divided into many cells. The "square window" means a square in (x, y) spatial coordinates. It is a two dimensional window. In the method, the size of "square window" is 96 m × 96 m, which is divided into 128 × 128 cells. Each "square window" extracted for a point can be transferred to a 128 × 128 image by mapping each cell to a pixel with red, blue, and green colors. For each cell, the maximum (Z max ), minimum (Z min ), and mean (Z mean ) of the height among all points within the cell are obtained. Then, the difference values between Z max , Z min , and Z mean and the height (Z i ) of point (P i ) are transferred to three integers within 0 to 255 following Equations (1) and (2); these integers would be the red, green, and blue values of the corresponding pixel in the image transformed from the cells, as follows: The sigmoid function is expressed in Equation (2), as follows: An example for the point-to-image transformation is shown in Figure 2.
The sigmoid function is expressed in Equation (2), as follows: x Sigmoid x e − − = + (2) An example for the point-to-image transformation is shown in Figure 2.

Convolutional Neural Network
CNNs have been the focus of considerable attention for a wide range of vision-related [15][16][17][18], audio-related [19], or language-related [20] tasks. The existing best-performing models [21][22][23] on ImageNet ILSVRC have all been based on deep CNNs since 2012. CNNs are designed to process data that come in the form of multiple arrays, such as 1D arrays for signals like language and 2D arrays for images or audio spectrograms. CNNs have four key ideas, namely, local connections, shared weights, pooling, and use of many layers [24]. More detailed description of CNN can be found in [25].
The architecture of the CNN model used in our approach is shown in Figure 3.

Convolutional Neural Network
CNNs have been the focus of considerable attention for a wide range of vision-related [15][16][17][18], audio-related [19], or language-related [20] tasks. The existing best-performing models [21][22][23] on ImageNet ILSVRC have all been based on deep CNNs since 2012. CNNs are designed to process data that come in the form of multiple arrays, such as 1D arrays for signals like language and 2D arrays for images or audio spectrograms. CNNs have four key ideas, namely, local connections, shared weights, pooling, and use of many layers [24]. More detailed description of CNN can be found in [25].
The architecture of the CNN model used in our approach is shown in Figure 3.
The sigmoid function is expressed in Equation (2), as follows: x Sigmoid x e − − = + (2) An example for the point-to-image transformation is shown in Figure 2.

Convolutional Neural Network
CNNs have been the focus of considerable attention for a wide range of vision-related [15][16][17][18], audio-related [19], or language-related [20] tasks. The existing best-performing models [21][22][23] on ImageNet ILSVRC have all been based on deep CNNs since 2012. CNNs are designed to process data that come in the form of multiple arrays, such as 1D arrays for signals like language and 2D arrays for images or audio spectrograms. CNNs have four key ideas, namely, local connections, shared weights, pooling, and use of many layers [24]. More detailed description of CNN can be found in [25].
The architecture of the CNN model used in our approach is shown in Figure 3.   Deep CNN model is comprised of 6 kinds of layers, the size of layers can be defined as width × height × depth in which width × height describes the spatial size and depth refers to the number of channels of its feature maps. The detailed explanation of the layers in Figure 3 can be seen below: (1) An input layer is denoted as Input here. The input layer contains the input data for the network.
The input of the deep CNN model is a three-channel (red, green, blue) 128 × 128 image generated from an ALS points. (2) A convolution layer is denoted as Conv here. The convolution layer is the core building block of a convolutional network that performs most of the computational heavy lifting. Convolutional layers convolve the input image or feature maps with a learnable linear filter, which have a small receptive field (local connections) but extend through the full depth of the input volume. The output feature maps represent the responses of each filter on the input image or feature maps. Each filter is replicated across the entire visual field and the replicated unit share the same weights and bias (shared weights), which allows for features to be detected regardless of their position in the visual field. As a result, the network learns filters that activate when they see a specific type of feature at some spatial position in the input. In our model, all of the convolution layers use the same sized 3 × 3 convolution kernel. (3) A batch normalization layer is denoted as BN here. Given that the deep CNN often has a large number of parameters, taking care to prevent overfitting is necessary, particularly when the number of training samples is relatively small. Batch normalization [26] normalizes the data in each mini-batch, rather than merely performing normalization once at the beginning, using the following equation: The input of BN is normalized to zero mean and unit variance and then linearly transformed. During training, µ and σ 2 are the mean and variance of the input mini-batch. During testing, µ and σ 2 are the average statistics calculated from the training data. γ and β are learned parameters which scale and shift the normalized value. ε is a constant added to the mini-batch variance for numerical stability.
Batch normalization can significantly reduce overfitting, allow higher learning rates and accelerate the training for deep network.
(1) A rectified linear units layer is denoted as ReLU here. Activation layers are neuron layers that apply nonlinear activations on input neurons. They increase the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer. Rectified linear units (ReLU) proposed by Nair and Hinton in 2010 [27] is the most popular activation function. ReLU can be trained faster than typical smoother nonlinear functions and allows the training of a deep supervised network without unsupervised pretraining. The function of ReLU can be demonstrated as f (x) = max(0, x). (2) A pooling layer is denoted as Pooling here. Pooling layers are nonlinear downsampling layers that achieve maximum or average values in each sub-region of input image or feature maps. The intuition is that once a feature has been found, its exact location is not as important as its rough location relative to other features. Pooling layers increase the robustness of translation and reduce the number of network parameters. (3) A fully-connected layer is denoted as FC here. After several convolutional and max pooling layers, high-level reasoning in the neural network is performed via fully-connected layers. A fully-connected layer takes all neurons in the previous layer and connects it to every single neuron it has. Fully-connected layers are not spatially located anymore, thereby making them suitable for classification rather than location or semantic segmentation.
A BN and a ReLU are applied after every conv layer and the first two FC layers. Thus, layers 1 to 6 are composed of Conv → BN → ReLU and layers 7 and 8 are composed of FC → BN → ReLU . Pooling layers are applied after layers 1, 2, and 6. The output of the last FC layer is fed to a 2-way softmax, which produces a distribution over the 2 class labels. Our network maximizes the multinomial logistic regression objective.
To train the model of the deep CNN, over 150 million parameters need to be learned. Two measures are taken to avoid overfitting: huge amount of training data and batch normalization layers which are proven to be effective.

Experimental Data
A total of 17,280,000 labeled points are sampled evenly from 900 airborne ALS datasets in south China to be used to train a general model tested in variety types of terrains to evaluate the proposed approach. Each dataset has an area size of 500 m by 500 m and an average density of 4 points/m 2 . Moreover, 40 scenes outside the training areas and the International Society for Photogrammetry and Remote Sensing (ISPRS) benchmark datasets provided by the ISPRS Commission III/WG2 [28] are classified to validate the trained CNN model. The 40 scenes have the same area size as the training data and approximately 40 million points. All training and testing ground truths are produced by a procedure of DTM production, including automatic filtering by TerraScan software and post manual editing. Examples of the training dataset and feature maps of the training samples are shown in Figures 4 and 5, respectively.
It is easy to see from Figure 5 that as the ground points usually being lower than their surrounding points while non-ground points more probable being higher than their surrounding points, most of the F red , F green and F blue calculated from surrounding cells of ground points by Equation (1) are much bigger than non-ground points, which causes that the feature images of ground points are much brighter than non-ground points. layer is fed to a 2-way softmax, which produces a distribution over the 2 class labels. Our network maximizes the multinomial logistic regression objective. To train the model of the deep CNN, over 150 million parameters need to be learned. Two measures are taken to avoid overfitting: huge amount of training data and batch normalization layers which are proven to be effective.   It is easy to see from Figure 5 that as the ground points usually being lower than their surrounding points while non-ground points more probable being higher than their surrounding points, most of the red F , green F and blue F calculated from surrounding cells of ground points by

Experimental Data
Equation (1) are much bigger than non-ground points, which causes that the feature images of ground points are much brighter than non-ground points.

Training
Batch gradient descent with a batch size of 256 examples, momentum of 0.9, and weight decay of 0.0005 to estimate the CNN parameters is used for the training. To find a local minimum of a function, gradient descent takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. In batch gradient descent, the gradient is approximately estimated by the mini-batch in each iteration.
The loss function of the CNN model can be calculated as: where m is the size of batch 256, and k is the number of classes (in here k = 2 because there are 2 classes, ground points and non-ground points), w is the parameters of the model, x is the output of the upper layer and for the first hidden layer, and x is the input layer.
where t is the iteration index, mini-batch t D is the m training samples which will be used to estimate the gradient in this iteration, v is the momentum variable, ε is the learning rate, and

Training
Batch gradient descent with a batch size of 256 examples, momentum of 0.9, and weight decay of 0.0005 to estimate the CNN parameters is used for the training. To find a local minimum of a function, gradient descent takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. In batch gradient descent, the gradient is approximately estimated by the mini-batch in each iteration.
The loss function of the CNN model can be calculated as: where m is the size of batch 256, and k is the number of classes (in here k = 2 because there are 2 classes, ground points and non-ground points), w is the parameters of the model, x is the output of the upper layer and for the first hidden layer, and x is the input layer. y (i) is the label of training sample i. The value of 1 y (i) = j equals 1 while y (i) = j and 0 otherwise. The update rule for weight w was: where t is the iteration index, mini-batch D t is the m training samples which will be used to estimate the gradient in this iteration, v is the momentum variable, ε is the learning rate, and ∂L ∂w W t D t is the average over the t th batch D t of the derivative of the objective loss function with respect to W, evaluated at W t [21]. The training of the CNN model takes approximately three weeks on a PC with Intel i7-4790 CPU, 32 GB RAM, and a NIVIDIA GTX TitanX GPU.

Results and Comparison with Other Filtering Algorithms
We compare the deep CNN model with the popular commercial software TerraSolid TerraScan, Mongus's parameter-free ground filtering algorithm in 2012 [1], SGF [10], Axelsson's algorithm, and Mongus's connected operators-based algorithm in 2014 [29] on the ISPRS benchmark dataset. TerraScan uses the TIN-based filtering method; this software produces a significantly low average total error when a set of tunable parameters of the data is processed using the algorithm. The classification that CNN used for this test over two datasets is the one trained by the 900 airborne ALS datasets in south China to challenge the versatility of the CNN.
The filtering accuracy is measured based on the Type I error, which is the percentage of rejected bare ground points; Type II error, which is the percentage of accepted non-ground points; and total error, which is the overall probability of points being incorrectly classified. The results are shown in Table 1 and Figures 6-8. The classification using deep CNN model takes approximately 200 s on a test ALS dataset with a million points using a computer with an i7-4790 CPU and a TitanX GPU.
We also compare deep CNN model with TerraScan on 40 cases from the test ALS data with various terrain complexities in the aspects of both error rates and root mean square error of DTM. The comparison of error rates is shown in Figure 7 and comparison of root mean square error (RMSE) between the generated DTM with the ground truths is shown in Figure 8.
is the average over the t th batch t D of the derivative of the objective loss function with respect to W , evaluated at t W [21].
The training of the CNN model takes approximately three weeks on a PC with Intel i7-4790 CPU, 32 GB RAM, and a NIVIDIA GTX TitanX GPU.

Results and Comparison with Other Filtering Algorithms
We compare the deep CNN model with the popular commercial software TerraSolid TerraScan, Mongus's parameter-free ground filtering algorithm in 2012 [1], SGF [10], Axelsson's algorithm, and Mongus's connected operators-based algorithm in 2014 [29] on the ISPRS benchmark dataset. TerraScan uses the TIN-based filtering method; this software produces a significantly low average total error when a set of tunable parameters of the data is processed using the algorithm. The classification that CNN used for this test over two datasets is the one trained by the 900 airborne ALS datasets in south China to challenge the versatility of the CNN.
The filtering accuracy is measured based on the Type I error, which is the percentage of rejected bare ground points; Type II error, which is the percentage of accepted non-ground points; and total error, which is the overall probability of points being incorrectly classified. The results are shown in Table 1 and Figures 6-8. The classification using deep CNN model takes approximately 200 s on a test ALS dataset with a million points using a computer with an i7-4790 CPU and a TitanX GPU.  We also compare deep CNN model with TerraScan on 40 cases from the test ALS data with various terrain complexities in the aspects of both error rates and root mean square error of DTM. The comparison of error rates is shown in Figure 7 and comparison of root mean square error (RMSE) between the generated DTM with the ground truths is shown in Figure 8.
(a) We also compare deep CNN model with TerraScan on 40 cases from the test ALS data with various terrain complexities in the aspects of both error rates and root mean square error of DTM. The comparison of error rates is shown in Figure 7 and comparison of root mean square error (RMSE) between the generated DTM with the ground truths is shown in Figure 8.  The comparison of total error over 40 various complex terrains can be seen in Table 2   The tests strongly indicate that the CNN model produces significantly low Type I error; this result indicates only a slightly tedious manual post-editing for DTM production because removing the non-ground points (Type II error) is usually easier than finding the incorrectly rejected ground points. The proposed CNN-based classification can generate high-quality DTM, particularly to retain subtle, micro, and steep terrains; existing handcrafted algorithms may produce more Type I error.  The comparison of total error over 40 various complex terrains can be seen in Table 2   The tests strongly indicate that the CNN model produces significantly low Type I error; this result indicates only a slightly tedious manual post-editing for DTM production because removing the non-ground points (Type II error) is usually easier than finding the incorrectly rejected ground points. The proposed CNN-based classification can generate high-quality DTM, particularly to retain subtle, micro, and steep terrains; existing handcrafted algorithms may produce more Type I error. The comparison of total error over 40 various complex terrains can be seen in Table 2 Figure 11 shows that deep CNN model conserve the terrain feature well even in the mountains which often has little ground points caused by the shield of trees. The ground hit density in mountain area of Figure 11 is about 1-2 ground points per square meter while the ground hit density in plain area in the same data is about 4-6 ground points per square meter.
However, there are still some cases that deep CNN model cannot deal with very well, as shown in Figure 13. The tests strongly indicate that the CNN model produces significantly low Type I error; this result indicates only a slightly tedious manual post-editing for DTM production because removing the non-ground points (Type II error) is usually easier than finding the incorrectly rejected ground points. The proposed CNN-based classification can generate high-quality DTM, particularly to retain subtle, micro, and steep terrains; existing handcrafted algorithms may produce more Type I error. Figures 10 and 12 show that deep CNN model does well in some big scale non-ground situations which are hard for TerraScan such as buildings and farmlands. Figure 11 shows that deep CNN model conserve the terrain feature well even in the mountains which often has little ground points caused by the shield of trees. The ground hit density in mountain area of Figure 11 is about 1-2 ground points per square meter while the ground hit density in plain area in the same data is about 4-6 ground points per square meter.
However, there are still some cases that deep CNN model cannot deal with very well, as shown in Figure 13.
The CNN model generates some wrong results (type II error) as shown in Figure 13. This is mainly because these wrong non-ground points belong to very low (down to 10 cm) man-made structures, which are too close to the ground. Modifying points-image transformation to represent shape and structure information may improve the accuracy in the similar cases. model conserve the terrain feature well even in the mountains which often has little ground points caused by the shield of trees. The ground hit density in mountain area of Figure 11 is about 1-2 ground points per square meter while the ground hit density in plain area in the same data is about 4-6 ground points per square meter.
However, there are still some cases that deep CNN model cannot deal with very well, as shown in Figure 13. The CNN model generates some wrong results (type II error) as shown in Figure 13. This is mainly because these wrong non-ground points belong to very low (down to 10 cm) man-made structures, which are too close to the ground. Modifying points-image transformation to represent shape and structure information may improve the accuracy in the similar cases.

Conclusions
To the best of our knowledge, this study is the first one that reports using deep CNN-based classification for DTM extraction from ALS data. Relative elevation differences between each point and its surrounding points are extracted and transformed into an image representing the point feature. Then, the deep CNN model is used to train and classify the images. Each point to be processed can be classified as a ground or non-ground point by the trained deep CNN. A total of 40 ALS point clouds with 40 million points and the ISPRS benchmark dataset with various scene complexities and terrain types are tested using one CNN trained by 17 million labeled points. The results show the high accuracy of the proposed method. The developed method provides a general framework for ALS point cloud classification.
However, the drawback of deep-learning-based methods is that they usually require large labeled data and powerful computational resources. Future work should focus on better point-image transformation and making more compact classification through the deep CNN model to improve the training and classification. Future work should also perform tests on larger datasets.

Conclusions
To the best of our knowledge, this study is the first one that reports using deep CNN-based classification for DTM extraction from ALS data. Relative elevation differences between each point and its surrounding points are extracted and transformed into an image representing the point feature. Then, the deep CNN model is used to train and classify the images. Each point to be processed can be classified as a ground or non-ground point by the trained deep CNN. A total of 40 ALS point clouds with 40 million points and the ISPRS benchmark dataset with various scene complexities and terrain types are tested using one CNN trained by 17 million labeled points. The results show the high accuracy of the proposed method. The developed method provides a general framework for ALS point cloud classification.
However, the drawback of deep-learning-based methods is that they usually require large labeled data and powerful computational resources. Future work should focus on better point-image transformation and making more compact classification through the deep CNN model to improve the training and classification. Future work should also perform tests on larger datasets.