A Novel LiDAR Data Classification Algorithm Combined CapsNet with ResNet

LiDAR data contain feature information such as the height and shape of the ground target and play an important role for land classification. The effect of convolutional neural network (CNN) for feature extraction on LiDAR data is very significant, however CNN cannot resolve the spatial relationship of features adequately. The capsule network (CapsNet) can identify the spatial variations of features and is widely used in supervised learning. In this article, the CapsNet is combined with the residual network (ResNet) to design a deep network-ResCapNet for improving the accuracy of LiDAR classification. The capsule network represents the features by vectors, which can account for the direction of the features and the relative position between the features. Therefore, more detailed feature information can be extracted. ResNet protects the integrity of information by passing input information to the output directly, which can solve the problem of network degradation caused by information loss in the traditional CNN propagation process to a certain extent. Two different LiDAR data sets and several classic machine learning algorithms are used for comparative experiments. The experimental results show that ResCapNet proposed in this article `improve the performance of LiDAR classification.


Introduction
LiDAR launched in the 1980s and successfully detected the lunar surface for the American Apollo mission to the moon. Because of its huge technical potential, many research scholars have studied it to promote the development continuously and progress of theory and technology. Thus, it becomes an indispensable detection technology in the field of science and technology. LiDAR has many advantages, such as high resolution, good concealment, and strong anti-interference ability. It is widely used in many different fields. For example, it can elevate the measure accuracy of projects that are difficult to measure in construction engineering [1]; it can build the 3D models for historical buildings to record information in terms of cultural relics; it can detect underwater distances to provide data for environmental protection programs [2]; it also can be used to detect landslides and other disasters [3]. In recent years, deep learning has developed rapidly and has achieved remarkable results in various fields [4][5][6][7]. Therefore, this article also uses deep learning algorithms for pixel-level classification of LiDAR data.
The data used in this article are the LiDAR-derived rasterized Digital Surface Models (LiDAR-DSM), which were obtained by processing the points cloud data acquired from the airborne LiDAR system by denoising and rasterization [8]. LiDAR-DSM mainly includes the terrain change of the target area and Sensors 2020, 20, 1151 3 of 20 In addition, for the traditional CNN, with the depth of the network increasing, the performance of network may degrade; that is, when the accuracy of training tends to be flat, the training error becomes larger. Residual network (ResNet) [40] was proposed to solve the problem. ResNet establishes a bypass connection and sends the input to the output directly to avoid the loss of information and to mitigate the degradation of the network. ResNet has significant benefits in many areas. In 2018, Mou et al. propose a novel network architecture, fully Conv-Deconv network, for unsupervised spectral-spatial feature learning of hyperspectral images, which is able to be trained in an end-to-end manner [41]. In the same year, Zhong et al. designed an end-to-end spectral-spatial residual network (SSRN) that takes raw 3-D cubes as input data without feature engineering for hyperspectral image classification [42]; Qin et al. proposed a deep residual neural network based on leukocyte classifier constructed at first, which can imitate the domain expert's cell recognition process, and extract salient features robustly and automatically [43]. In 2019, Paolett et al. presented a new deep CNN architecture specially designed for the HSI data. A new model pursues to improve the spectral-spatial features uncovered by the convolutional filters of the network [44]. Zhan et al. proposed an attention residual learning convolutional neural network (ARL-CNN) model for skin lesion classification in dermoscopy images, which is composed of multiple ARL blocks, a global average pooling layer, and a classification layer [45].
We combine the advantages of ResNet and CapsNet to design the ResCapNet to obtain more detailed information of LiDAR data for classification applications. The main contributions of this article are as follows.
(1) Combine the CapsNet and ResNet to form a new network framework named ResCapNet.
The input features are extracted using ResNet and the outputs of ResNet are sent to CapsNet for further classification. (2) The proposed method is tested on two different LiDAR data sets to predict for each pixel the land type associated with that pixel while the number of training samples is limited.
The organization of this article is as follows. Sections 2 and 3 present the CapsNet and ResNet, respectively. Section 4 is dedicated to the details of the proposed classification method in this article and Section 5 reports the experimental results and analysis. Section 6 is the conclusions of the proposed framework.

Capsule Network
The CapsNet is made up of capsules rather than neurons. A capsule is a small group of neurons that can examine a particular object, such as a rectangle, and learns from a certain area of the feature maps. The output of CapsNet is an n-dimensional vector. The length of each vector represents the estimated probability of the existence of the object and the direction of each vector records the attitude parameters of object, such as the exact position, rotation, thickness, inclination, and size of the object. If the object changes slightly, such as moving, rotating, or changing the size, the CapsNet will obtain an output vector of the same length but with a slight change in direction. Therefore, the feature extraction of CapsNet is not affected by the changes of space for features. Traditional CNNs require additional components to identify each detail of the objects automatically, and CapsNet can represent the hierarchical structure of each detail part directly. CapsNet has two main characteristics: The first is layer-based compression, and the second is dynamic routing.

Layer-Based Compression
As shown in Figure 1, both input u i and output v j are vectors. Multiply the transformation matrix W ij with the output u i of the previous capsule for turning the u i toû j|i . Then, in Equation (1) and Equation (2), calculate the weighted sum s i according to the weight C ij . C ij is the coupling coefficient, Sensors 2020, 20, 1151 4 of 20 which is calculated through the iteration of dynamic routing process, and specifies the sum of j c ij is 1. C ij measures how likely can capsule i activate capsule j.
The activation function of s j is squash instead of ReLU, so the length of the final output vector v j of the capsule is between 0 and 1. This function compresses small vectors to zero and large vectors to unit vectors. The activation function Squash is shown as Equation (3). The activation function of s j is squash instead of ReLU, so the length of the final output vector of the capsule is between 0 and 1. This function compresses small vectors to zero and large vectors to unit vectors. The activation function Squash is shown as Equation (3).

Dynamic Routing
Capsule calculates the output by calculating the intermediate value Cij through the iterative dynamic routing. In Equation (1) and Equation (2), the prediction vector | is the prediction (vote) from capsule i and has an impact on the output of capsule j. If the activation vector has a high similarity with the prediction vector, the two capsules are highly correlated. This similarity is measured by the scalar product of the prediction vector and the activation vector.
Therefore, in Equation (4), the similarity score b ij will consider both the possibility of feature existence and the attribute of the feature, unlike neurons, which only consider the possibility of feature existence. At the same time, if the activation of the capsule is very low, since the length of u j|i is proportional to ui , b ij will still be low; that is, if the capsule of the detail feature is not activated, the correlation between the detail feature and the overall feature is very low. The coupling coefficient Cij is calculated by the softmax of b ij in Equation (5):

Dynamic Routing
Capsule calculates the output by calculating the intermediate value C ij through the iterative dynamic routing. In Equation (1) and Equation (2), the prediction vectorû j|i is the prediction (vote) from capsule i and has an impact on the output of capsule j. If the activation vector has a high similarity with the prediction vector, the two capsules are highly correlated. This similarity is measured by the scalar product of the prediction vector and the activation vector.
Therefore, in Equation (4), the similarity score b ij will consider both the possibility of feature existence and the attribute of the feature, unlike neurons, which only consider the possibility of feature existence. At the same time, if the activation u i of the capsule i is very low, since the length ofû j|i is proportional to u i , b ij will still be low; that is, if the capsule of the detail feature is not activated, the correlation between the detail feature and the overall feature is very low. The coupling coefficient C ij is calculated by the softmax of b ij in Equation (5): Dynamic routing is not a complete replacement for backpropagation. The transformation matrix W ij is still trained by backpropagation, while the dynamic path is only used to calculate the output of the capsule. Calculate the C ij to quantify the connection between the child capsule and its parent capsule. Each data point is re-initialized to 0 before performing dynamic routing calculations [43].

Residual Network
Deep convolutional networks integrate the characteristics of different levels, such as global features and detail features. The levels of features can be enriched by deepening the network. Therefore, a deeper network structure is used to obtain more detail features generally. However, there is a problem of degradation on traditional CNN when using too deep network layers. When the network layer reaches a certain level and the network is too complicated, the accuracy rate will saturate and then decrease rapidly.
ResNet was proposed by He et al. in 2015 [42]. Because hierarchical networks have many redundancies, ResNet is designed to optimize network layer. The aim of ResNet is to complete the identity mapping and ensure that the input and output of the identity layer are the same. The identity layer of the network is determined automatically through training. ResNet changed several layers of the original network into a residual block.
The specific structure of the residual block is shown in Figure 2, where x is the input of this residual block and the residual is F(x). F(x) is the output after the linear transformation and the activation of the first layer. After the linear transformation of the second layer, the input x of this layer is added to F(x), and total activated by ReLU for getting output. The initial input x is added to the output of the second layer and then activated. This path is called shortcut connection. Establishing a direct correlation channel between the input and the output can make the parameterized layers focus on learning the residuals from the input to the output.

ResCapNet for LiDAR Classification
The proposed method by us is shown as Figure 3. The network structure consists of two parts, the upper part is ResNet for extracting features and the lower part is CapsNet for classification.  Residual operation is shown as Equations (6)- (8), where σ in Equation (6) represents the non-linear function ReLU. In Equation (7), y is the common output of the shortcut and the second ReLU. In Equation (8), when the input and output dimensions need to be changed, such as changing the number of channels, a linear transformation W s can be performed on x by the shortcut operation.

ResCapNet for LiDAR Classification
The proposed method by us is shown as Figure 3. The network structure consists of two parts, the upper part is ResNet for extracting features and the lower part is CapsNet for classification.

ResCapNet for LiDAR Classification
The proposed method by us is shown as Figure 3. The network structure consists of two parts, the upper part is ResNet for extracting features and the lower part is CapsNet for classification.

Proposed Network Structure
We adopt the structure of ResNet34 and modify it to fit LiDAR data. Resnet-34 consists of four parts, each of which has three, four, six, and three identity blocks. Every identify block in each part has 64, 128, 256, and 512 filters, respectively. In the experiments of this article, because the size of the input is small, we reduced the size of the convolution kernel in the first convolution layer from 7 to 3 to ensure that the network can extract useful information. Meanwhile, reduce the number of filters used for each identify block in the four parts respectively to 16,28,40, and 52 and no output classification layer is used. Figure 4 shows the identity block used in this article, which consists of two convolutional layers and two batch normalization (BN) layers.

Proposed Network Structure
We adopt the structure of ResNet34 and modify it to fit LiDAR data. Resnet-34 consists of four parts, each of which has three, four, six, and three identity blocks. Every identify block in each part has 64, 128, 256, and 512 filters, respectively. In the experiments of this article, because the size of the input is small, we reduced the size of the convolution kernel in the first convolution layer from 7 to 3 to ensure that the network can extract useful information. Meanwhile, reduce the number of filters used for each identify block in the four parts respectively to 16,28,40, and 52 and no output classification layer is used. Figure 4 shows the identity block used in this article, which consists of two convolutional layers and two batch normalization (BN) layers. The parameter of dynamic routing in digit caps for the two data sets are all set to 3. The size of convolution kernel in primary caps is 3×3 and the channel is set to 3. Because there are seven land classes in Bayview Park data set, the number of vectors in primary caps and digit caps are both set to 7 and the number of capsules in digit caps is also set to 7. Meanwhile, there are 11 land classes in the Recology data set, so the number of vectors in primary caps and digit and the number of capsules in digit caps are all set to 11.

Adaptive Learning Optimization Algorithm
In this article, the Stochastic Gradient Descent (SGD) with momentum is used to back-propagate and update the network parameters for obtaining the optimal framework of ResCapNet, as shown in Equation (9) and Equation (10), v=β·v-α·∇ω (9) x←x+v (10) where α represents the learning rate and v represents the momentum factor. The gradient acts on v directly. When the direction of the negative gradient is the same with the direction of v, the direction of update is correct, and the weight will be updated quickly.

Loss and Activate Function
This article uses the ReLU function as the activation function of the network. In Equation (11), some outputs of the neuron are set to zero, which can reduce the dependency between the parameters and alleviate the overfitting phenomenon of the network.
We adopt the softmax function to classify and choose the exponential form of softmax in Equation (12).
The input of the last layer is Z j L , the output of the last layer is and e is a constant. The inputs of all neurons in the L th layer is ∑ e (Z K L ) K . Therefore, the loss function is cross-entropy loss in Equation (13). The parameter of dynamic routing in digit caps for the two data sets are all set to 3. The size of convolution kernel in primary caps is 3 × 3 and the channel is set to 3. Because there are seven land classes in Bayview Park data set, the number of vectors in primary caps and digit caps are both set to 7 and the number of capsules in digit caps is also set to 7. Meanwhile, there are 11 land classes in the Recology data set, so the number of vectors in primary caps and digit and the number of capsules in digit caps are all set to 11.

Adaptive Learning Optimization Algorithm
In this article, the Stochastic Gradient Descent (SGD) with momentum is used to back-propagate and update the network parameters for obtaining the optimal framework of ResCapNet, as shown in Equations (9) and (10), v = β·v − α·∇ω (9) x ← x + v where α represents the learning rate and v represents the momentum factor. The gradient acts on v directly. When the direction of the negative gradient is the same with the direction of v, the direction of update is correct, and the weight will be updated quickly.

Loss and Activate Function
This article uses the ReLU function as the activation function of the network. In Equation (11), some outputs of the neuron are set to zero, which can reduce the dependency between the parameters and alleviate the overfitting phenomenon of the network.
We adopt the softmax function to classify and choose the exponential form of softmax in Equation (12).
Sensors 2020, 20, 1151 8 of 20 The input of the last layer is Z L j , the output of the last layer is a L j and e is a constant. The inputs of all neurons in the L th layer is K e (Z L K ) . Therefore, the loss function is cross-entropy loss in Equation (13).

Algorithm Data Description
In this article, two different LiDAR data sets were used to evaluate the proposed method; one is Bayview Park data set and the other is Recology data set. They were obtained from the 2012 IEEE International Remote Sensing Image Convergence Competition. The Bayview Park data set was collected in June 2010 by the sensor WorldView2 in San Francisco, USA, as shown in Figure 5. The data set had a spatial resolution of 1.8m and contains 300 × 200 pixels. It had seven land classes, which were building1, building2, building3, road, trees, soil, and seawater.

Algorithm Data Description
In this article, two different LiDAR data sets were used to evaluate the proposed method; one is Bayview Park data set and the other is Recology data set. They were obtained from the 2012 IEEE International Remote Sensing Image Convergence Competition. The Bayview Park data set was collected in June 2010 by the sensor WorldView2 in San Francisco, USA, as shown in Figure 5. The data set had a spatial resolution of 1.8m and contains 300 × 200 pixels. It had seven land classes, which were building1, building2, building3, road, trees, soil, and seawater.  . shows the Recology data set, which was also acquired in an urban location in San Francisco, USA. It contained 200 × 250 pixels and had a spatial resolution of 1.8 m. It had 11 land classes, which were building1, building2, building3, building4, building5, building6, building6, trees, parking lot, soil, and grass.

Experimental Setup
The experiments in this article were carried out under Windows system and accelerated with Nvidia RTX2060(Asus, Taiwan, China)graphics card. The codes take tensorflow as the backend and are implemented through the Keras and the python(Anaconda, Austin, Texas). The data sets were divided into training sets and test sets. We selected 400, 500, 600, and 700 samples randomly in the data sets as the training set, and the rest for testing the effect of the model. Verified by experiments, it was better to set the size of the input for ResCapNet to 38 × 38 pixels, meanwhile the input size of all comparative experiments was set to 38 × 38 pixels, and the DSM data were linearly mapped to  . shows the Recology data set, which was also acquired in an urban location in San Francisco, USA. It contained 200 × 250 pixels and had a spatial resolution of 1.8 m. It had 11 land classes, which were building1, building2, building3, building4, building5, building6, building6, trees, parking lot, soil, and grass.

Algorithm Data Description
In this article, two different LiDAR data sets were used to evaluate the proposed method; one is Bayview Park data set and the other is Recology data set. They were obtained from the 2012 IEEE International Remote Sensing Image Convergence Competition. The Bayview Park data set was collected in June 2010 by the sensor WorldView2 in San Francisco, USA, as shown in Figure 5. The data set had a spatial resolution of 1.8m and contains 300 × 200 pixels. It had seven land classes, which were building1, building2, building3, road, trees, soil, and seawater.  . shows the Recology data set, which was also acquired in an urban location in San Francisco, USA. It contained 200 × 250 pixels and had a spatial resolution of 1.8 m. It had 11 land classes, which were building1, building2, building3, building4, building5, building6, building6, trees, parking lot, soil, and grass.

Experimental Setup
The experiments in this article were carried out under Windows system and accelerated with Nvidia RTX2060(Asus, Taiwan, China)graphics card. The codes take tensorflow as the backend and are implemented through the Keras and the python(Anaconda, Austin, Texas). The data sets were divided into training sets and test sets. We selected 400, 500, 600, and 700 samples randomly in the data sets as the training set, and the rest for testing the effect of the model. Verified by experiments, it was better to set the size of the input for ResCapNet to 38 × 38 pixels, meanwhile the input size of all comparative experiments was set to 38 × 38 pixels, and the DSM data were linearly mapped to

Experimental Setup
The experiments in this article were carried out under Windows system and accelerated with Nvidia RTX2060(Asus, Taiwan, China) graphics card. The codes take tensorflow as the backend and Sensors 2020, 20, 1151 9 of 20 are implemented through the Keras and the python (Anaconda, Austin, Texas). The data sets were divided into training sets and test sets. We selected 400, 500, 600, and 700 samples randomly in the data sets as the training set, and the rest for testing the effect of the model. Verified by experiments, it was better to set the size of the input for ResCapNet to 38 × 38 pixels, meanwhile the input size of all comparative experiments was set to 38 × 38 pixels, and the DSM data were linearly mapped to [−0.5, 0.5]. The training batch size of the data sets was 32. Set 150 epochs for training, and when the classification accuracy of the network no longer increases (exceeding 20 epochs), the training will stop early. Selecting the 'same' for the fill pattern of each layer's feature maps, so that the length and width of each layer's inputs and outputs are unchanged. The structure of CNN is shown in Table 1.
We use SGD algorithm with momentum as the gradient optimizer. The momentum was selected to 0.9 and the descent rate was selected to 10 −6 . When training the ResCapNet model, the initial learning rate for the Bayview Park data set and the Recology data set were set to 0.001, and when training the CNN and the ResNet models, the initial learning rate were also set to 0.001. For the Bayview Park data set, the maximum depth of the decision tree was set to 100, and for the Recology data set, the maximum depth of the decision tree was set to 25. The kernel function of the SVM was set to the radial basis function (rbf), the rbf coefficient defaults to "auto", and the penalty parameter of the error term was set to 100. The value of k for the KNN was set to 1, the leaf_size was set to 30, and the metric distance select to Euclidean distance. The estimates of the Random Forest for the two data sets were set to 30.

Experimental Results and Aanlysis
We adopted overall accuracy (OA), average accuracy (AA), kappa coefficient (K), recall, precision, and RGB false color map to evaluate the performance of the model. Tables 2 and 3 provide the classification results of different methods for Bayview Park data set and Recology data set when selecting 400, 500, 600, and 700 training samples, respectively.  We can see that ResCapNet always achieved the highest accuracy and the best OA were 96.12% ± 0.51% for the Bayview Park data set and 96.39% ± 0.79% for the Recology data set. The best OA of Bayview Park data set was 0.70%, 1.33%, 5.95%, 5.51%, 5.69%, 10.06%, 18.91%, and 19.27% higher than OctSqueezeNet, ResNet, CapsNet, CNN, Random Forest, KNN, SVM, and Decision Tree, respectively. The best OA of Recology data set increased 0.48%, 0.67%, 6.22%, 3.91%, 4.68%, 8.03%, 19.18%, and 20.09% compared to OctSqueezeNet, ResNet, CapsNet, CNN, Random Forest, KNN, SVM, and Decision Tree, respectively. Figure 7 is a comparison of the test results of different methods when 700 training samples were selected for the two data sets. It can be intuitively seen that the method proposed by us had the best classification effect. Tables 4 and 5 give the precision and recall of each class for 700 samples on Bayview Park data set and Recology data set. Tables 6 and 7 give the classification accuracy of per class on Bayview Park data set and Recology data set. According to the classification results of each land classes shown in these four tables, when CapsNet was used alone, the classification effect of land classes with lower height was good, because it was sensitive to spatial features, but its overall classification accuracy was not high. When ResNet was used alone, the classification accuracy of land classes with higher height was high, but it was difficult to identify the land classes with lower height. The combination of the two greatly reduced the influence for the height of the land classes on the classification results, and the classification accuracy of each category was very high.                                                    0.98 0.92 0.88 1.00 0.97 0.98 1.00 0.86 0.86 0.81 1.00   CNN  0.99 0.99 0.97 0.92 0.94 0.89 0.84 0.96 0.83 0.86 74 0.78 0.96 0.91 0.77 0.77 0.84 0.86 0.65 0.76 1.00   KNN  0.88 0.88 0.98 0.96 0.89 0.76 0.93 0.99 0.68 0.36 1.00  Random  Forest  0.98 0.92 0.88 1.00 0.97 0.98 1.00 0.86 0.86 0.81 1.00   CNN  0.99 0.99 0.97 0.92 0.94 0.89 0.84 0.96 0.83 0.86                                 . Precision and recall of each class for 700 samples on Recology data set.                          Figures 8 and 9 visually show the classification results of each class on the two data sets. It can be clearly seen that classification results of ResCapNet for each class were excellent. Figures 10 and  11 provides classification maps for different classifiers.

Conclusions
This article designs a deep learning model-ResCapNet, which combines the advantages of ResNet and CapsNet for improving the original structure to effectively classify remotely sensed LiDAR data. The two well-known LiDAR data sets are considered in this article, and eight established algorithms are used to compare with our proposed method, it can be seen that, competitive with state-of-the-art classification methods for LiDAR, our proposed method can achieve better classification results. It achieves 96.12% and 96.39% in terms of OA on the Bayview Park and Recology data sets, respectively, when the number of training samples is selected 700.
The shortcut channel of ResNet can retain more complete feature information and alleviate the problem of network performance degradation caused by the inappropriate depth of CNN. At the same time, it automatically extracts effective features from the data. This enables subsequent CapsNet to learn more useful feature information. Meanwhile, because the sensitivity of CapsNet to space transformation of features, it can extract more detailed feature information and retain more valuable information compared to ordinary CNNs. Thus, the combination of the two structure obtains a very good classification effect.
In addition, the practical effects of this methods on other remote sensing data sets need to be continuously verified. Meanwhile, we need to further explore how to automatically generate an optimal network model suitable for LiDAR classification.