Three-Dimensional Urban Land Cover Classiﬁcation by Prior-Level Fusion of LiDAR Point Cloud and Optical Imagery

: The heterogeneity of urban landscape in the vertical direction should not be neglected in urban ecology research, which requires urban land cover product transformation from two-dimensions to three-dimensions using light detection and ranging system (LiDAR) point clouds. Previous studies have demonstrated that the performance of two-dimensional land cover classiﬁcation can be improved by fusing optical imagery and LiDAR data using several strategies. However, few studies have focused on the fusion of LiDAR point clouds and optical imagery for three-dimensional land cover classiﬁcation, especially using a deep learning framework. In this study, we proposed a novel prior-level fusion strategy and compared it with the no-fusion strategy (baseline) and three other commonly used fusion strategies (point-level, feature-level, and decision-level). The proposed prior-level fusion strategy uses two-dimensional land cover derived from optical imagery as the prior knowledge for three-dimensional classiﬁcation. Then, a LiDAR point cloud is linked to the prior information using the nearest neighbor method and classiﬁed by a deep neural network. Our proposed prior-fusion strategy has higher overall accuracy (82.47%) on data from the International Society for Photogrammetry and Remote Sensing, compared with the baseline (74.62%), point-level (79.86%), feature-level (76.22%), and decision-level (81.12%). The improved accuracy reﬂects two features: (1) fusing optical imagery to LiDAR point clouds improves the performance of three-dimensional urban land cover classiﬁcation, and (2) the proposed prior-level strategy directly uses semantic information provided by the two-dimensional land cover classiﬁcation rather than the original spectral information of optical imagery. Furthermore, the proposed prior-level fusion strategy provides a series that ﬁlls the gap between two- and three-dimensional land cover classiﬁcation.


Introduction
Sustainable development of the urban environment is related to human well-being, and monitoring and managing the urban environment have long been a research hotspot wherein two-dimensional land cover products have played an important role [1][2][3]. Geoobjects in the urban environment are diverse and have unique three-dimensional structures, such as a building with a roof and façade, or a tree with a height and diameter. While these three-dimensional structures cannot be derived from current two-dimensional land cover products, they should not be neglected in the study of the urban environment, including urban form analysis [4], local climate zone [5], and urban woody biomass estimation [6]. Thus, the urban land cover should proceed from two-dimensional to three-dimensional analysis, where the type of land cover is indexed by point cloud instead of pixels in the three-dimensional land cover ( Figure 1). The point cloud for indexing three-dimensional land cover can be provided by a light detection and ranging system (LiDAR), which uses a laser beam to measure the Earth's surface and has become the most important instrument for acquiring three-dimensional geospatial data. A LiDAR point cloud is the optimal data source for three-dimensional land cover classification and storage. Support vector machines, random forest, and other supervised learning methods are often used for point cloud classification [7][8][9]. These supervised classification methods require features that can express the characteristics of the point and its neighborhood; these features are vital to the performance of the classification. Commonly used features include histogram and covariance features. Histogram features, such as the fast point feature histogram [10], accumulate information about the spatial interconnection between a point and its neighbors into a histogram representation [11,12]. Covariance features, including line, plane, and volume attributes, are calculated from the covariance matrix of all points in the point's neighborhood [13,14]. Although this manual-constructed feature is useful for land cover classification, it cannot produce three-dimensional land cover classification with sufficient quality owing to the complexity and diversity of actual geo-objects.
Deep neural networks (DNNs) learn the features of objects in "end-to-end" ways, and as such can achieve high performance in many computer vision and remote sensing classification tasks. In particular, three-dimensional DNNs, such as PointNet [15], PointCNN [16], and SSCNs [17] have overcome the difficulty caused by the sparseness and disorder of the point cloud for learning features. With these developments, deep learning has achieved rapid development in point cloud classification and has been used in the processing of outdoor LiDAR data [18]. For example, Yousefhussien et al. used multi-scale PointNet to improve the accuracy of urban LiDAR point cloud classification [19]. Zhang et al. used smoothing error enhanced data to solve the overfitting of PointCNN in urban LiDAR point cloud classification [20].
Although three-dimensional urban land cover is indexed by point clouds, the data used for this task are not only the LiDAR point clouds; optical images can also provide supplementary information. Numerous studies have demonstrated that fusing optical imagery and LiDAR point clouds can improve the performance of two-dimensional land cover classification [21,22]. For example, Singh et al. integrated structural and intensity surface models extracted from LiDAR data with Landsat Thematic Mapper (TM) imagery to derive large-area urban land cover [23]; Paisitkriangkrai et al. trained a The point cloud for indexing three-dimensional land cover can be provided by a light detection and ranging system (LiDAR), which uses a laser beam to measure the Earth's surface and has become the most important instrument for acquiring three-dimensional geospatial data. A LiDAR point cloud is the optimal data source for three-dimensional land cover classification and storage. Support vector machines, random forest, and other supervised learning methods are often used for point cloud classification [7][8][9]. These supervised classification methods require features that can express the characteristics of the point and its neighborhood; these features are vital to the performance of the classification. Commonly used features include histogram and covariance features. Histogram features, such as the fast point feature histogram [10], accumulate information about the spatial interconnection between a point and its neighbors into a histogram representation [11,12]. Covariance features, including line, plane, and volume attributes, are calculated from the covariance matrix of all points in the point's neighborhood [13,14]. Although this manual-constructed feature is useful for land cover classification, it cannot produce threedimensional land cover classification with sufficient quality owing to the complexity and diversity of actual geo-objects.
Deep neural networks (DNNs) learn the features of objects in "end-to-end" ways, and as such can achieve high performance in many computer vision and remote sensing classification tasks. In particular, three-dimensional DNNs, such as PointNet [15], PointCNN [16], and SSCNs [17] have overcome the difficulty caused by the sparseness and disorder of the point cloud for learning features. With these developments, deep learning has achieved rapid development in point cloud classification and has been used in the processing of outdoor LiDAR data [18]. For example, Yousefhussien et al. used multi-scale PointNet to improve the accuracy of urban LiDAR point cloud classification [19]. Zhang et al. used smoothing error enhanced data to solve the overfitting of PointCNN in urban LiDAR point cloud classification [20].
Although three-dimensional urban land cover is indexed by point clouds, the data used for this task are not only the LiDAR point clouds; optical images can also provide supplementary information. Numerous studies have demonstrated that fusing optical imagery and LiDAR point clouds can improve the performance of two-dimensional land cover classification [21,22] [25]; Rasti et al. fused hyperspectral information with spatial and elevation information extracted from hyperspectral imagery and rasterized LiDAR features using orthogonal total variation component analysis [26]. In these studies, LiDAR represents auxiliary data for two-dimensional urban land cover classification, where optical imagery is the primary data. Thus, LiDAR is usually rasterized to DSM [27,28] and other structural features including height difference and deviation angle [29]. Unlike in two-dimensional land cover classification, LiDAR point clouds play a key role in threedimensional land cover classification, where the space occupied by geo-objects is sparse. In this case, optical imagery represents the auxiliary data and its spectral information is often simply interpolated as the attributes of the LiDAR point cloud, also known as point-level fusion [19]. Apart from the point-level fusion strategy, feature-level and decision-level fusion [30] can also be adapted to three-dimensional land cover classification. However, they rarely receive attention, especially under a deep learning framework. In contrast, deep learning models require sufficient training data or pre-trained models. There are fewer training data available in large-scale outdoor LiDAR point clouds, while several are available in optical imagery, such as International Society for Photogrammetry and Remote Sensing (ISPRS) two-dimensional semantic labeling dataset, WHU building dataset [31], and DeepGlobal [32].
Thus, to make full use of the two-dimensional neural network pre-training model and comprehensively compare different fusion strategies in three-dimensional land cover classification, we proposed a prior-level fusion of LiDAR point cloud and optical imagery for three-dimensional land cover classification under a deep learning framework. We then compared our proposed method with the no-fusion strategy (baseline) and three other fusion strategies (point-level, feature-level, and decision-level). The proposed prior-level fusion strategy assumes that there is a certain relationship between two-dimensional and three-dimensional land covers, that is, two-dimensional land cover can provide a prior knowledge for the three-dimensional land cover classification. For example, vegetation in the two-dimensional classification may be shrubs or trees in the three-dimensional classification, and the façade is under the building edge. The proposed prior-level strategy is based on a widely used DNN, whereby optical imagery is classified by a fully convolutional network and its result, namely two-dimensional land cover prior knowledge, is assigned to the LiDAR point cloud. Then, the LiDAR point cloud assigned with the prior knowledge is classified by a three-dimensional deep learning network to obtain the three-dimensional urban land cover classification. Thus, our proposed prior-level fusion strategy can fill the gap between two-and three-dimensional land cover through a series form.
In the following, Section 2 provides a comprehensive description of the proposed strategy, including two kinds of DNNs and three other fusion strategies. The experimental data and results are given in Section 3 and discussed in Section 4. The conclusions and proposed future work are given in Section 5.

Methods
The proposed prior-level fusion of LiDAR point clouds and optical imagery for threedimensional urban land cover classification includes three main parts ( Figure 2).
(1) Obtain two-dimensional land cover, namely prior knowledge, from the optical image (see Section 2.1). Here, optical imagery is classified by a deep convolutional neural network (DCNN), and the result of DCNN is the probability belonging to each class. The probability is considered as prior to the subsequent three-dimensional classification. The DCNN used in this study was SegNet [33].
(2) Assign the prior knowledge to the LiDAR point cloud (see Section 2.2). The prior derived from the optical imagery is two-dimensional; however, the LiDAR point cloud is three-dimensional. These can be linked through their coordinates. We use (x, y) of the LiDAR point to search for its nearest pixel in the optical image to obtain a prior. (3) Classify the LiDAR point cloud that has been assigned the prior knowledge to produce the three-dimensional urban land cover by three-dimensional DNN (see Section 2.3). The LiDAR point cloud is sparse and irregular, which renders traditional convolution unusable. PointNet++ represents pioneering work on point clouds to overcome this problem [34] and was used to classify urban LiDAR point clouds in this study where the hyper-parameter of PointNet++ was redesigned.

Obtaining Prior Knowledge from Optical Image Using Deep Convolutional Neural Network (DCNN)
Obtaining prior knowledge corresponds to optical image semantic segmentation, which gives every pixel a classification vector and can be accomplished by using fully convolutional networks (FCNs), a popular DCNN. There are many FCNs, such as UNet [35], SegNet [33], and PSPNet [36]. Among these, SegNet exhibits a good balance between operating efficiency, required memory, and classification accuracy, and has high efficiency in space and time utilization [33]. Thus, we selected SegNet as the base model for optical imagery semantic segmentation.
SegNet consists of a trainable encoding network and a corresponding decoding network, with a pixel-level Softmax classifier after the decoding network. The encoding network is the convolutional neural network VGG-16 [37] without a fully connected network, which can extract encoding features. The encoding network contains five groups of encoders. Each group uses a convolutional layer, a batch normalization layer, a rectified linear unit (ReLU) activation layer, and a max-pooling layer to extract features and expand their receptive field. The output of the encoding network is 1/32 of the original image. The parameters in the encoding network can be initialized by a VGG-16 pre-trained model, which is convenient for learning an improved classifier on the remote sensing data.
Unlike the encoding network, the decoding network up-samples low resolution features (1/32 of the original image) through the up-sampling layer, convolutional layer, batch normalization layer, and ReLU activation layer to obtain a feature image that is of the same size as the original image. The up-sampling layer uses indices of corresponding max-pooling to obtain sparse features with higher resolution, and the sparse features are Figure 2. Framework of the proposed prior-level fusion for three-dimensional (3D) land cover classification. In this study, the two-dimensional (2D) deep convolutional neural network (DCNN) was SegNet [33] and the 3D deep neural network (DNN) was PointNet++ [34].
(3) Classify the LiDAR point cloud that has been assigned the prior knowledge to produce the three-dimensional urban land cover by three-dimensional DNN (see Section 2.3). The LiDAR point cloud is sparse and irregular, which renders traditional convolution unusable. PointNet++ represents pioneering work on point clouds to overcome this problem [34] and was used to classify urban LiDAR point clouds in this study where the hyper-parameter of PointNet++ was redesigned.

Obtaining Prior Knowledge from Optical Image Using Deep Convolutional Neural Network (DCNN)
Obtaining prior knowledge corresponds to optical image semantic segmentation, which gives every pixel a classification vector and can be accomplished by using fully convolutional networks (FCNs), a popular DCNN. There are many FCNs, such as UNet [35], SegNet [33], and PSPNet [36]. Among these, SegNet exhibits a good balance between operating efficiency, required memory, and classification accuracy, and has high efficiency in space and time utilization [33]. Thus, we selected SegNet as the base model for optical imagery semantic segmentation.
SegNet consists of a trainable encoding network and a corresponding decoding network, with a pixel-level Softmax classifier after the decoding network. The encoding network is the convolutional neural network VGG-16 [37] without a fully connected network, which can extract encoding features. The encoding network contains five groups of encoders. Each group uses a convolutional layer, a batch normalization layer, a rectified linear unit (ReLU) activation layer, and a max-pooling layer to extract features and expand their receptive field. The output of the encoding network is 1/32 of the original image. The parameters in the encoding network can be initialized by a VGG-16 pre-trained model, which is convenient for learning an improved classifier on the remote sensing data.
Unlike the encoding network, the decoding network up-samples low resolution features (1/32 of the original image) through the up-sampling layer, convolutional layer, batch normalization layer, and ReLU activation layer to obtain a feature image that is of the same size as the original image. The up-sampling layer uses indices of corresponding max-pooling to obtain sparse features with higher resolution, and the sparse features are densified through a convolution layer, a batch normalization layer, and a ReLU activation layer. The featured image with the same size as the original image is classified by the pixel-level Softmax classifier to obtain the needed prior for each pixel.

Assigning Prior Knowledge to the Light Detection and Ranging (LiDAR) Point Cloud
The prior knowledge obtained from the optical imagery are raster data that is indexed by pixel (r, c), where r is the row and c is the column of pixel relative to the upper left corner of the raster. Each pixel contains classification probability vectors p.
where k indicates the k type of two-dimensional land cover. We can use the x and y in the coordinates (x, y, z) of a LiDAR point to calculate the corresponding row and column (r , c ) in the raster data as follows: where (X, Y) are the coordinates of the upper left corner of the raster, and gsd is the ground sample distance, namely, the spatial resolution of the raster The prior value is assigned to the LiDAR point cloud according to its corresponding calculated (r , c ) value. Then, a point in the LiDAR point cloud can be represented by (x, y, z, p 1 , p 2 , . . . , p k ) instead of (x, y, z), which establishes a link between the twodimensional and three-dimensional land cover classification.

Classification of LiDAR Point Cloud Assigned Prior to Three-Dimensional Deep Neural Network (DNN)
Unlike optical imagery, whose regular grid makes it convenient for convolution and automatic feature extraction in the end-to-end framework, a LiDAR point cloud is disordered and irregular, which make it difficult to design DNNs for learning point cloud features. In PointNet, an MLP-Max operation is designed to overcome the difficulty, where a multi-layer perceptron (MLP) is operated on (x, y, z, p 1 , p 2 , . . . , p k ) to extract a feature for every point, and then maximum pooling is used to summarize the extracted features of all points within the spherical neighborhood to a single vector [15].
PointNet++ extends PointNet to extract hierarchical point features and forms an encoder-decoder structure for point cloud semantic segmentation [34]. PointNet++ includes sampling and grouping, feature extraction, up-sample, and feature set propagation layers. The sampling and grouping layer use the farthest point sampling method to obtain abstract points and their spherical neighborhood. The feature extraction layer uses PointNet to extract abstract features for abstract points. The sampling and grouping layer and feature extraction layer are repeated to form an encoder network. For point cloud semantic segmentation, a decoder network is needed to up-sample the abstract points into their original point cloud size. The up-sample layer is accomplished by the distance-based interpolation and level skip link, and the features of the up-sample layer are readjusted through a feature set propagation layer (i.e., a PointNet). Finally, the Softmax classifier is used to derive the three-dimensional classification result.
PointNet++ was originally designed for small-scale indoor point clouds and cannot be directly used for urban LiDAR point clouds. Therefore, we redesigned the hyperparameters of every layer in PointNet++ (Table 1).

Fusion Strategies on Three Other Different Levels
To evaluate our proposed prior-level fusion strategy, we compare it with three other commonly used fusion strategies including point-level, feature-level, and decision-level fusion strategies [30]. We accomplished all the fusion strategies under the DNN framework by using SegNet and PointNet++ to ensure the fairness of the comparison as much as possible ( Figure 3).

Experimental Data
The LiDAR point cloud and optical imagery used in this experiment were provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) and downloaded from https://www.isprs.org/education/benchmarks.aspx on 12 July 2019. The LiDAR point cloud is an airborne LiDAR dataset that was collected by Leica Geosystems in Vaihingen using the Leica ALS50 system with a 45° field of view. Its geographic coordinate system is WGS84 and the projected coordinate system is UTM-32N. The average point density is 8 pts/m 2 . The ISPRS working group labeled some parts of these data as training and testing data to evaluate the three-dimensional land cover classification (Figure 4a,b). The labeled categories are power line, low vegetation, impervious ground, car, fence, roof, façade, shrub, and tree.
The optical multispectral imagery provided by the ISPRS is ortho photographic images comprising three bands: near-infrared, red, and green (IR-R-G; Figure 4c,d). The spatial resolution of the optical multispectral image is 1 m. The projected coordinate system of the orthophoto images is the same as the airborne LiDAR point cloud. Thus, registration of LiDAR data and optical imagery was not needed in this experiment. The ISPRS working group selected 16 blocks from Vaihingen's ortho photographic images and manually labeled six categories including impervious surface, building, low vegetation, tree, car, and background. The background includes water bodies and other  The point-level fusion strategy assigns multispectral information from optical imagery to the points and then trains the classifier using three-dimensional DNN to classify the point cloud with spectral information (point level in Figure 3). The feature-level fusion strategy first concatenates the features extracted from the multispectral image by DCNN and the features extracted from the LiDAR point cloud by three-dimensional DNN, and then the concatenated features are fed to an MLP to derive the three-dimensional land cover classification result (feature-level in Figure 3). Unlike point-level and feature-level fusion, the decision-level fusion strategy directly classifies the optical imagery and LiDAR point cloud to obtain two-and three-dimensional classification results, which are then combined using a heuristic fusion rule (decision-level in Figure 3). The heuristic fusion rule used in this study was to update the probability of a three-dimensional classification results based on the two-dimensional classification results.
The updating procedure included two steps: (1) three-dimensional classification probabilities are multiplied by two-dimensional classification probabilities according to land cover type; for example, the probabilities of a façade and roof in three-dimensional land cover are multiplied by the probability of a building in two-dimensional land cover, and Remote Sens. 2021, 13, 4928 7 of 17 the probabilities of a shrub and tree in three-dimensional land cover are multiplied by the probability of vegetation in two-dimensional land cover; (2) then, multiplied probabilities normalized to ensure that the sum of probabilities belonging to three-dimensional land cover type is one. The final classification result of decision-level fusion strategies is determined by the type whose probability is the maximum.

Experimental Data
The LiDAR point cloud and optical imagery used in this experiment were provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) and downloaded from https://www.isprs.org/education/benchmarks.aspx on 12 July 2019. The LiDAR point cloud is an airborne LiDAR dataset that was collected by Leica Geosystems in Vaihingen using the Leica ALS50 system with a 45 • field of view. Its geographic coordinate system is WGS84 and the projected coordinate system is UTM-32N. The average point density is 8 pts/m 2 . The ISPRS working group labeled some parts of these data as training and testing data to evaluate the three-dimensional land cover classification (Figure 4a,b)

Details of Experimental Setting
The optical imagery used in this experiment has three spectral bands. SegNet ca directly used and its encoder network parameters were initialized using a pre-tra VGG-16 model. We randomly selected 12 blocks of the optical image to fine-tune Se and set aside four other blocks for evaluation. The input image block for SegNet w randomly cropped 256 × 256 image unit. For training SegNet, we set a batch size o and the parameter optimizer selected the Stochastic Gradient Descent (SGD) method loss function used in SegNet was the weighted cross-entropy loss, calculated as: The optical multispectral imagery provided by the ISPRS is ortho photographic images comprising three bands: near-infrared, red, and green (IR-R-G; Figure 4c,d). The spatial resolution of the optical multispectral image is 1 m. The projected coordinate system of the orthophoto images is the same as the airborne LiDAR point cloud. Thus, registration of LiDAR data and optical imagery was not needed in this experiment. The ISPRS working group selected 16 blocks from Vaihingen's ortho photographic images and manually Remote Sens. 2021, 13, 4928 8 of 17 labeled six categories including impervious surface, building, low vegetation, tree, car, and background. The background includes water bodies and other objects.
To simplify the design of the rule of the decision-level fusion strategy, the categories used for the point cloud were low vegetation, shrub, tree, impervious surface, façade, and roof. The categories used for the optical imagery were impervious surface, building, low vegetation, and tree.

Details of Experimental Setting
The optical imagery used in this experiment has three spectral bands. SegNet can be directly used and its encoder network parameters were initialized using a pre-trained VGG-16 model. We randomly selected 12 blocks of the optical image to fine-tune SegNet and set aside four other blocks for evaluation. The input image block for SegNet was a randomly cropped 256 × 256 image unit. For training SegNet, we set a batch size of 16, and the parameter optimizer selected the Stochastic Gradient Descent (SGD) method. The loss function used in SegNet was the weighted cross-entropy loss, calculated as: where y p is the predicted probability vector, y g is the ground truth, and w is the weight vector for every class, which is calculated by dividing the class frequency by the median of all class frequencies.
Although the three-dimensional geometry information (x, y, z) of the LiDAR point cloud was the same for all four fusion strategies, we trained different PointNet++ models because the different strategies have different auxiliary information. First, baseline trained the PointNet++ by only using three-dimensional geometry information. The point-level fusion strategy trained PointNet++ using geometry and spectral information [i.e., (x, y, z, IR, R, G)]. The prior-level fusion strategy trained PointNet++ using geometry and prior information [i.e., (x, y, z, p 1 , p 2 , . . . , p k )]. The batch size of these models was set to 16, and the parameters were initialized using the Xavier initializer provided in TensorFlow. The optimizer was the adaptive moment estimation method. The loss function was the weighted cross-entropy loss (Equation (3)). The learning rate decreased by an exponential decay. The input unit of PointNet++ was a point set that had 8192 points. Thus, we split the LiDAR training data (Figure 4a) into 30 × 30 m blocks and resampled them into 8192 points for training PointNet++. When classifying the LiDAR testing data (Figure 4b), we also down-sampled the original data using the same procedure for the training data to obtain the classification result of down-sampled point cloud by using the trained PointNet++ model; we then classified every point of testing data to the type of its nearest point in the down-sampled point cloud. Figure 5 shows the ground truth and results of the four fusion strategies. All four fusion strategies achieved acceptable performance. In particular, three dominant geoobjects, namely tree, impervious surface, and roof, presented high accuracy ( Table 2). Figure 6 shows the error distribution for different fusion strategies. Compared with the baseline, the red area in the other classification error distribution plots is smaller, indicating that the four fusion strategies had fewer classification errors and improved overall classification accuracy. The increase in overall classification accuracy was 5.24% for the point-level, 1.60% for the feature-level, 6.50% for the decision-level, and 7.85% for the prior-level ( Table 2). The F1-scores of the decision-level and prior-level were >80%. Among the fusion strategies, prior-level had the highest accuracy and lowest error ( Table 2, Figures 5 and 6).

Discussion
Studying the heterogeneity of urban landscapes is important for managing the urban environment, and requires a three-dimensional urban land cover product. Three-dimensional LiDAR classification is a fundamental task for producing this three-dimensional land cover product. Traditionally, LiDAR data always act as auxiliary

Discussion
Studying the heterogeneity of urban landscapes is important for managing the urban environment, and requires a three-dimensional urban land cover product. Threedimensional LiDAR classification is a fundamental task for producing this three-dimensional land cover product. Traditionally, LiDAR data always act as auxiliary data in twodimensional land cover classification where an optical image is the core data, to improve accuracy. We fused optical images into LiDAR classification and found that the threedimensional accuracy could also be improved by the fusion (Table 2, Figure 5). Among the different fusion strategies, our proposed prior-fusion approach had the highest accuracy. The phenomenon was analyzed using the loss during the training process (see Section 4.1). Moreover, we checked the error region in Figure 6 to identify the data bottleneck in the three-dimensional land cover classification (see Section 4.2). Finally, we compared the results with other methods to indicate the limitations of the approach and the scope (see Section 4.3).

Loss Variation during Training
The loss used in this study was cross-entropy loss, which measures the difference between two probability distributions (Equation (3)), indicating that the lower the loss, the better the prediction of the model; loss variation is an important indication for the DNN training process. During the training process, the loss was decreased by updating the parameters of the DNN (Figures 7 and 8). Note that there was no overall loss in the decision-level classification, which included the loss with the baseline and the loss with the SegNet (Figure 8). When the training loss was stable, the largest loss occurred with the feature-level, followed by the baseline, point-level, and prior-level. Moreover, the prior-level offered the fastest convergence because it directly used the two-dimensional land cover classification result, which contained semantic information. When the test loss was stable, the largest loss occurred with the baseline, followed by the feature-level, point-level, and prior-level, consistent with the overall accuracy of the classification in the prediction results. These phenomena imply that, after embedding the information from the optical imagery, the loss becomes smaller and reaches a stable state faster.
Of the two losses with the decision-level, irrespective of training or testing, the loss with baseline was greater than the loss with SegNet (Figures 7 and 8), confirming that it is reasonable to train different DNN for the optical imagery and LiDAR point cloud separately, and that the prior-level fusion strategy can make use of the two-dimensional neural network pre-training model in three-dimensional land cover classification when the training data are insufficient. The training loss with the feature-level was greatest because the MLP classifier in the feature-level only had two layers, and a dropout was added, resulting in a weak learning capability.
Remote Sens. 2021, 13, x FOR PEER REVIEW 12 of 18 DNN training process. During the training process, the loss was decreased by updating the parameters of the DNN (Figures 7 and 8). Note that there was no overall loss in the decision-level classification, which included the loss with the baseline and the loss with the SegNet (Figure 8). When the training loss was stable, the largest loss occurred with the feature-level, followed by the baseline, point-level, and prior-level. Moreover, the prior-level offered the fastest convergence because it directly used the two-dimensional land cover classification result, which contained semantic information. When the test loss was stable, the largest loss occurred with the baseline, followed by the feature-level, point-level, and prior-level, consistent with the overall accuracy of the classification in the prediction results. These phenomena imply that, after embedding the information from the optical imagery, the loss becomes smaller and reaches a stable state faster.
Of the two losses with the decision-level, irrespective of training or testing, the loss with baseline was greater than the loss with SegNet (Figures 7 and 8), confirming that it is reasonable to train different DNN for the optical imagery and LiDAR point cloud separately, and that the prior-level fusion strategy can make use of the two-dimensional neural network pre-training model in three-dimensional land cover classification when the training data are insufficient. The training loss with the feature-level was greatest because the MLP classifier in the feature-level only had two layers, and a dropout was added, resulting in a weak learning capability.

Detailed Analysis of the Error Region
Among all fusion strategies, the error region of the prior-level was smallest (Fig  6), and the performance of the prior-level fusion strategy was highest ( Table 2 and Fig  5). Therefore, we selected eight typical regions that were misclassified by the prior-le

Detailed Analysis of the Error Region
Among all fusion strategies, the error region of the prior-level was smallest (Figure 6), and the performance of the prior-level fusion strategy was highest (Table 2 and Figure 5). Therefore, we selected eight typical regions that were misclassified by the prior-level fusion strategy ( Figure 9). Generally, the elevation of the misclassified regions varied abruptly ( Figure 10). For example, there were significant elevation differences between grassland and adjoining impervious surface (i.e., region 1 in Figure 10), and between the bottom of a building and the adjacent grassland (i.e., a narrow ditch in region 2 of Figure 10). Grassland was misclassified as shrubs in region 4 of Figure 10, because the elevation difference suddenly increased after the road bifurcated. Compared with regions with gentle elevation change, the density of the point cloud in these areas was lower, and the distribution of the point cloud was sparser. Thus, there were insufficient points to resolve local features, likely leading to the misclassification.
Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 1 fusion strategy ( Figure 9). Generally, the elevation of the misclassified regions varie abruptly ( Figure 10). For example, there were significant elevation differences betwee grassland and adjoining impervious surface (i.e., region 1 in Figure 10), and between th bottom of a building and the adjacent grassland (i.e., a narrow ditch in region 2 of Figur  10). Grassland was misclassified as shrubs in region 4 of Figure 10, because the elevatio difference suddenly increased after the road bifurcated. Compared with regions wit gentle elevation change, the density of the point cloud in these areas was lower, and th distribution of the point cloud was sparser. Thus, there were insufficient points to resolv local features, likely leading to the misclassification. Errors also occurred when two geo-objects of the same type had large elevatio differences. For example, when high and low trees were mixed, the prior-level strateg misclassified some of the smaller trees as shrubs (i.e., regions 3 and 5 in Figure 10). Som shrubs were also misclassified as trees (i.e., region 6 in Figure 10). Similarly, when tw buildings were connected and the lower roof was near the ground level, the lower roo was likely to be misclassified as ground (i.e., region 7 in Figure 10). These errors may b  Errors also occurred when two geo-objects of the same type had large elevation differences. For example, when high and low trees were mixed, the prior-level strategy misclassified some of the smaller trees as shrubs (i.e., regions 3 and 5 in Figure 10). Some shrubs were also misclassified as trees (i.e., region 6 in Figure 10). Similarly, when two buildings were connected and the lower roof was near the ground level, the lower roof was likely to be misclassified as ground (i.e., region 7 in Figure 10). These errors may be significantly reduced by incorporating vertical information (i.e., tree trunks and building façades), such that integration of multi-platform LiDAR, such as backpack LiDAR and vehicle LiDAR, is necessary for three-dimensional urban land cover classification. Although some errors existed in the classification results of the prior-level fusion strategy, the majority situations were more accurate than the manually labeled results (i.e., region 8 in Figure 10).

Comparison with Other Methods
The purpose of this study was to explore the fusion strategy for LiDAR point clouds and optical imagery for three-dimensional urban land cover classification. Therefore, we used the basic PointNet++ and SegNet models. Apart from PointNet++, some machine learning methods were also used for point cloud classification. To compare our prior-level fusion strategy with these methods and determine the limitations of our method, we used the prior-level fusion strategy to classify the ISPRS LiDAR point cloud into the nine original categories and then compared it with other methods (Table 3). The methods in Table 3 are divided into non-deep learning and deep learning. ISS_7 [38] first extracts the super-prime with the help of point cloud geometry and optical spectral information, and then uses machine learning to classify the super-prime. UM [39] uses the multiple attributes of the point cloud (intensity, echo number, etc.), texture features (locally fitted surfaces), and morphological features (differential morphological profile lines) to train a one-to-one class machine learning strategy classifier. HM_1 uses k-nearest neighbors (KNN) to select domain points to extract features, and then uses a conditional random field (CRF) to complete the context classification. LUH [40] uses high-order CRF to complete the classification with the help of extracted super-primes. RIT_1 [19] extracts the ground to obtain a normalized elevation and then uses PointNet to process the LiDAR point cloud fused with optical imagery. WhuY4 [41] uses a multi-scale CNN to process feature images obtained from LiDAR point clouds. The features used include normalized elevation, intensity, normal vector, and local plane features. PointCNN is the baseline in the A-XCR method [42], and its processing method is similar to that of the point-level fusion strategy described in this paper. Based on PointCNN training, A-XCR introduces an error smoothing process generated by CRF to avoid the over-fitting of PointCNN.
The deep learning methods were superior to the non-deep learning methods (Table 3), and by normalizing the elevation of the LiDAR point cloud, extracting some features for deep learning can achieve higher accuracy. Furthermore, using an advanced neural network architecture, such as PointCNN that uses a dilated convolution technique to obtain multiple models and integrates these to get superior results, can also improve accuracy. Thus, in the future, we plan to embed PointCNN or other more advanced three-dimensional classification networks, such as KPConv [43], into the prior-level fusion strategy; such embedding will be simple owing to the serial form of the proposed prior-level fusion strategy (Figure 2).

Conclusions
In this study, a novel prior-level fusion strategy of LiDAR point clouds and optical imagery for three-dimensional land cover classification was proposed and compared with other fusion strategies, namely point-level, feature-level, and decision-level. The proposed prior-level fusion strategy builds a link between two-dimensional and three-dimensional land cover through the prior knowledge obtained from the optical imagery. The pointlevel fusion strategy directly assigns multispectral information of the optical imagery to the point cloud, and classifies the point cloud with multispectral information. The feature-level fusion strategy concatenates the features extracted from the optical image and the features from the LiDAR point cloud, and then the concatenated feature is used to obtain classification results. The decision-level fusion strategy fuses the results of twodimensional land cover from optical imagery and three-dimensional land cover from a LiDAR point cloud based on the heuristic rule. The experimental results using ISPRS data show that the proposed prior-level fusion strategy delivers the best performance, which is manifested mainly in the lowest losses in the training process and highest F1-score (82.79%) in the classification results. The F1-score of point-level, feature-level, decision-level, and prior-level were 80.15%, 76.46%, 81.35%, and 82.79%, respectively.
Through detailed analysis of the error distribution of the prior-level fusion strategy, we found that some errors arose due to date problems, such as the airborne LiDAR point cloud being very sparse at locations where elevation changed abruptly, as airborne LiDAR lacks vertical information. If other platforms of LiDAR, such as backpack LiDAR and vehicle LiDAR, were integrated with the airborne LiDAR, more reliable three-dimensional urban land cover could be achieved, which would help urban ecology research. On the other hand, since the pioneering work of PointNet++, a few three-dimensional deep learning structures with better performance have emerged to encode the point cloud neighborhood relationship. We anticipate that it will be necessary to adopt a more advanced neural network structure in the prior-level fusion strategy to improve the performance of threedimensional land cover classification.