Multi-View Features Joint Learning with Label and Local Distribution Consistency for Point Cloud Classification

In outdoor Light Detection and Ranging (lidar)point cloud classification, finding the discriminative features for point cloud perception and scene understanding represents one of the great challenges. The features derived from defect-laden (i.e., noise, outliers, occlusions and irregularities) and raw outdoor LiDAR scans usually contain redundant and irrelevant information which adversely affects the accuracy of point semantic labeling. Moreover, point cloud features of different views have a capability to express different attributes of the same point. The simplest way of concatenating these features of different views cannot guarantee the applicability and effectiveness of the fused features. To solve these problems and achieve outdoor point cloud classification with fewer training samples, we propose a novel multi-view features and classifiers’ joint learning framework. The proposed framework uses label consistency and local distribution consistency of multi-space constraints for multi-view point cloud features extraction and classification. In the framework, the manifold learning is used to carry out subspace joint learning of multi-view features by introducing three kinds of constraints, i.e., local distribution consistency of feature space and position space, label consistency among multi-view predicted labels and ground truth, and label consistency among multi-view predicted labels. The proposed model can be well trained by fewer training points, and an iterative algorithm is used to solve the joint optimization of multi-view feature projection matrices and linear classifiers. Subsequently, the multi-view features are fused and used for point cloud classification effectively. We evaluate the proposed method on five different point cloud scenes and experimental results demonstrate that the classification performance of the proposed method is at par or outperforms the compared algorithms.


Introduction
In recent years, with the rapid advancement of computer vision and Light Detection and Ranging (lidar) technology, an increasing number of point clouds are acquired and widely used in various remote-sensing applications. In applications such as autonomous driving, understanding the outdoor scenes by semantic labeling of point clouds has become a hot topic [1][2][3][4]. Point cloud classification is to mark a specific semantic attribute label for each point in the point cloud [2], which is a key step in environmental perception and scene understanding. Due to disorder, sparsity and irregularity of point clouds as well as the possible presence of uncertainty in point clouds (noise, outliers and missing data), the intra-class points in the same scene are quite different and the difference of inter-class points is not obvious [3]. Therefore, effective classification of point clouds is a challenging problem.
For the past few years, the classification algorithms proposed in [3][4][5][6][7][8][9] have achieved good performances for classifying images and point clouds. For example, Zhang et al. [10] proposed DKSVD (discriminative K-SVD [11]) algorithm, which introduces classification error to optimize feature extraction and classifier, simultaneously. To explore prior knowledge of label information, Jiang et al. [12] introduced label consistency constraints into the objective functions of LCKSVD1 (Label Consistent K-SVD) and LCKSVD2 algorithms, and showed better classification results. Zhang et al. [1] used discriminant dictionary learning to construct multi-level point set features for point cloud classification. Li et al. [4] proposed a deep-learning network based on multi-level voxel features fusion for point cloud classification. The feature dimensions used by the above methods are relatively high, and are found to be carrying noise and redundant information [13]. To overcome this drawback, dimensionality reduction and sparse representation are widely used. Dimensionality reduction can be seen as a special case of subspace learning, which projects high-dimensional data to lowdimensional subspace through algorithms, such as ICA (independent component analysis) [14], PCA (principal component analysis) [15], optimal mean robust PCA [16] and other variants.
In addition, most supervised classification methods usually require a huge amount of training samples to learn features and classifiers to achieve very high classification accuracy. As the training set generation through point cloud labeling is time-consuming, it significantly lowers the algorithmic efficiency [17]. Therefore, if we can successfully classify a large volumes of point clouds using only a small percentage of training samples, it should have a great practical applicability because the time and labor cost will be significantly reduced [5]. To solve this problem, [13,[17][18][19] proposed semisupervised or supervised classification methods for feature transformation matrix and classifier joint learning. For example, Mei et al. [17] concatenated multiple single point features to form highdimensional features for each point, and then used the joint constraints of margin, graph of adjacency and labels to train the model with a small portion of samples based on a semi-supervised framework. Zhu et al. [18] used the feature vector composed of multi-scale features in a series of images to express each image, and then the constraints of local connection relations of labels and samples were introduced to jointly learn the features projection matrix and classifier with local and global consistency. These methods directly fuse different types of sample features or the same features defined at multiple scales for classification. However, this kind of feature fusion and its variants have relatively limited expression ability of sample attributes and classification performances. In this case, the effectiveness of feature fusion cannot be guaranteed.
To express and classify multimedia data effectively, researchers have proposed a variety of multi-view learning algorithms. Each point can be described by high-dimensional features from multiple views, e.g., eigenvalue features of covariance matrix [1], spin image features [20], normal vector, FPFH (fast-point feature histogram) [21] and VFH (viewpoint feature histogram) [22]. The created features in each view contain unique information that is different from the features derived from other views. Meanwhile, it should be noted that multi-view features also include some overlapping information to a certain extent, although they are generalized from different views. Different view features describe properties of different aspects of a point, but they commonly represent the same point. Typically, the multi-view learning methods [23][24][25][26][27][28][29][30][31] can effectively fuse features from different views. The authors leverage this diversity and consistency of different view features to obtain more discriminative feature representation. More specifically, Nie et al. [25] proposed an adaptive weighted multi-view learning algorithm (MLAN) for image clustering and semi-supervised classification. This algorithm introduces different view weights and learns the local structure from different view data based on a manifold learning method. In [28], the features' lowrank representations of each view are jointly optimized by introducing the exclusivity and category consistency constraints. In contrast to [28], the method in [31] jointly learned the low-rank representations of multi-view by introducing the error term with adaptive weight of each view and the diversity regular term for reducing redundant information between different views. Then, the joint projection graph of the multi-view was constructed by the low-rank representations for clustering/classification. All these methods outperform the single-view feature-learning methods, however, they are not applicable for classifying outdoor point clouds. By contrast with the above multi-view learning methods, some multi-view Convolutional Neural Networks (CNNs) based methods (deep learning mechanism) for point cloud processing are proposed in recent years [32]. Generally, this kind of method relies on 2D rendered views instead of on the 3D data. For example, Su et al. [33] use multi-view CNNs to extract the features of 2D renderings from a 3D object, which shows good performance for 3D object model classification. For point cloud semantic segmentation of outdoor scenes, multi-view CNN-based methods, e.g. [34][35][36], need to render point cloud to generate multi-view images with multi-modal representation, which can include the information of depth, color, normal and other features. Then the generated multiple images are semantically segmented by the networks, e.g. U-Net [37], SegNet [38] and Fully Convolutional Network (FCN) [39]. After that, the semantic segmentation results of multiple images are projected to meshes to jointly determine the label of each mesh vertex. Finally, the labeled vertices are projected to the original point cloud. Although these deep learning-based methods have been obtaining good results, they rely on full 3D meshes to generate multi-view rendering, which is difficult to enable the reliable 3D meshes for outdoor point clouds. These multi-view methods are processed based on images, not 3D point clouds. In addition, to the best of our knowledge, there does not exist any multi-view learning method that is directly applied to point cloud classification.
To fill this gap, we propose a features extraction and point clouds classification model based on multiple views and space representation consistency under constraints of label consistency (MvsRCLC). The overall flowchart of the proposed algorithm is shown in Figure 1. Firstly, the multiview features of each point are extracted. Then, the features of each view are joined to learn the subspace of each view to remove redundant information, making the features representation more suitable for classification tasks. Secondly, the local distribution consistency constraint of feature space and position space is used to express the adjacency graph of each point in the local neighborhood. After that, the label consistency constraint is used to ensure the consistency between the predicted labels of all views and the ground truth label, as well as the consistency between the predicted labels within multiple views. Label consistency includes label consistency of grouped points (LCG) and label consistency of single point (LCS). Finally, an iterative optimization algorithm of objective function is proposed to progressively learn a subspace projection matrix and optimal linear classifier via solving a minimization problem. In the experimental section, two airborne laser-scanning (ALS) data scenes, two mobile laser-scanning (MLS) scenes and a terrestrial laser-scanning (TLS) scene with different complexities are used to evaluate the proposed algorithm. We state the original contributions explicitly as below: (1) A multi-view features and classifiers fusion framework for point cloud classification is proposed. By introducing the multiple constraints among different views and combining the classification error terms, the feature subspaces of multi-view can be jointly learned. This subspace learning method can be used to remove redundant information or noise effectively. Moreover, by simultaneously optimizing the feature projection matrices and linear classifiers of all views on the unified objective function, different view features can be fully explored, and more discriminant feature subspaces and optimal linear classifiers can be obtained.
(2) Unlike previous methods that simply concatenate multi-view features for classification, we propose a multi-view subspace learning method using diversity and consistency constraints between multi-view features, and then multi-view features and the classifier are coupled to classify point clouds, thereby improving labeling accuracy.
(3) Our algorithm takes multiple constraints including, local distribution consistency in feature space and position space, LCG constraints and predicted labels consistency constraints of multi-view features into account. It enhances subspace representations of point clouds and improves classification accuracy of point clouds. The proposed method has better classification performances particularly for using a small portion of training samples.
(4) The joint optimization method of multi-view objective function based on an iterative algorithm proposed in this paper can rapidly converge on the point cloud scenes used in this paper. The proposed algorithm is superior to other state-of-the-art methods by labeling ubiquitous points, ALS, MLS and TLS point clouds.

Materials and Methods
In this section, the multi-view point features, i.e., normal vector, covariance eigenvalue features and spin image features, used in our method are firstly presented. Then, the joint learning of the multi-view point feature extraction and classification model is constructed, which mainly includes: subspace learning, local distribution consistency constraint of multi-view feature and position spaces, and multi-view label constraints. Next, the proposed model is optimized before it is used to classify point clouds.

Multi-View Point Cloud Feature Extraction
The different types of features represent different attributes of point clouds. Thus, we extract different types of point cloud features to fully express point cloud attributes. In this paper, point cloud features of different views are constructed by using normal vector , covariance eigenvalue features [1] and spin image features [20]. Here, we also generate a series of single point features extracted at multi-scale regions to enhance object recognition abilities. The features extraction processes are described as follows: The neighborhood points of point p in radius r is regarded as the support region of point p. We construct the features defined at different scales by progressively changing radius r. Different radii r are selected to construct features of different scales for each view. The normal vector and covariance eigenvalue are calculated according to the methods in [3]. Since the normal vector and covariance eigenvalue features mainly represent the geometric features of point cloud, we take them as the same view features. Therefore, the multi-scale features of point clouds belonging to different views

Multiple Views and Space Representation Consistency under Constraints of Label Consistency (MvsRCLC)
In this section, we discuss multiple views and spaces representations consistency with label consistency constraints for features extraction and point clouds classification (MvsRCLC).

Reconstruction Independent Component Analysis (RICA) Subspace Learning
In the multi-scale point cloud features representation, redundant and/or noisy information is usually present. To solve this problem, the feature projection technique is usually employed. It transforms the data in the high-dimensional space into a lower-dimensional space to achieve dimensionality reduction and subspace learning. To project the high-dimensional features into subspace by the feature transformation matrix with less reconstruction error, MvsRCLC minimizes the following objective function using RICA (reconstruction independent component analysis) algorithm [14] to extract the optimal feature transformation matrix.
For multi-view features, different views can be projected to the same subspace through different feature transformation matrices. Therefore, the subspace learning objective function of the multiview features can be defined as: where, is the vth view feature transformation matrix, and is the vth feature. m is the number of views.

Multi-View Local Distribution Consistency Constraints (1) Point Cloud Spatial Position Constraint Term
Although each point has different view features, the intrinsic spatial relationships are explicitly embedded in the point clouds. Therefore, the constructed spatial position constraint is applicable to subspace learning of all the views. Intuitively, for point cloud data, K neighbor points in the spatial position space tend to belong to the same category, and points of the same category also have similar data distribution in the feature space. Based on this observation, a coordinate distance constraint from each point to its K neighbor points is imposed, which is expressed by a spatial position weight matrix. Thus, the spatial position weight matrix of the point cloud is expressed as follows: where, is the coordinate.
(·) represents the set of the K nearest neighbor points in the spatial position space.
(2) Point Cloud Feature Space Constraint Term Similarly, in the feature space, K neighbor points tend to express the objects of the same category. Thus, the feature distance constraint from each point to its K neighbor points in feature space can be constructed, which is expressed by a feature space weight matrix. The point cloud feature space weight matrix is constructed below: where, represents the nearest K neighbor of point pj in feature space.
Considering points of the same category also have similar data distribution in the spatial position space and feature space, spatial position space and feature space constraints in the same view can be jointly expressed as Equation (5). More details can be found in [17][18][19].
where, and are diagonal matrices , = ∑ , = ∑ , = − , = − . Operator tr(.) is the trace of the matrix. Parameter is the trade-off parameter. As shown in Figure 2, the same color points belong to the same category in Figure 2(a). Figure  2(b) shows the 5 neighbor points of p1 (the red point) in spatial position space. Figure 2(c) shows the different view relationships of the p1 and its neighbor points in feature space. The weight of each point in the blue circle ( Figure 2(b)) and red circle (Figure 2 (c)) depends on the distance to the p1. According to Equations (3) and (4), the weight of neighbor points in different spaces can be described by and . If the points are not neighbors, the weight values are assigned to 0. Although features of different views have a certain degree of difference, the points from the same category need to have similar adjacent relationships in the feature space of different views. In order to ensure that local points have a similar relationship graph in multi-view features, the relationship graph can be constrained by the features of different views. Figure 2(d) is the relationship of neighbor points for projection features, which is projected by the constraints of spatial position space and feature space. Generally, the neighbor points in spatial position space and feature space usually belong to the same category, thus, these constraints can guarantee the similarity of projection features for neighbor points. To minimize the feature discrepancy of points in subspace, the joint constraints of spatial position space and feature space are used to construct the multi-view objective function, which is shown as below: where, is the Laplacian matrix of vth in the feature space.

Label Consistency (1) Label Consistency of Grouped Points (LCG)
It should be remembered that the corresponding labels need to be consistent before and after the feature transformation of the same category points. Based on this constraint, a label matrix Q of grouped points is built. Assuming that, and belong to the first category, and belong to the second category and and belong to the third category, the matrix Q can be expressed as: After defining Q, the corresponding objective function of LCG can be expressed as follows: where G is the weight matrix, and the term ‖ ‖ is the constraint used to prevent overfitting.
In addition, the ground truth labels of grouped points should be consistent with the predicted labels of grouped points at each view, and predicted labels of grouped points between different views are consistent. Therefore, we introduce the difference between the predicted labels of grouped points at different views as a constraint, which can be calculated by‖ − ‖ . Then, the corresponding multi-view objective function can be expressed as follows: where, and are trade-off parameters.
(2) Label Consistency of Single Point (LCS) While the classification results between the different views need to be as consistent as possible, each view classification result needs to be close to the ground truth label. After the point cloud feature transformation, the classification results obtained by the linear classifier using the projected features should be consistent with the ground truth. The LCS can be expressed as: where H is the linear classification matrix, and the term ‖ ‖ is the constraint used to prevent overfitting. F is the ground truth label matrix, which is same for all views. To make the predicted labels between the different views consistent, we introduce the discrepancy between the predicted labels of each point at different views as a constraint, which can be calculated by ‖ − ‖ . Then, the objective function of the multi-view LCS can be expressed as follows: where, and are trade-off parameters.

Objective Function of MvsRCLC
To make the point cloud features with stronger expression ability, MvsRCLC learns the discriminative optimal feature expression through the subspace learning method. To leverage the diversity of subspace feature expressions from different views, joint constraint terms of position space and feature space are introduced. To make sure that the subspace feature expressions of the same category can be consistent in different views, and the subspace feature expression has the best discriminability, MvsRCLC introduces the LCG and LCS constraints. The labels of grouped points Q and the label of single point F are used to optimize the discriminability of the multi-view subspace expression for each category. Therefore, the objective function of MvsRCLC is as follows: where, ， and are trade-off parameters. In Eq (11), , and need to be optimized. After it has been solved, more discriminative features of point, i.e. , , = , and linear classifier of each view can be obtained.

Optimization Technique
Since the objective function is highly non-linear, conventional techniques such as the gradient descent method or Newton method cannot be directly used in our situation. Alternatively, an iterative optimized strategy is utilized to solve Equation (11). The detailed pseudocode is elaborated in Table 1. For convenience, we remove the subscript of sub-view in Equation (11).
As mentioned before, we use the MvsRCLC optimization algorithm listed in Table 1 to solve the objective function. More precisely, in each iteration, we update only one variable and the rest of variables are fixed. By doing this, W, G and H can be optimized individually.

Update of W
Once we fix G and H, Equation (11) can be transformed into Equation (12) with a variable of W as input: Note that Equation (12) can be considered as an unconstrained optimization problem. We obtain the derivative of Equation (12) with respect to W. where, represents the ith row of the matrix W. Given the original feature matrix of the training data, the unconstrained optimization method L-BFGS [40] is used to update W.

Update of G
Similarly, we fix H and W, Equation (11) can be turned into Equation (15) with a variable of G as input: Equation (14) is also an unconstrained optimization problem. We obtain the derivative of Equation (14) with respect to G. Then by setting the derivative ( ) = 0, we obtain the solution of G: Thus, the weight matrix of G can be updated by Equation (15).

Update of H
When fix G and W, Equation (11) can be converted into Equation (16) with a variable of H as input: Equation (16) is an unconstrained optimization problem. We obtain the derivative of Equation (16) with respect to H. By setting the derivative ( ) = 0, we obtain the solution of H: Thus, the weight matrix of H can be updated by Equation (17).
After optimization, the optimal weight matrices of W, G and H can be calculated. More details regarding the detailed optimization process are shown in Table 1.

Point Cloud Labeling
After the objective function has been solved, the feature transformation matrix and the label projection matrix (linear classifier) of each view can be learned. For the testing set, the point cloud can be classified using learned projection matrices and linear classifiers. Since and are the optimal solutions, feeding the features of new points , the classification result of the point cloud can be obtained: where c is the number of categories. is the predicted classification label of multi-view, and ϱ is the weight of each view for the point cloud classification, ∑ ϱ = 1 and ϱ is determined by the ratio of in each view.

Performance Evaluation
In this section, we first briefly describe the experimental data and evaluation metrics. Afterwards, we compare the classification results of the proposed algorithm with other related algorithms in two different experiments. Finally, we analyze the parameters and convergence of the proposed method.

Experiment Data and Evaluation Metrics
Five different point cloud scenes (see Figure 3) are used to evaluate the performance of the proposed algorithm. The five scenes are mainly divided into three types according to the different platforms where the lidar sensor are mounted. The first type is ALS point cloud data, including Scene1 scanned in residential area and Scene2 scanned in urban area. These two scenes are collected by a Leica ALS50 system with a mean flying height of 500 m above ground and a 45° field of view in Tianjin, China [41]. The point density is approximately 20-30 points/m 2 . The ground points of these two scenes have been manually filtered out. As shown in Figures 3(a) and 3(b), the data of these two scenes only contain non-ground points, i.e., buildings, trees and cars. Note that the training samples and testing samples are defined in [41]. The second type of data used in this paper is MLS point clouds, including Scene3 and Scene4. The point clouds in Scene3 and Scene4 are acquired by the backpacked mobile acquisition device [42], which is also a mobile laser scanner mounted on a person, in Shenyang and Beijing, China. As shown in Figures 3(c) and 3(d), the data (filtered out ground points manually) of these two scenes are manually labeled four and nine categories by ourselves, respectively. Moreover, Scene4 is complex data published in [4], which have a large fluctuation of the point number among the nine classes, i.e., buildings, trees, cars, pedestrians, wire poles, street lamps, traffic signs, wires and pylons. The third type is TLS point cloud, i.e., Scene5, which is acquired by the terrestrial laser scanner (RIEGL MS-Z620) in urban area [5]. Affected by the distance of objects to the scanner, point density of the Scene5 varies a lot, and many objects are incomplete and noisy. As shown in Figure 3(e), similar to ALS point clouds, the ground points of Scene5 have been filtered out manually. Four categories, i.e., Cars, Trees, Pedestrians, and Buildings, are used to evaluate the classification performance. The specific information of the five scenes is shown in Table 2. The data of Scene1, Scene2 and Scene5 can be downloaded at the author's website (http://geogother.bnu.edu.cn/teacherweb/zhangliqiang/). The point cloud of Scene3 (https://pan.baidu.com/s/1WA_YwOACBcy5jArUAmd6xA) and Scene4 (https://pan.baidu.com/s/1lOCe39sfPvkpPDTY1-TOrw) are also public datasets and are extensively used in the previously published works [4,43].  In our experiment, the proposed algorithm is implemented in MATLAB 2017b. All the experiments were run on a personal computer, equipped with a 4.20 GHz Intel Core i7-7700k CPU, 24 GB of main memory. In order to be more comprehensive and effective, we use four popular evaluation indicators to evaluate the classification performance of each category, including precision/recall, intersection over union (IoU), and F1-score. Overall classification results of each scene are evaluated by four other popular metrics, namely overall accuracy (OA), mean intersection over union (mIoU), Kappa and mF1. The detailed definitions of these metrics are presented in [3][4].

Experimental Results
In order to prove the effect of the proposed algorithm from different perspectives, two groups of experiments are conducted using different types of point clouds. The first experiment group is mainly used to verify the classification results of the proposed method by selecting a small number of training samples from point cloud scenes with a large volume of point sets. The second experiment group is mainly conducted to compare the proposed method with the popular multi-view joint classification methods. It proves the classification performance of the proposed algorithm using a small number of training samples from the point cloud scenes with a small-volume of point sets. To avoid ambiguity, the term large and small volume point set defined in our context refer to point clouds greater than and less than 50 thousand points.

The First Experimental Group
Experiments were carried out on five different point cloud scenes shown in Table 2 to test the classification performance of the proposed algorithm on point clouds with a small percentage of training samples. A small number of points are selected for model training from each scene. The remaining points are regarded as testing points that need to be classified.
(1) Comparison methods To highlight the performance of the proposed algorithm, we compare nine algorithms, which can be divided into two categories, i.e., single view feature based method and multiple features fusion based method. To prove the effectiveness of the multi-view features fusion, we compare our method with FC(our) and FSI(our). To show the advantages of the multi-view features joint learning, we compare the proposed method, FC(our) and FSI(our where, T is the sparse factor, D is the dictionary, A is the linear transformation matrix, S is the sparse code, and Q is the significant sparse code. (20). According to [11], the model can obtain the optimal classification results when = is taken in multiple groups of experiments. In our experiment, and are tuned from the set {0.001, 0.01, 0.1, 1,10}, respectively. The remaining parameters are the default parameters of [11]. We adopt 128 words, = = 0.001 in the LC-KSVD1 and LC-KSVD2 experiments to obtain the optimal results.

9) LC-KSVD2[11]：Based on the direct fusion of covariance eigenvalue features and spin image features, LC-KSVD2 is used for classification. The classification model of this method is shown in Equation
where, Y is the predicted label matrix, and H is the linear classifier matrix.

(3) The results
For ALS point clouds, we use Scene1 and Scene2 shown in Table 2 to carry out experiments on the same hardware specs. This experiment group verifies the algorithms' labeling performances under the premise of using a small number of samples for training and a larger number of points for testing. The number of the selected training points is less than 5% of the testing points. More specifically, 10,000 and 9602 points are selected from the training samples of Scene1 and Scene2 for training, all of the testing point clouds are used for testing samples. To make an unbiased comparison with other methods, the same training and testing samples are used for testing other methods. The classification result statistics of various types of evaluation measures are shown in Table 3 and Table  4. To verify the efficiency of our method for point cloud classification, we compare the classification running time of different methods with multiple features on the test set of Scene1 (422,355 points) and Scene2 (236,802 points). The running time comparison is shown in Table 5.  Table 3, Table 4 and Table 5, it is easily to draw the conclusions below: 1) The classification performance of our method outperforms other comparison methods on ALS point clouds. FC(our) and FSI(our) can also achieve good classification performance. Moreover, not all multiple feature-based methods are superior to single feature-based methods. In this experiment group, for multiple feature-based methods, the OA and Kappa of our method are at least 4.3% higher than other comparison methods, and the mIoU and mF1 of our method are at least 1.2% higher than other comparison methods. Compared with the single feature-based method, the classification performance of our method outperforms FC(our), FSI(our), FC(SVM) and FSI(SVM), which indicates that the proposed method can effectively fuse multi-view features and it is effective to fuse multi-view features for the improvement of point cloud classification.
2) The classification performances of LC-KSVD1 and LC-KSVD2 outperform DKSVD, which proves the effectiveness of the constraints of LCG and LCS. Adaboost cannot achieve a better classification performance by directly concatenating different view features. However, we also adopt the strategy of concatenating different view' features, RICA-SVM has achieved relatively good classification results because it learns the transformation matrix of fused features, thereby making the projected features more distinguishable. The proposed method has achieved the highest value in terms of OA, mIoU, kappa, and mF1 for Scene1 and the best accuracy in terms of OA, mIoU and kappa for Scene2. These qualitative values demonstrate that the effectiveness of joint learning for feature projection matrices and multi-view classifiers under the constraints of labels.
3) As shown in Table 5, for the point cloud classification of Scene1 and Scene2, our method requires less than 4.5%, 2.7%, 1.3%, 0.9% and 44.2% running time of the Adaboost, LC-KSVD1, LC-KSVD2, DKSVD and RICA-SVM, respectively. Thus, the speed of our method outperforms the compared methods. Although the running time of RICA-SVM is relatively close to our method, the OA and Kappa of our method are at least 5.4% higher than RICA-SVM according to the classification accuracy performance of Scene1 and Scene2 in Table 3 and Table 4. The above comparisons demonstrate that our method is superior to Adaboost, LC-KSVD1, LC-KSVD2, DKSVD and RICA-SVM in terms of accuracy and efficiency.
To show the advantages of the proposed method vividly, we also compared the performance of our method with RICA-SVM, which is the second-ranked classification approach based on multiview fusion features. Figures 4(b) and 4(c) give the comparisons of labeling results of Scene2. As shown in Figure 4, the cars classification of our method and RICA-SVM are bad. This is mainly due to the fewer training points of cars and the great similarity of extracted features between cars and buildings. From the enlarged black boxes, we can find that a large number of trees are misclassified as buildings using RICA-SVM. This is probably because the discriminability of trees features is weakened by the direct fusion of multiple features. However, the classification performance of our method is more accurate due to the effective fusion of multiple features. We used Scene3-Scene5 depicted in Table 2 to verify the labeling performances by using MLS and TLS point clouds. More precisely, 3200, 7200 and 3200 points are randomly chosen as training points from Scene3, Scene4 and Scene5. This only counts for just 0.7%, 0.8% and 0.8% of the corresponding scene's points. The rest of point clouds of each scenes are all regarded as testing data.
To make an unbiased comparison with other methods, the same training and testing samples were used in this group. The classification results of different methods on Scene3-Scene5 are shown in Table 6-Table 8.  Tables 6-8, our method can achieve the similar classification performance on MLS point cloud and TLS point cloud, and our method has certain advantages over other compared algorithms. In all three scenes, the OA and Kappa of our method are at least 1% and 0.47% higher than the other compared algorithms, respectively. FC(SVM), FSI(SVM) , Adaboost, DKSVD, LC-KSVD1 and LC-KSVD2 can classify most points (OA ≥51%), which proves that the features of each view are relatively effective. RICA-SVM can obtain one or two of the highest values of all the metrics in each scene, which indicates that simple fusion of features to learn projection features through RICA is helpful to the expression of discriminant features. In addition, our method is jointly trained with multi-view features, which makes FC(our) and FSI(our) obtain better classification results. In the two MLS and one TLS point cloud scenes, the classification performances of FC(our) and FSI(our) outperform FC(SVM) and FSI(SVM). This proves the classification model of each view jointly learned with multi-view features is applicable.
To vividly show the superiority of the proposed method, we compared the performance of our method with LCKSVD1, LCKSVD2, DKSVD and RICA-SVM. Figure 5 gives the comparisons of labeling results of Scene4. From the enlarged black boxes, we can find that the proposed method has obvious advantages in the classification of buildings and trees. In terms of the overall classification performance, our method is closer to the ground truth due to the more discriminant features extraction by multi-view features joint learning and the effective fusion of the multiple features.

The Second Experimental Group
(1) Comparison methods To evaluate the performance of our method, we compared results of nine related algorithms. To highlight the performance of the proposed multi-view features joint learning and fusion method, we compared it with three representative multi-view learning methods, i.e., adaptively weighted procrustes (AWP) [26], automatic multi-view graph and weights learning (AMGL) [23] and multiview learning with adaptive neighbors (MLAN) [25]. AWP is an unsupervised method, which is compared with AMGL, MLAN and our method to show the performance of multi-view unsupervised and supervised classification methods. To prove the effectiveness of multi-view learning, we compared our method with non-negative sparse graph (NNSG) [19] and SVM [44]. Besides, we compare our method with FC(our), FSI(our), FC(SVM) and FSI(SVM) to show the effectiveness of multi-view features fusion. The compared algorithms and their recommended parameters are described below: 1) AWP 2018 [26]: a multi-view unsupervised classification algorithm based on adaptively weighted procrustes (AWP).
2) AMGL 2016 [23]: a new framework for automatic multi-view graph and weights learning. This method uses a small number of samples to train the classifier and further to predict unlabeled samples. In this paper, the neighborhood parameter of AMGL is set to 5, and the remaining parameters are assigned with default values of the source code that was released to the public in [23].
3) MLAN 2018 [25]: an effective multi-view model for adaptive learning of local data structure. The adaptive neighborhood value of MLAN is set to 9, and the remaining parameters are assigned with default parameters of the source codes in [25]. 4) NNSG 2015 [19]: a non-negative sparse graph method for learning linear regression. We set = 0.001, = 0.5 and = 0.01. [44]: multi-view features are directly concatenated together, which are then provided as input to SVM classifier for training and classification.

5) SVM
Apart from the comparisons with the above five methods, we also compared our results with FC(our), FSI(our), FC(SVM) and FSI(SVM) (see Section 3.2.1 for more detailed descriptions regarding these four configurations.) (2) The results For testing ALS point clouds, we randomly selected 9062 points from Scene2, from which 300 points (3.31%) were randomly selected for training. The remaining points were used as testing samples. Figure 6 shows a qualitative comparison result of the four multi-view methods. The quantitative comparison on all evaluation metrics are shown in Table 9. Table 9. Classification results of Scene2 (%). Precision/recall/intersection over union (IoU)/F1-score, OA, mIoU, Kappa and mF1 (%). Note that the highest accuracy values are highlighted in bold.

Method
Building  From the results listed in Table 9 and illustrated in Figure 6, we can make the following observations: 1) In Figure 6, the results of the proposed method and AMGL are closer to ground truth, the rest of qualitative results have relatively large errors. From Table 9, overall classification measures of 79.7%, 56.6%, 64.3% and 69.4% with regard to OA, mIoU, Kappa and mF1 maintain the highest values, which demonstrates the advantages/superiority of the proposed method.
2) AWP has a high OA, but it cannot classify cars because the sample number of cars is relatively small in the experimental data. In addition, it is rare to see the categorized car points, as demonstrated in Figure 6(c). In contrast, for AMGL algorithm, numerous tree points are mistakenly classified as cars (see Figure 6(d)). Although the classification precision of trees and the recall of buildings reached 100% using MLAN algorithm, numerous building points are misclassified as trees. In addition, the car points cannot be recognized, as demonstrated in Figure 6(e). NNSG has capability to classify most cars, but a large number of building points and trees are falsely classified as cars, as evident in Figure  6(f). The overall labeling results of NNSG is relatively poor. Although the SVM algorithm has the best classification performance of trees, the overall labeling results are inferior to ours.
3) For the covariance eigenvalues feature, the four overall evaluation metrics of the classification model obtained by our joint training method (FC(our) ) are at least 4% higher than the FC(SVM). For the spin image feature, the classification results of our model trained by joint learning (FSI(our)) also have certain advantages over FSI(SVM). Therefore, we can make safe conclusions that the classification model jointly trained by our method is more effective than SVM-based model using single-view features. In addition, the classification performance of our multi-view fusion method outperforms the singleview feature classification.
For testing MLS point clouds, we randomly selected 12,000 points from Scene3 as experimental data, from which 1200 points (10%) were randomly selected for training. The remaining points are as testing samples. Figure 7 shows a qualitative comparison result of the multi-view methods. The quantitative comparison on all evaluation metrics are shown in Table 10.  From the results listed in Table 10 and illustrated in Figure 7, we can make the following observations： 1) In the qualitative and quantitative comparison, our method has better performance in the classification of point cloud in Scene3 than other methods. For the four classification evaluation metrics, our method has achieved the highest classification accuracy.
2) AWP, AMGL and NNSG have poor classification performances on each class of points, and there is a large number of misclassification points as demonstrated in Figures 7(c)-7(f). MLAN has the best overall classification result for the poles. Although this method has a highest recall value for tree classification, there are more misclassifications for the other class (see Figure 7(e)). SVM has the capability to distinguish trees and cars, achieving 36.0%/52.9% and 67.6%/80.6% with regard to IoU/F1-score measures. Although the MLAN and SVM algorithms show the best labeling result for a certain class, we still have absolute advantages in terms of overall measure evaluations as we obtained 70.0%, 49.7%, 56.0% and 65.7% with regard to overall measures of OA, mIoU, Kappa and mF1.
3) The classification results obtained by our multi-view fusion method are significantly improved compared with the single-view feature classification methods. In addition, the model of multi-view feature joint training and classification outperforms the multi-view feature direct fusion classification.

Effectiveness on ISPRS 3D Semantic Labeling Dataset
To prove the effectiveness of the proposed method for the International Society for Photogrammetry and Remote Sensing (ISPRS) 3D Semantic Labeling Dataset [45], we conducted an experiment on this dataset. Here, we add three kinds of features, i.e. fast-point feature histograms (FPFH), normal angle distribution histogram (NAD) and latitude sampling histogram (LSH) [3] as multi-view features. The experimental results are shown in Figure 8 and Table 11, which can demonstrate that the proposed method achieves a promising classification performance for the cars, trees, buildings and grounds.

Effectiveness of Multiple Constraints
In order to verify the contribution of each independent constraint (IC) in the objective function of our method, we compared the different ICs for point cloud classification. We use three configurations to prove the effectiveness of individual constraints: (1) IC1: IC1 is the subspace learning term, which removes the redundant information by joint learning the multi-view feature transformation matrices and linear classifiers. Then the point cloud classification model can be obtained. The objective function is as follows: ( , ) = arg min , − + ( ) (2) IC2: Based on IC1, we introduce spatial position and feature space constraints into the objective function, as shown in Equation (22). The point cloud classification model is obtained by learning and .
( , ) = arg min , − + ( ) (3) IC3 ： In order to verify the contributions of multi-view joint training and multi-view classifier joint classification, we construct the classification model as shown in Equation (23), which directly concatenate the derived features of multi-views.
Here, to show the effectiveness of spatial position and feature space constraints, we compared IC1 with IC2. To show the advantages of multi-view learning, we compared our method and IC2 with IC3. Besides, we also compare our method with IC2 to prove the effectiveness of LCG. We arbitrarily select Scene1 (ALS point clouds with 3 classes) and Scene4 (MLS point clouds with 9 classes) to verify the effectiveness of individual constraints. In the experiment, 10,000 and 7200 points are selected from Scene1 and Scene4, respectively, which amounts to 0.8% points of testing sets. For IC1, IC2, IC3 and the proposed model, we keep the sum dimension number of the transformation features consistent (the sum dimension number of the new feature is 86), and the maximum iteration number is set to 20. The comparisons of the above three modified configurations with our classification results of each classification model are shown in Table 12. From Table 12, we have the following observations.
(1) IC2 outperforms IC1. The main reason is that the consistency constraint of position and feature spaces is introduced in different views, so the learned subspace can better reflect the local geometry of the data and the consistency of the local space distribution among different views.
(2) Our method achieves 77.49%/80.93% and 57.15%/68.62% with regarding to OA and Kappa measures on Scene1/Scene4, which are 4.54%/1.64% and 6.66%/1.25% higher than IC1, and 3.03%/0.25% and 3.73%/0.32% higher than IC2. The experimental results prove that the constraints of LCG and label consistency among different views are effective. The joint learning of the feature transformation matrix and classifier can effectively improve the accuracy of point cloud classification. The main reason is that LCG makes the learned subspace more discriminative, and the constraints among different views make the trend of the feature expression and classification results of different views more consistent.
(3) The overall performance of our method outperforms IC3. The experimental results show that the multi-view joint learning classification method for features projecting and fusion is more effective than the direct fusion classification methods. The main reason is that different features reflect the different attributes of the same point. The features of multi-view can make full use of the attributes of each point through joint constraint, which can achieve better classification results than the method of multiple features direct fusion.

Parameters Analysis
In the proposed method, there are mainly 7 key parameters that need to be tuned, including , , , , , and . In the experiment, we set = 0.1, = 0.1, = 0.01 and = 0.001. Parameters of , and correspond to the terms of position space and feature space constraint, multi-view LCG constraint and multi-view LCS constraint, respectively. We jointly discuss and , which affect the weight of the LCG term and LCS term. Figures 9(a) and 9(b) show the classification accuracy of different and on Scene1 and Scene4 when is fixed. It can be seen that the difference of accuracy with different and is small, which is within 0.8% and 0.05% for Scene1 and Scene4, respectively. This demonstrates that and have a small influence on the classification result when they vary within the given values. Moreover, the accuracy fluctuations are smooth, i.e., the classification accuracy is relatively less affected by the varying and when is fixed. In addition, as shown in Figures 9(c) and 9(d), when is fixed, and and varies within given ranges, the classification accuracy of our method on different parameter values has larger changes, which is within 5.5% and 0.25% for Scene1 and Scene4, respectively. However, the accuracy fluctuation caused by the change of and (fixed ) is relatively stable, except for some values of parameters. As shown in Figures 9(e) and 9(f), when is fixed, and and varies within given values, and the classification accuracy difference of our method on different parameter values is within 2.5% and 0.2% for Scene1 and Scene4, respectively. Besides, we can also find that the accuracy fluctuations are relatively small, i.e., the proposed method is relatively stable, which has less influence by the varying and when is fixed. From Figure 9, we can observe that the accuracy fluctuation trends of the two scenes are similar for each situation (two parameter values vary when a parameter value fixes). Besides, the comparisons of these three situations demonstrate that has the largest influence on classification accuracy, and and have relatively small influence on classification accuracy. From the above analysis we can draw a safe conclusion that the parameters of our method have a low impact on the point cloud classification in a certain range. That is, our method is relatively insensitive to parameters , and .

Convergence Analysis
The objective function shown in Equation (11) is highly non-linear, although it is solvable, it is very difficult to simultaneously optimize the variables , and . Figure 10 reports the objective function values within the 25 iterations. The objective function is optimized by the proposed update rules depicted in Table 1. Figures 10(a) and 10(b) are the objective function value curves corresponding to Scene1 and Scene4. We can see that our optimization method can quickly converge to a local or even global minimum for our two scenes.

Conclusions
In this paper, we have proposed a multi-view joint learning framework for point cloud classification, which includes multi-view subspace learning term for removing redundant information and representing low-dimensional features; the local distribution consistency constraint of feature space and position space term for the adjacency expression of neighborhood points; and the label consistency terms with the consistency between the predicted labels of all views and the ground truth label and the consistency among the predicted labels of each view. These terms are joined to learn the transformation matrices and optimal classifiers by the iterative optimization algorithm of objective function, and the convergence is very fast. Experiments performed on two ALS point cloud scenes, two MLS point cloud scenes and a TLS point cloud scene with fewer training points clearly confirm that our method outperforms the compared algorithms. Although our method achieves more promising classification accuracy than the compared algorithms, there are some drawbacks and interesting ideas that can be further explored to extend the research reported in this paper. Currently, the proposed method cannot achieve accurate semantic labeling of city-level point clouds having complex geometric shapes and diverse objects. In addition, the derived multiple features are relatively simple. Point set features and high-level features are not fully explored. In future work, the multi-view joint learning method and deep-learning technologies can be combined to learn a multi-view deep-learning network for improving the semantic labeling of point clouds.
Author Contributions: Y.L. and D.C. analyzed the data and wrote the Matlab source code. G.T., and S.-B.X. helped with the project and study design, paper writing, and analysis of the results. J.P. and Y.W. helped with the data analysis, experimental analysis, and comparisons. All authors have read and agreed to the published version of the manuscript.