TUM-MLS-2016: An Annotated Mobile LiDAR Dataset of the TUM City Campus for Semantic Point Cloud Interpretation in Urban Areas

: In the past decade, a vast amount of strategies, methods, and algorithms have been developed to explore the semantic interpretation of 3D point clouds for extracting desirable information. To assess the performance of the developed algorithms or methods, public standard benchmark datasets should invariably be introduced and used, which serve as an indicator and ruler in the evaluation and comparison. In this work, we introduce and present large-scale Mobile LiDAR point clouds acquired at the city campus of the Technical University of Munich, which have been manually annotated and can be used for the evaluation of related algorithms and methods for semantic point cloud interpretation. We created three datasets from a measurement campaign conducted in April 2016, including a benchmark dataset for semantic labeling, test data for instance segmentation, and test data for annotated single 360 ◦ laser scans. These datasets cover an urban area of approximately 1 km long roadways and include more than 40 million annotated points with eight classes of objects labeled. Moreover, experiments were carried out with results from several baseline methods compared and analyzed, revealing the quality of this dataset and its effectiveness when using it for performance evaluation.


Introduction
Geospatial data plays a vital role in a wide variety of urban applications like road mapping, field navigation, and building reconstruction [1]. Recently, the spreading uses of 3D point clouds generated from Light Detection and Ranging (LiDAR) systems or multi-view stereo vision provide more diverse options of using geospatial data, with accurate geometric and rich radiometric information [2]. In particular, point clouds measured with the LiDAR system mounted on a mobile platform (MLS) can directly map large-scale urban areas with accurate and detailed 3D measures [3,4]. However, for applying this appealing type of geospatial data in practical jobs, a semantic interpretation of the acquired point clouds is often an obligatory procedure [5]. In this regard, in the past decade, a vast amount of strategies, methods, and algorithms have been developed to explore the semantic interpretation of 3D point clouds for extracting desirable information. To assess the performance of the developed algorithms or methods, public standard benchmark datasets should invariably be introduced and used, which serve as an indicator and ruler in the evaluation and comparison [6]. The innovative contributions of this paper are twofold: (1) We introduce and present three large-scale annotated point cloud datasets with point-wise labels and instance cases for semantic interpretation in urban areas. (2) We give an extensive comparison on the performance of semantic labeling methods on the proposed benchmark dataset.
The remainder of this paper is organized as follows: A brief literature view of Mobile LiDAR datasets is given in Section 2. The description of our proposed datasets is provided in Section 3. Subsequently, experimental evaluation using the proposed benchmark dataset and related discussions are given in Section 4. Finally, conclusions about the proposed datasets are drawn in Section 5.

Benchmark Datasets from MLS Point Clouds for Semantic Interpretation
With the rapid development of point cloud processing techniques, a wide range of benchmark datasets for various tasks have been presented. With respect to the semantic segmentation and semantic labeling, there are already plentiful benchmark datasets that have been presented, such as the Oakland outdoor MLS dataset [7], the Semantic3D.net TLS dataset (Semantic3D) [8], our own but unannotated TUM-City-Campus MLS (2016) dataset [4], the Paris-Lille-3D MLS dataset [9], the Toronto-3D MLS dataset [10], the Daimler urban segmentation dataset [11], and the A2D2 dataset [12]. However, for any of the benchmark point cloud datasets, there is always a delimitation for the platform used for measuring the 3D points. This means that the attributes, accuracy, density, and quality of different types of point clouds vary significantly due to different platforms used in the measuring [13]. Thus, for evaluating algorithms and methods designed for different applications, the different types of point clouds used for generating benchmark datasets should be considered. Moreover, the costs and difficulties of generating these benchmark datasets are totally different as well. Thus, there are only a few accessible benchmarks from MLS point clouds, and the representative ones include Oakland 3D [7], the Sydney Urban Objects Dataset [14], iQmulus [15], Paris-Lille-3D [9], SemanticKITTI [16], Toronto-3D [10], the Daimler urban segmentation dataset [11], and the A2D2 dataset [12]. The brief introduction of these datasets is as follows: • The Oakland 3D dataset [7] is one of the earliest publicly accessible MLS datasets for semantic labeling. The dataset was acquired by a side-looking SICK LMS sensor in push-broom way around the campus of the Carnegie Mellon University Oakland, Pittsburgh, PA. This dataset has by default been separated into the training, validation, and test parts, with a total number of about 1.6 million points. All the points in this dataset were assigned with labels of 44 classes of objects, but only 5 classes among them can be used for the evaluation.

•
The Sydney Urban Objects dataset [14] is a dataset containing a variety of common urban road objects. This dataset was collected in the CBD of Sydney, Australia, by a Velodyne HDL-64E LiDAR sensor. The entire dataset consists of 631 individual scans. All points were labeled with four classes of objects, including vehicles, pedestrians, traffic signs and trees. As an evaluation dataset, it was designed to test matching and classification algorithms, with a large variability in viewpoint and occlusion.

•
The iQmulus dataset [15] is also an early published MLS dataset, which served the IQmulus and TerraMobilita Contest. This dataset was acquired in the 6th district of Paris by the Stereopolis II system with a Riegl LMS-Q120i LiDAR sensor. The entire dataset has collected more than 300 million points. All the points in this dataset were assigned with labels of 22 classes of objects. However, only a 200 m long subset, including 12 million points of 8 classes, is available for the public evaluation purpose.

•
Paris-Lille-3D [9] is a recently published MLS dataset for both semantic labeling and instance segmentation. The dataset was acquired in the streets of Paris and Lille by an MLS system with a Velodyne HDL-32E LiDAR sensor. The entire dataset has collected more than 140 million points, covering approximately 2 km roadways. All the points in this dataset were assigned with labels of 50 classes of objects. For the public evaluation purpose, for benchmarks, labels of 9 classes are provided. Moreover, not only point-wise labels, individual objects like cars and trees are also segmented as instances for evaluation use.

•
The SemanticKITTI dataset [16] is one of the newest publicly accessible MLS datasets for semantic segmentation. This dataset was created by annotating the renowned KITTI dataset [17]. This dataset has collected about 4.5 billion points, covering a roadway of 40 km. This dataset is presented by a sequence of scans. The points of each sequential scan were labeled with 25 classes for the evaluation purpose. • Toronto-3D [10] is a recent MLS dataset for semantic labeling. This dataset was acquired on Avenue Road in Toronto, Canada, via a vehicle-mounted MLS system with a 32-line LiDAR sensor. This dataset has collected approximately 78.3 million points, covering approximately 1 km of roadways. All the points in this dataset were assigned with labels of 7 classes of objects and 1 class of unclassified ones. This dataset has been separated into four parts in default, and each part covers a road length of about 250 m. For the evaluation purpose, theoretically, any part can be used as test data and the rest as training data, or vice versa.

•
The Daimler urban segmentation dataset [11] is not an MLS dataset, but can still be considered related (3D): It consists of 5000 rectified stereo image pairs, and 500 frames come with pixel-level semantic class annotations into five classes. Dense disparity maps are provided as a reference computed using semi-global matching.

•
Audi's recent A2D2 dataset [12] is provided for research in the context of autonomous driving. This dataset was acquired in three cities in the south of Germany, namely: Gaimersheim, Ingolstadt, and Munich. In total, six cameras and five Velodyne VLP-16 LiDAR sensors were used. 41,277 images have semantic and instance segmentation labels for 38 categories. The annotation of the point clouds is generated by projecting the points to the 38,481 semantically labeled images with calibrated relative position and orientation of the sensors.
Even though all the above-mentioned datasets contain semantically labeled 3D data, our MLS dataset differs from them in several aspects. Typically, related work focuses on either real-time computer vision tasks, such as autonomous driving, or on mobile mapping, e.g., for the generation of city models or other tasks related to geoinformatics. Typical representatives of the first type are the datasets provided by car companies. For example, Audi's A2D2 dataset and Daimler's urban segmentation dataset are designed for developments in view of autonomous driving and focus on traffic participants at the street level. A typical representative of the second type is the iQmulus dataset, where the focus is on high density data acquisition and offline scene analysis of large urban areas. Real-time aspects like the perception of current events during data acquisition are completely ignored in this case.
With our TUM-MLS-2016 dataset, we want to bridge the gap between different communities, e.g., computer vision, robotics, and geoinformatics. Our dataset covers the time course of the measuring run and the street scene with real-time events, but also a consistent and area-wide representation of the surveyed urban area including high-rise facades of buildings. A representation of a large urban area by a 3D point cloud and its semantic interpretation could leverage research on real-time applications like self-localization and traffic monitoring, for which we provide an all-in-one dataset. In addition, we provide different kinds of labels, including semantic labels and single instances of relevant objects. Compared with image-based 2D labels transferred to 3D points by projection, annotations of our point clouds have been made directly in 3D, which has been a labor-intensive operation but can be considered more reliable.
A special sensor configuration with two obliquely rotating laser scanners was chosen to provide a data basis as universal as possible under the two aspects mentioned above, real-time applications and mobile mapping. Although this configuration is special, the sensor data can represent or simulate that of state-of-the-art mapping sensor systems as well as sensors discussed and designed for autonomous driving (e.g., forward looking solid-state LiDAR sensors, flash LiDAR cameras with overlapping fields-of-view). We have put a lot of effort into correct georeferencing of the 3D data using an inertial navigation system including RTK-GNSS. On the one hand, this is necessary to be able to use the dataset in the context of geoinformatics. On the other hand, and with the existence of loops in the trajectories, it can be a perfect testbed for sophisticated investigations like LiDAR-based Semantic SLAM.
To have a better impression of current benchmark datasets of MLS point clouds, we give a comparison of comprehensive indicators of the above-mentioned datasets in Table 1. As seen from Table 1, we can also find several remarkable limitations in current MLS benchmark datasets. Thus, different algorithms and methods usually suffer inconsistent performance on different datasets. Relatively, the algorithms or methods designed for certain tasks should be assessed by corresponding benchmark datasets. Otherwise, the evaluation would be biased.

TUM-City-Campus MLS Dataset
Based on the analysis of the problems existing in the current benchmark datasets of MLS point clouds, we present our large-scale Mobile LiDAR datasets termed as TUM-City-Campus MLS (2016), which are designed for semantic interpretation of MLS point clouds in urban areas. Video clips illustrating our dataset and further information are available on the website (https://www.pf.bgu. tum.de/en/pub/tst.html) with supplementary material [18].

Data Acquisition, Preparation, and Annotation
The MLS data have been acquired in April 2016 by Fraunhofer IOSB with their MODISSA mobile sensor platform. At Fraunhofer IOSB, the experimental multi-sensor vehicle MODISSA (Mobile Distributed Situation Awareness) is used for hardware evaluation and software development in the contexts of automotive safety and security applications. At the time of the data acquisition in 2016, MODISSA was equipped with two Velodyne HDL-64E LiDAR sensors above the windshield, where each Velodyne HDL-64E was configured to have a rotational frequency of 10 Hz and acquired 130,000 range measurements (3D points) per rotation with distances up to 120 m. Each sensor consists of 64 laser rangefinders, which divide the vertical field of view of 26.8 • into 64 scan lines. Both laser scanners were positioned on wedges at a 25 • angle to the horizontal, rotated outwards at a 45 • angle (see Figure 2a). Reasons to use two obliquely rotating laser scanners have already been given in Section 2. There are additional positive features of this configuration. To a large extend, it prevents measurements of the vehicle's roof and still guarantees a good coverage of the roadway in front and to the sides of the vehicle. At the same time, the facades of buildings are captured in their entire height, which is useful for mobile mapping purposes (see Figure 2b). With the data of both sensors being synchronized in time, the overlap between the two sensors can simulate the overlapping fields-of-view of directed LiDAR cameras to be used for autonomous driving. In addition, this configuration increases the overall point density. The LiDAR data were recorded synchronously with position and orientation data of an Applanix POS LV 520 inertial navigation system (INS), which was augmented by RTK correction data of the German SAPOS network. All lever arms and boresight directions of the system components had been thoroughly calibrated beforehand [19], such that it was possible to perform direct georeferencing of the LiDAR data and aggregate all resulting 3D points in a common local ENU coordinate frame. Although the data acquisition is continuous, for convenience and by convention we split the stream of georeferenced 3D points to a sequence of scans of 1/10 second duration, corresponding to single 360 • scans of the scanner heads rotating at 10 Hz. A comprehensive description of the sensor system can be found in [20].
The data acquisition took place in the area of the city campus of Technical University of Munich (TUM) in Munich, Germany. Figure 2b illustrates the data acquisition and shows the footprint of each laser scanner in different color [4]. MLS data in more than 17 thousand 360 • scans have been acquired by each of the two laser scanners and directly georeferenced while driving along the roads around the TUM city campus and the inner yard. This covers an urban scenario consisting of building facades, trees, bushes, parked vehicles, wedges, roads, grass and so on. Each point has 3D x-, y-, and z-coordinates and intensities of the laser reflectance. In Figure 3, we illustrate the aerial image of the measured areas and the acquired and georeferenced MLS point clouds. In the annotation, all the measured points in the scene were manually labeled with eight semantic classes following the ETH standard (Semantic3D.net benchmark) [21] and one unclassified class. The point-wise labels were assigned manually using CloudCompare 2.10 (https://www.danielgm. net/cc/). In Figure 1d, an illustration of points with these classes is given, with points of various labels rendered with different colors. To be specific, the details of these eight different classes are given in Table 2. Based on the annotation of points, we created three datasets serving the evaluation of related methods and algorithms, including a benchmark dataset for semantic labeling, test data for instance segmentation, and test data for individual labeled 360 • scans. These three datasets are related to the two core tasks of semantic interpretation, namely the semantic labeling and object segmentation. In Figure 4, we give an illustration of the creation of these three datasets with involved processing steps and a workflow.

Benchmark for Semantic Labeling
As shown in the workflow of Figure 4, for generating the benchmark dataset for semantic labeling, we started with re-merging all points of all the georeferenced single scans into a large point cloud. Then, the merged point cloud was preprocessed by the statistical outlier removal (SOR) filter and downsampled, with duplicated points deleted. These duplicated points were mainly caused by repetitive scans when the vehicle was waiting for traffic lights. After these two steps, the number of points has been greatly reduced. Sequentially, the distant points in the scan with a sparse density were cropped and removed. Based on the filtered and cropped point cloud, we conducted the annotation of points manually according to the standard stated in the previous subsection. The total number of annotated points is more than 40 million. With these annotated points, we created a benchmark dataset for the evaluation of semantic labeling. Here, only annotated points of the eight semantic classes are kept, and those which belong to the unclassified class are removed. In Figure 5, we show the entire annotated benchmark dataset with eight object classes. For evaluation purposes, the entire labeled dataset of the TUM city campus has been evenly divided into three areas according to the covered area size, and the numbers of points in these three areas are around 20 million, 16 million, and 13 million, respectively. In Figure 3a, the separation of these three areas is displayed. Points of each area are saved in the same ply files.
In Figure 6, a statistic overview of labeled points in different areas is illustrated. As seen from the statistics, we can find that for all three areas, the percentages of points from different objects show different distributions and occur in an imbalanced way. The various distributions of points from different objects reveal different scenarios in these three areas. The scenario in Area 1 is a street scene hybridizing man-made objects and vegetation with a relatively balanced distribution. However, the numbers of points from hardscape and scanning artifacts are still considerably less than those of buildings and ground. The scenario in Area 2 is the inner yard of the campus, with all types of objects but very few points of vehicles. Actually, there were only one or two parked cars inside the yard. The scenario of Area 3 is a street scene with closely packed facades and almost no vegetation. Plenty of parked cars were also scanned in this scene. As a summary, we can comment that building points show a dominating tendency in all three areas. The ground points of man-made ground occupy the second-largest percentage for all areas. This would be beneficial for tasks like building reconstruction and vehicle detection. However, the points of artifacts and hardscape are less than 3% among all the annotated points in no matter which area, which makes the recognition of such objects a challenge. For the supervised method, it is recommended to use the points of Area 1 as the training data, and points of Areas 2 and 3 as the test data. The imbalanced distributions of percentages of labeled points in each area should be considered when making the evaluation.

Annotated Data for Instance Segmentation
Based on the annotated benchmark for semantic labeling, we also conducted an instance segmentation to the labeled points, so that points of the same instance can be separated and assigned with a unique label. For example, all points of a parked car are labeled as an individual object and rendered with a unique color. The instance segmentation is comparable to the labeled instances in the Paris-Lille-3D dataset. The creation of this dataset was achieved by two steps. The first step is the automatic segmentation using an unsupervised clustering [22]. Then, in the second step, a manual modification was carried out to correct errors and refine the boundaries of objects. In Figure 7, the instance segmentation of the labeled points is displayed. In total, there are 1002 objects of eight classes mentioned above labeled and segmented. The numbers of objects from different classes in the dataset for instance segmentation are given in Figure 8. As seen from the figure, we can find that there are a vast number of trees and vehicles that have been segmented and annotated. This could be a useful aspect for related tasks.

Annotated Data for Single 360 • Laser Scans
For the single 360 • laser scans from both scanners, we conducted a nearest neighbor search for assigning the points sequentially with a possible label, according to the annotated points in the large-scale benchmark dataset. In Figure 9, the trajectory of the MLS vehicle is illustrated by red dots (representing equal time steps). As can be seen from the trajectory, the scanning data include two loops covering the inner yard and outer area of the campus, which can be used for the evaluation of SLAM methods like LiDAR-based Semantic SLAM. To do that, the georeferenced 360 • scans can be transformed back to the sensor's coordinate frame, i.e., the motion compensation and direct georeferencing achieved by the INS can be undone, and the task for the SLAM method under consideration would be to replace the INS. The labeled sequence of scans is comparable to the labeled sequences of scans in the SemanticKITTI dataset. For the points in each scan, a point was given the same label as the nearest neighbor in the annotated scene within a given threshold (0.3 m). If there are no points found in the given radius, the point was labeled to belong to the unlabeled class. In Figure 10, annotated points of single 360 • laser scans are displayed. Compared to the annotated points in the large benchmark dataset, the annotated single laser scans contain both the geometric characteristics of the scanners and the classification labels of the points.   In Figure 11, we also provide an illustration of the annotation result of a sequence of single scans with time index of 04068, 04108, 04148, respectively. These three 360 • scans are not continuous but have a separation of 4 s.

Evaluation
To give a brief evaluation of the dataset, we applied supervised classification to this dataset for labeling the points with five aforementioned classes of objects in the scene. In the experiments, we use points of Area 1 as the training data, while we use points of Areas 2 and 3 as the test data.

Baselines of Semantic Labeling
In the experiments, four point-based semantic labeling methods were tested on the proposed dataset as baselines, including: • PointNet [23]: PointNet is a neural network that directly processes point clouds, which well respects the permutation invariance of points in the input. It provides a unified architecture for applications emerging from object classification. • PointNet++ [24]: PointNet learns global features with MLPs for raw point clouds. PointNet++ applies PointNet to local neighborhoods of each point to capture local features, and a hierarchical approach is taken to capture both local and global features.

•
Detrended geometric features and graph-based optimization (DEGO) [25]: This is a conventional handcrafted feature-based method using eigenvalue-based geometric features [5] with a detrended feature enhancement strategy and a random forest classifier. A post-processing with graph-structured optimization is applied for the refinement of initial labels.

•
Hierarchical deep feature learning (HDL) [13]: This is a deep feature learning method based on the original PointNet++ [24], in which hierarchical data augmentation is used to create multi-scale pointsets as input. Pointsets subdivided with various scales will contain different levels of contextual information and be concatenated to a multi-scale deep feature vector, which is then classified by the random forest. The joint manifold-based embedding (JME) and global graph-based optimization (GGO) used in [13] are not included here.

Evaluation Metrics
For the evaluation of the classification results, we follow the Pascal VOC challenges [26] and use Intersection over Union (IoU) averaged over all classes. The evaluation measure for class i is defined as The main evaluation measure is the IoU, which is the averaged summation of IoU i for each class i. Moreover, the overall accuracy is also calculated.
Here, for the labeled result of each class, TP denotes the True positive, which is the number of points correctly labeled as this class, namely the points with correct labels. FP stands for the False positive, which means the number of points with incorrect labels. FN is the False negative, which is the number of points which should be labeled as other classes but incorrectly labeled as this class. Moreover, the precision (Pre.) and recall (Rec.) values are also given for assessing the performance, and finally the overall accuracy (OA) is calculated as well.

Training Settings
In this semantic labeling experiment, for the DEGO method, the feature extraction and initial classification have been implemented via C++, and these implementations ran on an Intel i7-6700 CPU @ 3.4GHz and with 32.0 GB RAM. The graph-based optimization is achieved via MATLAB 2018b on the same hardware. To set the key parameters, the size of the voxels (their edge length) is 0.3 m, while the seed resolution of supervoxels is 1.0 m. For the weight factors in the boundary refined process, w n is set to one, while w d is set to the reciprocal value of the size of voxels. The number of trees used in our RF classifier is 200. The threshold for the graph cut is set to 0.5. As input to the PointNet and PointNet++ methods, the entire point cloud is subdivided into thousands of sub-point chips, in which 10,000 points are contained. These chips are downsampled to 8192 points which represent the main structure of each chip, and the downsampled chips serve as the input for PointNet and PointNet++. Each point in the chip is represented by a 3D vector, containing the coordinates (x, y, z). Similarly, for the HDL method, we generated sub-point chips with different scales. The training of networks with these chips of points with different scales is carried out individually as well. Thus, we can acquire encoded features encapsulating different levels of contextual information from points. Considering the real scale of objects (e.g., buildings, trees, low vegetation), the sizes of the chips of points are empirically set to 10,000, 20,000, and 30,000 for different scales, respectively, since they showed satisfying performance in the experiments. For the training process of all the above three deep learning-based methods, each training batch contained 16 chips in total. The stochastic gradient descent algorithm with a learning rate η = 0.001 and a momentum value of p = 0.9 was used. For adjusting the learning rate, we decayed its value by the factor of 0.7 in every 40 training chips. The training process lasts for a total of 500 epochs. We monitored the progress of the validation loss and saved the weights if the loss improves. These three methods were implemented via Tensorflow and carried out on an NVIDIA TITAN X (Pascal) 12GB GPU.

Results And Discussion
In Table 3, we illustrate the comparison of the classification results. The results of PointNet and PointNet++, and DEGO methods have been reported in our previous work [25]. As seen from the results, the DEGO method has advantages over these three baseline methods, when checking the overall accuracies. In particular, when compared with PointNet and PointNet++, the DEGO method outperforms them significantly in labeling buildings, man-made terrain, and high vegetation. However, we should be aware of the fact that DEGO includes graph-based optimization as post-processing. As we analyzed in [25], a possible reason is that these three kinds of objects have generally isotropic geometric characteristics, facilitating the graph-based optimization process considering the local contextual information. However, the deep learning-based methods also have a strong advantage for classifying points of vehicles and low vegetation. Rather than handcrafted features, the ones from deep learning can supervise and adaptively generate features for such irregular shaped objects, which better fits the reality. For PointNet and PointNet++ methods, the reason buildings and man-made terrain cannot be classified with satisfying results is due to the scale factor. In other words, PointNet and PointNet++ methods are originally designed for computer vision applications and mainly for indoor scenarios. However, when it comes to outdoor scenarios with large changes of object scales, the sampling of points for the input of the network should be further considered. In the HDL method, a hierarchical strategy for data augmentation is used for preparing the input, in which multi-scale pointsets are created. Consequently, the result of the HDL method reveals comparable performance like that of DEGO, even without any post-processing as refinement. Compared with PointNet++, HDL can get an improvement of around 7%. Since the HDL method is just an improved one based on the PointNet++, we can confirm that the modification of input scales can significantly enhance the power of networks for outdoor scenarios. In some recent publications like [27], the hierarchical structure has been used and achieved excellent performance.  By applying the trained models to the entire dataset, we can get a complete visualization of the results of every single method, cf. Figure 12. As seen from this figure, the visualization results can also support the quantitative evaluation: although all four methods can assign correct labels for the majority of points, the DEGO method, benefiting from the post-processing optimization, performs better when discriminating buildings and high vegetation. For the visualized results from the HDL method, we can observe that not only buildings and high vegetation but also scanning artifacts and man-made terrain can achieve good results even without post-refinement. From the aspect of the dataset, all these above-achieved results and feedback of various methods have supported the feasibility and effectiveness of the proposed dataset when using it for the performance evaluation.  [23], (b) classification result using PointNet++ [24], (c) classification result using DEGO [25], and (d) classification result using HDL [13].

Conclusions
In this work, we presented large-scale annotated MLS dataset using Mobile LiDAR point clouds acquired at the city campus of the Technical University of Munich for semantic interpretation evaluation. We presented three datasets including a benchmark dataset for semantic labeling, test data for instance segmentation, and test data for annotated single 360 • laser scans. The benchmark dataset for semantic labeling covers an urban area of approximately 1 km long roadways, and it includes more than 40 million annotated points with eight classes of objects labeled. The dataset for instance segmentation provides annotated and segmented points of more than 1000 objects. The dataset for labeled single scans provides labeled points from more than 17,000 sequential 360 • scans (rotations of the LiDAR scanner's head). Compared with other representative MLS benchmark datasets, the proposed MLS dataset provides not only a benchmark for point-wise semantic labeling but also annotated instances of individual objects, as well as labeled points in a sequence of 360 • scans. Moreover, we also reported evaluation experiments using the benchmark dataset with several reference methods. As a conclusion, we can have the following two remarks: • The creation of benchmark datasets is essential to assess the performance of developed algorithms and methods. The analysis of our proposed large-scale annotated dataset has revealed its good potential for the evaluation of semantic interpretation in complex urban scenarios.

•
Experiments have validated the feasibility and quality of the dataset for semantic labeling. The comparison with methods of different strategies also reveals the importance of considering the scale factors in deep learning-based feature descriptions.
In the future, the quality of these datasets could be further improved and updated according to the feedback from comprehensive experiments and evaluations.