Disturbance analysis in the classification of objects obtained from urban lidar point clouds with convolutional neural networks

: Mobile Laser Scanning (MLS) systems have proven their usefulness in the rapid and accurate acquisition of the urban environment. From the generated point clouds, street furniture can be extracted and classiﬁed without manual intervention. However, this process of acquisition and classiﬁcation is not error-free, caused mainly by disturbances. This paper analyses the effect of three disturbances (point density variation, ambient noise, and occlusions) on the classiﬁcation of urban objects in point clouds. From point clouds acquired in real case studies, synthetic disturbances are generated and added. The point density reduction is generated by downsampling in a voxel-wise distribution. The ambient noise is generated as random points within the bounding box of the object, and the occlusion is generated by eliminating points contained in a sphere. Samples with disturbances are classiﬁed by a pre-trained Convolutional Neural Network (CNN). The results showed different behaviours for each disturbance: density reduction affected objects depending on the object shape and dimensions, ambient noise depending on the volume of the object, while occlusions depended on their size and location. Finally, the CNN was re-trained with a percentage of synthetic samples with disturbances. An improvement in the performance of 10–40% was reported except for occlusions with a radius larger than 1 m.


Introduction
Although point clouds have proven to be a data source with many advantages, they also have some limitations that cause undesired behaviour in processing algorithms. The main disturbances in data acquisition are strong point density variations, noise, and occlusions. These disturbances reach such an extent that a data pre-processing phase is usually required to correct anomalies by density reductions, outliers and hole filling [1].
Point density is one of the main characteristics to consider for each point cloud [2]. The average density and the point distribution vary according to the type of laser scanning [3]. The density defines the number of points in a specific space and it limits the measurement and detection of the elements. Many authors prefer to work directly with high-density point clouds, ensuring there are always enough points to identify the desired targets [4]. However, the density in the point cloud is not constant, since point density varies with the distance to the laser and with the geometry of the environment. In [5], the authors reported a variation up to 41 times in point density between the highest and the lowest density areas of a Mobile Laser Scanning (MLS) acquisition. The distance between points is a determining factor in point cloud processing. Segmentation or classification methods focus the feature extraction in point-to-point relations, either by nearest neighbours, spherical neighbourhood, raster, voxels, or convolutions [6][7][8][9][10][11]. When extracting features, it is important to consider both the point density to be handled and the possible variations to give robustness to the proposed method.
Noise is another common disturbance in point clouds, although it does not appear as significant as point density. Isolated points of noise can be found in the acquisitions due to isolated reflections. Besides, a slight amount of noise may appear on the surface of the objects depending on the LiDAR quality [12]. Noise does not affect all methods equally, for example, methods based on edge detection are more sensitive than those based on region growing [13]. To deal with noise, many authors design and implement methods for denoising and removing outliers [14,15]. Some methods are especially robust and filter high-density noise, specific to certain atmospheric and meteorological conditions, such as fog, rain, or airborne dust [16][17][18], even though usually in those environments, acquisitions are no longer performed. Another option is to design or adapt the methods so that they can deal with a certain amount of noise in the point cloud [19].
Occlusions are a geometrical problem related to the visibility of objects in the environment. The most common occlusions are related to occlusions between several objects (one object occludes another), but also an object can partially occlude itself (most of the time, one side of the object is not acquired). Although occlusions are a common problem in all environments [20], there are no specific datasets for working with occlusions. In autonomous driving, the occlusions considered most important to avoid collisions are those affecting pedestrians, objects already with few acquired points [21]. Few authors who focus on this problem choose to generate synthetic occlusions. Zachmann and Scamati [22] generated occlusions in Aerial Laser Scanning (ALS) data to detect and classify targets under trees. Despite the great difficulty of considering occlusions in point cloud processing, Habib et al. [23] designed a method to classify objects based on the interpretation of occlusion in ALS data.
Generally, works that assume the existence of disturbances, analyse them and take advantage of them achieve better results [5,24]. Although normally, it is easier and more straightforward to work with undisturbed data to show the advantages of the proposed methods in ideal situations, without anomalies that worsen the perception of the results.
The objective of this work is to evaluate how these three disturbances (point density variations, noise, and occlusions) affect the classification of urban objects in point clouds. The classification method is based on the generation of multi-views from point clouds, coupled with the use of a Convolutional Neural Network (CNN). Prior to the conversion of the point cloud into images, synthetic disturbances are generated automatically with different degrees of intensity. Finally, the CNN is re-trained with new objects with synthetic disturbances and the results are compared. To the best of the authors' knowledge, there are no other works that generate and analyse the effect of these three disturbances in a classification method based on multi-view and CNN. The major contributions of the work are listed below: • Automated generation of point density variation and noise addition at different intensity levels. • Automated generation of occlusions at five positions with different sizes. • Analysis of accuracy and confusion in ten classes of urban objects: bench, car, lamppost, motorbike, pedestrian, traffic light, traffic sign, tree, wastebasket, and waste container. This paper is structured as follows. In Section 2, the method for image generation, classification and automated disturbance generation is explained. In Section 3, the case study is presented, and the results are analysed and discussed. Section 4 concludes this paper.

Method
The method for the classification of the objects in point clouds used in this work is based on multi-view extraction. Multiview-based methods require lower computational cost and enable further data augmentation than working directly on unordered point Remote Sens. 2021, 13, 2135 3 of 20 clouds, while taking advantage of the wide variety of convolutional neural network (CNN) architectures available [25].
Since disturbances are inherent to the acquisition of point clouds, and the generation of disturbances in this work is performed on the point clouds, even if they are later converted to images. The workflow of the experiment is shown in Figure 1. Disturbances can also influence the extraction and individualisation of objects; however, the behaviour of other algorithms in the presence of disturbances will not be analysed in this paper, as we focus exclusively on classification.

Method
The method for the classification of the objects in point clouds used based on multi-view extraction. Multiview-based methods require lower cost and enable further data augmentation than working directly on un clouds, while taking advantage of the wide variety of convolutional n (CNN) architectures available [25].
Since disturbances are inherent to the acquisition of point clouds, and of disturbances in this work is performed on the point clouds, even if the verted to images. The workflow of the experiment is shown in Figure 1. D also influence the extraction and individualisation of objects; however, th other algorithms in the presence of disturbances will not be analysed in th focus exclusively on classification.

Image Generation
The generation of the images is adapted from [26]. The selected vi objects to project in an image is the lateral view, as it provides more inform proves the success rates in the classification. In addition, only the lateral the maximum extent to improve visualisation and classification, and to ma ances more visible and not hidden by overlapping points. To generate the imum lateral perspective, the point cloud object is rotated on th on the maximum distribution of points in the XY plane. The orientation of culated by Principal Component Analysis is applied on the and (Equation (1)). From the eigen vectors, the angle of rotation α between components is calculated (Equation (2)). Then, α is used to generate a 3D ro over Z, which is applied to the input object point cloud P (Equation (3)).

Image Generation
The generation of the images is adapted from [26]. The selected view of acquired objects to project in an image is the lateral view, as it provides more information and improves the success rates in the classification. In addition, only the lateral view is used to the maximum extent to improve visualisation and classification, and to make the disturbances more visible and not hidden by overlapping points. To generate the images in maximum lateral perspective, the point cloud object P(P X P Y P Z ) is rotated on the Z-axis based on the maximum distribution of points in the XY plane. The orientation of the points calculated by Principal Component Analysis is applied on the P X and P Y components (Equation (1)). From the eigen vectors, the angle of rotation α between the P X and P Y components is calculated (Equation (2)). Then, α is used to generate a 3D rotation R matrix over Z, which is applied to the input object point cloud P (Equation (3)).
Once the object point cloud is rotated so that its maximum extent is shown on the Y-plane, the image is generated on the Y-plane. The P Y component (indicative of depth) of the rotated point cloud is removed. A grid of cells (pixels) is generated on which the cloud Remote Sens. 2021, 13, 2135 4 of 20 points are structured. The size of the grid is defined by the size of the CNN input. The generated image is a binary image: black in the pixels whose cells contain at least one point and white in the pixels whose cells are empty. Although various cloud characteristics (such as intensity) can be extracted to generate a greyscale image, binary images are generated instead of greyscale images because from [26] it is concluded that the most important factor for the classification of the point cloud is the silhouette and not so much the colour. The result is an image where the silhouette and the object are well defined and contrasted against the white background. The process of image generation is illustrated in Figure 2.
Remote Sens. 2021, 13, x FOR PEER REVIEW 4 of 21 the rotated point cloud is removed. A grid of cells (pixels) is generated on which the cloud points are structured. The size of the grid is defined by the size of the CNN input. The generated image is a binary image: black in the pixels whose cells contain at least one point and white in the pixels whose cells are empty. Although various cloud characteristics (such as intensity) can be extracted to generate a greyscale image, binary images are generated instead of greyscale images because from [26] it is concluded that the most important factor for the classification of the point cloud is the silhouette and not so much the colour. The result is an image where the silhouette and the object are well defined and contrasted against the white background. The process of image generation is illustrated in Figure 2.

Point Density Reduction
Point density depends on the proximity and orientation of the object to the MLS; therefore, different urban objects have different densities even if they are acquired with the same MLS device. Density reduction is applied on each point cloud before generating the images. Before applying a density reduction, the current average point density must be known to establish values for effective density reduction. In this article, density is calculated as the average point-to-point distance [27] between the k = 5 nearest points with the knn algorithm, assuming regular scanning patterns.
Point density reduction is performed using the 3D grid method or voxelization [28]. The input point cloud is enclosed in a bounding box, which is divided into 3d cells of side l, with equal width, depth, and height. The value of l is variable and increases, as the distance between output points and point density decreases. Once the input cloud is structured in a 3D grid, the points are filtered by cell. If the cell contains one point, this point is retained. If the cell contains more than one point, the one nearest to the cell centre is selected. In the case of empty cells, no points are selected [29].

Noise Addition
Ambient noise is produced by airborne dust or rain. This noise is measured as in the previous section. The noise point density is calculated as the average distance from a point to the five nearest points with the knn algorithm. However, the generation of noise points is performed by the number of points enclosed in the bounding box object. Since each object has a different volume and the points are generated randomly, it is very difficult to generate noise of an exact point density value d for the test. For this reason, an iterative noise generation process is chosen, generating an initial number of points n which is then adjusted (Figure 3).
Before generating any number of noise points, a test is conducted to establish a relationship between the number of points and the point density in 1 m 3 . For 500 points, an average distance of 0.115 m is measured with k = 5. The process starts by calculating the bounding box of the object to obtain the volume. From the volume, n points are generated

Point Density Reduction
Point density depends on the proximity and orientation of the object to the MLS; therefore, different urban objects have different densities even if they are acquired with the same MLS device. Density reduction is applied on each point cloud before generating the images. Before applying a density reduction, the current average point density must be known to establish values for effective density reduction. In this article, density is calculated as the average point-to-point distance [27] between the k = 5 nearest points with the knn algorithm, assuming regular scanning patterns.
Point density reduction is performed using the 3D grid method or voxelization [28]. The input point cloud is enclosed in a bounding box, which is divided into 3d cells of side l, with equal width, depth, and height. The value of l is variable and increases, as the distance between output points and point density decreases. Once the input cloud is structured in a 3D grid, the points are filtered by cell. If the cell contains one point, this point is retained. If the cell contains more than one point, the one nearest to the cell centre is selected. In the case of empty cells, no points are selected [29].

Noise Addition
Ambient noise is produced by airborne dust or rain. This noise is measured as in the previous section. The noise point density is calculated as the average distance from a point to the five nearest points with the knn algorithm. However, the generation of noise points is performed by the number of points enclosed in the bounding box object. Since each object has a different volume and the points are generated randomly, it is very difficult to generate noise of an exact point density value d for the test. For this reason, an iterative noise generation process is chosen, generating an initial number of points n which is then adjusted (Figure 3). and the average distance between k = 5 nearest neighbours is calculated. In case of coincidence of the desired and calculated point density (±10%), the noise points are added to the object point cloud and the image is generated. In case the current density is outside the threshold, the number of points is increased/decreased by 10% accordingly. The process iterates until the point density is within the threshold.

Occlusion Generation
Occlusions are caused by the existence of an object between the laser and the surface to be acquired. The occluding element can be an object other than the occluded object, or the acquired object itself (hidden unacquired side of the occluding object). The occlusions are based on two variables, location and size, unlike the previously generated disturbances which only depend on one variable.
Location is relevant as it can erase key shapes of the object even if the occlusion is small. Occlusion location describes the area or part of the object where the occlusion is located. To uniform the possible locations in urban objects with large geometry variations, it is decided to locate in each object five different occlusion origins on which occlusions of different sizes are generated. The authors assume that five occlusion origins cover most of the object to distinguish occlusion effects. The location of the occlusion origins is selected using the k-means algorithm applied to the object cloud [13]. The k-means algorithm clusters the point cloud into k groups where each point belongs to the group whose mean value is the nearest. The central point of each group is selected as the origin of the occlusion.
The size of the occlusion indicates how much of the object is hidden. The generation of the occlusion is performed as a sphere from each origin calculated above. Therefore, from the cloud of the input object, points with a radius to the origin smaller than r are detected and removed. Even though this method generates exclusively round-shaped occlusions in the image projection, for the CNN the loss of object information is more relevant than the shape of the occluded object's boundaries.

Dataset and Pre-Trained CNN
The dataset was composed of 200 objects acquired by LYNX Mobile Mapper of Optech [30] in the cities of Vigo (Spain) and Porto (Portugal). In order not to propagate errors from the process of object segmentation and individualisation, the objects were manually extracted from the point cloud. However, there are numerous techniques for the extraction of objects from the urban point cloud. One option is to detect ground and Before generating any number of noise points, a test is conducted to establish a relationship between the number of points and the point density in 1 m 3 . For 500 points, an average distance of 0.115 m is measured with k = 5. The process starts by calculating the bounding box of the object to obtain the volume. From the volume, n points are generated and the average distance between k = 5 nearest neighbours is calculated. In case of coincidence of the desired and calculated point density (±10%), the noise points are added to the object point cloud and the image is generated. In case the current density is outside the threshold, the number of points is increased/decreased by 10% accordingly. The process iterates until the point density is within the threshold.

Occlusion Generation
Occlusions are caused by the existence of an object between the laser and the surface to be acquired. The occluding element can be an object other than the occluded object, or the acquired object itself (hidden unacquired side of the occluding object). The occlusions are based on two variables, location and size, unlike the previously generated disturbances which only depend on one variable.
Location is relevant as it can erase key shapes of the object even if the occlusion is small. Occlusion location describes the area or part of the object where the occlusion is located. To uniform the possible locations in urban objects with large geometry variations, it is decided to locate in each object five different occlusion origins on which occlusions of different sizes are generated. The authors assume that five occlusion origins cover most of the object to distinguish occlusion effects. The location of the occlusion origins is selected using the k-means algorithm applied to the object cloud [13]. The k-means algorithm clusters the point cloud into k groups where each point belongs to the group whose mean value is the nearest. The central point of each group is selected as the origin of the occlusion.
The size of the occlusion indicates how much of the object is hidden. The generation of the occlusion is performed as a sphere from each origin calculated above. Therefore, from the cloud of the input object, points with a radius to the origin smaller than r are detected and removed. Even though this method generates exclusively round-shaped occlusions in the image projection, for the CNN the loss of object information is more relevant than the shape of the occluded object's boundaries.

Dataset and Pre-Trained CNN
The dataset was composed of 200 objects acquired by LYNX Mobile Mapper of Optech [30] in the cities of Vigo (Spain) and Porto (Portugal). In order not to propa-gate errors from the process of object segmentation and individualisation, the objects were manually extracted from the point cloud. However, there are numerous techniques for the extraction of objects from the urban point cloud. One option is to detect ground and façade planes to remove the corresponding points and break the continuity between objects; then, the objects are individualised with connected components [26]. Other options are to structure the cloud into super voxels [31], to cluster points with superpoint graphs [32] or to implement techniques based on Deep Learning [33]. The ten selected classes were: bench, car, lamppost, motorbike, pedestrian, traffic light, traffic sign, tree, wastebasket, and waste container. A total of twenty objects were selected from each class. Ten objects per class were used for the generation of disturbances and analysis of results. All selected objects were checked for correct classification without disturbances by the pre-trained CNN, in order to start from a correct classification and to find the degree of disturbance for the first misclassification. The other ten objects per class were used for the CNN re-training (Section 3.5).
The method selected to test analyse the disturbance was presented in [26]. Although there are many methods for point cloud object classification, it was not possible to analyse them all in the present work, so a state-of-the-art method with good classification rates was chosen, based on the use of views and projections to classify the point cloud as an image with Deep Learning. The pre-trained CNN was described in [26]. The network architecture is an Inception V3 [34]. The softmax layer was changed to obtain a 1 × 1 × 10 output corresponding to the number of classes ( Figure 4). The samples used for training were (for each class) 450 images from internet sources (Google Images and ImageNet) and 50 images of objects in point clouds. An additional 50 images of objects in point clouds were used in the validation. The point clouds for pre-training were different from those used in disturbance generation and re-training. façade planes to remove the corresponding points and break the continuity between objects; then, the objects are individualised with connected components [26]. Other options are to structure the cloud into super voxels [31], to cluster points with superpoint graphs [32] or to implement techniques based on Deep Learning [33]. The ten selected classes were: bench, car, lamppost, motorbike, pedestrian, traffic light, traffic sign, tree, wastebasket, and waste container. A total of twenty objects were selected from each class. Ten objects per class were used for the generation of disturbances and analysis of results. All selected objects were checked for correct classification without disturbances by the pretrained CNN, in order to start from a correct classification and to find the degree of disturbance for the first misclassification. The other ten objects per class were used for the CNN re-training (Section 3.5).
The method selected to test analyse the disturbance was presented in [26]. Although there are many methods for point cloud object classification, it was not possible to analyse them all in the present work, so a state-of-the-art method with good classification rates was chosen, based on the use of views and projections to classify the point cloud as an image with Deep Learning. The pre-trained CNN was described in [26]. The network architecture is an Inception V3 [34]. The softmax layer was changed to obtain a 1 × 1 × 10 output corresponding to the number of classes ( Figure 4). The samples used for training were (for each class) 450 images from internet sources (Google Images and ImageNet) and 50 images of objects in point clouds. An additional 50 images of objects in point clouds were used in the validation. The point clouds for pre-training were different from those used in disturbance generation and re-training.

Density Results and Analysis
The average point density per class measured using the knn algorithm with k = 5 is presented in Table 1. The maximum difference between classes with the highest and the lowest point density is 2 cm. The average of all densities was calculated to be 0.0256 metres-this value was taken as a starting point to apply density reduction and it was increased on a pseudo-logarithmic scale until 1 m was reached. Examples of the generated images with point density reduction are shown in Figure 5.

Class
Average Point-to-Point Distance (meters) Tree 0.0335 Bench 0.0219

Density Results and Analysis
The average point density per class measured using the knn algorithm with k = 5 is presented in Table 1. The maximum difference between classes with the highest and the lowest point density is 2 cm. The average of all densities was calculated to be 0.0256 mthis value was taken as a starting point to apply density reduction and it was increased on a pseudo-logarithmic scale until 1 m was reached. Examples of the generated images with point density reduction are shown in Figure 5.   Table 2 shows the precision and Table 3 shows the recall of the classification according to the density value l. Table 4 compiles three confusion matrices along with the l variation. The reduction in the average precision and recall was constant up to l = 0.1 m. Above l = 0.25 m, practically no class was well identified. The misclassification behaviour was observed in two ways: an abrupt decline from very high accuracy to almost zero accuracy, as in the case of cars and traffic signs; and a gradual decline as traffic signs and streetlights showed for several consecutive l values a medium precision before being completely misclassified.
In general, it was observed that objects were affected to a greater or lesser extent by density reduction depending on the dimensions and shape of the object. The larger an object was, the less it was affected by density reduction, as the main features could still be seen. In particular, the volumetric point distribution in the tree canopy, produced by the multibeam return, resulted in the preservation of more points than in other classes. As can be seen in Table 4, when other classes such as cars and benches are difficult to distinguish with l = 0.5 m, trees remain completely identifiable.
The shape of the object is the most important factor for classification. The CNN looked at certain key points to identify each class. Simple shapes, such as squared waste containers, were very often confused with waste baskets. Additionally, other objects, such  Table 2 shows the precision and Table 3 shows the recall of the classification according to the density value l. Table 4 compiles three confusion matrices along with the l variation. The reduction in the average precision and recall was constant up to l = 0.1 m. Above l = 0.25 m, practically no class was well identified. The misclassification behaviour was observed in two ways: an abrupt decline from very high accuracy to almost zero accuracy, as in the case of cars and traffic signs; and a gradual decline as traffic signs and streetlights showed for several consecutive l values a medium precision before being completely misclassified.  In general, it was observed that objects were affected to a greater or lesser extent by density reduction depending on the dimensions and shape of the object. The larger an object was, the less it was affected by density reduction, as the main features could still be seen. In particular, the volumetric point distribution in the tree canopy, produced by the multibeam return, resulted in the preservation of more points than in other classes. As can be seen in Table 4, when other classes such as cars and benches are difficult to distinguish with l = 0.5 m, trees remain completely identifiable.
The shape of the object is the most important factor for classification. The CNN looked at certain key points to identify each class. Simple shapes, such as squared waste containers, were very often confused with waste baskets. Additionally, other objects, such as motorbikes with many variations in shapes between models, were affected as the CNN did not find common features between models. Among the pole-like objects, streetlights and traffic lights showed similar behaviour, while traffic signs were more similar to pedestrians.
From the confusion matrix (Table 4) with l = 0.065 m, a well-defined diagonal was observed, except for the very early misclassification of the classes "Waste containers" and "Motorbikes", while with l = 0.1 m, the class "Cars" had already joined the misclassified classes and "Pedestrian" and "Traffic lights" showed great confusion. From the confusion matrix with l = 0.5 m, the "Bench" and "Waste basket" classes were the ones towards which the confusion was directed. From these confusion results and the gradual precision loss of Table 2 and the reduction in the recall in Table 3, it can be deduced that in the absence of a class "Others", the network considered waste baskets and benches as such a class.

Noise Results and Analysis
The generation of random noise points was directly related to the volume of the bounding box that encloses the point cloud object. The objects with the largest volume were trees, streetlights, and traffic lights. The largest objects required a larger number of random points ( Table 5). The noise generation with the same density affected the image generation due to the depth of the objects. For the same noise level, some classes could be identified by a human observer, while in other classes, the projection of points generated a practically black image ( Figure 6). The selection of the noise values was limited between d = 0.6 m and d = 0.02 m in a linear distribution, considering no significant changes occurred beyond these two limits.    Table 6 shows the precision and Table 7 shows the recall of the classification results according to ten noise levels, while Table 8 compiles confusion matrices obtained from three noise levels. The precision loss was gradual until a very high misclassification with a noise density of d = 0.04 m was reached. Comparing Tables 6 and 8, there is a direct relation between misclassification and volume. The objects with the highest volume (tree light and traffic light) were the first to be confused. They lost the shape of the object due  Table 6 shows the precision and Table 7 shows the recall of the classification results according to ten noise levels, while Table 8 compiles confusion matrices obtained from three noise levels. The precision loss was gradual until a very high misclassification with a noise density of d = 0.04 m was reached. Comparing Tables 6 and 8, there is a direct relation between misclassification and volume. The objects with the highest volume (tree light and traffic light) were the first to be confused. They lost the shape of the object due to a large number of projected noise points. Without considering large objects, no specific behaviour of each class in relation to their shape was observed. The largest drop in precision occurred between d = 0.2 m and d = 0.04 m. The "Motorbikes" and "Pedestrian" classes, despite losing precision, maintained high recall values until they were misclassified. From the confusion matrices (Table 8), the confusion was gradual, starting with trees and streetlights and predicting all objects erroneously as waste baskets. The network had considered the class "Waste basket" as the class "Others", as indicated by high precision values and low recall.

Occlusions Results and Analysis
The analysis of the influence of occlusions on the classification focused on the different origins and sizes of occlusions. Therefore, five possible origins for each object and ten different occlusion sizes per origin were selected. In total, 5000 images were generated in the study of the occlusions. The occlusion origins for an object of each class are shown in Figure 7. The k-mean estimated origins matched between the different objects in each class. An occlusion radius range between r = 0.1 m and r = 5 m was selected to analyse both small-size occlusions and those with a size potentially affecting larger objects. Large occlusions applicable to small objects imply the disappearance of these objects. An example of the generation of occlusions of different sizes is shown in Figure 8.   Table 9 shows the precision and Table 10 shows the recall of the classification with different occlusion sizes. In Table 11, three confusion matrices are compiled. The incidence of occlusions varied mainly according to the size of the object, since if the object is large, a large occlusion is needed to remove a relevant part of the points. The location of the occlusion was another important factor, as in certain objects it can remove key features that characterise the object. This is the case of trees, streetlights, some traffic lights, or signs, where if the occlusion is at the top and is large, the image was simply a vertical line. In some traffic lights, no misclassification was observed since they had a double light (top and middle) (Figure 8iii) and the elimination of both only happened in extreme cases. The "Streetlight" class showed very high robustness, with high precision and recall rates even in extreme cases. Other classes, such as trees, also showed good precision, although in the worst case, the CNN tended to over detect them. Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 21  Table 9 shows the precision and Table 10 shows the recall of the classification with different occlusion sizes. In Table 11, three confusion matrices are compiled. The incidence of occlusions varied mainly according to the size of the object, since if the object is large,  Table 9 shows the precision and Table 10 shows the recall of the classification with different occlusion sizes. In Table 11, three confusion matrices are compiled. The incidence of occlusions varied mainly according to the size of the object, since if the object is large,  The class "Bench", being small-sized objects, was affected from occlusions of size r = 1 m. Although partially occluded benches were classified correctly, occlusions larger than r = 1.5 m eliminated the benches completely. The class "Car" showed similar behaviour. Additionally, the occlusions which caused the most errors were the central ones that broke the car in two parts, or the occlusions located on the wheels. The class "Waste containers" was not affected by the position of the occlusions since the origins were centred in the object. Similarly, the motorbike class showed the same behaviour, as there was no area where the occlusion generated the most error.
The waste baskets behaved in two ways depending on the occlusion size. Small-sized occlusions were mistaken for benches. The class "Bench" already showed some behaviour as class "Others" in the point density analysis. As the size of the occlusions increased, the confusions between classes were dispersed, and with large occlusions when many objects were completely occluded, the class "Waste basket" acted as a class "Others", showing high precision and low recall.
Pedestrians began to be misclassified as having occlusions with more than r = 0.5 m located on the torso.
The choice of five points was a compromise solution to cover several parts along the entire length of the object. Many more points could have been chosen; however, given the low effect of small occlusions on the classification, there was no need to increase the number of occlusion origins. On the other hand, the large occlusions covered several origins, so it was not necessary to increase the number of occlusion origins for this reason either.

Re-Train of CNN
The re-training aims to assess whether adding data with disturbances in the network training can minimise misclassification. A second dataset not previously used in the analysis of results was used to re-train the network (Figure 9). In this second dataset, composed of the same number and classes of objects as the previous dataset, disturbances were generated, added to the objects, and transformed into images. From all the images generated, 10 images with point density variations, 10 images with noise and 30 images with occlusions were randomly selected per class. More occlusion samples were selected due to the variation in occlusion positioning and size, which generated a larger amount of data than the other two disturbances. The selected images were added to the training set that was used to pre-train Inception V3 for the previous classification. The hyperparameters of the training were the same as in the pre-training (optimisation method sgdm, learning rate 10−4, Momentum 0.9, L2 Regularisation 10−4, Max Epochs 20, and Mini Batch Size 16). The network was trained in 200 min with a GPU NVIDIA GTX1050 4 GB GDDR5, CPU i7-7700HQ 2.8 Ghz, and 16 GB RAM DDR4.
were generated, added to the objects, and transformed into images. From all the images generated, 10 images with point density variations, 10 images with noise and 30 images with occlusions were randomly selected per class. More occlusion samples were selected due to the variation in occlusion positioning and size, which generated a larger amount of data than the other two disturbances. The selected images were added to the training set that was used to pre-train Inception V3 for the previous classification. The hyperparameters of the training were the same as in the pre-training (optimisation method sgdm, learning rate 10−4, Momentum 0.9, L2 Regularisation 10−4, Max Epochs 20, and Mini Batch Size 16.). The network was trained in 200 min with a GPU NVIDIA GTX1050 4 GB GDDR5, CPU i7-7700HQ 2.8 Ghz, and 16 GB RAM DDR4.  The use of images with noise points in re-training produced a consistent im ment of 10%-20% for all d values ( Figure 11). However, in cases with a high level the CNN was still not able to correctly classify the objects. The use of images with noise points in re-training produced a consistent improvement of 10-20% for all d values ( Figure 11). However, in cases with a high level of noise, the CNN was still not able to correctly classify the objects. The use of images with noise points in re-training produced a consistent i ment of 10%-20% for all d values ( Figure 11). However, in cases with a high level the CNN was still not able to correctly classify the objects. Re-training with occlusions showed a turning point in the results obtained 12). Up to r = 1 m, the classification with small occlusions improved up to 10% misclassification effect was slowed down. However, from r = 1 m onwards, the effect occurred, and the results worsened. This turning point is related to the si occlusion concerning the size of the object, as many small objects with occlusion m to 1.5 m started to occlude almost entirely. Applying occlusions of radius grea r = 1.5 m, most of the objects that conserved information were the large ones, main like objects. When these objects were partially occluded, the resulting sample was only a pole, which was very difficult to identify. Re-training with occlusions showed a turning point in the results obtained ( Figure 12). Up to r = 1 m, the classification with small occlusions improved up to 10% and the misclassification effect was slowed down. However, from r = 1 m onwards, the reverse effect occurred, and the results worsened. This turning point is related to the size of the occlusion concerning the size of the object, as many small objects with occlusions of r = 1 m to 1.5 m started to occlude almost entirely. Applying occlusions of radius greater than r = 1.5 m, most of the objects that conserved information were the large ones, mainly pole-like objects. When these objects were partially occluded, the resulting sample was usually only a pole, which was very difficult to identify.

Discussion
The objects were extracted from the urban point cloud manually. Although ances may affect detection and individualisation, only the classification of urba was evaluated in this work. Effects such as extreme density reduction, or total o would logically discard the extraction of samples from the environment as objec not be recognisable by the extracting algorithm. In the opposite case, the existen treme noise would not allow clustering algorithms such as DBSCAN or connect ponents correctly, and occlusions can break the object into several parts produc segmentation.
The selection of values for the synthetic generation of the disturbances and t

Discussion
The objects were extracted from the urban point cloud manually. Although disturbances may affect detection and individualisation, only the classification of urban objects was evaluated in this work. Effects such as extreme density reduction, or total occlusion, would logically discard the extraction of samples from the environment as objects would not be recognisable by the extracting algorithm. In the opposite case, the existence of extreme noise would not allow clustering algorithms such as DBSCAN or connected components correctly, and occlusions can break the object into several parts producing over segmentation.
The selection of values for the synthetic generation of the disturbances and their distribution was based on several considerations. The values were delimited between those that produced almost no effect on the classification and those that rendered the samples completely useless. Between the two limits, the range of values was selected in such a way as to provide constant information at classification intervals and the range was reduced where changes were more abrupt. The distributions were in linear or pseudo-logarithmic intervals. In addition, the maximum, minimum and intervals had to apply to objects of very different dimensions, ranging from waste baskets of just 1 m to trees of more than 5 m in height.
The process of generating synthetic disturbances was automated to produce all samples easily, reliably, and with the ability to be replicable. As far as possible, the design of the algorithm corresponded to the disturbances produced in the real world, although they were taken to extreme situations to seek to break the classification algorithm. As mentioned above, very distant objects to MLS (with low point density) would not be extractable from the urban point cloud or would be mistaken for noise. On the other hand, in conditions of extreme environmental noise (such as that generated by torrential rain or large amounts of dust-sand in suspension), MLS acquisitions are not scheduled or executed. Occlusions, although generated in various parts of the objects, are typically only located in low areas for the tall objects [35], where it was found to have little effect on the misclassification of pole-like objects, trees, or even pedestrians. Completely occluded objects were also not extracted from the urban point cloud for further classification.
Re-training with samples containing synthetic errors provides robustness to the CNN classifier. The classification of objects with low point density, noise, and small occlusions was improved; however, re-training did not solve the most serious problems. From these results, it is recommended to use some samples with synthesised disturbances in the training to make the classifier less sensitive to point density variations, ambient noise, and small occlusions.

Conclusions
In this work, three disturbances (point density, noise, and occlusions) were evaluated as affecting a CNN to classify the main classes of urban objects acquired with MLS. Synthetic disturbances with different levels of intensity were generated and applied to point clouds of urban objects acquired the cities of Vigo and Porto. The CNN was also re-trained with some of the new synthetic samples to assess whether the effects of the disturbances could be mitigated.
The main disturbances affecting the classification were density reduction and large occlusions. Reducing the density to a distance between points of 0.1 m, the precision was halved (0.56), and with a distance of 0.25 m, few samples were classified well. Noise showed a behaviour linked to the volume of the object and did not significantly affect the classification until very extreme levels that would never occur in a real acquisition. Thus, at the same noise level, bulky elements such as streetlights were unrecognisable, while smaller ones, such as waste baskets, were clearly visible. In samples with occlusion of radius 1 m, the precision fell to 0.69, while with 1.5 m the precision stood at 0.51. Small occlusions (less than 1 m of radius) had little effect in misclassification, while large occlusions showed problems when small objects were completely hidden or when occlusion was located on top of tall objects. However, if occlusions obscure an entire object, it would no longer be removable from the environment, and occlusions in high areas of objects do not usually occur in the urban MLS acquisition. With occlusion of radius 1 m, the precision fell to 0.69, while with 1.5 m the precision stood at 0.51.
Re-training the CNN with synthetic disturbance data provided some robustness to the classifier. The classification of samples with low density was improved up to 40%, ambient noise up to 20%, and occlusions up to 1 m radius up to 10%. In extreme cases of disturbances, there was not enough improvement to obtain a good classification. At the most extreme level of disturbances, none of the precisions exceeded 30%, even after network re-training.
In future work, other classification methods will be tested, and the simultaneity of the disturbances will be compared to see which disturbances are affected to a greater degree.