3D Scene Reconstruction Using Omnidirectional Vision and LiDAR: A Hybrid Approach

In this paper, we propose a novel approach to obtain accurate 3D reconstructions of large-scale environments by means of a mobile acquisition platform. The system incorporates a Velodyne LiDAR scanner, as well as a Point Grey Ladybug panoramic camera system. It was designed with genericity in mind, and hence, it does not make any assumption about the scene or about the sensor set-up. The main novelty of this work is that the proposed LiDAR mapping approach deals explicitly with the inhomogeneous density of point clouds produced by LiDAR scanners. To this end, we keep track of a global 3D map of the environment, which is continuously improved and refined by means of a surface reconstruction technique. Moreover, we perform surface analysis on consecutive generated point clouds in order to assure a perfect alignment with the global 3D map. In order to cope with drift, the system incorporates loop closure by determining the pose error and propagating it back in the pose graph. Our algorithm was exhaustively tested on data captured at a conference building, a university campus and an industrial site of a chemical company. Experiments demonstrate that it is capable of generating highly accurate 3D maps in very challenging environments. We can state that the average distance of corresponding point pairs between the ground truth and estimated point cloud approximates one centimeter for an area covering approximately 4000 m2. To prove the genericity of the system, it was tested on the well-known Kitti vision benchmark. The results show that our approach competes with state of the art methods without making any additional assumptions.


Introduction
Reconstructing the 3D scene using a mobile observer, also referred to as mobile mapping, is used in a variety of applications ranging from navigational tasks for autonomous vehicles to facility condition assessment or cartography. The former application imposes a real-time constraint on the processing time of the used algorithms, whereas the latter allows the system to take more computation time. The application that we have in mind aims at construction site monitoring. The goal there is to obtain a 3D model of the environment in a reasonable amount of time, which spans the duration of the intervention. To this end, we propose a system that combines both a LiDAR sensor (Velodyne HDL32-e) and a panoramic camera system (Ladybug 3) to perceive the environment. Some systems exist that rely on the output of a regular camera to obtain 3D reconstructions [1,2], known in the literature as (visual) structure from motion (SfM). In the case of sequential (or incremental) SfM, the problem is also referred to as simultaneous localization and mapping (SLAM). However, visual SfM produces rather sparse 3D reconstructions, which require much post-processing to make them more dense. Moreover, they are prone to drift and tend to fail when the images are overexposed because of the abundance of sunlight. For this reason, the core of our mobile mapping system is based on the output of a LiDAR sensor, aided by images of a regular camera to perform loop detection. Often times, mobile mapping systems are also relying on GPS to estimate the trajectory of the observer. However, in the environments we have in mind, i.e., construction sites with many overhanging or protruding structures (e.g., pipelines), the GPS signal is too weak. In closed indoor environments, it is even completely lacking. Therefore, we developed a system that is fully GPS-independent. In summary, we propose an entire 3D reconstruction system that covers both the odometry and mapping problem. It incorporates the alignment of consecutive point clouds, the fusion with a global 3D map and a loop closure mechanism to cope with drift. The main goal of our approach is to make the system as generic as possible and, hence, to make as little assumptions as possible about the environment and about the sensor set-up. Doing so, we are able to use our system in combination with a drone-mounted LiDAR sensor and camera in the future. The main contribution of this work is that our solution deals explicitly with the inhomogeneous density of point clouds produced by LiDAR scanners. This latter refers to the fact that some parts of the point cloud have a higher point density than others due to the inclination of the laser beams and the proximity of the objects present in the scene. Generally, regions close to the sensor origin will be densely sampled, whereas more distant objects will be sparsely sampled. As has been shown in numerous studies [3][4][5], traditional point-to-point registration techniques (e.g., Iterative Closest Point (ICP)) fail in these situations. For that reason, we propose a number of improvements over standard point-to-point strategies and incorporate them in one framework. The main contributions can be summarized as follows: 1.
We propose a surface analysis technique that is used to build a topological space on top of the point cloud, providing for each point its ideal neighborhood and taking into account the underlying surface.

2.
We keep track of a global 3D map that is continuously updated by means of a surface reconstruction technique that allows us to resample the point cloud. This way, the global 3D map is continuously improved, i.e., the noise is reduced, and the alignment of the following point clouds can be conducted more accurately.

3.
The topological space is used to compute low-level features, which will be incorporated in an adapted version of the ICP algorithm to make this latter more robust. The alignment of consecutive local point clouds, as well as the alignment of a local point cloud with the global 3D map will be conducted using this improved ICP algorithm. 4.
We incorporate the residual of the ICP process in the loop closure process. Based on this residual, we can predict the share of each pose estimation in the final pose error when a loop has been detected.

Related Work
Regarding LiDAR odometry and mapping, existing solutions can roughly be divided in four main classes: based on ICP, normal distribution transform (NDT), features and planar surfaces. The majority of the approaches presented in the literature are based on the iterative closest point (ICP) algorithm in order to perform scan matching and to identify the transformation between sensor poses [6][7][8]. However, the standard ICP algorithm has many limitations, and therefore, many improvements have been presented in literature. One of the main drawbacks lies in the fact that it is often difficult to find good point to point correspondences between two point clouds, especially in the case of a sparse point density or when the speed of the moving platform is too high. The authors of velocity ICP [9] therefore propose to perform velocity estimation over the ICP iterations. Doing so, the distortion of a scan due to motion can be compensated by re-projecting all points acquired during one sweep to the position of the sensor at the beginning of the sweep. In [10], the authors tackle the correspondence problem by defining low-level features that are used to guide the ICP process. The authors focus on pair-wise registration of datasets that do not exhibit large changes. They do not investigate a full sequential 3D reconstruction approach that incorporates a global 3D map or loop closure. The authors of Velodyne SLAM [11], on the other hand, incrementally build a map of the environment by aligning each newly arrived Velodyne scan with the whole map using ICP. Their global map is improved over time by adapting incoming measurements according to already existing adapted neighbors. However, the refinement of the global map is limited as they do not perform a surface analysis or re-sampling of the point cloud. Moreover, they do not incorporate a loop closure mechanism to cope with drift. Finally, some studies focus on combining the ICP-based scan matching with (stereo) visual odometry. For example, in [12], the authors propose a solution that uses the output of visual odometry to obtain a rough estimate of the ego motion upon which consecutive point clouds are registered. The generalized ICP algorithm [13] is then used to refine the motion estimation, and finally, the output is combined with reduced IMU outputs. Although these latter solutions seem effective, these studies, in contrast to ours, do not investigate how LiDAR odometry itself can be improved.
Aside from ICP, the normal distribution transform (NDT) [14] is another common technique that is often used for laser scan matching [15][16][17]. It is based on an alternative representation of a point cloud similar to an occupancy grid. In each 3D cell, a normal distribution is stored, and both scans are matched using Newton's method. Compared to ICP-based methods, the main advantage of this method is that there is no need for correspondence estimation. The main limitation is however that the accuracy of NDT is strongly related to the cell size, which is difficult to define in the case of inhomogeneous point clouds.
Another way to cope with correspondence estimation is to introduce the use of features. Feature-based methods have significant advantages as they concentrate on strong cues, such as corners and lines and filter out irregular points, such as vegetation or tree leaves. One very interesting feature-based system has been developed by Zhang et al. and described in [18,19]. Their technique is based on the extraction of only two different shape features, representing sharp edges and planar surface patches. Edge points of the source cloud are associated with edge lines, while planar patches are matched with other planar patches in the target cloud. In [19], the same authors elaborated on their method by incorporating data from a regular camera to perform visual odometry.
Finally, in [20][21][22], the authors use yet another approach based on the registration of planar surfaces. These planes can lead to very accurate alignments as they serve as strong cues and can be estimated by an accumulation of data, hence reducing the noise. However, although these methods are very effective, the lack of planes in the scene can cause them to fail and, hence, make them less generic.
In this work, we will adopt a hybrid approach in the sense that we combine ICP-based registration with feature extraction and surface reconstruction. To this end, we adopt the strategy of [10] by incorporating low-level features that describe the underlying surface in a simple way, i.e., using three main classes, linear, planar and volumetric. We exploit these properties to identify for each point an ideal neighborhood and, hence, a topological space representing the captured scene. The low-level features will further be used to guide the ICP process in such a way that the alignment concentrates itself near the most reliable regions, i.e., lines and corners instead of irregular points or clutter. Our approach can benefit from large planar surfaces while at the same time remaining generic. It is thus more widely applicable. In addition, we keep track of a global map, which will continuously be improved and refined with newly-captured points. More specifically, we use a surface reconstruction technique after which we resample the point cloud and improve its topological space. Finally, we incorporate loop closure to cope with the remaining drift.

Acquisition Platform
Our acquisition platform consists of a Velodyne High Definition LiDAR (HDL-32e) scanner combined with a Ladybug panoramic camera system. Because of this combination, we named it the Vellady platform. As can be seen in Figure 1, the Ladybug camera is mounted perpendicular to the ground plane, whereas the Velodyne is tilted on its head, making an angle of approximately 66 • with the ground plane. The Velodyne LiDAR scanner is equipped with 32 lasers mounted collinear and covering a vertical FOV of 41.3 • , hence resulting in a vertical resolution of 1.29 • . The vertical FOV covers 30.67 • below the middle point of the Velodyne and 10.67 • above it. The head is continuously spinning at approximately 10 Hz, resulting in a horizontal FOV of 360 • . Figure 2 shows an example of a point cloud obtained by this Velodyne sensor. The Ladybug on the other hand is a fixed system that comes in the form of a pentagonal prism consisting of five vertical-oriented sides, each one incorporating a camera. A sixth camera is mounted on top pointing upwards. It is shooting images at one frame per second. A stitched image obtained by this Ladybug camera system, corresponding with the point cloud of Figure 2, is depicted in Figure 3. We mounted this Vellady platform on a moving vehicle, i.e., a kitchen cart or four-wheel trolley (see Figure 1). Our platform is hence not able to rotate around its roll angle. A rotation about its pitch angle is only possible in the case that the ground plane has a slope, but these situations were not encountered during the experiments. However, we did not exploit these facts as we want to retain the option to mount the platform on a drone in the future. Our system thus operates as a full six degrees of freedom SLAM algorithm incorporating three unknown position coordinates x, y, z and three rotation angles θ x , θ y , θ z . The main features of our platform are summarized in Table 1.

Terminology
As mentioned in the previous section, the Velodyne is spinning its head, thereby producing 360 • point clouds. A full rotation of the head is also referred to as a sweep. Throughout this paper, we use right subscript k, k ∈ Z + to indicate the sweep number and P k to indicate the point cloud perceived during sweep k. This point cloud is expressed using the local coordinate system of the Velodyne V k at the time that sweep k has started. We will use right superscript i to denote a single point p i k . Finally, we define I j to denote the j-th set of images. Recall that the Ladybug only outputs six images simultaneously each second, and hence, some point clouds will have no accompanying visual information. However, both sensors were synchronized, and as a result, the following condition is always warranted: ∀ j ∈ Z + , ∃ k ∈ Z + : t(I j ) = t(P k ). In this expression, t is a function that returns the timestamp at which respectively the j-th set of images was captured and sweep k was started. As the Velodyne is spinning at a frequency of 10 Hz, there will only be visual information available for every tenth point cloud P k .

Reference Coordinate System
As both sensor outputs P k and I k are expressed in their own coordinate system, we need to define a reference coordinate system to harmonize both. To ease this task, the platform was designed in a way that the origin of both sensors is collinear. Since the Ladybug was put perpendicular to the ground plane, it is sufficient to determine the orientation of the Velodyne sensor with respect to the ground plane and nullify the offset of the origins of both sensors. As the Velodyne scanner incorporates three gyroscopes and three associated two-axis accelerometers, the transformation can be derived from one of its accelerometers. Specifically, given the accelerometer output a = (a y , a z ) (cf. Figure 4), the angle about the x-axis, i.e., the pitch angle, can be computed as α = atan(a z /a y ). The final transformation T (vel2lady) ∈ R 4×4 from the Velodyne coordinate system V to the Ladybug coordinate system L is then given by Equation (1): Herein, the values of t = (t x , t y , t z ) were determined by subtracting the two vectors that project both origins to the ground plane. In the case of the Velodyne coordinate system V, the projection of the origin is performed after applying R. Next, the coordinate system W denotes the global coordinate system in which the final 3D point cloud is expressed. We decided to set the coordinate system of the Ladybug at the start of the first sweep L 0 as the global world coordinate system W.  The angle about the x-axis, i.e., the pitch angle, was computed using the output of the two-axis accelerometer a = (a y , a z ) from the Velodyne HDL. In this figure, the subscript L denotes the coordinate system of the Ladybug, whereas V denotes the coordinate system of the Velodyne.

Problem Statement
The problem can be formulated as finding the trajectory, i.e., the sequence of sensor poses S = {S 1 , . . . , S k , . . . , S N }, that transforms each point cloud P k in the world coordinate system W. The poses S k are defined as in Equation (2): Herein, T k,k−1 denotes the transformation of the point clouds P k and P k−1 acquired in two consecutive sweeps. The reconstructed 3D point cloud W k after k sweeps have been processed is given by W k = {P 0 , . . . , S k P k }. The reconstructed point cloud of the entire sequence containing N sweeps is finally given by W = {P 0 , . . . , S k P k , . . . , S N P N }. Figure 5 depicts a schematic drawing of this problem statement.

Approach
A schematic overview of our approach is depicted in Figure 6. As can be seen, it is implemented as a sequential, i.e., incremental, process in which every newly-arrived point cloud P k is first processed and subsequently added to the current world model W k . The process repeated for every point cloud is conducted in five steps. First, we project the generated point clouds on a 2D grid (Step 1). Subsequently, we conduct a surface analysis (Step 2), after which we perform pairwise alignment of two consecutive point clouds (Step 3), which serves as an initial guess for the current pose. Next, we register the aligned point cloud with a global 3D map (Step 4) and fuse the new points with this 3D map (Step 5). This fusion consists of re-sampling the point cloud by means of a surface reconstruction technique. Optionally, when a loop has been detected, we adopt loop closure to preserve global consistency (Step 6). This is done by means of pose graph optimization, which propagates the estimated error back in the pose graph. In the following sections, we will further clarify each of these steps.  Figure 6. Overview of the system, which is implemented as an incremental process. Every time a consecutive point cloud P k has arrived, it is first projected on a 2D grid (Step 1). Next, it is analyzed to obtain some low-level surface features (Step 2), after which it is aligned (registered) with the point cloud P k−1 acquired during the previous sweep k−1 (Step 3). The obtained transformation serves as an initial guess to subsequently align P k with the current world model W k−1 (Step 4) and to fuse them together to form W k (Step 5). This fusion consists of re-sampling the point cloud by means of a surface reconstruction technique. Optionally, when a loop has been detected, loop closure is performed to cope with the drift and to preserve global consistency (Step 6).

2D Projection
The 3D points captured by the Velodyne are organized since the 32 lasers are placed collinear in the vertical direction. Because of that, we can project the 3D points onto a two-dimensional spherical grid; cf. Figure 7. Figure 8 shows an example of a 360 • range image corresponding to the point cloud of Figure 2. Using this 2D projection, we can exploit the adjacency in the pixel domain, e.g., to quickly find neighboring points in 3D. We will exploit this knowledge to perform a surface analysis as explained in Section 4.2.
. The 3D points generated by the Velodyne scanner are projected onto a 2D spherical grid. This way, we can exploit the adjacency in the 2D domain, e.g., to quickly find neighboring points in 3D.  Figure 2 onto a 2D grid. The blue color means that points are close-by, whereas the red color denotes that points are located further away.

Surface Analysis
The surface analysis step is performed twice each iteration, once on the point cloud P k perceived during sweep k and once on the point cloud map W k−1 after the fusion of the previous point cloud P k−1 . Our proposed algorithm is an extension of the one presented in [10]. It aims to find for each 3D point the optimal neighborhood size or the most suitable local point set that describes the underlying geometry. This is an interdependence problem, as geometrical features largely depend on the choice of the neighborhood, whereas a good neighborhood definition should rely on the local geometry and, thus, on geometrical features. Let us first define a neighborhood function n as n r (p) : (x, y, z) → (x, y, z) N . In this equation, r denotes the search radius for neighboring points, and N denotes the number of points belonging to the neighborhood. Using this neighborhood, we determine the principal components of the covariance matrix of the points belonging to it. These principal components are used to decide whether the underlying surface is linear (1D), planar (2D) or volumetric (3D), cfr. Figure 9. Hence, we first compute the three eigenvalues λ 1 , λ 2 and λ 3 and their corresponding eigenvectors v 1 , v 2 and v 3 . Next, we define the standard deviation along an eigenvector as σ i = √ λ i for i ∈ 1, 2, 3. Using these three values, we can obtain a measure on how linear, planar or scattered the underlying surface is. To this end, we define the three dimensionality values as follows: These three values are normalized such that ψ 1 + ψ 2 + ψ 3 = 1; hence, they represent a partition of unity Ψ. Finally, the dimensionality label l of each point is given by Equation (5): [1,3] (ψ i ). Figure 9. Visual representation of the low-level surface features. The values σ i = √ λ i represent the standard deviation along the eigenvectors. When σ 1 ≈ σ 2 ≈ σ 3 , the points are volumetric, often times representing a scatter, such as vegetation. When σ 1 σ 2 , σ 3 , the respective points are lying on a line between planar surfaces or are part of thin structures, such as pipelines. Finally, when σ 1 , σ 2 σ 3 , the points are lying on a planar surface.
When σ 1 σ 2 , σ 3 , then ψ 1 will be larger than the two other features. This corresponds to lines between planar surfaces or thin structures, such as pipelines. On the other hand, when σ 1 , σ 2 σ 3 , then ψ 2 will be larger, corresponding to planar surfaces. Finally, when σ 1 ≈ σ 2 ≈ σ 3 , then ψ 3 will be larger often times representing a scatter, such as bushes or tree leaves. To determine the optimal radius r * , we use the concept of Shannon entropy. To this end, we compute the geometrical features for a growing radius size r ∈ [r min , r max ]. The Shannon entropy for the partition . This value gives a measure for the uncertainty about the dimensionality label. The optimal neighborhood radius r * is then defined as the minimum of the entropy function E: This radius r * leads to an initial guess of the optimal neighborhood V * p of a point p. As points within the optimal radius r * can still belong to different surfaces and can have different dimensionality labels, we define a similarity measure S denoting the homogeneity within the neighborhood of each point: In this equation, 1 represents the indicator function, and N is the cardinality of n r (p). If this value S is smaller than 0.5, we reconsider the optimal radius r * using Equation (8): In the other case, we remove the points belonging to the neighborhood of p that have another dimensionality label. Hence, we define the optimal neighborhood V * p of a point p as in Equation (9): In summary, the proposed method tries to find the ideal topological space T * representing the underlying surface of the point cloud. For each node (or point) in T * , we store a 19-dimensional feature vector, which is the combination of the eigen values λ = {λ 1 , λ 2 , λ 3 }, the eigen vectors V = {v 1 , v 2 , v 3 }, the dimensionality values Ψ = {ψ 1 , ψ 2 , ψ 3 }, the dimensionality label l and the values r * and E * . The last feature we define represents the omnivariance and is given by O = ∏ i∈ [1,3] σ i . One of the most important features in this vector is the local surface normal n i of a point p i , which can be approximated by the eigenvector v 3 corresponding to the smallest eigenvalue λ 3 of the principal components, which was computed using the optimal neighborhood.

Local Registration
The majority of the solutions presented in the literature to find the alignment of consecutive scans or point clouds are based on the iterative closest point (ICP) algorithm. As the name suggests, it is an iterative method that consists of four main steps in each iteration: (1) point selection; (2) correspondence estimation; (3) weighting; and (4) transformation estimation. In every iteration, the transformation is updated until convergence has been reached. In Figure 10, a graphical representation of the algorithm is depicted. However, the standard ICP approach has some severe limitations. One of the main issues originates from the fact that the point clouds generated by the Velodyne scanner have a sparse and inhomogeneous density. As a result, it is hard to find good point correspondences between two consecutive point clouds (referred to as source and target point cloud), as most of the time, the determined corresponding pairs will not represent the same physical point in space. This leads to point clouds that are not aligned accurately, even after a large number of iterations and, hence, to wrongly estimated sensor poses. In addition, the convergence process will be very slow. Motivated by this shortcoming, we propose to incorporate the low-level features that we derived in the surface analysis step to guide the ICP process. Doing so, we will not use the closest points in 3D as target points directly as is done in traditional ICP. Instead, we present an approach in which we use the local surface normals that were computed using the optimal neighborhood (cf. Section 4.1). We then minimize the distance from each source point to the tangent plane of its corresponding surface patch in the target point cloud. In the following, we will briefly discuss the four main sub-parts of the process. (1) Point Selection The goal of the point selection step is to focus on the most reliable areas for accurate registration. An area is considered as unreliable if it is located near the border between several objects or surfaces or if it belongs to a geometrically complex object. As the points corresponding with a volumetric label (l = 3) are mainly corresponding to scattered points, e.g., bushes or tree leaves, we will exclude them from the estimation process, as we consider them less reliable. This is mainly due to the fact that it is hard to estimate accurate surface normals for these points, as they represent complex structures and are too sparsely sampled. In addition, we will exclude planar points for which the uncertainty, i.e., the entropy, is too high. In the experiments, the threshold for the entropy was set to 0.8.

(2) Correspondence Estimation
In contrast with existing systems, we will select the corresponding target point p j k−1 as the point having the most similar neighborhood as the source point p i k in terms of the geometry of the underlying surface. To this end, we compute the distance from the source point to the target point in feature space. However, as it is too infeasible to compare every point in the target point cloud with the source point, we first determine corresponding candidates by selecting the closest points in Euclidean space using the 2D projection of both point clouds. More specifically, we will look for the points that are only a few pixels away in the 2D domain. Finally, the target point that has the smallest distance in feature space is chosen as the corresponding point.

(3) Weighting
In order to make the procedure more robust against correspondence outliers computed in the previous section, we incorporate robust M-estimators in the iteration process. Thus, instead of using fixed weights, we adapt them through the iterations resulting in an iteratively reweighted least squares (IRLS) ICP optimization. Compared to traditional solutions, this weighting is carried out using the distance of the corresponding pairs in feature space instead of Euclidean 3D space. This is important as the point clouds are inhomogeneous, and the distance between correct correspondence pairs, i.e., representing the same physical point, will vary greatly based on the distance of the points to the origin of the sensor. We want to emphasize that the feature vector of a point itself is not changing during consecutive ICP iterations, but its corresponding point can be chosen differently, as the matching process also depends on their Euclidean distance (cf. Section 4.3). Thus, the distance in feature space can be varying during the iterations.

(4) Transformation Estimation
In traditional ICP approaches, the transformation between point clouds is estimated by minimizing the sum of the Euclidean distances between corresponding points, known as the point-to-point distance. In our approach, we will not match points from the source point cloud P k with points in the target point cloud P k−1 , but rather match points of P k with surface patches of P k−1 . The goal is to minimize the distance between the points in P k with the tangent plane of the corresponding surface patch in P k−1 , known as the point-to-plane distance and given by the following error metric: Herein, T k,k−1 is the estimated transformation matrix; P k is the source point cloud; P k−1 is the target point cloud; n i k−1 is the surface normal according to target point; p i k−1 , w i is the weight vector; and c is the vector containing the indices of the N corresponding points. Equation (11) gives the expression to derive the final transformation matrix T k,k−1 : In order to solve this optimization problem, we adopt the method proposed by Low et al. in [23]. In that paper, a method is derived to approximate the nonlinear optimization problem with a linear least squares one in the case that the relative orientation between the two input point clouds is small. Since we consider point clouds acquired in two consecutive sweeps, this assumption is guaranteed in our case.

Global Registration
Once we have found the initial transformation T k,k−1 between point clouds P k and P k−1 , the next step consists of fusing P k with the current registered point cloud W k−1 . To this end, we first transform P k in the world coordinate system using the estimate of the current pose, which is given bŷ S k = T k,k−1 S k−1 resulting inP k =Ŝ k P k . As the point cloudP k is still not accurately aligned with W k−1 , we perform a second registration step based on ICP as explained in the previous section. However, regarding the transformation estimation, the cost function that we want to minimize is now given by Equation (12): To find the correspondence points ofP k in W k−1 , we can still use the criterion explained in Section 4.3. However, as the world map is growing, it is practically infeasible to look in the entire map for correspondences. Fortunately, this is not necessary as corresponding points in the map will be located close to the newly-added point cloudP k , as this latter is already transformed with an estimate of the current poseŜ k . Therefore, we can first filter the map by means of frustum culling using the oriented bounding box of the current sweep as a box filter extended with a small offset. The principal axes of this oriented bounding box are already known as these are previously determined by the accelerometer output. This box filtering will lower the processing speed without harming the guarantee to find good point correspondences. As we cannot use the organized structure of the separated point clouds any more after they have been fused with the point cloud map, we first create a kd-tree of the points remaining after the box filtering. After the registration ofP k with W k−1 , we define S k = T k,wŜk .

Map Fusion
After adding P k to the former world model W k−1 , the point density in this world model will increase. We can now refine the ideal topological space T * and hence the local surface normals in the global 3D map by re-estimating them using their new neighborhood. However, as this point cloud contains noise and has an inhomogeneous point density, we will resample it using the moving least squares (MLS) surface reconstruction algorithm [24]. In summary, this method will try to locally approximate the underlying surface by higher order polynomial interpolations between surrounding data points. Using these polynomials one can resample the point cloud and obtain more accurate estimates for the surface normals. The procedure can be described as follows. Using the ideal topological space T * , we have for each point p an optimal neighborhood V * p , as well as the tangent plane to p defined as H p [n, d]. For all points lying within this neighborhood, we can compute the distance to this tangent plane. Subsequently, we fit a polynomial in the set of distances from these points to the surface. To this end, we define a local approximation of degree m by a polynomial p ∈ Π m minimizing, among all p ∈ Π m , the weighted least-squares error of Equation (13): In this equation, I is the vector of indices representing the points in V * p ; {x i } i∈I are the orthogonal projections of the points {p i } i∈I onto H p ; and f i p i , n − d is the distance of p i to the tangent plane represents the weighting function that is based on the distances to the tangent plane and the average separation σ r of the 3D points.
Once the parameters of the polynomials are known, we finally project the data points back on the moving least squares surface. This procedure will hence manipulate the sampled data points in such a way that they represent the underlying surface in a better way. In addition, we upsample the point cloud using voxel grid dilation in order to fill small gaps. This latter procedure will first dilate a voxel grid representation of the world model built using a predefined voxel size. After that, the resulting new points are projected to the MLS surface of the closest point in the world point cloud. With time, W k−1 is becoming larger and larger, and as a result, the search for nearest neighbors becomes quite intractable. For that reason, we store the map as an octree data structure. Using this octree, we can work on a downsampled version of the point cloud map W k−1 . The main benefit of this octree representation is that we can keep all points, but at the same time, we can freely choose the level of density we want to use.

Loop Closure
Detecting loops using the acquired point clouds would be too cumbersome, as it would take ages to compare every newly-generated point cloud with the entire 3D map. For this reason, we perform loop detection on the images of the Ladybug. We adopt the same strategy presented in [25] (DBOW2) and improved by [1] (ORB-SLAM). More specifically, we adopt a bag of words (BOW) implemented as a hierarchical tree that uses a visual vocabulary. This vocabulary is built offline using ORB descriptors and converts an image into a sparse numerical vector. Each image is thus converted into a bag of word vector that is used to compare it with other images and to measure the similarity. This latter is conducted in the same way as described in [25]. In addition, an index is maintained that stores for each word in the vocabulary the list of images where it is present. Doing so, we are able to perform comparisons only against images that have some word in common with the query image.  Figure 11. Overview of the loop closure procedure. When a loop has been detected, its loop transform ∆ is considered as an error and is propagated back in the pose graph. To this end, we use the residual of the minimization step (cf. Section 4.3) to assign a weight to each link in the pose graph. We assign a higher weight for those transformations that had a high residual in the previous minimization step. The idea is that a high residual indicates that two consecutive point clouds were potentially inaccurately aligned.
Let us assume that a loop has been detected between image set I j and I i . We now consider all point clouds P l captured in the interval [t(I j ), t(I j+1 )[ (recall that there are ten sweeps in the interval) and try to find the most similar point cloud P k captured in the interval [t(I i ), t(I i+1 )[. As the sensor could have visited the same location from different directions, it can happen that both ends of the loop have different orientations. Therefore, we first determine the matrix R k,l that describes the rotation between P l and P k . It is important to note that we compute the transformation on the local point clouds, which are expressed in their own coordinate system. The rotation estimation is a two-step process, consisting of an initial rough alignment and a refinement step using the ICP algorithm. The two point clouds P l and P k that have the lowest residual after the ICP step are considered the most similar point clouds and are denoted by P e and P s , representing both ends of the loop. Their poses are respectively S e and S s . Next, the difference in pose, i.e., the loop transform, is given by ∆ = (R s,e S e ) −1 S s . This loop transform is considered as an error as both poses R s,e S e and S s should be equal. The next step consists of propagating the error ∆ back in the pose graph. To this end, we use the residual of the minimization step (cf. Section 4.3) to assign a weight c i,j to each link in the pose graph. We assign a higher weight for those transformations that had a high residual in the previous minimization step. The idea is that a high residual indicates that two consecutive point clouds were potentially inaccurately aligned. On the other hand, the ICP process will have already been converging in the right direction. Next, we define the distance between two poses S k and S l as d(S k , S l ) = ∑ i,j c i,j . Herein, {i, j} denotes the set of all edges in the path from S k to S l . Finally, we define a weight w i = d(S s ,S i ) d(S s ,S e ) for each pose in the graph that specifies the fraction of the matrix ∆ by which the pose has to be transformed. The poses S k are than updated replacing t k by t k w k ∆ and R k by slerp(R k , w k ∆). In this latter assignment, slerp denotes the spherical linear interpolation function as described in [26]. A graphical representation of this loop closure procedure is depicted in Figure 11.

Evaluation
To evaluate the system, we considered two different datasets. The first dataset we have captured ourselves. It covers different environments, including a university campus and an industrial site. The latter was captured at a chemical site of the Dow company in Terneuzen in the Netherlands. In order to be able to compare our system with the state of the art, we also performed some experiments on the well-known Kitti vision benchmark that was presented in [27].

Our Dataset
The first set of sequences was captured at a chemical site of the Dow company in Terneuzen. This environment is part of a disused area that is planned to be demolished. It consists of many pipelines that were formerly used to carry liquids or gases, as can be seen in Figure 3. Although this environment seems outdoors, the GPS signal is far too unreliable due to the abundance of pipelines. The acquired data consists of video sequences recorded with our Vellady platform. The speed of the platform approximated walking speed, i.e., 4 km/h. Furthermore, accurate ground truth information by means of terrestrial laser scanning was captured using a Leica system.(www.leica-geosystems.be) An image of this ground truth point cloud is depicted in Figure 12. As we do not have ground truth of the actual trajectory, we have to compare the reconstructed point cloud with the ground truth point cloud. For this comparison, it is necessary that both point clouds are aligned. Hence, we will first perform a rough alignment based on key points that we manually selected from both point clouds. Subsequently, we perform ICP to align them in a fine way. The final residual of the ICP process, i.e., the Euclidean distance between all closest point pairs, can be considered as a measure for the accuracy of the reconstruction. The experiments demonstrated that the final residual of this ICP process is approximately 2.1 cm for our reconstruction. However, as the closest point pairs do not necessarily represent the same physical points in space, the ICP residual is not a perfect measure for the evaluation of the accuracy. For that reason, we conducted another experiment in which we use the dominant planes in the scene to make the comparison. More specifically, in both point clouds, we estimate the parameters of the most dominant planes after which we determine the corresponding planes. Subsequently, we compute for all inliers of the planes from the source point cloud the distance to its corresponding plane in the target point cloud. We hence define the average distance of two corresponding planes H s [n s , In addition, we also define the difference in angle of the two corresponding planes as φ(H s , H t ) acos(n s · n t ). The results are summarized in Table 2 for the eight most dominant planes in the scene. The table shows that on average, the deviation in angle of two corresponding planes is approximately 0.84 • , whereas the average distance between two planes is approximately 1 cm. In Figure 13, an image is depicted in which both the ground truth and estimated point cloud are shown after alignment seen from a bird's eye view. In Figure 14, the same models are depicted, but this time after they have been projected onto the ground plane to make it easier to evaluate the accuracy. Visually, there are little to no discrepancies noticeable between the two 3D point cloud models.    By means of comparison, the trajectory of the mobile observer is plotted in Figure 15. Herein, six plots are depicted representing the trajectories obtained by the visual structure from motion (SfM) approach using SIFT features, presented in [28] (in red) and the trajectory obtained by our proposed LiDAR mapping system (in blue). Each red graph is the result of the visual SfM method run on the output of one camera of the Ladybug system. As can be clearly seen, many poses are missing for this approach. This is due to the fact that sometimes too few good matches could be found between one image and the others. As a consequence, the resulting point cloud of the visual SfM is far more sparse than the one obtained by the LiDAR system. In addition, many outlier poses are present. The graphs show that the performance of visual SfM is less compared to odometry and mapping using LiDAR data, and hence, the former is less suited to obtain an accurate 3D model of the environment using a mobile observer. Finally, Figure 16 shows an image of the estimated trajectory, as well as the obtained 3D reconstruction using our proposed approach using the LiDAR data acquired by our Vellady platform. Lidar odometry Figure 15. Five plots of the trajectories estimated by the visual structure from motion (SfM) framework presented in [28] using the images of each Ladybug camera separately (red color). The result of the camera pointing upwards is omitted. The blue plot represents the trajectory estimated using our method. For the visual SfM approach, many poses are missing due to the lack of good feature matches. Moreover, many outlier poses are present. By means of further evaluation, a second set of sequences was captured at a campus of Ghent University. Some example images of 3D reconstructions are shown in Figure 17. Unfortunately, there is no ground truth available for these environments. Visually, we can see that the reconstructions are adequate. The reconstructions of the indoor environment (images at the bottom) however have a mirror effect due to the reflectance of the windows.

Kitti Vision Benchmark
To further evaluate our system, we performed experiments on the Kitti vision benchmark presented by A. Geiger in [27], currently one of the main benchmarks related to (visual) odometry. For that benchmark, a van was driving in the streets of the German city Karlsruhe, thereby recording data from different modalities, among them a Velodyne HDL-64e LiDAR scanner and a stereo camera rig. To evaluate our odometry system, we ran our algorithm on the eleven test sequences that were recorded. Important to note is that the sensor set-up by which the sequences were captured differs from our set-up in the sense that the LiDAR scanner was placed perpendicularly. In order to evaluate our results quantitatively, we projected the ground truth and estimated poses on the ground plane as is suggested by the Kitti vision benchmark. The plots of these trajectories are depicted in Figure 18. Visually, the graphs show that our method is capable of approximating the ground truth in most cases. For the very long sequences, the system suffers from drift as is inherent to the SLAM problem. These results were also obtained without the loop detection and loop closure as described in Section 4.6. Next, we derived the translational error e t and rotational error e r from the trajectories. In order to evaluate both errors, the trajectories were quantized in intervals of length ∆. The sensor pose S f corresponding with sweep f represents the first sensor pose of the interval, whereas S l represents the last sensor pose of the interval. For S f and S l , the following condition is valid: ||t f − t l || 2 ≈ ∆. We now define the difference of the two ground truth poses S f and S l and estimated posesŜ f andŜ l as S ∆ = S −1 f S l andŜ ∆ =Ŝ −1 fŜ l , respectively. The final pose error E ∆ is then given by Equation (14): Finally, the rotational error e r and translational error e t are given by Equations (16) and (17): The graphs that summarize the translational and rotational errors for ∆ ∈ {100, . . . , 800} are depicted in Figures 19 and 20. These graphs are the average numbers for the eleven different sequences of the Kitti dataset. As can be seen, the rotational error decreases from 0.016 • per meter to 0.006 • per meter when the path length ∆ increases from 100 to 800 meter. The translational error on the other hand increases from 1% to 1.75% per meter for the path length increasing from 100 to 500. After that, it stabilizes more or less. Regarding the translational error, our algorithm competes with the methods ranked seventh to 20th in the Kitti vision benchmark. Figure 20 also shows the translational and rotational errors for an increasing speed. The rotational error varies only slightly when the speed increases. Regarding the translational error, we can see that it increases for higher speeds, which is expected. For speeds higher than 70 km/h, the error increases drastically. The reason for this failure is clear, as for higher speeds, consecutive point clouds begin to differ greatly from each other and no good correspondences can be found. However, we want to point up that our 3D mapping system is meant to operate in industrial plants or indoor environments where the speed of a mobile observer or a drone is always limited. Our system can perfectly cope with speeds up to 30 km/h. Moreover, our method is more generic, as it does not pose any restriction on the sensor set-up nor any prior knowledge of the type of the scene. Furthermore, it does not rely on the detection of the ground plane, and as a result, it does not need to be present in the LiDAR image. Although the latter assumption could improve the accuracy of the system, we want it to be generic and applicable in any environment and in combination with any sensor set-up. Finally, some example images of 3D reconstructions using the Kitti benchmark are presented in Figure 21. These reconstructions are part of the validation sequences for which no ground truth is provided. Ground Truth Lidar Odometry Figure 18. Eleven plots of the estimated (blue) and ground truth (red) trajectories of the Kitti benchmark presented in [27]. The sequences "00" to "10" are presented from left to right and from top to bottom. The algorithm was run without the loop detection and closure algorithm. The majority of the results are satisfying.

Conclusions
In this paper, a novel mobile mapping system using a multi-modal sensor set-up was presented. The platform is unique in the sense that it copes with the inhomogeneity of the point clouds produced by LiDAR scanners. To this end, it integrates both an intensive surface analysis, as well as a global map that is continuously updated and improved. Moreover, loop detection is conduction using the output of the Ladybug images, and a loop closure technique was proposed based on the output of the LiDAR odometry. Experiments demonstrated that our system is able to reconstruct challenging environments, such as chemical plants, and can compete with state of the art methods concerning LiDAR mapping in the field of autonomous vehicles; cf. the Kitti vision benchmark.