Sensor Fusion of Cameras and a Laser for City-Scale 3D Reconstruction

This paper presents a sensor fusion system of cameras and a 2D laser sensor for large-scale 3D reconstruction. The proposed system is designed to capture data on a fast-moving ground vehicle. The system consists of six cameras and one 2D laser sensor, and they are synchronized by a hardware trigger. Reconstruction of 3D structures is done by estimating frame-by-frame motion and accumulating vertical laser scans, as in previous works. However, our approach does not assume near 2D motion, but estimates free motion (including absolute scale) in 3D space using both laser data and image features. In order to avoid the degeneration associated with typical three-point algorithms, we present a new algorithm that selects 3D points from two frames captured by multiple cameras. The problem of error accumulation is solved by loop closing, not by GPS. The experimental results show that the estimated path is successfully overlaid on the satellite images, such that the reconstruction result is very accurate.

image-based methods have shown impressive results in terms of accuracy and scaling, owing to the improvement in computing devices and techniques related to structure-from-motion (SFM) approaches and bundle adjustment [1]. For example, Snavely et al. [2] collected images from the Internet and reconstructed 3D structures of several tourist attractions. The final result from a huge amount of data is obtained by utilizing parallel computing resources [3] and visualized via multi-view stereo [4]. Pollefeys et al. [5] used multiple cameras with a small degree of overlap and utilized a GPU to implement a plane sweeping method. However, camera-based methods have common limitations. The depth of a feature point computed by triangulation is not highly accurate unless it is seen in various directions and matched correctly. For this reason, only stable features are reconstructed accurately, while homogeneous areas cannot be reconstructed without assumptions. Three dominant planes (Manhattan world) [5,6] and vertical walls [7] are reasonable assumptions when reconstructing urban or indoor scenes, but they are not suitable for general scenes.
Laser sensors provide an accurate depth of their field of view without triangulation. Howard et al. [8] and Frueh et al. [9] used 2D laser sensors to reconstruct urban scenes. The range data scanned by the vertical sensors are accumulated based on the localization result computed by the horizontal sensor. Smith et al. [10] focused on compressing accumulated laser scans into a small number of meshes. Fentanes et al. [11] analyzed 3D data scanned by rotating vertical laser sensor for outdoor navigation and reconstruction. Banno et al. [12] obtained range data using 3D laser sensors and 2D laser sensors mounted on sliding modules. Xiao and Furukawa [13] detected lines from the point cloud obtained by laser sensors and merged them into a plane-based 3D model. The only process required for 3D reconstruction using laser sensors is the registration of the local range data. Registering data from 3D laser sensors is relatively easy, because we can obtain data with enough overlap. For example, Allen et al. [14] extracted planes and lines to compute a proper transformation between two 3D scans. Since the frame rate of 3D laser sensors is lower than that of 2D laser sensors, the structures scanned by 3D sensors are distorted more than those scanned by 2D sensors if they are mounted on a fast-moving ground vehicle. The high price of 3D laser sensors is also a problem of using them. If scanned data by 2D laser sensors can be accumulated accurately, using 2D laser sensors is a better solution to several applications than using 3D sensors.
Fusion of different sensors is a solution to the problems mentioned above. Sensors have different characteristics, strengths and weaknesses. Utilizing multiple sensors in a system can complement their weaknesses each other. A combination of cameras and laser sensors is a popular example of sensor fusion. It has been applied to the 2D-based localization of robots in indoor environments [15][16][17][18] and outdoor applications [19,20]. Recently, RGB-D sensors, such as Kinect, became popular in the robotics community. They are utilized in various research issues: SLAM [21][22][23], photometry [24][25][26], recognition [27,28] and other applications [29].
Bok et al. [30,31] proposed a new concept of camera-laser fusion for 3D reconstruction. The accurate depth provided by the laser sensor is independent of the pose of the system and the target scenes (even homogeneous areas can be scanned). The motion of the 2D laser sensor in 3D space (including an absolute scale) is computed by utilizing the scanned points projected onto the images. The system proposed in [30] is capable of reconstructing arbitrary scenes without any assumptions. However, it is not appropriate for urban scenes, because it is designed to be carried by a human operator to reconstruct narrow scenes. It does not work well if the target objects are distant or the system moves rapidly. Recently, Moghadam et al. [32] developed a hand-held system, which consists of a 2D laser sensor, camera and IMU for 3D reconstruction. It also has problems in large-scale outdoor environments, similar to the system by Bok et al. [30].
In this paper, we propose a sensor fusion system and algorithms for large-scale 3D reconstruction. The system is designed to be mounted on a ground vehicle to reconstruct scenes with roads. A vertical 2D laser sensor scans structures, and the reconstruction is done by accumulating the data scanned by the sensor. The difference between the proposed method and previous ones involving the use of two 2D laser sensors is that we do not assume 2D motion, but estimate motion in 3D space. Other researches utilizing vertical laser sensors tried to overcome the 2D assumption [33], but they are not completely free from 'ground vehicle movement'. In this paper, the motion of the system is estimated accurately even if the system moves freely in 3D space, so that it may be carried by any types of platforms, such as human operators or helicopters. Moreover, a high frame rate and good synchronization allow the system to be mounted on a fast-moving ground vehicle, whereas the previous system [30] required being carried at a low speed. We present new methods of estimating frame-by-frame motion and reducing accumulated error by utilizing a few closed loops. The results show that the motion is very accurate over thousands of frames and over tens of kilometers using several closed loops without GPS.
This paper is an extended version of [34]. The motion estimation algorithm (described in Section 3) of the proposed system is published in [35]. The performance of the algorithms in this paper is verified by several quantitative analyses and experiments using real data.

3D Reconstruction Using Range Sensors
In this paper, 3D structures are reconstructed by accumulating vertical laser scans (see Figure 1). This methodology looks old-fashioned compared to 3D sensors, such as Velodyne, time-of-flight camera or Kinect. However, 2D-laser-based reconstruction is still a cheap and efficient solution. Using any range sensor, 3D reconstruction requires its pose in 3D space. Moreover, any range sensor with known poses can be used for 3D reconstruction. In this approach, the only problem is the motion estimation of the sensor. Usually, 3D sensors do not require additional sensors for motion estimation: 3D-3D registration is enough for that. However, few sensors are appropriate for outdoor reconstruction. For example, Kinect [36] works only in an indoor environment, because the infrared pattern from a built-in projector is not recognized well in daylight condition. Time-of-flight cameras [37,38] provide low-resolution range images from a narrow field of view. Velodyne [39] provides high-quality information in outdoor environments, but it is relatively expensive.
We use a vertical 2D laser sensor to obtain 3D information. It is impossible to estimate the horizontal motion of a vertical laser sensor using only scan data from it, because adjacent scans have no relation (overlap). Usually, additional sensors are attached to estimate the motion of the vertical laser sensor. As we mentioned in Section 1, a number of research using vertical laser sensors has been published. We utilize a number of CCD cameras to estimate the motion of the sensor.

Sensor Configuration
Our previous system [30] consists of four cameras and two 2D laser sensors. We did not have to pay attention to the synchronization issue, because the movement of the system carried by a human operator was slow. However, the synchronization issue becomes serious if the system is mounted on a ground vehicle moving at a high speed. Since we cannot adjust the angular velocity of the mirror in a 2D laser sensor, two laser sensors cannot be synchronized if there is a slight difference between their scanning speeds. We decided to limit the use of laser units to one to avoid the synchronization issue. Figure 2 shows the proposed sensor system, which consists of six cameras and one 2D laser sensor. The laser sensor (SICK LMS151) scans 270 degrees, and the center (135 degrees) of the angular range is headed toward the sky, such that the laser sensor scans both sides of the vehicle. Two cameras (1 and 4 in Figure 2) see the 225-and 45-degree directions of the laser's scanning angle. Most laser data measured at the interval of 180-270 degrees and 0-90 degrees are projected onto Cameras 1 and 4, respectively. These cameras are rotated 90 degrees (roll) to project as many laser points onto them as possible. The other four cameras track only image features. We avoided the forward and backward heading directions, because there are usually few static features in those directions in a scene with roads. Figure 3 shows an example of the data captured by the system. Figure 2. The proposed sensor fusion system contains six cameras and one 2D laser sensor. The vertical laser sensor scans the structures. The non-overlapping cameras provide a wide field of view, which makes the estimated motion accurate.

Synchronization
The capturing speed of the laser sensor is 50 fps, while that of the camera is 60 fps. This difference makes the sensors capture data at different moments, while also making the calibration result useless. In order to prevent this problem, we develop a synchronization system for one laser sensor and multiple cameras. The laser sensor used in the proposed system does not give information regarding its shooting time; therefore, an infrared detector is attached onto the front of the laser sensor. The detected laser signal is sent to every camera after passing through a noise filter and an amplifier. The trigger signal is generated slightly (0.4 ms) after the infrared ray is received, due to hardware processing time.

Laser Points as 3D Points
In the motion estimation process, the most important difference between our method and typical camera-based methods is that we have additional laser points with known depths. The system is fully calibrated using a planar pattern (cameras [40]) and the point-line constraint (between cameras and laser sensor [30]). The points scanned by the laser sensor can be transformed into the camera coordinate and projected onto the images. We track those points on the images to obtain 3D-2D correspondences, which are useful in the motion estimation process (see Figure 4). Any template-based methods, such as the KLT (Kanade-Lucas-Tomasi) tracker [41], can be used to find their correspondences, but feature-based methods, such as SIFT [42] or SURF [43], cannot, because the locations of the laser points on images are different from those of feature points. Although a small number of them can be tracked properly, an advantage of using the laser points is that the motion is estimated without the previous motions, which are required to triangulate the image points. . Using the camera-laser calibration result, the laser points can be transformed into the camera coordinate and, thus, can be projected onto the image. Tracking them on the image generates 3D-2D correspondences, which are useful for the motion estimation of the system.

Current frame
Next frame Projection 3D-2D correspondence

Tracking
We estimate relative pose between adjacent frames using laser points and their tracking results. RANSAC [44] is a well-known method of pose estimation using correspondences, including outliers. We sample a number of point sets-three 3D-2D correspondences for each set-and compute pose candidate for each set. The candidate with the maximum number of inliers (points with error below a user-defined threshold) is selected as an initial solution. It is refined via non-linear optimization, which minimizes projection errors of inliers (laser points and image features). Laser points and image features are tracked through multiple frames (5 and 20 frames for laser points and image features, respectively, in our experiments). All projection errors of inliers in tracked frames are minimized in the optimization process. Bundle adjustment [1] may be applied to the final result, but it must be modified to include laser points. A modification of bundle adjustment for sensor fusion system is proposed in [31].
In order to compute an initial solution of frame-by-frame motion, we can use typical image-based algorithms, such as a perspective three-point algorithm [45], for single-camera-based systems or a generalized three-point algorithm [46] for multiple-camera-based systems. However, these algorithms experience degeneration when using three 'collinear' points. It is because three points on a line in 3D space is not changed by any rotation whose axis is equal to the line. Since we obtain 3D points from a 2D laser sensor, the degeneration appears frequently if a plane, such as a vertical wall or the ground, is scanned (see Figure 5). Figure 5. Scanning a plane with a 2D laser sensor leads to degeneration when using the typical three-point algorithms, because all laser points are collinear in 3D space. Collinear points are not changed by any rotation on the line that contains the points.

Multiple solutions exist !
Bok et al. [47] presented the laser three-point algorithm to avoid degeneration. In order to estimate the motion between the consecutive frames, this algorithm utilizes the laser points from both frames. The union set of the laser points from both frames is expected not to be collinear if the system moves while capturing data. In Figure 6, points Q 1 and Q 2 scanned at Frame 1 and point Q 3 scanned at Frame 2 are projected onto the corresponding images and tracked to the other frames. The angles among the rays (θ 1 , θ 2 , θ 3 , φ 1 , φ 2 , φ 3 ) and the distance to the points at the scanned frame (L 1 , L 2 , L 3 ) are known. The unknown variables are the distances to the points (l 1 , l 2 , l 3 ) at the tracked frames. We can obtain three Equations (1)-(3) with three unknowns from Figure 6 and compute solutions by solving a four-degree polynomial equation. Figure 6. Laser three-point algorithm [47]: two points Q 1 and Q 2 are selected from Frame 1 and tracked to Frame 2. One point Q 3 is selected from Frame 2 and tracked to Frame 1.

Generalized Laser Three-Point Algorithm
In this paper, we present a generalized version (i.e., a modified one for the multiple-camera setups) of the laser three-point algorithm. Similar to the laser three-point algorithm, we assume that two points Q 1 = [x 1 y 1 z 1 ] T and Q 2 = [x 2 y 2 z 2 ] T are scanned at Frame 1 and that a point Q 3 = [x 3 y 3 z 3 ] T is scanned at Frame 2. Q 1 and Q 2 are tracked to Frame 2, and Q 3 is tracked to Frame 1. The objective of the proposed algorithm is to compute the relative pose H = [R T ] between two frames, which moves each point Q n (n=1,2,3) onto its corresponding ray with the camera center of P n = [A n B n C n ] T and the direction vector of V n = [a n b n c n ] T , as described in Equations (4)- (6). The ray P n + λ n V n on which the point Q n should lie is referred to as 'ray n' in the rest of this paper.
In order to solve the problem, we compute the transformation, which makes the two points Q 1 and Q 2 lie on their corresponding rays via four steps. The first and the second steps are to transform Frames 1 and 2 into their canonical positions, as described in Figure 7. Without loss of generality, the canonical positions are defined to simplify following equations, which describes the process of 'two points on two rays' (see Figure 8). Since points Q 1 and Q 2 are scanned by the laser sensor, the points and the origin of the laser sensor are not collinear (unless one instance of the range data corresponding to Q 1 and Q 2 is equal to zero). The transformation H 1 = [R 1 T 1 ] of the first step is computed using the following equations, where O L is the origin of the laser sensor in the camera coordinate (it can be computed easily using the camera-laser calibration result).  In the second step, Ray 1, corresponding to point Q 1 , is aligned to be equal to the y-axis. The line connecting the points on the rays that are closest to the other ray is set to the z-axis. With this transformation H 2 = [R 2 T 2 ], Ray 2 becomes parallel to the x-y plane. l 1 and l 2 are constants that make the point on Ray 1 (P 1 + l 1 V 1 ) closest to Ray 2 and the point on Ray 2 (P 2 + l 2 V 2 ) closest to Ray 1, respectively.
The objective of the third and the fourth step is to transform points Q 1 and Q 2 onto their corresponding rays (see Figure 8). Because Rays 1 and 2 are parallel to the x-y plane, we do not have to consider the z-axis if the corresponding points have proper z-coordinates. The third step is to make the z-coordinates of Q 1 and Q 2 equal to their corresponding rays. The z-coordinate of Q 1 should be zero, and this is already satisfied. The z-coordinate of Q 2 should be equal to that of P 2 transformed into the canonical position of Frame 2. Let [D 2 0 d 2 ] T be the transformed coordinate of Q 2 . D 2 , d 2 and the transformation H 3 = [R 3 T 3 ] (T 3 = 0) are computed by the following equations. φ is the rotation angle of R 3 .  The fourth step is to transform points Q 1 and Q 2 onto their corresponding rays in the x-y plane. Let y = mx + n be the equation of the line connecting the points on the rays after the transformation. The transformation changes the x-y coordinates of Q 1 and Q 2 in the following manner: Let y = rx be the equation of Ray 2 in the canonical position of Frame 2, while ignoring the z-axis. n can be expressed in terms of r and m, because the transformed Q 2 should be on Ray 2. M is defined to simplify the equations.
The transformation H 4 = [R 4 T 4 ], which satisfies in Equations (18) and (19), is as follows: The fourth step 'two points on two rays' removes four degrees of freedom, while two degrees remain free. One of the free degrees is the slope m of the line connecting the two points mentioned in Equation (19). The other is the rotation angle θ of Frame 1 on the x-axis in its canonical position. The coordinates of Q 1 and Q 2 are not changed by this rotation, because they are on the x-axis.
We transform point Q 3 to the canonical position of Frame 1 to find solutions of m and θ, as shown in Figure 9. Let P 3 = [A 3 B 3 C 3 ] T and V 3 = [a 3 b 3 c 3 ] T be the camera center and the direction vector of Ray 3, which is transformed into the canonical position of Frame 1 by H 1 (= [R 1 T 1 ]). Let By means of the inverse transformation of H 4 (= [R 4 T 4 ]), Q 3 is changed in the following manner: Applying the inverse transformation of This point should be on Ray 3 rotated on the x-axis.
Three equations are derived from the constraint of Equation (32).
λ 3 is expressed in terms of m from Equation (33). θ is eliminated by computing the squared sum of Equations (34)   After solving the polynomial equation, we verify each solution candidate of m using the fact that the scale parameters λ 1 , λ 2 and λ 3 in Equations (4)-(6) must be positive (i.e., positive depth constraint). For each candidate of m, first we compute λ 3 using Equation (33) and discard the candidate if λ 3 is negative. Next, we compute Q 1 and Q 2 in the canonical position of Frame 2 using Equations (18) and (19). Their z-coordinates are equal to d 2 . Camera centers P 1 , P 2 and direction vectors V 1 , V 2 in that position can be computed easily by applying the transformation H 2 (= [R 2 T 2 ]) to the original ones P 1 , P 2 , V 1 and V 2 .
Scales λ 1 and λ 2 can be computed by Equations (4) and (5), because the relationship among P n , V n and Q n is not changed by any metric transformation. We also discard the candidate of m if the scales are negative. If all scales are positive, we know the coordinates of Q 1 , Q 2 and Q 3 in both frames. Their relative pose H is computed by registering point sets or accumulating transformations. In Equation (40), all transformations are modified into 4 × 4 matrices by adding fourth row [0 0 0 1] to them.

Experimental Validation
We generated a synthetic data set to verify the performance of the proposed algorithm. Three points and the relative pose are generated randomly. After generating the ground truth, we added Gaussian noise to the 3D points and projected their locations on the images. We generated data simulating single-camera system and non-overlapping two-camera system (heading in the opposite direction) using non-collinear points and collinear points. We compared four different algorithms mentioned above: perspective three-point algorithm (P3P, [45]), laser three-point algorithm (L3P, [47]), generalized three-point algorithm (G3P, [46]) and generalized laser three-point algorithm (GL3P, proposed). The simulation results displayed in Figure 10 are identical to what we expected. In the case of the two-camera setup, the generalized algorithms show better performance than the single-camera-based setup (see Figure 10c,d). We propose L3P and GL3P to avoid degeneration, and indeed, they provide more accurate results than the typical three-point algorithms in such a case (see Figure 10b,d). The proposed algorithm shows the best performance among the three-point algorithms in terms of accuracy.
We also examined the computation time of the proposed algorithm. Every three-point algorithm tested in this paper computes the solutions in three steps: (1) computes the coefficients of a four-or eight-degree polynomial equation; (2) solves the equation; and (3) computes the transformation matrix for each solution. In Table 1, 'Equation' and 'Matrix' indicate the average time for (1) + (2) and (1) + (2) + (3), respectively. Table 1 shows that the proposed algorithm computes the solution matrices faster than L3P and G3P and that its computation time is short enough to be applied to real-time implementations. Figure 10. The graphs show the rotation errors of four different three-point algorithms (perspective three-point algorithm (P3P, [45]), laser three-point algorithm (L3P, [47]), generalized three-point algorithm (G3P, [46]) and generalized laser three-point algorithm (GL3P, proposed)) using synthetic data: (a) a single camera and non-collinear points; (b) a single camera and collinear points; (c) non-overlapping cameras and non-collinear points; and (d) non-overlapping cameras and collinear points. The results of P3P and G3P are not displayed in (b) and (d), because the errors resulting from them are very large (about 0.5 rad) compared to those of the other algorithms. Graphs showing translation errors are not displayed, because they resemble those showing rotation errors. The proposed algorithm is also verified using real data. Because we do not know the ground truth of the motion between the adjacent frames, the final result, which is locally refined using inliers and globally refined by loop closing, is referred to as the ground truth (i.e., the desired result). We used a frame with no structures nearby to test the worst case of scanning only the ground. We selected three points randomly and computed the candidate 20,000 times using both the G3P and GL3P algorithms. The difference between the candidate and the ground truth is shown in Figure 11. The proposed method provides solutions with less serious errors than the G3P algorithm. Furthermore, the number of the samples that give no solution by the proposed method is much smaller than that by the G3P algorithm.
Although both algorithms compute an appropriate candidate with a large number of RANSAC iterations, the result shows that the proposed algorithm has a higher probability of computing an accurate initial solution within a small number of iterations.    We performed another experiment using short sequences with a closed loop. Again, we consider the final result refined using all of the inliers among the image features and the laser points as the ground truth. The results of the RANSAC process (1000 iterations) using the G3P and the proposed GL3P are compared to that result. The results are similar to the ground truth if we utilize all of the laser data (see Figure 12a,c). However, if we reduce the maximum range of the laser sensor (five meters in our experiment), both results become erroneous. In Figure 12b,d, the result using the proposed algorithm contains a small amount of drift in the rotation, while the result using the G3P algorithm contains severe distortion. When using real data, the G3P may estimate something with the laser points on the ground, because they are not perfectly collinear. However, they do not guarantee the accuracy of the estimation result using the G3P algorithm. This can be guaranteed only by the laser points on non-ground structures or with a large amount of laser data on the ground. This result shows that the proposed algorithm shows better performance than the G3P algorithm with the limitation of an actual experiment. Figure 12. The motion of short sequences with a closed loop is estimated using the G3P [46] and the proposed GL3P algorithms. The initial motion (the result of the RANSAC process) is compared to the final result (denoted by GT (ground truth)). (a,c) Using all laser points, the results contain small accumulation errors; (b,d) When using only some of the laser points (maximum range limited to five meters), the result using the GL3P algorithm contains a small amount of drift on rotation, while the result using the G3P algorithm contains severe distortion. The lengths of the sequences are about 440 m (a,b) and 850 m (c,d). (c) (d) In order to avoid degeneration in real experiments, we recommend using long-range laser sensors to scan non-ground objects as much as possible. If there are no objects nearby, we recommend moving the system continuously (without stopping) while the system captures data. Only the proposed GL3P algorithm can provide a good initial solution in that case. Any existing algorithms and even GL3P cannot compute a proper motion if the laser sensor does not move and scans a plane. Detecting and ignoring duplicated frames at the same location may be an alternative solution for that case.

Reducing Accumulated Error
The proposed method of 3D reconstruction accumulates frame-by-frame estimation results. This type of method always experiences an error accumulation problem. This is usually solved using global sensors, such as GPS, or by closing a few loops. In this section, we present a novel method of reducing the accumulated error by utilizing a few closed loops only and not GPS. The loops to be closed are selected by a human operator without considering the additional issue of automated loop closing.
We consider the case of visiting the same place twice, while capturing data continuously. Although the local motion estimation in this case is very accurate, usually, two visits are not registered well due to accumulated error. In order to reduce the error by loop closing, the relative pose between the visits must be computed first. In this paper, this is done by the registration of 3D point clouds. The local structures of the first visit and the second visit to be registered are reconstructed using hundreds of neighboring frames. Various methods of 3D-3D registration are available, such as the ICP (iterative closest point, [48]) and EM-ICP (expectation-maximization iterative closest point, [49]) algorithms. Moreover, additional information, such as the color or template of laser points, may be utilized [23,50,51] because their projected location on images are already known.
After the loops are closed, the error is distributed to the frames. In order to distribute the error equally and reasonably, we divide the frame-by-frame accumulation result into 'local parts', so that the end frames of the parts are equal to the first visit or the second visit of the closed loops, as shown in Figure 13. Let us assume that we have N closed loops (i.e., the accumulated result is divided into 2N + 1 local parts). P i and Q i (i = 1 ∼ N ) are the projection matrix of the first visit and relative pose between two visits of the i-th loop, respectively. Q i is computed by the 3D-3D registration mentioned above. We adjust the relative poses among the first visits to minimize the error to be distributed to each frame, while satisfying the closed-loop constraint. We call this process the 'registration' of local parts. P 2 ∼ P N are the unknown variables of the registration (P 1 is set to I as the reference coordinate). The cost function of the registration is the sum of the errors to be distributed in all parts: E k is the error of the k-th local part caused by the difference between accumulation (B k A −1 k ) and loop closing (b k a −1 k ). A k and B k are the projection matrices of the start frame and the end frame of the k-th local part, respectively, computed by frame-by-frame estimation. They are not changed during the registration process. a k and b k are the projection matrices of the closed-loop visits corresponding to the start frame and the end frame of the k-th local part, respectively. They are determined by P i and Q i (i = 1 ∼ N ). If the start frame of the k-th local part corresponds to the first visit of the i-th loop, a k is set to P i . If the start frame corresponds to the second visit of the j-th loop, a k is set to Q j P j . b k is computed in the same way as a k . The weight n k is the number of frames included in the k-th local part. The initial value of P i is set to the projection matrix of the corresponding frame in the accumulated result.
The error function g(E k ) consists of rotation error and translation error. Since rotation error is independent of translation error, we divide the refinement process into two steps-rotation and translation-and refine rotation matrices first. In the first step, the error function g 1 (E k ) is defined as the rotation error computed from R k . After optimizing rotation matrices of P 2 ∼ P N , rotation error R k is distributed equally to the frames included in the k-th local part using the method proposed by Sharp et al. [52], while translation vectors are fixed. After refinement of rotation matrices, translation vectors are refined by using error function g 2 (E k ). Figure 13. After computing the relative pose between two visits, the initial estimation is divided into local parts. In the registration process, the relative pose among the first visits of the loops (red dots) are adjusted in order to minimize the error, which should be distributed to the frames. We investigated the effect of the error distribution using the 'institute' dataset shown in Section 5. The sequence has 23 local parts, because we utilized 12 closed loops for refinement. The rotation error and the translation error distributed to each frame are smaller than 0.001 degrees and one millimeter, respectively. For your information, the initial result (before refinement) is computed by the GL3P algorithm explained in Section 3 followed by a local refinement using inliers among image features and laser points. The closed-loop error distributed to each frame is small enough to keep the projection error of the features small. The mean value of the projection error of the laser points is increased from 0.4649 to 0.4691 (+0.59%), while that of the image features is increased from 0.2019 to 0.2060 (+2.06%). The refined result after error distribution is similar to the frame-by-frame result in terms of the projection error. The projection error is computed using the tracking result as a reference. Although the tracking result may contain some error, the process of the motion estimation minimizes the projection error using the tracking result. Hence, the error distribution process does not affect the motion estimation process significantly and allows the local estimation result to satisfy the closed-loop constraint.

Experimental Results
We captured three places to verify the performance of the proposed system. The first place is a research institute in Daejeon, Korea. The result of motion estimation is shown in Figure 14. The length of the entire path is 3.5 km, and 67,300 frames were captured ('institute' dataset). The vehicle was driven at a low speed to obtain a dense scan result. This was possible, because there were few cars in the area. The accumulated error in the initial result is reduced by 12 closed loops. The 3D structure reconstructed by accumulating laser scans is shown in Figure 15. In order to compare the proposed method to the camera-based SFM (structure from motion) methods, we applied an SFM method [53] to the first 20,000 frames of the same dataset and reconstructed 3D structures. The results using the SFM method and the proposed method are compared in Figure 16. The camera-based SFM method computed system path and 3D structure using the triangulation results of tracked image features. The path contains some drift, and the reconstruction result is barely recognizable as an outdoor scene with a number of buildings. The proposed method computed the path and structure using both the image features and laser points. The result using the proposed method has a smaller drift on path and a small number of erroneous points. The structures are recognized easily due to the accuracy of the result.    The second place is an urban part of Daejeon, Korea. We captured 14 adjacent blocks, and most roads were captured twice to generate closed loops. The length of the entire path is 20 km, and 174,000 frames were captured ('city' dataset). The vehicle was driven at 40 km/h, and we sometimes stopped at traffic lights, while the data were captured continuously. The upper image of Figure 17 shows the initial estimation result, which includes the accumulated error. Applying 34 closed loops, the result is refined enough to be overlaid on the satellite image (the lower image of Figure 17). The uppermost block is reconstructed and magnified in Figure 18.  The last place we reconstructed is the KAIST (Korea Advanced Institute of Science and Technology) campus. Most scenes contain two types of roads: for ground vehicles and those for pedestrians only. We utilized both the proposed system and the previous hand-held system in [30]. The basic rationale behind the use of both systems is 'coarse-to-fine'. Structures near wide roads for vehicles are captured by the proposed vehicle-mounted system. Local environments in which vehicles are not allowed are reconstructed by the hand-held system. The relative pose between the results reconstructed by the different systems is computed by the registration of the local structures, as mentioned in Section 4. The method in Section 4 can be extended easily to handle multiple sequences. Two sequences are captured using the proposed system mounted on a ground vehicle, and the other two are captured using the hand-held system. The initial estimation results are shown in Figure 19. The total length of the sequence is 17.27 km, and the number of frames is 317,120 ('campus' dataset). The final result is overlaid on the satellite image in Figure 20. Several parts of the reconstruction result are magnified in Figure 21.   Figure 20. The initial results in Figure 19 are registered using 70 closed loops. The final result is successfully overlaid on the satellite image. The sequences captured by the vehicle-mounted system and the hand-held system are denoted by red and green curves, respectively. Figure 21. Parts of the reconstruction result of the 'campus' dataset.

Conclusions
In this paper, we proposed a new version of a sensor fusion system of cameras and a 2D laser sensor for large-scale 3D reconstruction. The proposed system is designed to be mounted on a ground vehicle. In order to capture data when the vehicle travels at speed, we increased the frame rate of the system, so that it was higher than that of our previous system [30], and synchronized the entire system using an infrared receiver. The reconstruction is done by accumulating vertical laser scans without the assumption of 2D motion. In order to avoid degeneration when using one 2D laser sensor, we proposed a generalized laser three-point algorithm for motion estimation of the proposed system. After several loops are closed by conventional 3D-3D registration algorithms, the frame-by-frame accumulation result is adjusted slightly to satisfy the closed-loop constraint while maintaining the projection error of the features. The accuracy of the estimated path was verified by overlaying the paths on satellite images and computing the error distributed to the frames.
There are several works that could improve the proposed system and method. The proposed system uses six non-overlapping cameras to obtain a near-omnidirectional field of view. We will attempt several different arrangements of cameras to find the optimal solution. The loops are detected and closed manually in this paper, but the process can be automated using recent techniques on scene matching. Additional sensors, such as GPS and IMU, can be attached to improve the accuracy of the motion estimation.