CuFusion: Accurate Real-Time Camera Tracking and Volumetric Scene Reconstruction with a Cuboid

Given a stream of depth images with a known cuboid reference object present in the scene, we propose a novel approach for accurate camera tracking and volumetric surface reconstruction in real-time. Our contribution in this paper is threefold: (a) utilizing a priori knowledge of the precisely manufactured cuboid reference object, we keep drift-free camera tracking without explicit global optimization; (b) we improve the fineness of the volumetric surface representation by proposing a prediction-corrected data fusion strategy rather than a simple moving average, which enables accurate reconstruction of high-frequency details such as the sharp edges of objects and geometries of high curvature; (c) we introduce a benchmark dataset CU3D that contains both synthetic and real-world scanning sequences with ground-truth camera trajectories and surface models for the quantitative evaluation of 3D reconstruction algorithms. We test our algorithm on our dataset and demonstrate its accuracy compared with other state-of-the-art algorithms. We release both our dataset and code as open-source (https://github.com/zhangxaochen/CuFusion) for other researchers to reproduce and verify our results.


Introduction
Real-time camera tracking and simultaneous dense scene reconstruction has been one of the most actively studied problems in computer vision over recent years. The advent of depth cameras based either on structured light (e.g., Asus Xtion, Kinect 1.0) or time-of-flight (ToF) (e.g., Kinect 2.0) sensing offers dense depth measurements directly in real-time as video streams. Such dense depth sensing technologies have drastically simplified the process of dense 3D modeling, which turns the widely available Kinect-style depth cameras into consumer-grade 3D scanners.
KinectFusion [1] is one of the most famous systems for registering each incoming frame of depth images captured during the scanning into one integrated volumetric representation of the scene. An iterative closest point (ICP) algorithm [2] is performed to align the current depth map to the reconstructed volumetric truncated signed distance function (TSDF) [3] surface model to get the camera pose estimation. Each depth measurement is fused into the TSDF model directly to update the reconstruction. A triangulated 3D mesh model could finally be extracted using a Marching Cubes type algorithm [4].
Existing geometric alignment approaches based on ICP and its variants [5] are prone to drift in the presence of structure-less surfaces. Drift might be accumulated and even cause the failure of camera tracking when scanning larger man-made environments. Meanwhile, the weighted moving average TSDF fusion strategy makes the assumption of a Gaussian noise model on the depth measurements with a naïve surface visibility predicate that every surface point is visible from all sensor viewpoints [6]. This predicate is only locally true and usually violated due to surface occlusions [1] when scanning (SDF) is performed to avoid surfaces interfering, surface blurring and the inflating problem (as shown in Figure 1c) may happen when scanning around tiny objects or sharp geometries in the scene. Existing algorithms have been proposed to keep globally consistent camera trajectory estimation. Pose graphs are created and optimized when large loop closures are found [7], which may substantially reduce the odometry error accumulation. On the task of scanning small-sized scenes or objects, however, even small camera drift may cause deformation of the reconstruction. We propose a novel algorithm called CuFusion, which particularly focuses on the application of reconstructing small-sized scenes and objects precisely in real-time, with the accuracy of both camera tracking and data fusion significantly improved. With a priori knowledge of the planar faces and occluding contours of the cuboid reference object partly or totally present in the scene, each data frame is aligned against both the reconstructed scene and the localized cuboid model, and thus driftfree camera trajectories are maintained.
The predicate that every surface point is visible from all sensor viewpoints is only locally true due to surface occlusions [1]. In our work, we drop such assumptions and implement a "predictioncorrected" data fusion algorithm to integrate all incoming data into one geometrically consistent 3D model in the global reference frame. Instead of a simple moving average surface reconstruction, our work extends the TSDF representation by adding components storing the locally consistent TSDF value, the pixel ray and surface normal vector in each voxel grid for the detection of the camera view variation and correction of the global TSDF value. Experimental results (Figure 1) show the ability of (c) KinectFusion: mild accumulated camera drift and simple moving average truncated signed distance function (TSDF) fusion result in reconstruction inflation; (d) Our approach, CuFusion, keeps drift free camera tracking with additional constraints of a cuboid reference object and preserves the fidelity of the reconstructed objectives using our prediction-corrected TSDF fusion strategy. Note the sharpness of the cuboid edges and the thinness of the character's ears of our reconstruction.
Existing algorithms have been proposed to keep globally consistent camera trajectory estimation. Pose graphs are created and optimized when large loop closures are found [7], which may substantially reduce the odometry error accumulation. On the task of scanning small-sized scenes or objects, however, even small camera drift may cause deformation of the reconstruction. We propose a novel algorithm called CuFusion, which particularly focuses on the application of reconstructing small-sized scenes and objects precisely in real-time, with the accuracy of both camera tracking and data fusion significantly improved. With a priori knowledge of the planar faces and occluding contours of the cuboid reference object partly or totally present in the scene, each data frame is aligned against both the reconstructed scene and the localized cuboid model, and thus drift-free camera trajectories are maintained.
The predicate that every surface point is visible from all sensor viewpoints is only locally true due to surface occlusions [1]. In our work, we drop such assumptions and implement a "prediction-corrected" data fusion algorithm to integrate all incoming data into one geometrically consistent 3D model in the global reference frame. Instead of a simple moving average surface reconstruction, our work extends the TSDF representation by adding components storing the locally consistent TSDF value, the pixel ray and surface normal vector in each voxel grid for the detection of the camera view variation and correction of the global TSDF value. Experimental results (Figure 1) show the ability of our fusion method to keep the structural details of surfaces, which is on par with, or better than, existing state-of-the-art reconstruction systems that focus mostly on camera tracking accuracy.
Many scanning and reconstruction systems use both RGB and depth images. Feature-based registration is combined with dense ICP shape matching to estimate the best alignment between consecutive frames. Our system exploits only depth information as input to maximize tracking accuracy for the following reasons: First, some depth cameras such as ASUS Xtion PRO are not accompanied by RGB cameras. Second, for RGB-D cameras which provide both color and depth streams, the spatiotemporal alignment of RGB and depth information in pixel level may not be perfect. Third, by using only depth data, our system enables scanning in complete darkness regardless of the ambient lighting conditions. We evaluate our algorithm qualitatively and quantitatively using both noiseless synthetic and noisy real-world data captured by a hand-held Kinect. The synthesized data provide both ground-truth (GT) camera trajectories and GT mesh models enabling both the trajectories and reconstructions to be quantitatively evaluated. For real-world image sequences, unfortunately we do not have GT camera trajectories. We 3D printed several rigid models using a high precision 3D printer (http://www.dowell3d.com/3d/3.html) for scanning and evaluate the quality of our reconstructions directly compared with the GT models.

Related Work
The research into the real-time 3D model reconstruction problem has been extensively studied in recent decades. The advance of range sensing technology has facilitated the development of real-time interactive range scanners for dense 3D surface model acquisition. Such range sensors, particularly on active sensing technologies, could be categorized into different types including laser scanners [8,9], time-of-flight (ToF) [10,11] sensing and structured-light cameras [12]. The introduction of Microsoft's Kinect-based on structured-light sensing-has brought dense depth sensors to wide consumer-grade accessibility.
KinectFusion [1] of Newcombe et al. is one of the founding systems for real-time dense SLAM, taking a sequence of depth maps streamed from a Kinect-style sensor as the input to create a globally consistent 3D model of the scene. Despite its enlightenment, this algorithm has limitations in several aspects. First, pure geometric alignment of ICP is prone to drift in the presence structure-less surfaces. Second, the regular volumetric representation is memory consuming, which limits the size of the reconstructed model to medium sized rooms, also with limited resolution. Third, it cannot detect loop closures and therefore lacks the ability to recover from accumulating drift, leading to mesh artifacts.
Researchers have been making efforts to address the problems mentioned above. Henry et al. [13] were the first to combine texture feature matching with Generalized-ICP [14] using RGB-D data to reduce drift result from pure geometric alignment and increase the robustness of visual odometry [15]. Loop closure is detected when the previously seen region is revisited, and a pose graph is optimized to create a globally consistent map in [13], as well as in the work of Endres et al. [7,16], Whelan et al. [17,18] and Kerl et al. [19]. Whelan et al. [20] further proposed ElasticFusion, a novel algorithm for loop closure optimization without a pose graph. Moreover, higher-level primitives such as edges [21,22], occluding contours [23], curvature information [24], lines [25] and planes [26][27][28][29] are used as additional information to constrain the pose estimation process.
On dense scene representation, Whelan et al. [17] extended the KinectFusion algorithm spatially to support large unbounded scenes, with a cyclical buffer data structure. Endres et al. [7,16] used an octree-based mapping framework OctoMap [30] to generate a volumetric 3D map of the environment at scale, yet no mesh model is created. Other researchers have been using points and surfels [20,24,[31][32][33][34] to represent the scene and render it with the surface-splatting technique [35]. Such point-based scene representation has significantly reduced computational complexity and lowered the memory overhead compared with the volumetric approaches and is therefore adequate for reconstructing large-scale environments. Note that Lefloch et al. [24] use curvature information as an independent surface attribute for their real-time reconstruction, leading not only to camera drift reduction but also to improved scene reconstruction.
However, despite the efforts exerted, both the camera pose estimation and the reconstructed models are far from perfect. On small-sized scenes particularly, slight camera drift may lead to reconstruction deformation and sharp depth edges or highly concave scenes are problematic for these approaches [36]. We tackle these problems and focus on fidelity preservation in this paper.

Method
We base our work on an open-sourced implementation of the KinectFusion algorithm from the PCL library [37]. Our reconstruction pipeline is illustrated in Figure 2, which is described in detail in the following sections. surface attribute for their real-time reconstruction, leading not only to camera drift reduction but also to improved scene reconstruction. However, despite the efforts exerted, both the camera pose estimation and the reconstructed models are far from perfect. On small-sized scenes particularly, slight camera drift may lead to reconstruction deformation and sharp depth edges or highly concave scenes are problematic for these approaches [36]. We tackle these problems and focus on fidelity preservation in this paper.

Method
We base our work on an open-sourced implementation of the KinectFusion algorithm from the PCL library [37]. Our reconstruction pipeline is illustrated in Figure 2, which is described in detail in the following sections.

Notation
We define the image domain as Ω ⊂ , and a depth image D ∶ Ω → at time k. We represent the camera pose at time k in the global coordinate frame ℱ g by a rigid transformation matrix: with a 3 3 rotation matrix R g, k ∈ (3) and a 3 1 translation vector t g, k ∈ , which transforms a point ∈ in the camera coordinate frame ℱ to a global point g = R g, k + t g, k ∈ . We model the depth camera by the simple pinhole model, and use a constant camera intrinsic matrix K to transform points on the sensor plane into image pixels: where ( , ) are the horizontal and vertical focal lengths and ( , ) is the image coordinate of the principal point. We define the 3D back-projection of an image pixel ∈ Ω as = K D( ) , where ≔ ( |1) is the homogeneous form of . And inversely, we define the perspective projection of point = ( , , ) as = π(K ) , where function π( ) = ( / , / ) performs perspective projection including de-homogenization process. Prior to registration, an organized vertex map V is computed by bilateral-filtering and backprojecting the raw depth image D . The normal map N is computed using the PCA method. Given the camera pose T g, k at time k, we could transform both V , N to the global frame of coordinate:

Notation
We define the image domain as Ω ⊂ N 2 , and a depth image D k : Ω → R at time k. We represent the camera pose at time k in the global coordinate frame F g by a rigid transformation matrix: with a 3 × 3 rotation matrix R g, k ∈ SO(3) and a 3 × 1 translation vector t g, k ∈ R 3 , which transforms a point p k ∈ R 3 in the camera coordinate frame F k to a global point p g = R g, k p k + t g, k ∈ R 3 . We model the depth camera by the simple pinhole model, and use a constant camera intrinsic matrix K to transform points on the sensor plane into image pixels: where f x , f y are the horizontal and vertical focal lengths and c x , c y is the image coordinate of the principal point.
We define the 3D back-projection of an image pixel u ∈ Ω as p = K −1 . uD(u), where . u := u T 1 T is the homogeneous form of u. And inversely, we define the perspective projection of point p = (x, y, z) T as u = π(Kp), where function π(p) = (x/z, y/z) T performs perspective projection including de-homogenization process. Prior to registration, an organized vertex map V k is computed by bilateral-filtering and back-projecting the raw depth image D k . The normal map N k is computed using the PCA method. Given the camera pose T g, k at time k, we could transform both V k , N k to the global frame of coordinate:

Cuboid Localization
Given a depth image D k and a rectangular cuboid with edge lengths P cu = (a, b, c) present in the image, we localize the cuboid and calculate its pose in the global coordinate frame F g . Live depth frames will be latterly aligned against the reference cuboid when scanning around it to mitigate the accumulating camera drift.
We first perform plane segmentation using the Agglomerative Hierarchical Clustering (AHC) algorithm [38], as illustrated in Figure 3c. Then we check the orthogonality of the segmented planes. Two planes are considered to be orthogonal if the angle Θ p between their normal vectors is approximately 90 . Once we find three planes that are orthogonal to each other, we check the length of the intersecting line segments between the planes. If the three line segments' lengths match the cuboid edge length parameter P cu approximately (differences below a threshold ε P = 10 mm), we claim to find the cuboid and mark the three planes as its adjacent planes.

Cuboid Localization
Given a depth image D and a rectangular cuboid with edge lengths = (a, b, c) present in the image, we localize the cuboid and calculate its pose in the global coordinate frame ℱ g . Live depth frames will be latterly aligned against the reference cuboid when scanning around it to mitigate the accumulating camera drift. We first perform plane segmentation using the Agglomerative Hierarchical Clustering (AHC) algorithm [38], as illustrated in Figure 3c. Then we check the orthogonality of the segmented planes. Two planes are considered to be orthogonal if the angle Θ between their normal vectors is approximately 90° (i.e., Θ − 90° < ; = 5°). Once we find three planes that are orthogonal to each other, we check the length of the intersecting line segments between the planes. If the three line segments' lengths match the cuboid edge length parameter approximately (differences below a threshold = 10 mm), we claim to find the cuboid and mark the three planes as its adjacent planes.  [38] are labeled with random colors, the cuboid is localized with its vertices marked as green circles, and the axes of the cuboid frame are drawn in CMY colors; (d) The localized cuboid is drawn as a red wireframe in the depth image, and the "contour generators" proposed in [23] are drawn as white lines.
We consequently define the cuboid coordinate frame of reference. We set frame origin O to the intersection point of the three orthogonal planes, and draw the system axes from the normal vectors. Due to the inaccuracy of the depth measurement and camera intrinsic calibration, orthogonality between the normal vectors of the segmented adjacent planes are not guaranteed strictly. We obtain the nearest orthogonal axes [ , , ] of the frame by solving the Orthogonal Procrustes Problem. The cuboid pose in the camera frame at time k is:  [38] are labeled with random colors, the cuboid is localized with its vertices marked as green circles, and the axes of the cuboid frame are drawn in CMY colors; (d) The localized cuboid is drawn as a red wireframe in the depth image, and the "contour generators" proposed in [23] are drawn as white lines.
We consequently define the cuboid coordinate frame of reference. We set frame origin O cu to the intersection point of the three orthogonal planes, and draw the system axes from the normal vectors. Due to the inaccuracy of the depth measurement and camera intrinsic calibration, orthogonality between the normal vectors of the segmented adjacent planes are not guaranteed strictly. We obtain the nearest orthogonal axes [X cu , Y cu , Z cu ] of the frame by solving the Orthogonal Procrustes Problem. The cuboid pose in the camera frame at time k is: Assuming the camera pose T g, k at time k is known, the cuboid pose T g, cu = R g, cu t g, cu 0 T 1 in the global frame of coordinate could then be derived: T g, cu = T g, k T k, cu . Figure 4 illustrates the notations used in the paper.
Assuming the camera pose T g, k at time k is known, the cuboid pose T g, cu = R g, cu t g, cu 0 T 1 in the global frame of coordinate could then be derived: T g, cu = T g, k T k, cu . Figure 4 illustrates the notations used in the paper.

Camera Pose Estimation
Since we use depth maps as input sequences, only geometric alignment is performed. For each input frame D at time k, we estimate the pose T g, k of the depth camera frame ℱ with respect to the global frame ℱ g by registering the live depth map to both the global reconstructed surface model and the cuboid reference object.

A. Frame to Model Registration
Given the implicit TSDF surface model , the surface prediction w.r.t. the camera pose T g, k-1 is obtained as an organized vertex and normal map (V , N ), and transformed into the global frame as (V g , N g ). For frame-to-model registration, a transformation T g, k is pursued to minimize the point-to-plane error between T g, k V and V g : where = {( , )} is the set of correspondences obtained by projective data association [1]: T , denotes the transformation from current time k to time (k − 1) during each ICP iteration.

B. Frame to Cuboid Registration
Assuming the cuboid pose w.r.t., the global coordinate frame is already known. For each camera pose T g, k , per-pixel ray casting is performed on the global cuboid to synthesize a proxy depth map D . An organized vertex and normal map in the global frame as (V g , N g ) is then obtained

Camera Pose Estimation
Since we use depth maps as input sequences, only geometric alignment is performed. For each input frame D k at time k, we estimate the pose T g, k of the depth camera frame F k with respect to the global frame F g by registering the live depth map to both the global reconstructed surface model and the cuboid reference object.

A. Frame to Model Registration
Given the implicit TSDF surface model S, the surface prediction w.r.t. the camera pose T g, k−1 is obtained as an organized vertex and normal map ( V k−1 , N k−1 ), and transformed into the global frame as ( V g k−1 , N g k−1 ). For frame-to-model registration, a transformation T g, k is pursued to minimize the point-to-plane error between T g, k V k and V g k−1 : where K 1 = {(u, u)} is the set of correspondences obtained by projective data association [1]: T k−1, k denotes the transformation from current time k to time (k − 1) during each ICP iteration.

B. Frame to Cuboid Registration
Assuming the cuboid pose w.r.t., the global coordinate frame is already known. For each camera pose T g, k , per-pixel ray casting is performed on the global cuboid to synthesize a proxy depth map D cu k . An organized vertex and normal map in the global frame as V cu g k−1 , N cu g k−1 is then obtained using back projection of the depth map and local to global transformation. Similar to the frame-to-model registration, a frame is aligned against the cuboid surface in the global coordinate frame by minimizing the point-to-plane error: In addition, we adopt the edge-to-edge error metric as a constraint to mitigate the potential camera drift. Given the inpainted depth map D k , we find the edge points (i.e., pixels at depth discontinuities) on the live depth map along the contour generator set C k as proposed in [23]: where N 8 s is the 8-neighborhood of pixel s ∈ D k and δ c is the depth discontinuity threshold, set to 50 mm according to the sensor noise magnitudes [39]. Figure 3d demonstrates the contour generators with white lines labeled on the depth map. Edge points set Ve k of the live depth map is obtained by back-projection of C k .
On the other hand, the cuboid edges are discretized into a 3D point set Ve cu g in the global frame with an interval of 1 mm. Ve cu g is invariant to the camera pose, and is obtained once the cuboid is successfully localized, prior to the ICP registration procedure. We also set up a KD-tree over Ve cu g beforehand for fast correspondence search for each point in Ve k . The edge-to-edge error to minimize is: where K 3 = {(s, t)} is the correspondence set obtained by nearest neighbor search with KD-tree.

C. Joint Optimization
We combine Equations (7), (9) and (11) to form a joint cost function: where f 2c and e2e are the weights that determine the influence of correspondences on the cuboid surfaces and edges. When setting f 2c = e2e = 0, our optimization objective is equivalent to KinectFusion. We set f 2c = 1 and e2e = 4 in our experiments empirically, enforcing the constraint of the edge correspondences. The camera pose T g, k is then obtained by minimizing the overall cost function E track iteratively. A linear approximation is made to solve the system, assuming the orientation change between consecutive frames is very small [1,40]. Using the small angle assumption at each iteration, we approximate the incremental rotation matrix as: where α, β, and γ are the rotation in radians about the X, Y, and Z axis, respectively. Similar to KinectFusion, we compute and sum the linear system in parallel on the GPU, and solve it on the CPU using a Cholesky decomposition.

Improved Surface Reconstruction
Although we are trying to stabilize camera tracking, surface reconstruction is yet to be perfect. The TSDF volumetric representation allows for online surface extraction as a polygon mesh, while the simple moving average TSDF fusion strategy proposed in KinectFusion suffers from the inflation problem, and lower the reconstruction accuracy. Figure 5 illustrates one of our synthetic datasets "armadillo." Even with noiseless depth images and GT camera trajectory as input, surface reconstruction is smoothed and inflated, particularly at the cuboid edges, the claws and ears of the armadillo, which is far less satisfactory than the GT surface model. The reason for fusion inflation is illustrated in Figure 6. Due to the simple moving average TSDF fusion algorithm based on the predicate that every surface point is visible from all sensor viewpoints [6], voxel grids with negative TSDF values interfere with the positive ones. To tackle this problem, we extend the storage of TSDF ( ) from the truncated signed distance value F( ) and its weight W( ) to: where for each voxel grid :  The reason for fusion inflation is illustrated in Figure 6. Due to the simple moving average TSDF fusion algorithm based on the predicate that every surface point is visible from all sensor viewpoints [6], voxel grids with negative TSDF values interfere with the positive ones. To tackle this problem, we extend the storage of TSDF S(p) from the truncated signed distance value F(p) and its weight W(p) to: where for each voxel grid p: 1.
[F(p), W(p)] are the original TSDF components, and [F (p), W (p)] are "ghost" distance value and weight for correction of the existing TSDF prediction; 2.
R g (p) and N g (p) are the view ray and the normal vector in the global coordinate frame respectively, which are used to check if a new surface patch is observed from a different view; 3.
Cv (p) and Cn (p) are two integer counters as the confidence indices of voxel p and its normal vector N g (p). When Cv (p) > δ v , we think the distance value F(p) of voxel has been robustly estimated; when Cn (p) > δ n , the normal vector N g (p) is believed to be stable enough against the measurement noise. , (15) where the thresholds are set to δ = 15, δ = 5, θ = 15°, θ = 30° empirically. We define a weight map for each input frame D : where the thresholds are set to δ v = 15, δ n = 5, θ r = 15 • , θ n = 30 • empirically. We define a weight map W k for each input frame D k : with θ I = Angle R D k (p), N D k (p) denoting the incidence angle of the view ray to the surface, and L k is a distance transform map obtained from the contour generator map C k . For each grid p in the TSDF volume, we obtain the adaptive fusion weight W k (p) and the truncation distance threshold µ k (p): where u is the projection of p given the camera pose T g, k , and W base , µ base are the base weight and base truncation distance which are set empirically. Our prediction-corrected TSDF fusion algorithm is then detailed as a flowchart in Figure 7. We categorize the fusion procedure into three sub-strategies: Moving Average: Identical to the TSDF update procedure of KinectFusion, simple moving average TSDF fusion is performed when a voxel has high uncertainty (e.g., at glancing incidence angle or too close to the depth discontinuity edge): Ignore Current: We ignore the TSDF value at the current time when a previously robustly estimated voxel is at glancing incidence angle along the view ray. This is also the case when the current TSDF value with higher uncertainty is observed from a new perspective.
Fix Prediction: When a voxel with previously stable TSDF value F k−1 (p) < 0 is observed to increase from a new point of view-either with F D k (p) > 0 or F D k (p) 0 and F D k (p) F k−1 (p) -we believe the live TSDF estimation is more trustworthy as a correction of the previous prediction. In the case of measurement noise, we fuse the live estimation into the ghost storage: and replace the global TSDF with the ghost storage when W k (p) is above a threshold:

Evaluation
We compare our algorithm with four other dense tracking and mapping approaches: KinectFusion [1] (PCL implementation [37]), the work of Zhou et al. [23], ElasticFusion [20] of Whelan et al., and the NICP algorithm [32] of Serafin et al. ElasticFusion jointly aligns RGB and depth information, while the other four are pure depth camera tracking and reconstruction approaches. We set the weight = 0 for RGB alignment component in ElasticFusion, to make it relies only on depth camera tracking as in others' work. On the evaluation of NICP, we run their CPU implementation offline at full resolution, with default configuration (only the camera parameters are updated). We use the point clouds for reconstruction accuracy evaluation.
Since the scales of our scanned objectives are small, we use a volume of size 1 m with 256 voxels for all the compared algorithms, where each voxel is approximately 3.9 mm .

A. Noiseless Synthetic Data
We synthesize three depth image sequences with ground-truth (GT) mesh surface models and GT camera trajectories. A camera intrinsic matrix K is given to generate images of resolution 640 480, as shown in Table 1. We choose from "The Stanford Models" [41] the armadillo, dragon and bunny, and scale and place them respectively on top of a synthetic cuboid of edge lengths = (400, 300, 250) mm. We then move the camera freely around the scene to generate GT trajectories and depth images, as illustrated in Figure 8. Note that neither the depth measurement noise nor the motion blur is modeled and the only measurement inaccuracy comes from data type casting from floats to integers when saving the depth images.  Note the update of R g (p), N g (p), Cv(p), Cn(p) is performed independently from the three fusion strategies. With our subdivided fusion algorithm, different surface areas are reconstructed elaborately, resulting in the good preservation of high-curvature surface areas, as illustrated in Figure 6e.

Evaluation
We compare our algorithm with four other dense tracking and mapping approaches: KinectFusion [1] (PCL implementation [37]), the work of Zhou et al. [23], ElasticFusion [20] of Whelan et al., and the NICP algorithm [32] of Serafin et al. ElasticFusion jointly aligns RGB and depth information, while the other four are pure depth camera tracking and reconstruction approaches. We set the weight w rgb = 0 for RGB alignment component in ElasticFusion, to make it relies only on depth camera tracking as in others' work. On the evaluation of NICP, we run their CPU implementation offline at full resolution, with default configuration (only the camera parameters are updated). We use the point clouds for reconstruction accuracy evaluation.
Since the scales of our scanned objectives are small, we use a volume of size 1 m 3 with 256 3 voxels for all the compared algorithms, where each voxel is approximately 3.9 mm 3 .

A. Noiseless Synthetic Data
We synthesize three depth image sequences with ground-truth (GT) mesh surface models and GT camera trajectories. A camera intrinsic matrix K s is given to generate images of resolution 640 × 480, as shown in Table 1. We choose from "The Stanford Models" [41] the armadillo, dragon and bunny, and scale and place them respectively on top of a synthetic cuboid of edge lengths P cu = (400, 300, 250) mm. We then move the camera freely around the scene to generate GT trajectories and depth images, as illustrated in Figure 8. Note that neither the depth measurement noise nor the motion blur is modeled and the only measurement inaccuracy comes from data type casting from floats to integers when saving the depth images. Table 1. Camera intrinsic parameters used in our dataset, including the focal lengths f x , f y and the optical center c x , c y . Note that on real-world data the RGB and the depth camera share one intrinsic matrix K r since they are pre-aligned together.

Scenario
Intrinsic

B. Noisy Real-World Data
We manufacture six rigid objects using a 3D printer and put them on a precisely manufactured cuboid with dimensions 400 300 250 mm , same as the one used in our synthetic data. The cuboid is placed on a turntable which is turned by hand, and we held and moved a Kinect camera slowly to perceive more details of the objectives. 640 480 pre-aligned RGB-D images are generated at 30 Hz , with the camera intrinsic matrix K (Table 1). We pre-process the depth sequences by truncating depth pixels of values larger than 1.5 m, to remove static background areas. Figure 9 demonstrates our GT mesh models, the 3D printed objects and the captured depth images with the scanning objectives placed on top of the cuboid reference object. Note that in data "lambunny," a simplified bunny model with merely 640 vertices and 1247 faces is used, and in data "mug," a regular hexagonal mug resting upside down on the cuboid is scanned.

B. Noisy Real-World Data
We manufacture six rigid objects using a 3D printer and put them on a precisely manufactured cuboid with dimensions 400 × 300 × 250 mm 3 , same as the one used in our synthetic data. The cuboid is placed on a turntable which is turned by hand, and we held and moved a Kinect camera slowly to perceive more details of the objectives. 640 × 480 pre-aligned RGB-D images are generated at 30 Hz, with the camera intrinsic matrix K r (Table 1). We pre-process the depth sequences by truncating depth pixels of values larger than 1.5 m, to remove static background areas. Figure 9 demonstrates our GT mesh models, the 3D printed objects and the captured depth images with the scanning objectives placed on top of the cuboid reference object. Note that in data "lambunny," a simplified bunny model with merely 640 vertices and 1247 faces is used, and in data "mug," a regular hexagonal mug resting upside down on the cuboid is scanned.

Error Metrics
On synthetic data, both GT camera trajectories and GT mesh surfaces are provided. We quantify the accuracy of camera trajectory using absolute trajectory error (ATE) proposed by Sturm et al. [42], and evaluate the root mean squared error (RMSE) of the translational components over all time indices, which gives more influence to outliers. We further quantify the surface reconstruction accuracy using the cloud to mesh (C2M) distance metric [43] after aligning the GT model with the reconstructed model using the CloudCompare software [44]. We use two standard statistics: Mean and Std. over the C2M distances for all vertices in the reconstruction. On our real-world data, GT camera trajectories are not available nor do we have GT surface models of the entire scenes. We focus on the evaluation of the reconstructed 3D printed models using the C2M error metric.

Camera Trajectory Accuracy
We evaluate the absolute trajectory error (ATE) of the camera trajectories on synthetic depth image sequences. Although planar surfaces of the cuboid occupy the majority of the depth images, KinectFusion [1], Zhou et al. [23], ElasticFusion [20] and our approach achieve decent camera trajectories without prominently accumulating drift, as listed in Table 2. However, NICP [32] produces inaccurate trajectories with ATE up to hundreds of millimeters. With the additional information from the cuboid a priori, our approach significantly outperforms the reference algorithms, reducing the RMS odometry error from 3~8 mm to less than 2 mm.

Error Metrics
On synthetic data, both GT camera trajectories and GT mesh surfaces are provided. We quantify the accuracy of camera trajectory using absolute trajectory error (ATE) proposed by Sturm et al. [42], and evaluate the root mean squared error (RMSE) of the translational components over all time indices, which gives more influence to outliers. We further quantify the surface reconstruction accuracy using the cloud to mesh (C2M) distance metric [43] after aligning the GT model with the reconstructed model using the CloudCompare software [44]. We use two standard statistics: Mean and Std. over the C2M distances for all vertices in the reconstruction. On our real-world data, GT camera trajectories are not available nor do we have GT surface models of the entire scenes. We focus on the evaluation of the reconstructed 3D printed models using the C2M error metric.

Camera Trajectory Accuracy
We evaluate the absolute trajectory error (ATE) of the camera trajectories on synthetic depth image sequences. Although planar surfaces of the cuboid occupy the majority of the depth images, KinectFusion [1], Zhou et al. [23], ElasticFusion [20] and our approach achieve decent camera trajectories without prominently accumulating drift, as listed in Table 2. However, NICP [32] produces inaccurate trajectories with ATE up to hundreds of millimeters. With the additional information from the cuboid a priori, our approach significantly outperforms the reference algorithms, reducing the RMS odometry error from 3~8 mm to less than 2 mm.
Since the errors of all the trajectory estimations (NICP excluded) on synthetic data are small (<10 mm), we plot the per frame ATE (as in Figure 10) for each algorithm (NICP excluded) rather than the trajectory overviews. Our approach (cyan line) keeps the least drift on most of the frames compared with the other three algorithms. Since the errors of all the trajectory estimations (NICP excluded) on synthetic data are small (< 10 mm), we plot the per frame ATE (as in Figure 10) for each algorithm (NICP excluded) rather than the trajectory overviews. Our approach (cyan line) keeps the least drift on most of the frames compared with the other three algorithms.

Surface Reconstruction Accuracy
The surface reconstruction accuracy is evaluated with the cloud to mesh (C2M) distances between the reconstructions and the ground-truth mesh models. For our synthetic data, GT models of the whole scenes are provided while for our real-world data, we have only GT models for the 3D printed objectives placed on the reference cuboid. Surface reconstructions are first aligned against the GT models for C2M distance computation, and heat maps of the C2M distances are plotted in Figure  11 for qualitatively reconstruction accuracy evaluation. Rows 1~3 show the reconstruction of the synthetic data inputs, and rows 4~9 show the real-world ones. The outputs of ElasticFusion in column 3 are not watertight, since it outputs clouds instead of meshes. NICP is excluded from comparison, since its inaccurate camera trajectories result in invalid scene clouds on our benchmark dataset. Note how tightly our approach preserves the scale of the reconstruction and maintains high-fidelity particularly on sharp geometries.

Surface Reconstruction Accuracy
The surface reconstruction accuracy is evaluated with the cloud to mesh (C2M) distances between the reconstructions and the ground-truth mesh models. For our synthetic data, GT models of the whole scenes are provided while for our real-world data, we have only GT models for the 3D printed objectives placed on the reference cuboid. Surface reconstructions are first aligned against the GT models for C2M distance computation, and heat maps of the C2M distances are plotted in Figure 11 for qualitatively reconstruction accuracy evaluation. Rows 1~3 show the reconstruction of the synthetic data inputs, and rows 4~9 show the real-world ones. The outputs of ElasticFusion in column 3 are not watertight, since it outputs clouds instead of meshes. NICP is excluded from comparison, since its inaccurate camera trajectories result in invalid scene clouds on our benchmark dataset. Note how tightly our approach preserves the scale of the reconstruction and maintains high-fidelity particularly on sharp geometries.

KinectFusion [1]
Zhou et al. [23] ElasticFusion [20] Our We quantitatively evaluate the C2M errors for each algorithm with Mean and Std. statistics, as shown in Tables 3 and 4. Our approach keeps the minimum values on both Mean and Std. in all experimental datasets, indicating its superiority in accuracy over the compared algorithms. Close-up views of the reconstructions are detailed in Figure 12, for further comparison between KinectFusion and our approach.  Additionally, we evaluate the reversed C2M errors, namely the distance from the point clouds of the GT models to the mesh of the surface reconstructions. ElasticFusion is excluded from this comparison, since it produces no surface meshes. Tables 5 and 6 show the quantitative results of this evaluation. On average, the error of this metric is slightly larger than that of the normal C2M distance metric shown in Tables 3 and 4, which results from the inabilities of the compared algorithms to accurately reconstruct extremely sharp surface geometries.  We quantitatively evaluate the C2M errors for each algorithm with Mean and Std. statistics, as shown in Tables 3 and 4. Our approach keeps the minimum values on both Mean and Std. in all experimental datasets, indicating its superiority in accuracy over the compared algorithms. Close-up views of the reconstructions are detailed in Figure 12, for further comparison between KinectFusion and our approach.  Table 4. Surface Reconstruction accuracy on our real-world data, with C2M error metric (Mean ± Std.) in millimeters. Note that for real-world data, the evaluation is only performed on the 3D printed objectives but not the whole scene. Additionally, we evaluate the reversed C2M errors, namely the distance from the point clouds of the GT models to the mesh of the surface reconstructions. ElasticFusion is excluded from this comparison, since it produces no surface meshes. Tables 5 and 6 show the quantitative results of this evaluation. On average, the error of this metric is slightly larger than that of the normal C2M distance metric shown in Tables 3 and 4, which results from the inabilities of the compared algorithms to accurately reconstruct extremely sharp surface geometries.

Discussion and Conclusions
We have presented a novel approach called CuFusion for real-time 3D scanning and accurate surface reconstruction using a Kinect-style depth camera. A man-made cuboid, the scale of which is accurately known, is used as a reference object for accurate camera localization without explicit loop closure detection, and a novel prediction-corrected TSDF fusion strategy is employed for reconstruction update. By solving the surface inflation problem introduced by the simple moving

Discussion and Conclusions
We have presented a novel approach called CuFusion for real-time 3D scanning and accurate surface reconstruction using a Kinect-style depth camera. A man-made cuboid, the scale of which is accurately known, is used as a reference object for accurate camera localization without explicit loop closure detection, and a novel prediction-corrected TSDF fusion strategy is employed for reconstruction update. By solving the surface inflation problem introduced by the simple moving average fusion strategy, our approach preserves the surface details especially when scanning tiny objects or edge areas with high curvatures, resulting in high-fidelity surface reconstruction, which also improves the camera odometry accuracy in turn. We provide a dataset CU3D for the quantitative evaluation of our algorithm and have made our code open-source for scientific verification.
There are several limitations for future work to overcome. First, our modified dense volumetric representation needs 16 bytes per voxel-four times as much memory as KinectFusion at the same resolution-which limits our reconstruction to small-sized scenes. Second, to be capable of reconstructing high-curvature geometries, the camera should be moved as steadily as possible to reduce motion blur and uncertainty in depth measurements. Our algorithm trades off the robustness for reconstruction accuracy, which may fail in the presence of camera jitter or large motion. Figure 13 shows an example of the reconstruction failure result from depth motion blur artifact. Although no noticeable tracking drift happens, the reconstruction is delicate due to our prediction-corrected TSDF fusion strategy. Third, despite our efforts, the reconstructions are yet to be perfected due to sensor noise and the limitation of the volume resolution. As illustrated in Figure 12, engraved surfaces such as the armadillo shell, the facial expression of the owl, wingedcat and buddhahead are smoothed out-additionally, very thin geometries such as the owl's ears and the mug's handle are partly gone.
Our future work will focus on the memory efficiency of our modified volumetric representation, enabling higher volume resolution and a larger scale of reconstruction. The octree-based framework OctoMap [30] could be used for volume data compression. Another interesting challenge might be the surface smoothing problem, which we will focus on mitigating using the surface curvature consistency among the captured frames.

Discussion and Conclusions
We have presented a novel approach called CuFusion for real-time 3D scanning and accurate surface reconstruction using a Kinect-style depth camera. A man-made cuboid, the scale of which is accurately known, is used as a reference object for accurate camera localization without explicit loop closure detection, and a novel prediction-corrected TSDF fusion strategy is employed for reconstruction update. By solving the surface inflation problem introduced by the simple moving average fusion strategy, our approach preserves the surface details especially when scanning tiny objects or edge areas with high curvatures, resulting in high-fidelity surface reconstruction, which also improves the camera odometry accuracy in turn. We provide a dataset CU3D for the quantitative evaluation of our algorithm and have made our code open-source for scientific verification.
There are several limitations for future work to overcome. First, our modified dense volumetric representation needs 16 bytes per voxel-four times as much memory as KinectFusion at the same resolution-which limits our reconstruction to small-sized scenes. Second, to be capable of reconstructing high-curvature geometries, the camera should be moved as steadily as possible to reduce motion blur and uncertainty in depth measurements. Our algorithm trades off the robustness for reconstruction accuracy, which may fail in the presence of camera jitter or large motion. Figure 13 shows an example of the reconstruction failure result from depth motion blur artifact. Although no noticeable tracking drift happens, the reconstruction is delicate due to our prediction-corrected TSDF fusion strategy. Third, despite our efforts, the reconstructions are yet to be perfected due to sensor noise and the limitation of the volume resolution. As illustrated in Figure 12, engraved surfaces such as the armadillo shell, the facial expression of the owl, wingedcat and buddhahead are smoothed out-additionally, very thin geometries such as the owl's ears and the mug's handle are partly gone. Our future work will focus on the memory efficiency of our modified volumetric representation, enabling higher volume resolution and a larger scale of reconstruction. The octree-based framework OctoMap [30] could be used for volume data compression. Another interesting challenge might be the surface smoothing problem, which we will focus on mitigating using the surface curvature consistency among the captured frames.