You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

Published: 15 May 2014

A Comparative Study of Registration Methods for RGB-D Video of Static Scenes

,
,
,
,
,
and
Institute for Computing Research, University of Alicante, P.O. Box 99, Alicante, Spain
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue State-of-the-Art Sensors Technology in Spain 2013

Abstract

The use of RGB-D sensors for mapping and recognition tasks in robotics or, in general, for virtual reconstruction has increased in recent years. The key aspect of these kinds of sensors is that they provide both depth and color information using the same device. In this paper, we present a comparative analysis of the most important methods used in the literature for the registration of subsequent RGB-D video frames in static scenarios. The analysis begins by explaining the characteristics of the registration problem, dividing it into two representative applications: scene modeling and object reconstruction. Then, a detailed experimentation is carried out to determine the behavior of the different methods depending on the application. For both applications, we used standard datasets and a new one built for object reconstruction.

1. Introduction

Registration of multiple 3D datasets is a fundamental problem in many areas, such as computer vision, medical imaging [1], object reconstruction, mobile robotics, augmented reality [2], etc. Due to the increasing usage of current low-cost RGB-D sensors, this technology has opened new lines of research.

Three-dimensional data can be obtained from different devices: 3D lasers, stereo cameras, time-of-flight cameras, RGB-D cameras, etc. Depending on the input sensor, some algorithms provide better results than others. 3D lasers are usually active, non-contact sensors. They emit a pulse of light and measure the time spent by the light to return to the device. Others 3D lasers use a triangulation using a camera to measure the deviation of the light depending on the depth of the object from which the light was reflected. Some 3D laser systems do not provide color information, so algorithms that need visual features are not suitable. Other 3D lasers systems provide color information (using different approaches to incorporate color to the depth information), but their cost is prohibitive. Stereo cameras use two or more conventional cameras to obtain the disparity (the difference of position from one camera to the others, usually by correlation) and, from it, the depth (objects close to the camera have less disparity than farther ones). However, stereo cameras suffer from the lack of textures: image areas without texture do not provide depth information. Another sensor is the Photonic Mixer Device (PMD) (also known as time-of-flight cameras), which measures distances directly for a two-dimensional field of pixels, based on the time of flight of modulated infrared light. The visual information of PMD cameras, like SR4000, is infrared. It is affected by natural light and, normally, is noisy. In our previous work [3], we performed some experiments using the SIFT visual feature method [4] with this kind of camera. As the SR4000 camera provided noisy images, the repeatability of the SIFT feature was low.

Low-cost RGB-D sensors, such as Microsoft Kinect, Primesense Carmine ( http://www.primesense.com) or Asus Xtion ( http://www.asus.com/Multimedia/Xtion PRO), introduce a great advance to the robotics area. They are composed of two sensors: an IR (infrared) projector, an IR CMOS camera and an RGB camera. The IR sensor provides the depth information. The IR projector sends out a fixed pattern of bright and dark speckles. Using structured light techniques; depth is calculated by triangulation against a known pattern from the projector. The pattern is memorized at a known depth, and then, for each pixel, a correlation between a known pattern and the current pattern is done, providing the current depth at this pixel. In this work, we have used the Kinect camera for experiments. The Kinect camera has a resolution of 640 × 480 (307,200 pixels) and a working range between 1 and 8 m, approximately providing a frame rate up to 30 fps. A detailed analysis of the accuracy and resolution of this camera can be found in [5].

In this paper, we are focusing on low-cost RGB-D sensors. For the rest of the paper, when we mention RGB-D sensors or data, we are referring to the low-cost ones. With these specific kinds of sensors, we are interested in the study of the behavior of different registration methods for incremental video reconstruction of scenes and small objects. The work focuses on the registration of subsequent frames of a slowly moving Kinect camera in a static Lambertian scene.

The registration problem could be addressed in two ways. First is searching the solution in the correspondence space. In this case, the problem is comprised of two related sub-problems: correspondence selection and motion (or transformation) estimation. In the former, candidate correspondences between datasets are chosen, while in the latter, transformation minimizing the distances between corresponding points are estimated. Second is searching the solution in the transformation space. An objective function is defined (for example, the distance between two datasets), and a search using different transformations to find the transformation that minimize the objective function is performed.

Several reviews related to the registration problem can be found in the literature. In [6], a complete color image registration survey is presented. Tam et al. [7] made a survey of registration methods for rigid and non-rigid point clouds and meshes. In [8], a comparison among different iterative closest point (ICP) methods is presented, while in [9], a similar study is proposed, but with real-world datasets.

In this paper, we are focused on an experimental review of state-of-the-art rigid registration methods using RGB-D images. Therefore, our main contributions are:

  • A study of the most used approaches to register static environments using RGB-D sensors and testing different methods using a state-of-the-art dataset.

  • A study of the current methods for object reconstruction and a discussion of the problems of registering small objects using low-cost RGB-D sensors and also the creation of a new real-world dataset.

With respect to registration methods, RANSAC (random sample consensus) [10] usually works with features (visual or 3D). Since the global properties of objects are vulnerable to occlusions and clutter in the scene [11], local invariant features are used for this purpose. Moreover, local features could be used with non-rigid objects in scenarios, i.e., articulated or deformable objects. RANSAC is faster than other methods, and it allows a proper registration in the presence of noisy data. However, it depends on the ratio between inliers and outliers. If there are many more outliers than inliers and the number of inliers is low, the probability of finding the best solution is low. Furthermore, if the number of matched features is low, there is a high probability of obtaining a small number of inliers. Iterative closest point (ICP [12,13]) uses all of the points in the scene. ICP needs an initial alignment to register the scene. However, for small and smooth camera or scene motion in rigid scenes, incremental methods, such as ICP, achieve good results. ICP is more suitable for the local motion of noisy surfaces, while RANSAC achieves better results for global motion with precise correspondences containing outliers.

There exist several variants of the ICP method that are more quickly calculated by reducing the amount of points or by extracting less features. In this work, we have compared the original ICP only considering a kd-tree to speed-up correspondence searching. Due to the high variety of methods, we have made a study of the registration methods that use RGB-D sensors in static scenarios.

In order to review and describe the state-of-the-art of the rigid registration approaches, we decided to classify them into coarse and fine methods.

Following the definition of Salvi et al. [14], coarse and fine registration mainly differs in the accuracy of the provided solution. Coarse registration aims at computing an initial estimation of the rigid motion between data points. The robustness of these methods may highly vary in measure, where in theory, low accuracy increases their speed. Most of coarse registration methods are iterative, despite the existence of linear approaches. It is important to highlight that many coarse approaches use a subset of the data (downsampling or keypoints) in order to reduce the computational cost.

Fine registration, on the other hand, is focused on providing the most accurate solution. These methods generally use a roughly initial estimation to unify all views in a common coordinate system (avoiding falling in local minima) and then refining the initial solution.

In [14], a table is presented where important aspects of coarse and fine registration methods are classified: kind of correspondences, motion estimation, robustness and registration strategy.

This classification is used in [14,15], but many others can be found in the literature, such as dense/sparse, intrinsic/extrinsic, etc. Despite this classification, most of the registration methods use a hybrid approach to firstly coarse register pre-aligning the data into a global coordinate system and, next, refining the result using fine registration methods. In order to analyze the performance and accuracy of the reviewed methods, we have divided the experiments into two categories. The first one is the scene reconstruction. It is often used to perform map building. The mapping problem consists of registering the point sets obtained by the robot at different positions in order to get a map of the environment around the robot. The second one is the object reconstruction, very similar to the previous one, but focused on object reconstruction, like sculptures, tools, plants, etc. We will develop a detailed experimentation to determine the best methods to solve both problems. This paper does not intend to be an exhaustive review of the state-of-the-art, but a comparative study among different approaches to estimate registration, tested in two important and representative applications in which registration is used as a part of the general process: scene mapping and object reconstruction. These applications cover most of the expected requirements for the output of a registration process. Specifically, they allow one to reach important conclusions about the performance of registration methods using RGB-D sensors.

The remainder of this paper is organized as follows; Section 2 presents the review of the state-of-the-art, presenting first the registration problem. Then, the two frameworks used to make the comparison (scene and object reconstruction) and the metrics used to compare them are presented in Section 3. Results are presented, and a discussion is given in Section 4. Finally, conclusions are drawn in Section 5.

3. Comparison Framework

In this section, we present the comparison framework, including the considered scenarios (Section 3.1), to evaluate some of the aforementioned methodologies for different situations, and the metrics (Section 3.2) used to evaluate the results.

3.1. Considered Scenarios

In order to evaluate the different methodologies presented in the previous sections, we studied different scenarios that represent mainly the wide range of registration problems that can be considered for many applications, including full scene and small objects reconstruction. Moreover, we present here the specific methods we are going to evaluate that are the most relevant in this area.

3.1.1. Scene Reconstruction

In order to register large scenes, several RGB-D datasets have to be registered into a common coordinate system. Most of the scene mapping methods use a simultaneous location and mapping (SLAM) [5356] scheme in order to get the registration of consecutive RGB-D datasets. SLAM methods use a global rectification method in order to reduce the incremental error of the consecutive estimations.

We implemented some of the state-of-the-art registration methods in order to test and compare their results on representative scene RGB-D datasets. The methods tested are visual features, dense visual odometry and the KinectFusion. The visual features method is a hybrid system, which uses the Features from Accelerated Segment Test (FAST) detector [57,58] and the Binary Robust Independent Elementary Features (BRIEF) descriptor [59] and then a RANSAC algorithm to estimate the correspondences and the transformation between the features of scene and model datasets. In order to refine the final estimation results, we also applied an Iterative Closest Point algorithm. The FAST detector and the BRIEF descriptor are implemented in the OpenCV library ( http://opencv.org/). The RANSAC and ICP methods are implemented in the Point Cloud Library (PCL) ( http://pointclouds.org). We also implemented some variations of this method in order to see the difference of applying individually visual features or ICP to estimate the transformation. The dense visual odometry method is provided as a package in the Robot Operating System (ROS) system ( http://vision.in.tum.de/data/software/dvo). Finally, the KinectFusion method is also provided by the PCL. KinectFusion was not originally implemented to register large scenarios, but PCL has a modification of this algorithm to extract the model as a polygon mesh and update the model.

In order to test the implemented scene mapping systems on large scenarios, we used the Technische Universität München (TUM) RGB-D dataset [60]. This dataset provides RGB-D and ground-truth data with the goal of evaluating the visual odometry and visual SLAM systems. The dataset contains the color and depth images of a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. It provides images at a full frame rate (30 Hz) and sensor resolution (640 × 480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz). Further, it provides the accelerometer data from the Kinect.

This original dataset contains 39 sequences recorded in two different scenarios. The “fr1” datasets are recorded on a typical office environment and the “fr2” datasets are recorded in a large industrial hall. Furthermore, some sequences are recorded using a hand-held Kinect and the rest using a Kinect mounted on a wheeled robot. Later, this dataset was extended with more sequences in order to test scenarios with different texture and structure appearances or scenes with dynamic objects. Figure 5 shows an example of the first 200 ground-truth positions of the camera as a yellow line and the reconstructed map.

Table 1 shows the average translation and rotation velocities of the different datasets. We observe that some datasets, like “fr1 xyz”, “fr2 xyz”, “fr2 desk” or all of “fr3”, have slow velocities. Other datasets, like “fr1 desk” and “fr1 desk2”, have high translations, so the movement between frames is higher. The “fr1 360” has slow translational velocity, but it has high rotational movement that also has influence on the registration results.

The authors concluded from their calibration measures that the relative error of the Kinect camera position on a frame-to-frame basis in the ground-truth data is lower than 1 mm.

3.1.2. Small Object Reconstruction

The problem of object reconstruction is a well-known topic in computer vision [61], but with the appearance of the low-cost RGB-D sensors, a wide variety of approaches have been proposed. Object reconstruction covers from objects with big volumes, such as chairs or tables, to small and intricate ones, like plants, tools, etc. Related to the size of the object, the resolution of the RGB-D devices is an important aspect that affects the acquisition and, hence, the registration and reconstruction. The depth resolution expresses the minimum difference in depth that the camera is able to distinguish. The resolution is affected by the noise of the data. As the level of noise is increased, the performance of the registration methods decreases.

Regarding big elements, traditional algorithms could be used for registering the views, because each view has a large number of object points. Moreover, this sort of object is fairly described with visual and 3D features, making easier the pre-alignment by using RANSAC techniques for a coarse registration. Nevertheless, small objects have different aspects to be considered, i.e., size and geometry. Related to the size, object reconstruction is performed with a subset of the points. Irrespective of the size of the object, RGB-D sensors only work properly in a certain range of depth. Then, at least a minimum distance has to be preserved, and the object only will be represented with a part of the possible data. Thereby, the smaller the objects are, the less data that is available.

Another important issue is the object surface geometry. Due to the technique used by the low-cost RGB-D sensors, smooth surfaces are better estimated than rough ones. In scene mapping, the signal-to-noise ratio (SNR) is high due to the fact that scenes usually have big smooth surfaces (roof, floor, walls, etc.) that are well extracted, and then, the registration methods obtain a good transformation to align the views. However, objects do not always follow this kind of geometry. Normally, big objects have regions of smooth surfaces, except those intricate ones, such as trees. Nevertheless, when the target is small, less smooth surfaces appear; therefore, the SNR decreases. Hence, traditional techniques cannot be applied directly to the point cloud.

In this section, five different registration techniques are tested for object reconstruction acquired using RGB-D sensors:

  • Coarse registration (Section 2.1): a feature-based approach has been evaluated using RANSAC. The number of features in objects is small compared to the scene case. 3D features are time-consuming techniques, and also, due to the noise of the RGB-D sensors, they are not usually reliable; hence, visual features are more often used. We use the SIFT feature extraction and description, as it is one of the most used algorithms in object reconstruction. RANSAC is used to estimate the best translation that registers the SIFT descriptors of two views.

  • Fine registration (Section 2.2): in this section, the registration is performed with the well-known iterative closest point. In particular, the Chen and Medioni ICP variant with edge rejection has been tested. In order to be able to use fine registration directly, several views, close to each other, are registered.

  • A combination of coarse and fine registration methods is applied to evaluate a common process of pre-alignment and refinement.

  • A well-known implementation for reconstruction with RGB-D sensors is KinectFusion [19,62]. This method was developed for environments, not for objects. It tends to smooth the objects by using a truncate signed distance function (TSDF) and a model of the scene, which makes shapes rounded or even disappear when they are too small.

  • The last presented method is the RGBDemo [21], which has been specifically developed for object reconstruction using RGB-D sensors. It uses color markers (ARToolKit markers) to make an initial coarse registration. Once the initial alignment is done, an ICP and then a subsampling process return the final result.

Figure 6 shows different objects with specific features associated with their shapes, which affect the registration. The dataset has been created for this experimentation, due to the lack of a dataset of objects acquired using RGB-D sensors, where KinectFusion and RGBDemo results are presented. They have been acquired using a Microsoft Kinect on a turntable. Three hundred twenty views have been acquired of each element around them (about 1.13 degrees per step) in order to use the fine registration method without pre-alignment. The distance that separates sensor and objects is a meter, and the camera is placed diagonal-upper to allow the markers of RGBDemo to be visible. The first one, Figure 6a, is a Taz toy of 15 cm in height with a large variety of colors. The second object (Figure 6b) is a wooden cube of 8 cm3, which has faces where a knot appears and others, less varied in color. In the third column, a tool (Figure 6c) is presented. It is 30 cm-long; the thinnest part is 0.5 cm, and the widest is 2.8 cm. The last object is a bomb toy shown in Figure 6d with 8 cm height and 5 cm width. It has different colors, thin parts, such as the white one on the top, the body part with a smooth curve and the back part with a key attached.

A previous segmentation of the region of interest has been performed in order to isolate the object. A segmentation combining color and depth information has been used to extract the object from the scene for ICP and RANSAC algorithms. The background of the scene has been carefully established using blue chroma. Moreover, the distance from camera to object is previously known. Regarding RGBDemo, the method uses a white floor with specific markers (printed in white paper) to localize the space where the object is placed. RGBDemo uses only that region in the registration process (markers for coarse registration and ICP for fine registration). However, KinectFusion works with the whole data supplied by the camera having as a consequence two different motions: moving parts as the object on the turntable and static parts as the rest of the scene.

3.2. Metrics and Performance Measures

As previously mentioned, we used the TUM RGB-D dataset [60] for scene reconstruction. This dataset proposes some evaluation measures based on the comparison of the estimated trajectories of the camera and the ground-truth ones.

The relative pose error measures the local accuracy of the trajectory over a fixed time interval, Δ. Therefore, the relative pose error corresponds to the drift of the trajectory, which is useful for the evaluation of visual odometry systems. The dataset authors define the relative pose error at time step i as:

E i : = ( Q i 1 Q i + Δ ) 1 ( P i 1 P i + Δ )

From a sequence of n camera poses, we obtain in this way m = n − Δ individual relative pose errors along the sequence. From these errors, they propose to compute the root mean squared error (RMSE) over all time indices of the translational component as:

RMSE ( E 1 : n , Δ ) : = ( 1 m i = 1 m | | trans ( E i ) | | 2 ) 1 / 2
where trans(Ei) refers to the translational components of the relative pose error, Ei. The time parameter, Δ, needs to be chosen. For visual odometry systems that match consecutive frames, Δ = 1 is an intuitive choice; RMSE(E1:n) then gives the drift per frame that we will use to measure the quality of the implemented systems.

For object reconstruction, we will use visual appearance analysis, due the non-existence of a common dataset and its correspondent ground-truth.

4. Results and Discussion

In this section, we present the results of the experiments done for both scene and object reconstruction. For each part, we discuss the obtained results.

4.1. Scene Reconstruction

We have performed two different experiments to evaluate scene registration methods. In the first one, we analyze the results of each method on the different scene sequences. In order to improve clarity, some “y-axes” are trimmed, because there are some high error values that represent a totally misalignment or registration error. Moreover, some graphs are not complete, since the implementation fails, and it is not able to recover or register the following frames. The blue lines represents the translational error. The ground-truth translational magnitude (in meters) is included on the following graphics as red lines and reflects the relative error with respect to the real translation. Following the considerations of the dataset authors’ evaluation method, we do not show the rotational error, because the camera is in continuous motion; an error in the rotation estimation involves an error in the translational error.

Figure 7 shows the results of the KinectFusion implementation. We observe that KinectFusion fails in the “fr1 desk” and “fr1 360” datasets. The rest of the results looks quite smooth, despite some high errors. Some of these datasets have parts where there is a lack of geometry, so the method gets a high error or it gets lost. According to the real relative movement of the camera represented with red lines, the “fr1 xyz” dataset has continuous changes in velocity and direction. In the “fr2-desk” dataset, the camera described sudden/abrupt movements. These movements are caused by a lack of ground-truth information, so the distance between “consecutive” frames is relatively high.

Dense Visual Odometry results are shown in Figure 8. We observe that DVO has more variability in the “fr1 xyz” caused by sudden direction changes. DVO gives good results on the datasets with smooth movements, like “fr2 desk”. The “fr1 360” dataset presents high errors caused by the high rotational camera motion. In general, the DVO method works better than the KinectFusion, except on the “fr2 xyz”.

Visual Features with ICP refinement results, showed in Figure 9, have mostly the same or higher errors as the ones in DVO and KinectFusion. There are no relatively high errors on most of the datasets. The “fr2 desk” dataset shows higher error in one of the last frames (Frame 659). The “Fr2 xyz” dataset presents a high error due to the error introduced by the localization of visual features, which is not corrected by the ICP refinement.

The results of the visual features method (Figure 10) show a similar structure than the visual + ICP method, but the errors are higher in almost all the cases. We now observe the difference of applying the ICP refinement step. In general, the visual features method without the ICP refinement has higher errors.

Figure 11 shows the results of the iterative closest point method. We observe similar errors as the ones obtained with the visual features method, but it also has some high errors, like at the beginning of the “fr1 desk2” and in the “fr2 desk”. As expected, ICP gets better results than visual features in the datasets with less movement, since the frames to register are initially very close. In the “fr1 360” dataset with high rotational movements, the ICP gets several high errors. In general, we can observe that the visual features with ICP refinement works better than both methods individually. The implementations of the visual features and ICP methods have no correction, and that is why high errors are obtained that make no sense for a particular application, such as scene reconstruction. These errors can be detected and corrected or discarded by a simply movement boundary limitation, like the one that KinectFusion and DVO use.

To summarize, Figure 12 shows the average errors and the standard deviation of the five tested systems on the different datasets. We observe that the datasets with lower velocities (“fr1 xyz” and “fr2 xyz”) have less error values. In “fr1 360”, we spot the influence of the rotational velocities on the registration methods. Despite it having a low translational velocity (0.21 m/s), its rotational velocity increases the error values of the methods, as can be observed in the error bars and their standard deviation. In general, we observe the improvement of the visual features method with the ICP refinement. This improvement is particularly noticeable in the “fr1 360” dataset. We conclude that the dense visual odometry method is one of the most robust methods, and it provides low errors.

Finally, Figure 13 shows the error in the camera pose of the KinectFusion, DVO and visual features + ICP methods over the “fr1 desk2” dataset. Due to the incremental estimation of the camera pose, a high error in one estimation can lead to a totally misaligned trajectory. This is the case of the KinectFusion. When a high error is detected, an additional method should be implemented in order to discard the frame and use the previous correct position.

In the second experiment, we focus on the analysis of the results of the three main tested methods in scenarios with special features. The TUM dataset has a special “fr3” set of scenes where different combinations of texture and geometry appearance are presented. We use four different combinations of texture and structure appearance. The first column of Table 2 shows some of the used scenes. Table 2a represents a scene with detailed texture and geometry appearance. Table 2b shows an empty and white floor, so it represents a non-texture and non-structure information dataset. Table 2c only has some posters in a wall, so it represents the texture and non-structure information situation. The last dataset, Table 2d, represents the non-texture and structure information situation.

To improve the interpretation of the results we did not include the ICP and visual feature methods in this experiment. Furthermore, the first experiment showed that the combined features with ICP refinement worked better than the two methods separately.

Table 2 shows the relative pose translational errors of the three tested methods. The results show a general poor registration, since most of the errors are higher than the real translational movement. However, the errors are mostly lower than 2 cm. Despite the simplicity of the scenes, we observe a high error in most of the scenes with different methods. In the datasets with a lack of structure information Table 2b,c, the results of the three methods are similar. In general, we observe that the KinectFusion method gets the best results with the exception of the dataset with texture, but no structure information, where the visual features with ICP refinement gets the best results.

Figure 14 shows the average errors and the standard deviation of the five tested methods on the different datasets. We can observe that the dense visual odometry method is getting the biggest errors on these datasets. The KinectFusion method (Kinfu) gets the best results on the datasets with geometry information and gets results closest to the best on the other two datasets.

4.2. Small Object Reconstruction

For small object reconstruction, we have made an experiment using some representative methods of objects in Figure 6. Figure 15 shows the result of the different registration algorithms. The first row has the coarse registration results; the second one presents the ICP results. In the third, a combination of coarse pre-alignment and fine registration using the RANSAC and ICP methods is shown. The fourth row presents KinectFusion's results. Lastly, Figure 15q–t shows the RGBDemo results.

Figure 15a–d shows the coarse registration of the objects with RANSAC and visual SIFT features. Figure 15a has been well registered in the front part, where a large variety of colors appear. However, the error in the back part produces the wrong final result. The cube in Figure 15b shows some views that are bent, caused by features in the top matched with others in the lateral part, due to the similarity, causing a bad registration. In Figure 15c,d, the error in the registration is caused because these shapes have few features, and it is not possible to register them properly.

The fine registration ICP applied to the test objects (Figure 15e–h) shows that, in general, the results are better than in coarse registration, but still with a considerable error. These errors are caused by the low SNR, mainly in thin or sharp parts, where the RGB-D sensor cannot return the depth information accurately.

The registration results of combining (Figure 15i–l) pre-alignment using RANSAC and refinement with ICP shows how if the wrong pre-alignment is achieved, ICP cannot return a proper registration.

KinectFusion results show that it does not work properly in object reconstruction. For example, the cube in Figure 15n, where the edges of the shape are rounded. In addition, thin parts close to each other are joined, making them distorted. This method uses only 3D information; hence, in objects, such as the bomb toy, where the shape is rounded, the method cannot put the different views in the right place. Finally, the tool (Figure 15o) is joined to the floor, due to its thin geometry. Figure 16 shows different moments of the reconstruction, where step by step, the algorithm mixes the object and the floor.

RGBDemo algorithm results are presented in Figure 15q–t. The Taz toy (Figure 15q) and cube (Figure 15r) are well registered. The cube has the edges rounded due to the subsampling. The tool in Figure 15s is well registered in the thickest part, but the thin part disappeared, due to different aspects. One is the subsampling, but the main reason is the way in which low-cost RGB-D sensors recover the depth information. They use a correlation window ( http://wiki.ros.org/kinect calibration/technical) to estimate the depth information. In case the window of 9 × 7 speckles (they use speckle pattern) had more points in the floor than in the object, the floor depth information will be dominant. Figure 17 shows the color (Figure 17a), depth (Figure 17b) and infra-red (Figure 17c) images of the tool acquired with the RGB-D sensor. The depth image has no information, because the speckles that reach the object are not enough to estimate the profundity. The bomb toy in Figure 15t is properly registered at the top part, but the low part of the body is incomplete. This is produced by self-occlusion. RGBDemo uses visual markers, which have to be always visible. In case the full object has to be fully reconstructed, it should be acquired from different positions, and finally, all registration results aligned together.

According to the four objects that have been selected to evaluate different aspects of the registration methods, the analysis shows that as the Taz toy is the largest object and has several colors, enough visual features can be found for the RANSAC algorithm. It also has a varied geometry what allows ICP-based methods to align the views properly. Despite these features, the traditional methods (RANSAC, RANSAC + ICP and ICP) cannot achieve an accurate final result, due to the cumulative error between views, producing the final wrong closure. KinectFusion obtains a reasonable result, despite the fact that it joins several parts, due to the smoothing characteristics of the method.

The cube is a low textured object, making difficult the feature extraction in some of its faces. Then, both methods based on RANSAC cannot achieve good results. On the other hand, despite the geometry being simple, the RGB-D sensors provide noisy data, mainly in edges, which makes it difficult to evaluate the point cloud correspondence and general transformation between two consecutive views. For that reason, the ICP cannot obtain a final proper registration. KinectFusion tends to smooth surfaces, because it uses a model of the scene and a TSDF method for point-to-model correspondence estimation, which produces the rounded edges.

The tool has been selected due to the thin and low textured shape. Neither enough features could have been extracted nor are there geometry aspects suitable for traditional algorithms. Thus, RANSAC and ICP cannot register the views. KinectFusion does not work properly due to the aforementioned smoothing characteristics of the method, producing a non-accurate final reconstruction.

Finally, the bomb toy is a small object with different areas. There are parts having color feature areas and others with varied geometry. These features produce good results for RANSAC when the featured parts are registered, but fail in the rest. On the other hand, ICP works properly with the back part, where a keypoint is attached to the object, but fails when the round part of the Bomb is aligned. KinectFusion cannot achieve a good result due to the lack of geometry.

Different conclusions can be extracted analyzing the above experimentation results. In general, regarding the size of the object, all methods find more difficulties in registering views from small objects, because either few features appear or the noise becomes more relevant. That is the reason why the best result in general terms is achieved by the RGBDemo algorithm, using external visual markers in order to help the registration. The coarse alignment is based on these visual markers; then the lack of texture does not affect it, and the fine registration works properly, because of the good coarse pre-alignment. For thin and small objects, such as the tool and the bomb toy, a slight filtering process should be used in order to avoid over-rejection of data when those objects are commonly originally represented with few data. Despite the RGBDemo presenting good results, there is still work to do in order to allow different angles of the sensor against the object to minimize self-occlusions. There exist new proposals focused on this problem, such us [63], where a model-based multi-view registration method is presented for 3D markers to allow the registration of objects.

5. Conclusions

In this paper, we have first made a description of different registration methods using low-cost RGB-D sensors, dividing them into coarse and fine methods. This classification facilitates the description of the main features of the algorithms. Then, we have developed an experimental validation of different registration methods in two representative applications: scene and object reconstruction.

For scene registration, we have tested five different registration methods and quantitatively measured the error in the pose estimation using a state-of-the-art RGB-D dataset for visual odometry and SLAM systems. Using the evaluation measures and tools provided by the dataset, we analyzed the results of the tested methods, which are the KinectFusion, the dense visual odometry, the ICP, a visual feature-based method and visual features with the ICP refinement method. Results showed that the DVO method gets the lowest registration error, and it is the most robust. KinectFusion does not work properly with datasets where frames have a lack of geometry.

For object registration, we have tested also five different registration methods and qualitatively measured the error using a new RGB-D dataset created for this evaluation; specifically, a visual feature-based method using SIFT and RANSAC, the ICP variant of Chen and Medioni, the visual features with ICP refinement method and the new methods, KinectFusion and RGBDemo. KinectFusion is one of the most used methods for object reconstruction. However, RGBDemo is a very different approach, due to the use of ARToolKit markers. Results show that traditional algorithms did not provide accurate results, while visual marker-based methods obtain better registrations. Different areas of research still remain a challenge in this topic, such as better techniques focused on registering objects acquired with low-cost RGB-D sensors.

As future work, we plan to continue comparing new methods and analyzing new RGB-D sensors for these problems. Moreover, quantitative techniques to evaluate object registration error have to be enhanced, since they are the most used for object registration and reconstruction visual inspection.

Acknowledgments

This work has been supported by a grant from the Spanish Government, DPI2013-40534-R, University of Alicante projects GRE11-01 and a grant from the Valencian Government, GV/2013/005.

Conflict of Interest

The authors declare no conflict of interest.

References

  1. Yang, F.; Ding, M.; Zhang, X.; Wu, Y.; Hu, J. Two Phase Non-Rigid Multi-Modal Image Registration Using Weber Local Descriptor-Based Similarity Metrics and Normalized Mutual Information. Sensors 2013, 13, 7599–7617. [Google Scholar]
  2. Duan, L.; Guan, T.; Yang, B. Registration Combining Wide and Narrow Baseline Feature Tracking Techniques for Markerless AR Systems. Sensors 2009, 9, 10097–10116. [Google Scholar]
  3. Cazorla, M.; Viejo, D.; Pomares, C. Study of the SR4000 camera. Proceedings of XI Workshop of Physical Agents Físicos, Valencia, Spain, September 2010.
  4. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar]
  5. Khoshelham, K.; Elberink, E.O. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 2012, 2, 1437–1454. [Google Scholar]
  6. Zitov, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar]
  7. Tam, G.; Cheng, Z.Q.; Lai, Y.K.; Langbein, F.; Liu, Y.; Marshall, D.; Martin, R.; Sun, X.F.; Rosin, P. Registration of 3D Point Clouds and Meshes: A Survey from Rigid to Nonrigid. IEEE Trans. Vis. Comput. Gr. 2013, 19, 1199–1217. [Google Scholar]
  8. Rusinkiewicz, S.; Levoy, M. Efficient variants of the ICP algorithm. Proceedings of the 3rd International Conference on 3-D Digital Imaging and Modeling, Quebec City, QC, Canada, 28 May–1 June 2001; pp. 145–152.
  9. Pomerleau, F.; Colas, F.; Siegwart, R.; Magnenat, S. Comparing ICP variants on real-world data sets. Auton. Robots 2013, 34, 133–148. [Google Scholar]
  10. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar]
  11. Hetzel, G.; Leibe, B.; Levi, P.; Schiele, B. 3D object recognition from range images using local feature histograms. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 2, pp. 394–399.
  12. Chen, Y.; Medioni, G. Object modeling by registration of multiple range images. Proceedings of 1991 IEEE International Conference on Robotics and Automation, Sacramento, CA, USA, 9–11 April 1991; Volume 3, pp. 2724–2729.
  13. Besl, P.; McKay, N. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar]
  14. Salvi, J.; Matabosch, C.; Fofi, D.; Forest, J. A review of recent range image registration methods with accuracy evaluation. Image Vis. Comput. 2007, 25, 578–596. [Google Scholar]
  15. Campbell, R.J.; Flynn, P.J. A Survey of Free-form Object Representation and Recognition Techniques. Comput. Vis. Image Underst. 2001, 81, 166–210. [Google Scholar]
  16. Crum, W.R.; Hartkens, T.; Hill, D.L.G. Non-rigid image registration: Theory and practice. Br. J. Radiol. 2004, 77, S140–S153. [Google Scholar]
  17. Henry, P.; Krainin, M.; Herbst, E.; Ren, X.; Fox, D. RGB-D mapping: Using depth cameras for dense 3D modeling of indoor environments. In Experimental Robotics; Springer: Berlin Heidelberg, Germany, 2014; pp. 477–491. [Google Scholar]
  18. Raguram, R.; Frahm, J.M.; Pollefeys, M. A Comparative Analysis of RANSAC Techniques Leading to Adaptive Real-Time Random Sample Consensus. Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 500–513.
  19. Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.A.; Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.J.; et al. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Honolulu, HI, USA; 2011; pp. 559–568. [Google Scholar]
  20. Steinbrucker, F.; Sturm, J.; Cremers, D. Real-time visual odometry from dense RGB-D images. Proceedings of 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 719–722.
  21. Kramer, J.; Burrus, N.; Echtler, F.; Parker, M.; Herrera, C.D. Object Modeling and Detection. In Hacking the Kinect; Apress: New York, NY, USA, 2012; Chapter 9; pp. 173–206. [Google Scholar]
  22. Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. Speeded-up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar]
  23. Gil, A.; Mozos, O.; Ballesta, M.; Reinoso, O. A comparative evaluation of interest point detectors and local descriptors for visual SLAM. Mach. Vis. Appl. 2010, 21, 905–920. [Google Scholar]
  24. Gomb, P. Detection of Interest Points on 3D Data: Extending the Harris Operator. In Computer Recognition Systems 3; Kurzynski, M., Wozniak, M., Eds.; Springer: Berlin Heidelberg, Germany, 2009; Volume 57, pp. 103–111. [Google Scholar]
  25. Rusu, R.; Blodow, N.; Beetz, M. Fast Point Feature Histograms (FPFH) for 3D registration. Proceedings of IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 3212–3217.
  26. Johnson, A. Spin-Images: A Representation for 3-D Surface Matching. PhD Thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 1997. [Google Scholar]
  27. Viejo, D.; Cazorla, M. 3D Model Based Map Building. Proceedings of Comunicacin presentada en el IX Workshop de Agentes Fsicos, Vigo, Spain, 11–12 September 2008.
  28. Viejo, D.; Cazorla, M. A robust and fast method for 6DoF motion estimation from generalized 3D data. Auton. Robots 2014, 36, 295–308. [Google Scholar]
  29. Koser, K.; Koch, R. Perspectively Invariant Normal Features. Proceedings of IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8.
  30. Wu, C.; Clipp, B.; Li, X.; Frahm, J.M.; Pollefeys, M. 3D model matching with Viewpoint-Invariant Patches (VIP). Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AL, USA, 24–26 June 2008; pp. 1–8.
  31. Zeisl, B.; Köser, K.; Pollefeys, M. Automatic Registration of RGBD Scans via Salient Directions. Proceedings of IEEE 16th International Conference on Computer Vision, Sydney, Australia, 3–6 December 2013.
  32. Brunnstrom, K.; Eklundh, J.; Uhlin, T. Active Fixation for Scene Exploration. Int. J. Comput. Vis. 1996, 17. [Google Scholar] [CrossRef]
  33. Stückler, J.; Behnke, S. Model Learning and Real-Time Tracking Using Multi-Resolution Surfel Maps. Proceedings of the 26th AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012.
  34. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. Proceedings of the 7th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 24–28 August 1981; Volume 2, pp. 674–679.
  35. Druon, S.; Aldon, M.J.; Crosnier, A. Color Constrained ICP for Registration of Large Unstructured 3D Color Data Sets. Proceedings of 2006 IEEE International Conference on Information Acquisition, Shandong, China, 20–23 August 2006; pp. 249–255.
  36. Kerl, C.; Sturm, J.; Cremers, D. Robust Odometry Estimation for RGB-D Cameras. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6–10 May 2013.
  37. Turk, G.; Levoy, M. Zippered polygon meshes from range images. Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, Orlando, FL, USA, 24–29 July 1994; pp. 311–318.
  38. Masuda, T.; Sakaue, K.; Yokoya, N. Registration and Integration of Multiple Range Images for 3-D Model Construction. Proceedings of the 1996 International Conference on Pattern Recognition (ICPR ’96), Vienna, Austria, 29 August 1996; Volume 1, p. 879.
  39. Weik, S. Registration of 3-D partial surface models using luminance and depth information. Proceedings of International Conference on Recent Advances in 3-D Digital Imaging and Modeling, Ottawa, ON, Canada, 12–15 May 1997; pp. 93–100.
  40. Simon, D.A. Fast and Accurate Shape-Based Registration. PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1996. [Google Scholar]
  41. Godin, G.; Rioux, M.; Baribeau, R. Three-dimensional registration using range and intensity information. Proceedings of Photonics for Industrial Applications. International Society for Optics and Photonics, Boston, Massachusetts, 31 October–4 November 1994; pp. 279–290.
  42. Pulli, K. Multiview registration for large data sets. Proceedings of the 2nd International Conference on 3-D Digital Imaging and Modeling, Ottawa, ON, Canada, 4–8 October 1999; pp. 160–168.
  43. Zinsser, T.; Schmidt, J.; Niemann, H. A refined ICP algorithm for robust 3-D correspondence estimation. Proceedings of 2003 International Conference on Image Processing, Barcelona, Spain, 14–18 September 2003; pp. 695–698.
  44. Zhang, L.; Choi, S.I.; Park, S.Y. Robust ICP Registration Using Biunique Correspondence. Proceedings of 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), Hangzhou, China, 16–19 May 2011; pp. 80–85.
  45. Curless, B.; Levoy, M. A Volumetric Method for Building Complex Models from Range Images. Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 4–9 August 1996; pp. 303–312.
  46. Whelan, T.; Kaess, M.; Fallon, M.; Johannsson, H.; Leonard, J.; McDonald, J. Kintinuous: Spatially Extended KinectFusion. Proceedings of RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, Sydney, Australia, 9–10 July 2012.
  47. Whelan, T.; Johannsson, H.; Kaess, M.; Leonard, J.; McDonald, J. Robust Real-Time Visual Odometry for Dense RGB-D Mapping. Proceedings of IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013.
  48. Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual Odometry and Mapping for Autonomous Flight Using an RGB-D Camera. Proceedings of International Symposium on Robotics Research (ISRR), Flagstaff, AZ, USA, 28 August–1 September 2011.
  49. Whelan, T.; Kaess, M.; Leonard, J.; McDonald, J. Deformation-based Loop Closure for Large Scale Dense RGB-D SLAM. Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–8 November 2013.
  50. Koch, R. Dynamic 3D Scene Analysis through Synthesis Feedback Control. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 556–568. [Google Scholar]
  51. Comport, A.; Malis, E.; Rives, P. Accurate Quadrifocal Tracking for Robust 3D Visual Odometry. Proceedings of 2007 IEEE International Conference on Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 40–45.
  52. Audras, C.; Comport, A.; Meilland, M.; Rives, P. Real-time dense RGB-D localisation and mapping. Proceedings of Australian Conference on Robotics and Automation, Melbourne, Australia, 7–9 December 2011.
  53. Dissanayake, M.; Newman, P.; Clark, S.; Durrant-Whyte, H.; Csorba, M. A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robot. Autom. 2001, 17, 229–241. [Google Scholar]
  54. Endres, F.; Hess, J.; Engelhard, N.; Sturm, J.; Cremers, D.; Burgard, W. Evaluation of the RGB-D SLAM System. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), St. Paul, MN, USA, 14–18 May 2012.
  55. Chang, P.; Shen, J.; Cheung, S.C. A Robust RGB-D SLAM System for 3D Environment with Planar Surfacess. Proceedings of the IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013.
  56. Shen, J.; Su, P.C.; Cheung, S.; Zhao, J. Virtual Mirror Rendering With Stationary RGB-D Cameras and Stored 3-D Background. IEEE Trans. Image Process. 2013, 22, 3433–3448. [Google Scholar]
  57. Rosten, E.; Drummond, T. Fusing points and lines for high performance tracking. Proceedings of IEEE International Conference on Computer Vision, Beijing, China, 17–20 October 2005; Volume 2, pp. 1508–1511.
  58. Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. Proceedings of European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Volume 1, pp. 430–443.
  59. Calonder, M.; Lepetit, V.; Ozuysal, M.; Trzcinski, T.; Strecha, C.; Fua, P. BRIEF: Computing a Local Binary Descriptor very Fast. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1281–1298. [Google Scholar]
  60. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. Proceedings of the International Conference on Intelligent Robot Systems (IROS), Vilamoura, Algarve, Portugal, 7–12 October 2012.
  61. Chane, C.S.; Schtze, R.; Boochs, F.; Marzani, F.S. Registration of 3D and Multispectral Data for the Study of Cultural Heritage Surfaces. Sensors 2013, 13, 1004–1020. [Google Scholar]
  62. Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohli, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A.W. KinectFusion: Real-time dense surface mapping and tracking. Proceedings of 2011 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Basel, Switzerland, 26–29 October 2011; pp. 127–136.
  63. Saval-Calvo, M.; Azorin-Lopez, J.; Fuster-Guillo, A. Model-Based Multi-view Registration for RGB-D Sensors. Proceedings of the 12th International Work-Conference on Artificial Neural Networks (IWANN 2013), Tenerife, Spain, 12–14 June 2013; pp. 496–503.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.