^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

We present a novel approach for autonomous location estimation and navigation in indoor environments using range images and prior scene knowledge from a GIS database (CityGML). What makes this task challenging is the arbitrary relative spatial relation between GIS and Time-of-Flight (ToF) range camera further complicated by a markerless configuration. We propose to estimate the camera's pose solely based on matching of GIS objects and their detected location in image sequences. We develop a coarse-to-fine matching strategy that is able to match point clouds without any initial parameters. Experiments with a state-of-the-art ToF point cloud show that our proposed method delivers an absolute camera position with decimeter accuracy, which is sufficient for many real-world applications (e.g., collision avoidance).

Even though indoor positioning is a comparatively new research topic, the development of indoor positioning techniques has become a major research field. Three-dimensional (3D) geometries of objects or scenes need to be captured for a variety of applications, such as scene mapping, robot navigation, and video surveillance where physically deployed infrastructure should not be required during data acquisition to minimize the costs. Several hundred approaches have been made in the past few years, as no localization technology is able to cover the indoor space like Global Navigation Satellite Systems (GNSS) do for the open sky. In [

An alternative technique that is able to rapidly acquire large amounts of indoor depth data in a video-like fashion is range imaging (RIM). ToF range cameras measure depth information directly without need for stereo matching. Depth ranges of several tens of meters with centimeter to decimeter precision of state-of-the-art systems are largely sufficient for indoor applications.

We propose to estimate camera positions via matching of range image sequences to already available GIS data. Since relative orientations of objects and absolute position are known therein, we can use this information to pose our measuring device once newly acquired point clouds are accurately matched to the model. Such models have become widely available because various disciplines like Computer Aided Architectural Design (CAAD)/BIM, Computer Graphics, and Geographic Information Systems (GIS), which deal with 3D interior building models (e.g., IFC [

Our method combines absolute and relative orientation to achieve decimeter accuracy of a mono-camera system (ToF range camera) while no dedicated markers nor any other locally deployed infrastructure (e.g., Wi-Fi hot spots) inside the building is required [

In the following we first review works related to ours. After a conceptual overview of our approach, we provide a detailed methodological step-by-step explanation of the proposed point cloud matching and camera positioning procedure. Thereafter, experiments with a challenging dataset are described and discussed. Finally, we give conclusions and an outlook.

In [

In [

In [

Based on this background we will present in this work how the transformation from acquired point clouds to an object model is realized. The main challenge of our approach is, that we face the problem of datasets with total different amounts of points. Furthermore, a 3D model needs to be matched to a 2.5D scan. In the present system only geometric information is used. The advantage of CityGML, its semantic data, can be used in future for room identification.

The basic concept is to position a ToF range camera indoors via matching of range image sequences to GIS building models (

The database needs to be given a priori by a Building Information Model (BIM). Such models are nowadays established during the construction phase of a building and measured by other sensors, e.g., laser scanners. Recall that we neither assume any particular markers nor other locally deployed infrastructure inside the building. Although this assumption makes our method applicable to a wide range of objects and tasks, it makes sensor orientation and positioning challenging. Furthermore, the ToF range camera is used as a self-contained sensor where no additional information from other sensors as IMUs or odometers is used. Positioning solely relies on the range camera and a given building model.

A matching procedure for different spatial datasets being somehow located apart from each other always consists of two main parts: First, a suitable transformation that is capable of mapping the input to the reference and second, a metric that measures the fitting quality between both datasets after each transformation. Such metric can also be interpreted as a score function where an optimal score is aimed at. The exterior camera orientation is determined by a Cartesian 3D coordinate transformation with three shift and three rotational parameters. Transformation parameters are improved iteratively, for example using gradient descent methods, and after each run the metric measures the quality of fit. The goal is to find those transformation parameters that lead to an optimal fit between the datasets with respect to the metric. In practice, this sequence of transformation and metric evaluation is repeated iteratively until convergence of the metric value.

It should be noted that coordinates of the building model are usually given in an absolute national coordinate system (e.g., LV95 in Switzerland [_{i}_{i}

In a first step the absolute coordinates of the GIS object model are reduced such that the camera is located inside the building of interest. Therefore, the integer part of the first point of the object model is subtracted from all other model points thus accounting for large offsets. This translation
_{i}_{i}^{3} is a standard 3 × 3 rotation matrix, T_{i} is a 3D translation vector and V_{i} a noise vector [

Generally, it is essential to keep prior assumptions as relaxed as possible because they could potentially limit applicability in practice. In a real-world scenario any kind of orientation and positioning from one point cloud with respect to the other one inside a building is possible and our method has to cope with this situation. Therefore, we use a coarse-to-fine matching procedure. A first coarse matching is done without need for precise initial values.

After the initial translation

A point cloud matching method without need for relatively precise initial transformation parameters is proposed in [

Due to the fact that the point clouds have total different amounts of points, the NDT algorithm might not converge in its best solution. Therefore, NDT is used to achieve a coarse registration which is followed by fine registration with Correspondence Grouping (CG) [

Due to the fact that the ToF point cloud contains not only points of the object of interest and the CG algorithm is not based on a voxel data structure like NDT the ToF point cloud needs to be filtered. A promising algorithm is given by plane model segmentation to delete for example the floor and walls. In [

After the first translation of the object model coordinate system into the camera coordinate system the acquired point cloud will from now on be transformed into the object model. Once point clouds have been matched the camera pose can be estimated in a user specified interval (e.g., every 80 frames) by adding the translation vector

It has to be mentioned that the camera coordinate system of the MESA® ToF range camera is by construction not the same as usually used in the literature on perspective camera projection models. It lies in the center of the lens and X and Y are rotated around Z positively about

It is a “right-handed” camera coordinate system, with X-coordinate increasing horizontally to the left, Y-coordinate increasing vertically upwards and Z-coordinate increasing along the optical axis away from the camera.

We implement all previously described algorithms in the framework of the open source Point Cloud Library (PCL) containing a wide range of state-of-the-art algorithms like filtering, registration, model fitting,

We evaluate the proposed method on point clouds acquired with a MESA® ToF camera SwissRanger 4000 [

Our chosen test object is a block of wood that is uniquely identifiable in its orientation from all view directions. The object model was generated as a small VRML model in the Swiss coordinate system LV95. The VRML model is only represented by the ten edge points and the information which point is a member of which plane (

The measuring principle of the MESA® ToF camera SwissRanger 4000, schematically shown in _{mod}. The sensor samples the reflected light regularly and calculates the phase shift

The maximal non-ambiguity distance _{max}_{mod}_{max}

To increase the precision of the result the cameras interior orientation has to be determined previously. The SR4000 camera used had been calibrated by the manufacturer MESA® Imaging.

The camera was given a warm up phase for at least one hour prior to data acquisition to ensure it reached internal temperature stability [

The a priori known GIS object models are stored as Virtual Reality Modeling Language (VRML) files with spatio-semantic information in CityGML [

As we mentioned before, we chose a wooden block with plane surfaces as test object for our experiments. However, our proposed matching procedure works with any kind of object which is available in the database like cupboards, chairs, tables

In order to successfully register the point cloud acquired by the ToF camera with the GIS object model point cloud, the amount of points representing the model had to be increased to achieve roughly equal point densities in both datasets. The amount of additional model points was increased by randomly distributing between 300 and 1,000 points on each plane.

The modification of the scale dependent parameters of the NDT algorithm in our approach is based on the example in [

We performed a grid search to find the optimal NDT parameter setting, which are shown in

Based on the chosen parameters, the algorithm took eight seconds to determine the location and calculate the transformation parameters

The implemented CG algorithm is based on the tutorial of [

Recall that our ToF point cloud covers much more than only the object itself. Thus we have to filter out non-relevant parts prior to fine registration. We adopt the plane model segmentation algorithm of [

The model point cloud fits significantly better to the SR4000 point cloud than the original output from the NDT algorithm does (

Maximum overestimation is in the region where multiple reflection paths are both, maximum in number and in intensity. This explains the shape of the measured values, as shown in orange on

The Template Alignment algorithm (TA) presented in [^{2}. After calling a SAC-IA's accessory method the final transformation matrix (

The template alignment algorithm takes the longest computation time compared with the other two tested algorithms and did not work with our input data; the ToF point cloud was transformed to a completely wrong location. We conclude that this is a result of the large difference in the density of points between the two input point clouds causing the failure of the TA approach. In order to provide a better initial solution for the algorithm the input model point cloud was exchanged with the outputpoint cloud from the NDT algorithm, like it was done in the GC algorithm. Nevertheless, there was no improvement.

Efficient absolute positioning of a ToF range camera based on object acquisition in form of measured points in 3D Cartesian coordinates is possible. The absolute position of the camera can be calculated with decimeter accuracy based on the transformation parameters obtained in the presented coarse and fine registration with NDT and CG algorithm. The position of the ToF camera can be transformed into the reference coordinate system,

However, ToF cameras still suffer from a set of error sources that hamper the goal of infrastructure-free indoor positioning. State-of-the-art range imaging sensors measure distances unambiguously between 0.5–10 m at an accuracy level of centimeters. Besides the influence of incidence angle [

Nevertheless, the algorithms provided in the point cloud library can deal with any kind of point clouds. The generated point clouds from ToF cameras are small in their data amount in comparison to laser scanner data and can therefore be processed in near real time. Due to fast processing of 3D data at high frame rates, ToF cameras are well suited for kinematic applications, such as 3D obstacle avoidance, gesture recognition [

The proposed location approach using only a range imaging sensor could be improved by self-calibration as well as considering additional sensors. Adding observations of an electronic compass and/or a tilt sensor would provide approximate values of some transformation parameters and therefore stabilize the search for the correct transformation set. Furthermore, if our approach will be used for an UVS equipped with a ToF camera additional problems like so-called mixed pixels or motion artifacts have to be solved by using filtering methods (e.g., [

Overall concept of the presented approach.

Acquired point cloud from ToF range camera.

Origin (x,y,z) as delivered by the camera. Reproduced with permission from MESA Imaging AG, SR4000 User Manual; published by MESA Imaging AG, 2011 [

VRML model of the test object with seven surfaces and ten object points (

Time-of-flight principle. Reproduced with permission from T. Kahlmann and H. Ingensand, Proc. SPIE 6758; published by SPIE, 2007 [

Final test results after calibration by manufacturer. All deviations are within the 20 mm tolerance. The absolute error represents the deviation between a reference distance and the distance measurements. Reproduced with permission from MESA Imaging AG, SR4000 Final Test Report; published by MESA Imaging AG, 2011 [

Acquired and model point cloud after

Solution of NDT algorithm (input data: object model point cloud in green and SR4000 point cloud in white).

Solution of CG algorithm from different viewing angles; filtered SR4000 point cloud (white) and GIS object model point cloud (green).

Multiple path reflections on a concave scene-in that case a corner between two walls. Reproduced with permission from MESA Imaging AG, SR4000 User Manual; published by MESA Imaging AG, 2011 [

SR4000 specifications (modified from [

Measurement range (m) | calibrated 0.8–8 |

Sensor pixels | 176 × 148 |

Field of view (degree) | 43.6 × 34.6 |

Scan resolution at 3 m (mm) | 13.6 |

Footprint area at 3m (m^{2}) |
4.48 |

Camera weight (g) | 470 |

Camera dimensions (mm) | 65 × 65 × 68 |

Frame rate (f/s) | 54 |

Illumination wavelength (nm) | 850 |

Price (€) | ∼5500 |

Parameters of NDT used to produce

pcl∷NormalDistributionsTransform<pcl∷PointXYZ, pcl∷PointXYZ> ndt; | |

ndt.setTransformationEpsilon (0.01); | // min. transform. difference for termination condition (m) |

ndt.setStepSize (0.3); | // max. step size for More-Thuente line search (m) |

ndt.setResolution (2.0); | // resolution of NDT grid structure (VoxelGridCovariance) (m^{2}) |

ndt.setMaximumIterations (100); | // max. number of registration iterations. |

Code example for the used CG algorithm.

// Compute Descriptor for keypoints |

pcl∷SHOTColorEstimationOMP<PointType, NormalType, DescriptorType> descr_est; |

descr_est.setRadiusSearch (0.1f); // unit: (m) |

// Clustering |

pcl∷Hough3DGrouping<PointType, PointType, RFType, RFType> clusterer; |

clusterer.setHoughBinSize (0.07f); |

clusterer.setHoughThreshold (16.5); |