^{★}

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license

In this article, we present a real-time 6DoF egomotion estimation system for indoor environments using a wide-angle stereo camera as the only sensor. The stereo camera is carried in hand by a person walking at normal walking speeds 3–5 km/h. We present the basis for a vision-based system that would assist the navigation of the visually impaired by either providing information about their current position and orientation or guiding them to their destination through different sensing modalities. Our sensor combines two different types of feature parametrization: inverse depth and 3D in order to provide orientation and depth information at the same time. Natural landmarks are extracted from the image and are stored as 3D or inverse depth points, depending on a depth threshold. This depth threshold is used for switching between both parametrizations and it is computed by means of a non-linearity analysis of the stereo sensor. Main steps of our system approach are presented as well as an analysis about the optimal way to calculate the depth threshold. At the moment each landmark is initialized, the normal of the patch surface is computed using the information of the stereo pair. In order to improve long-term tracking, a patch warping is done considering the normal vector information. Some experimental results under indoor environments and conclusions are presented.

Real-time egomotion estimation has a key role in robotics and computer vision applications. Ever since the seminal work by Broida

In this work, a 6DoF metric Stereo SLAM with a hand-held camera as the only sensor, is proposed for people egomotion estimation in order to provide on-line metric maps and localization to the users. Our system lays down the bases towards a high level 6DoF SLAM for the visually impaired. Since the users of the system are humans, there are no special constraints about camera movement (

We think that vision-based localization and mapping systems can provide the visually impaired with information about their current position and orientation and/or guide them to their destination through diverse sensing modalities [

Some interesting works about navigation assistance for the visually impaired using vision and other sensors such as GPS can be found in the literature [

Our system follows a Davison’s SLAM approach [

We introduce an adaptation of well-known techniques in the Robotics community and apply them to the problem of people egomotion for assisting the visually impaired community in navigation purposes. The two main contributions of our work, are the determination of a depth threshold for switching between inverse depth and 3D features by means of a non-linearity analysis, and a new 2D homography warping method considering information from both cameras of the stereo pair. This article is organized as follows: The general structure of the system is explained in Section 2.. In Section 3. the non-linearity analysis of depth and angular information and how to obtain an optimal depth threshold for switching between 3D and inverse depth features are explained. Then, in Section 4. we briefly explain the details of our EKF SLAM implementation. In Section 5. the 2D homography warping for patch adaptation is explained. Finally, some experimental results are shown in Section 6. Conclusions and future works are presented in Section 7.

Our system consists of a hand-held stereo camera with wide angle lenses and a laptop for image processing.

The global state vector _{c}_{x} u_{y} u_{z}^{t}

Due to the use of wide-angle lenses, it is necessary to use a distortion model correcting distorted images. Unlike other SLAM systems [

For 3D features, the feature’s state vector encodes the information about the 3D position of the feature in the global map reference system.

For inverse depth features, the feature’s state vector encodes the information of the 3D optical center pose from which the feature was first seen _{ori}

Research in MonoSLAM has shown the benefits of using an inverse depth parametrization, since this parametrization allows undelayed initialization of features and mapping of features as infinity as well as close points [

We propose to use a non-linearity analysis for finding an optimal depth threshold. A function is linear in an interval, if the first derivative is constant in that interval, and therefore, the second derivative is equal to zero. Considering the Taylor expansion for the first derivative of a continuous function _{f}

If the non-linearity index _{f}_{i}

If the non-linearity index _{f}

Considering an ideal stereo system, the depth of one point can be determined by means of the following equation:
_{x}_{u}

The angular non-linearity index _{a}

Given the baseline _{x}_{uMAX}_{uMIN}_{x}_{x}

The type of graphs shown in

We performed several experiments in which we computed depth and angular non-linearity indexes for different baselines and considering the infinitesimal changes in disparity and depth as: Δ

_{x}_{x}

As it can be observed in _{t}

Assuming that

_{tot}

_{tot}

_{tot}

_{tot}

For modeling the camera motion between two consecutive frames, we use a general motion model to predict the camera pose in the next frame. Since in this case we are using a hand-held camera, we assume 6DoF and we expect smooth motions. Our motion model assumes that the camera linear and angular velocities may change in every frame, but they are expected to be constant in average. This means, that the camera movements are approximated using the linear and angular velocity motion model [^{W}^{C}^{W}^{W}_{v}_{i}

Visual measurements are obtained from the set of map points that are _{i}_{L} v_{L} u_{R} v_{R}^{t}

In order to decide, which features are going to be measured, we predict the visibility of every feature in the map. In this way, we can predict if the appearance of a given feature is close enough to the original appearance when the feature was initialized. Our visibility criteria is based in a length and angle heuristic. Feature visibility is calculated considering the difference between the viewpoint from which the feature was initially seen and a new viewpoint. This difference in viewpoint has to be below some length and angle ratio, and predicted to lie within the image, in order to predict the feature as visible. Usually the feature is expected to be visible if the length ratio

Prior to perform the actual measurement, we need to obtain the value of the predicted vector _{i}

In order to obtain the measurement vector for each feature _{i}_{i}_{XX}_{YY}_{i}_{i}_{iL}_{iR}_{i}

Once the search areas are defined, we try to measure each of the features. At the initialization stage of each feature we store an 11 × 11 2D image template centered on the interest point and also an estimate of its normal vector, assuming that the feature is located onto a plane. Then, we modify the original 2D image template according to the current camera pose with a 2D image warping that is described in Section 5. Then, we perform a correlation search over the whole search area and compare the best correlation value to a threshold value. Then, if the correlations in the two views (left, right) are good enough, the new measured projection coordinates are saved in order to perform the filter update. Otherwise, the feature is marked as

To perform the filter update, the Kalman gain _{i}_{cam}_{i}_{i}_{tot}_{tot}

In order to build the map incrementally, we need to define a criteria for adding new features and deleting those ones whose tracking was poor during previous frames. When a new feature is added into the system, the feature is initialized with its respective 3D feature uncertainty plus the current camera pose uncertainty. In the next steps, the rules for adding new features will be to maintain, at least, 10 visible features at the same time. In addition to that, there will have to be, at least, 7 successfully measured features at the same time in order to avoid the complete loss of the camera tracking.

Besides, some of the features that are in the total state vector can be

Harris corners are extracted from the images and are classified as 3D features or stored as inverse depth features, depending on the estimated optimal depth threshold. Once the features are predicted in the EKF prediction step, it is necessary to determine if the original parametrization of the features has to be changed (

When an inverse depth feature is switched to a 3D parametrization, it is necessary to adapt the feature’s state and the covariances implied in the filtering process by means of

When a feature is going to be measured, the estimation of the left camera position and orientation, which are obtained both from the SLAM state vector, and the normal surface patch vector are used for transforming the initial image template appearance (due to changes in viewpoint) by warping the initial template using a 2D homography. Our approach is related to the previous works of [

Considering two camera centered coordinate systems, the transformation between two generic coordinate systems _{1} and _{2} is defined by:
_{1} is a point on the plane defined by ^{t}

^{RL}^{RL}^{RL}^{t}^{CO}^{CO}

In order to test the system performance, lots of indoor sequences have been tested. In this work, we present only the results of three of them. The cameras used were the Unibrain Fire-i IEEE1394 modules with additional wide-angle lens of 1.9 mm which provides a field of view of around 100° horizontal and vertical. Camera calibration is done in a previous setup process according to the one described in [

The first sequence is a typical corridor indoor sequence. The corridor has a length of 10 m, and the camera moves in a straight tilted left trajectory. This scenario is suitable for inverse depth parametrization, since we can find very far features that are parametrized as inverse depth points. We performed a comparison between inverse depth and 3D parametrization and studied three different cases: without inverse depth parametrization, with both parametrizations using two different depth thresholds of _{t}_{t}

The final map and trajectory of the L and loop sequences are displayed in

%

_{i}

_{YY}_{YY}

When a feature is predicted as visible and is going to be measured, we perform a correlation search over a high probability area of finding the measurement. For performing this correlation search, we can use the original image patch that was captured when the feature was initialized or we can modify the appearance of this original patch with respect to the current camera pose by means of the 2

#

#

#

According to

Respect to the processing time, real-time implementation imposes a time restriction, which shall not exceed 33 ms for a 30 frames/second capturing rate. _{t}^{3}), being

In this article we have presented a system that allows self-locating a stereo camera by combining depth and angular information from different natural landmarks. We think that our vision-based localization system can help in the future the visually impaired community assisting them in navigation purposes by either providing information about their current position and orientation or guiding them to their destination through diverse sensing modalities. One of the contributions of our work is the determination of an depth threshold for switching between inverse depth and 3D features by means of a non-linearity index analysis of the stereo sensor. In addition, the benefits of using an inverse depth parametrization for mapping features at infinity have been shown. However, depending on the application (the scenario and computation time constraints) the overhead due to the use of the inverse depth parametrization can be unnecessary, and higher values of depth for switching can be chosen, if the map quality is not altered. According to the results of

However, using the proposed depth threshold can exceed real-time constraints due to the inverse depth parametrization overhead. We are very interested in studying the use of a dynamic threshold as a function of the kind of environment, instead of the static one that is currently used, so as to obtain the same map quality keeping real-time constraints.

Considering 2D image templates and the normal vector of the plane that contains the point in the space improves the tracking considerably and it is better than using just 2D image templates. However, since the normal vector is only estimated once per feature, an update of the patch normals estimation would likely be of benefit. Moreover, we are interested in using scale invariant features and descriptors such as center surround extrema features [

In further works, a high level SLAM will be developed for mapping outdoor large environments. We plan to do a similar submapping approach as the one described in [

In addition, we are interested in fusing the stereo system with inertial sensors such as pedometers and/or GPS for outdoor experiments. Besides, the motion model must be improved, due to the great variability of movements that a person walking can do. Other interesting alternative can be using fast 6DoF visual odometry priors, replacing general camera motion models.

As we are interested in the application of Visual SLAM techniques for the visually impaired navigation, we plan getting some feedback from some visually impaired organizations.

This work was supported in part by the Spanish Ministry of Science and Innovation (MICINN) under grant TRA2008-03600/AUT (DRIVER-ALERT Project) as well as by the Community of Madrid under grant CM: S-0505/DPI/000176 (RoboCity2030 Project).

Stereo vision system for our 6DoF visual SLAM.

Inverse depth point coding.

(a) Absolute and (b) relative depth estimation errors for a stereo rig considering a focal length _{x}

Depth and angular non-linearity indexes witch focal length _{x}

Stereo geometry and locally planar surfaces.

Inverse depth and 3D comparison. (a) Total state vector size, (b) Without inverse depth par. L sequence, (c) with inverse depth par. Z = 5.7 m L sequence, (d) without inverse depth par. loop sequence, (e) with inverse depth par. Z = 10 m loop sequence, (f) with inverse depth par. Z = 5.7 m loop sequence.

Optimal depth thresholds for different stereo baselines and fixed focal length _{x}

15 | 5.71 |

20 | 6.69 |

30 | 8.35 |

40 | 9.81 |

Inverse depth and 3D comparison: absolute errors in trajectory and map uncertainty.

_{X} |
_{Z} |
_{YY} | |||
---|---|---|---|---|---|

Corridor | Without Inverse Par. | 0.00 | 0.9394 | 0.4217 | 0.1351 |

Corridor | With Inverse Par., _{t} |
5.23 | 0.9259 | 0.4647 | 0.0275 |

Corridor | With Inverse Par., _{t} |
24.32 | 0.7574 | 0.3777 | 0.0072 |

L | Without Inverse Par. | 0.00 | 0.5047 | 0.3985 | 0.1852 |

L | With Inverse Par., _{t} |
7.85 | 0.5523 | 0.1017 | 0.0245 |

L | With Inverse Par., _{t} |
19.21 | 0.5534 | 0.2135 | 0.0078 |

Loop | Without Inverse Par. | 0.00 | 0.4066 | 0.9801 | 0.2593 |

Loop | With Inverse Par., _{t} |
5.27 | 0.3829 | 0.63030 | 0.0472 |

Loop | With Inverse Par., _{t} |
12.36 | 0.2191 | 0.3778 | 0.0310 |

Comparison of patch matching techniques: no patch transformation and 2

No Patch Transformation | Corridor | 85 | 6,283 | 5,612 | 89.32 |

2D Warping | Corridor | 68 | 6,398 | 5,781 | 90.35 |

No Patch Transformation | L | 116 | 11,627 | 8,922 | 76.73 |

2D Warping | L | 105 | 10,297 | 9,119 | 88.71 |

Processing times.

Feature Initialization (15) | 18.00 |

Prediction | 0.47 |

Measurement | 10 |

Update | 4.96 |