Real-world Comparison of Visual Odometry Methods

Positioning is an essential aspect of robot navigation, and visual odometry an important technique for continuous updating the internal information about robot position, especially indoors without GPS. Visual odometry is using one or more cameras to find visual clues and estimate robot movements in 3D relatively. Recent progress has been made, especially with fully integrated systems such as the RealSense T265 from Intel, which is the focus of this article. We compare between each other three visual odometry systems and one wheel odometry, on a ground robot. We do so in 8 scenarios, varying the speed, the number of visual features, and with or without humans walking in the field of view. We continuously measure the position error in translation and rotation thanks to a ground truth positioning system. Our result show that all odometry systems are challenged, but in different ways. In average, ORB-SLAM2 has the poorer results, while the RealSense T265 and the Zed Mini have comparable performance. In conclusion, a single odometry system might still not be sufficient, so using multiple instances and sensor fusion approaches are necessary while waiting for additional research and further improved products.


Introduction
Robot localization within its environment is one of the fundamental problems in the field of mobile robotics. One way of tracking this problem is to use vision-based odometry (VO), that is capable of accurately localizing robots' position with low drift over long trajectories even in challenging conditions. Many VO algorithms were developed that are categorized into direct, semidirect and feature-based on what image information is used in order to estimate egomotion. The hardware setup varies, as camera images can be captured in monocular or stereo vision. Many augmentations of VO are available, which perform sensor fusion of computed egomotion with other sensors that can refine the trajectory such as IMU and depth sensors.
There are many benchmark comparisons of VO algorithms, usually focusing on one of VO applications. Benchmarks compare various VO algorithms in terms of translation and rotation error, memory usage, computation time and CPU consumption. [1] compared monocular visual-inertial odometry for six degree of freedom (6DoF) trajectories of flying robots. [2] assessed performance of VO algorithms using image and depth sensors (RGB-D) for an application of mobile devices. [3] evaluated VO techniques in challenging underwater conditions. [4] KITTI dataset features benchmark for visual odometry where researchers can evaluate their algorithms on 11 unknown beforehand trajectories and the best performing VO in terms of rotation and translation error are Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 13 May 2020 doi:10.20944/preprints202005.0221.v1 listed online. [5] compared filtering-based methods and optimization-based methods of VI-SLAM through experiments, it also foresees the trend of running SLAM system on dedicated hardware, for example Intel RealSense.
The purpose of this research is mainly to assess the accuracy of a new commercial hardwaresoftware technology -the RealSense T265 from Intel -and compare it with a few other alternatives.
The conducted evaluation of VO solutions is especially meant for practitioners, serving as a guideline in choosing the right VO solution. Zed Mini and RealSense provide out-of-a-box hardware that contain dedicated software solutions, bringing the best of its hardware-software synergy. The RealSense and Zed Mini performance will be compared to well-established ORB-SLAM2 algorithm.
ORB-SLAM2 was chosen for this purpose, since it is one of the most widely used SLAM algorithms for VO purpose, therefore it is easier for readers to establish common frame of reference regarding accuracy performance. Moreover, Zed Mini and RealSense T265 will be compared to wheel odometry and evaluated against ground truth obtained by motion capture system OptiTrack.

Test environment
The experiments were conducted indoors, in a controlled area, on a flat, non-slippery surface.
As visible in Figure 1, we used some pieces of dark textile to make the scene more, or less, featurerich, i.e. to adjust the quantify of visual clues in the robot's field of view. Indeed, visual odometry systems are especially challenged when facing uniform surfaces such as a long white wall. Another important parameter affecting the quality of visual odometry is whether those visual clues are static (not moving) or whether some of them might be moving (dynamic). In order to compare the robustness of the different odometry systems over moving visual elements, we asked 3 persons to walk repeatedly along the walls of the experiment area.
It is important to note that in order to ensure that we are testing the different systems in a fair way, all visual odometry systems were running in parallel, meaning that there were exposed to exactly the same environment. We believe this is an interesting approach to compare VO systems.

Ground truth
An OptiTrack 1 system was used as a ground truth system. It is a motion capture system that is capable of tracking objects with positional error less than 0.3mm and rotational error less than 0.05°, using seven Prime 13 cameras 2 (cf. Figure 1), which can detect passive markers placed on the tracked object. Five markers placed on the top of the robot were used to track robots 6DoF position. The pivot point was marker location where the final position was calculated for. In the experiment, pivot point was located in the centre of the camera, which is was ~25cm in front of the centre of the robot.

Robot setup
The platform on which the tests were performed is a Parallax Arlo 3 Robot Platform System (cf.

Software setup
The robot software packages operate on ROS 5 (Robot Operating System, from the Open Source Robotics Foundation), more precisely ROS Kinetic under the GNU/Linux distribution Ubuntu 16.04 LTS. One has used modified "ROS packages for ArloBot" on Raspberry Pi to obtain communication with the "Parallax Activity board 6 " (microcontroller) on the robot.

Intel RealSense Tracking Camera T265
The RealSense T265 camera is a tracking camera that was released by Intel in March 2019 at a price of 199 USD. It includes two fisheye lens sensors as well as an Inertial Measurement Unit (IMU).
The Visual SLAM algorithm runs directly on built-in Intel Movidius Myriad 2 VPU. This gives very low latency between movement and its reflection in the pose, as well as low power consumption that stays around 1.5W. Since all the computations are performed in real-time onboard it does not require any computations to be held on the master computer.

ZED Mini
The ZED Mini 7 is a visual-inertial depth camera, that features dual high-speed 2K image sensors and a 110° field of view. With an eye separation of 63 mm, the camera senses depth from 0.1 meters to 12 meters with improved accuracy and fewer occlusions in the near range. Using visual-inertial odometry technology, inertial measurements are fused at 800Hz with visual data from the stereo camera. Sensor fusion allows for accurate tracking even when visual odometry gets lost due to insufficient amount of feature matches.

ORB-SLAM2
ORB-SLAM2 is a complete SLAM system [6] for monocular, stereo and RGB-D cameras that achieve state-of-the-art accuracy in many environments. In this study the monocular setup was used. The

Scenarios
For each scenario, the robot starts by driving forward for 3 meters. And then it makes three full turns plus 180° (1260° in total) around its own spot. The process is repeated four times during each scenario.
We repeated the experiments for 3 different parameters, giving a total of 8 combinations (cf. Table 1): • Quantity of visual features: We changed the number of visual features in the field of view: either "Many" with several paper posters on the walls to increase the number of visual clues, or "Few" with mostly grey walls. The floor is unchanged between conditions.
• Moving visual elements: We made the visual environment more, or less stable: either "Static" with nothing moving, or "Dynamic" with some persons constantly walking along the walls around the room.

Data Analysis
The datasets come from three different sources, namely OptiTrack system, Raspberry Pi and Jetson TX2. The first thing to do is to transform the data into same format. Because the robot runs only in a 2D plane, the position of different methods can be transformed into robot position (x, y) and robot orientation theta. Afterwards, the three datasets are synchronized and merged into one. In order to analyse the performance of different visual odometry systems relative to the OptiTrack, some columns of the dataset such as velocity and OptiTrack are interpolated (filled with previous values if the cells are empty) since the OptiTrack data does not come at the same timestamp as the others.
Before calculating the errors, the ORB-SLAM2 data is scaled and the scale coefficient is found by gradient decent. Besides, the robot wheel odometry data also needs to be transformed to the camera centre so that all measurements are in the same coordinate system. An example of how the data looks like at this stage can be seen in Figure 3.

Descriptive statistics
In order to get a better understanding of the data, a first round of descriptive statistics is performed. The two most informative visualisations are reported in Figure 4 and Figure 5, respectively for translation error (i.e. robot {x, y} position estimation error) and for rotation error (i.e. robot orientation error).
We observe that wheel odometry always provide a poor translation ( Figure 4) and rotation ( Figure 5) estimation but does so in a quite consistent manner: wheel odometry is indeed not much affected by the scenarios -not even speed -which is not surprising in non-sliding condition. The measurements are more consistent during translations than during rotations.
Expectedly, we observe that the visual odometry systems are much affected by the challenging scenarios, with sometimes big errors for all of them, especially ORB-SLAM2, and especially during rotations.

Figure 4.
For each of the odometries (Wheel encoders, RealSense T265, Zed-Mini, ORB-SLAM2), we report the median translation error (red horizontal line), the 75% observed translation errors (blue rectangle box), the 95% observed translation errors (blue error lines), as well as outliers (red crosses above the rest). For each odometries, from left to right, are reported the scenarios: "many slow static", "many slow dynamic", "few slow static", "few slow dynamic", "many fast static", "many fast dynamic", "few fast static", "few fast dynamic". , we report the median rotation error (red horizontal line), the 75% observed rotation errors (blue rectangle box), the 95% observed rotation errors (blue error lines), as well as outliers (red crosses above the rest). For each odometries, from left to right, are reported the scenarios: "many slow static", "many slow dynamic", "few slow static", "few slow dynamic", "many fast static", "many fast dynamic", "few fast static", "few fast dynamic".

Statistical analysis
In order to help identifying relevant differences, we did a light statistical analysis, with a series of t-Tests of the type "Two-Sample Assuming Unequal Variances" from the "Analysis ToolPak" of Excel (Microsoft Office 365 version 1910). Table 2 contains results for the average translation error, while   Table 3 contains the average rotation error, across all scenarios. P-values marked with one asterisk are better than 0,05; two asterisks when better than 0,01.  P-values marked with one asterisk are better than 0,05; two asterisks when better than 0,01.
The statistical analysis confirmed the main trends observed in the descriptive statistics.

Main findings
Wheel odometry is not much affected by the different scenarios, not even by the change of speed, leading to more consistent values, especially during translation. This is not surprising because the floor surface remained identical. However, the standard deviation of wheel odometry is typically higher than for the visual odometries, making it generally less precise, especially during the easy scenarios (i.e. one or more of: low speed, many features, static environment).
But the scenarios do have a significant effect on visual odometries. In our tests, speed had the greatest effect (in the "Fast" scenarios, linear speed was ~3 times higher and angular speed ~7 times higher), followed by the number of features, while the static vs. dynamic environment had the smallest effect.
Among the visual odometries, ORB-SLAM2 has the poorer results in our experiments, both in translation (p < 4E-2) and rotation (p < 4E-5), and for all scenarios. This materialises in a higher imprecision, a higher standard deviation, and more outliers than other methods.
Except for a few outliers, the RealSense T265 and the Zed Mini have comparable results in average (p > 0.1). RealSense is a bit more negatively affected by speed than Zed, especially during translation.

Camera lens types
The RealSense T265 has a wide field of few, making it able to potentially spot many more visual features than the ZED Mini, but the drawback is in principle a poorer image quality. In our experiments, in the end, it did not seem to make a significant difference, although we cannot tell which part of the results is due to the lens and which part is due to a difference of processing. The

Processing power
On a robot or drone, aspects such as total weight, price, and power consumption are essential factors. On those factors, the RealSense T265 globally wins over the Zed Mini and ORB-SLAM2, as it comes with built-in data processing, while the other visual odometries require an additional powerful computer board such as an NVIDIA Jetson or similar.

Multiple sensors & Sensor fusion
In our experiments, we compared the different method with one single sensor for each of them, but it would be possible to combine several cameras for potentially better quality. This is especially doable for larger robots.
A similar approach is to combine different types of sensors with a sensor fusion approach.
Outside, such a sensor fusion could be done with e.g. GPS, and indoor for instance with fiducial markers such as ArUco markers or other 2D-barcodes. Noticeably, the RealSense T265 offers a builtin sensor fusion mechanism that can be fed with wheel odometry, but this was outside the scope of those experiments.

Conclusions
As shown by the experiments, the Intel RealSense T265 compares well with the state of the art, especially when accounting for price, built-in processing power, and sensor fusion abilities. However, a single RealSense T265 does not solve the visual odometry challenge fully. Therefore, even for basic indoor navigation needs, several sensors or techniques must be combined. For the time being, visual odometry remains a domain with room for additional research and improvements.

Supplementary Materials:
The source code of the data analysis, as well as the raw data (in ROS Bag format) for the different odometry systems is available online from: https://github.com/DTU-R3/visual_odometry_comparison