^{★}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

One of the greatest difficulties in stereo vision is the appearance of ambiguities when matching similar points from different images. In this article we analyze the effectiveness of using a fusion of multiple baselines and a range finder from a theoretical point of view, focusing on the results of using both prismatic and rotational articulations for baseline generation, and offer a practical case to prove its efficiency on an autonomous vehicle.

In this article we analyze, from an analytical point of view, the possibilities and limitations of the fusion of using multiple baselines and a range finder.

One of the most useful techniques to rebuild three-dimensional scenes from two-dimensional images is stereo vision, which uses the horizontal disparity between corresponding points in different images to calculate their depth position. The process of matching which objects from each image correspond to one another is a very complex process, especially if the analyzed scene contains repeated objects. For instance,

One of the main techniques to solve these ambiguities is the usage of multiple baselines, which provides the reconstruction process with more than two images of the same scene to compare. This tool has been studied before for different camera configurations and applying several processing algorithms, such as SSSD (sum of sum of squared differences) in inverse distance [

We intend to create a set of theoretical guidelines and restrictions to aid the design and construction of multiple and variable baseline systems, by accurately describing the relation between the usual configuration parameters and the matching process difficulty. Our research follows a strictly numerical approach, assuming ideal pinhole cameras with exact precision, and for two particular configurations: a classic linear baseline for two cameras at a variable distance, and a pair at a fixed distance with a variable orientation.

A laser range finder is then included in the vision system to enhance the results, greatly improving their precision for a large set of configurations. This device has been largely studied and customized in the past, by the addition of dynamic range enhancement [

A multiple baseline system is bound to produce ambiguous results, in form of sets of points that must be matched to provide the location of the corresponding physical objects. The fusion procedure consists of comparing the output data of a laser range finder device with that of the baseline calculations, so that only one of the possible interpretations of its results is proved right.

This work is part of a project for the development of a low-cost smart system. It is designed for passenger transportation and surveillance in non-structured environmental settings, that is the Technical and Renewable Energies Institute (ITER) facilities in Tenerife (Spain). This features a housing complex of twenty-five environmental-friendly dwellings. One of the key elements of this task is obstacle detection and scene understanding, so the first stage of this process is to determine potential obstacles’ locations and approximate shapes.

The base system used for this project was the VERDINO prototype, a modified EZ-GO TXT-2 golf cart. This fully electric two seat vehicle was to operate automatically, so it was equipped with computerized steering, braking and traction control systems. Its sensor system consists of a differential GPS, relying on two different GPS stations, one of them being a fixed ground station. Its position is defined very accurately, so it is used to estimate the error introduced by each satellite and to send the corrections to the roving GPS stations mounted on the vehicle. The prototype’s three-dimensional orientation is measured using an Inertial Measurement Unit (IMU). This unit features three accelerometers, three gyroscopes, three magnetometers, and a processor that calculates its Euler angles in real time, based on the information these sensors provide. An odometer is also available and serves as a back-up positioning system in the event of a GPS system failure.

The prototype also includes several Sick LMS221-30206 laser range finders and a front platform that supports a vision system, which consists of two conventional and thermal stereo cameras. This platform moves to adjust the cameras’ height and rotation, allowing them to see around curves and on irregular terrain. This vision system was built to detect obstacles and the unpainted edges of roads, and will serve as grounds for our current research.

In this section we analyze two forms of baseline generation—variation of length or rotation—and generalize the common aspects of both procedures. We also present their fusion with a range finder system, study the effect of cardinality differences among sets of point projections, and evaluate possible irresolvable configurations.

Our nomenclature considers the cameras’ parameters are a constant focal distance

The distance between two hypothetical cameras 1 and 2 is named _{12}. For simplicity’s sake we will normalize all cameras’

Ambiguities occur when similar pixels produce a set of more than one

This spatial coincidence is defined as the Euclidean distance between the point sets calculated with the correct

The first camera configuration lets camera 2 move towards or away from camera 1. The distance it moves from its original position will be named Δ_{3}. Using cameras 1 and 3 we obtain _{12}, _{12}) and (_{13}, _{13}) is expected to be zero.

We express the difference amongst _{1} value is calculated by _{1} equals _{2} or _{3}. This can happen for two reasons.

If the distance between camera 1 and either of the rest is zero (_{12} = 0 or Δ_{12}) both would provide the same information. In this case said camera is redundant and its data will not contribute to the result.

Otherwise the

Further study of this case showed an important detail. Let us consider two hypothetical points, _{a}_{b}_{1a} and _{1b} for the former and _{2a} and _{2b} for the latter; we will assume _{1a} = _{2b}. Resorting to

Since they are different combinations, we know the _{b}_{a}_{a}_{b}

Let us now suppose the light beams are not parallel, but divergent; this would mean that _{1a} > _{2b} if _{12} > 0 and _{1a} < _{2b} if _{12} < 0. In either case, the _{b}

If more than two points are being considered, the rejection of one of them would immediately invalidate its whole set. This must be taken into account, since when a particular combination is discarded for two cameras, adding a new baseline would be unnecessary and would only add useless data. Therefore, since the light beams are more likely to be divergent the closer the cameras are, for ideal pinhole cameras ambiguity can be solved by making baselines shorter.

The second camera configuration sets the two cameras on the edges of a rotary rigid body. Its axis is placed exactly between the cameras, so that the rotation radius is half the baseline length: 2_{12}; the orientation angle is named

Since cameras 1 and 2’s relative position remain equal to the previously studied case, their points’ location can still be calculated using

Divergence conditions must also be taken into consideration, but the relation between _{1} and _{3} values is now more complex. It can be simplified by checking that _{13} must be strictly greater than zero for all valid configurations.

Our starting data will be one set of _{1}}, _{2}} and _{3}} for cameras 1, 2 and 3 correspondingly. We assume each ^{P}^{P}

This value will be close to zero when the combination of permutations results in the correct points set. We intend to calculate the optimal value of Δ

Most methods that calculate distance maps from conventional cameras are based on the usage of multiple baselines, such as [

By placing the range finder device directly over camera 1, most non-infinite threshold configurations can be easily solved using the depth value of all visible points, not needing a third camera. Our fusion system deals exclusively with ambiguous point distributions, which are reduced subsets of the stereo output. The stereo pair will detect an ambiguity every time the calculations of the point locations produce multiple results; as explained in Section 2, this will occur when the perceived distance between any two points of the input set is shorter than the camera baseline.

The calculated point locations are then compared to the corresponding beams of the range finder, in order to find the most probable combination. Therefore, we do not need a generic sensor fusion system, such as traditional Kalman filters [

Consider the range finder provides its information as a certain _{1} value, so its corresponding _{2} value is solved using

However, a real range finder system will not provide these values accurately, but limited to a certain number of angles. The point cloud it returns is processed into a set of centroids and radii that describe the approximate location and size of visible objects. Our particular case, a vision system for a vehicle prototype, deals with pedestrians and other highly vertical objects as obstacles, which allows the application of some theoretical simplification. The measurements derived from a single horizontal sweeping plane can be extrapolated for the whole visible surface.

This new sets of centroids and radii will be named _{1}} and _{2}} sets several point groups can be calculated. Their proximity to _{1}, _{2}) couple and a particular centroid/radius pair, this proximity can be calculated as

The relation between a point and a centroid is not univocal, since they are not necessarily defined by

Certain restrictions regarding sets

If all the sets’ cardinalities are different and greater than one there will be hidden points. The only way to solve this situation is to augment the smaller sets by repeating some of their elements, so that they all have the same cardinality. All possible combinations will have to be tested and their permutations generated so the problem can be solved normally.

Although this process would apparently increase the problem’s complexity, the number of permutations that must be considered may actually decrease for augmented sets. A set containing ^{n−m}

For our calculations, the maximum number of valid permutations for all the possible augmentations of a size ^{th}_{k}

Each time an element reappears it will steal available spaces from the following, leaving at least one for each. Therefore, in each augmentation element _{k}_{m}

The number of calculations can still be reduced if we consider some combinations of these permutations will still be redundant. For instance, combining set

Consider _{1} and _{2} the other two. We will try matching them by pairs, remembering one matching of _{1} and _{2} will have had to be augmented to suit

Since _{1} will be valid; this makes _{1}|, |_{2}, obtaining _{1}|, |_{1}|, |_{2}|, |

Sometimes three cameras may not be enough to distinguish real points from spurious ones, when one of either is hidden from them all, even including the range finder device. This may only be solved by modifying the cameras’ position once again or by adding a fourth one.

For instance, _{1}, _{2} and _{3} respectively. Since _{1} can only see points _{2} sees _{3} will show

Once the cameras have provided their

This differentiability is defined as the difference between the distance values for the correct point set and the closest wrong one. We studied this threshold value for all possible variations of both baseline length and rotation, along with the position and alignment of the original points. This way, the cameras can be arranged optimally depending on the location of the points, in order to improve the result of the matching algorithm. They are then compared with the values obtained by using proximities.

The studied arrangement included two points, (_{1}, _{1}) and (_{2}, _{2}), such that their middle point is (

The threshold value for variations of baselines between cameras 1 and 2 and between 1 and 3 is shown in _{12}| < 5 and |_{12} + Δ_{12} = 0 and _{12} + Δ

In _{12} = 10. All ambiguities disappear whenever _{12}, regardless of the other camera’s position. Otherwise, the problem remains solved as long as the varying baseline is shorter than the gap between points, that is |_{12} + Δ

The threshold value for variations of _{12} = 10. Null values happen whenever two of the cameras superimpose (Δ

_{12} and the gap between the studied points _{12} = 10. The former parameters are closely related since, as we studied for the previous case, anytime _{12} is shorter than

The threshold value for variations of point rotation

In _{12} = 10. Not only the results are better the further the points are from the cameras, but also when

As expected, whenever a baseline becomes shorter than the gap between the points—albeit greater than zero—the light beams become divergent. It was confirmed that variations of

When studying the results of varying rotation

Complementing the described system, the range finder system can increase the number of situations in which a third camera is not necessary. We studied a new arrangement with two possible points, aligned to the same _{1} value, such that only one of them may be real. A centroid is then swept over the visible area and the proximity based threshold value is analyzed. For those areas where the combined threshold is greatest, the problem can be solved using only two cameras and the range finder values.

In

Considering all calculations, an optimal combination of baseline and range finder can be designed. By placing this device over camera 1, all non-immediate point configurations can be first analyzed using their depth information, given as centroid/radius sets. Only when these data are not conclusive as means of scene reconstruction may the baseline parameters be modified.

Even though a rotary baseline might be easier to build than an extendable one, we have confirmed that the latter produces much better results than the former. Since our research requires one camera to remain static at all times in order to efficiently compare the calculated point sets, baseline rotation becomes inadequate. This is so because the third camera position is generally unable to produce infinite threshold situations—that is, whenever only two cameras are needed to find the point locations.

On the other hand, length variation avoids this problem so easily that a new possibility arises: according to our calculations and assuming ideal or precise enough cameras, a fixed 3-camera configuration could provide an optimal solution for all cases as long as two of them are as close possible. Most point configurations could be solved immediately by these alone.

We tested the presented algorithm in real life by placing two pedestrians in a partially controlled environment—namely the parking lot of La Laguna University’s Computer Engineering School—and using the VERDINO’s sensor system as input source.

The range finder system consists of Sick LMS221-30206 laser models, which offer an angular resolution of 0.5°, a maximum range of 80 m, and systematic and statistic errors (at range 1–20 m) of ±35 mm and ±10 mm respectively.

The stereo cameras are two Santachi DSP220x models, with a pixel resolution of 320 × 240, three color channels, an adjustable focal distance between 3.9 and 85.8 mm, and a lens angle of 47°. Their maximum precision at a distance of

The data from the camera system recognize the obstacles, but they must be correctly matched to properly calculate their distance. By crossing this information with the range finder preprocessed output, spurious points were easily discarded as those that produced a minimum proximity value to the resulting set of centroids, as explained in Section 2.4.

The input and output data for a practical application of the sensor fusion algorithm are shown in _{1}, _{2}} and {_{1}, _{2}}, one of which must be spurious. The depth information given by the range finder is shown as a gray line, and the processed centroid/radius couples as blue circles. As can be seen, the wrong point set {_{1}, _{2}} does not match any centroids, therefore producing a much lower proximity value and being discarded.

In order to compare the performance of our sensor fusion system and a trinocular camera arrangement, both methods were tested in an environment with two pedestrians at various relative positions. These included variations of the pedestrians’ distance from the sensor system, horizontal position, separation, and rotation, similarly to our theoretical experiments in Section 3. The separation was always kept shorter than the baseline length, otherwise no ambiguities would occur.

The average percentage of correctly solved ambiguities is shown in

We have studied two different ways to obtain multiple baselines from a two-camera configuration, as well as the fusion of a single baseline with a range finder system, and analyzed their optimal settings to most accurately locate the observed points’ spatial positions. Our research has defined equations to efficiently evaluate solutions for the correspondence problem and proved which parameters provide the best results.

Considering ideal pin-hole cameras, a variable length baseline is most able to match points correctly the closer the cameras are to each other. A rotary baseline is more difficult to configure as its optimal orientation angle is directly related to the position of the analyzed points.

The performance of a stereo vision/range finder fusion system was tested and compared with a trinocular baseline arrangement. The experimental results proved that the fusion system generates a higher precision in a more extensive range.

This research was partially supported by the project SAGENIA DPI2010-18349 of Spain Government.

Example of images for stereo vision with ambiguities: the pedestrians must be correctly matched between the images or else their calculated depth will be wrong.

Parameters

Number of possible permutations for variations of both the size desired for the sets (

Irresolvable point-camera configuration.

Threshold value for variations of baselines between cameras 1 and 2 and between 1 and 3.

Threshold value for variations of the gap between points

Threshold value for variations of

Threshold value for variations of rotation

Threshold value for variations of _{12} and

Threshold value for variations of

Threshold value for variations of

Proximity based threshold value for variations of the centroid position.

Range finder error between actual objects and calculated centroids and radii.

Input (1,2) and output (3) data of the sensor fusion algorithm. Units for (3) are given in meters and measured from camera 1.

Experimental performance for variations of object distance and separation, for a trinocular system

Comparison of experimental performance between a trinocular system and the stereo vision/range finder fusion system, for a fixed separation between objects.