^{1}

^{1}

^{⋆}

^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Stereo matching is an open problem in Computer Vision, for which local features are extracted to identify corresponding points in pairs of images. The results are heavily dependent on the initial steps. We apply image decomposition in multiresolution levels, for reducing the search space, computational time, and errors. We propose a solution to the problem of how deep (coarse) should the stereo measures start, trading between error minimization and time consumption, by starting stereo calculation at varying resolution levels, for each pixel, according to fuzzy decisions. Our heuristic enhances the overall execution time since it only employs deeper resolution levels when strictly necessary. It also reduces errors because it measures similarity between windows with enough details. We also compare our algorithm with a very fast multi-resolution approach, and one based on fuzzy logic. Our algorithm performs faster and/or better than all those approaches, becoming, thus, a good candidate for robotic vision applications. We also discuss the system architecture that efficiently implements our solution.

The goal of stereo vision is to recover 3D information given incomplete and possibly noisy information of the scene [

Our approach consists of performing an initial coarse matching between low resolution versions of the original images. The result is refined on small areas of increasingly higher resolution, until the matching is done between pixels in the original images resolution level. This is usually termed “coarse to fine” or “cascade correlation”.

Multiresolution procedures can, in principle, be performed in any order, even in a backwards and forwards scheme, but our choice is based upon computational considerations aiming at reducing the required processing time. Multiresolution matching, in particular, is known to reduce the complexity of several classes of image processing applications, including the matching problem, leading to fast implementations. The general problem with multiresolution algorithms is that, more often than not, they start with the coarsest resolution for all pixels and thus spend a long time. Our approach improves the search for an optimal resolution where to find correspondence points.

The main contribution of this work is proposing, implementing and assessing a multiresolution matching algorithm with starting points whose levels depend on local information. Such levels are computed using a new heuristic based on fuzzy decisions, yielding good quality and fast processing.

The paper unfolds as follows. Section 2 presents a review of image matching, focused on the use of multilevel and fuzzy techniques. Section 3 formulates the problem. Section 4 presents the main algorithms, and Section 5 discusses relevant implementation details. Section 6 presents results, and Section 7 closes with the main contributions, drawbacks and possible extensions of this work.

Vision is so far the most powerful biological sensory system. Since computers appeared, several artificial vision systems have been proposed, inspired by their biological versions, aiming at providing vision to machines. However, the heterogeneity of techniques necessary for modeling complete vision algorithms makes the implementation of a real-time vision system a hard and complex task.

Stereo vision is used to recover the depth of scene objects, given two different images of them. This is a well-defined problem, with several text books and articles in the literature [

Stereo matching is generally defined as the problem of discovering points or regions of one image that match points or regions of the other image on a stereo image pair. That is, the goal is finding pairs of points or regions in two images that have local image characteristics most similar to each other [

There are several stereo matching algorithms, generally classified into two categories: area matching and/or feature (element) matching [

Area based algorithms are usually slower than feature based ones, but they generate full disparity maps and error estimates. Area based algorithms usually employ correlation estimates between image pairs for generating the match. Such estimates are obtained using discrete convolution operations between images templates. The algorithm performance is, thus, very dependent on the correlation and on the search window sizes. Small correlation windows usually generate maps that are more sensitive to noise, but less sensitive to occlusions, better defining the objects [

In order to exploit the advantages of both small and big windows, algorithms based on variable window size were proposed [

Several models have been proposed in the literature for image data reduction. Most of them treat visual data as a classical pyramidal structure. The scale space theory is formalized by Witkin [

Wavelets [

Multiresolution algorithms mix both area and feature matching for achieving fast execution [

Besides the existence of these

Several works use logic fuzzy clustering algorithms in stereo matching in order to accelerate the correspondence process [

Fuzzy theory is also applied to determine the best window size with which to process correlation measures in images [

Our proposed approach is rather different from the above-listed works and integrates multiresolution procedures with fuzzy techniques. As stated above, the main problem with the multiresolution approach is how to determine the level with which to start correlation measures. A second problem is that, even if a good level is determined for a given pixel, this will not be the best for all the other image pixels, because this issue is heavily dependent on local image characteristics. So, we propose the use of fuzzy rules in order to determine the optimal level for each region in the image. This proposal leads to the precise determination of matching points in real time, since most of the image area is not considered in full resolution.

Our algorithm performs faster and better than plain correlation, and it presents improved results with respect to a very fast multi-resolution approach [

This paper extends results by Medeiros and Gonçalves [

In the stereo matching problem, we have a pair of pictures of the same scene taken from different positions, and possibly orientations, and the goal is to discover corresponding points, that is, pixels in both images that are projections of the same scene point. The most intuitive way of doing that is by comparing groups of pixels of the two images to obtain a similarity value. After similarities are computed, one may or may not include restrictions and calculate the matching that maximizes the global similarity. Our proposal assumes (i) continuity of disparity, and (ii) uniqueness of the correct matching.

In general, given a point in one image, the comparison is not made with all points of the other image. Using the epipolar restriction [

We measure similarity with the normalized sample cross correlation between images _{1≤i≤m,1≤j≤n} and _{1≤i≤m,1≤j≤n}, estimated by the linear Pearson correlation coefficient as

If the objects are known to lie within a distance range, the search for the best match can be restricted to a subset of the epipolar line. We will refer to this subset as the “search interval”, to avoid confusion with the refining interval that will be defined latter.

Small search intervals, if can be defined, improve the quality of the resulting matching and avoid false positives that are far from the desired match on the epipolar line. While for many problems this is convenient, for some, remarkably in robotic vision, near objects are the most important ones, requiring thus a full matching between the images.

We compare here the plain correlation and multiresolution matching approaches. Both algorithms have as common attribute the window size. Although some authors recommend the use of a 7 × 7 window for plain correlation (see, for instance, the work of Hirshmuller [

Traditional plain correlation calculates the normalized, linear cross correlation between all possible windows of both images. For each point in one image, the matching point is chosen in the other image such as to maximize the correlation coefficient.

When matching square images of side ^{3} correlations, but when a search interval _{s} < w_{s}w^{2}. Of course, in the worst case, we should assume that the plain correlation approach would have ^{3}) complexity.

Multi-resolution stereo matching uses several pairs of images of the same scene, sampled with different levels of detail, as a double pyramidal representation of the scene [

Multiresolution algorithms in stereo matching calculate the disparity of all pixels (or blocks of pixels) of a coarse level image and refine them, matching the pixels of finer level images with a small number of pixels around the coarser match. We refer to the interval that contains those pixels as the “refining interval”.

For example, a multiresolution algorithm with fixed depth that matches the points of two 256 × 256 pixels images, say _{0} and _{0}, may use three pairs of images having, thus, level 3 of sizes 128 × 128, 64 × 64 and 32 × 32; we denote these pairs of images (_{ℓ}_{ℓ}_{ℓ}_{ℓ−1}(2_{ℓ−1}(2_{ℓ−1}(2_{ℓ−1}(2_{0} in order to obtain _{1}, _{2} and _{3}. We omit the dependence of the coordinates (

The classical approach would attempt to match all the 32 × 32 pixels of the pair (_{3}, _{3}) to, then, proceed to their refinement. The refinement of pixel _{3}(_{2}(2_{2}(2_{2}(2_{2}(2_{2}. This is repeated until the matching is done on the (_{0}, _{0}) pair, obtaining the final result.

This approach is known to be faster than the brute force search on (_{0}, _{0}) (plain correlation). In fact, on the extreme case, where the images are squares and the smallest ones are single pixels, it requires ^{2} log(^{2} log(^{2}) + ^{2}^{2}^{2}), with the complexity of the matching, given above, which results anyway in ^{2} log(

Reducing the search interval is not very efficient at improving this algorithm, since the gain in operations comes at the expense of more errors. Often, important characteristics are lost in the smaller images, reducing correlation precision. Those errors can sometimes be alleviated by a larger refining interval, which increases the execution time.

In practice, some implementations relate that the processing time used for building the multiresolution pyramid often compensates for the time gained on optimizing the correlations [

As previously seen, plain correlation matching is very expensive and prone to generating errors such as ambiguity or lack of correspondence when there is not enough texture detail. On the other hand, multiresolution matching with fixed depth also tends to generate errors, but most of the pixels are still near correctly assigned. Also, the number of errors increases with the depth of the algorithm, since they are due to loss of information on the coarser images.

To get the best of both algorithms, one could assign for each pixel a different level: hard-to-compute positions should be treated at the highest resolution, while the others could be treated at an optimum, coarser level with just enough information. This adaptive approach, which is the proposed multiresolution matching with variable depth, will be shown to be able to reduce errors while still requiring less computational effort. The optimal level is computed on one of the images, and then each displacement is calculated in the same way as is done on the fixed depth algorithm.

An heuristic is, then, needed to calculate the desired depth. Also, we need to generate the small resolution images.

The proposed algorithm uses, for each image, a scale pyramid with several resolution versions of the original image, and one or more detail images. Scale images are obtained by a sub-band filter applied to the original images, while detail images are obtained by filtering the contents of the same level, scale image. We assessed two distinct approaches for the pyramid creation that differentiate mainly in the manner that the detail images are calculated: wavelets, and by Gaussian and Laplacian operators. They are described in the following sections.

We used a discrete wavelet transform to build the pyramids. With this approach, in a given level _{i}_{i}

We build two multiresolution pyramids by successively convolving the previous images with the low-pass Gaussian (ϒ_{G}_{L}

With this, we generate a pyramid of images and another of details. _{0} with the high-pass filter (_{0} is generated. _{0} is then convolved with the low-pass filter defined by the mask _{1}. This last image is then convolved again with the high-pass filter defined by the mask _{1}. A second low-pass filter (_{1}, generates image _{2}, which is finally filtered by _{2}.

These two pyramids are able to retain enough information in order to allow an efficient search for matching points.

The use of a sub-band filtering makes this algorithm much faster than the one proposed by Hoff and Ahuja [

Due to decimation, the construction of the scale images of the pyramid cannot be made shift-invariant. However, the detail images can be shift-invariant and this is a key difference between the two techniques. In the case of wavelets, the detail images are sensitive to shifts, but with 2D filtering they are invariant.

The wavelet transform is invertible. 2D filtering based transform is invertible only if both the high-pass and low-pass filters are ideal filters [

We use a propositional logic based on fuzzy evidence to derive a heuristic for calculating the desired level from which the matching will be performed. Such level is the coarsest one that can be labeled as “reliable”, in the sense that it provides enough information for the matching.

Fuzzy logic is composed of propositions

We define a predicate _{ℓ}

If the detail at (_{ℓ}

The deeper the classification the less reliable it is: if _{ℓ}_{+1}(_{v∈Kℓ+1(i,j)} _{ℓ}_{ℓ+1}(

Because short execution time is our main objective, the heuristic has to be easy to compute by general purpose computers, leading to

We define, for any _{ℓ}

The ideal values of

However, the amount of texture is not known a priori. So, in this work, an empirically value is assigned for

The fuzzy heuristic presented above is able to assign a proper level to every pixel of an image, identifying detailed and flat areas. A successful technique for our purposes should be able to detect the level of detail of each image region based on texture. Flat regions should be treated at coarser,

As it will be shown, at the coarsest level, the variable depth multiresolution matching also makes less mistakes than the fixed depth approaches. Because of that, we were able to obtain good results even with a refining interval as small as four pixels wide, leading to very fast execution.

The implementation of our proposal requires complex memory management that allocates and frees amounts of memory equivalent of several pages of the most common processors. Most operating systems lose performance on such conditions. So, also as a contribution of this work, we implemented a secondary memory management strategy that uses a buffer allocated only once at the beginning of execution. This pre-allocated memory is then managed by our procedure avoiding several calls to the operating system to perform this task. This approach alleviates the execution time, rendering a still faster procedure.

The proposed technique was implemented as a C++ library and a collection of test programs. This library generates disparity maps using the default correlation method and our approach, using multi-resolution with variable depth, considering or not a search interval. Due to the complexity of this library, its implementation was divided in several modules as shown in

The

Each module is detailed in the following.

This module contains the library

This module contains the template

The data structures that store pyramids of images, regardless the technique (wavelets or 2D filtering), are created by the classes

Class

The result of the heuristic that calculates the desired depth for each pixel requires a complex data structure. We implemented linked lists that contain objects of class

By using the class

The class

Class

Class

Note that each image pixel can be represented in more than a depth. In such case, matching must be performed at the least resolution depth in which the pixel is found. For example, if the sixth element of the returned list has position (1, 1), this means that for all pixels in the original image that lie in positions (^{6}, the greater level that can be used is 5 (starting from zero). It is possible for a pixel to appear twice in the list, for instance if position (2, 3) appears at the fourth list, for all pixels of the interval (x,y), 2 × 2^{4} ≤ ^{4}, 3 × 2^{4} ≤ ^{4}, that is, in the interval (^{6}, the depth must be up to 3, and not 5 anymore.

The easiest way of obtaining depth for each level is, thus, by traveling this list starting from the less coarse level and marking positions already visited. For that, pixels of the type

The main classes for our application are

Objects of classes

Classes

After windows are initialized, the matching is performed using

Memory allocation is always done in a transparent way to the programmer. All necessary memory is allocated at the creation of the objects of classes

An example of pyramids is shown in

We performed stereo measures using both approaches, but the use of wavelets (both Daubechies and Haar) for computing the pyramid turns out not being as efficient to subsequent phases as our proposal. Differently from other works [

We contrasted plain correlation and multiresolution with variable depth matching using them on two well known pair of images, namely the Tsukuba and Corridor data sets, and comparing the results with the available ground truth.

The matching results are compared with the desired ones in two ways, by visual analysis and by using an error metric. We use the mean error (

These error measurements are insensitive to the shape of the objects but are not so good for describing the quality of results on regions close to borders and edges. In this case, we use visual inspection that is, on the other hand, good in these tasks at the expense of being subjective. We therefore use these two complementary methods.

We used square correlation windows of side 3, 5, 7, 9, and 11 pixels, in order to test our approach with more than one window size. This means that, for a certain resolution level, given a pixel in one image (say the left) to be matched to a pixel in the other image (say right), a template window of a specified size will be taken around the pixel in the left image. Correlation measures will be calculated for this window with several windows of the same size taken around pixels in the epipolar line in the right image, within a certain search interval. When using the plain correlation algorithm, if a search interval is defined, it is always 70 pixels wide (not the whole epipolar line). We remark that, even with this optimization, plain correlation is still a time consuming algorithm. On the multiresolution matching, the refining interval is always 4 pixels wide.

We performed tests with two versions of our multiresolution matching. The first uses only scale images in all levels based on correlation measures. The second uses the detail images in each level and the scale images at the coarsest level, since at this level there is less detail.

Disparity maps generated by both versions of our multiresolution algorithm are shown in

With the new fuzzy heuristic, multi-resolution matching is likely to start at the lowest level where there is a border adjacent to the pixel under assessment. The correlation of the images at the coarsest depth is, thus, highly prone to errors due to occlusions. Matching the details, instead of the raw images, should, in principle, lead to higher resistance to occlusions. That behavior was confirmed in our experiments, as the results obtained matching the scale images at each level were consistently better than those that employed detail information.

Here we contrast plain correlation with multiresolution algorithm. Disparity maps obtained by both algorithms are shown in

We made experiments with both approaches for window sizes of 3, 5, 7, 9 and 11. Standard deviation and mean distance of the measured errors for multiresolution approach with variable depth are shown in

We observe that larger windows generate smaller errors in both approaches. Multiresolution incurred in smaller errors than plain correlation in most cases, and it made mistakes as often as the plain correlation. Plain correlation produces errors distributed on bigger areas than our algorithm, which is hard to visualize in the disparity figures. By the results, on the overall, our approach performed better than plain correlation.

We tested both algorithms also in the Corridor image, and the results are shown in

The time needed for the matching processes is shown in

We have proposed a new approach to stereo matching using multiresolution in which the level with which to start is variable as a function of the images content. That is, in a given region, for example a smooth one without edges, our algorithm starts in coarser (deeper) levels in order to improve precision; in regions with edges or well textured, it starts in finer (lower) levels reaching, thus, better execution time. Our approach is based on fuzzy logic, in order to define the level with which to start the matching, for each image region. By the results, this fuzzy logic decision process has proven to be excellent for this calculation.

The ideal value for

The ideal window size is also dependent on the amount of texture in the original image pair. This parameter and can also be estimated using a similar procedure as the one proposed for

Initial experiments using wavelets in order to calculate the multiresolution pyramid were not good enough due to the use of the detail coefficients. We then decided to apply a sub-band filtering based on a low pass Gaussian and a high pass Laplacian masks to generate the two multiresolution pyramids: one of images and other of details. With this approach, stereo matching performed much better, that is, faster and with better precision in stereo measurements.

The main contribution of this work is the multiresolution approach, which differs from usual methods, as seen above, by using a new fuzzy logic heuristic for calculating the starting level.

Our algorithm was able to generate disparity maps faster than plain correlation, with smaller errors. We conjecture that the use of Gaussian and Laplacian masks reduced even further the errors that occur close to borders. That is, those filters have a smoothing effect in such regions, allowing the algorithm to better treat occlusions.

Recent research on stereo matching based on multi-resolution and fuzzy techniques has been conducted, as discussed in Section 2. However, when facing the problem of real-time stereo matching, as in robotics vision, correlation based algorithms are known to be the best [

In the fast multi-resolution approach [

The fuzzy approach by Kumar and Chatterji [

These two techniques are, therefore, outperformed by our proposal when both precision and performance are required.

The authors would like to thank Brazilian Sponsoring Agency CNPq for the grants of Marcos Medeiros, Luiz Gonçalves and Alejandro C. Frery.

Creation of a pyramid with wavelet transform.

Illustration of the creation of a pyramid with three levels.

Cartoon image and

Scheme of the software architecture.

Computed pyramids. Left to right: original image, Daubechies wavelet levels, and levels computed by our proposal.

Tsukuba data set. From left to right: left image, right image, desired disparity map.

Tsukuba data set. From left to right: left image, right image, desired disparity map.

Disparity maps generated by multiresolution matching using the detail images at the coarsest level (level), and using always the scale images (right).

Errors measured with both algorithms: mean distance

Disparities obtained by plain correlation (right) and multiresolution (left) with correlation windows of size 3 (top) and 5 (bottom) pixels, using

Measured errors for multiresolution with variable depth: Tsukuba pair.

Measured errors for plain correlation with no search interval: Tsukuba pair.

Visual comparison between disparity maps generated by correlation (right column) and multiresolution matching with

Visual comparison for the Corridor images between disparity maps generated by correlation (right column) and multiresolution matching with

Disparity maps generated by multiresolution matching with

Disparity maps generated, Corridor, by generated by correlation (right column) and multiresolution matching multiresolution matching with

Time needed for computing the disparity by our approach in the Corridor pair.

Error and standard variation for the Corridor images.

Required time.

Disparity maps, Kumar and Chatterji algorithm, for window of sizes 3, 5, 7, 9, and 11 (from top to bottom).

Performance measures, Kumar and Chatterji’s algorithm, as a function of the window size.

Window | Mean | Standard | Execution |
---|---|---|---|

| |||

Size | Error | Deviation | Time |

3 | 14.12 | 19.74 | 21.00 |

5 | 10.66 | 16.04 | 53.31 |

7 | 8.92 | 13.96 | 98.48 |

9 | 8.09 | 13.07 | 161.06 |

11 | 7.61 | 12.56 | 241.15 |