1. Introduction
Stereo correspondence serves as a fundamental building block in many computer vision tasks, such as 3D reconstruction, navigation, and recognition [
1,
2,
3], and has been extensively studied in last two decades. Typical procedures to decide matching pixels in two rectified stereo pairs are building cost volume for reference image at all candidate disparities, aggregating cost in a neighborhood to filter out noise, assigning a label to each pixel and post-process to enhance the result. The aim of these procedures is to find a locally smooth solution in which discontinuities are aligned with the edges in reference image. Traditional stereo matching approaches can be categorized into local filtering [
4,
5,
6,
7,
8] and global optimization approaches [
9,
10,
11,
12].
Local filtering methods estimate the weighted average or sum of matching costs in a support window, and the weights between neighboring pixels depend on the intensity similarity and the spatial affinity. Local edge-aware filters, for instance the bilateral filter (BF) [
13] and the guided filter (GF) [
14], produce appealing results for highly textured images. However, these methods incorporate information in a local support region which is not geometric adaptive and cannot properly handle pixels in homogeneous regions. In order to aggregate information in the whole image, Yang [
7,
15] proposed the non-local filter (NL) which treats reference image as an undirected, 4-connected graph and extracts a minimum spanning tree from this graph by removing edges with large gradients. The aggregation procedure can be implemented by traversing the MST in two passes, namely from leaf nodes to root node and then from root node to leaf nodes. Segment-tree (ST) built by Mei et al. [
16] aims to enforce tight connections for pixels in a local region, while the structure of tree used for propagating message heavily depends on super-pixel segmentation [
17]. The recursive non-local filter (RNLF) [
18] builds four trees for input image based on the relative spatial relationships of neighboring pixels. The Chebyshev distance is used to compute the weight between any two pixels. However, the intensity distance between any two pixels on the tree is much larger than the intensity difference of these two pixels. Therefore, weights in highly textured regions decrease rapidly as the spatial distance increases, inhibiting informative messages from being propagated in wide range. Although those cost filtering methods produce appealing results for highly textured stereo pairs, they suffer from resolving the ambiguity in homogeneous regions or tend to overuse piece-wise constant assumption.
Global methods attempt to minimize a global energy function which composed by two terms, data term and smoothness term. Data term ensures the proximity of two matching pixels while the smoothness term enforces the discontinuities in disparity image aligned with edges in the reference image. A popular approach to solve this energy function is utilizing graph-based energy minimization methods in Markov Random Field (MRF) framework [
19,
20], for example graph cut (GC) [
10,
11] and belief propagation (BP) [
12,
21]. These methods treat reference image as an undirected graph and pass messages across entire graph to maximize a posterior estimation (MAP). Although many improvements have been made to enhance the efficiency or to accelerate the convergence rate of those global methods, they are still computationally intensive.
Semi-global stereo matching [
22] is an efficient strategy to solve an global energy function by approximating a 2D MRF minimization with multiple 1D optimizations. Inference along each scan line is performed separately, and the outputs in multiple directions are fused to determine the label of each pixel. As the 1D optimization operations along multiple scan lines in each pass are independent with each other, several approaches [
23,
24,
25,
26,
27,
28] take advantage of field-programmable gate-array (FPGA) or graphics card (GPU) to accelerate SGM in real-time applications. However, only pixels on scan lines intersected at current pixel in the reference image contribute to the aggregated cost of root node, degrading the performance of SGM under challenging conditions. Another shortcoming of SGM is that two adjacent pixels only share pixels on the same scan line. When matching costs on this line is unreliable, messages from other directions would produce different results for these pixels, resulting in stripe artifacts in disparity image. SGM-forest [
29] treats solutions in multiple directions as independent disparity proposals and formulate the fusion procedure as a classification problem that chooses the optimal estimate from given proposals. MGM [
30] takes messages from the nodes visited in previous scan line into account, aiming to make full use of information in 2D dimensions in cost aggregation along the 1D path. It overemphasizes information in neighboring pixels and inhibits an informative message from being propagated in a wide range to handle pixels in weakly textured area. Tripe SGM [
31] extends SGM to three images from a triplet-stereo rig which are composed by a horizontal and vertical camera pair. SGM-Net [
32] learns the penalties between neighboring pixels using Convolutional Neural Networks (CNN). In our approach, useful information is propagated in a certian direction along each tree and all pixels on the tree contribute to the aggregated cost of root node, making our method not only reduce streak artifacts of traditional SGM but also alleviate the ambiguities in homogeneous region.
In this paper, we propose a new version of SGM, named omni-directional SGM (OmniSGM), which acts as performing 1D optimization along all directions. We also present an iterative cost update scheme utilizing aggregated cost in the last pass to successfully improve the robustness of initial matching cost. Specifically, our method performs SGM along tree structures in four directions, namely from left-to-right, right-to-left, top-to-bottom and bottom-to-top, as shown in the last row of
Figure 1. In each pass, we recursively estimate the contribution of each pixel on the tree from leaf nodes to root node, leading to all pixels on the tree contribute to the aggregated cost of root node. Then we fuse the outputs of these four trees to obtain the final aggregated cost; thus, each pixel obtains supports from pixels in the whole image, making our method alleviate some limitations of SGM, such as streak artifacts. Compared with SGM-based methods which incorporate information from multiple scan lines, our method can be regarded as aggregating information from all pixels along all directions. In order to fully exploit reliable information in aggregated cost volume, we integrate it with initial cost volume according to the confidence of each pixel. With this successive cost volume update scheme, initial cost volume becomes more robust, and reliable information tends to propagate extensively across entire image. In the post-process step, we advance the widely used non-local refinement method [
15] to efficiently propagate disparities from
stable pixels to
unstable pixels.
The rest of this paper is organized as follows. In
Section 2, we present an introduction of traditional semi-global matching method at first, and then elaborate our proposed omni-directional SGM, cost volume update scheme and the efficient refinement strategy. Parameter settings and extensive experiments on widely used data sets are provided in
Section 3. Conclusions and remarks are given in
Section 4.
4. Conclusions and Remarks
In this paper, we present a novel omni-directional semi-global stereo matching framework. Messages propagate along all directions and each pixel obtains support from pixels in the whole image. The contribution of each pixel can be computed recursively along the tree structures. Specifically, we divide the entire image into four parts and compute the contributions of pixels on four tree structures, namely trees in the left, right, top, and bottom of root node, and then fuse the results to obtain contributions from pixels in the whole image. We also propose a cost volume update scheme to enhance the robustness of initial cost volume, since the quality of disparity image can be improved in the following pass. Finally, an efficient stable disparity propagation strategy along the MST is presented for disparity refinement.
We validate the effectiveness of our method on challenging datasets, and find that a stereo matching algorithm can benefit from the combination of handcrafted feature and feature maps from CNN, as they own the merits to deal with pixels in different regions. We will work on this in the future.