Hierarchical Guided-Image-Filtering for E ﬃ cient Stereo Matching

Featured Application: Potential applications of the work include autonomous navigation, 3D reconstruction, and vision-based object handling. Abstract: Stereo matching is complicated by the uneven distribution of textures on the image pairs. We address this problem by applying the edge-preserving guided-Image-ﬁltering (GIF) at di ﬀ erent resolutions. In contrast to most multi-scale stereo matching algorithms, parameters of the proposed hierarchical GIF model are in an innovative weighted-combination scheme to generate an improved matching cost volume. Our method draws its strength from exploiting texture in various resolution levels and performing an e ﬀ ective mixture of the derived parameters. This novel approach advances our recently proposed algorithm, the pervasive guided-image-ﬁltering scheme, by equipping it with hierarchical ﬁltering modules, leading to disparity images with more details. The approach ensures as many di ﬀ erent-scale patterns as possible to be involved in the cost aggregation and hence improves matching accuracy. The experimental results show that the proposed scheme achieves the best matching accuracy when compared with six well-recognized cutting-edge algorithms using version 3 of the Middlebury stereo evaluation data sets.


Introduction
Stereo vision aims at providing rich distance information of the captured scenes via image pairs.This is normally accomplished by matching algorithms to generate dense disparity maps.The maps can be transformed into three-dimensional information of the scene by the principle of triangulation with many potential applications, such as autonomous navigation, 3D reconstruction, and vision-based object handling.
Although the stereo matching problem has been under extensive research for decades, it is still difficult to obtain accurate matching under ill-posed conditions such as texture-less regions, repeated patterns, occlusion areas, and reflective surfaces.The current stereo matching algorithms can mainly be divided into two categories: The conventional matching algorithms [1] and the deep-learning-based stereo matching approaches.
The stereo matching algorithm based on deep learning regards the process of deriving the disparity map as a classification problem or a regression problem.For instance, Zbontar [2] used the convolutional neural networks (CNNs) to estimate the similarity of image blocks and uses the measures as the matching cost in the traditional stereo matching algorithm.Similarly, Nahar [3] proposed unsupervised pre-trained networks to estimate hierarchical features and combine them with a pixel-based intensity matching cost in a global energy minimization framework for dense disparity estimation.By combining a disparity estimation network with a CNN that was trained by a synthetically generated dataset, Mayer [4] demonstrated the effectiveness of deep learning in stereo matching.Pang [5] proposed a cascaded CNN architecture that is composed of two stages: The first stage advances the work of Reference [4] by equipping it with extra up-convolution modules, while the second stage generates residual signals across multiple scales.The summation of the outputs from the two stages gives the final disparity.Kendall [6] used deep unary features to compute a stereo matching cost volume.In this approach, disparity values are regressed for aggregation from the cost volume using 3D convolutions.
Another method to implement deep learning-based stereo matching is to use the networks to exploit context information.For example, Chang [7] developed a spatial pyramid pooling module to the aggregate context in different scales to form a cost volume.The cost volume is regularized by a stacked network to further improve the utilization of global context information.Besides, Williem [8] used the deep learning technique for the cost volume aggregation based on self-guided filtering.
Deep learning-based methods are promising as they can apply the high-level object detection as a guideline for within-object matching.However, most of the current schemes use supervised learning methods that assume the true disparity is known in advance.This assumption is impractical for many applications [9].Moreover, these approaches might be invalid for an unknown environment and cannot be well transplanted to robotic and embedded systems [10].
The conventional stereo matching approaches are classified as global or local according to the construction of an objective function that rates the degree of match between an image pair [1].The objective function of the global methods consists of a data term (the measurement part) and a regularization term (the penalty part).The data term designates the similarity between aggregated matching costs of pixels on the images, and the regularization term is included to provide constraints from neighboring pixels.Belief propagation [11] and dynamic programming [12,13] are the major global methods.However, global approaches need a lot of computing resource and are generally not suitable for real-time applications.
In contrast to the global approaches, the objective function of the local methods contains only the measurement part.The local methods generally perform the stereo matching in four stages [1]: (1) The calculation of the preliminary matching cost, (2) aggregation of the cost over support windows, (3) estimation of the disparity, and (4) refining the disparity.Among them, the cost aggregation step is usually transformed into an image filtering procedure of the matching cost and the disparity maps are obtained by the winner-takes-all method [1].The local methods require less computation and are popular for fast disparity calculations.
Cost aggregation is crucial for matching performance in the local algorithms.Bilateral filtering [14] is among the early approaches that led to the increase of computational complexity with the increase of support-window size.Later, tree filtering [15], domain transformation [16], recursive edge-aware filter [17], and full-image guided filtering [18] were proposed to decouple computational complexity with the support window size.However, these approaches all suffer from the weight-decay problem when there is a significant intensity difference between neighboring pixels.This behavior deteriorates information propagation and impairs the resulting matching performance.
Hosni [19] suggested treating the generation of disparity as a labeling problem, which is implemented through the steps of constructing a cost volume, cost volume filtering, followed by winner-takes-all label selection.Along this line, the guided-image-filter (GIF) [20] substantially involves cost volume filtering because it can generate clear edge profiles free from the gradient-reversal artifacts.Later, Li [21] introduced an edge-aware weighting, denoted as the weighted guided-image-filtering (WGIF), to improve GIF.Kou [22] proposed a gradient-domain guided-image-filter (GDGIF) to reduce halo artifacts by incorporating an explicit first-order edge-aware constraint.Nevertheless, due to the lack of pixel information outside the fixed window, the implementation of WGIF by Hong [23] results in restricted performance.
To remove the fixed-window limitation, approaches with adaptive guided filters were proposed [24].However, information outside the support windows is still missing.In a recent paper [25], we introduced weights that take both distance and intensity differences into account to extend the scheme of GIF.We called our approach the pervasive guided-image-filtering, denoted as PGIF [25], which exploits the whole image for aggregation.
Also, recent years have seen the development of coarse-to-fine (CTF) strategies to enhance the stereo matching accuracy.For instance, Hu [26] proposed to reduce the search space of local stereo matching by introducing a candidate set of neighbor pixels.Tang [27] introduced a multi-scale pixel feature vector to provide effective matching of radiometric differences.Advanced techniques to find improved disparity range was provided in the work of Li [28] by recursive multi-scale decomposition.These methods assume the existence of disparity consistency.In contrast to these approaches, the matching cost integration method of Zhang [29] resulted in superior matching, where cost aggregation is formulated to enforce the consistency of the cost volume among the neighboring scales.
Inspired by the multiscale scheme of Zhang [29], we extend the pervasive guided-image-filtering, PGIF [25], to exploit the cross-scale features in the cost volume.In our approach, the consistency is imposed on the GIF parameters, rather than the cost volume of Zhang [29], in the neighboring-scale direction.We call it hierarchical guided-image-filtering, denoted as HGIF.
The main contribution of this paper can be summarized as: (1) We created an innovative aggregation approach that efficiently combines the model parameters of PGIF [25] to allow the features of the image pairs in different resolutions to be considered; (2) The scheme is unique in its parameter-based aggregation, rather than the cost-volume-based approaches in the current literature, allowing efficient calculation with superior performance; (3) The proposed scheme outperforms most of the state-of-art algorithms in terms of disparity accuracy even without the refinement procedure.

The Cost Aggregation Based on the Pervasive Guided-Image-Filtering (PGIF)
This section presents the basic procedure of the pervasive guided-image-filtering (PGIF) [25] scheme.It serves as the definitions of some of the variables and algorithms for the proposed scheme to be shown in the next section.
In the GIF-based stereo matching algorithms, such as WGIF [21], GDGIF [22], and PGIF [25] are discussed in the last section.The aggregated cost, C d,0 (p) is a linear function of the guidance image, I G at a local disparity patch such that: In order to solve for a d,0 (p) and b d,0 (p), we proposed an objective function in Reference [25], which includes a weighted sum of the squared difference between the linear model and a primary matching cost, denoted as C d,0 (q), and a regularization term: where Ω 0 is the support window for a pixel located at p, ω 0 (p, q) is the weight that reflects the contribution of a pixel located at q, and ε(p) is the regularization factor to limit the magnitude of a d,0 (p).
Because of an efficient iterative computation scheme, to be presented below, the support window is the whole image.The primary matching cost, C d,0 (q), represents the degree of match between a pixel located at q of the left image and a pixel located at q + d of the right image.This cost is smaller for a higher degree of match.We employ a truncated version of the absolute gradient difference to define the cost: where I L,0 and I R,0 are the left and right images of the stereo pair in the original resolution, ∇ x and ∇ y are the horizontal and vertical gradients, respectively, and τ is the truncation threshold, normally assigned as 2. The use of the threshold helps to reduce the mismatch in obscured or noisy regions.
The optimum values of the parameters in Equation ( 3), represented as a * d,0 (p) and b * d,0 (p), are obtained by minimizing the objective function Equation (3): We have: In PGIF [25], the weight ω 0 (p, q) is decomposed into horizontal and vertical weighting factors, With the convention that q = (i, j) and p = (x, y), the weighting factors, W H i,x,0 and W V j,y,0 , can be recursively calculated as: where the parameter β is a constant factor.By this scheme, the weight ω 0 (p, q) depends on both the spatial and intensity differences, and the contribution of any pixel located at q to the pixel located at p can be effectively involved.Besides, the introduction of function f alleviates the effects of abrupt change in intensity, enhancing the immunity to noise caused by the recursive calculation.

Stereo Matching Based on Hierarchical Guided-Image-Filtering (HGIF)
A major shortage of approaches using the scheme starting from an energy function of Equation ( 3), including WGIF [21], GDGIF [22], and PGIF [25], is the restriction of the cost metrics to local regions of the support window Ω 0 (p).Even though PGIF [25] exploits the whole image for matching, the effects of patterns far from the pixel under aggregation decay with distance.
The proposed scheme begins with a down-sampling of the original image pair, I L,0 and I R,0 , where the left image is selected as the reference image and the right image as the target image.As the images are down-sampled by a factor of 2, we denote the resultant image pairs as I L,z and I R,z for z ∈ {0, 1, . . ., K} with K being the roughest level.
Firstly, the optimal parameters, a * d,z (q) and b * d,z (q), of each image pair, I L,z and I R,z , are calculated using Equation (6).We define a set of unknown parameters, âd,z (p) and bd,z (p), as aggregated parameters to be found.In order to aggregate the cross-scale patterns, we create an objective function F d (p) that is the weighted mean-square error between the optimized parameters, a * d,z (q) and b * d,z (q), and the unknown parameters, âd,z (p) and bd,z (p), throughout each scale: where the weight ω z (p, q) is defined similarly to Equations ( 7) and ( 8) with the subscript 0 replaced by z.This objective function inherits the weighting scheme of Reference [25] that not only reflects the effects of distance and intensity difference, but also applies constraints on the GIF parameters in the scale direction.In Equation ( 9), the positive constant γ is a constraint factor on the squared difference between the parameters âd,z (p) and âd,z−1 (p), and between bd,z (p) and bd,z−1 (p), respectively.The constraint weights, γ z for z ∈ {0, 1, . . ., K}, are larger between layers that are far from the original image, layer 0, if γ > 1.This trend is reversed for γ < 1.
There are totally 2(K + 1) unknown parameters in Equation ( 9), namely âd,0 (p), bd,0 (p), . .., âd,K (p), and bd,K (p).They can be obtained by setting the partial derivatives of Equation ( 9) with respect to these parameters to zero: Equation ( 10) can be simplified as: and: where g z is a nominal average function to the parameters, a * d,z (q) and b * d,z (q), with scale-dependent weights ω z (p, q): The aggregated parameters, âd,0 (p) and bd,0 (p), can be solved using the system of linear Equations ( 11) and (12).Taking the case of K = 2 (down-sampled twice) and γ = 1.5 as an example, Equation (11) can be calculated as: We have that: Likewise, bd,0 (p Thus, these aggregated parameters, âd,0 (p) and bd,0 (p), include the effect of features come from different resolutions.These parameters are three-dimensional with dimensions (x, y, d) where the x and y dimensions are in the image planes and the d dimension is in the disparity direction.
After obtaining the aggregated parameters, the cost volume Ĉd (p) is calculated according to Equation (1): This calculation is conducted using elementwise multiplications and additions in the image plane for each disparity.Finally, the disparity map, D(p), can be obtained using the winner-take-all (WTA) procedure of Equation ( 2 The procedure of our proposed algorithm for stereo matching is depicted in Figure 1, taking K = 2 as an example, and summarized in the following steps: 1. Down-sample the image pairs to build a pair of pyramids of images, I L,z and I R,z for z ∈ {0, 1, . . ., K}, with K + 1 levels of resolution; 2. Calculate the weight matrix ω z (p, q) as a multiplication of two weighting factors, W H i,x,z and W V j,y,z for z ∈ {0, 1, . . ., K}: where the weighting factors are recursively calculated for z ∈ {0, 1, . . ., K} as: 3. Generate the primary matching cost volume for each scale where z ∈ {0, 1, . . ., K}: 4. Find the optimum parameters for each resolution, where z ∈ {0, 1, . . ., K}: [ω z (p,q)•I L,z (q)] q∈I L,z (p) 5. Calculate the nominal average function to the parameters, g z (a * d,z (p)) and g z (b * d,z (p)), according to Equation ( 13) for z ∈ {0, 1, . . ., K}; 6. Solve for âd,0 (p) and bd,0 (p) based on the system of linear Equations ( 11) and (12); 7. Aggregate the cost volume across multiple scales according to Equation (17); 8. Find the disparity map using the WTA procedure of Equation (18).As depicted in Figure 1, the parameters  11) and (12).As the matrix inverse can be conducted in advance, the calculation can be simplified into matrix multiplication, as demonstrated in Equation ( 14).Besides, the cost volume ˆ( ) C p is calculated using the linear model of Equation (17).In contrary to the current multi-scale approaches in the literature, such as References [27][28][29], where the cross-scale aggregation is based on costs themselves, the parameter-based cross-resolution aggregation of the proposed procedure is unique and efficient.

Experimental Results
In the proposed scheme, there are two design parameters, β in Equation ( 20) and γ in Equation ( 9).According to Equations ( 19) and ( 20), we have that larger β will cause , , to increase, so z ω is larger and ˆ( ) D p will be smoother.Based on Equation ( 9), we also have that when γ is larger, the constraint between the scale layers is stronger.
We conducted experiments using the training dataset of the KITTI Vision Benchmark Suite [30] to determine the proper values for these two parameters.The dataset contains 200 image pairs and 200 ground truth disparity maps.The results of the performance of the proposed scheme using the dataset are summarized in Figure 2.
Figure 2a,b display the mean values of the percentage of the erroneous pixels on the disparity maps, denoted as the average error rates, and the standard deviations.Each of the mean values along with the standard deviation was calculated using these 200 image pairs.When 2 β = and 1.5 γ = these two parameters achieved their best performance.We fixed these values for all the calculation experiments presented in this section.As depicted in Figure 1, the parameters âd,0 (p) and bd,0 (p) are found by solving the system of linear equations, composed of Equations ( 11) and (12).As the matrix inverse can be conducted in advance, the calculation can be simplified into matrix multiplication, as demonstrated in Equation ( 14).
Besides, the cost volume Ĉd (p) is calculated using the linear model of Equation (17).In contrary to the current multi-scale approaches in the literature, such as References [27][28][29], where the cross-scale aggregation is based on costs themselves, the parameter-based cross-resolution aggregation of the proposed procedure is unique and efficient.

Experimental Results
In the proposed scheme, there are two design parameters, β in Equation (20) and γ in Equation (9).
According to Equations ( 19) and ( 20), we have that larger β will cause W H i,x,z and W V j,y,z to increase, so ω z is larger and D(p) will be smoother.Based on Equation (9), we also have that when γ is larger, the constraint between the scale layers is stronger.
We conducted experiments using the training dataset of the KITTI Vision Benchmark Suite [30] to determine the proper values for these two parameters.The dataset contains 200 image pairs and 200 ground truth disparity maps.The results of the performance of the proposed scheme using the dataset are summarized in Figure 2.
Figure 2a,b display the mean values of the percentage of the erroneous pixels on the disparity maps, denoted as the average error rates, and the standard deviations.Each of the mean values along with the standard deviation was calculated using these 200 image pairs.When β = 2 and γ = 1.5 these two parameters achieved their best performance.We fixed these values for all the calculation experiments presented in this section.To validate the effectiveness of the proposed scheme, extensive comparative experiments were conducted.We studied six state-of-the-art stereo matching algorithms to compare with the proposed scheme:


The fast cost volume filtering scheme of Reference [19], denoted as FCVF;  A combination of the cross-scale cost aggregation scheme of Reference [29] and FCVF [19], denoted as CS-FCVF;


The pervasive guided-image-filter scheme of Reference [25], denoted as PGIF;  A combination of the cross-scale cost aggregation scheme of Reference [29] and PGIF [25], denoted as CS-PGIF;


The deep self-guided cost aggregation scheme of Reference [8], denoted as DSG;


The sparse representation over discriminative dictionary scheme of Reference [13], denoted as  To validate the effectiveness of the proposed scheme, extensive comparative experiments were conducted.We studied six state-of-the-art stereo matching algorithms to compare with the proposed scheme:

•
The fast cost volume filtering scheme of Reference [19], denoted as FCVF; • A combination of the cross-scale cost aggregation scheme of Reference [29] and FCVF [19], denoted as CS-FCVF;

•
The deep self-guided cost aggregation scheme of Reference [8], denoted as DSG; • The sparse representation over discriminative dictionary scheme of Reference [13], denoted as SRDD;

•
The proposed scheme, which implements a hierarchical guided-image-filter, denoted as HGIF.
We tested these frameworks using the Middlebury (version 3) benchmark stereo database downloaded via the URL: vision.Middlebury.edu/stereo/[31].The "trainingQ" image set, which contains 15 stereo pairs, were used for the performance evaluation, starting from "Adirondack" to "Vintage".Of them, only five are shown in Figure 3 due to the space limitation.They are: "Adirondack", "Pipes", "Playroom", "Playtable", and "Shelves".These pictures are with a typical resolution of 720 by 480.These image pairs are down-sized twice, to 360 by 240 and 180 by 120, for example, in the proposed scheme.This arrangement corresponds to K = 2.


The proposed scheme, which implements a hierarchical guided-image-filter, denoted as HGIF.
We tested these frameworks using the Middlebury (version 3) benchmark stereo database downloaded via the URL: vision.Middlebury.edu/stereo/[31].The "trainingQ" image set, which contains 15 stereo pairs, were used for the performance evaluation, starting from "Adirondack" to "Vintage".Of them, only five are shown in Figure 3 due to the space limitation.They are: "Adirondack", "Pipes", "Playroom", "Playtable", and "Shelves".These pictures are with a typical resolution of 720 by 480.These image pairs are down-sized twice, to 360 by 240 and 180 by 120, for example, in the proposed scheme.This arrangement corresponds to K = 2. Besides, the Middlebury defines two measures for evaluating average error rates, including nonoccluded (non-occ) and all regions.The weighted average error rate is an official metric of the benchmark in measuring the accuracy of matching by using different weights for different image pairs.These weights are employed to compensate for the variation in the matching difficulty, as remarked on the website.Specifically, the image pairs: "PianoL", "Playroom", "Playtable", "Shelves", and "Vintage" contribute only half of the error rates.
Figure 4 shows the disparity maps obtained by these algorithms.Among them, only the results of SRDD [13] were improved by the disparity refinement procedure.The corresponding error rates in the non-occluded region and the all-region are summarized in Table 1 and Table 2, respectively.
Taking a close look of the disparity maps of Figure 4, we have that there was a significant improvement in the matching quality of CS-FCVF [19,29] over FCVF [19], especially in the textureless regions of the "Playtable" and "Shelves" cases.This improvement was less significant but could also be observed in that of the CS-PGIF [25,29] over PGIF [25].These improvements were due to the effective cross-scale cost aggregation scheme of Reference [29].
According to the weighted average error rates listed in Tables 1 and 2, DSG [8] performed worst.Also, the matching performance of SRDD [13] was better than CS-FCVF [19,29] but slightly worse than CS-PGIF [25,29].It is worthy to note that SRDD [13] applies semi-global cost aggregation and post-processing refinement to further improve the matching accuracy.However, the proposed scheme, even without refinement, performed better than SRDD [13] in most of the cases and achieves the smallest weighted average error rates in both the all-region and the non-occluded (non-occ) region.
The experiment was executed in MATLAB 2017b using an Intel Core I5 8300H and 16 GB RAM.Table 3 summarizes the execution time of these algorithms.We have that the multi-scale versions, CS-FCVF [19,29] and CS-PGIF [25,29], required more time for computation than their original versions, FCVF [19] and PGIF [25], respectively, as expected.
Similarly, the proposed algorithm needed to calculate the filtering parameters in multiple scale layers, it ran longer than the FCVF algorithm [19] and the PGIF algorithm [25].In addition, the CS- Besides, the Middlebury defines two measures for evaluating average error rates, including non-occluded (non-occ) and all regions.The weighted average error rate is an official metric of the benchmark in measuring the accuracy of matching by using different weights for different image pairs.These weights are employed to compensate for the variation in the matching difficulty, as remarked on the website.Specifically, the image pairs: "PianoL", "Playroom", "Playtable", "Shelves", and "Vintage" contribute only half of the error rates.
Figure 4 shows the disparity maps obtained by these algorithms.Among them, only the results of SRDD [13] were improved by the disparity refinement procedure.The corresponding error rates in the non-occluded region and the all-region are summarized in Tables 1 and 2, respectively.
Taking a close look of the disparity maps of Figure 4, we have that there was a significant improvement in the matching quality of CS-FCVF [19,29] over FCVF [19], especially in the texture-less regions of the "Playtable" and "Shelves" cases.This improvement was less significant but could also be observed in that of the CS-PGIF [25,29] over PGIF [25].These improvements were due to the effective cross-scale cost aggregation scheme of Reference [29].
According to the weighted average error rates listed in Tables 1 and 2, DSG [8] performed worst.Also, the matching performance of SRDD [13] was better than CS-FCVF [19,29] but slightly worse than CS-PGIF [25,29].It is worthy to note that SRDD [13] applies semi-global cost aggregation and post-processing refinement to further improve the matching accuracy.However, the proposed scheme, even without refinement, performed better than SRDD [13] in most of the cases and achieves the smallest weighted average error rates in both the all-region and the non-occluded (non-occ) region.
The experiment was executed in MATLAB 2017b using an Intel Core I5 8300H and 16 GB RAM.Table 3 summarizes the execution time of these algorithms.We have that the multi-scale versions, CS-FCVF [19,29] and CS-PGIF [25,29], required more time for computation than their original versions, FCVF [19] and PGIF [25], respectively, as expected.
Similarly, the proposed algorithm needed to calculate the filtering parameters in multiple scale layers, it ran longer than the FCVF algorithm [19] and the PGIF algorithm [25].In addition, the CS-FCVF [19,29] and CS-PGIF [25,29] algorithms only incorporated one matching cost parameter, while the proposed algorithm has to compute two parameters, âd,0 (p) and bd,0 (p) in Equations ( 11) and (12).We also have that the proposed algorithm ran slightly longer than CS-FCVF [19,29] and CS-PGIF [25,29].However, the increased computation time was marginal and still within the same order of magnitude.Moreover, both DSG [8] and SRDD [13] required much more computational resource than the other algorithms due to their deep neural network and semi-global cost aggregation scheme, respectively, as pointed out in Reference [13].Based on Table 3, we may conclude that the proposed scheme had the best performance when taking both the matching correctness and the calculation efficiency into consideration.  11and ( 12).We also have that the proposed algorithm ran slightly longer than CS-FCVF [19,29] and CS-PGIF [25,29].However, the increased computation time was marginal and still within the same order of magnitude.Moreover, both DSG [8] and SRDD [13] required much more computational resource than the other algorithms due to their deep neural network and semi-global cost aggregation scheme, respectively, as pointed out in Reference [13].Based on Table 3, we may conclude that the proposed scheme had the best performance when taking both the matching correctness and the calculation efficiency into consideration.All the results are obtained without the refinement procedure except SRDD (The sparse representation over discriminative dictionary scheme) [13].(a-g) are disparity maps generated by: (a) FCVF (The fast cost volume filtering scheme) [19], (b) CS-FCVF (A combination of the cross-scale cost aggregation scheme [29] and FCVF), (c) PGIF (The pervasive guided image filter scheme) [25], (d) CS-PGIF (A combination of [29] and PGIF), (e) DSG (The deep self-guided cost aggregation scheme) [8], (f) SRDD (The sparse representation over discriminative dictionary scheme) [13], and (g) the proposed scheme.Among them, the images of (e) DSG and (f) SRDD are by courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/[31].obtained without the refinement procedure except SRDD (The sparse representation over discriminative dictionary scheme) [13].(a-g) are disparity maps generated by: (a) FCVF (The fast cost volume filtering scheme) [19], (b) CS-FCVF (A combination of the cross-scale cost aggregation scheme [29] and FCVF), (c) PGIF (The pervasive guided image filter scheme) [25], (d) CS-PGIF (A combination of [29] and PGIF), (e) DSG (The deep self-guided cost aggregation scheme) [8], (f) SRDD (The sparse representation over discriminative dictionary scheme) [13], and (g) the proposed scheme.Among them, the images of (e) DSG and (f) SRDD are by courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/[31].

Conclusions
In this work, we propose a novel stereo matching scheme to make use of hierarchical pattern information in stereo matching.The scheme exploits feature with different level of scales for matching metrics.
Inspired by the scheme of Zhang [29] for multi-scale disparity cost aggregation, the scheme uses a hierarchy of parameters of the GIF-based linear models and exploits the pervasive guided-image-filtering [25] for efficient matching cost calculation.The resultant multi-scale features are collected to form an improved cost volume for disparity estimation by using a linear combination of the guidance image.
A performance evaluation of version 3 of the Middlebury stereo evaluation data set [31] showed that the proposed solution provided superior disparity accuracy and comparable processing speed when compared with the representative stereo matching algorithms.Besides, the scheme outperformed most of the state-of-art algorithms even without the refinement procedure.
where p = (x, y) is the location of the central pixel and d indicates the disparity.Note that C d,0 (p) have three dimensions (x, y, d) and is often denoted as the aggregated cost volume.Likewise, a d,0 (p) and b d,0 (p) can also be called the parameter volumes of the model.The second subscript, 0, indicates that these values are related to the original resolution.For the following stereo matching operations, we regard the left image as the reference image, such that I G = I L,0 , and the right image as the target image.The disparity map, D(p) is composed of the disparity value corresponding to the minimum aggregated cost at each location p: D(p) = argmin d C d,0 (p).

Figure 1 .
Figure 1.An overview of the proposed stereo matching process with two levels of down-sampling as an example.
found by solving the system of linear equations, composed of Equations (

Figure 1 .
Figure 1.An overview of the proposed stereo matching process with two levels of down-sampling as an example.

Figure 2 .
Figure 2. The effect of parameter selection on the performance of the proposed scheme using 200 stereo image pairs of the training dataset in the KITTI Vision Benchmark Suite [30].The dataset was downloaded from the URL: www.cvlibs.net/datasets/kitti/index.php.Each figure shows the mean values of the error rates and their corresponding standard deviations.(a) The average error rate with respect to the parameter β when γ = 1.5.(b) The average error rate with respect to the parameter γ when β = 2.
and Standard Deviation) Average Error Rate (%) (Mean Value and Standard Deviation)

Figure 2 .
Figure 2. The effect of parameter selection on the performance of the proposed scheme using 200 stereo image pairs of the training dataset in the KITTI Vision Benchmark Suite [30].The dataset was downloaded from the URL: www.cvlibs.net/datasets/kitti/index.php.Each figure shows the mean values of the error rates and their corresponding standard deviations.(a) The average error rate with respect to the parameter β when γ = 1.5.(b) The average error rate with respect to the parameter γ when β = 2.

Figure 3 .
Figure 3. Datasets and their corresponding ground truth disparity maps selected from the experimented data for visual comparison.(a) Left images of the image pairs: Adirondack, Pipes, Playroom, Playtable, and Shelves; (b) ground truth disparity maps of these images.Image courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/[31].

Figure 3 .
Figure 3. Datasets and their corresponding ground truth disparity maps selected from the experimented data for visual comparison.(a) Left images of the image pairs: Adirondack, Pipes, Playroom, Playtable, and Shelves; (b) ground truth disparity maps of these images.Image courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/[31].

Table 1 .
[13]arison of the weighted average error rates in the non-occluded (non-occ) region between seven algorithms (%).All of the results are obtained without the refinement procedure except SRDD[13].The lowest error records are marked in bold.

Table 2 .
[13]arison of the weighted average error rates in the all-region between seven algorithms (%).All of the results are obtained without the refinement procedure except SRDD[13].The lowest error records are marked in bold.

Table 3 .
Comparison of the computation time for four selected image sets between five algorithms (s).