Objective Image Quality Measures for Disparity Maps Evaluation

The evaluation of disparity (range) maps includes the selection of an objective image quality (or error) measure. Among existing measures, the percentage of bad matched pixels is commonly used. However, it requires a disparity error tolerance and ignores the relationship between range and disparity. In this research, twelve error measures are characterized in order to provide the bases to select accurate stereo algorithms during the evaluation process. Adaptations of objective quality measures for disparity maps’ accuracy evaluation are proposed. The adapted objective measures operate in a manner similar to the original objective measures, but allow special handling of missing data. Additionally, the adapted objective measures are sensitive to errors in range and surface structure, which cannot be measured using the bad matched pixels. Their utility was demonstrated by evaluating a set of 50 stereo disparity algorithms known in the literature. Consistency evaluation of the proposed measures was performed using the two conceptually different stereo algorithm evaluation methodologies—ordinary ranking and partition and grouping of the algorithms with comparable accuracy. The evaluation results showed that partition and grouping make a fair judgment about disparity algorithms’ accuracy.


Introduction
A stereo correspondence algorithm uses a stereo image pair as an input and produces an estimated disparity map (a new image) as an output [1,2]. The estimation of a disparity map is a fundamental problem in computer vision. This problem has been addressed in multiple domains such as outdoor mapping and navigation [3] and 3DTV [4].
With additional information about the stereo vision system, a disparity map can be transformed into a map of distances. Therefore, in addition to the term disparity map, the terms range map and depth map also appear in the literature.
The accuracy of stereo correspondence algorithms can be assessed by the evaluation of disparity maps, either qualitatively or quantitatively. A quantitative (or objective) approach is robust against several human-related biasing factors, offering advantages over a qualitative (subjective) approach. This assessment has practical applications such as component and procedure comparison, parameter tuning, supports decision-making by researchers and practitioners, and in general, to measure the progress in the field. It is useful to be able to quantify the quality of these disparity maps in order to benchmark range-finding devices and stereo correspondence algorithms [5].
Among the existing quantitative approaches used to estimate the accuracy/quality of disparity maps, Mean Squared Error (MSE), Root Mean Squared Error (RMSE) [6], Mean Relative Error (MRE) [7], and the percentage of Bad Matched Pixels (BMP) [8] have been used extensively. A modification of the Multi-Scale Structural Similarity (MS-SSIM) index for range image quality assessment was proposed in [9], while the Sigma Z-Error (SZE) objective measure was introduced in [10]. The strengths of the BMP and MRE measures are combined in the Bad Matching Pixel Relative Error (BMPRE) measure [11].
The results of objective measures are used as inputs for quantitative evaluation methodologies, which aim to evaluate the performance of stereo algorithms. Among the existing quantitative evaluation methodologies, Middlebury's methodology is commonly used [12]. It uses the percentage of BMP as the error measure. As an alternative to this evaluation methodology, the A * groups methodology has been proposed [13].
Among the current research in the field of stereo vision, disparity estimation based on deep learning methods [14] and real-time disparity estimation for high-resolution images [15] have been extensively studied, and significant progress has been made.
This paper proposes new disparity map evaluation measures, as modified versions of the original Image Quality Assessment (IQA) methods. The modification consists of the introduction of the capability to handle missing data in both the ground-truth disparity map (due to the lack of measures to determine ground-truth data) and in the estimated disparity map (due to the lack of an applied algorithm for image disparity estimation). The paper is focused also on consistency evaluation of the proposed measures using the two conceptually different state-of-the-art evaluation methodologies described in [12,13]. The obtained results show that the methodology extensions proposed in this research provide more accurate image disparity algorithm ranking.
The rest of this paper is organized as follows. The next section will describe the methodologies and dataset used for disparity maps' evaluation. This is followed by the sections for experimental evaluation of known and the newly proposed objective error measures. The comparison of the results is shown in the next section. Finally, the conclusions are stated in the last section.

Methodologies and Dataset Used for Disparity Maps' Evaluation
Different methodologies have been used to compare the results of stereo correspondence algorithms. Publicly available datasets of stereo image pairs with ground-truth disparity maps are the basis for the application of the methodologies. Furthermore, different objective measures can be used when comparing stereo algorithms.

Dataset Description
Middlebury's stereo benchmark dataset [12] is used in this research. This dataset was selected because it is widely used and known by the stereo vision research community. It is composed of four indoor stereo image pairs (the Tsukuba, the Venus, the Teddy, and the Cones) captured under controlled conditions, which are illustrated, with their associated disparity ground-truth data, in Figure 1. The disparity ground-truth data of these images were generated using various methods. The Tsukuba stereo image pair is composed of front-parallel objects, while it has a disparity range of 16 pixels. The disparity ground-truth data of the Tsukuba stereo image pair was generated manually and is of integer precision. It excludes a border of 18 pixels, where no disparity value is provided. The Venus stereo image is composed of piecewise, planar slanted objects, while having a disparity range of 20 pixels. Each planar component was manually labeled, and a direct alignment technique was used on each planar region for estimating the affine motion of each patch. The horizontal component of these motions was used to compute the ground-truth disparity map. The Teddy and the Cones stereo images contain several objects with a different geometry, with the disparity range of 60 pixels. Their disparity ground-truth data were generated using a structured light technique. These techniques rely on projecting one or more special light patterns onto a scene, in order to directly acquire a scene range map. By observing stereo image pairs and the corresponding disparity maps from Figure 1, it can be noticed that objects that are closer to the observer on the disparity maps have higher gray level values (brighter regions). Therefore, disparity maps can be used to determine the depth of a scene and to determine the distance to objects.

Middlebury's Methodology
Middlebury's methodology was introduced in [6] and implemented for online stereo evaluation at [12]. The methodology is based on a testbed of four stereo image pairs [16], and the BMP measure is used as an error measure. It considers three error criteria related to different image regions: the entire image (all), the visible area in both images of the stereo pair (not occluded (nonocc)), and regions near depth discontinuities (disc), as seen in Figure 2.
The all criteria cover all regions known in the ground-truth (white pixels on Figure 2b); the nonocc criteria include all regions that are not occluded (white pixels on Figure 2c); while the disc criteria address all the areas that are near depth discontinuities and near occlusions (white pixels on Figure 2d). The evaluation model is based on comparing, sorting, and ranking the scores produced by the BMP measure, using the three error criteria. An overall ranking, which acts as an accuracy indicator, is computed as the average of the ranks obtained in each error criterion. The interpretation of the results is based on the final ranking. It is assumed that the algorithm with lower ranking is more accurate. Based on this methodology, it is possible to determine a set of top-ranking algorithms. However, the cardinality of this set is a free parameter, and this may lead to discrepancies among researchers about algorithm accuracy [13].

The A * Groups Methodology
The evaluation of disparity maps was addressed as a multi-objective optimization problem in [17], and the whole evaluation process is termed the A * methodology. The introduced methodology computes the set A * , which is composed by the stereo correspondence methods, with comparable accuracy among them (i.e., not better, nor worst) and, at the same time, superior to the rest of the algorithms under evaluation, according to the scores of selected error functions and considered error criteria. In this way, it avoids a subjective interpretation of evaluation results. This research is focused on presenting the methodology rather than on discussing the results achieved by introducing the evaluation model. Additionally, the BMP measure is used as the error measure for evaluating disparity maps' accuracy. Moreover, the A * methodology is only capable of determining which algorithms are the most accurate overall, ignoring the rest of the algorithms.
The A * methodology was extended in [13] by introducing the A * groups methodology. It incorporates partitioning and grouping algorithms that allow a complete evaluation of the entire set of algorithms. In this way, multiple sets A * are computed iteratively and associated with a label of accuracy that identifies the particular set as a group. The group A * 1 is of special interest since it is composed of the top performing stereo algorithms under evaluation.

Disparity Evaluation Using Objective Measures
Although there are a number of techniques for assessing the similarity between two images [18], simple pixel-based objective measures are generally used to assess the similarity between the ground-truth and the estimated disparity maps.
In most stereo applications, range images are obtained through range-finding devices or from structured light techniques. These methods often produce unknown regions (regions where the stereo algorithm fails to compute a depth estimate or regions where the range finder is shadowed by an obstacle). These unknown regions must be handled properly in order to obtain an accurate score from the quality metric. The local-based quality metric, termed R-SSIM [9], is capable of handling missing data in both the disparity map under evaluation and ground-truth data. Missing data in the disparity map may be associated with a fault of the stereo correspondence algorithm under evaluation.
In this section, pixel-based objective measures will be described first, and then, the local objective measures based on the Structural SIMilarity Index (SSIM) [19] will be introduced.

Disparity Evaluation Using Pixel-Based Objective Measures
The five commonly used pixel-based error measures are considered for the disparity map evaluation-MSE, MRE, SZE, BMP, and BMPRE. MSE is formulated as: where D t and D e are the ground-truth and estimated disparity maps, respectively, and N is the image size in pixels. However, the MSE measure does not distinguish well between disparity estimates with a large number of small errors and disparity estimates with a small number of large errors.
The MRE measure is formulated as: while the SZE measure is defined as [10]: where f is the focal length, B is the stereo system baseline (i.e., the distance between camera optical centers), and µ is a constant added to avoid the instability due to missing estimations. In our research, it is assumed that f · B = 1, while the value of the constant is µ = 1. This measure is based on the inverse relation between disparity and depth using the magnitude of the disparity estimation error. It aims to measure the impact of a disparity estimation error on the 3D reconstruction. This impact depends on the true distance along the Z optical axis between the stereo camera system and the captured point and the point position according to the estimated disparity map.
The BMP measure is formulated as [12]: where ε(x, y) is a binary function defined as: and δ is the error threshold, which was fixed to one pixel in our research (commonly used error threshold).
The BMPRE measure is formulated as [11]: where τ is a function defined in order to avoid divisions by zero: ∆ is the disparity error magnitude, computed as the absolute difference between the estimated disparity and the true disparity value: and ρ is the ratio between ∆ and the true disparity value-the relative error: The BMPRE measure considers, in a simple way, both the error magnitude and the inverse relation between depth and disparity. Moreover, the impacts of these considerations are not simple at all, since they allow a proper and deeper quantitative evaluation of disparity maps than the evaluation obtained using the BMP. In contrast to error measures that consider the inverse relation between depth and disparity (SZE), the BMPRE does not require additional information about the stereo camera system.
The pixel-based objective scores for the disparity maps of Tsukuba, Venus, Teddy, and Cones, estimated by the PMF and the ADCensus stereo correspondence algorithms [12], are shown in Table 1 (better objective scores of stereo algorithms for all stereo image pairs, criteria, and the complete dataset are highlighted in grey). The averaging of the 12 objective scores (4 stereo pairs × 3 criteria) points out the success of the algorithm on the global (dataset) level. These objective values can also be viewed as errors introduced by stereo algorithms (the lower, the better), i.e., the accuracy of the algorithms can be observed through them. The obtained PMF and ADCensus scores using the BMP measure are identical to the Middlebury stereo evaluation results [12]. Furthermore, the BMP and BMPRE measures, for the Cones and the Teddy images, indicate a superior accuracy of the PMF algorithm. On the contrary, they indicate a superior accuracy of the ADCensus algorithm for the Tsukuba and the Venus images. On the other side, the MRE scores indicate a superior accuracy of the PMF algorithm. Although the SZE and MSE scores depend on the stereo image pair selection and criteria used, it can be concluded that the PMF algorithm provides better disparity accuracy than the ADCensus algorithm. If the accuracy of the disparity map is observed through the average value, based on the BMP measure, the advantage is on the side of the ADCensus algorithm, while for the remaining four objective measures, the PMF algorithm provides a disparity map with better accuracy. Consequently, the selection of the error measure, in the evaluation process, has a great impact on the algorithm selection.

Disparity Evaluation Using SSIM-Based Objective Measures
The SSIM index is a very popular algorithm for image quality evaluation. The basic idea behind the SSIM technique is that the images of natural scenes are rich in structures and that the human eye is sensitive to structural distortions. The index describes the quality by comparing the local luminance (mean value), contrast (local variance), and structure (local correlation) between the reference and the test image using an 11 × 11 window [19]. The dynamic range of this measure is [−1, 1]. The Universal Image Quality Index (UIQI) [20] is a special case of the SSIM index. It is used to quantitatively assess a structural distortion between two images, and the dynamic range of this measure is also [−1, 1]. Both measures are calculated in the local regions of the image, using a moving window. The final quality indexes are obtained as the mean of all local quality values.
The Multi-Scale SSIM or MS-SSIM index [21] is the most popular variation of the SSIM index. It utilizes the SSIM algorithm over several scales. The reference and distorted images are iteratively driven through a low-pass filter and down-sampled by a factor of two. The resulting image pairs are processed with the SSIM algorithm. The MS-SSIM algorithm compares details across resolutions in a multiplicative manner, providing the overall image quality score [21].
The R-SSIM algorithm is a variation of the MS-SSIM algorithm with the ability to handle unknown regions or missing data in disparity maps' evaluation [9]. Treating unknown regions depends on their appearance-on the reference (ground-truth) image or on the distorted (estimated) image. Pixels in the unknown regions of the reference image should be ignored in objective measure calculations. Pixels in the unknown regions of the distorted image should be ignored when they fall inside the sliding window used to calculate the local objective value, but the objective value of the unknown pixels themselves should be set to zero. Figure 3 shows and explains how the R-SSIM algorithm was implemented on one scale. The same idea in this research is applied for the UIQI and SSIM objective measures, and these modified measures we named as UIQI m and SSIM m .    Figure 3b shows the same patch in the estimated (test) disparity image, which also contains unknown (black) pixels. Figure 3c shows the Gaussian weighting function, which is used in local SSIM/MS-SSIM calculations, and Figure 3d shows it in the renormalized variant with ignored unknown (black) pixels. Finally, Figure 3e shows a map of the SSIM values of that patch, and Figure 3f shows its modification, where the pixel in the middle was calculated from Figure 3a,b,d. In Figure 3f, the unknown region from Figure 3b is indicated in black with a SSIM score of zero, while the unknown region from Figure 3a (shown in white) will be ignored in the final R-SSIM score.
The SSIM-based objective evaluations for the disparity maps of the Tsukuba, Venus, Teddy, and Cones images, estimated using the AdaptWeight and the TreeDP stereo correspondence algorithms [12], are shown in Table 2 (better objective scores of stereo algorithms for all stereo image pairs and complete dataset are highlighted in grey). UIQI local scores are calculated using a sliding window of 8 × 8 pixels and a uniform weighting function [20]. For these window-based measures, it is not reasonable to consider three criteria (all, nonocc, and disc) used for pixel-based error measures [9]. The averaging of the four stereo pairs' obtained error scores points out the success of the algorithm on the global (dataset) level.
Generally, we can notice that the scores after the modification (UIQI m , SSIM m , and R-SSIM) are higher than the scores provided by the original objective measures (UIQI, SSIM, and MS-SSIM).
Objective image quality measures without and with modification provide equal scores in situations where both disparity maps (ground-truth and algorithm computed) do not contain unknown regions (Venus image and AdaptWeight algorithm). Nevertheless, modified objective measures yield more real and precise results. Furthermore, it can be noticed that the objective measures give results that differ on the used algorithm and the selected image pair. Based on the objective measures, it can be concluded that the AdaptWeight algorithm provides disparity estimations that are closer to the ground-truth data than the TreeDP disparity map calculations.  Confirmation of the precedence of the AdaptWeight stereo algorithm over the TreeDP algorithm (measured by objective scores; Table 2) can be seen in Figure 4.  By visual inspection of the disparity maps from Figure 4, it can be concluded that the disparity map obtained with the AdaptWeight algorithm is closer to the ground-truth disparity map than the map obtained with the TreeDP algorithm. Therefore, the objective similarity scores of the AdaptWeight obtained map are higher than the TreeDP generated one ( Table 2).

A New Local-Based Objective Measures for Disparity Evaluation
In this research, new objective measures are proposed for disparity image evaluation. Despite the SSIM-based objective measures, we used four state-of-the-art local-based objective image quality measures, which we adapted for range image quality assessment.
The gradient-based objective image quality assessment measure, Q AB , is based on the preservation of gradient magnitudes and orientations [22]. A comparison of gradient information is carried out at the local level in a 3 × 3 window, after which the local values of information preservation are used to determine the final quality score. As a result, a numerical Q AB value is obtained that reflects the quality of the test image-the accuracy with which the original gradient information is presented in the test image. The Q AB values are in the range [0, 1].
Gradient Magnitude Similarity Deviation (GMSD) is also a gradient-based image quality assessment measure, which calculates a local similarity between the gradient magnitude maps of reference and test images in order to create a Local Quality Map (LQM) of the image degradation/similarity, in a 3 × 3 window. After this step, the standard deviation of the LQM is calculated in order to achieve a final image quality estimate: GMSD [23]. In this research, we used the mean value of the LQM (Gradient Magnitude Similarity Mean (GMSM)) as a final quality score.
The Riesz transform and Visual contrast sensitivity-based feature SIMilarity index (RVSIM) is a full reference IQA method, which combines Riesz transform and visual contrast sensitivity [24]. RVSIM takes full advantage of the monogenic signal theory and log-Gabor filter by exploiting the contrast sensitivity function to allocate the weights of different frequency bands. At the same time, gradient magnitude similarity is introduced to obtain the gradient similarity matrix using a 3 × 3 window. Then, the monogenic phase congruency matrix is used to construct the pooling function and obtain the RVSIM index.
Perceptual fidelity Aware MSE (PAMSE) is a variant of MSE, produced by introducing an l2-norm structural error term to it, using a 7 × 7 Gaussian filter [25]. Lower PAMSE values indicate a higher similarity between the compared images, contrary to the other three local-based measures (Q AB , GMSM and RVSIM) for which higher values indicate higher similarity.
For local-based objective measures Q AB , GMSM, RVSIM, and PAMSE, in this research, a new filtering method has been proposed to treat the unknown regions from disparity maps. Unlike the R-SSIM in which, for each pixel influenced by the unknown regions, a new computed value is calculated from the sliding window, in this approach, the new value for pixels affected by unknown regions is determined as the average value of valid (unaffected) local objective scores.
The range adapted objective measures, based on the four previously mentioned measures, are named in this research as Q AB m , GMSM m , RVSIM m , and PAMSE m . Figure 5 shows and explains how this adaptation was implemented for GMSM, which is also valid for Q AB and RVSIM. Figure 5a depicts a patch in the ground-truth (reference) image where there are some unknown pixels (shown in black), while Figure 5b shows the same patch in the computed range image, which also contains unknown (black) pixels. Figure 5c shows these unknown pixels and pixels affected by them (shown in white), in a 3 × 3 local-based objective calculations. Figure 5d illustrates the proposed filtering method, where the objective score for the pixel numbered with one should be calculated from four local scores of unaffected pixels (shown as red squares), while the objective score for the pixel numbered with two should be the same as a local score of the only one unaffected pixel in a 3 × 3 neighborhood. Finally, Figure 5e shows a map of the GMSM values of that patch, and Figure 5f shows its modification, where affected pixels were calculated using the proposed filtering method. In Figure 5f, the same unknown region from Figure 5b is indicated in black, with a GMSM score of zero, while the unknown region in Figure 5a (shown in white) will be ignored in the final GMSM m score. Since PAMSE has an inverted scale (lower is better), the unknown pixels from the computed disparity image are penalized using the maximum disparity value, which is 255 for 8 bit/pixel grayscale images. Furthermore, for the PAMSE objective measure, it should be taken into account that the window size is 7 × 7 pixels, which will give a larger number of affected pixels. The local-based objective evaluations for the disparity maps of the Tsukuba, Venus, Teddy, and Cones images, estimated using the AdaptWeight and the TreeDP stereo algorithms [12], are shown in Table 3 (better objective scores of stereo algorithms for all stereo image pairs and the complete dataset are highlighted in grey). Image quality scores after the modification (Q AB m , GMSM m , RVSIM m , and PAMSE m ) differ from image quality scores provided using original implementations that do not handle unknown regions (Q AB , GMSM, RVSIM, and PAMSE). Objective image quality measures without and with modification provide equal scores for the Venus image and the AdaptWeight algorithm, for the same reason mentioned for the SSIM-based measures-neither disparity map contains unknown regions.
The results vary according to the choice of the objective algorithm, where the AdaptWeight algorithm provides disparity estimations that are closer to the ground-truth data than the TreeDP disparity map calculations (see Figure 4), which is also valid for the previously described SSIM-based objective measures (see Table 2).

Results
In this research, fifty disparity stereo algorithms used in Middlebury's dataset [12] were analyzed, where 34 algorithms were used in [9], and the 16 other algorithms were chosen among the state-of-the-art stereo algorithms. In order to perform image disparity algorithm evaluation, two methodologies were applied: Middlebury's methodology and the A * groups methodology.
The ranking of these algorithms using pixel-based objective measures was done by sorting the average ranks obtained for different stereo image pairs (Tsukuba, Venus, Teddy and Cones) and criteria (all, nonocc, and disc); this means that for the 12 rank values' averaging while using SSIM-based and proposed local-based objective measures, the algorithms ranking was done by sorting the average ranks obtained for the four stereo image pairs (four ranking values averaging). The final algorithm rank was obtained by sorting the average ranks of all objective measures-based ranks.
A part of the ranked stereo correspondence algorithms according to Middlebury's evaluation methodology (BMP-based) and the rest of the considered objective measures are listed in Table 4. It can be observed that Table 4 lists 20 different stereo algorithms, and the best one of them is TSGO [12], which is reported among the top-five by all objective measures' rankings. Furthermore, it can be noticed that the algorithm rank in some cases differs significantly based on the objective measure used, i.e., the IGSM algorithm is the best ranked according to the BMP pixel-based measure, but it is ranked 25th using the UIQI m objective measure. This algorithm is positioned in 13th place on the global (final) level. For the bottom ranked algorithms (DP , SSD+MF , PhaseDiff , and SO [12]), there is a high level of agreement between objective-based rankings.
The affiliation of the above 20 selected algorithms with A * groups for the different objective error measures, according to the A * groups methodology [13], using 12 (pixel-based)/four (local-based) score errors, is shown in Table 5. The lower label values identify the algorithms with better accuracy. The last row of the table shows the number of A * subsets (labels) for each objective measure. It can be seen that the first four average-based ranked stereo correspondence algorithms ( Table 4) are members of almost all the instances of the A * 1 sets. There are two exceptions: the GC+LocalExp algorithm for the SSIM m measure and PM-Forest for the UIQI m error measure. The worst average-based ranked algorithms belong to the A * subsets with higher labels.
If we look at the individual ranks of the stereo correspondence algorithm AdaptingBP (Table 4), it can be concluded that this algorithm is positioned from first to 13th place. Based on all objective measures, it is in third place. The accuracy of this algorithm was better evaluated using the A * groups methodology, where Table 5 shows that it belongs to the A * 1 group by all objective measures. The Linear Correlation (LC) between average error values of objective measures applied for all the 50 stereo correspondence algorithms is shown in Table 6. From this table, it can be noticed that SZE, UIQI m , Q AB m , and PAMSE m have a weak mean correlation with the other objective measures (italicized scores in the last row). Moreover, we can see that proposed local-based measures GMSM m and RVSIM m have a high correlation with the other measures (bolded scores in the last row). Knowing that BMP and BMPRE are the most frequently used measures in the literature (anchor points), we can see that the proposed GMSM m and RVSIM m modifications can be good alternatives for them (bolded mutual scores), where GMSM m is computationally more efficient.  The correlation between objective measures can be illustrated through scatter plots. Figure 6 shows the scatter plots of GMSM m and Q AB m , local-based objective error measures, versus BMP and BMPRE, pixel-based error measures, where each point (star) represents the average error of the utilized stereo algorithms (four values averaging for GMSM m /Q AB m and 12 values averaging for BMP/BMPRE). Scatter plots between these objective measures confirm the results given in Table 6. It can be observed that GMSM m correlates better with BMPRE (Table 6, LC = 0.96) than BMP (Table 6, LC = 0.90). This can be explained by the fact that the GMSM m objective measure is also sensitive to disparity error magnitudes as the BMPRE measure. For the disparity maps that are closer to the ground-truth (higher GMSM m and lower BMP/BMPRE values), the correlation between GMSM m and BMP/BMPRE objective measures is almost linear (in the upper left corner in the scatter plots). The degree of agreement between Q AB m and BMP/BMPRE is lower, which is visible through a big spreading of the objective scores on the scatter plots.  Figure 7 shows that the two objective methods can provide very different rankings to the same disparity maps. The two algorithms being assessed are graded in nearly reverse rank-order by the two assessment methods, BMP and GMSM m . Therefore, in the performance analysis of stereo correspondence algorithms, it is desirable to use more objective quality measures. Visual inspection of the two disparity maps suggests that in this instance, the GMSM m algorithm delivers a more meaningful assessment of the quality of the estimated disparity map. Nevertheless, it can be noticed that the GMSM m objective values of quality are very close (0.958 vs. 0.968) and that the ranks are significantly different (21 vs. eight). Therefore, it is reasonable to use the A * groups methodology in the performance analysis of stereo correspondence algorithms, because algorithms whose accuracy is close will be classified in the same group.
SubPixSearch disparity map BMP rank/score (all criteria): 6/6.87 GMSM m rank/score: 21/0.958 OverSegmBP disparity map BMP rank/score (all criteria): 21/8.81 GMSM m rank/score: 8/0.968 We make the results for all 50 stereo algorithms publicly available to the research community as three Excel files in Supplementary Materials. These files contain extended results that are not presented in this paper: error scores for all analyzed stereo algorithms by all objective measures for each stereo pair and criteria, their rankings, and A * groups subsets. Furthermore, in these files, researchers can find error scores before and after the proposed modifications for local-based objective measures.

Conclusions
In this paper, a characterization of error measures for evaluating disparity maps is presented. The impact on the results caused by the selection of the error measure was analyzed using Middlebury's and the A * groups methodologies. This is the first attempt to analyze disparity maps using a plethora of pixel-based and local-based objective measures, using both methodologies.
We proposed modified objective measures as a new and needed quality metric for range images and demonstrated their utility by the evaluation of 50 stereo algorithms in the Middlebury Stereo Vision page, using four stereo image pairs with a rigid baseline. We confirmed that the new objective measures are effective for range image quality assessment; they complement the evaluation of stereo algorithms using conventional bad matched pixels approaches, but they can be stand-alone error measures. These modifications measure more than a loss of depth values; they are sensitive to errors in depth and surface structure, which cannot be measured using bad matched pixels approaches.
Experimental evaluations showed that the objective error measures may lead to different and contradictory ordinary rankings of stereo algorithms. These differences are less using the A * groups methodology, where this methodology makes a fair judgment about the algorithms' accuracy. Consequently, the fair comparison of stereo correspondence algorithms may be a more difficult task than it has been considered so far.

Conflicts of Interest:
The authors declare no conflict of interest.