Structural Similarity Measurement Based Cost Function for Stereo Matching of Automotive Applications

The human visual perception uses structural information to recognize stereo correspondences in natural scenes. Therefore, structural information is important to build an efficient stereo matching algorithm. In this paper, we demonstrate that incorporating the structural information similarity, extracted either from image intensity (SSIM) directly or from image gradients (GSSIM), between two patches can accurately describe the patch structures and, thus, provides more reliable initial cost values. We also address one of the major phenomenons faced in stereo matching for real world scenes, radiometric changes. The performance of the proposed cost functions was evaluated within two stages: the first one considers these costs without aggregation process while the second stage uses the fast adaptive aggregation technique. The experiments were conducted on the real road traffic scenes KITTI 2012 and KITTI 2015 benchmarks. The obtained results demonstrate the potential merits of the proposed stereo similarity measurements under radiometric changes.


Introduction
Intelligent vehicles rely on active sensors (e.g., time-of-flight-camera [1], LiDAR [2]) in order to represent the cloud points of the surrounding environment. However, low cost passive computer vision offers the potential to produce richer geometric representations. In particular, our intention was paid to the stereo matching task, as it is vital for applications that are linked to intelligent vehicles.
The aim of stereo matching process is to estimate the depth of a scene viewed from two stereo images. Stereo matching algorithms can be roughly split into two categories. Sparse algorithms that rely on feature-based matching methods, generally used in camera calibration or orientation tasks [3,4], and dense algorithms, estimate depth values at every pixel value in the image.
Dense algorithms can be classified to global or local approaches. The global approaches formulate the stereo correspondence problem as an energy function over all image pixels with some smoothness constraints. This function is then minimized by global methods, such as the commonly used dynamic programming [5], belief propagation [6], and graph-cuts [7]. Generally, these approaches can effectively alleviate the matching ambiguities and, therefore, provide quite accurate depth results. However, they are inappropriate for real-time applications due to their slow convergence to optimal values. By contrast, local approaches consider for each individual pixel in the image a local smoothness assumption to estimate its depth values [8][9][10]. This makes them computationally inexpensive but produce a lower disparity results, especially in textureless areas. A stereo matching algorithm can be performed in four steps [11]: cost computation, cost aggregation, disparity selection, and disparity refinement. The first step consists of matching pixels of the two stereo pairs. Several cost functions can be adopted in this step. Each of these have different characteristics that enable dealing with specific image regions. The second one, cost aggregation, is performed in order to filter out noisy matches that could have been occurred during the first stage. In the third step, disparity values are selected. The Winner-Take-All (WTA) strategy is often performed. It considers the disparity with the lowest or higher matching cost from the previous aggregation step. The last step, disparity refinement, is optional and it aims to refine erroneous disparity values by filtering out wrong matches using global smoothness assumptions.
Although all of these steps are required for accurate disparity results, the cost computation is the most critical, since early ambiguous cost values considerably affect the accuracy of the final results independently, regardless of the stereo matching algorithm. Therefore, obtaining a robust disparity map in real traffic situations require building a cost function that can be effective under radiometric distortions.
In this paper, we propose two new cost functions, which are based on the structural information (SSI M), C SSI M , and its gradient variant, the C GSSI M . The performance of the proposed costs was evaluated using both aggregation [10] and no aggregation approaches. The local WTA strategy was adopted to generate disparity maps. The experimental results were conducted on two challenging datasets, the real road traffic stereo pairs of KITTI 2012 [12] and KITTI 2015 [13].
The remainder of the paper is organized, as follows: in Section 2, we review the related works to the matching cost functions. In Section 3, we present the proposed cost function. Experimental results and discussions are given in Section 4. Additionally, finally, we draw conclusions in Section 5.

Related Work
A wide range of cost functions have been proposed in the literature. Of these, the absolute intensity differences, squared intensity differences, cross correlation sum, and normalized cross-correlation. Non parametric cost functions have been introduced for being robust against radiometric distortions [14]. Authors in [15] have proposed a cost function based on the mutual information in order to handle the complex radiometric relationships between images. Several works have focused on enhancing the performance of the traditional cost functions by proposing enhanced costs or by merging multiple cost functions to provide efficient variants of the existed ones. In [16], the authors fused both the absolute difference on image color and gradient along the horizontal direction. Other studies have exponentially fused the absolute difference on image color with the Census Transform (CT) cost function [17]. The authors in [18] have fused three cost functions: the absolute difference on color image, on image gradients, and the CT computed in image gradients using an exponential function. Authors in [19] have proposed an adaptive fusion method of multiple cost matching functions. The efficiency of the state of art cost function has been widely examined in several studies [20][21][22]. Indeed, the study that is presented in [20] included the comparison of robustness using six cost matching functions in term of photometric distortion and noise. While [21] is more extended and it has included the evaluation of fifteen different cost functions using various optimization schemes. The results have demonstrated that costs that are based on the CT give the best results, particularly for radiometric changes. Recently, authors in [22] have investigated cost functions in stereo matching algorithms for automotive vehicle applications using two different stereo matching algorithms. One is based on global energy optimization (Graph cuts) [7], and the other one uses local adaptive method [10]. The results of this study have proven that the cost function derived from the CT or its variants, as the Cross-Comparison Census (CCC) combined with the mean sum of relative pixel intensity differences within a CT window, provide overly a good performance on the KITTI 2012 benchmark. A variant of CCC cost function [23] was proposed in order to handle better the radiometric distortions. The authors claim that the proposed cost function outperforms the conventional cost functions on the KITTI 2012 benchmark. These studies have demonstrated that it is quite difficult to address the disparity, with radiometric distortions, relying only on intensity-based cost functions. Some research studies have investigated SSI M for stereo matching algorithms. In [24], authors have proposed to compute the final matching cost function using SSI M index over filtered left and right patches obtained from the non-local means algorithm [25]. In [26], the SSI M index has been introduced for multiview setero to compute the matching cost function in coarst-to-fine workflow.

SSIM Based Cost Function (C SSI M )
When considering the stereo matching problem as a visual issue. Extracting the most adopted information captured by the Human Visual System (HVS) can provide a consistent information in order to accurately describe the considered patch, and facilitate the matching process. In this context, we propose a new cost function based on the structural information [27]. Let p(x, y) be a pixel in the reference image (I 1 ), I p is the intensity value of pixel p and q(x, y−d) its hypothetical corresponding, with intensity value I q in the target image (I 2 ) at a disparity d. The C SSI M between p and q is defined, as follows: where, l(p, q, d) is the luminance, c(p, q, d) the contrast and s(p, q, d) structure measurements between p and q, defined in Equations (2)-(4), respectively.
C is a small constant to avoid the denominator being zero. µ p and µ q are the mean values computed in neighborhood N p and N q of p in I 1 and q in I 2 , respectively. σ p and σ q are standard deviations of p and q respectively. The standard deviations of p in the support window N p is described as follows: where ||N p || is the number of pixels in the support window N p . The σ (p,g) is the covariance between p and q, and can be estimated as: Finally, α > 0, β > 0, and γ > 0 are parameters that allow for to controlling the influence of the each of the three components.

SSIM Gradient Variant (C GSSI M )
Besides the structural information, the human visual system is capable of extracting the image gradients based structural features, such as (edges and points). Thus, in order to take into account this assumption, the structural information is extracted from image derivatives ∂I/∂x, ∂I/∂y, rather than image intensities. To do so, the luminance (l), the contrast (c), and the structure measurement (s), in Equation (1) will be modified by incorporating the gradient. Therefore, the gradient based structural information cost function C GSSI M is defined, as follows: where l g , c g and s g are structural information defined as follows : ∂µ p and ∂µ q are the mean values computed for the neighborhood ∂N p in ∂I 1 for p and ∂N q in ∂I 2 for q and q in ∂I 2 . ∂I 1 and ∂I 2 are the gradients along x and y directions, respectively. ∂σ p and ∂σ q are the standard deviations of p in ∂I 1 and q in ∂I 2 . The standard deviations of p ∂σ p is defined, as follows: The ∂σ (p,g) is defined, as follows: In contrast to the Equation (1), this enables to compute the new structural features on image principal derivatives with respect to x and y coordinates.

Experimental Results
In this section, we evaluate the ability of the proposed cost functions to discriminate stereo correspondences. We explore the proposed costs for stereo matching through two different algorithms: a stereo matching algorithm without aggregation stage and a fast local adaptive aggregation technique. These cost functions are then compared to the top cost functions C DIFFCensus [22] and C GCCC [23]. The optimal parameter values that were proposed in [22,23] were retained. Experiments were conducted on the KITTI 2012 [12] and KITTI 2015 [13] training datasets in order to evaluate the proposed approach in the context of intelligent vehicles applications. The evaluation for the KITTI 2012 datasets is measured by computing the percentage of disparity errors with respect to the ground truth. While, for the KITTI 2015 D1-all error measure is computed, it represents the percentage of pixels for which the estimation error is larger than three pixels and larger than 5% of the ground truth disparity at each pixel. For the parameters sets of both cost functions, C SSI M and C GSSI M , were experimentally set as: α = 0.9, β = 0.1 and γ = 0.2 to minimize the overall error rate. Parameter C is set to the smallest value to prevent dividing by zero. In the aggregation stage, the spacial and color similarity thresholds were fixed at L = 9 and τ = 20, respectively. The local WTA strategy was adopted in order to generate disparity results. We used the highest matching cost instead of the lowest one, as the proposed costs are built upon similarity measurement.

Evaluation of the Discriminative Ability of the Proposed Costs
In this section, the effectiveness of the proposed cost functions is studied on both KITTI datasets without using any cost aggregation method. Figure 1 shows a visualization of the output disparity results for each cost functions using both two stereo algorithms is presented. Column one shows the results that were obtained without using an aggregation method, while column two shows the results obtained with based on adaptive aggregation method. The output results for the #0 stereo pair from the KITTI 2012 training dataset are presented. The presented figure illustrates, in both cases, that the proposed cost functions lead to promising results, while the conventional costs provide highly noisy disparity results. The presented results demonstrate the discrimination power of the proposed costs without considering aggregation costs, which proves the effectiveness of the SSI M information for capturing reliable local information for stereo matching. The next section investigates the efficiency of these costs while using aggregation techniques.

Evaluation of the Proposed Costs Using the Adaptive Aggregation Technique
To further reduce noise and construct refined cost functions, the adaptive aggregation method [10] was performed. This choice is motivated by the fact that this method is fast and accurate, which is suitable for real time applications.
The effectiveness of the proposed method was firstly evaluated with respect to the support window size on KITTI 2012 training datasets. Figure 2 presents the mean error rate, in both non-occluded and all regions, computed at the default 3 pixels threshold for all of training set images. It can be noted that the size of the support window impacts highly the performance of the algorithm of both cost functions. Indeed, significant improvement in the performance of the local stereo matching algorithm can be obtained as the size of the support window increases. More precisely, the improvement is by a factor of 1.65% for the non-occluded and by 1.61% for occluded zones, for the C SSI M cost function when the size window passed from 3 to 5, for example.
In the following, we evaluate the robustness of the proposed cost functions based on adaptive aggregation method against the state-of-the-art cost function. Tables 2 and 3 present the average percentage of erroneous pixels with both non-occluded and all regions. In Table 3, the errors were calculated at three different pixels error thresholds, while in Table 2 the D1−all error was computed. The obtained results indicate that the proposed C GSSI M cost functions outperform the others ones by a significant margin. Indeed, the C GSSI M provides the lower mean disparity errors on both datasets, followed by the proposed C SSI M cost function under different scenarios. Indeed, in Table 3 at the default three pixel threshold, the improvement obtained by C GSSI M is of the order 2.23, 3.47 for non-occluded region and of 2.84, 3.4 for other zones, with respect to C DIFFCensus and C GCCC costs. Besides, from Table 2 , we can see clearly that the performance of our methods are significantly better than all other cost functions in both regions. For example, the improvement obtained by C GSSI M is of the order 1.87, 2.91 for non-occluded region and of 1.82, 2.84 for other zones, with respect to C GCCC and C DIFFCensus costs.  This evaluation shows that the proposed C GSSI M cost function is more appropriate for the real outdoor disparity computation than the top performers C Di f f Census and C GCCC .

Sensitivity of the Cost Functions in the Presence of Radiometric Distortions
In this section, we study the impact of radiometric distortions on different cost functions. These distortions are generated while using the absolute color difference between corresponding pixels [22]. At each level of radiometric distortion, we compute the mean disparity errors for all KITTI training set for C SSI M , C GSSI M , C DIFFCensus [22], and C GCCC [23] cost functions. It can be visualized from the Figure 3 that the proposed cost C GSSI M give the lowest error rate at all radiometric distortion levels.

Discussion
In the literature, it has been proven that cost functions based on pixel intensities are very sensitive to radiometric changes. In this paper, new intensity based cost functions have been proposed. It takes the local intensity, luminance, and contrast into account, which provide a significant local information to describe the considered pixel within a support window. This new consideration provides the ability of the proposed cost function to deal with radiometric changes (see Figure 3). The results described in Tables 2 and 3 demonstrate that the proposed cost functions outperform the top performer, in both KITTI 2012 KITTI 2015 datasets, compared to C Di f f Census and C GCCC costs. Although these latter promise better results with aggregation techniques, the aggregation costs proposed have led to the best results (see Tables 2 and 3). It must be noted that the overall performance of the proposed cost functions depends on support widow size. It can be seen that both cost functions performs well as the size of the support region increases, as shown in Figure 2. This is trivial since large support regions hold sufficient information to more accurately describe the considered patch, and then lead to good accurate initial cost functions.

Conclusions
In this paper, we presented a new stereo matching algorithm with a new structural information based cost functions for the cost computation step. Thus, two cost functions were proposed and evaluated using real road scenes from the challenging KITTI 2012 and KITTI 2015 training datasets. The obtained results have demonstrated that both cost functions lead to the lowest disparity mean errors as compared to the top performer in this data set under different scenarios, which has proven that our cost functions are more robust to radiometric distortions than conventional cost functions. The evaluation of the proposed local stereo matching algorithm using the best performing cost function over the current state-of-the-art algorithms has demonstrated the potential merits of the proposed stereo similarity measurement.