Stereo Image Matching Using Adaptive Morphological Correlation

A stereo matching method based on adaptive morphological correlation is presented. The point correspondences of an input pair of stereo images are determined by matching locally adaptive image windows using the suggested morphological correlation that is optimal with respect to an introduced binary dissimilarity-to-matching ratio criterion. The proposed method is capable of determining the point correspondences in homogeneous image regions and at the edges of scene objects of input stereo images with high accuracy. Furthermore, unknown correspondences of occluded and not matched points in the scene can be successfully recovered using a simple proposed post-processing. The performance of the proposed method is exhaustively tested for stereo matching in terms of objective measures using known database images. In addition, the obtained results are discussed and compared with those of two similar state-of-the-art methods.


Introduction
Stereo vision recovers three-dimensional (3-D) information about the observed scene by processing at least two images of the scene captured from different viewpoints. Stereo vision is widely used in high-impact technologies, such as robot navigation, autonomous vehicles, augmented reality and medical diagnosis, among others [1,2]. Stereo vision has many advantages over other existing 3-D technologies; for instance, simplicity and flexibility, high-rate performance, large field of view, and low cost. A fundamental task in stereo vision is disparity estimation. This task, also known as stereo matching, consists of determining the correspondence of all points in a pair of stereo images. The 3-D distribution of the scene can be retrieved from the disparity by triangulation [3].
Over the years, several approaches for stereo matching have been proposed. These approaches can be classified as local, global or hybrid [4,5]. In many applications, the local approach is preferable over the global and hybrid approaches because it is suitable for highrate performance. In general, local methods estimate the disparity of each point of the scene by matching local windows centered at given corresponding points in each stereo image. Local methods usually perform the following steps: matching-cost computation, cost aggregation, disparity computation, post-processing and refinement [5][6][7]. The matching cost quantifies the similarity of two corresponding image points for a given disparity value. Commonly, the matching cost is computed by comparing the intensity values of two given image points. The cost aggregation reduces the uncertainty in the association of matching points. This step is usually carried out by matching adaptive windows [8,9] or adaptive weight support functions [10]. Disparity computation is performed by selecting the best aggregation cost value for each corresponding point. Post-processing recovers the disparity of occluded image points. Finally, the refinement reduces estimation errors [11,12].
Within the state-of-the-art, several methods for matching-cost computation have been suggested [13]. The absolute difference (AD), squared difference (SD) and normalized cross-correlation (NCC) are widely known intensity-based matching measures [4]. Stereo matching based on the AD, SD or NCC is computationally efficient and possesses good tolerance to image noise. However, it tends to produce incorrect disparity estimates in image regions of low texture, nonstationary intensity or that are partially occluded [6]. Alternatively, matching-cost measures based on the relative order of pixel intensities have been considered [14]. The census transform (CT) is a non-parametric technique based on the local spatial structure [7,15]. The CT maps a given image point to a binary string. Each element of this string is true if the intensity of a given point is higher than that of a prespecified reference point; otherwise, it is false. Usually, the cost aggregation in CT-based methods is computed with the Hamming distance of two resultant binary strings. The CT is more accurate than intensity-based matching methods [6]. However, it is more sensitive to image noise [16].
Several variants of the CT have been suggested to improve the stereo matching accuracy and noise robustness. A simple approach consists of replacing the intensity value of the central element of the matching window (reference point) with the mean intensity value of their neighbor elements when computing the binary string [17]. Another approach is to compute the binary string from different pairs of image points within the matching window, excluding the central point [15,16]. Recently, the use of a weighting mask in the CT matching-cost computation has been suggested [18]. In addition, a trade-off between intensity-based AD and CT has been considered [19]. This approach, known as AD-Census, has good tolerance to image noise and accuracy of disparity estimation.
Although existing local methods for stereo matching have had great success, new alternatives still need to be explored to improve their performance. For instance, in the matching-cost and cost aggregation steps, it is desirable to obtain a low cost for image points with high similarity to those belonging to the object formed at the origin of the reference window and a high cost for the remaining points. To do this, we propose a robust method for stereo matching based on adaptive morphological correlation optimized with respect to a new criterion called binary dissimilarity-to-matching ratio (BDMR). First, locally adaptive windows constructed for a reference point and a potential corresponding point in the stereo image pair are preprocessed using binary threshold decomposition. Next, the morphological correlation is computed between the two preprocessed adaptive windows for different disparity values. Finally, a disparity estimate is obtained by finding the corresponding point coordinate of the maximum correlation. In addition, we propose a simple post-processing method to recover the disparity in occluded image points.
The main contributions of this research are as follows. A binary dissimilarity-tomatching ratio (BDMR) is introduced. By minimizing the BDMR, a matching-cost measure based on adaptive morphological correlation is derived. A locally adaptive cost aggregation method for stereo matching based on morphological correlation is proposed. An efficient post-processing method for recovering the disparity of occluded and not matched stereo image points is proposed. This paper is organized as follows. Section 2 presents the proposed method for stereo image matching. Section 3 presents the results obtained with the proposed stereo matching method using images from the Middlebury stereo dataset [20][21][22]. These results are discussed and compared with those obtained with two recent existing similar methods. Finally, Section 4 presents our conclusions.

Stereo Matching with Adaptive Morphological Correlation
This section provides details of the proposed approach for stereo matching. First, we review the preliminaries of stereo vision. Secondly, we present the proposed method for image matching based on adaptive morphological correlation. Finally, we introduce the suggested approach for disparity post-processing.

Stereo Vision
Consider the stereo imaging system depicted in Figure 1. A pair of cameras project a point P in their corresponding image planes as the points p 1 and p 2 , respectively. This setup assumes that the cameras are horizontally aligned, and the captured images I 1 (x, y) and I 2 (x, y) are rectified [23,24]. Thus, the points p 1 and p 2 can be located along the horizontal epipolar line, as shown in Figure 1. The location of the points p 1 and p 2 with coordinates (x 1 , y 1 ) and (x 2 , y 1 ), respectively, allows us to compute the disparity as The depth D to point P from the stereo baseline can be obtained as where f is the focal length of the camera lens and B is the distance between the optical camera centers. It should be noted that the parameters f and B are obtained by camera calibration, and the disparity δ is determined by stereo matching.

Proposed Method for Stereo Matching
The block diagram of the proposed method is shown in Figure 2. The first step is the estimation of the disparity map from the input pair of rectified stereo images I 1 (x, y) and I 2 (x, y). Let w 1 (x, y) and w 2 (x, y) be two image windows, both of size N w × N w , obtained from I 1 (x, y) and I 2 (x, y) at the coordinates (x 0 , y 0 ), respectively. According to the theory of morphological image processing, the image window w i (x, y) can be represented by the binary threshold decomposition in a given range as [25][26][27] is a binary image of w i (x, y) for the q-th intensity value. Note that if q 0 = 1 then w i (x, y) = w i (x, y). Now, assuming the horizontal epipolar constraint, we introduce the binary dissimilarityto-matching ratio (BDMR) as follows: where the denominator M(τ) is a point-wise binary matching measure between w 1 (x, y) and w 2 (x − τ, y). The numerator D(τ) quantifies the binary dissimilarity of w 1 (x, y) and w 2 (x − τ, y). Note that the BDMR produces zero when w 1 (x, y) and w 2 (x − τ, y) are identical and infinity when there are no matches. We want to derive a matching-cost measure by minimization of the BDMR. Based on the properties of the absolute value, Equation (5) can be rewritten as Note that the summation terms in Equation (6) can be calculated as where Q = (q N − q 0 ) is the number of quantization levels in the binary threshold decomposition. Moreover, by considering that Equation (6) can be rewritten as The minimum value of Equation (9) is obtained by maximizing where the term 1 /QN 2 w is added to the denominator to avoid singularities. Now, by interchanging the order of summations and considering [25,27] Equation (10) can be rewritten as Equation (12) is a nonlinear correlation that minimizes the BDMR when the maximum correlation value is reached. For the problem of stereo matching, the maximum value of Equation (12) should occur in the coordinate τ = δ; that is, at the location where the sliding window w 2 (x − τ, y) matches the reference window w 1 (x, y). To improve the accuracy and robustness of stereo matching using Equation (12), the values {q 0 , q N } can be chosen to properly describe the implicit object formed at the origin (x 0 , y 0 ) of the window w i (x, y), identified as the target. Thus, the values {q 0 , q N } can be specified as where σ w i is the standard deviation of w i (x, y) with respect to w i (x 0 , y 0 ) and v is a dispersion parameter. Thus, Equation (12) can be adapted to each point of the pair of stereo images as where are preprocessed image windows of I 1 (x, y) and I 2 (x, y), respectively, using adaptive binary threshold decomposition, with a quantization step as To perform stereo matching using Equations (14)- (16), consider a reference point p 1 with coordinates (x 0 , y 0 ) in the image I 1 (x, y). The corresponding point p 2 in image I 2 (x, y) can be detected and located as depicted in the block diagram shown in Figure 3. First, a reference window w 1 (x, y) with origin at the point p 1 and size of N w × N w is constructed from I 1 (x, y), where N w = 2s + 1 and s are computed adaptively as where β is a scalar, s 0 is a prespecified parameter defining the maximum allowable window size and σ 2 s 0 is the standard deviation of the intensity values of the points within the reference window with a maximum size of (2s 0 + 1) × (2s 0 + 1) with respect to p 1 . Then, a sliding window w 2 (x − τ, y) : τ ∈ [0, δ max ], with a size of N w × N w is constructed from I 2 (x, y). Note that w 2 (x − τ, y) is shifted along the horizontal epipolar line of I 2 (x, y). Afterward, w 1 (x, y) and w 2 (x − τ, y) are preprocessed by binary threshold decomposition as described in Equation (15). Next, the adaptive morphological correlation is given in Equations (14)- (16) is computed for all values of τ. Finally, a disparity estimate is obtained as The disparity maps δ 1 (x, y) and δ 2 (x, y) can be obtained by applying the proposed method to all points of the stereo images I 1 (x, y) and I 2 (x, y).

Disparity Post-Processing
The estimated disparity maps δ 1 (x, y) and δ 2 (x, y), can be verified as where δ is a tolerance parameter, k = {1, 2} and l = k Note that a value of m k (x, y) = 1 in Equation (19) indicates a verified estimated disparity, whereas a value of m k (x, y) = 0 denotes an incorrectly estimated disparity caused by an occlusion or any other perturbation. Let T = {(x T , y T ) : m i (x T , y T ) = 1} be the set of coordinates of all verified estimated disparities and F = {(x F , y F ) : m i (x F , y F ) = 0} be the set of coordinates of all incorrectly estimated disparities. A desirable post-processing method requires replacing the incorrectly estimated disparity value δ i (x F , y F ) with verified disparity values from the set {δ i (x T , y T )}. In this context, we consider the prior probability that a verified estimated disparity at arbitrary coordinates (x, y) can replace the incorrect disparity at the coordinates (x F , y F ), which is given by where a normal distribution with variance σ 2 1 is assumed. Furthermore, the probability density function that an image point with intensity value I i (x, y) has a similar disparity as that expected at the coordinates (x F , y F ), can be given by where σ 2 2 is the variance of the target's intensity values. According to Bayesian theory, the posterior probability that an image point with disparity δ(x, y) and intensity I i (x, y) can replace unknown disparity δ i (x F , y F ) given that I i (x F , y F ) is the intensity of I i (x, y) at the coordinates (x F , y F ) is given as P(δ i (x, y)|I(x F , y F )) = P(I i (x, y)|δ i (x F , y F ))P(x, y) P(I i (x, y)) , where P(I i (x, y)) is the prior probability density function of the intensity of I i (x, y). As a result, the coordinates (x T , y T ) ∈ T of the disparity δ(x T , y T ) with the highest probability corresponds to the unknown disparity δ(x F , y F ), and can be obtained as By substituting Equations (20) and (21) into Equation (23), and by applying the logarithm function, we get where δ i (x T ,ŷ T ) is an estimate of the incorrect disparity δ i (x F , y F ). Thus, by applying the estimator given in Equation (24) to all elements of the set F, one can obtain the improved post-processed disparity maps δ 1 (x, y) and δ 2 (x, y).

Results
This section presents the results obtained with the proposed approach for stereo matching using images from the Middlebury stereo dataset [20][21][22]. The results are discussed and compared with those obtained with two recent variants of the CT, namely, the improved weighted census transform (IWCT) [18] and the improved AD-Census (AD-C) algorithm [19]. The accuracy of disparity estimation by the proposed, IWCT and AD-C methods is quantified in terms of the bad-matched pixels (BMP) and root mean squared (RMS) error between estimated and ground truth disparities. For the BMP measure, we set the tolerance of δ = 2. First, we quantify the performance of the proposed and considered methods for disparity estimation of non-occluded regions in input stereo images. Next, we evaluate the performance of the suggested disparity post-processing method. Additionally, we show refined disparity maps obtained with the proposed approach using a generic refinement method. Finally, we present the statistical performance results of the proposed and considered methods for stereo matching with twenty-five images from the Middlebury stereo dataset.
The proposed method, IWCT and AD-C were implemented using the Python 3.10.7 language on a personal computer with an Intel Core I5 2.4 GHz processor, 16 GB of RAM and Linux Ubuntu 20.04 operating system. Figure 4a shows the right image from eight different stereo image pairs from the dataset. The window size for all tested methods is N w × N w , where N w = 2s 0 + 1 and s 0 = 6. For the proposed method we set Q = 31, v = 1.5 and β = 2.5s 0 . Figure 4b shows the ground truth disparities of non-occluded regions of the input images shown in Figure 4a. The non-occluded regions are obtained by applying the verification method given in Equation (19) to the ground truth disparities provided by the dataset. The estimated disparity maps obtained with the IWCT, AD-C and proposed method are presented in Figure 4c-e, respectively. Notice that the proposed method produces the lowest values of BMP and RMS measures in all cases compared to those obtained with the IWCT and AD-C methods. The proposed method is able to estimate the disparity in homogeneous regions with high accuracy. This feature is obtained when the specified number Q of quantification values for the binary threshold decomposition is sufficiently large (Q > 8). Furthermore, it can be seen that the proposed method is also able to correctly estimate the disparity at the edges of the objects in the scene. This feature is due to the dynamic adaptation of the sliding windows employed for point matching given in Equation (17). On the other hand, the IWCT method produces the worst results of all tested methods. This approach yields many incorrectly estimated disparity values in homogeneous image regions. Note that the test images shown in Figure 4a present several challenges for stereo matching, such as image regions with little texture, partial occlusions, nonstationary intensity changes, objects with sharp edges and abrupt disparity variations. According to the obtained results shown in Figure 4c-e, the proposed method adapts better to challenging situations than the other tested methods. However, the lack of texture in image regions larger than the search space of the algorithm causes the matching method to be unable to determine the point correspondences. For instance, this can be seen in the central area of the Recycle image. The AD-C algorithm yields good results in the majority of the performed tests. This algorithm can estimate the disparity values at the edges of the scene objects very well. However, its performance is lower than that of the proposed method.   Afterward, we evaluated the performance of the post-processing method described in Section 2.3. We applied the suggested post-processing to the estimated disparity maps obtained with the IWCT, AD-C and proposed method, see Figure 4c-e. The resultant postprocessed disparity maps are presented in Figure 5b-d. It can be seen that the suggested post-processing is successful in retrieving the unknown disparity values in occluded regions of the input stereo images. Furthermore, it can also retrieve several incorrectly estimated disparity values in homogeneous image regions, which were not verified by Equation (19). Additionally, Figure 5b-d presents the BMP and RMS values of all tested methods between the estimated and ground truth disparities shown in Figure 5a. The IWCT and AD-C methods yield higher BMP and RMS values in comparison with those obtained with the proposed method. The AD-C algorithm yields slightly better performance than the IWCT. However, the post-processed disparity maps obtained with the proposed method yielding the best results of all the tested methods. It is worth mentioning that the large occluded regions on the right side of the estimated disparity maps shown in Figure 4c-e were correctly recovered by the suggested post-processing in all tested methods. However, note that the post-processing was unable to recover the disparity values in the wood knot shown in the Wood2 image. This is because the verified disparity values in the vicinity of this region are associated with image points with intensity values that are significantly different from those of the wood knot.  Now, the post-processed disparity maps shown in Figure 5b-d, were refined by applying the well-known weighted least-squares filter [28]. The refined disparity maps are shown in Figure 6. Note that the refinement significantly reduces anomalous disparity errors for all tested methods. The refined disparity maps using the proposed adaptive morphological correlation approach produce the best results of all tested methods in terms of the BMP and RMS measures. Furthermore, we see that the refined disparity maps using the IWCT and AD-C methods of images, such as Adirondack, Recycle and Rocks1, contain very noticeable artifacts, while the refined disparity maps using the proposed approach contain fewer artifacts. This result is expected because any refinement method performs better when the input disparity map contains fewer incorrect disparity estimates, such as those obtained with the proposed approach.
Finally, we compare the statistical performance of the proposed method, IWCT and AD-C, in terms of both BMP and RMS measures. In this experiment, we estimated the disparity map of twenty-five different stereo images from the Middlebury stereo dataset using each of the considered stereo matching methods. The mean value and standard deviation of the BMP and RMS measurements were computed for each tested method. The results are presented in Figure 7 and Table 1. Figure 7a shows the statistical results for all tested stereo matching methods considering only the non-occluded regions of the input stereo images. Note that the proposed approach yields the best results of all tested methods. In contrast, the IWCT produces the worst results. This low performance is because the IWCT produces many wrong disparity estimates in homogeneous image regions, as shown in Figure 4c. The AD-C algorithm yields good statistical results in general terms. The AD-C approach produces fewer incorrect disparity estimates in homogeneous image regions than the IWCT. Additionally, it correctly estimates the disparity values at the edges of the objects present in the scene. However, the performance of both AD-C and IWCT methods is lower than that of the proposed approach.    Figure 7b presents the statistical results of the post-processed disparity maps using the suggested approach. It should be noted that the reference for computing the BMP and RMS measures consists of the ground truth disparities provided by the dataset, see Figure 5a. Note that the proposed approach yields the best results, whereas IWCT yields the worst results. The AD-C algorithm produces acceptable results in general terms. The results shown in Figure 7a and Table 1 confirm that the proposed method based on adaptive morphological correlation is effective and robust for stereo image matching. Additionally, the results presented in Figure 7b and Table 1 indicate that the suggested post-processing method is successful in retrieving the disparity values in occluded image regions.

Conclusions
An accurate and robust method for stereo image matching based on adaptive morphological correlation was presented. The correspondence of non-occluded points in a pair of rectified stereo images was accurately determined by matching locally adaptive image windows using the suggested morphological correlation operation, which is optimal with respect to the new, introduced criterion called binary-to-dissimilarity ratio. In addition, a simple disparity post-processing method for recovering point correspondences of occluded points was suggested. The performance of the proposed method for stereo matching was exhaustively tested in terms of the mean absolute error and peak signal-to-noise ratio objective measures using images of the well-known Middlebury stereo dataset. The obtained results were discussed and compared with two recent state-of-the-art methods based on the census transform. According to the performed experiments and obtained results, the proposed method for stereo matching outperformed the existing tested methods in terms of the considered performance measures. Additionally, the obtained results confirmed that the suggested post-processing method allowed the disparity values of partially occluded image points to be successfully recovered.