Robust Cost Volume Generation Method for Dense Stereo Matching in Endoscopic Scenarios

Stereo matching in binocular endoscopic scenarios is difficult due to the radiometric distortion caused by restricted light conditions. Traditional matching algorithms suffer from poor performance in challenging areas, while deep learning ones are limited by their generalizability and complexity. We introduce a non-deep learning cost volume generation method whose performance is close to a deep learning algorithm, but with far less computation. To deal with the radiometric distortion problem, the initial cost volume is constructed using two radiometric invariant cost metrics, the histogram of gradient angle and amplitude descriptors. Then we propose a new cross-scale propagation framework to improve the matching reliability in small homogenous regions without increasing the running time. The experimental results on the Middlebury Version 3 Benchmark show that the performance of the combination of our method and Local-Expansion, an optimization algorithm, ranks top among non-deep learning algorithms. Other quantitative experimental results on a surgical endoscopic dataset and our binocular endoscope show that the accuracy of the proposed algorithm is at the millimeter level which is comparable to the accuracy of deep learning algorithms. In addition, our method is 65 times faster than its deep learning counterpart in terms of cost volume generation.


Introduction
Stereo matching is to find the corresponding pixels along the horizontal epipolar lines in the rectified image pairs captured by the calibrated binocular cameras [1]. It is a lowlevel vision task because it is the basis for other 3D applications [2]. 3D reconstruction and measurement in confined space are considered to be the future directions of endoscopies that can be used in medical and industrial scenarios [3]. However, with the limits of light conditions, the matching results need to be robust to the radiometric distortions.
Many stereo methods have been proposed to get reliable matching results, so-called disparities, based on traditional or deep learning algorithms [4,5]. According to the current public benchmarks, deep learning algorithms are far ahead of traditional algorithms in terms of accuracy [6]. However, deep learning is very data-dependent, and difficult to guarantee the same good performance in real, non-dataset scenarios. Also, high computational complexity limits its real-time application in embedded devices [5]. On the other hand, although traditional algorithms have advantages in the above aspects, the reliability of the traditional matching algorithms needs to be improved, especially in challenging scenes, such as homogeneous [7] and radiometric distorted areas [8].
The cost volume serves to evaluate the pixel dissimilarities between a pixel in the base image and its corresponding pixel in the matching image [4]. Its width and height are equivalent to that of the image being matched, while the length of its third dimension is the maximum disparity. Both handcrafted and learning-based functions can be used to generate cost values in the volume, e.g., Local-Expansion [9] utilizing these two kinds of cost volumes [10] as its data term respectively. Previous research has demonstrated that the quality of matching performance is largely determined by the cost volume. However, generating a cost volume requires significant calculations. Thus, in this paper, we aim to produce a robust cost volume that is suitable for binocular endoscopic scenarios using traditional methods with reliability comparable to that of deep learning methods but with less computational complexity.
Traditional methods can be further divided into local and global ones based on different optimization goals [11]. Local methods focus on improving the cost volume's reliability by getting support for the nearby pixels in the local region, while global methods iteratively minimize the energy function composed of the data term and the smoothness term. As the cost volume can be used as data terms with a few modifications, we name them to cost volume uniformly. Our algorithm framework consists of two parts: the proposed cost generation method generating cost volumes with high robustness and the cost optimization methods (local or global) optimizing the cost volumes to get accurate disparities.
Our cost generation method contains three consecutive processes. First, we build the initial cost volume using the histogram of gradient angles (HOG angle) and amplitudes (HOG amplitude) which are all radiometric invariant based on the linear radiation distortion model (Section 3.1) [12]. Then the noises of the cost volume are suppressed by the guided filter under the same linear assumption as HOG metrics (Section 3.2) [13]. The same underlying assumption ensures the effectiveness of filtering. The robustness of the cost volume in low-texture areas is further improved by the new cross-scale propagation scheme (Section 3.3). As for cost optimization methods, two traditional methods [9,14] are used to get the disparity results after the final cost volume generation, which proves its advantage in reliability (Section 3.4). Section 4 presents the quantitative experimental results on the commonly used Middlebury datasets, a newly proposed surgical endoscopic dataset and the real image pairs of standard objects captured by our binocular endoscope.

Related Works
Considering the aim of our method, we mainly review the literature related to cost volume generation in this section. Various cost metrics have been proposed to construct the initial cost volume. Absolute Intensity Differences (AD) and Squared Intensity Differences (SD) along with their summation forms are the early metrics with low computational complexities [15]. However, they suffer from low performance in the region with radiometric distortion. In comparison, Normalized Cross Correlation (NCC) is insensitive to contrast changes but causes edge-fatten issues [16]. The subsequent zero-mean form of these metrics shows higher robustness at the cost of increased computation [8]. Census Transform (CT), as a representative of non-parametric metrics, has a relatively simple calculation process but good accuracy [17]. Ref. [12] uses the histogram of oriented gradient (HOG) descriptor [18] as the cost metric for the first time and proves its radiometric invariant property under linear radiation distortion assumption. Furthermore, a random forest classifier is trained in [19] to improve the confidence of the combination of the conventional metrics.
Cost aggregation is a key step in the stereo matching framework that can be used to classify algorithms into three different categories: local [20,21], non-local [22,23], and semi-global [4]. Non-local and semi-global methods are similar as their aggregation processes are along particular paths. However, the former methods limit the aggregation process to areas of similar colors by adaptive aggregation weights, while the latter methods perform the aggregation along scan lines in the whole image. The aggregations in local methods are limited to fixed, multi-fixed, or adaptive windows around the center pixel based on the local consistency assumption, which can result in blurring in discontinuity regions [24]. Considering the same de-noising purpose, edge-preserving filters are introduced to improve the cost volume's robustness, achieving impressive results. In this paper, we choose the guided filter with the small filter radius for three reasons: ensuring model consistency by using the same window size of the cost metric computation; preventing the error propagation during the filtering process in the large area; low computational complexity [25].
As for deep learning methods, some optimization algorithms [9,26] using the cost volume generated by MC-CNN [10] are proven to be able to get more accurate disparity results than those using traditional metrics. Ref. [27] calculates the cost volumes at multiple levels using the shared features extracted by CNN. To improve the models' generalization abilities and save the running memory and runtime, cascade cost volume structures in a coarse-to-fine manner are proposed [28,29]. Furthermore, these models can be used in optical flow estimation tasks [30].

Materials and Methods
Our cost volume construction process consists of several steps similar to cost calculation and aggregation, the first two procedures in the classical local method [11]. Besides, we introduce the cross-scale propagation method to further improve the robustness of the cost volume.

Pixel-Wise Cost Calculation
Radiometric distortion is a common factor deteriorating the stereo matching accuracy, caused by the unavoidable different settings of the binocular cameras and the noises in the imaging process [8]. Although ZNCC (zero-mean normalized cross-correlation) and CT (Census Transform) have been verified to be robust cost metrics [31], we use the improved HOG cost considering the matching accuracy. The accuracy evaluation of these metrics will be discussed in the experimental section.
Based on the linear radiation distortion model (1) and the linearity assumption in the small window, θ b (p b ) in (2) and θ m (p m ) in (3), the angles of the gradient vectors, are radiometric invariant [12]. In (1), g b (p b ) and g m (p m ) are the gray intensities of the corresponding pixels p b in base images and p m in matching images. c and t are model coefficients. In (2) and (3), Gx and Gy are the horizontal and vertical gradients of the pixel p. Likewise, the subscripts are used to distinguish between the base and matching images.
As shown in Figure 1a,b, we first compute the gradient angles in the predefined window centered at the pixel p, then get the histogram of gradient angle by counting the number of vectors in each bin. The bins are obtained by dividing 360 degrees into 12 equal parts. After calculating each pixel's normalized HOG angle in base and matching images, we can construct the cost volume C HOG,angle by, where HOG angle represents the angle histogram descriptor of the pixel p and d is the horizontal distance between corresponding pixels which is the disparity. The * operator is the sum of the absolute difference of each bin in 12-dimensional vectors. Unlike the CT metric used in [12], we further develop the relative HOG amplitude descriptors for compensating the important amplitude information of the gradient vector dismissed by the HOG angle descriptors. The relative amplitude ρ of the gradient vector is calculated by where Gx and Gy have the same meaning as in (2) and (3). p b,center and p m,center represent the center pixels in the local windows in base and matching images respectively. We can get the amplitude histogram descriptor HOG amplitude by concatenating the relative amplitude value of each pixel in the window, as shown in Figure 1c. The gradient computation and the division operation relatively eliminate the influences of the coefficients t and c in the linear radiation distortion model, so HOG amplitude is also a radiometric invariant metric.
Then the corresponding cost volume is constructed by the following equation where the meanings of the symbols are the same as in (4).
Finally, C HOG,angle (p b , d) and C HOG,amplitude (p b , d) are average fused to get the initial cost volume C HOG . Figure 2 shows the difference in the disparities generated from C HOG,angle with and without the fusing of C HOG,amplitude , using the WTA (winner-takes-all) strategy. The disparities that fail the LRC (left-right consistency) check are represented by white points [32].  The computational complexity of the HOG building process is relatively low when using the fixed window. However, both HOG angle and HOG amplitude descriptors are sensitive to the large distinguished values of the gradients. As illustrated in Figure 3a,b, the pixel near the cup handle in the left image tends to match the pixel in the same region of the right image, because of the top-right pixels' dominant influences in the cost calculation process. The final disparity in Figure 3b shows that pixels in the background area inside the cup handle are given the wrong foreground disparity value. To alleviate the influences of the distinguished gradients in the window, we set an RGB distance threshold and a gradient amplitude threshold. When building the HOG descriptors in the base image, if the color difference or the gradient difference between the local pixel and the center pixel exceeds the threshold, we adapt the shape of the local window to exclude this local position, and use this adaptive window to calculate the HOG metrics in the matching image. Figure 3c shows the result of this modification, the specific pixel matches the right position and the matching accuracy in the area of the cup handle is improved.

Cost Volume Filtering
The initial cost volume C HOG is usually full of noises which means the minimum position of the cost curve of a pixel may not be consistent with its correct disparity value. To improve the robustness of the cost volume, noise needs to be filtered by the cost volume filtering process. The joint bilateral filter [21] and the guided image filter [13] are the two commonly used edge-preserving filters. In the application of stereo matching, the guided image filter shows advantages in both the edge-preserving performance and the computational complexity [33]. More importantly, the guided filter is based on the same local liner model assumption as the HOG metric. So we use the following equation to compute the output of the guided filter [13].
where I i is the value of the pixel i in the guided image and q i is the corresponding filter output. a k and b k are coefficients of the liner module of the window ω k , which are computed by the filter input and the guided image. |ω| is the number of pixels in the window ω k . And summation process involves all the windows containing the pixel I i .
The filtering process is shown in Figure 4. The initial cost volume is sliced along the disparity dimension. The size of each slice is the same as the matching image and the grayscale of the matching image is used as the guided image. The filtered cost slices are then combined as the filtered cost volume.

Cross-Scale Cost Volume Propagation
Matching the pixels in texture-less areas is difficult for stereo methods, especially for the local ones [34], because of the insufficient details to distinguish between similar pixels.
Inspired by the fact that human eyes can use multi-scale information in the stereo matching process, a cross-scale cost aggregation framework is proposed in [35]. Although the aggregation framework's effectiveness has been validated by experiments, it is not suitable for our method for two reasons: down-sampling in both width and height dimensions and fixed aggregation coefficients in all image areas.
Down-sampling in the horizontal direction of the input image will reduce the disparity range accordingly. The cost curves of the same pixel on coarse scales need to be interpolated to the same length as that on the finest scale. The robustness of the aggregated cost volume will decrease because of the errors introduced by the interpolations. So in our scheme, images are down-sampled in the vertical direction for the HOG descriptor to perceive more texture information, while the horizontal direction keeps unchanged considering the accuracy.
Aggregation coefficients determine the proportion of the cost volume on each scale during the aggregation. The cost value may be more reliable in the complex texture areas of the image on the finer scale because of the loss of details by down-sampling. So we raise the proportion of the cost volume on the finer scale in the inhomogeneous areas. We adopt the method in [36] to determine the homogeneity of each area whose basic ideas are to convolve the pixels in the image with a Gaussian kernel sign-flipped in the horizontal or vertical direction and to use the sum of squares of the two directional convolution results as the indicator of homogeneity. The horizontal sign-flipped Gaussian kernel is calculated by where x and y represent the pixel position in the convolution window. σ x and σ y control the shapes of the Gaussian kernel in two directions. As shown in Figure 5, the colors of complex textured areas are darker than those of simple texture areas. We change the word aggregation to propagation as the name of the process because of the directional properties as shown in the right part of Figure 6. First, the input images are down-sampled from finest to coarsest scales. Then, after the initial cost volumes are generated on each scale, the propagation starts along the inverse direction of the downsampling process. There are three propagation paths for each cost volume to be propagated:

1.
From the initial cost volume of the same pixels on the same scale (black solid arrow in right part of Figure 6); 2.
From the initial or propagated cost volume of the pixels on its even lines on the coarser scale (purple dot dash arrow); 3.
From the initial or propagated cost volume of the pixels on its odd lines on the coarser scale (red dash arrow). The cost volumes from path 2 and path 3 are propagated to the even and odd rows of the finer scale cost volume respectively and fused with the cost volume from path 1. The fusion weights are computed using (9). We can get the final cost volume on the finest scale after the whole propagation process is finished.
Although increasing the number of scales results in a proportional rise in computational complexity, the total running time remains relatively constant due to the ability to parallelize the calculation of the cost volume and its fusion weights at each scale.

Cost Volume Optimization
The above calculation process of the final cost volume is mainly based on the local linear assumption, so the sizes of the windows are limited to small enough to prevent error propagation from nonlinear areas. However, on the other hand, the lack of neighborhood information leads to relatively poor disparity results, so optimization methods are needed. Among the various existing optimization methods, we choose the AD-Census [14] in local methods and the Local-Expansion [9] in global methods to illustrate the effectiveness of the cost volume generated by our algorithm.

AD-Census Optimization
The optimization procedure in the AD-Census method consists of two consecutive steps: cross-based cost aggregation and scanline optimization. The aggregation is based on the assumption that the pixels with similar colors have similar disparities in the local area, so the local information is taken into consideration. The scanline optimization uses the smoothness constraints that incorporate the global information.
We replaced the cost volume in [14], which was generated by summing the scaled AD and CT metrics, with our final cost volume. Figure 7 displays the results before and after the replacement, where green regions indicate occluded regions, and red regions represent pixels that failed the LRC check. It is evident that our cost volume is more robust in challenging datasets, such as MotorcycleE (which, according to the dataset website, is the exposure-changed version of Motorcycle).

Local-Expansion Optimization
The combination of GC and spatial propagation in the Local-Expansion method minimizes the global energy function efficiently. Because the matching quality is greatly affected by the data term in the energy function, they replace the borrowed cost volumes in [9] with the ones generated by CNN [10] when testing on the Middlebury 2014 training dataset (interpolated to full precision). Here we use the propagated final cost volume as the input data term, and use the Local-Expansion optimization method to generate disparity results.
The disparity results are shown in Figure 8, where green regions indicate occluded areas, and red regions represent wrong matches (bad1.0) in non-occluded regions. Intuitively, the error rates of the results based on these two data terms are relatively close which shows the reliability of our cost volume. Further quantitative analysis is performed in the experimental section.

Results and Discussion
Our method is designed to generate robust cost volumes using the improved handcrafted functions, HOG. In this section, we use the Local-Expansion algorithm to optimize the generated cost volumes and to get final disparity results. The combination of these two methods is designed to estimate the depth information from the images captured by the binocular endoscope which is the practical application scenario of our algorithm. Considering the similarities between the endoscopic scenes and the indoor scenes in terms of low and repetitive textures, we mainly use the Middlebury 2014 dataset to evaluate the accuracy of our algorithm quantitatively.
We set the fusion ratio of the two kinds of HOG metrics to 1:1 in the initial cost volume generation process. The window sizes of histogram statistics and guided filter areas are both set to 5 × 5 to ensure the consistency of the liner model. We use three scales in the cost volume propagation step considering the performance and the amount of computation. The parameters of the Gaussian kernel used in homogeneity determination and the propagation weights of the simple and complex texture regions are set as follows: σ x = 2.5, σ y = 1.5, ω s = 0.3, ω c = 0.7. As for the parameters in Local-Expansion, the optimization method in our framework, we keep them consistent with the original paper [9].

Robustness Evaluation of Cost Metrics
The initial cost volume which is directly calculated from the conventional cost metrics is the basis of further optimizations. To verify the performance advantage of the proposed HOG metric over other metrics, we compare the disparities generated by the proposed metric and other handcrafted radiometric invariant metrics using the WTA strategy without optimizations. Figure 9 illustrates the comparison results, where white points denote pixels that failed the LRC check. According to the proportion and distribution of white points in Figure 9, the proposed HOG metric shows greater reliability than the previous HOG angle metric (based on the ratio of the LRC passing pixels) and has better resolution in small texture-less areas (framed by red rectangles) compared with Census and NCC metrics.  Table 1 lists the quantitative comparison results generated by the ground truth disparities and the analyzing tool provided by the Middlebury dataset, where the numbers in bold are the results with the best performance in the listed cost metrics. These results show that our proposed metric outperforms other metrics in both accuracy (avgErr) and reliability (totBad). bad1.0: percentage of "bad" pixels whose error is greater than 1.0 compared with the ground truth; invalid: percentage of the pixels that failed the LRC check; totBad: the summation of the percentages of bad1.0 and invalid pixels; avgErr: average absolute error in pixels.

Effectiveness Evaluation of Cross-Scale Propagation
Cross-scale propagation is designed for improving the matching reliability in small homogeneous areas. Multi-scale images generated by down-sampling the original images enable the larger perception region along the vertical direction of the HOG descriptor and improve its resolving ability in the horizontal direction. Figure 10 displays the disparity comparisons after propagation using various numbers of scales, with white points indicating pixels that failed the LRC check. The ratio of bad matching pixels decreases as the number of scales increases, particularly in the region inside the red rectangles.
But too small scales will violate the local liner assumption and cause a large amount of calculation, so we set the number of scales to three for the balance. To be noted, the horizontal stripes in the disparity map caused by the fusion of its even and odd sub-scale cost volumes will be eliminated by the aggregation process in local methods or the smooth constraint in global methods in the optimization step.

Accuracy Evaluation on Middlebury Datasets
The ultimate purpose of the stereo matching methods is to get the high accuracy disparities, so we evaluate the accuracy of the disparities generated by the combined method (our cost volume generation method and Local-Expansion optimization) using the Middlebury Version 3 Benchmark.
As listed in Table 2, the results of our method are compared with SGM (semi-global method), INTS (non-local method), and original Local-Expansion (using the cost volumes generated by MC-CNN). Our method shows significant performance improvement on nearly all datasets compared with the first two pure traditional methods. As for original Local-Expansion, we achieve near or even better performance on most image pairs, but the performance drops terribly in specific scenes (italics in Table 2).  Figure 11 illustrates a comparison of the disparities generated by different algorithms, with the areas inside the yellow rectangles highlighting relatively noticeable performance differences. The top three rows correspond to the datasets MotorcycleE, Piano, and Playroom, where our method performs the best. In contrast, the bottom three rows correspond to the italicized entries in Table 2, indicating a drop in the performance of our method. We further analyze the reason for the performance changes of our proposed method. As shown in the top three rows of Figure 11, our method performs better than Local-Expansion in the textured and small texture-less areas (highlighted with yellow rectangles). However, in the large texture-less areas in the bottom three rows of Figure 11, our method performs far worse than Local-Expansion, even no better than SGM and INTS in the areas in the yellow rectangles. It is reasonable because our HOG metric cannot perceive a large area with the limitation of window sizes. Cross-scale propagation only improves the reliability of the costs in small texture-less areas. Based on our analysis, this shortcoming will not greatly affect the application of our method in the construction of endoscopic scenes because there will not be such large texture-less areas like floors and walls in these scenarios, especially in medical endoscopy. On the other hand, our method ranks top among the purely traditional algorithms in Middlebury Benchmark which proves its effectiveness.
To be noted, the ranks of our method drop significantly based on the bad2.0 and avgErr criteria, which means the percentage of "bad" pixels whose error is between 1.0 and 2.0 compared with the ground truth is relatively low. This proves the high reliability of our algorithm from another perspective because after removing the pixels with big matching errors using the LRC check, the accuracy of the remaining pixels will be high.

Accuracy Evaluation on Surgical Endoscopic Dataset
To evaluate the 3D reconstruction accuracy in real endoscopic scenario, we test the proposed combined method using the stereoscopic surgical endoscopic datasets SERV-CT [37]. With manual alignments of CT and endoscopic images, SERV-CT provides accurate depth results for 32 image pairs which can be used to evaluate stereo matching algorithm's reconstruction accuracy. Figure 12 shows the comparison of 3D point clouds generated from disparities calculated using our combined method and ground truths. Because there are no large homogeneous regions in the endoscopic scenario, most of the non-occluded areas can be reconstructed well. Based on the point clouds, we further calculate the 3D distance of the corresponding points and their average RMSE to represent the accuracy of the reconstruction. The quantitative comparison of accuracies of our method and various deep learning stereo matching algorithms are listed in Table 3. The reconstruction accuracy of our algorithm can reach about 5 mm, which exceeds that of some deep learning algorithms. This proves the effectiveness of our proposed method in this surgical endoscopic scenario.

3D Distance
(mm) The data in bold is the best precision.

Reconstruction Results Using Our Endoscope
The outer diameter and baseline length of our binocular endoscope are 6 mm and 2 mm. The resolution of a single image is 720 × 1280. As shown in Figure 13, because of the soft cable behind the cameras, it is ideal for 3D reconstruction and measurement in tight spaces, such as industrial and medical endoscopy. We perform surface fitting tests on two spheres and one cylinder that are 3D printed with the known parameters. The smaller size sphere (radius: 6 mm) and the cylinder (radius: 5 mm, height: 15 mm) are similar in size to the tumor, and the larger size sphere (radius: 25 mm) is used to approximate the surfaces inside the body. The point clouds and the fitting results of these three standard objects are shown in Figure 14 respectively. We paint some patterns on the surfaces of these objects to reduce the area of the homogeneous region. Such reconstruction accuracy (mm precision) enables the system to be used for measurement in endoscopic scenarios. Compared with the recent binocular endoscope reconstruction results [38], our system performs better in terms of the standard deviation errors of the fitting results.

Evaluation of Runtime
The runtime of the algorithm is another key factor affecting its application in practical scenarios. We port the MC-CNN algorithm provided by [9] and our method to the same GPU platform (NVIDIA GeForce RTX 2080 Ti) and compare their runtimes on cost volume generations using the Middlebury dataset. As presented in Table 4, the average runtime of MC-CNN is 65 times greater than that of our method, which clearly demonstrates the superiority of our approach in terms of computational efficiency. The cost volume generation processes of Jadeplant and Vintage were failed using the MC-CNN method due to the limited resource of our GPU platform, so we excluded these two datasets when calculating the average runtime.

Conclusions
In this paper, a novel non-deep learning cost generation algorithm is proposed to generate robust cost volumes for improving the matching accuracy in endoscopic scenarios. The new form of HOG metrics and the consecutive guided filter process guarantee the cost volume's reliability in radiometric distorted regions, which shows the best performance among the radiometric invariant metrics. And the robustness of the cost volume in small homogeneous areas is improved by the proposed cross-scale propagation framework. The different propagation weights in textured and texture-less regions keep the accuracy of disparity details. Quantitative experimental results show the effectiveness of the algorithm combination of our method and the Local-Expansion algorithm on the datasets and real scenes. Combining the above advantages with the runtime saving by the relatively low com-putational complexity, this algorithm is suitable for applications in binocular endoscopic scenarios. However, our method cannot handle the large homogeneous areas well which will not be a problem in endoscopic applications, but it should be considered carefully in other scenarios. Also, the complexities of the optimization methods limit the overall efficiency, so the acceleration of these methods will be our future work.