Stereo Matching with Spatiotemporal Disparity Reﬁnement Using Simple Linear Iterative Clustering Segmentation

: Stereo matching is a challenging problem, especially for computer vision, e.g., three-dimensional television (3DTV) or 3D visualization. The disparity maps from the video streams must be estimated. However, the estimated disparity sequences may cause undesirable ﬂickering errors. These errors result in poor visual quality for the synthesized video and reduce the video coding information. In order to solve this problem, we here propose a spatiotemporal disparity reﬁnement method for local stereo matching using the simple linear iterative clustering (SLIC) segmentation strategy, outlier detection, and reﬁnements of the temporal and spatial domains. In the outlier detection, the segmented region in the initial disparity is used to distinguish errors in the binocular disparity. Based on the color similarity and disparity difference, we recalculate the aggregated cost to determine adaptive disparities to recover the disparity errors in disparity sequences. The ﬂickering errors are also effectively removed, and the object boundaries are well preserved. Experiments using public datasets demonstrated that our proposed method creates high-quality disparity maps and obtains a high peak signal-to-noise ratio compared to state-of-the-art methods.


Introduction
Over the past several decades, disparity estimation is a problem in stereo vision, which has been investigated, and is still an active research topic in the field of computer vision [1]. It is a fundamental technology for the next-generation of network services to synthesize virtual viewpoints, including 3DTV, multiview video, free-view video, and virtual reality operations.
Stereo matching is an important vision problem that estimates the disparities in a given stereo image pair. Many stereo matching techniques have been proposed, which could be categorized as either global or local methods [2]. The global methods produce more accurate disparity maps, which are typically derived from an energy minimization framework that allows for the express integration of disparity smoothness constraints, and are thus able to regularize the solution in weakly textured areas. However, as a result of minimization, they use an iterative strategy or graph cuts, which require a high computing cost. These methods are often quite slow, and thus unsuitable for processing a large amount of data. In contrast, local methods compute the disparity value by optimizing a matching cost, which compares the candidate target pixels and reference pixels while simultaneously considering the neighboring pixel information in a support window [3]. Local methods, which are typically built upon the winner-takes-all (WTA) framework [4][5][6], have the lowest computational costs and are suitable for implementation on parallel graphics processers. In the WTA framework, local stereo methods consider a range of disparity hypotheses and compute a cost volume using various pixel-wise dissimilarity metrics between the reference image and the matched image at every considered disparity value. The final disparities are selected from the cost volume by going through its values and selecting the disparities associated with the minimum matching costs for every pixel of the reference Although the stereo matching method provides excellent matching results for the two image pairs, there are still some complicated problems with stereo matching, such as object occlusion and flickering artifacts. Object occlusion, where there are no corresponding pixels to be captured, is a serious problem that could cause the stereo matching algorithm to be ineffective at finding the correct pixels. Thus, it will produce incorrect disparity values in the occluded regions. In addition, the problem of flickering artifacts usually occurs in disparity sequences generated by a stereo matching method as a result of inconsistent disparity maps. These flickers will significantly reduce the subjective quality of a video clip. Simultaneously, the incorrect disparity values would cause visual discomfort in the human eye, especially when they are used for synthesizing a virtual view in depthimage-based rendering (DIBR). To address this issue, we propose a strategy based on the spatiotemporal domain to reduce the inconsistent disparities and to refine the disparity map in a video clip. The main contribution of our proposed method is to improve the inconsistent disparities and reduce flickering errors in the estimated disparity sequences for the synthesized video. The advantages of this method are as follows. First, the outliers can be detected and removed using the superpixel-based segmentation method. Second, the disparity refinement is achieved using our proposed technique in the temporal and spatial domains. In the temporal domain, we propose the method to improve the cost aggregation based on segmentation, color difference, and disparity difference and then to refine the outliers in frames. In the spatial domain, the rest of disparity errors of each frame can be further refined based on refinement within superpixel bounds, propagation mechanism, and filtering. Finally, our proposed method is easy to implement and an efficient technique.

Preliminary Techniques
In this section, we briefly describe the techniques related to our proposed approach.

Cross-Based Local Stereo Matching
In recent years, a local stereo matching method has usually been used to generate a disparity map. A well-known cross-based local stereo matching proposed by Zhang et al. [5] based on a shape-adaptive support region is an efficient and easy method. The main goals are to create a local upright cross for the anchor pixel and then construct an adaptive support region. This local support region should only contain the neighboring pixels with the same depth as the anchor pixel under consideration.
First, we construct a local cross support region for a pixel using the color information. We define pixel p within the shape-adaptive support region and then compute the cost aggregation based on a WTA strategy corresponding to each pixel, as shown in Figure 1. Using a variable cross support region instead of a fixed size for each pixel can efficiently reduce the computational complexity.
Cross-based aggregation proceeds by a two-step process, as shown in Figure 1. In the first step, an upright cross is constructed for each pixel. The support region of pixel p is modeled by merging the horizontal arms of the pixels (q for example) lying on the vertical arms of pixel p, as shown in Figure 1b. As per [5], there is an alternative, merging vertical arms of pixels lying on the horizontal arms of pixels p. In the second step, the aggregated cost based on the WAT strategy in the support region is computed within two passes. The first pass sums up the matching costs horizontally and stores the intermediate results; the second pass then aggregates the intermediate results vertically to obtain the final cost. Aggression or finite summation was used for the matching cost function while Figure 1 referred to horizontal or vertical aggregation. More details about the method can be found in [5]. Each pixel p consists of a horizontal section H(p) and vertical section V(p). q ∈ V(p) is a pixel that belongs to the vertical section. Each section has two arms, e.g., H(p) consists of left and right arms. "U" is union operation.

SLIC Segmentation and Merging
The SLIC segmentation method was proposed by [21]. It groups the pixels of an image into perceptually meaningful atomic regions that can be used to replace the rigid structure of the pixel grid using a k-means clustering approach. These grouped pixels are called superpixels. This method can adhere to the boundaries very well. The only parameter in the SLIC segmentation method is n 0 , which is the desired number of superpixels of approximately equal size. The choice of this parameter n 0 is depending on the users for the experiments.
The SLIC segmentation method adopts the CIELAB color space. The clustering procedure begins with an initialization step, where the n 0 initial cluster centers C i = [l i , a i , b i , x i , y i ] T are sampled on a regular grid, spaced S pixels apart. To generate superpixels of roughly equal size, the grid interval is S = N n 0 , where N represents the number of pixels for an image. The SLIC algorithm [21] obtains the desired number of clusters with size N n 0 . Additionally, in order to avoid centering a superpixel on an edge or a noisy pixel, the centers are moved to seed locations corresponding to the lowest gradient positions in 3 × 3 neighborhoods. It is known that an edge or a noisy pixel is often positioned on a pixel point with the largest gradient variance. Therefore, selecting the lowest gradient pixel point when positioning the center of a superpixel can efficiently reduce the chance of seeding a superpixel with an edge or a noisy pixel. After performing the SLIC segmentation, we can obtain the number of desired clusters that adhere to the boundaries of objects. However, some neighboring clusters that belong to the same object are segmented into different clusters. In order to obtain a greater advantage and more useful segmented information, we further merge some neighboring clusters into a larger cluster when the average color difference between neighboring clusters i and j, is less than a threshold (T s ). Given two adjacent superpixels (i.e., "clusters") clustered at pixels i and j, respectively, denoted by SE c (i) = I c (p) p = x p , y p ∈ S i , c ∈ {R, G, B} , where S i represents the set of pixel locations and x p , y p representing coordinates of pixel p in a two-dimensional (2D) space, in the superpixel i with a total number of pixels, N s i = card(S i ) the average color difference between neighboring clusters i and j is denoted as follows.
It is noted that symbol "card(•)" denotes the cardinality operator. This threshold (T s ) is set to 8, which is a value obtained through experimentation. Based on our experiments, if this threshold is greater than the specified value, the boundary information of clusters will be lost which will impact the following inpainting process. Figure 2 shows an example of the SLIC segmented results and merging.

Proposed Method
This section describes the proposed method, which includes simple linear iterative clustering (SLIC) segmentation and merging, and temporal and spatial refinements. Figure 3 illustrates the flowchart of the proposed system. The procedures are described in detail in the following subsections.

Outlier Detection Based on Disparity Variation
The initial disparity of each pixel according to the map calculated by the cross-based local stereo matching [5] contains many outliers. In order to detect these outliers, we here use the consistency of the disparity variation under the same cluster to label the inconsistent pixels in the current frame. These inconsistent pixels are marked as outliers. Due to the disparity variation, one cluster may include many disparity values, which will cause inconsistent disparity values within the same cluster. Hence, we estimate the variation of the disparity values in the same cluster to detect the inconsistent pixels. First, all of the pixels in the disparity map are given different initial labels. Given a pixel p = x p , y p ∈ S p with label l p , for every pixel q = x q , y q ∈ S P , a constraint is enforced on its labeling as follows.
where S p is the cluster containing pixel p, d(·) denotes the disparity corresponding to a pixel, and α e is a threshold. The symbol |·| denotes the absolute value. In a cluster, there exist many pixels, there might be different labels. In that cluster, the pixels with the label shown most frequently denote as a target. In the target, the disparity shown most frequently denotes the correct disparity. This value is found from those pixels with the same label shown most frequently. Additionally, the rest of the pixels are denoted as the outliers of the cluster. However, the leftmost or rightmost disparities of the disparity sequences may cause serious disparity errors because they do not have related information in the left or right image. Thus, these pixels have zero values. In other words, the corresponding region at the leftmost or rightmost of the disparity map cannot be matched. Hence, the disparity value is zero, which indicates an error. We use the following procedures to detect the disparity errors positioned at the leftmost or rightmost of the disparity sequences. First, we detect segmented regions that include disparities with a value of zero in the leftmost or rightmost region. Then, we use the characteristic of a disparity variation from an endpoint to inner point and compute the average disparity of the segmented region. If the average disparity is less than a threshold (T d = 4), this segmented region is labeled as having disparity error.
After performing the above procedures, the disparity errors generated at the leftmost or rightmost of a disparity map can be reduced.
The detected outliers are recorded on the disparity map for the left view, as shown in Figure 4. In Figure 4, the red points are the outliers. In other words, these outliers are regarded as disparity error points on the disparity map with the left view and must be refined.

Disparity Refinement in Temporal Domain
After detecting the outliers, we can determine the outliers in each frame within a video clip. When the camera or captured objects in the scene involve slow motion, the disparity of an object cannot show a dramatic change in a short period of time. In addition, an occlusion will cause a disparity with a great change in the background, which will result in disparity errors. However, a dramatic change may be caused by the appearance of another object. Hence, this approach will refine these disparity errors, whether they are caused by the appearance of another object or an occlusion.
These errors must be removed and refined in the disparity map. In our approach, the refining procedures are divided into two phases: refining in the temporal domain and in the spatial domain. The system design uses an object-based strategy.
We combine the segmentation information, region movement, color difference, and disparity difference to achieve the refinement in the temporal domain at this stage. We simply use the previous frame and its corresponding information to perform the refinement. The details of the procedures at this stage are described in the following.

Searching for Matching Point in Previous Frame
First, we need to search for the matching positions in the previous frame corresponding to the outliers in the current frame. Assuming that a given pixel, p, is in the current frame (at time t), I n f , with p ∈ R p where R p denotes a cluster as a result of the SLIC segmentation and merging, the candidate pixel, p , is from a given search window of 11 × 11 pixels in the previous fame denoted by W s , i.e., p ∈ W s , with its matching window to R p denoted by R p , the best matching point, p * , in the previous frame (at time t−1) I (n f −1) , to pixel p of the current frame, I n f , is defined as follows, where the cost function, J(p, p ), is composed of color matching, J c (p, p ), and disparity matching cost, J d (p, p ), and is defined as β is a weighting parameter, symbol "card(•)" denotes the cardinality operator, p n f and p (n f −1) are indexes to matching windows in frames I n f and I (n f −1) , respectively.
J pd p n f , p (n f −1) denotes the pixel level disparity matching cost and is defined, for N d (R P , R P ) denotes the number of points that possess correct disparities in the matching windows in frames I n f and I (n f −1) , and is defined by with D E p n f , p (n f −1) identifying pixels with correct disparities in candidate matching region and defined as where E p n f denotes the disparity of point p in the current frame, I n f , whether there is an error or not. Additionally, E p (n f −1) denotes the disparity of point p in the previous frame, , whether there is an error or not. Hence, E p n f = f alse means that the disparity of point p is correct, and the disparity of point p in the previous frame is correct. According to Equation (3), the best matching point in the previous frame corresponding to the outlier in the current frame can be obtained. Figure 5 shows the profile of the matching operation. Afterward, we further compute and record the displacement of the coordinates of the matching point. The displacement of the coordinates of p and p* is computed as follows: Based on the displacement information, we can more rapidly obtain the matching points of other outliers within the same segmented region. This is because they are clustered in the same region.
Hence, if we search for other matching points in the previous frame corresponding to any disparity errors in the same segmented region for the current frame, we can decide whether the displacement information belonging to the same cluster is recorded or not. Based on the displacement information, it is easier and quicker to find the matching point corresponding to the disparity error.

Disparity Refinement of Outliers
In terms of disparity refinement, based on the matching result in the previous process, we can obtain the matching point corresponding to the outlier. However, this matching point may not have the correct disparity. In order to avoid matching the error disparity, we use the color difference between point p in the current frame I and the neighborhood pixels of matching point p* in the previous frame I pre to select the appropriate matching point. The selection of the appropriate matching point is computed using Equations (11) and (12): where Rg is a 3 × 3 region with center point p*, as shown in Figure 5. After the computation using Equation (11), we further determine whether the disparity of point k is right or not, and then compute the color difference between point k and point p. In order to maintain temporal consistency for the pixels in the temporal sequence, a threshold (α t ) is used to estimate whether the colors of the pixels remain within a certain distribution. When the color difference is less than a threshold (α t ) and the disparity of point k is correct, the disparity error will be replaced by the disparity of point k. The disparity refinement of the outliers in the temporal domain is expressed as follows: where d(p) and d(k) denote the disparities of pixel p and pixel k, respectively. E k denotes the disparity of point k at time t-1, whether there is an error or not. Hence, E k = f alse means that the disparity of pixel k is correct. At the same time, the color difference between pixel p at time t and pixel k at time t-1 is less than a threshold (α t ). Thus, the disparity of pixel p in the current frame is replaced by the disparity of pixel k.
After refining the disparities of the outliers at time t in the temporal domain, the refined current frame will be reused for the refinement in the next frame. In other words, the correct disparities for the object in the current frame can propagate to the following frames and be used to refine the same object under the specified camera lens motion.

Disparity Refinement in Spatial Domain
After performing the refinement in the temporal domain, a large number of the disparity errors were refined. However, some errors still exist in the temporal sequence. Hence, the rest of the errors will be refined in the spatial domain.

Refinement within Superpixel Bounds
Based on the SLIC segmentation and merging, we will search the correct disparities to refine the rest disparities of error values using pixel-based processing of each frame. We compute the color differences between the pixels with the disparity error and the candidate pixels with the correct disparities, which are segmented into the same cluster. The disparity of a pixel with a minimum color difference is used to replace the disparity error of a pixel.
In order to refine the rest of the disparity errors in the spatial domain, first, given a pixel, p, of disparity image in a segmented region N s , i.e., p ∈ N s , and a 37 × 37 search region denoted by S s , a truth table for the candidate pixels representing correct disparity within the search region S s can be formulated for j ∈ S s , where E j denotes that the disparity of pixel j is error. E j = f alse denotes that the disparity of pixel j is correct. a s is a threshold for the color difference between point p and point j.
Based on the segmentation process, if any two pixels in the spatial domain are segmented into the same cluster, and the color difference between these two pixels is less than a threshold, these points will act as the candidate pixels for refining. Next, the best candidate pixel among all of the candidate pixels denoted by S c is selected as follows: and disparity value at p is replaced by that at b as follows, Based on Equation (14), we can obtain the best pixel b, and the disparity of pixel p is replaced by the disparity of pixel b.

Propagation from Horizontal Lines
After performing the disparity refinement of the outliers in the spatial domain and temporal domain, the boundaries of objects may look uneven, a few outliers may not have been corrected, or spatial noises exist in the frame. Thus, we use the following procedures to process the rest of the outliers.
The recovering procedures are described as follows. We use the propagation method in the horizontal direction for an outlier to further recover the disparity error. Assume that pixel p is an outlier. We search for pixels with the correct disparities along the left arm and right arm for equal distances. This search procedure ends when the first correct disparity derived from the left arm or right arm is obtained. Then, we compute the color differences between the outlier and these pixels within the two arms and obtain the maximum color difference for the two arms. The maximum color difference for the two arms is computed as follows: where arm l (p, i) denotes all of the pixels from pixel p to pixel i on the left arm, and arm r (p, i) denotes all of the pixels from pixel p to pixel i on the right arm. Figure 6 shows an example of the profile of the color difference computation in the horizontal direction. From Figure 6, it is clear that the pixels on the left arm may belong to the same cluster because their color variation is small. In contrast, the color variation on the right arm is extreme, which may denote the edge of an object or noises. Hence, we compute the maximum color difference within the search interval for the left arm and right arm using Equation (16). In addition, if the correct disparity only appears on the right arm or left arm under equal distances, the disparity of the outlier is directly replaced by this correct disparity.
Next, we take the pixel with the minimum color difference using Equations (16) and (17) to obtain a reliable disparity. The disparity of this outlier p is determined and replaced by f = argmin(cm l (p, i), cm r (p, i)),

Filtering
After performing the above procedures, in order to obtain good visual quality of the refined disparity maps, we finally use a median filter with a size of 9 × 9 to modify the refined disparity maps based on the results of all the above steps. This result after performing filtering can present the best refining disparity maps.

Summary of Procedures
Temporal domain: Input: Video clip Output: A refined disparity map in the temporal domain Step (1) Create the initial disparity maps using a local cross-based stereo matching method [5].
Step (2) Detect the outliers for the initial disparity maps under our proposed method.
Step (3) Check whether the displacement information in the segmented region is recorded or not. If it is recorded, the matching point in the previous frame corresponding to an outlier in the current frame can be found using the displacement information, and go to Step (6), else go to Step (4).
Step (4) Search for the matching region and matching point p* in the previous frame for these outliers based on the segmented region, color difference, and disparity difference using Equations (5) and (6); compute the aggregated cost using Equation (4); use Equation (3) to find p*.
Step (5) Compute and record the displacement information of the matching pair using Equation (10).
Step (6) Replace the disparity value of an outlier by computing the color difference between the outlier in the current frame and the neighborhood points of the matching point in the previous frame based on Equations (11) and (12).
Step (7) Repeat Steps (3)- (6), until all of the frames in the video clip are done. Spatial domain: Input: The refined disparity maps of a video clip obtained by the processing in the temporal domain Output: The final refined disparity maps Step (1) Search for the rest of the disparity errors.
Step (2) Select all of the candidate pixels in the search region (N s ) based on the color difference using Equation (13).
Step (3) Compute the minimum color difference and obtain the best candidate pixel based on Equations (14) and (15). Then, the disparity of the candidate pixel is used to replace the disparity of the error point.
Step (6) Use a median filter to improve the results of Step (5).

Experimental Results and Discussion
To verify the performance of the proposed method, the experimental results were compared to the disparity refinement results of Zhang et al. [5], Lin et al. [3], Jung et al. [12], Zhan et al. [14], and Yang et al. [7]. All compared methods were implemented according to their proposed method under the same experimental datasets and setup. In our experiments, we used cross-based local stereo matching [5] to obtain the initial disparities. These initial disparity values were then processed by our proposed method for refinement.

Experimental Datasets and Setup
Most research papers in this field use two popular datasets, the Middlebury and the KITTI datasets. The Middlebury dataset does not contain video datasets with ground truth labeled. The KITTI dataset contains image pairs from video clips, which have large frame interval up to 5 seconds and is not suitable to evaluate our algorithm. Hence, in our experiments, we used the datasets provided by [9], which included 'tanks', 'tunnel', 'temple', and 'street' video clips with known ground truths for our demonstration. Each of the video clips included 100 frames with a size of 400 × 300. Table 1 lists the parameters used in the experiments. However, the assignment of these parameters values can be modified by the user based on the data. The experimental programs were implemented in Microsoft Visual Studio C++, on an Intel ® core i5-4570 at 3.2 GHz computer (Intel, Santa Clara, CA, USA) with 4 GB of RAM running a Windows 7 64 bit platform.

Performance Evaluation
For the performance evaluation, we adopted the peak signal-to-noise ratio (PSNR) and bad pixel rate (BPR_kn) to demonstrate the benefits of our proposed method. The PSNR is expressed as follows: where I and I denote the refined disparity of the nth pixel and the disparity of the nth pixel of the ground truth, respectively. Then, we computed the average PSNR value in the datasets to verify the performance. The BPR_kn is expressed as follows: where M and N denote the height and width of the disparity map, respectively. d(x, y) and d GT (x, y) denote the refined disparity map and the ground truth corresponding to the x and y coordinates, respectively. kn denotes a threshold for the disparity difference and is set to 1, 2, and 4.

Experimental Results
The experimental results are presented in the following sections. Tables 2 and 3 Table 4 lists the average execution time per frame in the sequences. From Table 2 Table 4. From Table 4, it is obvious that the method of Jung et al. required a long execution time to achieve the disparity recovery. However, the corresponding quality was not much better, as shown in Table 2. Although the execution times of the methods of Zhang et al. and Lin et al. for the datasets were shorter than that of our proposed method, the quality of the results was lower than that of the proposed method, as shown in Tables 2 and 4. Figures 7-10 illustrate the PSNR values per frame for the datasets. Figures 11-13 show a portion of the refined disparities for the datasets.

Discussion
The experimental results were presented in the previous section. In summary, as a performance measure, a higher PSNR value indicated a better disparity refinement. The average PSNR values with our proposed method were greater than 35 dB for the datasets. The global performance of the average PSNR values was superior to that for the methods of Zhang Tables 2 and 3. Table 4 presents the execution time results. It was observed that our proposed method can achieve disparity refinement in a short execution time. Although our proposed method is slower than those of Zhang et al. and Lin et al., the execution time is acceptable. Figures 11-13 show a portion of the results for the visual perception of the disparity refined maps. Evidently our proposed system can provide good visual perception.

Conclusions
This paper proposed an efficient spatiotemporal stereo matching with disparity refinement method based on SLIC segmentation. Based on the segmentation information, the disparity errors in the disparity map are first detected. Next, we search for the matching region in the previous frame based on the segmented region. Then, based on the motion information of the matching region in the previous frame, we can find the correct disparities. The disparity refinement is performed using our proposed technique in the temporal and spatial domains. Finally, in order to obtain a more comfortable visual perception, a median filter is used on the refined disparity maps.
As previously presented in the experimental results, it is clear that the proposed method can efficiently improve the disparity quality and present a smooth disparity map for a video clip. However, the current system cannot search for the correct matching point to refine in the temporal domain refinement if the variation of the object aspect is too abrupt. Processing the variation of the object aspect in the temporal domain will be a major focus of future research. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found [9]. https://www.cl.cam.ac.uk/research/rainbow/projects/dcbgrid/ (accessed on 5 February 2021).