Sparse Cost Volume for Efﬁcient Stereo Matching

: Stereo matching has been solved as a supervised learning task with convolutional neural network (CNN). However, CNN based approaches basically require huge memory use. In addition, it is still challenging to ﬁnd correct correspondences between images at ill-posed dim and sensor noise regions. To solve these problems, we propose Sparse Cost Volume Net (SCV-Net) achieving high accuracy, low memory cost and fast computation. The idea of the cost volume for stereo matching was initially proposed in GC-Net. In our work, by making the cost volume compact and proposing an efﬁcient similarity evaluation for the volume, we achieved faster stereo matching while improving the accuracy. Moreover, we propose to use weight normalization instead of commonly-used batch normalization for stereo matching tasks. This improves the robustness to not only sensor noises in images but also batch size in the training process. We evaluated our proposed network on the Scene Flow and KITTI 2015 datasets, its performance overall surpasses the GC-Net. Comparing with the GC-Net, our SCV-Net achieved to: (1) reduce 73.08% GPU memory cost; (2) reduce 61.11% processing time; (3) improve the 3PE from 2.87% to 2.61% on the KITTI 2015 dataset.


Introduction
Depth images have widely been used as an input to many computer vision applications such as 3D reconstruction [1], object detection [2], and visual odometry [3]. As a cost-effective way, one of the classical choices to acquire depth images is to use a stereo camera. Given a calibrated stereo camera, the depth at a pixel can be computed from the disparity between two images. The process of computing the disparity is generally referred to as stereo matching. Owing to the epipolar constraint based image rectification, the searching space for the matching can be limited to the 1D horizontal line, as compared to 2D search for optical flow.
As summarized in [4], stereo matching is traditionally formulated as a problem with several optimization stages as follows: (1) matching cost calculation; (2) cost aggregation; (3) disparity computation; (4) disparity refinement. By leveraging the recent advance of machine learning techniques such as deep learning, stereo matching methods using neural networks have been proposed [5,6]. Such methods have shown its strong ability on correspondence matching owing to taking advantages of the massive data for the training [7]. However, there are remaining issues on both computation and huge memory costs. In addition, it is still challenging to find correct correspondences at ill-posed regions. For example, stereo matching normally fails at object occlusions, repeated patterns, texture-less or dim regions. Furthermore, sensor noise harms the matching because the local texture can be largely affected by the noise. As discussed in Section 4, our detailed evaluation indicated that the matching at dim and noisy regions was still challenging even with the state-of-the-art methods. Although this is one of the main reasons for the decreases in the accuracy and is crucial especially in outdoor environments, it has not been much discussed in the literature. For these issues, further improvements are obviously required to widen its use.
In this paper, we propose Sparse Cost Volume Network (SCV-Net) costing less GPU memory and less runtime while achieving comparable accuracy with the state-of-the-art methods. Our network architecture is inspired by GC-Net [8]. In the GC-Net, the idea of the cost volume was introduced to arrange local and global features for all of the possible disparities in a dense manner. However, this structure requires huge memory space, and makes the execution slow. Since such volume is redundantly constructed in terms of feature representation, we propose a sparse structure for the cost volume and an efficient similarity evaluation for the sparse volume. In addition, we propose to use weight normalization [9] instead of using batch normalization which is commonly used in the neural network for stereo matching tasks. The weight normalization not only improves the robustness to image noises, but also suppresses the influence of the batch size in the training process. Finally, we achieved more than 73.08% GPU memory and 61.11% runtime saving, compared with the GC-Net. In Section 4, we show the detail of the evaluation results on the Scene Flow and KITTI 2015 benchmarks, and finally discuss advantages and limitations of our network. Our source code can be found at https://github.com/rairyuu/SCVNet.

Related Work
We briefly review state-of-the-art stereo matching methods based on deep learning. Deep learning has been used in stereo matching, and has shown its superiority over traditional methods in recent literature [5,6,10,11]. Zbontar et al. trained a siamese network to extract batch features, and then found the correspondences between the features in two images [5]. Nikos Komodakis and Sergey Zagoruyko proposed an approach to learn a general similarity function for comparing image patches directly from image data [10], which can also be used in stereo matching. Inspired by their work [5,10], Luo et al. treated the stereo matching as multi-class classification over all possible disparities, and used the inner product as the similarity to accelerate the calculation [6]. Seki et al. constructed the semi-global matching (SGM) network by training the network to predict the penalties of small image patches [11]. Comparing with traditional methods, although the deep learning based ones usually have a higher requirement on hardware, they can bring significant improvement on accuracy and processing time.
With input stereo images, end-to-end deep learning methods have also been proposed to directly output the final disparity map [7,[12][13][14][15][16]. Mayer et al. trained a network to learn the disparity directly from the input images, and supervised the result in multi-scale [7]. Gidaris et al. proposed a method to detect the incorrect disparities, and then replace them with new ones, and finally refine the renewed disparity map [12]. Pang et al. proposed a framework to refine the input disparity map estimated by other methods [13]. Jie et al. introduced a recurrent neural network to achieve a better performance such that the predicted disparity map are refined in each recurrent step [14]. To improve the accuracy, Liang et al. constructed a network with three parts: one for feature extraction, one for matching cost calculation and the other one for result refinement [15]. Chang et al. proposed a pyramid network with 3D convolution to improve the performance at ill-posed regions [16].
To leverage the knowledge of geometry in stereo matching, Kendall et al. proposed a novel architecture named GC-Net [8]. They considered the stereo matching as a regression problem. This gives the GC-Net ability to predict the disparity with sub-pixel level accuracy. Although the GC-Net performs better than others, there are some drawbacks such that the network is large and slow. Moreover, as discussed in Section 4.4, its accuracy degrades in dim lighting conditions and sensor noise regions.
The problem on computational costs mainly comes from the large search space in the cost volume, the search space reduction has been a research topic in the literature of stereo matching. Wang et al. proposed a two-stage matching to reduce the search space for Markov Random Fields-based stereo algorithms [17]. For Graph Cuts based stereo matching, Veksler et al. proposed to use the fast local correspondence methods to limit the disparity search range [18]. By using the support points and triangulation geometry, Geiger et al. reduced the matching ambiguities, which also reduced the search space [19]. In Geiger's work, the support points are defined as pixels which can be robustly matched due to their texture and uniqueness. Gurbuz et al. proposed a sparse recursive cost aggregation, achieved O(1) complexity local stereo matching [20]. Sameh Khamis et al. proposed to use coarse resolution cost volume in the network, although the accuracy is not top notch, they achieved real-time processing on high end GPUs [21].
To reduce the computational costs of the GC-Net [8] while keeping the accuracy, we propose a structure named sparse cost volume based on the GC-Net. Even though our design strategy is to straightforwardly make it compact, it largely improves all aspects of the GC-Net. As illustrated in Section 3.2, different from the cost volume in the GC-Net, we form the sparse cost volume by introducing a stride S. In addition, together with our novel similarity evaluation, our SCV-Net achieves higher accuracy, less memory cost and faster computation. Moreover, our network can be robust to dim and sensor noise regions by incorporating weight normalization into the network. Figure 1 illustrates our architecture comprising four stages: feature extraction, sparse cost volume construction, similarity evaluation and loss evaluation. Table 1 provides a layer-by-layer definition of our network. In this section, we explain our design strategy for each stage in detail.

Input Stereo Images
Feature Extraction Sparse Cost Volume Similarity Evaluation Loss Evaluation Disparities … … … Figure 1. Our Sparse Cost Volume Network (SCV-Net). By making the cost volume compact and proposing an efficient similarity evaluation for the structure, we achieve high accuracy, low memory cost and fast computation stereo matching.

Feature Extraction
First, we explain our deep representation to compute the stereo matching cost. We basically follow the architecture in the GC-Net [8] at this stage as follows.
We train the feature extraction by using a number of 2-D convolution operations. Each convolution layer has a weight normalization [9] and a ReLU non-linearity. To reduce the computational demand, we initially apply a 5 × 5 convolution filter with the stride of 2 to down-sample the input images. Following this layer, we append eight residual blocks [23], each of which consists of two 3 × 3 convolution filters in series.
In this feature extraction network, there is no weight normalization or ReLU non-linearity with the final output layer. This makes our network have the ability to represent the absolute features of input images. Additionally, we concatenate the input of the previous layer (Conv8) to the input of the output layer (Conv18). Conv8 is relatively an upper layer, its input contains local features. Concatenating these two input makes our network concentrate more on local features. This slightly increases the memory cost, but can give more accurate results in experiments. Since we extract the feature of left and right image for stereo matching at the same time, the parameters of this network are naturally shared.

Sparse Cost Volume Construction
Next, we explain the computation of the stereo matching cost by forming a cost volume. The cost volume in the GC-Net [8] requires a lot of memory during its calculation because the cost volume itself is redundant in terms of the feature representation. Therefore, we form a sparse one for less memory use and faster computation.
As illustrated in Figure 2, the cost volume in the GC-Net is formed by moving right feature maps to the right in a pixel-by-pixel manner with the stride of 1. It has a good geometry explanation, and is theoretically reasonable. However, it has redundancies in feature learning as follows. Generally, the features extracted by the feature extraction stage consists of two parts; local features and global ones.
The local features at different pixels are different from each other, whereas adjacent local pixels usually share similar global features. In other words, the global features at adjacent pixels are redundantly computed for multiple times in the GC-Net. Since this costs large amounts of GPU memory and computation, this is the bottleneck part of the GC-Net. It would be an ideal solution if the common features would be separated out. However, the features extracted by neural networks are usually intricate, and cannot be easily separated.
An alternative way is to train the network itself to arrange these features. As illustrated in Figure 3, our sparse cost volume is formed by moving right feature maps to right with a stride S, which is a parameter to control the sparseness. This parameter can be designed such that it is big enough to bring a considerable improvement on memory use and runtime, and not too big to drop too many features which leads to decreases in accuracy. In this paper, we use S = 3 as default. By using the sparse cost volume, we can train our network to compress the features of adjacent pixels into the central pixel. In this way, the skipped pixels (disparities) are compared not directly but in an encoded way. The result will be decoded later, as described in Section 3.3. Even though this parameterization can be considered as simple and straightforward, it effectively works to suppress the redundancy. As discussed in Section 4.3, we provide the detailed comparison on GPU memory, runtime and accuracy of different strides. . We propose a parameter S for stride size to control the redundancy.

Similarity Evaluation
The cost volume in the GC-Net [8] has the shape of [Batchsize, Feature, 1 2 Disparity, 1 2 Height, 1 2 Width], which is a 5D tensor. To process this tensor, a series of 3D convolution and transposed convolutions were used. Although they expand the field of view of the network, much of its calculation is wasted because of the redundancy. Moreover, they force the Disparity to be a multiple of 32 to ensure the transposed convolution work properly. In our similarity evaluation, we propose to merge the Batchsize and 1 2 Disparity dimensions. Since the size of input images is large and our network only needs to process one pair of stereo images at one time, the Batchsize is set to 1. This makes our sparse cost volume have the shape of [ 1 6 Disparity, Feature, 1 2 Height, 1 2 Width]. Note that 1 6 Disparity is 1 S = 1 3 of 1 2 Disparity because our cost volume is formed with the stride of 3, if S = 3 in Figure 3. Since this is a 4D tensor, we can use 2D convolutions to process it, which enables faster computation than 3D ones.
As illustrated in Figure 1 and Table 1, we use an hour-glass structure, which utilizes a series of down-sampling and up-sampling to extract features [24,25]. Figure 4 provides an intuitive view of our network. Each down-sampling layer is followed with a similarity evaluation branch. To make the similarity evaluation more effective, each similarity evaluation branch consists of three convolutional layers. Then, evaluated similarity maps are up-sampled, and added to the similarity maps with the same resolution, and finally up-sampled until they have the same resolution with input stereo images. The layers with lower resolution tend to evaluate global features similarly, while the layers with higher resolution tend to evaluate local features similarly. Combining these similarity maps helps our network give a more comprehensive evaluation using both local and global features.  Since 1 6 Disparity is moved to the Batchsize dimension, all disparity pairs are processed independently. To give an absolute similarity, we removed the weight normalization and ReLU from the output layer. In our feature extraction, we train the network to compress the features of adjacent pixels to one pixel. Then, the features are processed in an encoded way. We decode it in the output layer of our network to get full disparities. The output similarity map has shape [ 1 6 Disparity, 6, Height, Width].

Loss Function
We use soft argmax to estimate disparities, as similar to [8,26]. This gives our network the ability to achieve sub-pixel accuracy. Before applying soft argmax, we merge the first two dimensions of the similarity map to make a shape of [Disparity, Height, Width]. The predicted disparityD is the weighted average of all possible disparities aŝ where max possible disparity D max equals to Disparity − 1 and s d represents the similarity of disparity d. We train our model with supervised learning using ground truth disparity data. When using LIDAR to generate the ground truth values such as KITTI dataset [27,28], these labels may be sparse. Therefore, we use the average of the absolute error between the ground truth disparity D n and the predicted disparityD n as where N is the count of pixels in the image.

Weight Normalization
Most of neural networks for stereo matching utilize batch normalization to improve the performance. The batch normalization usually performs well for various recognition tasks. However, neural networks for stereo matching are usually too big, which leads to a small Batchsize during training. For example, the Batchsize is lower than 4 in most of the networks [8,13,15,16]. Moreover, the distribution of dataset is usually uneven, especially when the dataset is small. In these situations, batch normalization is not a suitable choice because it fails to adapt to some images in the dataset.
In our network, we propose to use weight normalization in all of the processes. As described in [9], the weight normalization has the following advantages: (1) it does not introduce any dependencies between the samples in a minibatch; (2) it is not sensitive to noise; (3) it has lower computational overhead. This helps our network perform better on the datasets. In Section 4.4, we show the advantages after using weight normalization.

Experiments
We evaluated our SCV-Net on the Scene Flow [7] and KITTI 2015 [28] datasets. To achieve better performance, we first trained our network on the Scene Flow dataset. Then, we fine-tuned it with the KITTI dataset.

Implementation Detail
We implemented our model in PyTorch as follows. The network was randomly initialized. Our model was optimized with RMSProp [29] using a multistep learning rate. Specifically, to train the network with the Scene Flow dataset, the learning rate was set to 1 × 10 −4 for all 210k iterations. For the fine-tuning with the KITTI 2015 dataset, the learning rate was initially set to 2 × 10 −4 and then reduced by a half at the 20k-th and 40k-th iterations, and finally the training was stopped at the 60k-th iteration.
We trained our network with the Batchsize of 1 using a 768 × 320 image pair randomly cropped from the inputs. Before inputting to the network, we normalized each image pair such that the pixel intensities ranged from −1 to 1. Specifically, we performed data augmentation on the KITTI 2015 dataset to improve the adaptability of our network.
For both Scene Flow and KITTI 2015 datasets, Disparity was set to 192. The whole training took about 21 h on a single NVIDIA GTX 1080Ti GPU. It should be noted that the maximum disparity in the datasets was larger than 192. To train the network correctly, we discarded all pixels with disparities out of range [0, 192).

Computational Efficiency
We implemented the GC-Net and our SCV-Net in the same environment. The two networks were both evaluated on a single NVIDIA 1080Ti GPU. The computational overhead of two networks is described in Table 2. When processing the same data, our SCV-Net saved more than 73.08% GPU memory and 61.11% runtime comparing to the GC-Net. This significant enhancement comes from our efficient sparse cost volume. In addition, experiments in Section 4.3 indicate that our sparse cost volume did not harm the accuracy.

Benchmark Results
First, we validated our network on the testing set of Scene Flow dataset. We evaluated end point error (EPE) and pixel percentages with errors larger than 1, 3 and 5 pixels (1PE, 3PE and 5PE). As indicated in Table 3, our network surpassed the GC-Net [8] in all indexes except EPE by a noteworthy margin. As described in Section 4.1, a part of the Scene Flow dataset has disparities larger than 192. We discarded these pixels during training, which led to worse performance on these regions when testing, making the EPE which measures the average error larger. Note that the GPU memory cost and processing time of the GC-Net are absent in Table 3. This is because the image size (960 × 540) is too big, which cannot be processed on a single NVIDIA GTX 1080Ti. The result of GC-Net is quoted from its original paper [8]. Next, we investigated the performance of our network with different stride parameter S for constructing the sparse cost volume. As described in Table 3, Ours-S2, which has a stride of 2, has a better performance on 3PE and 5PE, but is larger and slower. Ours-S4 performs worse on 1PE and 3PE, but has a reasonable performance on 5PE in practice with small memory use and faster computation. The 1PE and 3PE of Ours-S4 have been worse than the GC-Net, so it is not necessary to test with S > 4. By introducing the stride parameter, we can control the balance between accuracy, memory use and computational cost. Since the performance of Ours-S3 can be balanced, we choose it for the following experiments.
In addition, as shown in Table 3, we investigated how much the weight normalization improved on performance. Ours-S3-BN, which used the batch normalization, surpassed Ours-S3 partly in accuracy. However, using the batch normalization increased 14% GPU memory cost. Moreover, as discussed in Section 4.4, using weight normalization can achieve better performance on dim and noise regions. For these reasons, we decided to use weight normalization in our network.
Finally, we evaluated our network (Ours-S3) on the KITTI 2015 benchmark [28]. Figure 5 illustrates the results, and Table 4 provides a detailed comparison. The Foreground index refers to dynamic object pixels such as vehicles and pedestrians. The Background refers to static object pixels such as streets and trees while Overall refers to all pixels. The results show the percentage of pixels which have error greater than 3 pixels over all 200 test image pairs. The Runtime index refers to the average processing time. As shown in Table 4, our network surpassed the GC-Net in almost all indexes.  Figure 5. Results on KITTI 2015 stereo benchmark (include occluded pixels). Our network has better results in the foreground, which corresponds to cars and pedestrians.

Discussions
Stereo matching at dim and noise regions have been an issue for outdoor applications. In our experiments, the GC-Net cannot perform well on these regions. As discussed in Section 3.5, the commonly used batch normalization has poor performance when the Batchsize is too small. As illustrated in Figure 6, our network Ours-S3, which used weight normalization, performed well at these ill-posed regions. In other hand, Ours-S3-BN, which used batch normalization, has similar results with the GC-Net. We also evaluated the GC-Net with weight normalization, it overcomes the limitations of dim lighting and sensor noise regions as well.
Left Image GC -Net Ours There are still some issues left for further improvements. As illustrated in the top of Figure 7, our network still cannot deal with specular reflection well. In some complex conditions, for example, the object occlusion happens at a texture-less region; our network may fail to separate the foreground and the background, which leads to wrong estimation. As shown in Figure 7 Bottom, our network failed to separate the pedestrians and the street.

Left Image
Right Image Disparity Map Figure 7. Failure examples. Our network still cannot deal with specular reflection and object occlusion well.

Conclusions
In this paper, we propose the SCV-Net that contains a sparse cost volume, which saves more than 73.08% GPU memory and 61.11% runtime, compared with the GC-Net. By parameterizing the stride of sparse cost volume, the network could achieve higher accuracy or become faster and smaller. Moreover, we use the weight normalization to settle the problem on processing dim and noise regions. Our network can finally satisfy the requirement of most applications in practice.
Author Contributions: All authors conceived and designed the study. C.L. designed the network, performed the training and testing of the network, and drafted most of the manuscript under the supervision and suggestion from H.U., D.T., A.S. and R.T. All authors contributed to the result analysis and discussions. All authors approved the submitted manuscript.
Funding: A part of this research was funded by JSPS KAKENHI grant number JP17H01768.