An Efﬁcient Stereo Matching Network Using Sequential Feature Fusion

: Recent stereo matching networks adopt 4D cost volumes and 3D convolutions for processing those volumes. Although these methods show good performance in terms of accuracy, they have an inherent disadvantage in that they require great deal of computing resources and memory. These requirements limit their applications for mobile environments, which are subject to inherent computing hardware constraints. Both accuracy and consumption of computing resources are important, and improving both at the same time is a non-trivial task. To deal with this problem, we propose a simple yet efﬁcient network, called Sequential Feature Fusion Network (SFFNet) which sequentially generates and processes the cost volume using only 2D convolutions. The main building block of our network is a Sequential Feature Fusion (SFF) module which generates 3D cost volumes to cover a part of the disparity range by shifting and concatenating the target features, and processes the cost volume using 2D convolutions. A series of the SFF modules in our SFFNet are designed to gradually cover the full disparity range. Our method prevents heavy computations and allows for efﬁcient generation of an accurate ﬁnal disparity map. Various experiments show that our method has an advantage in terms of accuracy versus efﬁciency compared to other networks.


Introduction
Stereo matching is a fundamental computer vision problem, and has been studied for decades. It aims to estimate the disparity for every pixel in the reference image from a pair of images taken from different points of view. Disparity is the difference in horizontal coordinates between corresponding pixels in the reference and target stereo images. If the pixel (x, y) in the reference left image corresponds to the pixel (x − d, y) in the target right image, the disparity of this pixel is d. Using the disparity value d, focal length f of a camera, and the distance between centers of two cameras B, depth can be obtained by f B d . Stereo matching allows us to obtain 3D information in a relatively inexpensive manner compared to other methods which leverage active 3D sensors [1] such as LiDAR, ToF, and structured light. The importance of stereo matching is recently increasing, because 3D information is required in various emerging applications, including autonomous driving [2], augmented reality [3], virtual reality [4], and robot vision [5].
Like other computer vision problems, much progress in terms of accuracy has been achieved by employing deep learning for stereo matching. Following conventional stereo matching methods [6], the structure of existing deep learning-based methods includes four steps: feature extraction, cost volume construction, cost volume processing (or aggregation), and final disparity (or depth) map estimation. Early approaches [7][8][9] using deep learning for stereo matching focus on extracting features using convolutional neural network (CNN) and computing similarity scores for a pair of corresponding image patches. Zbontar and LeCun [7] proposed the first deep learning-based stereo matching network which learns to match the corresponding image patches with CNN. Luo et al. [9] also uses CNN to compute matching costs by using the extracted robust deep features from a Siamese network. These early approaches show significant increase of accuracy compared to the previous conventional methods which use hand-crafted features. However, these approaches have common limitations that high computations are required to forward pass all potentially corresponding patches. In addition, the increase of accuracy from deep learning is limited, because they still use post-processing functions to obtain a final disparity map.
Mayer et al. [10] proposed the DispNet which is the first end-to-end network including feature extraction, cost volume generation, and disparity regression by processing the cost volume. Pang et al. [11] proposed an encoder-decoder network using 2D convolutions with cascaded residual learning. For the cost volume construction, these approaches [10,11] created a 3D cost volume with dimensions of width, height, and disparity range. To this end, the corresponding deep features are processed in a hand-crafted manner such as correlation between features. Then, cost volume processing using 2D convolution is followed to obtain a final disparity map. However, these methods still suffer from lack of context information, because they still use the hand-crafted operation such as correlation or dot-product between corresponding features for the cost volume generation.
To overcome this limitation, most of the latest stereo estimation networks create a 4D cost volume by stacking the corresponding deep features [12,13], instead of relying on the correlations between corresponding features. A typical 4D cost volume has width, height, disparity range, and feature dimensions. Unlike 3D cost volume, more information can be processed because the 4D cost volume maintains feature dimension. This 4D cost volume is processed and regularized using 3D convolutions [12][13][14]. In addition, the soft argmin function suggested by [12] is fully differentiable and able to predict smooth subpixel disparity. These techniques have become mainstream, because they show excellent performance in terms of accuracy compared to previous methods. The gain in accuracy comes from learning the entire process, including cost volume generation and processing, which is not done in the 2D convolution-based methods.
However, most of the 3D convolution-based methods have an inherent disadvantage in that they require consumption of a large amount of computing resources as the number of elements in the dimensions of cost volume increase. For this reason, a 4D cost volume which is stacked over the full disparity range requires a great deal of memory. In addition, 3D convolutions for processing of the cost volume also require sizeable amounts of computation and memory. These requirements limit their applications for mobile environments, which are inherently constrained in terms of computing hardware. However, the number of applications that demand to predict and use depth directly on mobile devices is steadily increasing. Also, in many real-world applications including autonomous driving, augmented reality and robotics, reliable real-time processing is essential. Therefore, many studies are conducted for the efficient stereo matching network that can be used on mobile devices or can be executed in real time with reliable accuracy. Recently, AnyNet [15] is proposed to deal with this problem. It predicts disparity map from the low scale and subsequently correcting it by the residual error at the up-sampled scale. Because Anynet processes a full range of disparities only at the smallest scale and computations for other scales are performed residually, real-time processing with small computation is realized. However, the accuracy of AnyNet [15] is severely decreased compared to other 3D convolution-based methods. Although both accuracy and consumption of computing resources are important, improving both at the same time is a non-trivial task.
To deal with this problem, we propose a simple yet efficient network, called Sequential Feature Fusion Network (SFFNet) which sequentially generates 3D cost volume and processes it using only 2D convolutions. The main building block of our network is a Sequential Feature Fusion (SFF) module which generates 3D cost volumes to cover a part of the disparity range by shifting and concatenating the target features, and processes the cost volume using 2D convolutions. A series of the SFF modules in our SFFNet are designed to gradually cover the full disparity range. Our method prevents heavy computations and allows for efficient generation of an accurate final disparity map. More specifically, with small complexity and small number of parameters, our proposed network generates comparable results with previous 3D convolution-based methods.
The rest of the paper is organized as follows. Section 2 explains related works, and a detailed explanation of the proposed method follows in Section 3. Various experiments done for the purposes of comparative evaluations are provided in Section 4. Finally, Section 5 concludes the paper.

Classical Stereo Matching
Traditional stereo matching essentially consists of four steps: matching cost computation, cost aggregation, disparity computation/optimization, and disparity refinement [6]. These algorithms are divided into global matching methods [16,17] and local matching methods [18][19][20][21][22] according to the optimization method that is used. Although global matching methods usually show higher accuracy than local methods, they are relatively complicated and require a lot of computing resources such as memory. On the other hand, local matching methods have the advantage of being relatively light, but they are less accurate than the global methods. To overcome this limitation, various post-processing methods [23][24][25][26][27] and comprehensive methods [28,29] have been studied. However, these traditional stereo matching algorithms have a common limitation in that their accuracy is good for relatively simple conditions.

Deep Stereo Matching
Recently, the idea of applying deep learning to stereo matching has been revived. The seminal work of MC-CNN [7] began to establish the basic structure of stereo matching networks. The basic procedure of the classical stereo methods is still reflected in the deep learning-based stereo matching network structure. Some steps in the classical method mentioned above are replaced with convolutional neural networks (CNNs). Previous deep learning-based methods can be categorized into two classes, 2D convolution-based and 3D convolution-based methods, according to the process used for generation and processing of the cost volume. Detailed explanations are given below.

2D Convolution-Based Methods
Most early works using CNNs for stereo matching are 2D convolution-based methods. These methods leverage CNNs to extract features [9] and/or construct cost volumes and perform matching using 2D convolutions [7,8]. Some of these methods require additional post-processing to obtain the final disparity map. To overcome the drawbacks of these methods, Mayer et al. [10] proposed the first end-to-end network which directly regresses a disparity map by constructing a 3D cost volume using hand-crafted computations such as correlation between corresponding features. CRL [11] improved upon [10] based on cascade residual learning, which refines the initial disparity using residual components across multiple scales. Yang et al. [30] proposed a unified network based on [10] that performs both semantic segmentation and disparity estimation by using semantic features to improve the performance of disparity estimation. Yin et al. [31] proposed a matching network which estimates matching distribution by using feature correlation and composing multiple scale matching densities. Tonioni et al. [32] proposed a fast stereo network to perform effective online adaption.
The above-mentioned methods usually generate a 3D cost volume using the correlations between corresponding features and 2D convolutions for cost volume processing. These methods outperform classical stereo matching methods, such as most deep learning-based computer vision techniques. However, their accuracies are usually not good compared to those of the 3D convolution-based methods which will be described in the following section. Despite their shortcomings, they are often used and studied because of their advantages in terms of computing resources and/or execution time [31,32].

3D Convolution-Based Methods
The networks based on correlation analysis and 2D convolutions introduced above did not deviate from the existing algorithms in that the matching cost is still generated in a hand-crafted manner. 3D convolution-based methods are designed to transform this step into a learnable form. Instead of constructing a 3D cost volume, GC-Net [12] proposed construction of a 4D cost volume by concatenation of left-right features along the full disparity range. This cost volume is processed using CNNs comprising 3D convolutions with encoder-decoder architectures. Following [12], PSMNet [13] proposed a method of constructing 4D cost volumes using multi-scale features. PSMNet [13] uses a stacked hourglass structure comprised of three encoder-decoder(hourglass) architectures. However, the disadvantage of using 3D convolutions is that doing so significantly increases the consumption of computing resources. To mitigate this increase, building off [12], Lu et al. [33] proposed a method to construct a sparse cost volume with stride to efficiently perform stereo matching. Duggal et al. proposed a deep learning-based matching network with a differentiable patch-match module [14] which prunes out most of the useless disparity range to reduce the complexity of the 3D convolutions. Tulyakov et al. [34] designed a practical network with a smaller memory footprint by compressing the cost volume into compact matching signatures before performing 3D convolution-based regularization.

Method
An overview of the proposed network is shown in Figure 1. As in other networks, we generate a feature vector for each pixel using a feature extraction network. Unlike previous works [13,14] which construct a heavy 4D cost volume by stacking corresponding features along the full disparity range and processes the cost volume using 3D convolutions, our method sequentially performs cost volume construction and aggregation using the proposed Sequential Feature Fusion Network (SFFNet). Our SFFNet consists of a sequence of the proposed Sequential Feature Fusion (SFF) modules, where each module is based on the ResNet block structure [35] and Hierarchical Feature Fusion (HFF) [36]. Finally, a refine network is used to further refine the initial disparity map and obtain an accurate final disparity map. The whole structure of our network is summarized in Table 1. Detailed explanations are given in the next subsections.

Feature Extraction Network
The feature extraction network extracts a feature representation for each pixel of the input stereo images. Given a pair of stereo images I L and I R , features F L (0) and F R (0) capable of forming a cost volume are output for each viewpoint. To this end, we employ a 2D convolutional network using the Spatial Pyramid Pooling (SPP) module [37,38], which is similar to [13,14]. By extending pixel-level features to region-level using different pooling sizes, generated features from the SPP module hold incorporated hierarchical context information and it makes feature representations more reliable. The parameters of the feature extraction network of the left and right images are shared. For efficient computation, the size of the output feature map is 1/4 of the original input image size. This part is commonly used by other networks using 3D convolutions that show the best performance [13,14].
Feature Extraction Network Repeat M times.  Figure 2 shows the proposed Sequential Feature Fusion Network (SFFNet) which consists of a series of M SFF modules. The first SFF module takes F L (0) and F R (0) from the feature extraction network as input. In addition, the following output of the n th SFF module serves as the input of the next n + 1 th SFF module. Only F L (M) from the final SFF module is used in the refine network to produce final disparity map. A single SFF module combines cost volume generation and aggregation for a part of full disparity range using only 2D convolutions. Our SFFNet is motivated by the Hierarchical Feature Fusion (HFF) [36] method used in semantic segmentation. HFF produces a feature map that covers large receptive field without directly performing original convolutions with large sizes. Instead, it hierarchically adds intermediate features with different small receptive fields before concatenating them. We adopt this idea for stereo matching, which processes full range of disparities by connecting modules which processes only a subset of disparity ranges. It is worthy to note that the purpose of the HFF is for efficiently obtaining the feature map with large receptive field in the spatial domain. Meanwhile, the purpose of our SFFNet is to efficiently enlarge the receptive field in the disparity domain. Specifically, the n th SFF module deals with the disparity range [(n − 1)S, nS], where S represents a specific disparity range which is processed at a single SFF module. As shown in Figure 3, the n + 1 th SFF module generates output feature maps F L (n + 1) and F R (n + 1) from input feature maps F L (n) and F R (n). Here, F L (n + 1) and F R (n + 1) are defined by where F + L (n) is the result of concatenation of various features of the reference (left) and target (right) images, and is defined by where • represents the concatenation operation, and F i R (n) denotes the feature that is shifted from the original feature F R (n) by i pixels in the width direction. Function f (·) in Equation (1) includes sum of results from two 3 × 3 2D convolutions and one 1 × 1 2D convolution, as shown in Figure 3. Two 3 × 3 convolutions are used to increase the receptive field, while one 1 × 1 convolution plays a role of the projection shortcut [35] to form a residual function.
After the n + 1 th SFF module, a cumulative cost volume for the disparity range [0, (n + 1)S] is generated. At the same time, the learning area for disparity of S pixels is widened while processing it using a series of SFF modules. Concretely, F L (n + 1) contains the processed and aggregated cost volume of the reference image for a disparity range of [0, (n + 1)S], while F R (n + 1) is the feature map of the target image shifted by (n + 1)S pixels for processing next (n + 2) th SFF module. Please note that unlike previous 3D convolution-based approaches which generate a 4D cost volume covering a full disparity range and aggregate it using 3D convolutions in a separate process, our SFFNet simultaneously performs both generating and aggregating cost volume, and gradually increases the range of disparity search. The proposed SFFNet adjusts the full disparity range R through the number of SFF modules M and the number of shifts S, as follows: Although a large S value allows the network to learn a wide range of disparities in a single SFF module, the disparities cannot be learned in detail in the module. Meanwhile, the number of connections M controls the depth of network, and a high value of M can slow the runtime.

Refine Network and Loss Function
The feature map F L (M) generated through the SFFNet is further processed using a light refine network similar to [14] to generate a final disparity map. As shown in Figure 1, the refine network takes F L (M) obtained from the final M th SFF module in the SFFNet and generates an initial disparity map d init as well as a final disparity map d re f ine . Use of both the initial disparity map d init and the processed feature maps F L (M) allows the refine network to focus only on the residual component of the initial disparity map and to improve the quality of the final disparity map d re f ine . Here, the initial disparity is simply generated by processing the feature map F L (M) from the SFFNet through the 1 × 1 convolutional network [39] and bilinear upsampling. Final refined disparity map is generated using the processed feature map F L (M) and the middle feature map obtained from initial disparity processing. This process is composed of 5 × 5 convolutional layer and bilinear upsampling. Now, the total loss function L used to learn the disparity map is defined by where d init and d re f ine denote the initial disparity map and the final disparity map, respectively, and d gt is a ground-truth disparity map. Here, the smoothness L1 loss function V s (·) [40] is defined by The values of γ 1 and γ 2 in Equation (4) represent the weight of the loss of the initial disparity map and that of the final disparity map in the total loss function, respectively.

Experimental Results
We evaluate our network on several datasets and demonstrate that the proposed SFFNet achieves better results in terms of consumption of computing resources vs. accuracy compared to the other methods. For purposes of comparison, we designed all experiments under the same conditions. Also, the training datasets, the maximum disparity range and all evaluation indicators for each network are the same. Next, we describe the experimental setup for each dataset, and then explain the performance using various evaluation indicators.

Datasets
We conducted experiments on two datasets.  [41,42]. The 3PE represents the percentage of pixels for which the difference between the predicted disparity and the true one is more than 3 pixels.

Implementation Details
We trained our network on the Scene Flow dataset and the KITTI training dataset. Input images from these two datasets are randomly cropped with size of H = 256 and W = 512, and then normalized using the ImageNet [43] [13,14]. Adam (β 1 = 0.9, β 2 = 0.999) [44] was used as an optimization method for end-to-end training. We implemented our model using PyTorch [45] in Unbuntu 16.04 OS with CUDA version 10.1 with 4 Nvidia Titan-XP GPUs. The hyperparameters for the loss function in Equation (4) were set as γ 1 = 1 and γ 2 = 1.3, so that more weight was given to the final result of our network similar to [13,14]. To create the same conditions as used for the other networks, the loss was calculated only for pixels with ground-truth disparity value in the range of 0 to 192. The numbers of S and M in Equation (3) are set as S = 2 and M = 24 which cover the full disparity range of 192 for 1/4 of the input image size.
Training was done for a total of 678 epochs on the Scene Flow dataset, with a batch size of 44 and a learning rate of 0.001; the learning rate was re-adjusted to 0.0007, 0.0003, 0.0001, and 0.00007 at epochs 20, 40, 60 and 600, respectively. In the case of the KITTI dataset, the network trained through the Scene Flow dataset was transferred. Concretely, the batch size was 22 and the learning rate was set to 0.0007 and re-adjusted to 0.00004 and 0.00001 at epochs 200 and 900, respectively. We empirically determined these optimal learning rates and number of epochs for training. Table 2 shows the comparative results of various methods using the test set of the Scene Flow dataset. "Ours (Initial)" represents the result using the initial disparity map of the proposed network without the refine network. "Ours" denotes the result of the proposed network with the refine network. Here, we compare our results with those of other recent 3D convolution-based networks. Table 3 further compares the runtime and EPE of the top-performing 3D convolutionbased methods. For a fair comparison, the same feature extraction network is used for all methods. The results show that our SFFNet achieves lower EPE with lower runtime than PSMNet [13]. The runtime of our network is 2.8 times faster than that of DeepPruner-Best [14], while EPE of ours is 1.2 times higher. These results show that SFFNet is more efficient than other 3D convolution-based cost aggregation networks. Figure 4 shows a qualitative comparison of our method and others on the Scene Flow test set. Our method generates results that are comparable to those of other state-of-the-art methods [13,14] for most regions, including sharp boundaries and textureless regions.     Table 4 shows comparison results for various indicators including runtime, error ratio, number of parameters, and FLOPs of competing algorithms on the KITTI-2015 stereo benchmark [42]. Here, the percentages of erroneous pixels in terms of 3PE averaged over the background (bg) and foreground (fg) regions and all ground-truth pixels (all) are measured separately. Noc (%) and All (%) represent the percentages of erroneous pixels for only non-occluded regions and for all pixels, respectively. Among these indicators, the computing resource-related indicators (parameters, FLOPs, runtime) produce ambiguous results, so it is difficult to establish their relationship. For example, if an algorithm that includes correlation [9,10] or patch-match [14] is included in the network, additional parameters are not added, but the floating-point operation might increase. Also, a structure with branches, such as a spatial pyramid pooling method [13,37,38] requires memory access for each branch. This can increase runtime and memory usage but not the number of parameters. As shown in the table, most of the 3D convolution-based methods require significantly more parameters and FLOPs, leading to slower runtime than ours. Concretely, the number of parameters in our method is 4.61 M, while that of the DeepPruner-Fast [14] is 7.47 M, which is 1.62 times more than that of ours. Meanwhile, the runtime and FLOPs are comparable. It is worthy to note that the number of parameters is one of important factors which is a measure of the model complexity, and it is directly related to the efficiency of the deep learning networks [47]. Thus, our method is simpler and more efficient than most of 3D convolution-based methods listed in Table 4 in terms of model complexity. On the other hand, some 2D convolution-based methods require relatively few parameters or FLOPs, leading to faster runtime, but produce error ratios that are higher than those of the 3D convolution-based methods. The results show that our method is superior to the 3D convolution-based methods in terms of runtime, while the accuracy of all tested methods is comparable. Although some of the 2D convolution-based methods are faster than our method, they show lower accuracy. Thus, the accuracy and computing resources of our network represent a compromise between those of the 2D convolution-based and 3D convolution-based methods. Considering these factors, our method represents a good compromise between the two. Figure 5 shows the results of qualitative comparisons on the test set of the KITTI-2015 benchmark [42]. The images for each method show the error maps of the predicted disparity maps and the predicted disparity maps for the red rectangle regions, where ground-truth disparities exist. In the error maps, the red and yellow colors represent regions with large errors. From these comparisons, it is observed that our method produces comparable results with other methods for various scenes.

Input left image
DispNetC [10] PSM-Net [13] DeepPruner-Fast [14] DeepPruner-Best [14] Ours   It can be seen that there is a trade-off between EPE, number of parameters, and runtime. As mentioned before, S represents a specific disparity range which is processed in a single SFF module. As the number of S increases, the range of disparity that a single module learns is widen, and the number M of SFF modules required to process the full disparity range decreases. Due to the decreased number of M, parameters of whole network and processing runtime is also reduced. However, it can be seen that EPE increases as S increases. This is because larger value of S requires correspondingly larger receptive field to fully process in a single SFF module. Table 5 shows that EPE reaches the lowest value when S = 2 and M = 24.

Conclusions
In this paper, we propose a simple yet efficient network, called Sequential Feature Fusion Network (SFFNet) for stereo matching. Unlike previous 3D convolution-based networks, our method does not require the construction of heavy 4D cost volume and 3D convolutions for processing it. Instead, our SFFNet sequentially and progressively generates 3D cost volume and processing it using lightweight 2D convolutions. Our SFFNet consists of a series of Sequential Feature Fusion (SFF) modules which sequentially generate 3D cost volumes to cover a part of the disparity range by shifting and concatenating target features, and then process the cost volume using 2D convolutions. Overall, SFFNet prevents heavy computations and allows for efficient generation of an accurate final disparity map. More specifically, with small complexity and small number of parameters, our proposed network generates comparable results with previous 3D convolution-based methods. Various experiments show that our method is relatively faster and require small number of parameters compared to previous 3D convolution-based methods, while achieving comparable accuracy and FLOPs. For example, for the Scene Flow test set, our SFFNet achieves lower EPE with faster runtime and smaller number of parameters than PSMNet. The runtime of our network is 2.8 times faster than that of DeepPruner-Best, while EPE of ours is 1.2 times higher. In future work, we plan to increase the entire performance of our SFFNet. Specifically, to obtain a more accurate final disparity map result in real time, we plan to gradually apply multi-scale approaches to the SFFNet.

Conflicts of Interest:
The authors declare no conflict of interest.