A Semi-Supervised Monocular Stereo Matching Method

: Supervised monocular depth estimation methods based on learning have shown promising results compared with the traditional methods. However, these methods require a large number of high-quality corresponding ground truth depth data as supervision labels. Due to the limitation of acquisition equipment, it is expensive and impractical to record ground truth depth for di ﬀ erent scenes. Compared to supervised methods, the self-supervised monocular depth estimation method without using ground truth depth is a promising research direction, but self-supervised depth estimation from a single image is geometrically ambiguous and suboptimal. In this paper, we propose a novel semi-supervised monocular stereo matching method based on existing approaches to improve the accuracy of depth estimation. This idea is inspired by the experimental results of the paper that the depth estimation accuracy of a stereo pair as input is better than that of a monocular view as input in the same self-supervised network model. Therefore, we decompose the monocular depth estimation problem into two sub-problems, a right view synthesized process followed by a semi-supervised stereo matching process. In order to improve the accuracy of the synthetic right view, we innovate beyond the existing view synthesis method Deep3D by adding a left-right consistency constraint and a smoothness constraint. To reduce the error caused by the reconstructed right view, we propose a semi-supervised stereo matching model that makes use of disparity maps generated by a self-supervised stereo matching model as the supervision cues and joint self-supervised cues to optimize the stereo matching network. In the test, the two networks are able to predict the depth map directly from a single image by pipeline connecting. Both procedures not only obey geometric principles, but also improve estimation accuracy. Test results on the KITTI dataset show that this method is superior to the current mainstream monocular self-supervised depth estimation methods under the same condition.


Introduction
Depth estimation is the fundamental problem of 3D scene reconstruction, which is widely used in virtual reality, self-driving cars, and other fields. It has become a very hot research direction with the development of these fields. At present, the research on depth estimation has made a large number of achievements [1]. However, as depth estimation from a single image is an ill-posed and geometrically-ambiguous problem, most of the traditional methods adopt feature registration algorithms of polar geometry based on a binocular view or a multi-view of the scene, such as stereo matching [2], structure from motion [3], photometric stereo [4], and depth cue fusion [5]. The 3D scene reconstructed by these methods has low accuracy and is mostly a sparse reconstruction.
In recent years, with the widespread application of deep learning in the field of computer vision, researchers began to apply learning-based methods to solve the depth estimation problem from stereo pairs [6,7] or a single image [8][9][10]. These methods can achieve a relatively high accuracy of depth estimation compared with the traditional methods. For the stereo matching method, the performance is superior to the monocular depth estimation method, but it requires costly special-purpose stereo camera rigs, which easily incur calibration errors and synchronization problems. Compared with a binocular camera, a monocular camera is much more preferred in practical applications. However, the current mainstream monocular depth estimation methods based on learning, which have the supervised method and self-supervised method, commonly have some problems. First, monocular supervised methods to solve this depth prediction problem almost completely rely on semantic information of a single image and directly match it to ground truth depth. It is difficult and unpractical to obtain a large number of high-quality ground truth depth data corresponding to input scene images. More importantly, the obtained ground truth depth data are sparse and noisy. Second, self-supervised monocular methods usually rely on a lot of high-quality data and effective learning to train a deep network to predict the warp function that can map the left view onto the right view. The depth network model is optimized by alignment loss between the reconstructed view and original view. However, self-supervised depth estimation from a single image, which does not take into account ground truth depth and right view, is usually an ill-posed and geometrically ambiguous in theory. Therefore, the result of the self-supervised monocular depth estimation method is usually suboptimal.
In order to solve the expensive and sparse problem of ground truth depth data, as well as the ill-posed problem of the self-supervised monocular method, we propose a novel semi-supervised monocular stereo matching method that is composed of a self-supervised view synthesis network and a semi-supervised stereo matching network. Table 1 shows that the depth estimation accuracy of stereo matching is much better than that of monocular depth estimation in the same self-supervised model, even better than that of the supervised method. Motivated by this experimental results and [11,12], we decompose the monocular depth estimation problem into a right view reconstruction problem and a stereo matching problem. The whole implementation process is shown in Figure 1, in which the reconstructed right view, which is generated by the view synthesis network from a single left view, is used as the right view of stereo pairs to input into the stereo matching network for predicting scene depth. In the test, the two networks are piped together to directly warp the RGB image to the disparity map. For the view synthesis approach, we propose a novel model on the basis of Deep3D [13] by adding the left-right consistency constraint and smoothness constraint to improve the reconstructed view. To reduce the influence from the error of the reconstructed right view on the model during training, we propose a semi-supervised stereo matching model that is like the semi-supervised method [14], but does not use ground truth depth data. This model takes advantage of the disparity maps generated by the self-supervised stereo matching network using the original stereo pairs as input data as supervision labels. It combines the supervision cues and self-supervision cues jointly to optimize the stereo matching model, namely semi-supervised stereo matching. The main problems that our paper solves are: (1) compared with the supervised depth estimation method, our method adopts a semi-supervised model without using ground truth depth and produces photoconsistent dense depth maps; (2) compared with the stereo matching method, our model estimates depth from a single image that avoids calibration errors and synchronization problems from the stereo camera; (3) compared with the monocular depth estimation method, our model adopts the semi-supervised monocular stereo matching method that not only uses the reconstructed stereo pairs as the input data, but also adopts the disparity maps generated by the self-supervised stereo matching network as the supervision labels. This makes the whole procedure obey the primary geometric principles, and the result is optimal. We train and test our method on the current popular KITTI dataset [15], and the experimental results show that our method is superior to the current state-of-the-art self-supervised monocular depth estimation model under the same conditions. Table 1. Comparative results of self-supervised monocular depth estimation and stereo matching [12], as well as supervised depth estimation from [11]. Self-supervised depth estimation adopt VGG16 and the ResNet50 network. Where the RMSE is root mean squared error, the ARD is abs relative difference and the SRD is squared relative difference. Please see Section 4.1.2 for estimation metrics in detail.

(Lower Is Better) (Higher Is Better)
Monocular-VGG16 [12] self-supervised 6 Table 1. Comparative results of self-supervised monocular depth estimation and stereo matching [12], as well as supervised depth estimation from [11]. Self-supervised depth estimation adopt VGG16 and the ResNet50 network. Where the RMSE is root mean squared error, the ARD is abs relative difference and the SRD is squared relative difference. Please see Section 4.1.2 for estimation metrics in detail.

(Lower Is Better) (Higher Is Better)
Monocular-VGG16 [12] self-supervised 6 Figure 1. Design idea of the semi-supervised monocular stereo matching method. We can synthesize the right view from a single left view by the self-supervised view synthesis network and then use the semi-supervised stereo matching network to estimate the disparity map from the concatenation input of both the left and reconstructed right views.
Our method is proposed on the basis of reading a large number of previous literature works [11][12][13][14], and the main contributions of this article are as follows: a. This paper proposes a novel monocular depth estimation method without using ground truth depth data, which uses the combinative model of the view synthesis network and stereo matching network to achieve a high-quality depth map from a single image. The model not only follows the geometric principles, but also improves the estimation accuracy. b. To raise the quality of the reconstructed right view, the paper improves the existing view synthesis network Deep3D model by adding a left-right consistency constraint and a smoothness constraint. c. In order to improve the estimation accuracy and reduce the impact of the reconstruction error from the right view, we propose a semi-supervised stereo matching method to predict the depth.

Related Work
Learning-based methods have better performance than traditional methods for the task of depth estimation. Therefore, more and more researchers have applied these methods to depth evaluation, and some research results have been achieved. Here, we will focus on works related to stereo matching and monocular depth evaluation based on deep learning, and we make no assumptions about the scene geometry or types of objects present. Design idea of the semi-supervised monocular stereo matching method. We can synthesize the right view from a single left view by the self-supervised view synthesis network and then use the semi-supervised stereo matching network to estimate the disparity map from the concatenation input of both the left and reconstructed right views.
Our method is proposed on the basis of reading a large number of previous literature works [11][12][13][14], and the main contributions of this article are as follows: a. This paper proposes a novel monocular depth estimation method without using ground truth depth data, which uses the combinative model of the view synthesis network and stereo matching network to achieve a high-quality depth map from a single image. The model not only follows the geometric principles, but also improves the estimation accuracy. b. To raise the quality of the reconstructed right view, the paper improves the existing view synthesis network Deep3D model by adding a left-right consistency constraint and a smoothness constraint. c. In order to improve the estimation accuracy and reduce the impact of the reconstruction error from the right view, we propose a semi-supervised stereo matching method to predict the depth.

Related Work
Learning-based methods have better performance than traditional methods for the task of depth estimation. Therefore, more and more researchers have applied these methods to depth evaluation, and some research results have been achieved. Here, we will focus on works related to stereo matching and monocular depth evaluation based on deep learning, and we make no assumptions about the scene geometry or types of objects present.

Stereo Matching
A large majority of stereo matching methods find the pixels matching right view in the left view. In general, the matching problem of 3D space can be transformed into a search problem in 1D by rectifying stereo pairs. The final matching result is simply calculated as the disparity of the left and right images, and then, the depth value can be obtained by the geometrical relation between the depth and the disparity, namely d = f × b/z where the d is the disparity of views, z is the scene depth, f is the camera focal length, and the baseline B is the distance between the cameras.
Recently, many papers have shown that learning-based stereo matching outperforms by hand to define similarity measures. It trains a warp function by using high-quality ground truth depth data as supervised labels to estimate image depth directly. Luo et al. [6] proposed a faster and more accurate depth estimation network architecture. The architecture consists of a two-Siamese network and a product layer that computes the inner product of feature vectors from the two-Siamese network. This method treats disparity estimation as a multi-classification problem, that is every possible disparity is treated as a class. Zbontar et al. [7] proposed a stereo matching architecture based on a convolutional neural network using ground truth disparity to construct a binary classification dataset. The approach focuses on the matching cost computation by learning a similarity measure on small image patches. Mayer et al. [16] presented a novel deep CNN network with a fully-convolutional network (FCN) [17] to achieve an end-to-end training process using synthetic stereo pairs, called disparity estimation network (DispNet). The network architecture of FlowNet [18] is similar to DispNet [16], which is also applied to optical flow estimation. Pang et al. [19] proposed a cascade residual convolutional neural network architecture composed of two stages. The two stages, which can generate residual signals across multiple scales, include improved DispNet [16] by adding additional up-convolution modules and the network of explicitly rectifying the disparity. Although the above methods based on learning outperform traditional stereo matching methods, they rely on vast expensive ground truth depth data and high-quality stereo pairs at training time. Godard et al. [12] mentioned an unsupervised stereo matching method based on rectified stereo pairs in their paper, which used the concatenation of both left and right views as the input of the stereo matching model.

Monocular Depth Estimation
The applicability of stereo matching is extremely limited by the calibration errors and synchronization problem of the stereo camera. To avoid the problem of the stereo camera, monocular cameras are more likely to be deployed in real applications. Therefore, monocular depth estimation, which refers to estimating depth from a single image at test time, has shown very promising research value and achieved a series of research results at present.
Saxena et al. [5] proposed a supervised learning approach to resolve the problem of depth estimation from monocular images for the first time. The model adopts a discriminatively-trained markov random field (MRF) with multi-scale local and global image features and models the depth of each point and the depth relation of different points. With the wide application of CNN in computer vision, researchers began to apply the deep learning method to monocular depth estimation. Eigen et al. [9] were the first that attempted to solve the monocular depth estimation problem using the CNN architecture by employing two network models: one network model makes coarse global prediction for the entire image, and the other network model refines this prediction locally. The loss function adopts a scale-invariant error rather than scale error to optimize the model. Subsequently, the authors improved the network model and generated a new multi-scale CNN network architecture [20] with a fully-convolutional up-sampling network [17], which can complete three visual tasks, including depth prediction, surface normal estimation, and semantic labeling. Laina et al. [21] proposed a fully-convolutional residual network model to estimate depth by establishing the mapping relation between a single image and the corresponding ground truth depth data. The network architecture adopts a novel up-sampling model called up-projected to improve the output resolution and introduces the reverse Huber loss to optimize the model. Liu et al. [22] proposed a deep learning model based on deep CNN and conditional random fields (CRF) for estimating monocular depth. On the basis of this, the authors further proposed an equally-effective model based on FCN and a new superpixel pooling method to accelerate the patch-wise convolutions in the estimation model. Ummenhofer et al. [23] trained a multiple stacked encoder-decoder network architecture to achieve end-to-end depth estimation and camera motion estimation for a monocular successive unconstrained image pair. Due to the sparsity of ground truth depth data acquired by radar, the supervised learning cannot capture high detail depth variations. Therefore, Kuznietsov et al. [14] proposed a semi-supervised depth estimation model, which can perform self-supervised learning on a dense correspondence field and use sparse radar depth data for further supervised learning. For explicitly imposing the geometrical constraint, Luo et al. [11] decomposed the monocular depth estimation problem into two sub-problems for the first time: one is a self-supervised view synthesis process, and the other is a supervised stereo matching process. The synthetic right view generated by the view synthesis network was treated as part of the input data of the stereo matching network to predict scene depth. Wu et al. [24] proposed a novel monocular depth estimation method that uses the real-world size of an object as the sparse supervision label to train a deep network for obtaining a coarse depth map and then refines the depth map by doing energy function optimization on the conditional random field.
The above methods can achieve the results of depth estimation well, but they require vast quantities of corresponding ground truth depth data or other depth annotations, which are difficult and expensive to obtain for practical applications. To overcome this problem, researchers have begun to focus on monocular depth estimation without using ground truth depth data. Xie et al. [13] proposed a self-supervised deep neural network model to convert 2D video to 3D video. During training, this network trains a warp function, which can directly reconstruct the right view from a single left view at test time, by extracting stereo pairs from existing 3D films as supervision labels. The network predicts a probabilistic disparity-like map and then combines them with the left view to reconstruct the right view. Garg et al. [25] proposed an unsupervised deep model based on polar geometry to implement end-to-end monocular depth estimation by optimizing the image alignment loss between the original right view and the reconstructed right view. The paper adopted Taylor expansion to linearize the not fully-differentiable loss function. Godard et al. [12] improved this model by adding a left-right consistency constraint and adopted the locally fully-differentiable image sampler from the spatial transformer network (STN) [26] to achieve higher precision depth estimation.
Li et al. [27] proposed a novel unsupervised monocular visual odometry system (UndeepVo) based on deep learning that can estimate the 6-DoF pose of the camera and the depth of a single view. During training, this paper added additional spatial and temporal dense constraints as the loss function. Zhan et al. [28] took advantage of stereo sequences as training data to train a novel deeper network for estimating depth and visual odometry. The loss function took into account both spatial and temporal photometric warp errors, which improved the estimation accuracy compared with only a simple photometric warp loss method. During testing, the model was able to estimate depth and two-view odometry from single-view sequences. The above two models adopted binocular video stereo pairs and monocular video temporal pairs as training data in the training stage. However, many literature woks have only used the adjacent temporal pairs of monocular video as the training data at present. Unlike binocular cameras, which are synchronized, the relative pose of monocular cameras varies from moment to moment. Therefore, it is a challenging task to estimate the pose of cameras and the depth map only from monocular video sequences for the deep network model. Zhou et al. [29] took advantage of only monocular video sequences to train an unsupervised learning framework for depth estimation and camera pose prediction. The loss of model was alignment error between warping a single video image to a nearby video sequence and the original video view. The design ideas of [30][31][32][33] were similar to the design idea of Zhou et al. [29], except that they decomposed non-rigid scene motion into rigid motion and non-rigid motion to process. Some approaches adopted a novel constraint condition of the loss function in the training process to improve the estimation accuracy. Yang et al. [34] improved the estimation accuracy by increasing the consistency loss between the output depth and the predicted surface normals. Mahjourian et al. [35] considered a novel 3D loss between the estimated 3D point clouds and ego-motion across consecutive frames. Wang et al. [36] found that the depth smoothness term may make the model unstable during training, so they used the normalized estimated depth maps to calculate the smoothness terms.
Although researchers have made various efforts to improve the monocular depth estimation method, its application is still limited by its accuracy. This paper proposes a novel semi-supervised monocular stereo matching method to improve the estimation accuracy. To be able to estimate depth from a static image, we only use a single frame of stereo pairs as training data without considering temporal information. The method was enlightened by the experimental results of Godard et al. [12], that the concatenation input of both the left and right view outperforms the monocular view input under the same self-supervised depth estimation model, as shown in Table 1. Therefore, we adopted the idea of step-by-step design similar to Luo et al. [11], which converts the monocular depth estimation procedures into a self-supervised view synthesis process followed by a semi-supervised stereo matching process. However, unlike this method [11] using ground truth depth data during stereo matching training, our method completely adopted a self-supervised training mode in the whole training process. For the view synthesis method, inspired by Xie et al. [13], we improved the view synthesis method by adding the left-right consistency constraint and smoothness constraint. For the stereo matching method, closet to [14] in spirit, we also made use of self-supervised and supervised cues to train the stereo matching network, but used disparity maps generated by the self-supervised stereo matching model as supervision labels. The implementation process of the semi-supervised monocular stereo matching method is illustrated in Figure 1.

Method
This section describes the semi-supervised monocular stereo matching method, which includes the principles of the method, the network architecture, and the loss function. The principle of the method demonstrates how we resolved the problem of monocular depth into two sub-problems. The view synthesis network and stereo matching network are respectively described for the network architecture and loss function.

Analysis of Depth Estimation Method
The pixel value of depth map reflects the distance between the object in the scene and the camera. The so-called monocular depth estimation is that, given a 2D image I, we use a function f to predict the depth map z corresponding to each pixel in the image. The process can be described as: z = f (I). The current monocular depth estimation methods based on supervised learning use the single RGB image as the input and the ground truth depth data as the label for training the neural network to construct the depth warping function f. However, this method needs to obtain the expensive high-quality ground truth depth data corresponding to the input image as training labels. An alternative to depth estimation-based supervision is the self-supervised method that uses synchronized stereo pairs to supervise the reconstructed views obtained by the deep network for optimizing the estimation model without using ground truth depth data. The key to this approach is that we treated depth estimation as an image reconstruction problem during training. That is to say that the network model can be trained to obtain warping function F that maps the left view of stereo pairs onto the right view I r = F(I l ). At test time, the warping function can achieve intermediate products of the model, which are the disparity estimations of a single left image, as shown in the second half of Figure 2. The depth value can be converted by the geometrical relation between the depth and the disparity with known focal length f and camera baseline b, namely d = f × b/z. However, using this self-supervised method to estimate depth directly from a single image is ill-posed and geometrically ambiguous. This model consists of two parts, namely the right view synthesis network and stereo matching network. The left view is first processed by the baseline network of the view synthesis network to generate probabilistic left and right disparity maps, which are input into the selection layer to reconstruct the right view. Then, the concatenation of both the synthetic right view and original left view is input into the stereo matching network and is processed by the encoder and decoder to estimate an accurate disparity.
Global Method Analysis: In this paper, for solving the shortcomings of the existing methods, we propose a novel semi-supervised monocular stereo matching method based on improving an existing approach [11,12]. As can be seen from Table 1, under the same self-supervised model, the estimation result from the concatenation of both the left and right views as the model's input is much better than that of the monocular view in the depth estimation metric. Therefore, inspired by the estimation results, we divided the monocular depth estimation problem into two sub-problems, that is a self-supervised view synthesis procedure followed by a self-supervised stereo patching procedure. The overall solution is shown in Figure 2. In order to make both separate processing procedures obey geometric principles, we trained the two networks separately. The right view reconstructed by the view synthesis network was input into the stereo matching network by the pipeline to predict the image depth at testing. The pipeline was an automated mechanism that enabled data communication between the two models. That is, the right view reconstructed from the view synthesis network model was stored in the pipeline queue, and the stereo matching network model could read data from the queue for depth prediction. The whole process was completely automated. The accuracy of depth estimation largely depends on the quality of the reconstructed right view, so the view synthesis model plays an important role in the whole method.
View Synthesis Method Analysis: Current existing warp-based view synthesis approaches [12,25] generally require an accurate disparity prediction of the underlying geometry. Deep3D [13] adopts a novel view synthesis idea that estimates a probabilistic disparity-like map, which is used by the selection layer to reconstruct the right view in a differentiable way. In order to enhance the synthesized accuracy, inspired by the method in [12], we improved Deep3D model by adding a left-right consistency constraint and a smoothness constraint. We trained the network to estimate the probabilistic disparity-like maps for left-right views from only the left view and then synthesized left-right views by the selection layer that obtained the sum of the inner products between the disparity maps and the opposite image of the stereo pair. The optimization procedure of the model was to use alignment loss between the original stereo pairs and reconstructed stereo pairs. For imposing spatial smoothness, we added a regularization term into the loss function. It only required the left view to generate the right view directly at test time.
Stereo Matching Method Analysis: There is no doubt that the right view reconstructed by the view synthesis network has some errors compared with the original right view. If we directly train the stereo matching network model using the reconstructed right view and the original left view without Figure 2. The implementation procedure of the semi-supervised monocular stereo matching method. This model consists of two parts, namely the right view synthesis network and stereo matching network. The left view is first processed by the baseline network of the view synthesis network to generate probabilistic left and right disparity maps, which are input into the selection layer to reconstruct the right view. Then, the concatenation of both the synthetic right view and original left view is input into the stereo matching network and is processed by the encoder and decoder to estimate an accurate disparity.
Global Method Analysis: In this paper, for solving the shortcomings of the existing methods, we propose a novel semi-supervised monocular stereo matching method based on improving an existing approach [11,12]. As can be seen from Table 1, under the same self-supervised model, the estimation result from the concatenation of both the left and right views as the model's input is much better than that of the monocular view in the depth estimation metric. Therefore, inspired by the estimation results, we divided the monocular depth estimation problem into two sub-problems, that is a self-supervised view synthesis procedure followed by a self-supervised stereo patching procedure. The overall solution is shown in Figure 2. In order to make both separate processing procedures obey geometric principles, we trained the two networks separately. The right view reconstructed by the view synthesis network was input into the stereo matching network by the pipeline to predict the image depth at testing. The pipeline was an automated mechanism that enabled data communication between the two models. That is, the right view reconstructed from the view synthesis network model was stored in the pipeline queue, and the stereo matching network model could read data from the queue for depth prediction. The whole process was completely automated. The accuracy of depth estimation largely depends on the quality of the reconstructed right view, so the view synthesis model plays an important role in the whole method.
View Synthesis Method Analysis: Current existing warp-based view synthesis approaches [12,25] generally require an accurate disparity prediction of the underlying geometry. Deep3D [13] adopts a novel view synthesis idea that estimates a probabilistic disparity-like map, which is used by the selection layer to reconstruct the right view in a differentiable way. In order to enhance the synthesized accuracy, inspired by the method in [12], we improved Deep3D model by adding a left-right consistency constraint and a smoothness constraint. We trained the network to estimate the probabilistic disparity-like maps for left-right views from only the left view and then synthesized left-right views by the selection layer that obtained the sum of the inner products between the disparity maps and the opposite image of the stereo pair. The optimization procedure of the model was to use alignment loss between the original stereo pairs and reconstructed stereo pairs. For imposing spatial smoothness, we added a regularization term into the loss function. It only required the left view to generate the right view directly at test time.
Stereo Matching Method Analysis: There is no doubt that the right view reconstructed by the view synthesis network has some errors compared with the original right view. If we directly train the stereo matching network model using the reconstructed right view and the original left view without using ground truth depth, the depth estimation accuracy must be lower than that from the original stereo pairs, as shown in Tables 2 and 3. Therefore, the depth estimation from the self-supervised stereo patching has a certain supervisory effect on our stereo matching model training. Inspired by semi-supervised depth estimation [14], we propose a semi-supervised stereo matching approach that takes advantage of supervised and self-supervised training cues to optimize our model jointly. Instead of using ground truth depth data, we adopted the estimation results of the self-supervised stereo matching model as auxiliary supervision cues, as well as self-supervised cues to train our model jointly. However, without any doubt, the estimating depth values from the self-supervised stereo matching model have some errors compared with the ground truth depth data, so we set the weighting of the supervision loss to adjust a suitable proportion of the whole loss. We discuss how to set this parameter in Section 4.5.2.

View Synthesis Network
In this section, the network architecture and loss function of the view synthesis network are described in detail. For the network architecture, we improved the Deep3D model to estimate left-right disparity maps and reconstructed the left-right view by sampling from the opposite view. The loss function was composed of the image alignment constraint and smoothing constraint.

Network Architecture
We built an encoder-decoder view synthesis network based on the Deep3D model [11] for which the encoding network used the VGG16 [38] architecture and the decoding network adopted the de-convolution network to implement end-to-end training. Our model adds the left probabilistic disparity map network for reconstructing left view to achieve the left-right consistency constraint. The network architecture is shown in Figure 3. At first, the input left views were processed by the baseline network into multi-feature images with different resolutions. In order to extract different level features into the final representation, the model adds a branch after each pooling layer that up-samples the features as left-right disparity predictions by the convolutional layer followed by de-convolution layers. After processed by de-convolution layers, each feature map with the same resolution is divided into left and right feature representation according to the channel. Then, we perform summation for predicted left-right feature maps respectively and input the summed feature maps into a softmax layer to output probabilistic left and right disparity-like map. The probabilistic disparity-like maps are fed into the selection layer that multiplies this left or right disparity-like map by the opposite view of the stereo pair at each corresponding pixel position to output the reconstructed views.

View Synthesis Network
In this section, the network architecture and loss function of the view synthesis network are described in detail. For the network architecture, we improved the Deep3D model to estimate left-right disparity maps and reconstructed the left-right view by sampling from the opposite view. The loss function was composed of the image alignment constraint and smoothing constraint.

Network Architecture
We built an encoder-decoder view synthesis network based on the Deep3D model [11] for which the encoding network used the VGG16 [38] architecture and the decoding network adopted the de-convolution network to implement end-to-end training. Our model adds the left probabilistic disparity map network for reconstructing left view to achieve the left-right consistency constraint. The network architecture is shown in Figure 3. At first, the input left views were processed by the baseline network into multi-feature images with different resolutions. In order to extract different level features into the final representation, the model adds a branch after each pooling layer that up-samples the features as left-right disparity predictions by the convolutional layer followed by de-convolution layers. After processed by de-convolution layers, each feature map with the same resolution is divided into left and right feature representation according to the channel. Then, we perform summation for predicted left-right feature maps respectively and input the summed feature maps into a softmax layer to output probabilistic left and right disparity-like map. The probabilistic disparity-like maps are fed into the selection layer that multiplies this left or right disparity-like map by the opposite view of the stereo pair at each corresponding pixel position to output the reconstructed views.  [38], and the decoder adopts deconvolution to achieve upsampling.

Loss Function
We can use the following formulation to denote the procedure based on geometric correspondence as Equation (1) where (i, j) ∈ Ω, Ω is the image space of I, and i, j refer to the horizontal and vertical coordinates of the pixel position of the image.  [38], and the decoder adopts deconvolution to achieve upsampling.

Loss Function
We can use the following formulation to denote the procedure based on geometric correspondence as Equation (1) where (i, j) ∈ Ω, Ω is the image space of I, and i, j refer to the horizontal and vertical coordinates of the pixel position of the image.
However, Equation (1) is not differentiable with respect to D; thus, it cannot be used to optimize the deep neural network model. In order to overcome this problem, the softmax layer estimates a probability disparity distribution D l d (i, j), D r d (i, j) across the channel for disparity value d at each pixel location. Here, we can denote this process as Equation (2) by defining the probability of the left or right view . This operation makes the whole system differentiable with respect to D l During training, the alignment loss between the reconstructed stereo pair and original stereo pair was used as the primary constraint to optimize the model. We adopted a simple L 1 loss as the appearance alignment loss.
The monocular depth estimation is an ill-posed problem in homogeneous regions of the scene without using ground truth disparity, and depth discontinuities often occur at image gradients. Thus, as suggested by [14], we added the edge-preserving regularizer as part of the loss function to make the estimated depth maps smooth using the image gradient ∂I, with n ∈ l, r.
Here, we used x as the direction of the gradient and X as the position space of the image. We defined the total loss L vs as the sum of the loss functions to optimize our view synthesis model.
where N is the total number of pixels in a picture. λ is the weight of the edge-preserving regularizer.

Stereo Matching Network
This section describes the network architecture and loss function of the stereo matching network in detail. The network architecture adopted an encoder-decoder scheme to process the RGB image into the depth map. The loss function incorporated the supervised cost and self-supervised cost to optimize the stereo matching model.

Network Architecture
The model architecture for the stereo matching network is shown in Table 4, which is similar to the self-supervised stereo matching network [12]. We simply describe the network here; please refer to the original article [12] for the detailed network model. The depth convolutional network used an improved VGG16 network architecture without fully-connected layer as the encoder scheme, and the decoder adopted nearest sampling to achieve different scales' disparity estimation. In order to reduce the number of parameters and extract deep features, we used multiple 3 × 3 convolutional layers to replace the 5 × 5 and 7 × 7 convolutional layers in the higher resolution CNN layers of the deeper network. Instead of using the pooling layer, the model adopted the convolution layer with stride two and employed more convolution at the beginning of the network to perform feature extraction. The network introduced skip connections between each encoder's activation block and each up-sampling layer of the decoder with same resolution to resolve higher resolution details. We used convx k s (x ∈ {1, 2, . . . , 20}) to denote the x-layer convolution with filter size k × k and stride s.
The convx(x ∈ {1, 2, . . . , 7}) was the x-scale upsampling block, which included a nearest neighbor sampling layer and a convolution layer with filter size 3 × 3 and stride one. Table 4. Stereo matching network encoder-decoder architecture. The encoder network is the improved VGG model without a fully-connected layer. The decoder network adopted skip connection, which is similar to full convolution [17], to upsample multi-scale disparity by using the nearest neighbor interpolation method. convx k s (x ∈ {1, 2, · · · , 20}) denotes the x-layer convolution with filter size k × k and stride s. convx(x ∈ {1, 2, · · · , 7}) is the x-scale upsampling block, which includes a nearest sampling function and a convolution with filter size 3 × 3 and stride 1.

Loss Function Supervised Loss
Instead of using the ground truth depth data, we used the disparity maps trained by the stereo matching network from the original stereo pairs as the supervised labels. The supervision error was the sum of the corresponding pixel error absolute value between the predicted disparity maps and the supervised disparity maps.
where n represents the left and right view. Ω is the image space, and i, j refer to the horizontal and vertical coordinates of the pixel position of the image. d n (x) is the estimated disparity pixel. d n (x) is the supervision disparity pixel. Inspired by [21], we adopted the reverse Huber(berHu)norm as the supervised loss. The Huber formula is shown in Equation (7): where c is set as:

Self-Supervised Loss
Image Alignment Loss: The main idea of self-supervised depth estimation is to treat the depth estimation problem as an image reconstruction problem during training. Therefore, image alignment loss that compares the pixels in the reconstructed image I with the pixel values at the same position in the original image I is the most important learning criterion. The network generates an image by a bilinear sampler from the spatial transformer network (STN) [26], which is locally fully differentiable and integrates seamlessly. Therefore, our model does not need any randomization [13] and approximation [25] for the loss function. Inspired by [12], we adopted a combination of L 1 loss and SSIM [39] loss as the image reconstruction cost L ia of our model.
where: L SSIM I n (x), I n (x) = 1 − SSIM I n (x), I n (x) , L l 1 I n (x), I n (x) = I n (x) − I n (x) , Left-Right Disparity Consistency Loss: Taking inspiration from [12], we also added the left-right disparity consistency loss to estimate more accurate disparity maps. According to the principle of polar geometry, the right disparity map can be calculated by the left disparity map and vice versa. In order to ensure left-right disparity coherence, we used the L 1 loss to calculate the disparity alignment cost between the predicted disparity map and the calculated disparity map.
Regularization loss: Just like the loss function of the view composition network, we also introduced regularization loss to smooth the estimated depth maps. L rl = n∈(l,r) X∈Ωi, j ∂ x d n (X) e −|∂ x I n (X)| + ∂ y d n (X) e −|∂ y I n (X)| .
Here, we used x as the direction of the gradient and X as the position space of the image.
We defined a total loss function formula L θ that is a combination of supervised loss and self-supervised loss. Self-supervised loss is composed by image alignment loss, left-right disparity consistency loss, and disparity smoothness loss.
where N is the number of image pixels. β, γ are the parameters of supervised loss and regularization loss, respectively, which adjust the proportion of each cost in the model loss function. By adjusting the proportion of β, γ, the model can be optimized more precisely. Each cost term contains the two costs of the left and right image from the output of the model.

Experiments
This section introduces our experimental process in detail. To verify the adaptability of our model in complex scenarios, we conducted training and testing on KITTI [15], an autonomous driving dataset. By comparing the performance of our method with the state-of-the-art monocular depth estimation method on the popular KITTI dataset, the advantages of our method are proven.

Evaluation Metrics
Evaluation metrics indicate the error and performance on the proposed prediction model. In this section, two different quantitative evaluation criteria are used for the view synthesis network and stereo matching network, respectively. Each network also adopted multiple evaluation metrics as follows.

Reconstruction Metrics of the View Synthesis Model
The quality of the synthesized right view is very important for the depth estimation accuracy of the stereo matching network, so we set the evaluation formula for the output of the synthesized view network. We computed the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM) [39] between the synthesized right view and the original.
where N is the number of pixels, which is the product of the image width and height. I r i is the ith pixel of the original right view;Î r i is the ith pixel of the synthesized right view. MSE is the mean squared error, and b is the number of bits per pixel, generally 8.
where µ I is the average of I, µÎ is the average ofÎ, σ 2 I is the variance of I, σ 2 I is the variance ofÎ, σ IÎ is the covariance of I, andÎ. C 1 = k 1 L 2 1 , C 2 = k 2 L 2 2 are constants used to maintain stability, and L is the dynamic range of pixel values, k 1 = 0.01, k 2 = 0.03.

Evaluation Metrics of the Stereo Matching Model
We set the following parameters as the estimation metrics of the model, and they demonstrate the error and accuracy of our method on depth evaluation using the ground truth depth data. The estimation metrics are the same used by Eigen et al. [9]. (16) where N is the number of pixels of the ground truth depth map Z gt and evaluation depth map Z.
To compare our method with the current state-of-the-art methods of self-supervised monocular depth estimates and semi-supervised monocular depth estimates, we cropped our image resolution to match these models. Because these methods cap the evaluated depth to different ranges (Eigen et al. [9] and Godard et al. [12,32] to 0-80 m and Garg et al [25] to 1-50 m), we respectively provide comparative results of the both depth distances. If the estimated depth value is outside the depth range, we set the depth value to be the lowest or highest value of the depth range.

Dataset
The KITTI dataset is the most widely-used image dataset in the field of autonomous driving. The dataset [40] records six-hour traffic scenarios by a series of sensors, including high-resolution color and gray stereo cameras, a 3D laser scanner, and high-precision GPS/IMU inertial navigation. The scenarios were captured by driving around the inner city of Karlsruhe city at high-speed and in rural areas, with many static and dynamic objects. This dataset was calibrated, synchronized, and time stamped, and the authors provided the rectified and raw image sequences.
We evaluated our method with rectified stereo pairs from 61 scenarios of the KITTI dataset, which includes the categories "city", "residential", and "road". In order to better show the comparison of the results of our method and other methods, our experiment referred to the data allocation scheme proposed by Eigen et al. [9]. We randomly selected 28 scenes from all 61 scenarios and then randomly selected 697 images from them as the test data. The remaining 33 scenes contained a total of 30,159 images, of which 29,000 images were used for training data and the rest as verification data. In order to estimate depth from static photos, we used a single-frame image without utilizing time series information to train our model.

Implementation Details
To better train our model, we implemented our model in Tensorflow [41] on the experimental platform E5-2620v4 with 32 GB RAM and two 11 GB memory NVIDIA GTX 1080TI GPUs. Since the input of the stereo matching network needs to have a high-quality input view from the concatenation of the original left view and synthesized right view, we trained the two networks separately to get a better reconstructed right view.
View Synthesis: The network adopted VGG16 as the encoder network and was initialized for the weights of VGG16 by the pre-trained model from ImageNet [42]. All other weights of this network used the Gaussian distribution with a standard deviation of 0.01 to initialize. Just like [11], we also set the input image resolution to 640 × 192 to make the model more suitable for the KITTI dataset. Other settings also referred to the model. In order to obtain the left-right feature representation, we set the number of feature graphs after de-convolution to 256 for which the first 128 channels were left upsampled disparity predictions and the last 128 channels were right upsampled disparity predictions. Then, a 65-channel left or right probabilistic disparity-like map was predicted respectively by the softmax layer. We set the batch size as 2 and the number of training sample data as 29,000, which resulted 14,500 iterations. To make the model adequately trained, the epoch was set as 20. The initial learning rate was set to 0.002, which was kept constant for the first 10 epochs, and then, we reduced it by a factor of 2 after every 5 epochs until the end. The weighting coefficient of the smooth regularizer was set to 0.01.
Stereo Matching: Our stereo matching network referred to the model architecture [12] that adopted the improved VGG16 model without the fully-connected layer as the encoder network and used a nearest neighbor upsampling followed by a convolutions as the decoder network. For the same part of the network [12], we initialized the weights of it using the trained model, and all other weights of our network used the Gaussian distribution with a standard deviation of 0.01 to initialize. γ is the parameter of the regularization loss term for which we referred to [12] to set the value. Outputting different multi-scale disparity maps with upsampling the output by a factor of two, the neighboring pixels were different in each scale disparity map. In order to correct this, we set the weighting coefficient of the smooth regularizer term to γ = 0.1/r, where r ∈ 1, 2, 4, 8 was the upscaling factor of each disparity output upsampling layer with this layer input. In order to avoid supervision labels that have some errors compared with the ground truth disparity from reducing the accuracy of the model, we experimented with different values instead of fixing ones for the weight coefficient of supervision loss. The experimental results of different parameters are shown in Table 2. We set the batch size to 8 and the number of epochs to 50. We reduced the learning rate to 10 −5 so as to avoid large shocks from the errors of the reconstructed right view and adopted the diminishing method to set the learning rate, which we kept constant for the first 30 epochs, and then reduced it by a factor of 2 after every 10 epochs until the end. Although we used the concatenation of both the original left view and synthesized right view as the input data of the stereo matching model in the test, we used different stereo pairs' data to train the model. The training process is discussed in Section 4.5.2.
We abandoned the addition of batch regularization in the two network models because experiments showed that the structure did not play an important role in the experimental results. We augmented the image data during data loading. We flipped and swapped every stereo pair with equal probability and made sure both images were in the right position relative to each other. At the same time, we also adjusted the brightness, contrast, and color of the stereo pairs by making linear changes to the pixel from the uniform distribution in the range [0.8, 1.2] for each color channel, [0.8, 1.2] for gamma, [0.5, 2.0] for brightness.
Post-processing: stereo disocclusions can cause disparity ramps on both the left side of the image and of the occluders, so it is necessary to perform a post-processing for the output disparity. Inspired by [12], we designed a post-processing method at test time by training the disparity map d of horizontally-flipped image I for input image I. Then, we flipped the disparity map d horizontally to form disparity map d , which also had disparity ramps, but it was the opposite of the disparity map d . We combined the first %1 on the left part of the disparity map d and the last %1 on the right part of the disparity map as the left and right edges, and the pixel value of the middle part was the average value of the two disparity maps d and d to form the final disparity map. Table 3 shows that the disparity map after processing had better precision and a better error index.

Performance Analysis
We trained and tested our model on the experimental platform of E5-2620v4 with 32 GB RAM and Two 11 GB memory NVIDIA GTX 1080TI GPUs. The view synthesis network contained about 371 million trainable parameters and took 58 hours to train 29,000 single left views for 30 epochs. During training, inference speed was more than 10.5 frames per second on a single GPU and more than 18.3 frames per second on a double GPU for a 640 × 192 resolution view. The speed was lower than that of [11] with 10.6 frames per second for 370 million trainable parameters on a single GPU and on the double GPU, slightly higher than the model by 0.2 frames. For the stereo matching network, we removed the fully-connected layer of VGG16, and the trainable parameters of model were about 31 million. Training 29,000 stereo pairs for 50 epochs took about 25 hours and the inference speed up to 30 frames per second on a double GPU. Compared to the model [14] with 48 million trainable parameters, our model could outperform about 3 frames per second in the same experimental platform. For the test, the two networks were piped together by the pipeline to implement the automated depth estimation from the RGB image input to the depth map output. For 200 test samples, the method took 21.53 s from loading the model to outputting the disparity map, and the loading model took 6.25 s. Therefore, the estimated time of each frame was approximately 0.078 s. Compared with a similar method [11] that used about 0.076 s for a single image, our method was comparable to it in execution time.

Results
To evaluate the accuracy of the view synthesis network model, we made a comparison with the original Deep3D [11,13] model and the left-right consistency method [12] in the reconstruction metrics of PSNR and SSIM. The larger these two reconstruction metrics are, the closer the reconstructed right view is to the original view. Table 5 shows the quantification comparison results for the reconstructed right view of our method and that of the current state-of-the-art view synthesis method in the two indicators of PSNR and SSIM. The original deep3D method that only uses the L 1 loss function to optimize the deep model was lowest in the reconstructed metrics of PSNR and SSIM. Godard et al. [12] adopted the locally fully-differentiable upsampling mode and left-right consistency constraint, greatly improving the accuracy of reconstruction. As can be seen from Table 5, this method improved 4.3 db on PSNR and 0.049 on SSIM. We improved the Deep3Ds [11] by adding the smoothness constraint and left-right consistency constraint to synthesize the right view. Compared with other method, our model performed the best, as it improved 0.555 db on PSNR and 0.011 on SSIM. Figure 4 shows the disparity, reconstruction error and the reconstructed right view of the three different view synthesis models. In order to show the comparison results more intuitively, we increased the reconstruction error by 50 times. We can see from Figure 4 that our model revealed more detail in the disparity map, and the reconstruction error outperformed that of the original Deep3D method and the left-right consistency method. Table 5. Comparison of different view synthesis network models. The last row is the evaluation index of our model that outperformed the other strategies.
×50 refers to refers to the reconstruction error multiplied by 50.

Comparison of Stereo Matching Model
Comparisons with depth estimation: Table 3 shows the comparison results of the estimated depth value between our model and the current state-of-the-art monocular depth evaluation method on the test dataset of the KITTI benchmark. The first two rows in Table 3 show the depth results of the supervision labels obtained by the self-supervised stereo matching model with the input of the original stereo pairs. Although the supervision labels were obtained by the self-supervised model, the depth results were even better than the supervised depth estimation method on most metrics. From the estimation metric of Table 3, we can see that under the same training conditions, our method can outperform other methods in most estimation metrics, especially the estimated results after post-processing can achieve better performance. For the depth evaluation cap of 80 m, compared with other methods under the same training condition, our method before post-processing was only slightly inferior to Godard et al.'s [32] in estimation of the metrics of RMSE and SRD, which were about 0.016 and 0.131 lower, respectively. Other estimation indexes were better than other methods under the same training condition. After post-processing, the estimation metrics of our method were more than any other method on all the metrics under the same training conditions. Especially for RMSE(log) and ARD, our method was better than the current state-of-the-art method, 0.019 and 0.008. When we set the cap of the predicted depth to 50 m, our method was only slightly inferior to Garg et al. [25], about 0.002 m on the RMSE metric before post-processing with the same training mode. After the post-processing, the metric of RMSE reduced 0.017, which was on average 0.001 m more accurate than that of Garg et al. [25]. Table 3 also demonstrates the depth estimation results of those methods [28,32] that added extra video temporal series in the training stage. Despite training data being added, our method still outperformed Zhan et al. [28], and the four indices exceeded Godard et al. [32] in the depth evaluation cap of 80 m. As can be seen from the comparison, there was a gap between our method and the state-of-the-art supervised depth estimation methods [11]. In conclusion, our method was superior to the current state-of-the-art self-supervised monocular depth estimation method under the same training conditions. Figure 5 shows our qualitative comparison between our semi-supervised method and the current mainstream monocular depth estimation methods by outputting the visual disparity map. Perhaps ×50 refers to refers to the reconstruction error multiplied by 50.

Comparison of Stereo Matching Model
Comparisons with depth estimation: Table 3 shows the comparison results of the estimated depth value between our model and the current state-of-the-art monocular depth evaluation method on the test dataset of the KITTI benchmark. The first two rows in Table 3 show the depth results of the supervision labels obtained by the self-supervised stereo matching model with the input of the original stereo pairs. Although the supervision labels were obtained by the self-supervised model, the depth results were even better than the supervised depth estimation method on most metrics. From the estimation metric of Table 3, we can see that under the same training conditions, our method can outperform other methods in most estimation metrics, especially the estimated results after post-processing can achieve better performance. For the depth evaluation cap of 80 m, compared with other methods under the same training condition, our method before post-processing was only slightly inferior to Godard et al.'s [32] in estimation of the metrics of RMSE and SRD, which were about 0.016 and 0.131 lower, respectively. Other estimation indexes were better than other methods under the same training condition. After post-processing, the estimation metrics of our method were more than any other method on all the metrics under the same training conditions. Especially for RMSE(log) and ARD, our method was better than the current state-of-the-art method, 0.019 and 0.008. When we set the cap of the predicted depth to 50 m, our method was only slightly inferior to Garg et al. [25], about 0.002 m on the RMSE metric before post-processing with the same training mode. After the post-processing, the metric of RMSE reduced 0.017, which was on average 0.001 m more accurate than that of Garg et al. [25]. Table 3 also demonstrates the depth estimation results of those methods [28,32] that added extra video temporal series in the training stage. Despite training data being added, our method still outperformed Zhan et al. [28], and the four indices exceeded Godard et al. [32] in the depth evaluation cap of 80 m. As can be seen from the comparison, there was a gap between our method and the state-of-the-art supervised depth estimation methods [11]. In conclusion, our method was superior to the current state-of-the-art self-supervised monocular depth estimation method under the same training conditions. Figure 5 shows our qualitative comparison between our semi-supervised method and the current mainstream monocular depth estimation methods by outputting the visual disparity map. Perhaps unsurprisingly, the ground truth disparities obtained by 3D scanner were able to provide better visual effects, but the recoding pixel points were sparse. As we can see from Figure 5, the method of [12,25,32] was able to obtain a good depth estimation map for a single view from the scene, but our method can present the details of the depth map more clearly and smoothly.
Symmetry 2019, xx, 5 18 of 22 unsurprisingly, the ground truth disparities obtained by 3D scanner were able to provide better visual effects, but the recoding pixel points were sparse. As we can see from Figure 5, the method of [12,25,32] was able to obtain a good depth estimation map for a single view from the scene, but our method can present the details of the depth map more clearly and smoothly.  Table 5, we can see that there were some errors between the synthetic right view and the original right view, so it may not be appropriate to use the weightings of the stereo matching model trained from the original stereo pair as the weightings of our stereo matching model directly. In order to verify which training mode is more suitable for our stereo matching model, we conducted three experiments, which are respectively represented as the SSD,the OOD, and the SOD. Where the SSD refers to the original left view and synthesized right view as self-supervised input data and supervision labels, the OOD refers to the original stereo pairs as self-supervised input data and supervision labels, the SOD refers to the synthesized right view and the original left view as self-supervised input data and the original stereo pairs as supervised labels. Either way, in the test phase, the input data were a mixture of the reconstructed right view and the original left view. As can be seen from Table 6, no matter whether the evaluation cap was 50 m or 80 m, the SOD from the three training modes was better than the other two training modes in most estimation metrics, and the evaluation metric of the SSD training mode was the worst, even lower than monocular self-supervised depth estimation methods. At the depth cap of 80 m, the estimation metric of the OOD was only slightly less, 0.003 m, than that of the SOD in terms of the SRD metrics. Although OOD mode had a good depth estimation result for the original stereo pair, it did not refer to the reconstructed right view during the training stage, so that the generalization for the reconstructed right view was poor. For SSD mode, supervision labels also adopted the reconstructed right view with some errors that made depth estimation for an actual scene have a big deviation. When the depth cap was 50 m, the SOD was much better than the OOD in all estimation metrics. Table 6. Evaluation results of our model and the current mainstream depth estimation model on the test of the KITTI dataset using the split of Eigen et al. [9]. This table shows two different caps of 50 m and 80 m between the ground truth and estimated depth.  Table 5, we can see that there were some errors between the synthetic right view and the original right view, so it may not be appropriate to use the weightings of the stereo matching model trained from the original stereo pair as the weightings of our stereo matching model directly. In order to verify which training mode is more suitable for our stereo matching model, we conducted three experiments, which are respectively represented as the SSD, the OOD, and the SOD. Where the SSD refers to the original left view and synthesized right view as self-supervised input data and supervision labels, the OOD refers to the original stereo pairs as self-supervised input data and supervision labels, the SOD refers to the synthesized right view and the original left view as self-supervised input data and the original stereo pairs as supervised labels. Either way, in the test phase, the input data were a mixture of the reconstructed right view and the original left view. As can be seen from Table 6, no matter whether the evaluation cap was 50 m or 80 m, the SOD from the three training modes was better than the other two training modes in most estimation metrics, and the evaluation metric of the SSD training mode was the worst, even lower than monocular self-supervised depth estimation methods. At the depth cap of 80 m, the estimation metric of the OOD was only slightly less, 0.003 m, than that of the SOD in terms of the SRD metrics. Although OOD mode had a good depth estimation result for the original stereo pair, it did not refer to the reconstructed right view during the training stage, so that the generalization for the reconstructed right view was poor. For SSD mode, supervision labels also adopted the reconstructed right view with some errors that made depth estimation for an actual scene have a big deviation. When the depth cap was 50 m, the SOD was much better than the OOD in all estimation metrics.  Figure 6 shows the qualitative depth estimation results of the stereo matching models in three training modes. It is very difficult to distinguish which training method was better only from the pictures, but we can intuitively see that the SOD mode can achieve clearer segmentation on the edge of the scene object. We can also see from Figure 6 that since the reconstructed right view was used as the self-supervision label for the SSD training mode, the scene object in the depth map had a blurred boundary at the edge part. The OOD mode did not refer to the feature of synthetic right view during training, and the generalization of the model was greatly reduced. In the SOD training mode, we adopted the concatenation of both the reconstructed right view and original left view as the input data and the original stereo pair as the supervision label to train the model. This model not only can extract from the reconstructed right view both the characteristics of the input data and the supervision of the original scene, it can produce the optimal depth estimation results. Figure 6 shows the qualitative depth estimation results of the stereo matching models in three training modes. It is very difficult to distinguish which training method was better only from the pictures, but we can intuitively see that the SOD mode can achieve clearer segmentation on the edge of the scene object. We can also see from Figure 6 that since the reconstructed right view was used as the self-supervision label for the SSD training mode, the scene object in the depth map had a blurred boundary at the edge part. The OOD mode did not refer to the feature of synthetic right view during training, and the generalization of the model was greatly reduced. In the SOD training mode, we adopted the concatenation of both the reconstructed right view and original left view as the input data and the original stereo pair as the supervision label to train the model. This model not only can extract from the reconstructed right view both the characteristics of the input data and the supervision of the original scene, it can produce the optimal depth estimation results. Comparison of Different Supervision Parameter: As can be seen from Table 3, compared with the ground truth depth data, the depth values predicted by the original stereo pair through the stereo matching network had a certain error. Therefore, if we used these disparity values directly as supervision labels with β = 1 to train our model, this would lead to greater error and reduce the generalization of the stereo matching network. In order to make the supervision loss play the most appropriate optimization role in the model, we used parameter β to adjust the proportion of this cost in the loss function so as to constrain its effect on the model. Table 2 shows the experiment results, which are the depth estimation metrics from the stereo matching network with different parameters value β. We set the range of supervision loss parameter to β ∈ (0, 1) and selected eight β values between zero and one to train our model. We can see from Table 2 that when only using the supervision loss to optimize our model, the generalization ability of the model was poor due to the errors of the supervision labels. With the range of β ∈ (0.1, 1), the depth estimation results gradually improved with the decrease of β, while when β ∈ (0, 0.1), the depth estimation results gradually reduced with the decrease of β. Like a parabola, when β = 0.1, the model can perform the best for all metrics and setups. Therefore, in this paper, we set the supervision loss parameter β as 0.1.
We also made a qualitative comparison of different loss parameter value β by visualizing the output disparity maps. As we can see from Figure 7, when only using the supervised loss to optimize the model, the depth estimation effect obtained by the warping function was very poor. When the value of β was greater than 0.1 or less than 0.1, the visualized disparity map obviously deviated from the original view. When β = 0, the visualized effect of the disparity map was similar to that of β = 0.1. However, we can see from the quantitative representation in Table 2 and the qualitative presentation in Figure 7 that the disparity estimation of β = 0.1 was superior to β = 0 in accuracy, details, and smoothness. Comparison of Different Supervision Parameter: As can be seen from Table 3, compared with the ground truth depth data, the depth values predicted by the original stereo pair through the stereo matching network had a certain error. Therefore, if we used these disparity values directly as supervision labels with β = 1 to train our model, this would lead to greater error and reduce the generalization of the stereo matching network. In order to make the supervision loss play the most appropriate optimization role in the model, we used parameter β to adjust the proportion of this cost in the loss function so as to constrain its effect on the model. Table 2 shows the experiment results, which are the depth estimation metrics from the stereo matching network with different parameters value β. We set the range of supervision loss parameter to β ∈ (0, 1) and selected eight β values between zero and one to train our model. We can see from Table 2 that when only using the supervision loss to optimize our model, the generalization ability of the model was poor due to the errors of the supervision labels. With the range of β ∈ (0.1, 1), the depth estimation results gradually improved with the decrease of β, while when β ∈ (0, 0.1), the depth estimation results gradually reduced with the decrease of β. Like a parabola, when β = 0.1, the model can perform the best for all metrics and setups. Therefore, in this paper, we set the supervision loss parameter β as 0.1.
We also made a qualitative comparison of different loss parameter value β by visualizing the output disparity maps. As we can see from Figure 7, when only using the supervised loss to optimize the model, the depth estimation effect obtained by the warping function was very poor. When the value of β was greater than 0.1 or less than 0.1, the visualized disparity map obviously deviated from the original view. When β = 0, the visualized effect of the disparity map was similar to that of β = 0.1. However, we can see from the quantitative representation in Table 2 and the qualitative presentation in Figure 7 that the disparity estimation of β = 0.1 was superior to β = 0 in accuracy, details, and smoothness.

Conclusions
In this paper, we proposed a novel semi-supervised stereo matching method from a single image without using ground truth depth data, which decomposes the monocular depth estimation problem into two sub-problems, that is the right view synthesis process and the stereo matching process. We innovated beyond existing self-supervised view synthesis method Deep3D by adding the left-right consistency constraint and smoothness constraint to improve reconstructed accuracy. For estimating a high-quality depth map, we proposed a semi-supervised stereo matching method to reduce the effect of reconstructed errors from the right view. The two network models were piped together during the test phase. The experimental results showed that our method further narrowed the gap with the supervised methods and made ill-posed monocular depth estimation obey geometric principles. Both procedures not only obeyed geometric principles, but also improved estimation accuracy. This will be a further incentive for methods that do not use ground truth depth data to predict depth data from a single image.
In the future, we will add video time series information during training to further improve the prediction accuracy and geometric correctness. Although our method estimates depth from a single image at the test, it still needs stereo pairs as training data during training. Therefore, we would be interested in using a single view and other information instead of stereo pairs as training data in future research. Finally, we want to say that our method had certain advantages for monocular self-supervised methods, but it was still inferior to the self-supervised stereo matching method using the original stereo pairs as input. It is obvious that we need to make further improve the view synthesis model to reconstruct a higher quality right view.

Conclusions
In this paper, we proposed a novel semi-supervised stereo matching method from a single image without using ground truth depth data, which decomposes the monocular depth estimation problem into two sub-problems, that is the right view synthesis process and the stereo matching process. We innovated beyond existing self-supervised view synthesis method Deep3D by adding the left-right consistency constraint and smoothness constraint to improve reconstructed accuracy. For estimating a high-quality depth map, we proposed a semi-supervised stereo matching method to reduce the effect of reconstructed errors from the right view. The two network models were piped together during the test phase. The experimental results showed that our method further narrowed the gap with the supervised methods and made ill-posed monocular depth estimation obey geometric principles. Both procedures not only obeyed geometric principles, but also improved estimation accuracy. This will be a further incentive for methods that do not use ground truth depth data to predict depth data from a single image.
In the future, we will add video time series information during training to further improve the prediction accuracy and geometric correctness. Although our method estimates depth from a single image at the test, it still needs stereo pairs as training data during training. Therefore, we would be interested in using a single view and other information instead of stereo pairs as training data in future research. Finally, we want to say that our method had certain advantages for monocular self-supervised methods, but it was still inferior to the self-supervised stereo matching method using the original stereo pairs as input. It is obvious that we need to make further improve the view synthesis model to reconstruct a higher quality right view.