Vision-Based Distance Measurement in Advanced Driving Assistance Systems

Featured Application: The outcome of this paper is a deep-learning-based application for the distance measurement between the subject vehicle and the target vehicle or pedestrian, which uses the forward-looking image captured by a vehicle-mounted vision sensor to achieve e ﬀ ective performances of depth map estimation and distance measurement. The technique can be used in advanced driving assistance systems to further enhance driving safety. Abstract: As the forward-looking depth information plays a considerable role in advanced driving assistance systems, in this paper, we ﬁrst propose a method of depth map estimation based on semi-supervised learning, which uses the left and right views of binocular vision and sparse depth values as inputs to train a deep learning network with an encoding–decoding structure. Compared with unsupervised networks without sparse depth labels, the proposed semi-supervised network improves the estimation accuracy of depth maps. Secondly, this paper combines the estimated depth map with the results of instance segmentation to measure the distance between the subject vehicle and the target vehicle or pedestrian. Speciﬁcally, for measuring the distance between the subject vehicle and a pedestrian, this paper proposes a depth histogram-based method that calculates the average depth values of all pixels whose depth values are in the peak range of the depth histogram of this pedestrian. To measure the distance between the subject vehicle and the target vehicle, this paper proposes a method that ﬁrst ﬁts a 3-D plane based on the locations of target points in the camera body coordinate using RANSAC (RANdom SAmple Consensus), it then projects all the pixels of the target to this plane, and ﬁnally uses the minimum depth value of these projected points to calculate the distance to the target vehicle. The results of the quantitative and qualitative comparisons on the KITTI dataset show that the proposed method can e ﬀ ectively estimate depth maps. The experimental results in real road scenarios and the KITTI dataset conﬁrm the accuracy of the proposed distance measurement methods.


Introduction
In order to improve road safety, both the scientific community and manufacturers must pay more attention to the development of automobile safety technology. As one of the key technologies, Advanced Driving Assistance Systems (ADASs) are developing rapidly [1]. Measuring the vehicle-vehicle and vehicle-pedestrian distance is one of the main tasks of

Related Work
Generally speaking, stereo-vision methods are not suitable for distance estimation in ADAS. There are two reasons for this: firstly, these methods are very susceptible to errors in feature extraction and matching. Secondly, they can only achieve relatively sparse and local depth values. Therefore, it is difficult to compute the distances of multiple different targets at the same time through these depth values. Therefore, in this section, we mainly discuss the distance measurement methods based on monocular vision and the progress of the related key technologies.
Currently, monocular-vision methods for distance estimation used in ADAS can be divided into two categories. The first category is based on the geometric relationship and camera imaging model. In these types of methods, several parameters from the camera (e.g., the azimuth and elevation angles of the camera) and the measured object (e.g., width of the target vehicle) need to be provided in advance. Liu et al. used the geometric positional relationship of a vehicle in the camera coordinate system to construct the correspondence between the key points in the world coordinate system and the image coordinate system, and then established a ranging model to estimate the target vehicle distance [11]. Kim et al. used the camera imaging model and the width of the target vehicle to estimate the distance to a moving vehicle that is far ahead [12]. The main disadvantage of such methods is that the accuracy of distance estimation depends heavily on the measurement accuracy of the parameters of the camera or the measured object. The second category involves constructing a regression model using machine learning. Wongsaree et al. trained a regression model using the correspondence between different positions in an image and their corresponding distances to complete distance estimation [13]. Gökçe et al. used the target vehicle information to train a distance regression model for distance estimation [14]. The main disadvantage of these methods is that they have to collect a large number of training data with real distances.
In the proposed method, the first and core task is to complete the depth map estimation. Traditional methods of vision-based depth map estimation are mostly realized using geometric constraints and handcrafted features (e.g., SIFT), such as Structure from Motion (SFM) [15]. The main disadvantage of these traditional methods is that they are very susceptible to errors in feature extraction and matching, and can only achieve relatively sparse and local depth maps. It is difficult to compute the distances of multiple different targets at the same time through these depth maps. In recent years, deep learning-based methods represented by convolutional neural networks (CNN) have been developed in various fields of computer vision [16], and several CNN-based approaches for depth map estimation have been studied [17,18]. Depending on whether they use real depth data as the labels during the training process, these methods can be divided into three categories: supervised, semi-supervised, and unsupervised methods. Moreover, depth values obtained by vision-based methods can be divided into the absolute depth, which denotes the true depth value of each pixel in the camera coordinate system, and the relative depth, which indicates the relative distance relationship of different pixels in the image.
As one of the representatives of the supervised methods, the Coarse-Fine method [19], proposed by Eigen et al., contains two CNNs of different scales: the coarse-scale CNN, used to estimate the globe depth of the input image; the fine-scale CNN, for optimizing the local details. On the basis of this method, Eigen et al. further proposed a multiscale network architecture [20], which can complete three tasks, including depth estimation, plan normal measurement, and semantic segmentation. Li et al. proposed a combination method in which a CNN is used to regress the depth of superpixels, and a conditional random field is used for post-processing [21]. The supervised methods require the real depth value of each pixel in the input image as the training labels, which are difficult to obtain. Therefore, as a result of the lack of sufficient training samples to train the supervised network, it is difficult for these methods to become popularly adopted in different application fields, such as V-DAS.
Unsupervised methods are generally divided into two subcategories. One is referred to as the self-supervised method, which uses the temporal information from a monocular video as supervision information. Compared with supervised methods, the training samples for a self-supervised network can be easily obtained. However, self-supervised methods also have some shortcomings. Firstly, self-supervised methods have to complete pose estimation using other approaches that increase the complexity of these methods, as a result the depth estimation results are largely dependent on the accuracy of pose estimation. Secondly, because of the lack of scale information, these self-supervised methods can only obtain relative depth results, and cannot obtain the absolute depth values. This relative depth information does not meet the requirements of ADAS. The other subcategory, unsupervised methods, is based on the spatial constraint relationship from the stereo vision acting as the supervision information. This means that stereo vision is used during offline training and monocular vision is used to estimate depth maps online. Generally, since the relative pose of two cameras is known, the estimation results of this subcategory are better than the results of a self-supervised network. Moreover, different from self-supervised networks, because the relative location of two cameras is known, unsupervised methods can obtain the absolute depth value, which is very important for ADAS. Garg et al. first used the spatial constraint of two views to propose a depth estimation method for unsupervised monocular vision based on convolutional neural networks [22]. Garg's method utilizes a network structure similar to a full convolutional neural network (FCN), including encoding and decoding. In the unsupervised network, the depth map is first obtained by inputting the left view into the CNN, and then the corresponding disparity map is calculated according to the relationship between the disparity and the depth in the stereo vision. Furthermore, both this disparity map and the right view are used to reconstruct the left view; the error between the original left view and reconstructed left view is used as the loss function of the encoding-decoding network. On the basis of this network structure, Godard et al. proposed a loss function that contains appearance matching loss, disparity smoothness loss, and left-right disparity consistency loss [23]. However, the estimation accuracy of unsupervised methods would be further improved if new information, including real depths, was added to the loss functions during training.
Generally speaking, compared to dense depth information corresponding to each pixel of a forward-looking image, sparse depth information corresponding to parts of pixels is easier to obtain. Therefore, semi-supervised methods using sparse and local depth information have been recently studied. Kuzniestsov et al. combined a sparse ground truth depth map with a calibrated stereo image pair to train a semi-supervised network [24], which demonstrated state-of-the-art performance using the KITTI dataset. Moreover, Ji et al. proposed a novel semi-supervised adversarial learning framework that only utilizes a small number of image-depth pairs in conjunction with a large number of easily available monocular images to achieve depth estimation [25]. In summary, compared with unsupervised methods, semi-supervised methods can achieve better estimation results due to the introduction of local and sparse depth labels.
In this study, we used instance segmentation to build a bridge between the depth map estimation and the distance measurement of a specific object. Instance segmentation is based on object detection and semantic segmentation, providing different labels for separate instances of objects belonging to the same class. As a relatively flexible model for instance segmentation, Mask R-CNN inherits the basic framework of Faster R-CNN and adds an object mask prediction branch [26]. As Mask R-CNN is easy to transfer to other tasks, is superior to most methods of instance segmentation, and only increases the computational load slightly as compared to Faster R-CNN, this paper employs Mask R-CNN to segment the target from the background for target distance estimation. One of the earliest applications of instance segmentation for distance estimation was performed by Huang et al. They proposed a method that combines instance segmentation and a projection geometry model for distance estimation [27]. In the latest work of Huang et al., they obtained the vehicle attitude angle using an angle regression model and a segmentation algorithm, and then estimated the distance to the vehicle ahead by constructing an "area-distance" geometric model [4].
In this paper, we combine the results of depth map estimation, which provide the depth information of each pixel, with the results of instance segmentation, which provide the classification information of each pixel, to estimate the absolute distances to different participants on the road, e.g., cars, vans, trucks, and pedestrians.

Relationship between Disparity and Depth
The depth estimation principle based on the left and right views is shown in Figure 1, where one image pair contains two images I l and I r captured simultaneously by the left and right cameras; f and b are the focal length and baseline distance, respectively, P l and P r are the projection points of the object point P on the imaging planes of the left and right cameras, respectively, and (x l , y l , f ) and (x r , y r , f ) are the locations of P l and P r in the coordinate systems of the left and right camera body, respectively. To simplify the following description, we set y l = 0 and y r = 0. As shown in Figure 1, according to the property of similar triangles, where (x l p , z l p ) and (x r p , z r p ) are the locations of Pin the coordinate systems of the left and right camera bodies, respectively. As x l p − x r p = b and z r p = z l p , as shown in Figure 1, Equation (1) can be rewritten as follows: further, where d = x l − x r is the left-right disparity and represents the difference in the location of point P in the left and right images. From Equation (3), we find that z l p , denoting the depth of point P in the coordinate system of the left camera body, can be obtained if the baseline distance b, the camera focal length f, and the disparity d are all known. Therefore, depth estimation can be transformed into a problem of solving disparity map computation. Consequently, the main task of a deep learning network is to compute the disparity map from the input image pair. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 20  Figure 2 shows the training processing of a semi-supervised learning network for depth map estimation. During the training process, the inputs are the left and right views, and the corresponding sparse depth labels that have been matched with the left and right views, respectively. The outputs of the deep network are two disparity maps corresponding to the left and right views, respectively. The loss functions used for training this network contain appearance matching loss, disparity smoothness loss, left-right disparity consistency loss, and supervised loss.     Figure 2 shows the training processing of a semi-supervised learning network for depth map estimation. During the training process, the inputs are the left and right views, and the corresponding sparse depth labels that have been matched with the left and right views, respectively. The outputs of the deep network are two disparity maps corresponding to the left and right views, respectively. The loss functions used for training this network contain appearance matching loss, disparity smoothness loss, left-right disparity consistency loss, and supervised loss.  Figure 2 shows the training processing of a semi-supervised learning network for depth map estimation. During the training process, the inputs are the left and right views, and the corresponding sparse depth labels that have been matched with the left and right views, respectively. The outputs of the deep network are two disparity maps corresponding to the left and right views, respectively. The loss functions used for training this network contain appearance matching loss, disparity smoothness loss, left-right disparity consistency loss, and supervised loss.    As shown in Figure 2, the semi-supervised depth estimation network in this paper is based on the encoding-decoding network structure. It uses ResNet-50 as the feature extraction model in the encoding stage [28], and the ResNet-50 network in the opposite direction in the decoding stage.

Appearance Matching Loss
In the training process, the left view I l and right view I r captured by the left and right cameras, respectively, are simultaneously input into the network. When the left view I l is input into this depth estimation network, the disparity map d r that corresponds to the conversion from the left view to the right view can be predicted pixel by pixel. Similarly, the disparity map d l is obtained when the input image is the right view I r . Further, we can reconstruct the right viewĨ are respectively combined with the original left view I l and right view I r to form a loss function, known as the appearance matching loss for network training. This appearance matching loss is expressed as follows [29]: can be calculated using Ref. [30].

Disparity Smoothness Loss
Disparity smoothness loss consists of two parts: (1) the gradient values of the disparity map in two directions d l (i + 1, j) − d l (i − 1, j) and d l (i, j + 1) − d l (i, j − 1) are used to create local smoothness; (2) considering that depth discontinuities often occur at image edges, we weight the gradient values of the disparity map with an edge-aware term using the image gradients in two directions.

Left-Right Disparity Consistency Loss
In order to achieve the consistency of left and right disparity maps, the left-right disparity consistency loss is introduced to make the left-view disparity map equal to the projected right-view disparity map. This loss is shown as follows:

Supervised Loss
Appl. Sci. 2020, 10, 7276 The main difference between unsupervised and semi-supervised depth estimation is that the semi-supervised method adds a supervised loss to the above three losses. The prerequisite for using supervised loss is to know the true depth values and the predicted depth values. In the training process, the true depth values corresponding to parts of pixels are first obtained and matched. The predicted depth values of the pixels with true depth values can be converted from the predicted disparity map using Equation (3). The supervised loss can be defined as the deviation of the predicted depth values Z from the available ground truth Z, and expressed as follows: where Ω is the set of all pixels with true depth values, and N is the number of the pixels with true depth values. • δ is the berHu norm and defined as follows: and Loss Function for Depth Estimation The total loss function for depth estimation consists of four parts: appearance matching loss, disparity smoothness loss, left-right disparity consistency loss, and supervised loss. As a combination of four loss functions, the expression is as follows: This paper presents a semi-supervised method that adds a supervised loss item to the unsupervised method to complete the depth estimation. Compared with unsupervised methods that only use the left and right views, this semi-supervised method can improve the estimation accuracy by introducing sparse and local depths corresponding to parts of pixels. Since the resolution and scan range are limited, 3-D LiDAR can only scan some points corresponding to parts of the image pixels and obtain sparse depth information of the front view scene of the vehicle.
During the training process, in order to make full use of the sparse and local depth information, we increase the use of the supervised loss function in the unsupervised framework. Specifically, for pixels with true depth values, we employ this depth information as the ground truth and add a supervised loss function. Additionally, for pixels without depth labels, the unsupervised method based on the principle of binocular reconstruction is used. This combination of unsupervised and supervised methods is referred to as semi-supervised depth estimation.

Depth Map Estimation
When the offline training is completed, we can obtain a pretrained depth estimation network. In the online test process, only one single test image is inputted into this pretrained depth estimation network, and the disparity map corresponding to this input image is calculated. Finally, according to Equation (3), the depth value of each pixel of the input image in the coordinate system of the left camera body can be estimated by combining the focal length and baseline distance of the binocular camera used for training. The depth values of all pixels form a depth map corresponding to the input image.

Pixel-Level Depth Map of the Target
Using the trained semi-supervised depth estimation network, we can obtain the pixel-level depth map. In order to measure the distance between the target and the subject vehicle, it is necessary to detect the pixels belonging to the target from the input forward-looking image. It is well known that instance segmentation can achieve pixel-level target classification. As a general instance segmentation architecture, Mask R-CNN is based on the Faster R-CNN detector and identifies the pixel-level regions of the target by adding a branch for the segmentation task. According to the results of instance segmentation, we can obtain the depth value of each pixel of the target. Figure 3 shows two forward-looking images captured by ADAS. The left image contains a car, the pixels of this car from instance segmentation, and the corresponding depth maps in 2D and 3D spaces, respectively. The right image in Figure 3 contains a pedestrian, the pixels of this pedestrian from instance segmentation, and the corresponding depth maps in 2D and 3D spaces, respectively.

Pixel-Level Depth Map of the Target
Using the trained semi-supervised depth estimation network, we can obtain the pixel-level depth map. In order to measure the distance between the target and the subject vehicle, it is necessary to detect the pixels belonging to the target from the input forward-looking image. It is well known that instance segmentation can achieve pixel-level target classification. As a general instance segmentation architecture, Mask R-CNN is based on the Faster R-CNN detector and identifies the pixel-level regions of the target by adding a branch for the segmentation task. According to the results of instance segmentation, we can obtain the depth value of each pixel of the target. Figure 3 shows two forward-looking images captured by ADAS. The left image contains a car, the pixels of this car from instance segmentation, and the corresponding depth maps in 2D and 3D spaces, respectively. The right image in Figure 3 contains a pedestrian, the pixels of this pedestrian from instance segmentation, and the corresponding depth maps in 2D and 3D spaces, respectively.  Figure 4 shows the pixel depth values of the car and pedestrians in Figure 3 in the camera body coordinate system. Generally speaking, to ensure a certain safety margin of ADSA, the minimum depth value in all pixels of this target can be regarded as the distance between the target and subject vehicle. However, in order to reduce the influence of the noise and error of depth map estimation, we present different methods to measure this distance according to different objects. As we all know, the three-dimensional structure of a vehicle can simply be considered as composed of multiple planes. On the contrary, the shape of the human body is a curved surface. The above spatial structures and shapes of the vehicle and the pedestrian can be observed from Figure 4a,b. Therefore, in the proposed  Figure 4 shows the pixel depth values of the car and pedestrians in Figure 3 in the camera body coordinate system. Generally speaking, to ensure a certain safety margin of ADSA, the minimum depth value in all pixels of this target can be regarded as the distance between the target and subject vehicle. However, in order to reduce the influence of the noise and error of depth map estimation, we present different methods to measure this distance according to different objects. As we all know, the three-dimensional structure of a vehicle can simply be considered as composed of multiple planes. On the contrary, the shape of the human body is a curved surface. The above spatial structures and shapes of the vehicle and the pedestrian can be observed from Figure 4a,b. Therefore, in the proposed methods, we divide the road targets into two types: vehicle (e.g., car, van, truck) and pedestrian, based on their 3D shapes, and we use two approaches to measure the distance, respectively.

Target Distance Measurement
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 20 methods, we divide the road targets into two types: vehicle (e.g., car, van, truck) and pedestrian, based on their 3D shapes, and we use two approaches to measure the distance, respectively.

Distance Measurement of the Target Vehicle
If the target is a vehicle, we first fit a plane in the camera body coordinate system using object points corresponding to the pixels of the vehicle and the RANSAC algorithm [31]. Suppose that

Distance Measurement of the Target Vehicle
If the target is a vehicle, we first fit a plane in the camera body coordinate system using object points corresponding to the pixels of the vehicle and the RANSAC algorithm [31]. Suppose that x i l , y i l , z i l is the coordinate of the point corresponding to each pixel of the target vehicle in the camera body coordinate system, S is the number of these object points, and i = 1, . . . , S. By using RANSAC, a fitted plane in the camera body coordinate system can be determined and expressed as z = ax + by + c. Secondly, by projecting the image points x i l , y i l , z i l onto this plane, specifically,ẑ i l = ax i l + by i l + c, we can obtain a setẐ = ẑ i l i = 1, 2, . . . , S . Finally, the distance between the target vehicle and the subject vehicle is equal to the minimum value of this setẐ.

Distance Measurement of the Target Pedestrian
Different from a target vehicle, the object points of the pedestrian in the camera body coordinate system cannot be fitted as a plane. Consequently, this method first uses the histogram to count the number of object points with different depth values. Specifically, suppose that x i l , y i l , z i l is the coordinate of the object points of the target pedestrian, S is the number of object points, min_z and max_z are the minimum and maximum depth values of the object points, respectively, the statistical range of depth values in the histogram is from min_z to max_z , and the interval is 1, where • and • indicate rounding down and rounding up to an integer, respectively. Secondly, the peak of the histogram and the corresponding depth range are obtained. Finally, the distance between the target pedestrian and the subject vehicle is the average of the depth values in this range.

Proposed Method Implementation
On the basis of the pretrained semi-supervised network for depth map estimation and the Mask R-CNN network for instance segmentation, the flow chart of the proposed distance measurement method is as follow (shown in Figure 5).

Proposed Method Implementation
On the basis of the pretrained semi-supervised network for depth map estimation and the Mask R-CNN network for instance segmentation, the flow chart of the proposed distance measurement method is as follow (shown in Figure 5). It is important to note that when the binocular vision equipment used in offline training is different from the image sensor used in the online test, in order to obtain the absolute depth value, it is necessary to adjust and calibrate the focal length of the online sensor based on the image resolution and focal length of the training equipment. The relationship between two focal lengths is as follows: where f train and f test are the focal lengths of sensors for training and testing, respectively, and w train and w test are the widths of the training and test images, respectively. As mentioned above, the output of the pretrained network for depth estimation is the disparity map d. In order to obtain the depth value of each pixel of the input image, we must use the following equation: where ( , ) i j D is the depth value of each pixel, ( , ) i j d is the disparity value of each pixel, and b is the baseline length of the binocular vision equipment used in offline training.

Implementation Details
Firstly, we trained a semi-supervised network for depth estimation on a computation hardware platform with NVIDIA GeForce GTX 1080Ti (NVIDIA, Santa Clara, CA, USA), with the Ubuntu 14.04 (Canonical, London, UK) operating system and the TensorFlow1.4.0 (Google, Mountain View, CA, USA) development tool. In the KITTI dataset, we selected 7322 groups as the training set, in which  It is important to note that when the binocular vision equipment used in offline training is different from the image sensor used in the online test, in order to obtain the absolute depth value, it is necessary to adjust and calibrate the focal length of the online sensor based on the image resolution and focal length of the training equipment. The relationship between two focal lengths is as follows: where f train and f test are the focal lengths of sensors for training and testing, respectively, and w train and w test are the widths of the training and test images, respectively. As mentioned above, the output of the pretrained network for depth estimation is the disparity map d. In order to obtain the depth value of each pixel of the input image, we must use the following equation: where D(i, j) is the depth value of each pixel, d(i, j) is the disparity value of each pixel, and b is the baseline length of the binocular vision equipment used in offline training.

Implementation Details
Firstly, we trained a semi-supervised network for depth estimation on a computation hardware platform with NVIDIA GeForce GTX 1080Ti (NVIDIA, Santa Clara, CA, USA), with the Ubuntu 14.04 (Canonical, London, UK) operating system and the TensorFlow1.4.0 (Google, Mountain View, CA, USA) development tool. In the KITTI dataset, we selected 7322 groups as the training set, in which each group contained right and left views and two corresponding sparse depth maps [32]. KITTI is a popular dataset which can be used for vision algorithm testing of ADAS; it contains a large number of stereo image pairs captured from a car driving in an urban scenario and also provides sparse depth data matched with the stereo vision. These depth data were not only the sparse depth labels in the training process, but also the ground truth for algorithm evaluation. During depth estimation network training, we used stochastic gradient descent with an initial learning rate of 0.0001 and 50 epochs. From the 30th to 40th epoch, the learning rate was reduced to 1/2 of the initial value, and the learning rate of the last 10 epochs was reduced to 1/4 of the initial value. The batch size was equal to 8. We used the Adam optimizer to optimize the model, and set β 1 = 0.9 and β 2 = 0.999. The Mask R-CNN model used in this paper was downloaded from https://github.com/matterport/Mask_RCNN.
In order to assess the performance of depth map estimation, which is the key of distance measurements, we used the following depth evaluation metrics [33]: (1) Absolute relative error (AbsRel) (2) Root-mean-square error (RMSE) (3) Threshold accuracy where the threshold usually takes three values: 1.25, 1.25 2 , and 1.25 3 ; for different thresholds, there are different threshold accuracies: δ < 1.25, δ < 1.25 2 , and δ < 1.25 3 ; N is the number of pixels with ground truth in the test set; z i and z i are the predicted depth value and true depth value, respectively. Regarding the above evaluation indexes, the smaller the first two parameters (AbsRel and RMSE), the higher the accuracy of the depth estimation result. Conversely, the larger the threshold accuracy, the better the depth estimation result.

Quantitative Comparison with the Other Four Methods
In this subsection, we provide a comprehensive comparison of the proposed depth map estimation method with four other methods: Eigen's method [20], which is a supervised depth estimation method; Zhou's [34] and Godard's [23] methods, which are unsupervised estimation methods; Kuznietsov's [24] method, which is a semi-supervised method. We conducted the quantitative comparison experiments on the KITTI dataset. We used 200 images from the KITTI 2015 dataset as the test samples. In Table 1, we illustrate the comparative results of the proposed method with the other four methods on the KITTI dataset. We found that in terms of absolute relative error, the semi-supervised method presented in this paper is a huge improvement as compared with the supervised method of Eigen et al. and the unsupervised method of Zhou et al. The results obtained by our method are also significantly better than the results of Godard's method which uses left-right consistency. In addition, our method outperforms the Kuznietsov's method which is also a semi-supervised method. From the perspective of the root-mean-square error, the method in this paper is significantly improved compared to the other four methods. In terms of threshold accuracy, accuracies of more than 90% were achieved for the three thresholds of our method, all being higher than the other methods.

Ablation Study of Depth Map Estimation
Compared with unsupervised networks, semi-supervised networks can improve the performance of depth map estimation by introducing the sparse and local depth labels. In this subsection, we present our experiment to demonstrate the function of these sparse labels. Figure 6 shows the comparative results of the unsupervised and semi-supervised methods. The first row contains three forward-looking images captured by on-board vision system and containing vehicles and pedestrians. Three maps of the 3-D point cloud are the sparse and local depth labels corresponding to the three images in the first row. From the depth labels in the second row, we can observe that the ground truth of the depth value is mainly concentrated in the middle area of each image, and there is no ground truth in the upper and lower edges. The third row contains the depth map estimation results obtained by the unsupervised method, which only uses left-right consistency. The last row contains the depth maps estimated by the proposed semi-supervised method. By comparing the areas in white rectangles in the third and fourth rows, and considering the areas in red rectangles in the first row, we can see that the results of the semi-supervised method can estimate depth values more accurately. For example, the depth prediction of strip objects such as pedestrians and traffic signs is more accurate.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 20 Compared with unsupervised networks, semi-supervised networks can improve the performance of depth map estimation by introducing the sparse and local depth labels. In this subsection, we present our experiment to demonstrate the function of these sparse labels. Figure 6 shows the comparative results of the unsupervised and semi-supervised methods. The first row contains three forward-looking images captured by on-board vision system and containing vehicles and pedestrians. Three maps of the 3-D point cloud are the sparse and local depth labels corresponding to the three images in the first row. From the depth labels in the second row, we can observe that the ground truth of the depth value is mainly concentrated in the middle area of each image, and there is no ground truth in the upper and lower edges. The third row contains the depth map estimation results obtained by the unsupervised method, which only uses left-right consistency. The last row contains the depth maps estimated by the proposed semi-supervised method. By comparing the areas in white rectangles in the third and fourth rows, and considering the areas in red rectangles in the first row, we can see that the results of the semi-supervised method can estimate depth values more accurately. For example, the depth prediction of strip objects such as pedestrians and traffic signs is more accurate.

Depth Map Estimation in Real Road Scenarios
In order to further verify the generalization ability of the proposed method, we directly used the depth estimation model trained on the KITTI dataset to do tests in real road scenarios. The results are shown in Figure 7 and clearly show that even if there are no road driving scene images with similar perspectives added to the training sample set, due to its good generalization ability, the proposed depth estimation network can still effectively recover depth information from the test scene and meet the requirements of distance estimation in ADAS.

Depth Map Estimation in Real Road Scenarios
In order to further verify the generalization ability of the proposed method, we directly used the depth estimation model trained on the KITTI dataset to do tests in real road scenarios. The results are shown in Figure 7 and clearly show that even if there are no road driving scene images with similar perspectives added to the training sample set, due to its good generalization ability, the proposed depth estimation network can still effectively recover depth information from the test scene and meet the requirements of distance estimation in ADAS.

Distance Measurement of the Pedestrian
In this subsection, we evaluate our distance measurement method for pedestrians. Our method computes the depth average of all pixels whose depth values are in the peak range of the depth histogram. To verify the effectiveness of our measurement method, we compared it with the method that calculates the average depth of all pixels belonging to the pedestrian region. In the experiment, under the premise of fixing the camera position, we arranged the human to stand at the positions L and 2L, away from the camera position, and conducted quantitative comparative experiments under four different L values, as shown in Figure 8. Figure 8a is a schematic diagram of our experimental device, Figure 8b,c shows four original images corresponding to different L values (2.82, 3.93, 5.87, and 7.86 m) and their depth maps using the proposed depth map estimation network. The distance measurement results of the pedestrian are given in Table 2. According to these results, we came to the following two conclusions: (1) the pedestrian distance measurement error of our method is significantly smaller than the error obtained using the average depth value; (2) the measurement results of our method effectively reflect the relative distance between two pedestrians, i.e., the distance between the second pedestrian and the camera is twice the distance between the first pedestrian and the camera. In this subsection, we evaluate our distance measurement method for pedestrians. Our method computes the depth average of all pixels whose depth values are in the peak range of the depth histogram. To verify the effectiveness of our measurement method, we compared it with the method that calculates the average depth of all pixels belonging to the pedestrian region. In the experiment, under the premise of fixing the camera position, we arranged the human to stand at the positions L and 2L, away from the camera position, and conducted quantitative comparative experiments under four different L values, as shown in Figure 8. In this subsection, we evaluate our distance measurement method for pedestrians. Our method computes the depth average of all pixels whose depth values are in the peak range of the depth histogram. To verify the effectiveness of our measurement method, we compared it with the method that calculates the average depth of all pixels belonging to the pedestrian region. In the experiment, under the premise of fixing the camera position, we arranged the human to stand at the positions L and 2L, away from the camera position, and conducted quantitative comparative experiments under four different L values, as shown in Figure 8. Figure 8a is a schematic diagram of our experimental device, Figure 8b,c shows four original images corresponding to different L values (2.82, 3.93, 5.87, and 7.86 m) and their depth maps using the proposed depth map estimation network. The distance measurement results of the pedestrian are given in Table 2. According to these results, we came to the following two conclusions: (1) the pedestrian distance measurement error of our method is significantly smaller than the error obtained using the average depth value; (2) the measurement results of our method effectively reflect the relative distance between two pedestrians, i.e., the distance between the second pedestrian and the camera is twice the distance between the first pedestrian and the camera. The distance measurement results of the pedestrian are given in Table 2. According to these results, we came to the following two conclusions: (1) the pedestrian distance measurement error of our method is significantly smaller than the error obtained using the average depth value; (2) the measurement results of our method effectively reflect the relative distance between two pedestrians, i.e., the distance between the second pedestrian and the camera is twice the distance between the first pedestrian and the camera. In the vehicle distance measurement experiment, we selected a car and an SUV (as shown in Figure 9), which are common in road scenarios, as the experimental objects, and used a laser rangefinder to measure the minimum horizontal distance to the target, starting from 2.5 m, taking a picture every 2.5 m, and using a camera with the focal length of 4.58 mm to 12.5 m. To test the performance of our vehicle distance measurement method, we compared it with the aforementioned average depth method and the method used in the pedestrian distance measurement. The comparative results are shown in Table 3. From these, we came to the following conclusions: (1) compared with the car distance measurement results, the measurement accuracy for the SUV was better; this is because the rear of an SUV is similar to a plane and so the plane fitting error is smaller; (2) no matter what method was used, as the real distance between the subject camera and target vehicle increased, the measurement accuracy decreased significantly. We believe there are two reasons for this. First, when the distance is greater, the mask of the target becomes smaller and thus the mask error becomes larger, i.e., the pixels that do not belong to the target area are extracted for the distance measurement. Secondly, according to the principle of binocular vision, since the baseline length is fixed, the longer the distance, the lower the measurement accuracy; (3) our vehicle distance measurement method demonstrated a better performance than the other two methods. Specifically, for the SUV distance measurement, the results of our method were better than the other methods for all five distances, and for the car distance measurement, our method was the best for three of the five distances.  In the vehicle distance measurement experiment, we selected a car and an SUV (as shown in Figure 9), which are common in road scenarios, as the experimental objects, and used a laser rangefinder to measure the minimum horizontal distance to the target, starting from 2.5 m, taking a picture every 2.5 m, and using a camera with the focal length of 4.58 mm to 12.5 m. To test the performance of our vehicle distance measurement method, we compared it with the aforementioned average depth method and the method used in the pedestrian distance measurement. The comparative results are shown in Table 3. From these, we came to the following conclusions: (1) compared with the car distance measurement results, the measurement accuracy for the SUV was better; this is because the rear of an SUV is similar to a plane and so the plane fitting error is smaller; (2) no matter what method was used, as the real distance between the subject camera and target vehicle increased, the measurement accuracy decreased significantly. We believe there are two reasons for this. First, when the distance is greater, the mask of the target becomes smaller and thus the mask error becomes larger, i.e., the pixels that do not belong to the target area are extracted for the distance measurement. Secondly, according to the principle of binocular vision, since the baseline length is fixed, the longer the distance, the lower the measurement accuracy; (3) our vehicle distance measurement method demonstrated a better performance than the other two methods. Specifically, for the SUV distance measurement, the results of our method were better than the other methods for all five distances, and for the car distance measurement, our method was the best for three of the five distances.

Distance Measurement on the KITTI Dataset
In this section, we provide several distance measurement results using the proposed method, which combines semi-supervised depth estimation, Mask RCNN, and pedestrian and vehicle distance measurement capabilities. Figure 10 shows several forward-looking images from the KITTI dataset, the red rectangles represent the targets, i.e., pedestrians (P), trucks (T), vans (V), and cars (C). The true distances of these targets and the corresponding estimated results are shown in Figure 11. In Figure 11, there are 28 targets, containing 3 pedestrians, 1 truck, 3 vans, and 21 cars. The average distance error rate of the 28 targets was 5.56%, in which the average distance error rate of the pedestrians was 4.02%, and the vehicles' average error rate was 5.74%. Compared with the measurement results using images captured by our camera, the accuracy of the distance measurements using the KITTI dataset was obviously higher. This is mainly because our model for depth map estimation used KITTI images as training samples. It is worth noting that the average processing time using the proposed distance measurement method for each image was 1.824 s, of which the online running time for the depth map estimation was 0.085 s, the time required for MASK-RCNN as 1.682 s, and computational time of distance measurement was 0.057 s. Therefore, in order to improve the real-time performance of the proposed method, it is necessary to improve the computational efficiency of the instance segmentation algorithm. In this section, we provide several distance measurement results using the proposed method, which combines semi-supervised depth estimation, Mask RCNN, and pedestrian and vehicle distance measurement capabilities. Figure 10 shows several forward-looking images from the KITTI dataset, the red rectangles represent the targets, i.e., pedestrians (P), trucks (T), vans (V), and cars (C). The true distances of these targets and the corresponding estimated results are shown in Figure 11. In Figure 11, there are 28 targets, containing 3 pedestrians, 1 truck, 3 vans, and 21 cars. The average distance error rate of the 28 targets was 5.56%, in which the average distance error rate of the pedestrians was 4.02%, and the vehicles' average error rate was 5.74%. Compared with the measurement results using images captured by our camera, the accuracy of the distance measurements using the KITTI dataset was obviously higher. This is mainly because our model for depth map estimation used KITTI images as training samples. It is worth noting that the average processing time using the proposed distance measurement method for each image was 1.824 s, of which the online running time for the depth map estimation was 0.085 s, the time required for MASK-RCNN as 1.682 s, and computational time of distance measurement was 0.057 s. Therefore, in order to improve the real-time performance of the proposed method, it is necessary to improve the computational efficiency of the instance segmentation algorithm.   We also tested the performance of depth map estimation when the target brightness changed due to the angle of light irradiation. In Figure 12a, as a result of the reflection of light, the brightness values of some areas on the rear windshield of the vehicle in the red rectangle are too large. Conversely, the rear of the vehicle in the red rectangle in Figure 12b is in shadow. Figure 12c,d shows the corresponding depth maps of Figure 12a,b, respectively. From these two depth maps, we can observe that the depth values of pixels in the overexposed areas and shadowy areas have not changed significantly. Therefore, in a daytime road environment, the direction and intensity of light had little effect on the results of depth map estimation and distance measurement using the proposed method.

Conclusions
The distance information between the target vehicle or pedestrian and the subject vehicle plays a very important role in ADAS. Therefore, this paper firstly proposed a semi-supervised depth map estimation algorithm, and then combined it with the Mask-RCNN instance segmentation algorithm to propose different distance measurement methods for target pedestrians and target vehicles. The depth map estimation algorithm in this paper used the left and right views of binocular vision and sparse depth ground truth to pretrain an encoding-decoding network. In the process of depth estimation, we used the known camera focal length, baseline length of training samples, and the pretrained deep model to compute the absolute depth map of a single input image. On the basis of Pedestrian1 of (a) truck1 of (b) car1 of (b) car1 of (c) car1 of (d) car2 of (d) car3 of (d) car1 of (e) car2 of (e) car3 of (e) car1 of (f) pedestrian1 of (f) pedestrian2 of (f) car1 of (g) car1 of (h) car1 of (i) car2 of (i) car3 of (i) car4 of (i) van1 of (i) car5 of (i) car1 of (j) van1 of (j) car2 of (j) van2 of (j) car3 of (j) car4 of (j) car5 of (j) GT Estimated value Figure 11. Ground truths (GT) of the distance and corresponding estimated values of all targets in Figure 10 using the proposed architecture (unit: m).
We also tested the performance of depth map estimation when the target brightness changed due to the angle of light irradiation. In Figure 12a, as a result of the reflection of light, the brightness values of some areas on the rear windshield of the vehicle in the red rectangle are too large. Conversely, the rear of the vehicle in the red rectangle in Figure 12b is in shadow. Figure 12c,d shows the corresponding depth maps of Figure 12a,b, respectively. From these two depth maps, we can observe that the depth values of pixels in the overexposed areas and shadowy areas have not changed significantly. Therefore, in a daytime road environment, the direction and intensity of light had little effect on the results of depth map estimation and distance measurement using the proposed method. We also tested the performance of depth map estimation when the target brightness changed due to the angle of light irradiation. In Figure 12a, as a result of the reflection of light, the brightness values of some areas on the rear windshield of the vehicle in the red rectangle are too large. Conversely, the rear of the vehicle in the red rectangle in Figure 12b is in shadow. Figure 12c,d shows the corresponding depth maps of Figure 12a,b, respectively. From these two depth maps, we can observe that the depth values of pixels in the overexposed areas and shadowy areas have not changed significantly. Therefore, in a daytime road environment, the direction and intensity of light had little effect on the results of depth map estimation and distance measurement using the proposed method.

Conclusions
The distance information between the target vehicle or pedestrian and the subject vehicle plays a very important role in ADAS. Therefore, this paper firstly proposed a semi-supervised depth map estimation algorithm, and then combined it with the Mask-RCNN instance segmentation algorithm to propose different distance measurement methods for target pedestrians and target vehicles. The depth map estimation algorithm in this paper used the left and right views of binocular vision and sparse depth ground truth to pretrain an encoding-decoding network. In the process of depth estimation, we used the known camera focal length, baseline length of training samples, and the pretrained deep model to compute the absolute depth map of a single input image. On the basis of Pedestrian1 of (a) truck1 of (b) car1 of (b) car1 of (c) car1 of (d) car2 of (d) car3 of (d) car1 of (e) car2 of (e) car3 of (e) car1 of (f) pedestrian1 of (f) pedestrian2 of (f) car1 of (g) car1 of (h) car1 of (i) car2 of (i) car3 of (i) car4 of (i) van1 of (i) car5 of (i) car1 of (j) van1 of (j) car2 of (j) van2 of (j) car3 of (j) car4 of (j) car5 of (j)

Conclusions
The distance information between the target vehicle or pedestrian and the subject vehicle plays a very important role in ADAS. Therefore, this paper firstly proposed a semi-supervised depth map estimation algorithm, and then combined it with the Mask-RCNN instance segmentation algorithm to propose different distance measurement methods for target pedestrians and target vehicles. The depth map estimation algorithm in this paper used the left and right views of binocular vision and sparse depth ground truth to pretrain an encoding-decoding network. In the process of depth estimation, we used the known camera focal length, baseline length of training samples, and the pretrained deep model to compute the absolute depth map of a single input image. On the basis of the estimated depth map and the pixel-level classification results of Mask-RCNN for the pedestrian target, this paper proposed a distance measurement method that calculates the average of the depth values corresponding to all pixels whose depth values are in the peak range of the target region depth histogram. For the vehicle target, this paper proposed a distance measurement method which first fits a plane using RANSAC, then projects all the pixels from the target to this plane, and finally uses the minimum depth value of these projected points to calculate the distance to the target vehicle. Extensive tests using a public dataset were conducted to assess the results of depth map estimation and real experiments were performed to evaluate the results of the distance measurements. The experimental results using the public dataset proved the superior performance of the proposed depth map estimation method, and the experimental results in real road scenarios confirmed the effectiveness of the distance measurement methods.
Since the accuracy of the proposed distance measurement results depends to some extent on the results of instance segmentation, we plan to combine the depth map and the shape of the target to improve the location precision of masks obtained by instance segmentation and further improve the accuracy of distance measurement. Additionally, in bad visibility conditions caused by illumination, gas particles, dust, fog, etc., the proposed method using images from the visible light sensor does not achieve satisfactory results. Therefore, our research group is studying a completely new method that uses infrared images for distance estimation for ADAS in cases of low visibility.