Real-Time Vehicle Detection Framework Based on the Fusion of LiDAR and Camera

: Vehicle detection is essential for driverless systems. However, the current single sensor detection mode is no longer su ﬃ cient in complex and changing tra ﬃ c environments. Therefore, this paper combines camera and light detection and ranging (LiDAR) to build a vehicle-detection framework that has the characteristics of multi adaptability, high real-time capacity, and robustness. First, a multi-adaptive high-precision depth-completion method was proposed to convert the 2D LiDAR sparse depth map into a dense depth map, so that the two sensors are aligned with each other at the data level. Then, the You Only Look Once Version 3 (YOLOv3) real-time object detection model was used to detect the color image and the dense depth map. Finally, a decision-level fusion method based on bounding box fusion and improved Dempster–Shafer (D–S) evidence theory was proposed to merge the two results of the previous step and obtain the ﬁnal vehicle position and distance information, which not only improves the detection accuracy but also improves the robustness of the whole framework. We evaluated our method using the KITTI dataset and the Waymo Open Dataset, and the results show the e ﬀ ectiveness of the proposed depth completion method and multi-sensor fusion strategy.


Introduction
Autonomous vehicles can improve the efficiency and safety of transportation systems, and have become the main topic of future traffic development. In the study of autonomous vehicles, vehicle detection is the key to ensure safe driving of autonomous vehicles. Autonomous vehicles are usually equipped with many different sensors to sense environmental information, such as camera, light detection and ranging (LiDAR), radar, ultrasonic radar, and so on. Among the above sensors, the camera and LiDAR have become the most commonly used sensors in the object detection field due to their superior performance.
The camera is widely used because of its high resolution. There has been a lot of literature on image-based object detection. In recent years, with the continuous development of deep learning, many scholars have introduced convolutional neural networks (CNNs) into the field of object detection and achieved excellent results. We usually divide the methods of object detection based on deep learning into two categories, the two-stage method and the one-stage method. The two-stage object detection method is also called the region-based object detection method. The classic models include regions with CNN features (R-CNN) [1], spatial pyramid pooling network (SSP-Net) [2], fast R-CNN [3], faster R-CNN [4], multi-scale CNN (MS-CNN) [5] and subcategory-aware CNN (SubCNN) [6]. Deep learning combined with images can achieve not only 2D object detection but also 3D object detection. the unknown pixels accurately. In addition, the detection process uses traditional machine learning methods with poor results. Kang et al. [23] designed a complete CNN framework that fuses LiDAR and color images to achieve multi-target detection. The CNN framework consisted of independent unary classifiers and the fusion CNN, but with a high complexity. Although it achieves good detection accuracy, it requires a huge amount of calculation and cannot guarantee real-time performance. Chavez-Garcia et al. [24] chose Yager's improved D-S evidence theory as the decision-level fusion method to improve the detection and tracking of moving objects. However, the method does not solve the conflict problem in a real sense, which will reduce the anti-interference performance of the system. Therefore, there are two main problems in the existing decision-level fusion framework. One is that the processing speed is too slow to meet the real-time requirements. Secondly, the advantages of LiDAR cannot be fully utilized, which leads to the problem that the detection precision is still very low at night, and the distance of the vehicle is not obtained.
Aiming at the above problems, this paper proposes a real-time decision-level fusion framework that considers both day and night and combines camera and LiDAR. The framework first proposes a multi-adaptive and high-precision completion method, which improves the adaptability to the detection environment and makes the preliminary fusion of the two-sensor data, laying a good foundation for subsequent steps. Then, the system realized fast and accurate object detection through the selected YOLOv3 [25] real-time object detection model and the proposed decision-level fusion strategy. The framework not only gets higher detection precision during daytime driving but also obtains the distance between the front vehicle and the detecting vehicle. Moreover, when driving at night, the object can be detected effectively when the camera is not working properly.
The organization of this paper is as follows. In Section 2, a vehicle detection framework including depth completion, vehicle detection, and decision-level fusion is proposed. Experimental results and discussion are described in Section 3, and Section 4 contains conclusions and future work.

Methodology
The framework consists of three parts, data generation, vehicle detection, and decision-level fusion. The overall structure of the framework is shown in Figure 1.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 18 traditional machine learning methods with poor results. Kang et al. [23] designed a complete CNN framework that fuses LiDAR and color images to achieve multi-target detection. The CNN framework consisted of independent unary classifiers and the fusion CNN, but with a high complexity. Although it achieves good detection accuracy, it requires a huge amount of calculation and cannot guarantee real-time performance. Chavez-Garcia et al. [24] chose Yager's improved D-S evidence theory as the decision-level fusion method to improve the detection and tracking of moving objects. However, the method does not solve the conflict problem in a real sense, which will reduce the anti-interference performance of the system. Therefore, there are two main problems in the existing decision-level fusion framework. One is that the processing speed is too slow to meet the real-time requirements. Secondly, the advantages of LiDAR cannot be fully utilized, which leads to the problem that the detection precision is still very low at night, and the distance of the vehicle is not obtained.
Aiming at the above problems, this paper proposes a real-time decision-level fusion framework that considers both day and night and combines camera and LiDAR. The framework first proposes a multi-adaptive and high-precision completion method, which improves the adaptability to the detection environment and makes the preliminary fusion of the two-sensor data, laying a good foundation for subsequent steps. Then, the system realized fast and accurate object detection through the selected YOLOv3 [25] real-time object detection model and the proposed decision-level fusion strategy. The framework not only gets higher detection precision during daytime driving but also obtains the distance between the front vehicle and the detecting vehicle. Moreover, when driving at night, the object can be detected effectively when the camera is not working properly.
The organization of this paper is as follows. In Section 2, a vehicle detection framework including depth completion, vehicle detection, and decision-level fusion is proposed. Experimental results and discussion are described in Section 3, and Section 4 contains conclusions and future work.

Methodology
The framework consists of three parts, data generation, vehicle detection, and decision-level fusion. The overall structure of the framework is shown in Figure 1. First, the 3D LiDAR point cloud was transformed into a 2D sparse depth map by the joint calibration of camera and LiDAR, and then it was converted into a dense depth map by depth completion so that the laser data and image have the same resolution and are aligned with each other in space and time. Then the color image and dense depth map were input into the YOLOv3 detection network and the bounding box and confidence score of each detected vehicle were obtained. Finally, bounding box fusion and the improved Dempster-Shafer (D-S) evidence theory were proposed to obtain the final detection results.

Depth Completion
Before the depth completion, a pre-processing operation is required to convert the 3D LiDAR point cloud into a 2D sparse depth map. In pre-processing, the precise calibration, joint calibration, First, the 3D LiDAR point cloud was transformed into a 2D sparse depth map by the joint calibration of camera and LiDAR, and then it was converted into a dense depth map by depth completion so that the laser data and image have the same resolution and are aligned with each other in space and time. Then the color image and dense depth map were input into the YOLOv3 detection network and the bounding box and confidence score of each detected vehicle were obtained. Finally, bounding box fusion and the improved Dempster-Shafer (D-S) evidence theory were proposed to obtain the final detection results.

Depth Completion
Before the depth completion, a pre-processing operation is required to convert the 3D LiDAR point cloud into a 2D sparse depth map. In pre-processing, the precise calibration, joint calibration, Electronics 2020, 9, 451 4 of 18 and synchronization of the LiDAR and camera are needed so that each 3D LiDAR point cloud can be projected accurately onto the 2D image plane to form the sparse depth map. The coordinate conversion relationship between the sensors is shown in Figure 2.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 18 and synchronization of the LiDAR and camera are needed so that each 3D LiDAR point cloud can be projected accurately onto the 2D image plane to form the sparse depth map. The coordinate conversion relationship between the sensors is shown in Figure 2. After the pre-processing work is completed, the sparse depth map is transformed into a dense depth map through the depth completion framework so that the resolution of LiDAR data and image is the same. The depth-completion method can be divided into two types, guided depth completion [22,[26][27][28][29] and non-guided depth completion [21,30].
In the daytime, the camera can capture a clear, high-resolution image. Obviously, the image at this time is very useful for guiding depth completion because it can help to distinguish object boundaries and continuous smooth surfaces. However, at night, the sharpness of the image is greatly reduced. At this time, the image guidance will not help the result of the depth completion but will cause it to go in the wrong direction. Therefore, using only LiDAR data for depth completion will result in better outcomes. However, the commonly used depth completion methods have only a single completion mode, resulting in low image quality after completion. Low-quality images lose a lot of detailed features, which create difficulties for the later detection stages and will cause a large number of false detections and missed detections, which is not practical. Therefore, this paper proposes a depth completion method that can switch between different completion modes according to day or night. Thus, this paper introduces the anisotropic diffusion tensor [31] and the proportionality coefficient, which can not only make the details of the dense depth map clearer but also switch between completion methods that require image guidance according to whether the image is clear or not.
This method first judges whether the image is positively guiding the completion of the sparse depth map based on whether the acquired image is day or night. Here, there are many methods of day-night image classification, such as Bayesian classifier [32], SVM classifier, and CNN. When it is daytime, image-guided depth completion is used, and at night, only LiDAR data is used for completion. The specific flowchart is shown in Figure 3. Our completion method is mainly based on three hypotheses. One is that pixels with similar distances have similar depth values. The other is that similar color regions have similar depth values. The third is that changes in texture edges correspond to the mutation of depth values. After the pre-processing work is completed, the sparse depth map is transformed into a dense depth map through the depth completion framework so that the resolution of LiDAR data and image is the same. The depth-completion method can be divided into two types, guided depth completion [22,[26][27][28][29] and non-guided depth completion [21,30].
In the daytime, the camera can capture a clear, high-resolution image. Obviously, the image at this time is very useful for guiding depth completion because it can help to distinguish object boundaries and continuous smooth surfaces. However, at night, the sharpness of the image is greatly reduced. At this time, the image guidance will not help the result of the depth completion but will cause it to go in the wrong direction. Therefore, using only LiDAR data for depth completion will result in better outcomes. However, the commonly used depth completion methods have only a single completion mode, resulting in low image quality after completion. Low-quality images lose a lot of detailed features, which create difficulties for the later detection stages and will cause a large number of false detections and missed detections, which is not practical. Therefore, this paper proposes a depth completion method that can switch between different completion modes according to day or night. Thus, this paper introduces the anisotropic diffusion tensor [31] and the proportionality coefficient, which can not only make the details of the dense depth map clearer but also switch between completion methods that require image guidance according to whether the image is clear or not.
This method first judges whether the image is positively guiding the completion of the sparse depth map based on whether the acquired image is day or night. Here, there are many methods of day-night image classification, such as Bayesian classifier [32], SVM classifier, and CNN. When it is daytime, image-guided depth completion is used, and at night, only LiDAR data is used for completion. The specific flowchart is shown in Figure 3. and synchronization of the LiDAR and camera are needed so that each 3D LiDAR point cloud can be projected accurately onto the 2D image plane to form the sparse depth map. The coordinate conversion relationship between the sensors is shown in Figure 2. After the pre-processing work is completed, the sparse depth map is transformed into a dense depth map through the depth completion framework so that the resolution of LiDAR data and image is the same. The depth-completion method can be divided into two types, guided depth completion [22,[26][27][28][29] and non-guided depth completion [21,30].
In the daytime, the camera can capture a clear, high-resolution image. Obviously, the image at this time is very useful for guiding depth completion because it can help to distinguish object boundaries and continuous smooth surfaces. However, at night, the sharpness of the image is greatly reduced. At this time, the image guidance will not help the result of the depth completion but will cause it to go in the wrong direction. Therefore, using only LiDAR data for depth completion will result in better outcomes. However, the commonly used depth completion methods have only a single completion mode, resulting in low image quality after completion. Low-quality images lose a lot of detailed features, which create difficulties for the later detection stages and will cause a large number of false detections and missed detections, which is not practical. Therefore, this paper proposes a depth completion method that can switch between different completion modes according to day or night. Thus, this paper introduces the anisotropic diffusion tensor [31] and the proportionality coefficient, which can not only make the details of the dense depth map clearer but also switch between completion methods that require image guidance according to whether the image is clear or not.
This method first judges whether the image is positively guiding the completion of the sparse depth map based on whether the acquired image is day or night. Here, there are many methods of day-night image classification, such as Bayesian classifier [32], SVM classifier, and CNN. When it is daytime, image-guided depth completion is used, and at night, only LiDAR data is used for completion. The specific flowchart is shown in Figure 3. Our completion method is mainly based on three hypotheses. One is that pixels with similar distances have similar depth values. The other is that similar color regions have similar depth values. The third is that changes in texture edges correspond to the mutation of depth values. Our completion method is mainly based on three hypotheses. One is that pixels with similar distances have similar depth values. The other is that similar color regions have similar depth values. The third is that changes in texture edges correspond to the mutation of depth values. For all pixels with unknown depth, its depth value D p can be obtained using Equation (1): W p is the normalization factor, where p and q represent the position coordinates of the pixel, G represents the Gaussian function, I represents the pixel value of the image, D represents the depth value corresponding to the image, Ω represents the kernel of the Gaussian function and T represents anisotropic diffusion tensor. σ I , σ D and σ T are the σ values of the Gaussian function of the color, distance, and anisotropic diffusion tensor, respectively. For the Gaussian function, the excessively large size of the convolution kernel results in fuzzy completion images. If the convolution kernel is set too small, the depth of unknown pixels around the sparse surroundings cannot be filled. In addition, the smoothness of the weight distribution depends on the size of σ. The larger the σ value, the smoother the weight distribution is. Therefore, after parameter tuning, the convolution kernel size is set between 5 and 15, when σ I = σ D = σ T = 5 ∼ 10, and convolution kernel size is usually set to 9, σ is usually set to 7. The following details the anisotropic diffusion tensor and proportional coefficient.
(1) Anisotropic diffusion tensor The anisotropic diffusion tensor is directly calculated from the color image, but it has a strong indication for the dense depth map formation because most texture edges correspond to depth value mutations. We use the anisotropic diffusion tensor to emphasize mutated regions of depth values and produce more accurate completion results.
Therefore, we include an anisotropic diffusion tensor T, which is calculated using the following equation: where ∇I H is the image gradient and n is the normalized direction (unit vector) of the image gradient, n =∇I H /|∇I H |. n ⊥ . is the normal vector of the image gradient. β and γ can adjust the magnitude and sharpness of the tensor.
(2) Proportional coefficient a, b and c are the proportional coefficients of distance variation, color variation, and anisotropic diffusion tensor variation, respectively. In the daytime, the value of the three coefficients can be adjusted to enlarge the details of the guide image so that the contour of the dense map is more evident. After parameter tuning, when a = 1, b = c = 10 ∼ 20, the error of the dense depth map is the smallest, and the effect is the best, usually we set b and c are 15.
At night, we set a is 1, b and c are 0, the G σ I b I P − I p G σ T c T p − T q value is constant. At this time, only the distance information is valid for the completion process. The entire equation is degraded to rely solely on LiDAR for depth completion, which enables switching between modes. However, if we only rely on the distance information for completion, the quality of the completion map is greatly affected, so we use the pre-processing operation of expansion and close operation to improve image sharpness.

Vehicle Detection
Because the YOLOv3 object detection model has not only breakneck detection speed but also excellent detection precision, this paper chooses YOLOv3 for vehicle detection. YOLOv3 is trained on two training sets (color image and dense depth map), and two trained models are finally obtained.
YOLO is a state-of-the-art real-time object detection model. YOLO has evolved through three iterations. YOLOv1 and YOLOv2 [33] are the first two-generation models of YOLO. They can process images at the rate of 45 frames per second (FPS), but they have the disadvantage of low detection precision. However, SSD, which belongs to the one-stage object detection model, not only has the same detection speed but also has better detection ability for small objects.
However, the emergence of YOLOv3 compensates for the imperfect detection ability of the previous two generations for small objects and maintains its speed advantage. YOLOv3 has a mean Average Precision (mAP) value of 57.9% on the COCO dataset, which is slightly higher than SSD and RetinaNet, but it is 2-4 times faster than them, 100 times faster than Fast R-CNN and 1000 times faster than R-CNN [24].

Decision-Level Fusion
In this section, based on the detection results of the dense depth map and the color image in YOLOv3, the obtained bounding box information and the corresponding confidence score are fused to obtain the final detection result.

Bounding Box Fusion
We choose different fusion strategies by judging the Intersection over Union (IoU) size of the bounding boxes in the dense depth map and color image. When IoU is less than 0.5, it is considered two independent detection objects without fusion. When IoU is between 0.5 and 0.8, and two bounding boxes have fewer overlaps, then the overlapping area is used as the final target area. When IoU is between 0.8 and 1, the two bounding boxes basically coincide. At this time, all the model boundaries are considered valid. We use the extended area of the bounding boxes as the new detection area. The effect is shown in Figure 4. However, the emergence of YOLOv3 compensates for the imperfect detection ability of the previous two generations for small objects and maintains its speed advantage. YOLOv3 has a mean Average Precision (mAP) value of 57.9% on the COCO dataset, which is slightly higher than SSD and RetinaNet, but it is 2-4 times faster than them, 100 times faster than Fast R-CNN and 1000 times faster than R-CNN [24].

Decision-Level Fusion
In this section, based on the detection results of the dense depth map and the color image in YOLOv3, the obtained bounding box information and the corresponding confidence score are fused to obtain the final detection result.

Bounding Box Fusion
We choose different fusion strategies by judging the Intersection over Union (IoU) size of the bounding boxes in the dense depth map and color image. When IoU is less than 0.5, it is considered two independent detection objects without fusion. When IoU is between 0.5 and 0.8, and two bounding boxes have fewer overlaps, then the overlapping area is used as the final target area. When IoU is between 0.8 and 1, the two bounding boxes basically coincide. At this time, all the model boundaries are considered valid. We use the extended area of the bounding boxes as the new detection area. The effect is shown in Figure 4.
. Figure 4. Bounding box fusion. The yellow area represents the bounding box detected by the dense depth map, and the blue area represents the bounding box detected by the color image, and the green area is the final detection result after fusion.

Confidence Score Fusion
For the fused bounding box, we take the corresponding confidence score of the original bounding box as a benchmark and obtain a new confidence score using improved D-S evidence theory.
D-S evidence theory is a no-exact reasoning theory introduced by Dempster and developed by Shafer. It is one of the most used methods for multi-sensor information fusion and is very suitable

Confidence Score Fusion
For the fused bounding box, we take the corresponding confidence score of the original bounding box as a benchmark and obtain a new confidence score using improved D-S evidence theory. D-S evidence theory is a no-exact reasoning theory introduced by Dempster and developed by Shafer. It is one of the most used methods for multi-sensor information fusion and is very suitable for decision-level fusion [34,35]. The specific flow of the algorithm is as follows: Let Θ be an identification framework, it is a set of mutually exclusive propositions, and the following formula holds: Suppose there are two evidences E 1 and E 2 under the identification framework Θ. The BPA and the focal elements of E 1 are m 1 and A 1 , A 2 , · · · , A k , respectively. The BPA and the focal elements of E 2 are m 2 and B 1 , B 2 , · · · , B k , respectively.
According to Dempster's combination rule of Equation (8), the above evidence can be fused: But when Dempster's combination rules are used to combine high-conflict evidence, it may lead to a wrong conclusion. At present, there are two ways to improve it. One is to modify the combination rules and the other is to modify the evidence before the improvement. Modifying the combination rule will destroy the excellent properties of the commutative law and the associative law of the Dempster rule. Therefore, this paper chooses to modify the evidence to solve this problem.
First, we introduce the distance between two evidences [36] to consider the degree of conflict between them. The distance between m 1 and m 2 is defined as follows: where D is called the Jaccard coefficient and the size is a matrix of 2 θ × 2 θ , the value of each element is: For n evidences, the distance matrix can be used to represent the distance between each two evidences: Electronics 2020, 9, 451 8 of 18 The similarity Sim [37] between the evidences can be obtained by the distance between the evidences, that is Sim(m i , m j ) = 1 − d(m i , m j , and the similarity matrix SM between the evidences is obtained by the same reasoning.
The degree of support for each evidence by other evidence can be defined as: Then the trust factor (weight) ω i of the ith evidence E i can be obtained as: After weighted averaging of the evidence, the expected evidence is obtained as: Finally, using D-S evidence theory, the result of n-1 iterative combinations of the expected evidence M are regarded as the synthesis result of n evidences.

Experimental Results and Discussion
We evaluated our method using the KITTI dataset [38] and the Waymo Open Dataset [39]. The KITTI dataset is the largest computer vision evaluation dataset for autonomous driving scenarios in the world. The data acquisition vehicle is equipped with a color camera and a Velodyne HDL-64E LiDAR. The Waymo Open Dataset is currently one of the largest and most diverse autonomous driving datasets in the world, with data from five LiDARs and five cameras. Our test platform is configured with an Intel Xeon E5-2670 CPU and an NVIDIA GeForce GTX 1080Ti GPU.

Depth Completion Experiment
The KITTI dataset provides calibration data for the camera and LiDAR, including rigid transformation matrix Tr_velo_to_cam from the LiDAR coordinate system to the camera coordinate system, camera internal parameter matrix P, and camera correction matrix R0_rect. Using Equation (16), we can project the LiDAR point cloud onto the camera plane to form a sparse depth map. In this process, points projected outside the image boundary need to be discarded. u and v are camera image coordinates, and x, y, z are 3D LiDAR coordinates. The conversion result of the sparse depth map is shown in Figure 5. In the fusion image, we can see that LiDAR points are well aligned with image pixels at the pillar. However the generated depth map is too sparse to obtain useful information directly. process, points projected outside the image boundary need to be discarded. u and v are camera image coordinates, and x, y, z are 3D LiDAR coordinates.
The conversion result of the sparse depth map is shown in Figure 5. In the fusion image, we can see that LiDAR points are well aligned with image pixels at the pillar. However the generated depth map is too sparse to obtain useful information directly. We undertook two kinds of processing of the LiDAR data at the same time and conducted experiments separately. When using only LiDAR data for completion, the results are shown in Figure  6. The contour edges of the image are more apparent after preprocessing. We undertook two kinds of processing of the LiDAR data at the same time and conducted experiments separately. When using only LiDAR data for completion, the results are shown in Figure 6. The contour edges of the image are more apparent after preprocessing.
When the image is used for depth completion, the result is shown in Figure 7. The edge contour of the dense map is made more explicit by enlarging the edge information of the guide image. The basic outline of the vehicle can be seen clearly from the figure.
Electronics 2020, 9, 451 10 of 18 A full example of depth completion is shown in Figure 8. We visually compared our algorithm with the most commonly used joint bilateral upsampling (JBU) method and ground truth. It can be seen from the figure that depth completion significantly improves the resolution of LiDAR data and makes up for its low resolution. The completion map using the JBU method is blurred and the image quality is poor. In the non-guided depth completion map, the edges of objects are clear, and each object can be identified easily. The guided depth completion map is rich in detail, and the outline of the object in the map is clear and recognizable. When the image is used for depth completion, the result is shown in Figure 7. The edge contour of the dense map is made more explicit by enlarging the edge information of the guide image. The basic outline of the vehicle can be seen clearly from the figure. A full example of depth completion is shown in Figure 8. We visually compared our algorithm with the most commonly used joint bilateral upsampling (JBU) method and ground truth. It can be seen from the figure that depth completion significantly improves the resolution of LiDAR data and makes up for its low resolution. The completion map using the JBU method is blurred and the image quality is poor. In the non-guided depth completion map, the edges of objects are clear, and each object can be identified easily. The guided depth completion map is rich in detail, and the outline of the object in the map is clear and recognizable. To objectively evaluate the quality of the completion map, we introduce the root mean square error (RMSE), mean absolute error (MAE), inverse root mean square error (iRMSE) and inverse mean absolute error (iMAE): where R(m, n) and I(m, n) represent the reference image and the target image, respectively. The reference image has true depth value, and M and N represent the size of the image. We experimented with 1000 groups of data with ground truth in the KITTI depth completion dataset and averaged all the errors. The results are shown in Table 1 and Figure 9. It is evident that our proposed method has a minimal error.    (20) where R(m,n) and I(m,n) represent the reference image and the target image, respectively. The reference image has true depth value, and M and N represent the size of the image. We experimented with 1000 groups of data with ground truth in the KITTI depth completion dataset and averaged all the errors. The results are shown in Table 1 and Figure 9. It is evident that our proposed method has a minimal error.

Vehicle Detection and Fusion Experiment
The KITTI object detection dataset contains 7481 frames of training data and 7518 frames of test data. Each frame contains a synchronized color image and LiDAR datapoint. There are nine classes of label information in the dataset, including 'Car', 'Van', 'Truck', 'Pedestrian', 'Person sitting', 'Cyclist', 'Tram', 'Misc', and 'Don't Care'. We merged the classes 'Car', 'Van', 'Tram', and 'Truck' into

Vehicle Detection and Fusion Experiment
The KITTI object detection dataset contains 7481 frames of training data and 7518 frames of test data. Each frame contains a synchronized color image and LiDAR datapoint. There are nine classes of label information in the dataset, including 'Car', 'Van', 'Truck', 'Pedestrian', 'Person sitting', 'Cyclist', 'Tram', 'Misc', and 'Don't Care'. We merged the classes 'Car', 'Van', 'Tram', and 'Truck' into the new class 'vehicle', and only detected vehicles. Since the ground truth of the testing set has not yet been released. We divided 7481 frames of the training set randomly into two parts, 3741 framed for training and 3740 frames for testing.
Since the KITTI dataset only has daytime driving data, to test the night driving data, this paper further introduces the Waymo Open Dataset, which contains high-resolution images and LiDAR data of 1000 fragments under various conditions. We validated the method using its 64-layer mid-range LiDAR data and the images of the front camera in its night environment. We trained color image, guided depth completion map (daytime), and non-guided depth completion map (nighttime) in YOLOv3, and fused the results of the color image and depth completion map. Mini-batch gradient descent (MBGD) is used to optimize our network. We trained network for about 180 epochs. Throughout the training process, the network with 1242 × 375 input was trained with a batch size of 8. The initial value of leaning rate was 10 −3 , which changed to 10 −4 after 100 epochs, and 10 −5 after 40 epochs. The momentum and weight decay were configured as 0.9 and 0.0005.
We were consistent with the evaluation method of the KITTI dataset, using average precision (AP) and IoU [40] to evaluate the detection performance. When the IoU overlap threshold was greater than 0.7, the detection was considered successful. According to the size of bounding box height, occlusion level, and truncation, the KITTI dataset is divided into three different levels, easy, moderate, and hard. Figure 10 shows the precision-recall (P-R) curves for day detection, night detection, and the fusion result. Table 2 shows the AP of the day detection results, and Table 3 shows the AP of the night detection results.
the new class 'vehicle', and only detected vehicles. Since the ground truth of the testing set has not yet been released. We divided 7481 frames of the training set randomly into two parts, 3741 framed for training and 3740 frames for testing.
Since the KITTI dataset only has daytime driving data, to test the night driving data, this paper further introduces the Waymo Open Dataset, which contains high-resolution images and LiDAR data of 1000 fragments under various conditions. We validated the method using its 64-layer mid-range LiDAR data and the images of the front camera in its night environment. We trained color image, guided depth completion map (daytime), and non-guided depth completion map (nighttime) in YOLOv3, and fused the results of the color image and depth completion map. Mini-batch gradient descent (MBGD) is used to optimize our network. We trained network for about 180 epochs. Throughout the training process, the network with 1242 × 375 input was trained with a batch size of 8. The initial value of leaning rate was 10 −3 , which changed to 10 −4 after 100 epochs, and 10 −5 after 40 epochs. The momentum and weight decay were configured as 0.9 and 0.0005.
We were consistent with the evaluation method of the KITTI dataset, using average precision (AP) and IoU [40] to evaluate the detection performance. When the IoU overlap threshold was greater than 0.7, the detection was considered successful. According to the size of bounding box height, occlusion level, and truncation, the KITTI dataset is divided into three different levels, easy, moderate, and hard. Figure 10 shows the precision-recall (P-R) curves for day detection, night detection, and the fusion result. Table 2 shows the AP of the day detection results, and Table 3 shows the AP of the night detection results.     As can be seen from the chart, both day and night, dense depth maps and color images can get excellent detection precision, and after fusion, the precision is improved. Compared with the results of daytime image detection, the results of daytime fusion detection are 3.45, 2.63, and 1.72% higher in easy, moderate, and hard, respectively. Compared with the results of night image detection, the results of night fusion detection increased AP by 3.55%.
An example of the fusion detection process is shown in Figures 11-14. Good detection results can be obtained by color image and dense depth image alone. After fusion, the detection advantages of both are considered comprehensively; more accurate results are obtained.  Table 3. Performance evaluation of each detector at night.

Color Non-Guided Depth Completion Fusion AP 65.47%
60.13% 69.02% As can be seen from the chart, both day and night, dense depth maps and color images can get excellent detection precision, and after fusion, the precision is improved. Compared with the results of daytime image detection, the results of daytime fusion detection are 3.45, 2.63, and 1.72% higher in easy, moderate, and hard, respectively. Compared with the results of night image detection, the results of night fusion detection increased AP by 3.55%.
An example of the fusion detection process is shown in Figures 11-14. Good detection results can be obtained by color image and dense depth image alone. After fusion, the detection advantages of both are considered comprehensively; more accurate results are obtained. Figure 11. Fusion detection process. The images from top to bottom are: detection results of color images (red), detection results of dense depth maps (white), fusion process of the former two, fusion results (yellow), and ground truth(green).
The distance information of the vehicle can be obtained through the final bounding box and the dense depth map. We first removed the overlapped area between the bounding boxes to filter out other vehicle information. An example is shown in Figure 12. However, the remaining bounding box still contains some background information and invalid points. Therefore, we should first remove the invalid points with a depth value of 0. Then we should remove the points with a maximum depth value of 30% and a minimum depth value of 10% of the remaining part, and average the depth value of the final remaining part to get the distance from   Table 3. Performance evaluation of each detector at night.

Color Non-Guided Depth Completion Fusion AP 65.47%
60.13% 69.02% As can be seen from the chart, both day and night, dense depth maps and color images can get excellent detection precision, and after fusion, the precision is improved. Compared with the results of daytime image detection, the results of daytime fusion detection are 3.45, 2.63, and 1.72% higher in easy, moderate, and hard, respectively. Compared with the results of night image detection, the results of night fusion detection increased AP by 3.55%.
An example of the fusion detection process is shown in Figures 11-14. Good detection results can be obtained by color image and dense depth image alone. After fusion, the detection advantages of both are considered comprehensively; more accurate results are obtained. Figure 11. Fusion detection process. The images from top to bottom are: detection results of color images (red), detection results of dense depth maps (white), fusion process of the former two, fusion results (yellow), and ground truth(green).
The distance information of the vehicle can be obtained through the final bounding box and the dense depth map. We first removed the overlapped area between the bounding boxes to filter out other vehicle information. An example is shown in Figure 12. However, the remaining bounding box still contains some background information and invalid points. Therefore, we should first remove the invalid points with a depth value of 0. Then we should remove the points with a maximum depth value of 30% and a minimum depth value of 10% of the remaining part, and average the depth value of the final remaining part to get the distance from LiDAR to the preceding vehicle. Then, we subtract the distance between LiDAR and the vehicle frontbody by 2.89 m to obtain the final vehicle distance. Figure 13 shows the final calculation results. Similarly, an example of the night detection process is shown in Figure 14. The detection result is more accurate after fusion and the distance information of the vehicle is obtained.  Similarly, an example of the night detection process is shown in Figure 14. The detection result is more accurate after fusion and the distance information of the vehicle is obtained. To further evaluate the effectiveness of the proposed algorithm, we compared our method with state-of-the-art object detection methods. The results are shown in Table 4.
For daytime detection, in terms of precision, ranking with moderate difficulty, our method ranks fifth out of 10 methods, and it has reached a high detection precision and fully meets the requirements of practical application.
In terms of speed, our method has a breakneck detection speed of 0.057 s, only 0.027 s slower than YOLOv2, but the average AP is 15.35% higher. Compared with Faster R-CNN with similar AP, our method is 35 times faster. Compared with R-SSD with similar performance, our method has a stronger anti-interference ability. Compared with MS-CNN, SubCNN, 3DOP, and Mono3D methods with high AP, our method is 7×, 35×, 53×, and 73× faster, respectively.
For nighttime detection, the method in this paper still has excellent performance, ranking third in detection precision and third in detection speed among all the compared methods. The distance information of the vehicle can be obtained through the final bounding box and the dense depth map. We first removed the overlapped area between the bounding boxes to filter out other vehicle information. An example is shown in Figure 12.
However, the remaining bounding box still contains some background information and invalid points. Therefore, we should first remove the invalid points with a depth value of 0. Then we should remove the points with a maximum depth value of 30% and a minimum depth value of 10% of the remaining part, and average the depth value of the final remaining part to get the distance from LiDAR to the preceding vehicle. Then, we subtract the distance between LiDAR and the vehicle front-body by 2.89 m to obtain the final vehicle distance. Figure 13 shows the final calculation results.
Similarly, an example of the night detection process is shown in Figure 14. The detection result is more accurate after fusion and the distance information of the vehicle is obtained.
To further evaluate the effectiveness of the proposed algorithm, we compared our method with state-of-the-art object detection methods. The results are shown in Table 4.
For daytime detection, in terms of precision, ranking with moderate difficulty, our method ranks fifth out of 10 methods, and it has reached a high detection precision and fully meets the requirements of practical application.
In terms of speed, our method has a breakneck detection speed of 0.057 s, only 0.027 s slower than YOLOv2, but the average AP is 15.35% higher. Compared with Faster R-CNN with similar AP, our method is 35 times faster. Compared with R-SSD with similar performance, our method has a stronger anti-interference ability. Compared with MS-CNN, SubCNN, 3DOP, and Mono3D methods with high AP, our method is 7×, 35×, 53×, and 73× faster, respectively. For nighttime detection, the method in this paper still has excellent performance, ranking third in detection precision and third in detection speed among all the compared methods.
In conclusion, compared with other models, our method achieves advanced detection precision, has fast detection speed, and has a strong anti-jamming ability, so it is fully capable of autonomous vehicle detection tasks.

Conclusions
This paper proposes a multi-adaptive real-time decision-level fusion framework combining LiDAR and camera. The framework consists of three parts, multi-adaptive completion, real-time detection, and decision-level fusion. The three parts are complementary. First, a multi-adaptive high-precision depth completion method is proposed, which improves the quality of the dense depth map. Then, we chose the YOLOv3 object detection model to ensure real-time performance. Finally, the bounding box fusion method and improved D-S evidence theory were designed to fit the application environment of this framework better. These decision-level fusion methods combine the detection results of the two sensors to achieve complementary advantages.
The experimental results show that the depth completion algorithm proposed in this paper is beneficial for vehicle detection, and the average detection accuracy is improved by 2.84% through the decision-level fusion scheme. The processing time of each frame of data only needs 0.057s, which is much shorter than the response time of 0.2s for human drivers, and fully meets the real-time requirements.
Although our depth completion algorithm is designed for vehicle detection, it can also be applied to popular research fields such as Simultaneous Localization and Mapping(SLAM), 3D object detection, and optical flow. The proposed decision-level fusion method is also universal in the field of sensor fusion.