You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

Published: 17 March 2023

3D Object Detection for Self-Driving Cars Using Video and LiDAR: An Ablation Study

,
,
,
,
,
,
and
1
Cerema Occitanie, Research Team “Intelligent Transport Systems”, 1 Avenue du Colonel Roche, 31400 Toulouse, France
2
Cerema Centre-Est, Research Team “Intelligent Transport Systems”, 8-10, Rue Bernard Palissy, 63017 Clermont-Ferrand, France
3
Institut de Recherche en Informatique de Toulouse IRIT, University of Toulouse, UPS, 31062 Toulouse, France
4
Department of Computer Science and Engineering, Universidad Carlos III de Madrid, Leganés, 28911 Madrid, Spain
This article belongs to the Special Issue Single Sensor and Multi-Sensor Object Identification and Detection with Deep Learning

Abstract

Methods based on 64-beam LiDAR can provide very precise 3D object detection. However, highly accurate LiDAR sensors are extremely costly: a 64-beam model can cost approximately USD 75,000. We previously proposed SLS–Fusion (sparse LiDAR and stereo fusion) to fuse low-cost four-beam LiDAR with stereo cameras that outperform most advanced stereo–LiDAR fusion methods. In this paper, and according to the number of LiDAR beams used, we analyzed how the stereo and LiDAR sensors contributed to the performance of the SLS–Fusion model for 3D object detection. Data coming from the stereo camera play a significant role in the fusion model. However, it is necessary to quantify this contribution and identify the variations in such a contribution with respect to the number of LiDAR beams used inside the model. Thus, to evaluate the roles of the parts of the SLS–Fusion network that represent LiDAR and stereo camera architectures, we propose dividing the model into two independent decoder networks. The results of this study show that—starting from four beams—increasing the number of LiDAR beams has no significant impact on the SLS–Fusion performance. The presented results can guide the design decisions by practitioners.

1. Introduction

Object detection is one of the main components of computer vision aimed at detecting and classifying objects in digital images. Although there is great interest in the subject of 2D object detection, the scope of detection tools has increased with the introduction of 3D object detection, which has become an extremely popular topic, especially for autonomous driving. In this case, 3D object detection is more relevant than 2D object detection since it provides more spatial information: location, direction, and size.
For each object of interest in an image, a 3D object detector produces a 3D bounding box with its corresponding class label. A 3D bounding box can be encoded as a set of seven parameters []: ( x , y , z , h , w , l , θ ) , including the coordinates of the object center ( x , y , z ), the size of the object (height, width, and length), and its heading angle ( θ ). At the hardware level, the technology involved in the object detection process mainly includes the use of mono and stereo cameras, with visible light or infrared cameras, RADAR (radio detection and ranging), and LiDAR (light detection and ranging), and gated cameras. In fact, the current top-performing methods in 3D object detection are based on the use of LiDAR (Figure 1) [,].
Figure 1. Example of a road scene where detection results using LiDAR technology is used. Detected objects are surrounded by bounding boxes. The green boxes represent detection while the red ones represent ground truth.
However, highly accurate LiDAR sensors are extremely costly (the price of a 64-beam model is around USD 75,000 []), which incurs a hefty premium for autonomous driving hardware. Alternatively, systems based only on camera sensors have also received much attention because of their low costs and wide range of use. For example, in [], the authors claim that instead of using expensive LiDAR sensors for accurate depth information, the alternative is to use pseudo-LiDAR, which has been introduced as a promising alternative at a much lower cost based solely on stereo images. The paper presents the advances to the pseudo-LiDAR framework through improvements in stereo depth estimation. Similarly, in [], instead of using a LiDAR sensor, the authors provide a simple and effective one-stage stereo-based 3D detection pipeline that jointly estimates depth and detects 3D objects in an end-to-end learning manner. These authors claim that this method outperforms previous stereo-based 3D detectors and even achieves comparable performance to a few LiDAR-based methods on the KITTI 3D object detection leaderboard. Another example is presented in []. To tackle the problem of high variance in depth estimation accuracy with a video sensor, the authors propose CG-Stereo, a confidence-guided stereo 3D object detection pipeline that uses separate decoders for foreground and background pixels during depth estimation, and leverages the confidence estimation from the depth estimation network as a soft attention mechanism in the 3D object detector. The authors say that their approach outperforms all state-of-the-art stereo-based 3D detectors on the KITTI benchmark.
Another interesting solution presented in the literature is the combination of LiDAR and a stereo camera. These methods exploit the fact that LiDAR will complete the vision and information provided by the camera, by adding notions of size and distance to the different objects that make up the environment. For example, the proposed method in [] takes advantage of the fact that it is possible to reconstruct a 3D environment using images from stereo cameras, making it possible to extract a depth map from stereo camera information and enrich it with the data provided by the LiDAR sensor (height, width, length, and heading angle).
In [], a new method proposed by us, called SLS–Fusion (sparse LiDAR and stereo fusion network), is presented. This is an architecture based on DeepLiDAR [] as a backbone network and the pseudo-LiDAR pipeline [] to fuse information coming from a four-beam LiDAR and a stereo camera via a neural network. Fusion was carried out to improve depth estimation, resulting in better dense depth maps and, thereby, improving 3D object detection performance. This architecture is extremely attractive in terms of cost-effectiveness, since 4-beam LiDAR is much cheaper than 64-beam LiDAR (the price of a 4-beam model is around USD 600 []). Results presented in Table 1 and in [] show that the performance offered by the 64-beam LiDAR (results were obtained with PointRCNN [] from testing on the KITTI dataset [], for “Car” objects with IoU = 0.5 on three levels of difficulty (defined in []): easy (fully visible, max. truncation–15%) = 97.3, moderate (partly occluded, max. truncation–30%) = 89.9, hard (difficult to see, max. truncation–50%) = 89.4) is not far from the one reached by the stereo camera and the four-beam LiDAR model (with the best results obtained from testing on the KITTI dataset, for “Car” objects with IoU = 0.5 on three levels of difficulty: easy = 93.16 and moderate = 88.81 with SLS–Fusion, and hard = 84.6 with Pseudo-LiDAR++). For this comparison, the satisfactory results obtained by the 64-beam LiDAR were modeled directly in PointRCNN, while the combination of video and LiDAR requires the generation of a new point cloud, usually referred to as the pseudo-point cloud.
There are other solutions presented in the literature based only on stereo cameras, such as the CG-Stereo method presented in [], which achieves outstanding results on easy mode (level of difficulty: easy = 97.04, see Table 1). However, implementing two sensors (e.g., LiDAR and stereo camera) instead of one brings robustness to the 3D object detection system, as demonstrated in []. In addition, the SLS–Fusion and Pseudo-LiDAR++ methods show better results in the hard mode, as illustrated in Table 1.
Table 1. Evaluation of the 3D object detection part of SLS–Fusion compared to other competitive methods. Average precision A P B E V [] results on the KITTI validation set [] for the “Car” category with the IoU at 0.5 and on three levels of difficulty, (defined as []): easy, moderate, and hard. S, L4, and L64, respectively, denote stereo, simulated 4-beam LiDAR, and 64-beam LiDAR. According to the inputs, the maximum average precision values are highlighted in bold.
Table 1. Evaluation of the 3D object detection part of SLS–Fusion compared to other competitive methods. Average precision A P B E V [] results on the KITTI validation set [] for the “Car” category with the IoU at 0.5 and on three levels of difficulty, (defined as []): easy, moderate, and hard. S, L4, and L64, respectively, denote stereo, simulated 4-beam LiDAR, and 64-beam LiDAR. According to the inputs, the maximum average precision values are highlighted in bold.
MethodInputEasyModerateHard
TLNet []S62.4645.9941.92
Stereo-RCNN []S87.1374.1158.93
Pseudo-LiDAR []S88.4076.6069.00
CG-Stereo []S97.0488.5880.34
Pseudo-LiDAR++ []S+L490.3087.7084.60
SLS–Fusion []S+L493.1688.8183.35
PointRCNN []L6497.3089.9089.40
From the above, the interest in using a low-cost LiDAR and stereo camera model as an alternative solution is justifiable. However, there is still a need to understand the scope and limitations of a 3D object detection model composed of LiDAR and a stereo camera. Knowing exactly what is the role of each sensor in the performance of the architecture will allow optimization of the synergy of these two sensors, possibly reaching higher accuracy levels at lower costs.
In this study, after analyzing the fusion between the stereo camera and LiDAR for 3D object detection, we studied the respective role of each sensor involved in the fusion process. In particular, an ablation study was conducted considering LiDAR with different beam number for object detection. Thus, LiDAR with 4, 8, 16, and 64 beams, either alone or fused with the stereo camera were tested. Regardless of the number of beams, fusion with stereo video always brought the best results. On the other hand, to reduce the overall equipment costs, the fusion between a 4-beam LiDAR and a stereo camera was enough to obtain acceptable results. Thus, when merging LiDAR with video, it is not necessary to use a LiDAR with a higher number of beams (which is more expensive).
Thus, a detailed study of the relationship between the number of beams of a LiDAR and the accuracy obtained in 3D object detection using the SLS–Fusion architecture is presented here. Roughly, two important results are presented: (1) an analysis of the camera stereo and LiDAR contributions in the performance of the SLS–Fusion model for 3D object detection; and (2) an analysis of the relationship between the number of LiDAR beams and the accuracy achieved by the 3D object detector. Both analyses were carried out by an ablation method [], which was carried out by removing one component from the architecture to understand how the other components of the system performed. This characterizes the impact of every action on the overall performance and ability of the system.
After this introduction, to make the paper more self-contained, Section 2 presents the work related to the sensor LiDAR fusion technique. Section 3 describes the framework by detailing the main contributions. Section 4 explains the main characteristics of combining a stereo camera and LiDAR in the SLS–Fusion architecture. Section 5 evaluates the contribution of each component to the neural network fusion architecture. Finally, Section 6 presents the concluding remarks, lessons learned, and some advice for practitioners.

3. Analysis of the Role of Each Sensor in the 3D Object Detection Task

SLS–Fusion is a fusion method for LiDAR and stereo cameras based on a deep neural network for the detection of 3D objects (see Figure 2). Firstly, an encoder–decoder based on a ResNet network is designed to extract and fuse left/right features from stereo camera images and project the LiDAR depth map. Secondly, the decoder network constructs a left and right depth map of optimized features through a depth cost volume model to predict the corrected depth. After the expected dense depth map is obtained, a pseudo-point cloud is generated using calibrated cameras. Finally, a LiDAR-based method for detecting 3D objects (PointRCNN []) is applied to the predicted pseudo-point cloud.
Figure 2. Overall structure of the SLS–Fusion neural network: red, blue, and red–blue boxes represent, respectively, stereo, LiDAR, and fusion networks: The LiDAR and stereo camera data are considered as inputs. Subsequently, in the encoder/decoder process, the resulting features are merged to obtain a depth map. Afterward, the depth map is converted into a point cloud, which makes it possible to estimate the depth of the objects detected by the two sensors.
Section 1 shows previous results of SLS–Fusion on the KITTI dataset, which uses the refined work of PoinRCNN to predict the 3D bounding boxes of detected objects. Experience with the KITTI benchmark and the low-cost four-beam LiDAR shows that the SLS–Fusion proposed by us outperforms most advanced methods as presented in Table 1. However, compared to the original PointRCNN detector that uses the expensive 64-beam LiDAR, the SLS–Fusion performance is lower. The superiority of the 64-beam LiDAR, used without fusing with stereo cameras, is expected because LiDARs with a high number of beams can provide very precise depth information, but highly accurate LiDAR sensors are extremely costly. In this case, the higher the number of LiDAR beams (i.e., the higher the number of point clouds generated), the higher the cost of the LiDAR sensor (from USD 1000 to 75,000). This paper, thus, analyzes how the stereo and LiDAR sensors contribute to the performance of the SLS–Fusion model for 3D object detection. In addition, the performance impact of the number of LiDAR beams used in the SLS–Fusion model was also studied. As shown in Figure 2, to separate the parts of the SLS–Fusion network that represents LiDAR and stereo camera architectures, it is only necessary to divide the model’s decoder into independent decoder networks. The decoder inside the SLS–Fusion model is the only component responsible for fusing features between LiDAR and stereo sensors.
Given a pair of images from a stereo camera and a point cloud from a LiDAR as input to detect 3D objects, the SLS–Fusion deep learning approach [] has shown a high performance in the 3D object detection task. The analysis of this performance focuses on the contribution of the neural network component of each sensor (LiDAR or stereo) and of the type of LiDAR selected for the overall architecture of the system. In this work, LiDAR sensors are compared in terms of the number of beams and are grouped into 3 main types: low-cost (4 or 8 beams), medium-cost (16 beams), and high-cost (32 or 64 beams). This kind of study, particularly in artificial intelligence, is known as an ablation study [,], which is used to understand the contribution of each component in the system by removing it, analyzing the output changes, and comparing them against the output of the complete system. This characterizes the impact of every action on the overall performance.
This type of study has become the best practice for machine learning research [,], as it provides an overview of the relative contribution of individual architectures and components to model performance. It consists of several trials such as removing a layer from a neural network, removing a regularizer, removing or replacing a component from the model architecture, optimizing the network, and then observing how that affects the performance of the model. However, as machine learning architectures become deeper and the training data increase [], there is an explosion in the number of different architectural combinations that must be assessed to understand their relative performances. Therefore, we define the notion of ablation for this study as follows:
  • Consequences of varying the number of layers for the 4- and 64-beam LiDAR on the results of SLS–Fusion.
  • Consequences of retraining SLS–Fusion by separating the parts of stereo cameras and LiDAR architectures.
  • Analyzing and discussing the characteristics of the neural network architecture used.
  • Applying some metrics with precision–recall curves (areas inside curves, F1-scores, etc.) to evaluate detection results achieved by the study.

4. Characteristics of the Neural Network Architecture Used

The main component of the SLS–Fusion neural network, used to fuse or separate LiDAR and stereo camera features (for an ablation study), is the encoder–decoder component (see Figure 2 and Figure 3). It is the main part of the SLS–Fusion network that aims to enrich the feature maps and, thus, lead to better-predicted depth maps from the stereo camera and the projected LiDAR images. To understand all of this, we outline how the encoder–decoder component works and how it will help to improve the precision of the system when using low-, medium-, or high-cost LiDAR.
Figure 3. SLS–Fusion encoder–decoder architecture: The residual neural network blocks (ResNet blocks) within the encoder are used to extract features from the LiDAR and stereo inputs. The fusion process inside the decoder is accomplished through the use of addition and up-projection operators.
As shown in Figure 3, both the stereo camera and LiDAR encoders are composed of a series of residual blocks from the neural network ResNet, followed by step-down convolution to reduce the feature resolution of the input. ResNet is a group of residual neural network blocks and each residual block is a stack of layers placed in such a way that the output of one layer is taken and added to another deeper layer within the block, as shown in Figure 4. The main advantage of ResNet is its ability to prevent the accuracy from saturating and degrading rapidly during the training of deeper neural networks (networks with more than 20 layers). This advantage helps in choosing a network to be as deep as needed for the problem at hand. What we needed in this case, was to extract as much detailed features as possible from sparse LiDAR data and high-resolution stereo images. This process considerably assisted the decoder network to fuse the extracted features well.
Figure 4. The structure of Stereo and LiDAR residual blocks inside the encoder/decoder of the SLS–Fusion model. A stack of layers is grouped into blocks for stereo and LiDAR networks, conducted by a step-down convolution direction and followed by a set of fusion blocks.
The network of the decoder consists of adding the functions of both LiDAR and stereo encoders, then up-projecting the result to progressively increase the resolution of the features and generate a dense depth map as a decoder output. Because the sparse input of LiDAR is heavily linked to the depth decoder output, features related to the LiDAR sensor should contribute more to the decoder than features related to the stereo sensor. However, as the add operation promotes features on both sides [], the decoder is encouraged to learn more features related to stereo images in order to be consistent with the features related to the sparse depth from LiDAR. In this way, whatever the type and associated resolution of the selected LiDAR (low-, medium-, or high-cost types), the decoder network will correctly learn merged features. Consequently, the SLS–Fusion network always outperforms all types of LiDAR sensors in 3D object detection, as shown in the next section.

5. Assessment of the Different Network Architectures Implemented

To assess the operation of the SLS–Fusion system, the KITTI dataset [,], one of the most common dataset for autonomous driving is used to train the neural network for dense depth estimation, pseudo-point cloud generation, and 3D objection detection. It has 7481 training samples and 7518 testing samples for both stereo and LiDAR.
In this section, the results obtained with each component of the SLS–Fusion model (stereo camera, 4- and 64-beam LiDAR) are presented to understand the impact of each component on the ultimate detection performance of 3D objects and show how results are affected. To do this, a complete ablation study was performed by disabling each component as previously explained, or by changing the number of LiDAR model component beams. As shown in Figure 5, increasing the number of LiDAR beams will increase the number of points that represent the targets detected by the LiDAR. The aim of this illustration is to show the difficulty when dealing with LiDAR data processing. Depending on the environment, some areas are full of detected points, while others are empty. Consequently, the LiDAR contribution to the performance of the object detection model will be enhanced. However, as shown in Table 2, increasing the number of beams from 4 to 64 beams will significantly increase the cost of the LiDAR sensor. An optimized solution involves selecting the appropriate number of corresponding LiDAR beams, which can provide a desired performance level. For a more comprehensive survey of the LiDARs available on the market, the reader is referred to [].
Figure 5. LiDARpoint clouds representing the measured environment. The point cloud is colored according to the information coming from the RGB image. The number of targets (points) varies according to the version of the LiDAR (number of beams): very dense for 64 beams (upper left) and dispersed for 4 beams (bottom right).
Table 2. Comparison of some LiDAR sensors. Channels show the number of laser beams of the LiDAR sensor vertically. Range indicates the maximum distance to objects at which a LiDAR can detect. HFoV/RES and VFoV/RES decode the horizontal and vertical field of view and angular resolution, respectively. There are a number of LiDARs whose resolution depends on frequency.

5.1. Metrics

The indicators that we used and recall here seem basic for specialists but it is important for us to recall them briefly because they are used in the analysis later. To better understand the detection process and the results achieved by this study, detection assessment measurements were used to quantify the performance of our detection algorithm in various situations. Among the popular measures for reporting results, there are basic concepts and evaluation criteria used for object detection [] as follows:
  • Confidence level: object detection model output score linked to the bounding of the object detected.
  • Intersection over union IoU: the ratio of the area of overlap between the predicted bounding box and the ground truth bounding box to the area of union between the two boxes. The most common IoU thresholds used are 0.5 and 0.7.
  • Basic measures: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
  • Precision: the number of true positive predictions divided by the total number of positive predictions.
  • Recall: the number of true positive predictions divided by the total number of ground truth objects.
Precision–recall curve: The precision–recall curve [] is a good way to evaluate the performance of an object detector as the confidence is changed. In the case of 3D object detection, to make things clearer, we provide an example to better understand how the precision–recall curve is plotted. Considering the detections as seen in Figure 6, there are 6 images with 10 ground truth objects represented by the red bounding boxes and 21 detected bounding boxes shown in green. Each green bounding box must have a confidence level greater than 50% to be considered as a detected object and is identified by a letter (B1, B2, …, B21).
Figure 6. Example of how the precision–recall curve is generated for six different images. Red bounding boxes show ground truth objects while green bounding boxes indicate detected objects.
Table 3 shows the bounding boxes with their corresponding confidences. The last column identifies the detections as TP or FP. In this example, a TP is considered if the IoU is greater than or equal to 0.2, otherwise, it is a FP.
Table 3. True and false positive-detected bounding boxes with their corresponding confidence levels. Det. and Conf. denote detection and confidence, respectively.
For some images, there is more than one detection overlapping a ground truth (see images 2, 3, 4, 5, 6 from Figure 6). In those cases, the predicted box with the highest IoU is considered a TP and all others as FPs (in image 2: B5 is a TP while B4 is a FP because the IoU between B5 and the ground truth is greater than the IoU between B4 and the ground truth).
The precision–recall curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first, we need to order the detections by their confidence levels, then we calculate the precision and recall for each accumulated detection as shown in Table 4 (note that for the recall computation, the denominator term is constant and equal to 10 since ground truth boxes are constant irrespective of detection).
Table 4. Precision and recall for each accumulated detection bounding box ordered by the confidence measure. Det., Conf., and Acumm. denote detection, confidence, and accumulated, respectively.

5.2. Ablation Results

This section deals with the use of precision–recall curves to better understand the effect and the role of each component of SLS–Fusion on the entire model performance. It corresponds to the stereo component and the LiDAR component (changing the number of LiDAR beams from 4 to 64). To do this evaluation, we used the KITTI evaluation benchmark of 3D bounding boxes or 2D bounding boxes in BEV to compute precision–recall curves for detection, as explained in the previous section. The BEV for autonomous vehicles is a vision monitoring system that is used for better evaluation of obstacle detection. This system normally includes between four and six fisheye cameras mounted around the car to provide right, left, and front views of the car’s surroundings.
Figure 7 shows the precision–recall (P–R) curves obtained by taking into account, respectively, stereo cameras, 4-beam LiDAR, 8-beam LiDAR, 16-beam LiDAR, and 64-beam LiDAR. As shown in that figure, an object detector is considered good if its precision stays high as the recall increases, which means that only relevant objects are detected (0 false positives = high precision) when finding all ground truth objects (0 false negatives = high recall). On the other hand, a poor object detector is one that needs to increase the number of detected objects (increasing false positives = lower precision) in order to retrieve all ground truth objects (high recall). That is why the P–R curve usually starts with high precision values, decreasing as recall increases. Finally, detection results are divided into three levels of difficulty (easy, moderate, or hard) mainly depending on the dimension of the bounding box and the level of occlusion of the detected objects, especially for cars.
Figure 7. Precision–recall (P–R) curves obtained for the detection of 3D objects (right column) and 2D objects in BEV (left column). In this graph, the minimal recall shows the first recall value obtained when the P–R curve starts to drop sharply and the precision score is still higher or equal to 0.7.
In summary, the P–R curve represents the trade-off between precision (positive predictive value) and recall (sensitivity) for a binary classification model. In object detection, a good P–R curve is close to the top-right corner of the graph, indicating high precision and recall. To provide a comprehensive evaluation of the performance of object detection models, represented by the shape of the P–R curves, a new metric called “minimal recall” is added to the graph. The minimal recall is defined exactly as the first recall value obtained when the P–R curve starts to drop sharply and the precision score is always higher or equal to 0.7 (this value is fixed experimentally). The best detector is then the detector that can achieve a high precision score (higher than 0.7) while the minimal recall score is closest to 1. Graphically, this means that a model that achieves a low level of detection will have a “minimal recall” that follows the left side of the graph, while a model that achieves a high level of detection will have a “minimal recall” that follows the right side of the graph.
Based on this idea, the P–R curves obtained for 2D objects in BEV are always better than those obtained for 3D objects. This is because the level of inaccuracy in detecting bounding boxes in 3D is always greater than in 2D. However, detecting the surrounding cars in the BEV projection view reduces the precision of estimating the distance of detected objects (cars) from the autonomous vehicle. P–R curves for the stereo camera show better results than four-beam LiDAR (BEV/3D minimal recall is 0.6/0.4 for stereo; BEV/3D minimal recall is 0.4/0.18 for LiDAR for the hard level of difficulty). However, fusing the two sensors (stereo camera and four-beam LiDAR) improves the detection performance (BEV/3D minimal recall is 0.63/0.42 in the hard level of difficulty). On the other hand, when the number of beams of LiDAR passes from a low-cost 4-beam LiDAR to a high-cost 64-beam LiDAR, the detector provides the best P–R curves (BEV/3D minimal recall is 0.65/0.45 in the hard level of difficulty).
Another way of comparing object detection performance is to compute the area under the curve (AUC) of the P–R. The AUC can also be interpreted as the approximated average precision (AP) for all recall values between 0 and 1. In practice, AP is obtained by interpolating through all n points in such a way that:
AP = i = 0 n ( r i + 1 r i ) max r ˜ : r r i + 1 ( p ( r ˜ ) )
where p ( r ˜ ) is the measured precision at recall r ˜ .
The statistical properties of various methods to estimate AUC were investigated by [], together with different approaches to constructing 95% of the confidence interval (CI). The CI represents the range within which 95% of the values from the P–R curve are distributed. Thus, this parameter corresponds to the dispersion around the AUC.
Hence, using the AUC for performance, and the asymmetric logit intervals presented in [] for constructing the CI, Table 5 presents the 3D obstacle detection performance and the corresponding CI for an IoU of 0.7. Each cell of this table contains a pair of numbers (A/B) corresponding to the results obtained with the A P B E V / A P 3 D metrics. In the upper side of the table, we consider the stereo camera and the different LiDAR sensors taken separately. When the sensors are taken separately, the stereo provides the following results: 82.38/68.08%, 65.42/50.81%, and 57.81/46.07% going from easy to hard. If we consider the LiDAR as taken separately, we can see that the 64-beam LiDAR provides the best results: 87.83/75.44%, 75.75/60.84%, and 69.07/55.95% going from easy to hard. Considering the progression of the detection as a function of the number of beams, an almost linear progression from 4 to 64 beams can be observed.
Table 5. Evaluation of 3D object detection performance by a stereo camera, different types of LiDARs, and the fusion of those. In the upper part of the table, the performance (measured using the area under the curve (AUC) in the P–R curve) and the confidence interval(s) (CI) of the stereo camera and LiDAR 4, 8, 16, and 64 beams are shown with respect to three levels of difficulties for objects to detect (easy, moderate, and hard). The bottom of the table presents the detection performance and CI using fusion between a stereo camera and LiDAR, i.e., 4-beam (S+L4), 8-beam (S+L8), 16-beam (S+L16), and 64-beam (S+L64). Each result is provided according to two indicators: average precision BEV (left)/average precision 3D (right).
The bottom of Table 5 presents the performance and CI resulting from the fusion between the stereo camera and the different types of LiDAR. In Table 5, we immediately notice that, when compared with the stereo camera alone, there is improvement in the 3D object detection when the stereo camera is fused with the LiDAR with the lowest number of beams (four-beam); this is true for all levels of object detection difficulties (easy, moderate, and hard). In addition, we note that the stereo camera and 4-beam LiDAR combination provides slightly better results than those obtained with the 64-beam LiDAR in the easy and moderate modes. On the other hand, what is surprising is that the detection performance barely improved when the number of beams increased (less than a 1% difference between S+L4 and S+L64). Moreover, CI values from S+L4 and S+L64 (the cheapest and the most expensive combinations) are compared. From this comparison, there is an overlap between the CI of S+L4 (e.g., easy = [86.96, 88.04]) and the CI of S+L64 (easy = [87.52, 88.58]) for all levels of object detection difficulties, meaning that for the fusion between the stereo camera and LiDAR, there is no significance when moving from one architecture to another.
The obtained results could be related to the dataset processed. Thus, to deepen this analysis, other datasets other than KITTI must be used. This is perspective work. In any case, the best solution is obtained by fusing both sensors. This proves that each component of the SLS–Fusion architecture effectively contributes to the final performance of the model, and we cannot eliminate these components of the neural network architecture in all possible cases: low-cost, medium-cost, or high-cost LiDAR sensors.

6. Conclusions

In this work, we analyzed the contribution of a stereo camera and different versions of LiDAR (4 to 64 beams) to the performance of the SLS–Fusion model in detecting 3D obstacles, through an ablation study. Based on the ablation analysis and the different measurements used to evaluate our detection algorithm, it has been shown that sensors performed better when fused. The quantitative results showed that the detection performance drops moderately with each component disabled (stereo camera or LiDAR) or by modifying the number of LiDAR beams, and the full model works best. Moreover, this fusion approach was found to be very useful in detecting 3D objects in foggy weather conditions []
This analysis allowed us to identify several inherent characteristics of video and LiDAR. The camera’s resolution provides an undeniable and important advantage over LiDAR, as it captures information through pixels, which makes a significant difference even if the number of layers for the LiDAR is increased. Using two cameras makes it possible to measure distances to obstacles while keeping the same resolution because depth is calculated on all the pixels (dense stereo vision). LiDAR is mainly useful for determining distances. By extension, it also allows us to know the size and volume of objects very precisely, which can be extremely useful when classifying objects (cars, pedestrians, etc.).
In terms of resolution, LiDAR is limited to the fact that each of its pixels is a laser. A laser light means that we have a focused light. It is a point that does not deform and it allows high precision. It is more complicated to multiply the lasers in a very small space, and that is why, for the moment, LiDAR has a much lower resolution than the camera. A classic smartphone-type camera provides 8 million pixels per image, while LiDAR will have around 30,000 pixels (at the most). An advantage of LiDAR is its ability to adapt to changes in light, which is a strong disadvantage for imaging. As a consequence, the two types of sensors must be used in a complementary way.
In conclusion, SLS–Fusion is an effective obstacle detection solution for low and high cost LiDAR when combined with a stereo camera; an optimal cost-effective solution is achieved with the most economical four-beam LiDAR component. To better generalize the SLS-Fusion model and find the optimal balance between obstacle detection performance and the cost of the LiDAR component, it is desirable to test the model on various datasets and environments, such as waymo [], nuScenes [], or argoverse2 [].

7. Perspectives to Go Further

To better understand the role and contribution of each technology to obstacle detection, it is necessary to make a more detailed analysis of the objects detected by one sensor or the other. Each type of sensor detects a list of objects with their 3D positions.
It is then necessary to merge the two lists of objects by following a rigorous procedure. In our system, when we consider the sensors separately, each provides a list of detected objects (included in 3D boxes) belonging to the same scene. We can develop a fusion module that would take as input the two lists of objects detected by the two types of sensors. In this case, the fusion module takes as input the list of the detected objects provided by both kinds of sensors and delivers a fused list of detected objects. For each object, we have the centroid of the bounding box, the class of the object, and the number of sensors that detected the object. To perform fusion between data from LiDAR and stereo vision objects, we could, for example, project the objects detected by stereo vision processing onto the laser plane.
Object association: In this step, we determine which stereo objects are to be associated with which LiDAR objects from the two object lists using, for example, the nearest neighbor technique. We could define a distance between the centroids of the objects detected by stereo and LiDAR. Then we can associate the current stereo object to the nearest LiDAR object from the stereo object, using as a reference point the coordinate points of the sensors installed on the vehicle. Exploiting the depths calculated by the stereo and the LiDAR, we only need to compare objects whose centroids are very close to each other (with a threshold) from the reference point. The result of this fusion process is a new list of fused objects. This list has the LiDAR objects, which could not be associated with stereo objects, and all of the stereo objects, which could not be associated with some LiDAR objects. By doing this, we can more objectively analyze the advantages and disadvantages of the two technologies, in what circumstance, for what type of object, at what distance, and with what brightness. The output of the fusion process consists of a fused list of objects. For each object, we have position (centroid) information, dynamic state information, classification information, and a count of the number of sensors (and for how many beams) detecting this object. This work is under development.

Author Contributions

Conceptualization, P.H.S., J.M.R.V. and N.A.M.M.; methodology, P.H.S., J.M.R.V., G.S.P. and L.K.; software, P.H.S. and N.A.M.M.; validation, P.H.S., J.M.R.V., L.K. and S.A.V.; formal analysis, P.H.S., J.M.R.V., G.S.P., L.K., N.A.M.M., P.D., A.C. and S.A.V.; investigation, P.H.S., J.M.R.V. and L.K.; resources, N.A.M.M. and P.D.; data curation, P.H.S. and N.A.M.M.; writing—original draft preparation, P.H.S., J.M.R.V., L.K., P.D., A.C. and S.A.V.; writing—review and editing, P.H.S., J.M.R.V., G.S.P., L.K., P.D., A.C. and S.A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
  2. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar] [CrossRef]
  3. He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure Aware Single-Stage 3D Object Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11870–11879. [Google Scholar] [CrossRef]
  4. Velodyne’s HDL-64E Lidar Sensor Looks Back on a Legendary Career. Available online: https://velodynelidar.com/blog/hdl-64e-lidar-sensor-retires/ (accessed on 20 February 2022).
  5. You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Conference, 26 April–1 May 2020. [Google Scholar]
  6. Chen, Y.; Liu, S.; Shen, X.; Jia, J. DSGN: Deep Stereo Geometry Network for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12533–12542. [Google Scholar] [CrossRef]
  7. Li, C.; Ku, J.; Waslander, S.L. Confidence Guided Stereo 3D Object Detection with Split Depth Estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5776–5783. [Google Scholar] [CrossRef]
  8. Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8437–8445. [Google Scholar] [CrossRef]
  9. Mai, N.A.M.; Duthon, P.; Khoudour, L.; Crouzil, A.; Velastin, S.A. Sparse LiDAR and Stereo Fusion (SLS-Fusion) for Depth Estimation and 3D Object Detection. In Proceedings of the the International Conference of Pattern Recognition Systems (ICPRS), Curico, Chile, 17–19 March 2021; Volume 2021, pp. 150–156. [Google Scholar] [CrossRef]
  10. Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3308–3317. [Google Scholar] [CrossRef]
  11. Valeo Scala LiDAR. Available online: https://www.valeo.com/en/valeo-scala-lidar/ (accessed on 17 February 2022).
  12. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar] [CrossRef]
  13. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Rob. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  14. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
  15. Mai, N.A.M.; Duthon, P.; Khoudour, L.; Crouzil, A.; Velastin, S.A. 3D Object Detection with SLS-Fusion Network in Foggy Weather Conditions. Sensors 2021, 21, 6711. [Google Scholar] [CrossRef] [PubMed]
  16. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar] [CrossRef]
  17. Qin, Z.; Wang, J.; Lu, Y. Triangulation Learning Network: From Monocular to Stereo 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7607–7615. [Google Scholar] [CrossRef]
  18. Li, P.; Chen, X.; Shen, S. Stereo R-CNN Based 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7636–7644. [Google Scholar] [CrossRef]
  19. Meyes, R.; Lu, M.; de Puiseau, C.W.; Meisen, T. Ablation Studies in Artificial Neural Networks. arXiv 2019, arXiv:1901.08644. [Google Scholar] [CrossRef]
  20. Rivera Velázquez, J.M.; Khoudour, L.; Saint Pierre, G.; Duthon, P.; Liandrat, S.; Bernardin, F.; Fiss, S.; Ivanov, I.; Peleg, R. Analysis of Thermal Imaging Performance under Extreme Foggy Conditions: Applications to Autonomous Driving. J. Imaging 2022, 8, 306. [Google Scholar] [CrossRef]
  21. Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teuliere, C.; Chateau, T. Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1827–1836. [Google Scholar] [CrossRef]
  22. Xu, B.; Chen, Z. Multi-level Fusion Based 3D Object Detection from Monocular Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2345–2353. [Google Scholar] [CrossRef]
  23. Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar] [CrossRef]
  24. Bello, S.A.; Yu, S.; Wang, C.; Adam, J.M.; Li, J. Review: Deep Learning on 3D Point Clouds. Remote Sens. 2020, 12, 1729. [Google Scholar] [CrossRef]
  25. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
  26. Beltran, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; Garcia, F.; De La Escalera, A. BirdNet: A 3D Object Detection Framework from LiDAR Information. In Proceedings of the IEEE International Conference Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar] [CrossRef]
  27. Liu, T.; Yang, B.; Liu, H.; Ju, J.; Tang, J.; Subramanian, S.; Zhang, Z. GMDL: Toward precise head pose estimation via Gaussian mixed distribution learning for students’ attention understanding. Infrared Phys. Technol. 2022, 122, 104099. [Google Scholar] [CrossRef]
  28. Liu, T.; Wang, J.; Yang, B.; Wang, X. NGDNet: Nonuniform Gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 2021, 436, 210–220. [Google Scholar] [CrossRef]
  29. Meyer, G.P.; Laddha, A.; Kee, E.; Vallespi-Gonzalez, C.; Wellington, C.K. LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12669–12678. [Google Scholar] [CrossRef]
  30. Gigli, L.; Kiran, B.R.; Paul, T.; Serna, A.; Vemuri, N.; Marcotegui, B.; Velasco-Forero, S. Road segmentation on low resolution LiDAR point clouds for autonomous vehicles. arXiv 2020, arXiv:2005.13102. [Google Scholar] [CrossRef]
  31. Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar] [CrossRef]
  32. Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar] [CrossRef]
  33. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar] [CrossRef]
  34. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar] [CrossRef]
  35. Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion. IEEE Trans. Multimedia 2022, 1–14. [Google Scholar] [CrossRef]
  36. Hameed, I.; Sharpe, S.; Barcklow, D.; Au-Yeung, J.; Verma, S.; Huang, J.; Barr, B.; Bruss, C.B. BASED-XAI: Breaking Ablation Studies Down for Explainable Artificial Intelligence. arXiv 2022, arXiv:2207.05566. [Google Scholar] [CrossRef]
  37. Liu, T.; Wang, J.; Yang, B.; Wang, X. Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Phys. Technol. 2021, 112, 103594. [Google Scholar] [CrossRef]
  38. Li, X.; Li, T.; Li, S.; Tian, B.; Ju, J.; Liu, T.; Liu, H. Learning fusion feature representation for garbage image classification model in human–robot interaction. Infrared Phys. Technol. 2023, 128, 104457. [Google Scholar] [CrossRef]
  39. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar] [CrossRef] [PubMed]
  40. Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. The computational limits of deep learning. arXiv 2020, arXiv:2007.05558. [Google Scholar] [CrossRef]
  41. Cvišić, I.; Marković, I.; Petrović, I. Recalibrating the KITTI dataset camera setup for improved odometry accuracy. In Proceedings of the 2021 European Conference on Mobile Robots (ECMR), Bonn, Germany, 31 August–3 September 2021; pp. 1–6. [Google Scholar]
  42. Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef] [PubMed]
  43. Alpha Prime. Available online: https://velodynelidar.com/products/alpha-prime/ (accessed on 10 January 2023).
  44. AT128—HESAI. Available online: https://www.hesaitech.com/en/AT128 (accessed on 17 February 2023).
  45. Pandar128—HESAI. Available online: https://www.hesaitech.com/en/Pandar128 (accessed on 17 February 2023).
  46. Pandar64—HESAI. Available online: https://www.hesaitech.com/en/Pandar64 (accessed on 17 February 2023).
  47. Velodyne’s HDL-32E Surround LiDAR Sensor. Available online: https://velodynelidar.com/products/hdl-32e/ (accessed on 17 February 2023).
  48. RS-LiDAR-32-RoboSense LiDAR—Autonomous Driving, Robots, V2X. Available online: https://www.robosense.ai/en/rslidar/RS-LiDAR-32 (accessed on 20 February 2023).
  49. Puck LiDAR Sensor, High-Value Surround LiDAR. Available online: https://velodynelidar.com/products/puck/ (accessed on 20 February 2023).
  50. LS LiDAR Product Guide. Available online: https://www.lidarsolutions.com.au/wp-content/uploads/2020/08/LeishenLiDARProductguideV5.2.pdf (accessed on 20 February 2023).
  51. Betke, M.; Wu, Z. Data Association for Multi-Object Visual Tracking. Springer International Publishing: Cham, Switzerland, 2017; pp. 29–35. [Google Scholar] [CrossRef]
  52. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
  53. Boyd, K.; Eng, K.H.; Page, C.D. Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, 22–26 September 2013; pp. 451–466. [Google Scholar] [CrossRef]
  54. Mai, N.A.M.; Duthon, P.; Salmane, P.H.; Khoudour, L.; Crouzil, A.; Velastin, S.A. Camera and LiDAR analysis for 3D object detection in foggy weather conditions. In Proceedings of the of the International Conference on Pattern Recognition Systems (ICPRS), Saint-Etienne, France, 7–10 June 2022; pp. 1–7. [Google Scholar] [CrossRef]
  55. Waymo Dataset. Available online: https://waymo.com/open/ (accessed on 17 February 2023).
  56. Nuscenes Dataset. Available online: https://www.nuscenes.org/ (accessed on 20 February 2023).
  57. Argoverse2 Dataset. Available online: https://www.argoverse.org/av2.html (accessed on 20 February 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.