Evaluation of 3D Vulnerable Objects’ Detection Using a Multi-Sensors System for Autonomous Vehicles

One of the primary tasks undertaken by autonomous vehicles (AVs) is object detection, which comes ahead of object tracking, trajectory estimation, and collision avoidance. Vulnerable road objects (e.g., pedestrians, cyclists, etc.) pose a greater challenge to the reliability of object detection operations due to their continuously changing behavior. The majority of commercially available AVs, and research into them, depends on employing expensive sensors. However, this hinders the development of further research on the operations of AVs. In this paper, therefore, we focus on the use of a lower-cost single-beam LiDAR in addition to a monocular camera to achieve multiple 3D vulnerable object detection in real driving scenarios, all the while maintaining real-time performance. This research also addresses the problems faced during object detection, such as the complex interaction between objects where occlusion and truncation occur, and the dynamic changes in the perspective and scale of bounding boxes. The video-processing module works upon a deep-learning detector (YOLOv3), while the LiDAR measurements are pre-processed and grouped into clusters. The output of the proposed system is objects classification and localization by having bounding boxes accompanied by a third depth dimension acquired by the LiDAR. Real-time tests show that the system can efficiently detect the 3D location of vulnerable objects in real-time scenarios.


Introduction
Autonomous vehicles (AVs) have been considered a major research subject in recent years due to their multiple benefits. The average driver in England spends 235 h driving every year [1]; therefore, AVs offer passengers extra free time during their journeys. They also offer mobility to those who cannot drive, they reduce emissions and congestion, and they have the potential to enhance road safety [1][2][3][4]. Since the early 1990s, AVs have been the focus of attention in many research fields. Thus, several highly automated driver assistance capabilities have reached mass production.
In addition to the advantages of AVs, there are some challenges facing their widespread use, such as: legal terms, cybersecurity, traffic management strategies, moral and ethical challenges, and operational challenges [5,6].
The National Highway reports that 76% of all accidents are based solely on human error, while 94% involve human error [1]. Furthermore, in 2019, 25,080 motor vehicle fatalities were recorded by the Department of Transport in the United Kingdom [6]. The autonomous driving operation can be summarized in the following steps [7][8][9]

Automation Levels
Due to the differences in terminology used to describe autonomous driving, the Society of Automotive Engineers (SAE) has established a ranking for autonomous driving [10], which ranges from Level 0 (no automation) to Level 5 (automation under any Operational Design Domain). Until now, the market has not yet witnessed Level 5 AVs; however, there are concept cars, such as the Mercedes Benz S-Class, the VW Sedric, the Rinspeed Snap, etc., which are expected to be available by 2030. Ahangar et al. have explored the technical evolution in autonomous cars in [6].

Autonomous Vehicles' Sensory Systems
Environment perception is achieved using the appropriate exteroceptive sensory system. Examples of exteroceptive sensors include: monocular and stereo-cameras, short-and long-range RADARs, ultrasonic sensors, and LiDARs, which is short for Light Detection and Ranging.

Related Work
In this subsection, we discuss the efficacy of images and point clouds when used in isolation and when integrated together in AVs in order to achieve object detection.

Images Acquired by Cameras
Cameras are considered the primary vision sensor used for object detection for two reasons: they are one of the cheapest sensors that can be used on AVs, and they can acquire rich texture information. However, monocular cameras suffer from the lack of a third dimension for the detection of objects.
However, 3D object detection can be achieved by applying extrapolation of the detected 2D bounding boxes by reprojection constraints or regression models; nevertheless, the accuracy of depth calculations is low.
The stereo camera, on the other hand, a more expensive alternative, provides distance calculations but with higher computational requirements. Multiple monocular cameras have been used in [20] to achieve multi-object tracking. Additionally, in [21,22], various algorithms were developed to perform object detection and localization. So far, though, results have suffered from relatively low accuracy in depth estimation, especially at longer ranges.

Point Clouds Acquired by LiDARs
LiDAR uses the Time of Flight (ToF) principle to detect the distance between the sensor and the detected objects. The maximum working detectable distance of LiDARs is 200 m [23]. LiDARs can withstand different weather and lighting conditions. Different LiDAR types project a different number of laser beams. Two-dimensional LiDARs project a single beam on a rotating mirror, while 3D LiDARs use multiple laser diodes that rotate at a very high speed; the higher the number of laser diodes, the more measurements can be acquired and the more accurate the perception task becomes [24]. Multiple 2D LiDARs (single beam) have been used in vehicle detection [25] and pedestrian detection [26,27] by applying pattern recognition techniques; however, this limits the detection to limited object classes.
There are three main methods for achieving 3D detection using point clouds [11]: • Projection of a point cloud into a 2D plane in order to apply 2D detection frameworks to acquire 3D localization on projected images. • Volumetric methods by voxelization [30,31]. However, 3D convolutional operations are computationally expensive.

•
The use of PointNets [32][33][34] by applying raw point clouds directly to predict 3D bounding boxes. This method is also computationally expensive and increases running time.

Sensor Fusion
Using a single type of sensor has proven to be insufficient and unreliable; sensor fusion is therefore mandatory in order to overcome these limitations. As a result of using multiple sensors, sensor fusion enhances the reliability and accuracy of measurements and reduces their uncertainty [5].
Many papers have applied sensor fusion to multi-beam LiDARs and cameras to achieve obstacle detection and avoidance. LiDAR was responsible for detecting the accurate position of objects, while the camera would detect its features and classification. Responding to the object detection problem, Han et al. [35] developed a framework that applied decision-level sensor fusion techniques on a Velodyne 64-beam LiDAR with RGB camera in order to improve the detection of dim objects such as pedestrians and cyclists. Additionally, a 3D object detector that processes in a Bird's Eye View (BEV) is outlined in [36]; it fuses image features by learning to project them into the BEV space. Some approaches have targeted the detection of specific object classes: pedestrian pattern matching and recognition [37], vehicle detection [38], and passive beacon detection [39]. However, all of these papers either used expensive 3D LiDARs which acquire extensive amounts of data, or they suffered from limitations on the classes of detected objects. Table 1 lists the most popular and recent 3D object detection networks and frameworks along with their limitations.
Although LiDAR-based 3D detections have attracted many researchers, point clouds still lack the texture information that enables them to classify objects. Moreover, point clouds suffer from sparsity and decreased density when detecting distant objects. In this paper, therefore, a 2D LiDAR and a monocular camera are fused together in order to achieve real-time dynamic object detection for AVs. This research acts as a foundation for the employment of 2D LiDARs on AVs as a lower cost substitute for 3D LiDARs. It also addresses the challenge of the presence of multiple overlapping moving objects in the same scene with real-time constraints.

Paper Organization
The rest of this paper is organized as follows: Section 2 discusses the real-time object detection module using a monocular camera; Section 3 illustrates how the LiDAR measurements were processed; Section 4 explains the fusion methodology between the videoprocessing module and LiDAR measurements; Section 5 shows and discusses the results obtained from the work; finally, Section 6 presents the paper's conclusion and discussion. Table 1. Three-dimensional object detection networks and frameworks.

Paper Modality Limitation
Multi-task multi-sensor fusion for 3D object detection [40] RGB Object detection can be defined as the process of detecting, localizing, and identifying the class of detected objects. Object detection methods output bounding boxes around detected objects, along with an associated predicted class and confidence score [19]. Different criteria affect the choice of the object detection algorithm, and, as a result, diverse driving scenarios impose different object detection challenges. For example: • Variable weather and lighting conditions. • Reflective objects. • Diverse object sizes.

•
The occlusion and truncation of obstacles.
In autonomous driving, objects that need to be detected are either static or dynamic. Traffic lights and signs, buildings, bridges, and curbs are considered static objects. Pedestrians, cyclists, animals, and vehicles, on the other hand, are considered dynamic objects due to their continuously varying locations and features. The detection of static objects is considered a straightforward task, which has been addressed in many previous studies (examples are shown in [45][46][47][48]). In this paper, therefore, we focus on the detection of vulnerable objects (e.g., dynamic objects) due to the greater levels of danger they pose during an AV's driving process.
In this proposed research, a pre-trained deep-learning (DL)-based real-time object detection network, namely YOLOv3, is employed. YOLOv3 works upon Darknet, which is a neural network framework created by Joseph Redmon [49]. It is an open-source framework written by C/CUDA, and serves as the basis for YOLO. The original repository can be found in [50]. YOLOv3's object detection network outputs 2D bounding boxes along with the classification of the detected objects. The model we used is pre-trained on KITTI [51,52], the largest computer vision evaluation dataset for autonomous driving scenarios in the world. It contains 7481 frames of training data and 7518 of test data. It has nine classes of labelled objects which we merged into six classes (Car, Van, Truck, Tram, Pedestrian, Cyclist).

Overlapping Detection
In order to use the 2D LiDAR accurately in diverse and challenging driving environments, and taking into consideration the presence of multiple different objects that could be overlapping and interacting with each other, an overlapping detection algorithm was necessary in order to detect which objects were overlapping with each other and forming clusters of objects. This algorithm will come into use during the LiDAR and camera integration step below.
In order to perform the overlapping detection, two attributes were added to each detected object: the first one holds the pixel ranges which are overlapping with other detected objects P OL , and the second one holds the IDs of other objects that are sharing the pixels (P OL ) with the current object.

Hokuyo UTM-30LX
The LiDAR used in this research is the Hokuyo UTM-30LX (shown in Figure 1). It is a 2D radial LiDAR that measures 1081 distance points in a range from −135 • to 135 • , where orientation 0 • corresponds to the front of the LiDAR. The following represent its other specifications [

Conversion of Radial Measurements into Perpendicular Measurements
LiDAR works by measuring the distances in an angular rotational pattern; hence, the measurements acquired are radial, as shown in Figure 2. In order to normalize the LiDAR measurements, these radial measurements must therefore be converted into perpendicular measurements (as shown in Equation (1)).

Linearization and Smoothing
Data acquired by the LiDAR for objects bounded by the bounding boxes cannot be considered a straightforward distance to be added as a third dimension of the detected objects. This is due to many factors; for example, the uneven surfaces of the detected objects, the sensors' uncertainty, and the continuous overlapping and truncation of objects. Therefore, data acquired by the LiDAR are smoothed and linearized to give a better understanding of the surrounding objects.
LiDAR measurements are known to have a Gaussian noise distribution with a variance of ±3 cm. Therefore, a filter with a Gaussian impulse response function is a good candidate for noise suppression. Objects' surfaces in real scenarios include both flat and rough surfaces or edges. Due to the sparsity of LiDAR measurements of distant objects, it is also assumed that the empty-depth pixels contain the same or similar measurements that also need to be restructured. Object edges are observed as a discontinuity in the LiDAR measurements, while flat surfaces have smoothly varying values. Thus, it is natural that the reconstruction of the depth measurements uses a form of edge-preserving filter. In order to maintain performance in real-time, a median filter was used to perform the filtering and preservation of edged features.

Grouping of LiDAR Measurements into Clusters with Unique IDs
Different studies have addressed the problem of object segmentation on LiDAR point clouds [25,28,29,54,55]; however, they either used 3D point clouds or assumed that the world consists of separate objects that are not physically overlapping with each other. In this paper, we address the challenge of having multiple dynamic and interacting/overlapping objects.
After filtering LiDAR measurements, clustering is performed by grouping similar neighboring data readings and assigning a unique ID for each of them along with an average distance value. The two main variables in this step are:

•
Minimum cluster size: in order to avoid the creation of numerous unneeded miniclusters that may represent objects' subregions, different cluster sizes were tested. The smaller the size of clusters, the more false clusters were created. • Setting a threshold to the difference which sets the edge between consecutive clusters.

Sensor Placement
The vehicle's ground clearance (i.e., ride height) must be considered during the placement of sensors: it is the shortest distance between a flat-level surface and the lowest part of a vehicle, other than those parts designed to be in contact with the ground such as tires and SIS. Eighty-two different vehicles (including SUVs) were surveyed in order to estimate the average ground clearance of vehicles in the UK so as to place the LiDAR at a height that was between the maximum ground clearance and the minimum vehicle height. It was concluded that the LiDAR's optimal height was 559 mm away from the ground. In the proposed setup, the camera and the LiDAR should have a common horizontal center point.

Mapping between Image and LiDAR Coordinates
The output from the video-processing module consists of: • Two-dimensional bounding boxes drawn over image pixels. • Object classes.
In order to apply complementary sensor fusion between pixels (bounding boxes) and LiDAR measurements, a mapping between image pixels and real-world angular coordinates is necessary. As we are using a 2D LiDAR, we are only concerned with the horizontal plane (x-axis), as the LiDAR has a constant vertical value.
Based on the camera pinhole mode shown in Figure 3, a function was developed in order to convert image pixels into angular rotations. Inputs to this function are: Width of the frame (FrameWidth).

•
Horizontal field of view of the camera (HFOV). Assuming a straight line is drawn between the camera and the center of the image ( ), two right-angled triangles can be drawn:

1.
The hypotenuse goes from the camera to the edge of the image and has an angle (θ) formed between the hypotenuse and ( ).

2.
The hypotenuse goes from the camera to (xPixel) and has an angle (φ) formed between its hypotenuse and ( ).
In this setup, the angle (φ) is required to be calculated. The following is the trigonometric calculation that is used to determine (φ): x = xPixel-Frame Center (4) Making use of the common ( ) in both Equations (5) and (6), we solve both for ( ), namely: This process is performed on the left and right x-coordinates of each bounding box in order to convert the bounding boxes' horizontal pixel values into real-world angular coordinates.

Complementary Camera and LiDAR Fusion
The direct fusion between the video and LiDAR data is a straightforward task that outputs a horizontal line of pixels with an associated distance measurement. However, the target of this process is to associate a depth measurement with the 2D bounding boxes. Therefore, when fusion is performed between bounding boxes and LiDAR measurements, the result is bounding boxes with an associated distance measurement. This task involves one main challenge: objects are normally overlapping with each other; therefore, the bounding boxes are not separate from each other, and pixels bounded by bounding boxes may include LiDAR measurements corresponding to multiple objects. Figure 4 is a block diagram that illustrates the fusion process between the video and LiDAR dataprocessing modules. In this step, sensor fusion is performed between the bounding boxes generated by the video-processing module and the clusters generated by the LiDAR data-processing module. There are multiple instances of overlapping between objects: Object 'x' is fully in front of object 'y' (object 'x' is smaller than object 'y') (as shown in Figure 5a). • Object 'x' is partially in front of object 'y' (as shown in Figure 5b).

•
Object 'x' is partially behind object 'y' (as shown in Figure 5c). In this operation, the algorithm analyzes the LiDAR clusters associated with the bounding boxes of each detected object. The flowchart for this operation is shown in Figure 6. This operation will complement the real-time object detection step made by the video-processing module because the bounding boxes are generally bigger than the true boundaries of the detected objects.

Real-Time Object Detection
YOLOv3 has been chosen as the real-time object detector along with the KITTI dataset [51,52]. Our platform is configured with an Intel ® Core™ i7-8750H CPU and an NVIDIA GeForce GTX 1050Ti GPU, which is considered an average-performance GPU. When YOLOv3 is tested on the KITTI raw dataset, it achieves the results shown in Table 2.
There are object detectors that perform better on the KITTI dataset (ex: Faster R-CCN); however, due to their slow execution speed, they cannot be used in real-time autonomous driving scenarios. Further comparisons between YOLOv3 and other deep learning object detection methods on different datasets are presented in [5,19].

Processing of LiDAR Measurements
The first step in processing LiDAR measurements is performing median filtering in order to smooth the measurements while maintaining the edges. Figure 7a shows a sample of LiDAR measurements of rough surfaces before filtering, and Figure 7b shows the same measurements after filtering. The second step is dividing the LiDAR measurements into groups and assigning each group a unique ID. Figure 8 shows a sample of LiDAR measurements when two cars were present (one car is partially in front of the other). The output showed the detection of four clusters. Due to the lack of 2D LiDAR point clouds, all the testing was performed in real-time driving scenarios, and the performance was manually measured and validated.

Adding a Third Dimension to Visual Bounding Boxes
The last step was making use of the LiDAR measurements (after filtering and grouping) in order to add a third dimension (depth) to the 2D bounding boxes generated from the real-time visual object detector. The system was tested in real-time scenarios, and it was capable of coping with the real-time constraints by performing in 18 FPS while maintaining dynamic object detection in addition to adding a depth dimension to the bounding boxes. The achieved running time of the proposed system is a major advancement compared to other approaches (refer to Table 1). The system was tested on moving vehicles, pedestrians, and cyclists in dynamic driving scenarios, while objects were overlapping and interacting with each other. However, a limitation of the system was the weather conditions, as the video-processing module is not robust enough for adverse weather conditions such as rain, snow, and fog.

Conclusions
A monocular vision-based system is inexpensive and can achieve the required accuracy for obstacle detection in autonomous vehicles, but it only gives a 2D localization of objects. Therefore, a range-finder sensor should be used. However, 3D LiDARs are expensive and are hindering the widespread rollout of autonomous driving in both industry and research. In this study, a 2D LiDAR was adopted to develop a prototype for achieving reliable real-time multiple object detection in driving scenarios using lower cost sensors.

Limitations and Future Work
The proposed research encourages the usage of low-cost 2D LiDARs in AVs, which advance the employment of autonomous driving in more vehicles. One limitation of the proposed method is that its performance is bound by the performance of the videoprocessing module (e.g., YOLOv3); therefore, further work should be conducted towards improving this, such as applying de-raining techniques. In terms of future work on the problem of multiple object detection based on the proposed research, the following approaches could be made:

•
The use of multiple cameras in order to cover a wider horizontal field of view without causing much image distortion.

•
Since the KITTI dataset only has daytime driving data, we suggest evaluating the real-time image-based object detection module on the Waymo Open Dataset.