Next Article in Journal
Towards Hyper-Relevance in Marketing: Development of a Hybrid Cold-Start Recommender System
Next Article in Special Issue
Adaptive Orthogonal Basis Function Detection Method for Unknown Magnetic Target Motion State
Previous Article in Journal
Jerusalem Artichoke (Helianthus tuberosus L.) as a Promising Dietary Feed Ingredient for Monogastric Farm Animals
Previous Article in Special Issue
An Intrusion Detection Method Based on Hybrid Machine Learning and Neural Network in the Industrial Control Field
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A LiDAR Multi-Object Detection Algorithm for Autonomous Driving

School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710600, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(23), 12747; https://doi.org/10.3390/app132312747
Submission received: 5 October 2023 / Revised: 19 November 2023 / Accepted: 21 November 2023 / Published: 28 November 2023

Abstract

:
Three-dimensional object detection is the core of an autonomous driving perception system, which detects and analyzes targets around the vehicle to obtain their sizes, shapes, and categories to provide reliable operational decisions for achieving autonomous driving. To improve the detection and localization accuracy of multi-object targets such as surrounding vehicles and pedestrians in autonomous driving scenarios, based on PointPillars fast object detection network, a three-dimensional object detection algorithm based on the channel attention mechanism, ECA Modules-PointPillars, is proposed. Firstly, the improved algorithm uses point cloud columnarization features to convert a three-dimensional point cloud into a two-dimensional pseudo-image. Then, combining the 2D backbone network for feature extraction with the Efficient Channel Attention (ECA) modules to achieve the enhancement of the positional feature information in the pseudo-image and the weakening of the irrelevant feature information such as background noise. Finally, the single-shot multibox detector (SSD) algorithm was used to complete the 3D object detection task. The experimental results show that the improved algorithm improves the mAP by 3.84% and 4.04% in BEV mode and 3D mode, respectively, compared to PointPillars, which improves the mAP by 4.64% and 5.89% in BEV mode and 3D mode, respectively, compared to F-PointNet, improves the mAP by 11.78% and 14.19% in BEV mode and 3D mode, respectively, compared to VoxelNet, and improves the mAP by 9.47% and 6.55% in BEV mode and 3D mode, respectively, compared to SECOND, demonstrating the effectiveness and reliability of the improved algorithms in autonomous driving scenarios.

1. Introduction

Three-dimensional object detection is the core of an autonomous driving perception system, which detects and analyzes targets around the vehicle to obtain their size, shape, category, and motion characteristics, providing reliable operational decisions for realizing autonomous driving [1,2,3]. In recent years, scholars at home and abroad have proposed a variety of effective and feasible 3D object detection algorithms according to the categories of sensors used in the visual perception system, among which the method based on LIDAR point cloud has been deeply studied due to its excellent detection performance [4].
Depending on how the point cloud is processed, it is further categorized into multi-view based, point based, and voxel based methods [5]. Multi-view based methods first convert the point cloud into a front view or bird’s eye view (BEV), and then use mature 2D object detection algorithms for feature extraction [6]. For example, VeloFCN [7] first converts the point cloud into a front view 2D feature map and then uses a 2D object detection algorithm for object detection, which combined with VoxelNet [8] and Fully Convolutional Networks (FCNs) performs well in multi-category and multi-object detection tasks, and can detect pedestrians, vehicles, bicycles, and other multiple objects simultaneously. However, the algorithm consumes a large amount of memory and computational resources, resulting in slower network training and inference, which requires high performance computing devices and longer training and testing time. PIXOR [9] converts the point cloud height information to color channels to achieve a bird’s eye view representation, and then uses a 2D detection network for detection, which is applicable to various roadside light environments for cars, solves the occlusion problem of the front view, and has high accuracy and robustness. However, the algorithm has high computational complexity and large data volume, and requires the use of GPU for accelerated computation.
Point-based methods directly use sparse point clouds for processing, which enhances the feature extraction and feature characterization of point clouds. For example, PointNet [10] proposed by Qi C R et al. from Stanford University directly uses the original point cloud to learn spatial information without conversion to other forms, which can be applied to various point cloud processing tasks. Aiming at the problem of PointNet’s insensitivity to local features and poor extraction effect, the PointNet++ [11] algorithm optimizes the processing of global features in the PointNet algorithm, and improves the algorithm’s ability to perceive both local and global structures through hierarchical aggregation and local feature extraction, and also has good stability for data with noise and sampling density changes. However, the algorithm has difficulty capturing the local feature information of non-uniformly distributed or sparse point clouds in some complex and diverse 3D scene tasks. F-PointNet [12] introduces fine-grained feature representations and multi-scale contextual information, further improves the PointNet algorithm, and can effectively classify, segment, and estimate the point cloud by adding a branching network and extracting the fine-grained feature representations of each point in the point cloud, with better performance [13]. However, for point clouds with different scales and densities, the performance of this algorithmic model may be degraded, and further optimization and improvement is needed. The output of the model may vary greatly with different arrangements of the input point cloud, and the interpretability is poor.
Voxel based methods usually convert irregular point clouds into regular voxel representations [14]. For example, ZHOU Y et al. proposed an end-to-end deep neural network, VoxelNet [8], for 3D object detection in unmanned scenarios, which preserves the original features of the point cloud, but the traditional 3D convolutional network of VoxelNet is computationally cumbersome and slow in operation. In order to improve the 3D convolutional operation speed of the point cloud, Yan Y et al. proposed the SECOND [15] method, which has good adaptability to complex scenes, and is able to deal with scenes such as vehicles, pedestrians, and buildings almost in real time and efficiently, and at the same time, it is applicable to all kinds of LIDAR data, and it has a better versatility. However, the algorithm is based on a sliding window, which will waste some computational resources and reduce the processing efficiency and speed. In order to speed up the operation of the algorithm, Lang A H et al. proposed the PointPillars [16] method, which adopts the point cloud columnar structured features and then transforms them into a two-dimensional pseudo-image, and feature extraction by two-dimensional convolutional neural network, which solves the problem of three-dimensional object detection that cannot be practically applied due to the slow computing speed, and its discarding of the traditional three-dimensional convolutional [17] layer to achieve a detection speed of 62FPS, which have become the more practical algorithms at present. Although all the above algorithms have achieved some success in the field of point cloud target detection, there are still some limitations and challenges. The sparsity and non-regularity of point cloud data is one of the common challenges of these algorithms, which requires better feature extraction and representation methods, introduction of more powerful models and deep learning frameworks, and optimization of computational algorithms.
In summary, the 3D object detection algorithm can detect and analyze the size, shape, and category of targets around the vehicles, provide accurate spatial perception to improve the accuracy and reliability of detection, and plays an important role in the recognition and localization of objects in the automatic driving scene. Therefore, improving the 3D object detection algorithm is of great significance to further enhance the performance and safety of autonomous driving technology. In this paper, for the problem of information loss and inaccurate localization due to the column division of the point cloud in the PointPillars algorithm, the Efficient Channel Attention module [18] is introduced to improve the original algorithm’s 2D backbone network. And the contributions of our work are:
(1)
Based on the PointPillars algorithm, ECA Modules–PointPillars are proposed by reconstructing the backbone network by connecting the ECA module and the Conv of the down-sampling module in series.
(2)
The improved algorithm is evaluated on the KITTI datasets, and the visual detection results of the algorithm are also shown for three scenarios: highways, rural roads, and urban roads.
(3)
The optimization effect of the proposed improved algorithm is analyzed and a preliminary solution is proposed for its insignificant accuracy in terms of pedestrian category performance.

2. Methods

The PointPillars algorithm uses the Pillar feature extraction method, which is able to efficiently extract information such as shape, position, and orientation from a point cloud and convert it into an efficient 3D feature representation. This representation allows the algorithm to maintain a low computational cost while maintaining high detection accuracy, improving the real-time performance of the algorithm.

2.1. PointPillars Algorithm

The PointPillars algorithm has a fast processing speed and real-time results even above the radar scanning frequency. It consists of three main stages: (1) a pseudo-image of a point cloud through Pillar Feature Net; (2) feature extraction by 2D Backbone while preserving the 3D features of the point cloud; (3) output of the detection results by Detection Head, and acquisition of 3D bounding box position and attitude of the 3D bounding box [19]. The network architecture is shown in Figure 1.

2.1.1. Point Cloud Pseudo-Image

Phase 1: The point cloud is processed. Point a is used to represent one of the points of cloud data, and its coordinates are x , y , z , r , x, y, and z are the position coordinates of point a, and r is the reflectance. Firstly, ignore the z-axis information, in the top view x-y plane for the H × W dimensional point cloud is gridded, and all points falling into the same grid are considered to constitute a column P. The points in each pillar are gridded using x c , y c , z c , x p , and y p for enhancement, i.e., the enhanced point a has coordinates x , y , z , r , x c , y c , z c , x p , y p , and a is now D = 9 dimensional. x c , y c , z c is the arithmetic mean distance from point a to all points in the column, x p and y p are the deviation of point a relative to the center of the column in the x and y directions. The sparsity of the point cloud causes the generated column P to be mostly empty, and there are only a few points in the non-empty column P. So, if the number of points in the non-empty column P is greater than N, the random sampling is taken to N. Otherwise, the tensor is filled with zeros to N. A tensor of size D , P , N can be created. N is the number of points sampled and P is the number of grids.
Phase 2: A simplified PointNet network is used to learn from the original D-dimensional data of each point to obtain C-dimensional features, to generate a tensor of size C , P , N . Then the maximum feature in N is extracted on the channel using the maximum pooling layer and calculated by forward propagation to obtain a tensor of C , P size. The feature vector output from the network is decoded into pictures [20,21,22] and the information such as the position, size, and orientation of the target object is restored in the two-dimensional image plane, thus realizing the conversion of 3D data to 2D pseudo-images [23]. The processing flow is shown in Figure 2.

2.1.2. Two-Dimensional Backbone Network

The 2D backbone network is a pyramid structure containing two sub-networks as shown in Figure 3: (1) top-down extraction of features at smaller and smaller spatial resolutions, and (2) up-sampling and concatenation of the extracted features.
The top-down subnetwork consists of a series of convolutional blocks, Block S , L , F where S represents the step size, L and F represent that a block has L 3 × 3 size two-dimensional convolutional kernels, and F the output channels. Each block is followed by a BatchNorm [24] and a ReLU [25] to accomplish pseudo-image feature extraction by convolution. The first convolutional step size in the layer is S / S i n , to ensure that the block handles on stride S after receiving a step size of S i n input, keeping the size of the input and output consistent. The subsequent convolution step in the block is 1, which avoids over-squeezing the input signal and losing important details due to the large first convolution step in the network layer.
The up-sampling network combines feature maps of different scales by up-sampling them to the same size through deconvolution and then connecting them. First, the feature maps at different scales are up-sampled- U P S i n , S o u t , F where S i n is the initial step size, and S o u t is the final step size. Then all the modules from different steps are connected. The up-sampling network can improve the receptive field of the model through the deconvolution operation and compensate for the information loss generated during the down-sampling process.

2.1.3. SSD Detection Head

The detection module adopts the classical single-step detection algorithm SSD [26] to complete the classification and regression of the 2D bounding box, which has a fast detection speed and high accuracy. To meet the characteristics of point cloud with large-scale transformations, the SSD network adapts to the multi-scale object detection task by introducing the idea of anchors. It contains three main parts:
(1)
Backbone: SSD uses VGG16 as the base network, and after the first five layers of the network Conv1-Conv5, the third last and the second last fully connected layers FC6 and FC7 of the VGG16 are changed to a 3 × 3 convolutional layer Conv6 and 1 × 1 convolutional layer Conv7; this part is used for the image preliminary feature extraction.
(2)
Extra Layers: Conv8, Conv9, Conv10, and Conv11 convolutional layers are added on top of the backbone to obtain more feature maps for detection.
(3)
Multi-box Layers: final object classification detection and non-greatly suppressed regression location operations.
Through the SSD algorithm, the 3D point cloud feature tensor after feature extraction and voxel segmentation is transformed into a series of bounding box predictions. In practice, many of the generated large numbers of bounding box predictions are redundant or inaccurate, followed by classification and non-extreme suppression to determine the final object detection results. PointPillars matches the output 2D detection frame with the ground truth frame for Intersection over Union (IoU) [27] and the height alone as an additional regression object.

2.1.4. Loss Function

The output of the PointPillars network mainly includes the object class and the 3D bounding box parameters. The total loss function mainly contains the 3D bounding box positioning loss, classification loss, and orientation loss, as shown in Equation (1).
L = 1 N p o s β l o c L l o c + β c l s L c l s + β d i r L d i r
N p o s denotes the number of positive probability anchors, i.e., the number of generated boxes greater than the specified IoU threshold; L l o c represents the regression loss function of the 3D bounding box; L c l s represents the classification loss function; L d i r represents the object orientation loss function. β l o c = 2; β c l s = 1; β d i r = 0.2, which are the corresponding weight coefficients of the three types of loss functions, respectively.
The positioning loss function of the 3D bounding box mainly contains the position regression loss function, the size regression loss function, and the orientation loss function of the 3D bounding box.
L l o c = b x , y , z , w , l , h , θ S m o o t h L 1 b
x = x g t x p d p ,   y = y g t y p d p ,   z = z g t z p d p
w = l o g w g t w p ,   l = l o g l g t l p ,   h = l o g h g t h p
θ = s i n θ g t θ p
where x , y , z , w , l , h , θ are used to characterize the 3D bounding box, where x , y , z denotes the coordinates of the center position of the 3D frame in the LiDAR coordinate system; w , l , h denotes the width, length, and height of the 3D frame, respectively; θ denotes the yaw rotation angle of the 3D frame around the z-axis. b denotes the difference between the predicted 3D bounding box and the true bounding box, i.e., the coordinate deviation of the predicted bounding box. S m o o t h L 1 is the L1 smoothing function, which is mainly used to reduce the effect of the coordinate error of the prediction bounding box on the loss function to minimize the effect of extreme values. The subscript g t is the variable that is the true value of the object; the scalar in the following table is the predicted or anchor value of the object for p; d p = w p 2 + l p 2 .
The classification loss function of the 3D bounding box uses the focal loss function [28], as shown in Equation (6).
L c l s = α a 1 P a γ l o g P a
where P a is the probability value of a category, i.e., the anchor point; α = 0.25; γ = 2.
The localization loss cannot distinguish whether the 3D frame is flipped or not, so the L d i r uses the softmax function to learn the orientation of the 3D frame in the discrete direction.
The loss function is optimized using the Adam optimizer, which updates the parameters using an exponentially weighted average of the gradient, along with a momentum mechanism that can effectively handle non-smooth objective functions and large-scale datasets. The initial learning rate is 2 × 10 4 and the learning rate decay strategy is adopted to decay to 0.8 times of the current learning rate every 15 cycles.

2.2. The PointPillars Algorithm Based on the ECA Module

The attention mechanism is a concept inspired by human behavioral characteristics to selectively use important parts of the data when making decisions rather than treating all information equally. In the attention mechanism, the representation vectors of data points are weighted and summed with their corresponding weight vectors to obtain a weighted representation vector. In this way, the attention mechanism enables the network to learn the important information in the pseudo-image and suppress the unimportant information. Advantages such as small number of parameters, high interpretability, and high robustness make it popular.

2.2.1. ECA Module

The ECA attention module is a type of channel attention that is often used in visual models because of its plug-and-play support, which allows the input feature map to be enhanced in the channel direction, and the final output does not affect the size of the input feature map. Compared with SE-Net [29], it replaces the original learning of channel attention information by the fully connected layer with 1 × 1 convolution to capture information between the different channels, which reduces the number of parameters, eliminates the negative impact of the will-be operation on the prediction of channel attention adopted by SE-Net, and avoids the inefficiency and unnecessary of obtaining the relationship between all channels at the same time. The structure of the ECA module is shown in Figure 4.
The size of the convolution kernel k in convolution represents the range of action to capture information between different channels, which is proportional to the channel dimension size, i.e., there is a mapping relationship between the two:
C = f k
The simplest mapping relationship is a linear function, i.e., f k = p × k q . Since the channel dimension is usually expressed as several times the square of 2, the expression (7) is obtained as:
C = f k = 2 p × k q
When the number of channels is known, derive Equation (8):
k = g C = l o g 2 C + q p o d d
where k denotes the convolutional kernel size; C denotes the number of channels; A o d d denotes the odd number closest to A; p = 2 and q = 1 used to change the ratio between the number of channels and the convolution kernel size.

2.2.2. ECA Modules–PointPillars

The use of ECA modules can improve the expressiveness of the model by enabling the network to focus on the important information in the pseudo-image features and ignore the irrelevant information without imposing too much extra burden on the training, operation, computation, and storage properties of the algorithm. For the PointPillars algorithm, due to the column division of the point cloud and random sampling, there is a loss of information problem, combined with the two-dimensional backbone network of the PointPillars model. The original two-dimensional backbone network from the top to bottom sub-network consists of a series of convolutional blocks. The ECA modules are sequentially connected behind the original down-sampling modules Block1, Block2, and Block3 to reconstruct the backbone network, and the ECA Modules–PointPillars algorithm is proposed to improve the algorithm to address the practical challenges of the point cloud spatial loss due to the division of the columns of PointPillars, thus affecting the accuracy of the target detection. The ECA Attention Modules adaptively strengthen the important features and suppress the irrelevant features inside the column channels to compensate for the loss of spatial information. The network structure is shown in Figure 5.
The output features of the network after point cloud pseudo-imaging are input to the ECA1 module   C , K , where C denotes the channel dimension of the input features and K denotes the size of the convolution kernel. The ECA modules connected after the convolution block of each top-down 2D backbone sub-network can be denoted as ECA2 module C , K , ECA3 module 2 C , K , and ECA4 module 4 C , K . The input features are fed into Block1 S , 4 , F after the ECA1 module, where S denotes the step size, Block1 has 4 3 × 3 size 2D convolutional layers, and F denotes output channels. Each block is followed by a BatchNorm and a ReLU, and to ensure that the network layer can keep as much information as possible when processing the input, the first convolutional step size in the layer is S / S i n , and the subsequent convolutional steps are all 1 to ensure that the network layer still has a step size of S after receiving an input with a step size of S i n . This avoids over-compression of the input signal due to the large first convolutional step in the network layer and avoids the loss of information to the greatest extent possible.
As can be seen in Figure 5, the output characteristics of Block1 are sequentially passed through ECA2 module, Block2 2 S , 6,2 F , ECA3 module, Block3 4 S , 6,4 F , ECA4 module, and the output characteristic sampling process used by ECA2 module for the up-sampling part is denoted as U P 1 S , S , 2 C , where the parameters are initial step size S, final step size S, and 2C output channels are obtained through the inverse convolution. The ECA3 module and ECA4 module are used for the up-sampling part of the output characteristic sampling process denoted as U P 2 2 S , S , 2 C and U P 3 4 S , S , 2 C . The up-sampling network is followed by a BatchNorm and a ReLU, and then all the modules from different step sizes are connected to generate the final output characteristics, which are fed to the SSD detection header.
In summary, the network output features after point cloud pseudo-imaging are inputted to the ECA1 module, and the output characteristics of Block1 are sequentially passed through ECA2, Block2, ECA3, Block3, and ECA4; each time the features are extracted, they undergo the process of enhancement of the positional feature information and the weakening of the irrelevant feature information, such as the background noise, by the ECA module. ECA2, ECA3, and ECA4 output features are used for up-sampling and connectivity, through which the final output features are generated and output to the SSD detection head.
Compared with the algorithm proposed in references [8,12,15,16,22], the improved algorithm first divides the point cloud data into different voxels, and extracts features for the point cloud data in each voxel, such as the location and reflective intensity of the point cloud, which can help the algorithm better understand the geometric and semantic information of the point cloud data. Then the ECA attention module is introduced, the algorithm maps the global averaged pooled input features into channel attention weights through a 1 × 1 convolutional layer in the feature extraction stage, and then after feature weighting, the resulting channel attention weights are multiplied with the input features, where a larger weight will amplify the important information, while a smaller weight will suppress the irrelevant information. The attention to important information such as shape, size, position, and orientation of the target object is strengthened, and the interference of redundant information such as background noise and sensor noise is reduced. By learning the geometric and semantic information in the point cloud data, the improved algorithm can better understand the contextual relationships and shape features of the target, helping the network to better focus on useful features and reduce the influence of occlusions on the prediction. Also, due to the global average pooling operation in the ECA module, it helps the network to focus on the statistical information of the overall features instead of just focusing on the local area or individual channel features. By considering the global distribution of features, the network can capture more comprehensive and diverse feature information in order to better cope with the appearance of unexpected obstacles that can affect the prediction. Through validation on the KITTI datasets, the improved algorithm shows good performance indicating that the algorithm is more responsive in dealing with challenges such as occlusions, sensor noise, and unexpected obstacles, which improves the reliability and safety of the autonomous driving system. However, in extreme occlusion scenarios, a completely occluded target may not be detected and localized correctly. Therefore, further research and improvements are still necessary, and combination with other sensors may be required to meet the demand for high-performance target detection and localization in autonomous driving systems [30].

3. Results and Discussion

3.1. Experimental Setup

The experiments were conducted using the OpenPCDet 0.6.0 object detection framework with an environment of the Ubuntu 18.04 operating system, Python 3.8, CUDA 11.3, Pytorch 1.11.0 [31], cuDNN 8, NVCC, torchvision 0.12.0, and torchaudio 0.11.0. The processor was an Intel (R) Xeon (R) CPU E5-2686 v4 @ 2.30 GHz, the graphics card was NVIDIA GeForce RTX 3080 Ti, graphics memory was 12 GB, GPU bandwidth was 912.10 GB/s, number of CPUs available was 8, and hard disk bandwidth was 377.42 MB/S. The KITTI datasets used in this experiment are some of the most important datasets in the field of autonomous driving. It focuses on image processing techniques in the field of automated driving and applications in automated driving perception and prediction. It also involves localization and SLAM techniques, and was jointly established by the Karlsruhe Institute of Technology (KIT), Germany, and the Toyota Technological University Chicago (TTIC). The datasets contains point cloud and image data acquired by a 64-line 3D LIDAR, two grayscale cameras, two color cameras, and four optical lenses in scenes such as urban roads, countryside, and highways, and is used to evaluate the performance of computer vision technologies such as stereo imagery, optical flow, visual ranging, 3D object detection, and 3D tracking in in-vehicle environments [32]. Each image in the datasets contains up to 15 vehicles, 30 pedestrians, and obstacles with different levels of occlusion, and only point clouds are used in the training process. In order to evaluate the model performance and prevent overfitting, the datasets are divided into 7481 training sets and 7518 test sets, and the training set is divided into 3712 training samples and 3769 validation samples in training [33] in order to monitor the model performance in a timely manner, and to make adjustments to the model to make it better adapted to the new data. The samples in datasets can be categorized into three grades: simple, medium, and difficult according to the truncation value and masking scale, and the specific indexes are shown in Table 1. Default parameters: x and y resolution of the Pillar columns are set to 0.16 m, the maximum number of columns P is 12,000, and the maximum number of point clouds inside the columns N is 100. The 3D object detection frame for each object category consists of a 0 degree and 90-degree direction, the width, length, and height of the car anchor point are 1.60 m, 3.90 m, and 1.56 m, z center is −1.00 m, and the matching positive and negative thresholds are 0.60 and 0.45. The width, length, and height of the pedestrian anchor point are 0.60 m, 0.80 m, and 1.73 m, and that of the cyclist anchor point are 0.60 m, 1.76 m, 1.73 m, all with a z-center of −0.60 m and matching positive and negative thresholds of 0.5 and 0.35 [34].

3.2. Experimental Results and Analysis

Average precision (AP) was used as an evaluation metric for 3D object detection algorithms in order to evaluate 2D, bird’s eye view, 3D, and Average Orientation Similarity (AOS) [35]. Based on the degree of occlusion of the target, the KITTI datasets were categorized into three levels of difficulty: easy, moderate, and hard. The easy, moderate, and hard overlap thresholds were set to 0.7, 0.7, and 0.7 times IoU for the car category, and for the cyclist and pedestrian category the easy, moderate, and hard overlap thresholds are set to 0.5, 0.5, and 0.5 times IoU, respectively. Table 2 shows the comparison of the detection performance results with the literature [8,12,15,16,22] algorithms based on the KITTI datasets cars, pedestrians, and cyclists in BEV mode at three difficulty levels. Table 3 shows the comparison of the detection performance results with literature [8,12,15,16,22,36,37] algorithms based on the KITTI datasets cars, pedestrians, and cyclists in 3D mode at three difficulty levels. Table 4 demonstrates the comparison of detection performance results with literature [15,16] algorithms based on the KITTI datasets cars, pedestrians, and cyclists in AOS mode at three difficulty levels.
The experimental results show that compared with the PointPillars algorithm in moderate difficulty, the mAP of the improved algorithm is improved by 3.84% in the BEV mode, 4.04% in the 3D mode, and the mAOS is reduced by 2.02% in the AOS mode. In addition, the AP of cars in the BEV mode improves to 87.39% in the moderate difficulty, 76.96% in the 3D mode, and 89.39% in the AOS mode; the AP of cyclists in the BEV mode improves to 67.84% in the moderate difficulty, 64.15% in the 3D mode, and 71.39% in the AOS mode; except for the pedestrian category, the improved algorithm outperforms the other detection models in the table under the other two categories. From Table 2 and Table 4, it can be seen that the detection results under the AOS mode for pedestrians do not present optimal results. The attention mechanism can help the network to focus on the key information in the input data and suppress the unimportant information. However, there are some issues that may have an impact on the detection results of small objects, for example, pedestrians may be confused with other slender objects pedestrians such as poles, tree trunks, and the like, and the presence of certain vertical objects may also affect the number of point clouds. Real-world environments with different lighting, shooting angles, and complex backgrounds also affect detection effects, etc. In order to solve these problems, a combination of image-based methods [38] may be a viable solution.
In addition to list the official KITTI evaluation protocols, i.e., an IoU threshold of 0.7 for cars and 0.5 for pedestrians and cyclists, it also lists the IoU thresholds of 0.7, 0.5, and 0.5 for the bbox AP, BEV AP, 3D AP, and AOS values for car detection and the IoU thresholds of 0.5, 0.25, and 0.25 for bbox AP, BEV AP, 3D AP, and AOS values for pedestrian and cyclist detection. According to the different methods of taking the equal interval of the recall rate, the results of the interpolation precision of AP11 (indicating the interpolation precision of 11 points, taking the equal interval of the recall rate, calculating the average value of the relative exact rate, and obtaining the average precision; R11 = {0, 0.1, 0.2, ……, 1}) and AP40 (indicating the interpolation precision of 40 points; R40 = {1/40, 2/40, 3/40, ……, 1}) are shown in Table 5, Table 6 and Table 7.
Theoretically, the ECA attention module has the property of supporting plug-and-play and does not bring too much extra burden to the training, running, computation, and storage of the algorithm. However, in order to verify whether the complexity it brings is reasonable or not, and whether it will limit real-time processing in autonomous driving scenarios and affect the scalability of the algorithm, in this paper, the speed comparison of the algorithm before and after the improvement is conducted on NVIDIA GeForce RTX 3080 Ti graphics card, and the comparison results are shown in Table 8.
Where FPS denotes the number of detectable scenes per second, it can be seen that the inference speed of the improved algorithm is basically kept at the same level as that of the PointPillars algorithm. It is shown that the additional introduction of the ECA module does not affect the real-time performance of the algorithm too much, and the improved algorithm is able to improve the detection accuracy while still maintaining the computational efficiency, which preliminarily verifies the effectiveness and scalability of the algorithm. However, the actual scalability is still affected by factors such as hardware platform, datasets size, and computational resources. Therefore, further tuning and adaptation are still necessary to meet the needs of specific application scenarios.

3.3. Object Detection Results and Analysis

In order to facilitate the study, observation and explanation of the illustration, the RGB image and 3D frame prediction results in BEV mode are shown, and the object detection results are shown in Figure 6 (green for cars, yellow for cyclists, and blue for pedestrians).
Figure 6 shows the RGB original images and their corresponding point cloud detection result maps for three different road scenes. Among them, the left side, Figure 6a,c,e, show the RGB original images under the highway driving scenario, the rural road driving scenario, and the urban road driving scenario, respectively, and the right side, Figure 6b,d,f, show their corresponding point cloud prediction result maps. Through the point cloud detection, the shape and position information of target objects such as cars, pedestrians, cyclists, etc. are obtained. The point cloud detection result graphs demonstrate the accurate detection and recognition capability of the target objects of these original maps. As shown in the original RGB image and the point cloud detection image, the point cloud data are acquired by the LiDAR sensor, which contains the 3D information of the whole environment, including the target, the ground, the background, etc., covering the spatial area from the sensor to the farthest reachable distance in the scene. However, only the point cloud within the camera’s field of view has significance for the vision task, so filtering out the point cloud that is not within the camera’s field of view and then detecting is the next step in the endeavor.

4. Conclusions

To address the problem of information loss in the column segmentation of the PointPillars algorithm, the ECA module is introduced, which is connected in series with the Conv of the down-sampling module to reconstruct the backbone network. The improved algorithm improved the mAP in 3D mode by 4.04% compared to PointPillars, 5.89% compared to F-PointNet, 14.19% compared to VoxelNet, and 6.55% compared to SECOND. It shows that the improved algorithm is effective and reliable in multi-target detection in autonomous driving scenarios. For the problem that the improved algorithm performs ordinary in the pedestrian category, a point cloud and image fusion based approach may be a feasible solution, with the following two specific solutions: building a complete semantic understanding framework by carrying cameras around the vehicle, fusing the image information of other targets and the point cloud information; and incorporating a de-fogging and de-raining model to process the data before fusion. In addition to multi-sensor collaboration and multi-modal data fusion, the lightweighting of network models is also a direction for further exploration to achieve a balance between accuracy and inference speed, and to improve the real-time performance of the algorithms, thus enabling the models to be deployed to mobile.

Author Contributions

Conceptualization, S.W. and M.C.; methodology, S.W. and M.C.; software, S.W. and M.C.; validation, S.W.; formal analysis, M.C.; investigation, M.C.; resources, S.W. and M.C.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, S.W.; visualization, M.C.; supervision, S.W.; project administration, S.W.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the National Natural Science Foundation of China (NSFC52174197) and the UWB Radar Life Information Feature Extraction and Quantitative Identification for Mine Drill Hole Rescue project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

I would like to acknowledge Shuqi Wang for inspiring my interest in the development of innovative technologies.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, X.D.; Zhang, J.C.; Pang, W.S.; Ai, D.; Wang, Y.; Cai, H. Key technology and application algorithm of intelligent driving vehicle LiDAR. Opto-Electron. Eng. 2019, 46, 34–46. [Google Scholar]
  2. Fan, X.; Xu, G.; Li, W.; Wang, Q.; Chang, L. Target segmentation method for three-dimensional LiDAR point cloud based on depth image. Chin. J. Lasers 2019, 46, 292–299. [Google Scholar]
  3. Huo, W.; Jing, T.; Ren, S. Review of 3D Object Detection for Autonomous Driving. Comput. Sci. 2023, 50, 107–118. [Google Scholar]
  4. Zhao, L.; Hu, J.; Liu, H. Deep learning based on semantic segmentation for three-dimensional object detection from point clouds. Chin. J. Lasers 2021, 48, 177–189. [Google Scholar]
  5. Zhao, Y.; Arxidin, ·A.; Chen, R.; Zhou, Y.; Zhang, Q. 3D point cloud object detection method in view of voxel based on graph convolution network. Infrared Laser Eng. 2021, 50, 281–289. [Google Scholar]
  6. Qin, J.; Wang, W.; Zou, Q.; Wang, Z.; Ji, C. Review of 3D Target Detection Methods Based on LIDAR Point Clouds. Comput. Sci. 2023, 50, 259–265. [Google Scholar]
  7. Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3D lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
  8. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
  9. Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3D object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7652–7660. [Google Scholar]
  10. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  11. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5100–5109. [Google Scholar]
  12. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
  13. Wang, T.; Wang, W.J.; Cai, Y. Research of deep learning based semantic segmentation for 3D point cloud. Comput. Eng. Appl. 2021, 57, 18–26. [Google Scholar]
  14. Huang, Z.; Wang, Y.; Li, D. A survey of 3D detection algorithms. Chin. J. Intell. Sci. Technol. 2023, 5, 7–31. [Google Scholar]
  15. Yan, Y.; Mao, Y. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  16. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
  17. Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1355–1361. [Google Scholar]
  18. Shu, X.; Chang, F.; Zhang, X.; Shao, C.; Yang, X. ECAU-Net: Efficient channel attention U-Net for fetal ultrasound cerebellum segmentation. Biomed. Signal Process. Control 2022, 75, 103528. [Google Scholar] [CrossRef]
  19. Chen, D.; Yu, W.; Gao, Y. Lidar 3D Target Detection Based on Improved PointPillars. Laser Optoelectron. Prog. 2023, 60, 447–453. [Google Scholar]
  20. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
  21. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
  22. Simon, M.; Milz, S.; Amende, K.; Gross, H.-M. Complex-YOLO: Real-time 3D Object Detection on Point Clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar] [CrossRef]
  23. Li, R.; Wu, C.; Zhu, M. 3D object detection in voxelized point cloud scene. Chin. J. Liq. Cryst. Disp. 2022, 37, 1355–1363. [Google Scholar] [CrossRef]
  24. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML, Lile, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  25. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  26. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  27. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  28. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  29. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  30. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 28 October 2017).
  31. Qin, C.; Wang, Y.; Zhang, Y.; Yin, C. 3D Object Detection Based on Extremely Sparse Laser Point Cloud and RGB Images. Laser Optoelectron. Prog. 2022, 59, 447–458. [Google Scholar]
  32. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  33. Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals for accurate object class detection. Adv. Neural Inf. Process. Syst. 2015, 5, 424–432. [Google Scholar]
  34. Zhan, W.; Ni, R.; Yang, B. An attention-based PointPillars+3D object detection. J. Jiangsu Univ. (Nat. Sci. Ed.) 2020, 41, 268–273. [Google Scholar]
  35. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
  36. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
  37. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
  38. Song, S.; Lichtenberg, S.P.; Xiao, J. Sun RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
Figure 1. The algorithm flow network architecture diagram of PointPillars.
Figure 1. The algorithm flow network architecture diagram of PointPillars.
Applsci 13 12747 g001
Figure 2. The flowchart of the pseudo-image.
Figure 2. The flowchart of the pseudo-image.
Applsci 13 12747 g002
Figure 3. 2D backbone.
Figure 3. 2D backbone.
Applsci 13 12747 g003
Figure 4. The structure diagram of the ECA module.
Figure 4. The structure diagram of the ECA module.
Applsci 13 12747 g004
Figure 5. The algorithm structure diagram of ECA Modules–PointPillars.
Figure 5. The algorithm structure diagram of ECA Modules–PointPillars.
Applsci 13 12747 g005
Figure 6. Object detection results. (a) Highway driving scenario; (b) object detection results corresponding to figure (a); (c) in rural road driving scenarios; (d) object detection results corresponding to figure (c); (e) in urban driving scenarios; (f) object detection results corresponding to figure (e).
Figure 6. Object detection results. (a) Highway driving scenario; (b) object detection results corresponding to figure (a); (c) in rural road driving scenarios; (d) object detection results corresponding to figure (c); (e) in urban driving scenarios; (f) object detection results corresponding to figure (e).
Applsci 13 12747 g006
Table 1. Classification of difficulty levels.
Table 1. Classification of difficulty levels.
Difficulty LevelEasyModerateHard
ShelterFully visiblePartial shade, mostly shelteredHard see
Truncation value≤15%15~50%≥50%
Number of pixels≥40≥25
Table 2. Experimental results of the BEV model.
Table 2. Experimental results of the BEV model.
Method CarsPedestriansCyclists
mAPEasyModerateHardEasyModerateHardEasyModerateHard
F-PointNet [12]65.3988.7084.0075.3358.7551.0547.2068.0957.4850.77
VoxelNet [8]58.2589.3579.2677.3946.1340.7438.1166.7054.7650.55
SECOND [15]60.5688.0779.3777.9555.1046.2744.7673.6756.0448.78
PointPillars [16]66.1988.3586.1079.8358.6650.2347.1979.1462.2556.00
Complex-YOLO [22]62.8485.8979.2677.3946.0845.9044.2072.3763.3660.27
This paper70.0389.9187.3983.3258.7354.8649.9583.7667.8463.62
Table 3. Experimental results of the 3D model.
Table 3. Experimental results of the 3D model.
Method CarsPedestriansCyclists
mAPEasyModerateHardEasyModerateHardEasyModerateHard
F-PointNet [12]57.3581.2070.3962.1951.2144.8940.2371.9656.7750.39
VoxelNet [8]49.0577.4765.1157.7339.4833.6931.5061.2248.3644.37
SECOND [15]56.6983.1373.6666.2051.0742.5637.2970.5153.8546.90
PointPillars [16]59.2079.0574.9968.3052.0843.5341.4975.7859.0752.92
Complex-YOLO [22]54.0167.7264.0063.0141.7939.7035.9268.1758.3254.30
Voxel R-CNN [36]-89.4184.5278.78------
3DSSD [37]-89.7179.4578.67------
This Paper63.2486.7676.9674.0753.8348.6045.2181.1464.1560.21
Table 4. Experimental results of the AOS model.
Table 4. Experimental results of the AOS model.
Method CarsPedestriansCyclists
mAPEasyModerateHardEasyModerateHardEasyModerateHard
SECOND [15]54.5387.8481.3171.9551.5643.5138.7880.9757.2055.14
PointPillars [16]68.8690.1988.7686.3858.0549.6647.8882.4368.1661.96
This Paper66.8490.7689.3988.2042.0739.7337.5584.4771.3966.98
Table 5. Car detection results by ECA Modules–PointPillars on the KITTI datasets.
Table 5. Car detection results by ECA Modules–PointPillars on the KITTI datasets.
Detection TypeCars–EasyCars–ModerateCars–Hard
bbox AP1190.8289.6288.60
bbox AP4095.5292.0390.99
BEV AP1190.8290.0389.37
BEV AP4095.5994.6293.75
3D AP1190.8289.9589.22
3D AP4095.5694.4393.30
AOS AP1190.7689.3988.20
AOS AP4095.4591.7890.56
Table 6. Pedestrian detection results by ECA Modules–PointPillars on the KITTI datasets.
Table 6. Pedestrian detection results by ECA Modules–PointPillars on the KITTI datasets.
Detection TypePedestrians–EasyPedestrians–ModeratePedestrians–Hard
bbox AP1164.7860.9957.57
bbox AP4064.7560.7257.06
BEV AP1169.7466.7763.48
BEV AP4071.0667.3363.56
3D AP1169.6866.5963.12
3D AP4070.9966.9563.18
AOS AP1142.0739.7337.55
AOS AP4041.6939.2336.75
Table 7. Cyclist detection results by ECA Modules–PointPillars on the KITTI datasets.
Table 7. Cyclist detection results by ECA Modules–PointPillars on the KITTI datasets.
Detection TypeCyclists–EasyCyclists–ModerateCyclists–Hard
bbox AP1185.4973.9269.51
bbox AP4088.7474.7170.74
BEV AP1184.8870.8566.63
BEV AP4087.9571.5567.57
3D AP1184.8870.8566.62
3D AP4087.9571.5467.47
AOS AP1184.4771.3966.98
AOS AP4087.4971.9467.88
Table 8. Algorithmic inference speed comparison.
Table 8. Algorithmic inference speed comparison.
MethodAverage Inference TimeAverage Inference SpeedBash_Size
PointPillars0.116 s34.483 FPS4
This paper0.133 s30.075 FPS
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Chen, M. A LiDAR Multi-Object Detection Algorithm for Autonomous Driving. Appl. Sci. 2023, 13, 12747. https://doi.org/10.3390/app132312747

AMA Style

Wang S, Chen M. A LiDAR Multi-Object Detection Algorithm for Autonomous Driving. Applied Sciences. 2023; 13(23):12747. https://doi.org/10.3390/app132312747

Chicago/Turabian Style

Wang, Shuqi, and Meng Chen. 2023. "A LiDAR Multi-Object Detection Algorithm for Autonomous Driving" Applied Sciences 13, no. 23: 12747. https://doi.org/10.3390/app132312747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop