LiDAR-Based Intensity-Aware Outdoor 3D Object Detection

LiDAR-based 3D object detection and localization are crucial components of autonomous navigation systems, including autonomous vehicles and mobile robots. Most existing LiDAR-based 3D object detection and localization approaches primarily use geometric or structural feature abstractions from LiDAR point clouds. However, these approaches can be susceptible to environmental noise due to adverse weather conditions or the presence of highly scattering media. In this work, we propose an intensity-aware voxel encoder for robust 3D object detection. The proposed voxel encoder generates an intensity histogram that describes the distribution of point intensities within a voxel and is used to enhance the voxel feature set. We integrate this intensity-aware encoder into an efficient single-stage voxel-based detector for 3D object detection. Experimental results obtained using the KITTI dataset show that our method achieves comparable results with respect to the state-of-the-art method for car objects in 3D detection and from a bird’s-eye view and superior results for pedestrian and cyclic objects. Furthermore, our model can achieve a detection rate of 40.7 FPS during inference time, which is higher than that of the state-of-the-art methods and incurs a lower computational cost.


Introduction
The pace of research into 3D vision perception has accelerated over the past few years, as it is an essential component of indoor and outdoor navigation systems.Examples of applications of navigation systems include autonomous vehicles (AVs) [1][2][3], robots [4,5], and augmented reality [6].In regard to AVs, 3D perception in outdoor, urban environments still remains an open challenge [1,7,8].This challenge is even greater in very complex scenarios, such as in dense intersections with a high volume of traffic and uncertain pedestrian actions.To develop AVs that operate safely and in a hazard-free manner, it is important to understand what AVs perceive in environments that present on-road and off-road traffic objects, especially in dense and occluded environments.In AVs, 3D visual perception LiDAR and stereo camera sensors are considered the primary choices of sensing modalities.Unlike 2D stereo camera images, LiDAR 3D point clouds provide accurate depth information about the surrounding objects, such as object scale, relative positions, and occlusion.However, due to the inherent sparsity and higher density variance in 3D point cloud data, it is very difficult to capture the geometric abstraction of objects.To this end, different point-cloud-encoding techniques have been proposed that implement sparse-to-dense feature representation conversion while preserving geometric abstraction.The proposed encoders are then followed by 2D convolution filters for object detection and localization [9][10][11][12].
Recent 3D detection and localization methods rely on geometric encoding techniques for feature extraction [2,10,11,13,14].Geometric encoding consumes all the points in a 3D cloud or a subset of points resulting from quantization.The existence of false and noisy points in a 3D cloud can have a detrimental impact on the performance of geometric encoding techniques for feature extraction [15,16].Specifically, 3D point clouds from LiDAR are highly susceptible to scattering media such as fog, snow, rain, or dust in outdoor environments [17][18][19][20][21], and in these scenarios, feature extraction becomes more challenging.To improve feature extraction in noisy point clouds, denoising methods have been developed [18,19,22].However, these methods are computationally expensive, and their efficiency has not been evaluated on standard benchmark datasets [7,[23][24][25].
To develop robust feature extraction methods for 3D object detection, we argue that voxel-wise point cloud intensity histograms can complement geometric features, as false and true points might have different intensity properties.In this paper, we propose an intensity-based encoder for extracting voxel-wise features, which are then used in a 3D and 2D backbone detection network.Our intensity-based encoder integrates the underlying geometric structure with the intensity histogram to produce sparse feature maps.Using a custom Cuda kernel, we generate voxel-wise intensity histograms in a parallel fashion to increase efficiency.The extracted sparse feature map is followed by a sparse convolution stage that produces a dense feature map.Finally, a Region Proposal Network (RPN) is used for classification and 3D bounding box estimation.

Related Work
Geometric feature representation from 3D point clouds can be broadly categorised into three point-cloud-encoding schemes, namely, grid-based, point-based, and hybrid.In this section, we review each of these categories.

Grid-Based Feature Encoders
Grid-based approaches apply fixed grid structures to 3D world scenes.This allows for point vectorization techniques to be subsequently used on a grid-cell basis.Grid-based encoding offers a trade-off between efficiency and memory size but incurs information loss due to the quantization of the grid size.Early attempts to encode irregular and sparse 3D point clouds imposed a regular structure onto the 3D space and performed sparse-todense feature conversion.Sparse-to-dense feature conversion was then followed by 2D convolutional neural networks (CNNs).In [26], 3D point clouds were projected onto a 2D bird's-eye-view (BEV) plane to form a range image, after which a fully connected CNN was used to perform vehicle detection.Similarly, Refs.[27][28][29] used deep multi-sensor fusion for 3D detection and 2D BEV, resulting in more-robust schemes.VoxelNet [12] uses end-to-end 3D object detection on point clouds while using simple sampling techniques to suppress sparse regions and develop dense feature representations for 3D object detection.This approach was extended in [30] by fusing RGB and point clouds.Based on VoxelNet, the SECOND approach [31] improved the inference speed of VoxelNet at the expense of an increased complexity and memory footprint while using sparse 3D convolutions.

Point-Set-Based Feature Encoders
Unlike grid-based methods, point-set-based encoders consume all raw points while preserving the permutation invariance of the input point cloud.The authors of [10,11] used point-set-based techniques in combination with deep learning pipelines.PointRCNN [32] generates 3D bounding boxes by consuming raw point clouds using segmentation in the first stage of the model followed by 3D box refinement instead of using prior-fixed anchor boxes.In a related work [33], sparse-to-dense 3D object detection was used, reaching an improved detection rate of up to 10 frames per second (FPS).The EdgeConv model, proposed in [34], used a dynamic graph-based technique on raw point clouds and achieved better discrimination of local features than that demonstrated in [10].However, EdgeConv presents a high computational load as it requires the evaluation of pairwise k-nearest neighbor queries.

Hybrid-Based Feature Encoders
Recent works attempting to combine the best of both grid-and point-set-based methods have shown the best results in 3D object detection.The authors of [9] proposed the PointPillar method, a hybrid approach that combines both voxelization and point-set-based approaches.In this method, the 3D space is first converted into 3D pillars, and then Point-Net is applied to each pillar containing raw points for feature representation.The obtained feature representation is used by standard 2D convolutions for object detection.Although PointPillar is a computationally efficient approach, it relies heavily on manually adjusting the pillar size to improve its performance.The work in [35] proposed a two-stage feature framework, Fast Point R-CNN, where the initial feature representation from voxelization is fused with raw points to increase the accuracy of localization.Fast Point R-CNN achieved detection rate values of 15 FPS.Similarly, PointVoxel-RCNN (PV-RCNN) [36], an extension of the PointRCNN technique [32] and similar to spare-to-dense models [33], integrates two feature-encoding methods to aid in learning more-accurate 3D bounding boxes for 3D object detection.In PointVoxel-RCNN, the first stage encodes the scene from voxel to keypoint, and the second stage converts keypoints to grids for abstracting region-of-interest features.

Methodology
In this section, we formulate the problem of object detection in 3D point clouds.Then, we present our proposed 3D object detection pipeline.This pipeline includes an intensitybased encoder that enriches a voxel-wise set of geometric features.Finally, we describe our experimental setup, including the dataset used, preliminary exploration stages, evaluation strategy, and training process.

Problem Formulation
Let us define a scene S as a 3D cloud point instance produced by a single sweep of a LiDAR sensor.In this paper, we will consider LiDAR sensors that record the 3D spatial coordinates x i , y i , and z i and the reflected intensity value ρ i of each scene point p i .A scene is therefore a set of N points S = {p i |1 ≤ i ≤ N}, where p i = [x i , y i , z i , ρ i ] T .In the context of 3D object detection, an annotated scene is defined as a scene S that is equipped with a label Y that includes the location, orientation, and class of every object in S. The location and orientation of an object can be described using a 3D bounding box, and examples of object classes in a typical urban scene include car and pedestrian classes.Figure 1 shows an example of an urban annotated scene consisting of a 3D point cloud produced by a LiDAR sensor.Given a scene S produced by a LiDAR sensor, the 3D object detection problem consists of identifying the location, orientation, and class of all the objects in the scene.A possible solution to this problem comes in the form of a computational pipeline l that uses a scene S as an input and produces a predicted label Ŷ as its output.Mathematically, we can express the functionality of this pipeline as follows: To determine the class that one bounding box in Ŷ belongs to, the computational pipeline l produces a probability value c for each class; this value can be interpreted as the pipeline's confidence that the bounding box embeds an object of each defined class.If the confidence for a particular class is greater than a predetermined threshold c T , the pipeline decides that an object of that class has been detected.
The performance of 3D object detection pipelines can be evaluated using datasets of annotated scenes.If there is only one class of objects in a scene, the performance of a pipeline l can be assessed as follows.Firstly, given an annotated scene S k appended with a ground truth label Y k and a predicted label Ŷk produced by pipeline l, objects in Ŷ are matched with objects in Y. Matching is conducted by obtaining the Intersection over Union (IoU) between the bounding boxes of each pair of objects from Ŷ and Y, respectively.The IoU, which we denote as O, is computed from the bounding box B of an object in Y and the bounding box B of an object in Ŷ as follows: where vol(•) computes the volume of its 3D argument, ∩ is the intersection between two 3D objects, and ∪ is their union.IoU values close to one indicate that there is a high overlap between bounding boxes B and B, whereas IoU values close to zero indicate limited overlap.Object matching occurs when the IoU value is above a predefined threshold O T .
In addition, if several bounding boxes in Ŷ overlap with the same ground bounding box in Y, the one with the highest IoU value is selected.This process, which is known as non-max suppression, is used to eliminate potential duplicate detections.Figure 2 shows the bounding boxes in a predicted label Ŷ produced by a pipeline l for a subset of points of the scene in  Once the matching process is completed, the concepts of true positive (TP), false positive (FP), and false negative (FN) are computed for every scene S k .These concepts are defined as follows.An object in Ŷ that matches one object in Y is a TP.In contrast, an object in Ŷ that does not match any object in Y is a FP.Finally, an FN is any object in Y that is not matched by an object in Ŷ.It is worth noting that the values TP, FP, and FN depend on the confidence threshold c T .In general, low c T values will result in many FPs and few FNs, and vice versa.This is illustrated in Figure 3a,b for the given scene instance, which shows, alongside the bounding boxes of the true objects, the bounding boxes of the objects predicted by a pipeline l using two different confidence thresholds, namely, c T = 0.1 and c T = 0.9.For c T = 0.1, the prediction pipeline produces 6 TPs, 2 FPs, and 1 FN.In contrast, using the confidence threshold c T = 0.9 leads to 1 TP, 0 FPs, and 6 FNs.By aggregating the TP, FP, and FN values across all scenes S k in a dataset, we can compute the precision γ and recall r metrics of the pipeline l for a given confidence threshold c T : where TP k , FP k , and TN k denote the TP, FP, and TN values for scene k.By gradually changing the confidence threshold from c T = 0 to c T = 1, a precision-recall curve can be obtained.The area under the precision-recall curve defines a performance metric known as the average precision (AP).The AP, which we denote as γ, can be estimated from a dataset of annotated scenes: where R l denotes the l-th segment resulting from partitioning the recall interval [0, 1] into L equal parts, and γ(r) is the precision of pipeline l when its corresponding recall value is r.Given a collection of 3D object detection pipelines, the AP value can be used as a metric to compare their performance.
In the case of 3D scenes consisting of objects of multiple classes, we decompose the problem into several detection problems by considering each class separately.For instance, in a traffic scene consisting of cars, pedestrians, and bicycles, we would formulate three separate problems focusing on car detection, pedestrian detection, and cycle detection, respectively.Finally, BEV approaches provide an alternative to directly detecting objects in a 3D point cloud.By projecting 3D point clouds onto a 2D plane corresponding to a top-down view, BEV allows the problem of 3D detection to be recast as 2D detection.

3D Object Detection Pipeline
Our proposed intensity-aware voxel encoder combines both geometric and intensity features of 3D point clouds and can be embedded within 3D object detection pipelines.To evaluate our intensity-aware voxel encoder, we embedded it within a single-stage 3D object detection pipeline.This single-stage 3D object detection pipeline is illustrated in Figure 5 and consists of a voxel-wise feature map generation stage, followed by a 3D backbone stage in which convolution operations are used to produce dense feature maps, and finally a 2D backbone stage for producing the final prediction, namely, object classification and 3D bounding box estimation.Within this pipeline, our proposed encoder generates voxel-wise feature maps by extracting geometric and intensity histogram features for each voxel in a parallel fashion.The three stages of the 3D object detection pipeline are described below.Two voxel-feature-encoding (VFE) stages convert the sparse 3D point cloud into a dense feature representation.VFE stages operate voxel-wise as follows.Let V be a collection of points p i within a given voxel.First, with an eye on computational efficiency, a subset of 35 points is randomly extracted from among all the points within each voxel.Given this random subset of points, an augmented representation pi for each point p i is obtained by including the offset between each point and the mean of the voxel point cloud with the following coordinates: (v x , v y , v z ).This augmented representation is defined as Each augmented point pi is then transformed using a fully connected network into a complex feature f i .The purpose of this network is to aggregate element-wise features, and it encodes the shape of the surface presented within a voxel.The fully connected network consists of a linear layer, batch normalization, and a rectified linear unit layer.After obtaining the element-wise feature representation f i , we perform max pooling on f i to obtain a locally aggregated feature fi .Then, each complex feature f i is concatenated with fi to form a point-wise concatenated feature f out i .Our intensity-aware voxel feature encoder includes two VFE blocks, namely, VFE-1 followed by VFE-2.The output from VFE-2 is concatenated with an intensity vector I out produced by an intensity histogram generator that operates in a voxel-wise fashion.In our study, the intensity histogram generator uses 10 bins and normalised intensity values within the range of 0 to 1.

3D and 2D Backbone Stages
A 3D backbone stage inspired by [31] was implemented.This backbone performs 3D sparse convolutions, which aggregate additional context with the feature descriptor produced by the intensity-aware voxel feature encoder.After performing the 3D sparse convolutions and reshaping the feature vector, a 2D backbone implementing the RPN is used for classification and 3D bounding box estimation.The RPN has two output heads, namely, a classification head and a regression head.The classification head is used to predict an object's class, and the regression head is used to produce an estimation of the object's bounding box.To improve the computational efficiency of the RPN stage, we use a set of predefined bounding boxes called anchors, each of which is associated with a different object class.Specifically, in a scenario where we are interested in detecting objects from three classes, e.g., 'Car', 'Pedestrian', and 'Cyclist', three anchor boxes are created.
The detailed architectures of both 3D and 2D backbones follow the architectures presented in [31,36].The 3D backbone stage uses sparse convolutions for dimensionality reduction and feature extraction.Convolutions are followed by batch normalization and rectifier linear unit activation.Sparse convolutional layers use a kernel size of (3, 1, 1) and a stride of (2, 1, 1) and produce a feature map with 128 output channels.This feature map is then processed via the 2D backbone stage, which consists of two blocks, each having five layers of 2D convolutions.Using the notation Conv2D(C out , k, s, p) to describe a 2D convolutional layer, where C out represents the number of output channels, k is the kernel size, s stands for the stride, and p denotes the padding, in the first block, we employ Conv2D(128, 3, 1, 1) layers, and in the second block, we employ Conv2D(256, 3, 1, 1) layers.The final feature map produced by the 2D backbone is then sent to the classification and regression heads for object class prediction and bounding box estimation.

Experimental Setup
We used the KITTI dataset [37] to train and evaluate our proposed 3D object detection pipeline.Before training, we used the KITTI dataset and the Canadian Adverse Driving Condition (CADC) dataset [38] to explore the nature of the intensity value of LiDAR points in the context of object detection.The KITTI and CADC datasets, training environment, and evaluation approach are described below.

Datasets
The KITTI dataset [37] is a popular dataset used for autonomous driving applications that offers annotated 2D camera images (375 × 1242 pixels) and 3D LiDAR images of 15K urban traffic scenes, together with other navigation data.Labels in the KITTI dataset include the location, size, and orientation of every object.Object location, size, and orientation are represented using a 3D bounding box in the LiDAR 3D image and a 2D bounding box in the corresponding 2D image.Object class name, truncation level, and occluded state are also given for each bounding box.The KITTI dataset defines nine different object classes, namely, 'Car', 'Pedestrian', 'Cyclist', 'Van', 'Truck', 'Person (sitting)', 'Tram', 'Misc', and 'Don't-Care'.The truncation value describes the fraction of objects lying outside the image boundary.Finally, the occlusion level, which takes on the values 0 through 3, describes the degree to which an object is occluded by other objects in a scene, where 0 indicates clearly visible and increasing values indicate greater occlusions.
The CADC dataset [38] provides a collection of scenes captured under adverse weather conditions.The dataset consists of 56K 2D camera images with a resolution of (1280 × 1024) pixels and 7K LiDAR instances.The CADC dataset includes 10 annotation classes, namely, 'Car', 'Pedestrian', 'Truck', 'Bus', 'Garbage Container on Wheels', 'Traffic Guidance Object', 'Bicycle', 'Pedestrian With Object', 'Horse and Buggy', and 'Animal'.In this work, the CADC dataset is used to explore the impact of adverse weather conditions on LiDAR intensity distributions.
A preliminary exploration was carried out to investigate the possible impact of the surrounding environment, including scattering media such as rain, snow, and fog, on the LiDAR intensity values associated with traffic objects.We explored LiDAR scenes recorded in clear weather conditions from the KITTI dataset and scenes recorded in adverse weather conditions from the CADC dataset.Our preliminary exploration produced average profiles for the intensity distributions of objects from different classes.Intensity distributions were obtained using kernel density estimation (KDE), a non-parametric method that expresses a distribution as a linear combination of kernel functions centered around each dataset sample.In our implementation of KDE, we chose a Gaussian kernel.The bandwidth of Gaussian kernels is a parameter that needs to be set before applying KDE.We used Scott's estimation method to select the value of the bandwidth.Scott's method produces a bandwidth value that minimizes the mean integrated square error of the estimated distribution.

Training and Evaluation
We used the KITTI benchmark dataset [23] for training and evaluation.This benchmark dataset consists of 7481 training instances and 7518 testing instances.We split the benchmark training dataset further into two subsets consisting of 3712 instances for training and 3769 instances for validation.Validation was conducted in accordance with the protocol described in [2,9,27,36].We trained and evaluated our 3D object detection pipelines using an RTX 3080 10GB GPU and an AMD RYZEN 9 3900 CPU using a Pytorch-based mmdetection3d framework.The CUDA mixed-precision method was employed during training, allowing us to combine FP16 (16-bit, half-precision) and FP32 (32-bit, singleprecision) floating-point formats to enhance computational speed.We trained our 3D object detection pipeline in an end-to-end fashion using the AdamW optimization algorithm, using a decaying learning rate of 0.01.
We chose the AP γ as our prediction performance metric.We obtained AP values separately for the object classes 'Car', 'Pedestrian', and 'Cyclist', and within each class for 3D point cloud detection and BEV detection.We followed the KITTI benchmark evaluation criterion, which is based on the PASCAL criterion [37,39], for 3D point clouds and 2D BEV point clouds.According to this criterion, different classes use different IoU thresholds O T to produce a match between an object in Y and an object in Ŷ.Specifically, the threshold value O T is 0.7 for 'Car' objects and 0.5 for both 'Pedestrian' and 'Cyclist' objects.In addition to the AP, we obtained the detection rate of each 3D object prediction pipeline, measured in frames per second (FPS).In our study, we report both the original detection rate values, as reported by the authors and on our hardware.Finally, three evaluation scenarios of different difficulty levels, namely, 'Easy', 'Moderate', and 'Hard', were considered.Each difficulty level defines a set of constraints on the characteristics of the objects that are included, and they are defined in Table 1.For instance, at the 'Easy' difficulty level, only objects that are fully visible, truncated up to 15%, and have a bounding box whose height is 40 pixels are included.We compared our 3D object detection pipeline against state-of-the-art models in terms of AP and prediction rate.We used the AP values reported by their authors and obtained new detection rate values using our hardware to ensure fairness.

Results
Figure 7 illustrates the effects of the weather conditions on the reflected intensities of LiDAR cloud points.Compared to clear weather conditions (Figure 7a), the spatial distribution of LiDAR points when there are adverse weather conditions (Figure 7b) is noisy due to collisions with air particles that do not correspond to true objects in a scene.In addition, the intensity values in adverse weather conditions are lower than those under clear weather conditions.LiDAR intensity values can therefore provide useful information about scenes that can contribute to the interpretation of LiDAR point spatial distribution.We also observed that different scene objects reveal different intensity profiles.The average intensity distributions for the objects 'Car', 'Van', 'Truck', 'Pedestrian', 'Person (sitting)', and 'Cyclist' in scenes from the KITTI dataset are shown in Figure 8.Each object reveals a unique intensity profile; consequently, the intensity distribution of the LiDAR points associated with an object can provide information about their class.We based the design of our proposed encoder on this fundamental observation, namely, that LiDAR intensity distributions depend on the class of the underlying object.Therefore, the LiDAR intensity distribution can be used to improve object classification.Tables 2-4 compare the AP values of the selected state-of-the-art 3D object detection models for 'Car', 'Pedestrian', and 'Cyclist' objects, respectively.The highest AP values in the 'Hard' evaluation scenario are highlighted, as they provide the strongest comparison between existing detection models.Despite its simplicity, the 3D object detection pipeline that we have designed to illustrate our proposed intensity-aware voxel encoder has a performance that is comparable with state-of-the-art models.Our proposed model ranks third in the AP benchmark for the 'Car' object, as shown in Table 2.Moreover, for the 'Pedestrian' and 'Cyclist' classes, our 3D object detection pipeline demonstrates superior performance, as shown in Tables 3 and 4, respectively.Therefore, not only does this comparatively simpler architecture have a performance that is close to that of the more complex state-of-the-art models in regard to the 'Car' benchmark, it also surpasses this performance for the 'Pedestrian' and 'Cyclist' benchmarks.Computational performance is compared in Tables 5 and 6, which show the detection rates reported by the authors and the detection rates obtained in our computing environment, respectively.Our 3D object detection pipeline, which includes our proposed intensity-based encoder, achieved the second-highest detection rate, namely, 40.7, among the considered 3D object detection pipelines.It is worth noting that PointPillars, the model that achieved the highest frame rate, also presents a comparatively lower AP.The detection and computational performance of every 3D object detection pipeline under consideration are summarized in Figure 9, where the detection rate and AP for 'Car', 'Pedestrian', and 'Cyclist' objects under 3D detection and BEV detection are represented as coordinates on a 2D plane.The highestperforming models are situated close to the upper right corner, where AP is close to 100 and the detection rate is close to the real-time value of 60 FPS. Figure 9 demonstrates that the simple 3D object detection pipeline built around our intensity-aware voxel encoder achieves results that are superior to those obtainable by state-of-the-art models.

Conclusions and Discussion
In this paper, we have presented an intensity-aware voxel encoder for 3D LiDAR object detection and localization.The proposed encoder achieves AP values comparable to the state-of-the-art models while yielding higher detection rates during inference.In addition to this, a computationally efficient implementation of a voxel-wise histogram generator has been developed.Our results indicate 3D object detection pipelines simpler than the state of the art can be developed to achieve accurate and robust 3D detection.The combination of our feature extractor and histogram generator can contribute to the development of 3D object detection models with higher inference rates.
Our preliminary analysis of the reflected intensity values of points associated with objects of different classes in the KITTI dataset suggests that each class of objects has a different intensity signature; therefore, LiDAR intensity values can be useful during 3D LiDAR object detection.In addition, we compared the intensity values from scenes under favorable and adverse weather conditions and observed the impact of rain and snow on the spatial distribution of LiDAR points and their intensity values.Based on this preliminary exploration, we proposed an intensity-aware voxel encoder that generates intensity histograms of point clouds within each voxel to capture the intensity profiles of objects within voxels.Voxel intensity histograms were then integrated as features together with conventional voxel feature sets.
We built a simple 3D object detection pipeline that included our intensity-aware voxel encoder to evaluate the potential impact of voxel intensity features on detection performance.Using the KITTI test dataset, we compared the detection performance and computational performance of our 3D object detection pipeline with state-of-the-art 3D object detection pipelines.Our 3D object detection pipeline outperforms the state-of-art models in regard to the 'Pedestrian' and 'Cyclist' classes.Although grid and point set-based feature encoders can implicitly consume intensity information together with the spatial distribution of 3D point clouds, our proposed encoder stands out in that it explicitly builds LiDAR intensity distributions by generating intensity histograms.
Current state-of-the-art 3D object detection pipelines have significantly improved over the past years with respect their performance on clear-weather datasets, but their performance deteriorates considerably when they are applied to adverse weather scenarios [15,43].The reason for this deterioration is that LiDAR 3D point clouds are highly susceptible to adverse weather conditions or scattering media (fog, snow, rain, or dust) [17][18][19][20][21]. Therefore, adverse weather conditions produce noisy 3D point clouds that result in poor 3D object detection accuracy.To contribute to the development of 3D perception, academic and industrial partners have shared their pools of ever-growing datasets obtained under different environmental conditions while using different sensing modalities including LiDAR, cameras, and radar.Most of these benchmark datasets were recorded in clear weather [7,23,44], and some were obtained under adverse weather conditions [24,25,45].We hypothesize that intensity-aware encoders might also improve 3D object detection performance under adverse weather conditions.Therefore, a future avenue for research on our proposed model will be to investigate its impact on 3D object detection performance under a wider range of weather conditions.This investigation should include adverse weather augmentation approaches to create controlled datasets of different complexity.Having robust 3D object detection pipelines that perform well under any weather conditions and in real-time is essential to achieve level 5 autonomy, which is defined as autonomy with no human intervention in any driving conditions.Due to the diversity of hazardous situations, collecting a complete dataset seems impractical.Employing simulation techniques based on physical and behavioral models of traffic objects and actors in different weather conditions is a promising direction.Creating custom simulated environments with complex situations can help develop robust and accurate 3D detection and tracking models for autonomous vehicles.

Figure 1 .
Figure 1.LiDAR-annotated scene consisting of a 3D point cloud and seven labelled objects indicated by green 3D bounding boxes.

Figure 1 .
The bounding boxes of the objects in Y are also shown, demonstrating that the bounding box of every object in Y overlaps with the bounding box of one object in Ŷ.In one case, the bounding box of one of the objects in Y overlaps with the bounding boxes of two objects in Ŷ.If both predicted bounding boxes are matched to the true object, non-max suppression is triggered to select the predicted object with the highest IoU value.

Figure 2 .
Figure 2. Bounding boxes corresponding to true objects (green) and objects predicted by a detection pipeline (blue).True and predicted objects are matched by computing the degree of overlap between their bounding boxes.

Figure 3 .
Superimposed in the annotated scene shown in Figure1are the bounding boxes of the objects detected by a pipeline l using a confidence score of 10% (a) and 90% (b).Ground truth objects are enclosed in a green bounding box, whereas predicted bounding boxes are blue.Predicted bounding boxes that match ground truth ones are TPs, whereas those that do not match a ground truth bounding box are FPs.

Figure 4
illustrates the principle of BEV.A 2D camera image and 3D LiDAR point cloud captured during the same urban traffic scene are shown in Figure 4a,b, respectively.The BEV point cloud resulting from projecting the 3D LiDAR point cloud onto a top-down view plane is shown in Figure 4c.Objects in the original 3D point cloud scene also appear in the BEV point cloud, which allows us to recast a 3D-object-detection-and-localization problem as a BEV 2D-object-detection-and-localization problem.

Figure 4 .
Figure 4. Example of a scene from the KITTI dataset: (a) 2D camera image, (b) 3D LiDAR point cloud, and (c) resulting 2D BEV point cloud.The BEV point cloud is produced by projecting the 3D LiDAR point cloud onto a top-down view plane.Objects recognizable in the 3D LiDAR point cloud (b) are also recognizable in the 2D BEV point cloud (c).

Figure 5 .
Figure 5.Our proposed 3D object detection pipeline consists of three stages, namely, an intensityaware voxel encoder, which includes intensity features; a 3D backbone for dense feature extraction; and a 2D backbone that produces the final prediction (object classification and bounding box estimation).
3.2.1.Intensity-Aware Voxel Feature EncodingOur proposed voxel encoder is illustrated in Figure6.The first step is scene voxelization.Given a 3D box with dimensions of D × H × W containing scene S and a predefined voxel with dimensions of v D × v H × v W , we first partition the 3D box into a grid with the following dimensions:T D × T H × T W = (v D /D) × (v H /H) × (v W /W).Once the voxel grid has been generated, points occupying each voxel are identified and grouped.Due to the sparse nature of 3D point clouds, some voxels have a large number of points, whereas others might have fewer points.

Figure 6 .
Figure 6.Architecture of the proposed intensity-aware voxel encoder.After voxelization, a scene is represented as a tensor with dimensions of T v × 35 × 4, where T v = T D × T H × T W , and T D , T H , and T W are the number of voxels along the depth, height, and width dimensions of the scene.After augmentation, a tensor whose dimensions are T v × 35 × 7 is generated and then processed via cascaded encoders VFE-1 and VFE-2.A voxel-wise intensity histogram I out , whose dimensions are Tv × 10, is concatenated to the output of VFE-2, whose dimensions are T v × 128, to produce the final T v × 138 voxel-wise feature map.

Figure 7 .
Examples of 3D LiDAR point cloud scenes where the reflected intensity ρ i has been-color coded.(a) KITTKI instance taken in clear weather conditions.(b) CADC instance taken during adverse weather conditions, where the highlighted region (red eclipse) shows low intensity values.

Table 2 .
Performance comparison of 3D object detection pipelines applied to the KITTI test set for the 'Car' object."L" indicates LiDAR-only method, while "R + L" indicates multi-modality method including both RGB images and LiDAR sensors.

Table 3 .
Performance comparison of 3D object detection pipelines on the KITTI test set for the 'Pedestrian' object."L" indicates LiDAR-only method, while "R + L" indicates a multi-modality method including both RGB images and LiDAR sensors.

Table 4 .
Performance comparison of 3D object detection pipelines on the KITTI test set for the 'Cyclist' object."L" indicates LiDAR-only method, while "R + L" indicates a multi-modality method including both RGB images and LiDAR sensors.