NeXtFusion: Attention-Based Camera-Radar Fusion Network for Improved Three-Dimensional Object Detection and Tracking

: Accurate perception is crucial for autonomous vehicles (AVs) to navigate safely, especially in adverse weather and lighting conditions where single-sensor networks (e.g., cameras or radar) struggle with reduced maneuverability and unrecognizable targets. Deep Camera-Radar fusion neural networks offer a promising solution for reliable AV perception under any weather and lighting conditions. Cameras provide rich semantic information, while radars act like an X-ray vision, piercing through fog and darkness. This work proposes a novel, efficient Camera-Radar fusion network called NeXtFusion for robust AV perception with an improvement in object detection accuracy and tracking. Our proposed approach of utilizing an attention module enhances crucial feature representation for object detection while minimizing information loss from multi-modal data. Extensive experiments on the challenging nuScenes dataset demonstrate NeXtFusion’s superior performance in detecting small and distant objects compared to other methods. Notably, NeXtFusion achieves the highest mAP score (0.473) on the nuScenes validation set, outperforming competitors like OFT (35.1% improvement) and MonoDIS (9.5% improvement). Additionally, NeXtFusion demonstrates strong performance in other metrics like mATE (0.449) and mAOE (0.534), highlighting its overall effectiveness in 3D object detection. Furthermore, visualizations of nuScenes data processed by NeXtFusion further demonstrate its capability to handle diverse real-world scenarios. These results suggest that NeXtFusion is a promising deep fusion network for improving AV perception and safety for autonomous driving.


Introduction
There has been a rapid advancement in the development of sensing systems for autonomous driving that has notably elevated the effectiveness of perception tasks, such as object detection, in recent years.Despite these achievements in research and development, there remains a lack of widespread adoption of level 4 or 5 autonomous driving capabilities in commercial vehicles due to autonomous vehicles (AVs)' reliance on single-sensor perception of the real world and a substantial commitment to research and development to guarantee the continuous evolution and enhancement in the technology over time [1].Furthermore, the prediction and decision-making processes in AVs that rely on a single sensor can be hampered by external factors such as bad weather, occlusion, or poor lighting conditions because cameras struggle in low-light environments, whereas radars cannot detect objects with rich visual features.This limitation in cameras and radar, and the potential consequences of reliance on a single sensor for object detection in AVs, has generated significant attention in the field of research toward the utilization of multi-modal based sensing in the automotive domain, especially in perception systems that fuse camera and radar inputs [2][3][4].
An ideal fusion system utilizing both the cameras and radar sensor information can effectively leverage the advantages of both these sensors, concurrently addressing the limitations inherent in each.While a camera as a sensor input offers detailed texture and semantic information, its performance diminishes with long-range small objects, occlusions, and poor lighting conditions, whereas radar as a sensor input exhibits the ability to offer reliable performance in all weather and lighting conditions, detect small objects at long ranges, and operate without hindrance from issues related to occlusions.However, radar sensors encounter difficulties in precisely identifying objects due to the absence of detailed texture and semantic features [2,5,6].The work presented within this paper revolves around the primary objective of determining how we can effectively harness the benefits of both modalities (camera and radar sensors) to attain precise and dependable object detection.
An optimal Camera-Radar fusion network must capitalize on the advantages offered by both sensors.Simultaneously, it should also guarantee that the limitations of one sensor do not impact the performance of the other.Previous studies in the fusion of camera and radar modalities have often employed the mapping of radar data onto the camera's data [7].However, working within this technique imposes limitations on performance, particularly in scenarios involving object occlusion, thus resulting in inefficient use of radar sensor data.More sophisticated and cutting-edge methodologies engage in fusion at the feature level instead of directly mapping features.For instance, the approach proposed for the AVOD network [8] extracts bird-eye view features concurrently from the camera and the radar's sensor input.This approach then merges these features on a per-object basis to capitalize on the unique information extracted by each of these modalities to perform camera-radar sensor fusion.However, it is observed that the concurrent approach of extracting features and performing fusion does not hold true for instances where the camera sensor becomes unreliable in situations such as occluded objects or adverse conditions, like rain or fog.In such scenarios, radar-based sensors are not affected and work well, but the reliability of camera-based sensors can significantly decrease, resulting in a notable overall performance decline in the AV system for object detection in such adverse conditions.
Evidently, there is a necessity to enhance the dependability of camera-radar systems to attain satisfactory performance, particularly in situations where the quality of camera input is compromised due to external factors.The multi-modal sensor fusion network proposed within this paper posits that by independently extracting valuable information from both camera and radar sensors, one can leverage the advantages of each modality without compromising either in situations of degraded performance due to external factors.This innovative multi-modal fusion approach is rooted in the acknowledgment that cameras and radars offer complementary attributes.The detailed texture and semantic-rich information from cameras can be utilized to identify multiple different objects, while radars offer the advantage of detecting objects over long distances, unobstructed by occlusion, and are reliable in adverse weather conditions such as fog or rain.Thus, this potential to individually extract information serves as a method to enhance the overall reliability of the proposed neural network for 3D object detection and tracking names as NeXtFusion, proposed within this paper.
NeXtFusion network utilizes point-cloud data representation for object detection, extracting all information obtained from radar sensor(s), whereas the semantic information obtained from camera sensor(s) primarily serves to distinctly identify objects within an input image/video.Drawing inspiration from this paper's authors' most recent work, NeXtDet [9], significant advancements are being proposed in this paper where the literature on camera-based semantic information extraction is relied upon to independently and efficiently extract these semantic-rich features from camera-based RGB images and fuse them with radar's point-cloud data as illustrated in Figure 1.The structure of this paper is organized as follows: Section 2 offers the reader information on related literature and studies.Section 3 provides information about the methodology for the proposed multimodal sensor-fusion 3D object detection model called NeXtFusion.Section 4 assesses the performance of the proposed network through benchmarking on a multi-modal large-scale autonomous driving dataset and visualizing the results.Finally, Section 5 presents the concluding remarks for this paper.proposed multi-modal sensor-fusion 3D object detection model called NeXtFusion.Section 4 assesses the performance of the proposed network through benchmarking on a multi-modal large-scale autonomous driving dataset and visualizing the results.Finally, Section 5 presents the concluding remarks for this paper.

Camera-Based Object Detection
A convolutional neural network (CNN) represents a subset of the broader deep neural network (DNN) extensively utilized in the realm of AI.It is frequently employed for devising inventive methodologies and algorithms within the OpenCV realm, facilitating complex tasks such as image classification and object detection.This is achieved through the utilization of multiple layers of neurons, simulating the natural visual perception of human beings [10].Object detection involves a computer vision process aimed at identifying instances of objects belonging to a specific class, categorizing the types of objects, pinpointing their locations, and precisely labeling them within an input image or video.
A state-of-the-art object detector essentially pinpoints the position and type of objects within an image [9].It is a system with three main modules (parts): 1. Backbone: This module acts as the foundation, extracting salient features from the image and producing a compressed representation through a robust image classifier.Imagine it like a skilled photographer capturing the essence of a scene.2. Neck: This module acts as the bridge, connecting the backbone and skillfully merging features extracted from different levels of the backbone.Consider it like a data sculptor, gathering and harmonizing different perspectives to create a richer understanding.3. Head: This module is the decision maker responsible for drawing bounding boxes around objects and classifying their types.Think of it as a detective, analyzing the combined information and identifying what each object is and where it lies.
Modern object detectors employ a specialized component known as the head, tasked with both object localization and classification, as aforementioned.Two predominant approaches prevail in head design: single-stage and two-stage architectures.

Camera-Based Object Detection
A convolutional neural network (CNN) represents a subset of the broader deep neural network (DNN) extensively utilized in the realm of AI.It is frequently employed for devising inventive methodologies and algorithms within the OpenCV realm, facilitating complex tasks such as image classification and object detection.This is achieved through the utilization of multiple layers of neurons, simulating the natural visual perception of human beings [10].Object detection involves a computer vision process aimed at identifying instances of objects belonging to a specific class, categorizing the types of objects, pinpointing their locations, and precisely labeling them within an input image or video.
A state-of-the-art object detector essentially pinpoints the position and type of objects within an image [9].It is a system with three main modules (parts): 1.
Backbone: This module acts as the foundation, extracting salient features from the image and producing a compressed representation through a robust image classifier.Imagine it like a skilled photographer capturing the essence of a scene.

2.
Neck: This module acts as the bridge, connecting the backbone and skillfully merging features extracted from different levels of the backbone.Consider it like a data sculptor, gathering and harmonizing different perspectives to create a richer understanding.

3.
Head: This module is the decision maker responsible for drawing bounding boxes around objects and classifying their types.Think of it as a detective, analyzing the combined information and identifying what each object is and where it lies.
Modern object detectors employ a specialized component known as the head, tasked with both object localization and classification, as aforementioned.Two predominant approaches prevail in head design: single-stage and two-stage architectures.
Single-stage detectors, epitomized by YOLO [11], SSD [12], RetinaNet [13], CornerNet [14], and CenterNet [15], prioritize speed and efficiency.They execute both tasks-generating bounding boxes and classifying object types-within a single, unified module.This streamlined approach enables faster inference, making them attractive for real-time applications on computationally constrained platforms like edge devices.However, their focus on efficiency might compromise detection accuracy compared to their two-stage counterparts.In contrast, two-stage detectors, exemplified by the R-CNN family (including Fast R-CNN [16], Faster R-CNN [17], Mask-RCNN [18], Cascade R-CNN [19], and Libra R-CNN [20]), prioritize accuracy over speed.These detectors follow a two-step process: First, a dedicated region proposal network (RPN) generates potential object regions within the image.Subsequently, these proposed regions are forwarded to a separate module for classification and precise bounding box refinement.This division of labor leads to superior detection performance but incurs a higher computational cost, limiting their real-time capabilities.
When choosing an object detector, striking a balance between accuracy and inference speed is crucial.single-stage detectors like YOLO offer impressive speed, making them suitable for real-time applications like autonomous vehicles.However, for tasks demanding high accuracy, two-stage detectors like Faster R-CNN might be preferred despite their slower performance [21].Ultimately, the optimal choice hinges on the specific requirements and resource constraints of the application.
The neck of an object detector acts as a bridge, meticulously combining features extracted from various depths within the backbone.This intricate network of interconnected pathways, both descending (top-down) and ascending (bottom-up), allows for the seamless integration of multi-scale information.Popular strategies employed within the neck include the addition of specialized modules, like spatial pyramid pooling [22], which further enhance feature fusion capabilities.Alternatively, path-aggregation blocks such as feature pyramid networks [23] or path-aggregation networks [24] may be employed.The backbone architecture typically relies on a resilient image classifier for implementation, such as CondenseNeXt [25].It serves as an effective and resilient image classification network developed to efficiently utilize the computational resources needed for real-time inference on edge devices with limited processing power.The CondenseNeXt CNN is employed in the NeXtDet object detection network as the backbone and has been further modified to achieve even greater lightweight characteristics than its original design for image classification purposes by deprecating the final layers of classification to ensure compatibility and facilitating linking the backbone to the neck module of NeXtDet.Thus, the NeXtDet network will serve as the foundational reference point for the multi-modal sensor-fusion research and enhancements presented within this paper.

Radar Point Cloud and Camera-Based Sensor Fusion for Object Detection
Radar sensors actively perceive the surroundings and analyze the reflected waves to determine the position and the speed of objects by constantly emitting radio waves.The world of radar signal processing for autonomous vehicles offers a rich tapestry of techniques to extract meaningful information from the sensor's output, for example, doppler processing [26], occupancy grid maps [27], multi-input multi-output (MIMO) radar [28], and representations using radar point clouds [29].The work presented in this paper focuses on radar point-cloud data representation for radar signal processing using the proposed deep neural network for 3D object detection.
Typically, radars in automotive applications determine objects as 2D reference points in bird's eye view (BEV), synthesizing information on the radial distance to/from the object.In each radar detection, information is provided about the object's instantaneous velocity in the radial direction, along with the azimuth angle.The azimuth angle represents the horizontal angle of the object concerning the observer, which, in this case, is the radar sensor.The azimuth angle is crucial for determining the location of detected objects in the radar's field of view.Figure 2 provides a graphical representation of the difference between the radial velocity and the true velocity.

The Proposed NeXtFusion 3D Object Detection Network
This paper introduces NeXtFusion, a novel approach to 3D object detection for autonomous vehicles.It leverages sensor fusion, combining camera and radar data within a unified framework.By building upon the efficient NeXtDet framework [9], NeXtFusion prioritizes both performance and resource efficiency, making it particularly suitable for real-time applications.The key contribution of the work presented in this paper is in Figure 2. Comparison between velocities: the true velocity (v A ) for object A is equal to the radial velocity (v r ), whereas, for object B, the radar-reported radial velocity (v r ) indicated by the red arrow is not equal to the true velocity (v B ).

The Proposed NeXtFusion 3D Object Detection Network
This paper introduces NeXtFusion, a novel approach to 3D object detection for autonomous vehicles.It leverages sensor fusion, combining camera and radar data within a unified framework.By building upon the efficient NeXtDet framework [9], NeXtFusion prioritizes both performance and resource efficiency, making it particularly suitable for realtime applications.The key contribution of the work presented in this paper is in adapting the NeXtDet network for multi-modal sensor fusion and significantly improving 3D object detection through a lightweight object detection architecture.Section 4 provides details about extensive nuScenes experiments performed in order to demonstrate the proposed NeXtFusion network's ability to surpass existing benchmarks.These results pave the way for more robust and resource-conscious object detection for fully autonomous driving applications.Figure 3 illustrates the NeXtDet architecture that serves as the foundation for the proposed NeXtFusion network.

Backbone
The core of any cutting-edge object classifier is a powerful image-processing module called the backbone.This module scans an input image and distills features at different depths (levels of detail).Backbones typically rely on a resilient image classifier for implementation, which serves as an effective and resilient image classification network developed to efficiently utilize the computational resources needed for real-time inference

Backbone
The core of any cutting-edge object classifier is a powerful image-processing module called the backbone.This module scans an input image and distills features at different depths (levels of detail).Backbones typically rely on a resilient image classifier for implementation, which serves as an effective and resilient image classification network developed to efficiently utilize the computational resources needed for real-time inference on edge devices with limited processing power.NeXtDet utilizes CondenseNeXt CNN [25] in its backbone module, which also serves as the backbone for the proposed NeXtFusion network.
CondenseNeXt belongs to the DenseNet [30] family.It utilizes an innovative approach to capture spatial details from individual layers and transmit them in a feed-forward manner to all subsequent layers.This process enables the extraction of information at varying coarseness levels.At the core of CondenseNeXt are several dense blocks.Additionally, it incorporates depthwise separable convolution and pooling layers between these blocks to alter feature-map dimensions accordingly.This strategy facilitates more effective extrapolation of features at diverse resolutions from an input image.Subsequently, the obtained information is then fused to mitigate the vanishing-gradient issue, resulting in efficient inference and reduced size of the trained weights because of a reduction in both the number of parameters and floating-point operations per second, as seen in [10].

Neck
In modern object detectors, the neck module plays an important role as a feature fusion hub.It collects feature maps extracted at different depths within the backbone, often using pyramid networks like feature pyramid networks (FPNs) [23] and path-aggregation networks (PANs) [24].NeXtDet [9], for instance, leverages a combined PAN-FPN architecture.While FPNs generate feature maps of various sizes to capture diverse information, merging them can be challenging due to size discrepancies.To bridge this gap, a PAN is integrated with an FPN and upsampled using nearest-neighbor interpolation.This allows bottom-up features, rich in positioning information, to connect with top-down features, strong in semantic understanding.This fusion, visualized in Figure 4, ultimately enhances the network's performance.gradient issue, resulting in efficient inference and reduced size of the trained weights because of a reduction in both the number of parameters and floating-point operations per second, as seen in [10].

Neck
In modern object detectors, the neck module plays an important role as a feature fusion hub.It collects feature maps extracted at different depths within the backbone, often using pyramid networks like feature pyramid networks (FPNs) [23] and pathaggregation networks (PANs) [24].NeXtDet [9], for instance, leverages a combined PAN-FPN architecture.While FPNs generate feature maps of various sizes to capture diverse information, merging them can be challenging due to size discrepancies.To bridge this gap, a PAN is integrated with an FPN and upsampled using nearest-neighbor interpolation.This allows bottom-up features, rich in positioning information, to connect with top-down features, strong in semantic understanding.This fusion, visualized in Figure 4, ultimately enhances the network's performance.Spatial pyramid pooling (SPP) [22] presents an innovative max-pooling technique designed to enhance the accuracy of the CNNs.It achieves this by pooling the responses of each filter within individual local spatial bins, preserving the spatial information.This concept draws inspiration from the well-known bag-of-words approach [31] in computer vision.The strategy employs three distinct sizes of max-pooling operations to discern analogous feature maps, regardless of the diverse resolutions of input feature patterns.
After applying max pooling, the resulting information is flattened and merged before being fed into the fully connected layer.This final layer delivers an output of fixed size, Spatial pyramid pooling (SPP) [22] presents an innovative max-pooling technique designed to enhance the accuracy of the CNNs.It achieves this by pooling the responses of each filter within individual local spatial bins, preserving the spatial information.This concept draws inspiration from the well-known bag-of-words approach [31] in computer vision.The strategy employs three distinct sizes of max-pooling operations to discern analogous feature maps, regardless of the diverse resolutions of input feature patterns.
After applying max pooling, the resulting information is flattened and merged before being fed into the fully connected layer.This final layer delivers an output of fixed size, independent of the initial input dimensions, as shown in Figure 5. Notably, it is the fully connected layer, not the convolution layer, that restricts the final output size of CNNs.This integration typically occurs in the later stages of feature fusion.For a visual representation, Figure 5 showcases the SPP block employed within the neck of the proposed NeXtFusion architecture.

Head
The final stage of an object detector, the head module, takes center stage in defining bounding boxes and generating detailed detection results like object class, confidence score, location, and size.To achieve this, modern detectors often employ multiple head modules working together.These modules share features extracted earlier in the network and specialize in accurately identifying objects and predicting their confidence scores.In the NextDet architecture [9], three distinct heads, each equipped with a spatial attention module (SAM), tackle this crucial task.The SAM module, originally introduced in the convolutional block attention module (CBAM) [32], plays a key role in feature aggregation.It accomplishes this by creating a spatial attention map that highlights critical areas within the image.This map is generated by analyzing the relationships between different features using both max-pooling and average-pooling operations along the channel axis.

Bounding-Box Regression
The object detection task can be streamlined by dividing it into two simpler subtasks: identifying objects and pinpointing their locations.Finding these objects relies on a technique called bounding-box regression (BBR).This method essentially draws a rectangular box around the predicted object location within the image, maximizing the overlap with the actual object.The extent of this overlap is measured by mean squared error (MSE) or the intersection-over-union (IoU) losses.This popular metric evaluates the similarities and differences between two arbitrary shapes.Mathematically, it represents the ratio between the area shared by the predicted bounding box (denoted as ) and the ground-truth bounding box () as follows: While IoU loss is widely used in BBR, it encounters a challenge when the predicted bounding box and the actual object's ground-truth box do not intersect, i.e., when IoU of  and  equals zero, it fails to provide an overlap ratio.To overcome this limitation, the NextDet object detector has integrated a more robust approach called generalized IoU (GIoU) [33].GIoU addresses this issue by actively encouraging a greater overlap between

Head
The final stage of an object detector, the head module, takes center stage in defining bounding boxes and generating detailed detection results like object class, confidence score, location, and size.To achieve this, modern detectors often employ multiple head modules working together.These modules share features extracted earlier in the network and specialize in accurately identifying objects and predicting their confidence scores.In the NextDet architecture [9], three distinct heads, each equipped with a spatial attention module (SAM), tackle this crucial task.The SAM module, originally introduced in the convolutional block attention module (CBAM) [32], plays a key role in feature aggregation.It accomplishes this by creating a spatial attention map that highlights critical areas within the image.This map is generated by analyzing the relationships between different features using both max-pooling and average-pooling operations along the channel axis.

Bounding-Box Regression
The object detection task can be streamlined by dividing it into two simpler subtasks: identifying objects and pinpointing their locations.Finding these objects relies on a technique called bounding-box regression (BBR).This method essentially draws a rectangular box around the predicted object location within the image, maximizing the overlap with the actual object.The extent of this overlap is measured by mean squared error (MSE) or the intersection-over-union (IoU) losses.This popular metric evaluates the similarities and differences between two arbitrary shapes.Mathematically, it represents the ratio between the area shared by the predicted bounding box (denoted as A) and the ground-truth bounding box (B) as follows: While IoU loss is widely used in BBR, it encounters a challenge when the predicted bounding box and the actual object's ground-truth box do not intersect, i.e., when IoU of A and B equals zero, it fails to provide an overlap ratio.To overcome this limitation, the NextDet object detector has integrated a more robust approach called generalized IoU (GIoU) [33].GIoU addresses this issue by actively encouraging a greater overlap between the predicted and ground-truth boxes, effectively steering the prediction closer toward the target.Mathematically, GIoU loss (L GIoU ) is expressed as follows: Here, C represents the smallest enclosing box that incorporates both the predicted bounding box (A) and the ground-truth box (B).As defined in [33], experiments reveal that GIoU loss delivers superior performance compared to both mean squared error (MSE) and standard IoU losses.Notably, it also demonstrates effectiveness in tackling vanishing gradients when the predicted and ground-truth boxes fail to intersect.Figures 6 and 7 provide a visual overview of the difference between IoU and GIoU.Here,  represents the smallest enclosing box that incorporates both the predicted bounding box () and the ground-truth box ().As defined in [33], experiments reveal that GIoU loss delivers superior performance compared to both mean squared error (MSE) and standard IoU losses.Notably, it also demonstrates effectiveness in tackling vanishing gradients when the predicted and ground-truth boxes fail to intersect.Figures 6 and 7 provide a visual overview of the difference between IoU and GIoU.

Extracting Radar Features
The proposed multi-modal object detection network in this paper utilizes features of each object within the image to predict all other properties associated with the object.To maximize the utilization of radar point-cloud data associated with the object in this context, it is essential to initially establish a connection between radar detections and their corresponding objects of interest detected within the image.
An autonomous vehicle's movement can be visualized using the right-handed coordinate system as it travels in the forward direction.A commonly employed method to determine the right-hand rule involves extending the index finger along the positive -direction, bending the middle finger (and/or ring and pinky fingers) inward to indicate the positive -direction, and raising the thumb to represent the positive -direction, as denoted by Figure 8.In this configuration, the -axis denotes the direction of motion, the Here,  represents the smallest enclosing box that incorporates both the predicted bounding box () and the ground-truth box ().As defined in [33], experiments reveal that GIoU loss delivers superior performance compared to both mean squared error (MSE) and standard IoU losses.Notably, it also demonstrates effectiveness in tackling vanishing gradients when the predicted and ground-truth boxes fail to intersect.Figures 6 and 7 provide a visual overview of the difference between IoU and GIoU.

Extracting Radar Features
The proposed multi-modal object detection network in this paper utilizes features of each object within the image to predict all other properties associated with the object.To maximize the utilization of radar point-cloud data associated with the object in this context, it is essential to initially establish a connection between radar detections and their corresponding objects of interest detected within the image.
An autonomous vehicle's movement can be visualized using the right-handed coordinate system as it travels in the forward direction.A commonly employed method to determine the right-hand rule involves extending the index finger along the positive -direction, bending the middle finger (and/or ring and pinky fingers) inward to indicate the positive -direction, and raising the thumb to represent the positive -direction, as denoted by Figure 8.In this configuration, the -axis denotes the direction of motion, the

Extracting Radar Features
The proposed multi-modal object detection network in this paper utilizes features of each object within the image to predict all other properties associated with the object.To maximize the utilization of radar point-cloud data associated with the object in this context, it is essential to initially establish a connection between radar detections and their corresponding objects of interest detected within the image.

of 22
An autonomous vehicle's movement can be visualized using the right-handed coordinate system as it travels in the forward direction.A commonly employed method to determine the right-hand rule involves extending the index finger along the positive x-direction, bending the middle finger (and/or ring and pinky fingers) inward to indicate the positive y-direction, and raising the thumb to represent the positive z-direction, as denoted by Figure 8.In this configuration, the x-axis denotes the direction of motion, the y-axis runs parallel to the front axle of the vehicle, acting as the reference point, and the z-axis runs perpendicular to the xand y-axes and points out through the roof of the vehicle.For handling and analyzing point-cloud data representations, i.e., data obtained from radar sensors, a polar coordinate system is being used.In this type of system, a point in space is represented by a distance () from a reference point (origin) and an azimuth angle (∝) measured from a reference direction (usually the positive x-axis).This system is particularly suitable for representing spatial information in scenarios where the distance and angle of objects from a reference point are essential, as is often the case with radar measurements and point-cloud data.By determining the distance () of an object from point  and its azimuth (∝ ) from the radar, one can make an approximation of the location of the object of interest () within the global coordinate system [34].
Within the global coordinate system of , , and , each radar detection is expressed as a 3D point relative to the sensor's position, thus characterizing it as , , ,  ,  , where , , and  denote the position of the point in a 3D space, and  and  signify the stated radial speed of the object along the  and  axes.For each scenario, three sequential radar point-cloud sweeps are combined, with a 0.25 s interval between each.Each camera within the nuScenes dataset is pre-calibrated, featuring both intrinsic and extrinsic parameters.
The intrinsic parameter is a 3 × 3 matrix that defines the internal characteristics of the camera, including focal length, principal point, and distortion coefficients, which are typically acquired through specialized calibration procedures that involve checkerboard patterns or similar techniques and can be defined as follows: Here,  and  represent the focal lengths in the  and  planes, respectively, and  and  represent the offset points of the camera in the  and  planes, respectively.
On the other hand, the extrinsic parameter is a combination of a rotation matrix and a translation vector that define the camera's position and orientation with respect to the vehicle position and, thus, plays a vital role in the projection of radar detections from global coordinates onto the camera's image plane.Extrinsic parameters for a camera in the nuScenes dataset can be defined as follows: For handling and analyzing point-cloud data representations, i.e., data obtained from radar sensors, a polar coordinate system is being used.In this type of system, a point in space is represented by a distance (d) from a reference point (origin) and an azimuth angle (∝) measured from a reference direction (usually the positive x-axis).This system is particularly suitable for representing spatial information in scenarios where the distance and angle of objects from a reference point are essential, as is often the case with radar measurements and point-cloud data.By determining the distance (r) of an object from point A and its azimuth (∝) from the radar, one can make an approximation of the location of the object of interest (A) within the global coordinate system [34].
Within the global coordinate system of x, y, and z, each radar detection is expressed as a 3D point relative to the sensor's position, thus characterizing it as x, y, z, v x , v y , where x, y, and z denote the position of the point in a 3D space, and v x and v y signify the stated radial speed of the object along the x and y axes.For each scenario, three sequential radar point-cloud sweeps are combined, with a 0.25 s interval between each.Each camera within the nuScenes dataset is pre-calibrated, featuring both intrinsic and extrinsic parameters.
The intrinsic parameter is a 3 × 3 matrix that defines the internal characteristics of the camera, including focal length, principal point, and distortion coefficients, which are typically acquired through specialized calibration procedures that involve checkerboard patterns or similar techniques and can be defined as follows: Here, f x and f y represent the focal lengths in the x and y planes, respectively, and c x and c y represent the offset points of the camera in the x and y planes, respectively.On the other hand, the extrinsic parameter is a combination of a rotation matrix and a translation vector that define the camera's position and orientation with respect to the vehicle position and, thus, plays a vital role in the projection of radar detections from global coordinates onto the camera's image plane.Extrinsic parameters for a camera in the nuScenes dataset can be defined as follows: r xx r xy r xz t x r yx r yy r yz t y r zx r zy r zz t z 0 0 0 1 Here, the nine elements, r xx to r zz , represent the 3 × 3 rotation matrix that describes the camera's orientation relative to the world coordinate system, and the three elements, t x to t z , represent the translation vector that describes the camera's position in 3D space relative to the world coordinate system.These calibration data are provided by the nuScenes dataset [35] and are utilized along with the camera-radar dataset for experiments outlined in Section 4 of this paper.The radar detections can, therefore, be associated with their corresponding representations obtained from the camera sensor.Following this mapping process, detections located outside the image are discarded.The projection of radar detections from global coordinates onto the camera's image plane can be defined as follows: Here, P camera is a 3 × 1 vector representing a 3D coordinate system of the camera, i.e., x y z T , and P world is a 4 × 1 vector representing a 3D point in the global coordinate system, i.e., X Y Z 1 T .

Assosciating Radar Data to the Image Plane
The proposed NeXtFusion network utilizes a modified frustum generation mechanism to associate image data with radar data, similar to CenterFusion's approach [36].This technique leverages the object's two-dimensional bounding box from the camera sensor along with estimations of the object's three-dimensional size, depth, and orientation from the radar sensor.By doing so, a tightly defined region of interest (RoI) is constructed around the object of interest called frustum.This frustum then facilitates the filtering of radar detections.Only radar detections located within the frustum are then considered for association (concatenation) with the camera-detected object.
Instead of utilizing only one radar detection for each object proposal individually, as described in [36], the proposed NeXtFusion network utilizes a modified mechanism that utilizes an entire cluster of radar detections that fall within the object's designated RoI.This allows the network to make use of collective information within the cluster, resulting in a more robust camera-radar data association because it was observed that the entire radar detection cluster, encompassing its shape, size, and orientation, holds valuable information resulting in an improvement in multi-modal object detection compared to the approach described in [36] which focuses solely on individual detections that only provide information about their specific location and velocity.
Figure 9 provides a visual representation of the architecture of the proposed NeXtFusion multi-modal (sensor fusion) 3D object detection network.The design of the proposed network incorporates several modifications to the baseline single-modal (camera-based) NextDet [9] architecture.The proposed architecture comprises two CondenseNeXt [25] CNNs integrated into the backbone module to extract feature map representations from data acquired through camera and radar sensors.These feature maps are subsequently transmitted to the neck for feature fusion, facilitated by connections indicated by dashed lines.Furthermore, the design of the network incorporates a strategy called early fusion, where feature maps from cameras and radars are concatenated at an initial stage.This merges information from both sensors (camera for visual details and radar for object distance and presence) to create a richer feature representation for object detection.Thus, by combining these features early on, the network can learn a more robust representation of the environment for object detection.Inspired by the work of [37], the design of NeXtFusion incorporates these connections in the network, which aims to improve gradient backpropagation, alleviate gradient fading, and minimize the loss of feature information, especially in scenarios involving small objects in adverse weather conditions, resulting in an improved and robust performance of the proposed object detection network.

Experiments and Results
A modern object detector is usually designed to identify the location and type of object present in each input image.Typically, such a detector undergoes training on a dataset consisting of labeled images, referred to as ground-truth values.In this section, the proposed network, NeXtFusion, is evaluated on the nuScenes dataset as part of the research presented within this paper.A comparative analysis is conducted on the proposed network and compared to the other existing object detection neural networks in Section 4.4.Additionally, samples from the nuScenes dataset are visualized in Section 4.5 to better understand the object detection and tracking performance of the proposed network.Inspired by the work of [37], the design of NeXtFusion incorporates these connections in the network, which aims to improve gradient backpropagation, alleviate gradient fading, and minimize the loss of feature information, especially in scenarios involving small objects in adverse weather conditions, resulting in an improved and robust performance of the proposed object detection network.

Experiments and Results
A modern object detector is usually designed to identify the location and type of object present in each input image.Typically, such a detector undergoes training on a dataset consisting of labeled images, referred to as ground-truth values.In this section, the proposed network, NeXtFusion, is evaluated on the nuScenes dataset as part of the research presented within this paper.A comparative analysis is conducted on the proposed network and compared to the other existing object detection neural networks in Section 4.4.Additionally, samples from the nuScenes dataset are visualized in Section 4.5 to better understand the object detection and tracking performance of the proposed network.

Dataset
Extensive experiments are performed on the nuScenes dataset [35], a multi-modal dataset for autonomous driving, which provides challenging urban driving scenarios using the full suite of sensors from a real autonomous vehicle.It offers annotated images, bounding boxes, and point-cloud radar data suitable for object detection, tracking, and forecasting tasks related to autonomous driving, everyday objects, and humans using cameras and radar sensors.Figure 10 provides a sample from the nuScenes dataset.
Future Internet 2024, 16, x FOR PEER REVIEW 13 of 23

Dataset
Extensive experiments are performed on the nuScenes dataset [35], a multi-modal dataset for autonomous driving, which provides challenging urban driving scenarios using the full suite of sensors from a real autonomous vehicle.It offers annotated images, bounding boxes, and point-cloud radar data suitable for object detection, tracking, and forecasting tasks related to autonomous driving, everyday objects, and humans using cameras and radar sensors.Figure 10 provides a sample from the nuScenes dataset.

Evaluation Metrics
This paper establishes two main criteria for an object detector to be considered successful in identifying target objects using the proposed multi-modal detection approach.Notably, a GIoU threshold of 0.5 is consistently applied across all models and datasets examined, employing a grid-based search approach to ensure consistency.This metric extends IoU by considering the minimum bounding box that can enclose both the predicted and ground-truth boxes, as explained in Section 3.4 within this paper.This technique penalizes predictions that are far away from the ground truth, even if they have some overlap.
Evaluating the performance of object detectors requires a variety of metrics [38].A popular choice for both assessment and comparison is the mean average precision (mAP).This metric represents the average performance across all object categories, calculated by taking the mean of the average precision (AP) for each class.AP itself is derived from the area under the precision-recall (PR) curve.Within this curve, precision (P) reflects a model's ability to correctly identify true objects, indicating the percentage of predictions that are positive.Conversely, recall (R) assesses a model's ability to find all actual positive instances present in the ground-truth data.These metrics can be expressed mathematically as follows:

Evaluation Metrics
This paper establishes two main criteria for an object detector to be considered successful in identifying target objects using the proposed multi-modal detection approach.Notably, a GIoU threshold of 0.5 is consistently applied across all models and datasets examined, employing a grid-based search approach to ensure consistency.This metric extends IoU by considering the minimum bounding box that can enclose both the predicted and ground-truth boxes, as explained in Section 3.4 within this paper.This technique penalizes predictions that are far away from the ground truth, even if they have some overlap.
Evaluating the performance of object detectors requires a variety of metrics [38].A popular choice for both assessment and comparison is the mean average precision (mAP).This metric represents the average performance across all object categories, calculated by taking the mean of the average precision (AP) for each class.AP itself is derived from the area under the precision-recall (PR) curve.Within this curve, precision (P) reflects a model's ability to correctly identify true objects, indicating the percentage of predictions that are positive.Conversely, recall (R) assesses a model's ability to find all actual positive instances present in the ground-truth data.These metrics can be expressed mathematically as follows: • Python 3.7.9:The general-purpose programming language used for the research.• CUDA 11.3: Enables efficient utilization of the NVIDIA GPUs for computations.

Experiment Results
To assess the performance of the proposed multi-modal 3D object detection network and understand the benefit of multi-modal over single-modal object detection networks, NeXtFusion has been compared to the camera-based 3D object detection networks, orthographic feature transform (OFT) [40] and monocular 3D object detection (MonoDIS) [41], as well as InfoFocus [42], which is a LiDAR-based 3D object detection neural network, and benchmarked on the nuScenes dataset.
The effectiveness of any object detection network relies heavily on its training process.Table 1 presents the training parameters used for both the existing and the proposed object detection networks analyzed in this study.As is evident from this table, all network models employ the Adam optimizer, a learning rate of 0.0001, batch size of 64, weight decay of 0.0005, momentum of 0.85, and trained for 400 epochs.Table 2 presents a comparison of several 3D object detection methods on the nuScenes validation dataset, indicating performance metrics such as mAP, mATE, mASE, mAOE, mAVE, and mAAE described in Section 4.2 of this paper.The models compared are 3D object detection networks that utilize data primarily from camera or lidar sensors such as OFT, MonoDIS, and InfoFocus.As is evident in this table, the proposed NeXtFusion network achieves a remarkable performance, securing the highest mAP score among all compared 3D object detection networks.This metric signifies the overall accuracy of the proposed multi-modal object detection model across different confidence levels.Compared to its single sensor-based 3D object detection competitors, the proposed NeXtFusion network exhibits significant improvements in mAP.Notably, it outperforms MonoDIS by 9.5% and surpasses OFT by a margin of 35.1%.While InfoFocus exhibits a very strong mAP score, its reliance solely on LiDAR data limits its capabilities compared to NeXtFusion's fusion of multiple sensor modalities.This multi-sensor (multi-modal) approach enables NeXtFusion to achieve a significant improvement in velocity error compared to camera and lidar-based methods, demonstrating the added value of utilizing diverse sensor information for a more comprehensive understanding of the environment in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset to evaluate the effectiveness of the proposed multi-modal fusion network in this paper.This table outlines the single-modal object detection performance of two networks: MonoDIS and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which builds upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data to improve object detection performance.While both networks achieve reasonable mean average precision (mAP) scores, indicating their ability to identify objects, NeXtDet demonstrates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architecture is better suited for extracting relevant features obtained from the sensor data, resulting in an increased accuracy in object detection (mAP).However, the differences in other metrics are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE, and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improvement in mAVE and mAAE.diverse sensor information for a more comprehensive understanding of the environment in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset to evaluate the effectiveness of the proposed multi-modal fusion network in this paper.This table outlines the single-modal object detection performance of two networks: MonoDIS and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which builds upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data to improve object detection performance.While both networks achieve reasonable mean average precision (mAP) scores, indicating their ability to identify objects, NeXtDet demonstrates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architecture is better suited for extracting relevant features obtained from the sensor data, resulting in an increased accuracy in object detection (mAP).However, the differences in other metrics are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE, and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improvement in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion network and the impact of two crucial methods responsible for associating radar detections with objects in the image plane, especially for multi-modal networks.The first is the naïve method (NM), where each radar detection point is projected directly onto the image plane using the sensor calibration information.Therefore, if the projected radar point falls within the two-dimensional bounding box of the detected object inside an image, then it is associated with that object.NM is compared to the modified frustum association method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the proposed NeXtFusion network.The results of this study demonstrate the potential benefits of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significant improvements in several key metrics when utilizing FAM, demonstrating a substantial increase in the network's ability to correctly identify objects.Additionally, there are considerable reductions in errors related to object location, scale, and orientation, as observed in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures various scenarios that provide a platform to evaluate, benchmark, and compare 3D object detection algorithms.Each scene in this dataset is roughly 20 s long and provides information, including camera images, radar data, and meticulously labeled objects.Although LiDAR data are also represented as a point-cloud representation, similar to radar data in this dataset, LiDAR and radar point-cloud data are fundamentally different despite both appearing as point clouds.LiDAR provides highly detailed 3D representations while radar excels at long-range detection, and therefore, the work presented within this paper focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for experiments outlined in this section.diverse sensor information for a more comprehensive understanding of the environment in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset to evaluate the effectiveness of the proposed multi-modal fusion network in this paper.This table outlines the single-modal object detection performance of two networks: MonoDIS and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which builds upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data to improve object detection performance.While both networks achieve reasonable mean average precision (mAP) scores, indicating their ability to identify objects, NeXtDet demonstrates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architecture is better suited for extracting relevant features obtained from the sensor data, resulting in an increased accuracy in object detection (mAP).However, the differences in other metrics are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE, and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improvement in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion network and the impact of two crucial methods responsible for associating radar detections with objects in the image plane, especially for multi-modal networks.The first is the naïve method (NM), where each radar detection point is projected directly onto the image plane using the sensor calibration information.Therefore, if the projected radar point falls within the two-dimensional bounding box of the detected object inside an image, then it is associated with that object.NM is compared to the modified frustum association method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the proposed NeXtFusion network.The results of this study demonstrate the potential benefits of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significant improvements in several key metrics when utilizing FAM, demonstrating a substantial increase in the network's ability to correctly identify objects.Additionally, there are considerable reductions in errors related to object location, scale, and orientation, as observed in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures various scenarios that provide a platform to evaluate, benchmark, and compare 3D object detection algorithms.Each scene in this dataset is roughly 20 s long and provides information, including camera images, radar data, and meticulously labeled objects.Although LiDAR data are also represented as a point-cloud representation, similar to radar data in this dataset, LiDAR and radar point-cloud data are fundamentally different despite both appearing as point clouds.LiDAR provides highly detailed 3D representations while radar excels at long-range detection, and therefore, the work presented within this paper focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for experiments outlined in this section.diverse sensor information for a more comprehensive understanding of the environment in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset to evaluate the effectiveness of the proposed multi-modal fusion network in this paper.This table outlines the single-modal object detection performance of two networks: MonoDIS and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which builds upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data to improve object detection performance.While both networks achieve reasonable mean average precision (mAP) scores, indicating their ability to identify objects, NeXtDet demonstrates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architecture is better suited for extracting relevant features obtained from the sensor data, resulting in an increased accuracy in object detection (mAP).However, the differences in other metrics are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE, and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improvement in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion network and the impact of two crucial methods responsible for associating radar detections with objects in the image plane, especially for multi-modal networks.The first is the naïve method (NM), where each radar detection point is projected directly onto the image plane using the sensor calibration information.Therefore, if the projected radar point falls within the two-dimensional bounding box of the detected object inside an image, then it is associated with that object.NM is compared to the modified frustum association method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the proposed NeXtFusion network.The results of this study demonstrate the potential benefits of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significant improvements in several key metrics when utilizing FAM, demonstrating a substantial increase in the network's ability to correctly identify objects.Additionally, there are considerable reductions in errors related to object location, scale, and orientation, as observed in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures various scenarios that provide a platform to evaluate, benchmark, and compare 3D object detection algorithms.Each scene in this dataset is roughly 20 s long and provides information, including camera images, radar data, and meticulously labeled objects.Although LiDAR data are also represented as a point-cloud representation, similar to radar data in this dataset, LiDAR and radar point-cloud data are fundamentally different despite both appearing as point clouds.LiDAR provides highly detailed 3D representations while radar excels at long-range detection, and therefore, the work presented within this paper focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for experiments outlined in this section.diverse sensor information for a more comprehensive understanding of the environmen in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset t evaluate the effectiveness of the proposed multi-modal fusion network in this paper.Th table outlines the single-modal object detection performance of two networks: MonoDI and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which build upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data t improve object detection performance.While both networks achieve reasonable mean av erage precision (mAP) scores, indicating their ability to identify objects, NeXtDet demon strates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architectur is better suited for extracting relevant features obtained from the sensor data, resulting i an increased accuracy in object detection (mAP).However, the differences in other metric are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improve ment in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion networ and the impact of two crucial methods responsible for associating radar detections wit objects in the image plane, especially for multi-modal networks.The first is the naïv method (NM), where each radar detection point is projected directly onto the image plan using the sensor calibration information.Therefore, if the projected radar point fal within the two-dimensional bounding box of the detected object inside an image, then is associated with that object.NM is compared to the modified frustum associatio method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the pro posed NeXtFusion network.The results of this study demonstrate the potential benefit of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significan improvements in several key metrics when utilizing FAM, demonstrating a substantia increase in the network's ability to correctly identify objects.Additionally, there are con siderable reductions in errors related to object location, scale, and orientation, as observe in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures var ious scenarios that provide a platform to evaluate, benchmark, and compare 3D objec detection algorithms.Each scene in this dataset is roughly 20 s long and provides info mation, including camera images, radar data, and meticulously labeled objects.Althoug LiDAR data are also represented as a point-cloud representation, similar to radar data i this dataset, LiDAR and radar point-cloud data are fundamentally different despite bot appearing as point clouds.LiDAR provides highly detailed 3D representations while ra dar excels at long-range detection, and therefore, the work presented within this pape focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for exper ments outlined in this section.
Future Internet 2024, 16, x FOR PEER REVIEW diverse sensor information for a more comprehensive understanding of the e in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validatio evaluate the effectiveness of the proposed multi-modal fusion network in this table outlines the single-modal object detection performance of two network and NeXtDet.It serves as a baseline for understanding how NeXtFusion, w upon the efficient NeXtDet architecture, makes use of additional sensor (ra improve object detection performance.While both networks achieve reasonab erage precision (mAP) scores, indicating their ability to identify objects, NeXt strates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's is better suited for extracting relevant features obtained from the sensor data, an increased accuracy in object detection (mAP).However, the differences in o are very subtle.While MonoDIS exhibits slightly lower errors in terms of mA and mAOE, these differences are negligible.Conversely, NeXtDet shows a slig ment in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFus and the impact of two crucial methods responsible for associating radar det objects in the image plane, especially for multi-modal networks.The first method (NM), where each radar detection point is projected directly onto the using the sensor calibration information.Therefore, if the projected radar within the two-dimensional bounding box of the detected object inside an im is associated with that object.NM is compared to the modified frustum method (FAM) outlined in Section 3.5 of this paper and utilized in the design posed NeXtFusion network.The results of this study demonstrate the poten of sensor fusion.Compared to the camera-only results, NeXtFusion achieve improvements in several key metrics when utilizing FAM, demonstrating a increase in the network's ability to correctly identify objects.Additionally, th siderable reductions in errors related to object location, scale, and orientation, in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, c ious scenarios that provide a platform to evaluate, benchmark, and compa detection algorithms.Each scene in this dataset is roughly 20 s long and pro mation, including camera images, radar data, and meticulously labeled objec LiDAR data are also represented as a point-cloud representation, similar to r this dataset, LiDAR and radar point-cloud data are fundamentally different appearing as point clouds.LiDAR provides highly detailed 3D representatio dar excels at long-range detection, and therefore, the work presented withi focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized ments outlined in this section.diverse sensor information for a more comprehensive understanding of the environment in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset to evaluate the effectiveness of the proposed multi-modal fusion network in this paper.This table outlines the single-modal object detection performance of two networks: MonoDIS and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which builds upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data to improve object detection performance.While both networks achieve reasonable mean average precision (mAP) scores, indicating their ability to identify objects, NeXtDet demonstrates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architecture is better suited for extracting relevant features obtained from the sensor data, resulting in an increased accuracy in object detection (mAP).However, the differences in other metrics are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE, and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improvement in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion network and the impact of two crucial methods responsible for associating radar detections with objects in the image plane, especially for multi-modal networks.The first is the naïve method (NM), where each radar detection point is projected directly onto the image plane using the sensor calibration information.Therefore, if the projected radar point falls within the two-dimensional bounding box of the detected object inside an image, then it is associated with that object.NM is compared to the modified frustum association method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the proposed NeXtFusion network.The results of this study demonstrate the potential benefits of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significant improvements in several key metrics when utilizing FAM, demonstrating a substantial increase in the network's ability to correctly identify objects.Additionally, there are considerable reductions in errors related to object location, scale, and orientation, as observed in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures various scenarios that provide a platform to evaluate, benchmark, and compare 3D object detection algorithms.Each scene in this dataset is roughly 20 s long and provides information, including camera images, radar data, and meticulously labeled objects.Although LiDAR data are also represented as a point-cloud representation, similar to radar data in this dataset, LiDAR and radar point-cloud data are fundamentally different despite both appearing as point clouds.LiDAR provides highly detailed 3D representations while radar excels at long-range detection, and therefore, the work presented within this paper focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for experiments outlined in this section.diverse sensor information for a more comprehensive understanding of the environmen in adverse external environmental conditions.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScenes validation dataset t evaluate the effectiveness of the proposed multi-modal fusion network in this paper.Th table outlines the single-modal object detection performance of two networks: MonoDI and NeXtDet.It serves as a baseline for understanding how NeXtFusion, which build upon the efficient NeXtDet architecture, makes use of additional sensor (radar) data t improve object detection performance.While both networks achieve reasonable mean av erage precision (mAP) scores, indicating their ability to identify objects, NeXtDet demon strates a 16.61% improvement over MonoDIS.This suggests that NeXtDet's architectur is better suited for extracting relevant features obtained from the sensor data, resulting i an increased accuracy in object detection (mAP).However, the differences in other metric are very subtle.While MonoDIS exhibits slightly lower errors in terms of mATE, mASE and mAOE, these differences are negligible.Conversely, NeXtDet shows a slight improve ment in mAVE and mAAE.The second half of the ablation study focuses on the proposed NeXtFusion networ and the impact of two crucial methods responsible for associating radar detections wit objects in the image plane, especially for multi-modal networks.The first is the naïv method (NM), where each radar detection point is projected directly onto the image plan using the sensor calibration information.Therefore, if the projected radar point fal within the two-dimensional bounding box of the detected object inside an image, then is associated with that object.NM is compared to the modified frustum associatio method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the pro posed NeXtFusion network.The results of this study demonstrate the potential benefit of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significan improvements in several key metrics when utilizing FAM, demonstrating a substantia increase in the network's ability to correctly identify objects.Additionally, there are con siderable reductions in errors related to object location, scale, and orientation, as observe in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures var ious scenarios that provide a platform to evaluate, benchmark, and compare 3D objec detection algorithms.Each scene in this dataset is roughly 20 s long and provides info mation, including camera images, radar data, and meticulously labeled objects.Althoug LiDAR data are also represented as a point-cloud representation, similar to radar data i this dataset, LiDAR and radar point-cloud data are fundamentally different despite bot appearing as point clouds.LiDAR provides highly detailed 3D representations while ra dar excels at long-range detection, and therefore, the work presented within this pape focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for exper ments outlined in this section.

Ablation Studies
Table 3 outlines ablation studies conducted on the nuScene evaluate the effectiveness of the proposed multi-modal fusion netw table outlines the single-modal object detection performance of tw and NeXtDet.It serves as a baseline for understanding how NeX upon the efficient NeXtDet architecture, makes use of additional improve object detection performance.While both networks achiev erage precision (mAP) scores, indicating their ability to identify ob strates a 16.61% improvement over MonoDIS.This suggests that is better suited for extracting relevant features obtained from the s an increased accuracy in object detection (mAP).However, the diffe are very subtle.While MonoDIS exhibits slightly lower errors in t and mAOE, these differences are negligible.Conversely, NeXtDet s ment in mAVE and mAAE.The second half of the ablation study focuses on the propose and the impact of two crucial methods responsible for associating objects in the image plane, especially for multi-modal networks method (NM), where each radar detection point is projected directl using the sensor calibration information.Therefore, if the proje within the two-dimensional bounding box of the detected object i is associated with that object.NM is compared to the modifie method (FAM) outlined in Section 3.5 of this paper and utilized in posed NeXtFusion network.The results of this study demonstrat of sensor fusion.Compared to the camera-only results, NeXtFusi improvements in several key metrics when utilizing FAM, demo increase in the network's ability to correctly identify objects.Addi siderable reductions in errors related to object location, scale, and o in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 drivin ious scenarios that provide a platform to evaluate, benchmark, a detection algorithms.Each scene in this dataset is roughly 20 s lo mation, including camera images, radar data, and meticulously lab LiDAR data are also represented as a point-cloud representation, this dataset, LiDAR and radar point-cloud data are fundamentally appearing as point clouds.LiDAR provides highly detailed 3D re dar excels at long-range detection, and therefore, the work prese focuses on camera-radar sensor fusion.Hence, LiDAR data are n ments outlined in this section.The second half of the ablation study focuses on the proposed NeXtFusion network and the impact of two crucial methods responsible for associating radar detections with objects in the image plane, especially for multi-modal networks.The first is the naïve method (NM), where each radar detection point is projected directly onto the image plane using the sensor calibration information.Therefore, if the projected radar point falls within the twodimensional bounding box of the detected object inside an image, then it is associated with that object.NM is compared to the modified frustum association method (FAM) outlined in Section 3.5 of this paper and utilized in the design of the proposed NeXtFusion network.The results of this study demonstrate the potential benefits of sensor fusion.Compared to the camera-only results, NeXtFusion achieves significant improvements in several key metrics when utilizing FAM, demonstrating a substantial increase in the network's ability to correctly identify objects.Additionally, there are considerable reductions in errors related to object location, scale, and orientation, as observed in Table 3.

Visualization of Samples
The nuScenes dataset, with its diverse collection of 1000 driving scenes, captures various scenarios that provide a platform to evaluate, benchmark, and compare 3D object detection algorithms.Each scene in this dataset is roughly 20 s long and provides information, including camera images, radar data, and meticulously labeled objects.Although LiDAR data are also represented as a point-cloud representation, similar to radar data in this dataset, LiDAR and radar point-cloud data are fundamentally different despite both appearing as point clouds.LiDAR provides highly detailed 3D representations while radar excels at long-range detection, and therefore, the work presented within this paper focuses on camera-radar sensor fusion.Hence, LiDAR data are not utilized for experiments outlined in this section.This section delves into visualizing the proposed network's validation performance in terms of 3D object detection on the nuScenes dataset.In this dataset, the scenes are annotated every half a second (i.e., at 2 Hz).A sample is defined as an annotated keyframe of a scene at a specific timestamp, where the timestamps of data from all sensors closely align with the sample's timestamp.For illustration, let us examine the first annotated sample in a scene described as 'Night, pedestrians on sidewalk, pedestrians crossing crosswalk, scooter, with difficult lighting conditions' by the annotators of the dataset.This scene was captured in Holland Village, Singapore, using a comprehensive sensor suite mounted on a vehicle.Each snapshot of a scene references a collection of data from these sensors, accessible through a data key in the dataset.These sensors, which are mounted on the vehicle, include the following:

•
One LIDAR (light detection and ranging) sensor: Five RADAR (radio detection and ranging) sensors: Figure 11 provides a visual representation of 3D object detection performed by the proposed NeXtFusion network on a scene with occlusion and poor lighting conditions, and Figure 12 provides an example of multi-modal 3D object detection involving multiple objects at an intersection, such as humans (blue bounding box), cars (yellow bounding box), and trucks (red bounding box).It also demonstrates the object detection performance of distant, tiny objects in the shadow of the building.This section delves into visualizing the proposed network's validation performance in terms of 3D object detection on the nuScenes dataset.In this dataset, the scenes are annotated every half a second (i.e., at 2 Hz).A sample is defined as an annotated keyframe of a scene at a specific timestamp, where the timestamps of data from all sensors closely align with the sample's timestamp.For illustration, let us examine the first annotated sample in a scene described as 'Night, pedestrians on sidewalk, pedestrians crossing crosswalk, scooter, with difficult lighting conditions' by the annotators of the dataset.This scene was captured in Holland Village, Singapore, using a comprehensive sensor suite mounted on a vehicle.Each snapshot of a scene references a collection of data from these sensors, accessible through a data key in the dataset.These sensors, which are mounted on the vehicle, include the following:

•
One LIDAR (light detection and ranging) sensor: Five RADAR (radio detection and ranging) sensors: Figure 11 provides a visual representation of 3D object detection performed by the proposed NeXtFusion network on a scene with occlusion and poor lighting conditions, and Figure 12 provides an example of multi-modal 3D object detection involving multiple objects at an intersection, such as humans (blue bounding box), cars (yellow bounding box), and trucks (red bounding box).It also demonstrates the object detection performance of distant, tiny objects in the shadow of the building.Figure 13 provides an example that plots radar point-cloud data for the same image from Figure 12.Unlike LiDAR, which excels in dense, close-range measurements, radar boasts a significantly larger operational range.While this extended reach comes at the expense of point density, it enables the detection of distant objects that might be invisible to LiDAR.Consider a scenario where a car is approaching a sharp bend on a highway.While LiDAR can meticulously map the immediate surroundings, radar's wider net can detect vehicles or obstacles further down the road, providing crucial information for safe navigation.This complementary perspective offered by radar, despite its sparser data, paints a more comprehensive picture of the environment, enhancing perception and decision-making capabilities in scenarios demanding long-range awareness.Figure 13 provides an example that plots radar point-cloud data for the same image from Figure 12.Unlike LiDAR, which excels in dense, close-range measurements, radar boasts a significantly larger operational range.While this extended reach comes at the expense of point density, it enables the detection of distant objects that might be invisible to LiDAR.Consider a scenario where a car is approaching a sharp bend on a highway.While LiDAR can meticulously map the immediate surroundings, radar's wider net can detect vehicles or obstacles further down the road, providing crucial information for safe navigation.This complementary perspective offered by radar, despite its sparser data, paints a more comprehensive picture of the environment, enhancing perception and decision-making capabilities in scenarios demanding long-range awareness.Figure 13 provides an example that plots radar point-cloud data for the same image from Figure 12.Unlike LiDAR, which excels in dense, close-range measurements, radar boasts a significantly larger operational range.While this extended reach comes at the expense of point density, it enables the detection of distant objects that might be invisible to LiDAR.Consider a scenario where a car is approaching a sharp bend on a highway.While LiDAR can meticulously map the immediate surroundings, radar's wider net can detect vehicles or obstacles further down the road, providing crucial information for safe navigation.This complementary perspective offered by radar, despite its sparser data, paints a more comprehensive picture of the environment, enhancing perception and decision-making capabilities in scenarios demanding long-range awareness.Figure 14 illustrates a point-cloud plot generated by combining radar data from five sweeps.This data is obtained from the nuScenes dataset utilized for experiments outlined in this paper.As seen in this plot, radar detection lines visually align with their respective bounding boxes of two vehicles, providing an estimate of the detected objects. Figure 15 shifts the evaluation and visualization focus to object tracking, for which a tracking pipeline was developed for evaluation purposes.By providing more reliable starting points in each frame, the proposed network analyzes data about objects, which, in other words, are individual entities within the environment that an autonomous vehicle (AV) needs to detect, track, and potentially interact with based on the conditions.Examples include specific vehicles, pedestrians, traffic signs, or other relevant objects.Understanding these instances and their associated metadata is crucial for safe and efficient navigation.Figure 14 illustrates a point-cloud plot generated by combining radar data from five sweeps.This data is obtained from the nuScenes dataset utilized for experiments outlined in this paper.As seen in this plot, radar detection lines visually align with their respective bounding boxes of two vehicles, providing an estimate of the detected objects. Figure 15 shifts the evaluation and visualization focus to object tracking, for which a tracking pipeline was developed for evaluation purposes.By providing more reliable starting points in each frame, the proposed network analyzes data about objects, which, in other words, are individual entities within the environment that an autonomous vehicle (AV) needs to detect, track, and potentially interact with based on the conditions.Examples include specific vehicles, pedestrians, traffic signs, or other relevant objects.Understanding these instances and their associated metadata is crucial for safe and efficient navigation.Consider a hypothetical scenario where the camera sensor is affected by lens glare from the sun rays directly hitting the camera sensor, resulting in an occlusion in the field of vision.Here, an AV encounters pedestrians crossing an intersection on the road.Examining the instance metadata associated with this object unveils valuable information beyond its simple presence, even in adverse conditions such as lens glare.These rich metadata are generated by fusing information from radar and cameras in NeXtFusion, which plays a critical role in AV perception and decision making.By tracking and analyzing instance metadata, the AV can accomplish the following: Consider a hypothetical scenario where the camera sensor is affected by lens glare from the sun rays directly hitting the camera sensor, resulting in an occlusion in the field of vision.Here, an AV encounters pedestrians crossing an intersection on the road.Examining the instance metadata associated with this object unveils valuable information beyond its simple presence, even in adverse conditions such as lens glare.These rich metadata are generated by fusing information from radar and cameras in NeXtFusion, which plays a critical role in AV perception and decision making.By tracking and analyzing instance metadata, the AV can accomplish the following:

•
Continuously monitor the movement of objects in its environment.

•
Classify and differentiate between different types of objects and understand their potential intentions even under unfavorable conditions.• Make informed decisions by planning safe maneuvers based on perceived information about the environment.
tential intentions even under unfavorable conditions.• Make informed decisions by planning safe maneuvers based on perceived information about the environment.
While Figure 15 serves as a static representation to understand how the proposed network performs 3D object detection and tracking, the AV perception is always dynamic in nature and performs object detection and tracking for each frame.The work presented within this paper focuses on processing individual camera images and radar scans simultaneously.Also, metadata continuously updates as sensors gather new information, allowing the AV to adapt its understanding of the surrounding world in real time.This dynamic interplay between sensor data, object detection and tracking, and rich metadata forms the foundation for safe and intelligent navigation for autonomous driving.

Conclusions
This paper introduces NeXtFusion, a novel deep camera-radar fusion network designed to enhance the perception capabilities of autonomous vehicles (AVs).By effectively combining the strengths of camera and radar data, NeXtFusion overcomes the limitations of single-sensor networks, particularly in challenging weather and lighting conditions.While Figure 15 serves as a static representation to understand how the proposed network performs 3D object detection and tracking, the AV perception is always dynamic in nature and performs object detection and tracking for each frame.The work presented within this paper focuses on processing individual camera images and radar scans simultaneously.Also, metadata continuously updates as sensors gather new information, allowing the AV to adapt its understanding of the surrounding world in real time.This dynamic interplay between sensor data, object detection and tracking, and rich metadata forms the foundation for safe and intelligent navigation for autonomous driving.

Conclusions
This paper introduces NeXtFusion, a novel deep camera-radar fusion network designed to enhance the perception capabilities of autonomous vehicles (AVs).By effectively combining the strengths of camera and radar data, NeXtFusion overcomes the limitations Future Internet 2024, 16, 114 20 of 22 of single-sensor networks, particularly in challenging weather and lighting conditions.Utilizing an attention module, NeXtFusion extracts crucial features from both modalities, leading to improved object detection accuracy and tracking.
Extensive evaluations on the nuScenes dataset demonstrate NeXtFusion's superior performance, achieving a significant mAP score improvement compared to existing methods.Additionally, strong performance in other metrics like mATE and mAOE further highlights its overall effectiveness in 3D object detection.Visualizations of real-world data processed by NeXtFusion showcase its ability to handle diverse scenarios.By leveraging the complementary information from cameras and radars, NeXtFusion offers a robust solution for navigating complex environments and ensuring reliable operation under various conditions.
However, recent research explores the potential of utilizing additional sources of information to enhance an AV's awareness of its perceived surroundings, including but not limited to roadside sensors, such as surveillance cameras and radars, along with unmanned aerial vehicles (UAVs).These technologies contribute to the topic of digitalization of traffic scenes, providing more comprehensive information about the surrounding environment to the AVs.By establishing communication between AVs and this intelligent infrastructure, researchers further envision a future where object detection is not just reliant only on onboard sensors but off-vehicle sensors as well.
The research work and findings presented in this paper, particularly the ability to combine and extract meaningful and complementary information from camera and radar sensors, could potentially contribute to the development of such infrastructure-vehicle cooperation systems.Exploring this exciting potential application is a promising avenue for future research.

Figure 1 .
Figure 1.An overview of the proposed NeXtFusion multi-modal sensor fusion neural network.

Figure 1 .
Figure 1.An overview of the proposed NeXtFusion multi-modal sensor fusion neural network.

Figure 2 .
Figure 2.Comparison between velocities: the true velocity ( ) for object A is equal to the radial velocity ( ), whereas, for object B, the radar-reported radial velocity (  ) indicated by the red arrow is not equal to the true velocity (  ).

et 2024 , 23 Figure 3 .
Figure 3. NextDet architecture utilized as the baseline for the proposed NeXtFusion architecture.

Figure 3 .
Figure 3. NextDet architecture utilized as the baseline for the proposed NeXtFusion architecture.

Figure 4 .
Figure 4.A visual depiction of top-down and bottom-up implementation of FPN and PAN networks.

Figure 4 .
Figure 4.A visual depiction of top-down and bottom-up implementation of FPN and PAN networks.

Future 23 Figure 5 .
Figure 5.A visual depiction of the spatial pyramid pooling operation.

Figure 5 .
Figure 5.A visual depiction of the spatial pyramid pooling operation.

Figure 6 .
Figure 6.Three sets of examples (a-c) where (a) represents a perfect overlap between the predicted bounding box () and the ground-truth box (), (b) represents a partial overlap resulting in 0.5 IoU losses, and (c) represents the disjoint problem of IoU when  and  do not overlap.

Figure 7 .
Figure 7. Two sets of examples (a,b) where (a) represents a perfect overlap between the predicted bounding box () and the ground-truth bounding box () and (b) represents a non-overlapping case of  and , which solves the IoU's disjoint problem by introducing a third smallest enclosing bounding box called  that encompasses both  and  bounding boxes.

Figure 6 .
Figure 6.Three sets of examples (a-c) where (a) represents a perfect overlap between the predicted bounding box (A) and the ground-truth box (B), (b) represents a partial overlap resulting in 0.5 IoU losses, and (c) represents the disjoint problem of IoU when A and B do not overlap.

Figure 6 .
Figure 6.Three sets of examples (a-c) where (a) represents a perfect overlap between the predicted bounding box () and the ground-truth box (), (b) represents a partial overlap resulting in 0.5 IoU losses, and (c) represents the disjoint problem of IoU when  and  do not overlap.

Figure 7 .
Figure 7. Two sets of examples (a,b) where (a) represents a perfect overlap between the predicted bounding box () and the ground-truth bounding box () and (b) represents a non-overlapping case of  and , which solves the IoU's disjoint problem by introducing a third smallest enclosing bounding box called  that encompasses both  and  bounding boxes.

Figure 7 .
Figure 7. Two sets of examples (a,b) where (a) represents a perfect overlap between the predicted bounding box (A) and the ground-truth bounding box (B) and (b) represents a non-overlapping case of A and B, which solves the IoU's disjoint problem by introducing a third smallest enclosing bounding box called C that encompasses both A and B bounding boxes.

Figure 9 .
Figure 9. Architecture of the proposed NeXtFusion 3D object detection network.

Figure 9 .
Figure 9. Architecture of the proposed NeXtFusion 3D object detection network.

Figure 10 .
Figure 10.An example of an occluded image obtained from the nuScenes dataset, captured by a front camera positioned at the vehicle's top-front location.

Figure 10 .
Figure 10.An example of an occluded image obtained from the nuScenes dataset, captured by a front camera positioned at the vehicle's top-front location.

-
Future Internet 2024, 16, x FOR PEER REVIEW diverse sensor information for a more comprehensive understand in adverse external environmental conditions.

Figure 11 .
Figure 11.An example of camera-radar fusion-based 3D object detection involving occlusion at night from the nuScenes dataset using the proposed NeXtFusion network.Different colors of the bounding boxes indicate different objects detected.

Figure 11 .
Figure 11.An example of Camera-Radar fusion-based 3D object detection involving occlusion at night from the nuScenes dataset using the proposed NeXtFusion network.Different colors of the bounding boxes indicate different objects detected.

Figure 12 .
Figure 12.An example of camera-radar fusion-based 3D object detection at an intersection using the proposed NeXtFusion network.Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.

Figure 13 .
Figure 13.An illustration of radar point-cloud data of the environment from the nuScenes dataset generated by NeXtFusion.Dots represent objects, with darker shades indicating closer proximity to the sensor.

Figure 12 .
Figure 12.An example of Camera-Radar fusion-based 3D object detection at an intersection using the proposed NeXtFusion network.Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.

Future 23 Figure 12 .
Figure 12.An example of camera-radar fusion-based 3D object detection at an intersection using the proposed NeXtFusion network.Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.

Figure 13 .
Figure 13.An illustration of radar point-cloud data of the environment from the nuScenes dataset generated by NeXtFusion.Dots represent objects, with darker shades indicating closer proximity to the sensor.

Figure 13 .
Figure 13.An illustration of radar point-cloud data of the environment from the nuScenes dataset generated by NeXtFusion.Dots represent objects, with darker shades indicating closer proximity to the sensor.

Figure 14 .
Figure 14.An example of very confident radar returns from two vehicles, captured and combined using five radar sweeps.Light blue lines indicate radar detection, indicating its length and orientation.Red cross indicates vehicle's ego-centric position.Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.

Figure 14 .
Figure 14.An example of very confident radar returns from two vehicles, captured and combined using five radar sweeps.Light blue lines indicate radar detection, indicating its length and orientation.Red cross indicates vehicle's ego-centric position.Here, blue bounding boxes, denote people, yellow bounding boxes denote cars and red bounding boxes denote bus.

Figure 15 .
Figure 15.A static representation to analyze how the proposed network performs multi-modal 3D object detection and tracking as part of the validation tests.nuScenes provides data for a scene captured from camera and radar sensors mounted on the AV as follows: (a) Front left camera identifies and starts tracking the object.In this example, only a single pedestrian is being tracked for analysis purposes.(b) Front center camera continues to track the identified object while it is in its field of view.(c) Front right camera continues to track the identified object until it leaves its field of view.

Figure 15 .
Figure 15.A static representation to analyze how the proposed network performs multi-modal 3D object detection and tracking as part of the validation tests.nuScenes provides data for a scene captured from camera and radar sensors mounted on the AV as follows: (a) Front left camera identifies and starts tracking the object.In this example, only a single pedestrian is being tracked for analysis purposes.(b) Front center camera continues to track the identified object while it is in its field of view.(c) Front right camera continues to track the identified object until it leaves its field of view.

Table 1 .
Training prameters for experimental analysis on the nuScenes training dataset.

Table 2 .
Comparison of 3D Object Detection Methods on the nuScenes Validation Dataset (↑ indicates Higher = Better, ↓ indicates Lower = Better, and Y indicates the use of nuScenes validation dataset).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM: naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM: naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM: naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM: naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM: naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset (C: camera, R: radar, NM naïve method, and FAM: frustum association method).

Table 3 .
Results of the ablation studies on nuScenes validation dataset ( naïve method, and FAM: frustum association method).