3.1. Preliminary
LSS [
8] is the pioneering work that generates BEV scene representations from multi-view RGB images. The input consists of
n RGB images
, where
H and
W denote the height and width of each image, respectively. Each image is associated with an extrinsic matrix
and an intrinsic matrix
. LSS then seeks to generate a rasterized representation of the scene in the BEV coordinate frame, denoted as
, where
defines the spatial extent of the horizontal plane in the physical world, and
C represents the feature dimension at each spatial location.
The core operations of LSS consist of “lift” and “splat”. The “lift” operation aims to recover the depth of each pixel in the image, thereby projecting the image from the 2D plane into the 3D space. This process is divided into two steps: The first step is the generation of the 3D frustum point cloud. Given an image of size , each pixel is associated with D discrete depth values, representing all possible depth positions that the pixel may occupy. This process produces a frustum point cloud of size . The second step is the generation of the context feature point cloud. A convolutional neural network is employed as the backbone to extract image features. For each point on the feature map, a C-dimensional feature vector and a probability distribution over D discrete depth values are predicted. The outer product of the feature vector and the depth distribution is then computed, resulting in the construction of the context feature point cloud.
The “splat” operation refers to the projection of the context features onto the BEV grid for the construction of BEV representations. The procedure is as follows: first, by leveraging both the intrinsic and extrinsic matrices of the camera, the entire frustum point cloud is transformed into the ego-vehicle coordinate system. Second, the frustum point cloud is translated from the ego-vehicle coordinate system into the BEV grid, whereas points that fall outside the grid boundaries are discarded. Finally, the context features associated with points residing in the same grid cell are aggregated through sum pooling, thereby yielding the final BEV features.
3.4. PV Fusion
First, we generate radar images for each camera view using the radar image generation module. Subsequently, the camera encoder and radar encoder are applied to extract the respective PV features. The camera encoder utilizes a suitable backbone network for visual tasks (e.g., ResNet [
36]) and a neck model (e.g., FPN [
37]) to extract 16× downsampled image feature maps (i.e., camera PV features). The radar encoder is designed based on ResNet and consists of two main components: the stem and the block. The stem is the original stem module of ResNet and is responsible for processing the input data. The block follows the architecture of the first stage of ResNet50, utilizing two residual blocks to generate 16× downsampled radar feature maps (i.e., radar PV features). Finally, the cross-modal feature fusion module fuses the PV features extracted from both modalities. This fusion process enables the integration of complementary information from the camera and radar data. Next, we provide a detailed description of the radar image generation module and the cross-modal feature fusion module.
Radar Image Generation. The radar processes scan data to detect and identify targets, yielding a set of identified objects. Each identified target includes measurements such as the position, velocity, and radar cross-section. Using the radar’s position information, we project the radar data into the camera view. The projected image locations
are computed as follow:
where
represents the camera’s intrinsic parameter matrix,
is the extrinsic calibration matrix from radar to camera, and
denotes the target’s location in the radar coordinate system. Both
and
are represented in homogeneous coordinates. To mitigate the influence of radar measurement uncertainty, previous works [
38,
39] marked each target’s position in the image as a small circle rather than a single pixel, as shown in
Figure 2a. The pixels inside the circle are filled with the radar’s depth or velocity information, whereas other areas are filled with zeros. Additionally, for overlapping circles, only the information of the closer target is retained. The image generated through this process is referred to as the radar image.
Previous works set an empirically determined, fixed circle radius
r to define the projected area of radar data. However, different targets vary in size and distance, making such a fixed projection area inherently inaccurate and potentially misleading the extraction of radar PV features. As shown in the red box of
Figure 2a, the radar information corresponding to the building in the background is incorrectly projected onto the car in the foreground.
In this work, we propose a radar image generation module based on radar RCS and depth information, aiming to enhance the accuracy of radar projection area in the camera view. Specifically, radar RCS information provides size-related characteristics of the target. A larger target leads to a larger RCS measurement [
4]. Therefore, we dynamically adjust the projection area of each radar target based on the RCS information. The circle radius
r is scaled by an RCS modulation factor
, as described by the following equations:
where
denotes the RCS value of the radar target, measured in square meters (
).
and
represent the maximum and minimum RCS values of the radar, respectively, and
indicates the normalized RCS value of the radar target. The radar image adjusted based on the RCS information is shown in
Figure 2b, where the radar projection area of the building in the background is noticeably expanded, covering a larger portion of the target. However, this also increases the erroneous projection onto the car in the foreground. This occurs because the imaging of targets must adhere to the rule of “near large, far small”. Even if a target is large, its projected area in the camera view will be smaller if it is farther away. Therefore, we further dynamically adjust the projection area of each radar target based on depth information. The circle radius
r applies an additional depth modulation factor
, as described by the following equations:
where
represents the radar target depth value, measured in meters (m).
refers to the maximum depth value in the scene. The radar image adjusted based on both RCS and depth information is shown in
Figure 2c, where the radar projection areas for both the car in the foreground and the building in the background are more accurate. According to the experiments presented in
Section 4, the radar image generation method based on RCS and depth information attains 57.2 NDS and 48.4 mAP, demonstrating superior performance compared to the other approaches.
Cross-modal Feature Fusion. Whereas radar provides a wealth of useful information, it also presents several challenges, such as noisy measurements induced by multi-path effects or clutter [
7]. Through the radar image generation module, all radar targets, including noisy targets, are projected into the camera view. The radar PV features extracted from this radar image are inherently noisy. Applying naive fusion methods, such as channel-wise concatenation or summation, does not resolve this problem and may introduce adverse effects. In this work, we propose a dynamic fusion approach using the attention mechanism [
40] to fuse camera PV features with radar PV features, achieving promising results.
Specifically, given the camera PV features denoted by
and the radar PV features denoted by
, we first leverage the accurate camera PV features to update the noisy radar PV features. Specifically,
is converted into queries
, and
is treated as keys and values. Then we apply deformable cross-attention [
41] to update the radar PV feature, as shown in the following equation:
where
m indexes the attention head,
q indexes the query element,
k indexes the sampled keys,
M is the total number of attention heads, and
K is the total number of sampled keys.
represents the 2D reference point.
and
denote the sampling offset and attention weight of the
sampling point of the
query element in the
attention head, respectively. The scalar attention weight
is normalized in the range
.
are of 2D real numbers with unconstrained range. Both
and
are obtained via linear projection over the query
.
and
are the output projection matrix and input value projection matrix (
by default), which are trainable on samples. Once the radar PV features
are updated, we then use them as queries
and treat
as keys and values. Similarly, we apply deformable cross-attention to update the camera PV features, as shown in the following equation:
After updating the PV features from both modalities, they are concatenated and processed through the residual block, obtaining the final fused PV features
. As demonstrated in
Section 4, our attention-based method achieves 57.2 NDS and 48.4 mAP, which is nearly 1.0 higher than that of other naive methods.
3.5. BEV Fusion
We adopt the conventional BEV feature generation and BEV feature fusion methods to obtain the fused BEV features, and finally, the 3D object detection results are obtained through the detection head. The detection head is based on CenterPoint [
42], which predicts the center heatmap using an anchor-free and multi-group head [
43]. Next, we introduce the components of the BEV fusion module.
Image BEV Feature Generation. We generate image BEV features based on the LSS framework. For each camera view, we first perform PV fusion to obtain the fused PV features
(ignoring the channel dimension). Then, based on the fused PV features, we predict the depth distribution
and semantic features
for each pixel and compute the outer product to obtain the frustum view features
. After completing the above processing, we use the splat operation to convert the frustum view features
into the unified BEV features
. For further details, we refer the reader to LSS [
8].
Radar BEV Feature Generation. We generate radar BEV features based on the PointPillars framework. First, we voxelize the radar point cloud in the frustum view
(ignoring the feature dimension), where
denotes the pillar-style voxelization. Next, we use PointNet [
44] and sparse convolution [
45] to encode the non-empty radar pillars into frustum view features
. Finally, we apply the pooling operation [
46] to convert the frustum view features
into the unified BEV features
. For further details, we refer the reader to PointPillars [
47].
BEV Feature Fusion. We fuse the image BEV features and radar BEV features based on the CRN framework. First, the image BEV features and radar BEV features are flattened, after which each feature is passed through a layer normalization layer. Then, the features are concatenated and transformed into a
C-dimensional query feature via a linear projection layer. Finally, the feature map is aggregated through the multi-modal deformable cross attention (MDCA) module. We refer the reader to CRN [
7] for more details.