BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion

Nathania, Jesslyn; Liu, Qiyuan; Li, Zhiheng; Liu, Liming; Gao, Yipeng

doi:10.3390/app15073896

Open AccessArticle

BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion

by

Jesslyn Nathania

¹

,

Qiyuan Liu

^1,2,*

,

Zhiheng Li

^1,3,

Liming Liu

⁴ and

Yipeng Gao

⁴

¹

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

²

Peng Cheng Laboratory, Shenzhen 518066, China

³

Department of Automation, BNRist, Tsinghua University, Beijing 100084, China

⁴

Streamax Technology Co., Ltd., 21-23/F B1 Building, Zhiyuan, No. 1001 Xueyuan Avenue, Shenzhen 518057, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3896; https://doi.org/10.3390/app15073896

Submission received: 9 March 2025 / Revised: 27 March 2025 / Accepted: 29 March 2025 / Published: 2 April 2025

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

This research paper presents BEVCorner, a novel framework that synergistically integrates monocular and multi-view pipelines for enhanced 3D object detection in autonomous driving. By fusing depth maps from Bird’s-Eye View (BEV) with object-centric depth estimates from monocular detection, BEVCorner enhances both global context and local precision, addressing the limitations of existing methods in depth precision, occlusion robustness, and computational efficiency. The paper explores four fusion techniques—direct replacement, weighted fusion, region-of-interest refinement, and hard combine—to balance the strengths of monocular and BEV depth estimation. Initial experiments on the NuScenes dataset yield a 38.72% NDS, which is lower than the baseline BEVDepth’s 43.59% NDS, highlighting the challenges in monocular pipeline alignment. Nevertheless, the upper-bound performance of BEVCorner is assessed under ground-truth depth supervision, and the results show a significant improvement, achieving a 53.21% NDS, despite a 21.96% increase in parameters (from 76.4 M to 97.9 M). The upper-bound analysis highlights the promise of camera-only fusion for resource-constrained scenarios.

Keywords:

bird’s eye view (BEV); monocular 3D object detection; depth fusion; depth estimation; autonomous driving

1. Introduction

Accurate 3D object detection is essential for autonomous driving systems, enabling safe navigation and reliable environment perception. Current advancements in this field focus on multi-modal fusion, multi-view perception, depth estimation, and robustness to occlusion. While multi-view camera systems provide rich spatial context through Bird’s-Eye View (BEV) representations, they often struggle with depth ambiguity and occlusion handling. Conversely, monocular 3D object detection methods, though cost-effective, face inherent challenges in estimating precise depth and orientation from single-view imagery. To bridge these gaps, we propose BEVCorner, a novel framework that synergistically integrates monocular and multi-view pipelines to enhance 3D object detection accuracy and robustness.

Existing monocular methods like MonoFlex [1] excel in leveraging 2D contextual cues for object localization but suffer from depth estimation inaccuracies, particularly for distant or occluded objects. Meanwhile, multi-view approaches such as BEVDepth [2] generate dense BEV feature maps with LiDAR-supervised depth estimation but lack fine-grained object-centric details. BEVCorner addresses these limitations by fusing depth maps from BEV pipelines with object-specific depth estimates from monocular 3D object detection, creating a unified representation that balances global scene context and local object precision.

Our contributions are as follows.

We introduce a novel framework, BEVCorner, designed for BEV object detection. It integrates BEV features and monocular features using a depth fusion module.
We explore four fusion techniques—direct replacement, weighted fusion, region-of-interest refinement, and hard combine—to balance the strengths of monocular and BEV depth estimation.
Through extensive experiments on the NuScenes [3] dataset, we demonstrate BEVCorner’s potential, achieving a 53.21% NDS when leveraging ground-truth depth supervision. While our baseline result of 38.72% NDS lags behind BEVDepth due to challenges in monocular pipeline alignment, our upper-bound analysis highlights the promise of camera-only fusion for resource-constrained scenarios.

Code is available at https://github.com/jesslyn1999/BEVCorner (accessed on 28 March 2025).

2. Related Work

2.1. Monocular 3D Object Detection

Monocular 3D object detection methods estimate 3D bounding boxes from a single camera. These methods are crucial for cost-effective applications where only a single camera is available. Despite having only a 2D image, these methods infer 3D positions, mimicking human depth perception by learning from vast image datasets. Anchor-based methods [4,5,6] have been a cornerstone of early monocular 3D object detection, drawing from 2D object detection paradigms like Faster R-CNN [7,8]. These methods rely on predefined 3D bounding boxes (anchors) of various sizes and orientations, predicting whether or not each anchor contains an object and refining it to fit the object’s dimensions. The rise of anchor-free methods [1,9,10] eliminates the need for predefined anchors, predicting objects based on keypoints, centers, or direct regression. They are preferred due to their effectiveness and simplicity [11]. CenterNet [12] pioneered this approach, and many works extend from this framework. Refs. [13,14] focuses on depth estimation. FCOS3D [15] redefines centerness using a 2D Gaussian distribution based on the 3D center and is three times faster in learning compared to CenterNet. MonoPSR [16] uses proposals and shape reconstruction for accurate 3D localization. MonoPair [17] improves detection for occluded samples by encoding spatial constraints from adjacent objects, optimizing predictions with nonlinear least squares. MoVi-3D [18] leverages virtual views to normalize object appearances. MonoCD [19] introduces complementary depth predictions to mitigate errors in depth estimation. Refs. [20,21] predicts depth maps from the image, creating a pseudo-3D representation. These methods demonstrate the potential of monocular approaches, particularly in handling size variability and depth estimation from a single view, with applications in scenarios with limited camera availability. However, they may struggle with accurate position and orientation due to the lack of multi-view constraints, which the proposed method seeks to enhance by integrating with multi-view data.

2.2. Multi-View 3D Object Detection

Multi-view 3D object detection methods utilize multiple camera views to achieve accurate 3D object localization and recognition. BEV space construction methods [2,22,23] project multi-view images into a BEV space—a top-down representation of the scene—and employ a BEV-based detector to identify 3D objects. BEV space projection is inherently ambiguous because projecting a pixel from a camera image to its correct position in BEV requires resolving depth information. Traditionally, creating a BEV map from camera images involves first projecting 3D space onto 2D image space using camera intrinsics, then associating 3D features with corresponding 2D image features, and finally attempting to correctly place the pixel in the BEV map [2]. However, recent transformer-based methods [23,24,25,26] can implicitly predict the BEV feature map without explicit projection. BEVDet [27] introduced a view transformation network that uses a transformer to convert camera features to BEV features. BEVDet4D [28] extended this by incorporating temporal information, recognizing the importance of sequences of images for capturing dynamic scenes. StreamPETR [26] is object-centric, focusing on individual objects’ temporal behavior, unlike BEVDet4D’s broader BEV space approach. BEVformer [29] utilizes a transformer to learn the BEV representation, focusing on spatial cross-attention to map camera features to BEV, addressing multi-view feature fusion challenges. Transformer-based methods generate 3D object queries from the BEV and use transformers to perform cross-view attention between these queries and the features extracted from multi-view images. These methods implicitly form the BEV space without constructing a dense BEV feature [30,31]. DETR3D [32] projects 3D object queries onto 2D image planes to gather relevant features. The integration of LiDAR with camera data has further revolutionized BEV 3D object detection, enhancing accuracy and robustness. It leverages the complementary strengths of LiDAR (precise 3D spatial data) and cameras (rich semantic information), particularly in the BEV space. MV2DFusion-e [25] integrates camera and LiDAR data using object queries specific to each modality. SparseLIF [33] presents a high-performance, fully sparse detector for end-to-end multi-modality 3D object detection. BEVFusion [34] proposes a generic framework for multi-task, multi-sensor fusion in BEV space, unifying multi-modal features in a shared BEV representation. BEVFusion4D [35] further extends BEVFusion by integrating LiDAR and camera information in BEV space with cross-modality guidance and temporal aggregation. Misalignment affects feature alignment in the BEV feature map, leading to blurred or incorrect object representations, which can result in false positives, missed detections, or inaccurate bounding box predictions, especially for occluded or partially visible objects.

2.3. Fusion Techniques in 3D Object Detection

Fusion techniques in 3D object detection aim to combine information from different data sources or methods to enhance detection accuracy, addressing the limitations of single-modal approaches. While most research focuses on fusing camera and LiDAR data, either by early fusion [36,37], late fusion [38], or cascade fusion [39,40,41], camera-only fusion, particularly combining monocular and multi-view methods, is relevant to the proposed work. DPFusion [42] proposes fusing camera and LiDAR data with dense depth map-guided BEV transform and multi-modal feature adaptive fusion, achieving competitive results on NuScenes [3]. However, papers exploring camera-only fusion, especially combining monocular and multi-view information, are less common. View-to-Label [43] proposes self-supervising monocular 3D detection using multi-view consistency, leveraging RGB sequences alone for improved performance. MVC-MonoDet [44] introduces a semi-supervised framework, enforcing consistency across unlabeled multi-view data (stereo or video) with box-level and object-level regularizers, enhancing monocular detection. These works suggest that combining monocular and multi-view information can improve detection accuracy, particularly in data-scarce scenarios. This aligns with the proposed method that aims to leverage the strengths from both monocular and BEV object detection, by fusing their outputs through depth map enhancement, providing a novel approach to improve overall detection accuracy.

2.4. Depth Estimation in 3D Object Detection

Depth estimation is a critical component in both multi-view and monocular 3D object detection, enabling accurate 3D localization and projection into BEV space. Its accuracy significantly impacts detection performance, making it a focal point for research. In multi-view methods, depth estimation facilitates feature projection. BEVDepth [2] explicitly focuses on improving depth estimation, using explicit depth supervision and a depth refinement module. In monocular methods, depth estimation is more challenging due to the ill-posed nature of 2D to 3D mapping. MonoFlex [1] uses a depth refinement module to enhance estimates, while MoVi-3D [18] normalizes depth-related appearances with virtual views. Ref. [45] shows that ground-truth depth improves detection accuracy, but estimated depth’s effectiveness depends on factors like network choice and integration strategy, proposing an early concatenation strategy for higher mAP. Ref. [46] integrates ground-referenced geometric priors into monocular models, achieving unparalleled accuracy, highlighting depth estimation’s role in enhancing detection.

3. BEVCorner

3.1. Baseline and Motivation

BEVCorner is the proposed method designed to improve 3D object detection accuracy in autonomous driving scenarios by synergistically combining the strengths of both monocular and multi-view camera systems. It aims to fuse the outputs from these two systems through depth map enhancement, providing a novel approach to address the challenges of depth estimation and scene understanding in BEV space. This method is particularly motivated by the need to leverage the complementary information from single-camera (monocular) and multiple-camera (multi-view) setups to achieve more robust and accurate detection (see Table 1), especially in complex urban environments. To establish a foundation for BEVCorner, two state-of-the-art methods are selected as baselines: BEVDepth [2] for multi-view depth estimation and MonoFlex [1] for monocular 3D object detection. The following paragraphs provide the details of each method, their capabilities, and their limitations.

BEVDepth [2] has demonstrated superior performance over several baseline methods on the NuScenes [3] dataset. It is a state-of-the-art method for depth estimation from multi-view images in autonomous driving. For instance, it achieves competitive results in terms of depth estimation accuracy, with metrics like Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). It leverages the Bird’s-Eye View (BEV) structure to self-supervise depth learning from images, eliminating the need for explicit depth labels from LiDAR. This self-supervision is achieved by exploiting the inherent relationship between image pixels and their corresponding positions in the BEV, which naturally encodes depth information. However, BEVDepth may struggle with depth estimation for distant or occluded objects due to the limitations of image-based depth estimation. The self-supervision relies on the BEV structure, which can be less effective in scenarios with sparse visual cues or complex occlusions, potentially leading to inaccuracies in long-range depth predictions.

MonoFlex [1] excels in monocular 3D object detection, particularly on the KITTI [47] dataset, where it achieves high mean Average Precision (mAP) for 3D object detection, demonstrating its effectiveness in leveraging rich visual cues and contextual information from individual camera views. It is a flexible single-stage monocular 3D object detection framework capable of handling different object categories effectively without category-specific architectures. It is extended from CenterNet [12] and incorporates a size-adaptive module and a depth refinement module to estimate 3D bounding boxes from single images. The size-adaptive module adjusts to varying object sizes, while the depth refinement module enhances depth estimation by leveraging mathematical priors and uncertainty modeling. As a monocular method, MonoFlex faces ambiguity in depth estimation from a single view, which can lead to inaccuracies in 3D positioning, especially for objects at varying distances or in occluded scenarios. This depth ambiguity is a known challenge in monocular vision, limiting its robustness compared to multi-view approaches.

The motivation for BEVCorner stems from the recognition that both methods, while individually strong, have complementary strengths and limitations that can be addressed through their integration. The primary goal is to enhance 3D object detection accuracy by leveraging accurate depth estimation from BEVDepth [2] and rich contextual information from MonoFlex [1]. By combining these two approaches, BEVCorner aims to enhance depth estimation by integrating the multi-view depth estimates and monocular depth refinement. It also improves scene understanding by utilizing detailed object recognition capabilities and spatial object-centric depth. This fusion is expected to provide a more robust and accurate solution for 3D object detection, particularly in urban driving scenarios where both depth accuracy and object context are critical.

One of the primary challenges in combining BEVDepth and MonoFlex is aligning the image features from the monocular and multi-view pipelines to achieve collaborative perception. Initially, it might seem beneficial to use a shared backbone for feature extraction to ensure consistency in feature representation across both pipelines. Multi-view BEV pipeline requires features that are contextual across multiple views in both spatial and temporal dimensions to improve the accuracy of Inverse Perspective Mapping (IPM). Monocular pipelines benefit from features that are rich in 2D contextual information from individual images, such as textures, object appearance, and semantic context, which are crucial for object recognition and localization in single views. These features are optimized for monocular depth estimation and 3D bounding box prediction, leveraging the size-adaptive and depth refinement modules. Therefore, BEVCorner employs separate backbones for feature extraction in the multi-view BEV and monocular detection pipelines, extracting different types of image features tailored to each task’s requirements. The fusion process then involves enhancing the depth map using the multi-view BEV representation from BEVDepth and integrating it with the object detection outputs from MonoFlex, potentially through a depth map enhancement module, to achieve improved detection accuracy. BEVCorner represents a promising approach to enhancing 3D object detection as it aims to improve detection accuracy and robustness, setting the stage for more reliable autonomous driving systems.

3.2. Overall Architecture

The overall architecture is shown in Figure 1. We have the two image backbones, which extract 2D features,

F_{m o n o_i}^{2 d}

and

F_{B E V_i}^{2 d}

, for each monocular and multi-view BEV pipeline (see Equation (1)) from N view input images

I = {I_{i}, i = 1, 2, \dots, N}

, where H, W, and

C_{F}

stand for the feature’s height, width, and channel number.

F_{i}^{2 d} = {F_{i}^{2 d} \in R^{C_{F} \times H \times W}, i = 1, 2, \dots, N}

(1)

Each image feature

F_{i}^{2 d}

goes through the different pipelines.

F_{m o n o_i}^{2 d}

is processed through a monocular pipeline that is extended from CenterNet [12]. Prediction heads process

F_{m o n o_i}^{2 d}

by regressing object properties, including 2D bounding box, 3D dimension, geometrical-based 3D bounding box, orientation, keypoints, object depth, and keypoint depths. Final depth estimation is an uncertainty guided combination of the regressed depth and the computed depths from estimated keypoints and dimensions, following the MonoFlex [1] framework. The point of the monocular pipeline is to obtain individual objects’ depth distribution,

D_{m o n o} (k) = {D_{m o n o} (k) \in R^{C_{D} \times H \times W}, k = 1, 2, \dots, K}

, where D is the number of depth channels set in the BEV pipeline, and K is the number of detected objects from that i-th frame.

F_{B E V_i}^{2 d}

is processed through the BEV pipeline. Combined with the camera parameters,

F_{B E V_i}^{2 d}

is input into the depth network to generate depth-aware context features,

C_{B E V} (x, y) \in R^{C_{C} \times H \times W}

, and estimation of depth distribution,

D_{B E V} (x, y) \in R^{C_{D} \times H \times W}

.

x, y

refers to the image pixel at the i-th frame,

C_{D}

is the number of context channels, and

C_{D}

stands for the number of depth bins. Depth estimation,

D_{B E V} (x, y)

, is explicitly supervised via LiDAR point clouds.

A depth fusion module merges both depths

D_{m o n o} (k)

and

D_{B E V} (x, y)

at the i-th frame. This is broken down further in Section 3.3. The depth distribution,

D p r e d

, is gained from the combination of BEV’s depth map and monocular’s object-specific depth.

D^{p r e d} = D_{f u s e d} = D_{B E V}^{p r e d} \oplus D_{m o n o}^{p r e d}

(2)

A view transformer projects

F_{i}^{2 d}

into 3D representations

F_{i}^{3 d}

using Equation (3), and then pools them into an union BEV representation,

F^{b e v}

. A 3D detection head predicts the class, 3D box offset, and other attributes from

F^{b e v}

.

F_{i}^{3 d} = F_{i}^{2 d} \otimes D_{i}^{p r e d}, F_{i}^{3 d} \in R^{C_{F} \times C_{D} \times H \times W}

(3)

3.3. Depth Fusion Module

This module focuses on ways to fuse two pieces of depth information. We have the following depths:

Depth map ( $D_{B E V}$ from multi-view LiDAR supervision of BEVDepth): A base layer that provides a soft depth distribution over each pixel, or the overall depth estimate for the entire scene.
Object Depth ( $D_{m o n o}$ from monocular image of MonoFlex): Provides accurate, object-specific depth estimates.

Initially in vanilla BEVDepth, only

D_{B E V}

is used, but we argue that this depth map distribution is a soft probabilistic estimate and lacks context. It only assumes that each image pixel is centered on one depth value, through LiDAR supervision. Occlusion problems thus cannot be handled.

With this, we propose the fusion from those depths to obtain a more contextual depth map. There are four different methods used here (see Figure 2):

Direct Replacement

For each object i detected by the monocular pipeline, the object-specific depth

D_{mono} (i)

directly replaces the depth map distribution

D_{BEV} (i)

for regions corresponding to the object in the BEV grid. The depth map is retained in areas outside the detected objects.

For each 3D pixel

(x, y, z)

, we combine the depth map and object-specific depth predictions as follows:

E [D_{fused} (x, y, z)] = \{\begin{matrix} E [D_{mono} (i)], & if (x, y) \in object i \\ , z \in D_{mono} (i) \\ E [D_{BEV} (x, y)], & otherwise \end{matrix}

(4)

Here,

D_{mono} (i)

represents the object-level depth from the monocular object detection pipeline for object i, while

E [D_{BEV} (x, y)]

is the expected depth from the BEVDepth softmax distribution for pixel

(x, y)

. The value of

E [D_{mono} (i)]

can either be a constant (e.g., 1), the loss of 2D bounding box regression

L_{\dim}

, or a Gaussian kernel-fused value to ensure smooth transitions at object boundaries.

This approach is straightforward: it replaces depth values within object regions while leaving the rest of the grid unchanged. However, it has limitations. Since the object-specific depth from monocular detection cannot always be fully trusted, this method only works well when the monocular depth is highly accurate and should take precedence in detected object regions.

Weighted Fusion

To avoid abrupt transitions between object-level depth and depth maps, we apply a confidence map,

C (x, y, z)

, that weighs the depth estimates based on the reliability of each source (monocular vs. BEV). The confidence map,

C (x, y, z)

, indicates the trustworthiness of the estimated monocular depth at pixel

(x, y, z)

, taking into account factors such as object detection certainty and distance to the object. It is defined as:

C (x, y, z) = α \in R^{C_{D} \times H \times W}

(5)

where

(x, y) \in object i

and

z \in D_{mono} (i)

. The parameter

α

is a fixed or learned weight that reflects the confidence in the object-specific depth from the monocular pipeline.

A simple approach is to set

α

as a fixed constant for all pixels inside the object region, typically close to 1, since monocular depth is generally more reliable in object regions than BEV depth. For example:

α = 0.9,

(6)

indicating 90% confidence in the monocular depth.

Alternatively,

α

can be learned using a neural network:

α = f (E [D_{mono} (i)] + E [D_{BEV} (x, y)])

(7)

where f is a function implemented by the neural network.

α

can also vary based on object size or distance. For smaller or more distant objects, depth estimates are often less reliable, so

α

can be decreased accordingly. For instance:

α = \frac{1}{1 + distance (i)}

(8)

or based on the area of the bounding box:

α = \frac{area of bounding box}{\max area}

(9)

where

\max area

is the maximum area of any bounding box in the dataset or scene.

Finally, the fused depth

D_{fused} (x, y, z)

is computed as a weighted sum:

\begin{matrix} D_{fused} (x, y, z) & = C (x, y, z) \cdot E [D_{mono} (i)] + \\ (1 - C (x, y, z)) \cdot E [D_{BEV} (x, y)] \end{matrix}

(10)

This approach allows the object-specific depth to dominate when the object is detected (i.e., higher

C (x, y, z)

), while the depth map distribution from BEVDepth remains influential in areas without objects (i.e., lower

C (x, y, z)

).

Region-of-Interest Refinement

Another way to combine the two depth sources is to treat the monocular depth as a refinement over the BEV distribution in specific Regions Of Interest (ROIs). These ROIs correspond to the detected objects in the scene. This approach combines elements of both the direct replacement and weighted fusion methods. Like direct replacement, the refinement is confined strictly to the object regions, leaving the depth map from BEV untouched outside these regions. Like weighted fusion, the same fusion process takes place, but it occurs only within each enclosed detected object region.

The blend of the two depths is achieved using a weighted sum, where BEV depth and monocular depth are mixed based on a confidence map or ROI-specific weighting:

\begin{matrix} D_{i_fused} (x, y, z) & = C (x, y, z) \cdot E [D_{mono} (i)] \\ + (1 - C (x, y, z)) \cdot E [D_{BEV} (x, y)], \end{matrix}

(11)

where

D_{i_fused} (x, y, z)

is the fused depth applied to the region enclosing object i, and

C (x, y, z)

follows Equation (5). For the learned

α

approach, the focus shifts to the object itself:

α = f (E [D_{mono} (i)])

(12)

The final fused depth is computed as:

D_{fused} (x, y, z) = \{\begin{matrix} D_{i_fused} (x, y, z), & if (x, y, z) \in {ROI}_{i}, \\ E [D_{BEV} (x, y)], & otherwise \end{matrix}

(13)

This method ensures that only the depth within the object regions is refined, leaving other regions unaffected.

Hard Combine

Hard Combine represents a distinct approach to depth fusion, focusing on integrating features from both the BEV and monocular pipelines at a feature level rather than directly manipulating depth values. This method leverages the complementary strengths of global scene context BEV feature (global context and geometric consistency) and object-specific detail monocular features (fine-grained object details such as texture and keypoints) to create a unified 3D feature representation. It is also an extension from Equation (3).

F^{3 d} = \underset{BEV-guided lifting}{\underset{︸}{F^{2 d} \otimes D_{BEV}}} + \underset{Object-centric lifting}{\underset{︸}{F^{2 d} \otimes D_{mono}}}

(14)

where

F 3 d \in R^{C_{F} \times C_{D} \times H \times W}

denotes the fused 3D feature volume. This technique may produce unrealistic depth values in general, as it does not account for the relative trustworthiness or scale differences between

D_{BEV}

and

D_{mono}

. Consequently, it is less practical than the other methods and is included primarily for comparative analysis.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset and Metrics

The NuScenes dataset [3] is a large-scale benchmark for autonomous driving research. It includes data from six cameras, one LiDAR sensor, and five radars. The dataset comprises 1000 scenarios, each approximately 20 s in duration. Key samples are annotated at 2 Hz and are divided into 700, 150, and 150 scenes for training, validation, and testing, respectively. Each sample consists of RGB images captured by six cameras, providing a 360° horizontal Field Of View (FOV). For the 3D object detection task, the dataset contains 1.4 million annotated 3D bounding boxes spanning 10 object classes. We adopt the official evaluation metrics for the 3D object detection task. The mean Average Precision (mAP) for NuScenes is computed using the center distance on the ground plane to match predicted results with ground truth annotations, while the NuScenes Detection Score (NDS) incorporates five types of True Positive (TP) error metrics: mean Average Translation Error (mATE) is Euclidean center distance in 2D (units in meters), ensuring objects are correctly located; mean Average Scale Error (mASE) evaluates size estimation accuracy using

(1 - I o U)

after alignment, reflecting how well object dimensions are captured; mean Average Orientation Error (mAOE) measures orientation accuracy in radians, and is the smallest yaw angle difference between prediction and ground truth; mean Average Velocity Error (mAVE) gauges velocity prediction accuracy in meters per second, the L2 norm of the velocity differences in 2D, crucial for moving objects; and mean Average Attribute Error (mAAE) evaluates the accuracy of attribute classification, such as object type or state by

(1 - a c c)

. These metrics are used to calculate the NDS score as follows:

NDS = \frac{1}{10} [5 \cdot mAP + \sum_{mTP \in TP} (1 - min (1, mTP))]

(15)

4.1.2. Implementation Details

We process RGB inputs from six cameras:

C A M_F R O N T_L E F T

,

C A M_F R O N T

,

C A M_F R O N T_R I G H T

,

C A M_B A C K_L E F T

,

C A M_B A C K

, and

C A M_B A C K_R I G H T

. The images in the dataset have an original resolution of

900 \times 1600

. These images are normalized using a mean of

[123.675, 116.28, 103.53]

and a standard deviation of

[58.395, 57.12, 57.375]

.

To augment the image data, we apply transformations consistent with prior works [2,27], including random cropping, random scaling, random flipping, and random rotation. Random resizing is applied within a range of

(0.386, 0.55)

, random rotation is constrained to (−5.4°, 5.4°), and random flipping is performed. After augmentation, the resulting image size becomes

256 \times 704

. In addition to image-level augmentations, we also employ BEV (Bird’s Eye View) data augmentations, which include random scaling, random flipping, and random rotation. Rotation is applied within (−22.5°, 22.5°), scaling ranges between

(0.95, 1.05)

, and flipping occurs with a probability of

0.5

along both the x- and y-axes. The BEV data are processed in a spatial grid of

128 \times 128

. The chosen augmentation parameters aim to simulate realistic variations, such as minor misalignments in camera orientation or slight tilts due to road inclines. For example, the image rotation range of ±5.4° is subtle enough to reflect these realistic scenarios without introducing unrealistic distortions.

For the monocular pipeline, we use the DLA-34 backbone [48], consistent with prior works [1,49,50]. This backbone processes augmented images with a resolution of

256 \times 704

. The detector head consists of multiple prediction branches, each responsible for regressing different object properties along with their associated uncertainties. Each branch follows a two-layer structure: a

3 \times 3 \times 256

convolutional feature extraction layer with batch normalization and ReLU activation, followed by a

1 \times 1 \times c_{o}

convolutional prediction layer, where

c_{o}

denotes the output channel dimension specific to each task. As in [1], we incorporate an edge fusion mechanism to enhance feature learning. This mechanism improves the network’s ability to capture object boundaries and structural relationships in the image. Additionally, we adopt an adaptive depth ensemble of

M + 1

, combining direct regression with M geometric solutions derived from keypoints, weighted by their uncertainties.

For the BEV pipeline, we use ResNet-50 [51] pretrained on ImageNet (via torchvision) as the image backbone to process augmented images with a resolution of

256 \times 704

. To extract BEV features, we employ ResNet-18 [51]. Two-dimensional context features are transformed into a unified 3D space, specifically a frustum of dimensions

112 \times 16 \times 44

, representing each camera’s field of view. Using the voxel pooling mechanism proposed by [2], we construct BEV data with dimensions of

128 \times 128

.

4.1.3. Training Protocol

The model is optimized using AdamW [52] with an initial learning rate of

2 \times 10^{- 4}

and a batch size of 2, distributed across two Quadro RTX 6000 24 GB GPUs. Due to limited computational resources, we trained the monocular and BEV pipelines separately for approximately 6 epochs each. Subsequently, we fine-tuned the unified BEVCorner model for an additional 2 epochs. For ablation studies, all experiments were trained for approximately 6 epochs. Despite the limited training duration, our results demonstrate significant improvements over the baseline [2].

4.1.4. Loss Calculation

The spatial distribution of object centers is modeled using a heatmap, with the loss computed via Focal Loss [53]. This loss addresses class imbalance between foreground (object centers) and background pixels by assigning higher weights to hard-to-classify examples. Bounding box dimensions (length, width, height), depths, and corners are regressed using L1 loss (Mean Absolute Error). Orientation prediction employs a multi-bin approach to estimate the yaw angle of detected objects in 3D space. The yaw angle range (0 to 360 degrees) is discretized into 8 bins, each spanning 45 degrees, following the practices in 3D detection task [1]. For each object, the model uses a classification head with cross-entropy loss to predict the correct bin, followed by a regression head with L1 loss to refine the residual angle within the selected bin. This hybrid approach balances coarse categorization with fine-grained adjustment, aiming to achieve robust orientation estimates. In BEVCorner, the final orientation is predicted by the 3D detection head in the BEV pipeline, leveraging the fused depth information,

D^{p r e d}

, to construct the BEV feature map,

F^{b e v}

, which informs the orientation output.

4.2. Ablation Study

4.2.1. Depth Fusion Module

We conducted experiments using various depth fusion strategies, as described in Section 3.3. The results are presented in Table 2. Unfortunately, the outcomes were not optimistic when compared to the baseline BEVDepth [2]. Specifically, the accuracy achieved in monocular 3D object detection did not meet the expected level. This may suggest that the approach proposed in MonoFlex [1] is not particularly effective for the NuScenes dataset. One of the main reasons for this could be the complexity of the method, especially in the global-to-different-image-plane projections, or specifically, six different projections from six cameras. Additionally, the trained model exhibited incorrect depth estimations from the monocular pipeline, which affects the entire framework. The output also revealed issues related to class imbalance. For instance, objects that appear a significant number of times in the NuScenes [3] dataset, such as barriers and traffic cones, showed high mAP values, whereas other object categories performed significantly worse, as shown in Table 3. Consistent with the previous work on MonoFlex [1], which was originally developed and tested on the KITTI [47] benchmark, MonoFlex might perform well on simpler scenes with fewer dynamic objects (such as cars, pedestrians, and cyclists) and less complex environments. MonoFlex may not be optimized for NDS’s broader criteria, especially velocity and attribute prediction, which are not emphasized in KITTI, as it focuses on mAP for 3D boxes. Furthermore, augmenting both image-level and BEV-level data increased the overall complexity of the process. Despite these challenges, we proceeded to evaluate the upper performance limit that BEVCorner can achieve.

4.2.2. Upper Bound Performance Estimation

In BEVCorner, to achieve the maximum performance limit and determine which depth fusion method generates the highest NDS, we utilize ground truth depth and bounding box information to train the overall architecture. Specifically, we replace the monocular pipeline directly with this ground truth data during training. Surprisingly, the results improved significantly, as demonstrated in Table 4. Additionally, we visualized our significant improvement through Figure 3.

4.2.3. Predictive Capacity of Object Orientation Model

To evaluate the predictive capacity of BEVCorner’s object orientation model, we analyze the mean Average Orientation Error (mAOE), which measures the smallest yaw angle difference between predicted and ground-truth orientations in radians. As shown in Table 2 and Table 4, BEVCorner with predicted depth achieves mAOE values ranging from 0.57 to 0.89 radians across fusion methods, higher than the baseline BEVDepth’s 0.536 radians, indicating less accurate orientation predictions. This degradation is primarily due to inaccuracies in the fused depth map

D^{p r e d}

, particularly from the monocular pipeline’s depth estimates, which introduce noise into the BEV feature map,

F^{b e v}

, used by the detection head. BEVCorner employs a straightforward approach where axis-aligned bounding boxes are used to represent detected objects. These boxes are aligned parallel to the image axes relative to the camera, meaning they do not inherently capture rotation information in spatial space. In real-world scenarios, many objects—such as vehicles or pedestrians—exhibit slight rotations, making axis-aligned boxes suboptimal for precise orientation modeling. This suggests that non-precise orientation information leads BEVCorner to achieve an average error of approximately 5–21 degrees (converted from radians) more than the baseline BEVDepth’s orientation error, highlighting the critical role of depth quality in orientation accuracy. These results underscore the limitations of BEVCorner while also pointing to the need for improved monocular depth estimation in future work.

4.2.4. Runtime and Efficiency

It is evident that BEVCorner performs slower than the Baseline BEVDepth [6], as the framework is extended from it. The evaluation was conducted on the NuScenes [5] Eval dataset using two Quadro RTX 6000 GPUs, each equipped with 24 GB of memory. A comparison of the resources used is provided in Table 5.

5. Conclusions and Future Work

A novel framework, BEVCorner, was presented in this research work for enhancing BEV object detection by integrating monocular features via a depth fusion module. The idea is rooted in the observation that, while multi-view systems like BEVDepth [2] provide dense BEV representations, they lack fine-grained object details, and monocular approaches such as MonoFlex [1] struggle with depth ambiguity. The strengths of both BEV and monocular approaches were then complemented by using depth fusion modules. Four fusion techniques—direct replacement, weighted fusion, region-of-interest refinement, and hard combine—were explored to balance the strengths of monocular and BEV depth estimation. Through extensive experiments on the NuScenes dataset [3], the potential of BEVCorner was demonstrated. The baseline result of 38.72% NDS was obtained, lagging behind the BEVDepth baseline [2] of 43.59% due to challenges in aligning the monocular pipeline, specifically MonoFlex [1]. Further insights into this observation are provided in the ablation study presented in Section 4.2. Despite this limitation, the potential of the approach is considered promising. To evaluate the upper bound performance, experiments were conducted in which the network was trained using object-centric ground truth depth. A significant improvement was yielded, with an NDS score of 53.21% being achieved, representing a 9.62% increase without new modalities or excessive computational resources being introduced. An increase in the number of parameters from 76.4 M to 97.9 M was observed, reflecting a 21.96% rise in trainable parameters. This trade-off, where a 9.62% improvement in NDS is obtained, demonstrates the value of the approach.

Two potential enhancements could further improve the performance of BEVCorner. First, multi-camera fusion could be implemented to address limitations in monocular object detection methods, which rely solely on 2D images from a single camera view. Data from six cameras are provided by the NuScenes dataset [3], offering a more comprehensive field of view. In future work, transformers [54] could be leveraged to fuse features from all camera views before they are passed to the detection head, potentially improving coverage and accuracy. Second, alternative representations for detected objects could be explored. In this experiment, axis-aligned bounding boxes were used to represent all detected objects, as shown in Figure 2. While this approach is straightforward, rotation information cannot be captured, as the bounding boxes remain parallel to the image axes. In reality, slight rotations may be exhibited by many detected objects, making axis-aligned boxes suboptimal. Therefore, oriented bounding boxes or circular regions could be investigated in future work to better capture the spatial orientation of objects.

Author Contributions

Conceptualization, J.N. and L.L.; methodology, J.N.; validation, J.N.; formal analysis, Q.L.; investigation, J.N. and Z.L.; resources, J.N.; data curation, J.N.; writing—original draft preparation, J.N. and L.L.; writing—review and editing, J.N., Q.L., and Y.G.; visualization, J.N.; supervision, Z.L., L.L., and Y.G.; project administration, Y.G.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shenzhen Science and Technology Innovation Committee under Grant JSGG20211029100204006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to acknowledge the financial support from the Shenzhen Science and Technology Innovation Committee, as well as the experimental support from Streamax Technology Co., Ltd.

Conflicts of Interest

Authors Liming Liu (L.L.) and Yipeng Gao (Y.G.) are employed by the company Streamax Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3289–3298. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1477–1485. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Luo, S.; Dai, H.; Shao, L.; Ding, Y. M3dssd: Monocular 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6145–6154. [Google Scholar]
Huang, K.C.; Wu, T.H.; Su, H.T.; Hsu, W.H. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4012–4021. [Google Scholar]
Brazil, G.; Liu, X. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9287–9296. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Roh, M.C.; Lee, J.y. Refining faster-RCNN for accurate object detection. In Proceedings of the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 514–517. [Google Scholar]
Li, Z.; Gao, Y.; Hong, Q.; Du, Y.; Serikawa, S.; Zhang, L. Keypoint3D: Keypoint-based and Anchor-Free 3D object detection for Autonomous driving with Monocular Vision. Remote Sens. 2023, 15, 1210. [Google Scholar] [CrossRef]
Guan, H.; Song, C.; Zhang, Z.; Tan, T. MonoPoly: A practical monocular 3D object detector. Pattern Recognit. 2022, 132, 108967. [Google Scholar] [CrossRef]
Chen, W.; Zhao, J.; Zhao, W.L.; Wu, S.Y. Shape-aware monocular 3D object detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6416–6424. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]
Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 996–997. [Google Scholar]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 1–17 October 2021; pp. 913–922. [Google Scholar]
Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11867–11876. [Google Scholar]
Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12093–12102. [Google Scholar]
Simonelli, A.; Bulo, S.R.; Porzi, L.; Ricci, E.; Kontschieder, P. Towards generalization across depth for monocular 3d object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 767–782. [Google Scholar]
Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. Monocd: Monocular 3d object detection with complementary depths. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10248–10257. [Google Scholar]
Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Tao, C.; Cao, J.; Wang, C.; Zhang, Z.; Gao, Z. Pseudo-mono for monocular 3d object detection in autonomous driving. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3962–3975. [Google Scholar]
Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1486–1494. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–18. [Google Scholar]
Liu, F.; Huang, T.; Zhang, Q.; Yao, H.; Zhang, C.; Wan, F.; Ye, Q.; Zhou, Y. Ray denoising: Depth-aware hard negative sampling for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 200–217. [Google Scholar]
Wang, Z.; Huang, Z.; Gao, Y.; Wang, N.; Liu, S. Mv2dfusion: Leveraging modality-specific object semantics for multi-modal 3d detection. arXiv 2024, arXiv:2408.05945. [Google Scholar]
Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3621–3631. [Google Scholar]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17830–17839. [Google Scholar]
Liu, H.; Teng, Y.; Lu, T.; Wang, H.; Wang, L. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18580–18590. [Google Scholar]
Lin, X.; Lin, T.; Pei, Z.; Huang, L.; Su, Z. Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv 2022, arXiv:2211.10581. [Google Scholar]
Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 180–191. [Google Scholar]
Zhang, H.; Liang, L.; Zeng, P.; Song, X.; Wang, Z. SparseLIF: High-performance sparse LiDAR-camera fusion for 3D object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 109–128. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2774–2781. [Google Scholar]
Cai, H.; Zhang, Z.; Zhou, Z.; Li, Z.; Ding, W.; Zhao, J. Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv 2023, arXiv:2303.17099. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10386–10393. [Google Scholar]
Shin, K.; Kwon, Y.P.; Tomizuka, M. Roarnet: A robust 3d object detection based on region approximation refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2510–2515. [Google Scholar]
Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1742–1749. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Ipod: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276. [Google Scholar]
Chen, Z.; Hu, B.J.; Luo, C.; Chen, G.; Zhu, H. Dense projection fusion for 3D object detection. Sci. Rep. 2024, 14, 23492. [Google Scholar] [CrossRef] [PubMed]
Mouawad, I.; Brasch, N.; Manhardt, F.; Tombari, F.; Odone, F. View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection. arXiv 2023, arXiv:2305.17972. [Google Scholar] [CrossRef]
Lian, Q.; Xu, Y.; Yao, W.; Chen, Y.; Zhang, T. Semi-supervised monocular 3d object detection by multi-view consistency. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 715–731. [Google Scholar]
Cetinkaya, B.; Kalkan, S.; Akbas, E. Does depth estimation help object detection? Image Vis. Comput. 2022, 122, 104427. [Google Scholar] [CrossRef]
Liu, Y. Scalable Vision-Based 3D Object Detection and Monocular Depth Estimation for Autonomous Driving. Ph.D. Thesis, Hong Kong University of Science and Technology (Hong Kong), Hong Kong, China, 2024. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar]
Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4721–4730. [Google Scholar]
Liu, X.; Xue, N.; Wu, T. Learning auxiliary monocular contexts helps monocular 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 36, pp. 1810–1818. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]

Figure 1. Framework of BEVCorner. The framework comprises dual pipelines with separate backbones for monocular and multi-view feature extraction, ensuring optimal adaptation to the unique demands of each modality. The extracted features are fused through a Depth Fusion module, and Voxel Pooling is employed to unify all point features into a single coordinate system, projecting them onto the BEV feature map.

Figure 2. Different depth fusion strategies. The top-left image displays the object across various depth bins. In the remaining images, different colors represent distinct depth estimation sources: blue indicates regions derived from the BEV depth map, red denotes regions from the monocular depth estimation of the object, and purple highlights regions where both depth estimates are fused. Additionally, in the visualization of the weighted fusion method, light blue signifies regions where the BEV depth map contributes less.

Figure 3. Experiment Results. (a) Vanilla BEVDepth fails to detect distant occluded objects. (b) BEVCorner successfully detects such objects but exhibits lower accuracy in overlapping regions of detected bounding boxes. (c) Groundtruth-trained BEVCorner achieves higher accuracy in detecting occluded objects. Under the same confidence thresholds, the number of predicted bounding boxes varies significantly between methods.

Table 1. Characteristics of BEV and monocular object detection.

Aspects	BEV Object Detection	Monocular Object Detection
Input Data	Typically uses LiDAR point clouds or fused multi-sensor data (e.g., LiDAR + camera).	Uses a single RGB camera image.
Depth Information	Directly available from LiDAR or depth sensors, providing accurate 3D spatial information.	Inferred from 2D images using monocular depth estimation, which may be less accurate.
Strengths	✓ Precise depth and spatial data improve object positioning in 3D space. ✓ Objects appear at consistent scales in BEV maps. ✓ Ideal for tasks needing spatial awareness, like path planning.	✓ Rich Semantic Information, e.g., detailed visual features (e.g., texture, color) to enhance classification. ✓ Requires only a single camera, lowering hardware costs.
Weaknesses	- LiDAR data can miss distant or small objects due to sparsity. - Processing point clouds or BEV maps requires significant resources.	- Struggles with accurate 3D localization due to inferred depth. - Occluded objects in the image plane are hard to detect. - Objects at different distances vary in size, complicating detection. - Performance drops in poor conditions (e.g., rain, darkness).
Typical Applications	Autonomous driving (e.g., obstacle detection, path planning), robotics (e.g., navigation in structured environments).	Surveillance and security, consumer electronics (e.g., smartphones, AR/VR), and low-cost autonomous systems.
Computational Cost	High, due to multi-cameras and processing large point clouds or multi-sensor fusion.	Lower, though depth estimation can increase complexity.

Table 2. Performance of depth fusion methods in BEVCorner.

Configuration	mAP(%) ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓	NDS(%) ↑
Baseline: BEVDepth [2]
Vanilla: non EMA + non CBGS + 1 key	33.13	0.7009	0.2796	0.536	0.5533	0.2273	43.59
BEVCorner (Ours)
direct_replacement (fixed)	20.68	0.8725	0.2945	0.6675	0.8654	0.2516	30.83
weighted_fusion (fixed)	16.68	0.9176	0.3451	0.8296	1.2502	0.3679	23.74
weighted_fusion (geometry)	27.64	0.7372	0.2938	0.5690	0.6631	0.2465	38.72
weighted_fusion (learned)	21.87	0.8340	0.2930	0.6437	0.8314	0.2424	32.49
roi_refinement (fixed)	20.77	0.8652	0.2930	0.6594	0.8249	0.2420	31.54
roi_refinement (geometry)	26.65	0.7525	0.2930	0.5798	0.6975	0.2560	37.53
roi_refinement (learned)	22.67	0.8247	0.2914	0.6255	0.7853	0.2531	33.54
hard_combine (fixed)	23.32	0.7369	0.2945	0.5780	0.7504	0.2391	35.67

Note: ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 3. Object detection performance metrics from BEVCorner Weighted_Fusion (Geometry).

Object	AP (%) ↑	ATE ↓	ASE ↓	AOE ↓	AVE ↓	AAE ↓
car	0.385	0.599	0.170	0.262	0.909	0.252
truck	0.178	0.787	0.228	0.285	0.756	0.241
bus	0.322	0.714	0.227	0.190	1.259	0.320
trailer	0.117	1.111	0.270	0.560	0.478	0.189
construction_vehicle	0.031	1.090	0.542	1.144	0.133	0.390
pedestrian	0.272	0.755	0.299	0.985	0.622	0.325
motorcycle	0.295	0.643	0.272	0.715	0.838	0.246
bicycle	0.291	0.550	0.275	0.719	0.310	0.010
traffic_cone	0.429	0.545	0.368	nan	nan	nan
barrier	0.443	0.578	0.285	0.261	nan	nan

Note: ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 4. Upper bound performance of BEVCorner.

Configuration	mAP (%) ↑	mATE ↓	mASE ↓	mAOE ↓	mAVE ↓	mAAE ↓	NDS(%) ↑
Baseline: BEVDepth [2]
Vanilla: non EMA + non CBGS + 1 key	33.13	0.7009	0.2796	0.536	0.5533	0.2273	43.59
BEVCorner (Ours)
direct_replacement (fixed)	55.79	0.4461	0.295	0.7014	0.8257	0.2566	52.65
weighted_fusion (fixed)	54.84	0.4705	0.3035	0.8963	1.079	0.2971	47.75
weighted_fusion (geometry)	43.11	0.5622	0.2972	0.6121	0.6978	0.2484	47.38
weighted_fusion (learned)	55.84	0.4466	0.2935	0.6796	0.8	0.2561	53.16
roi_refinement (fixed)	55.81	0.4463	0.2947	0.691	0.7998	0.259	53
roi_refinement (geometry)	45.88	0.536	0.2952	0.6245	0.7284	0.2625	48.47
roi_refinement (learned)	55.07 +21.94	0.45 −0.25	0.2919 +0.01	0.6627 +0.13	0.7727 +0.22	0.2561 +0.03	53.21 +9.62
hard_combine (fixed)	23.32	0.737	0.2945	0.578	0.7504	0.2391	35.67

Note: ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 5. Resource usage from BEVCorner.

Methods	# Params	# Eval Time	# Eval Max GPU Memory Usage
BEVDepth [2]	76.4 M	115 s	1311.20 MB
MonoFlex [1] as monocular pipeline	21.5 M	-	-
BEVCorner (ours)	97.9 M	135 s	2447.44 MB
	+21.96%	+20 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nathania, J.; Liu, Q.; Li, Z.; Liu, L.; Gao, Y. BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion. Appl. Sci. 2025, 15, 3896. https://doi.org/10.3390/app15073896

AMA Style

Nathania J, Liu Q, Li Z, Liu L, Gao Y. BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion. Applied Sciences. 2025; 15(7):3896. https://doi.org/10.3390/app15073896

Chicago/Turabian Style

Nathania, Jesslyn, Qiyuan Liu, Zhiheng Li, Liming Liu, and Yipeng Gao. 2025. "BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion" Applied Sciences 15, no. 7: 3896. https://doi.org/10.3390/app15073896

APA Style

Nathania, J., Liu, Q., Li, Z., Liu, L., & Gao, Y. (2025). BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion. Applied Sciences, 15(7), 3896. https://doi.org/10.3390/app15073896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BEVCorner: Enhancing Bird’s-Eye View Object Detection with Monocular Features via Depth Fusion

Abstract

1. Introduction

2. Related Work

2.1. Monocular 3D Object Detection

2.2. Multi-View 3D Object Detection

2.3. Fusion Techniques in 3D Object Detection

2.4. Depth Estimation in 3D Object Detection

3. BEVCorner

3.1. Baseline and Motivation

3.2. Overall Architecture

3.3. Depth Fusion Module

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset and Metrics

4.1.2. Implementation Details

4.1.3. Training Protocol

4.1.4. Loss Calculation

4.2. Ablation Study

4.2.1. Depth Fusion Module

4.2.2. Upper Bound Performance Estimation

4.2.3. Predictive Capacity of Object Orientation Model

4.2.4. Runtime and Efficiency

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI