MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement

Yang, Rong; You, Zhijie; Luo, Renhui

doi:10.3390/wevj16030173

Open AccessArticle

MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement

by

Rong Yang

^*,

Zhijie You

and

Renhui Luo

College of Mechanical Engineering, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(3), 173; https://doi.org/10.3390/wevj16030173

Submission received: 22 February 2025 / Revised: 9 March 2025 / Accepted: 13 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Vehicle-Road Collaboration and Connected Automated Driving)

Download

Browse Figures

Versions Notes

Abstract

The rapid development of autonomous driving has underscored the pivotal role of 3D perception. Monocular 3D object detection, as a cost-effective alternative to expensive lidar systems, is garnering increasing attention. However, existing pseudo-lidar methods encounter challenges such as coarse quality and insufficient semantic information when generating 3D point clouds from monocular images. To address these issues, this paper introduces MSFNet3D, which aims to overcome the quality limitations of pseudo-lidar point cloud. Our contributions are threefold: (1) We introduce a dual-branch network to optimize depth maps and propose a multi-scale channel spatial attention module (MS_CBAM). This module captures multi-scale geometric features through a hierarchical feature pyramid and an adaptive weight allocation mechanism, thereby addressing the scale sensitivity inherent in traditional attention mechanisms. (2) We propose a consistency-weighted fusion strategy that employs local gradient consistency analysis and differentiable weighted optimization to achieve a pixel-level fusion of image and depth features. This approach reduces feature conflicts within the dual-branch network and enhances the model’s robustness in complex scenes. (3) We introduce a semantic-guided pseudo-point cloud enhancement method that leverages an instance segmentation network to extract object-specific semantic regions and generate high-confidence point cloud, consequently improving the accuracy of object detection. Experiments on the KITTI dataset show that the proposed method performs excellently under various detection challenges, achieving an average precision of 18.87% in the 3D detection of car objects, which is a 1.67% improvement over the original model. The method also shows good performance in detecting pedestrians and cyclists. The proposed framework can provide economical and reliable 3D perception for mass-produced electric vehicles.

Keywords:

autonomous driving; 3D object detection; pseudo-point cloud; attention mechanism

1. Introduction

The rapid development of autonomous driving technologies has significantly elevated the importance of 3D perception systems in this field. As the demand for autonomous systems to comprehensively understand their surroundings grows, 3D object detection has emerged as a cornerstone technology for enabling safe and efficient navigation. Traditional 3D detection approaches rely on lidar sensors, leveraging point cloud data for object recognition and localization, with representative algorithms including Lvp [1], VoxelNet [2], and PV-RCNN [3]. While lidar offers advantages in precision and stability, the inherent sparsity of point clouds leads to a sharp decline in detection performance for distant targets. Furthermore, its prohibitive cost and hardware requirements have constrained its application in large-scale autonomous driving systems. According to Robesense’s latest financial data, the average unit price of lidar products for advanced driver-assistance system (ADAS) applications in the first quarter of 2024 was USD 358 [4]. Industry-related research reports indicate that the unit price of ordinary surround-view automotive camera modules ranges from USD 20 to USD 28, while the unit price of ADAS automotive camera modules ranges from USD 41 to USD 69 [5,6]. The challenges posed by these budgetary limitations have spurred research into monocular 3D object detection. This method strives to enable accurate 3D awareness utilizing only one camera sensor. With the automotive industry progressing toward higher levels of automation, the development of cost-effective and robust 3D detection solutions has become increasingly critical. Motivated by these economic and practical considerations, recent research has shifted focus toward advancing monocular 3D detection techniques, as explored in the following sections.

Recent years have witnessed remarkable advancements in monocular 3D detection techniques. These methodologies can be broadly categorized into three paradigms: direct regression, depth-guided estimation, and pseudo-lidar point cloud generation. Direct regression approaches bypass intermediate representations by leveraging geometric priors and uncertainty-aware depth estimation to predict 3D bounding box parameters (center coordinates, dimensions, and orientation) directly from RGB images. Pioneering works such as MonoCon [7] and MonoDLE [8] map image features to 3D space through keypoint detection and geometric constraints. SMOKE [9] eliminates traditional 2D detection branches, instead employing a single-stage network to jointly estimate keypoints and 3D attributes, achieving real-time inference at 30 frames per second (FPS). MonoDGP [10] deals with model prediction by revolving around geometric errors. Recent innovations, including MonoCAPE [11], introduce coordinate-aware position embedding generators to enhance the model’s spatial understanding, addressing the historical neglect of spatial context in earlier methods. MonoDiff [12] employs a diffusion-based framework for monocular 3D detection and pose estimation, improving feature representation without requiring additional annotation costs. However, the interdependency of predicted parameters (e.g., depth and center offsets) often amplifies estimation errors, posing a persistent challenge.

Building on the challenges of direct regression, another prominent paradigm within monocular 3D detection is depth-guided estimation, which integrates depth estimation as an auxiliary task to strengthen spatial reasoning. For instance, AuxDepthNet [13] presents a depth-fusion transformer architecture designed to effectively combine visual information with depth data. The architecture incorporates a depth position mapping unit (DPM) alongside an auxiliary depth feature unit (ADF). By guiding the fusion process with depth cues, the network achieves holistic integration of visual and depth-related characteristics, leading to dependable and optimized detection. Exploiting Ground Depth Estimation [14] designs a transformer-based RGB-D fusion network to effectively unify RGB and depth information. MonoDFNet [15] further incorporates multi-branch depth prediction and weight-sharing modules to enhance depth information acquisition and integration. While MonoGRNet [16] and D4LCN [17] utilize sparse depth supervision or depth-adaptive convolutions to minimize computational overhead, their performance remains constrained by the accuracy of pretrained depth estimators—depth prediction errors propagate into 3D detection, particularly for distant or occluded objects.

In contrast to depth-guided methods that leverage depth as an auxiliary feature, the pseudo-lidar framework adopts a different strategy by transforming depth estimates into a 3D point cloud. Early implementations like pseudo-lidar [18] project depth maps into point clouds and apply point-based detectors such as Frustum PointNet [19]. Subsequent works [20,21] enhance robustness through multimodal fusion with low-cost beam lidar. Am3d [22] and Mono3D_PiDAR [23] extract point cloud frustums from pseudo-lidar data using 2D detection masks, while Sparse Query Dense [24] improves data richness through synthetic point clouds. To address the high computational demands of pseudo-lidar methods, Meng et al. [25] propose a lightweight detection framework that achieves minimal latency, ensuring real-time performance. Gao et al. [26] introduced a 2D detection mask channel as a guiding layer by reshaping the representation of pseudo-lidar to facilitate 3D object detection. This method has achieved good results on several datasets, particularly making breakthrough progress on the KITTI dataset, demonstrating its potential in autonomous driving applications. However, the Pseudo-lidar method also faces challenges, especially the structural mismatch between depth estimation and the real scene. This mismatch affects the geometric authenticity of the point cloud, leading to a decrease in object detection accuracy [18,21].

In order to solve the above problems, this paper proposes a pseudo-lidar optimization method based on multi-scale attention mechanism MSFNet3D, which aims to improve the multi-scale feature representation capability of depth estimation, optimize the efficiency of cross-modal feature fusion, and generate more accurate point cloud data through semantic guidance. Specifically, in this paper, a multi-scale channel spatial attention module (MS_CBAM) is designed to enhance the network’s ability to extract multi-scale geometric information through hierarchical feature pyramid and adaptive weight allocation mechanism so as to effectively alleviate the problem of multi-scale feature loss in traditional depth estimation. Secondly, this paper proposes a dynamic fusion strategy based on local gradient consistency, which can dynamically adjust the feature fusion between RGB images and depth maps according to the local gradient information so as to improve the expression ability of cross-modal features. Finally, a semantic-guided point cloud generation method is introduced in this paper. The semantic features extracted from the instance segmentation network are embedded into the pseudo-lidar point cloud generation process to enhance the semantic attributes of the point cloud so that the target detection network can identify and distinguish different types of objects more accurately.

The organization of this paper is as follows: In Section 2, the MSFNet3D algorithm is designed in detail. Section 3 introduces the experimental setup and comparative analysis and verifies the effectiveness of the module through ablation experiment. Limitations of our approach are presented in Section 4, along with a discussion of necessary future work. Section 5 summarizes the key contributions of this paper and underscores its potential impact on the advancement of autonomous driving technology.

2. MSFNet3D: Framework and Operational Flow

2.1. System Framework

MSFNet3D primarily comprises three stages: a depth map generation network, pseudo-point cloud semantic fusion generation, and 3D detection.

As illustrated in Figure 1, initially, a robust depth estimation network generates depth maps. Subsequently, based on the PENET [27] encoder embedded with MS_CBAM and consistency weight, a dual-branch feature fusion is performed through the dynamic interaction of color-dominant and depth-dominant pathways to optimize the depth map, resulting in a high-precision depth map. A pseudo-point cloud is generated, enhanced by target region augmentation guided by Mask R-CNN [28]. The final stage utilizes mature 3D detection techniques to identify objects within the scene.

2.2. MS_CBAM

Precise depth estimation requires the effective capture of features at various scales and the enhancement of information-rich feature expressions. To this end, we propose a novel multi-scale channel spatial attention module (MS_CBAM). Based on the Convolutional Block Attention Module (CBAM) [29], this module introduces a multi-scale feature extraction mechanism and an adaptive weight allocation strategy so as to more effectively utilize contextual information and focus on important features.

The structure of MS_CBAM, as shown in Figure 2, mainly includes three parts: multi-scale feature extraction, adaptive weight allocation, and channel and spatial attention mechanisms.

The multi-scale feature extraction module employs average pooling operations with downsampling ratios of 2, 4, and 8, ensuring comprehensive coverage of scene geometry. Each scale of pooling operation is followed by a 3 × 3 convolution and a bilinear interpolation operation to restore the original spatial dimension and adjust the number of channels. Following this, the multi-scale feature maps are combined, resulting in a more integrated and complete representation of the data. At the same time, in order to effectively fuse the multi-scale features, we introduce an adaptive weight allocation strategy. The strategy uses a global average pooling layer and two fully connected layers to learn the weight of the feature map of each scale. These weights are activated by a sigmoid function to ensure that their values are between 0 and 1. Subsequently, the learned weights are multiplied by the corresponding multi-scale feature maps to achieve adaptive feature fusion. This adaptive weighting strategy enables the network to dynamically emphasize information-rich scales while suppressing less-relevant scales. Then, the module uses channel and spatial attention processes to refine the fused multi-scale features.

As shown in Figure 3, the channel attention module applies global average pooling and global maximum pooling, with the output of each pooling operation then being directed to its own dedicated fully connected layer. Following the summation of the outputs from the two fully connected layers, a sigmoid activation is employed to create the channel attention weights. The mathematical formulation is provided in Equation (1).

M_{c} (F) = σ {(W}_{1} {(W}_{0} {(F}_{avg}^{c} {)) + W}_{1} {(W}_{0} {(F}_{\max}^{c})))

(1)

where

M_{c} (F)

denotes the channel attention weights,

σ

is the sigmoid activation function,

W_{0} \in ℝ^{C / r \times C}

and

W_{1} \in ℝ^{C / r \times C}

represent the weight matrices of the fully connected layers within the multilayer perceptron (MLP), and r is the reduction ratio (controlling the number of parameters in the MLP).

These channel attention weights are multiplied with the corresponding multi-scale features. After the channel attention processing, the spatial attention branch takes the channel-weighted features and performs average and max pooling across the channel dimension. As illustrated in Figure 4, the resulting pooled maps are then concatenated. The combined feature representation is subsequently processed with a 7 × 7 convolution, and a sigmoid activation is applied to produce the spatial attention weights, as mathematically expressed in Equation (2).

M_{s} (F) = σ (f^{7 \times 7} ([F_{avg}^{s}; F_{\max}^{s}]))

(2)

where

M_{s} (F)

denotes the spatial attention weights,

F_{avg}^{s} \in ℝ^{B \times 1 \times H \times W}

and

F_{\max}^{s} \in ℝ^{B \times 1 \times H \times W}

represent the intermediate feature map within the spatial attention module, and

f^{7 \times 7}

signifies a convolution operation with a 7 × 7 kernel. To obtain the final output feature map, an element-wise multiplication is carried out, combining the spatial attention weights and the initial feature representation.

Through the integration of multi-scale feature extraction, adaptive weight allocation, and channel and spatial attention mechanisms, the MS_CBAM module effectively enhances salient feature representations and significantly improves the network’s ability to perceive information across various scales while suppressing irrelevant features. This provides a more discriminative feature representation for downstream depth completion tasks. Furthermore, the lightweight design of MS_CBAM ensures high efficiency despite the introduction of additional computational overhead, making it suitable for integration within large convolutional neural networks.

Then, we integrate the proposed MS_CBAM module into the PENet encoder architecture. The original PENet encoder uses strided convolution for downsampling in the BasicBlockGeo module and fuses features of different scales through skip connections. We replace the first 3 × 3 convolutional layer (when the stride is 2) in the BasicBlockGeo module with the MS_CBAM module so that it can extract multi-scale features and perform channel and spatial attention weighting while downsampling.

Specifically, we set the stride of the MS_CBAM module to 2, enabling it to perform downsampling operations. For the residual connection part, we keep the original 1 × 1 convolution and BatchNormalization operations to maintain the function of the residual connection and use an MS_CBAM module with a stride of 1 to process the residual path to further enhance multi-scale feature fusion. The integration method of the MS_CBAM module is shown in Figure 5.

We believe that integrating the MS_CBAM module into the PENet encoder can better capture contextual information at different scales and enhance important feature expressions, thereby enhancing the precision of the generated pseudo-lidar point cloud.

2.3. Consistency Weight

Discrepancies in depth accuracy and detail representation exist between depth maps generated from depth estimation and their corresponding RGB images, stemming from inherent differences in input modalities and data distributions. To mitigate this and enhance the quality of depth estimation after fusing these sources, we introduce “Consistency Weights”, a weight adjustment method based on local consistency differences. Consistency Weights dynamically adjust the contribution of features from the color-guided and depth-guided branches, promoting more adaptive feature fusion and thereby improving the quality of the final depth map. These weights are generated dynamically by comparing local regions, integrating both pixel value differences and gradient information from the depth map and RGB image. Since depth maps and RGB images have differing value ranges, a normalization step is first applied to the pixel values of the depth map, D, and the RGB image, I, mapping them to the range [0, 1], as shown in Equation (3).

\{\begin{cases} d_{new} = \frac{d - d_{\min}}{d_{\max} - d_{\min}} \\ c_{new} = \frac{c - c_{\min}}{c_{\max} - c_{\min}} \end{cases}

(3)

where

d_{new}

represents the normalized depth map,

d_{\min}

is the minimum value, and

d_{\max}

is the maximum value within the depth map.

c_{new}

represents the normalized RGB image pixel values, with

c_{\min}

and

c_{\max}

being the minimum and maximum values, respectively.

Image gradient computation is commonly used to detect edges and regions of significant change. Furthermore, depth map pixel values, corresponding to the distance from the point to the camera center, can represent spatial variations via their gradient. Therefore, to obtain a richer feature representation, a 3 × 3 Sobel operator is applied to both the depth map and RGB image, as formulated in Equation (4).

\{\begin{cases} G_{d} = \sqrt{{G x_{d}}^{2} + {G y_{d}}^{2}} \\ G_{c} = \sqrt{{G x_{c}}^{2} + {G y_{c}}^{2}} \end{cases}

(4)

where

G x_{d}

and

G y_{d}

denote the horizontal and vertical gradients of the depth map D, respectively;

G x_{c}

and

G y_{c}

denote the corresponding gradients for the RGB image.

Following normalization and gradient computation, to obtain a more comprehensive representation, pixel value differences and gradient information from the depth map and RGB image are integrated and compared within local regions, as formulated in Equation (5).

L (d, c) = \sum_{i, j} (∥ d_{new} (i, j) - c_{new} (i, j) ∥^{2} + ∥ G_{d} (i, j) - G_{c} (i, j) ∥^{2})

(5)

After obtaining the local region comparison results, these values are used to assign the weights

w_{c}

. This assignment is formulated in Equation (6).

w_{c} = \frac{1}{1 + L (d, c)}

(6)

In the proposed architecture, a consistency weighting mechanism is applied to the dual-branch network (the color-dominated branch and the depth-dominated branch) to dynamically adjust the contribution ratio during the feature fusion process. The final fused feature is calculated according to the following Equation (7):

F_D = W_{c} \times C_D + (1 - W_{c}) \times D_D

(7)

where C_D and D_D represent the features extracted from the color-guided and depth-guided branches, respectively, while F_D denotes the fused feature. Through the consistency weight W, the model can adaptively adjust the feature ratio from both branches based on local region differences, thereby enhancing the flexibility and adaptability of feature fusion.

In the implementation, to ensure dimensional compatibility between the consistency weight and the network’s feature map, a broadcasting mechanism is employed to extend the weight tensor.

2.4. Integrated Dual-Branch Network Architecture

To further refine the depth maps, we introduce a dual-branch network, integrating the MS-CBAM and the consistency weighting mechanism into its architecture. This dual-branch network effectively processes and integrates complementary information from RGB images and initial depth maps. Furthermore, the MS-CBAM and consistency weighting mechanism enhance the network’s capacity to capture intricate details within the depth maps. Illustrated in Figure 6, the dual-branch network architecture consists of a color-dominant branch and a depth-dominant branch. Within this architecture, multi-scale features (1) to (5) extracted from the color-dominant branch are fused with depth features in the depth-dominant branch.

With the RGB image and depth map D as input, the color-dominated branch employs a 5-level encoder-decoder architecture to extract multi-scale representations. It learns depth around the boundary by capturing color features and structural information in the color image, generating a dense depth map C_D and a confidence map C_cd. Fed with the dense depth map C_D and the depth map D, the depth-dominated branch concatenates color-dominated decoder features with its own encoder features, outputting the dense depth map D_D and confidence map D_cd. The confidence map is generated by a convolutional neural network running in parallel with the backbone network, consisting of feature extraction layers and confidence prediction layers. By combining the dense depth map, the confidence map, and the consistency weight, the final dense depth map F_D is obtained through fusion, as described in Equation (8).

F_D = \frac{e^{C_c d} \cdot w_{c} \cdot C_D + e^{D_c d} \cdot (1 - w_{c}) \cdot D_D}{e^{C_c d} + e^{D_c d}}

(8)

After obtaining the dense depth map F_D with size H×W, a pixel grid of the same size as F_D is generated, and the position of each pixel is computed. Depth points with a value of zero or invalid points generally lack valid depth information. These points tend to increase unnecessary computational load and introduce noise. Therefore, an effective depth mask valid_mask = (D > 0) is applied to the depth map and the pixel grid to remove invalid depth values and their corresponding pixels.

2.5. Pseudo-Point Cloud Generation

Our approach adheres to the technical framework of pseudo-lidar, necessitating the conversion of the fused depth map F_D into a 3D point cloud representation. Utilizing the camera’s intrinsic matrix, each pixel (u, v) in the image is mapped to 3D space using the following back-projection formula:

\{\begin{cases} z = D \\ x = \frac{(u - c_{u}) \times z}{f_{u}} \\ y = \frac{(v - c_{v}) \times z}{f_{v}} \end{cases}

(9)

Here,

(c_{u}, c_{v})

represents the pixel’s location relative to the camera center,

f_{v}

is the vertical focal length, and

f_{u}

is the horizontal focal length. We obtain a three-dimensional point cloud

{(x^{(n)}, y^{(n)}, z^{(n)})}_{n = 1}^{N}

by transforming pixel positions to their corresponding locations in 3D space through back-projection, where N signifies the aggregate pixel population. The resulting pseudo-lidar point cloud is then used for 3D object detection. After mapping the depth map to 3D space, an initial point cloud Pr is obtained, with each point Pi in the point cloud containing coordinate information

(X_{i}, Y_{i}, Z_{i})

.

Despite its cost-effectiveness for 3D object detection, pseudo-lidar generates relatively coarse point cloud. This lower point cloud quality leads to reduced object detection accuracy compared to using lidar-derived point cloud. To address this, the Pseudo-lidar++ [21] approach was developed. By refining the network used for depth prediction and optimizing its loss function, this method achieves better object detection for objects at a distance. The system also incorporates a four-beam lidar sensor for pseudo-point cloud data calibration. However, this method still relies on lidar, making it unsuitable for 3D detection with monocular cameras alone. We address this limitation by integrating an image segmentation network to segment objects, generating a semantic point cloud that is subsequently fused with the pseudo-point cloud, enhancing its characteristics. In this work, we use Mask R-CNN [28] to process the RGB image, and the segmented result is shown in Figure 7.

To develop a more detailed semantic description for every point, semantic features from the RGB image are projected into the point cloud space. This projection is limited to the masks of vehicles, pedestrians, and bicycles. For each point Pi, its corresponding pixel coordinates (ui, vi) in the RGB image are computed using the inverse projection formula based on the depth map. The RGB-based semantic feature F_rgb(ui, vi) at that pixel location is then assigned to Pi. For the point cloud cluster projected from the 2D mask into 3D space, the minimum and maximum coordinates of all points are identified to generate an axis-aligned bounding box, characterized by its center

(\frac{x_{\min} + x_{\max}}{2}, \frac{y_{\min} + y_{\max}}{2}, \frac{z_{\min} + z_{\max}}{2})

and dimensions

\{(x_{\max} - x_{\min}), (y_{\max} - y_{\min}), (z_{\max} - z_{\min})\}

. A monocular 3D object detection model (this article uses the MonoRun [30]) is employed to predict the 3D bounding box parameters (center, size, and orientation) of the target. The next step involves projecting the eight corner vertices that define the predicted 3D bounding box onto the point cloud plane. As illustrated in Figure 8, the red box represents the detection from the monocular 3D object detection model, while the green box indicates the bounding box generated from the segmentation mask.

For each object instance, the intersection-over-union (IoU) between its segmentation area and the projected bounding box is calculated. Instances with an IoU threshold below 0.5 are considered false detections. Specifically, for each 3D point Pi, its source pixel coordinates are evaluated to determine whether they fall within the mask region of a high-IoU target. If true, Pi is labeled as a foreground point; otherwise, it is treated as a non-foreground point and all non-foreground points are discarded. The initial pseudo point cloud Pr is fused with the semantic features to obtain the final semantically enhanced point cloud Psem. The fusion process is expressed as follows:

P_{sem} = {(X_{i}, Y_{i}, Z_{i}, F_{rgb} (u_{i}, v_{i})) | P_{i} \in P_{r}}

(10)

where

F_{rgb} (u_{i}, v_{i})

is the RGB semantic feature corresponding to point Pi.

2.6. Loss Fuction

The loss function proposed in this paper consists of two key components: the depth optimization network loss

L_{depth}

and the gradient consistency regularization loss

L_{grad}

. The depth optimization network loss follows the loss function used by the PENet network. The model is trained using this loss, and supervision signals are applied during the intermediate stages of the depth optimization process, as expressed in Equation (11):

L_{depth} = L (F_D) + λ_{cd} L (C_D) + λ_{dd} L (D_D)

(11)

where

λ_{cd}

and

λ_{dd}

are hyperparameters set empirically. The complete loss function is given in Equation (12). Since the ground truth includes invalid pixels, only those pixels with valid depth values are considered:

L (F_D) = {∥ (F_D - D_{gt}) ∥}^{2}

(12)

where

D_{gt}

denotes the ground truth used for supervision. To ensure consistency between the depth map and RGB image, especially in the edge regions, we introduce a gradient consistency regularization loss term:

L_{grad} = \frac{1}{N} \sum_{i = 1}^{N} | \nabla D_{pred} (u_{i}, v_{i}) - \nabla I (u_{i}, v_{i}) |

(13)

where

\nabla D_{pred} (u_{i}, v_{i})

and

\nabla I (u_{i}, v_{i})

represent the gradients of the predicted depth map and the RGB image, respectively.

The final combined loss function is expressed as follows:

L_{total} = L_{depth} + L_{grad}

(14)

3. Results

3.1. Experimental Setup

3.1.1. Dataset

This study utilizes the KITTI dataset, a widely recognized and authoritative benchmark in the field of autonomous driving [31]. The KITTI dataset provides camera calibration matrices and aligned sparse depth maps for each image. It comprises 7481 training and 7518 testing. Following common practice in the field [5,32], the training was divided into 3712 images for training and 3769 images for validation. Both color images and their corresponding sparse depth maps were used during the training and validation phases to ensure robust performance of the model in monocular image object detection during testing.

3.1.2. Evaluation Metrics

The KITTI dataset categorizes the objects to be detected in the scene according to their distance from the vehicle, classifying them into three difficulty levels: Easy, Moderate, and Hard. This study performs quantitative evaluation of the detected objects based on these difficulty levels, primarily focusing on the “Car” category for experimental evaluation. Following the official KITTI dataset requirements, we adopt the average precision at 40 recall positions (AP|_R40) as the metric to evaluate detection performance and compare the performance of different models.

AP|_R40 divides the recall range into a set of equally spaced points and calculates the interpolated precision at each point. This approximates the area under the precision–recall curve, yielding the average precision (AP) value. The calculation process is defined as follows:

AP |_{R_{N}} = \frac{1}{| N |} \sum_{r \in R_{N}} ρ_{interp} (r)

(15)

where r is a recall point in the set R, R40 is a set containing 40 equally spaced recall points, and

ρ_{interp} (r)

is the interpolated precision at each recall point r.

3.1.3. Preprocessing

The network training was conducted using the Adam optimizer with an initial learning rate of 0.00125, and a cosine annealing strategy was employed. The batch size was set to 1, with a total of 50 training epochs. Following Mask R-CNN [28], we use a default confidence threshold of 0.5 for instance mask selection. The reduction ratio r in the MLP is set to 16 based on CBAM [29]. During the training phase, sparse depth maps derived from the KITTI lidar point cloud are introduced as supervisory signals for the dual-branch network. During inference, only a monocular image is required as input. This study primarily utilizes two 3D detection models for evaluation: PointRCNN [33] and TED [34]. The TED model uses the TED-S version, which processes only point cloud inputs. To integrate pseudo-lidar point clouds into the data, pseudo-lidar point clouds were used to replace the original 64-line lidar point cloud data in the models, and both models were retrained using their respective hyperparameters. Before replacement, we sparsify all point clouds. All experiments presented in this work were performed using a configuration of two NVIDIA GeForce RTX 3090 GPUs.

3.2. Comparisons with Prior Arts

We conducted comprehensive comparisons between our MSFNet3D framework and existing pseudo-lidar approaches RMPL [35] and DD3D [36], along with three state-of-the-art monocular 3D detection methods (MonoDETR [37], MonoRun, and MonoCD [38]) reproduced on the KITTI benchmark. Table 1 summarizes the evaluation results on the KITTI test set.

As shown in Table 1, when integrated with the PointRCNN model, the proposed method demonstrates significant improvements in monocular 3D object detection, as evidenced by evaluations on the average precision in bird’s eye view (AP_BEV) and average precision in 3D boxs (AP_3D) benchmarks. We achieve scores of 22.00 and 16.40 on these respective metrics, representing a substantial increase of 0.72 and 1.35 compared to existing methods that rely on pseudo-lidar point cloud. When integrated into the TED framework, our approach establishes a new state-of-the-art for pseudo-lidar-based techniques, achieving 24.93 and 18.87 on AP_BEV and AP_3D, respectively, with performance improvements of 1.63 and 1.67 over baseline models. Notably, our proposed method attains the highest accuracy within the subset of monocular pseudo-lidar techniques. While it ranks second overall among all monocular 3D object detectors, it is only outperformed by MonoCD. Nevertheless, our method achieved the highest performance in the moderate difficulty category, widely regarded as the most representative of real-world driving conditions, with scores of 22.87 on AP_BEV and 17.30 on AP_3D. Furthermore, in the hard difficulty category, which encompasses occluded or distant objects, TED(Ours) attained an AP_BEV score of 20.58, surpassing the second-best method, DD3D, which scored 20.03, by 0.55 points. This demonstrates that our method exhibits greater robustness in localizing objects within complex driving environments. Here, “I” denotes monocular image input.

Table 2 and Table 3 further compare the pedestrian and cyclist detection performance using the AP|_R40 metric. MSFNet3D outperforms the pseudo-lidar baseline in both categories and achieves the best performance among all methods, showing particular strengths in small object detection (2.23% and 1.18% AP increases for pedestrians and cyclists, respectively).

An intriguing observation from Table 1, Table 2 and Table 3 is TED’s superior performance over PointRCNN when using monocular pseudo-point cloud across all object categories. We attribute this to TED’s efficient feature extractor and transformation-equivariant architecture, which better preserves geometric relationships in pseudo-lidar data.

Table 4 details the runtime performance of each component in MSFNet3D, where A, B, C, and D represent depth estimation, the dual-branch network, pseudo-point cloud generation, and 3D object detection, respectively. As the table shows, MSFNet3D outperforms the pseudo-lidar baseline by 0.462 s, representing a 57% improvement in runtime. A trade-off exists in the pseudo-point cloud generation phase, where the added complexity of semantic point cloud fusion leads to increased processing time. Furthermore, even with the integration of the computationally efficient MS_CBAM and consistency weighting within the dual-branch network, this module still consumes 0.026 s. The input to MSFNet3D consists of images sized 1242 × 375 pixels, which are cropped and resized to 516 × 516 pixels before depth estimation, following the dual-branch network and pseudo-point cloud generation steps, the 3D object detection module operates on a point cloud comprising approximately 17,000 points.

To ensure the robustness of the observed performance gains, we conducted five independent trials using different random seeds and performed t-tests to compare the performance of the baseline method against MSFNet3D. As shown in Table 5, MSFNet3D achieved an average improvement of 1.67% (±0.22%) in AP3D across all “Car” categories in the KITTI dataset. A t-test (p = 0.006 < 0.01) confirmed that this improvement is statistically significant, indicating the reliability of the reported enhancement.

3.3. Ablation Studies

To validate the effectiveness of key components in our MSFNet3D framework, we conduct systematic ablation studies on the vehicle category of the KITTI 3D detection benchmark. Using the TED model as our baseline framework, we progressively integrate three proposed innovations: MS_CBAM, consistency-weighted fusion, and semantic-guided point cloud refinement. All experiments follow the official validation protocol with 3969 samples, evaluating performance using AP|_R40 metrics.

The initial enhancement introduces MS_CBAM, which constructs a three-level feature pyramid with adaptive weight allocation to strengthen multi-scale geometric perception. As shown in Table 6 (Row 2), this module elevates the average AP from 17.20% to 17.95% (Δ = +0.75), with the most significant improvement observed in easy-difficulty scenarios (23.86% AP, +0.94 absolute gain). This demonstrates that our attention mechanism effectively mitigates scale ambiguity in monocular depth estimation, particularly enhancing depth prediction accuracy for mid-to-long range targets. We subsequently implement a consistency-weighted fusion module that employs differentiable weight allocation through local gradient coherence analysis, enabling pixel-level adaptation between image features and pseudo-lidar representations. Table 6 (Row 3) reveals this improvement boosts the average AP to 18.48% (Δ = +0.53), with maximum gains concentrated in moderate-difficulty scenarios (13.61% AP, +0.38 gain). The final stage incorporates semantic-guided refinement, produce object-centric pseudo-point cloud. This refinement yields an additional 0.39% average AP gain (Table 6, Row 4), predominantly benefiting hard-difficulty scenarios (13.96% AP, +0.35 improvement). Our full model synthesizes all three components to achieve 18.87% average AP on the KITTI test set, representing a 1.67% absolute improvement over the baseline (Table 6, Final Row).

3.4. Qualitative Results and Analysis

Figure 9 presents a comparative visualization of depth maps before and after dual-branch network optimization. The second row displays pre-optimization results, while the third row demonstrates post-optimization improvements. Visual comparisons reveal that the optimized depth maps exhibit significantly enhanced contour accuracy, particularly in lateral profile continuity. The refinement stems from our multi-scale attention mechanism, which strengthens feature representation across hierarchical receptive fields.

Figure 10 visualizes the 3D detection results of MSFNet3D on the KITTI test set. The MS_CBAM, through its multi-scale feature extraction and adaptive weight allocation, enhances the network’s ability to perform contextual reasoning in occluded regions. Furthermore, it effectively removes interference from occluding objects by preserving the point cloud data of foreground targets. As illustrated in Figure 10a, vehicles obscured by 70% of a tree trunk were successfully identified in an occlusion scenario. In the dense traffic flow of an urban road scene, as shown in Figure 10b, almost all vehicles were accurately detected, including those that were partially occluded.

Figure 11a,b illustrate MSFNet3D’s detection performance on diverse vehicle types. By leveraging Mask R-CNN for semantic fusion and point cloud optimization, MSFNet3D is capable of identifying various vehicle classes, including sedans and trucks. However, given that the current experimental scope is limited to the “Car” category in KITTI, MSFNet3D treats vehicles of varying sizes and shapes uniformly as “car” instances. The method also effectively mitigates artifacts arising from illumination variations by integrating features from both RGB images and depth maps. As demonstrated in Figure 9, the optimized depth maps maintain edge continuity even in shaded regions. Figure 11c,d showcase vehicle detection under conditions of strong shadows and overexposure, respectively. The favorable detection results achieved in these scenarios attest to MSFNet3D’s robust illumination invariance.

Figure 12 compares pseudo-lidar point cloud generated by MSFNet3D against ground-truth lidar data. To balance computational efficiency and precision, we implement non-uniform voxel downsampling on raw pseudo-point cloud (Figure 12c) and generate the sparse point cloud shown in Figure 12d. The optimized point cloud maintains structural integrity over long distances. Notably, longitudinal compression distortion persists in near-field regions (<5 m, red part in Figure 12c), primarily attributed to monocular scale ambiguity. Future work will address this through stereo-prior fusion and learned depth scaling factors.

4. Discussion

The results presented in this paper demonstrate that MSFNet3D achieves significant performance improvements in monocular 3D object detection compared to existing pseudo-lidar based methods. However, it is essential to acknowledge the limitations of the proposed method. One primary constraint is the multi-stage training process. The depth estimation module, pseudo-point cloud generation module, and 3D detection module are currently trained separately. This disjointed training approach may limit the overall performance of the system, as the modules are not jointly optimized for the end-to-end task of 3D object detection. Future work should explore end-to-end training strategies to improve the integration of these components. Another limitation is the computational cost associated with the pseudo-lidar generation process. While the generated point cloud improves detection accuracy, the current implementation requires approximately 0.328 s to generate a pseudo-lidar point cloud, which contributes significantly to the overall runtime. This computational overhead limits the applicability of MSFNet3D in real-time applications, especially on resource-constrained embedded platforms. Further research is needed to optimize the pseudo-lidar generation process, for instance, by exploring more efficient depth estimation architectures or by employing hardware acceleration techniques. Another limitation of our present research lies in its concentration on car detection within the KITTI dataset, potentially overlooking the full spectrum of vehicle types present in real-world environments. Moreover, the robustness of our method under extreme illumination conditions or severe occlusion requires further investigation. We intend to address these limitations in future work by exploring training on more diverse vehicle datasets and conducting comprehensive real-world evaluations.

5. Conclusions

In this paper, we propose MSFNet3D, an advanced monocular 3D inspection framework designed to solve the problem of the quality limitation of pseudo-lidar point cloud. Firstly, the constructed multi-scale channel spatial attention module (MS_CBAM) breaks through the limitation of scale sensitivity of traditional attention mechanisms. This module significantly enhances the ability of depth estimation network to analyze complex scenes through hierarchical feature pyramid structure and adaptive weight allocation strategy. Secondly, the proposed dynamic consistency weight achieves optimized coupling of pixel-level image features and depth features through local gradient consistency analysis and differentiable weighting, which effectively mitigates the feature conflict issue caused by fixed-weight fusion in dual-branch networks, demonstrating enhanced robustness in dense traffic scenarios. In addition, the semantic-guided point cloud optimization strategy enhances the semantic interpretability of point cloud by fusing the result of instance segmentation with pseudo-point cloud data and verifies the key role of semantic information in 3D reconstruction.

These advancements not only push the boundaries of monocular 3D perception but also present a cost-effective solution for electric vehicle manufacturers. The method in this paper can use an 800 M monocular camera, which can save about 80% of the cost compared with the vehicle lidar. Our approach aligns with the industry’s urgent demand for affordable yet reliable autonomous driving technologies.

Author Contributions

Conceptualization, R.Y. and Z.Y.; methodology, Z.Y.; software, Z.Y.; validation, R.Y., Z.Y. and R.L.; formal analysis, R.Y.; investigation, R.Y.; resources, R.Y.; data curation, Z.Y. and R.L.; writing—original draft preparation, Z.Y.; writing—review and editing, R.Y. and Z.Y.; visualization, Z.Y.; supervision, R.L.; project administration, R.Y.; funding acquisition, R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Central Government Guides Local Science and Technology Development Foundation of Guangxi Zhuang Autonomous Region, grant number ZY24212060, and the Guangxi Science and Technology Major Project, grant number AA24206034.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MS_CBAM	Multi-Scale Channel Spatial Attention Module
ADAS	Advanced Driver-Assistance Systems
FPS	Frames Per Second
ADF	Auxiliary Depth Feature
DPM	Depth Position Mapping
CBAM	Convolutional Block Attention Module
MLP	Multilayer Perceptron
IoU	Intersection-over-Union
AP	Average Precision
AP\|_R40	Average Precision at 40 Recall Positions
AP_BEV	Average Precision of Bird’s Eye View
AP_3D	Average Precision of 3D Boxes

References

Chen, Y.; Cai, G.; Song, Z.; Li, J.; Wang, X. LVP: Leverage Virtual Points in Multi-Modal Early Fusion for 3D Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 1–15. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar] [CrossRef]
Robosense Technology Co., Ltd. 2024 First Quarter Earnings Conference Call Script. Available online: https://sc.ir.robosense.ai/index.php?s=19&item=10 (accessed on 21 February 2025).
Automotive Camera Industry Analysis Report. Available online: https://chejiahao.autohome.com.cn/info/16199274 (accessed on 21 February 2025).
Global Automotive Camera Market Research Report. Available online: https://www.wiseguyreports.com/cn/reports/car-cameras-market (accessed on 21 February 2025).
Liu, X.; Xue, N.; Wu, T. Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1810–1818. [Google Scholar] [CrossRef]
Ma, X.; Zhang, Y.; Xu, D.; Zhou, Y.; Ouyang, W. Delving into Localization Errors for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4721–4730. [Google Scholar] [CrossRef]
Liu, Z.; Wu, Z.; Tóth, R. SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar] [CrossRef]
Pu, F.; Wang, Y.; Deng, J.; Yang, W. MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors. arXiv 2024, arXiv:2410.19590. [Google Scholar]
Chen, W.; Chen, M.; Fang, J.; Zhao, H.; Wang, G. MonoCAPE: Monocular 3D Object Detection with Coordinate-Aware Position Embeddings. Comput. Electr. Eng. 2024, 120, 109781. [Google Scholar] [CrossRef]
Ranasinghe, Y.; Hegde, D.; Patel, V.M. MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10659–10670. [Google Scholar] [CrossRef]
Zhang, R.; Choi, H.; Jung, D.; Anh, P.H.N.; Jeong, S.; Zhu, Z. AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features. arXiv 2025, arXiv:2501.03700. [Google Scholar]
Zhou, Y.; Liu, Q.; Zhu, H.; Li, Y.; Chang, S.; Guo, M. Exploiting Ground Depth Estimation for Mobile Monocular 3D Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3079–3093. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Wang, P.; Li, X.; Zhang, Z.; Li, D. MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization. Sensors 2025, 25, 760. [Google Scholar] [CrossRef] [PubMed]
Qin, Z.; Wang, J.; Lu, Y. MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 8851–8858. [Google Scholar] [CrossRef]
Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Wang, Y.; Chao, W.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 8445–8453. [Google Scholar] [CrossRef]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar] [CrossRef]
Fan, X.; Xiao, D.; Cai, D.; Ding, W. Real Pseudo-LiDAR Point Cloud Fusion for 3D Object Detection. Electronics 2023, 12, 3920. [Google Scholar] [CrossRef]
You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Weinberger, K.Q. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Shirmohammadi, Z.; Nikoofard, A.; Ershadi, G. AM3D: An Accurate Crosstalk Probability Modeling to Predict Channel Delay in 3D ICs. Microelectron. Reliab. 2019, 102, 113379. [Google Scholar] [CrossRef]
Weng, X.; Kitani, K. Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 857–866. [Google Scholar] [CrossRef]
Mo, Y.; Wu, Y.; Zhao, J.; Yang, L.; Li, Y. Sparse Query Dense: Enhancing 3D Object Detection with Pseudo Points. In Proceedings of the 32nd ACM International Conference on Multimedia (ACM MM), Melbourne, Australia, 28 October–1 November 2024; pp. 409–418. [Google Scholar] [CrossRef]
Meng, H.; Li, C.; Chen, G.; Knoll, A. Accurate and Real-Time Pseudo-LiDAR Detection: Is Stereo Neural Network Really Necessary? arXiv 2022, arXiv:2206.13858. [Google Scholar]
Gao, R.; Kim, J.; Cho, K. 2D Instance-Guided Pseudo-LiDAR Point Cloud for Monocular 3D Object Detection. IEEE Access 2024, 12, 187813–187827. [Google Scholar] [CrossRef]
Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. PENet: Towards Precise and Efficient Image-Guided Depth Completion. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13656–13662. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. MonoRUN: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 10379–10388. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Hekimoglu, A.; Schmidt, M.; Marcos-Ramiro, A. Monocular 3D Object Detection with LiDAR-Guided Semi-Supervised Active Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 2346–2355. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar] [CrossRef]
Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-Equivariant 3D Object Detection for Autonomous Driving. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; pp. 2795–2802. [Google Scholar] [CrossRef]
Vianney, J.M.U.; Aich, S.; Liu, B. Refinedmpl: Refined monocular pseudolidar for 3d object detection in autonomous driving. arXiv 2019, arXiv:1911.09712. [Google Scholar]
Park, D.; Ambrus, R.; Guizilini, V.; Li, J.; Gaidon, A. Is Pseudo-LiDAR Needed for Monocular 3D Object Detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3142–3152. [Google Scholar] [CrossRef]
Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Gao, P. MonoDETR: Depth-Guided Transformer for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 9155–9166. [Google Scholar] [CrossRef]
Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. MonoCD: Monocular 3D Object Detection with Complementary Depths. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10248–10257. [Google Scholar] [CrossRef]
Jia, Y.; Wang, J.; Pan, H.; Sun, W. Enhancing Monocular 3-D Object Detection Through Data Augmentation Strategies. IEEE Trans. Instrum. Meas. 2024, 73, 5018811. [Google Scholar] [CrossRef]

Figure 1. The MSFNet3D framework.

Figure 2. MS_CBAM module.

Figure 3. Channel attention module.

Figure 4. Spatial attention module.

Figure 5. BasicBlockGeo module with embedded MS_CBAM.

Figure 6. The dual-branch network architecture. (1)–(5) represent multi-scale features extracted from the color dominant branch.

Figure 7. Instance segmentation results by Mask R-CNN.

Figure 8. Instance segmentation projection map.

Figure 9. Comparison of depth map before and after two-branch network optimization.

Figure 10. Visualization of 3D inspection results.

Figure 11. Different types of vehicle detection and shadow, strong light target detection.

Figure 12. Point cloud comparison: (a) is the original image, (b) is the corresponding lidar point cloud, (c) is the original pseudo-point cloud generated by this method, and (d) is the sparse pseudo-point cloud of this method.

Table 1. AP|R40 scores on KITTI “Car” test set.

Method	Mode	BEV/3D (IoU ≥ 0.7)
Method	Mode	Easy	Mod.	Hard	Average
MonoRun	I	27.94/19.65	17.34/12.30	15.24/10.58	20.17/14.17
MonoDETR	I	33.60/25.00	22.11/16.47	18.60/13.58	24.77/18.35
MonoCD	I	33.41/25.53	22.81/16.59	19.57/14.53	25.26/18.88
RMPL	I	28.08/18.09	17.60/11.14	13.95/8.94	19.87/12.72
DD3D	I	30.98/23.22	22.56/16.34	20.03/14.20	24.52/17.92
PointRCNN (Pl)	I	28.31/20.91	18.60/13.51	16.92/10.74	21.28/15.05
TED (Pl)	I	29.23/22.92	21.45/16.05	19.22/12.63	23.30/17.20
PointRCNN (Ours)	I	29.43/22.54	19.44/14.62	17.13/12.04	22.00/16.40
TED (Ours)	I	31.34/25.35	22.87/17.30	20.58/13.96	24.93/18.87

Table 2. AP|R40 scores on KITTI “Pedestrian” test set.

Method	Mode	3D (IoU ≥ 0.5) (%)
Method	Mode	Easy	Mod.	Hard
DA3D [39]	I	4.62	2.95	2.58
MonoRun	I	10.88	6.78	5.83
RMPL	I	11.14	7.18	5.84
DD3D	I	13.91	9.30	8.05
TED (Pl)	I	11.08	7.29	6.27
TED (Ours)	I	14.03	9.45	7.84

Table 3. AP|R40 scores on KITTI “Cyclist” test set.

Method	Mode	3D (IoU ≥ 0.5) (%)
Method	Mode	Easy	Mod.	Hard
DA3D	I	3.37	1.86	1.48
MonoRun	I	1.01	0.61	0.48
RMPL	I	3.23	1.82	1.77
DD3D	I	2.39	1.52	1.31
TED (Pl)	I	5.84	3.79	3.36
TED (Ours)	I	7.51	4.83	4.19

Table 4. MSFNet3D running time of each part (s).

	A	B	C	D	Total
MSFNet3D	0.007	0.026	0.295	0.010	0.338
Pesudo-lidar	0.38	\	0.27	0.15	0.8

Table 5. Average AP_3D comparison (%) and significance codes: ** p < 0.01.

Method	AP_3D	Δ Improvement	p
TED (Pl)	17.20 ± 0.24	\	\
TED (Ours)	18.87 ± 0.19	1.67 ± 0.22	0.006 **

Table 6. Ablation experiments of our method.

Method	Easy	Mod.	Hard	Average	Δ Avg.
Baseline	22.92	16.05	12.63	17.20	\
+MS_CBAM	23.86	16.74	13.23	17.95	+0.75
+Consistency Weights	24.51	17.29	13.61	18.48	+0.53
+Semantic Point Cloud Refinement	25.35	17.30	13.96	18.87	+0.39
Full Model	25.35	17.30	13.96	18.87	+1.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, R.; You, Z.; Luo, R. MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement. World Electr. Veh. J. 2025, 16, 173. https://doi.org/10.3390/wevj16030173

AMA Style

Yang R, You Z, Luo R. MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement. World Electric Vehicle Journal. 2025; 16(3):173. https://doi.org/10.3390/wevj16030173

Chicago/Turabian Style

Yang, Rong, Zhijie You, and Renhui Luo. 2025. "MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement" World Electric Vehicle Journal 16, no. 3: 173. https://doi.org/10.3390/wevj16030173

APA Style

Yang, R., You, Z., & Luo, R. (2025). MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement. World Electric Vehicle Journal, 16(3), 173. https://doi.org/10.3390/wevj16030173

Article Menu

MSFNet3D: Monocular 3D Object Detection via Dual-Branch Depth-Consistent Fusion and Semantic-Guided Point Cloud Refinement

Abstract

1. Introduction

2. MSFNet3D: Framework and Operational Flow

2.1. System Framework

2.2. MS_CBAM

2.3. Consistency Weight

2.4. Integrated Dual-Branch Network Architecture

2.5. Pseudo-Point Cloud Generation

2.6. Loss Fuction

3. Results

3.1. Experimental Setup

3.1.1. Dataset

3.1.2. Evaluation Metrics

3.1.3. Preprocessing

3.2. Comparisons with Prior Arts

3.3. Ablation Studies

3.4. Qualitative Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI