Robust 3D Target Detection Based on LiDAR and Camera Fusion
Abstract
1. Introduction
2. Related Work
2.1. 3D Target Detection Technology Based on Pure Laser Radar
2.2. 3D Target Detection Technology Based on Multi-Sensor Fusion
2.3. Three-Dimensional Target Detection Technology Based on Multimodal Continuous Frames
3. Multimodal Video Stream 3D Object Detection Based on Reliability Evaluation
3.1. Reliability Evaluation Module
3.2. Confidence Weighted Feature Fusion Module
| Algorithm 1 Reliability Assessment Process (Reliability Evaluation Module Pseudo-code) |
|
- 1.
- Point cloud features guide the cross-attention of image feature enhancement, as follows:
- 2.
- Image features guide the cross-attention of point cloud feature enhancement, as follows:
3.3. Cross-Frame Optimization Module Based on Target Timing Consistency
- 1.
- Historical feature caching mechanismIn order to achieve cross-frame target-level semantic modeling, the model needs to have the ability to perceive the target state in the historical frame. We use the RoI feature of the target k in the BEV space as its state representation. Assuming that the current time frame is t and a sliding time window with a length of T is defined, the historical memory bank of the target at the current time can be expressed aswhere denotes the characteristic of target k at time t. The entire feature sequence is arranged in chronological order to form a continuous historical state trajectory of the target k before time t.
- 2.
- Target timing feature alignmentIn order to enhance the feature expression ability of the current frame target, this section introduces the spatial and temporal attention mechanism to fuse the historical feature sequence of the target. The process includes two stages: intra-frame spatial self-attention and cross-frame temporal cross-attention.First, is transformed into a two-dimensional feature sequence , and then the spatial self-attention calculation is performed on the feature sequence as follows:After obtaining the spatial enhancement feature , the model needs to further model the consistent expression between the current frame and the historical frame. For the current frame t, we use its corresponding spatial enhancement feature as a query, and use the enhancement feature of the historical frame as a key and value for time cross-attention fusion. The calculation is as follows:
- 3.
- Consistency optimization based on feature measureAlthough the target temporal feature alignment mechanism can enhance the semantic representation ability of the current frame target, it is difficult to ensure the structural consistency of the prediction results of the same target in different frames only relying on feature level enhancement. Because the detection process of each frame is independent of each other, the model is prone to problems such as position fluctuation and attitude jump when facing moving targets, partial occlusion, or illumination changes, which affects the rationality of the detection results and the stability of subsequent tasks. To this end, we introduce the idea of target-level structure optimization, and reversely adjust the timing prediction results based on the target cross-frame feature consistency. This idea is similar to the traditional Bundle Adjustment [27]; that is, if the target semantic key areas can correspond in different frames, their spatial projection positions should be consistent, and if inconsistent, there is an error in pose or position estimation.Based on the above ideas, we construct a feature metric consistency mechanism based on cross-frame semantic correspondence to constrain the geometric consistency of target multi-frame prediction results. Different from the traditional bundle adjustment method, this mechanism does not need to rely on explicit geometric projection and 3D point reconstruction, but constructs cross-frame semantic consistency constraints in the feature metric space. As shown in Figure 5, by establishing the feature-level correspondence of the key regions of the target in the multi-frame view, a semantic point soft matching graph is formed, and the target pose estimation parameters are optimized to ensure that the same semantic region is aligned at each frame projection position. This mechanism directly imposes structural consistency constraints on the feature space, has differentiability, supports end-to-end training, and effectively compensates for the shortcomings of a single-frame detector in time series modeling at the structural level.Specifically, for any two frames t and , we first calculate the feature similarity between the feature points of the target k in these two frames:where denotes the i-th local characteristic of , and , denote the inner product operation. Then, all the positions in the space are normalized to obtain the soft matching probability across frames, as follows:Among them, the meaning of is the matching degree of the i-th feature point in frame t and the j-th point in frame in the feature space.We hope that the feature representation learned by the model can meet the requirements of structural consistency; that is, the semantic matching points can be aligned in three-dimensional space after the pose transformation of their respective frames. To this end, we introduce the Featuremetric Object Bundle Adjustment [28], which extends the structural consistency modeling from the traditional geometric space to the feature space. The alignment process of the target structure is indirectly constrained by the semantic similarity between features. The reprojection error can be expressed asHere, we use the feature soft matching graph of semantic points to replace the L2 loss of Equation (18) as a measure of structural consistency, and take the logarithm of the score and add a negative sign to construct the maximum likelihood loss function as follows:Compared with directly calculating the Euclidean distance of cross-frame semantic points, the feature soft matching graph directly comes from the feature distribution itself, so avoids the dependence on clear geometric correspondence and has better training stability. For each traceable target k in the time window, a structural consistency metric loss is constructed. After introducing the loss, the overall training target of the model is updated as follows:Among them, is the loss weight coefficient.
4. Experimental Verification
4.1. Experiment Setting
4.2. Detection Index Results and Visual Analysis
4.3. Ablation Experiment
5. Conclusions
- Dynamically perceives the reliability of each modal feature and outputs confidence scores. It not only leverages the complementarity of point cloud and image BEV (Bird’s-Eye View) features but also introduces pure visual BEV features to address extreme scenarios where point cloud quality degrades severely, ensuring the system’s robustness under different environmental conditions.
- Adaptively adjusts the contribution ratio of each feature in the fusion process based on the aforementioned confidence scores. This enables the fused features to more accurately reflect the target’s geometric structure and semantic information, improving detection robustness and accuracy.
- By explicitly modeling the correspondence between the internal semantic features of targets in consecutive frames, it achieves cross-time-scale target structure consistency constraints and collaborative optimization. This extends single-frame perception to temporal memory and cross-frame structure inference, effectively enhancing the accuracy and stability of 3D target prediction to adapt to dynamic and complex scenarios.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
- Yang, J.; Song, L.; Liu, S.; Mao, W.; Li, Z.; Li, X.; Sun, H.; Sun, J.; Zheng, N. Dbq-ssd: Dynamic ball query for efficient 3d object detection. arXiv 2022, arXiv:2207.10909. [Google Scholar]
- Wang, D.Z.; Posner, I. Voting for voting in online point cloud object detection. In Robotics: Science and Systems; Springer Proceedings in Advanced Robotics; Springer: Rome, Italy, 2015; Volume 1, pp. 10–15. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Du, L.; Ye, X.; Tan, X.; Feng, J.; Xu, Z.; Ding, E.; Wen, S. Associate-3ddet: Perceptual-to-conceptual association for 3d point cloud object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13329–13338. [Google Scholar]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Wang, Y.; Fathi, A.; Kundu, A.; Ross, D.A.; Pantofaru, C.; Funkhouser, T.; Solomon, J. Pillar-based object detection for autonomous driving. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 18–34. [Google Scholar]
- Zhang, J.; Liu, H.; Lu, J. A semi-supervised 3d object detection method for autonomous driving. Displays 2022, 71, 102117. [Google Scholar] [CrossRef]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4604–4612. [Google Scholar]
- Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3d detection. Adv. Neural Inf. Process. Syst. 2021, 34, 16494–16507. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. Clocs: Camera-lidar object candidates fusion for 3d object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. Fast-clocs: Fast camera-lidar object candidates fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 187–196. [Google Scholar]
- Gong, R.; Fan, X.; Cai, D.; Lu, Y. Sec-clocs: Multimodal back-end fusion-based object detection algorithm in snowy scenes. Sensors 2024, 24, 7401. [Google Scholar] [CrossRef] [PubMed]
- Koh, J.; Lee, J.; Lee, Y.; Kim, J.; Choi, J.W. Mgtanet: Encoding sequential lidar points using long short-term motion-guided temporal attention for 3d object detection. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1179–1187. [Google Scholar]
- Yin, J.; Shen, J.; Gao, X.; Crandall, D.J.; Yang, R. Graph neural network and spatiotemporal transformer attention for 3d video object detection from point clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 9822–9835. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Miao, Z.; Zhang, D.; Pan, H.; Liu, K.; Hao, P.; Zhu, J.; Sun, Z.; Li, H.; Zhan, X. Int: Towards infinite-frames 3d detection with an efficient framework. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 193–209. [Google Scholar]
- Chen, X.; Shi, S.; Zhu, B.; Cheung, K.C.; Xu, H.; Li, H. Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 680–697. [Google Scholar]
- He, C.; Li, R.; Zhang, Y.; Li, S.; Zhang, L. Msf: Motion-guided sequential fusion for efficient 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5196–5205. [Google Scholar]
- Hou, J.; Liu, Z.; Zou, Z.; Ye, X.; Bai, X. Query-based temporal fusion with explicit motion for 3d object detection. Adv. Neural Inf. Process. Syst. 2023, 36, 75782–75797. [Google Scholar]
- Yuan, Y.; Sester, M. Streamlts: Query-based temporal-spatial lidar fusion for cooperative object detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2025; pp. 34–51. [Google Scholar]
- Van Geerenstein, M.R.; Ruppel, F.; Dietmayer, K.; Gavrila, D.M. Multimodal object query initialization for 3d object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12484–12491. [Google Scholar]
- Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In International Workshop on Vision Algorithms; Springer: Berlin/Heidelberg, Germany, 1999; pp. 298–372. [Google Scholar]
- Lindenberger, P.; Sarlin, P.; Larsson, V.; Pollefeys, M. Pixel-perfect structure-from-motion with featuremetric refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 5987–5997. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv 2022, arXiv:2203.17270. [Google Scholar]
- Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: Auckland, New Zealand, 2022; pp. 180–191. [Google Scholar]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, 23–28 August 2020; part XXVII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 720–736. [Google Scholar]
- Zhu, B.; Jiang, Z.; Zhou, X.; Li, Z.; Yu, G. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv 2019, arXiv:1908.09492. [Google Scholar] [CrossRef]
- Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.; Zhao, M. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2743–2752. [Google Scholar]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11784–11793. [Google Scholar]
- Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3d: A unified sensor fusion framework for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 172–181. [Google Scholar]







| Name | Configuration |
|---|---|
| CPU | 11th Gen Intel(R) Core(TM) i7-11700K |
| GPU | NVIDIA GeForce RTX 3090 |
| Memory | 24 GB |
| Storage | 30 GB |
| Operating System | Ubuntu 20.04 |
| CUDA | 11.2 |
| Deep Learning Framework | Pytorch 1.10.0 |
| Model | mAP (%) | NDS (%) | Vehicle (%) | Truck (%) | Construction Vehicle (%) | Bus (%) | Trailer (%) | Obstruction (%) | Motorcycle (%) | Cyclist (%) | Pedestrian (%) | Traffic Cone (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PointPillar [7] | 40.1 | 55.0 | 76.0 | 31.0 | 11.3 | 32.1 | 36.6 | 56.4 | 34.2 | 14.0 | 64.0 | 45.6 |
| Pointpainting [11] | 46.4 | 58.1 | 77.9 | 35.8 | 15.8 | 36.2 | 37.3 | 60.2 | 41.5 | 24.1 | 73.3 | 62.4 |
| BEVFormer [31] | 48.1 | 56.9 | 67.7 | 39.2 | 22.9 | 35.7 | 39.6 | 62.5 | 47.9 | 40.7 | 54.4 | 70.3 |
| BEVDet [32] | 42.4 | 48.8 | 64.3 | 35.0 | 16.2 | 35.8 | 35.4 | 61.4 | 44.8 | 29.6 | 41.1 | 60.1 |
| DETR3D [33] | 41.2 | 47.9 | 60.3 | 33.3 | 17.0 | 29.0 | 35.8 | 56.5 | 41.3 | 30.8 | 45.5 | 62.7 |
| 3D-CVF [34] | 52.7 | 62.3 | 83.0 | 45.0 | 15.9 | 48.8 | 49.6 | 65.9 | 51.2 | 30.4 | 74.2 | 62.9 |
| CBGS [35] | 52.8 | 63.3 | 81.1 | 48.5 | 10.5 | 54.9 | 42.9 | 65.7 | 51.5 | 22.3 | 80.1 | 70.9 |
| CT3D [36] | 59.5 | 65.3 | 84.8 | 53.7 | 19.4 | 64.2 | 55.4 | 72.1 | 59.5 | 24.8 | 83.4 | 77.9 |
| CenterPoint [37] | 60.3 | 67.3 | 85.2 | 53.5 | 20.0 | 63.6 | 56.0 | 71.1 | 59.5 | 30.7 | 84.6 | 78.4 |
| SECOND [5] | 62.0 | 50.9 | 81.6 | 51.9 | 18.0 | 68.5 | 38.2 | 57.8 | 40.1 | 18.1 | 77.4 | 56.9 |
| FUTR3D [38] | 64.2 | 68.0 | 86.3 | 61.5 | 26.0 | 71.9 | 42.1 | 64.4 | 73.6 | 63.3 | 82.6 | 70.1 |
| MVP [12] | 66.4 | 70.5 | 86.8 | 58.5 | 26.1 | 67.4 | 57.3 | 74.8 | 70.0 | 49.3 | 87.9 | 83.6 |
| Ours | 67.3 | 70.6 | 88.1 | 61.8 | 29.4 | 72.7 | 59.7 | 80.1 | 68.3 | 47.2 | 89.2 | 86.3 |
| Model | mAP (%) | NDS (%) | Vehicle (%) | Pedestrian (%) | Cyclist (%) |
|---|---|---|---|---|---|
| Ours-L | 62.1 | 68.6 | 85.4 | 86.4 | 27.1 |
| Ours | 67.3 ↑ 5.2 | 70.6 ↑ 2.0 | 88.1 ↑ 2.7 | 89.2 ↑ 2.8 | 47.2 ↑ 20.1 |
| Model | Confidence Weighted Feature Fusion Module | Cross-Frame Optimization Module Based on Target Timing Consistency | mAP (%) | NDS (%) | Vehicle (%) | Pedestrian (%) | Cyclist (%) |
|---|---|---|---|---|---|---|---|
| Ours-L | 62.1 | 68.6 | 85.4 | 86.4 | 27.1 | ||
| Ours | ✓ | 64.8 | 69.2 | 86.7 | 87.7 | 36.4 | |
| ✓ | ✓ | 67.3 | 70.6 | 88.1 | 89.2 | 47.2 |
| Model | Input Modality | Ideal Condition | Front View Missing | Only Front View Remains | 50% Lens Occlusion |
|---|---|---|---|---|---|
| DETR3D | C | 41.2/47.9 | 30.6/41.8 | 8.8/26.3 | 18.0/32.5 |
| 3D-CVF | L + C | 52.7/62.3 | 47.3/58.1 | 36.4/49.9 | 45.2/57.2 |
| FUTR3D | L + C | 64.2/68.0 | 53.7/59.8 | 51.2/52.6 | 50.3/58.5 |
| MVP | L + C | 66.4/70.5 | 61.2/67.7 | 58.8/63.9 | 60.6/66.4 |
| Ours | L + C | 67.3/70.6 | 63.4/70.1 | 62.4/69.3 | 61.8/70.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, M.; Lu, B.; Liu, G.; Diao, Y.; Chen, X.; Nie, G. Robust 3D Target Detection Based on LiDAR and Camera Fusion. Electronics 2025, 14, 4186. https://doi.org/10.3390/electronics14214186
Jin M, Lu B, Liu G, Diao Y, Chen X, Nie G. Robust 3D Target Detection Based on LiDAR and Camera Fusion. Electronics. 2025; 14(21):4186. https://doi.org/10.3390/electronics14214186
Chicago/Turabian StyleJin, Miao, Bing Lu, Gang Liu, Yinglong Diao, Xiwen Chen, and Gaoning Nie. 2025. "Robust 3D Target Detection Based on LiDAR and Camera Fusion" Electronics 14, no. 21: 4186. https://doi.org/10.3390/electronics14214186
APA StyleJin, M., Lu, B., Liu, G., Diao, Y., Chen, X., & Nie, G. (2025). Robust 3D Target Detection Based on LiDAR and Camera Fusion. Electronics, 14(21), 4186. https://doi.org/10.3390/electronics14214186
