Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle
Highlights
- This study proposes a dynamic SLAM method for UAVs that integrates an improved YOLOv8 with ORB-SLAM3, enabling the construction of a high-precision and reliable prior map; additionally, it develops an end-to-end multimodal spatiotemporal joint optimization framework based on Transformer and BEV models for the environmental perception of unmanned ground vehicles.
- A cross-platform feature fusion method for 3D object perception and an incremental map update algorithm are proposed in this study. By employing a fusion algorithm based on a cross-attention mechanism, aerial top-view features and ground fused BEV features are integrated across platforms, thereby enabling the construction of high-precision maps.
- The proposed UAV-based dynamic SLAM algorithm can provide prior map information for unmanned ground vehicles, supporting their navigation and collaborative perception tasks, while achieving efficient spatiotemporal fusion of multi-sensor data. This significantly improves object detection accuracy and system robustness in complex urban scenarios.
- Unmanned aerial and ground vehicles can achieve cross-platform collaborative perception, ranging from target-level alignment to semantic-level fusion. This enables real-time and precise perception of complex dynamic environments.
Abstract
1. Introduction
- (1)
- UAV dynamic SLAM with precise dynamic feature elimination. Unlike prior dynamic SLAM methods that simply remove all feature points inside object bounding boxes, our approach integrates an improved YOLOv8 (with BiFPN and a small-target detection layer) with ORB-SLAM3 and MobileSAM instance segmentation. Dynamic feature points are removed only after geometric verification, while static background points inside the bounding boxes are preserved. This yields a high-precision, reliable prior map for UGVs.
- (2)
- UAV-UGV BEV feature fusion. A cross-attention mechanism is designed to fuse the UAV’s overhead BEV features with the UGV’s ground BEV features. The UAV’s overhead features are first transformed into the UGV’s BEV coordinate frame via a calibrated spatial mapping that accounts for GPS/IMU uncertainty and scale ambiguity. To our knowledge, this is the first end-to-end learnable fusion of aerial and ground BEV features for urban collaborative perception.
- (3)
- Incremental map updating with dynamic Bayesian probability and D-S evidence theory. A semantic-assisted conflict resolution strategy is introduced. When the conflict coefficient K exceeds a learned threshold (0.3), semantic context is used to resolve contradictions before applying the Dempster Shafer combination rule. This provides robust map updates under occlusion and dynamic changes, a capability absent in conventional Bayesian filters.
2. Related Work
2.1. Visual SLAM for UAVs
2.2. Multimodal End-to-End Perception for Unmanned Ground Vehicles
2.3. UAV-UGV Collaborative Perception
3. Materials and Methods
3.1. System Overall Architecture
3.2. UAV Dynamic SLAM
3.2.1. Small Target Detection and Segmentation Optimization Based on YOLOv8
- (1)
- Improved YOLOv8
- (2)
- MobileSAM Instance Segmentation Module
3.2.2. Fusion of Improved YOLOv8 with ORB-SLAM3
- (1)
- Dynamic Authenticity Judgment of Objects
- (2)
- Fusion and Extension of ORB-SLAM3
3.3. Multimodal Fusion Perception Model for UGV
3.3.1. BEV-Based Multimodal Fusion Perception Model
3.3.2. Multimodal Unified BEV Feature Encoder with Temporal Optimization
3.3.3. Adaptive Dynamic Weighting BEV Feature Fusion
3.3.4. Multi-Task Head Decoding
- (1)
- 3D detection head. To meet the requirements of high real-time performance and detection accuracy in urban scenarios, a 3D detection head based on Deformable DETR is adopted. Its design fully leverages the advantages of BEV features, using the fused BEV features as input to the decoder. By exploiting the deformable attention mechanism, it can adaptively sample and aggregate features at different positions and scales in the BEV feature map according to the actual distribution of target objects, resulting in more accurate detection boxes.
- (2)
- Semantic segmentation head. The semantic segmentation head is an important component of the multi-task perception framework. It performs multi-class binary semantic segmentation based on the fused unified BEV features, decomposing the semantic segmentation task into multiple independent binary classification problems, with one segmentation head per class.
- (3)
- Occupancy prediction network. To better enable scene-level modeling, an occupancy prediction network is introduced on top of the above two tasks. Occupancy prediction is a core technique for modeling the geometry and semantics of 3D scenes. It determines whether each voxel is occupied by an object and its category through voxel-wise classification or probability prediction. Compared with traditional 3D bounding box detection, occupancy provides a finer description of the geometric details of irregular objects and dynamic scenes.
3.4. UAV–UGV Collaborative Perception Model
3.4.1. Cross-Platform Feature Fusion for 3D Object Perception
3.4.2. UAV-UGV Spatiotemporal Synchronization
- (1)
- Collaborative coordinate transformation
- (2)
- Time synchronization
- (3)
- Calibration experiment.
3.4.3. Incremental Map Updating Algorithm
4. Experiments and Results
4.1. Experimental Configuration
4.2. UAV Dynamic SLAM Experiments
4.2.1. Small-Target Detection Results Analysis
4.2.2. Dynamic Feature Point Elimination Experiment
4.2.3. Dense Mapping Experiment
4.3. Multimodal Fusion Perception Experiments on UGV
4.3.1. Dataset and Experimental Setup
4.3.2. 3D Object Detection Results Analysis
4.3.3. Semantic Segmentation Results Analysis
4.3.4. Ablation Study Results
4.4. UAV–UGV Collaborative Perception Experiments
4.4.1. Collaborative 3D Object Detection Experiments
4.4.2. Perception Map Updating Experiments
- (1)
- Direct replacement method: The local grid map of the UGV directly overwrites the corresponding octree region, ignoring confidence and temporal information;
- (2)
- Fixed-weight linear fusion: Statically assigns fusion weights to the maps from the two platforms;
- (3)
- Traditional Bayesian method: Based on conventional Bayesian updating rules;
- (4)
- Proposed method: Dynamic Bayesian probability updating combined with D-S evidence theory.
4.4.3. Robustness to Communication and Sensor Failures
4.5. Quantitative Real-World Evaluation
4.6. Real-Time Performance Analysis
5. Conclusions
- (1)
- To address the problem that the prior map built by a UAV in a collaborative system is easily blurred by dynamic objects in the environment, a dynamic feature point elimination framework integrating ORB-SLAM3 with an improved YOLOv8 is constructed, which provides a navigation basis for UAV-UGV collaborative navigation.
- (2)
- An end-to-end perception model for the UGV is developed. By thoroughly fusing the BEV features from the camera and LiDAR onboard the UGV, the model significantly improves the performance of 3D object perception and semantic map segmentation.
- (3)
- A cross-attention-based cross-platform feature fusion strategy for UAV-UGV collaboration is proposed, which greatly enhances the recognition accuracy of occluded objects and objects under bridges. Moreover, probabilistic Bayesian and D-S evidence theory are employed to incrementally update the fine-grained perceptual map of the UGV into the UAV’s prior map, resulting in an accurate semantic map under heterogeneous collaboration.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ORB-SLAM3 | Oriented FAST and Rotated BRIEF Simultaneous Localization and Mapping 3 |
| SLAM | Simultaneous Localization and Mapping |
| UAV | Unmanned Aerial Vehicle |
| UGV | Unmanned Ground Vehicle |
| V-SLAM | Visual Simultaneous Localization and Mapping |
| BEV | Bird’s-Eye-View |
| mAP | mean Average Precision |
| NDS | NuScenes Detection Score |
| mIOU | mean Intersection over Union |
References
- Li, S.; Li, H.; Zhao, H.; Cai, Y.; Zhu, B.; Tao, J.; Wang, J.; Shuai, B.; Chen, C.; Gao, J.; et al. Evolution Path of Multimodal Collaborative Optimization and Training Techniques for End-to-End Autonomous Driving Systems. Sci. Sin. Technol. 2025, 55, 1638–1658. [Google Scholar] [CrossRef]
- Munasinghe, I.; Perera, A.; Deo, R.C. A Comprehensive Review of UAV-UGV Collaboration: Advancements and Challenges. J. Sens. Actuator Netw. 2024, 13, 81. [Google Scholar] [CrossRef]
- Shahar, F.S.; Sultan, M.T.H.; Nowakowski, M.; Łukaszewicz, A. UGV-UAV Integration Advancements for Coordinated Missions: A Review. J. Intell. Robot. Syst. 2025, 111, 32. [Google Scholar] [CrossRef]
- Zhang, R.; Wu, W.; Chen, X.; Gao, Z.; Cai, Y. Terahertz Integrated Sensing and Communication-Empowered UAVs in 6G: A Transceiver Design Perspective. IEEE Veh. Technol. Mag. 2026, 21, 71–80. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, X.; Yao, F.; An, K.; Zheng, G.; Chatzinotas, S. Pilot Assignment and Power Control in Secure UAV-Enabled Cell-Free Massive MIMO Networks. IEEE Internet Things J. 2024, 11, 3377–3391. [Google Scholar] [CrossRef]
- Li, J.; Jia, Y.; Qin, M.; Yang, Q.; Quek, T.Q.S.; Gao, W.; Kwak, K.S. DFF-SLAM: Dynamic Feature Filtering-Based Simultaneous Localization and Mapping for UAV Positioning in IoT-Enabled Complex Environments. IEEE Trans. Mob. Comput. 2026, 25, 550–565. [Google Scholar] [CrossRef]
- Kottege, N.; Williams, J.; Tidd, B.; Talbot, F.; Steindl, R.; Cox, M.; Frousheger, D.; Hines, T.; Pitt, A.; Tam, B.; et al. Heterogeneous Robot Teams With Unified Perception and Autonomy: How Team CSIRO Data61 Tied for the Top Score at the DARPA Subterranean Challenge. IEEE Trans. Field Robot. 2025, 2, 100–130. [Google Scholar] [CrossRef]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from LiDAR-Camera via Spatiotemporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
- Yu, X.; Wei, J.; Li, X.; Liu, M.; Wang, C.; Qin, Z.; Chen, W.; Li, K.; Liu, K. A Quadrotor Aerial Docking System Utilizing Both Vision and Magnetic Field. IEEE Robot. Autom. Lett. 2025, 10, 5529–5536. [Google Scholar] [CrossRef]
- Katkuri, A.V.R.; Madan, H.; Khatri, N.; Abdul-Qawy, A.S.H.; Patnaik, K.S. Autonomous UAV Navigation Using Deep Learning-Based Computer Vision Frameworks: A Systematic Literature Review. Array 2024, 23, 100361. [Google Scholar] [CrossRef]
- Favorskaya, M.N. Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends. Electronics 2023, 12, 2006. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Campos, C.; Elvira, R.; Rodríguez, J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- El-Alfy, H.; Abdelkader, M.; Kamel, A. Monocular Based 3D Depth Estimation and SLAM Integration. Drone Syst. Appl. 2025, 13, 1–14. [Google Scholar] [CrossRef]
- Wang, G.; Wang, L.; He, J.; Jiang, Y.; Qi, Q.; Zhou, Y. Multi-Camera Simultaneous Localization and Mapping for Unmanned Systems: A Survey. Electronics 2026, 15, 602. [Google Scholar] [CrossRef]
- Zhang, J.; Li, M.; Chai, J.; Xu, L.; Zhou, C. Deep-UAV SLAM: SuperPoint and SuperGlue Enhanced SLAM for Dynamic Outdoor Air Navigation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-1/W5-2025, 177–183. [Google Scholar] [CrossRef]
- Lupandin, A.; Moroz, O. Analysis of Modern Neural Network Methods for Visual Information Processing in High-Speed UAV Navigation Systems. Bull. V.N. Karazin Kharkiv Natl. Univ. Ser. Math. Model. Inf. Technol. Autom. Control Syst. 2025, 68, 53–61. [Google Scholar] [CrossRef]
- Ren, T.; Zhao, X.; Jebelli, H. Advanced Sensor Integration for Enhanced Flight Control in UAV-Based Construction Automation. In Proceedings of the 42nd International Symposium on Automation and Robotics in Construction (ISARC), Montreal, QC, Canada, 28–31 July 2025; pp. 122–129. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Wang, H.; Liu, Y.; Chen, X. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. [Google Scholar] [CrossRef] [PubMed]
- Ping, P.; Zhang, X.; Tao, L.; Shi, Q.; Tian, Y.; Yan, J.; Ding, W. A Comprehensive Survey on Multi-Sensor Information Processing and Fusion for BEV Perception in Autonomous Vehicles. Inf. Fusion 2026, 126, 103653. [Google Scholar] [CrossRef]
- Wolters, P.; Gilg, J.; Teepe, T.; Herzog, F.; Fent, F.; Rigoll, G. SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection. arXiv 2025, arXiv:2411.19860. [Google Scholar]
- Hu, Y.; Fang, Z.; Zhang, Y.; Chen, J. FusionFormer: A Multi-Sensor Fusion Transformer Architecture for End-to-End Autonomous Driving. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1–8. [Google Scholar]
- Han, C.; Yang, J.; Sun, J.; Ge, Z.; Dong, R.; Zhou, H.; Mao, W.; Peng, Y.; Zhang, X. Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception. IEEE Robot. Autom. Lett. 2024, 9, 6544–6551. [Google Scholar] [CrossRef]
- Li Auto. MindVLA-o1: Next-Generation Unified Vision-Language-Action Autonomous Driving Foundation Model. Presented at NVIDIA GTC 2026, San Jose, CA, USA, 17 March 2026; Available online: https://www.autoreport.cn/zonghexinwen/20260317/17108352827.html (accessed on 1 April 2026).
- Zhou, X.; Liang, D.; Tu, S.; Chen, X.; Ding, Y.; Zhang, D.; Tan, F.; Zhao, H.; Bai, X. HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation. In Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 27817–27827. [Google Scholar]
- Wisth, D.; Camurri, M.; Das, S.; Fallon, M. Unified Multi-Modal Landmark Tracking for Tightly Coupled Lidar-Visual-Inertial Odometry. IEEE Robot. Autom. Lett. 2021, 6, 1004–1011. [Google Scholar] [CrossRef]
- Tranzatto, M.; Dharmadhikari, M.; Bernreiter, L.; Camurri, M.; Khattak, S.; Mascarich, F.; Pfreundschuh, P.; Wisth, D.; Zimmermann, S.; Kulkarni, M. Team CERBERUS Wins the DARPA Subterranean Challenge: Technical Overview and Lessons Learned. Field Robot. 2024, 4, 249–312. [Google Scholar] [CrossRef]
- Ai, M.; Elhabiby, M.; Yang, Y.; El-Sheimy, N. A Coarse-to-Fine Optimization Framework for LiDAR-Based Air-Ground Cooperative Mapping. In Proceedings of the 2025 IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, USA, 28 April–1 May 2025; pp. 636–642. [Google Scholar]
- Schmuck, P.; Chli, M. CCM-SLAM: Robust and Efficient Centralized Collaborative Monocular Simultaneous Localization and Mapping for Robotic Teams. J. Field Robot. 2019, 36, 763–781. [Google Scholar] [CrossRef]
- Chen, Y.; Du, B.; Wu, T. Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems. Drones 2025, 9, 612. [Google Scholar] [CrossRef]
- Cheng, C.; Li, X.; Xie, L.; Li, L. A Unmanned Aerial Vehicle (UAV)/Unmanned Ground Vehicle (UGV) Dynamic Autonomous Docking Scheme in GPS-Denied Environments. Drones 2023, 7, 613. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, J.; Liu, X.; Zhang, L. A Distributed Multi-Robot Collaborative SLAM Method Based on Air–Ground Cross-Domain Cooperation. Drones 2025, 9, 504. [Google Scholar] [CrossRef]
- Zhang, M.; Li, Y.; Wang, H. Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion. Drones 2026, 10, 87. [Google Scholar] [CrossRef]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–18. [Google Scholar]
- Nowakowski, M.; Kurylo, J.; Dang, P.H. Camera Based AI Models Used with LiDAR Data for Improvement of Detected Object Parameters. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
- Kumar, S.; Sharma, S.; Asghar, R.; Mohandas, R.; Brophy, T.; Sistu, G.; Grua, E.M.; Donzella, V.; Eising, C. Exploring Sensor Impact and Architectural Robustness in Adverse Weather on BEV Perception. IEEE Open J. Veh. Technol. 2025, 6, 2857–2875. [Google Scholar] [CrossRef]





































| Target Type | Typical Objects | Classification Label |
|---|---|---|
| Static target | Trees, buildings, traffic signs, etc. | S |
| Potentially dynamic target | Cars, crowds, animals, etc. | / |
| Dynamic target | Pedestrians, cyclists, etc. | D |
| Model Version | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs |
|---|---|---|---|---|
| YOLOv8s | 39.7 | 23.9 | 11.2 | 28.6 |
| YOLOv8m | 41.7 | 26.2 | 25.9 | 78.9 |
| YOLOv8s + SAHI | 44.5 (+12%) | 27.4 (+15%) | 11.2 (unchanged) | 85.2 (+197%) |
| Proposed Algorithm | 48.2 (+21%) | 30.5 (+27%) | 13.5 (+20%) | 36.5 (+27%) |
| Sequence | ORB-SLAM3 | Dyna-SLAM | RDS-SLAM | Ours |
|---|---|---|---|---|
| sitting_static | 0.0355 | 0.0365 | 0.0328 | 0.0283 |
| walking_rpy | 0.0239 | 0.0282 | 0.0277 | 0.0202 |
| walking_xyz | 0.0154 | 0.0247 | 0.0183 | 0.0134 |
| Detector | Mod. | Car | Truck | Bus | Ped. | Bike | mAP (%) | NDS (%) |
|---|---|---|---|---|---|---|---|---|
| CVCENT | L | 81.0 | 48.6 | 55.0 | 80.2 | 25.3 | 53.1 | 63.4 |
| CenterPoint | L | 85.8 | 59.1 | 72.5 | 85.6 | 42.5 | 59.5 | 66.4 |
| LargeKernel3D | L | 85.4 | 59.9 | 72.7 | 85.3 | 56.1 | 63.6 | 69.2 |
| FUTR3D | C + L | 86.3 | 61.5 | 71.9 | 82.6 | 63.3 | 64.2 | 68.0 |
| Transfusion | C + L | 86.2 | 56.7 | 66.3 | 86.1 | 44.2 | 65.5 | 70.2 |
| FusionPainting | C + L | 87.0 | 63.0 | 70.7 | 88.4 | 64.4 | 66.5 | 70.7 |
| BEVfusion | C + L | 88.6 | 60.0 | 68.3 | 88.7 | 52.9 | 67.9 | 71.0 |
| DeepInteraction | C + L | 87.1 | 65.0 | 75.4 | 88.4 | 65.8 | 68.6 | 71.9 |
| Ours | C + L | 88.5 | 66.5 | 76.5 | 89.0 | 66.9 | 71.8 | 74.3 |
| Detector | Mod. | Drivable | Ped. Cross | Walkway | Stop Line. | Carpark | mIoU |
|---|---|---|---|---|---|---|---|
| LSS | C | 75.4 | 38.3 | 46.3 | 30.3 | 39.1 | 44.4 |
| CVT | C | 74.3 | 36.8 | 39.9 | 25.8 | 35.0 | 40.2 |
| CenterPoint | L | 75.1 | 47.4 | 57.6 | 35.8 | 32.7 | 47.6 |
| PointPillars | L | 72.0 | 43.1 | 53.1 | 29.7 | 27.7 | 43.8 |
| PointPainting | C + L | 75.9 | 48.5 | 57.1 | 36.9 | 34.5 | 49.1 |
| MVP | C + L | 76.1 | 48.7 | 57.0 | 36.9 | 33.0 | 49.0 |
| BEVFusion | C + L | 85.5 | 60.6 | 67.6 | 52.0 | 57.0 | 62.7 |
| Ours | C + L | 88.5 | 62.3 | 70.5 | 53.6 | 57.5 | 65.3 |
| Method | CAFM | TFM | AWFM | mAP (%) | NDS (%) | mIoU (%) |
|---|---|---|---|---|---|---|
| Baseline | 67.9 | 71.0 | 62.7 | |||
| √ | 68.2 | 72.5 | 63.6 | |||
| √ | √ | 69.8 | 73.8 | 64.0 | ||
| √ | √ | 71.0 | 73.4 | 64.2 | ||
| √ | √ | √ | 71.8 | 74.3 | 65.3 |
| Scene Type | Total Frames | Number of Objects | Occlusion Ratio | Lighting Condition |
|---|---|---|---|---|
| Residential Alley | 1120 | 3581 | 40% | Normal lighting |
| Urban Arterial Road | 1245 | 5872 | 32% | Normal lighting |
| Industrial Park | 980 | 2120 | 25% | Complex lighting |
| Suburban/Rural | 760 | 950 | 17% | Backlight/low light |
| Range | 0–30% | 30–60% | >60% | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mod. | Car | Ped. | Cyclists | Car | Ped. | Cyclists | Car | Ped. | Cyclists |
| UAV | 78.5 | 65.2 | 68.4 | 62.1 | 48.7 | 52.3 | 41.5 | 29.6 | 32.8 |
| UGV | 82.3 | 74.8 | 71.6 | 68.9 | 58.4 | 55.1 | 53.2 | 38.4 | 40.1 |
| Coo. | 86.7 | 78.9 | 75.2 | 75.4 | 65.8 | 63.7 | 59.8 | 47.8 | 49.5 |
| Experiment Method | Scene A RMSE | Scene A F1-Score | Scene B RMSE | Scene B F1-Score | Scene C RMSE | Scene C F1-Score |
|---|---|---|---|---|---|---|
| UAV | 4.8 ± 0.3 | 0.79 | 10.8 ± 1.2 | 0.76 | 13.8 ± 1.2 | 0.64 |
| UGV | 3.1 ± 0.2 | 0.86 | 8.7 ± 0.8 | 0.84 | 10.5 ± 0.8 | 0.71 |
| Collaborative Map | 2.9 ± 0.1 | 0.93 | 5.9 ± 0.5 | 0.91 | 6.3 ± 0.5 | 0.94 |
| Experiment Method | Scene A RMSE | Scene A F1-Score | Scene B RMSE | Scene B F1-Score | Scene C RMSE | Scene C F1-Score |
|---|---|---|---|---|---|---|
| Direct Replacement | 4.2 ± 0.3 | 0.82 | 12.8 ± 1.2 | 0.68 | 18.5 ± 1.2 | 0.42 |
| Linear Weighted | 3.5 ± 0.2 | 0.85 | 9.7 ± 0.8 | 0.74 | 16.7 ± 0.8 | 0.61 |
| Classical Bayesian | 3.1 ± 0.4 | 0.87 | 8.2 ± 0.9 | 0.78 | 15.0 ± 0.9 | 0.67 |
| Proposed Method | 2.1 ± 0.1 | 0.93 | 6.3 ± 0.5 | 0.89 | 6.3 ± 0.5 | 0.89 |
| Condition | mAP (%) | NDS (%) |
|---|---|---|
| Baseline (no failure) | 88.7 | 74.3 |
| Delay 50 ms | 88.2 | 73.9 |
| Delay 100 ms | 86.5 | 72.8 |
| Delay 200 ms | 81.5 | 69.2 |
| Packet loss 5% | 87.9 | 74.0 |
| Packet loss 10% | 86.1 | 72.5 |
| Packet loss 20% | 80.2 | 67.8 |
| Bandwidth 10 Mbps | 86.9 | 73.2 |
| Bandwidth 1 Mbps | 71.2 | 62.5 |
| UAV offline (fallback to UGV) | 68.5 | 63.1 |
| UGV LiDAR failure | 72.1 | 66.4 |
| Metric | UAV Only | UGV Only | Collaborative |
|---|---|---|---|
| Detection mAP (%) | 68.3 | 72.1 | 84.5 |
| Localization RMSE (m) | 0.35 | 0.28 | 0.19 |
| Map F1-score | 0.71 | 0.79 | 0.88 |
| Metric | Value |
|---|---|
| End-to-end latency | 87 ms |
| Overall pipeline frame rate | 18 fps |
| UAV SLAM frame rate | 35 fps |
| UGV BEV fusion frame rate | 22 fps |
| UGV total module runtime | 65 ms |
| Communication bandwidth | 28 Mbps |
| UGV GPU memory usage | 5.2 GB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Y.; Tian, E.; Chen, X.; Han, H.; Zhang, X. Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones 2026, 10, 470. https://doi.org/10.3390/drones10060470
Li Y, Tian E, Chen X, Han H, Zhang X. Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones. 2026; 10(6):470. https://doi.org/10.3390/drones10060470
Chicago/Turabian StyleLi, Yufeng, Erming Tian, Xiaofeng Chen, Huiyan Han, and Xinya Zhang. 2026. "Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle" Drones 10, no. 6: 470. https://doi.org/10.3390/drones10060470
APA StyleLi, Y., Tian, E., Chen, X., Han, H., & Zhang, X. (2026). Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones, 10(6), 470. https://doi.org/10.3390/drones10060470
