Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle

Li, Yufeng; Tian, Erming; Chen, Xiaofeng; Han, Huiyan; Zhang, Xinya

doi:10.3390/drones10060470

Open AccessArticle

Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle

by

Yufeng Li

^1,2,*,

Erming Tian

^1,2,

Xiaofeng Chen

³,

Huiyan Han

¹ and

Xinya Zhang

^1,2

¹

Shanxi Key Laboratory of Machine Vision and Virtual Reality, North University of China, Taiyuan 030051, China

²

Institute for Intelligent Weapon Research, North University of China, Taiyuan 030051, China

³

ShanXi PingYang Industry Machinery Co., Ltd., Linfen 043003, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 470; https://doi.org/10.3390/drones10060470 (registering DOI)

Submission received: 10 April 2026 / Revised: 10 June 2026 / Accepted: 12 June 2026 / Published: 19 June 2026

(This article belongs to the Section Innovative Urban Mobility)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study proposes a dynamic SLAM method for UAVs that integrates an improved YOLOv8 with ORB-SLAM3, enabling the construction of a high-precision and reliable prior map; additionally, it develops an end-to-end multimodal spatiotemporal joint optimization framework based on Transformer and BEV models for the environmental perception of unmanned ground vehicles.
A cross-platform feature fusion method for 3D object perception and an incremental map update algorithm are proposed in this study. By employing a fusion algorithm based on a cross-attention mechanism, aerial top-view features and ground fused BEV features are integrated across platforms, thereby enabling the construction of high-precision maps.

What are the implications of the main findings?

The proposed UAV-based dynamic SLAM algorithm can provide prior map information for unmanned ground vehicles, supporting their navigation and collaborative perception tasks, while achieving efficient spatiotemporal fusion of multi-sensor data. This significantly improves object detection accuracy and system robustness in complex urban scenarios.
Unmanned aerial and ground vehicles can achieve cross-platform collaborative perception, ranging from target-level alignment to semantic-level fusion. This enables real-time and precise perception of complex dynamic environments.

Abstract

Complex urban scenarios impose high demands on the environmental perception capabilities of unmanned systems, which serve as a prerequisite for executing autonomous missions such as disaster response, infrastructure inspection, and smart city operations. UAVs, leveraging their high mobility, can provide accurate prior maps and wide-area aerial observation for unmanned ground vehicles. However, their long-range perception accuracy is limited. Conversely, UGVs can achieve high-precision environmental perception along their navigation paths using prior maps, but suffer from a constrained field of view. The collaboration between the two platforms complements their respective strengths, thereby enhancing 3D object perception and mapping accuracy in complex scenarios. To address the aforementioned challenges, this study proposes a cross-platform feature fusion method for 3D object perception and an incremental map updating approach for UAVs and UGVs. First, a dynamic SLAM method that integrates an optimized YOLOv8 with ORB-SLAM3 is employed to mitigate map blurring caused by dynamic noise, providing prior map information for UGVs. Second, a multimodal fusion perception model is constructed for UGVs, utilizing attention mechanisms to achieve deep fusion of multimodal Bird’s-Eye-View (BEV) features. This overcomes issues such as diminishing complementarity between modalities and weak temporal feature associations. Finally, an air ground fusion model based on a cross-attention mechanism is developed to fuse aerial view features with ground-based fused BEV features across platforms, yielding a unified feature representation for 3D object detection and generating a fused high-precision map. Experimental results demonstrate that under complex occlusion scenarios in a simulated dataset, the proposed collaborative perception system improves the mean Average Precision (mAP) by 12.7% and 15.7% compared to using a single UAV or a single UGV, respectively, while increasing the map accuracy F1-score by 0.21. This study provides technical support for achieving real-time and accurate air ground collaborative perception in complex dynamic environments.

Keywords:

visual SLAM; UAV-UGV; multimodal fusion; air ground collaborative perception

1. Introduction

With the rapid advancement of unmanned systems technology, modern society is witnessing a profound transformation in autonomous operations, shifting from manual control to intelligent automation [1]. In complex urban environments, such as those encountered in disaster response, infrastructure inspection, and intelligent transportation, unmanned platforms face unprecedented challenges in autonomous environmental perception [2]. Typical urban scenes are characterized by dense buildings, intricate alleyways, severe visual occlusion, and frequent dynamic objects, making it difficult for a single unmanned platform to independently achieve high-precision, highly robust environmental perception and localization [3].

UAVs, with their high mobility and wide field of view, can provide aerial reconnaissance information and prior map guidance. Recent advances in UAV communication and sensing, such as terahertz integrated sensing and communication [4] and secure UAV-enabled massive MIMO networks [5], further highlight the potential of UAVs in complex urban operations. However, their long-range perception accuracy is limited, and they are susceptible to occlusion and dynamic noise in complex urban environments [6]. In contrast, unmanned ground vehicles can achieve high-precision close-range perception using multimodal sensors, yet they suffer from a constrained field of view and poor penetrability, resulting in blind spots in non-line-of-sight areas. UAVs, leveraging their expansive aerial perspective and rapid maneuverability, can supply UGVs with global situational awareness and prior map guidance, while UGVs equipped with multimodal sensors such as LiDAR and cameras enable high-precision local environmental modeling. The deep integration of the two platforms provides a new technical pathway for 3D object detection and real-time mapping in complex scenarios, making the UAV-UGV collaborative perception system a critical approach for enhancing the situational awareness capabilities of unmanned systems in urban environments [7].

However, current air ground collaborative perception systems face three major challenges. First, significant differences between UAVs and UGVs in terms of spatial perspective, sensor characteristics, and data modalities hinder the efficient correlation and deep fusion of cross-platform features [8]. Second, factors such as dynamic objects, frequent occlusions, and illumination variations in urban environments cause traditional simultaneous localization and mapping (SLAM) systems to suffer from map blurring and trajectory drift, making it difficult to construct high-precision prior maps. Third, collaborative perception systems are highly dependent on communication links; issues such as communication interruptions, bandwidth limitations, and partial sensor failures in real-world applications pose severe tests to system robustness and fault tolerance [9].

To address the above issues, this study proposes a UAV-UGV collaborative perception system tailored for urban environments, focusing on solving key technical challenges such as cross-platform heterogeneous feature fusion, high-precision mapping in dynamic environments, and collaborative localization. The goal is to enhance the environmental perception accuracy, robustness, and collaborative efficiency of heterogeneous unmanned systems in urban scenarios, providing critical technical support for achieving real-time and accurate air ground collaborative perception in complex dynamic environments. This study makes the following main contributions:

(1): UAV dynamic SLAM with precise dynamic feature elimination. Unlike prior dynamic SLAM methods that simply remove all feature points inside object bounding boxes, our approach integrates an improved YOLOv8 (with BiFPN and a small-target detection layer) with ORB-SLAM3 and MobileSAM instance segmentation. Dynamic feature points are removed only after geometric verification, while static background points inside the bounding boxes are preserved. This yields a high-precision, reliable prior map for UGVs.
(2): UAV-UGV BEV feature fusion. A cross-attention mechanism is designed to fuse the UAV’s overhead BEV features with the UGV’s ground BEV features. The UAV’s overhead features are first transformed into the UGV’s BEV coordinate frame via a calibrated spatial mapping that accounts for GPS/IMU uncertainty and scale ambiguity. To our knowledge, this is the first end-to-end learnable fusion of aerial and ground BEV features for urban collaborative perception.
(3): Incremental map updating with dynamic Bayesian probability and D-S evidence theory. A semantic-assisted conflict resolution strategy is introduced. When the conflict coefficient K exceeds a learned threshold (0.3), semantic context is used to resolve contradictions before applying the Dempster Shafer combination rule. This provides robust map updates under occlusion and dynamic changes, a capability absent in conventional Bayesian filters.

These contributions are built upon well-established components (ORB-SLAM3, YOLOv8, BEVFormer-style BEV generation, PointNet++, and ResNet101+FPN), but are tightly coupled through shared coordinate frames, synchronized timestamps, and a unified BEV representation to address specific bottlenecks in air ground collaboration.

The remainder of this paper is organized as follows: Section 2 reviews related work on UAV-GV collaborative perception. Section 3 details the proposed methodology. Section 4 presents the experiments and results. Finally, Section 5 concludes the paper and outlines directions for future work.

2. Related Work

2.1. Visual SLAM for UAVs

With the widespread application of UAVs in fields such as disaster rescue, infrastructure inspection, and smart cities, visual simultaneous localization and mapping (V-SLAM) technology has become one of the key technologies supporting UAV autonomous navigation and environmental perception [10]. Compared to traditional positioning methods that rely on the Global Navigation Satellite System (GNSS), V-SLAM enables high-precision localization in GPS-denied environments using only onboard visual sensors and has therefore attracted significant attention [11].

Among classical visual SLAM algorithms, the ORB-SLAM series is one of the most influential open-source frameworks in recent years. ORB-SLAM achieves efficient feature extraction and tracking using ORB features, supports monocular, stereo, and RGB-D cameras, and integrates loop closure detection and global optimization modules [12]. ORB-SLAM2 further introduces support for stereo vision and RGB-D sensors, while ORB-SLAM3 incorporates multi-map management, multi-session operation, and tight coupling with inertial measurement unit (IMU) data, significantly enhancing system robustness and flexibility [13]. For UAV applications, researchers have proposed integrating deep learning-based monocular depth estimation models into the ORB-SLAM2 system. In real-world tests with a Tello UAV, this method reduced trajectory errors by 34% to 54% compared to conventional approaches [14].

In terms of dynamic environment adaptability, traditional visual SLAM systems typically assume static environments, leading to significant degradation in localization accuracy when dynamic objects such as pedestrians or vehicles are present [15]. To address this issue, DFF-SLAM (Dynamic Feature Filtering-based SLAM) proposes a dynamic feature filtering method that identifies a priori dynamic objects in the scene through a semantic detection thread, and combines optical flow tracking with epipolar geometry constraints to assess the motion state of feature points, effectively filtering out dynamic features. This approach significantly improves UAV localization accuracy in complex IoT-enabled environments. Meanwhile, the Deep-UAV SLAM framework introduces deep learning features such as SuperPoint and SuperGlue into the SLAM system, enhancing UAV navigation robustness in dynamic outdoor environments through superior feature matching capabilities [16].

Regarding deep learning-enabled approaches, the fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) has emerged as a frontier in UAV visual SLAM research [17]. Studies indicate that CNNs excel in real-time performance and are well-suited for high-frequency reactive perception tasks, whereas ViTs possess strong global contextual reasoning capabilities and demonstrate greater robustness in object occlusion scenarios, albeit with higher computational overhead [18].

UAV visual SLAM technology is rapidly evolving from single-sensor static scene localization toward multi-sensor fusion, dynamic environment robustness, and deep learning-enabled paradigms. Future advancements require continued breakthroughs in efficient computing architectures, cross-scenario generalization capabilities, and deep multimodal synergy to meet the demands of high-precision autonomous navigation for UAVs in complex dynamic environments.

2.2. Multimodal End-to-End Perception for Unmanned Ground Vehicles

With the rapid advancement of large-scale model research, perception models based on Transformer and Bird’s-Eye-View (BEV) frameworks have become the mainstream paradigm for end-to-end perception systems in unmanned ground vehicles [19]. In the area of BEV-based collaborative perception, BEVFormer, as a foundational work in BEV perception, employs spatiotemporal Transformers to achieve a unified BEV representation of multi-camera features, providing an important basis for subsequent air ground collaborative BEV feature fusion [20]. In the context of multimodal fusion frameworks, SpaRC (Sparse Radar Camera Fusion) achieves efficient radar camera fusion through a sparse attention mechanism, offering a more robust perception solution for unmanned ground vehicles operating in adverse environments [21]. The sparse frustum fusion and range-adaptive radar aggregation modules within this framework hold significant value for the fusion of heterogeneous multi-source sensor data in UAV-UGV collaborative scenarios. Furthermore, FusionFormer integrates deformable attention mechanisms with residual connections to fuse 2D image and 3D voxel features during the feature encoding stage, enhancing model stability in cases where certain input modalities are missing—a critical consideration for collaborative perception systems facing potential communication disruptions or sensor failures [22].

With the development of large model technologies, VLA unified models have begun to be introduced into collaborative perception tasks [23]. MindVLA-o1, released by Li Auto, adopts a native multimodal Mixture-of-Experts (MoE) Transformer as its core architecture, leveraging 3D spatial understanding and unified behavior generation to offer new insights into semantic-level collaboration across heterogeneous platforms [24]. Meanwhile, the HERMES framework integrates a multimodal driving module that combines multi-view perception, historical motion cues, and semantic guidance, demonstrating strong performance in risk-aware trajectory planning under long-tail scenarios [25]. Such technologies hold promise for further improving the adaptability of UAV-UGV collaborative systems in complex dynamic environments. Air ground collaborative multimodal end-to-end perception technologies are currently in a phase of rapid development. Future efforts should focus on continuous breakthroughs in cross-platform feature alignment, efficient communication mechanisms, and real-world data accumulation, so as to advance the collaborative perception capabilities of heterogeneous unmanned systems in complex environments to a higher level.

2.3. UAV-UGV Collaborative Perception

In recent years, UAV-UGV collaborative perception technology has emerged as a research hotspot in the field of intelligent unmanned systems, with significant progress made by research institutions worldwide. In the area of collaborative mapping, Carnegie Mellon University’s LAMP (Large-Scale Autonomous Mapping and Perception) system achieved high-precision 3D mapping in large-scale unknown environments through the collaboration of UAVs and ground robots, with UAVs providing a global perspective and ground vehicles contributing detailed information [26]. Research related to the DARPA Subterranean Challenge (SubT), a competition focused on autonomous exploration in underground environments, demonstrated the collaborative perception capabilities of UAVs and ground robots in complex underground settings, significantly improving environmental coverage and target recognition accuracy through multi-platform information sharing [27]. Ai et al. proposed a LiDAR-based coarse-to-fine optimization framework that jointly refines initial point clouds from UAVs and UGVs through graph optimization post-processing. Experimental results showed that this method significantly enhanced point cloud map and trajectory estimation accuracy [28].

In the realm of multi-sensor fusion, the multi-robot collaborative SLAM framework developed by ETH Zurich leveraged the complementary nature of UAVs’ aerial perspective and UGVs’ local perception to effectively address map consistency maintenance in dynamic scenes [29]. A research team at Stanford University focused on real-time air ground collaborative object detection and tracking, achieving continuous localization and environmental awareness of moving targets through cross-platform feature association [30]. Cheng et al. proposed a dynamic autonomous docking scheme for UAV-UGV systems in GPS-denied environments, realizing precise docking under complex conditions through multi-sensor fusion and visual guidance [31]. Wang et al. introduced a distributed multi-robot collaborative SLAM method based on iterative registration assisted by semantic and geometric features [32]. By employing a multi-level partitioning DPGO optimization strategy, this approach effectively addressed challenges such as large viewpoint differences and difficult backend optimization convergence in air ground cross-domain collaboration. Zhang et al. proposed an air ground collaborative perception framework for drivable area detection, achieving accurate identification of drivable areas in complex environments by fusing UAV aerial views with multimodal sensor data from UGVs [33]. Additionally, the BEV-based collaborative perception method developed at the Karlsruhe Institute of Technology deeply integrated bird’s-eye-view features from both UAVs and UGVs, achieving superior object detection performance in urban environments compared to single-platform approaches [34].

Recent studies have explored simplified spatial data fusion approaches [35] and analyzed sensor robustness under adverse weather conditions in BEV perception systems [36]. These works highlight the importance of handling environmental disturbances and computational efficiency, which are complementary to our contributions.

Although these studies have achieved notable results in collaborative mapping, multimodal fusion, and object perception, several critical issues remain to be addressed. First, existing methods are primarily focused on structured environments or static scenarios, demonstrating insufficient adaptability to highly dynamic targets and frequent occlusions in complex urban environments; thus, system robustness requires improvement. Second, cross-platform feature fusion still relies predominantly on shallow feature concatenation or simple weighting, lacking effective modeling of deep semantic correlations among heterogeneous sensor data, which limits the full exploitation of modal complementarity. Third, current collaborative perception systems generally assume stable and reliable communication links, lacking fault-tolerant mechanisms to handle communication interruptions, latency, or partial sensor failures—posing significant challenges to system reliability in complex real-world environments. Finally, most studies validate algorithm performance in simulated environments and lack real-world data from diverse operational scenarios, leaving the transferability of algorithms to real-world conditions in need of further verification.

3. Materials and Methods

3.1. System Overall Architecture

Based on a cross-platform feature fusion strategy, this paper constructs a collaborative perception system for UAVs and unmanned ground vehicles (UGVs). The overall system architecture is shown in Figure 1, and the information interaction flow is illustrated in Figure 2. The architecture elucidates the bidirectional data flow interaction mechanism between heterogeneous nodes. From the perspective of system functionality, its core components include multimodal data channel multiplexing among airborne sensing units, ground-based sensing analysis units, and edge computing nodes on the unmanned ground platform, as well as a cross-platform closed-loop feedback network within the heterogeneous intelligent framework of UAVs and UGVs.

The collaborative perception process of the system is as follows:

First, the UAV scans a wide area to generate a regional 3D point cloud and constructs a prior map. The unmanned ground vehicle performs prior path planning based on this map, thereby significantly reducing the probability of falling into local optima. During this process, an improved YOLOv8 fused with ORB-SLAM3 is employed to mitigate map blurring caused by dynamic noise.

Second, a branch feature extraction strategy is adopted, where a lightweight feature extraction network is deployed on the UAV side. Cross-platform feature fusion is then achieved via a cross-attention mechanism, yielding a unified fused image feature representation.

Finally, BEV semantic segmentation decoding is performed to generate a precise local semantic map. This map is fed back to the initial prior map for updating, while 3D object detection is also accomplished.

Through the above technical approach, the UAV and UGV achieve cross-platform collaborative perception ranging from target-level alignment to semantic-level fusion. When the UAV’s line-of-sight is obstructed in complex environments (e.g., under bridges or inside buildings), the system can automatically switch to an independent operation mode of the UGV, thereby maintaining the robustness of the collaborative system while significantly enhancing environmental understanding capabilities in complex scenarios.

3.2. UAV Dynamic SLAM

To address the challenge that unmanned ground vehicles face in acquiring prior environmental maps in complex urban environments, this paper utilizes the aerial perspective of UAVs for preliminary map construction, thereby providing a foundation for subsequent perception and navigation tasks. To this end, a dynamic small-target point cloud elimination strategy integrating an improved YOLOv8 with ORB-SLAM3 is investigated. By constructing dense maps and octree maps to optimize visual SLAM, the recognition accuracy of aerial dynamic small targets is enhanced, resulting in high-quality dense and grid maps. The UAV dynamic SLAM process is illustrated in Figure 3.

First, the YOLOv8-based small-target detection algorithm is improved. Then, the enhanced YOLOv8 is fused with the ORB-SLAM3 algorithm, employing a feature point elimination strategy that combines object detection and semantic segmentation. After dynamic target judgment, dynamic feature points are removed in real time, generating a hierarchical data structure containing semantic objects and a dynamic visual map. Finally, experimental validation is conducted using the TUM dataset and a self-built UAV platform. The proposed algorithm maintains localization accuracy while ensuring high-quality static map construction, providing a reliable prior map for subsequent collaborative perception.

3.2.1. Small Target Detection and Segmentation Optimization Based on YOLOv8

(1): Improved YOLOv8

YOLOv8 by default uses feature layers P3 to P5, among which P2 is a higher-resolution feature map derived from shallower layers and is suitable for small-target detection. To this end, this paper improves the connection mode of the YOLOv8 neck network: after the first, second, and third C2f modules and the SPPF module in the backbone network, a 1 × 1 convolutional kernel is connected, and the connection mode of the BiFPN network is adopted to achieve bidirectional cross-scale connections. Meanwhile, an additional connection is added between the input node and the output node to fuse more features. The improved neck network is shown in Figure 4.

This structure establishes a cross-layer association mechanism by constructing bidirectional channels, and adds inter-layer connections and bidirectional links to realize cross-scale bidirectional connections and inter-layer weight fusion. The multi-scale features extracted from the encoding network are directly fused with the feature tensors in the bottom-up path, thereby effectively capturing shallow feature details while preserving the high-level semantic representations extracted by the deep network to the greatest extent.

Through the above improvements, the difficulties of traditional object detection models in small-target detection are effectively overcome, without compromising the ability to detect large targets. This architecture not only improves the detection rate of small targets but also maintains good detection performance for large targets.

(2): MobileSAM Instance Segmentation Module

Instance segmentation not only achieves precise boundary delineation but also identifies individual objects; however, it typically requires processing large amounts of data. To address this issue, this paper combines YOLOv8 and MobileSAM to create a learning framework that can perform both object detection and semantic segmentation tasks simultaneously.

MobileSAM (Mobile Segment Anything Model) is a lightweight version of the Segment Anything Model (SAM) released by Meta AI. It features a lightweight network architecture and supports multimodal inputs such as depth maps and grayscale images. Through knowledge distillation, it reduces model size while maintaining performance. It also adopts an optimized Vision Transformer (ViT) as its backbone network to capture long-range dependencies in images, adapting to mobile device requirements by reducing computational complexity. Therefore, MobileSAM is well-suited for applications that demand high-performance image segmentation under limited computational resources. In this paper, the small-target detection boxes obtained from the improved YOLOv8 are used as prompt information and fed into MobileSAM for segmentation. This approach enables simultaneous object detection and semantic segmentation. The effect of the improved YOLOv8+MobileSAM framework for object detection and instance segmentation is shown in Figure 5.

3.2.2. Fusion of Improved YOLOv8 with ORB-SLAM3

(1): Dynamic Authenticity Judgment of Objects

The improved YOLOv8, trained on the VisDrone2019-DET dataset, can recognize three categories of objects: static objects (e.g., trees, buildings, traffic signs), dynamic objects (e.g., pedestrians, cyclists), and potentially dynamic objects (e.g., cars, trucks). This paper adopts a random sample consensus (RANSAC)-based method to judge the dynamic authenticity of potentially dynamic objects.

First, a stepwise labeling strategy is applied: feature points located within static object detection regions are uniformly labeled as class S. For feature points within moving object detection regions that do not spatially overlap with static or potentially dynamic regions, they are separately classified as class D, as shown in Table 1. Second, the RANSAC algorithm is used to compute the geometric deviation between the spatial positions of each suspected moving feature point and its epipolar line projected from the static feature point set. Finally, based on a preset threshold, the dynamic attribute is determined: if the deviation exceeds the threshold, the point is labeled as D (dynamic); otherwise, it is labeled as S (static). Dynamic feature points are marked in white.

(2): Fusion and Extension of ORB-SLAM3

On the basis of the ORB-SLAM3 algorithm, the improved YOLOv8 small-target detection algorithm is integrated, as shown in Figure 6. First, the coordinate points within the bounding boxes of detected dynamic objects are collected and semantically segmented. Then, these coordinates are compared with the coordinates of relevant feature points in the corresponding frame of ORB-SLAM3: if an extracted feature point lies inside a dynamic object’s bounding box, it is further compared with the segmentation mask coordinates and removed; otherwise, it is retained. This method retains the static background points within dynamic bounding boxes to the greatest extent. After removing dynamic feature points, the remaining static feature points are fed into the pose estimation module, thereby preventing feature points on dynamic objects from interfering with the SLAM system. Using the optimized feature point information, a high-precision map suitable for navigation is established.

Unlike sparse reconstruction, which provides only a limited number of feature points, dense reconstruction can capture more detailed and realistic scene details. The ORB-SLAM3 algorithm focuses on vision-based tracking, object localization, and sparse 3D structure reconstruction. To extend its functionality to support dense reconstruction, the dense module introduced in this study works based on the camera trajectory and sparse feature points already computed by ORB-SLAM3, and leverages fused depth data to achieve high-density 3D point cloud reconstruction. The core algorithm module relies on the Point Cloud Library (PCL) to build the data processing pipeline. The detailed workflow is illustrated in Figure 7.

First, in the class constructor implementation of the PointCloudMapping.cc file, the mpGlobalCloud object is instantiated to maintain the 3D point cloud data, and the relevant parameters for point cloud noise reduction are configured. Meanwhile, an independent mptViewerThread thread is started in the constructor body, and a visualization Viewer is implemented. Second, the InsertKeyFrame function is designed to insert keyframe data into the tracking thread in real time, and point clouds are generated by the GeneratePointCloud function. Then, the 3D point cloud is iteratively optimized using global bundle adjustment (BA) to obtain the optimal camera pose parameters, after which the spatial coordinate data are updated. Finally, before the main program terminates, the optimized point cloud model is stored in a standardized format.

The dense map cannot be directly used for navigation by the UGV it must be converted into a grid map. Considering that subsequent air ground collaboration will update the prior map, the octree map format is adopted.

3.3. Multimodal Fusion Perception Model for UGV

3.3.1. BEV-Based Multimodal Fusion Perception Model

The overall framework of the model is shown in Figure 8. A multi-sensor fusion 3D object detection architecture based on a unified BEV representation is constructed. In the unified BEV space, cross-modal alignment and feature enhancement are performed between the 3D spatial features of point clouds and the semantic features of images. To address the issues that image perception for unmanned ground vehicles heavily relies on accurate depth prediction, that point cloud compression leads to loss of height information, and that models exhibit poor understanding of dynamic scenes, a novel cross-modal unified feature extractor based on deformable attention is designed. This extractor learns cross-modal spatial correlations through shared BEV queries across modalities, while a temporal attention-based BEV feature fusion module is introduced to enhance temporal understanding. Second, an adaptive dynamic weighting BEV feature fusion method is designed, which dynamically fuses the two BEV feature streams by learning from different scenes, combined with real-time weight prediction. Finally, a multi-task head decoding module is added to simultaneously perform 3D object detection, semantic segmentation, and occupancy grid mapping.

The collaborative perception network described in Figure 8 adopts a dual-stream heterogeneous feature encoding architecture for cross-modal representation learning. In the feature encoding stage, the system processes multi-source sensing data through parallel pathways: the first branch extracts point cloud features from LiDAR point clouds using PointNet++; the second branch extracts multi-view, multi-scale image features using a ResNet101+FPN. Subsequently, a deformable attention-based multimodal unified BEV feature encoder is employed to fully preserve the spatial geometric information of point clouds and the semantic information of images through a shared query mechanism, yielding two BEV feature streams with strong representational capacity. Finally, these two BEV feature streams are input into an adaptive dynamic weighting BEV feature fusion module for fusion processing, and a multi-task prediction head simultaneously outputs object detection and scene segmentation results.

Feature dimension alignment: Before fusion, the image BEV features (32 channels) are projected to 256 channels using a 1 × 1 convolutional layer. The LiDAR BEV features are already 256 channels after PointNet++ aggregation. All attention modules use a hidden dimension d_k = 256 with 8 attention heads. The final fused BEV features also have 256 channels, ensuring consistent representation across all processing stages.

PointNet++ configuration: The PointNet++ encoder uses three Set Abstraction (SA) layers with radii of 0.1 m, 0.2 m, and 0.4 m. The output feature dimensions are 64, 128, and 256 respectively. After the final SA layer, per-point features are aggregated by max pooling onto the BEV grid (128 × 128 cells), resulting in a LiDAR BEV feature tensor of size 128 × 128 × 256.

ResNet101+FPN configuration: ResNet101 is used as the backbone for the four camera views. Feature pyramid network (FPN) outputs multi-scale feature maps at 1/8, 1/16, and 1/32 scales, all with 256 channels. These are transformed to BEV space using the implicit projection method (BEVFormer-style), yielding an image BEV feature map of size 128 × 128 × 32. This is later projected to 256 channels via a 1 × 1 convolution before attention fusion.

3.3.2. Multimodal Unified BEV Feature Encoder with Temporal Optimization

A unified deformable attention mechanism is employed to directly generate aligned BEV features from the original sensor coordinate frames, thereby eliminating cross-modal discrepancies. To address the issue of information loss caused by the temporary occlusion of dynamic targets in single-frame perception, a sparse temporal perception module is introduced. This module adds temporal optimization while preserving the real-time performance of the algorithm. A schematic diagram of the proposed feature encoder is shown in Figure 9.

The multimodal unified BEV feature encoder adopts a fundamental principle of “alignment before fusion.” After feature extraction, the LiDAR features

F_{L}

and camera features

F_{C}

are still represented in different coordinate frames. To enable subsequent unified projection and alignment, a set of learnable query vectors

Q \in R^{H \times W \times N}

is defined to capture BEV features. The spatial resolution of the queries is 128 × 128, covering a 64 m × 64 m area around the vehicle, where

H \times W

denotes the BEV grid resolution and

N

is the number of channels. Each BEV query at position (x,y) is responsible for representing a small 3D reference region

P \in R^{H \times W \times 3}

corresponding to that grid cell. The key idea is that all modalities share the same set of query vectors, thereby aligning features from different modalities into the common BEV space.

The unified projection operation maps the 3D reference points in the BEV space to the original coordinate frames of different sensor modalities, so that semantic information at the corresponding locations can be extracted from image or point cloud features. This operation consists of two parts: camera projection and LiDAR projection. Camera projection uses the camera extrinsic matrix

T_{i}

to project a 3D reference point

P

into the 2D image coordinate frame of the

i

-th camera, as expressed in Equation (1):

P_{C}^{i} = α (P, T_{i})

(1)

where

α

denotes the projection function. Similarly, the reference point can be mapped to the spatial coordinates of the LiDAR feature map, as expressed in Equation (2):

P_{L} = β (P)

(2)

where

β

represents the projection function from the 3D reference point to the LiDAR feature map coordinates.

After coordinate projection is completed, a six-layer deformable self-attention and deformable cross-attention between BEV queries and sensor feature maps are used to construct the BEV feature map

F_{C}^{B E V^{'}}

for each modality. The computation formula for the first layer of the camera BEV encoder is shown in Equation (3). The output of the last layer is the final camera BEV feature map

F_{C}^{B E V^{'}}

, which is then passed to the fusion module. The image features extracted by ResNet101+FPN have 32 channels. They are linearly projected to 256 channels before entering the attention module. The LiDAR features from PointNet++ are aggregated into a 128 × 128 × 256 BEV grid. The cross-attention module takes the UGV’s BEV features as Query

Q \in R^{128 \times 128 \times 256}

and the UAV’s BEV features as Key and Value

K, V \in R^{128 \times 128 \times 256}

. The output fused BEV features have the same spatial size of 128 × 128 with 256 channels.

F_{C}^{B E V^{'}} = \sum_{i = 1}^{V} \sum_{z = 1}^{D} D e f o r m A t t n (Q, P_{C}^{i} (z), F_{C}^{i})

(3)

where

V

is the number of camera views, and

D

is the depth information of the sampling points corresponding to each BEV query.

Correspondingly, the LiDAR BEV encoder performs the same operation. The first-layer feature map is expressed in Equation (4):

F_{L}^{B E V^{'}} = \sum_{z = 1}^{D} D e f o r m A t t n (Q, P_{L}^{i} (z), F_{L}^{i})

(4)

Furthermore, a temporal attention-based BEV feature fusion module is designed. It receives a sequence of fused BEV features from the current frame and the previous

N

= 3 frames, each of size 128 × 128 × 256. After three layers of temporal cross-attention and feed-forward networks, it outputs temporally enhanced BEV features of the same dimensions. The framework of the BEV feature temporal fusion module is shown in Figure 10. Through the above steps, the geometric consistency between camera and LiDAR features in the BEV space is significantly enhanced.

A temporal perception module is introduced, incorporating a temporal fusion encoder that leverages historical features. Through multi-layer BEV temporal attention and feed-forward networks, historical frame information is effectively exploited, and the queries are updated via a self-attention mechanism, resulting in more accurate temporally fused BEV features. This module adopts a three-layer encoding structure, where each layer consists of BEV temporal cross-attention and a feed-forward network. The detailed computation process consists of the following steps.

First, the BEV features of the current frame and the historical BEV features of the previous

N

frames are extracted, as shown in Equation (5).

F = \{F_{t}, F_{t - 1}, F_{t - 2}, \dots, F_{t - N}\}

(5)

All historical features are aligned to the coordinate frame of the current frame through geometric transformation (e.g., motion compensation). The aligned features are then augmented with tempNoral positional encoding, as shown in Equation (6).

\{\begin{matrix} \{F_{t - 1}^{'}, F_{t - 2}^{'}, \dots, F_{t - N}^{'}\} \\ F_{t} = F_{t}^{'} + E_{t} (t \in \{t, t - 1, \dots, t - N\}) \end{matrix}

(6)

Subsequently, the temporally position-encoded features from multiple frames are concatenated into a sequence

X

, and the correlations among features are computed via Multi-Head Self-Attention (MHSA). The process is expressed as shown in Equation (7).

\{\begin{matrix} X = [F_{t 0}, F_{t - 1}, \dots, F_{t - N}] \in ℝ^{(N + 1) \times D} \\ X^{'} = LayerNorm (X + MHSA (X)) \end{matrix}

(7)

Finally, a feed-forward network is used to further extract nonlinear features, followed by feature aggregation to output the temporally enhanced features of the current frame. The process is expressed as shown in Equation (8).

X_{o u t} = LayerNorm (X^{'} + FFN (X^{'}))

(8)

For clarity, we summarize the complete data flow with tensor dimensions. The UAV captures an RGB image of size 640 × 640 × 3, which is processed by the improved YOLOv8 network (with BiFPN) to produce multi-scale feature maps (80 × 80 × 256) and a depth map (640 × 640 × 1). These are then projected into BEV space, generating UAV BEV features of size 128 × 128 × 256. On the UGV side, the 32-beam LiDAR point cloud (N × 4 points) is fed into PointNet++ with three Set Abstraction layers (output dimensions 64, 128, 256). After max pooling onto a 128 × 128 BEV grid, LiDAR BEV features of 128 × 128 × 256 are obtained. The four GMSL cameras produce images of resolution 256 × 736 × 3 each; they are processed by ResNet101+FPN to produce multi-scale image features, which are then transformed into BEV space using the implicit projection method (BEVFormer-style), resulting in image BEV features of 128 × 128 × 32. A 1 × 1 convolution lifts the channel dimension to 256. All three BEV feature streams (UAV, LiDAR, camera) are fed into a six-layer cross-attention module with hidden dimension d_k = 256 and 8 attention heads. The output fused BEV features remain of size 128 × 128 × 256. This fused representation is then passed to a three-layer temporal fusion module that also takes the previous three frames (each 128 × 128 × 256) as input, producing temporally enhanced BEV features of the same dimensions. Finally, a multi-task head decodes the features into 3D bounding boxes, BEV semantic segmentation, and an occupancy grid. The 3D detection results are subsequently used in an incremental map update module based on Bayesian probability and D-S evidence theory to maintain an octree map with 0.1 m voxel resolution.

3.3.3. Adaptive Dynamic Weighting BEV Feature Fusion

To address the issue of varying perception confidence of different sensors under different environments, a dynamic adaptive channel-wise weighting fusion method combined with temporal attention is adopted. First, adaptive dynamic weight learning is used to compute the multimodal BEV weights, which are then summed to obtain the raw fused representation. Second, a temporal attention mechanism is applied to incorporate temporal information, yielding temporal fusion coefficients that generate a temporal correction term. Finally, the correction term is added to the raw fused representation to produce the final fusion result. A dynamic adaptive network framework is designed, as illustrated in Figure 11.

Considering the characteristics of each channel, the channel weights are adaptively adjusted. The mean fusion method is extended by incorporating a real-time weight network, resulting in a dynamic adaptive channel-wise weighting fusion module. This module consists of two stages: channel-level weight learning and adaptive fusion. In the learning stage, for each modality

m

, a weight vector

W_{m}

of length equal to the total number of channels is learned, where each element

W_{m} (i)

represents the relative importance of modality

m

in the

i

-th channel of the fused result. The mathematical expression is given in Equation (9). The weights for each modality and each feature channel are dynamically quantified using trainable parameters, which are optimized during training through backpropagation. During inference, the weights are dynamically allocated according to the actual environment.

\begin{matrix} W_{l i d a r b e v} = σ (MLP (F_{l i d a r b e v})) \\ W_{c a m b e v} = σ (MLP (F_{c a m b e v})) \end{matrix}

(9)

Here

W_{l i d a r b e v}, W_{c a m b e v}

represents the fusion weights for the LiDAR and camera BEV features computed by the dynamic adaptive channel-wise weighting fusion module.

σ

denotes the Sigmoid weight constraint function, which restricts the weight range to [0, 1]. The weight generation network has a two-layer fully connected structure, with the intermediate layer activated by a ReLU function.

F_{l i d a r b e v} \in R^{H \times W \times C}

and

F_{c a m b e v} \in R^{H \times W \times C}

are the BEV features from the LiDAR branch and the camera branch, respectively.

Before fusion, all weights are normalized using the Softmax function to ensure that the sum of weights for each channel equals one. The mathematical expression is shown in Equation (10):

W_{l i d a r}^{'} = \frac{e^{W_{l i d a r}}}{e^{W_{l i d a r}} + e^{W_{c a m e r a}}}, W_{c a m e r a}^{'} = 1 - W_{l i d a r}^{'}

(10)

After obtaining the channel weights, the process proceeds to the weighted fusion stage. The expression for the weighted fusion BEV feature is given by Equation (11).

F_{f u s i o n} = {W^{'}}_{l i d a r} ⊙ F_{l i d a r} + {W^{'}}_{c a m e r a} ⊙ F_{c a m e r a}

(11)

where ⊙ denotes element-wise (channel-wise) multiplication.

When only one modality is available, Softmax is applied to each channel and the full weight is assigned to that modality. For example, when only the camera modality is available, the expression is shown in Equation (12):

F_{f u s i o n} = F_{c a m e r a}

(12)

The model can automatically adjust the contribution of each modality according to different scenarios (e.g., day/night, degree of occlusion). It introduces a small number of learnable parameters, allowing it to optimize the relative importance of each modality in the fusion process, while still providing meaningful outputs even when only a single modality is available as input.

3.3.4. Multi-Task Head Decoding

A multi-task collaborative design is adopted, where a unified BEV feature representation is shared among all task heads. All task heads make predictions based on the same set of BEV features, avoiding the feature redundancy common in traditional multi-task models. This design is task-agnostic, allowing new task heads (e.g., dynamic object tracking, end-to-end path planning) to be flexibly added with minimal modification to the head structures, without affecting the underlying fusion framework. The architecture of the multi-task head module is shown in Figure 12.

(1): 3D detection head. To meet the requirements of high real-time performance and detection accuracy in urban scenarios, a 3D detection head based on Deformable DETR is adopted. Its design fully leverages the advantages of BEV features, using the fused BEV features as input to the decoder. By exploiting the deformable attention mechanism, it can adaptively sample and aggregate features at different positions and scales in the BEV feature map according to the actual distribution of target objects, resulting in more accurate detection boxes.
(2): Semantic segmentation head. The semantic segmentation head is an important component of the multi-task perception framework. It performs multi-class binary semantic segmentation based on the fused unified BEV features, decomposing the semantic segmentation task into multiple independent binary classification problems, with one segmentation head per class.
(3): Occupancy prediction network. To better enable scene-level modeling, an occupancy prediction network is introduced on top of the above two tasks. Occupancy prediction is a core technique for modeling the geometry and semantics of 3D scenes. It determines whether each voxel is occupied by an object and its category through voxel-wise classification or probability prediction. Compared with traditional 3D bounding box detection, occupancy provides a finer description of the geometric details of irregular objects and dynamic scenes.

3.4. UAV–UGV Collaborative Perception Model

3.4.1. Cross-Platform Feature Fusion for 3D Object Perception

A cross-domain collaborative perception system consisting of the UAV and UGV is established to achieve accurate perception of unknown environments during cooperative tasks such as infrastructure inspection, disaster response, and smart city operations. The UAV is employed to fill the visual blind spots of the UGV, while the UGV enhances local details, jointly improving the recall rate of object detection. Through the fusion and updating of collaborative information, the richness of mapping information is increased, resulting in a detailed perceptual map. A dual-encoder architecture with a cross-attention mechanism is adopted for cross-platform feature association, fusing multi-view features in urban scenarios to achieve holistic environmental understanding. The overall collaborative fusion workflow is shown in Figure 13.

The collaborative fusion algorithm uses separate encoder branches to process the UAV’s overhead features, avoiding information loss caused by premature fusion and thereby fully capturing the correlation between the two platforms. To preserve both texture and spatial information in the aerial images, a lightweight ResNet-50 network is employed to extract image features and depth features, respectively. The UAV’s overhead image

I_{d r o n e} \in R^{H \times W \times 3}

is taken as the network input, and after processing, it outputs a feature map

F_{d r o n e} = Re s N e t (I_{d r o n e}) \in R^{H \times W \times C}

that retains global semantic information. Figure 14a shows an original overhead image, and Figure 14b shows the RGB feature map from the layer4[0].conv1 layer of ResNet-50.

For depth images, to enable batch-level concatenation with RGB images, the depth maps are normalized to the range [0, 255] and converted into three-channel images via color mapping. They are then fed into the backbone network separately from the RGB images for feature extraction. RGB features contain rich texture information, whereas depth features focus on spatial location information. To effectively leverage both types of features, the implicit BEV feature generation method from BEVFormer is adopted, and the multi-view camera spatial features in that method are replaced with depth features. The framework of the designed cross-attention fusion module is shown in Figure 15.

After obtaining the UGV’s multi-camera front-view BEV features and the UAV’s overhead BEV features, the overhead BEV feature space is transformed via coordinate mapping into the UGV’s BEV space. This transformation facilitates subsequent BEV feature fusion for collaborative tasks, leading to more comprehensive and accurate 3D object perception.

After unifying the coordinate frames, an attention-based fusion module is employed for modality complementarity. The BEV features of the ground unmanned vehicle serve as the Query (

Q

) to capture local ground details, while the UAV’s BEV features serve as the Key (

K

) and Value (

V

) to provide global information. A six-layer cross-attention stack is used to enhance feature interaction. The cross-attention calculation formula is shown in Equation (13):

A t t e n t i o n (Q, K, V) = S o f t \max (\frac{Q K^{T}}{\sqrt{d}} + M_{g e o}) V \in R^{H \times W \times D}

(13)

where

Q \in R^{N_{q} \times d_{k}}

,

K, V \in R^{N_{k} \times d_{k}}

are query, key, value matrices from UGV and UAV BEV features with

N_{q} = N_{k} = 128 \times 128

and

d_{k} = 256

. The spatial weight

M_{g e o} \in R^{N_{q} \times N_{k}}

is computed by passing the UAV’s depth map through a two-layer CNN (kernel size 3, ReLU activation) followed by a sigmoid function, normalizing each element to [0, 1]. This weight strengthens attention on geometrically consistent regions between the two platforms.

After the above steps, the BEV features generated by the UGV and the UAV’s overhead BEV features are fused, producing cross-domain fused multi-camera BEV features. Subsequently, these features are processed by a BEV feature fusion algorithm based on dynamic weighting and temporal attention, which can further incorporate LiDAR features, ultimately constructing a collaborative environmental perception model with three-dimensional spatial continuity.

3.4.2. UAV-UGV Spatiotemporal Synchronization

(1): Collaborative coordinate transformation

Figure 16 presents a schematic diagram of the sensor coordinate frames, illustrating the relative transformations between them.

The radar camera joint calibration method is adopted, where the camera and LiDAR synchronously acquire calibration checkerboard data. By combining planar calibration constraints with 3D LiDAR constraints, the extrinsic parameters are solved using the least squares method. A schematic diagram of the joint calibration for one camera and the LiDAR is shown in Figure 17. The intrinsic calibration is applied to the sensor calibration of both the UGV and the UAV. The extrinsic calibration is performed for the multi-sensor calibration of the UGV, which includes four cameras and one LiDAR; therefore, the extrinsic calibration procedure needs to be carried out four times to unify the camera and LiDAR coordinate systems.

The coordinate positions and transformation relationship between the UAV and the UGV are illustrated in Figure 18. The UAV obtains its longitude and latitude via an onboard GPS and calculates its global pose

T_{U A V}^{w o r l d}

by fusing GPS data with IMU measurements. The UGV acquires its pose

T_{U G V}^{w o r l d}

through an integrated navigation system (GNSS + IMU). After obtaining the poses of both platforms relative to the world coordinate system, the UAV’s BEV coordinate frame is transformed into the world coordinate system through coordinate conversion, as expressed in Equation (14).

F_{U A V}^{w o r l d} = T_{U A V}^{w o r l d} \cdot F_{U A V}^{B E V}

(14)

The resulting coordinates are then further transformed into the UGV’s coordinate system, as shown in Equation (15).

F_{U G V}^{U A V} = {(T_{U G V}^{w o r l d})}^{- 1} \cdot F_{U A V}^{w o r l d}

(15)

In the proposed system, the BEV plane is defined as a two-dimensional grid centered on the UGV, divided into many small cells. The height

H

and width

W

of this plane represent the number of grid cells along the X-axis (lateral) and Y-axis (longitudinal), respectively. This plane intuitively reflects the physical space around the vehicle, where each cell corresponds to a fixed area in the real world. In this paper, the UGV coordinate system coincides with the BEV coordinate system used in the UGV fusion algorithm, and the UAV coordinate system coincides with its aerial BEV coordinate system; however, the BEV coordinate system is two-dimensional. During coordinate transformation, the information along the Z-axis of the transformation matrix must be properly handled. Through the above procedures, different BEV feature spaces can be transformed to achieve BEV feature fusion.

(2): Time synchronization

Figure 19 illustrates the hardware time synchronization scheme for the UGV.

After achieving clock signal synchronization, although the cumulative deviation of the time reference can be effectively suppressed, the discrete and independent sampling mechanisms of heterogeneous sensor nodes make it difficult to ensure strictly aligned acquisition of multi-source information in the time domain. To address this, this paper adopts the PPS (Pulse Per Second) signal sent by the GPS as a trigger mechanism to actively adjust the data acquisition period according to mission requirements, ensuring that data are collected at specific time points. Specifically, when the system reaches the preset GPS time reference, the rising edge of the PPS signal is used as a unified trigger source to simultaneously activate data acquisition from the LiDAR, cameras, inertial measurement unit (IMU), and odometry encoder, thereby achieving precise timestamp alignment of heterogeneous sensor data.

For time synchronization between the UAV and UGV, both platforms share GPS timestamps and calibrate the network delay using the NTP protocol. The synchronization process is expressed by the following formula:

t_{U A V} = t_{U G V} + Δ t_{n e t w o r k}

(16)

where

Δ t_{n e t w o r k}

represents the network delay during wireless transmission, which must be measured in real time.

The synchronization accuracy (timestamp uncertainty ±5 ms) and its impact on fusion performance are analyzed, showing that mAP degradation remains below 1% for synchronization errors up to 10 ms.

(3): Calibration experiment.

The joint calibration process consists of two parts: intrinsic calibration of the cameras and extrinsic calibration between the LiDAR and the cameras. Intrinsic calibration is performed using the Camera-Calibration package of the ROS platform. The calibration process and its results are shown in Figure 20, where the red box indicates the 3 × 3 camera intrinsic matrix and the blue box indicates the 1 × 5 distortion matrix.

The extrinsic calibration between the LiDAR and the cameras is performed using the Calibration-Toolkit. First, a checkerboard with 8 × 6 squares, each of size 7.4 cm × 7.4 cm, is prepared. Then, images of the checkerboard at different viewpoints and positions within the field of view of both the cameras and the LiDAR are captured and fed into the calibration package. After joint calibration, a 4 × 4 extrinsic matrix is obtained. The extrinsic calibration process for the camera and LiDAR is shown in Figure 21a,b, and the resulting 4 × 4 extrinsic matrix is presented in Figure 21c.

3.4.3. Incremental Map Updating Algorithm

UGVs are confined to terrestrial locomotion, with a narrow perceptual field of view but strong capability for perceiving ground objects. Although UAVs offer wide-area coverage, their aerial perspective makes them susceptible to occlusions, leading to degraded recognition accuracy for ground targets. To address this issue, an incremental map updating algorithm based on spatial registration is proposed. As the UGV advances, it fuses the local map with the global map, thereby constructing a semantically rich map while maintaining real-time performance. The framework of the proposed map updating algorithm is shown in Figure 22.

After the multi-sensor data from the UGV are dynamically fused by a unified encoder, the generated BEV features contain rich geometric and semantic information. Moreover, the segmentation results from its multi-task head can be directly assigned to the semantic attributes of the corresponding octree nodes, supporting semantic updating in dynamic scenes. This fusion process mainly consists of three parts: map matching, resolution alignment, and dynamic updating, which are detailed as follows:

First, the coordinate frame of the UGV in BEV perception (UGV frame) is aligned with the global coordinate frame of the prior octree map (world frame). The occupancy grid map and detection results output by BEV are both referenced to the UGV frame. The transformation formula is given in Equation (15).

Secondly, the local BEV occupancy grid and the corner points of the 3D detection bounding boxes are mapped. Given a BEV occupancy grid resolution of

δ_{b e v} = 0.5

m per voxel, the coordinates of the center point of each voxel in the octree map are given by the following formula:

\{\begin{matrix} x_{w} = i \cdot δ_{b e v} \cdot \cos θ - j \cdot δ_{b e v} \cdot \sin θ + T_{x} \\ y_{w} = i \cdot δ_{b e v} \cdot \sin θ + j \cdot δ_{b e v} \cdot \cos θ + T_{y} \\ z_{w} = k \cdot δ_{b e v} + T_{z} \end{matrix}

(17)

where

(θ, T_{x}, T_{y}, T_{z})

represents the yaw angle and translation of the UGV relative to the global coordinate system, updated in real time using GPS/IMU data.

For a 3D detection box of size

(w, h, l)

, yaw angle

ϕ

, and center coordinates

(x, y, z)

, the global coordinates of its eight corner points are calculated as shown in Equation (18).

X_{c o r n e r} = R \cdot (R \cdot [\begin{matrix} \pm w / 2 \\ \pm h / 2 \\ \pm l / 2 \end{matrix}] + [\begin{matrix} x \\ y \\ z \end{matrix}]) + T

(18)

where

R

is the rotation matrix of the detection box around the Z-axis.

If the minimum voxel resolution of the octree is

δ_{o c t}

, the BEV grid voxels with resolution

δ_{b e v}

are mapped to the octree via bilinear interpolation, as shown in Equation (19).

\{\begin{matrix} P_{o c t} (v) = \max_{u \in U (v)} P_{b e v} (u) & if (δ_{b e v} \geq δ_{o c t}) \\ P_{o c t} (v) = \frac{1}{N} \sum_{u \in U (v)} P_{b e v} (u) & if (δ_{b e v} < δ_{o c t}) \end{matrix}

(19)

Here

P_{o c t}

is the occupancy probability of an octree node, and

U (v)

is the set of BEV voxels corresponding to octree node

v

. When

δ_{b e v} \geq δ_{o c t}

, a single BEV voxel covers multiple octree nodes, and probability occupancy propagation is applied. When

δ_{b e v} < δ_{o c t}

, the BEV voxels are sampled.

Finally, the octree is dynamically updated using a combination of dynamic Bayesian probability and D-S evidence theory. Dynamic Bayesian processing first handles the temporal decay of the prior map, followed by D-S evidence theory to fuse historical and current observations from multiple maps. The workflow of the incremental map updating algorithm is shown in Figure 23.

First, a time decay weight

w_{b e v}

is set for dynamic Bayesian updating. Based on the observation time difference

Δ t

, an exponential decay factor

λ

is introduced to adjust the confidence

w_{b e v} = e^{- λ Δ t}

of the BEV voxel data, preventing outdated data from excessively influencing the current state and thereby improving update accuracy. Then, for each octree voxel

v

, Bayesian probability fusion is performed by combining the prior probability

P_{p r i o r}

with the occupancy probability

P_{b e v}

from the fused BEV observations, as shown in Equation (20).

P = \frac{w_{b e v} P_{b e v} \cdot P_{p r i o r}}{w_{b e v} P_{b e v} \cdot P_{p r i o r} + (1 - w_{b e v}) (1 - P_{b e v}) \cdot (1 - P_{p r o i r})}

(20)

Next, basic probability assignment is performed. The BEV evidence assignment and the prior map evidence assignment are given by Equation (21) and Equation (22), respectively.

m_{b e v} \{o c c\} = P, m_{b e v} \{f r e e\} = 1 - P

(21)

m_{p r i o r} \{o c c\} = P_{p r i o r}, m_{p r i o r} \{f r e e\} = 1 - P_{p r i o r}

(22)

Finally, the conflict coefficient

K \in [0, 1]

is calculated, which represents the degree of contradiction between two pieces of evidence over all possible conflicting propositions. The calculation formula is shown in Equation (23).

K = m_{b e v} \{o c c\} \cdot m_{p r i o r} \{f r e e\} + m_{b e v} \{f r e e\} \cdot m_{p r i o r} \{o c c\}

(23)

The threshold

K = 0.3

was empirically determined by sweeping

K

from 0.1 to 0.5 on a validation set, where it achieved the best trade-off between map F1-score and localization accuracy. In typical urban road scenarios, if

K \geq 0.3

, the conflict threshold is exceeded, triggering semantic-assisted decision-making. The decision method is expressed as shown in Equation (24).

P_{o c t} = \{\begin{cases} C_{d e t} & If semantics are box - covered \\ m_{b e v} & If semantics are dynamic \\ m_{p r i o r} & If semantics are static \end{cases}

(24)

If

K < 0.3

, the Dempster combination rule is directly applied to fuse and generate the voxel grid, as shown in Equation (25).

{P_{o c t}}^{'} = \frac{\sum o c c \cap f r e e = m_{b e v} (o c c) m_{p r i o r} (f r e e)}{1 - K}

(25)

4. Experiments and Results

4.1. Experimental Configuration

The UGV and UAV platforms used in the experiments are shown in Figure 24. The UGV is equipped with the following perception devices: four cameras (SG8-AR0820C-G2A, Sensing, Shanghai, China) with different viewing angles (yellow box), one 32-beam LiDAR (RS-LiDAR-32, RoboSense, Shenzhen, China) (green box),one millimeter-wave radar (ARS408-21, Continental, Frankfurt, Germany) (red box), one navigation system (i70 GNSS/INS, StarNeto, Beijing, China) (purple box), and one domain controller Jetson AGX Orin 64G (NVIDIA, Santa Clara, CA, USA) (blue box). The onboard computing environment is based on Ubuntu 20.04 LTS with CUDA 11.4 and cuDNN 8.2. The multimodal fusion model is implemented in PyTorch 1.14 with torchvision 0.15.The aerial platform is an AMVLAB P410 UAV (AMVLAB Ltd., Shenzhen, China), with its core computing unit being the XU3Pai X3 (RDK X3, Horizon Robotics, Beijing, China) (green box). The communication module (yellow box) adopts the Mini Homer image and telemetry data link (CUAV, Shenzhen, China), suitable for real-time data transmission and remote control. Onboard sensors include an M8N GPS (Ublox, Thalwil, Switzerland), a TF-MINI altitude-holding radar (Benewake, Beijing, China) (blue box), and an Intel Realsense D455 visual sensor (Intel Corp., Santa Clara, CA, USA) (red box). The takeoff weight of the UAV is 10 kg.

4.2. UAV Dynamic SLAM Experiments

4.2.1. Small-Target Detection Results Analysis

Two different enhancement algorithms are selected for comparative evaluation: YOLOv8+SAHI, which detects smaller targets by optimizing inference, and YOLOv8+BiFPN (the proposed algorithm), which optimizes the backbone network and fuses it with a small-target detection layer. The recognition performance of the two algorithms is evaluated on the VisDrone2019-DET dataset. The quantitative results and the percentage improvements over YOLOv8s are shown in Table 2.

As can be seen from Table 2, after applying the SAHI slicing algorithm, while the number of parameters remains unchanged, the mAP@0.5 and mAP@0.5:0.95 metrics increase by 12% and 15%, respectively, compared to YOLOv8s, but the computational cost increases by 56.6 G. In contrast, after optimizing the YOLOv8s small-target detection layer with BiFPN, the mAP@0.5 and mAP@0.5:0.95 metrics increase by 21% and 27%, respectively, with a computational cost increase of 7.9 G and a parameter increase of 2.3 M. A comprehensive comparison shows that the proposed algorithm significantly improves detection accuracy while maintaining high real-time performance.

The experimental results in real-world scenes are shown in Figure 25. During small-target detection by the UAV, the improved algorithm achieves higher recognition accuracy for pedestrians and distant vehicles than the original algorithm. In Figure 25a, the car inside the purple box and the pedestrian inside the red box are not detected by the original algorithm, whereas both are successfully detected in Figure 25b.

4.2.2. Dynamic Feature Point Elimination Experiment

To better verify the dynamic mapping performance of the proposed algorithm, feature points in dynamic scenes are collected using the TUM dataset and a self-built experimental scenario. The simulation and experimental results are shown in Figure 26.

After obtaining the small-target sequences, the IDs of targets that are classified as dynamic are fed into the segmentation module. An accurate dynamic pixel range is obtained via the segmentation algorithm, based on which dynamic points are eliminated. The performance of the proposed algorithm is compared with that of the original algorithm before and after feature point elimination. As shown in Figure 26a,d, the two images respectively show a pedestrian from the TUM dataset and a pedestrian from the experimental scene; both pedestrians are in motion. Figure 26b,e present the object detection results and all feature points obtained by YOLOv8. The blue boxes indicate the approximate extents of the targets, the white areas inside the boxes represent the segmented dynamic objects, and the white labels indicate the target confidence scores. It can be observed that dynamic objects introduce a large number of interfering dynamic points during map construction. Figure 26c,f show the results after dynamic points are eliminated by the proposed algorithm. All dynamic feature points on the pedestrians have been removed, while background feature points inside the detection boxes (excluding the pedestrians) are retained. The experimental results demonstrate that the proposed method is effective in distinguishing dynamic from static features and possesses a strong capability for dynamic point elimination.

4.2.3. Dense Mapping Experiment

Since the map constructed by ORB-SLAM3 is a sparse point cloud map, it cannot be directly used for navigation tasks such as path planning. Therefore, a dense mapping module is added on top of the sparse point cloud, and the Octomap package is used to convert the dense point cloud map into an octree map that can be directly used for navigation, providing a reliable prior map for subsequent tasks. The mapping results using the TUM dataset and a real-world scene are shown in Figure 27.

Figure 27a,d show the mapping results of the original ORB-SLAM3 algorithm in scenes containing dynamic objects. It can be observed that ORB-SLAM3 directly uses video frames as keyframes, resulting in numerous ghosting point clouds that severely degrade mapping quality. Figure 27b,e present the mapping results of the proposed algorithm. Compared with Figure 27a,d, the proposed algorithm successfully removes dynamic objects and achieves good mapping performance, demonstrating its high accuracy and robustness in visual dynamic SLAM. Figure 27c,f show the octree maps that can be directly used for path planning and navigation.

The localization accuracy of the improved ORB-SLAM3 algorithm is evaluated on the public TUM RGB-D dataset. The results are shown in Table 3. Three typical motion sequences from the dataset are selected as test objects: the sitting_static sequence (representing static scenes) and the walking_rpy and walking_xyz sequences (representing highly dynamic scenes). The official ORB-SLAM3, Dyna-SLAM, and RDS-SLAM algorithms are compared with the proposed algorithm. The evaluation tool is Evo, and the metric is the absolute trajectory error (ATE).

As shown in Table 3, Dyna-SLAM3 shows a slight decrease in pose estimation accuracy in low-dynamic sequences. In contrast, for the two high-dynamic sequences, the root-mean-square error (RMSE) of ATE of the proposed algorithm is on average 13.0% lower than that of the original ORB-SLAM3, indicating that the improved algorithm achieves higher pose estimation accuracy in highly dynamic scenes.

To provide a more intuitive comparison of trajectory accuracy, the absolute error comparison of 3D trajectories between ORB-SLAM3 and the improved algorithm is shown in Figure 28. Figure 28a shows the trajectory tracked by the original algorithm, and Figure 28b shows the trajectory of the proposed algorithm. It is evident that the improved algorithm yields significantly smaller errors. Figure 28c,d present the error data comparison, where all error metrics of the proposed algorithm are reduced.

4.3. Multimodal Fusion Perception Experiments on UGV

4.3.1. Dataset and Experimental Setup

The nuScenes dataset is used to validate the proposed algorithm. It contains 7481 training samples and 7518 testing samples. The training samples are organized into two independent subsets according to functional requirements: a training group (n = 700) for parameter optimization and an evaluation group (n = 3769) for performance validation. The nuScenes dataset adopts the NuScenes Detection Score (NDS) as the core evaluation metric, which combines the mean Average Precision (mAP) and five True Positive (TP) quality metrics. mAP quantifies the accuracy of object recognition, while NDS provides a multi-dimensional assessment of detection quality by measuring object size, location, orientation, and velocity. Independent Average Precision (AP) is evaluated for five core object categories: vehicles (car, bus, truck), two-wheeled non-motorized vehicles, and pedestrians (Ped.). The experimental results focus on the quantitative analysis of per-category detection performance on the nuScenes dataset.

For feature extraction network configuration, the image branch of the proposed multimodal 3D object detection network uses a CNN-based encoder network, ResNet101+FPN, as the backbone to extract multi-scale image features, while cross-scale feature fusion is performed by the Adaptive Dynamic Weighting Module (ADP). In point cloud preprocessing, the recommended parameter settings of the PointNet++ framework are adopted. The number of channels in the image BEV semantic features is set to 32, the number of groups in the semantic enhancement module is 64, the image resolution for BEV map segmentation is set to 256 × 736, and the voxel size is 0.1 m.

The NuScenes Detection Score (NDS) is calculated with the following weight allocation: mAP accounts for 50% (5/10), and each of the five TP metrics contributes 10%, as shown in Equation (26):

N D S = \frac{1}{10} [5 \cdot m A P + \sum_{m T P \in \{m A T E, m A S E, m A O E, m A V E, m A A E\}} (1 - \min (m T P, 1))]

(26)

where mAP is the mean Average Precision, and the other error terms are the translation error (mATE), scale error (mASE), orientation error (mAOE), velocity error (mAVE), and attribute error (mAAE).

Mean Average Precision (mAP). mAP is based on the matching accuracy of 3D detection boxes. A center-distance threshold is used instead of IoU for matching, which reduces the miss-detection of small objects caused by slight positional offsets, as shown in Equation (27):

m A P = \frac{1}{4} \sum_{d \in \{0.5, 1, 2, 4\}} \frac{1}{|C|} \sum_{c \in C} A P_{c, d}

(27)

where

\{0.5, 1, 2, 4\}

is the distance threshold. NDS emphasizes both the presence of detection boxes (mAP) and their accuracy (TP errors). The use of a center-distance threshold reduces the sensitivity of small object detection, making it suitable for real-world autonomous driving scenarios. All metrics are normalized to the range [0, 1] to facilitate model comparison.

4.3.2. 3D Object Detection Results Analysis

The proposed unified BEV encoding architecture and dynamic fusion weight strategy are validated through systematic experiments on the nuScenes dataset. Table 4 presents a comparison of the proposed method with other state-of-the-art approaches on the validation set. In the table, “Mod.” denotes the sensor modality used (C: camera, L: LiDAR). The reported results of the baseline methods are taken from their original papers. As shown in Table 4, the proposed fusion algorithm achieves significantly higher object detection accuracy than current mainstream multimodal fusion models such as FUTR3D, TransFusion, and FusionPainting. Compared with the strong baseline BEVfusion, the proposed method improves mAP by 3.9% and NDS by 3.3%. Moreover, relative to the most advanced DeepInteraction algorithm, it achieves a 2.9% gain in mAP and a 2.4% gain in NDS.

All improvements of the proposed method over the baselines (BEVfusion and DeepInteraction) are statistically significant (paired t-test, p < 0.05). The standard deviations of mAP across three independent runs are below 0.3% for all methods, indicating stable performance.

Partial validation results of the proposed algorithm on the nuScenes dataset are shown in Figure 29.

To verify the robustness and perception performance of the proposed multimodal fusion perception algorithm in real-world applications, real-time data were collected from an UGV equipped with four GMSL cameras (front, rear, left, right) and a 32-beam LiDAR. Real-time object detection experiments were conducted on public roads using the proposed fusion algorithm. The perception results from randomly selected real-vehicle data are shown in Figure 30.

Figure 30a shows a typical urban intersection scene, and Figure 30c shows a straight urban road scene. Figure 30b,d present the corresponding point-cloud BEV object recognition results. It can be observed that the proposed algorithm performs well in recognizing both sparse distant targets and overlapping targets. Moreover, it exhibits good detection and recognition capabilities for dynamic cyclists and pedestrians, demonstrating satisfactory 3D object detection performance and feasibility.

To further evaluate the performance of the fusion perception algorithm under different lighting conditions, 3D object detection tests were carried out in three real-world scenarios: sunlight, dusk, and night. The results are shown in Figure 31. The algorithm maintains good recognition ability under all three lighting conditions.

Meanwhile, to assess the fusion effectiveness and the robustness of the adaptive dynamic weighting BEV fusion network, recognition experiments were conducted using different input modalities. Quantitative evaluation and analysis were performed on three data configurations: vision-only (Camera), LiDAR-only, and multimodal fusion (Camera-LiDAR). The results are presented in Figure 32. Compared with single-modality configurations, the fusion algorithm achieves substantial improvements across all metrics. Under normal lighting conditions, compared to the camera-only and LiDAR-only configurations, the mAP increases by 11.8% and 5.1%, the NDS increases by 23.3% and 9.6%, and the mIOU increases by 10.8% and 14.8%, respectively. Under low-light conditions, the fusion algorithm also improves the performance of each single modality, with the most significant improvement observed for the camera-only configuration.

4.3.3. Semantic Segmentation Results Analysis

In addition to the 3D detection task, this study also evaluates the performance of the proposed method on the BEV segmentation task. A systematic comparison is conducted against current mainstream single-modality algorithms (including PointPillars and CenterPoint) as well as the multimodal fusion method BEVFusion. The comparison results are shown in Table 5. The mean Intersection over Union (mIoU) is adopted as the evaluation metric, computed by quantitative analysis of IoU scores for semantic categories including drivable area, crosswalk, walkway, stop line, and lane marking. It can be observed that by integrating LiDAR geometric features with visual BEV representations, the proposed method significantly improves the segmentation accuracy of the detection model in complex scenes.

Figure 33 presents a comparison of the segmentation results obtained by the proposed method, BEVFusion, and LSS. It can be observed that in a typical T-junction scene, all three methods achieve reasonably satisfactory segmentation performance. However, when facing a complex intersection scene, the proposed method demonstrates a clear advantage in capturing local details. Compared with BEVFusion and LSS, the segmentation results of the proposed method are notably more precise, indicating that it is more effective and reliable for BEV map segmentation in complex scenarios.

4.3.4. Ablation Study Results

Table 6 presents the impact of the proposed multimodal BEV encoding architecture and its Adaptive Weighting Fusion Mechanism (AWFM) on the overall model performance. The feature encoder consists of a cross-attention fusion module (CAFM) with shared BEV queries and a temporal fusion module (TFM). The baseline is the BEVFusion algorithm, and the evaluation metrics are mAP, NDS, and mIoU.

As shown in Table 6, the most significant improvement over the baseline is contributed by the cross-attention fusion module within the unified BEV feature encoder, indicating that this deep fusion strategy effectively enhances the complementary advantages across different modalities. The temporal fusion module increases mAP by 0.8% and NDS by 0.9%, demonstrating that extracting and fusing temporal information from multiple frames effectively captures BEV features under complex distributions, thereby producing more comprehensive multimodal fusion results. After applying the adaptive weighting fusion module, the network achieves improvements of 1.0% in mAP, 0.5% in NDS, and 1.3% in mIoU, confirming that dynamic fusion of BEV features can effectively cope with adverse lighting and weather conditions, thus improving system robustness.

4.4. UAV–UGV Collaborative Perception Experiments

The onboard edge computing unit serves as the core of the collaborative system, responsible for processing the BEV perception data of the Unmanned Ground Vehicle and running the collaborative algorithms. In terms of software configuration, ImageNet pre-trained weights are adopted, with an initial learning rate of 2 × 10⁻⁴. The AdamW optimizer is used with a weight decay of 0.01 to prevent overfitting, and the weight of the heatmap loss is increased to 65%. The BEV grid is set to a size of 128 × 128 with a resolution of 0.5 m.

Due to the scarcity of multi-view annotated datasets for air ground collaboration, a custom simulation dataset is built using the AirSim platform. Urban scenes are imported into Unreal Engine, and dynamic targets as well as static obstacles are added. The global clock function simSetDetectionFilterRadius is used for synchronized data collection, the client.simGetObjectPose function is employed to obtain 3D object poses, and object masks are generated via the semantic segmentation channel of AirSim to extract pixel-wise annotations. Table 7 details the composition of the dataset, covering diverse scenarios including residential alleys, urban arterial roads, industrial parks, and suburban/rural areas. The dataset consists of 4105 frames containing a total of 12,523 objects.

The mean average precision (mAP) is used as the evaluation metric for collaborative 3D object detection. In the octree map, precision and recall jointly reflect the reliability and completeness of obstacle detection. Precision indicates the proportion of voxels marked as “occupied” that are actually occupied. The calculation formula is given in Equation (28):

P r e c i s i o n = \frac{T P}{T P + F P}

(28)

where TP is the number of voxels that are actually occupied and correctly marked as occupied in the map, and FP is the number of voxels that are actually free or unknown but incorrectly marked as occupied.

Recall represents the proportion of actually occupied voxels that are correctly detected and marked as occupied in the map, as shown in Equation (29):

R e c a l l = \frac{T P}{T P + F N}

(29)

where FN is the number of voxels that are actually occupied but incorrectly marked as free or unknown.

The F1-score is the harmonic mean of precision and recall, used to comprehensively evaluate classification performance, as given in Equation (30):

F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(30)

The voxel-wise root-mean-square error (RMSE) is used to assess map accuracy by quantifying the deviation between the generated map and the ground-truth map in 3D voxel space. The formula is shown in Equation (31):

R M S E = \sqrt{\frac{1}{N} {\sum_{i = 1}^{N} (P_{u p d a t e} (v_{i}) - P_{t u r e} (v_{i}))}^{2}}

(31)

where

P_{u p d a t e} (v_{i})

is the occupancy probability of the i-th voxel in the generated map,

P_{t u r e} (v_{i})

is the occupancy probability of the corresponding voxel in the ground-truth map, and

N

is the total number of voxels considered in the calculation.

4.4.1. Collaborative 3D Object Detection Experiments

Table 8 presents the object detection results under three different occlusion ranges (0–30%, 30–60%, and >60%). The detected objects mainly include common urban targets such as vehicles, pedestrians, and cyclists. Comparative analysis shows that the UAV alone achieves the lowest detection performance due to object occlusion, especially for distant targets, with a mAP of only 65.2%. In contrast, the UGV, equipped with the multimodal fusion perception algorithm, achieves significantly improved detection performance on the dataset, reaching a mAP of 76.5%. The proposed collaborative detection algorithm, which fuses multi-view data, improves the recognition rate for obstacles under all occlusion ranges. The most notable improvement is for objects with an occlusion rate > 60% (15.3%), followed by those with occlusion rates between 30% and 60% (12.7%).

The precision recall curves are shown in Figure 34. Under 0–30% occlusion, the collaborative system maintains a precision of 90% at a recall of 0.8, which is 38.5% higher than that of the UAV alone (0.65 → 0.90). At a recall of 0.5, the collaborative detection achieves a precision of 77.8%, significantly higher than the 55.3% of the UGV alone. By integrating the multi-view advantages of both UAV and UGV, the collaborative perception system outperforms either single platform, achieving a mAP of 88.7%, thereby demonstrating the superiority of the proposed collaborative algorithm in 3D object detection.

After simulation analysis, field experiments were conducted in representative real-world scenarios with obstacle occlusion and high-altitude (rooftop) areas. The UAV and the UGV followed the same route, with the UAV leading the UGV by approximately 8 m, and data were collected synchronously. The experimental results are shown in Figure 35.

In Scenario A of Figure 35, due to occlusion by trees and vehicles, the UGV alone can detect only one vehicle. In contrast, the UAV can detect more objects from the air; thus, after feature complementary fusion, more accurate perception is achieved. The same complementary effect is observed in Scenario B. In summary, the proposed collaborative perception algorithm effectively compensates for the limitations of single-sensor perception dimensions, enabling comprehensive 3D perception of objects through the complementary sensing of air ground platforms.

Taken together, the above experiments and analyses demonstrate that the established air ground collaborative 3D object perception system integrates the complementary spatial and accuracy advantages of both UAVs and UGVs, achieving local 3D stereo perception. This significantly enriches perceptual information in unknown environments, such as urban settings, and provides a solid foundation for executing unmanned tasks including inspection, rescue, and logistics supply.

4.4.2. Perception Map Updating Experiments

To comprehensively evaluate the performance of the incremental map updating algorithm, the UGV and UAV synchronously traverse the scene. Experiments are conducted from two perspectives: comparison of map accuracy before and after fusion, and comparison of results obtained with different fusion methods. Experiments are performed on a simulation dataset for both single-platform and multi-platform fusion, and the results are shown in Table 9. The experimental results demonstrate that the mapping performance after fusion is significantly improved. The most notable improvements in RMSE and F1-score occur in complex scenes: RMSE decreases by 7.5, and F1-score increases by 0.3. Moreover, mapping quality also improves in simple and medium-complexity scenes, verifying the superiority of the collaborative algorithm for collaborative map updating at the simulation level.

Four different fusion methods are compared:

(1): Direct replacement method: The local grid map of the UGV directly overwrites the corresponding octree region, ignoring confidence and temporal information;
(2): Fixed-weight linear fusion: Statically assigns fusion weights to the maps from the two platforms;
(3): Traditional Bayesian method: Based on conventional Bayesian updating rules;
(4): Proposed method: Dynamic Bayesian probability updating combined with D-S evidence theory.

Table 10 presents a comparison of map accuracy under different map updating algorithms for the collaborative perception system. As shown in Table 10, the direct replacement method yields the highest RMSE, indicating a large deviation between the generated map and the ground-truth map in 3D voxel space. The fixed-weight linear fusion and traditional Bayesian methods achieve some improvement in simple scenes but produce low F1-scores in complex scenes. Overall, the proposed map updating algorithm exhibits the best comprehensive performance.

After simulation analysis, the map updating algorithm is further validated in real-world scenes. The UAV first builds a prior map and plans a path, then sends the path to the UGV. The UAV leads the UGV by approximately 5 m, and both platforms follow the predetermined path synchronously while collecting real-world data. The collaborative mapping algorithm is used to fuse the data from both platforms, and the experimental results are shown in Figure 36.

It can be observed from the experimental results that a single UAV has limited capability for perceiving small ground objects, and the prior map it builds may miss some ground objects, as indicated by the red circles in Figure 36a. By contrast, the UGV, equipped with a multi-sensor fusion perception method, can incorporate the local obstacles missed by the UAV into the prior map through incremental map updating. The local objects in Figure 36c compensate for the perception-deficient area in Figure 36a, resulting in the fused map shown in Figure 36d. Collaborative mapping significantly improves the richness of perceptual information.

4.4.3. Robustness to Communication and Sensor Failures

We evaluated system robustness under communication disruptions and sensor failures on the simulation dataset. Results are summarized in Table 11.

Latency and packet loss: With added delays of 50–200 ms and packet loss rates of 5–20%, the mAP degradation remains below 3% for delays ≤ 100 ms and loss ≤ 10%. Higher delays/loss cause mAP drops of 7–8%.

Bandwidth limitation: Reducing bandwidth to 10 Mbps and 1 Mbps (via feature compression) results in mAP drops of 1.8% and 17.5%, respectively. At 1 Mbps, the system still outperforms the single-UGV baseline (71.2% vs. 68.5%).

Single-platform failure: When the UAV goes offline, the system falls back to UGV-only mode (mAP 68.5%). If the UGV LiDAR fails, the system uses camera-only plus UAV features (mAP 72.1%). Recovery time after restoration is <0.8 s.

The system maintains acceptable performance under moderate communication disturbances and single-platform failures, confirming its suitability for real-world urban operations.

4.5. Quantitative Real-World Evaluation

We conducted field experiments on a 500 m campus route (300 manually annotated frames). The UAV and UGV followed the same trajectory under varying lighting conditions. Table 12 summarizes the quantitative results. The collaborative system achieves a detection mAP of 84.5%, significantly outperforming the UAV alone (68.3%) and the UGV alone (72.1%). Localization RMSE is reduced to 0.19 m (vs. 0.35 m for UAV, 0.28 m for UGV). The map F1-score reaches 0.88, compared to 0.71 and 0.79 for single platforms.

These results confirm the practical effectiveness of the proposed collaborative perception system in real-world urban environments.

4.6. Real-Time Performance Analysis

The system was deployed on the onboard computing units (UAV: XU3Pai X3; UGV: Jetson AGX Orin). Table 13 summarizes the real-time performance metrics.

End-to-end latency (UAV image capture to collaborative output) is 87 ms. The overall pipeline runs at 18 fps, with UAV SLAM at 35 fps and UGV BEV fusion at 22 fps. Module-wise runtime on the UGV totals 65 ms (LiDAR: 12 ms, image: 18 ms, BEV encoder: 25 ms, task head: 10 ms). Communication bandwidth for compressed BEV features is 28 Mbps, and GPU memory usage on the UGV is 5.2 GB.

These results confirm that the proposed system meets real-time constraints on embedded hardware, making it suitable for practical urban deployment.

5. Conclusions

This paper has conducted research on three aspects: small-target recognition for UAVs, multi-sensor fusion BEV environmental perception for UGVs, and UAV-UGV collaborative perception. The main contributions are summarized as follows:

(1): To address the problem that the prior map built by a UAV in a collaborative system is easily blurred by dynamic objects in the environment, a dynamic feature point elimination framework integrating ORB-SLAM3 with an improved YOLOv8 is constructed, which provides a navigation basis for UAV-UGV collaborative navigation.
(2): An end-to-end perception model for the UGV is developed. By thoroughly fusing the BEV features from the camera and LiDAR onboard the UGV, the model significantly improves the performance of 3D object perception and semantic map segmentation.
(3): A cross-attention-based cross-platform feature fusion strategy for UAV-UGV collaboration is proposed, which greatly enhances the recognition accuracy of occluded objects and objects under bridges. Moreover, probabilistic Bayesian and D-S evidence theory are employed to incrementally update the fine-grained perceptual map of the UGV into the UAV’s prior map, resulting in an accurate semantic map under heterogeneous collaboration.

Complex real-world environments (e.g., urban canyons, disaster sites, infrastructure corridors) demand high sensor stability, system performance, and algorithmic robustness. For UAV dynamic SLAM, the dynamic point elimination strategy still has limited generalization for unseen objects, and ORB-SLAM3 sometimes loses tracking in high-speed dynamic scenes. The current collaborative perception system is far from fully intelligent. Therefore, future work will focus on three directions: (1) developing a fully intelligent air ground interactive perception system; (2) strengthening algorithm generalization and dynamic adaptability; and (3) providing extended validation with additional dynamic SLAM sequences (e.g., Bonn, AirSim), more recent baselines (e.g., DynaSLAM 2, Dynamic-SLAM, V2VNet, CoPerception), and a quantitative failure analysis. These additional experiments will be reported in a future extension of this study.

Author Contributions

Conceptualization, Y.L. and E.T.; methodology, H.H.; software, X.C.; validation, Y.L., X.C. and X.Z.; formal analysis, X.Z.; investigation, E.T.; resources, H.H.; data curation, X.C.; writing—original draft preparation, Y.L.; writing—review and editing, X.Z.; visualization, X.C.; supervision, E.T.; project administration, Y.L.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Foundation of Shanxi Key Laboratory of Machine Vision and Virtual Reality (No. 447-110103) and the Applied Basic Research Program of Shanxi Province (Nos. 202303021221119 and 202403021211093).

Data Availability Statement

The original contributions presented in the study are included in the article further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Xiaofeng Chen is employed by the ShanXi PingYang Industry Machinery Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ORB-SLAM3	Oriented FAST and Rotated BRIEF Simultaneous Localization and Mapping 3
SLAM	Simultaneous Localization and Mapping
UAV	Unmanned Aerial Vehicle
UGV	Unmanned Ground Vehicle
V-SLAM	Visual Simultaneous Localization and Mapping
BEV	Bird’s-Eye-View
mAP	mean Average Precision
NDS	NuScenes Detection Score
mIOU	mean Intersection over Union

References

Li, S.; Li, H.; Zhao, H.; Cai, Y.; Zhu, B.; Tao, J.; Wang, J.; Shuai, B.; Chen, C.; Gao, J.; et al. Evolution Path of Multimodal Collaborative Optimization and Training Techniques for End-to-End Autonomous Driving Systems. Sci. Sin. Technol. 2025, 55, 1638–1658. [Google Scholar] [CrossRef]
Munasinghe, I.; Perera, A.; Deo, R.C. A Comprehensive Review of UAV-UGV Collaboration: Advancements and Challenges. J. Sens. Actuator Netw. 2024, 13, 81. [Google Scholar] [CrossRef]
Shahar, F.S.; Sultan, M.T.H.; Nowakowski, M.; Łukaszewicz, A. UGV-UAV Integration Advancements for Coordinated Missions: A Review. J. Intell. Robot. Syst. 2025, 111, 32. [Google Scholar] [CrossRef]
Zhang, R.; Wu, W.; Chen, X.; Gao, Z.; Cai, Y. Terahertz Integrated Sensing and Communication-Empowered UAVs in 6G: A Transceiver Design Perspective. IEEE Veh. Technol. Mag. 2026, 21, 71–80. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, X.; Yao, F.; An, K.; Zheng, G.; Chatzinotas, S. Pilot Assignment and Power Control in Secure UAV-Enabled Cell-Free Massive MIMO Networks. IEEE Internet Things J. 2024, 11, 3377–3391. [Google Scholar] [CrossRef]
Li, J.; Jia, Y.; Qin, M.; Yang, Q.; Quek, T.Q.S.; Gao, W.; Kwak, K.S. DFF-SLAM: Dynamic Feature Filtering-Based Simultaneous Localization and Mapping for UAV Positioning in IoT-Enabled Complex Environments. IEEE Trans. Mob. Comput. 2026, 25, 550–565. [Google Scholar] [CrossRef]
Kottege, N.; Williams, J.; Tidd, B.; Talbot, F.; Steindl, R.; Cox, M.; Frousheger, D.; Hines, T.; Pitt, A.; Tam, B.; et al. Heterogeneous Robot Teams With Unified Perception and Autonomy: How Team CSIRO Data61 Tied for the Top Score at the DARPA Subterranean Challenge. IEEE Trans. Field Robot. 2025, 2, 100–130. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from LiDAR-Camera via Spatiotemporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Yu, X.; Wei, J.; Li, X.; Liu, M.; Wang, C.; Qin, Z.; Chen, W.; Li, K.; Liu, K. A Quadrotor Aerial Docking System Utilizing Both Vision and Magnetic Field. IEEE Robot. Autom. Lett. 2025, 10, 5529–5536. [Google Scholar] [CrossRef]
Katkuri, A.V.R.; Madan, H.; Khatri, N.; Abdul-Qawy, A.S.H.; Patnaik, K.S. Autonomous UAV Navigation Using Deep Learning-Based Computer Vision Frameworks: A Systematic Literature Review. Array 2024, 23, 100361. [Google Scholar] [CrossRef]
Favorskaya, M.N. Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends. Electronics 2023, 12, 2006. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
El-Alfy, H.; Abdelkader, M.; Kamel, A. Monocular Based 3D Depth Estimation and SLAM Integration. Drone Syst. Appl. 2025, 13, 1–14. [Google Scholar] [CrossRef]
Wang, G.; Wang, L.; He, J.; Jiang, Y.; Qi, Q.; Zhou, Y. Multi-Camera Simultaneous Localization and Mapping for Unmanned Systems: A Survey. Electronics 2026, 15, 602. [Google Scholar] [CrossRef]
Zhang, J.; Li, M.; Chai, J.; Xu, L.; Zhou, C. Deep-UAV SLAM: SuperPoint and SuperGlue Enhanced SLAM for Dynamic Outdoor Air Navigation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-1/W5-2025, 177–183. [Google Scholar] [CrossRef]
Lupandin, A.; Moroz, O. Analysis of Modern Neural Network Methods for Visual Information Processing in High-Speed UAV Navigation Systems. Bull. V.N. Karazin Kharkiv Natl. Univ. Ser. Math. Model. Inf. Technol. Autom. Control Syst. 2025, 68, 53–61. [Google Scholar] [CrossRef]
Ren, T.; Zhao, X.; Jebelli, H. Advanced Sensor Integration for Enhanced Flight Control in UAV-Based Construction Automation. In Proceedings of the 42nd International Symposium on Automation and Robotics in Construction (ISARC), Montreal, QC, Canada, 28–31 July 2025; pp. 122–129. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Wang, H.; Liu, Y.; Chen, X. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. [Google Scholar] [CrossRef] [PubMed]
Ping, P.; Zhang, X.; Tao, L.; Shi, Q.; Tian, Y.; Yan, J.; Ding, W. A Comprehensive Survey on Multi-Sensor Information Processing and Fusion for BEV Perception in Autonomous Vehicles. Inf. Fusion 2026, 126, 103653. [Google Scholar] [CrossRef]
Wolters, P.; Gilg, J.; Teepe, T.; Herzog, F.; Fent, F.; Rigoll, G. SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection. arXiv 2025, arXiv:2411.19860. [Google Scholar]
Hu, Y.; Fang, Z.; Zhang, Y.; Chen, J. FusionFormer: A Multi-Sensor Fusion Transformer Architecture for End-to-End Autonomous Driving. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1–8. [Google Scholar]
Han, C.; Yang, J.; Sun, J.; Ge, Z.; Dong, R.; Zhou, H.; Mao, W.; Peng, Y.; Zhang, X. Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception. IEEE Robot. Autom. Lett. 2024, 9, 6544–6551. [Google Scholar] [CrossRef]
Li Auto. MindVLA-o1: Next-Generation Unified Vision-Language-Action Autonomous Driving Foundation Model. Presented at NVIDIA GTC 2026, San Jose, CA, USA, 17 March 2026; Available online: https://www.autoreport.cn/zonghexinwen/20260317/17108352827.html (accessed on 1 April 2026).
Zhou, X.; Liang, D.; Tu, S.; Chen, X.; Ding, Y.; Zhang, D.; Tan, F.; Zhao, H.; Bai, X. HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation. In Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 27817–27827. [Google Scholar]
Wisth, D.; Camurri, M.; Das, S.; Fallon, M. Unified Multi-Modal Landmark Tracking for Tightly Coupled Lidar-Visual-Inertial Odometry. IEEE Robot. Autom. Lett. 2021, 6, 1004–1011. [Google Scholar] [CrossRef]
Tranzatto, M.; Dharmadhikari, M.; Bernreiter, L.; Camurri, M.; Khattak, S.; Mascarich, F.; Pfreundschuh, P.; Wisth, D.; Zimmermann, S.; Kulkarni, M. Team CERBERUS Wins the DARPA Subterranean Challenge: Technical Overview and Lessons Learned. Field Robot. 2024, 4, 249–312. [Google Scholar] [CrossRef]
Ai, M.; Elhabiby, M.; Yang, Y.; El-Sheimy, N. A Coarse-to-Fine Optimization Framework for LiDAR-Based Air-Ground Cooperative Mapping. In Proceedings of the 2025 IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, USA, 28 April–1 May 2025; pp. 636–642. [Google Scholar]
Schmuck, P.; Chli, M. CCM-SLAM: Robust and Efficient Centralized Collaborative Monocular Simultaneous Localization and Mapping for Robotic Teams. J. Field Robot. 2019, 36, 763–781. [Google Scholar] [CrossRef]
Chen, Y.; Du, B.; Wu, T. Identification and Association of Multiple Visually Identical Targets for Air–Ground Cooperative Systems. Drones 2025, 9, 612. [Google Scholar] [CrossRef]
Cheng, C.; Li, X.; Xie, L.; Li, L. A Unmanned Aerial Vehicle (UAV)/Unmanned Ground Vehicle (UGV) Dynamic Autonomous Docking Scheme in GPS-Denied Environments. Drones 2023, 7, 613. [Google Scholar] [CrossRef]
Wang, Y.; Chen, J.; Liu, X.; Zhang, L. A Distributed Multi-Robot Collaborative SLAM Method Based on Air–Ground Cross-Domain Cooperation. Drones 2025, 9, 504. [Google Scholar] [CrossRef]
Zhang, M.; Li, Y.; Wang, H. Cooperative Air–Ground Perception Framework for Drivable Area Detection Using Multi-Source Data Fusion. Drones 2026, 10, 87. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–18. [Google Scholar]
Nowakowski, M.; Kurylo, J.; Dang, P.H. Camera Based AI Models Used with LiDAR Data for Improvement of Detected Object Parameters. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Kumar, S.; Sharma, S.; Asghar, R.; Mohandas, R.; Brophy, T.; Sistu, G.; Grua, E.M.; Donzella, V.; Eising, C. Exploring Sensor Impact and Architectural Robustness in Adverse Weather on BEV Perception. IEEE Open J. Veh. Technol. 2025, 6, 2857–2875. [Google Scholar] [CrossRef]

Figure 1. UAV-UGV collaborative perception overall architecture.

Figure 2. UAV-UGV collaborative perception information interaction flow.

Figure 3. UAV dynamic SLAM workflow.

Figure 4. Improved neck network.

Figure 5. MobileSAM segmentation results. (a) Before segmentation (b) After segmentation.

Figure 6. Fusion framework of improved YOLOv8 and ORB-SLAM3.

Figure 7. Dense mapping module.

Figure 8. Multimodal fusion perception model based on attention and temporal optimization.

Figure 9. Unified BEV feature encoder.

Figure 10. BEV feature temporal fusion module.

Figure 11. Dynamic adaptive network framework.

Figure 12. Architecture of the multi-task head module.

Figure 13. UAV–UGV collaborative perception fusion workflow.

Figure 14. UAV overhead view feature extraction. (a) Overhead view; (b) RGB feature map.

Figure 15. Cross-attention fusion module.

Figure 16. Transformation relationships between coordinate systems.

Figure 17. Transformation relationships between coordinate systems in joint calibration.

Figure 18. Schematic diagram of UAV-UGV collaborative coordinate transformation.

Figure 19. Overall framework of the hardware time synchronization scheme for the UGV.

Figure 20. Intrinsic calibration process. (a) Intrinsic calibration. (b) Intrinsic calibration results.

Figure 21. Joint calibration process. (a) Joint calibration. (b) Joint calibration projection. (c) Joint calibration results.

Figure 22. Framework of the map updating algorithm.

Figure 23. Flowchart of the dynamic Bayesian probability+D-S evidence theory updating algorithm.

Figure 24. Experimental configuration of UAV-UGV collaborative perception. (a) UGV (b) UAV (c) UAV-UGV collaboration.

Figure 25. Comparison of small target detection results. (a) Small target detection results before improvement. (b) Small target detection results after improvement.

Figure 26. Comparison before and after feature point elimination. (a) Original image from the TUM dataset. (b) All feature points and object detection results on the TUM dataset. (c) Static feature points on the TUM dataset. (d) Original image from the experimental scene. (e) All feature points and object detection results on the experimental scene. (f) Static feature points on the experimental scene.

Figure 27. Comparison of dense mapping. (a) Before dynamic point removal on the dataset. (b) After dynamic point removal on the dataset. (c) Octree map of the dataset. (d) Before dynamic point removal in the real-world scene. (e) After dynamic point removal in the real-world scene. (f) Octree map of the real-world scene.

Figure 28. Comparison of EVO trajectory evaluation results. (a) Trajectory tracked by the original algorithm. (b) Trajectory tracked by the improved algorithm. (c) Error data of the original algorithm. (d) Error data of the improved algorithm.

Figure 29. 3D object detection results on the nuScenes dataset. (a) 3D object detection under sunny conditions. (b) 3D object detection under rainy conditions.

Figure 30. BEV 3D object perception results of the Unmanned Ground Vehicle. (a) Intersection (b) 3D object recognition results at the intersection. (c) Straight road. (d) 3D object recognition results on the straight road.

Figure 31. BEV 3D object perception results under different lighting conditions. (a) Detection results on a sunny road. (b) Detection results on a road at dusk. (c) Detection results on a road at night.

Figure 32. Comparison of different modalities under different lighting conditions. (a) Performance comparison under normal lighting. (b) Performance comparison under low-light conditions.

Figure 33. Segmentation results of different algorithms on nuScenes. (a) Intersection scene. (b) LSS. (c) BEVFusion. (d) Ours.

Figure 34. Precision recall curves. (a) 0–30% occlusion. (b) 30–60% occlusion. (c) >60% occlusion.

Figure 35. UAV-UGV collaborative target detection results. (a) Aerial view of Scenario A. (b) UGV detection results in Scenario A. (c) Collaborative perception feature heatmap of Scenario A. (d) Aerial view of Scenario B. (e) UGV detection results in Scenario B. (f) Collaborative perception feature heatmap of Scenario B.

Figure 36. UAV-UGV collaborative mapping experiment. (a) UAV mapping. (b) UAV octree map. (c) UGV local mapping. (d) UAV-UGV collaborative mapping.

Table 1. Target classification labels.

Target Type	Typical Objects	Classification Label
Static target	Trees, buildings, traffic signs, etc.	S
Potentially dynamic target	Cars, crowds, animals, etc.	/
Dynamic target	Pedestrians, cyclists, etc.	D

Table 2. Parameter comparison of small target detection results.

Model Version	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs
YOLOv8s	39.7	23.9	11.2	28.6
YOLOv8m	41.7	26.2	25.9	78.9
YOLOv8s + SAHI	44.5 (+12%)	27.4 (+15%)	11.2 (unchanged)	85.2 (+197%)
Proposed Algorithm	48.2 (+21%)	30.5 (+27%)	13.5 (+20%)	36.5 (+27%)

Table 3. Comparison of trajectory errors (ATE RMSE, unit: m).

Sequence	ORB-SLAM3	Dyna-SLAM	RDS-SLAM	Ours
sitting_static	0.0355	0.0365	0.0328	0.0283
walking_rpy	0.0239	0.0282	0.0277	0.0202
walking_xyz	0.0154	0.0247	0.0183	0.0134

Table 4. Comparison of 3D object detection results.

Detector	Mod.	Car	Truck	Bus	Ped.	Bike	mAP (%)	NDS (%)
CVCENT	L	81.0	48.6	55.0	80.2	25.3	53.1	63.4
CenterPoint	L	85.8	59.1	72.5	85.6	42.5	59.5	66.4
LargeKernel3D	L	85.4	59.9	72.7	85.3	56.1	63.6	69.2
FUTR3D	C + L	86.3	61.5	71.9	82.6	63.3	64.2	68.0
Transfusion	C + L	86.2	56.7	66.3	86.1	44.2	65.5	70.2
FusionPainting	C + L	87.0	63.0	70.7	88.4	64.4	66.5	70.7
BEVfusion	C + L	88.6	60.0	68.3	88.7	52.9	67.9	71.0
DeepInteraction	C + L	87.1	65.0	75.4	88.4	65.8	68.6	71.9
Ours	C + L	88.5	66.5	76.5	89.0	66.9	71.8	74.3

Table 5. Comparison of semantic segmentation results.

Detector	Mod.	Drivable	Ped. Cross	Walkway	Stop Line.	Carpark	mIoU
LSS	C	75.4	38.3	46.3	30.3	39.1	44.4
CVT	C	74.3	36.8	39.9	25.8	35.0	40.2
CenterPoint	L	75.1	47.4	57.6	35.8	32.7	47.6
PointPillars	L	72.0	43.1	53.1	29.7	27.7	43.8
PointPainting	C + L	75.9	48.5	57.1	36.9	34.5	49.1
MVP	C + L	76.1	48.7	57.0	36.9	33.0	49.0
BEVFusion	C + L	85.5	60.6	67.6	52.0	57.0	62.7
Ours	C + L	88.5	62.3	70.5	53.6	57.5	65.3

Table 6. Ablation study.

Method	CAFM	TFM	AWFM	mAP (%)	NDS (%)	mIoU (%)
Baseline				67.9	71.0	62.7
	√			68.2	72.5	63.6
	√	√		69.8	73.8	64.0
	√		√	71.0	73.4	64.2
	√	√	√	71.8	74.3	65.3

Table 7. Composition of the simulated air ground collaborative dataset.

Scene Type	Total Frames	Number of Objects	Occlusion Ratio	Lighting Condition
Residential Alley	1120	3581	40%	Normal lighting
Urban Arterial Road	1245	5872	32%	Normal lighting
Industrial Park	980	2120	25%	Complex lighting
Suburban/Rural	760	950	17%	Backlight/low light

Table 8. Comparison of 3D object detection performance.

Range	0–30%			30–60%			>60%
Mod.	Car	Ped.	Cyclists	Car	Ped.	Cyclists	Car	Ped.	Cyclists
UAV	78.5	65.2	68.4	62.1	48.7	52.3	41.5	29.6	32.8
UGV	82.3	74.8	71.6	68.9	58.4	55.1	53.2	38.4	40.1
Coo.	86.7	78.9	75.2	75.4	65.8	63.7	59.8	47.8	49.5

Table 9. Map accuracy comparison between collaborative and non-collaborative modes.

Experiment Method	Scene A RMSE	Scene A F1-Score	Scene B RMSE	Scene B F1-Score	Scene C RMSE	Scene C F1-Score
UAV	4.8 ± 0.3	0.79	10.8 ± 1.2	0.76	13.8 ± 1.2	0.64
UGV	3.1 ± 0.2	0.86	8.7 ± 0.8	0.84	10.5 ± 0.8	0.71
Collaborative Map	2.9 ± 0.1	0.93	5.9 ± 0.5	0.91	6.3 ± 0.5	0.94

Table 10. Accuracy comparison of different UAV-UGV collaborative map updating algorithms.

Experiment Method	Scene A RMSE	Scene A F1-Score	Scene B RMSE	Scene B F1-Score	Scene C RMSE	Scene C F1-Score
Direct Replacement	4.2 ± 0.3	0.82	12.8 ± 1.2	0.68	18.5 ± 1.2	0.42
Linear Weighted	3.5 ± 0.2	0.85	9.7 ± 0.8	0.74	16.7 ± 0.8	0.61
Classical Bayesian	3.1 ± 0.4	0.87	8.2 ± 0.9	0.78	15.0 ± 0.9	0.67
Proposed Method	2.1 ± 0.1	0.93	6.3 ± 0.5	0.89	6.3 ± 0.5	0.89

Table 11. Performance under communication and sensor failures.

Condition	mAP (%)	NDS (%)
Baseline (no failure)	88.7	74.3
Delay 50 ms	88.2	73.9
Delay 100 ms	86.5	72.8
Delay 200 ms	81.5	69.2
Packet loss 5%	87.9	74.0
Packet loss 10%	86.1	72.5
Packet loss 20%	80.2	67.8
Bandwidth 10 Mbps	86.9	73.2
Bandwidth 1 Mbps	71.2	62.5
UAV offline (fallback to UGV)	68.5	63.1
UGV LiDAR failure	72.1	66.4

Table 12. Quantitative real-world performance.

Metric	UAV Only	UGV Only	Collaborative
Detection mAP (%)	68.3	72.1	84.5
Localization RMSE (m)	0.35	0.28	0.19
Map F1-score	0.71	0.79	0.88

Table 13. Real-time performance metrics.

Metric	Value
End-to-end latency	87 ms
Overall pipeline frame rate	18 fps
UAV SLAM frame rate	35 fps
UGV BEV fusion frame rate	22 fps
UGV total module runtime	65 ms
Communication bandwidth	28 Mbps
UGV GPU memory usage	5.2 GB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Tian, E.; Chen, X.; Han, H.; Zhang, X. Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones 2026, 10, 470. https://doi.org/10.3390/drones10060470

AMA Style

Li Y, Tian E, Chen X, Han H, Zhang X. Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones. 2026; 10(6):470. https://doi.org/10.3390/drones10060470

Chicago/Turabian Style

Li, Yufeng, Erming Tian, Xiaofeng Chen, Huiyan Han, and Xinya Zhang. 2026. "Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle" Drones 10, no. 6: 470. https://doi.org/10.3390/drones10060470

APA Style

Li, Y., Tian, E., Chen, X., Han, H., & Zhang, X. (2026). Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle. Drones, 10(6), 470. https://doi.org/10.3390/drones10060470

Article Menu

Research on Multi-Source Heterogeneous Collaborative Perception System Based on Unmanned Aerial Vehicle and Unmanned Ground Vehicle

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Visual SLAM for UAVs

2.2. Multimodal End-to-End Perception for Unmanned Ground Vehicles

2.3. UAV-UGV Collaborative Perception

3. Materials and Methods

3.1. System Overall Architecture

3.2. UAV Dynamic SLAM

3.2.1. Small Target Detection and Segmentation Optimization Based on YOLOv8

3.2.2. Fusion of Improved YOLOv8 with ORB-SLAM3

3.3. Multimodal Fusion Perception Model for UGV

3.3.1. BEV-Based Multimodal Fusion Perception Model

3.3.2. Multimodal Unified BEV Feature Encoder with Temporal Optimization

3.3.3. Adaptive Dynamic Weighting BEV Feature Fusion

3.3.4. Multi-Task Head Decoding

3.4. UAV–UGV Collaborative Perception Model

3.4.1. Cross-Platform Feature Fusion for 3D Object Perception

3.4.2. UAV-UGV Spatiotemporal Synchronization

3.4.3. Incremental Map Updating Algorithm

4. Experiments and Results

4.1. Experimental Configuration

4.2. UAV Dynamic SLAM Experiments

4.2.1. Small-Target Detection Results Analysis

4.2.2. Dynamic Feature Point Elimination Experiment

4.2.3. Dense Mapping Experiment

4.3. Multimodal Fusion Perception Experiments on UGV

4.3.1. Dataset and Experimental Setup

4.3.2. 3D Object Detection Results Analysis

4.3.3. Semantic Segmentation Results Analysis

4.3.4. Ablation Study Results

4.4. UAV–UGV Collaborative Perception Experiments

4.4.1. Collaborative 3D Object Detection Experiments

4.4.2. Perception Map Updating Experiments

4.4.3. Robustness to Communication and Sensor Failures

4.5. Quantitative Real-World Evaluation

4.6. Real-Time Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI