Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios

Liu, Chang; Zhang, Yang; Ma, Liqun; Huang, Yong; Liu, Keyan; Wang, Guangwei

doi:10.3390/drones9060424

Open AccessArticle

Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios

by

Chang Liu

¹

,

Yang Zhang

¹,

Liqun Ma

^1,*

,

Yong Huang

¹,

Keyan Liu

¹ and

Guangwei Wang

²

¹

Chongqing Construction Industry (Group) Co., Ltd., Banan District, Chongqing 401320, China

²

School of Mechanical Engineering, Guizhou University, Guiyang 550025, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(6), 424; https://doi.org/10.3390/drones9060424

Submission received: 10 April 2025 / Revised: 26 May 2025 / Accepted: 27 May 2025 / Published: 10 June 2025

(This article belongs to the Special Issue Advances in Perception, Communications, and Control for Drones)

Download

Browse Figures

Versions Notes

Abstract

In this study, we introduce an Unmanned Aerial Vehicle (UAV) centric visual semantic simultaneous localization and mapping (SLAM) framework that integrates RGB–D cameras, inertial measurement units (IMUs), and a 5G–enabled remote interaction module. Our system addresses three critical limitations in existing approaches: (1) Distance constraints in remote operations; (2) Static map assumptions in dynamic environments; and (3) High–dimensional perception requirements for UAV–based applications. By combining YOLO–based object detection with epipolar–constraint-based dynamic feature removal, our method achieves real-time semantic mapping while rejecting motion artifacts. The framework further incorporates a dual–channel communication architecture to enable seamless human–in–the–loop control over UAV–Unmanned Ground Vehicle (UGV) teams in large–scale scenarios. Experimental validation across indoor and outdoor environments indicates that the system can achieve a detection rate of up to 75 frames per second (FPS) on an NVIDIA Jetson AGX Xavier using YOLO–FASTEST, ensuring the rapid identification of dynamic objects. In dynamic scenarios, the localization accuracy attains an average absolute pose error (APE) of 0.1275 m. This outperforms state–of–the–art methods like Dynamic–VINS (0.211 m) and ORB–SLAM3 (0.148 m) on the EuRoC MAV Dataset. The dual-channel communication architecture (Web Real–Time Communication (WebRTC) for video and Message Queuing Telemetry Transport (MQTT) for telemetry) reduces bandwidth consumption by 65% compared to traditional TCP–based protocols. Moreover, our hybrid dynamic feature filtering can reject 89% of dynamic features in occluded scenarios, guaranteeing accurate mapping in complex environments. Our framework represents a significant advancement in enabling intelligent UAVs/UGVs to navigate and interact in complex, dynamic environments, offering real-time semantic understanding and accurate localization.

Keywords:

dynamic scenarios; remote interaction; semantic recognition; visual–inertial SLAM

1. Introduction

The rapid advancement of autonomous navigation technologies in mobile robotics and autonomous vehicles [1,2,3] has made visual and LiDAR [4] sensors essential for environmental modeling. As described in [3,5], these sensors enable intelligent systems to create environmental maps, which are crucial for self–localization in complex scenarios. A robotic system’s operational intelligence hinges on its ability to integrate environmental perception, ego–localization, and situational understanding [6]. This integration serves as the basis for advanced robotic functions such as path planning [5] and mission–specific decision–making [7]. Conventional mobile robots often use multi–modal sensor suites, including cameras, LiDAR, GPS, and IMUs [8]. However, size-constrained platforms like UAVs face strict payload limitations [9]. High–resolution LiDAR systems, which are heavy and costly, are often not feasible for UAVs. Therefore, visual–inertial systems have become a popular choice for resource–constrained aerial platforms [9,10]. These systems mainly generate geometric data but lack inherent semantic awareness [11,12]. In semantically rich dynamic environments, this is a significant drawback. For autonomous systems operating in real world scenarios, basic localization is just the starting point. True operational intelligence requires understanding the environment at both geometric and semantic levels [13,14,15]. This dual requirement has propelled simultaneous localization and mapping (SLAM) [6] systems to the forefront of robotics research. SLAM enables autonomous systems to build maps of unknown environments while tracking their own motion, with visual SLAM relying on camera data to estimate pose and 3D structure and visual–inertial SLAM fusing camera imagery with inertial measurement unit (IMU) data to robustly estimate motion [16,17], even in fast moving or feature–sparse scenarios [6,18]. Despite their success, predominant visual SLAM and visual–inertial SLAM architectures [5,19] remain constrained by static world assumptions [20,21,22], producing sparse geometric maps devoid of object-level semantic intelligence–a critical limitation when navigating environments containing dynamic entities like pedestrians or vehicles [21,22,23].

Despite advancements, existing SLAM frameworks face three critical challenges: (1) Traditional visual–inertial SLAM assumes static environments, failing to robustly reject motion artifacts from dynamic objects, leading to degraded localization accuracy. Semantic-aware methods using deep learning improve dynamic feature removal but suffer from computational overhead or boundary inaccuracies in occluded scenes, rendering them unsuitable for UAVs with strict payload and real-time processing requirements. (2) UAVs rely on lightweight sensors but lack the computational power of ground robots, making it challenging to integrate high-precision semantic segmentation alongside real–time SLAM. This forces a compromise between semantic richness and processing speed, which are critical for flight stability in fast-moving scenarios. (3) Existing remote interaction systems struggle with dynamic bandwidth allocation in large-scale deployments, where UAV–UGV teams require low–latency video streaming for obstacle avoidance and reliable telemetry for path planning. Traditional monolithic protocols cause channel contention, leading to communication delays that jeopardize collaborative navigation in disaster response or urban surveillance scenarios.

Contemporary computer vision [18] offers two principal approaches for environmental understanding: object detection with bounding box localization and pixel-wise semantic segmentation. While deep learning advancements have enhanced both methodologies, their operational constraints in aerial robotics demand careful consideration. Object detection provides efficient spatial awareness but suffers from boundary inaccuracies when handling irregularly shaped targets. Semantic segmentation delivers precise pixel-level classification, yet its computationally intensive nature conflicts with UAVs’ real-time processing requirements. Our analysis reveals that high-performance object detection frameworks–particularly the YOLO series and its derivatives like YOLOX [20,24] and PP–YOLO [25]–present an optimal balance for UAV applications. These lightweight models achieve 30–45 FPS on embedded platforms while maintaining sufficient accuracy for dynamic object tracking. This capability enables seamless integration with SLAM systems without compromising flight stability, addressing a critical challenge in semantic–aware navigation for resource-constrained aerial platforms.

Traditional teleoperation systems face dual limitations in mission–critical operations: physical risks to operators in hazardous environments [9] and communication latency in large–scale deployments. While controlled indoor settings simplify system validation through stable lighting and network conditions, real–world UAV/UGV applications demand robust solutions for open–world challenges–particularly dynamic bandwidth allocation and distance-agnostic control. See Supplementary Video S1.

With the above analysis, we combine the visual–inertial SLAM [6], object detection, and remote interaction module to achieve robust, real-time dynamic feature rejection as well as semantic maps.

As an illustrative example, Figure 1 depicts a typical UAV–UGV collaborative scenario in search–and–rescue operations, where a UAV equipped with an RGB-D camera surveys a disaster zone from above, detecting dynamic obstacles like moving debris or victims, while a UGV navigates complex terrain using the UAV’s semantic map to avoid hazards. This synergistic interaction–enabled by our dual-channel communication architecture–highlights the practical utility of our framework in real-world multi-agent coordination. Key components include Task 1 (Data Sending), where sensors transmit raw data (RGB frames, IMU readings) to edge processors for on-board SLAM and object detection; Task 2 (Realization), where processed semantic maps and pose estimates are sent to the remote console via 5G modules or Wi-Fi; Task 3 (Correction), where human operators adjust paths through the console, with control signals routed back via the same channel; and Task 4 (Feedback), where real–time video streams (WebRTC) and telemetry (MQTT) enable situational awareness, with 5G offering 300 Mbps downlink for 1080p video streaming and Wi–Fi providing stable 100 Mbps connectivity for UGV indoor navigation.

Our contribution is two-fold: (1) The system improves visual localization by integrating visual mapping for navigation, thus enhancing generalization. Additionally, we have incorporated dynamic feature removal, fused semantic information, and optical flow consistency into visual maps with object detection, which has improved real–time performance compared to visual mapping alone. Furthermore, our system combines visual–inertial SLAM to increase robustness compared to semantic maps. (2) We introduce a lightweight remote interaction framework featuring a WebRTC–MQTT dual–channel architecture, which uniquely separates high-bandwidth video streaming from low–latency telemetry to eliminate channel contention–a critical limitation in monolithic communication systems. Unlike prior work that relies on centralized cloud processing, our edge–to–edge, point–to–point communication reduces end–to–end delay by 65%, enabling real-time human-in-the-loop control for UAV–UGV teams. The architecture dynamically allocates bandwidth based on agent density and synchronizes semantic map updates across heterogeneous platforms, ensuring consistent situational awareness in large–scale operations.

The manuscript is structured as follows: Section 2 presents related work; Section 3 summarizes the system architecture. Section 4 describes the remote interaction, followed by Section 5, which introduces dynamic feature removal and semantic mapping; Section 6 provides a comprehensive experimental analysis, and Section 7 concludes with a summary and outlook.

2. Related Work

Current semantic SLAM methodologies for dynamic environments primarily adopt three technical paradigms: object–detection–driven [9,20,25] dynamic filtering, semantic segmentation–based [10] scene decomposition, and multi–view geometric reconstruction [11]. Each approach presents unique trade–offs between computational efficiency and semantic granularity, particularly critical for UAV/UGV applications with embedded processing constraints. Object detection localizes dynamic entities with bounding boxes, efficiently providing spatial awareness for fast moving UAVs but potentially missing fine grained object boundaries. In contrast, pixel–wise semantic segmentation assigns class labels to every pixel, enabling precise scene parsing but requiring higher computational resources, which is challenging for real–time UAV operations. This trade–off motivates our selection of lightweight object detection for dynamic feature rejection, complemented by geometric constraints to mitigate boundary inaccuracies, as detailed in Section 5.2. Pioneering work by DynaSLAM [11] integrated Mask R–CNN [14] with ORB–SLAM2, achieving dynamic object removal through instance segmentation. While attaining 85.6% dynamic feature rejection accuracy, its 5–7 FPS processing speed on Jetson TX2 platforms renders it impractical for real-time UAV operations. Subsequent improvements like Dynamic–VINS [17] were optimized for embedded systems by coupling YOLOv3 with VINS–RGBD [20], yet retained fundamental limitations in semantic mapping completeness. YOLACT–SLAM advanced real-time performance by embedding lightweight YOLACT [22] (23 FPS on NVIDIA Xavier) into SLAM pipelines, sacrificing segmentation [21] precision (72.4% mIoU) for operational feasibility. Parallel efforts like SG–SLAM [23] fused SSD [26] detectors with geometric constraints, demonstrating 30% faster loop closure than conventional systems–a critical metric for battery–constrained UAV missions. CubeSLAM [13] bypassed deep learning dependencies through multi–view cuboid proposals, achieving 0.12 m localization accuracy in static environments. However, its inability to classify dynamic entities and reliance on observer motion (≥0.5 m/s) fundamentally restrict deployment in agile UAV applications requiring instantaneous environment understanding. State–of–the–art systems like VDO–SLAM [15] and VIDO–SLAM [16] unify optical flow, depth estimation, and detection modules, attaining 89% dynamic object tracking consistency in urban datasets. Nevertheless, their computational intensity and 200 ms level latency remain incompatible with micro-UAVs, typically operating under 5 W power budgets.

Table 1 summarizes the key performance metrics of the state–of–the–art semantic SLAM methods compared in this study, highlighting their approaches to dynamic feature management, processing efficiency, and localization accuracy. These benchmarks underscore the technical gaps addressed by our hybrid framework, particularly in real-time performance and resource-constrained UAV applications.

Most mapping and planning strategies rely on stable and precise localization. However, it is crucial to incorporate internal sensors such as cameras and IMU for self-state estimation and semantic object information in dynamic scenarios. Prominent visual–inertial SLAM frameworks like ORB–SLAM3 [6] have set benchmarks for accuracy in static environments, achieving sub–0.1 m RMSE by fusing RGB–D imagery and IMU data. However, ORB–SLAM3 assumes a static world, failing to robustly handle dynamic objects and relying on geometric features alone for loop closure. This limitation motivates our research to integrate semantic detection and geometric constraints to enhance robustness in dynamic scenarios.

Remote interaction is commonly utilized in various fields, including engineering plants and augmented reality. In the case of engineering plants, remote interaction is primarily used to ensure the safety of operators’ lives, as seen in mine excavation and remote driving, where the working conditions are unsuitable for prolonged human presence. On the other hand, remote interaction in augmented reality [27] is mainly focused on user–oriented behavioral interactions that enable human interaction from any location. Previous studies have explored data transmission in large–scale scenarios using socket programs and cloud servers, as seen in the literature [28]. However, these studies were largely limited to single–agent or UAV–only communication, lacking integration with heterogeneous agents like UGVs. For instance, research on 5G–enabled remote driving [22] highlighted transmission interference challenges but primarily focused on standalone vehicle control. Another study [29] investigated multi–user augmented reality via visual SLAM sharing yet omitted dynamic environment adaptation, which is critical for UAV–UGV collaboration. Our work bridges this gap by enabling seamless communication between mixed agents with semantic-aware data fusion.

Several studies have linked remote interaction with visual semantic mapping. Therefore, it is essential to investigate methods of connecting information between agents during remote interaction.

3. System Overview

The system architecture integrates visual–inertial SLAM, semantic object detection, and a dual-channel communication framework to enable robust navigation in dynamic environments, as shown in Figure 2.

Built upon ORB–SLAM3 [6], the core pipeline incorporates YOLO–FASTEST for real-time semantic detection, efficiently identifying dynamic objects at 75 FPS on edge devices like NVIDIA Jetson Xavier. A multi–threaded design synchronizes three parallel processes, tracking, detection, and mapping, using asynchronous data queues to manage inter-thread communication, while the detection thread filters dynamic features via epipolar constraints and optical flow consistency checks. This ensures real–time removal of motion artifacts, with processing achieving 38 FPS and a maximum pipeline latency of 27 ms from camera capture to pose estimation, surpassing the 30 Hz input rate of RGB–D sensors like Intel RealSense D455.

For remote interaction, a dual–channel architecture separates video streaming (WebRTC, median latency 42 ms) and telemetry (MQTT, 18 ms median RTT) to minimize interference, supporting seamless human–in–the–loop control across large–scale scenarios. The system further includes a semantic object database that fuses 3D point clouds with detection results, enabling scalable multi–agent collaboration by storing object labels, poses, and spatial dimensions. Real–world validation on UAV/UGV platforms demonstrates robust localization accuracy and efficient resource utilization, balancing performance with payload constraints for applications in search-and-rescue and smart city surveillance.

Our dynamic-aware visual semantic mapping system introduces three novel contributions that distinguish it from existing fusion methods. (1) Multi–threaded semantic–geometric pipeline. Unlike prior work that processes detection results only at keyframes, our system employs a per–frame YOLO–FASTEST integration (75 FPS on Xavier) synchronized with the tracking thread via asynchronous queues. This ensures temporal continuity in dynamic region masking, which is critical for resolving mutual occlusions. When two cars overlap, our method tracks each vehicle’s motion independently using bounding–box–specific optical flow fields, whereas keyframe–based methods often merge them into a single dynamic region, leading to feature loss. (2) Hybrid dynamic feature filtering. Most semantic SLAM systems rely solely on detection or geometry, but we fuse both via epipolar constraints and detection bounding boxes. In occluded scenarios, even if a subset of an object is undetected, geometric checks still reject 89% of its dynamic features. This hybrid approach outperforms pure detection or pure geometry in the Market 1–3 sequence, where 60% of dynamic objects are mutually occluded. (3) Real time motion consistency modeling. We explicitly model camera ego-motion using IMU pre–integration and homograph estimation, enabling the adaptive thresholding of optical flow amplitudes. In fast–moving scenarios, this reduces false positives from camera shake by 40% compared to static threshold methods.

Accurate time alignment between heterogeneous sensors and processing modules is critical for reliable SLAM. Our system employs a three-layer synchronization strategy to ensure temporal consistency. (1) Hardware level synchronization. The Intel RealSense D455 camera provides hardware–synchronized RGB, depth, and IMU data with a shared 10 MHz clock, ensuring <10 timestamp drift between modalities. Each sensor frame (RGB: 30 Hz, depth: 30 Hz, IMU: 200 Hz) includes a monotonic timestamp derived from the camera’s internal clock, serving as the reference for all modules. (2) Pipeline level timestamp alignment. ORB-SLAM3’s visual–inertial thread uses IMU pre-integration with bicubic interpolation to align IMU measurements (200 Hz) to RGB frame timestamps (30 Hz), following the method of [10] to minimize integration errors. The YOLO–FASTEST detection thread (38–75 Hz) appends its processing timestamp to each bounding box result. (3) Semantic mapping latency compensation. When inserting semantic objects into the map, we compensate for the detection-to-mapping delay by predicting the UAV/UGV pose at the detection timestamp using IMU velocity integration. This ensures 3D object coordinates are registered to the correct camera pose, as validated by a 99.2% timestamp alignment accuracy in static tests.

4. Remote Interaction

4.1. The Architecture of Remote Interaction

As intelligent devices become smaller and resources become limited, object detection must be accomplished efficiently. To achieve this, we used YOLO-FASTEST as the core solution and the NCNN high performance neural network inference library [30] as the implementation benchmark. Our results showed a detection rate of 38 FPS and 75 FPS in NVIDIA Jetson Nano and NVIDIA Jetson AGX Xavier, respectively. Using a combination of detection and NCNN allows for fast inference, making it possible to perform detection in every frame rather than just keyframes [9]. This ensures spatial continuity throughout the video, as shown in Figure 3.

The remote interaction framework leverages heterogeneous hardware tailored to UAV/UGV constraints, detailed in Table 2.

Edge devices (EDs) include the following. (1) UAVs: custom quadcopters equipped with NVIDIA Jetson AGX Xavier (8 GB HBM2, 3.3 GHz Hexa-core ARM) and Intel RealSense D455 RGB-D cameras (135 g, hardware-synchronized IMU, 30 Hz RGB/depth); (2) UGVs: Mecanum-wheel AGVs using Jetson Nano (4 GB LPDDR4, 1.43 GHz Quad-core ARM) for low-power operation, paired with RealSense D455 or LiDAR for obstacle detection. Both platforms integrate 5G modules (Quectel RM500Q) for outdoor long-range communication or Wi-Fi adapters for indoor low-latency links.

4.2. Performance Evaluation of Dual-Channel Communication

The dual–channel architecture’s efficacy is validated through quantitative measurements of latency and bandwidth, which are critical for real–time multi–agent coordination. WebRTC and MQTT were tested under typical operational conditions. In latency, WebRTC for video frames averaged 42 ms, with 95% of packets delivered within 65 ms, ensuring low–latency visual feedback for remote operators. MQTT exhibited a median RTT of 18 ms, with <1% packet loss, enabling reliable transmission of low–latency control signals. In terms of bandwidth usage, the average throughput of WebRTC for H.264–encoded video streams was 3.2 Mbps, which is scalable to 4.5 Mbps under high–dynamic–range settings. The telemetry data (pose estimates, sensor status) of MQTT consumed just 12 KBps, which is significantly lower than video traffic, minimizing interference between channels.

These metrics demonstrate that separating video and telemetry via WebRTC and MQTT reduces cross-channel contention. For example, during peak load, WebRTC streams consumed 16 Mbps while MQTT remained under 100 KBps, ensuring stable communication without bandwidth saturation.

5. Dynamic Feature Removal Module

Most current SLAM techniques are designed under the assumption of a static scenario, where environmental features remain stationary–a limitation widely acknowledged in the literature [11,21,31]. For example, seminal surveys such as [31] highlight that traditional visual–inertial SLAM frameworks, including ORB–SLAM3 [6] and VINS–Mono [18], rely on static feature correspondences, which leads to degraded performance when dynamic objects are present. Benchmark works like DynaSLAM [11] further demonstrate that even advanced systems struggle with motion artifacts, underscoring the need for dedicated dynamic feature management.

5.1. Selection of Object Detection Algorithm

In our system, we need a lightweight object detector capable of real–time operation on resource–constrained UAV platforms. The initial selection of YOLO–FASTEST was based on its reported high speed. To further justify this choice, we conducted a comparison with other popular lightweight detectors: YOLOX–Nano, PP–YOLO Tiny, and MobileNet–SSD. The comparison was carried out on an NVIDIA Jetson Xavier platform, using a custom real–world dataset containing various dynamic objects commonly encountered in UAV operations. We evaluated the key metrics of inference speed, mean average precision, and model size, as shown in Table 3.

YOLO-FASTEST achieved the highest inference speed at 75 FPS, surpassing YOLOX–Nano (62 FPS), PP–YOLO Tiny (55 FPS), and MobileNet–SSD (45 FPS). While its mAP@0.5 of 58.2% is slightly lower than YOLOX–Nano’s 61.5%, the 21 MB model size of YOLO-FASTEST is the most compact among the compared detectors, reducing memory usage by 34% compared to YOLOX–Nano (32 MB). This trade–off between speed and model size is critical for UAVs, where payload and computational resources are severely constrained.

The following experimental parameters and optimizations were applied: (1) Input image resolution (Jetson Nano: 416 × 416 pixels to balance detection speed (38 FPS) and object localization accuracy; Jetson AGX Xavier: 608 × 608 pixels, enabling 75 FPS while retaining fine-grained features for small object detection); (2) batch size (batch size = 1 for both platforms, as real-time robotic applications require per-frame processing rather than batched inference. This avoids pipeline stalls and ensures temporal consistency in dynamic feature tracking); (3) NCNN specific optimizations (model quantization: converted 32-bit floating–point weights to 8-bit integers, reducing memory access latency by 40% and GPU bandwidth usage by 75%; layer fusion: merged consecutive convolutional-batchnorm-relu layers into single kernels, decreasing computational graph complexity by 28%; multi-thread parallelism: utilized NCNN’s built-in CPU multi-threading and GPU acceleration (CUDA backend for Xavier only), achieving 2.3 speedup over single-threaded inference; (4) Hardware configuration (Jetson Nano: 1.43 GHz Quad-core ARM Cortex-A57, 4 GB LPDDR4, integrated NVIDIA Maxwell GPU; Jetson AGX Xavier: 3.3 GHz Hexa-core Carmel ARM, 8 GB HBM2, NVIDIA Volta GPU). These parameters demonstrate how resolution and optimization trade-offs enable real-time performance across heterogeneous edge devices.

5.2. Epipolar Constraints for Removing Dynamic Points

Fast object detection locates potential dynamic regions, after which epipolar constraints [32] are used to validate feature point stability. As derived in Hartley and Zisserman’s Multiple View Geometry [32], the fundamental matrix F encodes the geometric relationship between two camera views, and the epipolar constraint

x_{l}^{T} F x_{c} = 0

(where F is a rank-2 matrix estimated via RANSAC [33] from tracked feature correspondences) ensures static feature correspondences lie on their respective epipolar lines. We track feature points across frames using the Lucas–Kanade (LK) optical flow algorithm [34,35], which minimizes intensity constancy errors to estimate motion vectors. RANSAC is then applied to robustly estimate F, rejecting outliers caused by dynamic motion, as shown in Figure 4.

Initially, the mobile device (camera model) observes the same spatial point S from different viewpoints.

C_{1}

and

C_{2}

denote the optical centers, and

S_{l}

and

S_{c}

indicate the matched pairs of spatial points S in the last and current frames.

E_{l}

and

E_{c}

’s dashed lines represent the epipolar lines between the corresponding frames. Thus, the homogeneous coordinates of

S_{l}

and

S_{c}

could be expressed as:

S_{l} = [u_{l}, v_{l}, 1], S_{c} = [u_{c}, v_{c}, 1]

(1)

u, v

: the position of the feature point in the pixel coordinate system. Then, the fundamental matrix can determine the epipolar line

E_{c}

in the current frame. For static points,

S_{c}

must lie on the epipolar line in the current frame.

E_{c} = [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \end{matrix}] = F \cdot S_{l} = F [\begin{matrix} u_{c} \\ v_{c} \\ 1 \end{matrix}]

(2)

X_{c}

,

Y_{c}, Z_{c} :

vector corresponding to the epipolar lines. As a result, the epipolar constraint can be modeled as:

{S_{l}}^{T} F S_{c} = {S_{l}}^{T} E_{c} = 0

(3)

The static feature correspondence is proved when the epipolar constraint is satisfied. However, the relationship may be biased due to the noise, and the bias is confident within a threshold. Therefore, the distance from the feature point to the epipolar line can be expressed as:

S_{c}^{d i s t} = \frac{{S_{l}}^{T} F S_{c}}{\sqrt{{(X_{c})}^{2} + {(Y_{c})}^{2}}} = 0

(4)

S_{c}^{d i s t}

: the distance of the feature point

S_{c}

to the epipolar line. The value is often biased when equal to zero indicates the ideal case. Conversely, the threshold is far exceeded if the point is not a static feature, where the denominator normalizes by the fundamental matrix’s row norm to ensure scale invariance. Feature points with the

d > {t h}_{e p i}

(empirically set to 3 pixels) are classified as dynamic and excluded from SLAM.

Our method combines multi-view geometry and deep learning to avoid relying solely on object detection or geometric models. Dynamic objects such as pedestrians, vehicles, and animals are common in everyday scenarios. To filter static and dynamic features, we use a combination of semantic information and epipolar constraints, resulting in all features within the bounding box. The following algorithm can accurately and quickly eliminate dynamic features.

While YOLO-FASTEST detects predefined classes, the system incorporates geometric redundancy to handle unclassified moving objects or detection errors. Specifically, feature points within undetected dynamic regions are still filtered by optical flow consistency and epipolar constraint checks, which do not rely on semantic labels. For example, if a bird (Unclassified) enters the scene, the LK optical flow algorithm will detect large motion vectors, and RANSAC-based fundamental matrix estimation will reject outliers, even without a corresponding semantic bounding box. This hybrid approach ensures that 80% of dynamically moving features are removed in preliminary tests. As shown in Algorithm 1.

Algorithm 1 Epipolar constraints for removing dynamic points

Input: Last frame

F_{l}

; Current frame

F_{c}

; Features of last frame

P_{l}

; Features of current frame

P_{c}

;
Output: The set of static features in current frame

P_{s}

;
1:

P_{l}

= CalcOpticalFlowPyrLK (

F_{c}

,

F_{l}

,

P_{c}

) // LK flow to track

P_{l}

in last frame
2: Filter the outliers in

P_{l}

3: FundMat = FindFundamentalMatrix (

P_{l}

,

P_{c}

, Iter-RANSAC) // Iter to calc the fundMat
4: for each matched feature

p_{l}

,

p_{c}

in

P_{l}

,

P_{c}

do:
5: if (ExistDynamicObjects && IsInBoundingBox (

p_{c}

)) then
6:

E_{l}

= FindEpipolarLine (

p_{l}

, FundMat) // Calc epipolar line
7:

d i s t

= CalcEucliderDistanceFromEpipolarLine (

p_{c}

,

E_{l}

)
8: if dist < th then
9: Append

p_{c}

to

P_{s}

10: end if
11: end if
12: end for

However, detection errors (e.g., misclassifying a moving bicycle as a static “traffic sign”) can lead to residual dynamic features. To quantify this impact, we intentionally introduced detection errors in the Market 1–2 sequence and found false negatives: localization RMSE increased by 18% compared to perfect detection but remained 27% lower than baseline SLAM without any dynamic rejection. False positives have decreased in map density due to excessive feature culling, mitigated by optical flow consistency checks that retained stable static points.

These results, which are discussed in Section 7.3, demonstrate that while detection errors degrade performance, the geometric constraints act as a critical fallback, ensuring the system remains more robust than purely semantic or geometric–only approaches.

5.3. Moving Consistency Check for Rejecting Unstable Movements

To distinguish camera motion from object motion, we estimate the homography matrix H, which models the projective transformation between two planar scenes viewed from different perspectives [32]. For a static planar scene, H can be expressed as:

H = K (R + \frac{t n^{T}}{d}) K^{- 1}

.

where K is the camera intrinsic matrix, R is the rotation matrix, t is the translation vector, n is the unit normal vector of the plane, and d is the distance from the camera center to the plane. We solve for H using the singular–value decomposition (SVD) of at least four point correspondences, with RANSAC [33] robustly filtering mismatches by selecting the model with the maximum number of inliers (points satisfying the homography constraint within a 2–pixel reprojection error threshold). The resulting homography warps the current frame to the reference frame, effectively removing camera-induced motion from the optical flow field:

x_{c} = H x_{l}

.

where

x_{l}

and

x_{c}

are homogeneous coordinates of a point in the last and current frames. Dynamic objects, whose motion deviates from the estimated H, exhibit residual optical flow vectors that are filtered out, isolating static background motion and enabling accurate dynamic feature rejection.

6. Semantic Object Database Construction

This study focuses on extracting 3D object information from static scenes using the detection results from Section 5.1 and point clouds obtained through RGB–D cameras. Our approach involves utilizing the captured data to identify and analyze objects in the background. The process involves filtering the point cloud and separating it into clusters using the Euclidean cluster method. These clusters are then projected onto the RGB image plane, and their corresponding 2D bounding boxes are calculated. The box similarity is evaluated using Intersec over Union (IoU). Each detection result matches with a cluster, which then calculates the 3D information of that particular cluster, as depicted in Figure 5.

Given the correspondence between the objects detected in 2D images and the “clusters” in a 3D space, this study determines the position of objects in space by intersecting the reverse projection of a 3D point cloud onto a 2D image. In the depth image, while the Euclidean-clustered point cloud maintains its proximity property, it can still be considered as an unordered collection of points. Consequently, these “clusters” should be capable of projecting onto any 2D plane (i.e., the camera plane) and obtaining a 2D view. Equation (5) can be derived from the assumption that (x, y) represents two points on camera plane.

A (x_{2} - x_{1}) + B (y_{2} - y_{1}) + C (z_{2} - z_{1}) + D = 0

(5)

Let

(x_{2} - x_{1}, y_{2} - y_{1}, z_{2} - z_{1})

be a set of vectors representing the camera plane. According to Equation (5), vector

(A, B, C)

is perpendicular to vector

(x_{2} - x_{1}, y_{2} - y_{1}, z_{2} - z_{1})

; therefore,

(A, B, C)

is the normal vector of camera plane. Assuming

(x_{3}, y_{3}, z_{3})

is an arbitrary point in a 3D space, with its projection coordinates

(x, y, z)

on the camera plane, the vector formed by these two points is parallel to

(A, B, C)

. Thus, Equation (6) can be derived.

\frac{x - x_{3}}{A} = \frac{y - y_{3}}{B} = \frac{z - z_{3}}{C} = t

(6)

Furthermore, Equation (6) is substituted into the vector

x = A t + x_{3}, y = B t + y_{3}, z = C t + z_{3}

to obtain the projection result of the camera plane. Finally, the interSec between the boundary of the projected image and the bounding box coordinates of object detection is calculated to obtain more accurate boundary information in a 3D space. In this study, semantic information and multi-view geometry assumptions are combined, i.e., if projection and detection regions belong to the same object, these two sets of regions should overlap. Therefore, the InterSec over Union (IoU) is calculated to represent the degree of overlap, as shown in Equation (7).

I o U ({R e g i o n}_{1}, {R e g i o n}_{2}) = \frac{{R e g i o n}_{1} \cap {R e g i o n}_{2}}{{R e g i o n}_{1} \cup {R e g i o n}_{2}}

(7)

{R e g i o n}_{1}

: the detection box region;

{R e g i o n}_{2}

: the projected region from the map. Furthermore, when the two sets of regions overlap (IoU >

0.5

), the depth between them is estimated using maximum likelihood, as shown in Equation (8).

P ({R e g i o n}_{1} | {R e g i o n}_{2}) = I o U ({R e g i o n}_{1}, {R e g i o n}_{2}) \cdot σ^{- ϑ}

(8)

Variable “

σ

”: the minimum squared error (MSE) between observed depth and projected depth in overlapping region. The calculation of the error is described in Equation (9).

ϑ = \frac{1}{n} \sum_{(u, v) \in [{R e g i o n}_{1} \cap {R e g i o n}_{2}]} {(d_{p} (u, v) - d_{o} (u, v))}^{2}

(9)

Variable “n”: the number of point clouds projected onto the

{R e g i o n}_{1} \cap {R e g i o n}_{2}

, while

d_{p} (u, v)

and

d_{o} (u, v)

, respectively, represent the projected depth and observed depth for a pixel

(u, v)

. If the maximum likelihood estimation of depth exceeds a certain threshold, the object information within the projected region is assigned to the detection region.

The 3D object information of the corresponding label is recorded in the database. If the same label is detected twice, we check if the object is the same based on its position and then update the spatial coordinates and size of the 3D object box. If it is a new object, we record it as a new entry in the database. The information includes object labels, size, and pose. The process involves creating a new target and obtaining its initial position when an object is detected for the first time. Multi–view observation is used during camera movement to extract information from common–view keyframes in the local map. This information helps determine the spatial position and is used to create a common target database. During loop detection, the algorithm analyzes keyframe information to check if the object exists in the same place. This step relies on the semantic information binding in keyframes, and the joint optimization of all three strategies ensures the object’s continued existence. As shown in Figure 6.

The criteria for constructing the object database are as follows: (i) For objects with the same label appearing in consecutive keyframes, we judge whether the Euclidean distance of the mass center is less than th; (ii) If the center distance of the same object is greater than the object size, this is a different object with the same label, and the object ID is recorded; (iii) When constructing the public database from multi-view, mislabels based on the number of the same object are removed; (iv) When loop detection occurs for objects with the same label to optimize the spatial location, the steps (i), (ii), (iii), and (iv) are duplicated.

7. Results

The following section presents the experimental validation of our UAV-UGV visual semantic SLAM framework, focusing on three core objectives: (1) Evaluating localization accuracy and dynamic feature rejection in simulated and real-world environments, (2) Assessing the efficiency of the dual-channel communication architecture under varying bandwidth conditions, and (3) Demonstrating semantic mapping and multi-agent collaboration capabilities. We conducted tests across two primary scenarios: simulated datasets (OpenLORIS, EuRoC MAV, TUM–VI) to benchmark against state–of–the–art methods and real-world scenarios (indoor localization, outdoor UAV navigation, multi-agent scalability) to validate practical deployability. Key differences between scenarios include environmental dynamics, sensor configurations, and communication protocols. Each experiment evaluates system elements such as the YOLO–FASTEST detection pipeline, epipolar–constraint–based feature filtering, and WebRTC-MQTT communication architecture, with metrics including absolute pose error (APE), root mean squared error (RMSE), and bandwidth consumption. The results collectively demonstrate the framework’s robustness in complex, dynamic environments.

7.1. Hardware Setup and Evaluation Metrics

This section presents the experimental validation of our framework across simulated and real–world scenarios, structured to evaluate three core objectives: (1) localization accuracy in dynamic environments, (2) communication efficiency of the dual-channel architecture, and (3) semantic mapping robustness for UAV–UGV collaboration. Table 4 summarizes the hardware configurations, while Table 3 outlines the datasets used, categorized by environment (indoor/outdoor), dynamics (static/dynamic), and sensor type (RGB–D/IMU). Key evaluation metrics include absolute pose error (APE) and root mean squared error (RMSE) against ground truth; removal rate of dynamic features in occluded vs. clear scenes; latency (WebRTC/MQTT), bandwidth consumption, and multi–agent scalability; and point cloud density, object registration accuracy, and map completeness.

We conducted tests on both datasets and natural cases to analyze the method described in this manuscript. The simulation parameters used in these tests are detailed in Section 7.2. The use of public datasets allowed for accurate evaluation, with a focus on verifying the localization accuracy in dynamic scenarios, as described in Section 7.3. Section 7.4 describes the natural case experiments, which primarily demonstrate semantic object mapping and some generalization under the Wi–Fi or remote interaction module.

In time synchronization accuracy metrics, the synchronization framework ensures that dynamic feature rejection (Section 5.2) and semantic mapping (Section 6) operate on temporally consistent data, reducing localization errors caused by asynchronous updates. In tests, disabling synchronization increased APE RMSE by 47% in the Market 1–1 sequence, highlighting its critical role in maintaining system performance, as shown in Table 4.

7.2. Simulation Datasets

The OpenLORIS benchmark [36], as described in Table 5, utilizes images from daily life to evaluate the accuracy of SLAM systems for object recognition. The dataset includes data such as RGB, depth images, and ground truth obtained from motion capture systems or high–precision LiDAR.

7.3. Localization Accuracy in OpenLORIS Benchmark

The accuracy of visual–inertial odometry was evaluated using the absolute pose error (APE), and Table 4 presents the accuracy of various SLAM systems. The comparison frameworks chosen for this experiment were VINS–RGBD (RGBD–inertial), Dynamic–VINS (RGBD–inertial), ORB–SLAM2 (RGBD), ORB–SLAM3 (RGBD–inertial), and our system (RGBD–inertial). The experiment focused on demonstrating localization accuracy and robustness in dynamic scenarios. In this study, we compared our method to VINS–RGBD, Dynamic–VINS, ORB–SLAM2, and ORB–SLAM3. While Dynamic–VINS has a dynamic feature removal module designed within the same architecture, ORB–SLAM3 employs the RGBD–IMU model. However, none of these systems contain semantic object mapping, a key feature of our system. We have presented the accuracy metrics of our system in Table 6 (Bold indicates optimal accuracy).

Results Analysis: While our method generally outperformed baselines in dynamic scenarios, an exception occurred in Market 1–2, where it exhibited the highest RMSE (2.1437 m) among the compared approaches (Table 6). This degradation is hypothesized to arise from three interrelated factors, as follows. (1) High dynamic object density and occlusion: The sequence involves dense traffic with frequent occlusions, leading to partial bounding box mismatches during YOLO–FASTEST detection. Specifically, overlapping objects caused incomplete dynamic region masking, leaving residual motion features in static background regions. (2) Illumination fluctuations and sensor noise: Rapid changes in sunlight intensity induced temporary depth sensor noise, reducing the accuracy of optical flow estimates used in epipolar constraint checks. Noisy flow vectors increased false positives in dynamic feature validation, retaining some moving points as static. (3) Scale variability and loop closure delay: The sequence’s large trajectory (215 m) and lack of distinct semantic landmarks delayed loop closure, causing cumulative drift. Unlike ORB–SLAM3, which relies on geometric features for loop detection, our semantic-aware approach prioritized dynamic feature rejection over aggressive loop closure, exacerbating scale errors in feature-sparse segments, as shown in Figure 7.

In order to test our method, we first tested dynamic scenes and static scenarios separately, as shown in Table 6. Our results showed that in dynamic scenarios, our method had a lower RMSE, indicating higher localization accuracy. This also suggests that our method is more robust in dynamic environments. An example of this can be seen in the Market 1–1 sequence, as shown in Figure 8. Our method was able to distinguish between dynamic and static scenes more effectively, which can be attributed to the loops that occur when returning to the start point. We discarded dynamic features to ensure the accurate identification of static scenes. Our method, based on ORB–SLAM3, achieved comparable accuracy in static scenarios. However, for sequences such as Office 1–5, our method outperformed ORB–SLAM3 due to its additional preprocessing steps, such as optical flow and anti–homographic transform, to filter out abnormal motion caused by camera noise.

Based on the OpenLORIS benchmark, it is evident that the data-collecting device is a mobile robot with a known height. The study [37] found that current SLAM systems have reached the robust perception stage, where rotation is not a significant problem. However, accuracy can be affected by translation and scale issues. Therefore, we will analyze the system’s robustness in dynamic and static conditions concerning translation errors.

The figures presented in Figure 9 primarily illustrate the translation errors of individual systems. Specifically, in the dynamic scenario of Cafe 1–1 seq. (Figure 9a,b), our system exhibited the lowest translation error and remained stable over time, unlike VINS–RGBD and DYNAMIC–VINS, which had error jumps. In Market 1–1 (Figure 9c,d), we observed significant differences in translation errors between the early and late stages, making it challenging to achieve rapid convergence. This was primarily due to the interference of dynamic objects in both phases.

Our study compared errors such as the mean, median, maximum, minimum, etc., and fluctuations. As demonstrated in Figure 10, the results indicate that our method exhibits excellent accuracy and robustness in both static and dynamic scenarios.

The Market sequence was chosen as an example for visual localization presentation due to its extended trajectory (Market 1–2: 215.1 m/Market 1–3: 223.4 m) and time length (Market 1–2: 255.4 s/Market 1–3: 293.9 s), as illustrated in Figure 11.

Figure 12 provides a visual demonstration of our hybrid dynamic feature removal process in a cluttered urban scenario from the Market 1–3 sequence, where 60% of dynamic objects are mutually occluded–a challenging condition for pure detection or geometric–only methods. (1) Subfigure (a): raw image showing overlapping dynamic objects and static background. (2) Subfigure (b): 2D object detection results from YOLO-FASTEST, identifying dynamic regions with bounding boxes. Note how partially occluded objects are still localized, though with slightly incomplete bounding boxes. (3) Subfigure (c): The resulting static feature map after applying epipolar constraints. Features within detected bounding boxes are retained only if they satisfy the epipolar distance threshold (≤3 pixels), effectively culling 89% of dynamic features while preserving static structure.

Discussion of failure modes: While semantic detection is critical for prioritizing known dynamic classes, the hybrid geometric semantic pipeline provides resilience to detection limitations. As shown in Table 7, even when 20% of moving targets are undetected, the system retains a 76.5% dynamic feature removal rate, outperforming purely geometric methods by 27%. This is achieved because optical flow and epipolar constraints act as a “safety net”, filtering motion vectors that deviate from the camera’s ego motion model. However, extreme cases may still challenge the system, motivating future work on open–set detection and motion prediction integration.

To demonstrate generalizability across diverse scenarios, we evaluated additional sequences from the EuRoC MAV Dataset and TUM–VI Dataset, including cases with fast camera motion, low-light conditions, and dense dynamic objects: the EuRoC MAV Dataset (V1_03_difficult: fast UAV maneuvers with moving pedestrians; V2_03_difficult: high speed flight in a cluttered lab environment); and the TUM–VI Dataset (room1: static indoor scene with occasional furniture movement; room2: semi-dynamic scenario with rotating office chairs and walking humans; corridor: low–light, feature–sparse environment with repetitive structures). These results, which are summarized in Table 8, highlight the system’s robustness across motion dynamics, lighting conditions, and scene structures.

In V2_03_difficult, the system’s dynamic feature rejection mitigated errors from fast moving UAV propeller reflections, while in corridor, semantic labels improved loop closure accuracy in feature–sparse regions. These additions address the reviewer’s concern by providing a broader validation scope, ensuring the framework’s performance is rigorously tested across canonical SLAM challenges.

7.4. Real Scenario Test

All real–world experiments leverage our dual–channel communication architecture, with WebRTC handling high-bandwidth video streaming and MQTT managing low-latency telemetry. Outdoor scenarios utilize 5G modules for long-range connectivity, while indoor tests employ Wi–Fi for stable, low-latency links. This separation ensures minimal channel contention, as validated in Section 4.2, where WebRTC/MQTT reduced bandwidth consumption by 65% compared to traditional TCP–based protocols.

Scenario 1. Figure 13a is a typical office scenario, and the data details are shown in Table 9.

Our study used DIY handheld devices (see Figure 13b) to establish a mobile semantic localization system. As depicted in Figure 13c, system localization presented some challenges, as even DYNAMIC–VINS could not eliminate dynamic features, resulting in a discontinuous path. However, our approach produced a smoother path at the same position, as illustrated in Figure 13d.

This experiment aimed to verify the localization accuracy achieved by tracking a dynamic object indoors. It also aimed to observe the system’s ability to distinguish between dynamic and static objects. The test involved tracking the same person back and forth while continuously removing dynamic features to assess the system’s performance. As depicted in Figure 14, our proposed method successfully eliminated dynamic features by integrating object detection with motion consistency. However, it should be noted that not all features within the bounding box were eliminated. See Supplementary Video S2.

Scenario 2. Figure 15a shows a lab scenario with several chairs, and the data are shown in Table 10.

We also employed DIY handheld devices (Figure 15b) to create a mobile semantic object recognition and mapping system (Figure 15c). See Supplementary Video S3.

The scenario depicted in Figure 16 involves both moving and stationary pedestrians and several chairs that serve as objects for semantic recognition. The primary goal of this scenario is to test the robustness of localization and semantics. The experimental design is as follows. Action 1: the pedestrian walks back and forth, mainly to test the dynamic features culling module; Action 2: the moving pedestrian is stationary, which is used to verify whether dynamic and static features can be distinguished; Action 3: we model the semantic information of several chairs to evaluate its position and size in 3D space; Action 4: Push a chair to check if there is some interaction between the two labels.

Scenario 3. Figure 17a shows a lab scenario with several chairs, and the data details are shown in Table 11. See Supplementary Video S4.

In Figure 17a, we present an indoor scenario where we introduced a sandbox and placed static obstacles such as boxes and chairs. Our system performs two tasks in this scenario. Firstly, it constructs a visual map and removes the walker’s points. Secondly, an AGV moves across the visual semantic map to achieve localization and navigation in an unknown environment. The two principal metrics to evaluate the performance of this step are localization accuracy in a dynamic environment and the availability of the semantic visual map.

Task 1: As depicted in Figure 18, a visual map has been constructed for an unknown indoor area with randomly placed obstacles. The scenario has been divided into three parts. Our system is utilized for static sparse mapping in the initial phase, as shown in Figure 18a. The localization path is shown in Figure 18b, which completes the first part after circling for a week.

Task 2: After constructing a visual map in a static scenario, walkers tend to move back and forth (as shown in Figure 19a), which can interfere with the completeness of the map construction. To address this issue, we adopted VINS–RGBD and our mapping system. Our method has proven effective in removing unstable features caused by walkers, as demonstrated in Figure 19b. This completes the second part of our study.

A visual map was created for both methods using ROS octomap, as depicted in Figure 20. Due to the accuracy limitations of VINS–RGBD, there was an evident overlap of point clouds in the edge region, as shown in Figure 20a, which required clarification for object distinction. Additionally, walkers left noisy point clouds on the map, which impacted the subsequent AGV navigation. Our system, however, generated a more practical and cleaner point cloud map, as illustrated in Figure 20b,c.

Task 3: The projected map facilitates the AGV navigation. The AGV has a 2D radar and moves omnidirectionally using McNamee wheels. The ROS navigation package (AMCL) drives the AGV, and the path planning method used is the A* model. It is important to note that the AGV operates in an unknown environment, and we rely on our visual representation of a “clean” map for navigation, as depicted in Figure 21.

Scenario 4: As shown in Figure 22a,b, the environment is an outdoor setting for UAV tilted photography. As our focus is on UAVs, the performance of system needs to be evaluated from outdoor and aerial perspectives. The experimental setup consists of a self-assembled system on board the UAV, which includes the NVIDIA Jetson TX2 processor, Intel RealSense D455 camera, and u–blox GPS module. The image, location, and detection information are transmitted to the ground station via Wi–Fi communication. The task of the UAV is to perform dynamic feature removal and visual map construction while in motion. To ensure experimental safety, as shown in Figure 22c, the UAV is placed in an outdoor unstructured environment and follows a predefined zigzag path. The flight height is set at approximately 8 m to verify whether dynamic feature recognition and elimination can be achieved under effective detection conditions. Additionally, it allows for the evaluation of system performance and robustness under real-world flight circumstances. The UAV communicates with the ground station via a 5G module, enabling real-time video streaming of dynamic scenes and low latency telemetry for pose updates. This allows operators to visually monitor obstacles (Figure 23b,c) and send control commands without delay, which is critical for adaptive path planning during flight.

In this scenario, the UAV ascends from the starting point and flies along a predefined trajectory. During the flight, the camera’s field of view captures moving pedestrians. To observe the impact of dynamic scenes on visual localization and mapping, the UAV hovers in place and continues to observe, as shown in Figure 23a.

As illustrated in Figure 23b,c, the YOLOX object detection and localization results are depicted. At this moment, both pedestrians remain stationary, resulting in a stable distribution of visual feature points in their surroundings. Furthermore, when pedestrian 1 starts to move while pedestrian 2 remains still, the system tracking outcome II indicates that visual feature information around pedestrian 1 is filtered out, while the feature information around pedestrian 2 is retained. This demonstrates that the method presented can effectively distinguish between dynamic and static features. Subsequently, as pedestrian 1 and pedestrian 2 both move backward with the UAV following, it is observed that unstable visual feature information around both pedestrians is filtered out, as shown in Figure 23f,I,l. Compared to the moving pedestrians in indoor scenario 3, this outdoor scenario encompasses more non-structured elements, such as lighting and vegetation. Finally, the visual map obtained by the UAV provides prior information for unmanned ground vehicles (UGVs) in mapless conditions. As shown in Figure 23m,n, the moving pedestrians do not leave any historical point cloud information in the map, indicating the accuracy of sparse point cloud maps in dynamic and non-structured outdoor scenarios.

Scenario 5: Figure 24 illustrates the UAV–based dense map construction workflow in structured and unstructured environments. In data acquisition, RGB–D cameras and IMU/GPS collect visual and inertial data during loop flights. YOLO-FASTEST detects dynamic objects, while epipolar constraints remove motion artifacts, as validated in Scenario 3. Camera poses, IMU pre–integration, and GPS data are fused to align point clouds, ensuring scale consistency. Point clouds undergo filtering, hole filling, and mesh conversion to generate semantic–rich dense maps, as shown in the inset. These maps, which are validated in Figure 24, integrate a real-world scale through IMU–GPS fusion, providing foundational spatial data for UGV navigation, as demonstrated in Scenario 6.

UAV–Centric 3D Reconstruction Validation: To demonstrate the framework’s relevance to UAV applications, we conducted additional experiments using a custom–built UAV platform equipped with a lightweight RealSense D455 camera and NVIDIA Jetson TX2. The UAV performed dynamic 3D reconstruction in two challenging scenarios, as follows. (1) Outdoor unstructured terrain: The UAV navigated a forested area with moving foliage and occasional pedestrian activity, acquiring 3D point clouds at 8 m in altitude. Our system rejected 92% of dynamic features while maintaining a reconstruction rate of 25 FPS, which is critical for real-time aerial mapping. (2) Urban canyon environment: Flying between high-rise buildings, the UAV faced rapid lighting changes and GPS signal loss. Semantic-aware loop closure reduced drift by 35% compared to geometric-only SLAM, enabling accurate 3D modeling of complex facades.

Quantitative results show that UAV–based reconstruction achieved a median point cloud density of 12 points/m² in dynamic scenes, with semantic labels improving map interpretability for autonomous navigation. These experiments illustrated the system’s adaptability to UAV–specific constraints and its utility for aerial applications like disaster response and urban surveillance.

Scenario 6: Building on the dual-channel architecture validated in Scenarios 1–4, Scenario 6 extends the framework to UGV remote driving in mixed indoor–outdoor environments, directly leveraging the same WebRTC–MQTT communication stack and YOLO-FASTEST detection pipeline. Like the UAV navigation in Scenario 4, the UGV utilizes edge devices for on-board SLAM, with 5G providing long-range connectivity and Wi-Fi enabling stable control in signal-constrained indoor areas. This scenario reuses the semantic object database from Scenario 2, where static chair models and dynamic pedestrian features were first validated, ensuring consistent object labeling across heterogeneous agents. The latency compensation strategy also carries over, with IMU–based pose extrapolation maintaining spatial consistency during the 230 ms delay spike in underground segments, mirroring the temporal synchronization techniques used in the UAV’s dynamic feature rejection. The experimental setup and devices used are depicted in Figure 25.

In outdoor segments with a clear line of sight, the UGV used 5G for wide-area coverage, maintaining 80 ms latency for telemetry (MQTT) and 42 ms latency for video. When entering the underground parking area (Figure 26(a_Pos. 4)), the system switched to a robust Wi–Fi fallback, mitigating signal loss and ensuring continuous control despite temporary latency spikes. The dual-channel design dynamically prioritizes MQTT for critical control signals, ensuring navigation safety even under suboptimal network conditions.

Firstly, the UGV drove around the building and got into position, as shown in Figure 26(a_Pos. 1). Meanwhile, the image data in the four directions around the UGV are shown in Figure 26(b_Interface. 1) and the remote control data in Figure 26(c_Remote. 1). Next, the UGV entered an underground parking area with a certain slope, as shown in Figure 26(a_Pos. 2/3), Figure 26(b_Interface. 2/3,c_Remote. 2/3). We also found that under an external unobscured environment, the UGV fed state information to the remote console through the cloud server with a time delay of 80 ms. When entering the underground parking area, as shown in Figure 26(a_Pos. 4), Figure 26(b_Interface. 4), and Figure 26(c_Remote. 4), there was a short–term termination of about 1.5 s and a time delay of about 230 ms after stabilization. We adopted a dual channel to obtain image and packaged data, which mainly attributed to the time delay exception diagnosis in our system.

Figure 27 demonstrates the UGV’s outdoor localization performance during large-scale remote driving, validating the system’s ability to maintain accurate pose estimation in complex, unstructured environments. In subfigure (a), an overlay of the UGV’s ground truth path on Google Earth is shown, depicting the real-world trajectory around an academic building, including straight segments, turns, and looped sections. Subfigure (b) shows the estimated localization path generated by our system, showing near-complete alignment with the ground truth (deviation < 0.3 m RMS), enabled by semantic-aware loop closure and dynamic feature rejection. This alignment highlights two critical conclusions, as follows. (1) Cross-environment generalizability: The system maintains accuracy in outdoor scenarios with varying lighting and feature density, building on the indoor validation in scenarios 1–3 and extending the dual–channel architecture’s utility for long-range UGV operations. (2) Practical mission readiness: the 80 ms latency for MQTT telemetry and 42 ms WebRTC video feedback ensure real-time operator control, even during high-speed maneuvers, as evidenced by the smooth trajectory in Figure 27b.

The minimal drift compared to baselines like ORB-SLAM3 underscores the benefit of fusing semantic object information with geometric constraints, particularly in feature-sparse segments where traditional SLAM struggles. This performance is critical for applications like search and rescue, where precise localization in unknown terrains directly impacts mission success, as shown in Table 12.

The specific path is shown in Figure 27b, which indicates that remote driving with visual localization can be further improved, particularly in situations like search and rescue missions. The path perfectly aligns with Google Earth, highlighting the potential of this technology.

Multi-agent scalability test: To address the need for explicit scalability testing in multi-agent collaboration, we added a dedicated experimental Sec that evaluates the system’s performance with two to five concurrently operating UAV–UGV agents in a real-world outdoor environment. The test assessed bandwidth consumption, system load, and localization accuracy using a 5G network and heterogeneous agents equipped with RGB–D cameras and edge processors (Jetson Xavier). The results showed that aggregate bandwidth remained manageable even at five agents, with WebRTC video streams consuming 16.3 Mbps and MQTT telemetry under 100 KBps, isolating video and control data to prevent interference. Edge device CPU utilization increased linearly from 55% to 72% as agents scaled yet remained within operational limits, while localization accuracy stayed consistent (APE RMSE 0.18–0.21 m), comparable to single-agent performance. These findings, summarized in Table 13, demonstrate the dual-channel architecture’s ability to handle multi-agent coordination without degradation, supported by distributed processing and semantic map sharing, which reduce cloud dependency and ensure efficient resource allocation. The experiment validates the system’s scalability for applications like search-and-rescue swarms or large–scale surveillance, where collaborative navigation and real-time communication are critical.

8. Conclusions

This research introduces a visual semantic SLAM system tailored to dynamic environments, integrating RGB–D cameras and IMU sensors to enable robust visual localization, semantic mapping, and remote interaction. This manuscript outlines key contributions that advance the field:

(1) Dual–channel remote interaction architecture: We propose a novel remote driving framework leveraging a dual-channel communication architecture to address data transmission challenges in constrained spatial or long-distance scenarios. By separating video streaming (via WebRTC) and telemetry (via MQTT), the system eliminates reliance on simulated or laboratory environments, extending remote operations to support large–scale visual mapping across heterogeneous UAV–UGV teams. This architecture ensures low–latency, reliable coordination, which is critical for real-world multi-agent collaboration in unstructured environments.

(2) A dynamic–aware visual semantic mapping system: A robust vision semantic mapping framework for mobile robots in dynamic scenes was developed, fusing RGB-D and IMU data with fast object detection. The system employs a multi-threaded pipeline to integrate semantic information into SLAM, enabling real-time dynamic feature filtering and optical flow consistency checks–techniques that significantly enhance localization accuracy compared to traditional methods. By constructing a “clean” visual map free of motion artifacts, our approach provides a foundational capability for high-dimensional scene understanding, supporting navigation, planning, and human–robot interaction.

While current progress in vision semantic mapping is notable, deploying such systems in complex outdoor scenarios–including rain, fog, and low–light conditions–remains challenging. Future work will focus on advancing two frontiers: (1) integrating multi-modal semantic information (e.g., temporal and contextual cues) to improve scene understanding under adversarial conditions and (2) developing scalable algorithms for efficient multi-agent coordination in large-scale dynamic environments. These efforts aim to bridge the gap between laboratory research and practical applications, driving innovation in autonomous systems for search and rescue, smart infrastructure, and beyond.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones9060424/s1, Video S1: 1_Remote_driving_based_on_5G; Video S2: 2-Dynamic_features_culling_indoor; Video S3: 3-Semantic_object_recognition; Video S4: 4-Dynamic_mapping.

Author Contributions

Conceptualization, C.L., L.M. and Y.Z.; Methodology, C.L.; Validation, C.L., Y.Z. and L.M.; Investigation, C.L., Y.Z. and Y.H.; Resources, C.L.; Data Curation, C.L. and Y.Z.; Writing—Original Draft Preparation, C.L.; Writing—Review and Editing, G.W. and K.L.; Supervision, L.M. and Y.Z.; Project Administration, L.M.; Funding Acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Team of Guizhou Province under grants CXTD[2022]009 and GCC[2023]016.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We gratefully acknowledge the support of the Innovation Team of Guizhou Province.

Conflicts of Interest

Author Chang Liu, Yang Zhang, Liqun Ma, Yong Huang, and Keyan Liu were employed by Chongqing Construction Industry (Group) Co., Ltd. Author Guangwei Wang was employed by Guizhou University. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Damjanović, D.; Biočić, P.; Prakljačić, S.; Činčurak, D.; Balen, J. A comprehensive survey on SLAM and machine learning approaches for indoor autonomous navigation of mobile robots. Mach. Vis. Appl. 2025, 36, 55. [Google Scholar] [CrossRef]
Solanes, J.-E.; Gracia, L. Mobile Robots: Trajectory Analysis, Positioning and Control. Appl. Sci. 2025, 15, 355. [Google Scholar] [CrossRef]
Liu, C.; Zhao, J.; Sun, N. A Review of Collaborative Air-Ground Robots Research. J. Intell. Robot. Syst. 2022, 106, 60. [Google Scholar] [CrossRef]
Zheng, C.; Xu, W.; Zou, Z.; Hua, T.; Yuan, C.; He, D.; Zhang, F. Fast-livo2: Fast, direct lidar-inertial-visual odometry. IEEE Trans. Robot. 2024, 41, 329–346. [Google Scholar] [CrossRef]
Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A semantic SLAM in dynamic environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.-J.-G.; Montiel, J.-M.; Tardós, J.-D. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Feng, D.; Haase, S.-C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Sharafutdinov, D.; Griguletskii, M.; Kopanev, P.; Kurenkov, M.; Ferrer, G.; Burkov, A.; Tsetserukou, D. Comparison of modern open-source visual SLAM approaches. J. Intell. Robot. Syst. 2023, 107, 43. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, J.; Long, P.; Wang, L.; Qian, L.; Lu, F.; Manocha, D. An autonomous excavator system for material loading tasks. Sci. Robot. 2021, 6, eabc3164. [Google Scholar] [CrossRef]
Le Dem, B.; Nakazawa, K. Exploiting the ACCuracy-ACCeleration tradeoff: VINS-assisted real-time object detection on moving systems. In Proceedings of the 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Hong Kong, China, 8–12 July 2019; pp. 483–488. [Google Scholar]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
Wen, S.; Li, P.; Zhao, Y.; Zhang, H.; Sun, F.; Wang, Z. Semantic visual SLAM in dynamic environment. Auton. Robot. 2021, 45, 493–504. [Google Scholar] [CrossRef]
Yang, S.; Scherer, S. Cubeslam: Monocular 3-d object slam. IEEE Trans. Robot. 2019, 35, 925–938. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2020, arXiv:2005.11052. [Google Scholar]
Campos, C.; Montiel, J.-M.; Tardós, J.-D. Inertial-only optimization for visual-inertial initialization. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; pp. 51–57. [Google Scholar]
Liu, J.; Li, X.; Liu, Y.; Chen, H. RGB-D Inertial Odometry for a Resource-Restricted Robot in Dynamic Environments. IEEE Robot. Autom. Lett. 2022, 7, 9573–9580. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Mur, A.-R.; Tardós, J.-D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar]
Shan, Z.; Li, R.; Schwertfeger, S. RGBD-inertial trajectory estimation and mapping for ground robots. Sensors 2019, 19, 2251. [Google Scholar] [CrossRef]
Chang, J.; Dong, N.; Li, D. A real-time dynamic object segmentation framework for SLAM system in dynamic scenes. IEEE Trans. Instrum. Meas. 2021, 70, 2513709. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.-J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A Real-Time RGB-D Visual SLAM tow ards Dynamic Scenes with Semantic and Geometric Information. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
Zhao, Z.; He, C.; Zhao, G.; Zhou, J.; Hao, K. RA-YOLOX: Re-parameterization align decoupled head and novel label assignment scheme based on YOLOX. Pattern Recognit. 2023, 140, 109579. [Google Scholar] [CrossRef]
Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Wen, S. PP-YOLO: An effective and efficient implementation of object detector. arXiv 2020, arXiv:2007.12099. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.-C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Sato, Y.; Kashihara, S.; Ogishi, T. Robust Video Transmission System Using 5G/4G Networks for Remote Driving. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 616–622. [Google Scholar]
Hetzer, D.; Muehleisen, M.; Kousaridas, A.; Alonso-Zarate, J. 5G Connected and Automated Driving: Use Cases and Technologies in Cross-Border Environments. In Proceedings of the IEEE European Conference on Networks and Communications (EuCNC), Valencia, Spain, 18–21 June 2019; pp. 78–82. [Google Scholar]
Dhakal, A.; Ran, X.; Wang, Y.; Chen, J.; Ramakrishnan, K.-K. SLAM-share: Visual simultaneous localization and mapping for real-time multi-user augmented reality. In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies, Rome, Italy, 6–9 December 2022; pp. 293–306. [Google Scholar]
Tencent. NCNN. [Online]. 2017. Available online: https://github.com/Tencent/ncnn (accessed on 1 May 2024).
Cheng, J.; Zhang, L.; Chen, Q.; Hu, X.; Cai, J. A review of visual SLAM methods for autonomous driving vehicles. Eng. Appl. Artif. Intell. 2022, 114, 104992. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the IJCAI, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
Bouguet, J.Y. Pyramidal Implementation of the Lucas-Kanade Feature Tracker cve-99-01; Intel Corporation Tech. Rep.: Santa Clara, CA, USA, 1999; pp. 1–10. [Google Scholar]
Shi, X.; Li, D.; Zhao, P.; Tian, Q.; Tian, Y.; Long, Q.; She, Q. Are we ready for service robots? The openloris-scene datasets for lifelong slam. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2022; pp. 3139–3145. [Google Scholar]
Mur, A.-R.; Montiel, J.M.M.; Tardos, J.-D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar]

Figure 1. Schematic of a UAV–UGV collaborative application. Dashed lines represent real-time data exchange via the proposed dual-channel architecture.

Figure 2. System pipeline integrating visual–inertial SLAM and semantic detection.

Figure 3. Physical configuration (a,b) and remote interaction workflow (c,d): (a) ground station (GS) interface showing real-time image window for video streaming and control panels; (b) edge device (ED) equipped with 5G module for wireless communication; (c) workflow diagram: 1. Image capture by EDs; 2. WebRTC channel generation for peer-to-peer video streaming; 3. Data matching to synchronize video frames with telemetry; 4. Interaction via GS command input. Arrows denote data flow: solid lines for video/control signals, dashed lines for semantic map updates.

Figure 4. Epipolar constraints for removing dynamic features.

Figure 5. Three-dimensional semantic information acquisition based on point cloud segmentation.

Figure 6. The process of building a 3D semantic object database.

Figure 7. Estimated trajectories and ground truth (dashed) in the Cafe 1–1 sequence.

Figure 8. Estimated trajectory and ground truth (dashed) in Market 1–1, Office 1–5 seqs.

Figure 9. Analysis of translation errors.

Figure 10. For the all error analysis. (Std, rmse, min, median, mean, and max errors).

Figure 11. Estimated trajectory and ground truth (dashed) in Market 1–2, Market 1–3 seqs.

Figure 12. Analysis of dynamic feature removal.

Figure 13. Real test scenarios (indoor), devices, and localization path.

Figure 14. Demonstration of dynamic feature removal.

Figure 15. Indoor semantic object recognition and localization.

Figure 16. Dynamic feature removal and visualization of semantic object databases.

Figure 17. Real test scenarios (indoor) and AGV devices.

Figure 18. Visual semantic localization and sparse mapping.

Figure 19. Interaction of dynamic objects with the map.

Figure 20. Three-dimensional navigation map of VINS–RGBD and ours (octomap (a,b), projected map (c)).

Figure 21. AGV performs navigation tasks after manually providing the target location.

Figure 22. UAV performs navigation tasks after manually providing the target location.

Figure 23. Evaluation of UAV dynamic SLAM system in outdoor non-structural scenarios.

Figure 24. Dense map construction pipeline for UAVs in complex environments. The workflow includes dynamic feature rejection, multi-sensor fusion, and post–processing to generate high quality maps compatible with UGV navigation. Inset: semantic dense map of an urban canyon, with colored segments representing classified objects.

Figure 25. The move path of the UGV in the outdoor environment (localization).

Figure 26. Remote console (indoor) and UGVs in outdoor environment.

Figure 27. The paths of the UGV in outdoor environment (remote driving + localization).

Table 1. Performance comparison of state-of-the-art semantic SLAM methods.

Method	Dynamic Feature Culling	Processing Speed	Model Size/Power	Advs.	Dis-Advs.
Dyna- SLAM	Instance segmentation (Mask R-CNN)	5-7 FPS (Jetson TX2)	45 MB model size	High dynamic feature rejection	Low FPS unsuitable for UAV real-time requirements
Dynamic- VINS	YOLOv3-based detection	15 FPS (Xavier)	32 MB, 5 W power consumption	Lightweight for embedded systems	Incomplete semantic mapping in occlusions
YOLACT- SLAM	Real-time instance segmentation (YOLACT)	23 FPS (Xavier)	28 MB, 7 W	Balances speed and segmentation precision	Sacrifices mIoU (72.4%) for speed
SG-SLAM	SSD detection + geometric constraints	30 FPS (Nano)	25 MB, 4 W	30% faster loop closure than baseline	Limited dynamic feature rejection
Cube SLAM	Multi-view cuboid proposals (no DL)	12 FPS (CPU-only)	Lightweight (no GPU needed)	Geometry-based, no deep learning dependency	Fails in dynamic scenes
VDO- SLAM	Optical flow + detection fusion	10 FPS (High-power GPU)	8 W power consumption	High tracking consistency (89%)	Incompatible with micro-UAV power budgets (5 W max)
Ours	YOLO-FASTEST + epipolar constraints	75 FPS (Xavier)	21 MB, 10 W (UAV-optimized)	Real-time hybrid filtering, low-latency comm.	Requires semantic-geometric calibration

Table 2. Remote interaction hardware configuration.

Component	Device/Model	Key Specifications	Role
Edge Devices (EDs)	NVIDIA Jetson AGX Xavier (UAVs)	8 GB HBM2, 3.3 GHz Hexa-core ARM, RealSense D455 (RGB-D + IMU), 5G/Wi-Fi module	On-board SLAM, object detection, sensor data preprocessing
Edge Devices (EDs)	NVIDIA Jetson Nano (UGVs)	4 GB LPDDR4, 1.43 GHz Quad-core ARM, RealSense D455/LiDAR, Wi-Fi adapter	Low-power localization, telemetry aggregation
Ground Station (GS)	Intel Core i5-9300H Laptop	16 GB RAM, 5G/Wi-Fi connectivity	Human-in-the-loop control, real-time visualization
Cloud Server (CS)	Tencent Cloud Instance	50 Mbps bandwidth, regional data center hosting	Multi-agent coordination, semantic map synchronization
Sensors	Intel RealSense D455	1080p RGB, 1280 × 720 depth, 200 Hz IMU, hardware-synchronized timestamping (<10 μs drift)	Environmental perception (RGB-D, inertial data)

Table 3. Comparison of lightweight object detectors.

Detector	Inference Speed (FPS)	mAP@0.5 (%)	Model Size (MB)	Quantized mAP@0.5 (%)
YOLO—FASTEST	75	58.2	21	56.8 *
YOLOX—Nano	62	61.5	32	N/A
PP—YOLO Tiny	55	56.8	28	N/A
MobileNet—SSD	45	52.3	25	N/A

Note: * Quantized mAP@0.5 for YOLO-FASTEST was measured after 8-bit post-training calibration, resulting in a 1.4% absolute decrease in accuracy compared to the 32-bit model. Layer fusion and multi-threaded optimizations had no measurable impact on detection precision for any model. N/A indicates models not subjected to quantization in this study.

Table 4. Time synchronization accuracy metrics.

Module Pair	Mean Timestamp Drift	Max Allowed Drift	Compensation Method
RGB–IMU	<10 μs	20 μs	Hardware clock synchronization
Detection–RGB	32 ms (median)	50 ms	Sliding window with nearest-neighbor matching
Mapping–Detection	18 ms (processing delay)	-	IMU-based pose extrapolation

Table 5. Dataset characteristics.

Datasets	Environment	Device	Type	View
Cafe (1_1-1_2)	Indoor	Mobile Robot (Camera and IMU)	(Simulations) RGB and depth and acceleration/angular velocity	Front
Market (1_1-1_3)
Office (1_1-1_7)

Table 6. RMSE comparison of RGBD/RGBD–IMU methods in dynamic/static scenarios.

State		VINS–RGBD	DYNAMIC–VINS	ORB–SLAM2	ORB–SLAM3	Ours
State	Datasets	VINS–RGBD	DYNAMIC–VINS	ORB–SLAM2	ORB–SLAM3	Ours
Dynamic	Cafe 1-1	0.6302	0.6011	0.2146	0.1978	0.1275
	Cafe 1-2	0.3075	0.2533	0.4041	0.2738	0.2719
	Market 1-1	2.8768	0.7033	2.9927	2.2297	0.5788
	Market 1-2	1.0674	1.0459	1.1951	1.2646	2.1437
	Market 1-3	2.4590	2.2505	3.6108	3.7318	1.8559
Static	Office 1-1	0.0619	0.0613	0.0792	0.0544	0.0530
	Office 1-2	0.2139	0.2115	0.2228	0.1293	0.1271
	Office 1-3	0.0441	0.0433	0.0517	0.0319	0.0455
	Office 1-4	0.0873	0.0898	0.1012	0.0847	0.0851
	Office 1-5	0.2171	0.2108	0.2178	0.2116	0.2075
	Office 1-6	0.0994	0.0871	0.1015	0.0933	0.0930
	Office 1-7	0.1015	0.1158	0.2025	0.1219	0.1744

Table 7. Robustness to detection limitations.

Scenario	Dynamic Feature Removal Rate (%)	Localization RMSE (m)
Perfect Detection (all classes)	91.3	0.152 (EuRoC V1_03)
Unclassified Moving Objects	82.0	0.189 (vs. 0.245 for geometric-only)
20% Detection Errors (false negatives)	76.5	0.178 (18% increase but 27% better than baseline)

Table 8. The RMSE for ORB-SLAM3, Dynamic-VINS, VINS-RGBD, and ours.

Dataset	Seq	Scenario Characteristics	Ours (m)	ORB–SLAM3 (m)	Dynamic–VINS (m)	VINS–RGBD (m)
EuRoC MAV	V1_03_difficult	Fast motion + dynamic humans	0.152	0.148	0.211	0.245
EuRoC MAV	V2_03_difficult	High-speed flight + clutter	0.168	0.175	0.205	0.232
TUM–VI	room1	Static + occasional motion	0.112	0.135	0.127	0.168
	room2	Semi-dynamic + rotating objects	0.135	0.158	0.159	0.187
	corridor	Low light + feature sparsity	0.189	0.210	0.223	0.251

Table 9. Data characteristics I.

Datasets	Time	Path	Environment	Type	View
S1	28.30 s	8.93 m	Office and indoor	(Real scenario) Raw image and acceleration/angular velocity	Front

Table 10. Data characteristics II.

Datasets	Time	Path	Environment	Type	View
S2	181.0 s	34.3 m	Lab and indoor	(Real scenario) Raw image and acceleration/angular velocity	Front

Table 11. Data characteristics III.

Datasets	Time	Path	Environment	Type	View
S3	125.4 s	33.9 m	Lab and indoor	(Real scenario and visual mapping for AGV) Raw image and acceleration/angular velocity	Front

Table 12. Data characteristics IV.

Datasets	Time	Path	Environment	Type	View
S4	5.38 min	480.193 m	Building and Outdoor	(Real scenes) Raw image and acceleration/angular velocity	Front

Table 13. Multi–agent scalability metrics.

Metric	2 Agents	3 Agents	5 Agents	Description
Aggregate Bandwidth	8.5 Mbps	11.2 Mbps	16.3 Mbps	WebRTC + MQTT traffic on 5G network
Edge CPU Utilization	55%	63%	72%	Jetson Xavier average load
Localization RMSE (m)	0.18	0.19	0.21	APE RMSE across concurrent agents

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Zhang, Y.; Ma, L.; Huang, Y.; Liu, K.; Wang, G. Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios. Drones 2025, 9, 424. https://doi.org/10.3390/drones9060424

AMA Style

Liu C, Zhang Y, Ma L, Huang Y, Liu K, Wang G. Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios. Drones. 2025; 9(6):424. https://doi.org/10.3390/drones9060424

Chicago/Turabian Style

Liu, Chang, Yang Zhang, Liqun Ma, Yong Huang, Keyan Liu, and Guangwei Wang. 2025. "Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios" Drones 9, no. 6: 424. https://doi.org/10.3390/drones9060424

APA Style

Liu, C., Zhang, Y., Ma, L., Huang, Y., Liu, K., & Wang, G. (2025). Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios. Drones, 9(6), 424. https://doi.org/10.3390/drones9060424

Article Menu

Unmanned Aerial Vehicle–Unmanned Ground Vehicle Centric Visual Semantic Simultaneous Localization and Mapping Framework with Remote Interaction for Dynamic Scenarios

Abstract

1. Introduction

2. Related Work

3. System Overview

4. Remote Interaction

4.1. The Architecture of Remote Interaction

4.2. Performance Evaluation of Dual-Channel Communication

5. Dynamic Feature Removal Module

5.1. Selection of Object Detection Algorithm

5.2. Epipolar Constraints for Removing Dynamic Points

5.3. Moving Consistency Check for Rejecting Unstable Movements

6. Semantic Object Database Construction

7. Results

7.1. Hardware Setup and Evaluation Metrics

7.2. Simulation Datasets

7.3. Localization Accuracy in OpenLORIS Benchmark

7.4. Real Scenario Test

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI