1. Introduction
Within the framework of machine perception, SLAM technology serves as the essential mechanism for facilitating self-location and environmental modeling without pre-existing maps. It relies on the dynamic processing of onboard sensor inputs to derive a robot’s instantaneous pose while simultaneously populating an evolving map of its surroundings. The proliferation of autonomous navigation and immersive computing (AR/VR) has extended SLAM applications far beyond their traditional constraints [
1,
2,
3]. Consequently, these frameworks are increasingly required to maintain robust performance in unpredictable, real-world landscapes characterized by high dynamics and a lack of structured geometry, rather than mere controlled settings. This shift imposes heightened demands on the adaptability and environmental understanding capabilities of SLAM algorithms. Standard visual SLAM frameworks, such as PTAM [
4], DSO [
5], VINS-Mono [
6], and ORB-SLAM3 [
7], can achieve accurate trajectory tracking and spatial mapping under the static-scene assumption. However, in scenarios characterized by high dynamics, these systems typically experience a substantial decline in operational effectiveness. Taking the widely used ORB-SLAM2 [
8] as an example, when moving objects are present in the scene, its frontend association and backend refinement modules are vulnerable to dynamic features, resulting in significant motion drift, inconsistent map construction, and instability in system tracking.
To optimize the efficacy of visual SLAM frameworks for moving scenes, researchers have introduced various improvement strategies. Early approaches primarily relied on geometric constraints for dynamic feature identification. For instance, DynaSLAM [
9] utilized multi-view geometric consistency to distinguish dynamic from static regions, while DS-SLAM [
10] combined semantic masks and motion consistency checks for classification. With the advances of deep learning, semantic-based approaches have gained prominence. Ref. [
11] applies instance segmentation to detect moving targets like “people” and “vehicles”, then filters out the corresponding feature data. However, while geometry-based methods do not rely on semantic priors and semantic-based approaches leverage semantic information, both suffer from the misclassification of static points as dynamic in scenarios with severe occlusion or significant depth noise. In recent years, hybrid approaches integrating geometric and semantic information have garnered significant attention. Examples include SG-SLAM [
12] and YPR-SLAM [
13], which rely on geometric constraints supplemented by object detection to counteract dynamic object interference. Despite these advances, existing methods exhibit notable limitations. On the one hand, for objects moving along epipolar lines or lacking semantic cues, their apparent motion closely resembles that of static backgrounds, rendering geometric or semantic discrimination ineffective and leading to missed detections. On the other hand, most approaches fail to adequately analyze the temporal consistency of feature point motion, making it difficult to effectively remove residual dynamic features. Furthermore, in terms of mapping, sparse maps offer higher efficiency but lack geometric information, while dense point cloud maps provide comprehensive information at the cost of substantial memory overhead.
In this study, we propose RTS-SLAM, a visual-semantic SLAM architecture capable of real-time operation in dynamic environments. Building upon the ORB-SLAM2 framework, the proposed strategy employs a hierarchical, constraint-driven approach to filter moving feature points. This approach first integrates semantic prior knowledge with epipolar geometry to achieve coarse-grained dynamic feature removal. Subsequently, by introducing trajectory consistency constraints and analyzing the coherence between feature point 3D motion and camera pose transformations, it removes residual dynamic points, significantly improving the system’s localization accuracy in dynamic environments. Concurrently, a mapping strategy that incorporates global sparsification and critical-region refinement is proposed. This approach substantially reduces point cloud memory consumption while preserving the geometric structure and details of key static regions, thereby enabling low-redundancy and accurate dense map generation in dynamic environments.
This work primarily contributes the following:
- (1)
Relying on the ORB-SLAM2 architecture, we have designed RTS-SLAM, a real-time RGB-D visual SLAM framework. This system effectively identifies and eliminates interfering features in dynamic environments by integrating multiple constraints: semantic, geometric, and trajectory consistency. Experiments demonstrate that, compared to existing methods, RTS-SLAM yields superior localization precision and operational reliability in dynamic scenes, while simultaneously constructing a structurally coherent dense point cloud map in real time.
- (2)
The core of this strategy lies in a multi-layer constraint-driven refinement process for dynamic feature elimination. First, it integrates semantic priors from object detection with inter-frame epipolar geometry constraints to perform coarse-grained removal of dynamic features. Subsequently, trajectory consistency constraints are incorporated. By analyzing the consistency between the 3D motion of detected points across consecutive keyframes and camera pose transformations, this approach enables precise removal of residual dynamic points. This strategy effectively eliminates dynamic features while preserving static ones, thereby providing more reliable data for subsequent stages of localization and mapping.
- (3)
A mapping strategy combining global sparsification with local refinement of critical regions is proposed. This approach first performs global sparsification of point clouds to substantially reduce memory consumption of point cloud maps. Concurrently, to prevent loss of key structural features (such as object contours), locally enhanced and geometrically refined static critical regions are identified based on semantic information. Experiments demonstrate that the proposed methodology effectively preserves the high fidelity of key scene structures while substantially reducing memory consumption, thereby attaining an optimal equilibrium between processing velocity and mapping structural clarity.
The rest of this article is organized as follows.
Section 2 reviews relevant work. Methodological details of the system are provided in
Section 3. Experimental results are analyzed and discussed in detail within
Section 4.
Section 5 concludes the paper and presents potential avenues for future work.
2. Related Works
In the field of visual SLAM, addressing the interference of moving objects with system stability has become a significant challenge. Conventional visual SLAM frameworks generally rely on the static-environment assumption. However, in dynamic scenarios, moving objects introduce incorrect feature matches, severely affecting both localization accuracy and operational stability. Based on different strategies for handling dynamic features, existing approaches can be divided into three main categories: geometric constraint methods, semantic-informed models, and fusion-based strategies that combine both.
Geometric strategies primarily rely on multi-view consistency or motion-based verification for the identification and rejection of feature points produced by moving objects. Such approaches typically rest upon the fundamental assumption that violates the prescribed geometric constraints dynamically. In pure vision systems, the fundamental matrix is frequently calculated via the RANSAC-coupled seven-point algorithm [
14]. Kundu et al. [
15] identified dynamic elements by establishing geometric constraints derived from the fundamental matrix. Points are categorized as dynamic in cases where the geometric deviation from their associated epipolar trajectory surpasses a pre-established tolerance. Wang [
16] proposed a technique for detecting moving objects by employing analytical mathematical models and geometric constraints. This approach first performs a preliminary assessment of an object’s motion through deep clustering and feature statistics, then introduces inter-frame disparity constraints to further validate its motion state. Additionally, ref. [
17] enhanced dynamic semantic features in RGB-D point clouds by utilizing optical flow residuals, thereby achieving motion-static segmentation. Ref. [
18] proposes an online removal strategy independent of semantic priors, directly filtering dynamic points using geometric motion features. However, such methods often fail to effectively distinguish in complex, dynamic scenes featuring slow-moving objects, severe occlusions, or sparse textures, and computational efficiency frequently falls short of real-time requirements.
The core concept of semantic information methods lies in utilizing deep learning models to acquire pixel-level semantic priors, actively identifying dynamic regions within images, and eliminating all feature points within these areas. This approach circumvents interference from dynamic features at the data source. For instance, the study in [
19] proposed a graph-optimization SLAM framework combining YOLOv5 and Wi-Fi fingerprint sequence matching to improve the reliability of loop closure detection and reduce false loop detection in large-scale environments. DS-SLAM [
10] integrates the SegNet [
20] semantic segmentation network with optical flow techniques. It eliminates dynamic features via pixel-level moving-object detection and motion-consistency verification, thereby constructing a dense octree semantic map better suited for navigation tasks. Dynamic-DSO [
21] employs Mask R-CNN [
22] for semantic segmentation, deeply embedding semantic information into a direct SLAM framework to enhance system stability in dynamic environments. Detect-SLAM [
23] distinguishes moving and stationary objects in keyframes via object detection, combining motion probability matching with the Randomized Consistency Sampling (RANSAC) algorithm to identify moving targets and eliminate mismatches. By removing feature points from dynamic objects and assigning unique IDs to each, it achieves dense mapping of the target objects. Similarly, Dynamic-SLAM [
24] adopts the approach of removing all interest points within detected bounding boxes to suppress disturbances caused by dynamic objects. However, purely semantic dynamic exclusion methods suffer from three inherent limitations. Firstly, their performance is highly dependent on neural network models, creating a significant hurdle to reconciling computational efficiency with segmentation accuracy. Secondly, dynamic recognition capabilities are constrained by predefined semantic categories, resulting in insufficient generalization ability for unknown dynamic targets beyond the training set. Furthermore, in regions where static and dynamic elements coexist, the absence of geometric constraints often leads to the erroneous rejection of static features that should be retained.
Over the past several years, the trajectory of dynamic SLAM studies has evolved from relying on isolated information sources to developing hybrid frameworks that combine semantic and geometric constraints, significantly enhancing localization accuracy and operational efficiency in dynamic environments. Within this trend, multiple fusion strategies have been proposed. For instance, Chang et al. [
11] combined YOLACT instance segmentation with optical flow compensation to establish a two-stage dynamic point filtering mechanism, though it remains limited under complex motion conditions. YOLO-SLAM [
25] employs YOLOv3 [
26] to detect potential moving targets, utilizing RANSAC to eliminate feature points within dynamic regions. OVD-SLAM [
27] further integrates YOLOv5 semantic segmentation, optical flow, and depth information. By distinguishing foreground from background and averaging multi-frame reprojection errors, it restores static points on moving object surfaces, thereby adapting to non-rigid motion and low-dynamic scenes. Zhu et al. [
28] proposes an improved YOLOv9S-based SLAM framework integrated with ORB-SLAM2, where enhanced object detection, epipolar geometry constraints, and depth separation strategies are employed to remove dynamic features. DIO-SLAM [
29] employs YOLACT instance segmentation, dynamically classifying rigid masks using optical flow residuals and consistency criteria, while compensating for misdetections by leveraging temporal frame propagation techniques. While effective under typical dynamic conditions, these hybrid approaches still encounter constraints when faced with sophisticated movement patterns. Specifically, techniques utilizing optical flow consistency often struggle to identify non-rigid motion or objects moving along epipolar lines, which may lead to detection failures and the erroneous removal of static points.
To mitigate the detrimental impact of environmental dynamics on SLAM reliability, we introduce a hierarchical framework driven by multiple constraints to eliminate dynamic feature points. Initially, the proposed strategy incorporates semantic information alongside geometric constraints to perform preliminary removal of dynamic features. Subsequently, trajectory consistency constraints are introduced for refined filtering, enabling accurate detection and removal of potential dynamic outliers. The proposed strategy significantly improves the precision of feature removal while enhancing overall system efficiency.
3. System Overview
This chapter provides an overview of the RTS-SLAM system structure, which is divided into five components. First, it outlines the overall architecture of the system along with its basic operational sequence. Second, the target identification module is introduced, which provides semantic feature inputs to the system. Subsequently, the dynamic feature classification criteria based on epipolar geometry constraints are described. Third, a refinement strategy for dynamic feature rejection based on trajectory consistency constraints is presented. Finally, the dense map construction method used in the system is detailed.
3.1. System Framework
Derived from ORB-SLAM2 [
8], the RTS-SLAM framework augments the standard execution threads—tracking, local mapping, and loop detection—by integrating two parallel processing units for target recognition and dense mapping. By leveraging a multi-threaded parallel architecture, the framework preserves precise self-localization capabilities while significantly enhancing operational efficiency. The holistic design layout of this platform is visualized in
Figure 1. Once initialized, the system maintains synchronized ingestion of both color frames and depth maps sourced via the RGB-D camera. The tracking thread executes ORB feature extraction on the incoming image frames, providing the fundamental input for the real-time localization algorithm. Subsequently, the system integrates semantic priors provided by the object detection thread to perform preliminary motion classification on the feature points. Utilizing epipolar line geometry constraints, it completes an initial round of coarse-motion feature rejection based on semantic and geometric fusion. Building upon this, trajectory consistency constraints are incorporated. By analyzing whether the three-dimensional motion trajectories of feature points across consecutive key frames align with camera pose transformations, the system achieves refined rejection of residual motion features. Within the dense mapping thread, we employ a mapping strategy combining global sparsification with local refinement in critical regions. This approach significantly reduces point cloud storage redundancy through global sparsification while enhancing local point density and refining geometric structures in structurally critical static areas. Ultimately, the system generates dense point cloud representations in dynamic settings, thereby balancing global coherence with high-fidelity local attributes.
3.2. Object Detection
Dynamic environments pose significant interference to visual SLAM systems. To address this, the system incorporates a lightweight object detection module. This module leverages the NCNN architecture, which is specifically tailored for mobile deployment to facilitate efficient neural network inference. Developed entirely in native C++ without external library dependencies, its highly parallelized architecture ensures real-time operational efficiency on resource-constrained platforms.
In model selection, the system employs the lightweight Single Shot MultiBox Detector (SSD) [
30] architecture. This framework achieves both object localization and classification through a single forward pass, with relatively straightforward post-processing, thereby significantly reducing computational latency. To further optimize efficiency, the SSD backbone utilizes the lightweight MobileNetV3 [
31]. This network employs separable convolutions and neural architecture search techniques to substantially reduce parameter count and computational overhead while maintaining detection accuracy. The complete model leverages prior knowledge acquired from the Pascal VOC [
32] corpus, facilitating the robust identification of prevalent non-static entities, including human figures and various automobiles.
Contrastive to the computationally intensive semantic segmentation networks utilized in [
9,
10,
11,
33], the SSD detector integrated into our framework generates solely bounding boxes and category labels. This strategic simplification bypasses the pixel-level mask generation process, markedly curtailing processing delays. Consequently, as the cornerstone of mobile robot state estimation, the SLAM system’s real-time responsiveness becomes a critical prerequisite for the successful implementation of high-level missions.
3.3. Epipolar Constraints
In dynamic visual SLAM systems, accurate identification of static references within dynamic environments is essential to mitigate estimation drift and achieve high pose accuracy. While deep learning–based object detection methods provide semantic prior information for identifying potential moving objects such as pedestrians and vehicles, semantic information alone cannot determine their actual motion state at the observation moment. For instance, seated pedestrians or stationary vehicles detected as ‘people’ or ‘vehicles’ may contain feature points that remain static, contributing to camera motion estimation. Indiscriminately discarding all feature points associated with such semantic categories would not only discard valuable static observations but also compromise system stability in environments with many stationary structures. To further determine feature point motion states based on semantic prior knowledge, this paper introduces epipolar geometry constraints for preliminary dynamic feature screening between adjacent frames. Initially, the system establishes ORB feature correspondences across consecutive frames using a pyramidal iterative Lucas–Kanade (LK) optical flow scheme, while filtering out unreliable matches in peripheral image regions. Based on these correspondences, the inter-frame fundamental matrix is estimated from stable matches using the RANSAC-based seven-point method. By leveraging the geometric constraints defined by this fundamental matrix, the distance between each feature point and its corresponding epipolar line is computed. If this value remains below a preset threshold, the point is classified as static and retained; otherwise, it is flagged as dynamic and subsequently eliminated.
As illustrated in
Figure 2, the pinhole camera model characterizes the geometric correspondence between the same spatial point
P across adjacent image frames.
and
denote the optical centers of the cameras, respectively. The spatial point
P is mapped to feature points
and
, respectively, within the preceding and current frame-relative camera coordinate systems. The plane defined by
,
, and
P is designated as the epipolar plane; its intersections with the image planes of the two frames yield the epipolar lines
and
(illustrated by red dashed segments), while
and
denote the reprojected coordinates of the shifted map point. Feature points
and
are expressed in homogeneous coordinates as follows:
where
x and
y constitute the pixel coordinates of the feature point. The fundamental matrix
F is then utilized to determine the epipolar line
within the current frame:
where
X,
Y, and
Z constitute the elements of the line vector. Based on the principles in [
15], the epipolar constraint is defined by
For feature point
, the distance
D to its corresponding epipolar line is given by
From the Formula (
4), ideally, feature point
falls precisely onto the epipolar line
in the current frame; consequently, the calculated distance
D should be equivalent to zero. In practical operating scenarios, however, due to interference from diverse noise sources, feature points frequently deviate from the theoretical epipolar line, in which case the value of
D exceeds zero while remaining within the allowable tolerance margin. As shown in
Figure 2a, the movement of point
P to
causes the spatial deviation of the projected point
to exceed the allowable tolerance margin, leading to its categorization as a dynamic feature and its immediate removal. Nevertheless, relying solely on the geometric properties of the polar line is generally insufficient to address all dynamic scenarios. In
Figure 2b, geometric degeneration occurs when processing dynamic points moving along the polar line: if spatial point
P undergoes radial translation along the linear extension of
, upon attaining position
, its corresponding projection
within the polar plane remains situated on epipolar line
, leading to a null displacement
D when calculated via Formula (
4). This leads to the point being misclassified as static. Therefore, this paper introduces a strategy for eliminating trajectory consistency constraints. By analyzing the consistency between the 3D motion paths of observations and camera pose updates, this constraint effectively identifies dynamic points moving along the polar line, thereby achieving a more refined elimination of residual dynamic features.
Figure 2.
Epipolar geometry constraints. (a) The general case of dynamic features subject to epipolar geometric constraints. (b) The degenerate scenario where dynamic features undergo collinear motion relative to the epipolar line direction.
Figure 2.
Epipolar geometry constraints. (a) The general case of dynamic features subject to epipolar geometric constraints. (b) The degenerate scenario where dynamic features undergo collinear motion relative to the epipolar line direction.
3.4. Trajectory Consistency Constraint Rejection Strategy
To address the limitations of geometric constraints and semantic detection in removing dynamic feature points, trajectory consistency constraints analyze the 3D motion patterns of features across successive frames. The fundamental principle relies on the spatial alignment between feature displacement and camera ego-motion. While the apparent motion of stationary landmarks follows the camera motion, descriptors from moving objects exhibit significant motion inconsistencies. Therefore, by evaluating the alignment between feature trajectories and camera motion, dynamic and static elements can be effectively differentiated. In practical implementation, matched keypoints from consecutive frames are first transformed into the 3D camera coordinate system using depth information. Subsequently, the spatial displacement of each feature point across two frames and the ego-motion displacement of the camera are computed. By comparing these two displacement magnitudes, the motion state of each point can be determined. If the difference between the feature displacement and camera motion exceeds a predefined threshold, the point is classified as dynamic and removed; otherwise, it is retained as static.
In the proposed methodology, suppose that
N pairs of correlated feature points are identified between two consecutive frames, which can be formally represented as
Let
and
denote the image-plane coordinates of the
i-th feature across consecutive frames. For the purpose of mapping these pixel locations into the camera coordinate system, they are reformulated into a homogeneous coordinate form
By leveraging the information from the depth map, the depth values associated with these two frames are represented as
Through the camera’s intrinsic matrix:
The variables
and
correspond to the orthogonal focal lengths, with
marking the principal point where the optical axis aligns with the image plane’s center. Utilizing these intrinsic parameters, feature points are back-projected into the 3D camera coordinate space as follows:
Subsequently, the geometric displacement
of each feature point in 3D space is evaluated across the sequential observations.
The translation component
t is extracted directly from the camera’s estimated pose transition matrix
, where the magnitude of this vector represents the camera’s own motion displacement
:
In static scenes, this displacement is primarily caused by the camera’s own motion, and its magnitude should satisfy:
Given that feature points exhibit significant variations in their projected scale on the imaging plane at different depths, their sensitivity to pixel changes varies across their three-dimensional displacements. Applying a uniform fixed threshold directly may lead to misclassification of minute displacements of dynamic points within close-range scenes as static points, or cause failure to correctly identify a close-range dynamic point due to its displacement exceeding the threshold without triggering detection. Therefore, to achieve discrimination that better aligns with true geometric relationships, trajectory consistency constraints must incorporate depth information of feature points. This study introduces a depth-dependent tolerance threshold
into the trajectory consistency constraint to adjust the acceptable displacement range at different depths.
In this formulation (
13),
denotes the baseline tolerance, primarily employed to compensate for camera intrinsic residuals, pixel quantization noise, and minor stochastic fluctuations during feature extraction.
is a scaling factor reflecting the influence of depth, designed to balance the heteroscedasticity of depth measurement uncertainty that intensifies with increasing distance. Although
and
both originate from the same depth map and encapsulate depth information, they fulfill distinct roles:
serves as a scale factor for 3D back-projection in coordinate transformation, whereas
acts as a depth scalar to modulate the trajectory consistency threshold during motion identification.
Potential dynamic outliers are isolated by assessing the absolute deviation between feature displacement and the camera’s ego-motion relative to the threshold
.
The tactical approach to rejecting dynamic features is detailed in Algorithm 1. This method first performs preliminary screening of feature points based on extreme line constraints to obtain candidate static features. Subsequently, it combines depth information and introduces trajectory-consistency constraints to conduct secondary, refined elimination of the remaining dynamic features.
| Algorithm 1 Dynamic Feature Rejection Strategy |
- Input:
Previous frame’s feature points, ; Current frame’s feature points, ; Previous frame’s static feature points, ; Current frame’s static feature points, ; Previous frame, ; Current frame, ; Previous depth map, ; Current depth map, ; Camera intrinsic matrix, K; Dynamic object culling threshold, ; tolerance threshold, ; Number of feature points threshold, ; - Output:
Set of identified non-static points in the current frame, ;
- 1:
; - 2:
Filter out potential moving features located near image boundaries; - 3:
for each matched pair whithin do - 4:
if Point is situated outside semantic dynamic regions then - 5:
Append to - 6:
end if - 7:
Feature ++ - 8:
end for - 9:
if Feature then - 10:
Compute Fundamental Matrix F via RANSAC on ; - 11:
else if Feature and DynamicFlag is active then - 12:
Compute Fundamental Matrix F via RANSAC on ; - 13:
end if - 14:
for each matched pair in do - 15:
if Distance to epipolar line then - 16:
Append to ; - 17:
else if Valid depth values exist in both and then - 18:
Recover 3D spatial coordinates and using D and K; - 19:
Calculate individual point displacement: ; - 20:
Extract camera ego-motion displacement from pose: ; - 21:
if Displacement discrepancy then - 22:
Record into the dynamic feature set ; - 23:
end if - 24:
end if - 25:
end for
|
3.5. Dense Mapping
Within the traditional ORB-SLAM2 [
8] framework, maps are primarily constructed based on sparse feature points. While this approach ensures stable pose tracking, it struggles to accurately capture the fine-grained geometric structure of real-world scenes. These limitations are notably intensified in dynamic scenarios, where the limited descriptive power of sparse data, coupled with its susceptibility to disturbances stemming from mobile entities, significantly impairs the integrity and utility of the generated mapping output. To address these deficiencies, RTS-SLAM maintains sparse mapping for efficient pose estimation while concurrently integrating depth information to construct dense point clouds. This strategy substantially bolsters the system’s capability for comprehensive geometric representation of the environment.
In large-scale SLAM scenarios, dense mapping significantly increases the number of points in the global point cloud. This imposes a dual challenge: it extends the duration of the dense mapping phase, thereby reducing the execution speed of the overall SLAM system, and introduces substantial memory requirements. Such high memory demands create a risk of memory overflow, which consequently compromises system stability. Typically, research in this field prioritizes the detailed representation of salient objects within dense point clouds, while assigning lower importance to background elements such as walls and tabletops.
Based on the aforementioned scenarios and requirements, RTS-SLAM proposes a dense-mapping strategy that integrates global sparsification and key-region refinement. As described in Algorithm 2. Global sparsification refers to applying voxel grid filtering to the depth point cloud of the keyframe. This process downsamples the point cloud by retaining representative points within each voxel cell.
The mathematical formulation of voxel grid filtering is given as follows. The original point cloud generated from the depth map
of the keyframe
is defined as:
where
N denotes the number of points in the original point cloud. Given the voxel resolution
, the 3D space is partitioned into a set of cubic voxels with side length
, where each voxel corresponds to a cubic region in the Euclidean space.For each point
, its corresponding voxel index is defined as
where
denotes the floor operation, and
represents the discrete voxel grid coordinate. All points sharing the same voxel index
v are grouped into a voxel cell, defined as
For each non-empty voxel (i.e.,
), the representative point is computed as the centroid of all points within the voxel:
Thus, the downsampled point cloud after voxel grid filtering can be expressed as
where
denotes the number of points contained in voxel
v, and
represents the downsampled point cloud after voxel grid filtering. The computational complexity of voxel grid filtering is approximately linear with respect to the number of input points, as each point is assigned to a voxel once, and centroid computation is performed only over occupied voxels.
The remaining points are subsequently transformed into the world coordinate system and integrated into the global point cloud map.
The key region refinement method is based on global downsampling. By utilizing image information and object detection results from several common frames closest to the keyframe, only the point clouds of the detected static key objects are transformed into the world coordinate system and saved as the global point cloud map, thereby enhancing the details of the target objects in the dense map.
| Algorithm 2 Global Sparsification and Key-Region Refinement |
- Input:
Keyframe ; Neighboring frames ; Depth maps ; Camera intrinsics matrix K; Camera poses ; Voxel resolution ; Temporal window size ; - Output:
Updated global dense map P
- 1:
Generate depth point cloud from using K - 2:
Apply voxel grid filtering with resolution to obtain sparse point cloud - 3:
Transform into world coordinates using - 4:
Insert filtered points into global map P - 5:
Detect object bounding boxes in keyframe - 6:
for each neighboring frame do - 7:
if then - 8:
Identify corresponding object regions in frame - 9:
Extract depth points from within these regions - 10:
Remove dynamic points using multi-constraint filtering - 11:
Transform remaining points to world coordinates via - 12:
Merge refined points into P - 13:
end if - 14:
end for - 15:
return
P
|
Figure 3 illustrates the key region refinement method. In this example, a book on the desk is defined as the “key object”, with its target region localized by a 2D bounding box generated by the detection network. The image on the right represents the keyframe, where the overlaying points constitute the point cloud in the camera coordinate system, obtained by downsampling the keyframe’s depth map; the red points specifically denote the point cloud of the book. The multiple images on the left are frames preceding the keyframe, with the red points in each frame representing the book’s point cloud derived from the corresponding pixels in their respective camera coordinate systems. By integrating point clouds from both the left and right frames into the global map, the geometric completeness of the target object in the dense map is enhanced.
5. Conclusions
To tackle the challenges of complex dynamic settings, this research introduces RTS-SLAM, a real-time semantic visual SLAM framework. Built upon the ORB-SLAM2 architecture, the system implements a hierarchical dynamic feature suppression strategy: it first fuses semantic priors with epipolar geometry constraints for coarse-grained suppression, followed by trajectory consistency constraints to remove residual dynamic noise across consecutive frames. This methodology effectively mitigates interference from moving entities in camera ego-motion estimation, significantly improving positioning stability and metric accuracy in highly dynamic scenarios. Furthermore, during the dense-mapping phase, RTS-SLAM employs global sparsification and localized critical-region refinement. By prioritizing the geometric fidelity of salient objects while substantially reducing point cloud redundancy, the system reduces memory overhead and facilitates dense map construction for large-scale scenes.
Further refinements are required for the proposed system in future research. At the algorithmic level, efforts will focus on enhancing the generalization capability of semantic models and optimizing the spatial density of the constructed maps. At the hardware level, the integration of active sensing modalities, such as LiDAR, will be explored to compensate for the inherent limitations of pure vision solutions in depth estimation within unstructured environments.