Next Article in Journal
Improving Rover Path Planning in Challenging Terrains: A Comparative Study of RRT-Based Algorithms
Previous Article in Journal
Mixed-Reality (MR) Enhanced Human–Robot Collaboration: Communicating Robot Intentions to Humans
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment

1
College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
2
Institute of Space Information Innovation, Chinese Academy of Sciences, Beijing 100094, China
*
Author to whom correspondence should be addressed.
Robotics 2025, 14(10), 134; https://doi.org/10.3390/robotics14100134
Submission received: 23 July 2025 / Revised: 18 September 2025 / Accepted: 22 September 2025 / Published: 25 September 2025
(This article belongs to the Special Issue SLAM and Adaptive Navigation for Robotics)

Abstract

Reliable navigation for mobile robots in dynamic, human-populated environments remains a significant challenge, as moving objects often cause localization drift and map corruption. While Simultaneous Localization and Mapping (SLAM) techniques excel in static settings, issues like keyframe redundancy and optimization inefficiencies further hinder their practical deployment on robotic platforms. To address these challenges, we propose DKB-SLAM, a real-time RGB-D visual SLAM system specifically designed to enhance robotic autonomy in complex dynamic scenes. DKB-SLAM integrates optical flow with Gaussian-based depth distribution analysis within YOLO detection frames to efficiently filter dynamic points, crucial for maintaining accurate pose estimates for the robot. An adaptive keyframe selection strategy balances map density and information integrity using a sliding window, considering the robot’s motion dynamics through parallax, visibility, and matching quality. Furthermore, a heterogeneously weighted local bundle adjustment (BA) method leverages map point geometry, assigning higher weights to stable edge points to refine the robot’s trajectory. Evaluations on the TUM RGB-D benchmark and, crucially, on a mobile robot platform in real-world dynamic scenarios, demonstrate that DKB-SLAM outperforms state-of-the-art methods, providing a robust and efficient solution for high-precision robot localization and mapping in dynamic environments.

1. Introduction

Simultaneous Localization and Mapping (SLAM) is a fundamental technology for mobile robot autonomy, enabling core capabilities such as autonomous navigation, interaction, and exploration. It is widely applied in service robotics, warehouse logistics, autonomous driving, and immersive technologies like augmented reality (AR) and virtual reality (VR). Visual SLAM, which utilizes data from cameras, has gained prominence due to its low cost, rich environmental information, and effectiveness in complex, unstructured environments where other sensors might fail.
In recent years, many remarkable visual SLAM algorithms have emerged [1,2,3,4,5,6,7,8,9], setting milestones in accuracy and robustness. However, a core assumption in most of these systems is a static world. This assumption severely limits their real-world applicability, particularly for robots operating in human-centric environments like homes, hospitals, or public spaces. The presence of moving people, vehicles, or other objects violates this static-world assumption, leading to incorrect feature matching, which in turn causes significant drift in the robot’s estimated trajectory and corruption of the environmental map.
To address the challenges of visual SLAM in dynamic scenes, researchers have focused on two main approaches—geometry-based methods [10,11,12,13] and semantic-based methods [14,15,16,17,18]. Although geometry-based methods perform well in low-dynamic environments, they can struggle with complex dynamic objects and scenes, leading to misjudgments or incorrect rejections. In contrast, semantic analysis-based dynamic point recognition methods perform well in both low- and high-dynamic environments but are computationally expensive, making real-time implementation on resource-constrained robotic platforms challenging, a gap that remains in recent works such as [18].
In a complete SLAM system, keyframe selection plays a crucial role in balancing computational load and mapping accuracy. Traditional methods often rely on static rules, such as fixed-distance or viewing-angle thresholds [1,3,4,5]. These approaches lack adaptability to a robot’s varying motion dynamics; for instance, they may generate redundant keyframes when the robot is stationary or moving slowly, and too few when it is turning or accelerating quickly. While more intelligent strategies exist [19], they often fall short in complex dynamic environments, negatively impacting the robot’s localization accuracy.
Local optimization is another key factor for improving accuracy and robustness in SLAM systems. Traditional approaches, such as [1], typically use a single-weighted strategy, treating all map points equally. Other methods [4,20,21] use sliding-window optimization. While strategies like ICE-BA [22] improve efficiency, they often fail to fully exploit the rich geometric structure of the environment. For a robot navigating a structured indoor space, distinguishing between geometrically stable features (like corners and edges) and less informative ones (like large, textureless planes) can significantly improve optimization. Optimizing all map points equally incurs high computational costs and can lead to suboptimal results.
To address these robotics-centric challenges, we propose DKB-SLAM, a real-time RGB-D visual SLAM system tailored for robust robot navigation in complex environments. The system enables accurate localization in both dynamic and static scenes through efficient keyframe selection and geometry-aware local BA optimization. In summary, the main contributions of this article, framed from a robotics application perspective, are as follows:
  • A Novel Hybrid Dynamic Feature Filtering Mechanism: We propose a lightweight yet robust pipeline that enables a robot to navigate reliably amidst dynamic obstacles. The system uses YOLO to quickly identify potential moving objects and then applies a combination of optical flow and statistical depth analysis. This hybrid approach efficiently removes unstable feature points caused by motion while preserving the static background, achieving a superior balance between localization accuracy and the real-time performance required for robotic platforms.
  • An Adaptive, Multi-Criteria Keyframe Selection Strategy: To handle a robot’s varied motion patterns, we introduce a sophisticated keyframe selection strategy that holistically evaluates frame quality based on the robot’s motion history. It goes beyond simple parallax to incorporate map point visibility and matching quality, ensuring that selected keyframes are information-rich and geometrically stable. This prevents map degradation and improves tracking robustness, whether the robot is moving quickly, slowly, or turning.
  • A Geometry-Aware Local Bundle Adjustment (BA) Scheme: We enhance the backend optimization to better leverage the structure of typical robotic operating environments. The method classifies map points as ‘planar’ or ‘edge’ categories and assigns higher weights to the more geometrically informative edge points. This geometry-aware approach improves the robot’s pose accuracy and stability, especially in structured indoor settings, by prioritizing more reliable environmental constraints.
  • An Integrated and Robust RGB-D SLAM System for Robotic Applications: We integrate these modules into a coherent system, DKB-SLAM, designed for practical robotic use. The synergy between dynamic object handling, intelligent keyframe selection, and refined optimization results in superior accuracy and robustness. We validate this performance extensively on the public TUM RGB-D benchmark and, critically, on a mobile robot platform operating in challenging real-world, high-dynamic scenarios.
The remainder of the paper is structured as follows: Section 2 reviews related research, Section 3 presents an overview of the proposed method, Section 4 experimentally validates the method and compares it with state-of-the-art approaches, and Section 5 concludes the paper and outlines future research directions.

2. Related Works

2.1. Visual SLAM in Dynamic Environments

Traditional visual SLAM performs exceptionally well in static environments, but its reliance on the assumption of a static environment makes it vulnerable to dynamic elements, which can severely impact the system’s robustness and accuracy, limiting its practical application. To address the challenges posed by dynamic scenes, researchers have developed various strategies to improve the adaptability of visual SLAM systems. These approaches primarily include geometry-based dynamic point detection, dynamic object recognition combined with semantic analysis, and motion estimation of dynamic objects using clustering techniques.
(1)
Methods based on geometric information
Geometry-based methods primarily identify dynamic objects by analyzing the geometric features or motion consistency of rigid targets. These approaches allow the system to classify feature points that fail to adhere to rigid motion constraints as dynamic points and subsequently remove them. Yang et al. [10] introduced a dynamic point detection method that initially filters dynamic points using geometric constraints, then refines the filtering process through matrix calculations, and ultimately confirms the dynamic points with the polar constraint method. CoSLAM [11] detects dynamic points by examining the triangular consistency of map points. Kim et al. [12] proposed a visual odometry approach for dynamic scenes using background models. This method estimates a non-parametric background model from depth data, extracts the static parts of the scene, and employs an energy-based visual odometry technique to estimate camera motion. Hu et al. [23] applied a multi-view geometry approach to detect dynamic points, combined with an ant colony strategy, which selectively traverses dynamic point groups with a higher probability, thus improving detection efficiency.
(2)
Methods based on semantic analysis
Combining semantic analysis with geometric methods is another widely used approach to address dynamic scenes. DynaSLAM [14] effectively reduced the impact of dynamic elements on the SLAM system by integrating Mask R-CNN [24] for semantic segmentation with multi-view geometry. Other methods, such as [15,16,17,25], also incorporate semantic segmentation modules to identify and remove dynamic objects, thus enhancing the system’s performance in complex environments. Zhang et al. [26] employed the YOLO [27] object detection method to extract semantic information of dynamic objects in the scene, then eliminated dynamic points based on this information, improving the system’s accuracy.
(3)
Methods based on clustering
Clustering methods are crucial for motion estimation of dynamic objects. DRE-SLAM [28] combines the K-means clustering algorithm applied to depth images with multi-view geometric constraints to identify dynamic objects. Furthermore, ClusterSLAM [29] and ClusterVO [30] unify the clustering and estimation of both static and dynamic objects by analyzing 3D motion consistency among landmarks extracted from the same rigid body, thereby enhancing the performance of SLAM systems in dynamic environments. SAD-SLAM [31] improves the dynamic point removal process by classifying feature points and applying different processing techniques based on the type of feature points.

2.2. Keyframe Selection

Keyframe selection plays a central role in SLAM systems and has garnered significant attention in recent years. As a representative frame within the environment, a keyframe contains valuable pose and environmental information. By choosing keyframes with notable changes or distinctive features, the SLAM system ensures that map key points are evenly distributed and thoroughly covered. This approach reduces redundant data, enhances map accuracy and consistency, and prevents error accumulation due to repeated or similar frames. Dong et al.’s experiments [32,33] demonstrate that the proper selection of keyframes not only impacts positioning accuracy and map quality but is also directly linked to the system’s computational efficiency and real-time performance. Consequently, effective keyframe selection has become a key focus of current research.
(1)
Motion-based methods
Traditional keyframe selection methods are typically based on fixed time or spatial intervals. For instance, Refs. [3,7,34] all adopt strategies that select keyframes at regular time or location intervals. While these methods are simple to implement, they often result in keyframes that carry less information, lack flexibility, and offer lower quality. To enhance keyframe quality, VINS-Mono employs a selection criterion based on disparity and tracking quality. When the average disparity between the current and previous frames exceeds a threshold or when the number of tracked feature points falls below a certain limit, the current frame is selected as a keyframe. Lin et al. [4] select keyframes based on camera perspective changes. When the camera moves quickly, more keyframes are selected, and fewer keyframes are selected when the camera moves slowly. Experimental results show that this approach significantly reduces the number of required keyframes while maintaining accuracy. ORB-SLAM2 [1] utilizes a hybrid method for keyframe selection, setting the current frame as a keyframe when one of the following conditions is met: (1) more than 20 frames have passed since the last global relocalization, (2) local map construction is idle, or more than 20 frames have passed since the last keyframe insertion, (3) fewer than 50 map points are tracked in the current frame, or (4) the current frame tracks less than 90% of the map points in the reference keyframe. This strategy enables ORB-SLAM2 to maintain robust performance while ensuring high accuracy.
(2)
Appearance-based methods
Appearance-based keyframe selection methods typically evaluate images based on features such as grayscale, texture, and shape. Engel et al. [35] introduced an innovative keyframe selection approach that first selects a broad set of candidate keyframes based on criteria such as (1) significant changes in viewing angle, (2) substantial camera translation, and (3) notable variations in exposure time. The candidate set is then narrowed down, with the remaining keyframes selected as the final set. Additionally, Refs. [36,37] employed appearance-based pixel-level image or patch matching techniques, while [38] utilized a hybrid approach combining both feature and appearance strategies. These methods enhance the accuracy and robustness of keyframe selection by fully leveraging the appearance information within the images.

2.3. Backend Optimization

(1)
Filter-based methods
The filtering-based backend method has been widely utilized in early SLAM systems [39,40,41,42,43,44]. This method typically relies on recursive state estimation, assuming Markov properties, where the system’s current state depends solely on its previous state. This assumption simplifies the state estimation process, enabling the filtering method to achieve high computational efficiency in practical applications.
(2)
Optimization-based methods
Compared to the filtering method, the backend method based on nonlinear optimization performs global optimization by considering the relationships between global states, thereby enhancing the overall consistency and accuracy of the system. In recent years, optimization-based backends have been widely adopted in numerous SLAM systems [2,5,20,45]. One of the leading methods in the current field of visual SLAM, ORB-SLAM3, employs an efficient backend optimization strategy. ORB-SLAM3 utilizes factor graphs to represent the constraint relationships between keyframes and map points. To balance real-time performance with accuracy, ORB-SLAM3 introduces a hierarchical optimization strategy, which includes two stages—local BA and global BA. Local BA optimizes only the keyframes and associated map points within the current window, ensuring optimization efficiency and suitability for real-time applications. In contrast, global BA optimizes all keyframes and map points in the entire map, thereby enhancing global consistency and map accuracy. This hierarchical strategy allows ORB-SLAM3 to achieve high mapping accuracy while maintaining real-time performance.

2.4. Summary

While current visual SLAM methods have made significant progress in dynamic scene processing, keyframe selection, and backend optimization, they still exhibit several limitations, which can be summarized as follows:
  • Dynamic point recognition methods based on geometry or clustering perform well in low-dynamic environments, but they may lead to misidentifications when dealing with complex dynamic objects and scenes. On the other hand, dynamic point recognition methods based on semantic analysis exhibit strong performance in both low- and high-dynamic scenes; however, they face challenges in real-time execution on devices with limited computational resources.
  • Keyframe selection methods based on fixed time or spatial intervals lack flexibility and are susceptible to redundancy or the loss of critical information. Several methods, including those based on motion, deep learning, and appearance, have been proposed to address these issues. Among them, methods utilizing parallax and tracking quality can significantly improve selection accuracy, while deep learning-based methods excel at handling complex features, although they often suffer from poor real-time performance. While appearance-based methods can enhance accuracy in specific situations, they are highly sensitive to external factors such as lighting changes.
  • While filtering-based methods offer high computational efficiency, their application in SLAM problems is limited by their nonlinear nature, which can lead to errors, and their recursive characteristics may cause local error accumulation. Optimization-based methods, on the other hand, provide more accurate state estimations. However, due to resource limitations, current optimization techniques often trade off some accuracy and landmarks to prioritize real-time performance. Additionally, effectively utilizing environmental information to further enhance optimization remains a critical challenge that requires urgent attention.
To address the challenges discussed above, we propose a comprehensive visual SLAM system: DKB-SLAM. Building upon ideas from [46,47], this system filters dynamic points by combining optical flow with feature point depth Gaussian analysis. It employs a keyframe selection strategy that considers parallax, visibility, and matching quality, and introduces a heterogeneous weighting technique for local BA, which discretizes the weighting of feature point geometric properties. These innovations effectively resolve the identified limitations, enabling DKB-SLAM to deliver exceptional robustness, accuracy, and real-time performance in complex dynamic scenes, demonstrating its superior capabilities.

3. System Overview

The DKB-SLAM system proposed in this paper is an enhancement of ORB-SLAM3, with the overall framework shown in Figure 1. The system operates with four main threads—object detection, tracking, local mapping, and loop closing. First, the system captures RGB-D frames through the camera and extracts ORB feature points. Next, feature points from the YOLOv5 detection bounding boxes are processed to eliminate dynamic points that do not meet the criteria. The remaining feature points are then used for tracking local maps and selecting keyframes. After the tracking thread processes the keyframe, the local mapping thread refines the map points and optimizes the map structure. To ensure map accuracy and consistency, the system eliminates accumulated errors via loop closure detection and global BA.

3.1. Dynamic Point Recognition and Elimination

The presence of dynamic objects in real environments often leads to trajectory drift in SLAM systems. While semantic analysis has been proven effective in mitigating the impact of dynamic objects in most experiments, it is challenging to maintain real-time performance. To address this, we propose a method that identifies dynamic feature points using optical flow and depth Gaussian analysis. This approach effectively eliminates the influence of dynamic objects while preserving real-time performance.
(1)
Optical flow
Optical flow is a technique for estimating object or feature point motion by calculating pixel displacement between consecutive frames. It can be used to identify dynamic points. The proposed strategy applies optical flow only to preprocess feature points within the YOLO detection bounding boxes. This localized approach significantly reduces computational load and enhances the algorithm’s real-time processing capability.
During SLAM system operation, the images captured by the camera change over time. Therefore, the image can be treated as a function of time, i.e., I(t). A pixel at time t, located at (x,y), has a grayscale value of I(x,y,t). The principle is illustrated in Figure 2.
Based on the theory of the optical flow method, the following core assumptions are made:
  • Constant brightness: The brightness of a given pixel remains unchanged between two consecutive frames.
  • Small motion: Pixels move a small distance between two consecutive frames.
  • Spatial consistency: Pixels within a local region exhibit similar motion patterns.
When the above assumption 1 is satisfied, the following holds true:
I ( x , y , t ) = I ( x + d x , y + d y , t + d t )
Perform a first-order Taylor expansion on the right side of the above equation:
I ( x + d x , y + d y , t + d t ) I ( x , y , t ) + I x d x + I y d y + I t d t
Based on the assumption of grayscale invariance, the grayscale at the next moment is equal to that of the previous moment; thus, the following equation holds:
I x d x d t + I y d y d t = I t
In the above formula, d x / d t and d y / d t represent the velocities of the pixel on the x and y axes, respectively, and I / x and I / y represent the gradient of the point in the x and y directions, respectively. I / x and I / y are denoted as I x and I y , respectively, and d x / d t and d y / d t are denoted as u and v, respectively. Therefore, Equation (3) can be written in matrix form:
[ I x I y ] u v = I t
At least two equations are required to solve for these two unknowns, u and v. Therefore, according to the above assumptions 2 and 3, we assume that the pixels within a k × k window surrounding a pixel have the same motion state as the pixel. From this, we can construct k 2 equations, as follows:
[ I x I y ] k u v = I t k , k = 1 , 2 , n 2
A = [ I x , I y ] 1 [ I x , I y ] k , b = I t 1 I t k
A u v = b
By solving the least squares problem, we obtain the following:
u v * = A T A 1 A T b
Based on the velocity of the pixel along the x-axis and y-axis, we can infer the pixel position at different times and compute the displacement of the same feature point between consecutive frames using Equation (9):
distance = ( x cur x pre ) 2 + ( y cur y pre ) 2
If the distance exceeds the displacement threshold d t h = 1 , the feature point is classified as a dynamic feature point; otherwise, it is further evaluated using deep Gaussian distribution analysis. The selection of d t h = 1 is a common empirical value in motion analysis applications utilizing optical flow [48]. This threshold is established on the principle that, for typical camera frame rates, the displacement of static background points between consecutive frames, after initial ego-motion compensation, results in sub-pixel shifts. A displacement exceeding one full pixel is therefore a strong indicator of independent object motion. This value provides a balance between sensitivity to detect slow-moving objects and robustness against image noise or minor feature tracking jitter. It serves as an efficient first-pass filter before the more computationally intensive depth analysis.
(2)
Gaussian-based Depth Distribution Analysis
In Simultaneous Localization and Mapping (SLAM), the depth values of feature points in static scenes typically follow a Gaussian distribution centered around their mean [46]. Leveraging this property, we propose a method to statistically analyze the depth distribution of feature points using a Gaussian distribution to differentiate static and dynamic points. By modeling depth values within a Gaussian framework, points with significant deviations from the expected distribution are identified as dynamic. To enhance the real-time performance of the DKB-SLAM system, we restrict this analysis to feature points within YOLOv5 bounding boxes, which are pre-identified as potential dynamic regions, thereby reducing computational overhead while maintaining accuracy in dynamic point filtering.
Firstly, the mean depth value of the feature points within the bounding box is calculated as μ using Equation (10):
μ = 1 N i = 1 N d i
where d i represents the depth value of the i-th feature point, and N denotes the total number of feature points. Next, the standard deviation σ is calculated to measure the dispersion of the depth values of the feature points, reflecting the width of the data distribution in terms of depth, as shown in Equation (11):
σ = 1 N i = 1 N ( d i μ ) 2
Calculate the standard deviation of the depth values for feature points that are greater than σ h , and the standard deviation for those smaller than σ l . Then, compute the ratio of these two standard deviations s = σ h σ l , which can be used to dynamically adjust the threshold to better adapt to the depth distribution characteristics in different scenarios. It is assumed that the depth distribution of static feature points follows a Gaussian distribution, meaning that the depth values of most static feature points should be close to μ , and points with larger deviations may represent dynamic points or noise. For each feature point, the normalized deviation of its depth value from the mean is calculated as in Equation (12):
t h i = d i μ σ
t h i indicates the degree of deviation of d i with respect to μ , normalized by the standard deviation σ . A dynamic threshold δ is set to determine whether the key point is dynamic. If t h i 2 > δ , then d i is considered a dynamic feature point. δ is defined as follows, and its selection is grounded in statistical principles derived from the chi-squared ( χ 2 ) distribution. Since the normalized deviation t h i of static points is assumed to follow a standard normal distribution, its square, t h i 2 , follows a χ 2 distribution with one degree of freedom. Thus, the threshold δ acts as a critical value for a statistical test. The value s = σ h / σ l serves as a heuristic to assess the modality of the depth distribution. When s < 8 , the distribution is likely unimodal, and a threshold of δ = 2.8 is used, which approximates a 90% confidence interval. This provides a robust filter for clear outliers. When s 8 , the distribution is likely multi-modal (e.g., a dynamic object in front of a static background), indicating that the single Gaussian model is a poor fit. In this case, a much stricter threshold of δ = 9.0 is applied, corresponding to a confidence interval of over 99.7%. This conservative choice ensures that only extreme outliers are classified as dynamic, preventing the erroneous removal of static background points. This adaptive strategy, originally proposed in [46], provides a statistically justified method for dynamic point filtering.
δ = 2.8 s < 8 9.0 s 8
In summary, we propose a dynamic feature point recognition method that combines the optical flow method and depth Gaussian analysis, which effectively addresses the drift problem in SLAM systems operating in dynamic environments. The optical flow method quickly identifies moving feature points by calculating the motion trajectories of pixels in consecutive frames, while depth Gaussian analysis accurately classifies the depths of the feature points based on the statistical properties of static point depths, thereby improving the detection accuracy of dynamic points. The combination of these two methods not only meets the real-time requirements of SLAM systems but also effectively suppresses the interference of dynamic objects on localization and map building, providing a robust and efficient solution for SLAM in dynamic scenes. Compared to traditional geometry-based methods like the short-term motion consistency check, which primarily relies on epipolar constraints, our hybrid approach offers distinct advantages. The motion consistency check identifies dynamic points as outliers to the fundamental matrix estimated between two views. However, this method can fail when dynamic objects move parallel to the camera’s epipolar lines or when motion is too erratic for reliable geometric model fitting. Furthermore, in scenes dominated by dynamic elements, the estimation process itself can be compromised. Our method, by contrast, leverages two complementary cues: pixel-level motion vectors from optical flow and a statistical model of depth distribution from Gaussian analysis. This dual-modality approach makes it more robust to varied motion patterns and less susceptible to the geometric degeneracies that can affect consistency checks. For instance, an object moving directly towards the camera might satisfy epipolar constraints but would be immediately flagged by our depth distribution analysis, showcasing the enhanced robustness of our method in complex, real-world scenarios.

3.2. Keyframe Selection Based on Parallax, Visibility, and Match Quality

In visual SLAM systems, keyframe selection plays a crucial role in the overall performance of the system. Proper keyframe selection not only reduces the storage and computational overhead caused by redundant frames but also ensures geometric consistency between keyframes, thereby improving localization accuracy and map construction quality. However, traditional keyframe selection methods often rely on simple threshold-based criteria, which are challenging to adapt to complex and dynamic motion scenarios. This is especially true in fast-moving or slow-moving scenes, where the system tends to select too many or too few keyframes, thus compromising robustness and efficiency. To address this issue, we propose a novel keyframe selection method, as illustrated in Figure 3. This method integrates several factors, including the spatial distribution of map points, parallax variations, angular differences, and the quality of image grid matches, ensuring that the system maintains high-accuracy tracking performance across various environments.
(1)
Parallax-based current frame preprocessing
When evaluating the motion change between the current frame and the previous keyframe to select an appropriate keyframe, the pose estimate of the current frame must be updated, and the rotational and translational parallax between the two frames must be calculated. To quantify the rotational change between the two frames, the rotation matrices R current and R last of the current frame and the previous keyframe are extracted, and the rotational difference matrix Δ R between the two frames is computed using Equation (14):
Δ R = R last R current
In the above equation, R l a s t T represents the transpose of the rotation matrix of the previous keyframe. The rotational angle difference Δ θ is then computed from the trace of the rotational difference matrix, as shown in Equation (15):
Δ θ = arccos trace ( Δ R ) 1 2 × 180 π
trace ( Δ R ) in the above equation is the trace of Δ R . Extract the current frame and the translation vector t current and the translation vector t last of the previous keyframe, and compute the translation difference Δ t between them using Equation (16):
Δ t = t current t last
In order to avoid misjudgments caused by parallax variation in a single frame, we introduce a sliding window mechanism to smooth out the parallax variation. A sliding window of size N = 10 is defined to store the rotation angle variance Δ θ and translation variance Δ t from the last N frames. This value is inspired by the motion-based keyframe selection principles in robust systems like VINS-Mono [4], which require a balance between smoothing noisy motion estimates and reacting promptly to actual changes in trajectory. A 10-frame window is large enough to filter high-frequency jitter but small enough to adapt quickly to significant shifts in motion, such as the transition from straight-line movement to a sharp turn, thereby ensuring both stability and responsiveness. The historical average rotation angle variance Δ θ ¯ and historical average translation variance Δ t ¯ are then computed from the parallax values within the sliding window, as detailed in Equations (17) and (18):
Δ θ ¯ = 1 N k = 1 N Δ θ k
Δ t ¯ = 1 N k = 1 N Δ t k
The sliding window mechanism effectively smooths out abrupt parallax changes and prevents misjudgments caused by abnormal parallax in a single frame. Additionally, we adaptively adjust the thresholds based on the historical average parallax values to ensure the algorithm’s adaptability and robustness under different motion states. First, the weights w θ and w t for rotation and translation are calculated based on the historical average parallax values using Equations (19) and (20):
w θ = Δ θ ¯ Δ θ ¯ + Δ t ¯ + ϵ
w t = Δ t ¯ Δ θ ¯ + Δ t ¯ + ϵ
where ϵ = 1 × 10 5 is introduced to prevent the denominator from being zero. These weights dynamically adjust the keyframe selection criteria based on the nature of the robot’s motion. For example, during a pure rotation (turning on the spot), Δ t ¯ will be small, causing w θ to approach 1 and w t to approach 0. This makes the system highly sensitive to rotational changes while being robust to minor translational jitter, appropriately triggering a new keyframe. Conversely, during straight-line motion, w t will dominate, making the system sensitive to the translational distance covered. This adaptive weighting scheme ensures that keyframes are inserted based on the most significant component of the robot’s motion, improving the robustness and efficiency of the selection process. Second, a logarithmic function is used to nonlinearly scale the rotational and translational components. The application of the logarithmic function helps to smooth out the effects of rotation and translation on the thresholds, preventing extreme values from dominating the results and improving numerical stability. The nonlinearly processed M θ and M t are defined as follows in Equations (21) and (22):
M θ = log ( 1 + w θ · Δ θ ¯ )
M t = log ( 1 + w t · Δ t ¯ )
After completing the nonlinear scaling of the parallax components, the total mapping value M and the normalization factor F are calculated using Equations (23) and (24):
M = M θ + M t
F = M log ( 1 + w θ · Θ max ) + log ( 1 + w t · t max )
where Θ max and t max denote the assumed maximum rotation and translation values, respectively. By introducing these maximum values, the maximum rotation angle and translation between keyframes are limited. In most vision systems, the field of view is usually constrained, with wide-angle lenses typically not exceeding 90°. A rotation change of more than 45° between two frames results in insufficient overlap between the front and back frames, which affects the effectiveness of the keyframes. We, therefore, set Θ max = 45 ° and t max = 10 , following validation in VINS-Mono [4]. F ensures that the total mapping values are scaled uniformly, facilitating the subsequent threshold calculation. The ratio of the total mapping value and the normalization factor is used to calculate the final parallax adaptive threshold T via Equation (25):
T = T max F · ( T max T min )
where T min = 10 and T max = 30 represent the minimum and maximum values of the adaptive thresholds, respectively, to avoid too few keyframes in fast-motion scenes or excessive keyframe generation in slow-motion scenes. When the rotation or translation is large, F increases, leading to a decrease in T, which makes keyframe selection easier. Conversely, in low-motion situations, T increases, reducing keyframe generation. If Δ θ + Δ t < T , the current frame is not used as a keyframe.
This adaptive threshold mechanism allows the algorithm to dynamically adjust the keyframe selection criteria according to the current motion state, enhancing its adaptability and robustness in different scenarios.
(2)
Keyframe judgment based on visibility and match quality
In order to ensure the rationality of keyframe selection, DKB-SLAM must evaluate the geometric consistency and matching quality of the current frame after parallax filtering.
First, iterate through all the map points and calculate the Euclidean distance between each pair of map points. Let there be two map points, P i = ( x i , y i , z i ) and P j = ( x j , y j , z j ) , whose Euclidean distance is denoted as in Equation (26):
d ( P i , P j ) = ( x i x j ) 2 + ( y i y j ) 2 + ( z i z j ) 2
For each map point P i , find the five nearest neighbors and compute the vector V k of these neighbors relative to the current map point (where k = 1, 2, …, 5), as shown in Equation (27):
V k = P k P i
The cross products of all pairwise combinations of the five vectors are computed and summed to obtain the cumulative vector S i (Equation (28)). Subsequently, its magnitude S i (Equation (29)) and the unit normal vector N i (Equation (30)) are calculated as follows:
S i = k = 1 4 t = k + 1 5 ( V k × V t )
S i = S i x 2 + S i y 2 + S i z 2
N i = S i S i
The unit normal vector N i reflects the orientation of the local geometric structure around the map point P i . The visibility of a map point can be further assessed by analyzing the relationship between the unit normal vector and the direction of the camera’s line of sight. Specifically, the angle between the normal vector N i and the camera’s line-of-sight direction C i must be calculated for each map point. The principle is shown in Figure 4.
The angle θ i is calculated as shown in Equation (31):
θ i = arccos N i · C i N i C i × 180 π
where N i · C i denotes the dot product of vectors and C i , N i are the norms of the vectors.
Assign map points to different visibility levels based on the angle of clip A in order to evaluate their visibility status in the current frame. Set every 10 degrees as a standard, as defined in Equation (32):
V i = 1 , 0 < θ i 10 ° 2 , 10 ° < θ i 20 ° 3 , 20 ° < θ i 30 ° 4 , θ i > 30 °
where V i denotes the visibility level of the map point P i . By comparing the visibility levels of the matching points in the current frame and the previous keyframe, the number of matching points with inconsistent visibility levels, denoted as n diff , is counted. At the same time, we adopt an adaptive threshold t h r e s to dynamically adjust the judgment criteria for matching points with inconsistent visibility levels. t h r e s is defined as follows in Equation (33):
t h r e s = ( θ β α ) · S + S
S in the above equation is the similarity metric used to evaluate the similarity between the current frame and the previous keyframe, which is defined as follows in Equation (34):
S = n diff · n matches r + n diff · m C
In the calculation of S, n matches denotes the number of successful matches, r is the number of matched points in the reference frame and the previous keyframe, and m and C represent the total number of points in the current frame and the reference frame, respectively. The similarity metric S combines the visibility consistency between the current frame and the previous keyframe, along with the ratio of the number of matched points. The first term n diff · n matches r measures the impact of visibility-inconsistent matching points relative to the number of historical matches, while the second term n diff · m C evaluates the impact of visibility-inconsistent matching points relative to the number of historical map points. In addition, θ in the t h r e s calculus is used to measure the reduction in the number of current matches relative to the number of historical matches, which is defined as follows in Equation (35):
θ = 1 n matches r
On the other hand, α represents the proportionally adjusted value of a match with an inconsistent visibility class, which is defined as follows in Equation (36):
α = n diff n matches 0.5
The scale was adjusted to be centered on 0 by subtracting 0.5, with positive values indicating a higher proportion of visibility-inconsistent match points, and negative values indicating the opposite. Finally, β denotes the change in the proportion of visibility-inconsistent match points between the current frame and the reference frame, further reflecting the change in matching quality. It is defined as follows in Equation (37):
β = n diff n matches n diffr r
where n diffr denotes the number of matching points with inconsistent visibility levels in the reference frame and the previous keyframe. If β is positive, it means that the proportion of match points with visibility-inconsistent in the current frame increases; otherwise, it has decreased.
The adaptive thresholding criterion described above only models the number of changes in the current frame compared to the previous keyframe. Therefore, as these changes increase, the probability of a positive response to this criterion for selecting keyframes also increases. The number of changes must be controlled because if there are too many changes, this frame will not align with the previous keyframe. Thus, in order to select keyframes with a uniform distribution of valid points for creating a dense and accurate point cloud, another parallel criterion is employed.
The current frame image plane is divided into a 3 × 3 grid to construct the grid matrix G R 3 × 3 . For each pair of matching points, the value in G is incremented by 1 based on the grid position ( i , j ) where its pixel coordinate ( u , v ) is located. The equilibrium of center of gravity ( E C O G ) is then computed by analyzing the distribution of matching points in G , as shown in Equation (38):
E C O G = D max 2 · G max 2 2 · 2 · G max
where G max is the largest value in G , denoting the maximum number of matching points in the grid where it is located, and G max 2 is the second largest value in G . D max 2 denotes the Euclidean distance between the centroids of these two grids. E C O G reflects the spatial distribution of the matching points. A high value of E C O G indicates that the matching points are mainly concentrated in grid regions that are far apart; conversely, a low value of E C O G indicates that the matching points are concentrated in grid regions that are closer together, meaning that the distribution of matching points is more centralized. The threshold E t h of E C O G can be obtained by computing the center of gravity of the matrix, defined as follows in Equations (39)–(41):
centroidRow = j = 0 2 G 2 , j j = 0 2 G 0 , j i = 0 2 j = 0 2 G i , j
centroidCol = i = 0 2 G i , 2 i = 0 2 G i , 0 i = 0 2 j = 0 2 G i , j
E t h = ( centroidRow ) 2 + ( centroidCol ) 2
In the above equation, j = 0 2 G 2 , j represents the total number of matching points in all grids in the third row, j = 0 2 G 0 , j represents the total number of matching points in all grids in the first row, i = 0 2 G i , 2 represents the total number of matching points in all grids in the third column, i = 0 2 G i , 0 represents the total number of matching points in all grids in the first column, and i = 0 2 j = 0 2 G i , j represents the total number of matching points in all grids. E t h denotes the overall offset of the center of gravity of the matching points with respect to the center of the image.
By introducing a keyframe selection method based on parallax, geometric consistency, and matching quality assessment, we effectively enhance the robustness and accuracy of the visual SLAM system in complex scenes. This method not only adaptively adjusts the keyframe selection criteria to reduce the computational and storage overhead of redundant frames, but also further ensures the system’s efficiency and accuracy by optimizing the spatial distribution of map points and improving matching quality.

3.3. Local Bundle Adjustment Optimization with Heterogeneous Weighting of Map Point Geometry

In SLAM systems, the geometric distribution characteristics of map points directly impact the accuracy and efficiency of system optimization. Based on geometric characteristics, map points can be classified into planar points and edge points. We explore classification and differential weighting methods for different types of map points, and elaborate on the classification of planar points and edge points, as well as their applications in local BA.
During the optimization process, geometric consistency between observations and map points is a key factor in ensuring the reliability of the optimization. Therefore, observations and map points that do not satisfy geometric consistency must be filtered out before local bundle adjustment.
The pose of the keyframes is represented by the SE(3) transformation T c w i . First, the map point P j is transformed into the camera coordinate system (Equation (42)), and the projected position of P j on the image plane is computed as p proj (Equation (43)):
P c = T c w i · P j
p proj = π ( P c )
π ( · ) in the above equation denotes the projection function. Next, the reprojection error e reproj is computed as the Euclidean distance between the observed feature point p obs and the projected location p proj , as shown in Equation (44):
e reproj = p obs p proj
The reprojection error primarily arises from observation noise, which is assumed to follow a Gaussian distribution. Therefore, the square of the reprojection error, e reproj 2 , follows a chi-square distribution with 2 degrees of freedom. According to statistical theory, with a 95% confidence level (as applied in ORB-SLAM3), the critical value of the chi-square distribution is approximately 5.991. Therefore, the threshold for the reprojection error is set to θ reproj = 5.991 = 2.45 . If the reprojection error e reproj of an observation exceeds a preset threshold θ reproj , the geometric consistency between the observation and the map point is considered unsatisfactory, and the observation is excluded to prevent its negative impact on the optimization results.
To further enhance the effectiveness of the optimization, we classify map points based on their observed spatial distribution across multiple keyframes. Specifically, map points are categorized as planar points and edge points, which exhibit different geometric properties and optimization values. Edge points are typically distributed in multiple directions, effectively capturing detailed changes in the environment and providing additional information to support the optimization of camera pose and map point positions. In contrast, planar points are primarily concentrated within the same plane and contain relatively limited geometric information. This can lead to an over-reliance on planar constraints during the optimization process, thereby restricting the accuracy and robustness of the optimization. Therefore, appropriately increasing the weight of edge points in the optimization allows the optimizer to better leverage their rich geometric information, enhancing the system’s resilience to noise and outliers, accelerating optimization convergence, and improving map accuracy and consistency.
For each map point, collect the positions of all its observed keyframes { C i } in the world coordinate system. Then, compute the mean position μ (Equation (45)) and the covariance matrix Σ (Equation (46)) of these positions:
μ = 1 N i = 1 N C i
Σ = 1 N i = 1 N ( C i μ ) ( C i μ )
where N is the number of observed keyframes for the map points. Eigenvalue decomposition is used to determine the eigenvalues of Σ , denoted as λ 1 λ 2 λ 3 . The minimum eigenvalue, λ 3 , reflects the geometric distribution dimension of the map points in the least significant direction. The values of λ 3 for all map points are statistically analyzed to compute their mean λ 3 ¯ and standard deviation σ λ 3 . Adaptive thresholds are then set as defined in Equation (47):
θ planar = λ 3 ¯ + σ λ 3
Map points are classified as planar if λ 3 < θ planar ; otherwise, they are classified as edge points. After classification, differential weighting is applied to enhance optimization performance. In Local BA, map points are assigned different weights w depending on whether they are planar or edge points, with edge points typically receiving a larger w. By introducing w to weight the error, the optimizer places greater emphasis on edge points, where geometric information is more significant, thereby improving the overall effectiveness of the optimization. The optimization cost function is given by Equation (48):
E = i = 1 N p w ( p i ) · z i h ( p i ) 2
where N p is the total number of map points, p i represents the ith map point, z i is the observed value of p i (i.e., its pixel coordinates), and h ( p i ) denotes the coordinates of p i projected onto the image plane. w ( p i ) is the weight of the ith map point, which is used to adjust its contribution to the optimization. If the map point is an edge point, w ( p i ) = 1.5 ; otherwise, w ( p i ) = 1 . The selection of 1.5 as the weight for edge points is based on the sensitivity analysis presented in Table 1. While the optimal value varies slightly per sequence, the analysis demonstrates that a weight of 1.5 provides a robustly strong performance across multiple dynamic scenarios, making it an effective general-purpose choice.
By introducing discretized weighting, the optimizer can more effectively utilize information from edge points, thereby enhancing the system’s robustness to noise and outliers, accelerating optimization convergence, and improving the accuracy of the final localization.

4. Experimental Validation

In this section, we present a comprehensive experimental evaluation of the DKB-SLAM system. First, DKB-SLAM is validated using the TUM RGB-D dataset to assess its performance on standard benchmarks. To evaluate the effectiveness of each module, we perform ablation studies, analyzing the contribution of each component to the overall system performance. Next, we test DKB-SLAM in real-world scenarios to assess its adaptability to real environments. Finally, we measure the average time consumption of each module, demonstrating the real-time efficiency of DKB-SLAM. All experiments were conducted on computers equipped with an AMD Ryzen 9 7945HX CPU, an NVIDIA GeForce RTX 4060 GPU, and 48 GB of RAM.

4.1. Evaluation on the TUM RGB-D Dataset

The TUM dataset is a standardized dataset developed by the Technical University of Munich and is widely used for evaluating and testing visual SLAM algorithms. Among them, the TUM RGB-D dataset covers a diverse range of environments, making it highly valuable for benchmarking. To assess the accuracy and robustness of DKB-SLAM in both static and dynamic scenes, we use three types of TUM RGB-D datasets—static, low-dynamic, and high-dynamic. Each dataset is used to evaluate the system’s performance, and the experimental results are compared and analyzed against other state-of-the-art SLAM systems. Additionally, the impact of each module on DKB-SLAM’s performance is verified through ablation studies, further clarifying the role of its key components. The comparison algorithm, MSSD-SLAM, is run in RGB-D+IMU mode, while the other algorithms are run in RGB-D mode.
To quantitatively assess the global consistency of SLAM system trajectories, we utilize the absolute trajectory error (APE) to evaluate both translational and rotational errors as indicators of the global consistency of DKB-SLAM. The root-mean-square error (RMSE) and the mean error of the translational and rotational APEs are computed using the evo tool, where smaller values indicate better system performance.
Table 2 and Table 3 list the test results of ORB-SLAM3 and the three improved algorithms (DKB-SLAM-BA, DKB-SLAM-KF, DKB-SLAM-BA-KF) across five static test sequences. DKB-SLAM-BA is a SLAM system that incorporates a heterogeneous weighting local BA optimization module based on ORB-SLAM3. DKB-SLAM-KF is a SLAM system that employs an improved keyframe selection strategy, while DKB-SLAM-BA-KF combines both the heterogeneous weighting local BA optimization and improved keyframe selection modules. Bold entries indicate the best-performing results in each dataset. RMSE stands for root mean square error, and ‘mean’ refers to the average error.
The data in the table shows that DKB-SLAM-BA-KF achieves the lowest error values in most datasets, demonstrating the superiority of its comprehensive optimization module. In scenarios with a large number of edge points, DKB-SLAM-BA tends to perform better than DKB-SLAM-KF. This is because edge points provide more reliable data to the back-end local BA optimization module, which significantly improves localization accuracy. For test scenes with a large number of planar points, such as fr1/room and fr2/pioneer_360, the keyframe selection strategy of DKB-SLAM-KF proves more advantageous, allowing it to outperform DKB-SLAM-BA in these datasets. However, overall, the errors of both DKB-SLAM-BA and DKB-SLAM-KF are lower than those of ORB-SLAM3, significantly improving localization accuracy.
Table 4 and Table 5 present the translation and rotation APE results of various SLAM systems across six different dynamic sequences. The ‘walking’ series in the verified sequences represents high-dynamic scenes, while the ‘sitting’ series corresponds to low-dynamic scenes. To verify the effectiveness of the algorithm combination proposed in this paper, we designed several sets of ablation experiments, including (i) DKB-SLAM-DR, with only the dynamic feature point filtering module; (ii) DKB-SLAM-KF-DR, with improved keyframe selection and dynamic feature point detection and filtering modules; (iii) DKB-SLAM-BA-DR, with heterogeneous weighting local BA optimization and the dynamic feature point filtering module; and (iv) DKB-SLAM, with comprehensive deployment of the improved keyframe selection method, heterogeneous weighting local BA optimization, and dynamic feature point filtering modules.
The experimental results show that DKB-SLAM achieves the lowest error in most sequences, demonstrating excellent performance in both translation and rotation. In the sitting_half dataset, since the experimenters are almost stationary, some ultra-low dynamic feature points are not successfully identified and filtered out, causing the RMSE of DKB-SLAM’s translation APE to be slightly worse than that of DS-SLAM. From the perspective of rotation, in the walking_rpy dataset, there are many scenes where the experimenters walk in front of the camera, resulting in a significantly higher number of planar points compared to edge points. The improved keyframe selection method has a notable effect on system performance. In comparison, DKB-SLAM’s performance is slightly worse than that of DKB-SLAM-KF-DR, but still better than that of other systems. Overall, the experimental results demonstrate that DKB-SLAM has more advantages in most datasets, proving that the various modules proposed in this paper perform excellently across different dynamic scenes.
Figure 5 and Figure 6 show the trajectories of ORB-SLAM3, DKB-SLAM-BA, DKB-SLAM-KF, and DKB-SLAM-BA-KF in some static datasets, as well as the trajectories of DKB-SLAM-DR, DKB-SLAM-KF-DR, DKB-SLAM-BA-DR, and DKB-SLAM in dynamic datasets. Figure 7 and Figure 8 present the columnar comparison results of max, min, std, median, mean, and RMSE for the above experimental results.
From the test results on the public dataset, our proposed SLAM system demonstrates excellent performance in static, low-dynamic, and high-dynamic environments. The effectiveness of the proposed keyframe selection, heterogeneously weighted local BA optimization, and dynamic feature point identification and rejection is also confirmed.

4.2. Validation on Real-World Scenarios

To fully evaluate the effectiveness of the proposed algorithm, we conducted a series of tests using datasets collected in real-world scenarios. These datasets were designed to assess the adaptability and stability of our SLAM system in various complex environments. All data were collected by a mobile cart equipped with an Intel RealSense D435i RGB-D camera and a computing unit, as shown in Figure 9. The computing unit was primarily used for storing the recorded data and was not involved in the operation of DKB-SLAM.
Since the real trajectory is not available, we can only analyze the results qualitatively. To enhance the credibility of the results, we designed two sets of experiments: ‘straight’ and ‘static’. In the ‘straight’ experiment, the cart traveled in a straight line while the experimental personnel moved intermittently. To comprehensively evaluate the system’s performance in both low-dynamic and high-dynamic scenarios, we designed the experiment so that the experimenter moved in a low-dynamic manner in the first half and in a high-dynamic manner in the second half. In the ‘static’ experiment, the cart remained stationary while the experimenter moved quickly throughout the experiment to test whether the system could maintain accurate localization in extreme dynamic scenarios. We used ORB-SLAM3 and SG-SLAM as comparison systems.
Figure 10 illustrates the trajectories of DKB-SLAM, SG-SLAM, and ORB-SLAM3 when running the ‘straight’ dataset, as well as their positions in the x, y, and z directions over time. As shown in the figure, DKB-SLAM maintains a nearly ideal straight trajectory throughout the experiment, with minimal interference from the experimenter’s movements. The position changes in the x, y, and z directions also exhibit a smooth trend without sudden shifts, demonstrating their excellent stability in this scenario. In contrast, SG-SLAM’s performance is more variable. Although it maintains a relatively good straight trajectory during the first half of the experiment, it struggles in the second half of the highly dynamic scene. In particular, at position 1, the trajectory of SG-SLAM is significantly disrupted and deviates due to the experimenters’ side-by-side movement in a specific direction. The position change graph further reveals violent jittering along the x and y axes at this point, highlighting SG-SLAM’s limited adaptability to highly dynamic environments. Nevertheless, SG-SLAM still demonstrates strong competitiveness in low-dynamic scenes. Since ORB-SLAM3 lacks dynamic culling capabilities, its trajectory deviates from the expected path.
Figure 11 shows the trajectories of DKB-SLAM, SG-SLAM, and ORB-SLAM3 during the static experiment, along with their positions in the x, y, and z directions over time. Notably, the trajectory of DKB-SLAM remains almost stationary at the origin, which aligns perfectly with the designed baseline trajectory and is unaffected by the experimenter’s movements. From Figure 12, it is clear that DKB-SLAM effectively eliminates dynamic points associated with the experimenter, even in both single- and multi-person fast-motion scenarios. The position change graphs in the x, y, and z directions further validate that DKB-SLAM experiences minimal fluctuations in each direction, consistently staying stable around 0. This demonstrates that DKB-SLAM remains unaffected by the experimenter in harsh experimental conditions, successfully removing any interference from dynamic features and ensuring high localization accuracy. In contrast, SG-SLAM struggles in highly dynamic scenes, with its trajectory deviating significantly from the expected path. This results in increased errors along the x, y, and z directions and severe drifting of the trajectory, highlighting SG-SLAM’s limitations in handling dynamic environments, which impacts its performance in complex settings.

4.3. Timing Analysis

In addition to localization accuracy, real-time performance is a crucial metric for evaluating a SLAM system’s practical applicability. Table 6 presents a comparative analysis of the average processing time per frame for DKB-SLAM against several state-of-the-art dynamic SLAM systems. The results highlight DKB-SLAM’s computational efficiency, particularly when compared to methods reliant on heavy segmentation. Systems like DynaSLAM (240.91 ms) and DS-SLAM (87.84 ms) utilize computationally intensive, full-image semantic segmentation, leading to significant processing delays. In contrast, DKB-SLAM adopts a more streamlined two-stage approach. It first uses the efficient YOLOv5 to identify potential dynamic regions, and then applies lightweight optical flow and Gaussian distribution analysis only to feature points within these detected bounding boxes. This strategy avoids processing the entire scene, drastically reducing computational overhead and making it significantly faster than SG-SLAM (35.72 ms) and other segmentation-based methods.
Our proposed DKB-SLAM achieves an average processing time of 30.97 ms per frame. This is marginally slower than the static baseline ORB-SLAM3 (18.18 ms) and Crowd-SLAM (27.68 ms). The increased time compared to ORB-SLAM3 is the expected trade-off for adding robust dynamic object handling. Crowd-SLAM’s slight speed advantage stems from its more direct filtering strategy—it simply removes all feature points that fall within a YOLO detection box. While this approach is computationally faster, it risks discarding valuable static background points, which can lead to poorer robustness and potential tracking loss. In contrast, the marginal extra processing time of DKB-SLAM is a direct result of its more sophisticated filtering pipeline, which intelligently validates points within the bounding box to preserve stable features. This deliberate trade-off of a few milliseconds ensures higher accuracy and tracking stability. Ultimately, this performance, equivalent to processing over 32 frames per second, confirms that DKB-SLAM strikes an excellent balance between robust dynamic object handling and computational efficiency, meeting the stringent real-time requirements for practical applications.

5. Conclusions

In this paper, we propose DKB-SLAM, an RGB-D visual SLAM system designed to enhance the navigational autonomy and robustness of mobile robots in dynamic environments. By integrating a lightweight dynamic feature filtering pipeline, an adaptive keyframe selection strategy tailored for robot motion, and a geometry-aware local bundle adjustment, our system effectively mitigates localization drift caused by moving objects. The experimental results, validated on both the public TUM RGB-D benchmark and in real-world scenarios, demonstrate that DKB-SLAM achieves state-of-the-art accuracy and stability in static, low-dynamic, and extremely high-dynamic scenes. The primary advantage of our method lies in its balanced approach, achieving high accuracy in dynamic scenes without the computational burden of full semantic segmentation, making it suitable for real-time robotic applications. However, a limitation is its potential to misclassify very slow-moving or intermittently static objects, as the geometric and motion cues may not be strong enough for detection. Additionally, its performance is dependent on the initial object detection by YOLO, meaning that a failure to detect a dynamic object will result in it not being filtered.
Looking ahead, we plan to extend the system’s capabilities for more advanced robotic applications. Future work will focus on integrating the rich environmental perception of DKB-SLAM with motion planning and control modules. Furthermore, inspired by the valuable feedback from the review process, we will investigate an adaptive weighting scheme for the local BA. This future work will explore methods to dynamically adjust the weight assigned to edge points based on real-time scene characteristics, such as the level of dynamism or geometric complexity, which could potentially offer further accuracy improvements. This will enable not just robust localization, but also dynamic obstacle avoidance and safer human–robot interaction, paving the way for more intelligent and autonomous robotic systems. The source code for DKB-SLAM will be made publicly available at https://github.com/xu-no1/DKB-SLAM (accessed on 10 September 2025).

Author Contributions

Conceptualization, Y.L. and Q.S.; Methodology, Q.S.; Software, Q.S. and Z.X.; Validation, Q.S., Z.X. and Y.Z.; Formal Analysis, Q.S.; Investigation, Q.S.; Resources, Y.L.; Data Curation, Z.X.; Writing—Original Draft Preparation, Q.S.; Writing—Review & Editing, Y.L., F.Y. and Y.Z.; Visualization, Q.S.; Supervision, Y.L.; Project Administration, Y.L.; Funding Acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory of Target Cognition and Application Technology, grant number 2023-CXPT-LC-005; the Excellent Youth Foundation of Heilongjiang Province, grant number YQ2024F017; the National Natural Science Foundation of China, grant number 52271311; and the Fundamental Research Funds for the Central Universities, grant number 3072024XX0802.

Data Availability Statement

The public TUM RGB-D dataset analyzed during this study is available at https://vision.in.tum.de/data/datasets/rgbd-dataset (accessed on 7 September 2025). The source code for DKB-SLAM is publicly available on GitHub at https://github.com/xu-no1/DKB-SLAM (accessed on 10 September 2025). The custom datasets generated and analyzed during the real-world experiments are not publicly available due to privacy considerations, but are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  2. Sahoo, B.; Biglarbegian, M.; Melek, W. Monocular Visual Inertial Direct SLAM with Robust Scale Estimation for Ground Robots/Vehicles. Robotics 2021, 10, 23. [Google Scholar] [CrossRef]
  3. Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 6th IEEE/ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
  4. Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  5. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
  6. Mahmoud, A.; Atia, M. Improved Visual SLAM Using Semantic Segmentation and Layout Estimation. Robotics 2022, 11, 91. [Google Scholar] [CrossRef]
  7. Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
  8. Tourani, A.; Bavle, H.; Avşar, D.I.; Sanchez-Lopez, J.L.; Munoz-Salinas, R.; Voos, H. Vision-Based Situational Graphs Exploiting Fiducial Markers for the Integration of Semantic Entities. Robotics 2024, 13, 106. [Google Scholar] [CrossRef]
  9. Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. OpenVINS: A research platform for visual-inertial estimation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4666–4672. [Google Scholar]
  10. Yang, S.; Fan, G.H.; Bai, L.L.; Zhao, C.; Li, D. Geometric constraint-based visual SLAM under dynamic indoor environment. Comput. Eng. Appl. 2021, 57, 203–212. [Google Scholar]
  11. Zou, D.; Tan, P. CoSLAM: Collaborative visual SLAM in dynamic environments. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 354–366. [Google Scholar] [CrossRef]
  12. Kim, D.H.; Kim, J.H. Effective background model-based RGB-D dense visual odometry in a dynamic environment. IEEE Trans. Robot. 2016, 32, 1565–1573. [Google Scholar] [CrossRef]
  13. Du, Z.J.; Huang, S.S.; Mu, T.J.; Zhao, Q.; Martin, R.R.; Xu, K. Accurate dynamic SLAM using CRF-based long-term consistency. IEEE Trans. Vis. Comput. Graph. 2020, 28, 1745–1757. [Google Scholar] [CrossRef]
  14. Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  15. Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
  16. Pirker, K.; Rüther, M.; Bischof, H. CD SLAM: Continuous localization and mapping in a dynamic world. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 3990–3997. [Google Scholar]
  17. Wang, Y.; Tian, Y.; Chen, J.; Chen, C.; Xu, K.; Ding, X. MSSD-SLAM: Multi-feature semantic RGB-D inertial SLAM with structural regularity for dynamic environments. IEEE Trans. Instrum. Meas. 2024, 74, 5003517. [Google Scholar] [CrossRef]
  18. Fan, J.; Ning, Y.; Wang, J.; Jia, X.; Chai, D.; Wang, X.; Xu, Y. EMS-SLAM: Dynamic RGB-D SLAM with semantic-geometric constraints for GNSS-denied environments. Remote Sens. 2025, 17, 1691. [Google Scholar] [CrossRef]
  19. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  20. Leutenegger, S.; Furgale, P.; Rabaud, V.; Chli, M.; Konolige, K.; Siegwart, R. Keyframe-based visual-inertial SLAM using nonlinear optimization. In Proceedings of the Robotics: Science and Systems (RSS), Berlin, Germany, 24–28 June 2013. [Google Scholar]
  21. Li, P.; Qin, T.; Hu, B.; Zhu, F.; Shen, S. Monocular visual-inertial state estimation for mobile augmented reality. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Nantes, France, 9–13 October 2017; pp. 11–21. [Google Scholar]
  22. Liu, H.; Chen, M.; Zhang, G.; Bao, H.; Bao, Y. ICE-BA: Incremental, consistent and efficient bundle adjustment for visual-inertial SLAM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1974–1982. [Google Scholar]
  23. Hu, Z.; Zhao, J.; Luo, Y.; Ou, J. Semantic SLAM based on improved DeepLabv3+ in dynamic scenarios. IEEE Access 2022, 10, 21160–21168. [Google Scholar] [CrossRef]
  24. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  25. Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
  26. Zhang, L.; Wei, L.; Shen, P.; Wei, W.; Zhu, G.; Song, J. Semantic SLAM based on object detection and improved octomap. IEEE Access 2018, 6, 75545–75559. [Google Scholar] [CrossRef]
  27. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  28. Yang, D.; Bi, S.; Wang, W.; Yuan, C.; Qi, X.; Cai, Y. DRE-SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot. Remote Sens. 2019, 11, 380. [Google Scholar] [CrossRef]
  29. Huang, J.; Yang, S.; Zhao, Z.; Lai, Y.; Hu, S.M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5875–5884. [Google Scholar]
  30. Huang, J.; Yang, S.; Mu, T.J.; Hu, S.M. ClusterVO: Clustering moving instances and estimating visual odometry for self and surroundings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2168–2177. [Google Scholar]
  31. Yuan, X.; Chen, S. SAD-SLAM: A visual SLAM based on semantic and depth information. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4930–4935. [Google Scholar]
  32. Dong, Z.; Zhang, G.; Jia, J.; Bao, H. Keyframe-based real-time camera tracking. In Proceedings of the IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1538–1545. [Google Scholar]
  33. Hsiao, M.; Westman, E.; Zhang, G.; Kaess, M. Keyframe-based dense planar SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5110–5117. [Google Scholar]
  34. Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3-D mapping with an RGB-D camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]
  35. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
  36. Scaramuzza, D.; Siegwart, R. Appearance-guided monocular omnidirectional visual odometry for outdoor ground vehicles. IEEE Trans. Robot. 2008, 24, 1015–1026. [Google Scholar] [CrossRef]
  37. Zhang, A.M.; Kleeman, L. Robust appearance based visual route following for navigation in large-scale outdoor environments. Int. J. Rob. Res. 2009, 28, 331–356. [Google Scholar] [CrossRef]
  38. Nourani-Vatani, N.; Borges, P.V.K. Correlation-based visual odometry for ground vehicles. J. Field Robot. 2011, 28, 742–768. [Google Scholar] [CrossRef]
  39. Huang, G.P.; Mourikis, A.I.; Roumeliotis, S.I. A first-estimates Jacobian EKF for improving SLAM consistency. In Proceedings of the Experimental Robotics: The 11th International Symposium, Athens, Greece, 14–18 July 2008; pp. 373–382. [Google Scholar]
  40. Montemerlo, M. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In Proceedings of the AAAI National Conference on Artificial Intelligence, Edmonton, AB, Canada, 28 July–1 August 2002. [Google Scholar]
  41. Montemerlo, M.; Thrun, S.; Koller, D.; Wegbreit, B. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003; pp. 1151–1156. [Google Scholar]
  42. Eraghi, H.E.; Taban, M.R.; Bahreinian, S.F. Improved unscented Kalman filter algorithm to increase the SLAM accuracy. In Proceedings of the 9th International Conference on Control, Instrumentation and Automation (ICCIA), Kavar, Iran, 27–28 December 2023; pp. 1–5. [Google Scholar]
  43. Chandra, K.P.B.; Gu, D.W.; Postlethwaite, I. Cubature Kalman filter based localization and mapping. IFAC Proc. Vol. 2011, 44, 2121–2125. [Google Scholar] [CrossRef]
  44. Servières, M.; Renaudin, V.; Dupuis, A.; Antigny, N. Visual and visual-inertial SLAM: State of the art, classification, and experimental benchmarking. J. Sens. 2021, 2021, 2054828. [Google Scholar] [CrossRef]
  45. Leutenegger, S. OKVIS2: Realtime scalable visual-inertial SLAM with loop closure. arXiv 2022, arXiv:2202.09199. [Google Scholar] [CrossRef]
  46. Qian, S.; Xu, Z.; Liu, W.; Zou, J.Z.; Chen, H. Visual simultaneous localization and mapping algorithm for dim dynamic scenes. Exp. Technol. Manag. 2024, 41, 16–25. [Google Scholar]
  47. Azimi, A.; Ahmadabadian, A.H.; Remondino, F. PKS: A photogrammetric key-frame selection method for visual-inertial systems built on ORB-SLAM3. ISPRS J. Photogramm. Remote Sens. 2022, 191, 18–32. [Google Scholar] [CrossRef]
  48. Szeliski, R. Computer Vision: Algorithms and Applications, 2nd ed.; Springer: London, UK, 2022. [Google Scholar]
Figure 1. Overview of the DKB-SLAM system framework. ORB-SLAM3’s original work is shown in a light orange background, while our major new or modified works are presented in a teal background.
Figure 1. Overview of the DKB-SLAM system framework. ORB-SLAM3’s original work is shown in a light orange background, while our major new or modified works are presented in a teal background.
Robotics 14 00134 g001
Figure 2. Schematic diagram of optical flow.
Figure 2. Schematic diagram of optical flow.
Robotics 14 00134 g002
Figure 3. Logical flowchart of the keyframe selection algorithm. The process is a sequence of quantitative checks rather than visual transformations. This text-based representation is chosen for its clarity and precision in detailing the decision-making criteria, directly corresponding to the equations and logic presented in the text.
Figure 3. Logical flowchart of the keyframe selection algorithm. The process is a sequence of quantitative checks rather than visual transformations. This text-based representation is chosen for its clarity and precision in detailing the decision-making criteria, directly corresponding to the equations and logic presented in the text.
Robotics 14 00134 g003
Figure 4. Schematic diagram of visibility level judgment. In the diagram, the triangles of the same color represent the matching points, the dots of the same color correspond to the map points, the black dots indicate the feature points, the direction indicated by the arrows represents the orientation of the normal vector of the map points, and the position of the polygonal star denotes the visibility grade.
Figure 4. Schematic diagram of visibility level judgment. In the diagram, the triangles of the same color represent the matching points, the dots of the same color correspond to the map points, the black dots indicate the feature points, the direction indicated by the arrows represents the orientation of the normal vector of the map points, and the position of the polygonal star denotes the visibility grade.
Robotics 14 00134 g004
Figure 5. Comparison plots of the trajectories of each system under specific static datasets are shown. The gray dashed trajectory represents the real trajectory, the red trajectory corresponds to the DKB-SLAM-BA-KF, the green trajectory is for the DKB-SLAM-KF, the orange trajectory is for the DKB-SLAM-BA, and the blue trajectory represents the ORB-SLAM3.
Figure 5. Comparison plots of the trajectories of each system under specific static datasets are shown. The gray dashed trajectory represents the real trajectory, the red trajectory corresponds to the DKB-SLAM-BA-KF, the green trajectory is for the DKB-SLAM-KF, the orange trajectory is for the DKB-SLAM-BA, and the blue trajectory represents the ORB-SLAM3.
Robotics 14 00134 g005
Figure 6. Comparison plots of the trajectories of each system under dynamic datasets are presented. The gray dashed trajectory represents the real trajectory, the red trajectory corresponds to the DKB-SLAM, the green trajectory represents the DKB-SLAM-KF-DR, the orange trajectory represents the DKB-SLAM-BA-DR, and the blue trajectory corresponds to the ORB-SLAM3-DR.
Figure 6. Comparison plots of the trajectories of each system under dynamic datasets are presented. The gray dashed trajectory represents the real trajectory, the red trajectory corresponds to the DKB-SLAM, the green trajectory represents the DKB-SLAM-KF-DR, the orange trajectory represents the DKB-SLAM-BA-DR, and the blue trajectory corresponds to the ORB-SLAM3-DR.
Robotics 14 00134 g006
Figure 7. A comparison of the bar results for each system using selected static datasets is shown. The red columns represent the results of DKB-SLAM-BA-KF, the green columns represent the results of DKB-SLAM-KF, the orange columns show the results of DKB-SLAM-BA, and the blue columns represent the results of ORB-SLAM3.
Figure 7. A comparison of the bar results for each system using selected static datasets is shown. The red columns represent the results of DKB-SLAM-BA-KF, the green columns represent the results of DKB-SLAM-KF, the orange columns show the results of DKB-SLAM-BA, and the blue columns represent the results of ORB-SLAM3.
Robotics 14 00134 g007
Figure 8. Comparison of the bar results for each system using dynamic datasets is presented. The red columns represent the results of DKB-SLAM, the green columns show the results of DKB-SLAM-KF-DR, the orange columns display the results of DKB-SLAM-BA-DR, and the blue columns indicate the results of DKB-SLAM-DR.
Figure 8. Comparison of the bar results for each system using dynamic datasets is presented. The red columns represent the results of DKB-SLAM, the green columns show the results of DKB-SLAM-KF-DR, the orange columns display the results of DKB-SLAM-BA-DR, and the blue columns indicate the results of DKB-SLAM-DR.
Robotics 14 00134 g008
Figure 9. Data acquisition equipment.
Figure 9. Data acquisition equipment.
Robotics 14 00134 g009
Figure 10. Trajectory plots of various SLAM systems running on straight datasets and their position in the x, y, and z directions over time. The orange lines represent DKB-DLSM, the blue lines represent SG-SLAM, and the gray lines represent ORB-SLAM3.
Figure 10. Trajectory plots of various SLAM systems running on straight datasets and their position in the x, y, and z directions over time. The orange lines represent DKB-DLSM, the blue lines represent SG-SLAM, and the gray lines represent ORB-SLAM3.
Robotics 14 00134 g010
Figure 11. Trajectory plots of various SLAM systems running on static datasets and their position in the x, y, and z directions over time. The orange lines represent DKB-DLSM, the blue lines represent SG-SLAM, and the gray lines represent ORB-SLAM3.
Figure 11. Trajectory plots of various SLAM systems running on static datasets and their position in the x, y, and z directions over time. The orange lines represent DKB-DLSM, the blue lines represent SG-SLAM, and the gray lines represent ORB-SLAM3.
Robotics 14 00134 g011
Figure 12. Partial run chart of the DKB-SLAM executing the static dataset.
Figure 12. Partial run chart of the DKB-SLAM executing the static dataset.
Robotics 14 00134 g012
Table 1. Sensitivity analysis of edge point weights on translational RMSE for DKB-SLAM. The analysis across multiple dynamic TUM RGB-D sequences shows that, while the optimal weight varies slightly, a weight of 1.5 consistently yields near-optimal and robustly low errors, making it a suitable general choice. The bold values indicate the lowest RMSE for each sequence.
Table 1. Sensitivity analysis of edge point weights on translational RMSE for DKB-SLAM. The analysis across multiple dynamic TUM RGB-D sequences shows that, while the optimal weight varies slightly, a weight of 1.5 consistently yields near-optimal and robustly low errors, making it a suitable general choice. The bold values indicate the lowest RMSE for each sequence.
SequenceEdge Point Weight (w)
1.11.21.31.41.51.61.71.81.9
walk-rpy0.03600.03420.03150.03010.02930.03020.03210.03480.0365
walk-static0.00720.00690.00650.00580.00590.00630.00660.00700.0073
walk-xyz0.01380.01350.01280.01230.01200.01240.01290.01360.0140
sitting-static0.00650.00620.00580.00560.00530.00570.00600.00630.0066
Table 2. RMSE and mean translational APE for the TUM RGB-D static dataset.
Table 2. RMSE and mean translational APE for the TUM RGB-D static dataset.
SequenceORB-SLAM3DKB-SLAM-BADKB-SLAM-KFDKB-SLAM-BA-KF
RMSE (m)Mean (m)RMSE (m)Mean (m)RMSE (m)Mean (m)RMSE (m)Mean (m)
fr1/3600.13150.11850.09470.08800.11720.10460.08650.0772
fr1/room0.08230.07280.06640.05800.05680.04920.05220.0462
fr2/large_no_loop0.29280.26380.16950.15880.22380.20470.10640.0940
fr2/large_with_loop0.18410.17240.08840.07960.09290.08340.08330.0738
fr2/pioneer_3600.09680.08310.08720.06660.08670.07440.08320.0719
Table 3. RMSE and mean rotation APE under the TUM RGB-D static dataset.
Table 3. RMSE and mean rotation APE under the TUM RGB-D static dataset.
SequenceORB-SLAM3DKB-SLAM-BADKB-SLAM-KFDKB-SLAM-BA-KF
RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)
fr1/3600.24940.24010.20510.19220.29210.28580.19400.1893
fr1/room0.09970.08980.08390.07660.06810.06380.06860.0588
fr2/large_no_loop0.25330.23990.24430.24240.20370.19050.12760.1236
fr2/large_with_loop0.13460.13100.08520.08390.08510.08360.07110.0690
fr2/pioneer_3600.10790.09610.08680.07990.08780.08320.08870.0846
Table 4. RMSE of the Translational APE under the TUM RGB-D dynamic dataset.
Table 4. RMSE of the Translational APE under the TUM RGB-D dynamic dataset.
SequenceORB-SLAM3MSSD-SLAMDRG-SLAMCrowd-SLAMSG-SLAMDS-SLAMDKB-SLAM-DRDKB-SLAM-KF-DRDKB-SLAM-BA-DRDKB-SLAM
RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)RMSE (m)
walk-rpy0.15330.03280.04240.03670.44620.03860.03220.03860.0293
walk-half0.32650.01730.02580.05390.02030.03150.01960.01760.02360.0155
walk-static0.02370.01790.01110.00710.00780.00700.01090.00610.00640.0061
walk-xyz0.28010.01360.02170.01650.01540.03230.01440.01440.01390.0125
sitting-half0.02210.03540.07340.02470.02090.01510.02400.02160.01850.0176
sitting-static0.00740.00660.00720.01320.00770.01160.00650.00620.00570.0055
Table 5. RMSE and the mean rotated APE under the TUM RGB-D dynamic dataset.
Table 5. RMSE and the mean rotated APE under the TUM RGB-D dynamic dataset.
SequenceORB-SLAM3Crowd-SLAMSG-SLAMDKB-SLAM-DRDKB-SLAM-KF-DRDKB-SLAM-BA-DRDKB-SLAM
RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)RMSE (°)Mean (°)
walk-rpy2.55742.55740.09820.09720.10070.09980.09310.09220.11440.11380.09340.0929
walk-half1.10781.09910.03450.03340.02350.02160.02260.02120.02600.02350.03370.03140.01520.0137
walk-static0.49130.49080.16120.16120.26820.26820.20870.20860.15050.15050.12460.12460.11670.1166
walk-xyz1.33091.31980.02020.01910.01460.01200.01800.06170.01840.01700.01520.01300.01440.0127
sitting-half0.02690.02540.06710.06660.02820.02760.02870.02690.02250.02040.01870.01710.01520.0135
sitting-static0.10550.10540.10190.10190.13660.13660.07960.07950.08930.08930.09120.09110.09180.0917
Table 6. Algorithm runtime comparison (s).
Table 6. Algorithm runtime comparison (s).
SystemsAverage Processing Time per Frame (ms)
ORB-SLAM318.18
DS-SLAM87.84
DynaSLAM240.91
Crowd-SLAM27.68
SG-SLAM35.72
DKB-SLAM (Proposed)30.97
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Q.; Xu, Z.; Li, Y.; Zhang, Y.; Ye, F. DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment. Robotics 2025, 14, 134. https://doi.org/10.3390/robotics14100134

AMA Style

Sun Q, Xu Z, Li Y, Zhang Y, Ye F. DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment. Robotics. 2025; 14(10):134. https://doi.org/10.3390/robotics14100134

Chicago/Turabian Style

Sun, Qian, Ziqiang Xu, Yibing Li, Yidan Zhang, and Fang Ye. 2025. "DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment" Robotics 14, no. 10: 134. https://doi.org/10.3390/robotics14100134

APA Style

Sun, Q., Xu, Z., Li, Y., Zhang, Y., & Ye, F. (2025). DKB-SLAM: Dynamic RGB-D Visual SLAM with Efficient Keyframe Selection and Local Bundle Adjustment. Robotics, 14(10), 134. https://doi.org/10.3390/robotics14100134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop