Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling

Wang, Zhiyu; Ding, Weili; Wang, Wenna

doi:10.3390/machines14020222

Open AccessArticle

Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling

by

Zhiyu Wang

¹

,

Weili Ding

^1,* and

Wenna Wang

²

¹

The Engineering Research Center of Intelligent Control System and Intelligent Equipment, Ministry of Education, Hebei Key Laboratory of Intelligent Rehabilitation and Neuromodulation, Hebei Advanced Equipment Industry Technology Research Institute, Yanshan University, 438 West of Hebei Avenue, Haigang District, Qinhuangdao 066004, China

²

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(2), 222; https://doi.org/10.3390/machines14020222

Submission received: 7 January 2026 / Revised: 3 February 2026 / Accepted: 11 February 2026 / Published: 12 February 2026

(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)

Download

Browse Figures

Versions Notes

Abstract

Reliable autonomous perception is essential for indoor service machines operating in human-centered environments, where weak textures, repetitive structures, and frequent dynamic interference often degrade localization stability. Conventional RGB-D SLAM systems typically rely on static-scene assumptions or binary semantic masking, which are insufficient for handling persistent and fine-grained environmental dynamics. This paper presents a robust autonomous perception framework based on geometry-aware RGB-D SLAM, with a particular emphasis on probabilistic dynamic modeling at the feature level. The proposed system integrates multi-granularity geometric representations, including point features, parallel-line structures, and planar regions, to enhance geometric observability in low-texture indoor environments. On this basis, a probabilistic dynamic model is introduced to explicitly characterize feature reliability under motion, where dynamic probabilities are initialized by object detection and continuously updated through temporal consistency, spatial propagation, and multi-view geometric verification. Large-scale planar structures further serve as stable anchors to support robust pose estimation. Experimental results on the TUM RGB-D dynamic benchmark demonstrate that the proposed method significantly improves localization robustness, reducing the average ATE RMSE by approximately 66% compared with representative dynamic SLAM baselines. Additional evaluations on a real-world indoor dataset further validate its effectiveness for long-term autonomous perception under dense motion and frequent occlusions.

Keywords:

autonomous perception; probabilistic dynamic modeling; geometry-aware RGB-D SLAM; indoor service machines; robust localization

1. Introduction

With the increasing deployment of indoor service machines in human-centered environments such as homes, offices, hospitals, and commercial facilities, ensuring reliable autonomous perception has become a fundamental requirement for long-term autonomy and operational safety. Real-world indoor environments are often characterized by weak visual textures, repetitive human-made layouts, and frequent human-induced dynamics, which jointly challenge the stability of perception systems and may lead to localization degradation or failure. Developing perception frameworks that can maintain reliable localization under such complex and dynamic conditions therefore remains a critical problem for intelligent service machines [1].

Visual simultaneous localization and mapping (SLAM) serves as a core perception component for indoor service machines by enabling continuous pose estimation and spatial awareness. Among existing solutions, RGB-D SLAM is particularly attractive due to its ability to provide metric depth information and dense geometric cues with moderate hardware cost and power consumption [2,3]. However, most conventional RGB-D SLAM systems rely on static-scene assumptions and sparse point features, making them vulnerable to weak-texture regions, repetitive structural patterns, and dynamic elements commonly encountered in indoor service environments [4].

To improve robustness under such conditions, a variety of approaches have been proposed. Classical visual SLAM systems, including the ORB-SLAM family [5,6,7] and visual–inertial extensions [8], perform well in static and well-textured scenes but remain fragile in low-texture or repetitive environments. Structure-aware SLAM methods enhance geometric observability by exploiting human-made regularities such as line segments [9,10,11,12,13,14,15,16,17] and planar surfaces [18,19,20,21,22,23,24], while dynamic SLAM approaches typically rely on semantic segmentation or object detection to suppress the influence of moving objects [25,26,27,28,29,30,31,32,33]. Recent end-to-end methods attempt to model dynamics implicitly through learned photometric and geometric consistency [34,35,36,37].

Despite these advances, existing methods remain limited in realistic indoor service environments. Structure-aware approaches are often developed under static assumptions, whereas most dynamic SLAM methods handle scene dynamics in a binary manner by discarding features associated with detected objects. In practice, environmental dynamics are continuous and fine-grained: not all features on dynamic objects are unreliable, and many static features may be temporarily corrupted by occlusion or motion. Moreover, dynamic objects frequently appear near dominant structural elements, making it difficult to assess feature reliability using semantic cues alone. This reveals a fundamental limitation of existing approaches and highlights the need for feature-level uncertainty modeling in dynamic indoor scenes.

To address this issue, the aim of this work is to improve the robustness of autonomous perception for indoor service machines by jointly considering structural ambiguity and feature-level dynamic uncertainty. This paper proposes a geometry-aware RGB-D SLAM framework with probabilistic dynamic modeling in which multi-granularity geometric structures are integrated with feature-level dynamic probability estimation. Experimental results demonstrate that the proposed framework achieves robust and stable localization under dense motion and frequent occlusions, supporting long-term autonomous perception in complex indoor environments. The main contributions of this paper are summarized as follows:

A geometry-aware perception backbone for indoor service machines. A unified RGB-D perception framework is developed by jointly exploiting point features, parallel-line structures, and planar regions. By integrating multi-granularity geometric representations, the proposed backbone improves geometric observability and pose stability in weak-texture and structurally repetitive indoor environments.
A feature-level probabilistic dynamic modeling mechanism for reliability-aware perception. A probabilistic dynamic model is introduced to explicitly characterize the reliability of point and line features under environmental motion. Dynamic probabilities are initialized using object detection and continuously updated through temporal consistency analysis, spatial neighborhood propagation, and multi-view geometric verification, while large-scale planar structures serve as stable anchors in dynamic scenes.
A semantic–geometric dynamic observation handling strategy for back-end optimization. A structure-aware observation handling mechanism is designed by jointly considering near-frame motion residuals, multi-keyframe projection consistency, and epipolar-geometry-based constraints. This strategy enables adaptive weighting and suppression of unreliable observations caused by motion and occlusion during pose optimization.

The remainder of this paper is organized as follows. Section 2 reviews related work on the subject of robust indoor perception under conditions of weak texture and dynamic interference. In Section 3, the proposed method is described in detail, along with an examination of its fundamental components. Section 4 presents the findings of experimental investigations conducted on benchmark and real-world indoor datasets. Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Geometry-Aware SLAM for Robust Indoor Perception

Conventional point-based SLAM systems often suffer from degraded robustness in indoor environments with weak textures or repetitive structures, where sparse or ambiguously distributed keypoints lead to unreliable data association and unstable pose estimation. To alleviate these limitations, a substantial body of research has explored the incorporation of higher-level geometric primitives, most notably line segments and planar surfaces, giving rise to point-line, point-plane, and unified point-line-plane SLAM frameworks.

Early point-line SLAM systems introduced line features as complementary constraints to ORB-SLAM-style pipelines. Representative PL-SLAM approaches [9,10] incorporated line endpoint reprojection residuals into pose optimization and improved robustness under partial occlusions and texture-deficient conditions. Subsequent work refined line representation and residual design to improve optimization stability. Structure PLP-SLAM [11] adopted Plücker coordinates with orthonormal parameterization to reduce redundancy in nonlinear optimization, while EPLF-VINS [12] reformulated line constraints using angular residuals to strengthen geometric consistency. Nevertheless, line segments are often short and fragmented in real images, making them sensitive to noise when used in isolation.

To further improve line reliability, later studies moved from individual segments to more coherent structural cues. PLFG-SLAM [13] employed clustering and graph-based modeling to capture relationships between point and line primitives, and DPLVO [14] introduced pixel-level collinearity residuals with uncertainty-aware weighting to suppress spurious line detections. In addition, UV-SLAM [15] and UL-SLAM [16] exploited vanishing directions to stabilize orientation estimation in structured scenes, while PPL-SLAM [17] explicitly modeled parallel-line groups to reinforce structural coherence and improve pose stability in indoor environments with repetitive layouts.

Planar primitives provide more stable and globally consistent geometric support in human-made environments. Point–plane SLAM methods leverage coplanarity, parallelism, and orthogonality constraints to reduce pose drift and improve long-term stability [18,19]. Building on these insights, unified point-line-plane SLAM frameworks have been proposed to jointly exploit multi-granularity geometric features within a single optimization formulation and to incorporate structural priors such as Manhattan-world constraints [20,21,22,23,24]. Although these geometry-aware approaches significantly enhance geometric observability in weak-texture indoor scenes, most of them are developed under static-scene assumptions and do not explicitly address feature reliability under dynamic interference, which limits their applicability in real-world indoor service environments.

2.2. Dynamic Scene Handling in Visual SLAM

Dynamic environments pose a fundamental challenge to visual SLAM, as moving objects violate the static-scene assumption and introduce erroneous observations that degrade data association and state estimation. In indoor service environments, pedestrians, movable furniture, and frequent occlusions further exacerbate this problem, making robust dynamic handling a critical component of practical SLAM systems. Existing dynamic SLAM approaches can be broadly categorized into geometry-based, learning-based, and hybrid semantic–geometric methods.

Geometry-based dynamic SLAM methods attempt to distinguish static and dynamic features using multi-view geometry, motion consistency, or depth stability without relying on semantic priors. Representative approaches include motion-statistics-based foreground modeling and adaptive filtering strategies, which estimate feature reliability from reprojection residuals or temporal motion patterns [25,26]. Other methods suppress dynamic interference by assigning static weights based on depth consistency or edge stability [27,28]. Although computationally efficient, these approaches rely on hand-crafted thresholds and heuristic rules, making them sensitive to parameter tuning in complex indoor scenes.

Learning-based methods explicitly introduce semantic information to detect potentially dynamic objects at the pixel or instance level. Systems such as DS-SLAM [29] and StaticFusion [30] integrate semantic segmentation or dense reconstruction into the SLAM pipeline to identify moving objects and recover static background structures. While these approaches provide intuitive and effective cues in cluttered environments, their performance is constrained by predefined semantic categories and the accuracy of segmentation masks. Moreover, their reliance on GPU resources and dense inference pipelines often limits deployability on lightweight service machines.

To balance robustness and computational efficiency, hybrid semantic–geometric methods combine semantic priors with geometric consistency constraints. Typical systems integrate object detection with optical flow, depth consistency, or motion clustering to refine dynamic region estimation while maintaining real-time performance [31,32,33]. For instance, SG-SLAM [31] enhances RGB-D SLAM by verifying semantic segmentation results using geometric consistency, whereas Dynamic-VINS [32] fuses RGB-D and inertial measurements to improve tracking robustness under motion disturbances. DRG-SLAM [33] jointly exploits semantic cues and geometric features to filter dynamic observations. These methods demonstrate that coupling semantics with geometry can significantly improve stability in dynamic indoor environments. However, most existing approaches adopt a binary object-level dynamic masking strategy in which all observations associated with detected objects are either entirely discarded or fully retained. Such hard decisions ignore the temporal evolution and geometric reliability of individual features, making the system vulnerable to over-suppression of informative constraints or insufficient rejection of transient dynamic interference.

More recently, neural implicit representation-based SLAM frameworks have emerged as an alternative paradigm for scene modeling and camera tracking, where the environment is represented using continuous neural fields optimized under photometric consistency constraints. Representative systems such as NICE-SLAM [34] and ESLAM [35] employ hierarchical feature grids or multi-scale tensor representations to reduce implicit query cost and enable dense reconstruction in static indoor scenes. These approaches achieve impressive mapping quality and geometric completeness under static assumptions. To extend neural implicit SLAM to dynamic environments, several works have incorporated semantic and geometric cues to explicitly handle motion. NID-SLAM [36] refines semantic masks using depth-based geometric information to improve robustness near object boundaries, while DN-SLAM [37] combines ORB feature tracking with neural implicit mapping and progressively removes dynamic observations using semantic segmentation and optical flow constraints. RoDyn-SLAM [38] further fuses semantic masks with optical flow to generate reliable motion masks and introduces a divide-and-conquer pose optimization strategy with edge-based geometric constraints to enhance inter-frame consistency. In contrast, DDN-SLAM [39] incorporates explicit probabilistic modeling by combining semantic features with a mixture Gaussian formulation and dynamic semantic loss functions to mitigate error propagation caused by occlusions. Although these neural implicit SLAM variants demonstrate strong capability in dense reconstruction and dynamic detail modeling, they fundamentally rely on dense photometric consistency and iterative implicit field optimization. This results in high computational cost, substantial memory overhead, and strong dependence on illumination stability and semantic prediction quality, often requiring GPU acceleration to maintain acceptable performance. Such characteristics limit their real-time localization robustness and long-term stability in practical indoor service machine scenarios, where computational resources are constrained and dynamic interference is persistent.

In contrast to the above methods, which treat dynamics primarily as a discrete classification problem, this work models dynamic behavior as a continuously evolving reliability probability at the feature level. By jointly exploiting semantic cues and geometric consistency across both temporal and spatial domains, the proposed probabilistic dynamic modeling framework enables soft reliability weighting rather than hard exclusion of observations. This formulation fundamentally differs from traditional binary semantic masking strategies and provides a more flexible and stable mechanism for handling dense dynamic interference while preserving informative geometric constraints in real indoor service environments.

3. Algorithm

3.1. System Overview

The proposed system, termed MGD-SLAM (Multi-Granularity Geometry-Aware Dynamic SLAM), is developed on top of the ORB-SLAM2 [6] framework to support robust autonomous perception for indoor service machines operating in environments with weak texture and dynamic interference. ORB-SLAM2 offers a reliable point-based optimization baseline, but its performance degrades in scenes where point features become sparse or unstable. To overcome this limitation at the perception level, MGD-SLAM extends the original framework by incorporating multi-granularity geometric representations and dynamic-aware feature processing. Specifically, point features are complemented by parallel line structures and planar regions to enhance geometric observability in low-texture and repetitive indoor environments. In addition, a probabilistic dynamic modeling mechanism is introduced to characterize feature reliability under motion, avoiding binary static-dynamic decisions.

As illustrated in Figure 1, MGD-SLAM is organized into three tightly coupled stages. For each incoming RGB-D frame, the front-end performs camera tracking and extracts multi-granularity geometric features, including ORB point features, parallel line structures reconstructed from two-dimensional line segments, and planar regions obtained through depth-based segmentation. In parallel, a YOLOv11-based object detector runs as an independent ROS node and provides semantic bounding boxes for potentially dynamic objects, which are associated with point and line features to initialize dynamic probabilities. Based on this initialization, the probabilistic dynamic modeling module estimates the motion likelihood of point and line features. In the back-end, observations are adaptively weighted according to their estimated dynamic probabilities, and point-parallel line-plane constraints are jointly optimized to refine the camera trajectory while preserving the structural consistency of the indoor map.

3.2. Multi-Granularity Feature Extraction

To enhance perception robustness in indoor service environments, the proposed system employs a multi-granularity geometric feature extraction strategy that integrates point features, parallel-line structures, and planar regions into a unified representation. By combining geometric cues at different spatial scales, the system maintains reliable data association under weak texture, repetitive patterns, and partial occlusion.

Point: Point features are extracted using the ORB operator due to its favorable balance between computational efficiency and robustness to viewpoint and illumination changes. For each RGB-D frame, detected keypoints

u_{i}^{k}

are back-projected using depth measurements to obtain their corresponding 3D positions

X_{i}^{k}

. These point features provide appearance-based correspondences over short baselines and form the foundation of frame-to-frame tracking.

Parallel Line: To capture structural regularities beyond isolated line segments, MGD-SLAM represents line features at the level of parallel-line groups. Instead of relying on conventional detectors such as LSD [40] or EDLines [41], which output unorganized segments, an orientation-consistent grouping strategy is adopted to identify sets of line segments sharing similar directions [42].

Specifically, line segments are grouped based on the angular difference between their 2D orientations. Two segments are considered parallel candidates if the absolute orientation difference is smaller than a predefined threshold

τ_{θ}

. In our implementation,

τ_{θ}

is set to 3°, following the default configuration adopted in PPL-SLAM [17], which has been validated for structured indoor environments. This threshold provides a good balance between grouping robustness and sensitivity to local structural variations.

Each group reflects a coherent structural element commonly observed in human-made indoor environments. Within each group, an individual line segment is parameterized using three sampled image points corresponding to its start point

u_{n, q, s}^{k} = (u_{s}, v_{s})

, midpoint

u_{n, q, m}^{k} = (u_{m}, v_{m})

, and endpoint

u_{n, q, e}^{k} = (u_{e}, v_{e})

:

s_{n, q}^{k} = (u_{n, q, s}^{k}, u_{n, q, m}^{k}, u_{n, q, e}^{k}),

(1)

where the midpoint

u_{n, q, m}^{k}

provides an additional geometric sample along the segment. This three-point representation reduces depth back-projection ambiguity and improves the stability of inter-frame association compared with endpoint-only parameterizations. Using the depth map, the sampled points are back-projected into 3D space as

X_{n, q}^{k} = (X_{n, q, s}^{k}, X_{n, q, m}^{k}, X_{n, q, e}^{k}),

(2)

yielding a compact spatial representation that preserves the segment geometry while remaining efficient for optimization. To quantify the structural significance of each segment, its pixel length

D_{n, q}^{k}

is defined as

D_{n, q}^{k} = ‖u_{n, q, e}^{k} - u_{n, q, s}^{k}‖,

(3)

and a normalized salience score

γ_{n, q}^{k}

is introduced as

γ_{n, q}^{k} = \frac{L_{n, q}^{k}}{L_{\max}},

(4)

where

L_{\max} = \max (H, W)

denotes the dominant image dimension. This salience measure favors long, visually supported segments that are more likely to belong to persistent structural elements such as walls, shelves, or corridors.

Each segment is further associated with a set of parallel companions

{\hat{S}}_{n}^{k}

, determined using the same orientation-consistency criterion. Following PPL-SLAM [17], no additional hierarchical merging is applied across groups; instead, grouping is performed locally within each frame to maintain robustness under viewpoint changes and partial occlusions. The complete representation of a line segment is summarized as

K_{n, q}^{k} = \{θ_{n, q}^{k}, S L_{n, q}^{k}, D_{n, q}^{k}, γ_{n, q}^{k}, {\hat{S}}_{n}^{k}\},

(5)

where

θ_{n, q}^{k}

denotes the segment orientation and

S L_{n, q}^{k}

is the metric length derived from depth. These geometric attributes are integrated into an LBD-based descriptor, enabling appearance and structural cues to jointly support robust matching.

Plane: Planar regions are extracted from the depth-induced point clouds using an adaptive agglomerative hierarchical clustering (AHC) method, following the plane segmentation framework of Feng et al. [43], which provides a good balance between detection accuracy and computational efficiency for real-time indoor SLAM. The detected planes are subsequently parameterized using the minimal angular representation proposed by Zhang et al. [18] for pose optimization.

In AHC-based segmentation, locally fitted planar cells are iteratively merged based on spatial adjacency and normal consistency. Region merging is terminated when the plane fitting error exceeds a predefined threshold, as specified in [43]. To ensure geometric reliability, only planar regions with sufficient support are retained; in our implementation, planes with fewer than

T_{plane}

supporting points (approximately

0.05 m^{2}

in image space) are discarded. During segmentation, neighboring planar regions with consistent parameters are automatically merged, preventing over-segmentation and preserving dominant indoor structures such as walls, floors, and tabletops.

Each detected plane is represented by its unit normal vector

n = (n_{x}, n_{y}, n_{z})

and signed distance

d_{Π}

to the origin. To improve numerical stability during pose optimization, the plane normal is further reparameterized using a minimal angular representation following [18]. Since the unit normal lies on the sphere and thus contains redundancy, its orientation is expressed by

\{\begin{matrix} α = \arctan 2 (n_{y}, n_{x}) \\ β = \arcsin (n_{z}) \end{matrix},

(6)

where

α

denotes the azimuth of the normal projected on the horizontal plane and

β

specifies its elevation. The plane is thus represented in the compact form:

q (Π) = (α, β, d_{Π}),

(7)

which avoids the over-parameterization of raw normal vectors and provides a numerically well-conditioned representation for optimization.

Through the integration of point features, parallel-line structures and planar regions, the feature extraction module yields a structurally coherent description of each frame. This multi-granularity representation provides a robust geometric foundation for the probabilistic modeling and dynamic filtering mechanisms introduced in subsequent sections, enabling stable tracking in environments with weak texture, repeated patterns and heterogeneous dynamic interference.

3.3. Dynamic Object Detection

In indoor service environments, dynamic objects such as pedestrians, movable furniture, and carried items represent a major source of geometric inconsistency for visual SLAM systems. Their motion introduces transient observations that violate the static-scene assumption and can severely corrupt feature association and pose estimation if treated indiscriminately. Rather than directly discarding observations based on semantic labels, MGD-SLAM employs object detection as a semantic prior to initializing feature-level dynamic uncertainty.

A YOLOv11-based detector [44] is adopted to identify potentially dynamic objects in real time. The detector is initialized with COCO [45] pretrained weights and fine-tuned on scene-specific samples to ensure reliable detection of common indoor dynamic categories, including pedestrians, backpacks, chairs, and other movable objects. Bounding-box predictions with a confidence score higher than 0.5 are retained as valid detections. Importantly, the detected bounding boxes are not used to directly discard all enclosed observations. Instead, they serve as coarse semantic indicators of regions that are likely to contain independent motion. Point and line features falling inside such regions are assigned an initial dynamic probability, which is subsequently refined through temporal consistency, spatial propagation, and geometric verification, as described in Section 3.4. This probabilistic treatment avoids over-aggressive feature rejection and allows the system to handle partial motion, intermittent dynamics, and detection uncertainty in a principled manner.

To maintain modularity and computational efficiency, the detector runs as an independent ROS node and communicates with the SLAM system via lightweight message passing. Bounding-box predictions are synchronized with incoming RGB frames and associated with extracted geometric features without interfering with the core estimation pipeline. Representative detection results and the corresponding dynamic feature initialization are illustrated in Figure 2.

3.4. Probabilistic Dynamic Modeling

In realistic indoor environments, dynamic behavior cannot be reliably characterized using instantaneous semantic cues alone. Apparent motion induced by camera ego-motion, intermittent occlusions, and imperfect object detection often lead to ambiguous observations in which static and dynamic features are difficult to separate using binary decisions. To address this challenge, MGD-SLAM introduces a feature-level probabilistic dynamic model that treats motion state as a continuously evolving uncertainty rather than a discrete label.

Each geometric feature is associated with a dynamic probability that represents its likelihood of being influenced by independent motion. This probability is maintained and updated along the SLAM tracking thread, allowing dynamic behavior to be inferred from both current observations and historical evidence. While a unified probabilistic formulation is adopted, different geometric primitives exhibit distinct motion characteristics and are therefore modeled with tailored strategies within the same framework.

Point features are sensitive to both camera motion and object motion, making them the primary carriers of fine-grained dynamic information. For a 3D map point

X_{i}

observed in frame

F_{k}

, its dynamic probability is recursively updated using an exponential smoothing formulation:

P_{k}^{i n i t} (X_{i}) = (1 - \partial) P_{k - 1} (X_{i}) + \partial S_{k} (x_{i}),

(8)

where

P_{k - 1} (X_{i})

denotes the propagated probability from the previous frame

F_{k - 1}

,

S_{k} (x_{i})

is the observation-based likelihood derived from semantic–geometric cues (Section 3.5), and

\partial

controls the balance between historical consistency and new evidence.

Temporal coherence is explicitly enforced through feature association. When a feature in frame

F_{k}

is successfully matched to an existing map point with a descriptor distance below a predefined threshold, its dynamic probability is inherited from the corresponding map representation. For newly observed features

x_{i}^{k}

, the initial probability is determined based on whether the feature lies within a region indicated as potentially dynamic by object detection:

{\overset{⌢}{P}}_{k} (x_{i}^{k}) = \{\begin{matrix} P_{k - 1} (x_{i}^{k - 1}), {‖B (x_{i}^{k}) - B (x_{i}^{k - 1})‖}_{H} < δ \\ P (X_{i}^{k}), {‖B (x_{i}^{k}) - B (x_{i}^{k - 1})‖}_{H} < δ \\ P_{init} = 0, k = 1, F l a g_{d} = 0 \\ P_{init} = 0.5, k = 1, F l a g_{d} = 1 \end{matrix},

(9)

where

B (\cdot)

denotes the binary descriptor,

{‖\cdot‖}_{H}

is the Hamming distance,

δ

is a matching threshold, and

F l a g_{d}

indicates whether the feature lies inside a detection box associated with a dynamic object class. This strategy assigns conservative priors in dynamic regions while avoiding premature rejection of newly emerging static features.

Beyond temporal propagation, dynamic behavior also exhibits strong spatial coherence, as neighboring features often undergo similar motion patterns. To exploit this property, a neighborhood-based probability diffusion mechanism is introduced. Let

x_{i}^{k}

denote a feature in frame

F_{k}

and

H P (x^{k})

the set of neighboring features within a radius r. The refined dynamic probability is computed as

P_{k} (x_{i}^{k}) = {\overset{⌢}{P}}_{k} (x_{i}^{k}) + \sum_{x_{j}^{k} \in H P (x^{k})} λ (d_{i j}) ({\overset{⌢}{P}}_{k} (x_{j}^{k}) - {\overset{⌢}{P}}_{k} (x_{i}^{k})),

(10)

where

d_{i j}

measures the Euclidean distance between features

x_{i}^{k}

and

x_{j}^{k}

, and

λ (\cdot)

is a distance-dependent attenuation function defined as

λ (d_{i j}) = C e^{- d_{i j} / r}, d_{i j} \leq r,

(11)

where

C

is a normalization constant. This formulation encourages locally consistent probability fields while preserving sharp transitions at motion boundaries. The overall procedure of frame-by-frame temporal propagation and neighborhood-based diffusion is summarized in Figure 3, which illustrates how feature-level dynamic probabilities evolve coherently across consecutive frames and spatial neighborhoods.

Parallel-line structures provide medium-scale geometric constraints and inherently exert a global influence on pose estimation. To capture partial motion corruption along a line segment, the probabilistic modeling principle is extended by evaluating the dynamic probability at the start point, midpoint, and endpoint of each segment. If any representative sample exhibits a high dynamic likelihood, the entire line segment is treated as unreliable. This design choice is intentionally conservative. Unlike point features, structural line constraints contribute globally to pose optimization, and even partial corruption may introduce severe geometric inconsistency. Therefore, conservative suppression prioritizes robustness and consistency over constraint density, effectively preventing unstable structural constraints from dominating the optimization in highly dynamic environments.

Planar features in indoor environments predominantly correspond to large static structures such as walls, floors, and ceilings. Their spatial extent and semantic meaning make them significantly less susceptible to independent motion. Accordingly, planar features are treated as static anchors in the proposed framework and are directly incorporated as reliable constraints in pose optimization without dynamic probability propagation.

3.5. Semantic–Geometric Observation Likelihood Construction

The probabilistic dynamic model introduced in Section 3.4 relies on observation likelihoods to incorporate new evidence from incoming frames. In dynamic indoor environments, no single cue is sufficient to reliably distinguish static and dynamic features. Semantic information may be incomplete, while purely geometric cues are susceptible to camera ego-motion and noise. To address this challenge, the observation likelihood

S_{k} (x_{i})

is constructed by jointly integrating semantic priors with geometric consistency constraints at multiple temporal scales.

At the short-term temporal scale, local motion consistency is evaluated based on image-plane velocities. Let

x_{i}^{k} = {[u_{i}^{k}, v_{i}^{k}]}^{T}

denote the image coordinates of a feature in frame

F_{k}

, and its matched location in frame

F_{k - 1}

. The instantaneous velocity is defined as

v_{i}^{k} = [\begin{matrix} u_{i}^{k} - u_{i}^{k - 1} \\ v_{i}^{k} - v_{i}^{k - 1} \end{matrix}],

(12)

To obtain a reference motion pattern for static structures, the mean velocity is computed using features located outside all detection bounding boxes:

{\bar{v}}^{k} = \frac{1}{N^{k} - M^{k}} \sum_{i \in p_{n - b o x}} v_{i}^{k},

(13)

where

N^{k}

is the total number of features in frame

F_{k}

,

M^{k}

is the number of features inside detection boxes, and

p_{n - b o x}

denotes the set of features outside these regions. For each feature, the deviation from the static background motion is quantified by the magnitude and directional residuals:

\{\begin{matrix} d_{k} = ‖v_{i}^{k} - {\bar{v}}^{k}‖ \\ δ θ_{k} = \arctan (\frac{δ v}{δ u}) \end{matrix},

(14)

where

(δ u, δ v) = v_{i}^{k} - {\bar{v}}^{k}

. These residuals are mapped to a near-frame dynamic likelihood through a sigmoid function:

P_{L S} = \frac{1}{1 + \exp (- [α_{L S} (d_{k}^{L S} - d_{t h}^{L S}) + (1 - α_{L S}) (δ θ_{k} - θ_{t h}^{L S})])},

(15)

where

α_{L S}

balances magnitude and angular deviations,

d_{t h}^{L S}

and

θ_{t h}^{L S}

are velocity thresholds. This term captures abrupt or inconsistent motion that cannot be explained by camera movement alone.

Short-term analysis is complemented by a long-term geometric consistency check across multiple keyframes. Let

K F_{k}

denote the current keyframe, and let

C (K F_{k})

denote the set of its first- and second-order covisible keyframes. For a map point observed in

K F_{k}

, its 3D position is projected into each covisible keyframe and compared with the corresponding matched feature

P_{j}^{k}

. The forward reprojection error is defined as

d_{j k} = ‖P_{j}^{k} - P_{j}‖,

(16)

and its mean over all covisible keyframes as

{\bar{d}}_{j k} = \frac{1}{|K_{n}|} \sum_{j \in K_{n}} ‖P_{j}^{k} - P_{j}‖,

(17)

A backward projection error

{\bar{d}}_{j k}

is computed analogously by projecting points from covisible keyframes back into the current keyframe. The averaged bidirectional reprojection error is then converted into a long-term consistency likelihood:

P_{p r o} = \frac{1}{1 + \exp (- α_{p r o} (\frac{{\bar{d}}_{j k} + {\bar{d}}_{k j}}{2} - d_{t h}^{p r o}))},

(18)

where

α_{p r o}

controls the sensitivity and

d_{t h}^{p r o}

is a reprojection error threshold. This term effectively captures slow-moving or intermittently moving objects that may not trigger strong short-term motion cues.

Semantic detection may still miss dynamic objects whose categories are not covered by the detector. To handle such cases, a structural geometric likelihood based on epipolar constraints is introduced. An overview of the epipolar geometry employed for this structural constraint is shown in Figure 4, where the geometric relationship between point correspondences, epipolar lines, and the associated residuals is illustrated.

Specifically, using features outside all detected dynamic regions, the relative pose between frame

F_{k - 1}

and

F_{k}

is first estimated, from which the fundamental matrix is constructed as

F = K^{- T} {[t_{k, k - 1}]}^{\lor} R_{k, k - 1} K^{- 1},

(19)

where

K

is the intrinsic calibration matrix,

R_{k, k - 1}

and

t_{k, k - 1}

denote the relative rotation and translation, and

{[\cdot]}^{\lor}

is the skew-symmetric operator. For a feature

x_{i}^{k}

in frame

F_{k - 1}

, its corresponding epipolar line in frame

F_{k}

is given by

l_{i} = F^{T} x_{i}^{k - 1} = {(l_{1}, l_{2}, l_{3})}^{T},

(20)

and the distance from the observed feature

x_{i}^{k}

to this line is computed as

d_{i} = \frac{|{(p_{i}^{k})}^{T} l_{i}|}{\sqrt{l_{1}^{2} + l_{2}^{2}}} .

(21)

Static features are expected to satisfy the epipolar constraint with small residuals. Therefore, an epipolar-based dynamic probability is defined as

P_{m f} = \frac{1}{1 + \exp (- α_{m f} (d_{i} - d_{t h}^{m f}))},

(22)

where

α_{m f}

regulates the influence of the residual and

d_{t h}^{m f}

is a distance threshold. Features with large deviations from the epipolar geometry receive higher dynamic likelihood even in the absence of semantic cues.

Finally, the observation likelihood

S_{k} (x_{i})

is obtained by fusing the near-frame motion likelihood

P_{L S}

, long-term geometric consistency likelihood

P_{p r o}

, and epipolar-based structural likelihood

P_{m f}

using a weighted log-odds formulation:

S_{k} (x_{i}) = \frac{1}{1 + e^{- (ω_{L S} logit (P_{L S}) + ω_{p r o} logit (P_{p r o}) + ω_{m f} logit (P_{m f}))}},

(23)

where

logit (z) = \ln (z / (1 - z))

and

ω_{L S}

,

ω_{p r o}

,

ω_{m f}

are non-negative weights. These weights can be adaptively modulated according to detector confidence, keyframe connectivity, and epipolar inlier ratios.

Through this semantic–geometric integration, the observation likelihood

S_{k} (x_{i})

becomes sensitive to both short-term and long-term dynamic behavior, as well as structural inconsistencies that are invisible to purely semantic or purely geometric approaches. Combined with the probabilistic propagation mechanism in Section 3.4, it provides a principled basis for dynamic feature suppression and adaptive weighting in back-end optimization.

The proposed probabilistic dynamic modeling involves several hyperparameters controlling temporal smoothing, spatial diffusion, and observation likelihood fusion. The specific values used in all experiments are summarized in Table 1. All hyperparameters are fixed across datasets and scenes to avoid sequence-specific tuning. Their values were determined through empirical stability analysis on representative sequences and prior experience with indoor RGB-D SLAM systems. We observed that moderate variations around these values do not lead to significant performance degradation, indicating that the proposed model is not overly sensitive to hyperparameter selection.

3.6. Tightly Coupled Pose Optimization with Multi-Granularity Geometric Fusion

Following the probabilistic dynamic modeling and observation likelihood evaluation in Section 3.4 and Section 3.5, features with high dynamic probability are suppressed, while the remaining features constitute a geometrically coherent and reliability-aware support set for pose estimation. Unlike hard feature removal strategies, the proposed framework retains partially reliable observations through adaptive weighting, enabling robust optimization under intermittent and partial dynamics.

The filtered point features, parallel-line structures, and planar elements are jointly incorporated into a tightly coupled optimization framework that integrates geometry-aware initialization with multi-constraint bundle adjustment. The overall factor graph structure is illustrated in Figure 5, where different geometric primitives contribute complementary constraints to the camera pose.

For point features, a 3D map point

X_{j}

observed in frame

F_{j}

is reprojected into frame

F_{k}

according to

u_{j \to k} = \prod (R_{k, j} X_{j} + t_{k, j}, K),

(24)

where

\prod (\cdot)

denotes the projection function,

K

is the intrinsic matrix, and

(R | t)

represents the relative pose. The point reprojection residual is then defined as

e_{j \to k}^{p} = {\hat{u}}_{k} - u_{k, j},

(25)

that is, the Euclidean difference between the observed pixel location

{\hat{u}}_{k}

and its reprojection into frame

F_{k}

.

Parallel line residuals incorporate both collinearity and parallelism constraints. Each structural line

s_{n, q}

induces a normalized image line

l_{n, q}^{k}

computed from its two endpoints:

l_{n, q}^{k} = η l_{n, q}^{k^{'}},

(26)

where

l_{n, q}^{k^{'}}

is unnormalized and

η

is a scale factor. Given the 3D representatives

X_{n, q, s}^{j}

,

X_{n, q, m}^{j}

, and

X_{n, q, e}^{j}

in frame

F_{j}

, their projections onto frame

F_{k}

must lie on

l_{n, q}^{k}

. The point-to-line for each representative point is computed as:

d_{n, q, x}^{k} = \frac{|{(l_{n, q}^{k})}^{T} u_{n, q, x}^{k}|}{\sqrt{l_{1}^{2} + l_{2}^{2}}}, x \in \{s, m, e\} .

(27)

These distances constitute the point-to-line residuals used to enforce the collinearity of structural lines in frame

F_{k}

. The Jacobian of this distance with respect to the pose

ξ_{k}

follows from the chain rule:

\frac{\partial d_{n, q, x}^{k}}{\partial ξ_{k}} = \frac{\partial d_{n, q, x}^{k}}{\partial u_{j \to k}} \cdot \frac{\partial u_{j \to k}}{\partial ξ_{k}}, x \in \{s, m, e\} .

(28)

Collinearity is enforced by requiring that all three representative points

(s, m, e)

remain aligned:

e_{n, q}^{c o l} = d_{n, q, s}^{k} + d_{n, q, e}^{k} - 2 \times d_{n, q, m}^{k} .

(29)

Parallelism constraints use the most structurally consistent parallel line

l_{n, q, x}^{k, ∥}

discovered in Section 3.2. The corresponding residual is

e_{n, q}^{p a r} = d_{n, q, s}^{k, ∥} + d_{n, q, e}^{k, ∥} - 2 \times d_{n, q, m}^{k, ∥} .

(30)

The full structural-line residual aggregates these constraints:

e_{n, q}^{l} = e_{n, q}^{p a r} + e_{n, q}^{c o l} .

(31)

Planar features offer strong mid- and high-level geometric support to pose estimation, particularly in indoor environments dominated by walls, floors and ceilings. The reprojection error of a planar feature is constructed by comparing the plane

Π_{k}

observed in the current frame

F_{k}

with its corresponding map plane

Π_{m}

. Both planes are expressed using the angular parameterization introduced in Section 3.2, from which their unit normals

n (Π_{k})

and

n (Π_{m})

are reconstructed as

n (Π) = {(\cos β \cos α, \cos β \sin α, \sin β)}^{T} .

(32)

To enforce geometric consistency, three relationships are considered implicitly within a unified residual. The first term encourages coplanarity by aligning both the normal direction and the plane-to-origin distance under the estimated pose. The second term promotes parallel consistency when two planar regions are expected to share a common dominant orientation, while the third term penalizes deviations from expected orthogonality relations between major structural surfaces. Combining these components, the planar residual is formulated compactly as

e_{m}^{Π} = [\begin{matrix} θ (R_{W \to k} n (Π_{m}), n (Π_{k})) \\ (d_{Π_{m}} + n {(Π_{m})}^{T} t_{W \to k}) - d_{Π_{k}} \end{matrix}],

(33)

where

R_{k \leftarrow W}

and

t_{k \leftarrow W}

are the rotation and translation from the world map to the current frame, and

θ (\cdot, \cdot)

measures the angular discrepancy between two unit normals. This formulation allows coplanar, parallel and perpendicular structural relations to be handled naturally within the same optimization framework without introducing separate case-specific constraints.

By embedding point, line, and plane residuals into a unified optimization framework, the solver exploits both fine-grained local correspondences and high-level structural regularities. Planar features act as stable anchors, parallel-line structures provide medium-scale directional constraints, and point features capture local geometric details. Under Gaussian noise assumptions, the final optimization problem is solved using the Levenberg–Marquardt algorithm with robust loss functions:

ξ ⁎ = a r g m i n [\sum_{i} ρ_{p} ({e_{i, j \to k}^{p}}^{T} \sum_{p}^{- 1} e_{i, j \to k}^{p}) + \sum_{n, q} ρ_{l} ({e_{n, q}^{l}}^{T} \sum_{l}^{- 1} e_{n, q}^{l}) + \sum_{m} ρ_{Π} ({e_{m}^{Π}}^{T} \sum_{Π}^{- 1} e_{m}^{Π})],

(34)

where

Σ_{p}

,

Σ_{l}

and

Σ_{Π}

are inverse covariance matrices, and

ρ (\cdot)

is the Huber loss. The Levenberg–Marquardt algorithm is used to obtain the final camera pose.

Through this tightly coupled formulation, the pose solver explicitly accounts for feature reliability inferred from probabilistic dynamic modeling, resulting in robust and stable localization in weak-texture, repetitive, and partially dynamic indoor environments.

4. Results

To verify the performance of the proposed algorithm, a series of experimental validations were conducted. All experiments were conducted on a laptop configured with an AMD Ryzen 7 5800H processor (3.2 GHz) and 16 GB RAM, running Ubuntu 18.04.

4.1. Experimental Setup

To comprehensively evaluate the effectiveness of the proposed dynamic multi-feature fusion SLAM system, an experimental framework was established that covers both standard benchmark datasets and real-world service machine scenarios.

The TUM RGB-D dataset [46] is selected as the primary benchmark for quantitative evaluation. This dataset provides synchronized RGB–D observations acquired using real sensing platforms operating in real indoor environments, rather than simulated or synthetic settings. As illustrated in Figure 6a, the RGB-D sensor is mounted on a handheld or mobile platform and captures natural indoor scenes with real human motion, occlusions, and viewpoint changes. The dynamic sequences include two representative motion patterns. The sitting sequences (“s-”) mainly contain static backgrounds with localized hand movements, whereas the walking sequences (“w-”) involve continuous human motion, frequent occlusions, and large-scale dynamic interference. Owing to these characteristics, the TUM RGB-D dataset has become a widely adopted benchmark for assessing the robustness of SLAM systems in dynamic indoor environments.

The proposed method is compared against several representative dynamic SLAM systems, including DS-SLAM [29], SG-SLAM [31], Dynamic-VINS [32], and DRG-SLAM [33]. Since the TUM RGB-D dataset does not provide inertial measurements, all compared systems are evaluated in RGB-D mode to ensure a fair and consistent comparison. To further position the proposed approach within the broader SLAM landscape, several state-of-the-art end-to-end neural SLAM methods, including NICE-SLAM [34], ESLAM [35], NID-SLAM [36], RoDyn-SLAM [37], and DDN-SLAM [38], are additionally included. These methods emphasize neural implicit scene modeling and dense reconstruction under dynamic conditions, providing a complementary perspective to geometry-driven approaches.

To analyze the independent contribution of probabilistic dynamic modeling, an ablation study is conducted by comparing the full system with a variant in which dynamic probability estimation and filtering are disabled, leaving only multi-granularity geometric features for pose estimation. ORB-SLAM3 operating in RGB-D mode is further included as a classical geometry-based baseline. This configuration allows a direct assessment of how probabilistic dynamic modeling improves robustness and suppresses drift in highly dynamic indoor scenes.

Beyond benchmark evaluation, the THUD dataset [47] is employed to assess the proposed system in realistic service machine scenarios. As shown in Figure 6b, THUD is collected using a mobile service robot platform operating in real indoor environments over long durations. The dataset features dense pedestrian flows, severe occlusions, abrupt viewpoint changes, and extended trajectories, providing a challenging testbed for evaluating long-term localization stability and map consistency in service-oriented applications.

In all experiments reported in this paper, loop closure is intentionally disabled for all evaluated methods. This design choice aims to focus the evaluation on front-end tracking robustness and the effectiveness of dynamic observation handling, without the influence of global pose graph optimization or loop-induced error redistribution. Disabling loop closure also reflects practical service machine scenarios, where long-term loop opportunities may be sparse or unreliable in highly dynamic indoor environments.

For quantitative evaluation using ATE and RPE, only trajectory segments with successful and continuous tracking are considered.

For quantitative evaluation, Absolute Trajectory Error (ATE) is adopted as the primary metric to measure global trajectory consistency. Following the standard protocol, the estimated trajectory is first aligned to the ground-truth trajectory by a single rigid-body transformation

S \in S E (3)

, where

S E (3)

denotes the group of 3D rigid transformations (rotation and translation). Given the estimated pose

T_{t}^{e} \in S E (3)

and the ground-truth pose

T_{t}^{g} \in S E (3)

, ATE is computed as the root-mean-square (RMS) translational error of the aligned pose discrepancy:

{ATE}_{R M S E} = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {‖t r a n s ({(T_{t}^{g})}^{- 1} S T_{t}^{e})‖}^{2}} .

(35)

To complement global accuracy, Relative Pose Error (RPE) is used to quantify short-term drift and local motion accuracy, particularly in sequences with rapid movements or dense dynamic interference. For a time interval

Δ

, RPE is defined as

{RPE}_{R M S E} = \sqrt{\frac{1}{N - Δ} \sum_{t = 1}^{N - Δ} {‖t r a n s ({({(T_{t}^{g})}^{- 1} T_{t + Δ}^{g})}^{- 1} ({(T_{t}^{e})}^{- 1} T_{t + Δ}^{e}))‖}^{2}}

(36)

While ATE reflects long-term drift accumulation, RPE characterizes frame-to-frame consistency and sensitivity to dynamic disturbances. When a method fails to maintain tracking and cannot re-initialize, the evaluation is terminated at the failure point, and no extrapolation or post-recovery alignment is applied. This protocol ensures a fair comparison of robustness under dynamic interference, where early tracking failure directly reflects the system’s inability to cope with challenging conditions rather than being masked by back-end optimization.

For sequences without ground-truth trajectories, qualitative evaluation is conducted through visual inspection of reconstructed point clouds and dense maps, focusing on the structural integrity of walls, floors, and furniture, as well as the continuity of the estimated camera trajectory.

Through the integration of benchmark comparisons, ablation studies, and real-world service machine validation, the experimental setup provides a comprehensive assessment of the proposed SLAM system in terms of trajectory accuracy, local stability, robustness to dynamic interference, and long-term mapping consistency.

4.2. Quantitative Evaluation of Pose Accuracy on Dynamic RGB-D Sequences

Building upon the experimental setup described in Section 4.1, this subsection reports the quantitative evaluation results on the dynamic sequences of the TUM RGB-D dataset. ATE is adopted as the primary evaluation metric, and all RMSE values are computed using the EVO toolkit [48]. Table 1 compares the proposed MGD-SLAM with representative geometry-based dynamic SLAM systems, while Table 2 reports results against state-of-the-art end-to-end neural SLAM approaches. For clarity, the best and second-best results are highlighted accordingly.

As shown in Table 2, the proposed MGD-SLAM demonstrates a clear and consistent performance advantage across most dynamic sequences. On average, MGD-SLAM achieves an ATE RMSE of 0.023 m, representing an improvement of approximately 66% over the best-performing traditional dynamic SLAM baseline, Dynamic-VINS (0.068 m). This result indicates that the integration of multi-granularity geometric features with probabilistic dynamic modeling substantially enhances trajectory accuracy in dynamic indoor environments. In nearly static sequences such as “s_static”, all methods exhibit relatively low errors. Even in this favorable scenario, MGD-SLAM achieves the best or near-best performance, benefiting from the additional geometric constraints introduced by point–parallel-line–plane fusion. In moderately dynamic sequences (“s_half” and “s_xyz”), where local motion and partial occlusions are present, the probabilistic dynamic modeling mechanism effectively suppresses inconsistent observations, enabling stable tracking and low drift. The performance gap becomes more pronounced in highly dynamic walking sequences (“w_half”, “w_xyz”, and “w_rpy”). These sequences involve wide camera motion and dense human activity, where dynamic interference severely degrades traditional SLAM pipelines. MGD-SLAM consistently achieves the lowest ATE across all w-type sequences, demonstrating its robustness under strong dynamic disturbances. In contrast, DS-SLAM suffers noticeable degradation in “w_rpy”, where epipolar-geometry-based filtering becomes unreliable under rapid camera rotation. Dynamic-VINS, which operates without inertial measurements on this dataset, shows increased sensitivity to human motion and viewpoint changes. SG-SLAM performs reasonably when foreground motion dominates but becomes unstable in cluttered environments with complex spatial layouts. DRG-SLAM reaches competitive accuracy in isolated cases, yet its overall performance remains clearly inferior. It is worth noting that certain sequences, such as “w_static”, exhibit limited geometric diversity, where only a single dominant planar structure (e.g., a tabletop) is visible while larger structural elements such as the floor are partially occluded. In such cases, global structural regularities are weakly activated. Despite this limitation, MGD-SLAM maintains high accuracy, indicating that the proposed point-parallel line-plane fusion framework contributes to robustness even when strong global structural cues are unavailable.

Table 3 further compares MGD-SLAM with recent end-to-end neural SLAM approaches. Overall, the proposed method achieves accuracy comparable to or better than state-of-the-art neural SLAM systems across all evaluated sequences, attaining the lowest or near-lowest ATE in most cases. In terms of average performance, MGD-SLAM slightly outperforms the best competing neural method, DDN-SLAM, by approximately 5%. Neural implicit SLAM methods typically rely on dense photometric consistency, making them sensitive to weak-texture regions, transient occlusions, and dynamic appearance changes. This limitation is particularly evident in “w_rpy”, where NICE-SLAM and ESLAM suffer from severe drift. Although NID-SLAM and RoDyn-SLAM incorporate enhanced dynamic modeling strategies, their long-term stability remains limited. DDN-SLAM performs competitively on several sequences but still falls slightly short in overall accuracy.

Overall, these quantitative results demonstrate that MGD-SLAM achieves strong robustness across a wide range of dynamic intensities, motion patterns, and geometric configurations. Compared with traditional geometry-based SLAM methods, it delivers substantially improved accuracy in dynamic environments. Compared with end-to-end neural SLAM systems, its explicit geometric modeling and feature-level probabilistic reasoning enable more stable localization under weak texture, heavy occlusion, and rapid appearance variation. These results validate the effectiveness of integrating multi-granularity geometric features with probabilistic dynamic modeling for robust indoor RGB-D SLAM

4.3. Effectiveness of the Dynamic Feature Removal Strategy

To evaluate the effectiveness of the proposed dynamic feature removal strategy, experiments are conducted on the highly dynamic walking sequences of the TUM RGB-D dataset. Three representative baselines are selected for comparison. ORB-SLAM3 serves as a classical point-based SLAM system without explicit dynamic handling. MWSLAM incorporates point, line, and plane features and therefore shares the multi-granularity geometric representation adopted in this work, but it does not include any dynamic modeling mechanism. MGD-SLAM-nD is an ablation variant of the proposed system, in which the probabilistic dynamic modeling and feature removal modules are disabled while all other components remain unchanged. This setup enables a direct and controlled assessment of the contribution introduced by the dynamic feature removal strategy.

The quantitative results reported in Table 4 and Table 5 demonstrate the substantial impact of dynamic feature removal on both translational and rotational accuracy. In terms of ATE RMSE, ORB-SLAM3 exhibits severe degradation in dynamic scenes, with an average error of 0.475 m across the four walking sequences. Introducing multi-granularity geometric constraints significantly improves performance, as MWSLAM and MGD-SLAM-nD reduce the average error to 0.192 m and 0.197 m, respectively. However, both methods remain sensitive to dense foreground motion, since dynamic observations are still treated as valid constraints during optimization. By explicitly modeling feature-level dynamic probability and removing motion-inconsistent observations, MGD-SLAM further reduces the average ATE RMSE to 0.020 m, corresponding to a reduction of over 95% compared with ORB-SLAM3 and nearly 90% relative to both MWSLAM and MGD-SLAM-nD. The improvement is particularly evident in sequences such as “w_half” and “w_xyz”, where dense pedestrian motion causes significant drift in methods without dynamic filtering. In these cases, eliminating dynamic features effectively prevents moving objects from corrupting global trajectory estimation.

A similar trend is observed for rotational accuracy. Although MGD-SLAM-nD benefits from point-parallel line-plane fusion, its RPE RMSE increases noticeably in sequences involving rapid rotation or heavy occlusion, where dynamic features and structural cues may simultaneously degrade. In contrast, the full MGD-SLAM consistently achieves the lowest rotational errors across all evaluated sequences, as shown in Table 5. This improvement indicates that dynamic probability estimation and multi-frame geometric consistency checks not only suppress translation drift but also stabilize orientation estimation under complex motion patterns.

To further illustrate the effect of dynamic feature removal on global trajectory consistency, Figure 7 presents qualitative comparisons of estimated trajectories against ground truth for four representative walking sequences. While ORB-SLAM3 and MGD-SLAM-nD exhibit visible divergence and accumulated drift, the trajectory produced by MGD-SLAM closely follows the ground truth, maintaining a smooth and consistent shape throughout the sequence. The reduced deviation in regions with strong dynamic interference confirms that the proposed dynamic filtering mechanism effectively preserves reliable geometric constraints for pose estimation.

Overall, these results demonstrate that multi-granularity geometric fusion alone is insufficient to ensure robustness in highly dynamic indoor environments. Explicit dynamic feature removal plays a critical role in preventing motion-induced corruption of the optimization process. By combining probabilistic dynamic modeling with geometric consistency across multiple frames, MGD-SLAM achieves significantly improved translational and rotational accuracy, enabling stable and reliable localization in challenging dynamic scenarios.

4.4. Evaluation in Complex Large-Scale Dynamic Indoor Environments

To further evaluate the robustness and practical applicability of the proposed MGD-SLAM in real-world service machine scenarios, experiments are conducted on two large-scale indoor sequences from the THUD dataset: the “store-c1” supermarket scene and the “canteen-p1-c1” cafeteria scene. These environments exhibit rich human-made structures together with different levels of dynamic interference, ranging from intermittent pedestrian motion to extremely dense and highly unpredictable human activity. MGD-SLAM is compared with ORB-SLAM3, MWSLAM, and the ablation variant MGD-SLAM-nD to analyze the role of dynamic modeling in large-scale deployments.

In large-scale indoor service scenarios, the evaluation protocol differs from that of short benchmark sequences, as tracking continuity and mapping stability become the primary concerns. For the “store-c1” sequence, all evaluated methods experience tracking interruptions caused by dynamic occlusions and repeated structural layouts, preventing the completion of a continuous trajectory over the entire sequence. Under such conditions, trajectory-based quantitative metrics are less informative. Therefore, performance is assessed through visual comparison of estimated trajectories and mapping results, with particular attention to trajectory smoothness, consistency, and failure behavior.

For the “canteen-p1-c1” sequence, the environment is dominated by extremely dense and irregular human motion. In this case, evaluation focuses on qualitative mapping results and trajectory continuity, emphasizing the geometric coherence of reconstructed structures and the absence of abnormal distortions, such as bent corridors, broken planar surfaces, or abrupt trajectory jumps. This evaluation strategy reflects practical service machine deployment scenarios, where maintaining stable localization and structurally consistent maps under persistent dynamic interference is of greater importance than frame-wise trajectory accuracy.

The “store-c1” sequence represents a typical service environment with narrow aisles, densely arranged shelves, and intermittent pedestrian motion, as shown in Figure 8a. From a perception perspective, such environments simultaneously challenge SLAM systems in two aspects: (i) repeated structural patterns that increase the ambiguity of data association, and (ii) dynamic occlusions that intermittently corrupt feature observations, especially during initialization. Figure 8b and Figure 9 present the estimated trajectories and mapping results produced by different systems. ORB-SLAM3 fails to maintain stable tracking and terminates early, indicating that purely point-based representations are insufficient to cope with dense occlusions in cluttered indoor layouts. MWSLAM benefits from point–line–plane fusion and remains operational for a short period; however, without explicit dynamic feature handling, motion-inconsistent observations accumulate and lead to noticeable trajectory oscillations. MGD-SLAM-nD further improves short-term stability by exploiting more coherent multi-granularity geometric constraints, yet it still drifts and eventually loses tracking as dynamic foreground features persistently enter the optimization. In contrast, MGD-SLAM exhibits substantially enhanced robustness. By explicitly modeling feature-level dynamic probabilities and suppressing motion-inconsistent observations, the system maintains stable tracking for the longest duration among all compared methods. Although the full sequence is not completed, the resulting trajectory remains smooth and structurally consistent, and the reconstructed map preserves clear aisle and shelf layouts, as illustrated in Figure 9d. This behavior confirms that probabilistic dynamic modeling is essential for preventing dynamic clutter from degrading global pose estimation in structurally repetitive service environments.

The “canteen-p1-c1” sequence constitutes an extreme dynamic scenario captured during peak dining hours, featuring dense crowds, frequent occlusions, and rapid, irregular human motion (Figure 10). In such conditions, dynamic objects dominate the visual field and severely violate the static-scene assumption underlying classical SLAM systems. As shown in Figure 11, ORB-SLAM3 fails to initialize due to overwhelming dynamic interference, while MWSLAM also breaks down at an early stage despite its use of structural features. MGD-SLAM-nD manages a brief period of operation by relying on large static elements such as ceilings and upper walls; however, without dynamic probability filtering, the continuous influx of moving features quickly destabilizes pose estimation. MGD-SLAM demonstrates markedly superior performance in this highly dynamic environment. The system selectively preserves reliable planar and structural constraints while aggressively down-weighting or removing dynamic features associated with human motion. As a result, MGD-SLAM maintains the longest continuous trajectory and produces the most coherent map among all evaluated methods, even under extreme crowd density. These results highlight the importance of combining structural geometry with probabilistic feature-level dynamic reasoning for maintaining localization stability in real-world service machine scenarios.

The experiments on the THUD dataset reveal that multi-granularity geometric fusion alone is insufficient for robust SLAM in large-scale, crowded indoor environments. While MWSLAM and MGD-SLAM-nD benefit from richer geometric constraints, their lack of explicit dynamic discrimination renders them vulnerable to persistent motion interference. In contrast, MGD-SLAM consistently achieves superior stability and mapping quality by integrating probabilistic dynamic modeling with structure-aware optimization. This capability is particularly critical for service machines operating in public indoor spaces, where long-term autonomy requires resilience to dense crowds, frequent occlusions, and continuously changing scene dynamics.

4.5. Algorithm Runtime Analysis

To evaluate the computational efficiency of MGD-SLAM, we report the average runtime of its main modules on the dynamic sequences of the TUM RGB-D dataset, measured on the same platform described in Section 4.1.

Table 6 reports the average per-frame time consumption of the major modules in the proposed system. The multi-granularity geometric feature extraction module, including point, parallel-line, and planar features, requires approximately 31 ms per frame. Although multiple feature types are extracted, they are processed in parallel, resulting in moderate overhead compared to point-only pipelines.

The proposed dynamic feature removal module consumes approximately 9 ms per frame, benefiting from lightweight feature-level probabilistic updates and efficient geometric consistency checks. Semantic object detection is performed using a YOLO-based detector running as an independent ROS node. On the test platform, the detector processes approximately 83 frames per second (≈12 ms per frame) and operates asynchronously, thus not blocking the main SLAM tracking thread.

Overall, MGD-SLAM achieves an average tracking time of approximately 46 ms per frame, corresponding to about 22 FPS, which demonstrates near real-time performance while significantly improving robustness in dynamic indoor environments.

5. Conclusions

This paper presented MGD-SLAM, a geometry-aware RGB-D SLAM framework designed for robust autonomous perception in dynamic indoor service machine environments. The proposed method integrates multi-granularity geometric features with a feature-level probabilistic dynamic modeling mechanism, enabling reliable suppression of motion-inconsistent observations while preserving stable structural constraints from points, parallel-line structures, and planar regions. Extensive experiments on public benchmarks and real-world service environments demonstrate that MGD-SLAM significantly improves localization robustness and trajectory consistency under dynamic interference, weak texture, and crowded conditions. Compared with traditional dynamic SLAM systems, the proposed framework achieves substantially lower trajectory errors, while maintaining performance comparable to recent neural SLAM approaches without relying on dense photometric optimization or heavy computational resources.

Despite these advantages, the performance of MGD-SLAM may degrade in challenging scenarios where RGB-D sensing itself becomes unreliable, such as environments with strong specular reflections (e.g., mirrors or glass surfaces), poor depth quality, or extreme illumination changes. In such cases, inaccurate depth measurements and unstable geometric observations can limit the effectiveness of geometry-based constraints and probabilistic feature modeling. Future work will therefore focus on incorporating more robust multi-modal sensing strategies, such as tighter fusion with inertial or polarization cues, as well as higher-level semantic reasoning to better handle reflective and visually degraded environments. These extensions are expected to further enhance robustness and support long-term deployment of service machines in diverse and complex indoor scenarios.

Author Contributions

Conceptualization, Z.W. and W.D.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and W.W.; resources, Z.W.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W. and W.W.; visualization, Z.W.; supervision, W.D.; project administration, W.D.; funding acquisition, W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Projects of the Natural Science Foundation of Hebei Province under Grant F2024203051, the Hebei Province Higher Education Scientific Research Project under Grant CYZD202505, the National Natural Science Foundation of China under Grants 62073279 and 62203378, and the Central Guidance on Local Science and Technology Development Fund of Hebei Province under Grant 246Z1804G.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The TUM dataset was obtained from https://cvg.cit.tum.de/data/datasets/rgbd-dataset/download, accessed on 15 October 2025. The THUD ROBOTIC dataset was obtained from https://jackyzengl.github.io/THUD-Robotic-Dataset.github.io/, accessed on 2 November 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, L.; Li, L.; Li, M.; Liang, K. AI-Driven Robotics: Innovations in Design, Perception, and Decision-Making. Machines 2025, 13, 615. [Google Scholar] [CrossRef]
Noomwongs, N.; Kiataramgul, K.; Chantranuwathana, S.; Phanomchoeng, G. GNSS-RTK-Based Navigation with Real-Time Obstacle Avoidance for Low-Speed Micro Electric Vehicles. Machines 2025, 13, 471. [Google Scholar] [CrossRef]
Li, X.; Li, T.; Zhang, Y.; Li, Z.; Ban, L.; Ning, Y. GL-VSLAM: A General Lightweight Visual SLAM Approach for RGB-D and Stereo Cameras. Sensors 2025, 25, 7467. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Huang, G.; Mao, Y.; Wang, S.; Kaess, M. EDPLVO: Efficient Direct Point-Line Visual Odometry. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 7559–7565. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar] [CrossRef]
Gomez-Ojeda, R.; Moreno, F.-A.; Zuñiga-Noël, D.; Scaramuzza, D.; Gonzalez-Jimenez, J. PL-SLAM: A Stereo SLAM System through the Combination of Points and Line Segments. IEEE Trans. Robot. 2019, 35, 734–746. [Google Scholar] [CrossRef]
Shu, F.; Wang, J.; Pagani, A.; Stricker, D. Structure PLP-SLAM: Efficient Sparse Mapping and Localization Using Point, Line and Plane for Monocular, RGB-D and Stereo Cameras. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2105–2112. [Google Scholar] [CrossRef]
Xu, L.; Yin, H.; Shi, T.; Jiang, D.; Huang, B. EPLF-VINS: Real-Time Monocular Visual-Inertial SLAM with Efficient Point-Line Flow Features. IEEE Robot. Autom. Lett. 2023, 8, 752–759. [Google Scholar] [CrossRef]
Gao, R.; Wu, S.; Zhang, L.; Pan, L.; Zhang, G.; Wang, H.; Wang, X. PLFG-SLAM: A Visual SLAM for Indoor Weak-Texture Environments with Point-Line Feature Glue Matching. Measurement 2025, 256, 118435. [Google Scholar] [CrossRef]
Zhou, L.; Wang, S.; Kaess, M. DPLVO: Direct Point-Line Monocular Visual Odometry. IEEE Robot. Autom. Lett. 2021, 6, 7113–7120. [Google Scholar] [CrossRef]
Lim, H.; Jeon, J.; Myung, H. UV-SLAM: Unconstrained Line-Based SLAM Using Vanishing Points for Structural Mapping. IEEE Robot. Autom. Lett. 2022, 7, 1518–1525. [Google Scholar] [CrossRef]
Jiang, H.; Qian, R.; Du, L.; Pu, J.; Feng, J. UL-SLAM: A Universal Monocular Line-Based SLAM via Unifying Structural and Non-Structural Constraints. IEEE Trans. Autom. Sci. Eng. 2025, 22, 2682–2699. [Google Scholar] [CrossRef]
Wang, Z.; Ding, W.; Zhang, Y.; Hua, C. PPL-SLAM: RGB-D-Based Point and Parallel-Line Structural Constraints for Enhanced Pose Estimation. Measurement 2026, 260, 119853. [Google Scholar] [CrossRef]
Zhang, X.; Wang, W.; Qi, X.; Liao, Z.; Wei, R. Point–Plane SLAM Using Supposed Planes for Indoor Environments. Sensors 2019, 19, 3795. [Google Scholar] [CrossRef]
Sun, Q.; Yuan, J.; Zhang, X.; Duan, F. Plane-Edge-SLAM: Seamless Fusion of Planes and Edges for SLAM in Indoor Environments. IEEE Trans. Autom. Sci. Eng. 2021, 18, 2061–2075. [Google Scholar] [CrossRef]
Yang, H.; Yuan, J.; Gao, Y.; Sun, X.; Zhang, X. UPLP-SLAM: Unified Point–Line–Plane Feature Fusion for RGB-D Visual SLAM. Inf. Fusion 2023, 96, 51–65. [Google Scholar] [CrossRef]
Kim, P.; Coltin, B.; Kim, H.J. Linear RGB-D SLAM for Planar Environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 350–366. [Google Scholar] [CrossRef]
Li, Y.; Yunus, R.; Brasch, N.; Navab, N.; Tombari, F. RGB-D SLAM with Structural Regularities. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11581–11587. [Google Scholar] [CrossRef]
Yunus, R.; Li, Y.; Tombari, F. ManhattanSLAM: Robust Planar Tracking and Mapping Leveraging Mixture of Manhattan Frames. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 6687–6693. [Google Scholar] [CrossRef]
Wang, Z.; Ding, W.; Zhang, Y.; Hua, C. OTPS-VO: Enhanced RGB-D Odometry for Indoor Service Robots Leveraging Structural Features. Expert Syst. Appl. 2026, 298, 129704. [Google Scholar] [CrossRef]
Li, S.; Lee, D. RGB-D SLAM in Dynamic Environments Using Static Point Weighting. IEEE Robot. Autom. Lett. 2017, 2, 2263–2270. [Google Scholar] [CrossRef]
Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS++: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery. IEEE Robot. Autom. Lett. 2024, 9, 9127–9134. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, R.; Jin, S.; Yi, X. PFD-SLAM: A New RGB-D SLAM for Dynamic Indoor Environments Based on Non-Prior Semantic Segmentation. Remote Sens. 2022, 14, 2445. [Google Scholar] [CrossRef]
Xue, C.; Huang, Y.; Zhao, C.; Li, X.; Mihaylova, L.; Li, Y.; Chambers, J.A. A Gaussian–Generalized-Inverse-Gaussian Joint-Distribution-Based Adaptive MSCKF for Visual-Inertial Odometry Navigation. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 2307–2328. [Google Scholar] [CrossRef]
Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM toward Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar] [CrossRef]
Scona, R.; Jaimez, M.; Petillot, Y.R.; Fallon, M.; Cremers, D. StaticFusion: Background Reconstruction for Dense RGB-D SLAM in Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3849–3856. [Google Scholar] [CrossRef]
Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2023, 72, 7501012. [Google Scholar] [CrossRef]
Liu, J.; Li, X.; Liu, Y.; Chen, H. RGB-D inertial odometry for a resource-restricted robot in dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 9573–9580. [Google Scholar] [CrossRef]
Wang, Y.; Xu, K.; Tian, Y.; Ding, X. DRG-SLAM: A semantic RGB-D SLAM using geometric features for indoor dynamic scene. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 1352–1359. [Google Scholar] [CrossRef]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural implicit scalable encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12776–12786. [Google Scholar] [CrossRef]
Johari, M.M.; Carta, C.; Fleuret, F. ESLAM: Efficient dense SLAM system based on hybrid representation of signed distance fields. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 17408–17419. [Google Scholar] [CrossRef]
Xu, Z.; Niu, J.; Li, Q.; Ren, T.; Chen, C. NID-SLAM: Neural implicit representation-based RGB-D SLAM in dynamic environments. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Ruan, C.; Zang, Q.; Zhang, K.; Huang, K. DN-SLAM: A visual SLAM with ORB features and NeRF mapping in dynamic environments. IEEE Sens. J. 2024, 24, 5279–5287. [Google Scholar] [CrossRef]
Jiang, H.; Xu, Y.; Li, K.; Feng, J.; Zhang, L. RoDyn-SLAM: Robust dynamic dense RGB-D SLAM with neural radiance fields. IEEE Robot. Autom. Lett. 2024, 9, 7509–7516. [Google Scholar] [CrossRef]
Li, M.; Guo, Z.; Deng, T.; Zhou, Y.; Ren, Y.; Wang, H. DDN-SLAM: Real-time dense dynamic neural implicit SLAM. IEEE Robot. Autom. Lett. 2025, 10, 4300–4307. [Google Scholar] [CrossRef]
von Gioi, R.G.; Jakubowicz, J.; Morel, J.-M.; Randall, G. LSD: A line segment detector. Image Process. Line 2012, 2, 35–55. [Google Scholar] [CrossRef]
Akinlar, C.; Topal, C. EDLines: A real-time line segment detector with a false detection control. Pattern Recognit. Lett. 2011, 32, 1633–1642. [Google Scholar] [CrossRef]
Ding, W.; Wang, Z.; Hu, S. OTPL: A novel measurement method of structural parallelism based on orientation transformation and geometric constraints. Signal Process. Image Commun. 2025, 137, 117310. [Google Scholar] [CrossRef]
Feng, C.; Taguchi, Y.; Kamat, V.R. Fast Plane Extraction in Organized Point Clouds Using Agglomerative Hierarchical Clustering. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 6218–6225. [Google Scholar] [CrossRef]
Ultralytics. YOLOv11: Ultralytics Object Detection Framework. Available online: https://github.com/ultralytics/ultralytics (accessed on 3 August 2025).
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar] [CrossRef]
Tang, Y.-F.; Tai, C.; Chen, F.-X.; Zhang, W.-T.; Zhang, T.; Liu, X.-P.; Liu, Y.-J.; Zeng, L. Mobile robot oriented large-scale indoor dataset for dynamic scene understanding. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 613–620. [Google Scholar] [CrossRef]
Grupp, M. evo: Python Package for the Evaluation of Odometry and SLAM. Available online: https://github.com/MichaelGrupp/evo (accessed on 20 October 2025).

Figure 1. Overview of the proposed geometry-aware RGB-D SLAM–based autonomous perception framework for indoor service machines. (a) Front-end preprocessing (blue bracket), including object detection and multi-granularity geometric feature extraction; (b) feature-level probabilistic dynamic modeling (red bracket) for estimating motion likelihood; (c) back-end pose optimization (orange bracket) using static multi-granularity geometric constraints. Blue, red, and orange arrows indicate the data flow of the preprocessing, dynamic modeling, and optimization stages, respectively.

Figure 2. Examples of dynamic object detection and the corresponding dynamic feature identification results obtained after SLAM back-end geometric reasoning.

Figure 3. Illustration of temporal propagation and spatial diffusion of feature-level dynamic probabilities across consecutive frames.

Figure 4. Illustration of epipolar geometric constraints used for structural consistency evaluation in dynamic scenes. The red dots represent example feature correspondences (e.g., hand feature points) and their associated reconstructed 3D points. The blue line denotes the camera baseline between two consecutive frames. The green lines indicate the corresponding epipolar lines generated by the projection of 3D points between views, which are used to evaluate geometric consistency.

Figure 5. Factor graph optimization model based on structural features. Blue nodes represent the optimized camera poses, while the connected factors encode structural feature constraints used for pose refinement.

Figure 6. Real-world data acquisition platforms used in the experimental evaluation. (a) Hand-held RGB-D sensing platform used for collecting the dynamic sequences of the TUM RGB-D dataset (freiburg3 sitting and walking series) in real office environments [43]; (b) mobile service robot platform used for collecting the THUD dataset during long-term operation in crowded indoor scenes [46].

Figure 7. Trajectory comparison among ORB-SLAM3, MGD-SLAM-nD and MGD-SLAM on four dynamic TUM RGB-D sequences. (a) “w_half” sequence; (b) “w_rpy” sequence; (c) “w_static” sequence; (d) “w_xyz” sequence.

Figure 8. Scene characteristics and trajectory comparison on the THUD “store-c1” sequence. (a) Representative RGB-D frame captured in a supermarket environment, featuring narrow aisles, densely arranged shelves, and intermittent pedestrian motion; (b) comparison of 3D trajectories generated by different algorithms.

Figure 9. Mapping results on the THUD “store-c1” sequence. (a) ORB-SLAM3; (b) MW-SLAM; (c) MGD-SLAM-nD (without probabilistic dynamic modeling); (d) MGD-SLAM (proposed).

Figure 10. Representative scene from the THUD “canteen-p1-c1” sequence. The scene is captured during peak dining hours in a large cafeteria, characterized by extremely dense crowds, frequent occlusions, and rapid, irregular human motion, posing severe challenges to robust visual SLAM in service machine scenarios.

Figure 11. Mapping results of different algorithms on the “canteen-p1-c1” sequence. (a) ORB-SLAM3; (b) MW-SLAM; (c) MGD-SLAM-nD (without probabilistic dynamic modeling); (d) MGD-SLAM (proposed).

Table 1. Hyperparameters used in probabilistic dynamic modeling.

Parameter	Description	Value
$\partial$	Temporal smoothing factor	0.7
$δ$	Descriptor matching threshold	0.6
r	Diffusion radius	0.15 m
$d_{t h}^{L S}$	Velocity magnitude threshold	3
$θ_{t h}^{L S}$	Velocity direction threshold	2
$d_{t h}^{p r o}$	Reprojection error threshold	2
$d_{t h}^{m f}$	Epipolar distance threshold	2
$ω_{L S}$ $, ω_{p r o}$ $, ω_{m f}$	Likelihood fusion weights	0.4, 0.4, 0.2

Table 2. ATE RMSE Comparison between Traditional Dynamic SLAM and the Proposed MGD-SLAM on the TUM RGB-D Dynamic Dataset (Unit: m).

Sequence	ORB-SLAM3	SG-SLAM	DRG-SLAM	Dynamic-VINS	DS-SLAM	MGD-SLAM
s_half	0.019	0.357	0.073	0.079	0.015	0.034
s_rpy	0.031	0.454	0.032	0.098	0.029	0.042
s_static	0.009	0.163	0.007	0.011	0.012	0.009
s_xyz	0.031	0.316	0.014	0.033	0.031	0.021
w_half	0.427	0.521	0.567	0.059	0.032	0.022
w_rpy	0.829	0.641	0.042	0.167	0.446	0.038
w_static	0.144	0.349	0.011	0.049	0.007	0.007
w_xyz	0.501	0.551	0.022	0.050	0.032	0.013
Avg.	0.249	0.419	0.096	0.068	0.076	0.023

Bold values indicate the best performance; underlined values indicate the second-best performance.

Table 3. ATE RMSE comparison between state-of-the-art end-to-end dynamic SLAM methods and the proposed method on the TUM RGB-D dynamic dataset (Unit: m).

Sequence	NICE-SLAM	ESLAM	NID-SLAM	Rodyn-SLAM	DDN-SLAM	MGD-SLAM
w_half	0.134	0.617	0.070	0.056	0.023	0.022
w_rpy	0.734	0.573	0.678	0.045	0.039	0.038
w_static	0.491	0.075	0.063	0.017	0.010	0.007
w_xyz	0.421	0.435	0.072	0.083	0.013	0.013
Avg.	0.445	0.425	0.221	0.050	0.021	0.020

Bold values indicate the best performance.

Table 4. Comparison of ATE RMSE on TUM RGBD dynamic dataset (Unit: m).

Sequence	ORB-SLAM3	MWSLAM	MGD-SLAM-nD	MGD-SLAM
w_half	0.427	0.293	0.340	0.022
w_rpy	0.829	0.164	0.148	0.038
w_static	0.144	0.019	0.020	0.007
w_xyz	0.501	0.291	0.279	0.013
Avg.	0.475	0.192	0.197	0.020

Bold values indicate the best performance.

Table 5. Comparison of RPE RMSE on TUM RGBD dynamic dataset (Unit: rad).

Sequence	ORB-SLAM3	MWSLAM	MGD-SLAM-nD	MGD-SLAM
w_half	0.521	1.268	0.705	0.424
w_rpy	0.641	0.878	0.832	0.595
w_static	0.349	0.585	0.551	0.184
w_xyz	0.551	0.813	0.612	0.396
Avg.	0.516	0.886	0.675	0.400

Bold values indicate the best performance.

Table 6. Average runtime of main modules in MGD-SLAM on the TUM RGB-D dataset.

Module	Average Time (ms)
Multi-granularity feature extraction	31
Dynamic feature removal	9
Semantic detection (asynchronous)	12
Total tracking time	46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Ding, W.; Wang, W. Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling. Machines 2026, 14, 222. https://doi.org/10.3390/machines14020222

AMA Style

Wang Z, Ding W, Wang W. Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling. Machines. 2026; 14(2):222. https://doi.org/10.3390/machines14020222

Chicago/Turabian Style

Wang, Zhiyu, Weili Ding, and Wenna Wang. 2026. "Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling" Machines 14, no. 2: 222. https://doi.org/10.3390/machines14020222

APA Style

Wang, Z., Ding, W., & Wang, W. (2026). Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling. Machines, 14(2), 222. https://doi.org/10.3390/machines14020222

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Autonomous Perception for Indoor Service Machines via Geometry-Aware RGB-D SLAM and Probabilistic Dynamic Modeling

Abstract

1. Introduction

2. Related Work

2.1. Geometry-Aware SLAM for Robust Indoor Perception

2.2. Dynamic Scene Handling in Visual SLAM

3. Algorithm

3.1. System Overview

3.2. Multi-Granularity Feature Extraction

3.3. Dynamic Object Detection

3.4. Probabilistic Dynamic Modeling

3.5. Semantic–Geometric Observation Likelihood Construction

3.6. Tightly Coupled Pose Optimization with Multi-Granularity Geometric Fusion

4. Results

4.1. Experimental Setup

4.2. Quantitative Evaluation of Pose Accuracy on Dynamic RGB-D Sequences

4.3. Effectiveness of the Dynamic Feature Removal Strategy

4.4. Evaluation in Complex Large-Scale Dynamic Indoor Environments

4.5. Algorithm Runtime Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI