2.1. Traditional SLAM
Simultaneous Localization and Mapping (SLAM) has been a foundational problem in robotics and computer vision for over two decades. Traditional SLAM methods assume a static environment and rely on geometric constraints to jointly estimate the camera trajectory and map the scene. Classical systems can be broadly categorized into feature- based and direct methods. Feature-based SLAM, exemplified by ORB-SLAM [
20] and ORB-SLAM2 [
21], detects and matches sparse visual features across frames, followed by bundle adjustment to optimize poses and landmarks. These systems are robust to viewpoint changes and efficient in well-textured environments.
On the other hand, direct methods such as LSD-SLAM [
22] and DVO-SLAM [
23] bypass feature extraction and minimize photometric error directly over pixel intensities.This allows them to operate in low-texture regions and under motion blur, making them suitable for semi-dense or dense tracking and mapping. DVO-SLAM further utilizes depth information and dense alignment, improving robustness in indoor environments where depth sensors are available.
Despite their success, most early SLAM systems were designed under the static-world assumption, where observations are treated as if they originate from a rigid, unchanging world. This assumption breaks down in real-world scenarios, where dynamic elements such as humans, vehicles, or animals are ubiquitous. These moving entities introduce inconsistent observations, causing tracking drift, erroneous loop closures, and corrupted map points. Since traditional SLAM lacks mechanisms for motion segmentation or semantic reasoning, it often misinterprets dynamic objects as part of the static scene.
Moreover, these methods rely purely on geometric consistency and ignore high-level semantics or learned priors. As a result, they perform poorly in visually ambiguous environments (e.g., textureless surfaces, repetitive patterns), and their maps are often sparse or semi-dense, limiting their utility in tasks requiring high-fidelity reconstructions or 3D understanding.
Monocular SLAM systems face an additional challenge: scale ambiguity. Without depth input, monocular systems can only recover structure up to an unknown scale. Techniques like scale-aware initialization or loop closure can alleviate this issue to some extent, but scale drift remains a persistent problem, particularly in long trajectories or dynamic scenes.
To support diverse applications, extensions of traditional SLAM have incorporated various sensors and modules. For instance, ORB-SLAM3 [
24] unifies monocular, stereo, and RGB-D inputs in a single framework and supports visual–inertial fusion. However, its performance in dynamic environments remains limited due to the absence of motion modeling or dynamic filtering.
Traditional SLAM pipelines such as ORB-SLAM and LSD-SLAM were originally designed under the static scene assumption, where all observations are treated as originating from a rigid, unchanging world. This assumption simplifies optimization but limits robustness in highly dynamic environments. Nevertheless, it is important to note that several extensions of traditional SLAM have been proposed to explicitly handle dynamic objects. For example, DynaSLAM [
7] integrates semantic segmentation and motion consistency checks to filter out dynamic regions. Therefore, rather than a strict opposition between “traditional” and “dynamic” SLAM, we adopt the terminology here to emphasize whether a method 
explicitly models dynamic objects or relies on the static-world assumption.
  2.2. Dynamic SLAM
Dynamic SLAM aims to improve robustness in real-world environments by addressing the challenges posed by moving objects and non-rigid deformations. Traditional SLAM methods often fail in such scenarios due to the static-world assumption. To mitigate this issue, dynamic SLAM systems introduce strategies for detecting, filtering, or modeling dynamic regions. Early approaches primarily rely on semantic segmentation to identify and exclude potentially moving entities from the SLAM pipeline.
DynaSLAM is one of the earliest and most influential works in this direction, integrating Mask R-CNN into the ORB-SLAM2 framework to mask out dynamic objects such as humans and vehicles. By combining semantic segmentation with multi-view geometry, DynaSLAM enables the identification of both a priori dynamic objects and those detected through scene inconsistency. Detect-SLAM [
25] follows a similar idea but adopts lightweight object detectors for real-time performance, offering a better trade-off between speed and accuracy. These systems significantly reduce the negative impact of moving objects on pose estimation and map construction by filtering out segmented regions. However, their effectiveness highly depends on the accuracy of the semantic segmentation network and predefined dynamic classes. If a moving object is not part of the segmentation class or is incorrectly segmented, it may introduce severe tracking errors or corrupt the generated map.
Beyond semantic filtering, geometric-based dynamic SLAM systems utilize motion cues derived from scene flow, frame-to-frame residuals, or optical flow fields. DS-SLAM [
9] integrates optical flow and feature tracking to separate dynamic and static components in the input frames. By computing the inconsistency between tracked features and optical flow predictions, it effectively filters out dynamic keypoints and maintains reliable SLAM tracking. Co-Fusion [
26] employs real-time RGB-D segmentation to update the global map with only static elements while retaining dynamic segments in a separate object-level representation. These methods demonstrate improved adaptability in dynamic environments, particularly when semantic priors are unavailable or unreliable. However, a common limitation is their tendency to discard large amounts of image data, including occluded static backgrounds, resulting in sparse reconstructions and reduced map completeness.
Recently, several neural implicit SLAM approaches have emerged that directly incorporate dynamic modeling into end-to-end optimization frameworks. Methods such as DN-SLAM [
16], NID-SLAM [
17], and DDN-SLAM [
18] use neural signed distance functions (SDFs) or neural volumetric fields to jointly optimize scene structure and camera poses via differentiable rendering. RoDyn-SLAM [
19] introduces ray-based residual modeling and jointly estimates dynamic masks and depth using motion segmentation and photometric error feedback. These methods exhibit strong reconstruction potential in dynamic scenes and can adaptively refine the scene geometry. However, the reliance on ray marching and MLP inference incurs significant computational overhead. Moreover, many of them retain a loosely coupled architecture—separating tracking from mapping updates—which limits the system’s ability to respond promptly to dynamic changes.
DG-SLAM [
27] represents a more recent attempt to integrate 3D Gaussian Splatting into real-time SLAM for dynamic scenes. DG-SLAM introduces Gaussian-level dynamic labeling using temporal consistency and classifies each Gaussian as static or dynamic. However, its binary labeling mechanism is deterministic and lacks mechanisms for label refinement over time.
To overcome these limitations, several recent studies have explored integrating probabilistic inference and temporal fusion into dynamic SLAM. For example, approaches inspired by Bayesian filtering attempt to maintain belief distributions over the dynamic state of map elements. Rather than hard assigning labels to features, these methods accumulate multi-view evidence across time, offering more robust decisions about whether a point belongs to a dynamic object. Such strategies reduce the risk of misclassification from transient occlusions or partial observations. However, incorporating probabilistic models into high-speed SLAM pipelines introduces additional computational challenges, especially in terms of maintaining temporal coherence and integrating evidence in a scalable manner.
Overall, current dynamic SLAM methods can be broadly divided into two categories: segmentation-based filtering methods and reconstruction-centric implicit methods. The former ensures tracking stability by eliminating dynamic content but at the cost of map completeness. The latter provides high-fidelity map reconstructions but suffers from slow convergence, high memory usage, and limited real-time applicability. Thus, designing an efficient and robust dynamic SLAM framework remains an open and critical challenge, particularly for applications in autonomous driving, augmented reality, and human–robot interaction in unconstrained environments.
  2.3. Three-Dimensional Gaussian Splatting and NeRF-Based SLAM
Recent advances in neural scene representation have sparked significant interest in integrating implicit and explicit neural rendering models into SLAM pipelines. Neural Radiance Field (NeRF) [
28] opened up new possibilities for photorealistic view synthesis, which inspired many dense SLAM frameworks such as ESLAM [
29], Co-SLAM [
30], iMAP [
31], NICE-SLAM [
32], Vox-Fusion [
33], and Point-SLAM [
34]. These methods jointly optimize camera poses and scene geometry using ray-based volumetric rendering. However, due to the high computational cost of ray sampling and MLP inference, they often require depth priors or render sparse pixels for tractability, limiting their generalization and real-time performance.
Beyond single-agent settings, several works extend neural SLAM toward collaborative scenarios. MCN-SLAM [
35] enables distributed agents to share scene reconstructions with low communication overhead, while MNE-SLAM [
36] achieves real-time multi-agent mapping through cross-agent feature alignment. PLG-SLAM [
37] improves global consistency via progressive bundle adjustment, and NeSLAM [
38] enhances stability through depth completion and denoising. Although designed for multi-robot collaboration rather than dynamic scenes, these studies underline that scalability and robustness in SLAM benefit from probabilistic and modular designs, an idea also reflected in our BDGS-SLAM framework. Building on these insights, we next introduce BDGS-SLAM, a framework designed specifically for dynamic environments.
In contrast to NeRF-based pipelines, 3D Gaussian Splatting (3DGS) [
39] provides an efficient and differentiable rendering mechanism using explicit anisotropic Gaussians. GS-SLAM [
1] first introduced 3DGS into SLAM, enabling RGB-D mapping with adaptive Gaussian expansion. Photo-SLAM [
2] extended the framework to various camera modalities and improved map fidelity with geometry-based densification and pyramid learning. MonoGS [
4] explored monocular SLAM with 3DGS, addressing scale ambiguity using optical flow and priors. VPGS-SLAM [
40] further scales 3DGS-SLAM to large scenes using a voxel-based progressive optimization strategy, balancing reconstruction fidelity and computational efficiency in outdoor and multi-room indoor environments.
SplaTAM [
6] proposed a real-time pipeline that jointly optimizes camera pose and Gaussian parameters using only RGB inputs. Its global optimization strategy significantly mitigates cumulative drift. DG-SLAM [
27] further incorporated dynamic label optimization for each Gaussian, distinguishing dynamic and static content based on temporal consistency. However, it primarily relies on single-frame analysis, lacking probabilistic temporal fusion.
Compared to MLP-based NeRF SLAM methods, 3DGS-based SLAM systems offer superior runtime efficiency, differentiable rasterization, and high visual fidelity at lower computational cost. The explicit Gaussian parameterization also allows easier integration with classical geometry-based methods and opens possibilities for semantic labeling, uncertainty modeling, and instance-level manipulation. Recent works such as DeRainGS [
41] and LLGS [
42] further showcase the potential of 3DGS in challenging environments like rain or extreme darkness, highlighting its versatility across visual degradation scenarios. However, 3DGS-based methods still face several challenges, particularly in dynamic environments where frame-to-frame inconsistency can lead to degraded optimization and fragmented map updates.
Most current 3DGS-SLAM approaches focus on optimizing static scenes. Although some frameworks introduce heuristic segmentation or per-frame classification to filter out dynamic content, they often fail to maintain temporal consistency or probabilistic confidence across views. This leads to unstable performance in real-world scenarios involving occlusions, articulated motion, or partial rigidity. Moreover, the lack of online dynamic reasoning severely limits their applicability in robotics, AR, and autonomous systems where live interaction with changing environments is required. Several recent methods, such as Dy3DGS-SLAM [
43] and GARAD-SLAM [
44], attempt to mitigate these limitations by introducing dynamic segmentation and anti-dynamic filtering mechanisms, while DenseSplat [
45] demonstrates the benefits of neural priors in densifying sparse Gaussian maps, yet they still struggle with robust fusion of motion cues across time and views.
To address these limitations, BDGS-SLAM is proposed to improve the robustness and ensure the integrity of static scene reconstruction in 3DGS-SLAM under dynamic environments, effectively suppressing dynamic artifacts and preserving clean, high-quality maps. Our BDGS-SLAM addresses these limitations by introducing Bayesian filtering and multi-view probability updates to estimate dynamic probabilities, map them to Gaussians for label updating, and perform direct dynamic segmentation during rendering. This unified dynamic information flow improves tracking robustness and generates high-quality static scene maps in real time.
As 3DGS becomes more widely adopted, future SLAM frameworks may benefit from integrating probabilistic modeling, attention-based Gaussian filtering, and physically grounded motion priors. Such directions could further bridge the gap between explicit geometric SLAM and implicit neural modeling, enabling high-speed, high-fidelity, and dynamic-aware scene understanding.