3.1. System Overview
Our system comprises three main components: dynamic region segmentation, tracking, and mapping.
During the preprocessing stage, we apply DeepLab2 [
31] to segment the scene and identify movable objects. Then, the RGB images, depth images, and corresponding instance masks are input into the system.
Based on the instance segmentation results, we achieve a multi-target tracking process similar to ByteTrack [
32]. We identify bounding boxes for potential moving objects and employ Kalman filters to predict their positions in the next frame. The Hungarian algorithm is employed to associate and match the detected bounding boxes with the predicted ones, with Intersection over Union (IOU) and cosine similarity serving as similarity metrics for the matches. We create a Kalman tracker for each detected object, enabling the system to be applied in scenarios with multiple potential moving targets.
For each successfully tracked potential moving object, we employ a hierarchical, coarse-to-fine strategy for motion analysis. This process involves two main stages: (1) Coarse Motion Analysis: We first perform an efficient check using sparse optical flow. Features are extracted in annular regions along the object’s contour, and their motion is analyzed via DBSCAN clustering to quickly classify targets as either clearly static or globally dynamic. (2) Fine-grained Motion Segmentation: For ambiguous objects that pass the initial screening (typically non-rigid bodies like humans), we then compute a dense optical flow field. Gaussian Mixture Models (GMMs) are subsequently used to cluster these dense flow features, enabling fine-grained segmentation of locally moving regions. By integrating coarse judgments and fine-grained segmentation, we can more accurately obtain masks for dynamic regions.
The tracking module subsequently undertakes the task of estimating camera pose and constructing a sparse geometric map. This process utilizes only the filtered static feature points, which have been selected to exclude dynamic elements that could compromise the accuracy of the localization. Based on the camera pose estimates and the static feature points, the module constructs a sparse geometric map of the environment.
Meanwhile, our system adopts a hybrid map representation that combines geometric features with 3D Gaussian. On the one hand, it fully leverages the precise geometric information provided by feature map points to achieve rapid and accurate pose estimation. On the other hand, we designed a 3D Gaussian optimization strategy based on a Gaussian pyramid and a 3D Gaussian densification algorithm based on geometric features to refine the 3D Gaussian parameters. By extracting static feature points using the dynamic component segmentation module, the system can mitigate the impact of dynamic objects during the rendering process. This enables the creation of detailed static background maps, even in environments characterized by significant dynamic activity.
3.2. Dynamic Region Segmentation
The dynamic region segmentation module is crucial for the system to operate effectively in dynamic environments. Based on the results of instance segmentation and object tracking, we obtain matched object bounding boxes between consecutive frames. For each potential moving object, instead of directly applying computationally expensive methods, we propose a cascaded analysis pipeline that leverages both sparse and dense optical flow.
Sparse Optical Flow for Coarse Motion Analysis. As the first stage, we employ a lightweight, sparse optical flow method for an efficient coarse motion assessment. This approach, based on the Lucas–Kanade (LK) algorithm, tracks a limited set of high-quality feature points (corners) within annular regions surrounding the object’s contour. By clustering the resulting sparse motion vectors using DBSCAN and comparing the dominant motion against the background, we can rapidly filter out objects that are clearly static or undergoing simple, global translation. This pre-screening step significantly reduces the computational load by avoiding unnecessary dense analysis on every object.
Dense Optical flow for Fine-grained Motion Segmentation. For objects flagged as ambiguous by the sparse-flow stage (e.g., non-rigid bodies with complex internal movements), we then proceed to a more detailed analysis using dense optical flow. Optical flow represents the instantaneous velocity of pixel movements between two frames, leveraging changes in pixel intensity to establish dense correspondences. This method generally relies on three fundamental assumptions: (1) brightness constancy, (2) small motion, and (3) spatial consistency. To precisely capture fine-grained local motion, this paper employs the RAFT (Recurrent All-Pairs Field Transforms) algorithm [
33] for dense optical flow computation. Through an iterative optimization process, RAFT accurately estimates a dense field of optical flow vectors for each pixel within the object’s bounding box, providing a solid foundation for our subsequent segmentation modules.
The complete dynamic region segmentation module is thus divided into two main parts based on the outputs of our cascaded flow analysis: coarse object judgement and fine-grained motion segmentation.
3.2.1. Coarse Motion Judgement
Our cascaded analysis pipeline begins with an efficient global motion assessment, outlined in Algorithm 1, to rapidly identify objects that are either static or undergoing simple, uniform motion. Instead of the costly dense optical flow, this stage leverages a robust sparse optical flow approach combined with motion clustering.
First, for each tracked object, we generate annular masks representing regions immediately inside and outside its contour, as depicted in the left picture of
Figure 4. This is achieved through morphological dilation and erosion on the object’s instance mask. These annular regions serve as precise masks for extracting a sparse set of high-quality feature points using the goodFeaturesToTrack algorithm. To enhance robustness, particularly for smaller or less-textured objects like chairs, we dynamically adjust the feature extraction parameters based on the object’s class.
Subsequently, to track these sparse feature points between consecutive frames, we employ the GPU-accelerated Lucas–Kanade (LK) pyramidal optical flow algorithm (cv::cuda::SparsePyrLKOpticalFlow). For our implementation, we configure it with a 21 × 21 pixel search window, 3 pyramid levels, and a maximum of 30 iterations per level, a setup which ensures a strong balance between tracking accuracy and real-time performance. The resulting motion vectors from the inner ring are then fed into the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Using a neighborhood distance eps of 0.2 and requiring a minimum of 5 points (min_samples), DBSCAN groups features into distinct motion clusters based on a composite distance metric that weights spatial proximity () and flow vector similarity (). This allows the algorithm to automatically identify the dominant motion cluster within the object, effectively filtering out noise and minor local movements.
The final decision is made through a hierarchical voting process. First, we identify the largest motion cluster from DBSCAN as the object’s dominant motion. If this cluster’s size relative to the total number of tracked points is below a threshold , which we set to 0.5, the object’s motion is deemed chaotic and immediately flagged for dense analysis. This value ensures that a "dominant" motion is shared by at least a majority of points; a lower value would risk misinterpreting noisy motion as coherent, while a higher value is oo strict for non-rigid objects with complex movements. Otherwise, the representative motion of this dominant cluster, calculated using the median of its flow vectors for robustness against outliers, is compared against the median motion of the background points from the outer ring. If the difference in magnitude or angle exceeds predefined motion thresholds, , the object is classified as globally dynamic. These thresholds (e.g., 2.0 pixels in magnitude, 15 degrees in angle) are critical for distinguishing true object movement from apparent motion caused by camera ego-motion. Stricter values could cause false positives on static objects due to minor noise, whereas more lenient values might fail to detect subtle but independent object movements. If the dominant motion is consistent with the background, we then check the ratio of non-dominant points (other clusters and noise). If this outlier ratio is significant, determined by a threshold set to 0.15, particularly for non-rigid objects like a ’person’, the object is flagged for a more detailed local motion analysis using dense optical flow. This value is tuned to capture meaningful local dynamics, such as a waving arm, without being overly sensitive to minor tracking noise. A lower value triggers the dense analysis too frequently, while a higher one can miss important non-rigid movements.This cascaded approach ensures that the computationally expensive dense flow is only invoked when absolutely necessary. This methodology is inherently more robust than methods that rely on a single, averaged motion vector. By segmenting motion into clusters, our system can identify the true dominant motion without being skewed by a minority of outlier points or localized movements, thus avoiding the common pitfalls of mean-based analysis.
The right panel of
Figure 4 visualizes the clustering results from DBSCAN, effectively illustrating the capability of our motion segmentation approach. In this example, the feature points within the inner ring of the object are successfully partitioned into two dominant motion clusters: an upper body cluster exhibiting motion (represented by magenta points) and a static lower body cluster (yellow points). This segmentation accurately captures the semantics of the action—a person bending over. Since the ratio of static points is significant, the system correctly identifies this as a case of complex, non-rigid motion, where only local parts of the object are moving. Consequently, it flags the object for a fine-grained analysis using dense optical flow rather than misclassifying it as a simple global movement. Based on the aforementioned process, we are able to successfully identify targets that exhibit significant motion amplitudes.
Figure 5 provides a clear visualization of the region segmentation results for the global judgement of dynamic objects within the image.
3.2.2. Fine-Grained Motion Segmentation
Our coarse-grained motion analysis, based on sparse optical flow and DBSCAN clustering, excels at efficiently identifying objects that are either clearly static or undergoing simple, global motion. However, this holistic judgment is insufficient for complex scenarios, such as a non-rigid body exhibiting localized movements (e.g., a seated person waving an arm). In this case, the dominant motion cluster (the torso) remains static relative to the background, while a significant secondary motion cluster (the arm) exists. Our coarse stage is designed to specifically detect this ambiguity. When an object, particularly a ’person’, is found to have a dominant static cluster but also a notable ratio of outlier or secondary motion points (as determined by the logic in Algorithm 1, returning a
NON_RIGID state), the framework transitions to a fine-grained analysis to resolve the uncertainty.
Algorithm 1 Robust Coarse Judgement via Sparse Flow and Clustering |
- Input:
Current image , Previous image , Object instance mask , Tracked object - Output:
Motion state - 1:
- 2:
- 3:
- 4:
- 5:
if then - 6:
return Not enough points for reliable judgment - 7:
end if - 8:
- 9:
- 10:
- 11:
- 12:
if then - 13:
return {Chaotic motion} - 14:
end if - 15:
- 16:
- 17:
if then - 18:
return {Global translation/rotation} - 19:
else - 20:
- 21:
if then - 22:
return Local motion detected - 23:
else - 24:
return Global stationary - 25:
end if - 26:
end if
|
This fine-grained stage is selectively activated only for these ambiguous objects, ensuring that computational resources are used efficiently. It begins by computing a dense optical flow field within the object’s bounding box using RAFT [
33] pretrained model (
raft_things), which provides a complete, pixel-wise motion field. To robustly segment the locally moving parts, we model the distribution of these dense flow vectors using a Gaussian Mixture Model (GMM). The optical flow vectors (characterized by magnitude and direction) are first normalized to be invariant to lighting and motion speed. Then, the EM algorithm is employed to fit a two-component GMM (
clustersNumber = 2) with a spherical covariance matrix type (
EM::COV_MAT_SPHERICAL), set to terminate after 100 iterations or when the log-likelihood gain is less than 0.1. This process clusters the pixels into distinct motion patterns, allowing us to precisely segment the object’s moving components (e.g., the waving arm) from its static parts (e.g., the torso), enabling a more accurate exclusion of dynamic features from the SLAM process.
This paper assumes that the optical flows of both dynamic and static objects follow Gaussian distributions. The combination of optical flows from these two types of objects forms a sample set, which can be regarded as a linear combination of Gaussian distributions. By clustering the optical flow features using a Gaussian Mixture Model, the samples of optical flow features can be divided into multiple sets, each following an independent Gaussian distribution. The Gaussian Mixture Model is given by
is the mixing coefficient for each component and
N represents the probability density function of the two-dimensional optical flow features:
is the mean vector of each Gaussian component and ∑ is the covariance matrix of each Gaussian component.
To determine the category to which each optical flow feature belongs, the Expectation–Maximization (EM) algorithm is employed to perform maximum likelihood estimation of the parameters of the GMM. The EM algorithm typically consists of two steps:
E-step: First, given the current parameters, calculate the posterior probability
that each data point belongs to each Gaussian component.
represents the parameters of each sample
X, and
k corresponds to the
kth Gaussian component in
X.
M-step: Re-estimate the parameters based on the obtained posterior probabilities and update the model:
where
is the log-likelihood function of the GMM. Repeat the EM algorithm until the change in the log-likelihood function is less than a set threshold or the maximum number of iterations is reached.
Figure 6 presents the visualization results of fine-grained segmentation for some scenes in the TUM and Bonn datasets [
34]. Since the algorithm in this section calculates optical flow and makes motion judgments for individual bounding boxes, the size of each group of images varies based on the actual target, leading to inconsistencies in size as shown in the figure. For each group of images, the left side represents the optical flow visualization results based on combinations of color and brightness changes, where different colors represent different motion directions, and the brightness of the colors reflects the speed of motion. The right side of each group of images highlights the identification and segmentation results of local motion regions in the image. Our method effectively captures complex, non-rigid body movements (
TUM_walking_static), partial subjects (
Bonn_crowd2), and walking persons (
Bonn_person_tracking), demonstrating its robustness and accuracy.
However, we acknowledge that the performance of our fine-grained segmentation is linked to the accuracy of the underlying dense optical flow estimation from RAFT. Potential failure cases arise in scenarios where the fundamental assumptions of optical flow are violated. For instance, on large, textureless surfaces (e.g., a monochromatic T-shirt), RAFT may struggle to produce reliable flow vectors, potentially leading to incomplete or hollow dynamic masks where moving regions are missed. Similarly, abrupt illumination changes or severe motion blur can violate the brightness constancy assumption, which may introduce noisy flow vectors and result in small, falsely identified dynamic regions. Recognizing these limitations is crucial for understanding the operational envelope of the proposed method.
By employing Fine-grained Segmentation based on dense optical flow clustering, we obtain a pixel-level identification of the object’s moving parts, which allows us to truly achieve the goal of detecting local object motion.
3.3. Three-Dimensional Gaussian Splatting
Maps serve as a bridge for mobile robots to interact with their environment. Accurate 3D reconstructed maps can help them better understand scenes and provide necessary information for subsequent tasks such as navigation and obstacle avoidance. Currently, the maps created by mainstream and mature visual SLAM systems are sparse and only used for robot localization. Traditional dense visual SLAM systems suffer from poor reconstruction quality and low detail clarity, while SLAM systems based on NeRF have long rendering times and do not meet real-time requirements. Therefore, this paper adopts a hybrid map representation based on geometric features and 3D Gaussians, combining ORB geometric features from visual SLAM with 3D Gaussians. Geometric feature points are used to estimate pose, while an efficient training strategy is designed to optimize the parameters of the 3D Gaussians.
The hybrid map consists of two types of map points: geometric feature points used for localization and 3D Gaussian training and 3D Gaussian points containing rich spatial information. The geometric feature points are primarily derived from the 3D map points in the geometric map generated by visual SLAM, inheriting the properties of ORB feature points and the characteristics of 3D Gaussians. Since SLAM systems excel in localization, the information of geometric feature points is mainly updated during the optimization process of the geometric map and is synchronously updated to the hybrid map. The 3D Gaussian points, which do not have ORB feature-related attributes, are mainly generated through densification algorithms and are used to fill in areas that cannot be covered by sparse geometric feature points. Three-dimansional Gaussians possess excellent scene representation capabilities, playing a crucial role in precise scene fitting and high-accuracy three-dimensional reconstruction.
The system is mainly divided into two threads: localization and mapping. Initially, camera pose estimation, including rotation and translation t, is performed through consecutive image frames. Subsequently, a sparse map composed of keyframes and map points is constructed by the geometric mapping thread. Based on the keyframes and their corresponding map points in the sparse geometric map, a hybrid map is incrementally created. The hybrid map employs explicit 3D Gaussians as the basic units for rendering, containing parameters such as center position , opacity , 3D covariance matrix , and color c. During the 3D Gaussian Splatting rendering process, the system projects 3D Gaussians in space onto a pixel-based image plane, sorts these Gaussian projections, and computes the value for each pixel.
Rendering a keyframe at a specific pose
involves calculating the contribution of all 3D Gaussian points in the hybrid map to the pixels. The 3D Gaussians in the hybrid map are projected onto the two-dimensional image plane, and the projected 2D Gaussians are sorted by depth values to ensure correct handling of occlusion relationships in subsequent rendering processes. The final color of each pixel is then calculated using the following alpha compositing formula:
where
N denotes the number of 3D Gaussians in the map,
represents the color values output by spherical harmonics, and
denotes the opacity.
Due to the inability of the sparse geometric map from visual SLAM to meet the requirements of high-quality 3D reconstruction, this paper designs a 3D Gaussian densification algorithm to generate more 3D Gaussian points, capturing more detailed features and achieving a fine depiction of the scene. The original 3D GS method employs splitting and cloning techniques. In under-reconstructed areas where the current 3D Gaussians cannot accurately fit the necessary structures, resulting in blank regions, the algorithm simply creates Gaussian points of the same size and moves them along the direction of the positional gradient. For over-reconstructed areas, where a single 3D Gaussian has too broad a coverage to precisely outline complex shapes, the algorithm subdivides overly large 3D Gaussian points into smaller units.
However, the densification algorithm for 3D GS alone cannot provide sufficient parameters for the hybrid map, and many regional features remain difficult to capture. Additionally, splitting and cloning do not fully utilize accurate geometric feature information. Therefore, this chapter proposes a densification algorithm based on geometric features to further increase the information density of the map. Depending on the camera mode of the system, different strategies are adopted for monocular, stereo, and RGB-D modes. In RGB-D mode, the depth information of inactive two-dimensional feature points is directly utilized to project them into three-dimensional space. In monocular mode, the depth values of the closest active two-dimensional feature points are used to estimate the depths of these inactive points. In stereo mode, a stereo matching algorithm is employed to estimate the depths of inactive two-dimensional feature points. By utilizing geometric features from adjacent areas to densify the elements of the hybrid map, the system can significantly increase the density of feature points.
To clarify the interplay between our Gaussian pyramid-based optimization and the geometric feature-based densification, we present a detailed workflow in
Figure 7. The process begins by initializing 3D Gaussians from the map feature points provided by the SLAM front-end. The optimization then iterates through the levels of a Gaussian pyramid, starting from the coarsest level to establish a robust global structure and progressively moving to finer levels to refine details. Crucially, within this optimization loop, our novel geometric densification and the standard 3DGS densification are periodically triggered. This ensures that new geometric details from the SLAM system are continually integrated and refined alongside the existing Gaussians, synergistically improving map quality and completeness.
The Gaussian Pyramid is a set of images that exhibit different levels of detail, constructed by continuously applying Gaussian smoothing and downsampling techniques to the original image. At the initial stage of training, the features from the top layer of the pyramid (the layer with the coarsest image representation) are used to guide the optimization of 3D Gaussian parameters. As the training iterations progress, on the one hand, the geometric parameters are densified to improve accuracy and detail representation, and on the other hand, the next layer of the pyramid is accessed to obtain new real image values. This process iterates repeatedly, gradually capturing and learning more detailed geometric features of the scene until reaching the bottom layer of the Gaussian Pyramid. The optimization process using a Gaussian Pyramid with layers can be expressed by the following formula:
denotes the real image in the nth level of the pyramid, and optimization is achieved by minimizing the photometric loss between the rendered image and the real image at each level of the pyramid.
For the rendered image and the real image, the loss is calculated as follows. Finally, parameters are updated through gradient backpropagation to achieve the optimization of 3D Gaussians.
The loss is measured by the and D-SSIM loss between the rendered image and the real image .
This paper combines 3D scene representation with Gaussian Pyramid training and geometric densification algorithms, aiming to significantly improve training efficiency while ensuring that 3D Gaussians can accurately capture the detailed features of the scene. Based on the dynamic object segmentation module of our system, we successfully achieved the identification and segmentation of dynamic objects, which is crucial for ensuring that the feature points belonging to dynamic objects do not interfere with the map reconstruction process. By effectively excluding these interfering feature points, our 3D GS model is able to produce a clearer and less noisy map reconstruction, as shown in
Figure 8.
To better position our contribution within the landscape of existing research, we summarize the primary methodologies for dynamic SLAM in
Table 1. While prior approaches based on multi-view geometry, semantic priors, or dense optical flow have their respective strengths, they also suffer from significant drawbacks, such as failure in dynamic-cluttered scenes, inability to handle unknown objects, or prohibitive computational costs. Our proposed coarse-fine cascaded strategy is designed to synergistically combine the advantages of these methods while mitigating their limitations. By using an efficient sparse optical flow check to filter the majority of objects, we achieve real-time performance, and by reserving a fine-grained dense flow analysis for only the most complex cases, we retain the ability to accurately segment both global and local movements, thereby enhancing both localization robustness and mapping quality.