Next Article in Journal
Domain-Adapted MLLMs for Interpretable Road Traffic Accident Analysis Using Remote Sensing Imagery
Next Article in Special Issue
Indoor UAV 3D Localization Using 5G CSI Fingerprinting
Previous Article in Journal
Application of a Hybrid CNN-LSTM Model for Groundwater Level Forecasting in Arid Regions: A Case Study from the Tailan River Basin
Previous Article in Special Issue
A Novel Smartphone PDR Framework Based on Map-Aided Adaptive Particle Filter with a Reduced State Space
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TAS-SLAM: A Visual SLAM System for Complex Dynamic Environments Integrating Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation

1
Mechanical Electrical Engineering School, Beijing Information Science & Technology University, Beijing 100192, China
2
Intelligent Equipment Research Institute, Beijing Academy of Science and Technology, Beijing 100061, China
3
State Key Laboratory of Tribology, Department of Mechanical Engineering, Tsinghua University, Beijing 100084, China
4
China Academy of Safety Science and Technology, Beijing 100012, China
*
Author to whom correspondence should be addressed.
ISPRS Int. J. Geo-Inf. 2026, 15(1), 7; https://doi.org/10.3390/ijgi15010007
Submission received: 13 November 2025 / Revised: 12 December 2025 / Accepted: 18 December 2025 / Published: 21 December 2025
(This article belongs to the Special Issue Indoor Mobile Mapping and Location-Based Knowledge Services)

Abstract

To address the issue of decreased localization accuracy and robustness in existing visual SLAM systems caused by imprecise identification of dynamic regions in complex dynamic scenes—leading to dynamic interference or reduction in valid static feature points, this paper proposes a dynamic visual SLAM method integrating instance-level motion classification, temporally adaptive super-pixel segmentation, and optical flow propagation. The system first employs an instance-level motion classifier combining residual flow estimation and a YOLOv8-seg instance segmentation model to distinguish moving objects. Then, temporally adaptive super-pixel segmentation algorithm SLIC (TA-SLIC) is applied to achieve fine-grained dynamic region partitioning. Subsequently, a proposed dynamic region missed-detection correction mechanism based on optical flow propagation (OFP) is used to refine the missed-detection mask, enabling accurate identification and capture of motion regions containing non-rigid local object movements, undefined moving objects, and low-dynamic objects. Finally, dynamic feature points are removed, and valid static features are utilized for pose estimation. The localization accuracy of the visual SLAM system is validated using two widely adopted datasets, TUM and BONN. Experimental results demonstrate that the proposed method effectively suppresses interference from dynamic objects (particularly non-rigid local motions) and significantly enhances both localization accuracy and system robustness in dynamic environments.

1. Introduction

Simultaneous Localization and Mapping (SLAM) empowers autonomous agents to reconstruct environmental structures while estimating their pose through sensor data analysis. Among various sensing modalities, vision-based SLAM has gained prominence due to cameras’ low cost, portability, and rich information output. While established systems such as DSO [1], LSD-SLAM [2] and ORB-SLAM2 [3] demonstrate remarkable accuracy in static environments, their performance deteriorates substantially in dynamic settings where moving objects introduce feature contamination. Although conventional solutions like RANSAC [4] provide partial mitigation against dynamic interference, they prove insufficient in highly mobile scenes containing multiple non-rigid objects.
To address the above-mentioned challenges, researchers have developed a series of SLAM systems tailored for dynamic environments, which enhance both localization accuracy and robustness under such conditions. These systems can be broadly classified into three categories:
The first category integrates semantic segmentation models capable of providing precise object boundaries [5,6,7,8,9,10,11,12,13,14,15]. However, these approaches exhibit a pronounced dependency on training datasets and often fail to reliably detect undefined dynamic objects.
The second category incorporates geometric constraints and motion consistency cues [16,17,18,19,20,21,22,23,24,25,26,27]. Although these methods operate without prior knowledge, they are prone to misjudgment in scenarios involving multiple dynamic targets or potential motions.
The third category leverages optical flow information [28,29,30,31,32,33], benefiting from its computational efficiency and real-time performance. Nevertheless, optical flow-based methods are susceptible to environmental disturbances such as noise and illumination variations, which can lead to erroneous estimations.
In summary, current SLAM systems designed for dynamic environments still face considerable challenges when operating in complex scenarios involving multiple dynamic targets—such as non-rigid objects with localized motion, undefined moving objects, and low-dynamic objects. These systems often fail to process undefined moving objects effectively. Moreover, in dealing with the local motion of non-rigid objects, they tend to discard the entire object region, resulting in the unnecessary removal of numerous valid feature points. This significantly diminishes the amount of usable visual features, thereby adversely affecting the overall localization accuracy of the system.
To address these challenges, we propose TAS-SLAM, a visual SLAM system specifically designed for complex dynamic environments through the integration of Instance-Level Motion Classification, TA-SLIC and OFP. Constructed upon the ORB-SLAM2 framework, our approach incorporates three key technical contributions:
  • Instance-level motion classifier: A multi-threshold classification mechanism is proposed, leveraging YOLOv8-seg instance masks and residual flow statistics to categorize objects into three distinct states: rigid-consistent global motion, non-rigid local dynamic motion, or static. This approach enables targeted refinement of ambiguously moving objects and effectively overcomes the inherent limitation of semantic-based models in detecting unknown dynamic entities.
  • Temporally Adaptive Super-Pixel Segmentation: A novel segmentation module is proposed, integrating hysteresis thresholds, temporal priors, and flow-angle-aware superpixel aggregation. TA-SLIC effectively mitigates the instability inherent in pixel-level residual thresholding by generating spatially coherent and temporally consistent dynamic masks—particularly suited for handling non-rigid and partially moving objects.
  • Optical Flow Propagation: A selective high-threshold propagation mechanism is introduced, which transfers reliable posterior information across consecutive frames under strict magnitude and similarity constraints. In contrast to conventional propagation approaches that are prone to noise accumulation, the proposed OFP method enhances segmentation stability under challenging conditions including occlusions, semantic mask inaccuracies, and varying illumination.
Collectively, these contributions allow TAS-SLAM to precisely identifying dynamic regions containing non-rigid local motion, undefined moving objects, and low-dynamic objects, while effectively eliminate dynamic features while retaining static ones, thereby substantially improving localization accuracy and system robustness across both low- and high-dynamic environments.

2. Related Work

2.1. Optical Flow Estimation

Optical flow [34] provides a detailed characterization of motion patterns by estimating the per-pixel displacement between consecutive image frames, reflecting both the movement of the camera and the motion of objects within the scene.
Optical flow estimation methods can be broadly categorized into two groups: traditional optimization-based approaches and deep learning-based techniques. Traditional methods rely on spatiotemporal image gradients and operate under assumptions such as brightness constancy and spatial smoothness. For instance, the Lucas–Kanade method [35] computes sparse optical flow via local least-squares optimization within a windowed region, making it suitable for small displacements. In contrast, the Horn–Schunck approach [36] incorporates a global smoothness constraint through an energy functional to estimate dense optical flow, though it often blurs motion boundaries. Region matching-based methods compute displacements by searching for correspondences across images but are susceptible to performance degradation under large motions or weak textures.
Deep learning-based methods have improved robustness through data-driven learning. Supervised models such as FlowNet [37,38] and PWC-Net [39] are trained end-to-end using synthetic datasets to predict optical flow directly. Unsupervised methods leverage photometric consistency between frames as a self-supervision signal. More recent architectures like RAFT [40] combine iterative refinement with convolutional networks, achieving notable improvements in accuracy for challenging cases including large displacements and occlusions.

2.2. Visual SLAM Systems for Dynamic Environments

Mainstream visual SLAM systems designed for dynamic environments are predominantly built upon established frameworks such as ORB-SLAM2 [3], ORB-SLAM3 [41], and VINS [42,43].
DS-SLAM [5] incorporates semantic information using the SegNet [44] segmentation network, combined with sparse optical flow analysis and motion consistency verification, to detect and remove dynamic objects. Berta et al. introduced Dyna-SLAM [6], which integrates Mask R-CNN [45] for semantic instance segmentation with multi-view geometric constraints to handle dynamic content. Ran et al. proposed RS-SLAM [7], which combines semantic segmentation with enhanced dynamic object detection to improve localization accuracy and facilitate the construction of static semantic maps in dynamic settings.
He et al. developed OVD-SLAM [8], a method that discriminates foreground from background by fusing semantic, depth, and optical flow information. Dynamic regions are identified based on mean re-projection error computed over multiple frames, allowing recovery of static points. Zhang et al. proposed VDO-SLAM [46], which overcomes the static-world assumption of traditional SLAM through tight coupling of semantic and geometric cues.
Liu et al. presented RDS-SLAM [9], which introduces independently operating semantic and semantic-aware optimization threads. This design allows flexible integration of segmentation methods with varying computational demands without stalling the tracking thread. Yan et al. developed DGS-SLAM [10], which employs a multinomial residual model for dynamic point detection and enhances pose estimation through feature classification and a semantic keyframe selection strategy aimed at identifying potentially dynamic objects.
Liu et al. also proposed a stereo visual odometry system [47] tailored for dynamic environments. It identifies dynamic points using optical flow filtering based on quantitative and angular optical flow histograms, along with a multi-feature fusion strategy for binary segmentation to derive bounding boxes of dynamic objects. Zhou et al. introduced RVD-SLAM [11], which leverages affine photometric consistency and sparse semantic segmentation while incorporating prior outlier information to reduce computational cost.
Zhuang et al. proposed Amos-SLAM [48], which employs optical flow tracking and model generation to rapidly identify potentially dynamic regions—both known and unknown. The system subsequently uses super-pixel extraction and geometric clustering to detect motion in unknown objects and refines pose estimation accordingly.
Drawing inspiration from optical-flow-based pipelines such as OVD-SLAM, our method employs dense optical flow as the primary cue for detecting potentially dynamic regions. In contrast to prior work, we place greater emphasis on refining the optical flow estimation and explicitly reducing the influence of rigid camera motion when determining dynamic regions.
Geometry-driven systems, such as Amos-SLAM, typically apply SLIC or similar clustering heuristics directly on color features to obtain candidate regions for motion analysis. In TAS-SLAM, however, cluster generation is driven by motion evidence: we first derive spatially coherent dynamic posteriors from refined flow residuals and then perform temporally adaptive superpixel aggregation within this posterior space. This motion-first clustering strategy yields more stable boundaries for partially moving or semantically undefined objects compared to purely color-space clustering.
This paper addresses the challenge of low localization accuracy in current SLAM systems operating in highly dynamic environments. To improve localization precision, we propose TAS-SLAM, a novel SLAM framework that integrates an instance-level motion segmentation classifier, a temporally adaptive superpixel segmentation algorithm, and an optical-flow-propagation mechanism for correcting missed dynamic object detections. The proposed system enables robust and accurate localization in challenging dynamic scenarios.

3. System Overview

The proposed TAS-SLAM framework integrates instance-level motion reasoning, temporally adaptive superpixel segmentation, and optical-flow propagation into a unified processing pipeline, as depicted in Figure 1. Specifically, given a sequential stream of RGB-D frames, the system first executes two parallel processing branches: residual flow computation and instance segmentation.
In the residual-flow branch, LiteFlowNet3 [49] estimates dense optical flow between consecutive RGB images, while a rigid flow field is derived from coarse feature matching and depth-based camera motion projection. The residual flow—highlighting motions inconsistent with global rigidity—is then obtained by comparing the dense and rigid flows. Concurrently, the instance-segmentation branch employs YOLOv8-seg [50] to produce instance-level object masks for the current RGB frame.
Subsequently, the residual flow and instance masks are passed to an instance-level motion classifier. This module categorizes each detected instance into one of three motion states: rigid-consistent (global motion), non-rigid (local dynamic motion), or static. At the frame level, the classifier also determines whether any non-rigid local motion is present. If no such motion is detected, the pipeline bypasses the subsequent TA-SLIC module and proceeds directly to temporal consistency verification. Otherwise, TA-SLIC is activated to refine the regions identified as non-rigid and locally dynamic. TA-SLIC employs hysteresis thresholds to establish adaptive decision boundaries and leverages temporal priors to enforce consistency across frames. Superpixel aggregation is then applied to sharpen object boundaries and suppress anomalies, yielding refined masks for non-rigid local motions.
Following motion classification (and TA-SLIC refinement when triggered), the system invokes a propagation-decision module to determine whether reliable temporal information transfer should be performed. This decision is governed by dual constraints: consistency in residual-flow magnitude and structural similarity of the flow field. When significant and kinematically consistent motion is confirmed, the Optical Flow Propagation (OFP) module transfers high-confidence posterior information from the preceding frame; otherwise, propagation is suppressed to prevent error accumulation.
Finally, the refined masks are fused to generate robust, temporally consistent dynamic masks. Based on these masks, a dynamic-point filter removes feature points located in dynamic regions while preserving static features for pose estimation. This filtering mechanism enhances localization accuracy and improves overall SLAM stability in complex, dynamic environments.

4. Residual Flow Calculation

4.1. Optical Flow

To accurately estimate motion information within image sequences, we employ LiteFlowNet3 [49], an efficient and accurate optical flow estimation network that utilizes deep learning techniques to model inter-frame pixel-level motion. The network effectively captures subtle motion variations through a combination of multi-scale feature extraction and iterative flow refinement. The model processes two consecutive images by first extracting hierarchical features via a convolutional neural network, then computing per-pixel motion vectors within its optical flow estimation module. This architecture enables robust and precise optical flow estimation even in challenging scenarios involving rapid motion or significant illumination changes.

4.2. Rigid Flow

Rigid flow [51] characterizes the scene transformation induced by the global motion of the camera. This type of flow captures the macroscopic displacement characteristics of the entire image, without accounting for independent motion of individual objects within the scene. The computation of rigid flow typically employs camera pose parameters or homography matrices to map pixels from the current frame to their corresponding positions in subsequent frames, thereby constructing a coherent flow field model.
The homography matrix is central to the computation of rigid flow. It enables precise point-to-point mapping between images without relying on depth information. Although the homography assumes that the corresponding scene points lie on a common plane, this assumption remains valid when the camera’s translational displacement is small relative to the scene depth. The matrix is estimated from matched feature points across consecutive frames and is subsequently used to project pixels from the current frame to the next. Specifically, the projection relation can be expressed as:
p t [ i ] = H p t 1 [ i ] ,       i I
x 2 y 2 1 = h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 x 1 y 1 1
where p t [ i ] denotes the i t h pixel point in Frame t , p t 1 [ i ] represents the corresponding pixel point in Frame t 1 , and I denotes the input image. Here, H is an invertible homogeneous transformation matrix representing the projective transformation between the two image frames. The elements h 1 to h 9 denote the entries of the matrix H , x 1 , y 1 , 1 T are the homogeneous coordinates of the corresponding pixel in frame t 1 , and x 2 , y 2 , 1 T are the homogeneous coordinates of the corresponding pixel in frame t .
Once the corresponding pixel coordinates between the current frame and the previous frame are established, the rigid flow can be computed as follows:
f t r = p t [ i ] p t 1 [ i ] ,       i I
where f t r denotes the rigid flow vector at pixel i , representing the apparent motion resulting from camera motion under the planar scene assumption.

4.3. Residual Flow

To detect dynamic objects while mitigating background interference induced by global camera motion, optical flow and rigid flow can be leveraged to decouple motion components: those caused by camera movement and those resulting from independent object motion. Specifically, a residual flow field is derived by computing the difference between the optical flow and the rigid flow. This residual flow captures motion signals exclusive to camera movement, thereby primarily indicating the presence of dynamic objects in the scene. Such a formulation enables accurate identification of dynamic regions.
We utilize instance masks extracted by YOLOv8-seg [50] and perform the following statistical operations within each instance region. Let f o p t , f r i g i d , and f r e s denote the optical flow, rigid flow, and residual flow, respectively, and m ( x ) represent the magnitude of the optical flow. The residual flow field is then defined as:
f i r e s = f i o p t f i r i g i d
m ( x ) = f r e s 2
As shown in Figure 2 and the optical flow visualization results in Figure 2a–c, with reference to the color legend in Figure 2d, hue represents the direction of motion, while brightness corresponds to the magnitude of displacement—higher brightness indicates larger motion offsets.
During optical flow estimation, it can be observed that the resulting residual flow field in Figure 2c exhibits reduced brightness compared to the optical flow field in Figure 2a, particularly in regions containing dynamic objects such as the “person”. This attenuation effectively eliminates interference caused by camera motion.
The computed residual flow field in Figure 2c serves a dual purpose: it is utilized not only for detecting dynamic objects but also for computing optical flow magnitude. This integrated approach enhances the accuracy of motion segmentation and supports more robust motion interpretation in complex dynamic environments.

5. Instance-Level Motion Classifier

To leverage instance-level structural information, we designed an instance-level motion classifier that adaptively assigns each detected object to distinct refinement strategies. Using semantic instance masks generated by YOLOv8-seg segmentation, we analyze residual flow statistics within each instance region. Specifically, we calculate the dynamic ratio (defined as the proportion of pixels exceeding a robust residual threshold), the coefficient of variation in residual flow magnitudes, and the largest connected component ratio among dynamic pixels. Based on these metrics, each instance is classified into one of three motion states: rigid-consistent global motion, non-rigid local dynamic motion, or static.
We denote the candidate dynamic masks in the image as Ω , and employ the function T ( M ) for robust threshold estimation. Equation (6) effectively filters candidate dynamic masks through a binarization operation based on magnitude thresholding. The use of Median Absolute Deviation (MAD) and the median in Equation (7) enhances robustness against noise interference and mitigates excessive dispersion in the pixel-level magnitude values used for threshold computation. The constant κ in Equation (7) is set to 1.8.
Ω i = { x M i : V ( x ) = 1 } , V ( x ) { 0 , 1 }
T ( M ) = m e d i a n ( M ) + κ M A D ( M )
where the function Ω i performs binarization of the residual flow magnitude based on the threshold V ( x ) , and the function MAD is defined as: M A D ( z ) = m e d i a n ( z m e d i a n ( z ) ) .
We propose three threshold-based metrics to quantify the motion characteristics of instance objects in a scene, evaluating the proportion of motion, motion consistency, and spatial continuity of dynamic pixels, respectively:
Dynamic ratio ρ : measures the proportion of dynamic pixels within an instance. This metric can rapidly eliminate evidently static objects.
ρ i = 1 Ω i x Ω i 1 m ( x ) > τ s
where τ s = T ( M b g ) , and M b g denotes the instance mask.
Coefficient of variation in magnitude c v : evaluates the dispersion of residual flow magnitudes within the instance region.
c v i = s t d ( M | M i ) E M | M i
Connected component ratio λ : computes the ratio of the largest connected component of dynamic pixels to the total number of dynamic pixels within the instance. Let B denote the mask regions satisfying the magnitude threshold:
B i = { x Ω i : m ( x ) > τ s }
λ i = max C ( B i ) C B i + ε
Classification rule:
s t a t e i = Static ρ i τ ρ l o rigid - consistent   global   motion , ( ρ i τ ρ h i ) ( c v i τ c v ) ( λ i τ λ ) non - rigid   local   dynamic   motion , otherwise .
Based on comprehensive experimental validation, the thresholds τ ρ l o , τ ρ h i , τ c v , τ λ are configured as follows: 0.65, 0.35, 0.8, and 0.2.
Unlike purely pixel-based methods that treat all regions equally, our classifier introduces instance-level adaptivity, which prevents over-segmentation by enforcing object-level coherence. It reduces computational load by skipping refinement on clearly static or fully dynamic objects, focusing TA-SLIC algorithm only on regions with ambiguity, thereby improving both efficiency and accuracy. As shown in column 3 of Figure 3, the heatmap of residual flow for the segmented “person” instance in column 2 exhibits significant changes. Column 4 further demonstrates that the “person” is categorized as a local motion object. The proposed classifier effectively distinguishes between globally moving objects, non-rigid locally moving regions, and static objects.

6. Temporal Adaptive SLIC (TA-SLIC)

Purely pixel-level residual thresholding is highly sensitive to noise and illumination changes, often producing fragmented masks and unstable boundaries. To overcome this limitation, we propose TA-SLIC (Temporally Adaptive Simple Linear Iterative Clustering), a novel segmentation module that integrates robust statistics, temporal priors, and superpixel-based aggregation.
As detailed in Algorithm 1, TA-SLIC first computes robust residual thresholds using median + MAD (Median Absolute Deviation) statistics, which are subsequently refined through Exponential Moving Average (EMA) hysteresis to generate adaptive high and low thresholds. This adaptive thresholding mechanism stabilizes dynamic/static transitions over time. A temporal prior is obtained by warping the previous frame’s posterior probability map into the current frame, ensuring temporal consistency across consecutive frames. The pixel-wise posterior probabilities are then estimated using a logistic regression model that incorporates three key elements: residual deviations, temporal priors, and dynamic thresholds. To enhance noise robustness, these posterior probabilities are aggregated within super-pixels based on spatial proximity, thereby enforcing local coherence. Finally, a Conditional Random Field (CRF)-based refinement step is applied to sharpen object boundaries and eliminate spurious regions.
Algorithm 1: Temporally Adaptive SLIC (TA-SLIC)
Input: Residual flow f t r e s , Residual magnitude I t m a g , Instance mask M i n s t , Previous posterior P t 1
Output: Motion mask M t t a s , Posterior P t
1: Compute robust threshold μ = median ( I t m a g ) + κ·MAD
2: Update temporal thresholds μ h i g h , μ l o w
3: If P t 1 exists, warp it with f t r e s to get temporal prior P ˜
4: Compute posterior P = σ( θ 0 + θ 1 ( I t m a g μ ) + θ 2   P ˜ )
5: Generate super-pixels S within M i n s t by SLIC on P
6: For each super-pixel s ∈ S do
7:  Compute the average posterior for each s P ¯ = x s P ( x ) / s
8:   Label s as dynamic if P ¯ > τ t a s
9: End For
10: Aggregate super-pixel labels into binary mask M t t a s
11: Refine M t t a s with morphological cleanup and CRF
12: Store P t and M t t a s for next frame
Specifically, TA-SLIC achieves spatially coherent and temporally stable segmentation of ambiguous or partially moving regions—precisely the areas where naive pixel-level methods typically fail. To prevent excessive frame-to-frame fluctuations, the system leverages EMA to dynamically update the segmentation thresholds.
μ t h i g h = β μ t 1 h i g h + ( 1 β ) μ t * , μ t l o w = η μ t h i g h , μ t * = T ( M | Ω )
μ t ( x ) = μ t l o w , P t 1 ( x ) 0.5 , μ t h i g h , o t h e r w i s e .
In terms of temporal continuity, the residual flow from the previous frame is warped to align with the current frame, thereby establishing a temporal prior. The dynamic posterior probability for each pixel is then computed through logistic regression that integrates three key components: residual magnitude, threshold bias, and the temporal prior.
P ˜ t 1 ( x ) = P t 1 ( x f t 1 t r e s ( x ) )
P t t a s ( x ) = σ ( θ 0 + θ 1 ( M ( x ) μ t ( x ) ) + θ 2 P ˜ t 1 ( x ) )
where σ ( * ) denotes the sigmoid function, based on comprehensive experimental validation, the parameter values are assigned as follows: θ 0 = 0 , θ 1 = 1 ,   and   θ 2 = 1 .
Based on SLIC [52], we replace the original color space D with the posterior probability space P t a s :
D = ( P t a s ) 2 + ( D s S ) 2
where D s denotes spatial distance between cluster center and each pixel, S denotes average spacing of super pixels.
Let S k denote the k t h super-pixel block (the region obtained by SLIC segmentation).
P ¯ k = 1 S k x S k P t t a s ( x )
M t t a s = 1 [ P ¯ k > τ t a s ]
whereas the super-pixel segmentation threshold τ t a s is set to 0.65.
As shown in Figure 4, the proposed TA-SLIC algorithm successfully identifies non-rigid local regions of target objects—such as the arm of the person in Figure 4a, the rotating head of the left person in Figure 4b, both the rotating head of the left person and the fully moving person rising from the chair in Figure 4c, as well as the arm, hand, and left leg of the person in Figure 4d—while generating relatively complete and smooth motion regions.
Following the binarization step, precise masks of non-rigid local dynamic regions can be generated. Since segmentation boundaries of these masks constitute high-gradient areas, feature extraction yields a substantial number of feature points. However, actual segmentation results often exhibit irregular boundaries and internal voids within the masks. To address these issues, morphological operations are employed to remove small connected regions, followed by CRF optimization for further boundary refinement.

7. Optical Flow Propagation

Instance segmentation methods are frequently compromised by challenges such as rapid motion, object occlusion, and variations in ambient illumination, which can lead to intermittent semantic information—temporary disappearance, delayed reappearance, or flickering responses under noisy and occluded conditions. Such inconsistencies significantly undermine the stability of dynamic masks and facilitate the propagation of errors to downstream visual SLAM processes. To mitigate this issue, we propose an enhanced optical flow propagation (OFP) method that explicitly transfers reliable posterior information from the previous frame to the current one, thereby effectively leveraging temporal coherence.
The core design of the OFP module focuses on propagating dynamic information from the preceding frame under strict conditional constraints, ensuring that propagation is confined to semantically and motion-consistent regions. The detailed procedure is summarized in Algorithm 2.
Algorithm 2: Optical Flow Propagation (OFP)
Input: Previous posterior P t 1 , Residual flow f t r e s , Residual magnitudes I t m a g , I t 1 m a g
Output: Propagation mask M t o f p , Posterior P t
1: If  P t 1 is None then
2:  Set  M t o f p = all zeros, P t = all zeros
3: Else
4:   Warp P t 1 , I t 1 m a g with f t r e s to obtain P ˜ t 1 , I ˜ t 1 m a g
5:   Compute forward–backward check: | I t m a g I ˜ t 1 m a g | < ε
6:   Suppress static regions where I t m a g ≥ γ
7:   Compute flow similarity via local shift: sim > τ s i m
8:   Define mask = consistency ∧ non-static ∧ similarity
9:   Set  P t = mask P ˜ t 1
10:   Aggregate into binary mask M t o f p
We obtain the predicted optical flow magnitude and posterior for the current frame by warping the residual flow from the previous frame. In practice, a high-threshold strategy is employed in the OFP module to compensate for missed detections while restricting propagation only to regions exhibiting strong, consistent, and spatially coherent motion evidence. This design effectively suppresses noise amplification and long-term drift, thereby enhancing the temporal stability of the motion masks. Three stringent constraints are imposed during propagation: magnitude consistency, significant dynamic regions, and flow-field similarity.
G c o n s ( x ) = 1 I t m a g ( x ) I ˜ m a g ( x ) < ε
G r e j ( x ) = 1 I t m a g ( x ) γ
G s i m ( x ) = 1 s i m ( x ) > τ s i m
G c o n s ensures temporal coherence of optical flow between consecutive frames, G r e j mitigates abrupt transitions in object motion states, and G s i m suppresses pixel-level artifacts in local flow regions. Experimental results demonstrate that the parameter configuration ( ε = 0.5 , γ = 3 , τ s i m = 0.5 ) yields relatively precise localization performance.
f s h i f t ( x ) = f t r e s ( x + Δ ) ,   with   Δ = ( 0 ,   1 )
s i m ( x ) = 1 f t r e s ( x ) f s h i f t ( x ) 2 f t r e s ( x ) 2 + f s h i f t ( x ) 2 + δ
Here δ > 0 avoids division by zero.
G ( x ) = G c o n s ( x ) G r e j ( x ) G s i m ( x )
P t o f p ( x ) = G ( x ) P ˜ ( x )
M t o f p = 1 [ P t o f p > τ o f p ]
where P ˜ ( x ) , I ˜ m a g ( x ) are derived by warping the previous frame’s P t 1 ( x ) , I t 1 m a g ( x ) . In this paper, the optical flow propagation threshold τ o f p is set to 0.5.
Figure 5 demonstrates that the proposed method effectively corrects missed detections across consecutive frames even in complex scenarios. The first row shows the instance segmentation results obtained by YOLOv8-seg [50]; however, the segmentation performance of YOLOv8-seg is somewhat compromised—specifically, the masks of the moving balloon are missing in several frames. In contrast, the second row presents the compensated detection results for dynamic regions using the optical flow method, indicating that even under such challenging conditions, dynamic regions (e.g., the “moving balloon”) can still be identified through optical flow compensation. It is noteworthy that although the “person” exhibits no significant motion, resulting in the absence of its corresponding mask in the second row, the optical flow method remains effective in detecting dynamic regions, thereby enhancing the overall robustness and accuracy of the system.

8. Results and Discussion

8.1. Experimental Environment

The experimental platform in this study is equipped with an AMD Ryzen R7-4800HS 2.9 GHz CPU and an NVIDIA GeForce RTX 2060 Super 8 GB GPU, operating under Ubuntu 20.04. The dynamic SLAM system is implemented in a hybrid programming environment using Python 3.8 and C++ 14. To evaluate camera localization accuracy, we employ two widely adopted metrics: Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).
ATE assesses global alignment between the estimated and ground-truth trajectories, reflecting overall system accuracy. RPE evaluates local motion drift by measuring translational and rotational errors between consecutive poses over time intervals, often summarized via RMSE to quantify incremental drift. RPE can be further categorized into translational RPE and rotational RPE.

8.2. Experimental Dataset

The TUM dataset [53], captured with a Microsoft Kinect sensor, is widely adopted for SLAM performance evaluation. It provides dynamic scenes categorized into two types: high-dynamic (denoted as “w”, where individuals walk around causing large motion areas) and low-dynamic (denoted as “s”, where seated persons yield only local motions). Each scene includes four camera motion patterns: XYZ: slow translation along x, y, and z axes; halfsphere: hemispherical trajectory motion; static: stationary camera; rpy: rotation in roll, pitch, and yaw.
The BONN dataset [21], acquired with an ASUS Xtion Pro LIVE camera, offers more complex dynamic environments with diverse motion sequences. It includes challenging objects often missed by standard instance segmentation models, such as bursting balloons and moving boxes, providing rigorous use cases for evaluating SLAM robustness in highly dynamic settings.
We evaluate TAS-SLAM on two widely adopted RGB-D dynamic benchmarks: the TUM RGB-D dataset and the Bonn RGB-D Dynamic dataset. Both provide synchronized RGB and depth streams along with accurate 6-degree-of-freedom (6DoF) ground-truth sensor trajectories.
The TUM RGB-D dataset is recorded using a Microsoft Kinect sensor at 30 Hz and a resolution of 640 × 480. Ground-truth camera poses are obtained from an external motion-capture system, as provided by the dataset authors. For our evaluation, we specifically select sequences from the Dynamic Objects category (e.g., fr3/sitting_, fr3/walking_). In these sequences, human subjects exhibit non-rigid motion and partially moving behaviors—such as local limb movement in sitting scenarios and full-body deformation with occlusions in walking scenarios—which align with our target scenario of handling non-rigid and partially moving dynamic objects.
The Bonn RGB-D Dynamic dataset extends this setting by providing 24 highly dynamic indoor sequences that involve people interacting with objects (e.g., manipulating boxes, playing with balloons). Ground-truth poses are captured by an OptiTrack Prime 13 motion-tracking system and are supplied in the same format as the TUM dataset. Together, these two datasets offer reliable ground truth and a diverse range of non-rigid dynamic patterns, enabling a credible and representative assessment of TAS-SLAM’s robustness and generalization in complex RGB-D dynamic environments.
For trajectory evaluation, estimated poses are first temporally aligned to the ground truth using the provided timestamps, following the standard association formats of the TUM and Bonn datasets. After synchronization, a rigid SE(3) transformation is applied to remove global coordinate frame discrepancies. We then compute standard SLAM accuracy metrics—Absolute Trajectory Error (ATE) and Relative Pose Error (RPE)—using evaluation tools consistent with prior RGB-D SLAM literature. The Root Mean Square Error (RMSE) values of ATE and RPE are reported, defined as follows:
ATE RMSE = 1 N i = 1 N trans T i g t 1 S T i e s t 2
RPE RMSE = 1 N Δ i = 1 N Δ trans T i g t 1 T i + Δ g t 1 T i e s t 1 T i + Δ e s t 2
where T i g t , T i e s t S E ( 3 ) denote the true and estimated pose of the i t h frame, respectively; S denotes the rigid transformation after alignment (SE(3) alignment); Δ denotes the frame interval for relative error; t r a n s ( ) denotes the translation component of the pose; 2 denotes the Euclidean norm.

8.3. Comparison Experiment

Experimental results are compared with those of the classical SLAM algorithm ORB-SLAM2 [3], advanced dynamic SLAM methods including DS-SLAM [5] and Dyna-SLAM [6]—which leverage learning and geometric structures—as well as state-of-the-art dynamic SLAM systems such as Blitz-SLAM [12] and DGS-SLAM [10]. Both ATE and RPE are reported in terms of RMSE. The best and second-best performance metrics are highlighted in bold green and blue, respectively.
Table 1 presents the global trajectory consistency evaluated via ATE, Table 2 summarizes the average translational drift per second, and Table 3 reports the average rotational drift per second based on RPE.
Analysis of the results in Table 1, Table 2 and Table 3 demonstrates that TAS-SLAM outperforms current state-of-the-art SLAM algorithms designed for dynamic environments. It achieves significant accuracy improvements in highly dynamic scenes while also showing superior performance in low-dynamic scenarios. Notably, TAS-SLAM attains the best performance in sequences such as s/half, w/xyz, and w/half, and consistently ranks first or second across other sequences. These results underscore the robustness and effectiveness of TAS-SLAM in dynamic settings, as validated on the TUM dataset.
In Table 1, for low-dynamic sequences where the two individuals exhibit only slight and discontinuous motions, ORB-SLAM2 [3] demonstrates considerable robustness by effectively filtering outliers via the RANSAC algorithm. Nevertheless, the proposed method achieves approximately 20.3% and 25.2% improvement in ATE over ORB-SLAM2 in the s/half and s/xyz sequences, respectively, while maintaining comparable performance in the s/static and s/rpy sequences. This demonstrates that when residual motion is weak or intermittent, the proposed masking strategy avoids excessive suppression of static features, thereby preventing degradation in pose optimization—a key advantage in low-dynamic scenes.
In highly dynamic sequences, however, the optimization strategy of ORB-SLAM2 fails to handle large-area dynamic objects adequately. In contrast, our algorithm successfully mitigates the influence of moving objects on pose estimation, improving ATE by over 90% across all high-dynamic sequences compared to ORB-SLAM2. We attribute this improvement primarily to the residual-flow-guided instance reasoning mechanism, which detects motions that are inconsistent with the global rigid model and removes dynamic features prior to pose estimation. In contrast, purely geometric outlier rejection methods—such as the RANSAC used in ORB-SLAM2—often prove inadequate in the presence of large-area dynamic interference. Moreover, when compared to state-of-the-art dynamic SLAM systems such as Blitz-SLAM [12] and DGS-SLAM [10], our approach accounts more comprehensively for the diversity of dynamic targets, resulting in superior overall trajectory consistency. However, in sequences where the motion is dominated by very small or distant objects, the performance improvement becomes less pronounced. This is likely because the instance segmentation or optical flow estimation may fail to capture subtle motion cues, resulting in a small proportion of dynamic features remaining unfiltered.
As shown in Table 2 and Table 3, TAS-SLAM demonstrates superior accuracy in both translational and rotational RPE. Compared to ORB-SLAM2, it achieves an overall reduction of 49.78% and 46.31% in the RMSE of translational and rotational RPE, respectively. When evaluated against state-of-the-art dynamic SLAM systems including Blitz-SLAM and DGS-SLAM, the proposed method also exhibits a consistent decrease in RPE RMSE. The advantage becomes more pronounced in high-dynamic scenarios, where temporal inconsistency in masks typically leads to inter-frame drift. In contrast, the improvement is less significant in low-dynamic sequences, as the optical-flow propagation mechanism is seldom triggered and the residual flow approaches the level of noise.
The experiments confirm that the proposed approach improves both ATE and RPE performance of SLAM systems in dynamic environments. These results indicate that TAS-SLAM offers enhanced robustness and stability.
As presented in Table 4, the proposed method achieves optimal or suboptimal performance across multiple sequences of the BONN dataset, demonstrating high localization accuracy in both low- and high-dynamic scenes. Although performance is comparable to other algorithms in certain low-dynamic sequences, our approach exhibits significantly superior results in high-dynamic environments. The ability of TAS-SLAM to generalize to dynamic patterns beyond those in the TUM dataset is evident, especially under challenging conditions involving non-rigid local motions and partial object movements.
StaticFusion [21] and ReFusion [20] show limited capability in both scenarios, particularly in high-dynamic sequences where their RMSE values substantially exceed those of other methods. While Dyna-SLAM [6] delivers competitive accuracy in some low-dynamic cases, it underperforms in high-dynamic scenes such as “balloon”, “place_no_box”, and “move_no_box”, where our algorithm demonstrates considerably stronger robustness. A plausible reason is that these sequences contain deformable objects and undefined dynamic regions; methods relying on category-limited priors or frame-wise masks may either over-remove usable static features or retain dynamic ones, while our TA-SLIC refinement and conditional temporal propagation jointly preserve static structure and stabilize dynamic regions.
These challenging scenarios often contain non-rigid object local motions and undefined dynamic objects, which can cause conventional methods to erroneously exclude usable static features or incorporate dynamic features into pose estimation, leading to increased drift. The results confirm that the proposed improvements effectively enhance SLAM performance in complex dynamic environments.
As illustrated in Figure 6, a qualitative analysis of ATE and RPE is presented. Each subfigure displays trajectories in three colors: black represents the ground-truth trajectory provided by the TUM dataset, blue indicates the estimated trajectory, and red lines denote the pose errors between corresponding ground-truth and estimated points. Shorter red lines indicate lower trajectory estimation error. Visual comparison clearly demonstrates that TAS-SLAM achieves higher accuracy and robustness, with consistently smaller deviations across sequences. This qualitative evidence is consistent with the quantitative ATE/RPE trends and further illustrates where our method delivers the most benefit—namely, sequences with large non-rigid dynamics and frequent occlusions.

8.4. Ablation Study

To evaluate the contribution of individual components, as shown in Table 5, we conduct an ablation study on the high dynamic sequence from the TUM dataset using the following four algorithmic variants:
ORB-SLAM2: the baseline system without dynamic object handling;
Instance + TA-SLIC: dynamic instances identified by YOLO are refined solely using temporally adaptive superpixel segmentation;
Instance + OFP: instance masks are propagated only through optical flow consistency without superpixel refinement;
TAS-SLAM (full): the complete proposed system integrating both TA-SLIC and optical flow propagation under residual reliability and fusion mechanisms.
As anticipated, ORB-SLAM2 exhibits significant performance degradation in highly dynamic sequences, with pronounced increases in both ATE and RPE due to contamination from dynamic features. The Instance + TA-SLIC variant substantially enhances boundary accuracy and effectively suppresses non-rigid local motions, resulting in improved localization performance. However, this version exhibits noticeable mask fluctuations across frames, leading to temporal inconsistencies. The Instance + OFP configuration achieves smoother temporal continuity by propagating prior masks through optical flow, which stabilizes performance in low-dynamic or partially dynamic scenes. Nevertheless, in the absence of spatial refinement, the propagated masks frequently encroach into static regions, thereby reducing overall accuracy.
Notably, the full TAS-SLAM system attains the best overall performance by integrating the spatial precision of TA-SLIC with the temporal coherence provided by OFP. The fused masks maintain fine-grained object boundaries while eliminating inter-frame jitter, yielding the lowest ATE and RPE. These results confirm that TA-SLIC and OFP offer complementary advantages, and their synergistic integration is essential for achieving robust SLAM performance in complex dynamic environments.

8.5. Runtime Analysis and Discussion

We evaluate the computational performance of TAS-SLAM using RGB-D inputs at a resolution of 640 × 480 on a GPU platform. The current prototype operates on a single-threaded, serial processing pipeline without multi-threading optimizations; consequently, the reported runtimes represent a conservative estimate. The TA-SLIC refinement step requires 40 ms per frame, and rigid-flow estimation takes 30 ms. The inference times for the deep networks—LiteFlowNet3 for dense optical flow and YOLOv8-seg for instance segmentation—are 20 ms and 18 ms per frame, respectively. Additional modules, including the instance-level motion classifier and optical flow propagation (OFP), contribute 13 ms and 12 ms, while mask fusion is relatively lightweight at 3 ms. Overall, TAS-SLAM processes each frame in 136 ms on average, which corresponds to approximately 7.35 frames per second (FPS).
Although the present implementation does not achieve full real-time performance, the primary computational overhead stems from the TA-SLIC and rigid-flow computation stages. It is important to note that TA-SLIC and OFP are activated only when significant non-rigid motion is detected; therefore, the average runtime over the entire sequence reflects a worst-case scenario. Furthermore, the two main processing branches illustrated in Figure 1—dense optical flow estimation and instance segmentation—are inherently parallelizable. Employing multi-threaded execution and adopting lighter network variants are anticipated to substantially reduce system latency. Future work will focus on these engineering optimizations, while this study primarily demonstrates the effectiveness of the proposed dynamic masking strategy for achieving robust SLAM performance in highly dynamic environments.

9. Conclusions and Future Work

This paper proposes TAS-SLAM, a visual SLAM system that integrates Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation to improve localization accuracy in complex dynamic environments. The approach combines the YOLOv8-seg model with the TA-SLIC super-pixel segmentation algorithm to detect dynamic regions, and introduces an optical flow propagation mechanism to compensate for missed detections and refine segmentation masks. This ensures precise removal of dynamic feature points while preserving reliable static features for pose estimation. Extensive evaluations on the TUM and BONN datasets demonstrate that TAS-SLAM achieves superior localization accuracy compared to state-of-the-art systems including DS-SLAM and Dyna-SLAM.
We explicitly acknowledge that the current TAS-SLAM prototype does not operate in full real-time. Owing to its single-threaded serial implementation and the computational overhead of the TA-SLIC and rigid-flow estimation modules, the system processes frames at an average runtime of approximately 136 ms (≈7.35 FPS).
We delineate that TAS-SLAM can reliably serve as a baseline or target system in RGB-D indoor dynamic environments characterized by non-rigid or partially moving objects, where robustness to local non-rigid motion is essential. However, since the method depends on dense depth input for residual-flow reasoning, its current formulation is less suitable for depth-free scenarios or large-scale outdoor settings lacking reliable depth support.
Future work will focus primarily on improving efficiency and broadening applicability. First, we plan to parallelize the two naturally independent branches—dense optical-flow estimation and instance segmentation (illustrated in Figure 1)—and integrate lightweight network variants with inference-acceleration tools (e.g., TensorRT/ONNX) to reduce per-frame computational overhead. Second, redundant computation on weakly dynamic frames will be minimized through conditionally triggered or periodically updated dynamic masking. Concurrently, we intend to extend the framework to stereo and monocular setups by incorporating learnable depth priors, thereby enhancing performance in the absence of true depth input.

Author Contributions

Conceptualization, Yiming Li and Liuwei Lu; Methodology, Yiming Li, Liuwei Lu and Guangming Guo; Software, Liuwei Lu, Luying Na and Xianpu Liang; Validation, Liuwei Lu, Luying Na and Xianpu Liang; Formal analysis, Yiming Li and Liuwei Lu; Investigation, Guangming Guo, Pengjiang Wang and Qi An; Resources, Liuwei Lu, Luying Na, Xianpu Liang, Pengjiang Wang and Guangming Guo; Data curation, Liuwei Lu and Guangming Guo; Writing—original draft, Liuwei Lu; Writing—review & editing, Yiming Li and Liuwei Lu; Visualization, Luying Na and Xianpu Liang; Supervision, Yiming Li; Project administration, Yiming Li and Liuwei Lu; Funding acquisition, Peng Su and Pengjiang Wang. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Research and Development and Achievement Transformation Project of Inner Mongolia Autonomous Region, grant No. 2025YFHH0001-01; Young Backbone Teachers Support Plan of Beijing Information Science & Technology University, grant No. YBT 202405.

Data Availability Statement

Publicly available datasets were analyzed in this study. The TUM dataset can be found here: https://cvg.cit.tum.de/data/datasets/rgbd-dataset/download, accessed on 30 September 2011, The BONN dataset can be found here: https://www.ipb.uni-bonn.de/data/rgbd-dynamic-dataset/index.html, accessed on 27 October 2019.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
  2. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8690, pp. 834–849. [Google Scholar]
  3. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  4. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  5. Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Madrid, Spain, 2018; pp. 1168–1174. [Google Scholar]
  6. Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  7. Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. RS-SLAM: A robust semantic SLAM in dynamic environments based on RGB-D sensor. IEEE Sens. J. 2021, 21, 20657–20664. [Google Scholar] [CrossRef]
  8. He, J.; Li, M.; Wang, Y.; Wang, H. OVD-SLAM: An online visual SLAM for dynamic environments. IEEE Sens. J. 2023, 23, 13210–13219. [Google Scholar] [CrossRef]
  9. Liu, Y.; Miura, J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
  10. Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. DGS-SLAM: A fast and robust RGBD SLAM in dynamic environments combined by geometric and semantic information. Remote Sens. 2022, 14, 795. [Google Scholar] [CrossRef]
  11. Zhou, Y.; Tao, F.; Fu, Z.; Zhu, L.; Ma, H. RVD-SLAM: A real-time visual SLAM toward dynamic environments based on sparsely semantic segmentation and outlier prior. IEEE Sens. J. 2023, 23, 30773–30785. [Google Scholar] [CrossRef]
  12. Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A semantic SLAM in dynamic environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]
  13. Chen, J.; Xie, F.; Huang, L.; Yang, J.; Liu, X.; Shi, J. A robot pose estimation optimized visual SLAM algorithm based on CO-HDC instance segmentation network for dynamic scenes. Remote Sens. 2022, 14, 2114. [Google Scholar] [CrossRef]
  14. Miao, S.; Liu, X.; Ju, Y.; Qin, B. A visual SLAM combined with semantic-optical flow towards dynamic environment. In Proceedings of the 2021 China Automation Congress CAC, Beijing, China, 22–24 October 2021; IEEE: Beijing, China, 2021; pp. 328–333. [Google Scholar]
  15. Wang, S.; Gou, G.; Sui, H.; Zhou, Y.; Zhang, H.; Li, J. CDSFusion: Dense semantic SLAM for indoor environment using CPU computing. Remote Sens. 2022, 14, 979. [Google Scholar] [CrossRef]
  16. Hu, X.; Zhang, Y.; Cao, Z.; Ma, R.; Wu, Y.; Deng, Z.; Sun, W. CFP-SLAM: A real-time visual SLAM based on coarse-to-fine probability in dynamic environments. Remote Sens. 2022, 14, 5142. [Google Scholar]
  17. Jiao, S.; Li, Y.; Shan, Z. DFS-SLAM: A visual SLAM algorithm for deep fusion of semantic information. IEEE Robot. Autom. Lett. 2024, 9, 11794–11801. [Google Scholar] [CrossRef]
  18. Zhang, C.; Zhang, R.; Jin, S.; Yi, X. PFD-SLAM: A new RGB-D SLAM for dynamic indoor environments based on non-prior semantic segmentation. Remote Sens. 2022, 14, 2445. [Google Scholar] [CrossRef]
  19. He, X.; Ding, L.; Lan, Y. DSK-SLAM: A dynamic SLAM system combining semantic information and a novel geometric method based on k-means clustering. IEEE Sens. J. 2024, 24, 23265–23279. [Google Scholar] [CrossRef]
  20. Palazzolo, E.; Behley, J.; Lottes, P.; Giguère, P.; Stachniss, C. ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China, 3–8 November 2019; IEEE: Macau, China, 2019; pp. 7855–7862. [Google Scholar]
  21. Scona, R.; Jaimez, M.; Petillot, Y.R.; Fallon, M.; Cremers, D. StaticFusion: Background reconstruction for dense RGB-D SLAM in dynamic environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Brisbane, Australia, 2018; pp. 3849–3856. [Google Scholar]
  22. Sun, L.; Bian, J.-W.; Zhan, H.; Yin, W.; Reid, I.; Shen, C. SC-DepthV3: Robust self-supervised monocular depth estimation for dynamic scenes. arXiv 2023, arXiv:2301.12345. [Google Scholar] [CrossRef]
  23. Cheng, J.; Wang, Z.; Zhou, H.; Li, L.; Yao, J. DM-SLAM: A feature-based SLAM system for rigid dynamic scenes. ISPRS Int. J. Geo-Inf. 2020, 9, 202. [Google Scholar] [CrossRef]
  24. Liao, G.; Yin, F. DOR-SLAM: A visual SLAM based on dynamic object removal for dynamic environments. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; IEEE: Chongqing, China, 2023; pp. 1777–1782. [Google Scholar]
  25. Wan, Y.; Gao, W.; Han, S.; Wu, Y. Dynamic object-aware monocular visual odometry with local and global information aggregation. In Proceedings of the IEEE International. Conference on Image Processing. (ICIP), Abu Dhabi, United Arab Emirates, 25–28 September 2020; IEEE: Abu Dhabi, United Arab Emirates, 2020; pp. 603–607. [Google Scholar]
  26. Labsir, S.; Pages, G.; Vivet, D. Lie group modelling for an EKF-based monocular SLAM algorithm. Remote Sens. 2022, 14, 571. [Google Scholar] [CrossRef]
  27. Slowak, P.; Kaniewski, P. Stratified particle filter monocular SLAM. Remote Sens. 2021, 13, 3233. [Google Scholar] [CrossRef]
  28. Wan, B.S.; Chen, W.; He, L.; Zhang, H. Data driven optical flow prediction for improving direct method visual SLAM systems. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Jinghong, China, 5–9 December 2022; IEEE: Jinghong, China, 2022; pp. 1017–1022. [Google Scholar]
  29. Tarasov, A.; Nikiforov, M. Detection and tracking of moving objects optical flow based. In Proceedings of the International Russian Automation Conference (RusAutoCon), Sochi, Russia, 8–14 September 2024; IEEE: Sochi, Russia, 2024; pp. 121–126. [Google Scholar]
  30. Muller, P.; Savakis, A. Flowdometry: An optical flow and deep learning based approach to visual odometry. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Santa Rosa, CA, USA, 27–29 March 2017; IEEE: Santa Rosa, CA, USA, 2017; pp. 624–631. [Google Scholar]
  31. Pandey, T.; Pena, D.; Byrne, J.; Moloney, D. Leveraging deep learning for visual odometry using optical flow. Sensors 2021, 21, 1313. [Google Scholar] [CrossRef]
  32. Esparza, D.; Flores, G. The STDyn-SLAM: A stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments. IEEE Access 2022, 10, 18201–18209. [Google Scholar] [CrossRef]
  33. Qin, L.; Wu, C.; Chen, Z.; Kong, X.; Lv, Z.; Zhao, Z. RSO-SLAM: A robust semantic visual SLAM with optical flow in complex dynamic environments. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14669–14684. [Google Scholar] [CrossRef]
  34. Wang, H.; Zhang, X.; Yang, S.; Zhang, W.; Li, J.; Wang, H. Video anomaly detection via successive image frame prediction leveraging optical flows. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 1–3 June 2022; IEEE: Chengdu, China, 2022; pp. 643–650. [Google Scholar]
  35. Baker, S.; Matthews, I. Lucas-Kanade 20 years on: A unifying framework. Int. J. Comput. Vis. 2004, 56, 221–255. [Google Scholar] [CrossRef]
  36. Meinhardt-Llopis, E.; Sánchez Pérez, J.; Kondermann, D. Horn-Schunck optical flow with a multi-scale strategy. Image Process. Online 2013, 3, 151–172. [Google Scholar] [CrossRef]
  37. Fischer, P.; Dosovitskiy, A.; Ilg, E.; Häusser, P.; Hazırbaş, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning optical flow with convolutional networks. arXiv 2015, arXiv:1504.06852. [Google Scholar] [CrossRef]
  38. Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. arXiv 2016, arXiv:1612.01925. [Google Scholar] [CrossRef]
  39. Sun, D.; Yang, X.; Liu, M.-Y.; Kautz, J. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 8934–8943. [Google Scholar]
  40. Teed, Z.; Deng, J. RAFT: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference Computer Vision (ECCV), Cham, Switzerland, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 402–419. [Google Scholar]
  41. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  42. Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  43. Qin, T.; Pan, J.; Cao, S.; Shen, S. A general optimization-based framework for local odometry estimation with multiple sensors. arXiv 2019, arXiv:1901.03638. [Google Scholar] [CrossRef]
  44. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  45. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Venice, Italy, 2017; pp. 2980–2988. [Google Scholar]
  46. Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2021, arXiv:2008.09730. [Google Scholar]
  47. Liu, Y.; Zhou, Z. Optical flow-based stereo visual odometry with dynamic object detection. IEEE Trans. Comput. Soc. Syst. 2023, 10, 3556–3568. [Google Scholar] [CrossRef]
  48. Zhuang, Y.; Jia, P.; Liu, Z.; Li, L.; Wu, C.; Lu, X.; Cui, W.; Liu, Z. Amos-SLAM: An anti-dynamics two-stage RGB-D SLAM approach. IEEE Trans. Instrum. Meas. 2024, 73, 5003410. [Google Scholar] [CrossRef]
  49. Hui, T.-W.; Loy, C.C. LiteFlowNet3: Resolving correspondence ambiguity for more accurate optical flow estimation. arXiv 2020, arXiv:2007.09319. [Google Scholar]
  50. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
  51. Ma, W.-C.; Wang, S.; Hu, R.; Xiong, Y.; Urtasun, R. Deep rigid instance scene flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Long Beach, CA, USA, 2019; pp. 3609–3617. [Google Scholar]
  52. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef]
  53. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Vilamoura-Algarve, Portugal, 2012; pp. 573–580. [Google Scholar]
Figure 1. The overview of TAS-SLAM.
Figure 1. The overview of TAS-SLAM.
Ijgi 15 00007 g001
Figure 2. Optical flow visualization results.
Figure 2. Optical flow visualization results.
Ijgi 15 00007 g002
Figure 3. Instance-level motion classifier results. Each row shows a sample frame. From left to right: (a) input RGB image, (b) YOLO instance masks, (c) residual flow magnitude heatmap, and (d) classification results where instances are labeled as rigid-consistent global motion (red), non-rigid local dynamic motion (yellow), or static (green).
Figure 3. Instance-level motion classifier results. Each row shows a sample frame. From left to right: (a) input RGB image, (b) YOLO instance masks, (c) residual flow magnitude heatmap, and (d) classification results where instances are labeled as rigid-consistent global motion (red), non-rigid local dynamic motion (yellow), or static (green).
Ijgi 15 00007 g003
Figure 4. TA-SLIC results. (ad) Each picture shows a sample frame, including instance masks (blue) and TA-SLIC segments masks (pink). The proposed super-pixel segmentation demonstrates high sensitivity to both rigid and non-rigid motion regions.
Figure 4. TA-SLIC results. (ad) Each picture shows a sample frame, including instance masks (blue) and TA-SLIC segments masks (pink). The proposed super-pixel segmentation demonstrates high sensitivity to both rigid and non-rigid motion regions.
Ijgi 15 00007 g004
Figure 5. Optical flow propagation results.
Figure 5. Optical flow propagation results.
Ijgi 15 00007 g005
Figure 6. ATE and RPE of TUM in RGB-D mode from TAS-SLAM.
Figure 6. ATE and RPE of TUM in RGB-D mode from TAS-SLAM.
Ijgi 15 00007 g006
Table 1. Comparison of ATE on the TUM dataset [m/s].
Table 1. Comparison of ATE on the TUM dataset [m/s].
SequenceORB-SLAM2DS-SLAMDyna-SLAMBlitz-SLAMDGS-SLAMOurs
RMSERMSERMSERMSERMSERMSE
Low Dynamicfr3/s/xyz0.0092/0.01450.0148 0.00920.0087
fr3/s/half0.0192 /0.01860.01600.01820.0153
fr3/s/static0.0087 0.0065//0.00570.0065
fr3/s/rpy0.0195////0.0228
High Dynamicfr3/w/xyz0.7214 0.02470.01640.01530.01560.0145
fr3/w/half0.4667 0.03030.02960.02560.03010.0254
fr3/w/static0.3872 0.00810.00680.0102 0.00590.0087
fr3/w/rpy0.7842 0.44420.03540.0354 0.03010.0302
Top-2Count (Top 1)2(1)1(0)1(0)3(0)3(3)6(4)
Table 2. Comparison of RPE in translational drift on the TUM dataset [m/s].
Table 2. Comparison of RPE in translational drift on the TUM dataset [m/s].
SequenceORB-SLAM2DS-SLAMDyna-SLAMBlitz-SLAMDGS-SLAMOurs
RMSERMSERMSERMSERMSERMSE
Low Dynamicfr3/s/xyz0.0117/0.01420.01440.01340.0116
fr3/s/half0.0231/0.02390.01650.02760.0162
fr3/s/static0.00900.0078//0.00820.0079
fr3/s/rpy0.0245////0.0292
High Dynamicfr3/w/xyz0.39440.03330.02170.01970.02280.0183
fr3/w/half0.34800.02970.02840.02530.03660.0262
fr3/w/static0.23490.01020.00890.01290.01010.0102
fr3/w/rpy0.45820.15030.04480.04730.04320.0412
Top-2Count (Top 1)2(1)1(1)1(1)3(1)2(0)7(4)
Table 3. Comparison of RPE in rotational drift on the TUM dataset [°/s].
Table 3. Comparison of RPE in rotational drift on the TUM dataset [°/s].
SequenceORB-SLAM2DS-SLAMDyna-SLAMBlitz-SLAMDGS-SLAMOurs
RMSERMSERMSERMSERMSERMSE
Low Dynamicfr3/s/xyz0.4890/0.50420.50240.59380.4766
fr3/s/half0.6015/0.70450.59810.78760.6158
fr3/s/static0.28500.2735//0.32060.2704
fr3/s/rpy0.7772////0.7659
High Dynamicfr3/w/xyz7.78460.82660.62840.61320.64250.6218
fr3/w/half7.21380.81420.78420.78790.88480.7530
fr3/w/static4.18560.26900.26120.30380.26390.2760
fr3/w/rpy8.89233.00420.98941.08410.92130.9931
Top-2Count (Top 1)3(0)1(0)3(1)2(2)2(1)5(4)
Table 4. Comparison of ATE on the BONN dataset [m].
Table 4. Comparison of ATE on the BONN dataset [m].
SequenceORB-SLAM2StaticFusionReFusionDyna-SLAMOurs
RMSERMSERMSERMSERMSE
Low Dynamicballoon_tracking0.03610.22100.30200.0418 0.0313
balloon_tracking20.05700.36600.32200.03110.0471
kidnapping_box0.02670.33600.14800.0296 0.0283
kidnapping_box20.02530.26300.16100.02400.0303
moving_no_box20.18050.36400.17900.02970.0364
placing_no_box20.0306 0.17700.14100.01990.0256
removing_no_box0.01610.13600.04100.0166 0.0159
removing_no_box20.02280.12900.11100.02080.0238
High Dynamicballoon0.2173 0.2330 0.1750 0.03020.0293
balloon20.47110.2930 0.2540 0.02480.0285
crowd0.9445 3.5860 0.2040 0.01630.0281
crowd21.3835 0.2150 0.1550 0.02610.0360
crowd31.1615 0.1680 0.1370 0.0383 0.0376
moving_no_box0.6081 0.1410 0.0710 0.0282 0.0227
person_tracking0.7205 0.4840 0.2890 0.0609 0.0426
person_tracking21.0450 0.6260 0.4630 0.0484 0.0461
placing_no_box0.75080.1250 0.10600.1401 0.0243
placing_no_box30.3177 0.2560 0.1740 0.0670 0.0400
removing_o_box0.3401 0.3340 0.22200.2684 0.1849
synchronous1.1306 0.4460 0.4100 0.0693 0.0103
synchronous21.5736 0.0270 0.0220 0.0091 0.0071
Top-2Count (Top 1)6(2)0(0)2(0)17(6)18(12)
Table 5. Ablation Study results.
Table 5. Ablation Study results.
ORB-SLAM2Instance + T-ASLICInstance + OFPTAS-SLAM (Full)
fr3/w/xyz0.7214 0.01880.01710.0145
fr3/w/half0.4667 0.03130.03070.0254
fr3/w/static0.3872 0.00970.00950.0087
fr3/w/rpy0.7842 0.03310.04490.0302
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Lu, L.; Guo, G.; Na, L.; Liang, X.; Su, P.; An, Q.; Wang, P. TAS-SLAM: A Visual SLAM System for Complex Dynamic Environments Integrating Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation. ISPRS Int. J. Geo-Inf. 2026, 15, 7. https://doi.org/10.3390/ijgi15010007

AMA Style

Li Y, Lu L, Guo G, Na L, Liang X, Su P, An Q, Wang P. TAS-SLAM: A Visual SLAM System for Complex Dynamic Environments Integrating Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation. ISPRS International Journal of Geo-Information. 2026; 15(1):7. https://doi.org/10.3390/ijgi15010007

Chicago/Turabian Style

Li, Yiming, Liuwei Lu, Guangming Guo, Luying Na, Xianpu Liang, Peng Su, Qi An, and Pengjiang Wang. 2026. "TAS-SLAM: A Visual SLAM System for Complex Dynamic Environments Integrating Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation" ISPRS International Journal of Geo-Information 15, no. 1: 7. https://doi.org/10.3390/ijgi15010007

APA Style

Li, Y., Lu, L., Guo, G., Na, L., Liang, X., Su, P., An, Q., & Wang, P. (2026). TAS-SLAM: A Visual SLAM System for Complex Dynamic Environments Integrating Instance-Level Motion Classification and Temporally Adaptive Super-Pixel Segmentation. ISPRS International Journal of Geo-Information, 15(1), 7. https://doi.org/10.3390/ijgi15010007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop