Next Article in Journal
Observer-Based Adaptive Event-Triggered Fault-Tolerant Control for Bidirectional Consensus of MASs with Sensor Faults
Previous Article in Journal
Higher-Dimensional Geometry and Singularity Structure of Osculating Type-II Ruled Surfaces in Lorentzian Spaces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Roadmap of Mathematical Optimization for Visual SLAM in Dynamic Environments

1
College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China
2
Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai 200093, China
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(2), 264; https://doi.org/10.3390/math14020264
Submission received: 29 October 2025 / Revised: 18 December 2025 / Accepted: 6 January 2026 / Published: 9 January 2026
(This article belongs to the Section E2: Control Theory and Mechanics)

Abstract

The widespread application of robots in complex and dynamic environments demands that Visual SLAM is both robust and accurate. However, dynamic objects, varying illumination, and environmental complexity fundamentally challenge the static world assumptions underlying traditional SLAM methods. This review provides a comprehensive investigation into the mathematical foundations of V-SLAM and systematically analyzes the key optimization techniques developed for dynamic environments, with particular emphasis on advances since 2020. We begin by rigorously deriving the probabilistic formulation of V-SLAM and its basis in nonlinear optimization, unifying it under a Maximum a Posteriori (MAP) estimation framework. We then propose a taxonomy based on how dynamic elements are handled mathematically, which reflects the historical evolution from robust estimation to semantic modeling and then to deep learning. This framework provides detailed analysis of three main categories: (1) robust estimation theory-based methods for outlier rejection, elaborating on the mathematical models of M-estimators and switch variables; (2) semantic information and factor graph-based methods for explicit dynamic object modeling, deriving the joint optimization formulation for multi-object tracking and SLAM; and (3) deep learning-based end-to-end optimization methods, discussing their mathematical foundations and interpretability challenges. This paper delves into the mathematical principles, performance boundaries, and theoretical controversies underlying these approaches, concluding with a summary of future research directions informed by the latest developments in the field. The review aims to provide both a solid mathematical foundation for understanding current dynamic V-SLAM techniques and inspiration for future algorithmic innovations. By adopting a math-first perspective and organizing the field through its core optimization paradigms, this work offers a clarifying framework for both understanding and advancing dynamic V-SLAM.

1. Introduction

Visual Simultaneous Localization and Mapping (V-SLAM) serves as a foundational capability for autonomous systems, enabling robots to perceive, navigate, and interact within complex environments without relying on external infrastructure [1,2]. Its importance spans a broad spectrum of real-world applications, from autonomous driving and augmented reality to service robotics in public squares, hospitals, and logistics centers. However, despite significant advances in controlled or static settings, a critical bottleneck persists when V-SLAM operates in unstructured, dynamic domains [3,4,5]. In such scenarios, the fundamental assumption of a static world is violated by moving pedestrians, vehicles, and other transient objects. These dynamic elements introduce non-stationary noise that corrupts the measurement model. This challenge is concretely exemplified in widely-used benchmarks such as the KITTI dataset for urban traffic scenarios [6] and the TUM RGB-D dataset for indoor pedestrian environments [7], where moving vehicles and people systematically violate the static-world assumption and generate structured measurement outliers. Such scenarios lead to inconsistent data associations, biased state estimates, and ultimately, catastrophic failures in tracking and mapping. Addressing this robustness gap demands a deep re-examination of the underlying mathematical formulations and optimization frameworks [8]. This review is motivated by the urgent need to consolidate and analyze the latest mathematical optimization strategies that empower V-SLAM to function reliably amid real-world dynamics, thereby bridging the gap between laboratory performance and in-the-wild deployment [9,10].
In the pre-2000s, the filtering-based era, early V-SLAM systems predominantly relied on the Extended Kalman Filter (EKF) framework, whose mathematical foundation lies in first-order linearization of nonlinear systems around the current estimate, recursively updating the state vector and covariance matrix [11,12,13]. While theoretically sound, EKF-SLAM suffered from computational complexity that scaled quadratically with the number of landmarks, cumulative linearization errors, and sensitivity to data association errors. Notably, modern adaptations continue to leverage its efficiency in resource-constrained systems. A recent implementation [14] demonstrates an enhanced EKF-based MonoSLAM framework for UAV navigation in GPS-denied environments. By fusing monocular visual data with range-to-base-station measurements (e.g., UWB or RF), this system achieves full observability and bounded-error localization using only a single fixed base station, proving particularly effective for long-term indoor or urban-canyon drone operations without requiring loop closure.
During the batch optimization era from approximately 2000 to 2010, the limitations of recursive filtering spurred the development of bundle adjustment and pose graph optimization methods [15,16]. The mathematical foundation shifted toward sparse nonlinear least squares optimization over the entire trajectory, dramatically improving accuracy and consistency. The introduction of factor graphs provided a unified probabilistic graphical model for representing diverse sensor constraints, while Lie group theory offered a mathematically sound approach to manifold optimization for pose variables [17]. This period also saw the development of efficient optimization frameworks, such as g2o, which leveraged the problem’s inherent sparsity [18]. This established g2o as the standard optimization framework for major systems like ORB-SLAM3 [19] (for visual and visual-inertial SLAM), with recent extensions adapting it for dynamic environments [20,21]. Modern visual-lidar fusion systems also leverage its factor graph architecture for tightly-coupled sensor fusion [22].
From approximately 2010 to 2020, the robust and semantic era saw SLAM research mature, with its focus expanding to handle challenging environments. Robust estimation theory was incorporated to address outliers, while semantic SLAM emerged by integrating discrete semantic variables into continuous optimization frameworks [3,9,23]. This period also saw the rise of multi-sensor fusion, particularly visual-inertial systems, requiring sophisticated mathematical formulations for state estimation. Representative implementations include VINS-Mono [24], which employed sliding-window, tightly-coupled graph optimization with M-estimators (e.g., Huber loss) to jointly estimate camera poses, IMU states, and landmarks while robustly rejecting visual outliers. Simultaneously, semantic-enhanced systems integrated real-time semantic segmentation to filter dynamic objects, demonstrating how discrete semantic labels could be fused with continuous bundle adjustment for improved robustness in dynamic environments [25].
Beginning in the 2020s and continuing to the present, the learning-integrated era has been marked by the profound influence of deep learning on V-SLAM. This influence manifests through differentiable optimization layers, neural scene representations, and foundation models for zero-shot dynamic object recognition [26,27]. The mathematical foundations are evolving from explicit geometric models toward data-driven implicit representations and end-to-end function approximation. The mathematical foundations are evolving from explicit geometric models toward data-driven implicit representations and end-to-end function approximation. Key algorithmic advances include DROID-SLAM [28], which implements fully differentiable bundle adjustment; NeRF-SLAM [29], combining neural radiance fields with simultaneous localization; and the integration of foundation models like Segment Anything (SAM) [30] for zero-shot dynamic object recognition.
Through the above, it can be concluded that the core mathematical difficulty in dynamic V-SLAM stems from the violation of fundamental assumptions in traditional formulation. In this model, z i represents the sensor observation, h ( · ) is the observation function, x i denotes the robot state, m j is the static landmark position, and n i is the observation noise assumed to follow a zero-mean Gaussian distribution with covariance Σ i . The standard observation model z t = h ( x t , m j ) + n t , where n t N ( 0 , Σ t ) , becomes invalid when dynamic objects are present. Observations originating from moving objects constitute structured outliers that violate the Gaussian noise assumption, leading to biased state estimates [31,32,33].
Mathematically, this can be formalized as a mixture model:
p ( z t | x t , M ) = ( 1 ϵ ) N ( h ( x t , M ) , Σ t ) + ϵ p dynamic ( z t )
where ϵ represents the contamination ratio and p dynamic characterizes the outlier distribution.
Furthermore, the state estimation problem expands significantly. For a system with N camera poses, M static landmarks, and K dynamic objects observed over T time steps, the state dimension grows as O ( N + M + K T ) , compared to O ( N + M ) in static environments. This combinatorial increase introduces fundamental challenges in computational complexity, identifiability, and robust inference.
This review makes several key contributions to the literature on dynamic environment V-SLAM:
  • A comprehensive mathematical treatment of V-SLAM foundations, from probabilistic formulation to Lie group optimization, with detailed derivations.
  • A structured taxonomy of dynamic V-SLAM methods centered on their core mathematical optimization strategies, incorporating cutting-edge research.
  • In-depth analysis of mathematical challenges, theoretical controversies, and promising future research directions informed by the latest developments.
  • Explicit connections between classical geometric methods and emerging learning-based approaches, highlighting their complementary mathematical strengths.
Our taxonomy is designed to organize methods by their core mathematical approach to dynamic elements, while also showing the field’s progression over time. This dual structure helps connect methodological developments with historical trends. The remainder of this review is organized as follows: Section 2 establishes the mathematical foundations of V-SLAM optimization. Section 3 presents our taxonomy and detailed analysis of methods for dynamic environments. Section 4 discusses mathematical challenges and controversies, and Section 5 concludes with future research directions.

2. Mathematical Foundations of V-SLAM Optimization in Static Environments

This section delineates the core mathematical underpinnings required to understand modern V-SLAM optimization. Rather than an exhaustive review of all mathematical tools, we concentrate on three foundational components that are both ubiquitous in the literature and critical for transitioning from problem formulation to numerical solution: probabilistic graphical models (for formulation), Lie group theory (for correct geometric representation), and nonlinear least squares (for efficient computation). The choice of these three is motivated by their representative role in forming the backbone of the majority of contemporary V-SLAM systems discussed subsequently.
A comprehensive summary of the mathematical notation used throughout this review is provided in Appendix A.

2.1. Probabilistic Formulation and Graphical Models

The SLAM problem is inherently probabilistic, concerned with estimating the joint posterior over the robot trajectory x 1 : t and the map M given sensor observations z 1 : t and control inputs u 1 : t :
p ( x 1 : t , M | z 1 : t , u 1 : t )
The classical filtering approach addresses this through recursive Bayesian estimation [34]. The Extended Kalman Filter (EKF) maintains a Gaussian approximation of the belief state, while the Particle Filter (PF) represents the belief non-parametrically through weighted samples [2,35,36,37,38,39]. However, the modern paradigm formulates SLAM as a full trajectory optimization problem using Maximum a Posteriori (MAP) estimation:
X , L = arg max X , L p ( X , L | Z ) = arg max X , L p ( Z | X , L ) p ( X ) p ( L )
where X = { x 1 , , x N } represents the robot trajectory, L = { l 1 , , l M } denotes the set of landmarks, and Z encompasses all available observations.
The factor graph provides a powerful graphical model representation of this MAP estimation problem. This review focuses on the factor graph and MAP estimation paradigm as the primary probabilistic framework for modern optimization-based V-SLAM. While alternative approaches such as particle filters remain valuable for specific scenarios (e.g., multi-hypothesis tracking), the factor graph’s explicit representation of sparsity and its direct link to efficient nonlinear optimization make it the dominant and most extensible framework for integrating dynamic object states and semantic constraints, which are central themes in subsequent chapters. The joint probability distribution factorizes as:
p ( X , L | Z ) k = 1 K f k ( X k , L k )
where each factor f k corresponds to a constraint involving subsets of variables X k X and L k L . This factorization naturally captures the sparse connectivity of the SLAM problem and enables efficient inference.
Within the probabilistic framework, we emphasize factor graphs and maximum a posteriori estimation, as they constitute the prevailing optimization paradigm for handling dynamic and semantic information; the Extended Kalman Filter is introduced as a historical foundation to illustrate the developmental trajectory.

2.2. Lie Group Theory for Pose Representation

The Lie group structure addresses a fundamental challenge in SLAM optimization: performing calculus on the non-Euclidean manifold of rigid-body poses. Standard gradient-based optimization operates in vector spaces, where updates are additive [40,41,42]. However, directly adding an increment to a rotation matrix R R + Δ R does not guarantee that the result remains in SO ( 3 ) , violating the constraints of the rotation group.
Lie theory resolves this issue by providing a local parameterization via the Lie algebra. The algebra se ( 3 ) , being a vector space, permits unconstrained optimization. The current estimate T is updated using a perturbation δ ξ in the algebra:
T new = T · exp ( δ ξ )
This update rule ensures that T new remains in SE ( 3 ) for any δ ξ R 6 . The optimization then proceeds by iteratively solving for the optimal perturbation δ ξ in the tangent space.
The Jacobian matrices required for optimization, such as e δ ξ , characterize the sensitivity of the error function to infinitesimal changes on the manifold [43,44,45,46]. This approach, known as on-manifold optimization, transforms a constrained optimization problem on a nonlinear manifold into an unconstrained one in a Euclidean vector space. Consequently, Lie theory provides the mathematical rigor necessary for stable and consistent estimation of pose variables in modern sparse, nonlinear least-squares solvers [28,29].
For a camera at pose T i observing a landmark l j , the reprojection error is:
e i j = π ( T i 1 l j ) z i j
where π : R 3 R 2 is the camera projection function and z i j is the actual image measurement. Proper handling of 3D rigid body transformations is crucial for V-SLAM. The Special Euclidean Group SE ( 3 ) represents valid rigid body transformations:
SE ( 3 ) = T = R t 0 1 | R SO ( 3 ) , t R 3
where SO ( 3 ) is the Special Orthogonal Group representing rotations.
Lie theory provides the mathematical framework for manipulating elements of SE ( 3 ) during optimization [47]. The Lie algebra se ( 3 ) is the tangent space at the identity of SE ( 3 ) , providing a minimal representation for perturbations. The exponential map exp : se ( 3 ) SE ( 3 ) and logarithm map log : SE ( 3 ) se ( 3 ) connect the Lie algebra to the Lie group:
T = T exp ( ξ )
where ξ R 6 is the twist coordinate vector and ( · ) is the hat operator mapping R 6 to se ( 3 ) .
Regarding pose representation, we select the Lie groups SE ( 3 ) / SO ( 3 ) because they provide the minimal and complete mathematical objects for describing rigid-body motion, serving as the common foundation for the vast majority of V-SLAM systems.

2.3. Nonlinear Least Squares Formulation

Under the assumption of zero-mean Gaussian noise, the MAP estimate in Equation (3) becomes a Nonlinear Least Squares (NLSQ) problem:
X , L = arg min X , L i = 1 N 1 e i odom Ω i odom 2 + i , j e i , j Ω i , j 2 + k e k loop Ω k loop 2
where e Ω 2 e Ω e is the squared Mahalanobis distance and Ω is the information matrix [48,49].
The Gauss-Newton algorithm solves this iteratively by linearizing around the current estimate. At each iteration, we solve the linear system:
J Ω J δ Ø = J Ω e
where J is the Jacobian matrix of the error vector e with respect to the state perturbation δ Ø .
The structure of the Hessian approximation H = J Ω J reflects the connectivity of the factor graph. For pose graph SLAM, it exhibits a block structure:
H = H X X H X L H L X H L L
The Schur complement trick can be applied to efficiently solve this system by first marginalizing the landmark variables:
[ H X X H X L H L L 1 H L X ] δ X = b X + H X L H L L 1 b L
This dramatically reduces the problem size since the number of pose variables is typically much smaller than the number of landmarks. In terms of optimization solving, we analyze the Schur complement technique in detail, as it represents the key algorithmic core for addressing large-scale sparse problems, and its principle can be directly extended to the marginalization of object states in dynamic scenarios. The primary aim of this chapter is to demonstrate how these foundational tools are extended to address dynamism, rather than to enumerate all possible variants.

2.4. Sparsity, Solver Techniques, and Optimization Foundations

This section elucidates how the inherent sparsity of the SLAM problem is systematically exploited to enable efficient numerical optimization. The techniques discussed herein—particularly those leveraging the block structure of the Hessian and the Schur complement—form the computational backbone for solving large-scale nonlinear least-squares problems. While presented in the context of static environments, these foundational methods are directly extensible to dynamic SLAM, where they underpin advanced strategies for handling moving objects (e.g., through dynamic marginalization or object-aware sparsity exploitation).
As shown by the block structure of the Hessian matrix in Equation (11), the problem is highly sparse. This sparsity can be exploited through the Schur complement trick (also known as landmark marginalization). The core idea is to solve for the landmark variables first in terms of the poses, and then substitute back to obtain a smaller, denser system involving only poses. Formally, this is derived from the linear system H δ θ = b by block Gaussian elimination.
The computational advantages of this manipulation are threefold. First, the size of the system to be solved is drastically reduced from dim ( X ) + dim ( L ) to just dim ( X ) . Since the number of landmarks M often far exceeds the number of poses N (i.e., M N ), this transforms an intractable O ( ( N + M ) 3 ) problem into a manageable O ( N 3 ) problem [50]. Second, the inversion of H LL , which is the most expensive step, is highly efficient because H LL is block-diagonal, allowing for its inverse to be computed by inverting each small, individual landmark block independently—an O ( M ) operation. Finally, although the Schur complement S is denser than the original H XX , it inherits a structured sparsity pattern dictated by the pose graph, which permits the continued use of efficient solvers such as sparse Cholesky factorization or preconditioned conjugate gradient methods [51].
This combination of exploiting the problem’s inherent topological sparsity (from the factor graph) with the algebraic simplification (via the Schur complement) forms the computational backbone of modern, large-scale SLAM systems, making them feasible for real-world applications [52].

3. Mathematical Optimization Methods for Dynamic Environments: A Taxonomy Based on Robust, Semantic, and Learning Paradigms

This section systematically elaborates on the core mathematical optimization methods for visual SLAM in dynamic environments. Starting from the joint state modeling of dynamic SLAM, it highlights three major challenges: identifiability, computational complexity, and robustness. Furthermore, this section evaluates the adaptability of each method under atypical observation scenarios such as extreme illumination, occlusion, transient dynamic objects, and sensor degradation, providing both theoretical foundations and practical guidance for the design and selection of robust dynamic V-SLAM systems. As illustrated in Figure 1, we categorize dynamic V-SLAM methods into three main optimization paradigms: (1) robust estimation, which treats dynamic measurements as statistical outliers; (2) semantic factor graphs, which explicitly model object states and motion; and (3) deep learning end-to-end optimization, which learns implicit representations through differentiable components. The following subsections detail the mathematical foundations of each paradigm.

3.1. Mathematical Formulation of Dynamic SLAM

The transition from static to dynamic environments necessitates a fundamental extension of the classical SLAM state vector to explicitly account for moving objects [53]. This evolution shifts the problem from one of pure geometry and localization to a joint estimation problem involving both the static environment and dynamic agents. Let the full state in a dynamic setting be defined as the tuple [54]:
Θ = ( X , M , D ) ,
where X = { x 1 , , x N } denotes the robot trajectory, M = { m 1 , , m M } the static map, and D = { o 1 , , o K } the temporally evolving states of K dynamic objects. Crucially, the state of a dynamic object o k is not a single pose but a trajectory itself, often represented as o k = { o k 1 , o k 2 , , o k T k } over the time steps it is observed.
This augmented state vector fundamentally alters the probabilistic structure of the SLAM problem. The joint posterior, factoring in sensor observations Z , now becomes
p ( Θ Z ) p ( Z Θ ) · p ( X ) · p ( M ) · p ( D X ) ,
Here, the conditional prior p ( D X ) is particularly significant. It encodes the dependence of dynamic object states on the robot’s own trajectory, reflecting the fact that observations of objects are made from the robot’s moving frame of reference. This prior can incorporate dynamical models (e.g., constant velocity or acceleration) to constrain object motion. This formulation introduces a strong coupling between the robot’s path and the objects’ trajectories, as miscalibration in one directly biases the estimate of the other. Furthermore, the resulting objective function, often derived from this posterior, becomes highly non-convex due to the complex interactions and data association ambiguities between the robot, static landmarks, and dynamic objects.
While this formulation elegantly generalizes the static SLAM problem to dynamic settings, it simultaneously raises profound theoretical and practical challenges:
This formulation introduces profound theoretical and practical challenges in three key areas: scalability, identifiability, and robust optimization. First, regarding scalability, the state dimension explodes combinatorially. For a system with N poses, M static landmarks, and K dynamic objects each observed over T time steps, the state dimension grows as O ( N + M + K T ) , compared to O ( N + M ) in static environments. The worst-case computational complexity for a batch solution becomes O ( ( N + M + K T ) 3 ) , posing a severe challenge for long-term operations in crowded environments. In practice, systems like the Komatsu FrontRunner autonomous haulage system [55] address this by employing sliding window approximations with w = 10 , reducing effective complexity to O ( w 3 ) while tracking 5–10 dynamic vehicles in kilometer-scale open-pit mines. Second, the problem suffers from inherent identifiability ambiguities. A classic issue is the object-pose ambiguity, where the relative motion between the robot and a dynamic object cannot be uniquely disambiguated if the object is observed from only a limited set of viewpoints. This leads to unobservable directions in the state space, making the Fisher Information Matrix (FIM) rank-deficient. Finally, the presence of dynamic objects acts as a source of structured outliers that violate the standard static-world observation model, raising the challenge of robust optimization. Standard least-squares optimizers are highly sensitive to such outliers, which necessitates the integration of robust estimation techniques (e.g., M-estimators, switchable constraints) or probabilistic data association methods to prevent catastrophic failure, further complicating the optimization landscape.
Consequently, tackling dynamic SLAM requires not only a more complex state representation but also novel algorithmic approaches that address these intertwined challenges of scalability, identifiability, and robustness.

3.2. Robust Estimation Theory for Outlier Rejection

Robustness can be achieved at different levels. At the geometric level, methods such as RANSAC or epipolar geometric checks reject outliers through discrete hypothesis testing during motion estimation or feature matching. While effective for sparse, independent outliers, these approaches may struggle with pervasive or structured dynamic interference. Therefore, this article focuses on optimization-level robust estimators, which integrate robustness directly into the continuous optimization framework in a probabilistic and differentiable manner, offering a more principled solution for complex dynamic scenarios.

3.2.1. M-Estimators and Robust Kernels

The core idea of M-estimation is to replace the quadratic loss with a robust kernel function ρ ( · ) that reduces the influence of large residuals:
X , L = arg min X , L i , j ρ e i , j Ω i , j
The robustness properties of M-estimators can be characterized through their influence function ψ ( r ) = ρ ( r ) / r , which quantifies the effect of residuals on the parameter estimates. A bounded influence function indicates robustness to outliers. For the Huber loss, the influence function is
ψ Huber ( r ) = r if | r | δ δ · sign ( r ) otherwise
The asymptotic breakdown point ϵ provides a theoretical measure of robustness, representing the maximum proportion of outliers an estimator can tolerate before producing arbitrary estimates. For M-estimators with redescending influence functions like the Geman-McClure kernel, the breakdown point can approach 0.5, meaning they can handle up to 50% contamination in the data.
The statistical efficiency of M-estimators under Gaussian noise can be analyzed through their relative efficiency compared to the optimal maximum likelihood estimator. The Huber loss, for instance, maintains 95% efficiency at the Gaussian distribution while providing robustness to heavy-tailed distributions. This capability is leveraged in urban delivery drones like the DJI Matrice 300 series [56], where adaptive M-estimators handle 25–35% dynamic vehicle interference during package delivery, demonstrating practical robustness beyond theoretical guarantees.
The convergence properties of M-estimators in SLAM applications depend on the convexity of the robust kernel. While the Huber loss preserves convexity, redescending M-estimators like Cauchy and Geman-McClure introduce non-convexity, potentially leading to local minima. However, this non-convexity is precisely what provides stronger rejection of gross outliers in dynamic environments. The influence function ψ ( r ) = ρ ( r ) / r characterizes how much each residual contributes to the solution. Common robust kernels include the following:
  • Huber: ρ Huber ( r ) = 1 2 r 2 if | r | δ δ ( | r | 1 2 δ ) otherwise
  • Cauchy: ρ Cauchy ( r ) = c 2 2 log ( 1 + ( r / c ) 2 )
  • Geman-McClure: ρ GM ( r ) = r 2 / 2 1 + r 2
For practitioners selecting a robust kernel, key characteristics differ: the Huber kernel has a linear-then-constant influence function and is convex (typical δ 1.345 ); the Cauchy kernel has a redescending influence function and is non-convex (typical c 2.385 ); the Geman-McClure kernel also redescends and is non-convex (typical scale = 1 ). Choice depends on the expected outlier proportion and need for convexity.
Recent work has explored learning adaptive robust kernels directly from data. The researchers of [57] proposed a method to learn kernel parameters within a differentiable optimization framework:
α kernel = arg min α kernel validation X ( α kernel ) X gt
where X ( α kernel ) is the solution obtained using kernel parameters α kernel .
To illustrate how M-estimation integrates into practical SLAM systems, Algorithm 1 presents a simplified pseudocode for robust cost computation using the Huber loss. This algorithm demonstrates the conditional treatment of residuals based on their magnitude, which forms the core of robust outlier rejection in dynamic environments.
Algorithm 1 Robust Cost Computation using M-Estimation
1:function RobustEstimation( e i j , ρ , δ )
2:       C 0 ▹ Initialize total cost
3:      for each residual e in e i j  do
4:            if  | e | δ then▹ Inlier region: quadratic loss
5:                  C C + 0.5 · e 2
6:            else▹ Outlier region: linear loss
7:                  C C + δ · ( | e | 0.5 δ )
8:            end if
9:      end for
10:      return C▹ Robust cost for optimization
11:end function

3.2.2. Switch Variables and Probabilistic Data Association

The switch variable approach introduces a continuous variable s i , j [ 0 , 1 ] for each potential dynamic constraint, leading to a joint estimation problem:
X , L , S = arg min X , L , S i , j s i , j r i , j ( X , L ) Ω i , j 2 + i , j Ψ ( s i , j )
where Ψ ( s i , j ) is a prior penalty that encourages s i , j toward 0 or 1. This optimization problem is non-convex due to the bilinear terms s i j r i j ( X , L ) 2 . However, under certain conditions, convergence guarantees can be established. When the prior penalty Ψ ( s i , j ) is chosen as a concave function, the overall objective becomes difference-of-convex, enabling convergence to a stationary point using block coordinate descent. Specifically, when the prior penalty Ψ ( s i , j ) is concave, the objective becomes difference-of-convex, and block coordinate descent can be shown to converge to a stationary point, as analyzed in [58]. The sensitivity to initialization can be mitigated via continuation methods, which start with strong regularization and gradually reduce it.
The convergence rate can be characterized by analyzing the Kurdyka-Łojasiewicz property of the objective function. Recent work by [59] has shown that with proper initialization and regularization, the switch variable formulation converges linearly to a local minimum in practice.
The sensitivity of switch variables to initialization can be mitigated through continuation methods, where the optimization starts with strong regularization on the switch variables and gradually reduces it, allowing for the solver to first establish good pose estimates before determining data association.
From a probabilistic perspective, the switch variable approach can be interpreted as an approximation to marginalization over data association variables. The continuous relaxation s i j [ 0 , 1 ] approximates the discrete association probabilities, with the prior Ψ ( s i j ) acting as a entropy regularization that encourages binary solutions while maintaining differentiability.
Building on this, a probabilistic data association method that models the correspondence between landmarks and observations as a distribution can be expressed as:
p ( X , L , A | Z ) p ( Z | X , L , A ) p ( X ) p ( L ) p ( A )
where A = { α i , j } represents the data association variables. They employ variational inference to approximate the posterior:
q ( X , L , A ) = arg min q KL q ( X , L , A ) p ( X , L , A | Z )

3.2.3. Robustness to Atypical Measurements

While M-estimators and switch variables provide considerable robustness to outliers under the commonly assumed i.i.d. noise models, their efficacy against structured, non-i.i.d. atypical measurements—such as temporally correlated noise induced by dynamic reflections, persistent sensor malfunctions, or systematic errors from adverse weather conditions—presents a more complex and less understood challenge. The asymptotic breakdown point provides a crucial theoretical upper bound on the proportion of generic outliers an estimator can tolerate before failing catastrophically. However, this classical metric does not fully capture the nuanced vulnerability to structured corruptions that violate its underlying i.i.d. assumption.
Generalization in unseen, highly dynamic real-world scenarios often hinges on a more subtle capability: the estimator’s ability to discriminate between informative geometric perturbations (carrying legitimate signal about robot or object motion) and deceptive perturbations (arising from structured, non-geometric artifacts). For instance, a moving vehicle’s specular highlights may generate a spatially and temporally correlated stream of feature measurements that appear geometrically consistent but are physically uninformative for pose estimation. Standard robust kernels, tuned for isolated large residuals, may fail to reject such coherent deceptive structures, leading to biased estimates.
This limitation motivates the integration of temporal reasoning and semantic-aware robustness frameworks. Recent approaches, such as those incorporating learned, context-dependent kernel parameters or leveraging sequential hypothesis testing within a filtering paradigm [60], aim to endow estimators with the discernment required for these challenging scenarios. The analysis of estimator performance must, therefore, evolve from a purely statistical outlier-rejection perspective to one that also considers the semantic and temporal structure of measurement errors in dynamic environments.

3.3. Optimization with Semantic Priors: A Methodological Bridge

Beyond handling dynamics as outliers, another optimization strategy is to incorporate semantic understanding as structural priors into the estimation framework. This approach, represented by semantic SLAM systems, extends the state vector and factor graph model to jointly reason about geometry and object categories.

3.3.1. Semantic Factor Graphs

In semantic SLAM, the state vector is extended to include object states: Θ = { X , L , O } where O = { o 1 , , o K } represents dynamic objects. The factor graph incorporates additional semantic factors:
p ( Θ | Z , S ) k f k geo m f m sem ( X , O m ) n f n dyn ( O n ) p f p assoc
where S represents semantic observations, f sem are semantic observation factors, f dyn are dynamic model factors, and f assoc are data association factors. A fundamental theoretical question in semantic SLAM concerns parameter identifiability—whether the true state Θ = ( X , L , O ) can be uniquely determined from the observations Z . The Fisher Information Matrix (FIM) provides insights into this question:
I ( Θ ) = E log p ( Z | Θ ) Θ log p ( Z | Θ ) Θ
This ambiguity can be seen from a minimal derivation: observing a point p on a moving object from a single camera gives a measurement z = π ( T c 1 T o p ) , which is invariant under any common rigid transformation G applied to both T c and T o , i.e., π ( ( G T c ) 1 ( G T o ) p ) = z . This symmetry induces a nullspace in the Fisher Information Matrix. When I ( Θ ) is rank-deficient, certain state components are unobservable.
In semantic SLAM, common unobservable directions indicated by a rank-deficient Fisher Information Matrix include the object-pose ambiguity, where the relative motion between the camera and a moving object cannot be uniquely disambiguated if the object is observed by only one camera pose, leading to a nullspace in the FIM corresponding to simultaneous transformations of camera and object poses. Furthermore, scale ambiguity arises in monocular semantic SLAM, as the absolute scale of both the camera trajectory and object states may be unobservable without additional scale information from sensors like IMU or object size priors. Finally, object identity ambiguity presents a challenge when multiple objects of the same class have similar appearances, making data association difficult and potentially causing label switching during optimization.
The observability analysis reveals that at least two camera poses must observe the same dynamic object to make its motion observable relative to the static scene. Furthermore, the object’s motion model plays a crucial role in identifiability—constant velocity models provide better observability compared to random walk models.
The work of [61] presented a system that performs object-level SLAM with online object model estimation. Their optimization includes object shape parameters s k :
Θ = arg min Θ i e i odom 2 + i , j e i , j obs 2 + k e k obj 2 + k e k shape 2
The introduction of object states significantly increases the computational complexity of the SLAM problem. For a system with N camera poses, M static landmarks, and K dynamic objects each with T k temporal states, the state dimension becomes:
dim ( Θ ) = 6 N poses + 3 M landmarks + k = 1 K d k T k object states
where d k is the state dimension per object per time step (typically 6 for SE(3) poses or 9 for poses with velocities).
The worst-case complexity for direct solving is O ( ( dim ( Θ ) ) 3 ) . However, the sparsity pattern in semantic factor graphs exhibits a block structure that can be exploited:
H = H XX H XL H XO H LX H LL H LO H OX H OL H OO
The Schur complement can be applied hierarchically, first marginalizing landmarks and then objects, reducing the effective problem size to O ( ( N + T k ) 3 ) .
For real-time operation, sliding window approaches limit the number of poses and object states in the optimization window. The complexity then becomes O ( w 3 ) where w is the window size, making the approach scalable to long-term operation in dynamic environments. Algorithm 2 presents the procedural steps for constructing and optimizing a semantic factor graph.
Algorithm 2 Semantic Factor Graph Construction and Optimization
1:function SemanticFactorGraph( X , L , O , Z )
2:      Initialize factor graph G
3:      Add odometry factors between consecutive poses x i , x i + 1 X
4:      Add landmark observation factors between X and L based on Z
5:      for each dynamic object o k O  do
6:            Add object state variables { o k t } to G
7:            Add motion model factors between consecutive states o k t , o k t + 1
8:            Add observation factors linking camera poses x i to object states o k t
9:      end for
10:      Construct full state vector Θ = { X , L , O }
11:       Θ arg min Θ f G e f ( Θ ) Ω f 2 ▹ Joint optimization
12:      return  Θ ▹ Optimized camera poses, landmarks, and object states
13:end function

3.3.2. Joint SLAM and Multi-Object Tracking

The tight coupling of SLAM with Multi-Object Tracking (MOT) aims to estimate the posterior:
p ( X , M , O 1 : t | z 1 : t )
Some researchers proposed a unified factor graph framework that integrates visual-inertial SLAM with model-based object tracking [62,63]. The motion model for object o k  is:
o k t + 1 = A k o k t + B k u k t + w k t , w k t N ( 0 , Q k )
This creates dynamic factors in the factor graph:
f k dyn exp 1 2 o k t + 1 A k o k t B k u k t Q k 1 2
Consider a toy example tracking a single object ( k = 1 ) over three time steps ( t = 1 , 2 , 3 ). The factor graph would include static SLAM factors (odometry between camera poses x 1 , x 2 , x 3 and landmark observations), dynamic object factors (three object states o 1 1 , o 1 2 , o 1 3 connected by two dynamic factors f 1 dyn from Equation (30) representing the constant-velocity motion prior), and association factors (measurement factors linking each object state o 1 t to its corresponding image detection z 1 t , obtained from camera pose x t ). This minimal example clarifies how dynamic objects introduce additional variables and factors that are jointly optimized with the traditional SLAM graph.

3.4. Deep Learning-Based End-to-End Optimization

These methods represent the latest paradigm shift, extending differentiable optimization through neural networks and completing the evolution from explicit to implicit modeling.

3.4.1. Differentiable Optimization Layers

Differentiable optimization layers enable end-to-end learning of SLAM systems. A differentiable projective geometry module is also proposed [64,65] and the system learns feature extraction and matching through a differentiable correspondence network:
C = SoftMax F s · F t d
where F s and F t are source and target feature maps, and C is the correspondence matrix.
The pose is estimated through a differentiable PnP solver:
T = arg min T i π ( T P i ) p i 2
with gradients flowing through the iterative optimization process.

3.4.2. Theoretical Foundations of Differentiable Optimization

Differentiable optimization layers introduce new theoretical challenges regarding gradient computation through iterative solvers. For the differentiable PnP problem in Equation (30), the gradient T / p can be computed using the implicit function theorem:
T p = 2 E T 2 1 2 E T p
where E ( T , p ) = i π ( T P i ) p i 2 is the PnP energy function. The conditioning of the Hessian 2 E / T 2 determines the stability of gradient computation. Poor conditioning, which occurs when points are nearly coplanar or have limited parallax, can lead to exploding gradients during training. Algorithm 3 provides the concrete implementation of the differentiable PnP solver described in Equation (31).
Algorithm 3 Differentiable Perspective-n-Point (PnP) Solver
Require: 3D point set { P i } R 3 , 2D observation set { p i } R 2 , maximum iterations K,
        damping factor λ
Ensure: Camera pose T SE ( 3 ) and complete gradient computation graph
1:Initialize: T I 4 × 4 ▹ Identity transformation matrix
2:Initialize computational graph recorder
3:for  k = 1  to K do
4:      Compute projected points: p ^ i = π ( T P i ) π is the camera projection function
5:      Compute residuals: r i = p ^ i p i
6:      Compute total error: E = 1 2 i r i 2
7:      if  k = 1 or k = K  then
8:            Record current computational graph for gradient backpropagation
9:      end if
10:      Compute Jacobian matrix: J = r ξ ξ se ( 3 ) is the Lie algebra perturbation
11:      Compute Gauss-Newton update: Δ ξ = ( J J + λ I ) 1 J r
12:      Apply update on Lie group: T T · exp ( Δ ξ )
13:      if  Δ ξ < ϵ  then
14:            break▹ Convergence check
15:      end if
16:end for
17:Construct gradient propagation path through all iterative steps
18:return  T ▹ Pose estimate with attached gradient computation graph
Lipschitz continuity of the learned components is crucial for training stability. For the feature extraction network f θ , we require
f θ ( I 1 ) f θ ( I 2 ) L I 1 I 2
where L is the Lipschitz constant.
The convergence of end-to-end learned SLAM systems lacks the well-established guarantees of traditional geometric methods. However, recent work has shown that under mild assumptions about the network architecture and training procedure, the learned components can converge to approximations of geometric primitives, providing empirical if not theoretical guarantees of performance.
Several empirical heuristics are commonly employed to enforce the required theoretical properties in practice. Spectral normalization can enforce Lipschitz continuity during training by controlling the spectral norm of network layers. Curriculum learning gradually increases optimization complexity during training to avoid poor local minima. Geometry constraints, such as incorporating depth measurements or stereo constraints, help anchor scene geometry. Additionally, pose priors from inertial measurements or odometry are used to constrain camera pose estimation.

3.4.3. Neural Scene Representations

Neural Radiance Fields (NeRF) have revolutionized scene representation [66,67]. A joint optimization of camera poses and scene representation is presented:
Θ , X = arg min Θ , X t , r C ^ ( r ; Θ , T t ) C ( r ) 2
where C ^ ( r ) is the volume rendered color for ray r :
C ^ ( r ) = t n t f T ( t ) σ ( r ( t ) ) c ( r ( t ) , d ) d t
with T ( t ) = exp t n t σ ( r ( s ) ) d s .

3.4.4. Generalization Theory for Neural Representations

Neural scene representations like NeRF pose unique theoretical challenges due to their over-parameterization. The generalization error can be bounded using the concept of neural tangent kernel (NTK):
L test L train + O effective params N rays
The implicit regularization of gradient-based optimization in over-parameterized networks tends to find solutions with low complexity, as characterized by the frequency bias of neural networks—they preferentially learn low-frequency functions before fitting high-frequency noise.
For joint optimization of camera poses and scene representation, the identifiability problem becomes more severe. The network can compensate for incorrect camera poses by learning distorted scene geometries, leading to local minima that are photometrically accurate but geometrically inconsistent. In practice, the techniques mentioned in the Implementation Techniques paragraph above are applied to mitigate this ambiguity.
The sample complexity of neural scene representations grows with scene complexity rather than dataset size, making them particularly suitable for SLAM applications where we optimize a single scene with many observations [68,69]. The sample complexity of neural scene representations grows with scene complexity…. Furthermore, their generalization to atypical measurements—such as those arising from adverse weather or sensor failure—remains an open challenge, as the training data distribution may not adequately cover these edge cases.
A direct comparison and summary of these methods is presented in Table 1.

4. Mathematical Challenges and Future Directions

4.1. Foundational and Emerging Challenges in Optimization for Dynamic V-SLAM

The transition from static to dynamic environments does not merely add complexity; it fundamentally invalidates core assumptions of traditional V-SLAM and introduces new layers to the optimization problem. The challenges selected for discussion here are not an exhaustive list but represent the most critical bottlenecks that arise directly from extending the standard optimization framework to the dynamic formulation. They can be categorized into foundational challenges, which are inherent to the problem definition, and emerging challenges, which are amplified or created by modern solution paradigms.
The foundational challenges form an interdependent triad that defines the essential difficulty of dynamic V-SLAM. They arise directly from augmenting the state vector and corrupting the observation model with structured outliers.
First, scalability and computational tractability represent the cardinal challenge due to the combinatorial explosion of the state dimension. The real challenge lies in managing the effective complexity within real-time constraints. This necessitates approximate methods like sliding windows or incremental solvers, which trade optimality guarantees for tractability. Consequently, the scalability challenge is not just about processing larger problems but about designing optimization algorithms whose complexity grows gracefully with scene dynamics.
Second, parameter identifiability and observability present a key difficulty because dynamic SLAM is a severely under-constrained problem. A primary manifestation is the object-camera motion ambiguity, where the motion of an object observed from a limited set of viewpoints relative to the camera is not uniquely determinable. Mathematically, this induces a null space in the Fisher Information Matrix, making certain state components unobservable. This lack of identifiability leads to ill-posed optimization problems where the cost function has flat directions or multiple local minima that are geometrically distinct but photometrically similar. Ensuring identifiability therefore requires either sufficient viewpoint diversity, strong dynamical priors, or additional semantic or scale constraints.
Third, robustness to structured outliers is critical as dynamic objects generate measurements that are not random noise but structured outliers, spatially and temporally coherent signals that systematically violate the static observation model. Traditional M-estimators, designed for i.i.d. outliers, can fail against such structured corruption. Thus, the challenge evolves from simple outlier rejection to the more difficult task of inlier and outlier model selection within the optimization loop. This demands robust kernels or probabilistic frameworks that can discern geometrically consistent but physically invalid data associations.
In addition to these foundational issues, the adoption of advanced methods to solve them introduces new, secondary layers of difficulty, constituting the emerging challenges.
Specifically, the integration of deep learning into optimization pipelines, through learned features, differentiable layers, or implicit neural representations, shifts challenges from explicit modeling to generalization and stability. Key issues here include the Lipschitz continuity of learned components to guarantee stable gradients during optimization, the lack of convergence guarantees compared to geometric methods, and the identifiability problem in joint pose-NeRF optimization, where a neural radiance field can compensate for incorrect poses, leading to plausible yet geometrically inaccurate solutions.
Furthermore, theoretical guarantees for data-driven methods remain a significant gap. While empirical success is evident, providing certificates for learned models is essential. This includes understanding the sample complexity of neural scene representations, bounding their generalization error in unseen environments, and providing robustness guarantees against adversarial or out-of-distribution inputs. This challenge is particularly crucial for deploying learning-based SLAM in safety-critical applications.
Finally, the scalability of implicit representations themselves poses an open problem. Neural scene representations such as NeRF or Gaussian Splatting, while powerful, face scalability issues where their training and inference cost scales with scene complexity and required rendering resolution, posing a challenge for long-term, large-scale SLAM. Optimizing compact and efficient neural architectures for mapping therefore remains an active area of research.
In summary, the challenges highlighted here are selected because they represent fundamental limits arising from the core mathematical formulation of dynamic SLAM, namely scalability, identifiability, and robustness, and pivotal hurdles introduced by the field’s current trajectory towards learned, implicit models. Addressing them is synonymous with advancing the state of the art in robust, real-time perception for autonomous systems.

4.2. Future Research Directions

As for lifelong SLAM and continual learning, mathematical frameworks for lifelong SLAM must address catastrophic forgetting. Bayesian nonparametrics offer promising foundations:
G DP ( α , H ) , θ i | G G
where the Dirichlet Process (DP) prior allows for unbounded complexity growing with data.
For compositional scene understanding, future systems should reason about object relationships:
p ( S | O ) = relations r p ( r ( o i , o j ) )
where S represents scene structure and r are spatial, semantic, or functional relations.

5. Conclusions

This review has systematically reviewed mathematical optimization methods and their progress in visual SLAM for robots operating in complex dynamic environments. By establishing a taxonomy centered on mathematical optimization paradigms, we have organized existing approaches into three main categories: robust estimation theory-based methods for outlier rejection, semantic information and factor graph-based methods for explicit dynamic object modeling, and deep learning-based end-to-end optimization methods.
Our analysis reveals that the field of dynamic V-SLAM is evolving along three distinct yet interconnected trajectories. First, tighter Integration of Geometry and Semantics: Mathematical frameworks are transitioning from purely geometric optimization to unified formulations that simultaneously handle discrete semantic variables and continuous geometric states. This enables robots to achieve more reliable localization and mapping with contextual scene understanding. Second, paradigm Shift from Explicit Modeling to Implicit Learning: Optimization methods traditionally relying on precise geometric models and hand-crafted features are being complemented by data-driven implicit representations and differentiable optimization. This shift offers new possibilities for robots to adapt to unseen, unstructured dynamic environments. Third, algorithmic Scalability for Long-term Autonomy: As robots operate for extended periods in complex environments, mathematical optimization algorithms must balance computational complexity, memory efficiency, and long-term consistency. Techniques such as incremental optimization and continual learning provide promising pathways to address this challenge.
Looking forward, we identify several critical research directions for robotic dynamic V-SLAM:
Development of Unified Mathematical Frameworks: Creating frameworks that seamlessly integrate geometry, semantics, physical constraints, and task objectives to enable more comprehensive and robust environmental understanding by robots.
Theoretical Guarantees for Learning-Based Approaches: While data-driven methods demonstrate strong empirical performance in handling dynamic environments, their stability, convergence, and generalization capabilities require more rigorous theoretical analysis—particularly crucial for safety-critical robotic applications.
Scalable Optimization Algorithms for Lifelong Operation: Developing theoretically sound incremental and scalable optimization algorithms that address the needs of robots operating long-term in large-scale, unstructured, and dynamically changing environments.
Exploration of Compositional Scene Understanding Models: Investigating spatial, semantic, and functional relationships among objects in scenes to develop mathematical representations and reasoning methods that support advanced robotic tasks such as manipulation, navigation, and human-robot interaction.
As robots transition from controlled laboratory settings to unstructured real-world environments, the mathematical sophistication of V-SLAM systems must correspondingly increase. We believe the methods and perspectives surveyed in this work provide a solid foundation for these future developments and will inspire continued innovation at the intersection of robotic perception and mathematical optimization.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62073245 and Shenzhen Science and Technology Program under Grant KCXFZ20240903093005008.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Notation and Symbols

This appendix provides a comprehensive summary of the mathematical notation and symbols used throughout this review. The symbols are categorized by their domain of application to facilitate clarity and ensure consistency.

Appendix A.1. States and Variables

Table A1. Notation for states and variables.
Table A1. Notation for states and variables.
SymbolTypeDescription
tScalarTime step or index.
x t VectorRobot state vector at time t.
X = { x 1 , , x N } SetRobot trajectory from time 1 to N.
m j Vector3D coordinates of the j-th static landmark.
M = { m 1 , , m M } SetSet of all static landmarks.
o k t VectorState of the k-th dynamic object at time t.
D = { o 1 , , o K } SetSet of states for all dynamic objects.
O SetSet of dynamic object states (synonymous with D , used in semantic SLAM contexts).
Θ Tuple/SetFull state vector, typically Θ = ( X , M , D ) .
T SE ( 3 ) Matrix3D rigid body transformation matrix, representing a pose.
ξ R 6 VectorLie algebra twist vector corresponding to an SE ( 3 ) transformation.
s i j [ 0 , 1 ] ScalarSwitch variable, representing the probability that measurement z i j is an inlier.
a i j Discrete VariableData association variable, denoting the correspondence between measurement z i j and landmark m j .
A SetSet of all data association variables.

Appendix A.2. Observations, Models, and Functions

Table A2. Notation for observations, models, and functions.
Table A2. Notation for observations, models, and functions.
SymbolTypeDescription
z t VectorSensor observation obtained at time t.
z i j VectorObservation of landmark m j from pose x i .
h ( x i , m j ) FunctionObservation model, predicting the measurement of landmark m j from state x i .
π : R 3 R 2 FunctionCamera projection model, mapping a 3D point to 2D pixel coordinates.
e i j VectorReprojection error, e i j = π ( T i 1 m j ) z i j .
e i odom VectorError term corresponding to an odometry factor.
e k loop VectorError term corresponding to a loop closure factor.
ρ ( · ) FunctionRobust kernel function (e.g., Huber, Cauchy).
ψ ( r ) FunctionInfluence function, ψ ( r ) = ρ ( r ) / r .
f k ( X k , L k ) FunctionThe k-th factor in a factor graph, representing a probabilistic constraint.

Appendix A.3. Probability and Optimization

Table A3. Notation for probability and optimization.
Table A3. Notation for probability and optimization.
SymbolTypeDescription
p ( · ) FunctionProbability density function.
n t VectorObservation noise, typically assumed n t N ( 0 , Σ t ) .
Σ MatrixCovariance matrix.
Ω MatrixInformation matrix (inverse covariance), Ω = Σ 1 .
ϵ ScalarContamination ratio in a mixture model.
e Ω 2 ScalarSquared Mahalanobis distance, e Ω 2 = e Ω e .
J MatrixJacobian matrix of the error vector e with respect to the state perturbation δ θ .
H MatrixHessian matrix (or its Gauss-Newton approximation), H = J Ω J .
H X X , H L L MatrixBlocks of the Hessian matrix corresponding to pose-pose and landmark-landmark terms, respectively.
H X L MatrixBlock of the Hessian matrix corresponding to pose-landmark coupling terms.
I ( Θ ) MatrixFisher Information Matrix.

Appendix A.4. Sets and Algebraic Structures

Table A4. Notation for sets and algebraic structures.
Table A4. Notation for sets and algebraic structures.
SymbolTypeDescription
SO ( 3 ) GroupThe 3D rotation group (Special Orthogonal Group).
SE ( 3 ) GroupThe 3D rigid body transformation group (Special Euclidean Group).
so ( 3 ) Lie AlgebraThe Lie algebra associated with SO ( 3 ) .
se ( 3 ) Lie AlgebraThe Lie algebra associated with SE ( 3 ) .
exp ( · ) FunctionExponential map from a Lie algebra to its Lie group.
log ( · ) FunctionLogarithmic map from a Lie group to its Lie algebra.
( · ) OperatorThe “hat” operator mapping a vector ξ R 6 to an element of se ( 3 ) .

References

  1. Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A comprehensive survey of visual slam algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
  2. Tourani, A.; Bavle, H.; Sanchez-Lopez, J.L.; Voos, H. Visual slam: What are the current trends and what to expect? Sensors 2022, 22, 9297. [Google Scholar] [CrossRef]
  3. Wang, Y.; Tian, Y.; Chen, J.; Xu, K.; Ding, X. A survey of visual SLAM in dynamic environment: The evolution from geometric to semantic approaches. IEEE Trans. Instrum. Meas. 2024, 73, 1–21. [Google Scholar] [CrossRef]
  4. Ballester, I.; Fontán, A.; Civera, J.; Strobl, K.H.; Triebel, R. DOT: Dynamic object tracking for visual SLAM. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11705–11711. [Google Scholar]
  5. Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2020, arXiv:2005.11052. [Google Scholar]
  6. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  7. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar] [CrossRef]
  8. Gao, X.; Zhang, T. Introduction to Visual SLAM: From Theory to Practice; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  9. Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual slam: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
  10. Zhong, J.; Ren, H.; Chen, Q.; Zhang, H. A review of deep learning-based localization, mapping and 3D reconstruction for endoscopy. J. Micro Bio Robot. 2025, 21, 1. [Google Scholar] [CrossRef]
  11. Strasdat, H.; Montiel, J.M.M.; Davison, A.J. Visual SLAM: Why filter? Image Vis. Comput. 2012, 30, 65–77. [Google Scholar] [CrossRef]
  12. Li, M.; Mourikis, A.I. High-precision, consistent EKF-based visual-inertial odometry. Int. J. Robot. Res. 2013, 32, 690–711. [Google Scholar] [CrossRef]
  13. Lategahn, H.; Geiger, A.; Kitt, B. Visual SLAM for autonomous ground vehicles. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1732–1737. [Google Scholar] [CrossRef]
  14. Munguia, R.; Trujillo, J.-C.; Grau, A. UAV Navigation Using EKF-MonoSLAM Aided by Range-to-Base Measurements. Drones 2025, 9, 570. [Google Scholar] [CrossRef]
  15. Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef]
  16. Dubbelman, G.; Browning, B. COP-SLAM: Closed-form online pose-chain optimization for visual SLAM. IEEE Trans. Robot. 2015, 31, 1194–1213. [Google Scholar] [CrossRef]
  17. Zhang, H.; Wang, X.; Yin, X.; Du, M.; Liu, C.; Chen, Q. Geometry-constrained scale estimation for monocular visual odometry. IEEE Trans. Multimed. 2021, 24, 3144–3156. [Google Scholar] [CrossRef]
  18. Sharafutdinov, D.; Griguletskii, M.; Kopanev, P.; Kurenkov, M.; Ferrer, G.; Burkov, A.; Gonnochenko, A.; Tsetserukou, D. Comparison of modern open-source visual SLAM approaches. J. Intell. Robot. Syst. 2023, 107, 43. [Google Scholar] [CrossRef]
  19. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  20. Zang, Q.; Zhang, K.; Wang, L.; Wu, L. An Adaptive ORB-SLAM3 System for Outdoor Dynamic Environments. Sensors 2023, 23, 1359. [Google Scholar] [CrossRef] [PubMed]
  21. Tong, W.; Dai, K.; Zeng, L. VSG-SLAM: A Dense Visual Semantic SLAM with Gaussian Splatting. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025; pp. 9044–9050. [Google Scholar] [CrossRef]
  22. Shu, C.; Luo, Y. Multi-Modal Feature Constraint Based Tightly Coupled Monocular Visual-LiDAR Odometry and Mapping. IEEE Trans. Intell. Veh. 2023, 8, 3384–3393. [Google Scholar] [CrossRef]
  23. Pu, H.; Luo, J.; Wang, G.; Hu, H.; Yin, X. Visual SLAM integration with semantic segmentation and deep learning: A review. IEEE Sens. J. 2023, 23, 22119–22138. [Google Scholar] [CrossRef]
  24. Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  25. Zhang, H.; Wu, Z.; Li, H.; Shangguan, Q.; An, K. Geometry-Constrained Monocular Scale Estimation Using Semantic Segmentation for Dynamic Scenes. IEEE Trans. Instrum. Meas. 2025, 74, 7514011. [Google Scholar] [CrossRef]
  26. Bruno, H.M.S.; Colombini, E.L. LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method. Neurocomputing 2021, 455, 97–110. [Google Scholar] [CrossRef]
  27. Milz, S.; Arbeiter, G.; Witt, C.; Simon, S.; Yogamani, S. Visual SLAM for automated driving: Exploring the applications of deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 247–257. [Google Scholar] [CrossRef]
  28. Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar] [CrossRef]
  29. Rosinol, A.; Leonard, J.J.; Carlone, L. Nerf-slam: Real-time dense monocular slam with neural radiance fields. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 3437–3444. [Google Scholar] [CrossRef]
  30. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
  31. Yan, C.; Qu, D.; Xu, D.; Bao, H.; Tang, R.; Liu, Y.; Cui, Z. Gs-slam: Dense visual slam with 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19595–19604. [Google Scholar] [CrossRef]
  32. Matsuki, H.; Murai, R.; Kelly, P.H.J.; Davison, A.J.; Davison, A.J. Gaussian splatting slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18039–18048. [Google Scholar] [CrossRef]
  33. Qu, Z.; Zhang, Z.; Liu, C. Visual slam with 3d gaussian primitives and depth priors enabling novel view synthesis. In Proceedings of the 2024 4th International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Guangzhou, China, 8–10 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
  34. Bowman, S.L.; Atanasov, N.; Daniilidis, K.; Pappas, G.J. Probabilistic data association for semantic slam. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1722–1729. [Google Scholar] [CrossRef]
  35. Doherty, K.; Fourie, D.; Leonard, J. Multimodal semantic slam with probabilistic data association. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2419–2425. [Google Scholar] [CrossRef]
  36. Sumikura, S.; Shibuya, M.; Sakurada, K. OpenVSLAM: A versatile visual SLAM framework. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2292–2295. [Google Scholar] [CrossRef]
  37. Ćwian, K.; Nowicki, M.R.; Wietrzykowski, J.; Skrzypczyński, P. Large-scale LiDAR SLAM with factor graph optimization on high-level geometric features. Sensors 2021, 21, 3445. [Google Scholar] [CrossRef]
  38. Chen, Y.; Xu, B.; Wang, B.; Na, J.; Yang, P. GNSS reconstrainted visual–inertial odometry system using factor graphs. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  39. Hashim, H.A.; Eltoukhy, A.E.E. Nonlinear filter for simultaneous localization and mapping on a matrix lie group using IMU and feature measurements. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 2098–2109. [Google Scholar] [CrossRef]
  40. Brossard, M.; Bonnabel, S.; Barrau, A. Unscented Kalman filtering on Lie groups for fusion of IMU and monocular vision. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1–9. [Google Scholar] [CrossRef]
  41. Heo, S.; Park, C.G. Consistent EKF-based visual-inertial odometry on matrix Lie group. IEEE Sens. J. 2018, 18, 3780–3788. [Google Scholar] [CrossRef]
  42. Strasdat, H. Local Accuracy and Global Consistency for Efficient Visual SLAM. Ph.D. Thesis, Department of Computing, Imperial College London, London, UK, 2012. [Google Scholar]
  43. Huai, J.; Lin, Y.; Zhuang, Y.; Shi, M. Consistent right-invariant fixed-lag smoother with application to visual inertial SLAM. Proc. AAAI Conf. Artif. Intell. 2021, 35, 6084–6092. [Google Scholar] [CrossRef]
  44. Ning, Y. Efficient Right-Decoupled Composite Manifold Optimization for Visual Inertial Odometry. Comput. Intell. 2025, 41, e70127. [Google Scholar] [CrossRef]
  45. Zhang, M.; Zuo, X.; Chen, Y.; Zhang, W.; Liu, Y. Pose estimation for ground robots: On manifold representation, integration, reparameterization, and optimization. IEEE Trans. Robot. 2021, 37, 1081–1099. [Google Scholar] [CrossRef]
  46. Ge, Y.; Zhang, L.; Wu, Y.; Wang, H.; Zhang, H. PIPO-SLAM: Lightweight visual-inertial SLAM with preintegration merging theory and pose-only descriptions of multiple view geometry. IEEE Trans. Robot. 2024, 40, 2046–2059. [Google Scholar] [CrossRef]
  47. Vial, P.; Solà, J.; Palomeras, N.; Vidal-Calleja, T.; Andrade-Cetto, J. On Lie group IMU and linear velocity preintegration for autonomous navigation considering the Earth rotation compensation. IEEE Trans. Robot. 2024, 41, 1346–1364. [Google Scholar] [CrossRef]
  48. Pedrosa, E.; Pereira, A.; Lau, N. A non-linear least squares approach to SLAM using a dynamic likelihood field. J. Intell. Robot. Syst. 2019, 93, 519–532. [Google Scholar] [CrossRef]
  49. Sarikamis, F.A.; Alatan, A.A. Ig-slam: Instant gaussian slam. arXiv 2024, arXiv:2408.01126. [Google Scholar] [CrossRef]
  50. Wu, F.; Liu, D.; An, K.; Zhang, H. Image retrieval based on dimensionality reduction of second-order information. Signal Image Video Process. 2024, 18, 2723–2731. [Google Scholar] [CrossRef]
  51. Wang, H.; Wang, J.; Agapito, L. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13293–13302. [Google Scholar] [CrossRef]
  52. Blöchliger, F.; Fehr, M.; Dymczyk, M.; Schneider, T.; Siegwart, R. Topomap: Topological mapping and navigation based on visual slam maps. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3818–3825. [Google Scholar] [CrossRef]
  53. Vey, S.; Voigt, A. AMDiS: Adaptive multidimensional simulations. Comput. Vis. Sci. 2007, 10, 57–67. [Google Scholar] [CrossRef]
  54. Zhang, H.; Wang, X.; Du, X.; Liu, M.; Chen, Q. Dynamic Environments Localization via Dimensions Reduction of Deep Learning Features. In Proceedings of the International Conference on Computer Vision Systems, Shenzhen, China, 10–13 July 2017; Springer: Cham, Switzerlands, 2017. [Google Scholar] [CrossRef]
  55. Komatsu Ltd. FrontRunner Autonomous Haulage System: Technical Specifications and Field Performance; Komatsu Ltd.: Tokyo, Japan, 2023; Available online: https://www.komatsu.com/en-us/technology/smart-mining/loading-and-haulage/autonomous-haulage-system (accessed on 7 December 2025).
  56. DJI Technology. Matrice 300 RTK: Enterprise Drone Platform Technical White Paper; DJI Innovation: Shenzhen, China, 2023. Available online: https://igidrone.com/dji-matrice-300-rtk-specifications/ (accessed on 10 December 2025).
  57. Sun, S.; Zhang, G.; Wang, C.; Jia, A.; Hu, J. Differentiable compositional kernel learning for Gaussian processes. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4828–4837. [Google Scholar] [CrossRef]
  58. Huang, T.; Gamal, H.E. An Efficient Difference-of-Convex Solver for Privacy Funnel. In Proceedings of the 2024 IEEE International Symposium on Information Theory Workshops (ISIT-W), Athens, Greece, 7–12 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
  59. James, G.; Witten, D.; Hastie, T.; Tibshirani, R.; Taylor, J. Linear model selection and regularization. In An Introduction to Statistical Learning: With Applications in Python; Springer International Publishing: Cham, Switzerlands, 2023; pp. 229–288. [Google Scholar] [CrossRef]
  60. Ge, Y.; Kaltiokallio, O.; Xia, Y.; Chouvardas, S.; Xia, Y.; Bugallo, M.F.; Koivunen, V.; Wymeersch, H. Batch SLAM with PMBM Data Association Sampling and Graph-Based Optimization. IEEE Trans. Signal Process. 2025, 73, 2139–2153. [Google Scholar] [CrossRef]
  61. Wu, Y.; Zhang, Y.; Zhu, D.; Zhao, C.; Zhao, X.; Gu, F.; Zhang, J. An object slam framework for association, mapping, and high-level tasks. IEEE Trans. Robot. 2023, 39, 2912–2932. [Google Scholar] [CrossRef]
  62. Van Nam, D.; Gon-Woo, K. Learning observation model for factor graph based-state estimation using intrinsic sensors. IEEE Trans. Autom. Sci. Eng. 2022, 20, 2049–2062. [Google Scholar] [CrossRef]
  63. Lai, T. A review on visual-slam: Advancements from geometric modelling to learning-based semantic scene understanding using multi-modal sensor fusion. Sensors 2022, 22, 7265. [Google Scholar] [CrossRef]
  64. Chen, L.; Rottensteiner, F.; Heipke, C. Feature detection and description for image matching: From hand-crafted design to deep learning. Geo-Spat. Inf. Sci. 2021, 24, 58–74. [Google Scholar] [CrossRef]
  65. Kazi, A.; Cosmo, L.; Ahmadi, S.; Navab, N.; Bronstein, M.M. Differentiable graph module (dgm) for graph convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1606–1617. [Google Scholar] [CrossRef]
  66. Lee, M. The geometry of feature space in deep learning models: A holistic perspective and comprehensive review. Mathematics 2023, 11, 2375. [Google Scholar] [CrossRef]
  67. Kratsios, A.; Papon, L. Universal approximation theorems for differentiable geometric deep learning. J. Mach. Learn. Res. 2022, 23, 1–73. [Google Scholar] [CrossRef]
  68. Herrera-Granda, E.P.; Torres-Cantero, J.C.; Peluffo-Ordonez, D.H. Monocular visual SLAM, visual odometry, and structure from motion methods applied to 3D reconstruction: A comprehensive survey. Heliyon 2024, 10, e37356. [Google Scholar] [CrossRef] [PubMed]
  69. Yin, H.; Li, S.; Tao, Y.; Wang, D.; Zhang, Y.; Liu, H. Dynam-SLAM: An accurate, robust stereo visual-inertial SLAM method in dynamic environments. IEEE Trans. Robot. 2022, 39, 289–308. [Google Scholar] [CrossRef]
Figure 1. A taxonomy of dynamic V-SLAM optimization methods, organizing approaches into three core paradigms based on their mathematical treatment of dynamic elements. Each paradigm is characterized by its core idea, key techniques, and typical application scenarios. The left-to-right arrangement reflects the historical evolution from traditional robust estimation to modern learning-based methods.
Figure 1. A taxonomy of dynamic V-SLAM optimization methods, organizing approaches into three core paradigms based on their mathematical treatment of dynamic elements. Each paradigm is characterized by its core idea, key techniques, and typical application scenarios. The left-to-right arrangement reflects the historical evolution from traditional robust estimation to modern learning-based methods.
Mathematics 14 00264 g001
Table 1. Method comparison summary.
Table 1. Method comparison summary.
Robust Est.SemanticDeep Learning
AssumptionsMajority inliersReliable detectionTraining coverage
Sparse outliersSimple dynamicsGeneralization
Gaussian noiseCorrect associationSufficient compute
SensorsMonocular/StereoRGB-D/LiDARMonocular
RobustnessM-estimatorsExplicit modelingDiff. layers
Switch variablesSemantic constraintsLearned features
Prob. DAMotion modelsImplicit rejection
FailuresHigh outliersDetection errorsOOD failures
Structured noiseID switchesDebug difficulty
Param. sensitivityUnobservableHigh cost
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Zhao, X.; Luo, R.; Wang, Z.; Wang, G.; An, K. A Roadmap of Mathematical Optimization for Visual SLAM in Dynamic Environments. Mathematics 2026, 14, 264. https://doi.org/10.3390/math14020264

AMA Style

Zhang H, Zhao X, Luo R, Wang Z, Wang G, An K. A Roadmap of Mathematical Optimization for Visual SLAM in Dynamic Environments. Mathematics. 2026; 14(2):264. https://doi.org/10.3390/math14020264

Chicago/Turabian Style

Zhang, Hui, Xuerong Zhao, Ruixue Luo, Ziyu Wang, Gang Wang, and Kang An. 2026. "A Roadmap of Mathematical Optimization for Visual SLAM in Dynamic Environments" Mathematics 14, no. 2: 264. https://doi.org/10.3390/math14020264

APA Style

Zhang, H., Zhao, X., Luo, R., Wang, Z., Wang, G., & An, K. (2026). A Roadmap of Mathematical Optimization for Visual SLAM in Dynamic Environments. Mathematics, 14(2), 264. https://doi.org/10.3390/math14020264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop