However, since the front-end state update only considers constraints between consecutive frames, localization errors gradually accumulate over time due to sensor noise and motion uncertainty. To mitigate this issue, a back-end optimization module based on sliding-window factor graph optimization is introduced to further refine the robot state estimate.
Within the sliding window, several keyframes are maintained and constraints are established among them. These constraints include visual feature observations, IMU measurements, and wheel odometry information. During the factor graph optimization process, multiple types of factors are incorporated, including IMU factors, odometry factors, visual feature factors, scan-to-submap constraints, loop closure constraints, and the proposed roll angle smoothing factor. By jointly optimizing these factors, the system achieves improved global consistency and robustness.
4.4.1. Back-End Optimization
Although the front-end decoupled estimation algorithm can provide high-frequency local pose estimates, errors inevitably accumulate over long-term operation due to sensor noise, IMU bias random walk, and subtle wheel slip. To address this issue, this study introduces a sliding-window-based factor graph optimization framework to impose global consistency constraints on the local trajectory and perform error correction.
The back-end optimization of SLAM is essentially a state estimation problem. Given a sequence of sensor observations Z, the objective is to estimate the system state X at different time steps. Within a probabilistic framework, the optimal state estimate can be obtained by maximizing the posterior probability .
According to Bayes’ rule,
Since
is independent of the state
X, maximizing the posterior probability is equivalent to maximizing the numerator:
Here, represents the likelihood, describing the probability of obtaining observation Z under state X, which is defined by the sensor observation model. denotes the prior distribution, representing prior knowledge or assumptions about the state.
Assuming that the sensor observation noise and motion model noise are independent and follow Gaussian distributions, consider a measurement
associated with states
and
. The error term is defined as
which follows a zero-mean Gaussian distribution
Therefore, the likelihood can be written as
where the squared Mahalanobis norm is defined as
By incorporating all observations and prior information, the maximum a posteriori estimation can be converted into a nonlinear least-squares problem:
The factor graph provides an intuitive representation of this optimization problem. The entire graph consists of nodes and factors. Each node stores a system state at a particular time, typically representing the robot pose in the world coordinate frame. Factors represent constraints between nodes, which correspond to sensor measurements.
In this work, multiple types of constraints are incorporated into the factor graph, including IMU preintegration factors, wheel odometry factors, LiDAR scan matching factors, visual feature factors, loop closure factors, and the proposed roll angle smoothing factor. These factors constrain the relative transformation between adjacent states, forming a chain-like structure similar to a Markov chain. The relative pose transformation between nodes is estimated through scan-to-map matching. The factor graph structure used in the back-end optimization is shown in
Figure 3.
Within this framework, observations from different sensors are uniformly modeled as factors and fused within a unified probabilistic graph optimization framework. In this manner, the complementary characteristics of multiple sensors can be fully exploited, effectively suppressing the errors of individual sensors while improving the overall estimation accuracy.
IMU provides high-frequency acceleration and angular velocity measurements. Direct integration of these measurements would lead to significant drift over time. To address this issue, the IMU preintegration technique aggregates all IMU measurements between two consecutive keyframes into a single relative motion constraint. This approach effectively handles high-frequency measurements while modeling IMU noise and bias.
In the factor graph formulation, the IMU preintegration factor connects two consecutive keyframe states and together with the corresponding IMU bias terms and . The residual term is constructed from the difference between the predicted motion obtained through preintegration and the actual states, including rotation, position, and velocity components.
The IMU residual can be written as
The LiDAR scan-matching factor estimates the relative pose transformation between the current keyframe scan and the local map using the ICP algorithm. This constraint is used to correct the accumulated drift from IMU and wheel odometry.
Since the front-end has already jointly optimized the gravity vector from the IMU and the ground normal extracted from RGB-D observations to estimate rotation, the visual feature matching constraint between keyframes can be obtained by matching feature points between two keyframes. The relative pose is then estimated using the PnP-RANSAC algorithm with depth information.
To further improve coupling between visual and geometric constraints, 3D map points are directly introduced as optimization variables. For a map point with 3D coordinates
observed in keyframe
, its projection in keyframe
is
The corresponding visual residual is defined as
where
denotes the observed pixel coordinates of the map point in the image.
Roll Angle Smoothing Factor
Most existing VIO or LIO methods usually assume that the robot moves on a flat ground surface, where the roll angle variation is relatively small. However, the non-coaxial self-balancing two-wheeled robot considered in this work is a typical inverted-pendulum-like system. During forward motion, steering, or motion over uneven ground, the robot relies on the momentum wheel to perform high-frequency attitude regulation for maintaining lateral balance. Therefore, the roll angle of the robot does not remain constant during SLAM operation, but continuously exhibits high-frequency small- or medium-amplitude fluctuations.
If the pose estimation relies only on conventional frame-to-frame relative matching, the optimizer may be easily affected by abnormal observations when environmental features are sparse, visual measurements are degraded, or sensor data quality decreases. This may further lead to non-physical attitude jumps. To address this issue, a roll angle smoothing factor is introduced into the back-end factor graph to constrain the higher-order temporal continuity of the roll angle between adjacent keyframes.
Let three consecutive keyframes in the sliding window be denoted as
,
i, and
, with timestamps
,
, and
, respectively. The corresponding roll angles extracted from the rotation matrices of the state nodes are denoted as
,
, and
. When the time interval between adjacent keyframes is sufficiently short, the roll angular velocity of the robot can be assumed to vary smoothly, which gives
When the keyframe time intervals are approximately equal, namely
the above relationship can be further written as a second-order difference constraint on the roll angle:
Therefore, the residual function of the roll angle smoothing factor is defined as
After this residual is introduced into the back-end factor graph optimization, the corresponding cost term is formulated as
where
denotes the covariance of the roll angle smoothing factor and controls the weight of this constraint in the overall optimization objective.
Unlike purely empirical parameter tuning, the covariance in this work is determined according to the physical roll-dynamic boundary of the non-coaxial self-balancing two-wheeled robot. For this type of robot, the roll angle variation is mainly determined by the reaction torque generated by the momentum wheel, the gravity-induced restoring torque, and external disturbance torque. The roll dynamics around the longitudinal axis of the robot can be written as
where
is the moment of inertia of the robot body around the roll axis,
m is the body mass,
L is the height of the center of mass,
is the torque generated by the momentum wheel, and
denotes the external disturbance torque.
In practice, the covariance was selected according to the expected physical range of roll angle fluctuations observed during robot balancing. A relatively small covariance increases the contribution of the smoothing factor and suppresses high-frequency oscillatory roll motion, whereas an excessively small covariance may over-constrain the optimization and suppress physically valid rotational dynamics. Conversely, assigning an excessively large covariance weakens the smoothing effect and reduces robustness against balancing-induced rotational disturbances.
Similarly, the weighting parameters associated with the gravity and ground-normal constraints were empirically selected according to the sensor noise characteristics and environmental stability. In the experiments reported in this work, a single fixed parameter configuration was used across all tested environments without scene-specific retuning. Stable localization performance under different operating conditions suggests that the proposed framework maintains reasonable robustness with respect to parameter selection.
Considering the limited output capability of the motor, the maximum instantaneous peak torque generated by the momentum wheel is denoted as
. Without considering extreme external impacts, the roll angular acceleration of the robot is physically bounded by
In the discrete SLAM sliding window, when the adjacent keyframe interval is
, the second-order central difference of the roll angle can be approximated as
Therefore, the roll angles of three consecutive keyframes should satisfy the following physical boundary:
The above inequality indicates that the roll angle smoothing residual
has a clear physically feasible range. To ensure that the smoothing factor can suppress non-physical attitude jumps while avoiding excessive suppression of real roll motion, the standard deviation of the smoothing factor is defined as a scaled form of the physical boundary of the second-order roll angle variation:
The corresponding covariance is then given by
Here, is a confidence scaling coefficient that adjusts the relationship between the physical boundary and the optimization weight. In general, can be selected within the range of to . A smaller increases the weight of the smoothing constraint, making the optimizer more inclined to suppress abrupt roll angle changes. In contrast, a larger weakens the smoothing constraint and allows the optimizer to respond more flexibly to rapid real roll motion. In this work, the same value of is used in all experiments to ensure parameter consistency across different environments and disturbance conditions.
From the perspective of robust optimization, this smoothing factor can be regarded as a second-order Markov smoothness constraint with a physically adaptive boundary. When sensor mismatch or local observation degradation causes an abnormal roll estimate at a certain keyframe, the corresponding residual may exceed the physically reasonable range characterized by . In this case, the abnormal attitude estimate is penalized during optimization and pulled back toward a trajectory that is more consistent with the inverted-pendulum-like roll dynamics of the robot. Therefore, the proposed factor suppresses roll angle error leakage caused by motion degradation and improves the attitude continuity and overall trajectory stability of the backend optimization.
In practice, the proposed covariance design exhibits relatively low sensitivity to moderate variations of the scaling coefficient . Since the covariance is derived from physically bounded roll dynamics rather than purely empirical tuning, changing within a reasonable range mainly affects the strength of the smoothing constraint without altering the overall optimization behavior. Experimental observations indicate that the proposed system maintains stable trajectory consistency when varies within the range of 0.8 to 1.2. Therefore, the proposed formulation provides a physically interpretable and robust mechanism for determining the covariance of the roll angle smoothing factor.
In addition, loop closure detection is introduced to eliminate long-term drift. When the robot revisits a previously mapped area, loop closure detection identifies this event and establishes a strong constraint between the current pose node and historical pose nodes. This constraint pulls the trajectory back to the correct position, resulting in a globally consistent map.
As time progresses, the size of the factor graph grows continuously, which may lead to computational explosion. To maintain real-time performance, a sliding-window optimization strategy is adopted, where only the most recent N keyframes are retained for optimization.
When the window becomes full and a new keyframe is added, the oldest keyframe is marginalized. Marginalization compresses all constraints associated with the removed keyframe into a prior factor and propagates the uncertainty to the remaining variables. This prior factor preserves the information contained in the marginalized keyframe while keeping the optimization problem size manageable.
4.4.2. Loop Closure Detection
Loop closure detection plays an important role in SLAM systems. Without loop closure constraints, the robot trajectory gradually accumulates drift over time due to sensor noise and estimation errors. When the robot revisits a previously mapped area, loop closure detection identifies this event and introduces additional constraints to correct the accumulated error, thereby improving the global consistency of the map.
In this work, a vision-based place recognition method is adopted to detect loop closure candidates. Recent RGB-D and visual SLAM studies in dynamic indoor environments have shown that dynamic feature suppression and robust visual front-end design are important for maintaining reliable loop detection and pose estimation under degraded visual observations [
25]. Specifically, a Bag-of-Words (BoW) model is constructed using visual features extracted from RGB images. Each keyframe image is converted into a visual word vector, which enables efficient similarity search among previously observed keyframes.
Let the visual word vectors of two keyframes
i and
j be denoted as
and
, respectively. The similarity between these two vectors is measured using cosine similarity:
When the similarity score exceeds a predefined threshold, the corresponding keyframe pair is considered a loop closure candidate.
To verify the geometric consistency of the detected loop, feature correspondences between the two images are established and the relative pose transformation is estimated using the PnP-RANSAC algorithm. If the number of inlier correspondences exceeds a threshold, the loop closure is accepted.
After the loop closure candidate is verified, LiDAR scan matching is further performed between the corresponding scans to refine the relative pose transformation. The loop-closure residual is formulated on the Lie group manifold as
where
denotes the logarithmic map from the Lie group
to its corresponding Lie algebra representation.
The loop closure factor is then added into the factor graph optimization framework, which effectively reduces accumulated drift and improves the global consistency of the estimated trajectory and map. In summary, the primary role of the back-end optimization module is to enforce global consistency on the front-end estimation results. By fusing multi-source constraints, it effectively suppresses error accumulation, thereby ensuring the accuracy and stability of the system during long-term operation.