Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC

Lee, Jinhee; Kang, Jaeyoung

doi:10.3390/biomimetics10100710

Open AccessArticle

Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC

by

Jinhee Lee

and

Jaeyoung Kang

^*

Department of Mechanical Engineering, Inha University, Incheon 22212, Republic of Korea

^*

Author to whom correspondence should be addressed.

Biomimetics 2025, 10(10), 710; https://doi.org/10.3390/biomimetics10100710

Submission received: 25 September 2025 / Revised: 17 October 2025 / Accepted: 18 October 2025 / Published: 20 October 2025

(This article belongs to the Special Issue Recent Advances in Bioinspired Robot and Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

Visual odometry in dynamic environments is particularly challenging, as moving objects often cause incorrect data associations and large pose estimation errors. Traditional EKF-based VO methods rely on 1-point RANSAC to reject outliers under the assumption of a static world, thereby discarding dynamic landmarks as noise. However, in practice, outliers may arise not only from measurement errors but also from the motion of objects. To address this issue, we propose a modified 1-point RANSAC framework that detects dynamic objects and leverages both static and dynamic landmarks for ego-motion estimation. Inspired by adaptive strategies observed in biological vision systems, our approach integrates EKF-based state estimation with dynamic object tracking to achieve simultaneous ego-motion and object-motion estimation, improving robustness in complex and dynamic scenes.

Keywords:

visual odometry; Extended Kalman Filter; computer vision

1. Introduction

Visual odometry (VO) is a method that estimates the ego-pose, i.e., the position and orientation of an agent, using image data, and it has been widely applied in autonomous driving, robotics, and drone navigation. The fundamental concept underlying VO is to infer the agent’s relative motion by analyzing the changes in visual observations across consecutive frames. Similar mechanisms are observed in biological vision, where animals and insects estimate self-motion from optic flow and the relative motion of surrounding objects. Gay et al. [1] presented a navigation model inspired by place and grid cells, and Wang et al. [2] proposed REVIO using event-based vision inspired by retinal ganglion cells. Based on this concept, ego-pose estimation methods can be categorized into three types depending on the nature of data association: 2D-2D, 2D-3D (or equivalently, 3D-2D), and 3D-3D matching [3]. In 2D-2D matching, the relative pose between consecutive frames is estimated by analyzing geometric constraints between 2D feature correspondences extracted from consecutive images and is commonly used in monocular visual odometry [4], but it suffers from scale ambiguity, as depth information is not directly available. In 2D-3D matching, 2D image features are associated with known 3D landmarks obtained via triangulation (e.g., from stereo cameras) or direct depth sensors (e.g., RGB-D). The camera pose can then be estimated by solving the Perspective-n-Point (PnP) problem [5,6]. Finally, 3D-3D matching estimates the sensor’s motion by aligning two consecutive 3D point clouds. Since both point sets are defined in a common metric space, this approach inherently provides metric scale and high-precision motion estimation. It is widely employed in LiDAR-based and stereo camera-based ego-pose estimation pipelines, particularly where depth sensing is reliable and dense [7]. Similar mechanisms are observed in biological vision, where animals and insects estimate self-motion from optic flow and the relative motion of surrounding objects.

Beyond data association types, ego-pose estimation methods can also be classified by the underlying optimization framework used for state estimation. Among these, a common approach is to directly solve a bundle adjustment (BA) [8] problem over a sliding window of keyframes, refining both camera poses and landmarks by minimizing reprojection error. Alternatively, filtering-based methods such as the Extended Kalman Filter (EKF), and graph-based optimization techniques that model the problem as a factor graph, are also widely adopted.

EKF-based methods estimate the current state sequentially by propagating and updating the state vector and covariance using probabilistic models [9]. They are suitable for real-time applications due to their computational efficiency but may suffer from linearization errors and filter inconsistency in highly nonlinear environments. In contrast, graph-based optimization methods formulate the problem as a global nonlinear least-squares optimization over a factor graph, allowing for more accurate and globally consistent solutions by jointly optimizing all poses and measurements [10]. However, they are generally more computationally demanding than filtering methods, and may require additional engineering to meet strict real-time constraints. In recent years, graph-based methods have become dominant in VO and SLAM (Simultaneous Localization and Mapping) because they enable loop closure and globally consistent optimization [11,12,13]. Although EKF-based ego-pose estimation has been less actively studied in recent years, it still offers advantages such as low computational load, real-time recursive updates, and suitability for sensor fusion [14,15].

The presence of measurement noise and outliers significantly affects estimation accuracy. Several approaches have been proposed to distinguish outliers. In graph-based methods, robust graph-based optimization has been widely studied to mitigate the effects of outliers and measurement noise in visual and pose graph SLAM. Huber loss and other M-estimators have been commonly used to suppress the influence of large residuals in the cost function, providing robustness against occasional false matches in data association [16]. Dynamic Covariance Scaling (DCS) further improves robustness by down-weighting erroneous constraints based on their residual magnitudes in a closed-form, enabling efficient real-time operation without introducing additional variables [17]. Switchable Constraints (SC), on the other hand, explicitly model the reliability of each loop closure by introducing a latent switch variable that can deactivate incorrect constraints during optimization, achieving high resilience in the presence of false positives [18]. In EKF-based methods, Joint Compatibility Branch and Bound (JCBB) [19] is used to reject outliers in ego-pose estimation by evaluating all possible measurement associations and selecting the largest subset that meets joint statistical consistency criteria. Compared to JCBB, which performs a full branch-and-bound search over all possible data association hypotheses and selects the largest jointly compatible inlier set, 1-point RANSAC [20] significantly reduces computational cost by generating motion hypotheses from a single randomly selected feature (or landmark), estimating ego-pose through EKF, and identifying inlier features by thresholding their innovation residuals.

Since general algorithms are assumed to be used in static environments, their performance degrades when dynamic objects are present, leading to incorrect data associations and large pose estimation errors. To address this issue, robust outlier rejection should be applied for ego-pose estimation in dynamic environments. Cao et al. [21] propose a method that integrates optical flow-based motion state estimation into the sliding window optimization of VINS-Mono, enabling robust visual-inertial odometry in dynamic environments without relying on semantic information. Likewise, Li et al. [22] enhance sliding-window BA VIO by incorporating IMU-PARSAC-based matching and robust pure-rotation handling. Ballester et al. [23] integrate segmentation and photometric optimization into ORB-SLAM2 to remove dynamic features and perform BA-based optimization. Similarly, Wu et al. [24] extend ORB-SLAM3 with optical flow segmentation to eliminate dynamic outliers before BA optimization. Ram [25] adds planar homography constraints to the BA framework of VINS-Mono to achieve stable optimization under dynamic conditions. As an EKF-based method, Eckenhoff et al. [26] propose a Schmidt-EKF that uses dynamic object features only for tracking the object itself, while preventing them from directly updating the ego-motion estimate, thus maintaining robustness under inaccurate target motion models. Also, Ma et al. [27] propose an IEKF framework that applies Schur complement-based marginalization to efficiently eliminate visual features from the state vector, improving scalability and accuracy. On the other hand, as a graph-based method, Song et al. [28] propose a dynamic-aware VIO system that suppresses the influence of dynamic features by adaptively reweighting their motion residuals based on Black-Rangarajan duality, without explicitly classifying or rejecting them. Song et al. [29] extend this approach by introducing Adaptive Truncated Least Squares to fully eliminate the effect of outliers and incorporate bias correction mechanisms for handling abruptly dynamic objects. Liu et al. [30] proposed IPL-SLAM, which filters dynamic features using instance segmentation and motion consistency checks while framing the fusion of semantic and geometric cues as a biologically inspired perception strategy. In these works, dynamic features are effectively excluded from ego-pose estimation through robust optimization. In natural vision, dynamic elements are not always regarded as noise but often provide informative cues for navigation. This perspective motivates the incorporation of dynamic landmarks into VO rather than discarding them.

The aforementioned methods reject outlier features detected from dynamic objects so that the effect of measured features gets lower and they will be not suitable in highly dynamic environments. Several studies have explored ways to incorporate dynamic features into ego-pose estimation rather than simply excluding them. As an EKF-based approach, Todoran et al. [31] propose an EKF-based SLAM framework that models dynamic objects as individual state variables, allowing the system to incorporate their motion into ego-pose estimation rather than discarding them as outliers. This object-level approach improves robustness in dynamic environments by leveraging stable object motion for pose correction. However, the method is sensitive to segmentation quality and association accuracy because pose updates are performed only after segments are grouped and tracked as coherent objects. As another approach of EKF-based method, Chiu [32] proposes a unified optimization-based framework that jointly estimates ego-pose, static landmarks, and dynamic object trajectories using both static and dynamic features. This approach integrates localization, mapping, and multi-object tracking within a single EKF-based backend, offering a coherent treatment of dynamic scenes. However, the method does not incorporate explicit outlier rejection, making it vulnerable to erroneous dynamic feature measurements. The evaluation is limited to simulation, and no mechanism is provided to ensure robustness under real-world perception errors. As a graph-based approach, Tian et al. [33] propose DL-SLOT, a dynamic LIDAR SLAM and object tracking framework that jointly estimates the ego-pose and the states of stationary and dynamic objects within a unified sliding window-based collaborative graph optimization. Also, Huang et al. [34] present a graph-based backend that explicitly clusters landmarks into rigid bodies, jointly estimates both ego-pose and per-object motions through decoupled factor graph optimization, and thereby mitigates the negative impact of dynamic objects on pose estimation without discarding their measurements as outliers.

Previous EKF-based methods that use dynamic landmarks for estimation did not provide a clear strategy for outlier rejection in dynamic environments, making the implementation of VO difficult in practice [31,32]. Conventional 1-point RANSAC assumes that all features belong to a static scene and thus rejects those affected by independently moving objects, resulting in the loss of informative visual cues and degraded pose estimation in dynamic environments. To address this limitation, we propose a novel EKF-based VO methodology that modifies 1-point RANSAC to detect dynamic objects and incorporate both static and dynamic landmarks into ego-pose estimation. A supplementary video shows the evaluation of the proposed method, including comparisons with classical methods (Video S1). This design is consistent with a bio-inspired perspective, where dynamic visual cues are treated as informative signals rather than discarded as outliers, reflecting adaptive strategies observed in biological perception. This paper focuses on applications for vehicle navigation with a stereo camera, and the main contributions are as follows:

(1): A new algorithm of EKF-based visual odometry using stereo camera is proposed, which utilizes both static and dynamic landmarks for ego-pose estimation. This method uses instance segmentation to detect and track objects from image, so that it can use the characteristics of image. Also, in multiple object tracking on road, our method is robust to truncation due to angle of view, which often causes a shift in the observed point cloud centroid as the object moves near the boundary of the field of view.
(2): Reasonable outlier rejection in dynamic environment is conducted by 1-point RANSAC and RANSAC-based transformation estimation from point cloud. The method of outlier rejection and determining whether an object is static or dynamic is proposed.
(3): Tests are conducted using real-world image dataset to evaluate the performance of ego-pose estimation in dynamic environment with noise.

This paper is organized as follows. In Section 2, overall structure of the modified 1-point RANSAC for dynamic environments is proposed. In Section 3, evaluations are conducted to determine whether the proposed algorithm is suitable for estimating ego-pose in dynamic environments. Finally, in Section 4, conclusions are stated.

2. Modified 1-Point RANSAC for Dynamic Environments

In this section, the overall structure and the detailed methods of the proposed algorithm using both static and dynamic landmarks assuming dynamic environments is introduced. Since EKF is employed for the estimation, the state transition model and the measurement model used in the filter are first introduced. The proposed algorithm is divided into 3 steps: (1) measurement by stereo camera, (2) the classical 1-point RANSAC algorithm assuming static environment, (3) additional steps for considering dynamic landmarks. Since the proposed method incorporates the classical method, before introducing the ego-pose estimation method assuming dynamic environments, classical 1-point RANSAC algorithm which is assuming static environments is briefly introduced to define variables and facilitate comparison with the modified approach. Then, detailed methods for state estimation of dynamic objects and augmented estimation of ego-pose and landmark positions used in the proposed algorithm are further explained.

2.1. Models for EKF

In this paper, in the EKF-based VO framework, the ego-pose evolves according to the state transition model,

ξ_{k} = g (ξ_{k - 1}, u_{k}) + q_{ξ},

(1)

where

ξ_{k} ≜ {[x, y, ψ]}_{k}^{⊤}

denotes the ego-pose of the vehicle in the global coordinates at current frame k,

u_{k}

denotes the control input and

q_{ξ} \sim N (0, Q_{ξ})

denotes the state noise of the ego-pose.

x ≜ {[x, y]}^{⊤}

denotes the ego position on the road plane, and

ψ

denotes the ego heading on road plane. In this paper, the bicycle model is employed as the state transition model

g

.

The positions of landmarks are assumed to be static in the classic 1-point RANSAC, the state transition of the landmark position is represented as

X_{k}^{i} = X_{k - 1}^{m (i)} + q_{X},

(2)

where

X_{k}^{i} ≜ {[X, Y]}_{k}^{i ⊤}

denotes the position on the 2D road plane of i-th landmark at the k-th frame in the global coordinates,

m (i)

denotes the index of the associated landmark in the previous frame, and

q_{X} \sim N (0, Q_{X})

denotes the state noise of landmark position which is assumed to have Gaussian distribution.

The landmark measurements are represented in 2D polar coordinates relative to the camera coordinate system, so that the relationship between measurements and states is denoted as

\begin{matrix} z_{k}^{i} ≜ {[ρ, θ]}_{k}^{i ⊤} = h (X_{k}^{i}, ξ_{k}) + r_{z, k}^{i}, \\ {\hat{z}}_{k}^{i} = h (X_{k}^{i}, ξ_{k}) = [\begin{matrix} \sqrt{{(X_{k}^{i} - x_{k})}^{2} + {(Y_{k}^{i} - y_{k})}^{2}} \\ arctan 2 (Y_{k}^{i} - y_{k}, X_{k}^{i} - x_{k}) - ψ_{k} \end{matrix}], \end{matrix}

(3)

where

z_{k}^{i}

denotes the measurement of the i-th landmark position relative to the camera coordinate system,

ρ

is the distance from the agent,

θ

is the bearing angle relative to the ego heading

ψ

, and

r_{z, k}^{i} \sim N (0, R_{z} (z_{k}^{i}))

represents the measurement noise of landmark i at the k-th frame.

The polar representation is adopted to accurately model the characteristics of the measurements obtained in the vehicle body frame and to incorporate the uncertainty originating from stereo-based feature reconstruction. Because the landmark positions are defined relative to the vehicle, their measurements are directly affected by the vehicle’s rotational noise, and such relationships are more conveniently expressed in polar coordinates than in Cartesian form. The range-dependent uncertainty of stereo-reconstructed feature points caused by disparity and pixel localization errors can be naturally incorporated into the EKF measurement covariance by expressing the measurements in polar coordinates, thereby improving the consistency of state estimation.

In this paper, the covariance of landmark measurement is not assumed to follow a Gaussian distribution with constant variance, but is instead assumed as a function of the radial distance

ρ

and angle

θ

of the landmark measurement, represented as

R_{z} (z_{k}^{i}) ≜ d i a g (σ_{ρ}^{2} (ρ_{k}^{i}), σ_{θ}^{2} (θ_{k}^{i})),

(4)

\begin{matrix} σ_{ρ}^{2} (ρ_{k}^{i}) ≜ α_{ρ} {(ρ_{k}^{i})}^{4} exp (- β_{ρ} S_{Δ f, k}^{i}) \\ σ_{θ}^{2} (θ_{k}^{i}) ≜ α_{θ} {cos}^{4} (θ_{k}^{i}) exp (- β_{θ} S_{f, k}^{i}), \end{matrix}

(5)

where

α

and

β

denotes the constant parameters of each measurement,

S_{Δ f}

denotes the score of feature matching and

S_{f}

denotes the score of feature detection. The variance is designed to decrease with increasing score through the use of an exponential function. In our assumption, the variance of radial distance

ρ

is proportional to

ρ^{4}

and the variance of angle

θ

is proportional to

{cos}^{4} (θ)

. A detailed derivation supporting this assumption is discussed in Appendix D. The augmented state evolves according to the state transition model,

χ_{k} = f (χ_{k - 1}, u_{k}) + q, χ_{k} ≜ [\begin{matrix} ξ_{k} \\ X_{k}^{i}, \forall i \in I_{k} \end{matrix}],

(6)

where

χ_{k}

denotes the total state that contains ego-pose and landmark positions,

q

denotes the noise of total state, and

I_{k}

denotes the set of inlier indices at k-th frame. The augmented state transition model is used for the final EKF update, denoted as

f (χ_{k - 1}, u_{k}) = [\begin{matrix} g (ξ_{k - 1}, u_{k}) \\ X_{k - 1}^{m (i)}, \forall i \in I_{k} \end{matrix}] .

(7)

2.2. Classical 1-Point RANSAC Assuming Static Environments

Figure 1 indicates the pipeline of the classical 1-point RANSAC algorithm and detailed procedure is represented in Appendix A. A random landmark

z_{k}^{h}

is selected as sample for the hypothesis and a temporary state update is executed using only one sample. This procedure is repeated

n_{hyp}

times to find a sample which has the most consensus to maximize the number of inliers.

n_{hyp}

, the number of hypothesis repetition, is determined to guarantee that at least one hypothesis is generated from an inlier with a probability p, given an estimated inlier ratio

ϵ

. The equation of

n_{hyp}

is represented as

n_{hyp} = \frac{log (1 - p)}{log (1 - {(1 - ϵ)}^{1})} .

(8)

During each RANSAC iteration, landmarks with innovation below the threshold are considered inliers. Specifically, after performing the temporary state update using a randomly selected landmark, the predicted measurements of all other landmarks are computed, and their innovations

ν_{j} = z_{k}^{j} - {\hat{z}}_{k}^{j}

are evaluated. Each feature

j \in J_{k}

, where

J_{k}

denotes the set of valid landmarks observed at time step k (see Section 2.3.1), is classified as an inlier if its innovation magnitude satisfies

∥ ν_{j} ∥ < γ_{innov}

, where

γ_{innov}

is a predefined innovation threshold. Based on each hypothesis, an inlier set

I^{h}

is constructed. If the number of inliers,

| I^{h} |

, exceeds that of the best inlier set found so far, the best inlier set

\tilde{I}

and the corresponding hypothesis for the ego-pose and landmarks are updated accordingly. This process is repeated until the number of evaluated hypotheses reaches the predefined maximum

n_{hyp}

.

Figure 1. Pipeline of 1-point RANSAC algorithm.

The selected inlier set

\tilde{I}

is then used to perform a full EKF update as part of the high-innovation inliers recovery procedure, resulting in an updated state estimate

{\hat{χ}}_{k | k}

that includes both the ego-pose and the inlier landmark positions.

After this update, all remaining candidate features

i \in J_{k} ∖ \tilde{I}

are re-evaluated against the updated state using the Mahalanobis distance,

d_{i} = ν_{i}^{⊤} S_{i}^{- 1} ν_{i},

where

ν_{i}

is the innovation and

S_{i}

the corresponding innovation covariance.

A feature i is additionally classified as an inlier if

d_{i} < γ_{χ^{2}}

, where

γ_{χ^{2}}

is the chi-square threshold. Accordingly, the final inlier set is defined as

I_{k} = \tilde{I} \cup \{i \in J_{k} ∖ \tilde{I} | d_{i} < γ_{χ^{2}}\} .

(9)

2.3. Modified 1-Point RANSAC Assuming Dynamic Environments

In the proposed EKF-based framework, landmarks belonging to dynamic objects are not rejected but incorporated into the estimation process, considering the presence of dynamic environments. Figure 2 illustrates the overall structure of the proposed algorithm.

The proposed algorithm consists of five main stages, as summarized below. Each stage corresponds to the detailed procedures presented in Section 2.3.1, Section 2.3.2, Section 2.3.3, Section 2.3.4 and Section 2.3.5.

(1): Measurement (Section 2.3.1): Landmark positions are reconstructed in the camera coordinate system from stereo image pairs. Instance segmentation is then applied to classify features into object and background regions.
(2): Classical 1-point RANSAC (Section 2.3.2): Matched landmarks between consecutive frames are processed using the standard 1-point RANSAC to classify inliers and outliers. Outliers, as well as inliers with large innovation, are treated as potentially dynamic features. The estimated ego-pose at this stage serves as a temporary estimate for the following steps.
(3): Object motion estimation (Section 2.3.3): The rigid-body motion of each detected object is estimated from its point cloud using a RANSAC-based Euclidean transform. A secondary Kalman filter is then applied to refine the state of each object using the estimated transformation.
(4): Augmented state estimation (Section 2.3.4): The ego-pose and landmark states are jointly updated using inliers from both static and dynamic landmarks within the EKF framework, enabling consistent estimation under dynamic scenes.
(5): Mapping of newly detected landmarks (Section 2.3.5): Newly observed or unmatched landmarks are initialized and incorporated into the map using the filtered ego-pose.

2.3.1. Feature Detection, Matching and Multiple Object Tracking

As visual measurements, features in pixel coordinates are detected and matched using a stereo camera. In this paper, SuperPoint [35] is employed for feature detection, and SuperGlue [36] is used for feature matching. Deep learning-based feature detection and matching are adopted to enhance feature matching reliability in outdoor environments where repetitive patterns such as windows, tiles, or crosswalks often cause correspondence errors, as the graph neural network in SuperGlue effectively leverages spatial relationships between keypoints to maintain robust matches under such conditions. The feature detector extracts keypoints from the left stereo image, and the feature matching algorithm finds correspondences with features in the right image based on those keypoints. The matched feature coordinates in the left and right images are represented as

\begin{matrix} F_{k} ≜ {\{f_{k}^{i} ≜ {[v, w]}_{k}^{i ⊤}\}}_{i = 1}^{N_{f}}, \\ F_{k}^{*} ≜ {\{f_{k}^{* i} ≜ {[v^{*}, w^{*}]}_{k}^{i ⊤}\}}_{i = 1}^{N_{f}}, \end{matrix}

(10)

where

F_{k}

and

F_{k}^{*}

denote the sets of 2D pixel coordinates of

N_{f}

matched features in the left and right stereo images, respectively. The left features with valid correspondences in the right image are also matched to the left image of previous frame,

k - 1

. Two set of already calculated pixel points are matched by feature matching algorithm, in contrast to the matching with the right image, that the features are tracked to define the pixel position in the right image. The left features with valid correspondences in the right image and the previous frame are used to estimate ego-pose and the set of valid features is denoted as

J_{k}

, and a set of new features with valid correspondences in the right image but unvalid correspondences in the previous frame is denoted as

J_{k}^{*}

Figure 3 indicates the procedure of the feature measurement. (Below: stereo image at current frame, Upper: stereo image at previous frame)

The 3D positions of landmarks in the camera coordinates,

{\overset{´}{X}}_{k}^{i}

, are obtained via stereo triangulation for each inlier feature

i \in I_{k}

. These positions are then converted to 2D polar coordinates relative to the camera, denoted as

z_{k}^{i}

, which serve as the measurement inputs to the filter. Landmarks are classified according to the objects by calculating which segmentation mask each feature point is located in. The set of landmark indices located in object o is denoted as

P_{k}^{o}

. Also, the objects are assigned IDs and tracked using a MOT (Multiple Object Tracking) algorithm, so that each object’s ID can be matched with its corresponding instance from previous frames. YOLOv8 is used in the proposed algorithm to perform instance segmentation, and is integrated with the MOT algorithm ByteTrack [37] for object tracking.

2.3.2. 1-Point RANSAC Assuming Static Environment

The classical 1-point RANSAC, assuming static environment, is conducted using valid landmark positions

\{z_{k}^{i} | i \in J_{k}\}

, while the high-innovation inliers recovery procedure is excluded. The best inlier set

\tilde{I}

is considered directly as the final inlier set

I_{k}

. The temporarily filtered ego-pose

{\hat{ξ}}_{k | k}^{-}

is calculated, and landmarks are divided into inliers and outliers. Unlike the outliers are just rejected in classic method, in the proposed method, they are considered potentially dynamic landmarks which have the opportunity to be used in the estimation.

When an object is considered to be potentially dynamic, it is assumed to have either outlier landmarks or landmarks that pass the innovation threshold but still exhibit motion. Since some dynamic landmarks can be mistakenly included as inliers by 1-point RANSAC due to their small innovation, the distance between each landmark’s current filtered position and its previous position is computed to detect subtle motion. Whether a landmark is considered potentially dynamic depends on whether the object it belongs to is determined to be dynamic.

To identify such cases, a subset of landmarks that passed the innovation threshold

γ_{innov}

but exhibit a significant positional change is extracted for each object. This set is defined as

D_{1, k}^{o} = \{i \in I_{k} \cap P_{k}^{o} | ∥{\hat{X}}_{k | k}^{i} - {\hat{X}}_{k | k - 1}^{i}∥ > γ_{inliers}\},

(11)

where

I_{k}

denotes the inlier set and

P_{k}^{o}

denotes the set of landmarks belonging to object o.

If the proportion of such landmarks for object o exceeds a predefined threshold, the object is classified as dynamic. The set of such objects is denoted as

O_{1, k} = \{o | \frac{| D_{1, k}^{o} |}{| P_{k}^{o} |} > γ_{ratio}\} .

(12)

Next, landmarks that were rejected as outliers by the 1-point RANSAC procedure and failed the innovation threshold test are grouped per object, denoted as

D_{2, k}^{o} = (J_{k} ∖ I_{k}) \cap P_{k}^{o} .

(13)

An object is also considered potentially dynamic if the ratio of outlier landmarks exceeds the threshold:

O_{2, k} = \{o | \frac{| D_{2, k}^{o} |}{| P_{k}^{o} |} > γ_{ratio}\} .

(14)

The total set of potentially dynamic objects is defined as the union:

{\tilde{O}}_{k} = O_{1, k} \cup O_{2, k} .

For each object

o \in {\tilde{O}}_{k}

, the complete set of associated dynamic landmarks is given by:

{\tilde{D}}_{k}^{o} = D_{1, k}^{o} \cup D_{2, k}^{o} .

2.3.3. Estimation of Planar Motion of Dynamic Objects with Kalman Filter

Another Kalman filter is applied to each object o in the potentially dynamic object set

{\tilde{O}}_{k}

, in order to estimate the motions of dynamic objects. The Euclidean transformation of each object, consisting of the rotation

Ω_{k}^{o} \in SO (2)

and the translation

τ_{k}^{o} \in R^{2}

, is estimated by comparing the current point cloud with the previous one. The term

Ω_{k}^{o}

denotes the rotation, and

τ_{k}^{o}

denotes the translation of object o at frame k.

The estimation of transformation by comparison is conducted using raw point cloud for both current frame and previous frame, because the point cloud containing filtered landmarks is hard to estimate exact transformation. The position of each raw point at current frame in global coordinates is computed using the temporarily filtered ego-pose, denoted as

{\tilde{X}}_{k}^{i} = Φ ({\hat{ψ}}_{k | k}^{-}) {\overset{´}{X}}_{k}^{i} + {\hat{x}}_{k | k}^{-} .

(15)

Similarly, the position of each raw point at previous frame in global coordinates is computed using the finally filtered ego-pose at previous frame, denoted as

{\tilde{X}}_{k - 1}^{i} = Φ ({\hat{ψ}}_{k - 1 | k - 1}) {\overset{´}{X}}_{k - 1}^{m (i)} + {\hat{x}}_{k - 1 | k - 1},

(16)

where

Φ (ψ) \in SO (2)

denotes the rotation matrix of angle

ψ

.

When estimating the transformation of each point cloud, RANSAC is conducted to distinguish the inlier landmarks that fit with the estimated transform and the outliers that do not. The algorithm of RANSAC-based Euclidean transform estimation is indicated in Appendix B. As outlined in Algorithm A2, the method iteratively selects minimal pairs of point correspondences and estimates a candidate rigid trasformation using a least-squares solution as represented in Algorithm A3. For each hypothesis, the transformation is applied to all source points, and the reprojection error is computed. Inliers are determined by thresholding the error, and the transformation yielding the largest inlier set is selected as the final result. This procedure provides robustness against outliers and ensures that the estimated transformation best alignes the two point sets under Euclidean constraints.

Due to the possibility that instance segmentation may erroneously assign background pixels to object instances, the resulting point clouds can contain spurious points. To address this, additional filtering using point cloud clustering is performed to suppress such noise. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [38] is conducted to cluster the inliers of potentially dynamic point clouds. After performing point cloud clustering, the cluster containing the largest number of points is considered the representative cluster, while the remaining clusters are treated as noise and excluded from further computations. The set of representative landmark indices for each object o are denoted as

D_{k}^{o}

.

The measurement used in Kalman filter for object state,

ζ_{k}^{o}

, is calculated using the estimated transformation, represented in Appendix C. While the center of the point cloud is primarily assumed to correspond to the object center, velocity and angular velocity are additionally considered to compensate for potential center shifts induced by occlusion.

The measurement of object state is denoted as

ζ_{k}^{o} = [\begin{matrix} \frac{1}{|{\tilde{D}}_{k}^{o}|} \sum_{i \in {\tilde{D}}_{k}^{o}} {\tilde{X}}_{k}^{i} \\ τ_{o} / Δ t \\ arctan 2 (Ω {(2, 1)}_{o}, Ω {(1, 1)}_{o}) / Δ t \end{matrix}] .

(17)

To ensure the reliability of object motion estimation, several validation steps are incorporated before accepting the estimated state as a valid measurement for the Kalman filter update. These checks are essential to eliminate ill-posed or noisy results that may compromise the accuracy of the estimation process. First, each object must be associated with at least three 3D points to enable rigid transformation estimation. Then, the rigid transformation matrix

M

between the previous and current point sets is estimated using a RANSAC-based method. The transformation is accepted only if it yields a non-empty set of inliers that support the estimated motion. From the estimated rigid transform

M

, the translation vector

τ

and rotation matrix

Ω

are extracted. Third, the spatial distribution of the inlier points is evaluated using their covariance matrix. If the maximum eigenvalue of this matrix exceeds a threshold

T_{cov}

, indicating high spatial dispersion or unreliability, the object is discarded from further processing. An object is classified as an outlier and excluded from the Kalman filter update if it fails to satisfy any of the above criteria.

And the measurement model is represented as

ζ_{k}^{o} = s_{k}^{o} + r_{ζ, k}^{o}

(18)

where

s_{k}^{o} ≜ {[c, \dot{c}, \dot{ϕ}]}_{k}^{o ⊤}

denotes the state,

c_{k}^{o} ≜ {[\bar{X}, \bar{Y}]}_{k}^{o ⊤}

denotes the position,

{\dot{ϕ}}_{k}^{o}

denotes the angular velocity of the point cloud of object o at k-th frame, and

r_{ζ, k}^{o} \sim N (0, R_{ζ} (ζ_{k}^{o}))

denotes the measurement noise of object state.

The measurement error covariance

R_{ζ} (ζ_{k}^{o})

is calculated considering the number of point used to compute each value of the state, represented as

R_{ζ} (ζ_{k}^{o}) = \frac{σ^{2}}{| {\tilde{D}}_{k} |} \cdot diag (μ_{pos}, μ_{pos}, μ_{vel}, μ_{vel}, μ_{angle})

(19)

Each object state is filtered assuming planar motion with constant velocity, constant anglular velocity, represented as

s_{k}^{o} = A s_{k - 1}^{o} + q_{s}, A = [\begin{matrix} 1 & 0 & Δ t & 0 & 0 \\ 0 & 1 & 0 & Δ t & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix}]

(20)

where

q_{s} \sim N (0, Λ)

denotes the state error of objects.

2.3.4. Augmented Estimation with Both Static and Dynamic Landmarks

Since the motions of the dynamic objects are estimated, both static and dynamic landmarks are ready to be used in estimation. Unlike the transition model of the static landmarks is same as mentioned in Equation (2), the transition model of each dynamic landmark is denoted as

X_{k}^{i} = l (X_{k - 1}^{m (i)}, c_{k - 1}^{o}, {\dot{c}}_{k}^{o}, {\dot{ϕ}}_{k}^{o}) + η_{k}^{i},

(21)

\begin{matrix} l (X_{k - 1}^{m (i)}, c_{k - 1}^{o}, {\dot{c}}_{k}^{o}, {\dot{ϕ}}_{k}^{o}) \\ ≜ Φ ({\dot{ϕ}}_{k}^{o} Δ t) (X_{k - 1}^{m (i)} - c_{k - 1}^{o}) + c_{k - 1}^{o} + {\dot{c}}_{k}^{o} Δ t, \end{matrix}

(22)

where

Φ ({\dot{ϕ}}_{k}^{o} Δ t) \in SO (2)

denotes the rotation matrix of angle

{\dot{ϕ}}_{k}^{o} Δ t

and

η^{i}

denotes the state noise of landmark i.

When using dynamic landmarks to estimate ego-pose, it can cause more error than assuming static environment if there exists large error in object state estimation. To avoid the adverse effect of using the proposed algorithm, the vailidity of the estimation of the dynamic object states should be judged. Mean innovation of the estimated landmark positions of each object is calculated to judge the validity of each object denoted as

{\bar{ν}}_{1, k}^{o} = \frac{1}{| {\tilde{D}}_{k}^{o} |} \sum_{i \in O_{k}^{o}} ({\tilde{X}}_{k}^{i} - l ({\hat{X}}_{k - 1 | k - 1}^{m (i)}, c_{k - 1}^{o}, {\dot{c}}_{k}^{o}, {\dot{ϕ}}_{k}^{o}))

(23)

{\bar{ν}}_{2, k}^{o} = \frac{1}{| {\tilde{D}}_{k}^{o} |} \sum_{i \in {\tilde{O}}_{k}^{o}} ({\tilde{X}}_{k}^{i} - {\hat{X}}_{k - 1 | k - 1}^{m (i)})

(24)

O_{k} = \{o \in {\tilde{O}}_{k} | {\bar{ν}}_{1, k}^{o} < {\bar{ν}}_{2, k}^{o}\}

(25)

The total transition model is represented as

\begin{matrix} f (χ_{k - 1}, u_{k}) = \\ [\begin{matrix} g (ξ_{k - 1}, u_{k}) \\ \begin{matrix} \forall i \in I_{k} \cup ⋃_{o \in O_{k}} D_{k}^{o}, \\ \{\begin{matrix} X_{k - 1}^{m (i)} & if i \in I_{k} ∖ ⋃_{o \in O_{k}} D_{k}^{o} \\ l (X_{k - 1}^{m (i)}, c_{k - 1}^{o}, {\dot{c}}_{k}^{o}, {\dot{ϕ}}_{k}^{o}) & if i \in D_{k}^{o}, o \in O_{k} \end{matrix} \end{matrix} \end{matrix}] \end{matrix}

(26)

The state transition model

f (χ_{k - 1}, u_{k})

jointly updates the ego-pose and the landmark positions as components of an augmented state vector. The ego-pose is propagated using the motion model

g (ξ_{k - 1}, u_{k})

, while the landmarks are updated based on whether they are associated with dynamic objects or not. Dynamic landmarks are propagated according to the estimated motion of their corresponding object, whereas static ones retain their previous positions.

The Jacobian of the total transition model is also represented as

\begin{matrix} F (χ_{k - 1}, u_{k}) = \\ blockdiag (G (ξ_{k - 1}, u_{k}), \begin{matrix} \forall i \in I_{k} \cup ⋃_{o \in O_{k}} D_{k}^{o}, \\ \{\begin{matrix} I & if i \in I_{k} ∖ ⋃_{o \in O_{k}} D_{k}^{o} \\ Φ ({\dot{ϕ}}_{k}^{o} Δ t) & if i \in D_{k}^{o}, o \in O_{k} \end{matrix} \end{matrix}) \end{matrix}

(27)

The Jacobian of the state transition model,

F (χ_{k - 1}, u_{k})

, is represented in block-diagonal form. It consists of the ego-pose Jacobian

G (ξ_{k - 1}, u_{k})

and landmark-specific blocks, where each block is set to the identity matrix for static landmarks, and to a rotation matrix

Φ ({\dot{ϕ}}_{k}^{o} Δ t)

for dynamic landmarks, depending on whether

i \in D_{k}^{o}

for some

o \in O_{k}

.

The update of total state is updated following the standard EKF procedures, represented as

(1): State Prediction ${\hat{χ}}_{k | k - 1} = f ({\hat{χ}}_{k - 1 | k - 1}, u_{k})$
(2): Covariance Prediction $P_{k | k - 1} = F_{k} P_{k - 1 | k - 1} F_{k}^{⊤} + Q_{k}$
(3): Residual $ν_{k} = z_{k} - h ({\hat{x}}_{k | k - 1})$
(4): Innovation Covariance $S_{k} = H_{k} P_{k | k - 1} H_{k}^{⊤} + R_{k}$
(5): Kalman Gain $K_{k} = P_{k | k - 1} H_{k}^{⊤} S_{k}^{- 1}$
(6): State Update ${\hat{χ}}_{k | k} = {\hat{χ}}_{k | k - 1} + K_{k} ν_{k}$
(7): Covariance Update $P_{k | k} = (I - K_{k} H_{k}) P_{k | k - 1}$

2.3.5. Mapping of Newly Detected Landmarks

Mapping of newly detected landmarks,

\{{\overset{´}{X}}_{k}^{i} | i \in J_{k}^{*}\}

, is conducted in final sequence. The positions of the newly detected landmarks is calculated using ego-pose state, denoted as

{\tilde{X}}_{k}^{i} = Φ ({\hat{ψ}}_{k | k}) {\overset{´}{X}}_{k}^{i} + {\hat{x}}_{k | k}

(28)

Mapping of newly detected landmarks,

\{{\overset{´}{X}}_{k}^{i} | i \in J_{k}^{*}\}

, is conducted at the final time step. Each landmark position, originally represented in the camera coordinate system, is transformed into the global coordinate frame using the estimated ego-pose:

{\tilde{X}}_{k}^{i} = Φ ({\hat{ψ}}_{k | k}) {\overset{´}{X}}_{k}^{i} + {\hat{x}}_{k | k},

(29)

where

{\tilde{X}}_{k}^{i}

denotes the position of landmark i in the global coordinate system.

Then, the states of both static and dynamic objects are calculated by newly detected landmark positions as denoted in Equation (17).

3. Evaluation

3.1. Environment Settings

The evaluation is conducted to compare the proposed algorithm to classical 1-point RANSAC algorithm. KITTI odometry dataset [39] is used to evaluate the performance of the classic 1-point RANSAC assuming static environments and modified 1-point RANSAC assuming dynamic environment. The test is conducted under the assumption that the state transition model follows a bicycle model, considering two configurations: (1) incorporating the control inputs into the augmented state vector for joint estimation, and (2) treating them as measurements. In the first configuration, the control inputs are included in the augmented state vector, which is denoted as

ξ_{k} ≜ {[x, y, ψ, V, δ]}_{k}^{⊤}

. In the second configuration, control inputs are treated as measurements. The control input vector

u_{k} ≜ {[V, δ]}_{k}^{⊤}

consists the longitudinal velocity V and the steering angle

δ

of the vehicle. Gaussian noise is also added to the velocity and steering angle to model sensor uncertainty for the test. The test is conducted for 200 frames of 11 sequences. Among the 11 sequences, sequences 1, 4, 8 consist dynamic objects and other sequences do not consist dynamic objects.

To evaluate the local accuracy of the proposed VO method, we adopt the Relative Pose Error (RPE) as an evaluation metric. The RPE quantifies the discrepancy between the estimated and ground-truth relative motions by comparing frame-to-frame transformations.

All parameters, including the EKF process and measurement noise covariances, RANSAC thresholds, and segmentation settings, were fixed across all experiments to ensure fair comparison. Only the initial state of the ego-pose was changed depending on the starting frame of each sequence.

3.2. Evaluation Using an Augmented State Vector Including Control Inputs

Evaluation is first conducted using an augmented state vector including control inputs. Sequence 4, which was collected on a wide highway with dynamic vehicles, is presented as a representative example that demonstrates the effectiveness of the proposed method in dynamic environments. Figure 4 represents the comparison of the estimation results when control inputs are estimated as states (middle) and when they are used as measurements (bottom). The ego-pose is represented by a circle without an outline, indicating the agent’s position, and a short line, indicating its orientation. The agent’s trajectory is depicted as a line following its motion. Landmarks are plotted as small dots: inliers are shown as opaque points, while outliers are shown as translucent ones. In addition, the raw landmarks of dynamic objects are represented as squares. The distinction between objects is expressed through color differences in the landmarks. The positions of objects are indicated by circles, and the velocities of dynamic objects are visualized using arrows.

The scene includes one vehicle approaching the agent (green), another moving away from it (purple), and static landmarks distributed throughout the environment (brown). Since only a few static landmarks are available and they are difficult to match, assuming a static environment and applying the classical method can result in significant estimation errors or even lead to divergence. Although the absence of known control inputs during estimation makes the estimation highly sensitive to noise, the proposed method produces a coarse but informative estimate of object states that helps reduce ego-pose errors during estimation.

Figure 5 shows the estimated trajectory. Although the proposed method is not free from estimation errors, it demonstrates significantly better robustness compared to the classical approach, which suffers from severe scale drift and larger error in orientation estimation. This result confirms that incorporating the motion of dynamic objects into the estimation process helps compensate for ego-pose errors.

Figure 6 shows the estimated ego-pose state values, where the proposed method exhibits less perturbation in estimation of orientation and less scale drift.

Table 1 and Figure 7 further confirm the effectiveness of the proposed method. A substantial reduction in RPE is observed in dynamic sequences, achieving up to 53% improvement over the classical method. While performance in static environments is mostly comparable, the proposed method demonstrates strong robustness in handling dynamic objects through joint estimation of control inputs and states.

3.3. Evaluation Using Control Inputs as Measurements

The evaluation was also conducted by treating the control inputs as measurements. Figure 8 presents the estimation results of six selected frames from sequence 8.

For each frame, the upper figure shows the result of image-based measurement, including the detected features and instance segmentation masks. The lower figure shows the result of estimation.

Beginning with Frame 16 (Figure 8a), a dynamic object (pink) is present and is moving away from the agent. It is shown that the state of this moving object is successfully estimated.

In Frame 27 (Figure 8b), the dynamic object detected in Frame 16 remains dynamic in reality but is considered an outlier due to a large innovation, caused by having few landmarks and being too far from the agent. Instead, another dynamic object (green) appears, which is approaching the agent, and its state is also correctly estimated.

This dynamic object is consistently tracked in the subsequent frames. In Frames 39 (Figure 8d) and 42 (Figure 8e), the object state is shown to be estimated with minimal center shift caused by truncation during the filtering process.

Finally, in Frame 44 (Figure 8f), when the object is no longer visible in the image, its landmarks are excluded from the ego-pose estimation.

Figure 9 represents the estimated trajectories by classical 1-point RANSAC and the proposed method and Figure 9 represents the ego-pose values throughout time. Figure 9 show that the trajectory estimated by the proposed method follows the ground truth more closely than that of classical 1-point RANSAC. In Figure 9, bigger error is observed when using classical 1-point RANSAC than proposed algorithm for frames that exist dynamic objects. Figure 10 shows the yaw angle of the agent at dynamic frames in detail. It is observed that the yaw angle is estimated with less perturbation when used the proposed method.

The evaluation result of the proposed algorithm is presented in Table 2 and Figure 11. In sequences 1, 4, 8, that contain dynamic objects, lower relative error is obtained with the proposed algorithm. There is slight variations in error in static environments, but the overall performance is nearly identical. 9% lower error is obtained for total sequences, and 20% lower error is obtained for sequences of dynamic environments. There are some factors that improve performance and others that degrade it at static environments. Additional filtering procedure that reject high-innovation landmarks can contribute to the performance of the estimation. In contrast, static objects can be misclassified as dynamic objects by the noise of the measurement, and it can cause decrease in performance of the estimation.

4. Conclusions

We proposed a novel EKF-based visual odometry method designed for dynamic environments, where dynamic landmarks are explicitly incorporated into the ego-pose estimation process. Unlike conventional approaches that discard or ignore dynamic elements, our method leverages a 1-point RANSAC-based EKF framework, enabling robust state estimation even in the presence of moving objects and noisy measurements.

The proposed method demonstrated consistent improvements in relative pose error across various dynamic scenarios, validating its effectiveness in maintaining accurate ego-pose estimation under real-world conditions. We showed that the algorithm can estimate and track both ego-motion and object states without significant degradation caused by occlusion, landmark sparsity, or truncation artifacts.

Moreover, the framework is computationally efficient, achieving an average computation time of 0.0303 s per frame for the Kalman filter and 0.158 s per frame for image processing (feature matching and instance segmentation). This makes the system suitable for real-time applications and practical deployment in autonomous navigation tasks. The proposed method can be extended to various mobile platforms such as autonomous vehicles, drones, or robots operating in dynamic environments, where the assumption of static surroundings frequently fails.

While the method demonstrates reliable performance in typical dynamic scenarios, it may face challenges in extremely highly dynamic scenes or when measurements are severely degraded by noise. As part of future work, we plan to enhance the filter’s robustness in such conditions by incorporating motion priors and adaptive noise models that can better handle large landmark uncertainty and fast object motion. These improvements are expected to further increase the accuracy and reliability of ego-pose estimation in complex real-world environments.

Supplementary Materials

Video S1: EKF-based visual odometry in dynamic environments using modified 1-point RANSAC. The proposed method is available at: https://youtu.be/YT0lMQSoG2M (accessed on 17 October 2025).

Author Contributions

Conceptualization, J.L. and J.K.; methodology, J.L.; software, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.K.; supervision, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by INHA University Research Grant (INHA-75414-1).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are openly available at the KITTI Vision Benchmark Suite (http://www.cvlibs.net/datasets/kitti/ (accessed on 17 October 2025)).

Acknowledgments

This work was supported by INHA University Research Grant (INHA-75414-1).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Classical 1-Point RANSAC

Algorithm of 1-Point RANSAC is described in Algorithm A1.

Algorithm A1 1-Point RANSAC-Based EKF Pose Estimation

Input: Predicted state and covariance ${\hat{χ}}_{k | k - 1}$ , $P_{k | k - 1}$ ; valid feature set $J_{k}$ , innovation threshold $γ_{innov}$ , number of RANSAC hypotheses $n_{hyp}$
Output: Updated state ${\hat{χ}}_{k | k}$ , covariance $P_{k | k}$ , best inlier set $I_{k}$
Initialize best inlier set $\tilde{I} \leftarrow \emptyset$
For h = 1 to $n_{hyp}$ do
Randomly select one landmark $z_{k}^{i}$ from $J_{k}$
Perform temporary EKF state update using $z_{k}^{i}$ (no covariance update) $\Rightarrow {\hat{χ}}_{k}^{h}$
Initialize temporary inlier set $I^{(h)} \leftarrow {i}$
for each $j \in J_{k}$ , $j \neq i$ do
Predict measurement ${\hat{z}}_{k}^{j}$ from state ${\hat{χ}}_{k}^{(h)}$
Compute innovation $ν_{j} = z_{k}^{j} - {\hat{z}}_{k}^{j}$
if $∥ ν_{j} ∥ < γ_{innov}$ then
Add j to inlier set $I^{h}$
end if
end for
if $| I^{h} | > | \tilde{I} |$ then
Update $\tilde{I} \leftarrow I^{h}$ and save ${\hat{χ}}_{k}^{h}$
end if
end for
Perform full EKF update using inlier set $\tilde{I}$
Stack all residuals $ν$ and Jacobians $H$ for inliers
Compute $S = H P_{k | k - 1} H^{⊤} + R$
Compute Kalman gain $K = P_{k | k - 1} H^{⊤} S^{- 1}$
Update state: ${\hat{χ}}_{k | k} = {\hat{χ}}_{k | k - 1} + K ν$
Update covariance: $P_{k | k} = (I - K H) P_{k | k - 1}$
for each $i \notin \tilde{I}$ do
Predict ${\hat{z}}_{k}^{i}$ from updated state ${\hat{χ}}_{k | k}$
Compute innovation $ν_{i}$ , innovation covariance $S_{i}$
Compute Mahalanobis distance $d_{i} = ν_{i}^{⊤} S_{i}^{- 1} ν_{i}$
if $d_{i} < γ_{χ^{2}}$ then
Add j to final inlier set $I_{k}$
end if
end for
return Updated state ${\hat{χ}}_{k | k}$ , covariance $P_{k | k}$ , inlier set $I_{k}$

Appendix B. RANSAC-Based Euclidean Transform Estimation

Algorithm of RANSAC-based Euclidean Transform Estimation is described in Algorithm A3.

Algorithm A2 compute_rigid_transform: Least-Squares Rigid Transformation Estimation

Input: Two corresponding 2D point sets ${\overset{´}{X}}_{k}$ and ${\overset{´}{X}}_{k - 1}$ (each with $N_{p}$ points)
Output: Rotation matrix $Ω \in S O (2)$ , translation vector $τ \in R^{2}$
Compute the centroid (mean) of each point set:

$m_{k} = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} {\overset{´}{X}}_{k}^{i}, m_{k - 1} = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} {\overset{´}{X}}_{k - 1}^{m (i)}$
Center the points by subtracting the centroids:

${\tilde{a}}_{i} = {\overset{´}{X}}_{k}^{i} - m_{k}, {\tilde{b}}_{i} = {\overset{´}{X}}_{k - 1}^{m (i)} - m_{k - 1}$
Compute the covariance matrix:

$Θ = \sum_{i = 1}^{N_{p}} {\tilde{a}}_{i} {\tilde{b}}_{i}^{⊤}$
Perform singular value decomposition (SVD) on $Θ$ :

$Θ = U Σ V^{⊤}$
Compute the rotation matrix:

$Ω = V U^{⊤}$
if $det (Ω) < 0$ then
Negate the second row of $V$ and recompute $Ω$
end if
Compute the translation vector:

$τ = m_{k - 1} - R m_{k}$
return $Ω$ and $τ$

Algorithm A3 estimate_rigid_transform_ransac: Deterministic RANSAC-based 2D Rigid Transformation Estimation

Input: Two 2D point sets ${\overset{´}{X}}_{k}$ , ${\overset{´}{X}}_{k - 1}$ with $N_{p}$ points, reprojection threshold T, maximum iterations $N_{\max}$
Output: Best transformation matrix $M \in S E (2)$ , inlier mask $I \in {0, 1}^{N_{p}}$
Initialize best inlier count to zero
for each unique 2-point combination from ${\overset{´}{X}}_{k}$ and ${\overset{´}{X}}_{k - 1}$ (up to $N_{\max}$ iterations) do
Select two point pairs as a minimal sample set
Estimate transformation $(Ω, τ)$ using Algorithm A1 (compute_rigid_transform)
Apply $(Ω, τ)$ to all points in ${\overset{´}{X}}_{k}$
Compute reprojection error for each transformed point
Determine inliers where error $< T$
if current inlier count > best inlier count for
Update best transformation $M$ and inlier mask $I$
end if
end for
return Best transformation $M$ and inlier set $I$

Appendix C. Calculation of Object State

A method of calculation of object state is represented in Algorithm A4.

Algorithm A4 Adaptive RANSAC-based Velocity and Rotation Estimation

Input: Current points ${\overset{´}{X}}_{k}$ , previous points ${\overset{´}{X}}_{k - 1}$
if $| {\overset{´}{X}}_{k} | \geq 3$ then
Estimate rigid transform $M$ using Algorithm A3 (estimate_rigid_transform_ransac) between ${\overset{´}{X}}_{k - 1}$ and ${\overset{´}{X}}_{k}$
Compute reprojection error for each transformed point cloud
Construct inlier mask from RANSAC result
if inliers exist then
Apply DBSCAN to inlier points
Extract largest cluster and update inlier mask
end if
if $M \neq \emptyset$ and valid inliers exist then
Compute covariance matrix of inlier points
if maximum eigenvalue of covariance $< T_{c o v}$ then
Compute cluster center $c = mean ({\overset{´}{X}}_{k}^{inlier})$
Compute velocity from $M$ $\dot{c} = \frac{τ}{Δ t}$
Extract rotation matrix $Ω$ from $M$
Compute angular velocity $\dot{ψ} = \frac{arctan 2 (R_{21}, R_{11})}{Δ t}$
else
Reject the object as outlier
end if
else
Reject the object as outlier
end if
else
Reject the object as outlier
end if
return inlier mask, center $c$ , velocity $\dot{c}$ , angular velocity $\dot{ψ}$

Appendix D. Measurement Noise Covariance in Polar Coordinates

This appendix derives the measurement noise covariance in polar coordinates

(ρ, θ)

under the assumption that the stereo matching pixel disparity

Δ v

and pixel location v have independent, constant variance

σ_{v}^{2}

.

Given a stereo camera with baseline B and focal length

Φ

, the depth X is computed from the disparity

Δ v

as:

X = \frac{Φ B}{Δ v}

(A1)

The horizontal pixel location v determines the offset from the optical center. The corresponding 3D point in Cartesian coordinates is given by:

\begin{matrix} x & = X = \frac{Φ B}{Δ v} \end{matrix}

(A2)

\begin{matrix} y & = X \cdot \frac{v}{Φ} = \frac{B v}{Δ v} \end{matrix}

(A3)

The polar coordinates

(ρ, θ)

are defined by:

\begin{matrix} ρ & = \sqrt{x^{2} + y^{2}} = \frac{B}{Δ v} \sqrt{Φ^{2} + v^{2}} \end{matrix}

(A4)

\begin{matrix} θ & = arctan (\frac{y}{x}) = arctan (\frac{v}{Φ}) \end{matrix}

(A5)

Assuming that the measurement vector

z = {[v, Δ v]}^{T}

has isotropic noise, we define the covariance as:

Cov (z) = σ_{v}^{2} \cdot I_{2 \times 2}

The measurement noise in

(ρ, θ)

is obtained via first-order error propagation using the Jacobian

J = \frac{\partial (ρ, θ)}{\partial (v, Δ v)}

. Computing the partial derivatives:

\begin{matrix} \frac{\partial ρ}{\partial v} & = \frac{B v}{Δ v \sqrt{Φ^{2} + v^{2}}}, \frac{\partial ρ}{\partial Δ v} = - \frac{B \sqrt{Φ^{2} + v^{2}}}{{(Δ v)}^{2}} \end{matrix}

(A6)

\begin{matrix} \frac{\partial θ}{\partial v} & = \frac{Φ}{Φ^{2} + v^{2}}, \frac{\partial θ}{\partial Δ v} = 0 \end{matrix}

(A7)

Then, the variances in polar coordinates are:

\begin{matrix} Var (ρ) & = σ_{v}^{2} ({(\frac{\partial ρ}{\partial v})}^{2} + {(\frac{\partial ρ}{\partial Δ v})}^{2}) \end{matrix}

(A8)

\begin{matrix} = σ_{v}^{2} (\frac{B^{2} v^{2}}{Δ v^{2} (Φ^{2} + v^{2})} + \frac{B^{2} (Φ^{2} + v^{2})}{{(Δ v)}^{4}}) \end{matrix}

(A9)

Var (θ) = σ_{v}^{2} \cdot {(\frac{Φ}{Φ^{2} + v^{2}})}^{2}

(A10)

To interpret these results, we express the variances in terms of

(ρ, θ)

.

From the definition of

ρ

:

ρ = \frac{B}{Δ v} \sqrt{Φ^{2} + v^{2}} \Rightarrow ρ^{2} = \frac{B^{2} (Φ^{2} + v^{2})}{{(Δ v)}^{2}}

Using this, we can express the second term in

Var (ρ)

as:

\frac{B^{2} (Φ^{2} + v^{2})}{{(Δ v)}^{4}} = \frac{ρ^{4}}{B^{2}}

Thus,

Var (ρ)

becomes:

Var (ρ) = σ_{v}^{2} (\frac{v^{2}}{Φ^{2} + v^{2}} \cdot \frac{B^{2}}{Δ v^{2}} + \frac{ρ^{4}}{B^{2}})

(A11)

This shows that

Var (ρ)

includes a term proportional to

ρ^{4}

, and as depth increases (i.e.,

Δ v \to 0

), the second term dominates. Therefore, in the high-depth regime, radial uncertainty approximately grows with the fourth power of range:

Var (ρ) \approx σ_{v}^{2} \cdot \frac{ρ^{4}}{B^{2}}

(A12)

For the angular variance, we express v in terms of

θ

:

\frac{v}{Φ} = tan θ \Rightarrow Φ^{2} + v^{2} = Φ^{2} (1 + {tan}^{2} θ) = \frac{Φ^{2}}{{cos}^{2} θ}

Substituting into the expression for

Var (θ)

yields:

Var (θ) = σ_{v}^{2} \cdot {(\frac{Φ}{Φ^{2} + v^{2}})}^{2} = σ_{v}^{2} \cdot \frac{{cos}^{4} θ}{Φ^{2}}

(A13)

This indicates that angular uncertainty is minimized near the image center (

θ \approx 0

) and increases toward the periphery, as

{cos}^{4} θ

decreases.

References

Gay, S.; Le Run, K.; Pissaloux, E.; Romeo, K.; Lecomte, C. Towards a predictive bio-inspired navigation model. Information 2021, 12, 100. [Google Scholar] [CrossRef]
Wang, Y.; Shao, B.; Zhang, C.; Zhao, J.; Cai, Z. Revio: Range-and event-based visual-inertial odometry for bio-inspired sensors. Biomimetics 2022, 7, 169. [Google Scholar] [CrossRef] [PubMed]
Scaramuzza, D.; Fraundorfer, F. Visual odometry [tutorial]. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EP n P: An accurate O (n) solution to the P n P problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Ferraz, L.; Binefa, X.; Moreno-Noguer, F. Very fast solution to the PnP problem with algebraic outlier rejection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 501–508. [Google Scholar]
Jiang, Y.; Chen, H.; Xiong, G.; Scaramuzza, D. Icp stereo visual odometry for wheeled vehicles based on a 1dof motion prior. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: New York, NY, USA, 2014; pp. 585–592. [Google Scholar]
Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In Proceedings of the International Workshop on Vision Algorithms, Corfu, Greece, 21–22 September 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 298–372. [Google Scholar]
Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2017; IEEE: New York, NY, USA, 2007; pp. 225–234. [Google Scholar]
Büchner, M.; Dahiya, L.; Dorer, S.; Ramtekkar, V.; Nishimiya, K.; Cattaneo, D.; Valada, A. Visual Loop Closure Detection Through Deep Graph Consensus. arXiv 2025, arXiv:2505.21754. [Google Scholar] [CrossRef]
Wang, N.; Chen, X.; Shi, C.; Zheng, Z.; Yu, H.; Lu, H. SGLC: Semantic graph-guided coarse-fine-refine full loop closing for LiDAR SLAM. IEEE Robot. Autom. Lett. 2024, 9, 11545–11552. [Google Scholar] [CrossRef]
Lim, T.Y.; Sun, B.; Pollefeys, M.; Blum, H. Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors. arXiv 2025, arXiv:2503.16275. [Google Scholar] [CrossRef]
Ligorio, G.; Sabatini, A.M. Extended Kalman filter-based methods for pose estimation using visual, inertial and magnetic sensors: Comparative analysis and performance evaluation. Sensors 2013, 13, 1919–1941. [Google Scholar] [CrossRef]
van Goor, P.; Mahony, R. An equivariant filter for visual inertial odometry. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021; pp. 14432–14438. [Google Scholar]
Bosse, M.; Agamennoni, G.; Gilitschenski, I. Robust estimation and applications in robotics. Found. Trends® Robot. 2016, 4, 225–269. [Google Scholar] [CrossRef]
Agarwal, P.; Tipaldi, G.D.; Spinello, L.; Stachniss, C.; Burgard, W. Robust map optimization using dynamic covariance scaling. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; IEEE: New York, NY, USA, 2013; pp. 62–69. [Google Scholar]
Sünderhauf, N.; Protzel, P. Switchable constraints for robust pose graph SLAM. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; IEEE: New York, NY, USA, 2012; pp. 1879–1884. [Google Scholar]
Neira, J.; Tardós, J.D. Data association in stochastic mapping using the joint compatibility test. IEEE Trans. Robot. Autom. 2002, 17, 890–897. [Google Scholar] [CrossRef]
Civera, J.; Grasa, O.G.; Davison, A.J.; Montiel, J. 1-point RANSAC for EKF-based structure from motion. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 11–15 October 2009; IEEE: New York, NY, USA, 2009; pp. 3498–3504. [Google Scholar]
Cao, L.; Liu, J.; Lei, J.; Zhang, W.; Chen, Y.; Hyyppä, J. Real-time motion state estimation of feature points based on optical flow field for robust monocular visual-inertial odometry in dynamic scenes. Expert Syst. Appl. 2025, 274, 126813. [Google Scholar] [CrossRef]
Ding, S.; Ma, T.; Li, Y.; Xu, S.; Yang, Z. RD-VIO: Relative-depth-aided visual-inertial odometry for autonomous underwater vehicles. Appl. Ocean. Res. 2023, 134, 103532. [Google Scholar] [CrossRef]
Ballester, I.; Fontan, A.; Civera, J.; Strobl, K.H.; Triebel, R. DOT: Dynamic object tracking for visual SLAM. arXiv 2020, arXiv:2010.00052. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, Z.; Chen, H.; Li, J. A Motion Segmentation Dynamic SLAM for Indoor GNSS-Denied Environments. Sensors 2025, 25, 4952. [Google Scholar] [CrossRef]
Ram, K.; Kharyal, C.; Harithas, S.S.; Krishna, K.M. RP-VIO: Robust plane-based visual-inertial odometry for dynamic environments. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: New York, NY, USA, 2021; pp. 9198–9205. [Google Scholar]
Eckenhoff, K.; Geneva, P.; Merrill, N.; Huang, G. Schmidt-EKF-based visual-inertial moving object tracking. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 651–657. [Google Scholar]
Ma, G.; Li, C.; Jing, H.; Kuang, B.; Li, M.; Wang, X.; Jia, G. Schur Complement Optimized Iterative EKF for Visual–Inertial Odometry in Autonomous Vehicles. Machines 2025, 13, 582. [Google Scholar] [CrossRef]
Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 11523–11530. [Google Scholar] [CrossRef]
Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS++: Robust visual-inertial state estimator in dynamic environments by adaptive truncated least squares and stable state recovery. IEEE Robot. Autom. Lett. 2024. [Google Scholar] [CrossRef]
Liu, J.; Yao, D.; Liu, N.; Yuan, Y. Towards Biologically-Inspired Visual SLAM in Dynamic Environments: IPL-SLAM with Instance Segmentation and Point-Line Feature Fusion. Biomimetics 2025, 10, 558. [Google Scholar] [CrossRef]
Todoran, H.G.; Bader, M. Extended Kalman Filter (EKF)-based local SLAM in dynamic environments: A framework. In Proceedings of the Advances in Robot Design and Intelligent Control: Proceedings of the 24th International Conference on Robotics in Alpe-Adria-Danube Region (RAAD), Cluj-Napoca, Romania, 5–7 June 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 459–469. [Google Scholar]
Chiu, C.Y. SLAM Backends with Objects in Motion: A Unifying Framework and Tutorial. arXiv 2022, arXiv:2207.05043. [Google Scholar]
Tian, X.; Zhu, Z.; Zhao, J.; Tian, G.; Ye, C. DL-SLOT: Tightly-coupled dynamic LiDAR SLAM and 3D object tracking based on collaborative graph optimization. IEEE Trans. Intell. Veh. 2023, 9, 1017–1027. [Google Scholar] [CrossRef]
Huang, J.; Yang, S.; Zhao, Z.; Lai, Y.; Hu, S. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation, Comput. Vis. Media 2021, 7, 87–101. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. Density-based spatial clustering of applications with noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 240. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]

Figure 2. Overall structure of the modified 1-point RANSAC algorithm.

Figure 3. Feature detection and matching visualization. (Top): previous stereo images (left and right) with detected features and object masks. (Bottom): current stereo images. Feature correspondences are shown between the previous and current left images (temporal matching, left column), and between the current left and right images (stereo matching, bottom row).

Figure 4. Estimation results for Frames 17–20 of Sequence 4. Top: input image; Middle: estimation with control inputs as states; Bottom: with control inputs as measurements. Ego-pose is shown as a circle with a short line indicating heading; static landmarks are brown dots; dynamic object landmarks are colored squares (green = approaching, purple = receding); and arrows indicate estimated object velocities.

Figure 5. Comparison of estimated trajectories using the classical 1-point RANSAC (red), the proposed method (green), and the ground truth (blue). The proposed method jointly estimates control inputs as part of the state vector and produces a trajectory that more closely follows the ground truth, demonstrating improved accuracy and robustness over the classical approach.

Figure 6. Comparison of ego-pose estimation using the classical 1-point RANSAC (red), the proposed method with control inputs in the state vector (green), and the ground truth (blue). Top, middle, and bottom plots show s, y, and yaw angle

ψ

over time, respectively. The proposed method achieves higher accuracy and stability in dynamic scenes by jointly estimating control inputs as part of the state.

Figure 6. Comparison of ego-pose estimation using the classical 1-point RANSAC (red), the proposed method with control inputs in the state vector (green), and the ground truth (blue). Top, middle, and bottom plots show s, y, and yaw angle

ψ

over time, respectively. The proposed method achieves higher accuracy and stability in dynamic scenes by jointly estimating control inputs as part of the state.

Figure 7. Quantitative evaluation of the proposed method, which incorporates control inputs into the state vector, using RPE on KITTI sequences. Significant improvements are observed in dynamic environments (Sequences 1, 4, and 8), with up to 53% lower error compared to the classical method. While performance in static environments remains comparable overall, noise-induced misclassification of static objects can slightly degrade accuracy.

Figure 8. Estimation results for six selected frames (Sequence 8). Top: input images with detected features and instance segmentation masks. Bottom: estimated ego-pose and object states obtained by the proposed method. The ego-pose is shown as a circle with a short line indicating heading. Static landmarks are plotted as brown dots, and dynamic object landmarks are colored squares (green = approaching object, pink = receding object). Arrows represent the estimated object velocities, and circles indicate each object’s current position. The figure demonstrates that the proposed algorithm successfully detects and tracks dynamic objects across multiple frames while maintaining consistent ego-pose estimation.

Figure 9. (a) Comparison of estimated trajectories using the classical 1-point RANSAC (red), the proposed method treating control inputs as measurements (green), and the ground truth (blue). The proposed measurement-based method produces a trajectory that more closely follows the ground truth, demonstrating improved accuracy and stability compared to the classical approach. (b) Comparison of ego-pose estimation over time using the classical 1-point RANSAC (red), the proposed method treating control inputs as measurements (green), and the ground truth (blue). Top, middle, and bottom plots show X, Y, and yaw angle

ψ

over time, respectively. Larger estimation errors are observed in the classical method when dynamic objects are present, while the proposed algorithm maintains better accuracy and smoother orientation estimation.

Figure 9. (a) Comparison of estimated trajectories using the classical 1-point RANSAC (red), the proposed method treating control inputs as measurements (green), and the ground truth (blue). The proposed measurement-based method produces a trajectory that more closely follows the ground truth, demonstrating improved accuracy and stability compared to the classical approach. (b) Comparison of ego-pose estimation over time using the classical 1-point RANSAC (red), the proposed method treating control inputs as measurements (green), and the ground truth (blue). Top, middle, and bottom plots show X, Y, and yaw angle

ψ

over time, respectively. Larger estimation errors are observed in the classical method when dynamic objects are present, while the proposed algorithm maintains better accuracy and smoother orientation estimation.

Figure 10. Yaw angle estimation at dynamic frames for the classical 1-point RANSAC (red), the proposed method treating control inputs as measurements (green), and the ground truth (blue). The red curve shows larger fluctuations in yaw estimation under dynamic conditions, while the proposed method (green) maintains a smoother and more stable estimate that closely follows the ground truth. This demonstrates the improved robustness of the proposed approach against disturbances caused by moving objects in the scene.

Figure 11. Quantitative evaluation of the proposed method using relative pose error (RPE) on KITTI sequences. Significant improvements are observed in dynamic environments (Sequences 1, 4, and 8), with up to 20% lower error compared to the classical method. While performance in static environments remains comparable overall, noise-induced misclassification of static objects can slightly degrade accuracy.

Table 1. Relative Pose Error on KITTI sequences with control inputs estimated as states.

Sequence	Classical	Proposed
00	0.2754	0.2130
01	1.8018	0.8112
02	0.3248	0.4334
03	0.2791	0.3508
04	0.9288	0.3193
05	0.3511	0.3559
06	0.1342	0.1094
07	1.0964	0.2852
08	0.6455	0.4526
09	0.5847	0.4866
10	0.2120	0.5997
Mean	0.6031	0.4016
Dynamic cases	1.1254	0.5277

Table 2. Relative Pose Error on KITTI sequences with control inputs as measurements.

Sequence	Classical	Proposed
00	0.0160	0.0173
01	0.0273	0.0214
02	0.0170	0.0166
03	0.0167	0.0144
04	0.0257	0.0187
05	0.0145	0.0133
06	0.0146	0.0160
07	0.0142	0.0112
08	0.0220	0.0202
09	0.0135	0.0149
10	0.0153	0.0157
Mean	0.0179	0.0163
Dynamic cases	0.0250	0.0201

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.; Kang, J. Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC. Biomimetics 2025, 10, 710. https://doi.org/10.3390/biomimetics10100710

AMA Style

Lee J, Kang J. Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC. Biomimetics. 2025; 10(10):710. https://doi.org/10.3390/biomimetics10100710

Chicago/Turabian Style

Lee, Jinhee, and Jaeyoung Kang. 2025. "Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC" Biomimetics 10, no. 10: 710. https://doi.org/10.3390/biomimetics10100710

APA Style

Lee, J., & Kang, J. (2025). Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC. Biomimetics, 10(10), 710. https://doi.org/10.3390/biomimetics10100710

Article Menu

Extended Kalman Filter-Based Visual Odometry in Dynamic Environments Using Modified 1-Point RANSAC

Abstract

1. Introduction

2. Modified 1-Point RANSAC for Dynamic Environments

2.1. Models for EKF

2.2. Classical 1-Point RANSAC Assuming Static Environments

2.3. Modified 1-Point RANSAC Assuming Dynamic Environments

2.3.1. Feature Detection, Matching and Multiple Object Tracking

2.3.2. 1-Point RANSAC Assuming Static Environment

2.3.3. Estimation of Planar Motion of Dynamic Objects with Kalman Filter

2.3.4. Augmented Estimation with Both Static and Dynamic Landmarks

2.3.5. Mapping of Newly Detected Landmarks

3. Evaluation

3.1. Environment Settings

3.2. Evaluation Using an Augmented State Vector Including Control Inputs

3.3. Evaluation Using Control Inputs as Measurements

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Classical 1-Point RANSAC

Appendix B. RANSAC-Based Euclidean Transform Estimation

Appendix C. Calculation of Object State

Appendix D. Measurement Noise Covariance in Polar Coordinates

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI