Next Article in Journal
Physics-Informed Neural Network-Based Elevator Degradation Diagnosis and Early Warning
Previous Article in Journal
Research on Covert Communication in Satellite–Ground-Integrated Sensor Networks Based on FH-DL-MPWFRFT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DOE-LVI: Tightly Coupled LiDAR-Visual-Inertial SLAM System with Dynamic Object Elimination

1
College of Information Engineering, Tarim University, Alar City 843300, China
2
School of Mechano-Electronic Engineering, Xidian University, Xi’an 710071, China
3
Baidu Inc., Beijing 100193, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(12), 3717; https://doi.org/10.3390/s26123717
Submission received: 23 April 2026 / Revised: 3 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026
(This article belongs to the Section Environmental Sensing)

Abstract

In dynamic environments, Simultaneous Localization and Mapping (SLAM) systems often struggle with the challenges posed by moving objects. To address these issues, we propose Dynamic-Object-Elimination LiDAR-Visual-Inertial SLAM (DOE-LVI), an advanced tightly coupled LiDAR-Visual-Inertial SLAM system. DOE-LVI integrates two primary subsystems: the Visual-Inertial System (VIS) and the LiDAR-Inertial System (LIS). The VIS component extracts depth information from LiDAR scans and correlates it with visual features, providing accurate pose estimation by minimizing both visual and IMU residuals. The LIS uses this initial estimate to generate range images and perform preliminary removal of dynamic points. Misclassified points are then corrected through ground fitting and precise scan matching with the submap. For enhanced loop closure detection, DOE-LVI employs global LiDAR descriptors, which significantly improve both localization robustness and accuracy. Experimental evaluations on the KITTI and UrbanNav datasets demonstrate that DOE-LVI achieves robust localization and mapping performance, particularly in highly dynamic environments.

1. Introduction

Simultaneous Localization and Mapping (SLAM) is a fundamental capability for mobile robot navigation. By leveraging the unique characteristics of different sensors and fusing their data, SLAM systems can overcome the limitations of individual sensors, thereby improving accuracy, robustness, and applicability in complex environments [1,2,3]. For instance, LiDAR Odometry and Mapping (LOAM) [4] employs a loosely coupled approach, utilizing a Kalman filter to fuse LiDAR data with IMU measurements. High-frequency motion data from the IMU are utilized to correct the point cloud, enabling precise pose estimation. LiDAR-Inertial Odometry via Smoothing and Mapping (LIO-SAM) [5] is a tightly coupled LiDAR-Inertial Odometry system that leverages IMU pre-integration to compensate for point cloud distortion and provide an initial estimate for scan matching, resulting in faster processing and improved trajectory accuracy. LiDAR-Visual-Inertial Odometry via Smoothing and Mapping (LVI-SAM) [6] extends LIO-SAM by integrating a Visual-Inertial System alongside the LiDAR-Inertial System, with both subsystems tightly coupled to achieve robust and high-precision state estimation and mapping. Representative LiDAR-Inertial-Visual fusion systems, such as R3LIVE [7], further demonstrate that integrating LiDAR, inertial, and visual measurements can improve the robustness and accuracy of real-time state estimation and mapping. However, the mainstream SLAM algorithms [5,6,7,8,9] are primarily designed for static environments. In real-world scenarios, a more common situation involves complex scenes composed of both dynamic objects and static structures. Due to sensor occlusions, many features become associated with moving objects, leading to the failure of static-scene-based methods.
The primary methods for dynamic object removal typically involve comparing the point clouds across both time and space. Recent LiDAR moving object segmentation methods further exploit sequential range images to distinguish moving and static objects in 3D LiDAR scans [10]. For example, Kim et al. [11] utilized known global poses to construct a submap for each LiDAR scan and extracted dynamic point sets using a multi-resolution range image. Lim et al. [12] extracted descriptors from LiDAR scans and submaps, marking regions with low descriptor ratios as dynamic, where non-ground points were labeled as dynamic objects. However, these methods depend heavily on accurate global poses, and in environments with numerous dynamic objects, existing scan-matching techniques often struggle to provide the required accuracy. This limitation reduces the effectiveness of these methods for online use in dynamic environments. In contrast, Removal-First LiDAR-Inertial Odometry (RF-LIO) [13] offers an online solution by using adaptive multi-resolution range images to remove dynamic objects, followed by scan matching, enabling real-time operation in dynamic scenarios. Wen Lim et al. [14] proposed a label consistency-based dynamic point removal method, which reduces computational overhead and enables online localization and mapping.
In addition, loop closure is a critical step in SLAM, essential for correcting odometry drift and creating a globally consistent map. Most LiDAR-based loop closure detection methods rely on odometry. They use a k-dimensional tree (KD-tree) to find the closest keyframe in historical data as a loop closure candidate, then apply the Iterative Closest Point (ICP) algorithm to refine pose estimation. However, this method heavily relies on the system’s intrinsic accuracy and is significantly affected by odometry drift, leading to potential false loop closure candidates. In contrast, descriptor-based methods can mitigate the impact of accumulated errors. Scan-Context [15] encodes each LiDAR scan by representing the maximum height histogram in each bin of the horizontal plane as a point cloud descriptor. More recent structural place recognition methods, such as Scan Context++ [16], improve the robustness of LiDAR-based place recognition under rotation and lateral variations in urban environments. LiDAR-IRIS [17] further improved upon Scan-Context by using an 8-bit binary code to encode height information in each bin, providing rotation invariance and enhancing computational efficiency by avoiding brute-force matching.
In this work, we propose DOE-LVI, a tightly coupled LiDAR-Visual-Inertial Odometry system with dynamic object elimination, designed for real-time state estimation and mapping in long-term dynamic environments. DOE-LVI integrates a Visual-Inertial System (VIS) and a LiDAR-Inertial System (LIS). The VIS tracks visual features and estimates the pose by minimizing visual reprojection errors and IMU residuals. An active failure detection module assesses sensor reliability, providing the appropriate initial guess for LiDAR scan matching. The LIS first performs a coarse removal of dynamic points using range images between the current scan and the surrounding submap, then refines this process by recovering incorrectly identified points through ground plane fitting, followed by accurate pose estimation via scan matching. For loop closure detection, the LIS employs a global descriptor-based [17] method for place recognition, which is resilient to the influence of odometry drift. Compared with LIO-SAM and LVI-SAM, DOE-LVI is specifically designed for highly dynamic environments by introducing online dynamic object elimination before LiDAR scan matching. Unlike ERASOR, which relies on a pre-built static map, DOE-LVI removes dynamic points online by using the current scan and the surrounding submap. Compared with RF-LIO, DOE-LVI further integrates visual-inertial constraints, active failure detection, and global descriptor-based loop closure detection within a tightly coupled LiDAR-Visual-Inertial framework. Therefore, the main novelty of DOE-LVI lies in a dynamic environment-oriented system-level integration, together with an improved coarse-to-fine dynamic point removal strategy. It should be noted that LiDAR-IRIS is adopted as an existing global descriptor for loop closure detection, rather than being proposed as a new descriptor in this work. Our contribution lies in integrating it into the DOE-LVI framework to improve long-term mapping consistency in dynamic environments. The main contributions of our work can be summarized as follows:
(1)
A dynamic environment-oriented, tightly coupled LiDAR-Visual-Inertial Odometry framework is proposed, integrating active failure detection, online dynamic object elimination, and an adopted LiDAR-IRIS-based place recognition module into a unified system.
(2)
Real-time dynamic point removal with initial rough elimination and refinement through ground fitting for improved accuracy.
(3)
Comprehensive validations against state-of-the-art methods across various scales and environments.
The remainder of this paper is organized as follows: a framework for the proposed system is presented in Section 2. Experimental results are given in Section 3, with conclusions in Section 4.

2. The DOE-LVI Framework

Our framework consists of two key subsystems: a Visual-Inertial System and a LiDAR-Inertial System, as shown in Figure 1. The VIS renders the RGB color of the LiDAR scan, uses optical flow to track visual features, and obtains Visual-Inertial Odometry by optimizing errors of visual reprojection and IMU measurements. The LIS selects appropriate initial values based on the system state to construct a submap. It preliminarily removes dynamic points by comparing the range image of the scan and the corresponding submap and further refines dynamic points by fitting the ground. The static scan is then matched to the submap, and after graph optimization, a global static map and pose are obtained. We utilized LiDAR-IRIS descriptors [17] for pose recognition to eliminate cumulative errors. The overall framework of DOE-LVI is shown in Figure 1. The proposed system takes measurements from an IMU, a 3D LiDAR, and a camera as inputs, and mainly consists of a Visual-Inertial System (VIS) and a LiDAR-Inertial System (LIS). The VIS provides Visual-Inertial Odometry and evaluates its reliability through active failure detection, while the LIS performs dynamic object removal, scan matching, loop closure detection, and global map optimization. The optimized pose is further used to update the IMU bias, enabling the VIS and LIS to cooperate within the tightly coupled framework.

2.1. Visual-Inertial Odometry

We have adapted and extended the processing pipeline from [18] for our VIS, as detailed in Figure 2. The VIS first utilizes the image information to render the texture of the LiDAR scan, then extracts key points and establishes accurate correspondences between adjacent image frames using the Kanade–Lucas–Tomasi algorithm. Finally, bundle adjustment optimization is performed within a sliding window. Note that the VIS does not include loop closure detection. Given that Visual-Inertial Odometry (VIO) can fail under aggressive motion, illumination changes, and textureless conditions, relying on VIO alone for initial estimates may introduce significant errors. Therefore, fault detection is crucial to mitigate adverse impacts on LiDAR-Inertial Odometry (LIO). Due to space constraints, this section emphasizes our modifications, particularly in LiDAR scan rendering and active failure detection. For further details on visual residuals and visual feature depth association, please refer to [6,18].
Rendering the texture of a LiDAR scan: Rendering helps in visually inspecting and verifying the quality of the point cloud data and its alignment. It allows for an intuitive understanding of the spatial distribution and structure of the point cloud. The motion-compensated LiDAR scan is projected onto the nearest image frame along the time axis and matched with the corresponding pixels in the RGB image to generate a 3D colored point cloud. A LiDAR point in the LiDAR coordinate system is denoted as p L = [ X L , Y L , Z L ] T , and its homogeneous form is denoted as p ˜ L = [ X L , Y L , Z L , 1 ] T . By using the extrinsic transformation from the LiDAR coordinate system to the camera coordinate system, the point is projected onto the image plane as follows:
p C = X C Y C Z C = R C L p L + t C L , λ u c v c 1 = K R C L t C L L p ˜ , λ = Z C
where K is the camera intrinsic matrix, R C L and t C L denote the rotation matrix and translation vector from the LiDAR coordinate system to the camera coordinate system, respectively, and λ is the depth of the transformed point in the camera coordinate system. After normalization by Z C , the pixel coordinates on the image plane are obtained as:
u c = f x X C Z C + c x   and   v c = f y Y C Z C + c y
where ( u c , v c ) denotes the pixel coordinates on the image plane. The RGB values are then extracted from the color image and assigned to the corresponding 3D LiDAR points to obtain the colored LiDAR scan.
Active Failure detection: In scenarios involving rapid vehicle motion or textureless environments, visual feature tracking frequently fails, hindering the convergence of optimization. To address this, when the IMU bias b a , b g , velocity v i m u , and translation Δ t s between adjacent frames exceed predefined thresholds, the VIS reinitializes and informs the LIS. Here, b a and b g denote the accelerometer bias and gyroscope bias, respectively; v i m u denotes the IMU-derived velocity; and Δ t s denotes the translation between two adjacent frames. The threshold values of these parameters are listed in Table 1. These thresholds were selected empirically according to the IMU noise characteristics and preliminary experiments. Once determined, they were kept fixed across all datasets and sequences, including KITTI, Semantic-KITTI, and UrbanNav, rather than being tuned separately for each sequence. The purpose of these thresholds is to conservatively detect abnormal VIS behavior and prevent unreliable VIO constraints from being added to the factor graph. It should be noted that these thresholds are not intended to be universal constants for all sensor platforms. For systems with different IMU noise levels, sensor configurations, motion characteristics, or operating environments, these thresholds may need to be recalibrated through preliminary experiments.
We observed that unstable VIS outputs can significantly affect the overall system performance. During the short recovery period after a VIS reboot, stable VIO constraints are not yet available. Therefore, VIO constraints are temporarily excluded from the factor graph to avoid degrading LiDAR-Inertial optimization. Instead, the initial value of the current frame is estimated using the optimized pose from the previous frame and IMU propagation. Then, point cloud registration is performed, and the current frame pose is obtained through factor graph optimization. The optimized LIO result is further used to update the IMU bias. In this way, active failure detection improves the stability of the system in complex scenarios.

2.2. Dynamic Point Removal

Dynamic point removal requires comparison of point clouds across different frames [9]. For each point cloud frame containing dynamic objects, a submap is constructed using nearby multi-frame point clouds. The difference between the range image of the current point cloud frame and the submap is used to generate a residual image. This residual information is then employed to partition the target map into two mutually exclusive subsets: the static point cloud map and the dynamic point cloud map.
The visibility-based dynamic point removal process is shown in Figure 3. Figure 3a presents a scene containing dynamic objects. Figure 3b,c show the range images generated from the current scan and the surrounding submap, respectively. By comparing these two range images, a residual image is obtained, as shown in Figure 3d. Pixels with large residual values indicate inconsistent observations between the current scan and the submap, and these points are identified as potential dynamic points.
Dynamic point filter: The construction process of the range image is illustrated in Figure 4. First, each LiDAR point is transformed from the Cartesian coordinate system to the spherical coordinate system, as shown in Figure 4a. Then, the full 3D LiDAR scan is unfolded into a 2D range image according to the azimuth and elevation angles, as shown in Figure 4b.
The vertical field of view of the LiDAR is divided into an upper part F u p and a lower part F d o w n . The laser point p x , y , z in the Cartesian coordinate system, as shown in Figure 4a, is represented in the spherical coordinate system as:
r = x 2 + y 2 + z 2 θ = y a w = arctan y , x φ = p i t c h = arcsin z / r
where x, y, and z are the coordinates of the LiDAR point in the LiDAR coordinate system; r denotes the range from the LiDAR origin to the point; yaw or θ denotes the horizontal azimuth angle; and pitch or φ denotes the vertical elevation angle.
After scanning a full circle with the LiDAR, the point cloud is unfolded into a 2D image, with the x direction as the front view. As shown in Figure 4b, the image center is set as the origin; the vertical coordinate is obtained from the projection of the pitch angle, and the horizontal coordinate is obtained from the projection of the yaw angle. To account for variations in the LiDAR field of view, horizontal resolution, and image size, both coordinates are normalized. The final coordinates u , v of the laser point projected onto the range image are given by:
u v = 1 2 1 arctan y , x / π w 1 arcsin z / r + F u p / F o v h
where u and v denote the horizontal and vertical pixel coordinates of the projected point in the range image, respectively; w and h denote the width and height of the range image; F u p and F d o w n represent the upper and lower vertical fields of view of the LiDAR, respectively; and Fov denotes the total vertical field of view.
We compare the current scan F k with the corresponding submap M k to remove dynamic points. A submap refers to the set of keyframes adjacent to the current frame in the spatiotemporal dimension. We assemble frames with dynamic points removed into a global map. Each submap is constructed from the previous global map, so the frames used to build each submap have already undergone dynamic point removal.
The scan is divided into two mutually exclusive subsets: the static points set F S and the dynamic points set F D .
F = F S + F D
The construction of the range image projects the 3D point cloud onto a 2D image, and when multiple 3D points project onto the same 2D point, the minimum distance from the 3D point p R 3 to the current scan F k is selected as the pixel value on the range image:
I k M i , j = min p P i j M   d i s t p ,   I k F i , j = min p P i j F   d i s t p
where P i j M and P i j F represent the spherical coordinates of the points P M and P F , respectively, i represents the azimuth angle, j represents the elevation angle, and d i s t represents the distance of the 3D point to the local coordinate system of the scan. The visibility of points is computed by subtracting their matrices element-wise within the same fixed coordinate frame.
I k d i f f = I k F I k M
When the pixel value I k d i f f corresponding to the point p P i j F is greater than the threshold τ , the point is considered to be a dynamic point:
F k D = F k I k d i f f > τ ,   τ = k × d i s t p
where k represents the sensitivity coefficient related to the point distance. The disparity in density between the scan and the submap, as observed in Figure 3, has resulted in the presence of numerous static points in the residual image, especially on the ground surface. This challenge can be mitigated by implementing an adaptive threshold and incorporating ground plane fitting methods.
Ground Plane Fitting: In real-world scenarios, most dynamic objects, such as moving vehicles and pedestrians, are above the ground. By classifying laser points as ground or non-ground, we can recover the ground points from the dynamic point set and add them to the static point set. For a scan L, the points are sorted by height, and the n 0 lowest points are selected to calculate the ground threshold. Based on this threshold, the initial set S 0 c of candidate ground points is extracted as follows:
S 0 c = p k z p k < z ¯ + T i n i t , p k L , k 1 , n 0
where z p k represents the height value of point p k , z ¯ is the average value of the n 0 lowest points, and T i n i t is the initial ground candidate point height threshold. Based on the initial set of candidate points, an iterative plane optimization is performed to re-extract the optimized ground points. After i iterations, the covariance matrix C i of the ground points is obtained, and the eigenvalues and eigenvectors of the ground points in each direction through Principal Component Analysis (PCA) are obtained:
C i = k = 1 : S i c p k p ¯ i p k p ¯ i T
C i v j = λ j v j , j 0 , 1 , 2
where S i c represents the size of the set S i c , p ¯ i is the average position of all points in the candidate point set at the i t h iteration, λ represents the eigenvalues of the covariance matrix, and v represents the corresponding eigenvectors.
According to PCA, the eigenvector corresponding to the smallest eigenvalue of the covariance matrix can represent the normal vector of the plane [19]. Let the normal vector be n i = a i , b i , c i T ; the equation of the ground plane can be expressed as:
a i x + b i y + c i z + d i = 0
Based on the plane equation, the plane coefficient d i = n i T p ¯ . For each point p L in the point cloud frame, we define d L i = n i T p L . Then, the orthogonal point-to-plane distance from p L to the fitted plane is calculated as:
D L i = d i d L i n i 2 , p L L
When D L i is less than the distance threshold T dist , the point is considered a ground point. Therefore, the ground point set in the ( i + 1 ) - t h iteration is updated as:
S i + 1 c = p L L D L i < T d i s t
where D L i denotes the orthogonal point-to-plane distance from point p L to the fitted ground plane, n i is the normal vector of the fitted plane, d i is the plane coefficient, d L i is the projection coefficient of point p L , and T dist is the distance threshold for ground point extraction. This process iteratively optimizes the fitted ground plane and extracts the ground point set. After n iterations, the ground point set S n c and the non-ground point set S ¯ n c = L S n c are obtained. The portion of the dynamic points that belong to the ground is recovered and integrated back into the static point set. In Equations (8)–(14), S 0 c denotes the initial candidate ground point set, n 0 is the number of lowest points used to estimate the initial ground height, C i denotes the covariance matrix at the i t h iteration, n i and d i denote the normal vector and offset of the fitted ground plane, and T d i s t is the distance threshold for ground point extraction.
It should be noted that the core goal of this ground fitting module is to reduce the false removal of static ground points, which is the main source of precision loss in existing dynamic removal methods. For most dynamic points near the ground, the plane distance threshold can help exclude them from the ground point set during the iterative fitting process, thereby reducing the risk of misclassification. The slight decrease in recall after adding the ground fitting module is an acceptable trade-off for a significant improvement in precision, which ensures that the LIS can obtain sufficient static constraints for scan matching, avoiding system failure caused by excessive removal of static points.
However, low-height moving objects close to the ground may still be partially confused with ground points during the ground plane fitting stage. Examples include scooters, bicycles, wheelchairs, shopping carts, baby strollers, small mobile robots, animals, and moving road debris. Although range-image-based dynamic point detection can identify many of these objects through temporal inconsistency between the current scan and the submap, some low-height dynamic points may be recovered if they are very close to the fitted ground plane. This remains a limitation of the current geometric ground recovery strategy.
It should also be noted that the ground plane fitting module is mainly used to recover misclassified static ground points rather than to explicitly model all possible dynamic objects. For aerial dynamic objects or elevated moving structures, such as drones or objects moving above the ground surface, the ground fitting process usually does not recover them as ground points because they are far from the fitted ground plane. However, their removal still depends on the range-image-based dynamic point detection stage. If such objects are small, distant, sparsely scanned, or only temporarily observed, they may be difficult to detect reliably. Therefore, the current method is more suitable for ground-vehicle scenarios dominated by road users and ground structures, while aerial dynamic objects and elevated moving structures remain challenging cases for future work.

2.3. LiDAR-Inertial Odometry

Scan-Matching: Scan-matching is used to match the current scan F k with the surrounding submap M k . In point cloud data, noise and outliers can disrupt point-to-point matching results. To address this, we convert the problem into finding correspondence between points in one point cloud and planes in another. By optimizing the distances between points and planes for alignment, the registration algorithm leverages plane geometry to achieve more accurate results.
For a point p i F in the F k , the VIO or IMU odometry is first used as an initial guess to transform p i F to the submap as p i M . Then, the nearest point p m M is found on the submap. Two additional points p j M and p k M are found from the laser beam containing p m M and the surrounding beams. The three points on the submap are used to construct a plane M m j k [4].
The local planar patch used for scan matching is illustrated in Figure 5. For a feature point in the current scan, the nearest point is first searched in the surrounding submap. Then, two additional neighboring points are selected from the same laser beam and adjacent laser beams. These three points are used to construct a local plane, which converts scan matching into a point-to-plane distance minimization problem.
Then, the point-to-plane distance d s m is calculated as follows:
n m j k = p m M p j M × p m M p k M , n m j k = n m j k / n m j k
d s m = p i M p m M n m j k
Loop Closure: This work adopts the global LiDAR descriptor LiDAR-IRIS [17] for place recognition. The encoding process of the LiDAR scan into the LiDAR-IRIS descriptor is shown in Figure 6. Each LiDAR scan is first divided into several bins. The height information within each bin is then encoded as an 8-bit binary code and converted into a decimal value, which is used as the pixel intensity in the LiDAR-IRIS image representation.
The LiDAR-IRIS image representation is formulated as follows: N r × N a bins. The height information within each bin is then encoded as an 8-bit binary code and converted to a decimal value, which is used as the pixel intensity in the LiDAR-IRIS image representation:
I = a i j R N r × N a , a i j = ϕ B i j
where ϕ B i j represents the decimal encoding of the bin B i j . The distance between the query frame and the candidate frame is computed using the Hamming distance. Specifically, for two binary feature images b i j q and b i j c , the distance between the two LiDAR scans is given by:
d = i = 1 N r j = 1 N a b i j q b i j c
where represents the XOR operation. We construct a row key KD-tree to speed up the loop closure candidate search. The binary features are re-encoded as an N r × 1 dimensional vector K:
K = φ r 1 , , φ r N r ,   φ r i = r i 2 N a
where r i represents the row vector of the feature image, and r i 2 denotes its L 2 norm. For each scan, we construct a KD-tree using the K vector as the key, and then use the KD-tree to retrieve the top N l = 10 candidates from the keyframe database, selecting the frame with the smallest Hamming distance as the loop closure candidate. Candidate matches are validated by verifying feature correspondences between the scan and the surrounding submap using Random Sample Consensus (RANSAC). If the set of inliers is sufficiently large, the loop closure is deemed successful.
It should be noted that the LiDAR-IRIS-based loop closure detection operates mainly within the LIS and relies on LiDAR scan descriptors rather than visual features. Therefore, when the VIS is in a failed or rebooting state, the descriptor extraction and place recognition process can still be performed using LiDAR scans. During this period, the system uses the previous optimized pose and IMU propagation to provide the initial guess for LIO, while unreliable VIO constraints are not added to the factor graph. As a result, VIS failure does not directly interrupt LiDAR-IRIS-based place recognition. However, if the VIS remains unstable for a long time and LiDAR-Inertial Odometry accumulates large drift, the geometric verification of loop closure candidates may become more difficult. In such cases, successful loop closure can help correct accumulated drift once reliable LiDAR-based candidate matching is established.
IMU Bias Estimation: When using IMU measurements for pose prediction, due to the time-varying IMU bias, it is necessary to update the IMU bias of the previous time step in real-time using the information from the LiDAR-Inertial Odometry. As shown in Figure 7, the current LiDAR scan pose is extracted from the factor graph optimization, and the IMU pre-integration quantities are calculated between the previous and current scans. Then, the IMU pre-integration factor, IMU bias factor, and pose factor are added to the factor graph for optimization, updating the current scan’s pose, velocity, and IMU bias. The latest bias is used to calculate the IMU pre-integration quantities for the next scan, predicting the next pose and publishing the IMU odometry.

3. Experiments

We compared the proposed framework with other state-of-the-art methods on the KITTI [20] and UrbanNav [21] datasets and evaluated the precision and recall of the dynamic point detection and removal method on Semantic-KITTI [22]. The KITTI dataset consists of outdoor urban environment sequences captured using a Velodyne HDL-64 LiDAR (California, USA), while the UrbanNav dataset was captured in the urban areas of Hong Kong using a Velodyne HDL-32 LiDAR (California, USA). Semantic-KITTI provides point-wise annotations, and each LiDAR point has its unique semantic label. Detailed information about the datasets is provided in Table 2.
The datasets are categorized into low, medium, and high levels based on the number of consecutive dynamic frames: less than 150 for low, 150–300 for medium, and over 300 for high. We used the root mean square error of the translational Absolute Trajectory Error (ATE-RMSE) to evaluate localization accuracy:
A T E R M S E = 1 n i = 1 n t i e s t t i g t 2 2
where n is the total number of poses, t i e s t denotes the estimated position vector at time i , t i g t denotes the corresponding ground-truth position vector, and 2 represents the Euclidean norm. In this work, only the translational component of the pose error is evaluated. In all experiments, we use the same parameters shown in Table 1.
Table 3 presents the precision (P), recall (R), and F1-score for all methods. ERASOR [12] is an offline method that requires a pre-built static map as input, whereas Ours and Ours (w/o GPF) are online methods that perform dynamic point removal using the current scan and the surrounding submap. Therefore, this comparison is mainly used to evaluate dynamic point removal behavior rather than to claim a strictly identical system-level setting. Ours (w/o GPF) refers to our dynamic point removal method without ground plane fitting to eliminate misclassified points. The results indicate that our method achieves a good balance between precision and recall, resulting in a higher F1-score. Although Erasor detected most of the dynamic points, it mistakenly removed a large portion of the static points, leading to the lowest precision. Similarly, Ours (w/o GPF) misclassified many static points on the ground, resulting in lower precision. Figure 8 shows the static map generation results of different methods on sequence 05. Figure 8a shows the ground truth, while Figure 8b–d present the results of ERASOR, Ours without GPF, and Ours, respectively. The red point clouds represent the dynamic points detected by each method.
We compared the proposed method with LIO-SAM [5] and LVI-SAM [6]. To demonstrate the effectiveness of dynamic point removal and LiDAR-IRIS, DOE-LVI-F and DOE-LVI-S represent our method without the dynamic point detection module and without LiDAR-IRIS, respectively. The results of all methods on the KITTI dataset are shown in Table 4.
It should be noted that ERASOR and our method are not compared under identical system assumptions. ERASOR uses an offline setting with a pre-built map, while our method performs online dynamic point removal. Therefore, the results in Table 3 should be interpreted as a comparison of dynamic point removal characteristics under different operating settings rather than as a direct comparison of complete SLAM systems under the same input conditions.
The results in Table 4 show that DOE-LVI achieves better accuracy than LIO-SAM and LVI-SAM on most KITTI sequences, especially in sequences with more dynamic objects or longer trajectories. However, on sequence 06, DOE-LVI performs slightly worse than LIO-SAM. Sequence 06 contains relatively few dynamic points and has a simple trajectory structure; therefore, the benefit of dynamic point removal and loop closure correction is limited. In such predominantly static scenarios, dynamic point removal may also remove a small number of useful static points or slightly perturb the scan-matching constraints, which can lead to a minor degradation in localization accuracy. Therefore, the result seen on sequence 06 indicates that the proposed dynamic point removal module is more beneficial in dynamic environments, while its advantage may be limited in nearly static scenes.
The ablation results also show that DOE-LVI-S performs worse than DOE-LVI-F on some sequences. Since DOE-LVI-F denotes the system without the dynamic point detection module and DOE-LVI-S denotes the system without the LiDAR-IRIS-based loop closure module, this result indicates that loop closure contributes more significantly than dynamic point removal in sequences where accumulated odometric drift is the dominant error source. Conversely, in highly dynamic sequences, dynamic point removal plays a more important role in improving scan-matching robustness. These results suggest that the relative contribution of each module depends on the dynamic level, trajectory length, and loop closure structure of each sequence.
The detailed trajectory generated by DOE-LVI on sequence 00 is shown in Figure 9a, where the green line represents the trajectory generated by the Visual-Inertial subsystem and the red line indicates the trajectory generated by the LIO subsystem after VIO tracking failure. Figure 9b presents the trajectory comparison on sequence 00. LVI-SAM is affected by VIO, resulting in larger errors in turns and on bumpy roads, whereas our method demonstrates better stability. Figure 10 presents the mapping result of DOE-LVI on KITTI sequence 07, including the map aligned with Google Earth and partial screenshots of dynamic objects. In Figure 10, “A-O” represents the original point cloud of location A, while “A-S” represents the static point cloud generated by DOE-LVI. The alignment with Google Earth indicates that the generated map is globally consistent.
To further evaluate the superiority of our method in challenging high-dynamic scenarios, we selected the UrbanNav dataset as the test dataset. The results of all methods on the UrbanNav dataset are shown in Table 5. LVI-SAM leverages visual odometry to provide an initial guess for LiDAR scan matching, hence resulting in inferior trajectory accuracy compared to LIO-SAM. Figure 11 illustrates the trajectory comparison on 0428. Figure 12 demonstrates the mapping result of DOE-LVI on UrbanNav 0428, showing high consistency between the generated map and Google Earth. In Figure 12, “A-O” represents the original point cloud of location A, while “A-S” represents the static point cloud generated by DOE-LVI.
The runtime experiments were conducted on a desktop computer equipped with an Intel Core i7 CPU, 16 GB RAM, and an NVIDIA GeForce RTX 3080 GPU, running Ubuntu 20.04. We compared the running time of DOE-LVI with other methods. The average processing time per frame across all datasets is shown in Table 6 for each module. The point cloud configuration parameters of LIO-SAM and LVI-SAM are set to their default values. The time consumption of our VIO subsystem is affected by image resolution, and the LIO is more significantly influenced by the density of the feature map. To demonstrate the point cloud rendering effect, we applied a lower downsampling rate. The time consumption of LIO is primarily concentrated in loop closure detection and backend optimization. In environments with rich features and a large number of dynamic objects, the time consumption of our dynamic points removal module will increase.

4. Conclusions

We have proposed DOE-LVI, a tightly coupled LiDAR-Visual-Inertial Odometry framework with dynamic object elimination for real-time and robust state estimation and mapping in highly dynamic environments. The proposed framework integrates the complementary advantages of visual, inertial, and LiDAR measurements, and improves the reliability of pose estimation through the cooperation between the Visual-Inertial System (VIS) and the LiDAR-Inertial System (LIS). DOE-LVI estimates the reliability of IMU odometry and VIO through active failure detection and determines whether IMU measurements or VIO results should be used as the initial guess for LIO. This strategy helps reduce the influence of unreliable visual constraints when visual tracking fails under rapid motion, illumination changes, or textureless scenes.
In the LIS, dynamic points are first removed before scan matching, which reduces the negative impact of moving objects on point cloud registration. To avoid excessive removal of useful static structures, ground plane fitting is further introduced to recover misclassified static ground points, thereby preserving sufficient constraints for accurate pose optimization. In addition, DOE-LVI employs a LiDAR-IRIS-based global descriptor for loop closure detection, which improves the system’s ability to correct accumulated drift and enhances the consistency of long-term localization and mapping. Through evaluations on datasets of various scales and environments, the results show that DOE-LVI achieves more stable localization performance in highly dynamic scenes. Compared with LVI-SAM, DOE-LVI reduces the ATE by 22% to 71% in highly dynamic environments, demonstrating the effectiveness of dynamic object elimination, active failure detection, and global descriptor-based loop closure in improving SLAM robustness.
In addition, the benefit of dynamic point removal may be limited in predominantly static environments, where unnecessary removal of a few static points may slightly affect scan-matching constraints. Nevertheless, low-height moving objects close to the ground may be partially confused with ground points during ground plane fitting, especially when their geometric height is close to the fitted ground plane. Moreover, aerial dynamic objects or elevated moving structures are not explicitly modeled in the current framework, and their removal mainly depends on the range-image-based dynamic point detection stage. In future work, we will optimize and improve the VIS to mitigate the influence of dynamic objects on vision and investigate semantic cues and temporal consistency constraints to better distinguish low-height dynamic objects, aerial dynamic objects, and elevated moving structures from static points.

Author Contributions

Conceptualization, T.L. and S.Y.; methodology, X.L. and J.W.; software and validation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, X.L. and S.Y.; supervision, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shaanxi Province Natural Science Basic Research Program under Grant 2025SYSSYSZD-105.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request from the authors.

Acknowledgments

Thank Wang Junjie for completing the software, validation, and draft writing work of the manuscript during his graduate studies at Xidian.

Conflicts of Interest

Author Junjie Wang was employed by the company Baidu Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Fan, Z.; Zhang, L.; Wang, X.; Shen, Y.; Deng, F. LiDAR, IMU, and Camera Fusion for Simultaneous Localization and Mapping: A Systematic Review. Artif. Intell. Rev. 2025, 58, 174. [Google Scholar] [CrossRef]
  2. Yuan, Z.; Deng, J.; Ming, R.; Lang, F.; Yang, X. SR-LIVO: Tightly-Coupled LiDAR-Inertial-Visual Odometry and Mapping with Sweep Reconstruction. IEEE Robot. Autom. Lett. 2024, 9, 5110–5117. [Google Scholar] [CrossRef]
  3. Hoang, Q.H.; Kim, G.W. IMU Augment Tightly Coupled Lidar-Visual-Inertial Odometry for Agricultural Environments. IEEE Robot. Autom. Lett. 2024, 9, 8483–8490. [Google Scholar] [CrossRef]
  4. Zhang, J.; Singh, S. Low-drift and Real-time LiDAR Odometry and Mapping. Auton. Robot. 2017, 41, 401–416. [Google Scholar]
  5. Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-coupled LiDAR Inertial Odometry via Smoothing and Mapping. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 5135–5142. [Google Scholar]
  6. Shan, T.; Englot, B.; Ratti, C.; Rus, D. LVI-SAM: Tightly-coupled LiDAR-Visual-Inertial Odometry via Smoothing and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 5692–5698. [Google Scholar]
  7. Lin, J.; Zhang, F. R3LIVE: A Robust, Real-Time, RGB-Colored, LiDAR-Inertial-Visual Tightly-Coupled State Estimation and Mapping Package. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 10672–10678. [Google Scholar]
  8. Wang, S.; Sun, Z.; Xue, H.; Liu, B.; Fu, H.; Luo, Y. PGP-DOR: A Point-Grid-Point Scheme for Efficient Dynamic Object Removal. IEEE Robot. Autom. Lett. 2025, 10, 12780–12787. [Google Scholar] [CrossRef]
  9. Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS++: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery. IEEE Robot. Autom. Lett. 2024, 9, 9127–9134. [Google Scholar] [CrossRef]
  10. Chen, X.; Li, S.; Mersch, B.; Wiesmann, L.; Gall, J.; Behley, J.; Stachniss, C. Moving Object Segmentation in 3D LiDAR Data: A Learning-Based Approach Exploiting Sequential Data. IEEE Robot. Autom. Lett. 2021, 6, 6529–6536. [Google Scholar] [CrossRef]
  11. Kim, G.; Kim, A. Remove, then Revert: Static Point Cloud Map Construction Using Multiresolution Range Images. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 10758–10765. [Google Scholar]
  12. Lim, H.; Hwang, S.; Myung, H. ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point Cloud Map Building. IEEE Robot. Autom. Lett. 2021, 6, 2272–2279. [Google Scholar] [CrossRef]
  13. Qian, C.; Xiang, Z.; Wu, Z.; Sun, H. RF-LIO: Removal-First Tightly-coupled LiDAR Inertial Odometry in High Dynamic Environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 4421–4428. [Google Scholar]
  14. Yuan, Z.; Wang, X.; Wu, J.; Cheng, J.; Yang, X. A Fast Dynamic Point Detection Method for LiDAR-Inertial Odometry in Driving Scenarios. arXiv 2024, arXiv:2407.03590. [Google Scholar]
  15. Kim, G.; Kim, A. Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 4802–4809. [Google Scholar]
  16. Kim, G.; Choi, S.; Kim, A. Scan Context++: Structural Place Recognition Robust to Rotation and Lateral Variations in Urban Environments. IEEE Trans. Robot. 2022, 38, 1856–1874. [Google Scholar]
  17. Wang, Y.; Sun, Z.; Xu, C.Z.; Sarma, S.E.; Yang, J.; Kong, H. LiDAR Iris for Loop-Closure Detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 5769–5775. [Google Scholar]
  18. Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  19. Zermas, D.; Izzat, I.; Papanikolopoulos, N. Fast Segmentation of 3D Point Clouds: A Paradigm on LiDAR Data for Autonomous Vehicle Applications. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 5067–5073. [Google Scholar]
  20. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  21. Wen, W.; Bai, X.; Hsu, L.T.; Pfeifer, T. GNSS/LiDAR Integration Aided by Self-Adaptive Gaussian Mixture Models in Urban Scenarios: An Approach Robust to Non-Gaussian Noise. In Proceedings of the IEEE/ION Position, Location and Navigation Symposium, Portland, OR, USA, 20–23 April 2020; pp. 647–654. [Google Scholar]
  22. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Figure 1. Overall framework of DOE-LVI.
Figure 1. Overall framework of DOE-LVI.
Sensors 26 03717 g001
Figure 2. The framework of our Visual-Inertial System.
Figure 2. The framework of our Visual-Inertial System.
Sensors 26 03717 g002
Figure 3. Visibility-based dynamic point removal. (a) a scene containing dynamic objects. (b) the range images generated from the current scan. (c) the range images generated from he surrounding submap. (d) a residual imag by comparing the two range images.
Figure 3. Visibility-based dynamic point removal. (a) a scene containing dynamic objects. (b) the range images generated from the current scan. (c) the range images generated from he surrounding submap. (d) a residual imag by comparing the two range images.
Sensors 26 03717 g003
Figure 4. Range image construction. (a) LiDAR point transformation. (b) the full 3D LiDAR scan is unfolded into a 2D range image.
Figure 4. Range image construction. (a) LiDAR point transformation. (b) the full 3D LiDAR scan is unfolded into a 2D range image.
Sensors 26 03717 g004
Figure 5. Local planar patch for scan matching.
Figure 5. Local planar patch for scan matching.
Sensors 26 03717 g005
Figure 6. Encoding the LiDAR scan into the LiDAR-IRIS.
Figure 6. Encoding the LiDAR scan into the LiDAR-IRIS.
Sensors 26 03717 g006
Figure 7. Estimation of IMU bias.
Figure 7. Estimation of IMU bias.
Sensors 26 03717 g007
Figure 8. Comparison of static map generation results on sequence 05. (a) the ground truth. (b) the results of ERASOR. (c) the results of Ours without GPF. (d) the results of Ours.
Figure 8. Comparison of static map generation results on sequence 05. (a) the ground truth. (b) the results of ERASOR. (c) the results of Ours without GPF. (d) the results of Ours.
Sensors 26 03717 g008
Figure 9. Trajectory comparison on KITTI sequence 00. (a) The detailed trajectory generated by DOE-LVI on sequence 00. (b) The trajectory comparison on sequence 00.
Figure 9. Trajectory comparison on KITTI sequence 00. (a) The detailed trajectory generated by DOE-LVI on sequence 00. (b) The trajectory comparison on sequence 00.
Sensors 26 03717 g009
Figure 10. Mapping results on KITTI sequence 07.
Figure 10. Mapping results on KITTI sequence 07.
Sensors 26 03717 g010
Figure 11. Trajectory comparison on 0428.
Figure 11. Trajectory comparison on 0428.
Sensors 26 03717 g011
Figure 12. DOE-LVI map aligned with Google Earth and some dynamic objects on 0428.
Figure 12. DOE-LVI map aligned with Google Earth and some dynamic objects on 0428.
Sensors 26 03717 g012
Table 1. Parameters used in DOE-LVI.
Table 1. Parameters used in DOE-LVI.
Parameters b a b g v i m u Δ t s n 0 T i n i t T d i s t k
Values2.81.031.55.0201.00.30.03
Table 2. Dataset details.
Table 2. Dataset details.
DatasetSequenceScansTrajectory
Length (m)
Dynamic
Level
KITTI0045413724.187Low
KITTI0246615067.233Low
KITTI0527612205.576High
KITTI0611011232.876Low88
KITTI071101694.697Medium
KITTI0840713222.795Medium
KITTI0915911705.051High
KITTI101201919.518Low
Urban2019042848711984.464High
Urban2020031430051210.889Medium
Urban2021051778483641.810High
Table 3. Comparison of dynamic point removal on the Semantic-KITTI dataset.
Table 3. Comparison of dynamic point removal on the Semantic-KITTI dataset.
MethodP (%)R (%) F 1
00ERASOR (offline)4.25093.2000.081
Ours (online, w/o GPF)7.51053.8450.132
Ours (online)23.41152.5680.324
02ERASOR (offline)3.38197.1750.065
Ours (online, w/o GPF)7.41259.5140.132
Ours (online)17.62353.6750.265
05ERASOR (offline)10.69091.2750.191
Ours (online, w/o GPF)18.68760.2490.285
Ours (online)43.44157.7230.496
07ERASOR (offline)13.91095.5970.243
Ours (online, w/o GPF)33.79548.3190.398
Ours (online)50.27446.9440.486
Table 4. RMSE (m) comparison on the KITTI dataset.
Table 4. RMSE (m) comparison on the KITTI dataset.
Seq.LIO-
SAM
LVI-
SAM
DOE-
LVI-F
DOE-
LVI-S
DOE-
LVI
00FAIL11.303.464.103.25
024.55FAIL3.713.463.13
052.241.931.211.161.01
0614.1915.1014.6714.7214.61
070.670.720.460.450.42
085.785.494.594.634.35
098.6111.432.83.821.76
102.542.822.371.861.84
Table 5. RMSE (m) comparison on the UrbanNav dataset.
Table 5. RMSE (m) comparison on the UrbanNav dataset.
Seq.LIO-
SAM
LVI-
SAM
DOE-
LVI-F
DOE-
LVI-S
DOE-
LVI
04285.737.605.234.193.57
03141.742.511.831.951.42
05174.473.653.193.272.83
Table 6. Average processing time per frame for all methods.
Table 6. Average processing time per frame for all methods.
SequenceImage SizeLIO-SAMLVI-SAMDOE-LVI
LIOVIOLIOVIOLIORemoval
001247 × 376--63.5465.5943.3571.7824.15
021241 × 37666.89----45.4768.4522.89
051226 × 37060.5962.8661.3440.8965.2323.65
061226 × 37089.3454.6371.6837.8864.0120.51
071226 × 37059.7156.1656.0940.3958.3622.18
081226 × 37060.2568.2564.2443.5662.4523.76
091226 × 37057.3963.1454.9737.7260.6819.57
101226 × 37052.6959.1856.7836.2155.4618.67
04281920 × 120057.3464.0855.9853.9460.0720.14
03141920 × 120051.2175.0654.2157.6761.4119.80
0517672 × 37649.2345.9455.0236.8247.1922.52
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, T.; Yang, S.; Li, X.; Wang, J. DOE-LVI: Tightly Coupled LiDAR-Visual-Inertial SLAM System with Dynamic Object Elimination. Sensors 2026, 26, 3717. https://doi.org/10.3390/s26123717

AMA Style

Li T, Yang S, Li X, Wang J. DOE-LVI: Tightly Coupled LiDAR-Visual-Inertial SLAM System with Dynamic Object Elimination. Sensors. 2026; 26(12):3717. https://doi.org/10.3390/s26123717

Chicago/Turabian Style

Li, Tuanjie, Shichao Yang, Xu Li, and Junjie Wang. 2026. "DOE-LVI: Tightly Coupled LiDAR-Visual-Inertial SLAM System with Dynamic Object Elimination" Sensors 26, no. 12: 3717. https://doi.org/10.3390/s26123717

APA Style

Li, T., Yang, S., Li, X., & Wang, J. (2026). DOE-LVI: Tightly Coupled LiDAR-Visual-Inertial SLAM System with Dynamic Object Elimination. Sensors, 26(12), 3717. https://doi.org/10.3390/s26123717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop