An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion

Ding, Junfeng; An, Pei; Yu, Kun; Ma, Tao; Fang, Bin; Ma, Jie

doi:10.3390/drones9120823

Open AccessArticle

An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion

by

Junfeng Ding

¹

,

Pei An

^1,2,*

,

Kun Yu

³

,

Tao Ma

⁴

,

Bin Fang

⁵

and

Jie Ma

¹

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

²

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

³

National Key Laboratory of Science and Technology on Electromagnetic Energy, Naval University of Engineering, Wuhan 430033, China

⁴

Institute of Computer Application, China Academy of Engineering Physics, Mianyang 621900, China

⁵

Qingjiang Research Center, Wuhan 430200, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 823; https://doi.org/10.3390/drones9120823 (registering DOI)

Submission received: 10 September 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 27 November 2025

(This article belongs to the Special Issue Advances in Guidance, Navigation, and Control)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed DLIC method reformulates the complex, coupled UAV state estimation problem in multi-LiDAR–IMU–camera systems as an efficient distributed subsystem optimization framework. The designed feedback mechanism effectively constrains and optimizes the UAV state using the estimated subsystem states.
Extensive experiments demonstrate that DLIC achieves superior accuracy and efficiency on a resource-constrained embedded UAV platform equipped with only an 8-core CPU. It operates in real time while maintaining low memory usage.

What are the implications of the main finding?

This work demonstrates that the challenging, coupled UAV state estimation problem in multi-LiDAR–IMU–camera systems can be effectively addressed through distributed optimization techniques, paving the way for scalable and efficient estimation frameworks.
The proposed DLIC method offers a promising solution for real-time state estimation in resource-limited UAVs with multi-sensor configurations.

Abstract

State estimation plays a vital role in UAV navigation and control. With the continuous decrease in sensor cost and size, UAVs equipped with multiple LiDARs, Inertial Measurement Units (IMUs), and cameras have attracted increasing attention. Such systems can acquire rich environmental and motion information from multiple perspectives, thereby enabling more precise navigation and mapping in complex environments. However, efficiently utilizing multi-sensor data for state estimation remains challenging. There is a complex coupling relationship between IMUs’ bias and UAV state. To address these challenges, this paper proposes an efficient and accurate UAV state estimation method tailored for multi-LiDAR–IMU–camera systems. Specifically, we first construct an efficient distributed state estimation model. It decomposes the multi-LiDAR–IMU–camera system into a series of single LiDAR–IMU–camera subsystems, reformulating the complex coupling problem as an efficient distributed state estimation problem. Then, we derive an accurate feedback function to constrain and optimize the UAV state using estimated subsystem states, thus enhancing overall estimation accuracy. Based on this model, we design an efficient distributed state estimation algorithm with multi-LiDAR-IMU-Camerafusion, termed DLIC. DLIC achieves robust multi-sensor data fusion via shared feature maps, effectively improving both estimation robustness and accuracy. In addition, we design an accelerated image-to-point cloud registration module (A-I2P) to provide reliable visual measurements, further boosting state estimation efficiency. Extensive experiments are conducted on 18 real-world indoor and outdoor scenarios from the public NTU VIRAL dataset. The results demonstrate that DLIC consistently outperforms existing multi-sensor methods across key evaluation metrics, including RMSE, MAE, SD, and SSE. More importantly, our method runs in real time on a resource-constrained embedded device equipped with only an 8-core CPU, while maintaining low memory consumption.

Keywords:

unmanned aerial vehicle; distributed state estimation; multi-sensor fusion; multi-LiDAR–IMU–camera system; image-to-point cloud registration

1. Introduction

State estimation aims to determine a UAV’s position, pose, velocity, and other motion states in real time [1,2]. It serves as a critical prerequisite for perception and localization in UAV systems [2,3,4]. With advances in manufacturing technology, the cost and size of sensors have decreased sharply. Consequently, state estimation using multi-sensor systems has attracted increasing attention [5,6], as such systems can greatly enlarge the field of view (FoV) and reduce the measurement noise inherent to single-sensor configurations [7]. This facilitates more accurate navigation and mapping in complex environments. Typical sensor combinations include light detection and ranging (LiDAR), camera, and inertial measurement unit (IMU) [8]. LiDAR and cameras collect distance and texture information of the surrounding environment, respectively, while IMUs provide measurements of the UAV’s motion dynamics. Thus, the multi-sensor-based state estimation method has wide application in UAV navigation and perception.

In general, multi-sensor configurations can be categorized into three cases: single-sensor-multi-type, multi-sensor-single-type, and multi-sensor-multi-type. The first case employs multiple sensors, with one sensor per type (e.g., single LiDAR + single IMU + single camera) [6,9,10]. The second case uses multiple sensors of the same type (e.g., multi-LiDAR) [11,12,13]. The third case combines multiple sensor types, where each type includes at least one sensor (e.g., single IMU + multi-LiDAR [14,15,16,17], or single IMU + single camera + multi-LiDAR [5]).

For general multi-sensor-based state estimation, state errors mainly arise from measurement errors of individual sensors, such as 3D point position errors from LiDARs, 2D pixel intensity errors from cameras, and acceleration and angular velocity errors from IMUs. Ensuring accurate and reliable state estimation remains a critical challenge in most UAV-based applications. It is well known that multiple measurements of the same physical quantity can significantly reduce measurement errors. Based on this fact, incorporating multiple IMUs, LiDARs, and cameras can effectively suppress measurement errors compared with using a single sensor [13]. Moreover, a multi-sensor-multi-type system is a feasible and practical configuration. On the one hand, the prices and weights of LiDARs and cameras have both decreased in recent years, enabling commercial UAVs to be equipped with multiple LiDARs and cameras [18]. On the other hand, IMU sensors are commonly embedded into LiDAR units to correct motion distortions in LiDAR scans [18], making multi-IMU configurations prevalent in real-world applications. Therefore, it is essential to study state estimation methods tailored for general multi-sensor-multi-type systems.

Although state estimation has achieved notable progress in single-sensor-multi-type and multi-sensor-single-type systems, few studies have explored state estimation in general multi-sensor-multi-type systems. Existing methods [5,14,15,16,17] primarily consider multi-sensor systems with a single IMU. Consequently, state estimation on general multi-sensor-multi-type systems remains a significant challenge. The major bottleneck lies in the optimization of multiple IMUs. According to the IMU pre-integration theory [19], IMU biases have a dominant influence on pre-integration accuracy. In multi-IMU scenarios, the biases of multiple IMUs are mutually coupled, and the IMU measurements received from multiple IMUs are often unordered. These two issues lead to an extremely complex Jacobian matrix for multi-IMU biases, which seriously impedes the efficiency of mainstream state estimation frameworks such as sliding window-based methods [20,21] and error state iterated Kalman filter (ESIKF)-based methods [22]. Furthermore, in practical applications, most UAVs have limited computational resources. Hence, performing efficient and accurate state estimation for general multi-sensor-multi-type embedded systems remains an urgent yet underexplored problem.

To address the aforementioned challenge, we observe that a multi-sensor-multi-type system can be decomposed into a series of single-sensor-multi-type subsystems. In this way, the original complex problem is reformulated as a distributed and low-complexitystate estimation problem. Distributed state estimation is well-suited for embedded systems, as multi-core CPUs in such systems can process each low-complexity sub-problem in parallel. Lin et al. [13] developed a distributed state estimation method based on extended Kalman filter (EKF), but their method was typically designed for multi-LiDAR setups and does not fully exploit measurement constraints from multiple IMUs and multiple cameras. To this end, we propose an efficient and accurate distributed state estimation method with multi-LiDAR–IMU–Camera fusion. First, we construct an efficient distributed state estimation model. It decomposes the multi-LiDAR–IMU–camera system into a series of single LiDAR–IMU–camera subsystems, reformulating the complex coupling problem into multiple simple sub-state estimation problems that can be solved efficiently in parallel. Next, we analyze the inter-relations among subsystem states and derive a feedback function based on IMU pre-integration theory [19], which constrains and optimizes the UAV state using estimated subsystem states. Based on this model, we propose an efficient and accurate distributed state estimation method in a multi-LiDAR–IMU–camera system, termed DLIC. DLIC achieves robust multi-sensor data fusion through a shared feature map, effectively improving both estimation accuracy and robustness. Furthermore, to accelerate state estimation, we design an efficient LiDAR–IMU-assisted image-to-point cloud registration (A-I2P) module. This module establishes high-quality 2D–3D correspondences based on sparse LiDAR depth, thereby constructing reliable visual measurements efficiently. Extensive experiments are conducted on 18 real-world indoor and outdoor scenes from the public NTU VIRAL dataset [18]. The results demonstrate that DLIC consistently outperforms existing multi-sensor methods across key evaluation metrics, including root mean square error (RMSE), mean absolute error (MAE), standard deviation (SD), and the sum of squares due to error (SSE). Moreover, resource consumption tests on an embedded device equipped with only an 8-core CPU show that DLIC achieves real-time performance while maintaining low memory usage. In all, the core contributions are:

We propose an efficient and accurate distributed state estimation method, DLIC, which fuses data from multiple LiDARs, IMUs, and cameras to achieve accurate UAV state estimation.
DLIC decomposes the complex, coupled multi-LiDAR–IMU–camera system into a series of single LiDAR–IMU–camera subsystems. A feedback function is then derived to effectively constrain and optimize the global UAV state based on the estimated subsystem states.
To further accelerate state estimation, we develop an efficient I2P module that establishes high-quality 2D–3D correspondences and constructs reliable visual measurements efficiently.

The remainder of this paper is organized as follows. First, the related works of multi-sensor-based state estimation methods are illustrated in Section 2. The bottleneck of current methods is analyzed in Section 3. After that, the proposed DLIC is discussed with details in Section 4. Extensive experiments are shown in Section 5. Finally, this work is concluded in Section 6.

2. Related Works

As discussed in Section 1, most current multi-sensor-based SLAM methods can be classified into three cases: single-sensor-multi-type, multi-sensor-single-type, and multi-sensor-multi-type.

2.1. Single-Sensor-Multi-Type Case

LiDAR–IMU is the most common and convenient system in the single-sensor-multi-type case [23]. It has three reasons. At first, the mainstream LiDAR manufacturer embeds the low-cost IMU into the commercial LiDAR sensor, because the IMU eliminates motion blur in the LiDAR scan. Second, the LiDAR–IMU system does not need the strict extrinsic parameters. In most cases, only the rotation from LiDAR to IMU coordinate system needs to be calibrated [22]. Third, IMU decreases the drift (especially in the Z-axis) of LiDAR mapping [24], as IMU provides the pose prior for LiDAR scan-to-map registration. To fuse information from IMU and point cloud, loosely-coupled fusion is the naive solution. In LeGO-LOAM [24], Shan and Englot integrated the gyroscope data to calculate the coarse rotation for point cloud registration. Recently, tightly coupled fusion is the mainstream trend, for it estimates the IMU basis during the state optimization, thus achieving a more precise mapping result. Qin et al. [25] proposed a complete framework of an error state Kalman filter to estimate the IMU basis and the robot state. Based on their work [25], Xu et al. [22] found the complexity of the Kalman filter to be quadratic to the feature measurements. They designed a new formula for Kalman gain where the complexity only depends on the state dimension. Thus, Fast-LIO [22] can use more LiDAR features to estimate accurate odometry in real time. Another scheme is sliding window-based optimization. The error state Kalman filter is a special case in which the window size is only 1. Ye et al. [26] designed a fixed-lag smoother with prior marginalization. A rotational constraint is utilized to refine the global map. Shan et al. [20] presented a traditional framework, LIO-SAM. The fixed-lag smoothing problem is solved by factor graph optimization. In the factor graph, a factor of closed-loop detection or GPS can be added for accurate state estimation. In all, LiDAR–IMU-based SLAM has matured in recent years; its state estimation schemes (i.e., filter-based and sliding window-based) have a profound impact on multi-sensor fusion-based SLAM [27].

The LiDAR–IMU–Camera system has attracted attention [6]. Its core problem is to fuse the visual measurement into the tightly coupled optimization framework. In the branch of filter-based framework, Zheng et al. [6] added the visual measurement into Fast-LIO [22] and presented Fast-LIVO. Following Forster et al. [28] work, SVO, they built a visual global map, maintained by updating LiDAR points that fell into the camera field of view (FoV). Visual constraint is built by minimizing the photometric error of the feature pixels in the current image and the global map. Based on previous work [22], Lin et al. [10,29] designed a filter-based framework R3live. In their VIO subsystem, they aimed to minimize both the frame-to-frame and frame-to-map photometric error. In their current work, R3live++ [29], Lin et al. considered the covariance of measurement errors carefully, and online-calibrated time-offset of LiDAR and camera. In the branch of the sliding window-based optimization, Lin et al. [9] used both ESIKF and the sliding window-based optimization. In their VIO subsystem, a fixed-lag smoother [30] is leveraged to optimize the pose. Then, they compute the pixel reprojection error using the optimized pose. After that, ESIKF exploits the reprojection error as the visual measurement to estimate the state. Based on LIO-SAM [20], Shan et al. [21] proposed a refined version, LVI-SAM. It contains the VIO subsystem based on VINS [30]. In this system, they aligned the depth of 2D keypoints via projecting the reconstructed LiDAR map, thus enhancing the accuracy of the state estimated from VIO. Then, the odometry factor provided by the VIO system is added to the factor graph. Recently, Lv et al. [5] designed a continuous-time fixed-lag smoothing. In their factor graph optimization, they used B-spline [31] to represent the trajectory so that control points need to be estimated. Further, the visual factor built by structure from motion (SFM) is independent of the LiDAR factor. In all, following the optimization scheme of the LiDAR-IMU case, SLAM with IMU, LiDAR, and camera emphasizes the scheme of establishing visual measurements.

2.2. Multi-Sensor-Single-Type and Multi-Sensor-Multi-Type Case

To deal with SLAM with multi-sensor-single-type or multi-sensor-multi-type cases, researchers extend the method of state estimation in the single-sensor-multi-type case. Multi-LiDAR is the most common multi-sensor-single-type system. Jiao et al. [11] presented a framework, M-LOAM, for multi-LiDAR. First, they utilized a hand–eye calibration-based method to estimate the rigid translation between the auxiliary LiDAR and the primary LiDAR sensor. After initialization, they used a sliding window-based optimization scheme to estimate odometry and calibration parameters between the auxiliary and primary LiDARs. Lin et al. [13] designed an SLAM method for the decentralized LiDARs. The state of each LiDAR is estimated by one extended Kalman filter (EKF). The author claimed that multiple EKFs run in parallel. All EKFs share the robot state after the state optimization. These decentralized EKFs decrease the computation cost per onboard computer. These decentralized EKFs give us inspiration in designing DLIC. The difference between decentralized EKFs and DLIC is discussed in Section 2.3.

Recently, there have been some works for multi-sensor-multi-type-based SLAM. Wang et al. [12] leveraged sliding window-based optimization to add LiDAR, odometry, IMU, and GPS factors for SLAM on rail vehicles. Wang et al. [16] used the factor graph to process multi-LiDAR and single IMU data. The factor of LiDAR is derived by considering the occupancy probability, and the gravity factor is also added to the factor graph. Jung et al. [17] extended Fast-LIO [22] for the multi-LiDAR single-IMU case. They utilized B-spline interpolation to undistort the motion blur for multi-LiDAR. Nguyen et al. [15] designed a general sliding window-based SLAM framework, MILIOM. It contains one IMU factor and multiple LiDAR factors. The LiDAR factor is constructed by registering the current LiDAR point cloud to every keyframe. In 2023, Nguyen et al. [14] extended their previous work MILIOM [15] and presented SLICT. The major difference of SLICT is the usage of an octree-based surfel map [32]. The advantage of the surfel map is that it represents the planarity of planar features, so that it filters out low-quality 3D–3D correspondences. Moreover, the work of Lv et al. [5] can be used for the case of a single IMU, a single camera, and multi-LiDAR. With the help of B-Spline interpolation, point clouds from the other LiDARs are undistorted to build LiDAR factors directly. In all, the mainstream methods tend to utilize a sliding window-based optimization scheme, for it can be easily extended from multi-sensor-single-type to multi-sensor-multi-type.

2.3. Discussions

From the above discussion, current multi-sensor-based state estimation methods have made a lot of progress. However, there are still several open problems to be addressed. The first is the state estimation in the general multi-sensor-multi-type system with limited computational resources. Current optimization frameworks [5,14] cannot deal with the multi-IMU inputs, due to the extremely high complexity of multi-IMU biases optimization, as discussed in Section 1. The second problem is to explore an efficient and accurate way to construct the visual measurement. Current methods tend to embed a VIO-based sub-system into their state estimation frameworks [5,6,9,10,21,29]. But it suffers from the robustness and overlarge CPU computation resources in practical applications. In this paper, we propose an efficient and accurate distributed state estimation method, DLIC, to deal with the first problem, and then present a scheme, A-I2P, to address the second problem. Details are illustrated in Section 4.

3. Problem Statement

We briefly introduce the optimization framework of state estimation, and then analyze the challenge of state estimation in the general multi-sensor-multi-type case. The symbols’ notation is presented in Table 1. Overview of the problem analysis is shown in Figure 1.

3.1. Basic State Estimation Model

For clarity, we first illustrate the state estimation in a general multi-sensor-single-type system (i.e., IMU+Camera+LiDAR). Two coordinate systems are used, like the UAV coordinate system (marked as

I

) and the world coordinate system (marked as

G

). More specifically,

I

is the coordinate system of the main IMU in the UAV.

G

is the coordinate system of LiDAR at the initial time

k = 0

. State estimation is to estimate the UAV state

x

by fusing the measurements of all sensors. State

x

is defined as:

x ≜ {[{}^{G}p_{I}^{T}, {}^{G}v_{I}^{T}, {}^{G}ξ_{I}^{T}, b_{W}^{T}, b_{A}^{T}, {}^{G}g^{T}]}^{T} \in R^{18}

(1)

where

{}^{G}p_{I}^{T} \in R^{3}

,

{}^{G}v_{I}^{T} \in R^{3}

, and

{}^{G}ξ_{I}^{T} \in so (3)

represent 3D position, velocity, rotation vector of UAV on the world coordinate system.

b_{W}^{T} \in R^{3}

and

b_{A}^{T} \in R^{3}

are gyroscope and accelerometer bias of IMU.

{}^{G}g^{T} \in R^{3}

is the gravity vector on the world coordinate system. In the k-th time, the state of UAV is marked as

x_{k}

. The mainstream methods estimate the UAV state with Kalman filter [6,10,16,22,29]. As the core of Kalman filter is the Maximum A Posteriori (MAP), the computation of

x_{k}

contains two steps. The first step is to calculate the initial state, marked as

{\hat{x}}_{k}

.

{\hat{x}}_{k}

is computed with

x_{k - 1}

via IMU pre-integration from the

(k - 1)

-th to k-th time. The second step is to estimate the error of the initial state, marked as

{\tilde{x}}_{k}

.

{\tilde{x}}_{k}

is optimized by minimizing the cost function

F ({\tilde{x}}_{k})

. In all, the computation of

x_{k}

is represented by the following equations:

x_{k} = {\hat{x}}_{k} + {\tilde{x}}_{k},

(2)

min_{{\tilde{x}}_{k}} F ({\tilde{x}}_{k}) = {∥ J_{k} {\tilde{x}}_{k} ∥}_{{\hat{P}}_{k}^{- 1}}^{2} + \sum_{i = 1} ∥ y_{i}^{k} + H_{i}^{k} {\tilde{x}}_{k} ∥_{R_{i}^{- 1}}^{2} + \sum_{j = 1} {∥ z_{j}^{k} + L_{j}^{k} {\tilde{x}}_{k} ∥}_{Q_{j}^{- 1}}^{2}

(3)

where

J_{k}

denotes the Jacobian matrix of pre-integration on

{\hat{x}}_{k}

;

{\hat{P}}_{k}

means a pre-integration covariance matrix on

{\hat{x}}_{k}

.

y_{i}^{k}

and

z_{j}^{k}

are the measurements of point-to-point correspondence in the LiDAR point cloud and image, respectively.

H_{i}^{k}

and

L_{j}^{k}

are Jacobian matrices of LiDAR and visual measurements on

{\hat{x}}_{k}

.

R_{i}

and

Q_{j}

mean the covariance matrices of the corresponding measurements. The analytic formulas of

J_{k}

,

{\hat{P}}_{k}

,

y_{i}^{k}

,

z_{j}^{k}

,

R_{i}

, and

Q_{j}

can refer to previous works [6,22,29]. In Equation (3),

{\tilde{x}}_{k}

can be estimated via

\partial F ({\tilde{x}}_{k}) / \partial {\tilde{x}}_{k} = 0

. It means that

{\tilde{x}}_{k}

is solved by a linear equation:

(\begin{matrix} A_{imu}^{k} \\ A_{lidar}^{k} \\ A_{cam}^{k} \end{matrix}) {\tilde{x}}_{k} = (\begin{matrix} b_{imu}^{k} \\ b_{lidar}^{k} \\ b_{cam}^{k} \end{matrix}) \Rightarrow A_{k} {\tilde{x}}_{k} = b_{k}

(4)

where

(A_{imu}^{k}, b_{imu}^{k})

,

(A_{lidar}^{k}, b_{lidar}^{k})

,

(A_{cam}^{k}, b_{cam}^{k})

are computed from

J_{k}

, (

H_{i}^{k}

,

y_{i}^{k}

), and (

L_{j}^{k}

,

z_{j}^{k}

) via

\partial F ({\tilde{x}}_{k}) / \partial {\tilde{x}}_{k} = 0

. The analytic formulae of them can refer to previous works [6,22,29]. In the actual application,

{\tilde{x}}_{k}

is solved iteratively via Equation (2).

3.2. Challenge in Multi-Sensor-Multi-Type Case

State estimation in Equation (3) faces a challenge when it comes to the multi-sensor-multi-type case. Suppose that this system has multiple IMUs, cameras, and LiDARs, where the number of IMUs is L. State

x

of this system is rewritten as:

x ≜ {[{}^{G}p_{I}^{T}, {}^{G}v_{I}^{T}, {}^{G}ξ_{I}^{T}, b_{W, 1}^{T}, b_{A, 1}^{T}, \dots, b_{W, L}^{T}, b_{A, L}^{T} {}^{G}g^{T}]}^{T}

(5)

where

b_{W, i}^{T}

and

b_{A, i}^{T}

mean the gyroscope and accelerometer bias of the i-th IMU.

x

is a

(12 + 3 L) \times 1

vector. Specifically, Equation (3) suffers from two problems. Firstly, as IMU data flows into the system with unknown data sources, the Jacobian matrix of multi-IMU

J_{k}

has the unknown and complex analytic structure [19]. Second, as

J_{k}

is a matrix with a complicated structure, there is a complex coupled relation between the UAV state and multi-IMU biases, causing the difficulty of optimizing Equation (3) in a multi-sensor-multi-type case. Also, the practical embedded system has limited computation resources, which makes it difficult to optimize the state in Equation (5) with lots of coupled variables. Thus, it is essential to explore an efficient state estimation method on the general multi-sensor-multi-type-based embedded system.

4. Proposed Method DLIC

In this paper, we observe that a multi-sensor-multi-type system can be decomposed into a series of multi-sensor-single-type subsystems. For example, a system with L LiDARs, L cameras, and L IMUs can be decomposed as L subsystems with a single LiDAR, a single camera, and a single IMU (

L \geq 2

). Then, state estimation in multi-sensor-multi-type system can be relaxed as a distributed state estimation problem in multiple multi-sensor-single-type subsystems. Based on this observation, we propose an efficient distributed state estimation method, DLIC.

4.1. Efficient Distributed State Estimation Model

We develop an efficient distributed state estimation model on a general multi-sensor-multi-type system. Unlike in Lin et al. [13], the proposed DLIC method is suitable for a general multi-sensor system, and not limited to multi-LiDAR sensors.

To establish this model, we first define a distributed sensor system. A multi-sensor-multi-type system consists of L groups of subsystems with a single LiDAR, IMU, and camera. And each subsystem maintains state

x^{l}

(l = 1, \dots, L)

where

x^{l}

is defined in Equation (3). The rotation, position, and velocity in

x^{l}

are on the UAV coordinate system. In the actual application, we use the position in

x^{1}

to represent the UAV trajectory. Extending to Equations (2) and (3), an efficient distributed state estimation model is established as:

x_{k}^{l} = {\hat{x}}_{k}^{l} + {\tilde{x}}_{k}^{l}, l = 1, \dots, L,

(6)

min_{{\tilde{x}}_{k}^{l}} \{F ({\tilde{x}}_{k}^{l}) + \sum_{z = 1, z \neq l}^{L} G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l}) \cdot 1 (Δ t_{z}^{l} \in (- τ, τ))\} .

(7)

We illustrate the Equations (6) and (7) with details. Extending from Equation (2),

x_{k}^{l} = {\hat{x}}_{k}^{l} + {\tilde{x}}_{k}^{l}

denotes the state computation of the l-the subsystem.

{\hat{x}}_{k}^{l}

is computed by IMU pre-integration on the l-th subsystem where the detail is similar to Equation (2).

{\tilde{x}}_{k}^{l}

is the error of initial state

{\hat{x}}_{k}^{l}

. A naive way of computing

{\tilde{x}}_{k}^{l}

is to minimize the cost function of

F ({\tilde{x}}_{k}^{l})

. However, we find that

x_{k}^{l}

has the strong constraint with

x_{k}^{z}

where

z = 1, \dots, L

and

z \neq l

. In the ideal situation, the rotation, position, and velocity in

x_{k}^{l}

are equal to

x_{k}^{z}

. Based on this observation, we construct a self-defined feedback function

G

in Equation (7) to constrain

x_{k}^{l}

with states from the other subsystems. In all, Equation (7) provides an efficient way to optimize the residual error by fusing the measurements from the l-th subsystem (i.e.,

F ({\tilde{x}}_{k}^{l})

, defined in Equation (3)), and measurements from other subsystems (i.e.,

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

).

From the system viewpoint,

G

in Equation (7) serves as a feedback to correct

x_{k}^{l}

with

{\tilde{x}}_{k}^{z}

(

z \in [1, L], z \neq l

). In this context, we name

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

as a self-defined feedback function.

Δ t_{z}^{l}

is the time offset between the time of optimizing

{\tilde{x}}_{k}^{z}

and the time of optimizing

{\tilde{x}}_{k}^{l}

. To ensure the model stability, only the measurements received in a short time interval

(- τ, τ)

are used for the estimation of

{\tilde{x}}_{k}^{l}

. To achieve this goal, we use an indicator function

1 (Δ t_{z}^{l} \in (- τ, τ))

to filter the measurements from other subsystems.

We then illustrate the theoretical solution of Equation (7). Similar to Equation (4), we first linearize the function

F ({\tilde{x}}_{k}^{l})

as

A_{k} {\tilde{x}}_{k}^{l} = b_{k}

by Taylor expansion on

{\tilde{x}}_{k}^{l} = 0

. In the same way, we can linearize the function

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

by Taylor expansion on

{\tilde{x}}_{k}^{l} = 0

, and obtain that

G_{z}^{l} {\tilde{x}}_{k}^{l} = g_{z}^{l}

. Hence, Equation (7) can be linearized as the below equation of

{\tilde{x}}_{k}^{l}

:

(\begin{matrix} A_{k} \\ α_{1}^{l} G_{1}^{l} \\ ⋮ \\ α_{L}^{l} G_{L}^{l} \end{matrix}) {\tilde{x}}_{k} = (\begin{matrix} b_{k} \\ α_{1}^{l} g_{1}^{l} \\ ⋮ \\ α_{L}^{l} g_{L}^{l} \end{matrix}) \Rightarrow {\tilde{A}}_{k} {\tilde{x}}_{k} = {\tilde{b}}_{k}

(8)

where

α_{1}^{l} = 1 (Δ t_{z}^{l} \in (- τ, τ))

. From Equation (8),

{\tilde{x}}_{k} = {({\tilde{A}}_{k}^{T} {\tilde{A}}_{k})}^{- 1} {\tilde{A}}_{k}^{T} {\tilde{b}}_{k}

. As

{\tilde{A}}_{k}^{T} {\tilde{A}}_{k}

is a

18 \times 18

matrix, the complexity of Equation (7) is

O (L)

. With the multi-thread technique in the modern CPU, all states

x^{l}

(l = 1, \dots, L)

in Equation (7) can be solved in a distributed manner. Thus, the complexity of the theoretical distributed state estimation model is

O (L)

, friendly to the efficient multi-sensor-multi-type system.

4.2. Feedback Function in Distributed State Estimation

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

leads an important role in Equation (7). To ensure the stability of optimizing Equation (7), it should be an accurate and robust feedback function. Before designing

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

, we need to study the relationship between

{\tilde{x}}_{k}^{l}

and

{\tilde{x}}_{k}^{z}

. As IMU in the different subsystems are independent of each other, IMU biases in

{\tilde{x}}_{k}^{l}

and

{\tilde{x}}_{k}^{z}

are independent. While other parameters, including the rotation, position, and velocity in

{\tilde{x}}_{k}^{l}

and

{\tilde{x}}_{k}^{z}

are correlated, because they all reflect the UAV on the world coordinate system. It means that we can leverage the IMU pre-integration theory [19] to propagate the state error

{\tilde{x}}_{k}^{z}

to constrain

{\tilde{x}}_{l}^{z}

. Based on this observation, we design

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l})

as:

G ({\tilde{x}}_{k}^{l} | {\tilde{x}}_{k}^{z}, Δ t_{z}^{l}) = {∥ E {\tilde{x}}_{k}^{l} - A_{k}^{z} {\tilde{x}}_{k}^{z} ∥}_{A_{k}^{z} {(P_{k}^{z})}^{- 1} {(A_{k}^{z})}^{T}}^{2},

(9)

E = (\begin{matrix} I_{9 \times 9} \\ 0_{6 \times 6} \\ I_{3 \times 3} \end{matrix}),

(10)

\begin{matrix} A_{k}^{z} = \\ (\begin{matrix} I & I Δ t_{z}^{l} & 0 & 0 & 0 & 0 \\ 0 & I & - R_{a} V_{a} Δ t_{z}^{l} & 0 & - R_{a} Δ t_{z}^{l} & I Δ t_{z}^{l} \\ 0 & 0 & R_{w} & - I Δ t_{z}^{l} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & I \end{matrix}), \end{matrix}

(11)

V_{a} = {(a - b_{A}^{z})}^{\land},

(12)

R_{a} = exp ({(ξ_{k}^{z})}^{\land}), R_{w} = exp (- {(w - b_{W}^{z})}^{\land} Δ t_{k}^{l})

(13)

where ∧ is a function of generating the anti-symmetric matrix from a vector.

I_{n \times n}

and

0_{n \times n}

denote

n \times n

identity and zero matrices.

I = I_{3 \times 3}

and

0 = 0_{3 \times 3}

.

ξ_{k}^{z}

,

b_{W}^{z}

, and

b_{A}^{z}

belong to state

x_{k}^{z}

, where

x_{k}^{z}

is obtained via Equation (6).

P_{k}^{z}

is the covariance matrix of

x_{k}^{z}

. As

τ

is very small (set as

0.05 s

) and the UAV is not moving violently in this short time,

a

and

w

are the average acceleration and angular velocity of the z-th IMU in

Δ t_{z}^{l}

. The detail of

A_{k}^{z}

refers to work [19]. For better understanding, we provide an intuitive explanation of

A_{k}^{z}

here. The computations of

{\tilde{x}}_{k}^{l}

and

{\tilde{x}}_{k}^{z}

are performed at different times, where the time interval is

Δ t_{z}^{l}

. Thus, we need to propagate the state error

{\tilde{x}}_{k}^{z}

with

Δ t_{z}^{l}

. As the position, rotation, and velocity satisfy the kinematic constraint, the first three rows in

A_{k}^{z}

are designed as shown in Equation (11). More derivation of Equations (12) and (13) refer to the classic literature [19]. As the IMU intrinsic parameters are independent of each other, the fourth and fifth rows in

A_{k}^{z}

are zero matrices. As the error of gravity is constant with time, the last row in

A_{k}^{z}

contains a

3 \times 3

identity matrix.

4.3. Distributed State Estimation in a Multi-LiDAR–IMU–Camera System

To implement the optimization model in Equation (7), we propose a distributed state estimation method, DLIC, in a multi-LiDAR–IMU–camera system. The DLIC pipeline is provided in Figure 2. The core is to optimize

x_{k}^{l}

with Equation (8). It mainly has three steps: (i) preparation of Equation (8), (ii) solving Equation (8), and (iii) post-processing.

The first is the preparation of Equation (8). As IMU has a higher publishing rate (≥100 Hz) than other sensors, we follow the previous work [6] to package the LiDAR–IMU–camera data in a time interval. IMU pre-integration is to calculate the prior state

{\hat{x}}_{k}^{l}

and

{\hat{P}}_{k}

from

x_{k - 1}^{l}

. With

{\hat{x}}_{k}^{l}

, LiDAR measurements contain 3D–3D correspondences between LiDAR scan and the shared LiDAR feature map

M_{lid}

. They construct

(y_{i}^{k}, H_{i}^{k})

in Equation (3). Visual measurements denote 2D–3D correspondences from image keypoints to the shared camera feature map

M_{cam}

(details will be shown in Section 4.4). They construct

(z_{j}^{k}, L_{j}^{k})

in Equation (3). We retrieve the optimized state from other subsystems with the condition

Δ t_{z}^{l} \in (- τ, τ)

, and construct the feedback constraint via Equation (9).

The second step is to solve Equation (8). With the help of C++ matrix operation library Eigen,

{\tilde{x}}_{k}^{l}

is quickly solved, and

x_{k}^{l}

is obtained via Equation (6).

The third step is the post-processing.

x_{k}^{l}

is added into the shared state information with its timestamp. With

x_{k}^{l}

, we can transform the LiDAR scan into the world coordinate system as

S_{k}^{l}

and back-projected image keypoints with LiDAR depth (detail is in Section 4.4) into the world coordinate system as

V_{k}^{l}

. Then,

M_{lid}

and

M_{cam}

is updated as:

M_{lid} \leftarrow M_{lid} \cup Down - sample (S_{k}^{l}), M_{cam} \leftarrow M_{cam} \cup V_{k}^{l}

(14)

where

Down - sample (\cdot)

is a downsampling operation with voxel resolution of

0.1

m. With the shared feature maps, correspondences of 3D–3D and 2D–3D are stable and precise, ensuring an accurate state estimation result. For better understanding, we illustrate Equation (14) with details.

M_{lid} \cup Down - sample (S_{k}^{l})

means that the LiDAR feature map is enlarged by merging the point cloud

S_{k}^{l}

measured by the current scan. To avoid the LiDAR feature map being overly large, the common trick is to downsample

S_{k}^{l}

.

M_{cam} \cup V_{k}^{l}

means that the camera feature map is merged with the new keypoints

V_{k}^{l}

. As the number of visual keypoints (nearly

10^{2}

) is smaller than the LiDAR points (nearly

10^{4}

), it is not essential to downsample

V_{k}^{l}

.

4.4. Assisted Image-to-Point-Cloud Registration

From the discussions in Section 2.3, the current approach exploits the VIO subsystem to establish the visual measurement. However, this system fails to work if the initialization is not successful or keypoint tracking fails. Also, the VIO sub-system needs a much greater computational burden for the extra state optimization.

To deal with this issue, we present A-I2P without initialization and extra state optimization. A-I2P needs a shared feature map

M_{cam}

with the lightweight maintaining cost. A-I2P aims to extract 2D–3D correspondences from image

I_{l}

and

M_{cam}

with the aid of

P_{l}

and IMU data. The procedure is shown in Figure 3. It contains four steps: keypoint detection, depth generation, kNN search, and map update. First, Shi–Tomasi keypoint detector [33] is used to extract

{I_{k}}_{k = 1}^{K}

keypoints, as presented as red points in Figure 3c. Second, depth generation aims to project the LiDAR scan, undistorted by IMU data, into the image plane. It obtains the sparse depth, shown as the green points

{L_{s}}_{s = 1}^{S}

in Figure 3c. Each pixel

L_{s}

has depth.

The third step, 2D–3D kNN search, aims to establish 2D–3D correspondences from

{I_{k}}_{k = 1}^{K}

,

{L_{s}}_{s = 1}^{S}

,

M_{cam}

. It is a crucial step of A-I2P. For each

I_{k}

, we search

L_{k}^{⋆} \in {L_{s}}_{s = 1}^{S}

with the nearest distance to

I_{k}

. We align the depth of

I_{k}

from

L_{k}^{⋆}

, when two constraints are satisfied: (i) pixel distance of

I_{k}

and

L_{k}^{⋆}

is below

τ_{2 D}

where

τ_{2 D}

is empirically set as 5 pixels in the actual applications; (ii) gray-scale intensity of

I_{k}

and

L_{k}^{⋆}

is below

τ_{I}

where

τ_{I}

is empirically set as 24. Keypoints with the aligned depths are marked as

{I_{a}}_{a = 1}^{A}

, as presented as the yellow points in Figure 3c. For each

I_{a}

, we back-project it with its depth to 3D space as 3D point

P_{a}

. With the initial estimated state

{\hat{x}}_{k}^{l}

, we transform

P_{a}

from the camera coordinate system to

G

as

{}^{G}P_{a}

. Then, we search a 3D point

{}^{G}Q_{a}^{⋆} \in M_{cam}

with the closest distance to

{}^{G}P_{a}

using iKd-Tree [34]. If the distance of

{}^{G}P_{a}

and

{}^{G}Q_{a}^{⋆}

is below

τ_{3 D}

(

τ_{3 D}

is empirically set as 5 cm for optimal performance), pair

〈 I_{a}, {}^{G}Q_{a}^{⋆} 〉

is a valid 2D–3D correspondence, and can be used as the visual measurement. As the final estimated state

x_{k}^{l}

has a slight difference from the initial estimated state

{\hat{x}}_{k}^{l}

, the above correspondence construct scheme with

{\hat{x}}_{k}^{l}

is relatively accurate. After the state optimization, the fourth step is the feature map update, shown in Equation (14). From the above discussion, A-I2P acts like a plug-and-play module. To ensure the correctness of 2D–3D correspondences, it leverages the LiDAR depth to convert 2D–3D correspondence to 3D–3D correspondence, and uses the 3D distance threshold to filter out incorrect 2D–3D correspondences. A-I2P only maintains a feature map with 3D keypoints. It exploits the current LiDAR scan with IMU data, so that A-I2P has a low computational burden. More analysis of DLIC is provided in the Appendix A.

5. Experiments and Discussions

5.1. Dataset Configuration

To investigate the performance of multi-sensor-multi-type SLAM methods, we conduct extensive experiments on the public NTU VIRAL [18] dataset, as shown in Figure 4. The NTU VIRAL dataset is selected for three main reasons. First, to the best of our knowledge, it is the first real-world dataset that provides multi-LiDAR–IMU–camera data collected on a single UAV platform equipped with multiple LiDARs, IMUs, and cameras. Second, it includes 18 diverse indoor and outdoor scenes, enabling a comprehensive performance evaluation under varying conditions. Third, this dataset has been widely adopted in previous studies for comparison; however, most existing works utilize only one IMU and conduct evaluations on the first 9 scenes [5,6,11]. In contrast, our experiments demonstrate that by effectively fusing data from multiple sensors, the proposed DLIC achieves superior performance across all 18 scenarios. Notably, we further evaluate DLIC on the remaining 9 scenes for a more complete and rigorous assessment.

Specifically, in the VIRAL dataset, the UAV is equipped with two 16-beam Ouster LiDARs (refresh rate is 10 Hz) and two monocular camera sensors (refresh rate is 10 Hz). Each LiDAR unit integrates an IMU (refresh rate is 100 Hz). The cameras capture

640 \times 480

gray-scale images, and their intrinsic parameters are carefully calibrated in advance. Consequently, the UAV constitutes a typical multi-sensor-multi-type system. To obtain the ground truth (GT) position of the UAV, the dataset employs ultra wide band (UWB) technique and a 3D laser tracking system [18]. All sensor data in the NTU VIRAL dataset are recorded using the robot operating system (ROS, an open-source C++ robot platform) and saved as ROS bag files. During the evaluation of state estimation methods, these ROS bags are replayed to reproduce real-time data streams from all sensors. To ensure accurate and synchronized data acquisition, we construct C++ queues that store incoming sensor measurements with their corresponding timestamps.

For the proposed DLIC method, two LiDAR-camera-IMU subsystems are configured (

L = 2

). The first subsystem consists of the horizontal LiDAR, its associated IMU, and the left camera; the second subsystem includes the vertical LiDAR, its IMU, and the right camera. IMU calibration is performed offline using the parameters provided by the NTU VIRAL dataset, including the extrinsic transformations between the IMU and the LiDAR/camera, as well as intrinsic gyroscope and accelerometer biases. For each subsystem, IMU initialization follows the classical open-source work, FAST-LIO [22]. Data synchronization among LiDAR, IMU, and camera also follows the FAST-LIO, using a first-in-first-out (FIFO) queue strategy. The hyperparameters in the ESIKF are kept consistent with those in Fast-LIVO [6]. To simulate the condition of a commercial UAV onboard system, DLIC is compiled with C++ on an Ubuntu 20.04 Linux system, and tested on a CPU with 8 cores and 8 GB memory.

To evaluate the accuracy of the estimated UAV position, we mainly use RMSE, MAE, SD, and SSE metrics. RMSE and MAE are evaluated in the following manner:

RMSE = \sqrt{\frac{1}{U} \sum_{i}^{U} {∥ T_{gt, i} - T_{i} ∥}_{2}^{2}}, MAE = \frac{1}{U} \sum_{i}^{U} {∥ T_{gt, i} - T_{i} ∥}_{2}

(15)

where

T_{gt, i}

and

T_{i}

are the GT and estimated UAV 3D positions at the i-th time. U is the total time.

∥ T_{gt, i} - T_{i} ∥_{2}^{2}

is the 3D position error at the i-th time. To evaluate the overall trajectory error, RMSE and MAE compute the position error with the different manners. Following the guidance [18], we adjust the temporal difference between the estimated and GT trajectories for the real metric estimation. Metrics SD and SSE are computed from the set

{∥ T_{gt, i} - T_{i} {∥_{2}}}_{i = 1}^{U}

. Units of RMSE and MAE are meter. In the following experiments, we utilize EVO (https://github.com/MichaelGrupp/evo (accessed on 24 November 2025)) to compute the above metrics.

5.2. Comparison Results

This experiment investigates the performance of DLIC and existing methods across 18 real-world scenes. The compared approaches contain visual-inertial methods (SVO [28], VINS-Fusion [30]), LiDAR-inertial methods (Fast-LIO [22], D-EKF [13], IGE-LIO [7]), multi-LiDAR method M-LOAM [11], LiDAR-inertial-visual methods (Fast-LIVO [6], R2live [9]). Considering the restricted computational resource of UAVs, only the above methods are included in the comparison, as they have relatively lightweight computational complexity. Since the proposed method can also work in a multi-LiDAR–IMU configuration, we mark it as DLI in this special case. The main quantitative comparison results on the first nine scenes are presented in Table 2. It can be observed that visual-inertial methods [28,30] exhibit unsatisfactory performance compared to other approaches. With the addition of Lidar sensor, the performance improves notably. This contrast is primarily attributed to the high noise and instability of visual measurements in complex environments, which are less reliable than LiDAR data. In particular, the NYA_03 sequence, an indoor scene with weak texture features, further highlights this limitation. Similarly, without IMU assistance, DVL-SLAM [35] suffers from the largest localization errors, as its state estimation is highly sensitive to outliers in visual keypoint matching. These results collectively indicate that leveraging multi-sensor data can significantly enhance overall performance. In contrast, our proposed methods, DLI and DLIC, achieve superior robustness and accuracy by effectively fusing measurements from multiple sensors. This enables them to maintain stable estimation even in noisy or texture-degraded environments, achieving significantly lower RMSE values than competing approaches.

Then, we analyze the performance of state estimation methods with LiDAR-based multi-sensor. From Table 2, the overall RMSE metrics of proposed DLIC and DLI are smaller than current methods, such as R2live [9], R3live [10], M-LOAM [11], IGE-LIO [7], and SR-LIVO [36]. The reasons are illustrated in depth. First, the proposed method utilizes multi-LiDAR, so that it can receive LiDAR measurements with the larger FoVs. Second, DLIC is more robust to IMU measurement noise, for it leverages the multi-IMU to decrease acceleration and angular velocity noises. Third, the proposed method can construct the visual measurement by matching the image and the local feature map. It increases the accuracy and efficiency of visual measurement construction.

From Table 2, it is observed that some methods, including Fast-LIO [22], Fast-LIVO [6], and D-EKF [13], achieve competitive performance with the proposed DLI and DLIC methods. These methods are further compared where the results are provided in Table 3 and Table 4. In Table 3, we analyze the maximum RMSE and MAE metrics of these methods in the first nine scenes on the NTU VIRAL dataset [18]. Maximum RMSE is an important metric to reflect the stability of state estimation methods. The maximum RMSE of DLIC is always smaller than that of others. It suggests that the proposed distributed state estimation method is robust to multi-sensor measurement noise. MAE is another common metric to evaluate the UAV positional error. MAE of DLI and DLIC is smaller than that of others. Combining the results in Table 2 and Table 3, as well as the visualization of estimated trajectories in Figure 5, the proposed DLIC method has achieved an accurate state estimation result in the various scenes.

In Table 4, we further investigate the generalization ability of these methods in the last nine scenes of the NTU VIRAL dataset [18]. Most methods do not report their performance in these scenarios [6,13,22]. For a fair comparison, we fix the hyper-parameters of all methods and test their generalization ability in unseen scenarios. In the RTP scene, camera images are not saved during the dataset collection, so Fast-LIVO and DLIC cannot work. The results in Table 4 indicate that the proposed DLI and DLIC methods outperform existing methods in the overall metrics (i.e., RMSE, max RMSE, and MAE). The reason is analyzed in-depth. Compared with Fast-LIO and Fast-LIVO, the proposed DLI and DLIC can make use of the measurements from multiple LiDARs, multiple cameras, and multiple IMUs in a distributed estimation manner, thus increasing the robustness to sensor measurement noise. Compared with D-EKF, the proposed method considers the fine-grained multi-IMU measurements. It makes DLI and DLIC more accurate than D-EKF in most scenes. In all, from the results in Table 4, the proposed method achieves a higher generalization ability than existing methods.

Furthermore, to better evaluate the error turbulence caused by multi-sensor-multi-type noise during UAV state estimation, we use other metrics, such as standard deviation (SD) and the sum of squares due to error (SSE), and the result is provided in Table 5. SD and SSE of DLIC are smaller than those of others, for the proposed method makes full use of the multi-sensor-multi-type measurements.

After that, we aim to evaluate the robustness of the proposed DLIC method under different noise levels. Specifically, the sensor noise (i.e., Gaussian random noise with the standard deviation

σ_{LiDAR}

) is mainly added into the LiDAR point cloud, which is used to test the stability of multi-sensor-based state estimation methods.

σ_{LiDAR} \in [0, 0.5]

is divided into five noise levels. RMSE results are provided in Table 6. It is observed that, with the proposed distributed state estimation method, the estimated state is more robust to the sensor noise than other methods.

From the above experiments in the 18 scenes, it is concluded that the proposed DLIC makes full use of the multi-LiDAR–IMU–camera measurements, so that it achieves a more accurate and robust localization performance than current methods in most scenes.

5.3. Visualization Verifications

This experiment investigates the reconstruction performance of the proposed method. As the NTU VIRAL dataset [18] does not contain the GT of reconstruction results, we compare the mapping detail as well as visualize the trajectory difference of each approach. As Fast-LIVO [6], DLI, and DLIC have the closest performance in Table 2 and Table 4, these methods are compared in this experiment. The overall mapping results are shown in Figure 6. It is found that the proposed method has a scanning area larger than Fast-LIVO, for the proposed method utilizes the multi-sensor-multi-type information. From Figure 7, it is observed that the trajectory of DLIC is relatively smoother than D-EKF [13], so that DLIC has a smaller RMSE than D-EKF. From Figure 8, the local mapping details show that the reconstruction result by DLIC has less noise than D-EKF [13]. Mapping results of Fast-LIVO [6] and DLIC are provided in Figure 9. The reason why the left image is more blurred than the right one is that the point cloud reconstructed by Fast-LIVO has much greater noise than the proposed method. It indicates that the UAV trajectory estimated by the proposed DLIC method is more accurate than Fast-LIVO. After discussing the mapping results, it is concluded that the proposed DLIC method uses the multi-LiDAR–IMU–camera measurements effectively to obtain the fine-grained 3D mapping results.

5.4. Ablation Studies

This experiment investigates the performance of each module inside DLIC, with the result provided in Table 7. Baseline means Fast-LIO [22]. Baseline-1 or Baseline-2 means the usage of the horizontal or vertical LiDAR. In the outdoor scene, the horizontal LiDAR sensor has a wide FoV, so that Baseline-1 is more accurate than Baseline-2. Baseline+A-I2P is to add visual measurements in the ESIKF framework. With the usage of A-I2P, the localization accuracy of Baseline-1 and 2 is enhanced, because the 2D–3D correspondences can decrease the negative impact of the noisy 3D–3D correspondences in state estimation. On the other hand, it is found that DLIC also improves the performance of Baseline-1, for it makes use of the observations from Baseline-1 and 2. With the usage of Equation (9) and the A-I2P module, the proposed DLIC has achieved a significant improvement over Baseline-1. In all, the experiment demonstrates that the proposed modules are effective for multi-sensor-multi-type-based SLAM.

To further investigate the robustness of the methods in Table 7, we follow the experimental setting in Table 6 and conduct the noise experiments in Figure 10. With the proposed A-I2P and distributed state estimation model in Equation (8), the proposed method is more robust to sensor noise.

Further, we conduct an experiment to investigate the hyper-parameters

τ_{2 D}

,

τ_{I}

, and

τ_{3 D}

of A-I2P on the performance of Baseline+A-I2P. The results are presented in Table 8.

τ_{2 D} \in [1, 9]

,

τ_{I} \in [6, 30]

,

τ_{3 D} \in [2.5, 12.5]

. At first, we check

τ_{2 D}

, while

τ_{I}

and

τ_{3 D}

are set to default values of 6 and

2.5

cm, respectively. If

τ_{2 D} \geq 5

, RMSE begins to increase, for the large 2D search region causes the inaccuracy of keypoint depth.

τ_{2 D} = 5

has the best RMSE. Then, we fix

τ_{2 D} = 5

and check

τ_{I}

, while

τ_{3 D}

is at the default value. When

τ_{I} \geq 30

, the alignment error of LiDAR point and keypoint is increasing, causing the large RMSE error.

τ_{I} = 30

is optimal. After that, we fix

τ_{2 D}

and

τ_{I}

, and then study

τ_{3 D}

. If

τ_{3 D}

is set too large, 3D–3D correspondence searched by kNN is not accurate, causing inaccuracy of mapping. Thus,

τ_{2 D} = 5

,

τ_{I} = 24

,

τ_{3 D} = 5.0

cm are the best hyper-parameters in A-I2P.

Moreover, we conduct the runtime analysis, and the result is provided in Table 9. Baseline is Fast-LIO [22]. Compared with Fast-LIO [22], the additional cost of DLIC is mainly the updating of the sharing feature map. Due to the multi-threading technique, solving Equation (8) does not need double the runtime, so that DLI has the closest runtime to Fast-LIO [22]. As A-I2P is an efficient module, the runtime of DLIC is ≈25 ms. Although the state-of-the-art D-EKF method [16] is faster than the proposed methods, it does not fully use the multi-LiDAR–IMU–camera sensor data, so that its state estimation accuracy is less then DLIC. The core of DLIC is the distributed state estimation in Equation (7) and A-I2P. From Table 9, it is found that the Equation (7) costs nearly 1.24 ms and A-I2P costs nearly 5.99 ms. This means that the proposed method is efficient in the actual applications.

We also analyze the peak memory usage of the proposed method. In this test, the memory of visualization from ROS RVIZ is not included. In an SLAM system, it is noted that the memory usage is related to the mapping area, because the feature map storage is the dominant factor in memory usage. In the EEE_03 scene, the result of peak memory usage is presented in Table 10. For a fair comparison, Fast-LIVO ^† denotes the combination of two groups of LiDAR–camera–inertial system, in the meantime, as DLIC does. As DLIC can filter the overlapped 3D keypoints during the updating of iKd-Tree [34], DLIC saves nearly 0.5 GB of memory compared to Fast-LIVO ^†. Further, we also investigate the effect of the voxel size of the proposed method and the results are shown in Table 11. The default map voxel size is 0.050 m in the experiments. We test the performance of DLIC if the voxel size is 0.025 m and 0.100 m. When the voxel size is decreased to 0.025 m, the mapping resolution is enlarged with the overlarge CPU memory cost. If the voxel size is enlarged as 0.100 m, most mapping details are lost, such that the RMSE error is larger than 0.35 m. Thus, it is recommended to set the proper voxel size in the actual applications. From the above experiments, DLIC is a real-time state estimation method with light memory usage.

5.5. Limitations and Future Works

From the above method comparisons, visualization analysis, and ablation studies, it is found that DLIC has made a certain amount of progress in the various scenarios. However, there are still some potential improvements for DLIC. First, since DLIC is not equipped with a loop closure detection and optimization module, this issue may limit the performance of DLIC in long-range state estimation. We will design a specific loop closure module for the multi-sensor-multi-type system. Second, moving objects have a certain negative impact (i.e., motion blur) in the proposed method, as shown in Figure 11. We will address this issue in future work. Third, although our current fusion strategy demonstrates reasonable performance, it remains handcrafted. Exploring deep learning or AI-driven adaptive fusion techniques may better handle complex environments and further enhance both robustness and efficiency. We will pursue this direction in future work. Fourth, as there are few datasets that support state estimation on multi-sensor-multi-type-based systems, it is essential to establish a simulation environment on ROS to collect the multi-sensor-multi-type measurements in future work. It will greatly benefit the algorithm development of state estimation on multi-sensor-multi-type-based systems.

6. Conclusions

We propose an efficient and accurate distributed state estimation method, DLIC, which effectively fuses data from multi-LiDAR–IMU–camera sensors and estimates the UAV’s state. At first, we develop a theoretical state estimation model. It decomposes the original complex estimation problem into multiple low-complexity optimization sub-problems, which can be easily solved with a modern embedded system with a multi-core CPU. The proposed distributed state estimation model has a feedback function that serves as a bridge to interact with state estimation results from all subsystems. After that, we implement the above theoretical model and propose a method, DLIC, for a system with multi-LiDAR–IMU–camera sensors. In DLIC, we propose an A-I2P module to construct accurate 2D–3D correspondences with the aid of LiDAR depth. Extensive experiments on the public real-world NTU VIRAL dataset demonstrate the effectiveness of the proposed DLIC method. Thus, we believe that the proposed DLIC method benefits the development of multi-sensor autonomous systems.

Author Contributions

Conceptualization, J.D. and P.A.; methodology, J.D. and P.A.; software, T.M.; validation, J.D.; formal analysis, K.Y. and B.F.; investigation, T.M.; resources, P.A.; data curation, J.D.; writing—original draft preparation, J.D. and P.A.; writing—review and editing, J.M.; visualization, K.Y. and B.F.; supervision, P.A. and T.M.; project administration, J.M.; funding acquisition, P.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by National Key R&D Program of China (Grant ID: 2024YFC3015303), National Science Foundation of China (Grand ID: 62502171, 62201536), National Key Laboratory of Electromagnetic Energy (Grant ID: 6142217242040101), and China Postdoctoral Science Foundation (Grand ID: 2024M761014).

Data Availability Statement

The NTU VIRAL dataset can be found at https://ntu-aris.github.io/ntu_viral_dataset/ (accessed on 24 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned aerial vehicle
FoV	Field of view
RGB	Red, green, blue
LiDAR	Light detection and ranging
IMU	Inertial measurement unit
CPU	Central processing unit
I2P	Image-to-point cloud
LIVO	LiDAR-inertial-visual odometry
LIO	LiDAR-inertial odometry
ESIKF	Error state iterative Kalman filter
SD	Standard deviation
SSE	Sum of squared errors
RMSE	Root mean square error
MAE	Mean absolute error
SLAM	Simultaneous localization and mapping
UWB	Ultra wide band
GT	Ground truth
kNN	k-nearest neighbor
MAP	Maximum a posterior
EKF	Extended Kalman filter
GPS	Global positioning system

Appendix A. Further Discussion of DLIC

The effectiveness of multi-LiDAR–IMU-based state estimation methods has been analyzed in previous work [13]. In this appendix, we focus on discussing the effectiveness of the proposed multi-LiDAR–IMU–Camera-based state estimation method, DLIC.

First, measurements from multiple IMUs can alleviate the white noise inherent in a single IMU. If the intrinsic parameters (i.e., accelerometer and gyroscope biases of each IMU) and extrinsic parameters (i.e., rigid transformations between IMUs) are accurately calibrated, the combined measurements of multiple IMUs can approximate those of a single IMU with a higher sampling frequency, thereby reducing the negative impact of noise in IMU pre-integration [19]. Consequently, using multiple IMUs improves the accuracy of the 6-DoF pose estimated from IMU data.

Second, employing multiple LiDARs and cameras increases the number of 2D–3D and 3D–3D correspondences, significantly mitigating the negative effects of correspondence noise on state estimation. In DLIC, camera measurements are constructed through I2P registration. Multiple LiDARs and cameras enhance the field of view (FoV) of perception and enlarge the FoV overlap between LiDARs and cameras. Furthermore, multiple LiDARs alleviate the sparsity of point clouds captured by a single LiDAR. These factors collectively improve correspondence accuracy, ensuring the construction of reliable camera and LiDAR measurements.

Based on the above discussions, a general multi-sensor-multi-type system can effectively suppress noise in each sensor type. As a result, it achieves more accurate state estimation performance compared with other multi-sensor-based methods, as shown in Table 2, Table 3 and Table 4.

References

Cui, J.; Niu, J.; He, Y.; Liu, D.; Ouyang, Z. ACLC: Automatic Calibration for Nonrepetitive Scanning LiDAR-Camera System Based on Point Cloud Noise Optimization. IEEE Trans. Instrum. Meas. 2024, 73, 5001614. [Google Scholar] [CrossRef]
Zhou, Y.; Li, J.; Ou, C.; Yan, D.; Zhang, H.; Xue, X. Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives. Drones 2025, 9, 557. [Google Scholar] [CrossRef]
Lv, M.; Zhang, B.; Duan, H.; Shi, Y.; Zhou, C. Unmanned Aerial Vehicles Formation Control and Safety Guarantee; Springer Nature: Singapore, 2025. [Google Scholar]
Li, K.; Bu, S.; Li, J.; Xia, Z.; Wang, J.; Li, X. Distributed Relative Pose Estimation for Multi-UAV Systems Based on Inertial Navigation and Data Link Fusion. Drones 2025, 9, 405. [Google Scholar] [CrossRef]
Lv, J.; Lang, X.; Xu, J.; Wang, M.; Liu, Y.; Zuo, X. Continuous-Time Fixed-Lag Smoothing for LiDAR-Inertial-Camera SLAM. IEEE/ASME Trans. Mechatronics 2023, 28, 2259–2270. [Google Scholar] [CrossRef]
Zheng, C.; Zhu, Q.; Xu, W.; Liu, X.; Guo, Q.; Zhang, F. FAST-LIVO: Fast and Tightly-coupled Sparse-Direct LiDAR-Inertial-Visual Odometry. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2022; pp. 4003–4009. [Google Scholar]
Chen, Z.; Zhu, H.; Yu, B.; Jiang, C.; Hua, C.; Fu, X.; Kuang, X. IGE-LIO: Intensity Gradient Enhanced Tightly Coupled LiDAR-Inertial Odometry. IEEE Trans. Instrum. Meas. 2024, 73, 8506411. [Google Scholar] [CrossRef]
Zhao, X.; Wen, C.; Prakhya, S.M.; Yin, H.; Zhou, R.; Sun, Y.; Xu, J.; Bai, H.; Wang, Y. Multimodal Features and Accurate Place Recognition with Robust Optimization for Lidar-Visual-Inertial SLAM. IEEE Trans. Instrum. Meas. 2024, 73, 5033916. [Google Scholar] [CrossRef]
Lin, J.; Zheng, C.; Xu, W.; Zhang, F. R²LIVE: A Robust, Real-Time, LiDAR-Inertial-Visual Tightly-Coupled State Estimator and Mapping. IEEE Robot. Autom. Lett. 2021, 6, 7469–7476. [Google Scholar] [CrossRef]
Lin, J.; Zhang, F. R³LIVE: A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 10672–10678. [Google Scholar]
Jiao, J.; Ye, H.; Zhu, Y.; Liu, M. Robust Odometry and Mapping for Multi-LiDAR Systems with Online Extrinsic Calibration. IEEE Trans. Robot. 2022, 38, 351–371. [Google Scholar] [CrossRef]
Wang, Y.; Song, W.; Lou, Y.; Huang, F.; Tu, Z.; Zhang, S. Simultaneous Localization of Rail Vehicles and Mapping of Environment with Multiple LiDARs. IEEE Robot. Autom. Lett. 2022, 7, 8186–8193. [Google Scholar] [CrossRef]
Lin, J.; Liu, X.; Zhang, F. A decentralized framework for simultaneous calibration, localization and mapping with multiple LiDARs. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 4870–4877. [Google Scholar]
Nguyen, T.; Duberg, D.; Jensfelt, P.; Yuan, S.; Xie, L. SLICT: Multi-Input Multi-Scale Surfel-Based Lidar-Inertial Continuous-Time Odometry and Mapping. IEEE Robot. Autom. Lett. 2023, 8, 2102–2109. [Google Scholar] [CrossRef]
Nguyen, T.; Yuan, S.; Cao, M.; Lyu, Y.; Nguyen, T.H.; Xie, L. MILIOM: Tightly Coupled Multi-Input Lidar-Inertia Odometry and Mapping. IEEE Robot. Autom. Lett. 2021, 6, 5573–5580. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, L.; Shen, Y.; Zhou, Y. D-LIOM: Tightly-Coupled Direct LiDAR-Inertial Odometry and Mapping. IEEE Trans. Multim. 2023, 25, 3905–3920. [Google Scholar] [CrossRef]
Jung, M.; Jung, S.; Kim, A. Asynchronous Multiple LiDAR-Inertial Odometry Using Point-Wise Inter-LiDAR Uncertainty Propagation. IEEE Robot. Autom. Lett. 2023, 8, 4211–4218. [Google Scholar] [CrossRef]
Nguyen, T.; Yuan, S.; Cao, M.; Lyu, Y.; Nguyen, T.H.; Xie, L. NTU VIRAL: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint. Int. J. Robot. Res. 2022, 41, 270–280. [Google Scholar] [CrossRef]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual-Inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
Shan, T.; Englot, B.J.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5135–5142. [Google Scholar]
Shan, T.; Englot, B.J.; Ratti, C.; Rus, D. LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 5692–5698. [Google Scholar]
Xu, W.; Zhang, F. FAST-LIO: A Fast, Robust LiDAR-Inertial Odometry Package by Tightly-Coupled Iterated Kalman Filter. IEEE Robot. Autom. Lett. 2021, 6, 3317–3324. [Google Scholar] [CrossRef]
Zhang, J.; Singh, S. Low-drift and real-time lidar odometry and mapping. Auton. Robot. 2017, 41, 401–416. [Google Scholar] [CrossRef]
Shan, T.; Englot, B.J. LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar]
Qin, C.; Ye, H.; Pranata, C.E.; Han, J.; Zhang, S.; Liu, M. LINS: A Lidar-Inertial State Estimator for Robust and Efficient Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 8899–8906. [Google Scholar]
Ye, H.; Chen, Y.; Liu, M. Tightly Coupled 3D Lidar Inertial Odometry and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 3144–3150. [Google Scholar]
Li, T.; Pei, L.; Xiang, Y.; Zuo, X.; Yu, W.; Truong, T. P³-LINS: Tightly Coupled PPP-GNSS/INS/LiDAR Navigation System with Effective Initialization. IEEE Trans. Instrum. Meas. 2023, 72, 8501813. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems. IEEE Trans. Robot. 2017, 33, 249–265. [Google Scholar] [CrossRef]
Lin, J.; Zhang, F. R³LIVE++: A Robust, Real-time, Radiance reconstruction package with a tightly-coupled LiDAR-Inertial-Visual state Estimator. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11168–11185. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Furgale, P.T.; Tong, C.H.; Barfoot, T.D.; Sibley, G. Continuous-time batch trajectory estimation using temporal basis functions. Int. J. Robot. Res. 2015, 34, 1688–1710. [Google Scholar] [CrossRef]
Duberg, D.; Jensfelt, P. UFOMap: An Efficient Probabilistic 3D Mapping Framework That Embraces the Unknown. IEEE Robot. Autom. Lett. 2020, 5, 6411–6418. [Google Scholar] [CrossRef]
Shi, J.; Tomasi, C. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
Cai, Y.; Xu, W.; Zhang, F. ikd-Tree: An Incremental K-D Tree for Robotic Applications. arXiv 2021, arXiv:2102.10808. [Google Scholar]
Shin, Y.; Park, Y.S.; Kim, A. DVL-SLAM: Sparse depth enhanced direct visual-LiDAR SLAM. Auton. Robot. 2020, 44, 115–130. [Google Scholar] [CrossRef]
Yuan, Z.; Deng, J.; Ming, R.; Lang, F.; Yang, X. SR-LIVO: LiDAR-Inertial-Visual Odometry and Mapping with Sweep Reconstruction. IEEE Robot. Autom. Lett. 2024, 9, 5110–5117. [Google Scholar] [CrossRef]

Figure 1. Illustration of the bottleneck of current methods and the new solution proposed in this paper.

Figure 2. DLIC pipeline. The core step is to optimize the state via Equation (8). The inputs of Equation (8) are the prior state, LiDAR measurements, visual measurements, and feedback, respectively. The prior state is estimated with IMU pre-integration. LiDAR and visual measurements are constructed with local feature maps. After optimization, DLIC maintains the shared state information and 3D keypoint maps, which are crucial to construct LiDAR, visual measurements, and feedback for the next time. Compared with Fast-LIVO [6], DLIC can obtain a stable and accurate flight trajectory, as well as a high-quality reconstruction scene.

Figure 3. Procedure of A-I2P. (a) Depth generation is to project the undistorted LiDAR point cloud into the image plane as the LiDAR depth. (b) A-I2P contains four steps, including keypoint detection, depth generation, kNN search, and update. (c) Visualization of A-I2P. A-I2P is very fast in practical applications.

Figure 4. Details of the NTU VIRAL dataset [18]. (a) UAV equipped with two LiDARs, two cameras, and multiple IMUs. (b) Experiment sites are used to evaluate the proposed state estimation method.

Figure 5. Visualization of the estimated trajectories of UAV in EEE_01, EEE_02, and EEE_03 scenes on the NTU VIRAL dataset [18]. In most cases, the positions estimated by DLIC have relatively small errors. Black dotted line is the GT trajectory while the color line is the trajectory computed by our method.

Figure 6. Visualization of Fast-LIVO (Up) [6] and proposed DLIC (Down) in the different scenarios from (a–e). Pseudo color reflects the value of LiDAR intensity. DLIC utilizes the multi-sensor with large FoVs, so that it reconstructs a 3D scene more completely than Fast-LIVO.

Figure 7. Trajectory visualization of the proposed DLIC and D-EKF [13]. (a) EEE scene. (b) TNP scene. When the UAV moves fast, the trajectory of D-EKF tends to fluctuate, while the trajectory of DLIC is still smooth and accurate.

Figure 8. Mapping details of the different methods. (a) DLIC. (b) D-EKF [13]. The proposed method reconstructs the buildings with less noise.

Figure 9. Mapping results of Fast-LIVO (left) and DLIC (right). Compared with Fast-LIVO, DLIC achieves more accurate state estimation results, so that it reconstructs the 3D outdoor scene with less noise than Fast-LIVO.

Figure 10. RMSE of the different methods on EEE_03 scene with the various sensor noise levels (from 1 to 5). DLI is the baseline + Equation (8) and DLIC is the baseline + Equation (8) + A-I2P.

Figure 11. Moving objects have a certain negative impact (i.e., motion blur) in the proposed method. We will address this issue in future work.

Table 1. Main symbols notation.

Symbol	Meaning
$x, \hat{x}, \tilde{x}$	State of ground truth, estimation, residual error
$I, W, A, k$	subscript: IMU coordinate system, gyroscope, accelerometer, timestamp
$p_{I}, v_{I}, ξ_{I}$	3D position, velocity, rotation vector
$b_{W}, b_{A}, g$	IMU basis of gyroscope and accelerometer, gravity vector
$J_{k}, {\hat{P}}_{k}$	Jacobian, covariance matrix of pre-integration on ${\hat{x}}_{k}$
$M_{l}$	Feature map of the l-th Kalman filter
$P_{l}$	3D point at the current LiDAR coordinate system
${}^{G}P_{l}$	3D point at the global coordinate system $G$
I	2D pixel coordinate in the image
$y_{i}, z_{j}$	correspondence: LiDAR point, image point

Table 2. RMSE results of the compared methods on the first nine scenes in the NTU VIRAL dataset. L, I, C means the usage of LiDAR, IMU, and camera. L1 and L2 mean the usage of single LiDAR and double LiDAR. Red bold is 1st, and blue bold is 2nd.

RMSE Metric	Sensor Usage	EEE_01	EEE_02	EEE_03	NYA_01	NYA_02	NYA_03	SBS_01	SBS_02	SBS_03
Fast-LIO [22]	L1+I1	0.540	0.220	0.250	0.240	0.210	0.230	0.250	0.260	0.240
Fast-LIVO [6]	L1+I1+C1	0.280	0.170	0.230	0.190	0.180	0.190	0.290	0.220	0.220
DVL-SLAM [35]	L1+C1	2.880	1.650	3.080	2.090	1.450	1.820	1.080	2.310	2.230
SVO [28]	I1+C1	Fail	Fail	4.120	2.290	2.910	3.320	7.840	Fail	Fail
VINS-Fusion [30]	I1+C1	0.608	0.506	0.494	0.397	0.424	0.787	0.508	0.564	0.878
R2Live [9]	L1+I1+C1	0.450	0.210	0.970	0.190	0.630	0.310	0.560	0.240	0.440
M-LOAM [11]	L2	0.249	0.166	0.232	0.123	0.191	0.226	0.173	0.147	0.153
D-EKF [13]	L2+I2	0.269	0.164	0.220	0.229	0.178	0.207	0.208	0.220	0.244
IGE-LIO [7]	L1+I1	0.209	0.197	0.217	0.231	0.195	0.194	0.207	0.219	0.212
R3Live [10]	L1+I1+C1	1.690	−	0.640	0.630	0.350	0.230	0.400	0.270	0.210
SR-LIVO [36]	L1+I1+C1	0.210	0.230	0.220	0.180	0.190	0.200	0.120	0.220	0.210
DLI (Our)	L2+I2	0.256	0.152	0.211	0.196	0.166	0.183	0.185	0.181	0.182
DLIC (Our)	L2+I2+C2	0.237	0.146	0.208	0.166	0.143	0.170	0.162	0.160	0.173

Table 3. Other metrics of the compared methods on the first nine scenes in the NTU VIRAL dataset. Red bold is 1st, and blue bold is 2nd.

MAX RMSE Metric	EEE_01	EEE_02	EEE_03	NYA_01	NYA_02	NYA_03	SBS_01	SBS_02	SBS_03
Fast-LIO [22]	0.633	0.732	0.638	0.649	0.463	0.542	0.654	0.587	0.513
Fast-LIVO [6]	0.586	0.628	0.582	0.641	0.459	0.488	0.588	0.556	0.492
D-EKF [13]	1.012	0.621	0.438	0.588	0.469	0.630	0.709	0.492	0.504
DLI (Our)	0.973	0.541	0.411	0.520	0.545	0.466	0.439	0.435	0.491
DLIC (Our)	0.519	0.407	0.364	0.608	0.443	0.455	0.420	0.415	0.464
MAE Metric	EEE_01	EEE_02	EEE_03	NYA_01	NYA_02	NYA_03	SBS_01	SBS_02	SBS_03
Fast-LIO [22]	0.277	0.152	0.231	0.210	0.191	0.197	0.231	0.254	0.231
Fast-LIVO [6]	0.248	0.136	0.219	0.215	0.184	0.175	0.261	0.201	0.205
D-EKF [13]	0.252	0.141	0.208	0.190	0.161	0.181	0.198	0.207	0.232
DLI (Our)	0.242	0.129	0.199	0.167	0.134	0.150	0.165	0.159	0.163
DLIC (Our)	0.224	0.126	0.196	0.135	0.113	0.144	0.134	0.142	0.151

Table 4. Results of the compared methods on the last nine scenes in the NTU VIRAL dataset. Red bold is 1st, and blue bold is 2nd.

RMSE Metric	RTP_01	RTP_02	RTP_03	TNP_01	TNP_02	TNP_03	SPMS_01	SPMS_02	SPMS_03
Fast-LIO [22]	0.402	0.240	0.636	0.138	0.159	0.174	0.635	2.216	1.595
Fast-LIVO [6]	−	−	−	0.114	0.107	0.195	0.975	1.211	2.043
D-EKF [13]	0.298	0.165	0.633	0.096	0.109	0.178	0.363	2.047	1.505
DLI (Our)	0.216	0.157	0.572	0.095	0.104	0.144	0.301	1.898	0.684
DLIC (Our)	−	−	−	0.088	0.098	0.111	0.285	1.198	1.328
MAX RMSE Metric	RTP_01	RTP_02	RTP_03	TNP_01	TNP_02	TNP_03	SPMS_01	SPMS_02	SPMS_03
Fast-LIO [22]	1.667	0.741	1.991	0.392	0.328	0.422	4.376	7.046	5.250
Fast-LIVO [6]	−	−	−	0.304	0.421	0.377	2.165	4.607	4.631
D-EKF [13]	1.902	0.498	2.003	0.332	0.485	0.604	1.957	8.055	5.695
DLI (Our)	0.782	0.339	1.976	0.310	0.401	0.421	1.112	5.432	3.305
DLIC (Our)	−	−	−	0.251	0.295	0.511	1.062	4.161	3.574
MAE Metric	RTP_01	RTP_02	RTP_03	TNP_01	TNP_02	TNP_03	SPMS_01	SPMS_02	SPMS_03
Fast-LIO [22]	0.318	0.196	0.578	0.126	0.148	0.143	0.489	1.641	1.405
Fast-LIVO [6]	−	−	−	0.096	0.097	0.181	0.897	0.989	1.779
D-EKF [13]	0.287	0.150	0.563	0.078	0.121	0.152	0.287	1.554	1.318
DLI (Our)	0.252	0.143	0.517	0.075	0.093	0.102	0.260	1.388	0.464
DLIC (Our)	−	−	−	0.068	0.079	0.076	0.251	0.947	1.240

Table 5. State error turbulence evaluation of current state estimation methods on EEE_03 scene. Red bold is 1st.

Metrics	Fast-LIO [22]	Fast-LIVO [6]	D-EKF [13]	DLIC
SD	0.101	0.071	0.068	0.062
SSE	12.338	10.212	10.075	9.241

Table 6. Robustness evaluation of the current state estimation methods on EEE_01 scene. Red bold is 1st.

Noise Levels	Fast-LIO [22]	Fast-LIVO [6]	D-EKF [13]	DLIC
0	0.540	0.280	0.269	0.237
1	0.548	0.285	0.272	0.241
2	0.552	0.288	0.278	0.242
3	0.563	0.291	0.280	0.245
4	0.578	0.293	0.284	0.247
5	0.583	0.298	0.290	0.256

Table 7. Ablation study of DLIC on EEE_03 scene. Red bold is 1st.

Method	RMSE Metric	MAE Metric
Baseline-1	0.250	0.210
Baseline-2	0.284	0.233
Baseline-1+A-I2P	0.219	0.207
Baseline-2+A-I2P	0.231	0.211
Baseline + Equation (8) (DLI)	0.211	0.199
Baseline + Equation (8) + A-I2P (DLIC)	0.208	0.196

Table 8. Ablation study of hyper-parameters of A-I2P on EEE_03 scene. Red bold is 1st.

$τ_{2 D}$	1	3	5	7	9
RMSE	0.248	0.245	0.244	0.251	0.254
$τ_{I}$	6	12	18	24	30
RMSE	0.244	0.241	0.234	0.221	0.238
$τ_{3 D}$	2.5	5.0	7.5	10.0	12.5
RMSE	0.221	0.219	0.228	0.231	0.234

Table 9. Runtime analysis of DLIC on EEE_03 scene.

Method	Average Runtime per Loop
D-EKF	18.89 ms
Baseline	18.32 ms
Baseline + A-I2P	24.31 ms
Baseline + Equation (9) (DLI)	19.56 ms
Baseline + Equation (9) + A-I2P (DLIC)	25.86 ms

Table 10. Peak memory usage of DLIC on EEE_03 scene. ^† denotes different sensor configurations.

Method	Sensor Usage	Peak Memory Usage
Fast-LIVO [6]	L1+C1+I1	1821 MB
Fast-LIVO ^† [6]	L2+C2+I2	3277 MB
DLIC	L2+C2+I2	2741 MB

Table 11. RMSE and peak memory usage of DLIC on EEE_03 scene under different voxel size.

Voxel Size	0.025 m	0.050 m	0.100 m
RMSE	0.201 m	0.208 m	0.352 m
Peak memory usage	8421 MB	2741 MB	723 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, J.; An, P.; Yu, K.; Ma, T.; Fang, B.; Ma, J. An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion. Drones 2025, 9, 823. https://doi.org/10.3390/drones9120823

AMA Style

Ding J, An P, Yu K, Ma T, Fang B, Ma J. An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion. Drones. 2025; 9(12):823. https://doi.org/10.3390/drones9120823

Chicago/Turabian Style

Ding, Junfeng, Pei An, Kun Yu, Tao Ma, Bin Fang, and Jie Ma. 2025. "An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion" Drones 9, no. 12: 823. https://doi.org/10.3390/drones9120823

APA Style

Ding, J., An, P., Yu, K., Ma, T., Fang, B., & Ma, J. (2025). An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion. Drones, 9(12), 823. https://doi.org/10.3390/drones9120823

Article Menu

An Efficient and Accurate UAV State Estimation Method with Multi-LiDAR–IMU–Camera Fusion

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Single-Sensor-Multi-Type Case

2.2. Multi-Sensor-Single-Type and Multi-Sensor-Multi-Type Case

2.3. Discussions

3. Problem Statement

3.1. Basic State Estimation Model

3.2. Challenge in Multi-Sensor-Multi-Type Case

4. Proposed Method DLIC

4.1. Efficient Distributed State Estimation Model

4.2. Feedback Function in Distributed State Estimation

4.3. Distributed State Estimation in a Multi-LiDAR–IMU–Camera System

4.4. Assisted Image-to-Point-Cloud Registration

5. Experiments and Discussions

5.1. Dataset Configuration

5.2. Comparison Results

5.3. Visualization Verifications

5.4. Ablation Studies

5.5. Limitations and Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Further Discussion of DLIC

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI