GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization

Peng, Jincheng; Zhang, Xiaoli; Yuan, Kefei; Peng, Xiafu; Yang, Gongliu

doi:10.3390/app15168875

Open AccessArticle

GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization

by

Jincheng Peng

¹

,

Xiaoli Zhang

^1,*

,

Kefei Yuan

²,

Xiafu Peng

¹ and

Gongliu Yang

³

¹

Department of Automation, School of Aerospace Engineering, Xiamen University, Xiamen 361102, China

²

Wuhan Second Ship Design and Research Institute, Wuhan 430064, China

³

School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8875; https://doi.org/10.3390/app15168875

Submission received: 18 December 2024 / Revised: 23 March 2025 / Accepted: 30 July 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

The visual–inertial odometry (VIO) system is not robust enough in long time operation. Especially, the visual–inertial and Global Navigation Satellite System (GNSS) coupled system is prone to dispersion of system position information in case of failure of visual information or GNSS information. To address the above problems, this paper proposes a tightly coupled nonlinear optimized localization system of RGBD visual, inertial measurement unit (IMU), and global position (GDVI-Fusion) to solve the problems of insufficient robustness of carrier position estimation and inaccurate localization information in environments where visual information or GNSS information fails. The preprocessing of depth information in the initialization process is proposed to solve the influence of an RGBD camera by lighting and physical structure and to improve the accuracy of the depth information of image feature points so as to improve the robustness of the localization system. Based on the K-Nearest-Neighbors (KNN) algorithm, to process the feature points, the matching points construct the best geometric constraints and eliminate the feature matching points with an abnormal length and slope of the matching line, which improves the rapidity and accuracy of the feature point matching, resulting in the improvement of the system’s localization accuracy. The lightweight monocular GDVI-Fusion system proposed in this paper achieves a 54.2% improvement in operational efficiency and a 37.1% improvement in positioning accuracy compared with the GVINS system. We have verified the system’s operational efficiency and positioning accuracy using a public dataset and on a prototype.

Keywords:

slam; sensor fusion; state estimation; localization

1. Introduction

State estimation algorithms have garnered significant attention owing to the emergence of applications such as unmanned driving, autonomous navigation, and 3D reconstruction. The popularity of multi-sensor fusion localization methods has been on the rise, as they harness the complementary advantages of various sensors to bolster the robustness and accuracy of state estimation [1,2,3]. To meet the demand for accurate and fast position estimation in autonomous navigation and positioning, numerous open-source algorithms for visual–inertial navigation system (VINS) fusion have been developed [4,5,6]. These algorithms, combining visual and inertial sensors, can achieve high accuracy. However, the VINS combined navigation system has four unobservable directions [7], and the accuracy and robustness are influenced by the quantity of visual observations. The stability of monocular visual–inertial odometry (VIO) [5,8,9] relies on the initialization effect, and in stationary scenarios, scale information degrades to an unobservable state, resulting in localization information prone to dispersion. The performance of stereo VIO [10,11] is constrained by the computational resources of the embedded device. These fusion algorithms have some drift in the localization information when the visual information is subjected to external interference. In challenging environments where camera images may be ambiguous or affected by lighting conditions, Global Navigation Satellite Systems (GNSS) can provide global position information and restrain state estimation divergence. GNSS receivers offer unique coordinates such as latitude, longitude, and altitude, independent of environmental conditions. However, GNSS alone is insufficient for positioning in environments like tunnels or indoors. RGBD vision cameras, capable of capturing depth information, provide scale information for the real environment, mitigating the impact of inaccurate initial scale estimation in monocular VIO systems. The effective fusion of RGBD visual–inertial and GNSS measurement information enhances the robustness and accuracy of the positioning system. This fusion approach ensures that the localization information remains stable even in environments where visual or GNSS measurements are disturbed.

However, there are many challenges in the fusion of the three major sensor systems. First, RGBD cameras face limitations in perception range and structure, making it challenging to acquire accurate depth information in environments with depth discontinuity. Additionally, open-source algorithms [10,12] struggle to operate in scenarios characterized by poor image texture, rapid rotation of the device, and incomplete image depth information. Therefore, the system initialization stage needs to optimize the depth information with noise to obtain effective depth information. Second, the computational time complexity and space complexity required for visual feature point matching are high, and there are false matches. Finally, image information becomes unreliable in bright light environments, resulting in the failure of both visual and depth information. In scenarios like traveling through tunnels and indoor–outdoor transitions, this situation can lead to a brief loss of GNSS information. Therefore, it poses a challenge to maintain system localization information without dispersion when one sensor fails, relying on other sensor data.

To solve the above problems, this paper proposes a RGBD visual–inertial and GNSS fusion positioning algorithm. The GDVI-Fusion algorithm tightly integrates GNSS raw measurements with RGBD visual and inertial data through nonlinear optimization. It utilizes the visual residual factor, GNSS pseudorange factor, Doppler factor, and inertial measurement unit (IMU) factor to jointly constrain the optimized system state, depth bias, and scale information, effectively reducing accumulated drift errors. To improve feature matching accuracy and efficiency, the algorithm incorporates KNN feature matching and optimal geometric constraints. The algorithm, which relies on KNN feature matching and optimal geometric constraints, is employed to enhance the correctness and efficiency of feature matching. This reduces the reprojection error and improves the positioning accuracy of the system through the initial matching and matching optimization phases. Furthermore, the Nearest Neighbor Resampling (NNR) algorithm assigns effective depth values with minimal error to feature points lacking depth information, fusing the aligned monocular image information and depth information. This ensures accurate depth information for each feature point, enhancing the robustness of the localization system. To achieve stable and accurate localization even in the event of a sensor data failure, a fusion localization method using three sensors is proposed. Compared with the stereo VIO system [11], the monocular GDVI-Fusion localization system can run on the on-board computer TX2, making it a lightweight and robust solution. The main innovations and contributions of this article are as follows:

Fusion of depth information from RGBD cameras and raw measurement information from GNSS enables fast and accurate real-time estimation of the system state in case of failure of one of the sensors, such as image information or GNSS information.
The Nearest Neighbor Resampling (NNR) algorithm processes RGBD camera depth by assigning depth values with smaller errors to feature points lacking depth information. This increases the count of effective feature points. Additionally, depth bias and a scale factor are introduced to optimize the depth information of the feature points, enhancing the robustness of the positioning system.
The optimization of the best geometric constraints based on the KNN matching algorithm improves the accuracy and speed of feature matching and reduces the reprojection error, thereby enhancing the system’s positioning accuracy.

2. Related Work

Presently, various works on multi-sensor fusion state estimation algorithms show that multi-sensor localization systems are accurate and robust in localization. Most of these fusion-based localization algorithms rely on visual simultaneous localization and mapping (SLAM) techniques, which are classified into two main approaches: direct method [13,14,15] and indirect method [16,17]. Purely visual SLAM algorithms [18,19] are susceptible to challenges such as image ambiguity and strong exposure, which can result in the divergence of positional information. Additionally, monocular visual SLAM lacks the ability to observe scale information, leading to significant errors in localization outcomes due to cumulative long-term drift. Hence, to overcome these limitations and enhance localization performance, fusion algorithms integrate vision with other sensor modalities [20,21,22]. These fusion algorithms can be further classified into tightly coupled [23,24] and loosely coupled [11,25] frameworks. In the loosely coupled framework, estimates from different sensors are fused to make predictions. However, the accuracy of these estimates significantly influences the fusion modeling process and can lead to imprecise final results. The tightly coupled fusion [26,27] algorithm fuses raw measurements from different sensors with higher accuracy compared with the loosely coupled algorithm.

Algorithms for multi-sensor coupling can be divided into filter-based methods and optimization-based methods. The filter-based approach [28] to visual–inertial fusion state estimation system can quickly and efficiently optimize the system state using residual constraints between camera keyframes. The nonlinear BA optimization-based visual–inertial fusion state estimation system [5] uses sliding window optimization to constrain the system state by multiple keyframes, with better state estimation performance. However, the fusion of camera and inertial measurement unit (IMU) data only provides constraints on the local state between consecutive frames, failing to eliminate accumulated errors over long-term system operation. To mitigate this issue, the RGBD-VINS [29,30] system incorporates depth information from RGBD cameras to reduce accumulated errors. Additionally, the fusion of GNSS raw measurement information with the visual–inertial system offers a means to decrease the accumulated drift error in the VINS system, as GNSS position measurements are unique in the Earth-Centered, Earth-Fixed frame. State estimation systems such as [31,32,33] adopt a loosely coupled approach, fusing GNSS signals, vision, and inertial sensors within the Extended Kalman Filter (EKF) framework. VINS-Fusion [11] applies a nonlinear optimization approach to loosely couple VIO state estimates with GNSS measurements. In contrast, the GNSS visual–inertial navigation system (GVINS) [1] utilizes a tightly coupled fusion method to combine GNSS raw measurement information with visual–inertial sensors. Furthermore, GVINS proposes a coarse-to-fine initial alignment method to ensure non-divergence of the system state in the presence of GNSS signal degradation. When GNSS signals deteriorate, GVINS relies solely on monocular visual and inertial data, making the positioning system prone to drift when the platform is stationary or moving slowly. Moreover, in environments with low texture or unstable lighting conditions, the system is also susceptible to drift or cumulative errors.

In this paper, GNSS raw measurement information, visual image information, depth information, and inertial sensor information are tightly coupled. The system’s performance is evaluated in a complex urban environment, covering a total distance of 36.37 km. The experimental results show that the distance error between the initial and final positions of the loopback start and end points is only 1.8 m. By employing the tightly coupled positioning method, the paper effectively addresses the issue of divergence in positioning information when visual or GNSS signals degrade in the positioning system. Moreover, the system autonomously performs coordinate system conversions, allowing it to operate in both internal and external tunnel environments while maintaining global consistency.

3. System Description

The coordinates involved in our GDVI-Fusion system include the following:

o_{i} x_{i} y_{i} z_{i}

(i-frame): The Earth-Centered Inertial (ECI) frame is the output reference frame of the inertial sensor, with the center of the Earth as its origin and the

o_{i} x_{i}

axis pointing to the vernal equinox. The

o_{i} z_{i}

axis is the rotation axis of the Earth, pointing to the North Pole, and

o_{i} y_{i}

is in the plane of the Earth’s equator. The i-frame does not rotate with the Earth.

o_{e} x_{e} y_{e} z_{e}

(e-frame): The Earth-Centered, Earth-Fixed (ECEF) frame has its origin at the center of the Earth. The

o_{e} x_{e}

axis points to the prime meridian, the

o_{e} z_{e}

axis is aligned with the Earth’s rotation axis, pointing to the North Pole, and the

o_{e} y_{e}

axis lies in the equatorial plane of the Earth. The angular rate of rotation of the Earth

ω_{i e}

represents the angular motion between the ECEF frame and the Earth-Centered Inertial (ECI) frame. GNSS measurement data are based on the Earth-Centered, Earth-Fixed (ECEF) frame.

o_{n} x_{n} y_{n} z_{n}

(n-frame): The navigation frame is represented by the “East–North–Up” geographic coordinate system, which is the coordinate system that the carrier system equipment refers to when solving the navigation and positioning information, with the center of the carrier system equipment as the origin and the

x_{n}

,

y_{n}

, and

z_{n}

axes pointing to the geographic east, north, and up directions, respectively.

o_{b} x_{b} y_{b} z_{b}

(b-frame): The body frame takes the center of gravity of the carrier as the origin;

x_{b}

,

y_{b}

, and

z_{b}

point to the carrier rightward, forward, and upward, respectively. We define the b-frame to be the same as the IMU frame.

o_{c} x_{c} y_{c} z_{c}

(c-frame): The camera frame takes the center of gravity of the camera as the origin;

x_{c}

,

y_{c}

, and

z_{c}

point to the camera rightward, forward, and upward, respectively. The c-frame is fixedly connected to the IMU.

c_{k}

is the kth frame image in the camera frame.

The implementation of the positioning system involves three main phases: data preprocessing, initialization, and optimization. In the preprocessing stage, the pseudorange information obtained from a GNSS signal is processed by the SPP algorithm [34], while the acceleration and angular velocity information obtained from IMU is pre-integrated, and the depth information obtained from RGBD camera is optimized [29], where the image information is subjected to feature point detection, extraction, KNN-optimized matching, and optical flow tracking. During the initialization phase, the visual construction structure from motion (SFM) [35] is aligned with the pre-integrated IMU information. If the initialization of visual–inertial odometry (VIO) fails, GNSS residual data are utilized. If VIO initialization succeeds, it undergoes alignment with GNSS for yaw alignment and anchor positioning alignment. If data alignment succeeds, the data are optimized using VIO-GNSS residual data; otherwise, only VIO residual data are utilized. In the optimization phase, the residual data go through a sliding window optimization process. The overall system implementation is depicted in Figure 1.

3.1. Measurement Preprocessing

As shown in Figure 1, the initialization module includes IMU pre-integration, optimal geometric constraints based on KNN matching, preprocessing of visual feature point depth, and processing of raw GNSS information.

IMU Pre-integration: Inertial measurement devices are susceptible to various sources of error, including acceleration bias, accelerometer noise, gyroscope bias, and gyroscope noise. Additionally, earth surface measurements captured by the IMU contain a gravity component. However, for the purpose of this discussion, we will ignore errors related to scale factor, misalignment angle, and non-orthogonality in low-cost IMU measurements. Thus, inertial measurement model is

\begin{matrix} {\tilde{a}}_{k} = a_{k} + b_{a} + R_{n}^{b_{k}} g^{n} + n_{a} \\ {\tilde{ω}}_{k} = ω_{k} + b_{g} + n_{g} \end{matrix}

(1)

where

{\tilde{a}}_{k}

and

{\tilde{ω}}_{k}

are the actual measured values of linear and angular acceleration of the IMU at time k.

a_{k}

,

ω_{k}

are the real values of linear and angular acceleration of the IMU at time k. Assume that the accelerometer noise

n_{a}

and gyroscope noise

n_{g}

obey Gaussian distribution

n_{a} \sim N (0, σ_{a}^{2})

,

n_{g} \sim N (0, σ_{g}^{2})

.

b_{a}

and

b_{g}

are the accelerometer bias and gyroscope bias, respectively.

R_{n}^{b_{k}}

is the rotation matrix from the n-frame to the b-frame at the moment k.

g^{n}

is the acceleration of gravity under the n-frame.

With the IMU output frequency surpassing the camera acquisition frequency per frame, the gyroscope and accelerometer measurements are pre-integrated [5] to derive information about the camera’s position, velocity, and rotation for two consecutive frames in

[t_{k}, t_{k + 1}]

time, as follows:

\begin{matrix} α_{b_{k}}^{b_{k + 1}} = {\int \int}_{t \in [t_{k}, t_{k + 1}]} R_{t}^{b_{k}} ({\tilde{a}}_{t} - b_{a}) d t^{2} \\ β_{b_{k}}^{b_{k + 1}} = {\int \int}_{t \in [t_{k}, t_{k + 1}]} R_{t}^{b_{k}} ({\tilde{a}}_{t} - b_{a}) d t \\ γ_{b_{k}}^{b_{k + 1}} = {\int \int}_{t \in [t_{k}, t_{k + 1}]} \frac{1}{2} Ω ({\tilde{ω}}_{t} - b_{g}) γ_{t}^{b_{k}} d t \end{matrix}

(2)

where the pre-integration equations

α_{b_{k}}^{b_{k + 1}}

,

β_{b_{k}}^{b_{k + 1}}

, and

γ_{b_{k}}^{b_{k + 1}}

are the changes in position, velocity, and rotation of two consecutive frames of the camera in

[t_{k}, t_{k + 1}]

time, respectively.

R_{t}^{b_{k}}

is the rotation matrix of the

b_{k}

frame at time t.

Ω (\cdot)

are defined as [5]

\begin{matrix} Ω (ω) = [\begin{matrix} - ω^{\land} & ω \\ - ω^{T} & 0 \end{matrix}] \\ ω^{\land} = [\begin{matrix} 0 & - ω_{z} & ω_{y} \\ ω_{z} & 0 & - ω_{x} \\ - ω_{y} & ω_{x} & 0 \end{matrix}] \end{matrix}

Best geometric constraint algorithm based on KNN matching: We use the KNN algorithm for the initial matching of the extracted feature points. The sets of feature points in image

P_{1}

and its matching image

P_{2}

are denoted as

A = \{p_{11}, p_{12}, \dots, p_{1 n}\}

and

B = \{p_{21}, p_{22}, \dots, p_{2 m}\}

, respectively.

Calculate the Euclidean distance from the feature vector of each feature point in image

P_{1}

to the feature vector of each feature point in image

P_{2}

and store the distances in a matrix

{[\begin{matrix} d_{11} & \dots & d_{1 n} \\ ⋮ & ⋮ \\ d_{n 1} & \dots & d_{n m} \end{matrix}]}_{n \times m}

, where

d_{i j}

is the Euclidean distance of the eigenvectors of the feature points

p_{1 i}

and

p_{2 i}

. The

i^{t h}

row vector

[\begin{matrix} d_{i 1} & \dots & d_{i m} \end{matrix}]

represents the Euclidean distances between point

p_{1 i}

and all its corresponding points on image

P_{2}

. Identify the K (in this paper,

K = 2

) smallest values in this m-dimensional vector. The points on image

P_{2}

corresponding to these K distance minima are considered the K matching points.

If the distance ratio between the nearest neighbor point and the next nearest neighbor point is less than the threshold

D_{t h}

(the value of

D_{t h}

is set to 0.7), the nearest neighbor point will be used as the matching point. After completing the initial matching, the matching results will be subsequently optimized to reduce the false matching rate of the algorithm.

Feature matching feature point mismatch will occur between neighboring frames; when the mismatch is too large, it will affect the system accuracy. Therefore, the feature point matching algorithm needs to be optimized on the basis of the initial matching, and the false matching is eliminated by the geometric features of the feature point matching line so as to filter out the matching point pairs that satisfy the constraints of the geometric features. As shown in Figure 2. By comparing the length and slope of the matching lines between adjacent point pairs, we can detect abnormal matching point pairs. This screening process considers the distance

d_{i j}

and slope

k_{i j}

between the matching points (denoted as points

p_{i} (u_{i}, v_{i})

and

p_{j} (u_{j}, v_{j})

).

\begin{matrix} d_{i j} = \sqrt{{(u_{j} - u_{i})}^{2} + {(v_{j} - v_{i})}^{2}} \\ k_{i j} = \frac{v_{j} - v_{i}}{u_{j} - u_{i}} \end{matrix}

(3)

where

i, j \in [1, n]

, n are positive integers.

The KNN-based best geometric constraint matching algorithm filtering process is shown in Figure 3. The filtering abnormal matching point pairs method is as follows:

(a): Construct the set of feature point pairs, let the set of matching point pair line lengths be $x \in R$ , and the set of slopes as $k \in R$ .
(b): The mean value D of the sets $x \in R$ and $k \in R$ , as well as the root mean square error S; set D as the base value and S as the error range.
(c): Filter pairs of feature matching points that satisfy the condition of the value domain $[D - S, D + S]$ .

In this paper, we utilize a KNN-based matching algorithm with optimal geometric constraints to not only enhance matching efficiency but also improve the accuracy of feature point matching, ultimately enhancing the overall system localization accuracy.

Visual feature depth Nearest Neighbor Resampling preprocessing: Discontinuous and incomplete depth information is measured by RGBD cameras. Invalid depth information has an impact on the robustness of the system. In this paper, we use the method of depth information preprocessing to obtain the initial value of valid depth information of feature points, which makes it easier to achieve robust and fast visual–inertial VIO initialization.

The images and depth information captured by the camera are aligned, and feature points of keyframes are extracted. This paper constructs a set

p \in R^{3}

of feature points with valid depth. Since the local variation of depth is gradual, a nearest neighborhood is defined around key points with invalid depth, and during feature tracking, resampling is performed on the depth around these key points and their vicinity. The feature points within the nearest neighborhood of this point with valid depth are iterated using the RANSAC algorithm [36] to find a sufficient number of inliers of depth values, thereby recovering the approximate depth of the key point

p_{r} \in R^{3}

. In some scenes, the recovered depth has errors; therefore, the depth information needs to be optimized again.

The key point depth preprocessing method can roughly estimate the visual scale information, set the initial value of the scale, and further optimize it in the VIO initialization process to achieve fast initialization and improve the robustness of the system. Although resampling keypoints can be confounded by depth Nearest Neighbor Resampling (NNR) errors, most resampling keypoints provide depth values that are very close to the corresponding expected values. Deep NNR focuses on extending feature points with valid depth values, and we use resampling keypoints in tracking feature points to improve robustness when the depth map is severely incomplete.

GNSS raw information measurement: We incorporate GNSS measurements, including pseudorange measurements and Doppler shift measurements, in our system. These measurements are used to estimate the receiver position using the Single Point Positioning (SPP) algorithm [34]. Throughout the process of GNSS measurements, various errors can arise, such as orbital errors, clock deviations, and ionospheric and tropospheric errors. Therefore, we formulate the pseudorange measurement model to account for these errors:

{\tilde{P}}_{r_{t}}^{s} = ∥p_{s_{t}}^{i} - p_{r_{t}}^{i}∥ + c (δ_{r_{t}} - δ_{s_{t}}) + I_{r_{t}}^{s} + T_{r_{t}}^{s} + ε_{r_{t}}^{s}

(4)

where, at time t,

p_{s_{t}}^{i}

and

p_{r_{t}}^{i}

denote the coordinates of satellite s and receiver r in the i-frame, respectively.

δ_{r_{t}}

denotes the clock deviation of the receiver.

δ_{s_{t}}

denotes the clock deviation of the satellite. c denotes the speed of light.

I_{r_{t}}^{s}

and

T_{r_{t}}^{s}

denote the ionospheric and tropospheric errors, respectively.

ε_{r_{t}}^{s}

denotes the measurement noise.

∥\cdot∥

is the Mahalanobis norm.

The Doppler effect introduces a frequency difference between the signals transmitted by the GNSS satellites and received by the receiver, known as the Doppler shift. This Doppler shift is utilized to represent the relative motion between the GNSS receiver and the satellite. It is influenced by factors such as clock frequency deviations and measurement noise. Thus, the Doppler shift model is expressed as

Δ {\tilde{f}}_{r_{t}}^{s} = - \frac{1}{λ} [{κ_{r_{t}}^{s}}^{T} (v_{s_{t}}^{i} - v_{r_{t}}^{i}) + c (δ \dot{t} - Δ {\dot{t}}^{s}) + η_{r}^{s}]

(5)

where, in the i-frame at time t,

κ_{r}^{s T}

denotes the unit vector of the receiver pointing to the satellite.

λ

denotes the wavelength of the carrier signal.

v_{s_{t}}^{i}

and

v_{r_{t}}^{i}

denote the velocities of the receiver and the satellite, respectively. c denotes the speed of light.

δ \dot{t}

denotes the clock offset rate of the receiver.

Δ {\dot{t}}^{s}

denotes the clock offset rate of the satellite.

η_{r}^{s}

denotes the Doppler measurement noise.

3.2. Estimator Initialization

The state estimation of a multi-sensor fusion system is inherently nonlinear. As a result, the accuracy and robustness of the system heavily rely on the initialization values. In this paper, we propose a method where the scale initial values of depth information, IMU pre-integration, and visual information are loosely coupled and processed to obtain the initial values of gyroscope bias, acceleration bias, and gravity vector through online optimization.

With known scale initial values, the fusion of visual and IMU information requires knowing the initial values of gravity vector, velocity, and IMU deviation. We solve the problem in two steps.

Gyroscope bias: In consecutive frames k and

k + 1

of the sliding window, according to the constraints on the gyroscope measurements and the camera keyframe orientation, the equation for the residuals of the rotation term is defined as follows:

r_{k}^{g} (b_{g}) = L o g ({(Δ R_{k}^{k + 1} {Exp (J}_{Δ R}^{g} b_{g}))}^{T} R_{k}^{T} R_{k + 1})

(6)

where

Δ R_{k}^{k + 1}

denotes the change in rotation from frame k to frame k+1.

J_{Δ R}^{g}

denotes the Jacobi matrix of pre-integrated rotations to gyroscope deviations, and its analytic expression can be found in reference [37].

R_{k}^{T}

is the transpose of the rotation at times k.

R_{k + 1}

is the rotation at times

k + 1

.

Gyroscope deviation is optimized by iterative optimization using the LM algorithm, and the optimization equation is

χ_{g} = \underset{b_{g}}{arg min} {∥r_{p}^{g} (b_{g})∥}_{\sum_{p}^{g}}^{2} + \sum_{k} {∥r_{k}^{g} (b_{g})∥}_{\sum_{k}^{g}}^{2}

(7)

where

r_{p}^{g} (b_{g})

is the a priori residual, iterating from zero deviation, which can be ignored, and

r_{k}^{g} (b_{g})

denotes the residual term of the gyroscope deviation.

Accelerometer bias, gravity vector, scale, velocity initialization: After initializing the gyroscope bias, calculate the accelerometer bias, gravity vector, and velocity for two consecutive frames k to

k + 1

moments [38], as follows:

χ_{I} = [v_{b_{0}}, v_{b_{1}}, \dots, v_{b_{n}}, g^{c_{0}}, s, b_{a}]

(8)

where

v_{b_{k}}

is the velocity of the frame k image in the body frame.

g^{c_{0}}

is the acceleration of gravity in the camera

c_{0}

coordinate system, while

b_{a}

is the accelerometer bias. s is the scaling parameter.

The position

α_{b_{k}}^{b_{k + 1}}

and velocity

β_{b_{k}}^{b_{k + 1}}

of the IMU are computed in the body frame for two consecutive frames k to

k + 1

moments, as follows:

\begin{matrix} α_{b_{k}}^{b_{k + 1}} = R_{c_{0}}^{b_{k}} (s ({\bar{p}}_{c_{k + 1}} - {\bar{p}}_{c_{k}}) - \frac{1}{2} g^{c_{0}} Δ t_{k}^{2} - R_{b_{k}}^{c_{0}} v_{b_{k}} Δ t_{k} + J_{Δ p_{k}^{k + 1}}^{b_{a}} b_{a}) \\ β_{b_{k}}^{b_{k + 1}} = R_{c_{0}}^{b_{k}} (R_{b_{k + 1}}^{c_{0}} v_{b_{k + 1}} - g^{c_{0}} Δ t_{k} - R_{b_{k}}^{c_{0}} v_{b_{k}} + J_{Δ v_{k}^{k + 1}}^{b_{a}} b_{a}) \end{matrix}

(9)

where

{\bar{p}}_{c_{k}}

is the position at moment k in

c_{0}

frame.

Δ t_{k}

is the time difference between k and

k + 1

moments.

R_{c_{0}}^{b_{k}}

is the rotation matrix from

c_{0}

frame to

b_{k}

frame.

J_{Δ p_{k}^{k + 1}}^{a}

and

J_{Δ v_{k}^{k + 1}}^{a}

are the Jacobi matrices of pre-integrated position and velocity to acceleration bias, respectively. Reference [37] explains analytic expressions for the pre-integrated Jacobi matrix and uncertainty propagation. The least squares optimization model is established, as follows:

min_{χ_{I}} \sum_{k \in N} {∥{\hat{z}}_{b_{k}}^{b_{k} + 1} - H_{b_{k}}^{b_{k} + 1} χ_{I}∥}^{2}

(10)

From (9), we obtain the following linear measurement model:

\begin{matrix} {\hat{z}}_{b_{k}}^{b_{k + 1}} = [\begin{matrix} {\hat{α}}_{b_{k}}^{b_{k + 1}} - p_{c}^{b} + R_{c_{0}}^{b_{k}} R_{b_{k + 1}}^{c_{0}} p_{c}^{b} \\ {\hat{β}}_{b_{k}}^{b_{k + 1}} \end{matrix}] = H_{b_{k}}^{b_{k + 1}} χ_{I} + n_{b_{k}}^{b_{k + 1}} \end{matrix}

`

where

\begin{matrix} H_{b_{k}}^{b_{k + 1}} = \\ [\begin{matrix} - I Δ t_{k} & 0 & - \frac{1}{2} R_{c_{0}}^{b_{k}} Δ t_{k}^{2} & R_{c_{0}}^{b_{k}} ({\bar{p}}_{c_{k + 1}} - {\bar{p}}_{c_{k}}) & J_{Δ p_{k}^{k + 1}}^{a} b_{a} \\ - I & R_{c_{0}}^{b_{k}} R_{b_{k + 1}}^{c_{0}} & R_{c_{0}}^{b_{k}} Δ t_{k} & 0 & J_{Δ v_{k}^{k + 1}}^{a} b_{a} \end{matrix}] \end{matrix}

The linear least squares model solution procedure is as follows:

\begin{matrix} H_{b_{k}}^{b_{k + 1}} χ_{I} = {\hat{z}}_{b_{k}}^{b_{k + 1}} \\ {(H_{b_{k}}^{b_{k + 1}})}^{T} H_{b_{k}}^{b_{k + 1}} χ_{I} = {(H_{b_{k}}^{b_{k + 1}})}^{T} {\hat{z}}_{b_{k}}^{b_{k + 1}} \\ χ_{I} = {({(H_{b_{k}}^{b_{k + 1}})}^{T} H_{b_{k}}^{b_{k + 1}})}^{- 1} {(H_{b_{k}}^{b_{k + 1}})}^{T} {\hat{z}}_{b_{k}}^{b_{k + 1}} \end{matrix}

(12)

According to the least squares optimization model, the velocity of each frame in the body frame, the acceleration of gravity in the camera frame, and accelerometer bias are obtained.

3.3. Factor Graph Model

Formulation: Our system uses a nonlinear optimization approach, and the system state can be summarized as

\begin{matrix} χ = [x_{i}, x_{i + 1}, x_{i + 2}, \dots x_{j}] \\ x_{k} = [q_{b_{k}}^{n}, v_{b_{k}}^{n}, p_{b_{k}}^{n}, s_{k}, b_{a_{k}}, b_{g_{k}}, δ t_{k}, δ {\dot{t}}_{k}], k \in [i, j] \end{matrix}

(13)

where, within the sliding window

[i, j]

frame, the k-frame states include attitude

q_{b_{k}}^{n}

, velocity

v_{b_{k}}^{n}

, position

p_{b_{k}}^{n}

, scale

s_{k}

, accelerometer bias

b_{a_{k}}

, gyro bias

b_{g_{k}}

, receiver clock bias

δ t_{k}

, and receiver clock drifting rate

δ {\dot{t}}_{k}

.

We analyze the factor graph model, with residual factors constructed for each sensor, for back-end nonlinear optimization. The factor graph is shown in Figure 4. Build the maximum posterior probability model, as follows:

\begin{matrix} χ^{*} = \underset{χ}{arg max} p (χ | z) \\ = \underset{χ}{arg max} p (χ) p (z | χ) \\ = \underset{χ}{arg min} {{∥r_{p}∥}_{\sum_{p}}^{2} + \sum_{k} {∥r_{B} ({\tilde{z}}_{b_{t_{k}}}^{b_{t_{k + 1}}}, χ)∥}_{\sum_{k}}^{2} + \sum_{k} {∥r_{c} ({\tilde{z}}_{l}, χ)∥}_{\sum_{k}}^{2} \\ + \sum_{k} {∥r_{p} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ)∥}_{\sum_{k}}^{2} + \sum_{k} {∥r_{D} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ)∥}_{\sum_{k}}^{2}} \end{matrix}

(14)

r_{p}

denotes the a priori information about the system state.

r_{B} ({\tilde{z}}_{b_{t_{k}}}^{b_{t_{k + 1}}}, χ)

,

r_{c} ({\tilde{z}}_{l}, χ)

,

r_{p} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ)

, and

r_{D} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ)

denote the IMU residuals, visual residuals, GNSS pseudorange residuals, and Doppler shift residuals, respectively. We decompose the optimization problem into individual factors related to states and measurements. Figure 4 shows the factorization diagram of our system. In addition to the factors from the measurements, there is an a priori factor used to constrain the sliding window system state. We discuss each factor in detail below.

Inertial measurement unit (IMU) residual factor: From the IMU measurement model in Equations (3) and (11), the residual of the IMU in the sliding window time is defined as

\begin{matrix} r_{B} ({\tilde{z}}_{b_{k}}^{b_{k + 1}}, χ) = [\begin{matrix} δ α_{b_{k}}^{b_{k + 1}} \\ δ β_{b_{k}}^{b_{k + 1}} \\ δ θ_{b_{k}}^{b_{k + 1}} \\ δ b_{a} \\ δ b_{g} \end{matrix}] \\ = [\begin{matrix} R_{n}^{b_{k}} (p_{b_{k + 1}}^{n} - p_{b_{k}}^{n} - \frac{1}{2} g^{n} Δ t_{k}^{2} - v_{b_{k}}^{n} Δ t_{k}) - {\hat{α}}_{b_{k}}^{b_{k + 1}} \\ R_{n}^{b_{k}} (v_{b_{k + 1}}^{n} + g^{n} Δ t_{k} - v_{b_{k}}^{n}) - {\hat{β}}_{b_{k}}^{b_{k + 1}} \\ 2 {[{(q_{b_{k + 1}}^{n})}^{- 1} \otimes q_{b_{k}}^{n} \otimes {({\hat{γ}}_{b_{k}}^{b_{k + 1}})}^{- 1}]}_{x y z} \\ b_{a_{k + 1}} - b_{a_{k}} \\ b_{g_{k + 1}} - b_{g_{k}} \end{matrix}] \end{matrix}

(15)

where the vector component of the quaternion q is extracted by

{[\cdot]}_{x y z}

for representing the error state.

δ α_{b_{t_{k + 1}}}^{b_{t_{k}}}, δ β_{b_{t_{k + 1}}}^{b_{t_{k}}}, δ θ_{b_{t_{k + 1}}}^{b_{t_{k}}}, δ b_{a}, δ b_{g}

denote the observed terms of position, velocity, attitude, acceleration bias, and gyro bias between two consecutive frames, respectively. ⊗ represents the multiplication operation between two quaternions.

Depth-scaled BA based on NNR: Using NNR can extract enough effective depth information from the discontinuous and incomplete depth images, but there are errors in the depth data obtained in this way, leading to the failure of the SLAM system. In this paper, we introduce depth bias

δ λ

for optimization, assuming that the depth is constant in a short time. After extracting the feature points of consecutive keyframes in the image and applying distortion correction to the feature points, the visual projection is modeled as

\tilde{P} = π_{c} (R_{b}^{c} (R_{n}^{c} x^{n} + p_{n}^{b}) + p_{b}^{c}) + n_{c}

(16)

where

{R_{b}^{c}, p_{b}^{c}}

is the rotation and translation between the IMU coordinate system and the camera coordinate system.

{R_{n}^{c}, p_{n}^{b}}

is the rotation and translation between the camera and navigation frame.

n_{c}

is the measurement error.

x^{n}

is the 3D feature points on the navigation frame.

π_{c} (\cdot)

is the camera projection function. Therefore, at the moment k, the reprojection residual of feature point l projected from frame i to frame j can be expressed as

r_{c} ({\tilde{z}}_{l}, χ) = {\tilde{P}}_{l}^{c_{k_{j}}} - π_{c} ({\hat{x}}_{l}^{c_{k_{j}}})

(17)

where

{\hat{x}}_{l}^{c_{k_{j}}} = R_{b}^{c} (R_{n}^{b_{t_{j}}} (R_{b_{t_{i}}}^{n} (R_{c}^{b} {δ λ}_{l} π_{c}^{- 1} ({\tilde{P}}_{l}^{c_{k_{i}}}) + p_{c}^{b}) + p_{b_{k_{i}}}^{n}) + p_{n}^{b_{k_{i}}}) + p_{b}^{c},

{δ λ}_{l}

is the depth bias of 3D feature points l. The depth value

λ_{l}^{n e w}

of feature point l is updated as

λ_{l}^{n e w} = \frac{1}{1 + δ λ_{l}} λ_{l}^{o l d}

(18)

GNSS signal residual factor: We are able to compute the ECEF frame of the receiver at any moment for a given system state. Since the time of the GNSS measurement signal is marked by the receiver, we define the ECI frame at the moment of signal reception as coinciding with the ECEF frame. When the signal arrives at the receiver, we know

p_{r}^{i} = p_{r}^{e}

. By broadcasting ephemeris and pseudorange measurements, the position of the satellite in the ECEF frame at the moment of signal transmission can be obtained as

p_{s}^{e^{'}}

. However, there exists the effect of the angular velocity of the Earth’s rotation

ω_{i e}

, which changes the ECEF frame at the time the signal leaves the satellite from the time the signal arrives at the receiver. Therefore, it is necessary to convert the satellite position to the ECI coordinate system, as follows:

p_{s_{j}}^{i} = R_{e}^{i} (I + T (ω_{i e} \times)) p_{s_{j}}^{e^{'}}

(19)

The pseudorange is shown in (4). Therefore, the single pseudorange residual of the

j^{t h}

satellites received by the GNSS receiver r at moment k can be expressed as

r_{p} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ) = ∥p_{s_{j}}^{i} - (R_{n}^{i} p_{r_{k}}^{n} + p_{b}^{i})∥ + c (δ t_{r_{k}} - δ t_{s_{j}}) + T_{r_{k}}^{s_{j}} + I_{r_{k}}^{s_{j}} - {\tilde{P}}_{r_{k}}^{s}

(20)

The Doppler shift describes a phenomenon in which the frequency of a wave received by a receiver changes from the frequency actually transmitted by the satellite when there is relative motion between the satellite and the receiver, as expressed by (5). The Doppler shift is defined as a frame of the ECEF frame in which the satellite is transmitting a signal. By defining an ECI frame as an ECEF frame at reception, we know

v_{r}^{i} = v_{r}^{e}

. The satellite’s velocity

v_{s}^{e^{'}}

in the signaling ECEF frame is then transformed into an ECI frame, as follows:

v_{s_{j}}^{i} = R_{e}^{i} (I + T (ω_{i e} \times)) v_{s_{j}}^{e^{'}}

(21)

Therefore, the Doppler shift residual of the

j^{t h}

satellite s received by the GNSS receiver r at moment k can be expressed as

r_{D} ({\tilde{z}}_{r_{k}}^{s_{j}}, χ) = \frac{1}{λ} {κ_{r_{k}}^{s_{j}}}^{T} (v_{s_{j}}^{i} - R_{n}^{i} v_{r_{k}}^{n}) + \frac{c}{λ} (\dot{δ} t_{r_{k}} - Δ t_{s_{j}}) + Δ {\tilde{f}}_{r_{k}}^{s_{j}}

(22)

where

{R_{n}^{i}, p_{b}^{i}}

denotes the rotation and translation between body frame and ECI frame.

4. Experimental Results

We conduct experiments using the TUM RGBD dataset, as well as real UAV and vehicle tests, to evaluate the proposed GDVI-Fusion system. The TUM RGBD dataset provides time-stamped color and depth images, which we use to analyze the performance of the depth NNR algorithm. In the real UAV and vehicle experiments, this system is compared with renowned open-source systems such as VINS-Mono [5], VINS-Fusion [11], ORB-SLAM3 [10], and the GVINS [1] system. These comparative experiments focus on operational efficiency, scale error, and root mean square error, aiming to validate the localization accuracy and robustness of the GDVI-Fusion system. The system uses the processor Jetson TX2 with 256-core NVIDIA Pascal GPU, ARMv8 64-bit Cortex-A57 CPU, 8 GB memory, and D455 model camera, with the hardware shown in Figure 5 and Figure 6. The hardware parameters are shown in Table 1.

4.1. Dataset Comparison

We utilize the TUM RGBD dataset to validate the performance of the depth NNR method proposed in this paper. Depth NNR, with a radius of 3 pixels, is applied to each feature point that has valid depth information. We analyze the accuracy of depth NNR and its impact on the robustness of the system’s localization. Table 2 displays the mean absolute deviation(MAD) and mean relative deviation(MRD) of the resampled depth values compared with the actual values. In the TUM/freiburg1 and TUM/freiburg3 sequences, almost 90% of the resampled depth values demonstrate high accuracy (MAD < 25 mm, MRD < 2.5%). In the TUM/freiburg2 sequences, about 70% of the resampled depth values achieve this level of accuracy. The depth NNR sampling method enhances the stability of visual scale information, mitigates the dispersion of system localization information due to scale information degradation, and effectively improves the robustness of the localization system.

4.2. UAV Experiments

In this experiment, the UAV performed indoor environment, outdoor environment, and outdoor flight through the courtyard building environment to evaluate the average scale error of the GDVI-Fusion system. Based on the UAV flying a section of the trajectory, the system calculated the scale, trajectory, real scale, and ground trajectory alignment, solving for the average scale error. As shown in Figure 7, Figure 8 and Figure 9 for the UAV test, only the open-loop VIO was used to test the system for the purpose of scale estimation comparisons. From the experiments, it can be seen that the inertial measurement unit of RealSense D455 has a general performance, and the estimated scale information is not accurate, which leads to a large drift of the monocular VINS.

Table 3 displays the mean scale errors estimated by different systems. The depth NNR and depth optimization algorithm, as proposed by GDVI-Fusion, achieves an average scale error of 0.57% in outdoor environment tests, 0.12% in indoor environment tests, and 0.63% in outdoor crossing courtyard building environment tests. These results indicate that GDVI-Fusion can enhance the accuracy of scale estimation by up to 20% compared with other algorithms.

Scale error significantly impacts the accuracy of system localization, with excessive scale error potentially leading to dispersion of localization information. This study compares the actual localization trajectories generated by different algorithms against the real ground trajectories and calculates the root mean square errors (RMSE) of these trajectories to evaluate the effects of scale errors on the positioning accuracy of the system. Table 4 presents the accuracy of UAV flight positioning trajectories generated by various algorithms in indoor, outdoor, and outdoor crossing courtyard building environments.

4.3. Vehicle-Based Experiments

The device is installed on the test vehicle to verify the tracking time per frame for feature matching, system operation time, and average reprojection error of different localization algorithms during vehicle operation. As shown in Table 5, GDVI-Fusion enhances feature matching efficiency by almost 66%, improves system operation efficiency by approximately 50%, and reduces the average reprojection error to only 0.89 compared with other algorithms. The experiment demonstrates that our KNN-based initial matching and optimization algorithms effectively increase the efficiency and accuracy of feature matching. Our GDVI-Fusion system requires minimal time for feature detection and tracking and maintains a low reprojection error, contributing to improved operational efficiency, positioning accuracy, and robustness of the localization system. The feature tracking match is shown in Figure 10.

To evaluate the localization accuracy and robustness of various algorithms in urban complex environments characterized by high-rise buildings, numerous obstructions, significant light variations, and tunnels that impact sensor performance, we conduct a vehicle test in the Jimei District of Xiamen City. This test spans a total distance of 36.37 km over 1 h. The vehicle’s trajectory, depicted in Figure 11, includes areas with tall buildings, dim and bright lighting conditions, rapid vehicle movement, and high dynamic scenarios that challenge the accuracy of single sensors.

In these testing conditions, VINS-Mono experiences substantial visual disruptions upon entering brightly lit areas, leading to significant localization offsets and noticeable drift. VINS-Fusion encounters large fluctuations in localization data as the vehicle accelerates or passes through tunnels, struggling to accurately maintain vehicle localization. Conversely, both the GVINS and the GDVI-Fusion algorithms, as proposed in this paper, successfully complete the urban complex environment test, demonstrating resilience against single-sensor positioning degradation. The test results, presented in Table 6, indicate that, over a 1 h period and a total distance of 36.37 km, the GVINS system incurs a horizontal positioning error of 6.782 m (95% CI: [5.43, 8.14] m) and a root mean square error of 8.729 m (95% CI: [6.98, 10.48] m). In contrast, the GDVI-Fusion system achieves a horizontal positioning error of 4.667 m (95% CI: [3.73, 5.60] m) and a root mean square error of 5.729 m (95% CI: [4.58, 6.87] m). We conducted a test in a GPS-denied environment (tunnel) to validate the positioning accuracy of different positioning algorithms. The horizontal positioning error and root mean square error (RMSE) for the VINS-Mono system are 45.352 m (95% CI: [44.352, 46.352] m) and 46.252 m (95% CI: [45.252, 47.252] m), respectively. For the VINS-Fusion system, the horizontal positioning error and RMSE are 57.755 m (95% CI: [56.755, 58.755] m) and 59.364 m (95% CI: [58.364, 60.364] m), respectively. For the GVINS system, the horizontal positioning error and RMSE are 26.224 m (95% CI: [25.224, 27.224] m) and 28.014 m (95% CI: [27.014, 29.014] m), respectively. For the GDVI-Fusion system, the horizontal positioning error is 4.017 m (95% CI: [3.017, 5.017] m) and the RMSE is 5.889 m (95% CI: [4.889, 6.889] m).

The horizontal positioning accuracy of the proposed algorithm improves by 87.2% compared with that of VINS-Mono, 90.1% compared with that of VINS-Fusion, and 37.1% compared with that of GVINS, as shown in Table 7. Therefore, these results validate the GDVI-Fusion system’s robustness and accuracy in navigating complex urban environments.

5. Conclusions

The RGBD visual–inertial and GNSS tightly coupled, nonlinear optimization positioning system (GDVI-Fusion) we proposed addresses the issue of positioning drift in navigation systems, particularly in complex environments or GNSS-failure scenarios. First, during initialization, the depth information is preprocessed, mitigating the effects of lighting and physical structure on the RGBD camera, thus improving the accuracy of the image feature point depth information. Second, the KNN algorithm is used to process the feature points, and optimal geometric constraints are applied to matching point pairs. This process eliminates feature matches with abnormal matching line lengths and slopes, thereby enhancing feature matching efficiency and accuracy. Finally, the image information and depth data from the RGBD camera are tightly coupled with the raw GNSS measurement data using a factor graph optimization state method. In urban environment vehicle tests, our algorithm improves operational efficiency by 54.2% over GVINS and 40.1% over ORB-SLAM3, and our positioning accuracy is improved by 37.1% compared with that of GVINS.

Author Contributions

Conceptualization, J.P. and G.Y.; methodology, J.P.; software, J.P. and X.P.; validation, J.P. and K.Y.; formal analysis, J.P. and X.Z.; writing—original draft preparation, J.P.; writing—review and editing, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VIO	visual–inertial odometry
VINS	visual–inertial navigation system
GNSS	Global Navigation Satellite System
IMU	inertial measurement unit
NNR	Nearest Neighbor Resampling
KNN	K-Nearest-Neighbors

References

Cao, S.; Lu, X.; Shen, S. GVINS: Tightly Coupled Gnss–visual–inertial Fusion for Smooth and Consistent State Estimation. IEEE Trans. Robot. 2022, 38, 2004–2021. [Google Scholar] [CrossRef]
Du, T.; Shi, S.; Zeng, Y.; Yang, J.; Guo, L. An Integrated Ins/lidar Odometry/polarized Camera Pose Estimation via Factor Graph Optimization for Sparse Environment. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
He, X.; Li, B.; Qiu, S.; Liu, K. Visual–Inertial Odometry of Structured and Unstructured Lines Based on Vanishing Points in Indoor Environments. Appl. Sci. 2024, 14, 1990. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, D.; Su, Y.; Lian, C.; Deng, K. A Parallel Nonlinear Factor Recovery Method for VIO Based on a Factor Graph. IEEE Trans. Instrum. Meas. 2024, 73, 1–15. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. Vins-mono: A Robust and Versatile Monocular Visual-inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Kaess, M.; Leonard, J.J. Towards consistent visual-inertial navigation. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 4926–4933. [Google Scholar]
Liu, C.; Wang, T.; Li, Z.; Tian, P. A Novel Real-Time Autonomous Localization Algorithm Based on Weighted Loosely Coupled Visual–Inertial Data of the Velocity Layer. Appl. Sci. 2025, 15, 989. [Google Scholar] [CrossRef]
Tian, Z.; Cheng, Y.; Yao, S. An Adaptive Fast Incremental Smoothing Approach to INS/GPS/VO Factor Graph Inference. Appl. Sci. 2024, 14, 5691. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-source Library for Visual, Visual–inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Qin, T.; Cao, S.; Pan, J.; Shen, S. A general optimization-based framework for global pose estimation with multiple sensors. arXiv 2019, arXiv:1901.03642. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Miao, R.; Qian, J.; Song, Y.; Ying, R.; Liu, P. Univio: Unified Direct and Feature-based Underwater Stereo Visual-inertial Odometry. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Engel, J.; Sturm, J.; Cremers, D. Semi-dense Visual Odometry for a Monocular Camera. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 1449–1456. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast Semi-direct Monocular Visual Odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Huang, J.; Zhao, S.; Zhang, L. E2-VINS: An Event-Enhanced Visual–Inertial SLAM Scheme for Dynamic Environments. Appl. Sci. 2025, 15, 1314. [Google Scholar] [CrossRef]
Yin, H.; Liu, P.X.; Zheng, M. Stereo Visual Odometry with Automatic Brightness Adjustment and Feature Tracking Prediction. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Wang, K.; Ma, S.; Ren, F.; Lu, J. SBAS: Salient Bundle Adjustment for Visual SLAM. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Han, B.; Xiao, Z.; Huang, S.; Zhang, T. Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution Aided MAP Initialization. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5448–5455. [Google Scholar]
Nisar, B.; Foehn, P.; Falanga, D.; Scaramuzza, D. VIMO: Simultaneous Visual Inertial Model-based Odometry and Force Estimation. IEEE Robot. Autom. Lett. 2019, 4, 2785–2792. [Google Scholar] [CrossRef]
He, M.; Rajkumar, R.R. Extended Vins-mono: A Systematic Approach for Absolute and Relative Vehicle Localization in Large-scale Outdoor Environments. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4861–4868. [Google Scholar]
Li, T.; Zhang, H.; Gao, Z.; Niu, X.; El-sheimy, N. Tight Fusion of a Monocular Camera, MEMS-IMU, and Single-Frequency Multi-GNSS RTK for Precise Navigation in GNSS-Challenged Environments. Remote Sens. 2019, 11, 610. [Google Scholar] [CrossRef]
Mascaro, R.; Teixeira, L.; Hinzmann, T.; Siegwart, R.; Chli, M. GOMSF: Graph-optimization Based Multi-sensor Fusion for Robust UAV Pose Estimation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1421–1428. [Google Scholar]
Wu, D.; Zhong, X.; Peng, X.; Hu, H.; Liu, Q. Multimodal Information Fusion for High-robustness and Low-drift State Estimation of Ugvs in Diverse Scenes. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Yu, Y.; Gao, W.; Liu, C.; Shen, S.; Liu, M. A Gps-aided Omnidirectional Visual-inertial State Estimator in Ubiquitous Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 8 November 2019; pp. 7750–7755. [Google Scholar]
Majumder, S.; Kehtarnavaz, N. Vision and Inertial Sensing Fusion for Human Action Recognition: A Review. IEEE Sensors J. 2021, 21, 2454–2467. [Google Scholar] [CrossRef]
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. Openvins: A Research Platform for Visual-inertial Estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4666–4672. [Google Scholar]
Shan, Z.; Li, R.; Schwertfeger, S. RGBD-Inertial Trajectory Estimation and Mapping for Ground Robots. Sensors 2019, 19, 2251. [Google Scholar] [CrossRef] [PubMed]
Yuan, J.; Zhu, S.; Tang, K.; Sun, Q. ORB-TEDM: An RGB-D SLAM Approach Fusing ORB Triangulation Estimates and Depth Measurements. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Angelino, C.V.; Baraniello, V.R.; Cicala, L. UAV position and attitude estimation using IMU, GNSS and camera. In Proceedings of the 2012 15th International Conference on Information Fusion, Singapore, 9–12 July 2012; pp. 735–742. [Google Scholar]
Ha, C.; Yang, D.; Wang, G.; Kim, S.C.; Jo, H. Real-Time Drivable Region Mapping Using an RGB-D Sensor with Loop Closure Refinement and 3D Semantic Map-Merging. Appl. Sci. 2024, 14, 11613. [Google Scholar] [CrossRef]
Li, X.; Wang, X.; Liao, J.; Li, X.; Li, S.; Lyu, H. Semi-tightly Coupled Integration of Multi-gnss PPP and S-VINS for Precise Positioning in Gnss-challenged Environments. Satell. Navig. 2021, 2, 1. [Google Scholar] [CrossRef]
Kaplan, E.D.; Hegarty, C. (Eds.) Understanding GPS/GNSS: Principles and Applications, 3rd ed.; Artech House: Norwood, MA, USA, 2017. [Google Scholar]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. Epnp: An Accurate O(n) Solution to the Pnp Problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Raza, M.A.; Awais Rehman, M.; Qureshi, I.M.; Habib Mahmood, M. A Simplified Approach to Visual Odometry Using Graph Cut RANSAC. In Proceedings of the 2018 5th International Multi-Topic ICT Conference (IMTIC), Jamshoro, Pakistan, 25–27 April 2018; pp. 1–7. [Google Scholar]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-manifold Preintegration for Real-time Visual–inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. Visual-inertial Monocular SLAM with Map Reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef]

Figure 1. System architecture.

Figure 2. Optimal geometric constraints diagram (the blue line represents the correct match, and the red line represents the incorrect match).

Figure 3. KNN-based best geometric constraint matching algorithm filtering process.

Figure 4. Factor diagram model. Red circles denote system state

x_{k}

. Squares denote factors for each sensor, including pseudorange factor

e_{k}^{s}

, Doppler velocity factor

e_{k}^{D V}

, clock factor

c_{k}

, inertia factor

i_{k}

, and vision factor

ρ

. k is the number of state frames. s is a satellite.

Figure 4. Factor diagram model. Red circles denote system state

x_{k}

. Squares denote factors for each sensor, including pseudorange factor

e_{k}^{s}

, Doppler velocity factor

e_{k}^{D V}

, clock factor

c_{k}

, inertia factor

i_{k}

, and vision factor

ρ

. k is the number of state frames. s is a satellite.

Figure 5. UAV equipment.

Figure 6. In-vehicle equipment.

Figure 7. Indoor environment.

Figure 8. Outdoor environment.

Figure 9. Outdoor flight through the courtyard building environment.

Figure 10. Feature tracking match.

Figure 11. Trajectory of RTK and GDVI-Fusion in vehicle experiment.

Table 1. Hardware parameters.

Sensor Type	Item	Value
Camera	Sensor	RealSense D455
	Shutter	Global shutter
	RGB Resolution	$1280 \times 800$
	RGB Fov	$90^{\circ} \times 65^{\circ}$
	Depth Resolution	$1280 \times 720$
	Depth Fov	$87^{\circ} \times 58^{\circ}$
	Depth Accuracy	Less than 2% at 4 m
	Frame Rate	20 fps
IMU	Sensor	HY-INS-21B
	Frequency	200 Hz
	Gyro bias	0.003°/h
	Gyro random noise	0.0003°/h
	Accelerometer bias	10 $μ$ g
	Accelerometer random noise	10 $μ$ g
GNSS	Receiver	MarinePak7
	Antenna	Beitian BT-200
	Frequency	10 Hz
Hardware	Processors	Jetson TX2
	CPU	6 Core Denver and A57
	GPU	256 Core Pascal
	Memory	8GB 128bit LPDDR4
	Power	7.5 W (or 15 W)

Table 2. MAD and MRD.

Dataset	MAD			MRD
	5 mm	15 mm	25 mm	0.5%	1.5%	2.5%
fr1*	37.56	69.35	88.65	40.12	79.96	93.52
fr2*	25.64	36.65	68.33	30.56	50.65	75.44
fr3*	36.33	72.23	91.45	42.21	81.49	94.33

fr1* includes fr1_360, fr1_floor, fr1_desk, fr1_desk2, and fr1_room; fr2* includes fr2/360_hemisphere, fr2/360_kidnap, fr2/desk, fr2/large_no_loop, and fr2/large_with_loop; and fr3* includes fr3_office and fr3_nst.

Table 3. Mean scale error.

Mean Scale Error (%)	Outdoor	Indoor	Indoor–Outdoor
GDVI-Fusion	1.07	0.62	1.13
VINS-Mono [5]	1.6	3.03	3.91
VINS-Fusion [11]	1.37	1.48	1.83
GVINS [1]	1.19	0.84	1.37
ORB-SLAM3 [10]	1.25	0.85	1.77

Table 4. Root mean square error.

Mean Scale Error (m)	Outdoor	Indoor	Indoor–Outdoor
GDVI-Fusion	0.125	0.116	0.133
VINS-Mono	0.808	0.298	1.287
VINS-Fusion	0.256	0.249	0.224
GVINS	0.155	0.132	0.162
ORB-SLAM3	0.85	0.35	0.63

Table 5. Tracking time consumption for different system features.

	Feature Matching Tracking (ms)	System Runtime (ms)	Average Reprojection Error
VINS [11]	91.8	499.0	2.314144
GVINS [1]	81.8	560.2	1.259165
ORB-SLAM3 [10]	32.5	428.2	1.548526
GDVI-Fusion	30.2	256.3	0.895625

Table 6. Errors in vehicle tests.

	VINS-Mono	VINS-Fusion	GVINS	GDVI-Fusion
Horizontal positioning error (m)	-	-	6.782	4.667
Root mean square error of on-board test (m)	-	-	8.729	5.485

Table 7. Tunnel positioning error.

	VINS-Mono	VINS-Fusion	GVINS	GDVI-Fusion
Horizontal positioning error (m)	45.352	57.755	26.224	4.017
Root mean square error of on-board test (m)	46.252	59.364	28.014	5.889

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, J.; Zhang, X.; Yuan, K.; Peng, X.; Yang, G. GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization. Appl. Sci. 2025, 15, 8875. https://doi.org/10.3390/app15168875

AMA Style

Peng J, Zhang X, Yuan K, Peng X, Yang G. GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization. Applied Sciences. 2025; 15(16):8875. https://doi.org/10.3390/app15168875

Chicago/Turabian Style

Peng, Jincheng, Xiaoli Zhang, Kefei Yuan, Xiafu Peng, and Gongliu Yang. 2025. "GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization" Applied Sciences 15, no. 16: 8875. https://doi.org/10.3390/app15168875

APA Style

Peng, J., Zhang, X., Yuan, K., Peng, X., & Yang, G. (2025). GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization. Applied Sciences, 15(16), 8875. https://doi.org/10.3390/app15168875

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GDVI-Fusion: Enhancing Accuracy with Optimal Geometry Matching and Deep Nearest Neighbor Optimization

Abstract

1. Introduction

2. Related Work

3. System Description

3.1. Measurement Preprocessing

3.2. Estimator Initialization

3.3. Factor Graph Model

4. Experimental Results

4.1. Dataset Comparison

4.2. UAV Experiments

4.3. Vehicle-Based Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI