The implementation of the positioning system involves three main phases: data preprocessing, initialization, and optimization. In the preprocessing stage, the pseudorange information obtained from a GNSS signal is processed by the SPP algorithm [
34], while the acceleration and angular velocity information obtained from IMU is pre-integrated, and the depth information obtained from RGBD camera is optimized [
29], where the image information is subjected to feature point detection, extraction, KNN-optimized matching, and optical flow tracking. During the initialization phase, the visual construction structure from motion (SFM) [
35] is aligned with the pre-integrated IMU information. If the initialization of visual–inertial odometry (VIO) fails, GNSS residual data are utilized. If VIO initialization succeeds, it undergoes alignment with GNSS for yaw alignment and anchor positioning alignment. If data alignment succeeds, the data are optimized using VIO-GNSS residual data; otherwise, only VIO residual data are utilized. In the optimization phase, the residual data go through a sliding window optimization process. The overall system implementation is depicted in
Figure 1.
3.1. Measurement Preprocessing
As shown in
Figure 1, the initialization module includes IMU pre-integration, optimal geometric constraints based on KNN matching, preprocessing of visual feature point depth, and processing of raw GNSS information.
IMU Pre-integration: Inertial measurement devices are susceptible to various sources of error, including acceleration bias, accelerometer noise, gyroscope bias, and gyroscope noise. Additionally, earth surface measurements captured by the IMU contain a gravity component. However, for the purpose of this discussion, we will ignore errors related to scale factor, misalignment angle, and non-orthogonality in low-cost IMU measurements. Thus, inertial measurement model is
where
and
are the actual measured values of linear and angular acceleration of the IMU at time
k.
,
are the real values of linear and angular acceleration of the IMU at time
k. Assume that the accelerometer noise
and gyroscope noise
obey Gaussian distribution
,
.
and
are the accelerometer bias and gyroscope bias, respectively.
is the rotation matrix from the
n-frame to the
b-frame at the moment
k.
is the acceleration of gravity under the
n-frame.
With the IMU output frequency surpassing the camera acquisition frequency per frame, the gyroscope and accelerometer measurements are pre-integrated [
5] to derive information about the camera’s position, velocity, and rotation for two consecutive frames in
time, as follows:
where the pre-integration equations
,
, and
are the changes in position, velocity, and rotation of two consecutive frames of the camera in
time, respectively.
is the rotation matrix of the
frame at time
t.
are defined as [
5]
Best geometric constraint algorithm based on KNN matching: We use the KNN algorithm for the initial matching of the extracted feature points. The sets of feature points in image and its matching image are denoted as and , respectively.
Calculate the Euclidean distance from the feature vector of each feature point in image to the feature vector of each feature point in image and store the distances in a matrix , where is the Euclidean distance of the eigenvectors of the feature points and . The row vector represents the Euclidean distances between point and all its corresponding points on image . Identify the K (in this paper, ) smallest values in this m-dimensional vector. The points on image corresponding to these K distance minima are considered the K matching points.
If the distance ratio between the nearest neighbor point and the next nearest neighbor point is less than the threshold (the value of is set to 0.7), the nearest neighbor point will be used as the matching point. After completing the initial matching, the matching results will be subsequently optimized to reduce the false matching rate of the algorithm.
Feature matching feature point mismatch will occur between neighboring frames; when the mismatch is too large, it will affect the system accuracy. Therefore, the feature point matching algorithm needs to be optimized on the basis of the initial matching, and the false matching is eliminated by the geometric features of the feature point matching line so as to filter out the matching point pairs that satisfy the constraints of the geometric features. As shown in
Figure 2. By comparing the length and slope of the matching lines between adjacent point pairs, we can detect abnormal matching point pairs. This screening process considers the distance
and slope
between the matching points (denoted as points
and
).
where
,
n are positive integers.
The KNN-based best geometric constraint matching algorithm filtering process is shown in
Figure 3. The filtering abnormal matching point pairs method is as follows:
- (a)
Construct the set of feature point pairs, let the set of matching point pair line lengths be , and the set of slopes as .
- (b)
The mean value D of the sets and , as well as the root mean square error S; set D as the base value and S as the error range.
- (c)
Filter pairs of feature matching points that satisfy the condition of the value domain .
In this paper, we utilize a KNN-based matching algorithm with optimal geometric constraints to not only enhance matching efficiency but also improve the accuracy of feature point matching, ultimately enhancing the overall system localization accuracy.
Visual feature depth Nearest Neighbor Resampling preprocessing: Discontinuous and incomplete depth information is measured by RGBD cameras. Invalid depth information has an impact on the robustness of the system. In this paper, we use the method of depth information preprocessing to obtain the initial value of valid depth information of feature points, which makes it easier to achieve robust and fast visual–inertial VIO initialization.
The images and depth information captured by the camera are aligned, and feature points of keyframes are extracted. This paper constructs a set
of feature points with valid depth. Since the local variation of depth is gradual, a nearest neighborhood is defined around key points with invalid depth, and during feature tracking, resampling is performed on the depth around these key points and their vicinity. The feature points within the nearest neighborhood of this point with valid depth are iterated using the RANSAC algorithm [
36] to find a sufficient number of inliers of depth values, thereby recovering the approximate depth of the key point
. In some scenes, the recovered depth has errors; therefore, the depth information needs to be optimized again.
The key point depth preprocessing method can roughly estimate the visual scale information, set the initial value of the scale, and further optimize it in the VIO initialization process to achieve fast initialization and improve the robustness of the system. Although resampling keypoints can be confounded by depth Nearest Neighbor Resampling (NNR) errors, most resampling keypoints provide depth values that are very close to the corresponding expected values. Deep NNR focuses on extending feature points with valid depth values, and we use resampling keypoints in tracking feature points to improve robustness when the depth map is severely incomplete.
GNSS raw information measurement: We incorporate GNSS measurements, including pseudorange measurements and Doppler shift measurements, in our system. These measurements are used to estimate the receiver position using the Single Point Positioning (SPP) algorithm [
34]. Throughout the process of GNSS measurements, various errors can arise, such as orbital errors, clock deviations, and ionospheric and tropospheric errors. Therefore, we formulate the pseudorange measurement model to account for these errors:
where, at time
t,
and
denote the coordinates of satellite
s and receiver
r in the
i-frame, respectively.
denotes the clock deviation of the receiver.
denotes the clock deviation of the satellite.
c denotes the speed of light.
and
denote the ionospheric and tropospheric errors, respectively.
denotes the measurement noise.
is the Mahalanobis norm.
The Doppler effect introduces a frequency difference between the signals transmitted by the GNSS satellites and received by the receiver, known as the Doppler shift. This Doppler shift is utilized to represent the relative motion between the GNSS receiver and the satellite. It is influenced by factors such as clock frequency deviations and measurement noise. Thus, the Doppler shift model is expressed as
where, in the
i-frame at time
t,
denotes the unit vector of the receiver pointing to the satellite.
denotes the wavelength of the carrier signal.
and
denote the velocities of the receiver and the satellite, respectively.
c denotes the speed of light.
denotes the clock offset rate of the receiver.
denotes the clock offset rate of the satellite.
denotes the Doppler measurement noise.
3.2. Estimator Initialization
The state estimation of a multi-sensor fusion system is inherently nonlinear. As a result, the accuracy and robustness of the system heavily rely on the initialization values. In this paper, we propose a method where the scale initial values of depth information, IMU pre-integration, and visual information are loosely coupled and processed to obtain the initial values of gyroscope bias, acceleration bias, and gravity vector through online optimization.
With known scale initial values, the fusion of visual and IMU information requires knowing the initial values of gravity vector, velocity, and IMU deviation. We solve the problem in two steps.
Gyroscope bias: In consecutive frames
k and
of the sliding window, according to the constraints on the gyroscope measurements and the camera keyframe orientation, the equation for the residuals of the rotation term is defined as follows:
where
denotes the change in rotation from frame k to frame k+1.
denotes the Jacobi matrix of pre-integrated rotations to gyroscope deviations, and its analytic expression can be found in reference [
37].
is the transpose of the rotation at times
k.
is the rotation at times
.
Gyroscope deviation is optimized by iterative optimization using the LM algorithm, and the optimization equation is
where
is the a priori residual, iterating from zero deviation, which can be ignored, and
denotes the residual term of the gyroscope deviation.
Accelerometer bias, gravity vector, scale, velocity initialization: After initializing the gyroscope bias, calculate the accelerometer bias, gravity vector, and velocity for two consecutive frames
k to
moments [
38], as follows:
where
is the velocity of the frame
k image in the body frame.
is the acceleration of gravity in the camera
coordinate system, while
is the accelerometer bias.
s is the scaling parameter.
The position
and velocity
of the IMU are computed in the body frame for two consecutive frames
k to
moments, as follows:
where
is the position at moment
k in
frame.
is the time difference between
k and
moments.
is the rotation matrix from
frame to
frame.
and
are the Jacobi matrices of pre-integrated position and velocity to acceleration bias, respectively. Reference [
37] explains analytic expressions for the pre-integrated Jacobi matrix and uncertainty propagation. The least squares optimization model is established, as follows:
From (9), we obtain the following linear measurement model:
where
The linear least squares model solution procedure is as follows:
According to the least squares optimization model, the velocity of each frame in the body frame, the acceleration of gravity in the camera frame, and accelerometer bias are obtained.
3.3. Factor Graph Model
Formulation: Our system uses a nonlinear optimization approach, and the system state can be summarized as
where, within the sliding window
frame, the
k-frame states include attitude
, velocity
, position
, scale
, accelerometer bias
, gyro bias
, receiver clock bias
, and receiver clock drifting rate
.
We analyze the factor graph model, with residual factors constructed for each sensor, for back-end nonlinear optimization. The factor graph is shown in
Figure 4. Build the maximum posterior probability model, as follows:
denotes the a priori information about the system state.
,
,
, and
denote the IMU residuals, visual residuals, GNSS pseudorange residuals, and Doppler shift residuals, respectively. We decompose the optimization problem into individual factors related to states and measurements.
Figure 4 shows the factorization diagram of our system. In addition to the factors from the measurements, there is an a priori factor used to constrain the sliding window system state. We discuss each factor in detail below.
Inertial measurement unit (IMU) residual factor: From the IMU measurement model in Equations (3) and (11), the residual of the IMU in the sliding window time is defined as
where the vector component of the quaternion q is extracted by
for representing the error state.
denote the observed terms of position, velocity, attitude, acceleration bias, and gyro bias between two consecutive frames, respectively. ⊗ represents the multiplication operation between two quaternions.
Depth-scaled BA based on NNR: Using NNR can extract enough effective depth information from the discontinuous and incomplete depth images, but there are errors in the depth data obtained in this way, leading to the failure of the SLAM system. In this paper, we introduce depth bias
for optimization, assuming that the depth is constant in a short time. After extracting the feature points of consecutive keyframes in the image and applying distortion correction to the feature points, the visual projection is modeled as
where
is the rotation and translation between the IMU coordinate system and the camera coordinate system.
is the rotation and translation between the camera and navigation frame.
is the measurement error.
is the 3D feature points on the navigation frame.
is the camera projection function. Therefore, at the moment
k, the reprojection residual of feature point
l projected from frame
i to frame
j can be expressed as
where
is the depth bias of 3D feature points
l. The depth value
of feature point
l is updated as
GNSS signal residual factor: We are able to compute the ECEF frame of the receiver at any moment for a given system state. Since the time of the GNSS measurement signal is marked by the receiver, we define the ECI frame at the moment of signal reception as coinciding with the ECEF frame. When the signal arrives at the receiver, we know
. By broadcasting ephemeris and pseudorange measurements, the position of the satellite in the ECEF frame at the moment of signal transmission can be obtained as
. However, there exists the effect of the angular velocity of the Earth’s rotation
, which changes the ECEF frame at the time the signal leaves the satellite from the time the signal arrives at the receiver. Therefore, it is necessary to convert the satellite position to the ECI coordinate system, as follows:
The pseudorange is shown in (4). Therefore, the single pseudorange residual of the
satellites received by the GNSS receiver
r at moment
k can be expressed as
The Doppler shift describes a phenomenon in which the frequency of a wave received by a receiver changes from the frequency actually transmitted by the satellite when there is relative motion between the satellite and the receiver, as expressed by (5). The Doppler shift is defined as a frame of the ECEF frame in which the satellite is transmitting a signal. By defining an ECI frame as an ECEF frame at reception, we know
. The satellite’s velocity
in the signaling ECEF frame is then transformed into an ECI frame, as follows:
Therefore, the Doppler shift residual of the
satellite
s received by the GNSS receiver
r at moment
k can be expressed as
where
denotes the rotation and translation between body frame and ECI frame.