A Visual–Inertial Pressure Fusion-Based Underwater Simultaneous Localization and Mapping System

Detecting objects, particularly naval mines, on the seafloor is a complex task. In naval mine countermeasures (MCM) operations, sidescan or synthetic aperture sonars have been used to search large areas. However, a single sensor cannot meet the requirements of high-precision autonomous navigation. Based on the ORB-SLAM3-VI framework, we propose ORB-SLAM3-VIP, which integrates a depth sensor, an IMU sensor and an optical sensor. This method integrates the measurements of depth sensors and an IMU sensor into the visual SLAM algorithm through tight coupling, and establishes a multi-sensor fusion SLAM model. Depth constraints are introduced into the process of initialization, scale fine-tuning, tracking and mapping to constrain the position of the sensor in the z-axis and improve the accuracy of pose estimation and map scale estimate. The test on seven sets of underwater multi-sensor sequence data in the AQUALOC dataset shows that, compared with ORB-SLAM3-VI, the ORB-SLAM3-VIP system proposed in this paper reduces the scale error in all sequences by up to 41.2%, and reduces the trajectory error by up to 41.2%. The square root has also been reduced by up to 41.6%.


Introduction
The manufacturing and deployment of mines is relatively cheap, while detecting them is a costly and dangerous effort.Due to its features, the seafloor can obscure the devices from sonars, making it hard to detect bottom mines.These mines usually carry larger warheads and more sophisticated sensors than moored ones, which make them much harder to sweep.To detect these bottom mines, autonomous underwater vehicles (AUVs) are increasingly being employed, which are equipped with a sidescan sonar or a synthetic aperture sonar to carry out survey missions [1].However, the underwater environment poses a unique challenge to vision-based state estimation.In particular, features caused by suspended particles, blurring, light and color attenuation are not as clear as those on the surface of the water.Therefore, the results from different vision-based state estimation packets show a large number of outliers, leading to inaccurate estimation or even complete tracking loss.In the complex underwater environment, single-sensor equipment, such as sonar or cameras, cannot meet the requirements of high-precision autonomous navigation [2,3].So, the method of multi-sensor fusion is the best choice [4,5].
At present, ORB-SLAM3-VI [6,7] still has the problems of inaccurate scale estimation and easy loss of tracking after initialization after underwater dataset testing.IMUs (inertial measurement units) and optical sensors are local sensors, which can only allow for local observations.But estimation based on local sensors inevitably has a level of cumulative drift.In contrast, global sensors (such as pressure gauges) can allow for global observations, and the output observation data do not have cumulative drift.However, due to noise and a low observation frequency, they cannot be directly used in SLAM (simultaneous localization and mapping) [8,9].The combination of these two sensors can improve the positioning accuracy of the detection equipment.
For this reason, we introduced new constraints on the scale factor and z-axis position in the optimization; that is, we used a pressure sensor to measure the underwater depth of the robot.By introducing depth constraints, we can obtain better estimates of scale information and location.In this paper, we propose a visual-inertial pressure fusion-based SLAM system based on ORB-SLAM3.

Data Pretreatment
We took the image acquisition time as the benchmark to conduct soft synchronization for multi-sensor data.The frequency of IMUs is generally higher than that of the camera.The IMU measurement data between the current frame and the previous frame are mapped to the current frame, which ensures that each image has a corresponding IMU measurement value during fusion.However, the frequency of the pressure sensor may be higher or lower than the camera frequency, so the two cases need to be discussed separately.
When the frequency of the pressure sensor is higher than that of the camera, each frame will correspond to multiple depth measurements, and the average of the measured values will be taken as the depth of the current frame; that is, At this point, the variance of depth measurement becomes σ depth / √ M. When the frequency of the pressure sensor is lower, for example, the camera frequency in the AQUALOC dataset [10] is 20 Hz and the measurement frequency of the pressure sensor is 5 Hz, and some images have no corresponding pressure measurement.At this point, we believe that the underwater robot is not dynamic; that is, the depth change is not significant.We can use interpolation to estimate the depth of this frame; that is:

Problem Description
A factor graph can decompose mathematical problems involving multiple variables into products of multiple functions.In the SLAM problem, the factor graph is used as the probability graph model, and the nodes are composed of random variables and probability distributions.The SLAM system needs to use the maximum a posteriori estimation to solve the random variables through observation.
The function of solving the maximum a posteriori estimation problem through factor graph is as follows: • The factor graph model can describe the mathematical relationship between random variable nodes and observation error nodes.

•
When adding a new node to the factor graph, if we first analyze the impact relationship between it and the other nodes, we can only optimize the variables associated with the new node.But, when the scale of the graph changes to some extent, we need to optimize the whole graph.This optimization strategy reduces unnecessary calculation and improves the optimization speed.
For example, the factor graph of a purely visual SLAM can be showed as Figure 1, where x is the pose node, l is the landmark node, z is the observation value and f is the equation of motion.
Generally, the conditional probability of each factor is taken as Gaussian distribution.For the equation of motion, it is: where   ∼ (0,   ).So, we can obtain: Similarly, for the observation equation, we can obtain: where   is the covariance of the noise term of the observation equation, and ℎ is the observation equation.So, we can solve Formula (10); that is, maximize (( −1 ),   ) and (ℎ(  ,   ),   ).
To maximize the Gaussian distribution  ∼ (μ, Σ), we can minimize its negative logarithm.The negative logarithm of the probability density of the Gaussian distribution is: Since the first term is independent of , we just need to minimize the quadratic term on the right.Combining this with Formula (10), we can obtain: The error items are defined as: For a least-squares problem, arg  , 1 2 ‖()‖ 2 , we can use the iterative method.First, we give an initial value, and then we can repeatedly calculate the increment to update the The maximum posterior probability is: Generally, the conditional probability of each factor is taken as Gaussian distribution.For the equation of motion, it is: where w k ∼ N(0, R k ).So, we can obtain: Similarly, for the observation equation, we can obtain: where Q kj is the covariance of the noise term of the observation equation, and h is the observation equation.
So, we can solve Formula (10); that is, maximize N( f (x k−1 ), R k ) and N h x k , l j , Q kj .
To maximize the Gaussian distribution x ∼ N(µ, Σ), we can minimize its negative logarithm.The negative logarithm of the probability density of the Gaussian distribution is: Since the first term is independent of x, we just need to minimize the quadratic term on the right.Combining this with Formula (10), we can obtain: The error items are defined as: For a least-squares problem, argmin x,l 1 2 e(x) 2 , we can use the iterative method.First, we give an initial value, and then we can repeatedly calculate the increment to update the optimization variables until the end of the iteration to obtain a better solution.The specific steps are as follows: Sensors 2024, 24, 3207 4 of 13 1.
Give the initial value x 0 .2.
In the kth iteration, calculate the increment ∆x k to minimize e(x k + ∆x k ) 2 .

3.
When ∆x k is small enough, stop the iteration; otherwise, solve x k+1 = x k + ∆x k and return to the previous step.
The increment ∆x can be obtained by solving the increment equation H∆x = g.In the L-M (Levenberg-Marquardt) method, H is J T J + λI.

Graph Optimization Based on g2o
This paper uses the general solver g2o (General Graph Optimization) [11] to solve the graph optimization problem.g2o is a collection of algorithms that can be used to solve the graph optimization problem.It sup ports the use of the definition point method, edge selection method and linear solver to solve nonlinear equations.The algorithm framework is shown in Figure 2.
optimization variables until the end of the iteration to obtain a better solution.The specific steps are as follows: 1. Give the initial value x0. 2. In the kth iteration, calculate the increment △  to minimize ‖  △  ‖ .3. When △  is small enough, stop the iteration; otherwise, solve   △  and return to the previous step.
The increment △  can be obtained by solving the increment equation  △  .In the L-M (Levenberg-Marquardt) method,  is   λ.

Graph Optimization Based on g2o
This paper uses the general solver g2o (General Graph Optimization) [11] to solve the graph optimization problem.g2o is a collection of algorithms that can be used to solve the graph optimization problem.It sup ports the use of the definition point method, edge selection method and linear solver to solve nonlinear equations.The algorithm framework is shown in Figure 2. In practice, we use the Ceres [12] toolkit to calculate the gradient automatically and the Eigen [13] toolkit to solve the linear equation, which is constructed in the form of the L-M method, namely   λ △  .

SLAM Algorithm Based on Visual-Inertial Pressure Fusion
In this section, we tightly couple IMU and pressure measurement with visual data.Because the optimization-based SLAM problem can be expressed by a factor graph, fusing new sensor measurements with the SLAM problem is equivalent to adding new factors and nodes into the graph.The process of our method is shown as Figure 3. Next, we will introduce the specific algorithm process.In practice, we use the Ceres [12] toolkit to calculate the gradient automatically and the Eigen [13] toolkit to solve the linear equation, which is constructed in the form of the L-M method, namely J T J + λI ∆x = g.

SLAM Algorithm Based on Visual-Inertial Pressure Fusion
In this section, we tightly couple IMU and pressure measurement with visual data.Because the optimization-based SLAM problem can be expressed by a factor graph, fusing new sensor measurements with the SLAM problem is equivalent to adding new factors and nodes into the graph.The process of our method is shown as Figure 3. Next, we will introduce the specific algorithm process.

Residual Construction
We need to estimate the pose   = [  ,   ] and velocity   in the global coordin system and the gyro and accelerometer deviation,    and    ; the state vector can be culated using the following formula:

Residual Construction
We need to estimate the pose T i = [R i , p i ] and velocity v i in the global coordinate system and the gyro and accelerometer deviation, b g i andb a i ; the state vector can be calculated using the following formula: To solve the state vector, we define the optimization problem as: When there is only a visual sensor, because the gravity direction is unknown, the direction of the z axis in the global coordinate system depends on the reference frame of the first frame.It cannot be guaranteed that the z axis is parallel to the gravity direction, and the depth information cannot be directly fused.Therefore, we use an IMU to estimate the gravity direction, align the z axis of the global coordinate system with the gravity direction, and add the information from the pressure sensor to the optimization constraint.For depth measurement, it can be assumed that the z axis of the world coordinate system is collinear with the depth axis, so the depth residual is: The IMU pre-product component rotation, velocity and pose between frame i and frame i + 1 are expressed as ∆R i,i+1 , ∆v i,i+1 , ∆p i,i+1 .According to the pre-product component and the state vectors S i and S i+1 , the construct inertial residuals can be calculated: Combined with the visual residual, inertial residual and depth residual, the visual-inertial pressure SLAM can be considered as an optimization problem.Given k + 1 key frames, the state vector S k = {S 0 . . .S k } and l 3D points X = {x 0 . . .x l−1 }, the optimization problem of visual-inertial pressure can be expressed as: where K is the set of keyframes in which point j can be observed.ρ Hub is the Huber kernel function [14] which can be used to reduce the influence of false matching on the re projection error.

Data Initialization
The purpose of the initialization step is to obtain good initial values for velocity, gravity direction, scale and IMU deviation.Referring to the steps in ORB-SLAM3, the steps of visual-inertial pressure (VIP) fusion initialization are also divided into three steps: 1.
Visual estimation.Run the visual SLAM for two seconds, insert keyframes at the speed of 4Hz, and build a map.Using visual residual to constrain pose and map points, obtain initial estimation of pose T 0:k = [R, p] 0:k , where the upper dash represents the scaled variable.Because the drift in the visual estimation is very small, it can be used as the constraint of inertial optimization.

2.
Inertial pressure estimation.Take the initial estimation of the position and attitude obtained from the previous optimization as the prior of the inertial pressure estimation, and fix the position and attitude T 0:k to optimize the state vector: where s is the scale factor, which aims to restore the constructed map to the true scale, and R wg is the direction of gravity.The gravity g in the global coordinate system can be expressed as g = R wg g I , Sensors 2024, 24, 3207 6 of 13 where g I = (0, 0, G) T , and G is the magnitude of gravity.b = (b a , b g ) ∈ R 6 represents the deviation in the gyroscope and accelerometer.During initialization, b is regarded as a constant.v 0:k is speed.In the inertial pressure estimation, the pose is not optimized because we think the pose estimated by vision is considered good enough.The problem of inertial pressure optimization can be expressed as: where Σ b is the prior of IMU deviation, which limits the value range of deviation b to ensure that the IMU deviation is closer to 0. A factor diagram of inertial pressure optimization is shown in Figure 4.The vertex of the circle represents the variable to be optimized, and the square represents the factor node/the residual term.{X 0 , X 1 • • • X k } represents the k + 1 frame keyframe, keyframe X 0:k corresponding to pose T 0:k and speed v 0:k .b is the deviation, which is regarded as a constant in this step and is constrained by a prior deviation.The dotted fixed box indicates that the variable (circular node in the factor graph) is fixed; that is, pose T is fixed, and it only participates in the residual calculation as a constraint, and will not be updated in this step.Depth residuals are only used to estimate the state vector y k and gravity direction R wg .
= {,   , , ̅ 0: } where  is the scale factor, which aims to restore the constructed map to the true scale, and   is the direction of gravity.The gravity  in the global coordinate system can be expressed as  =     , where   = (0,0, )  , and G is the magnitude of gravity. = (  ,   ) ∈  6 represents the deviation in the gyroscope and accelerometer.During initialization, b is regarded as a constant.̅ 0: is speed.In the inertial pressure estimation, the pose is not optimized because we think the pose estimated by vision is considered good enough.
The problem of inertial pressure optimization can be expressed as: where Σ  is the prior of IMU deviation, which limits the value range of deviation  to ensure that the IMU deviation is closer to 0. A factor diagram of inertial pressure optimization is shown in Figure 4.The vertex of the circle represents the variable to be optimized, and the square represents the factor node/the residual term.{ 0 ,  1 ⋯   } represents the  + 1 frame keyframe, keyframe  0: corresponding to pose  0: and speed ̅ 0: .b is the deviation, which is regarded as a constant in this step and is constrained by a prior deviation.The dotted fixed box indicates that the variable (circular node in the factor graph) is fixed; that is, pose  is fixed, and it only participates in the residual calculation as a constraint, and will not be updated in this step.Depth residuals are only used to estimate the state vector   and gravity direction   .After inertial optimization, the map can be scaled according to the estimated proportion, and the global coordinate system of the map can be rotated to align the z axis with the estimated gravity direction.After inertial optimization, the map can be scaled according to the estimated proportion, and the global coordinate system of the map can be rotated to align the z axis with the estimated gravity direction.

3.
Visual-inertial pressure joint optimization.Once we have a good estimate of the inertial and visual parameters, we can perform joint visual-inertial pressure optimization to further optimize the state vector and no longer fix the pose and map points.This optimization can be seen in Figure 5. Every frame X i corresponds to a state vector S i = T i , v i , b g i , b a i ; deviation b conforms to random walk model, and the deviation of adjacent frames is constrained by prior random walk residuals.The visual residual constrains the position and key frame pose T i of the 3D map points.The inertial residual constrains pose T i , speed v i and deviation b i .The depth residual constrains p i in pose  After determining the optimization variables and the corresponding optimiza ror, we can add the corresponding nodes and edges to g2o to solve the optimizatio ables.

Scale Adjustment
In some specific cases, such as insufficient robot motion, initialization cann verge to an accurate solution, so an additional scale adjustment is required.In clu justment, when the scale factor s is explicitly expressed as an optimization variabl than being implied in , , the convergence speed, is much faster [15].Scale adju is based on all the inserted key frames and only adjusts the scale and gravity di We do not assume that the IMU deviation is constant and use the estimated devi each frame.In the local mapping thread, we run the scale fine adjustment every 10 the number of key frames in the map exceeds 200 or initialization has been compl more than 75 s.A factor diagram of the scale adjustment is shown in Figure 6.The fixed box indicates that the variable (circular node in the factor graph) is fixed; tha After determining the optimization variables and the corresponding optimization error, we can add the corresponding nodes and edges to g2o to solve the optimization variables.

Scale Adjustment
In some specific cases, such as insufficient robot motion, initialization cannot converge to an accurate solution, so an additional scale adjustment is required.In cluster adjustment, when the scale factor s is explicitly expressed as an optimization variable rather than being implied in T, v, the convergence speed, is much faster [15].Scale adjustment is based on all the inserted key frames and only adjusts the scale and gravity direction.We do not assume that the IMU deviation is constant and use the estimated deviation of each frame.In the local mapping thread, we run the scale fine adjustment every 10 s until the number of key frames in the map exceeds 200 or initialization has been completed for more than 75 s.A factor diagram of the scale adjustment is shown in Figure 6.The dotted fixed box indicates that the variable (circular node in the factor graph) is fixed; that is, the posture T i when the scale is adjusted, speed v i and deviation b i are all fixed and they only participate in residual calculation.The scale factor s and gravity direction are optimized, which are constrained by the depth residuals and inertial residuals.
posture   ̅ when the scale is adjusted, speed   ̅ and deviation   are all fixed and they only participate in residual calculation.The scale factor s and gravity direction are optimized, which are constrained by the depth residuals and inertial residuals.The optimization problem of scale adjustment can be expressed as: During optimization, the parameters of the incremental update to the gravity direction can be expressed: R wg new = R wg old Exp ( g ,  g , 0) where Exp represents the conversion from the rotation vector to the rotation matrix.To ensure that the scale is positive, the update of scale s is expressed as: In actual use, the local mapping thread has optimized the state vector many times.The initialization and scale fine-tuning processes are shown in Figure 7. IMU initialization, VIPBA1 (Visual Inertial Pressure Bundle Adjustment) and VIPBA2 all call the initialization function Initialize IMU.The initialization function first estimates the IMU state with only inertial pressure constraints, and then optimizes the IMU state with the visual and inertial pressure constraints.The optimization problem of scale adjustment can be expressed as: During optimization, the parameters of the incremental update to the gravity direction can be expressed: R new wg = R old wg Exp δα g , δβ g , 0 (17) where Exp represents the conversion from the rotation vector to the rotation matrix.To ensure that the scale is positive, the update of scale s is expressed as: In actual use, the local mapping thread has optimized the state vector many times.The initialization and scale fine-tuning processes are shown in Figure 7. IMU initialization, VIPBA1 (Visual Inertial Pressure Bundle Adjustment) and VIPBA2 all call the initialization function Initialize IMU.The initialization function first estimates the IMU state with only inertial pressure constraints, and then optimizes the IMU state with the visual and inertial pressure constraints.

Tracking and Mapping
Visual-inertial pressure tracking is responsible for tracking altitude, speed, IMU deviation and other state variables at the frame rate.The motion model enables us to predict the camera's attitude in the next frame.After the camera's altitude is predicted, we can calculate the visual error, IMU error and depth error to optimize the current frame state.Different optimization strategies are used according to whether the map is updated.A map update refers to the generation of new keyframes or the detection of closed loops.
When there is a map update, only the state vector of the current frame is optimized.When the state vector S i is Equation ( 17), the error is Equation (21).To optimize the error, the optimization result and the calculated matrix H are used prior to the subsequent frames.
When the map is not updated, the state vectors of two adjacent frames are optimized.The corresponding state variable to be optimized becomes {S i , S i+1 }, and the error becomes: where E prior is the previous error calculated according to the previous frame.Its calculation formula is: where ρ is the first-order robust kernel function.In Formula (29), R

Tracking and Mapping
Visual-inertial pressure tracking is responsible for tracking altitude, speed, IMU deviation and other state variables at the frame rate.The motion model enables us to predict the camera's attitude in the next frame.After the camera's altitude is predicted, we can calculate the visual error, IMU error and depth error to optimize the current frame state.Different optimization strategies are used according to whether the map is updated.A map update refers to the generation of new keyframes or the detection of closed loops.
When there is a map update, only the state vector of the current frame is optimized.When the state vector   is Equation ( 17), the error is Equation (21).To optimize the error, the optimization result and the calculated matrix H are used prior to the subsequent frames.
When the map is not updated, the state vectors of two adjacent frames are optimized.The corresponding state variable to be optimized becomes {  ,  +1 } , and the error becomes: where   is the previous error calculated according to the previous frame.Its calculation formula is: For the local mapping thread, the local window retains N key frames and all points of these key frames, and then optimizes them.At the same time, all other key frames that can observe these points also participate in error calculation, constraining the positions of the map points, but the state of these key frames is fixed in the optimization.

Scale Optimization Results on Data Set AQUALOC
Multi-sensor fusion can improve the accuracy of SLAM mapping.For monocular SLAM, the most important role of the fusion of other sensors is to restore the true scale.

Scale Error Statistics
We define the scale error according to the length of the predicted track before and after alignment with the real track: We have produced statistics on the scale error in the prediction of trajectory of multiple sequences, and the results are shown in Table 1.A negative scale error means that the predicted map is smaller than the real map, and a positive scale error means that the predicted map is larger than the real map.It can be seen from the comparison of the results for the scale errors that the scale errors of ORB-SLAM3-VIP's predicted map tracks are all within 5.4%.Compared with ORB-SLAM3-VI, the scale errors are reduced by 2.6-41.2%.ORB-SLAM3-VI cannot restore the map to the true scale, because the IMU has no direct effect on the position p.It estimates the position p through two integration calculations, which is more vulnerable to noise.

Result Analysis for the Harbor_01 Sequence
Here, we analyze the process of scale optimization.According to the flow chart in Figure 7, the map scale has been adjusted many times in the local mapping thread, including three changes the to initialization functions and n scale adjustments.n depends on the number of key frames being greater than 200 or the cumulative interval time of all key frames in the optimization exceeding 75s.In the process of initialization function and scale adjustment, a scale factor will be calculated.Every time the scale factor is calculated, the size of the map is updated.
We recorded the scale factor obtained by each optimization when running SLAM on sequence harbor_01, and calculated the length ratio of the predicted trajectory and the predicted trajectory after scale alignment, which is recorded as the relative scale.The scale proportions of predicted trajectory of ORB-SLAM3-VI and ORB-SLAM3-VIP are 0.767 and 0.958, respectively.According to the optimized scale factor, we can deduce the relative scale of each optimized map and draw the image.The change in the relative scale in the optimization process is shown in Figure 8.
Sensors 2024, 24, x FOR PEER REVIEW 11 of 14 0.958, respectively.According to the optimized scale factor, we can deduce the relative scale of each optimized map and draw the image.The change in the relative scale in the optimization process is shown in Figure 8.

Scale Adjustment Initialize
Optimized Relative Scale If the relative scale in the figure is 1, which means that the scale is the same as the real map.The relative scale before initialization is the scale of the map constructed by the purely visual SLAM relative to the real map, which can be considered as random.It can be seen from Figure 8 that the scale estimation of the VIP system converges faster and more accurately than that of the VI system.It is proved that the proposed vision-inertial pressure fusion framework is effective in restoring the monocular SLAM scale.If the relative scale in the figure is 1, which means that the scale is the same as the real map.The relative scale before initialization is the scale of the map constructed by the purely visual SLAM relative to the real map, which can be considered as random.It can be seen from Figure 8 that the scale estimation of the VIP system converges faster and more accurately than that of the VI system.It is proved that the proposed vision-inertial pressure fusion framework is effective in restoring the monocular SLAM scale.
We compared the mapping results of ORB-SLAM3-VI and ORB-SLAM3-VIP without scale alignment, as shown in Figure 9.If the relative scale in the figure is 1, which means that the scale is the same as the real map.The relative scale before initialization is the scale of the map constructed by the purely visual SLAM relative to the real map, which can be considered as random.It can be seen from Figure 8 that the scale estimation of the VIP system converges faster and more accurately than that of the VI system.It is proved that the proposed vision-inertial pressure fusion framework is effective in restoring the monocular SLAM scale.

Scale Adjustment Initialize
We compared the mapping results of ORB-SLAM3-VI and ORB-SLAM3-VIP without scale alignment, as shown in Figure 9.The prediction of the z-axis position by the two algorithms is shown in Figure 10.The z axis direction represents the direction of gravity in the world coordinate system.It can be found that the introduction of pressure sensor makes the depth estimation of the algorithm closer to the true value.
Sensors 2024, 24, x FOR PEER REVIEW 12 of 14 The prediction of the z-axis position by the two algorithms is shown in Figure 10.The z axis direction represents the direction of gravity in the world coordinate system.It can be found that the introduction of pressure sensor makes the depth estimation of the algorithm closer to the true value.

Track Prediction Results on Data Set AQUALOC
In this section, ATE RMSE is also used to evaluate the mapping trajectory, and the scale of the predicted trajectory is aligned with the true value without considering the results of the scale optimization.We conducted experiments to produce statistics for ATE RMSE values of different methods, as shown in Table 2, where VI represents the joint optimization of visual inertia, and VIP represents the joint optimization of visual and inertia pressure.We tested on the AQUALOC dataset and calculated the absolute trajectory error RMSE of each algorithm after scale alignment.The test on the AQUALOC dataset showed that IMU's estimation of motion is not accurate when the lack of visual constraint is too long, and it is easy to "fly" along the trajectory.After we lose visual tracking on the har-bor_04 and harbor_07 sequences, we no longer use the IMU for tracking.

Track Prediction Results on Data Set AQUALOC
In this section, ATE RMSE is also used to evaluate the mapping trajectory, and the scale of the predicted trajectory is aligned with the true value without considering the results of the scale optimization.We conducted experiments to produce statistics for ATE RMSE values of different methods, as shown in Table 2, where VI represents the joint optimization of visual inertia, and VIP represents the joint optimization of visual and inertia pressure.We tested on the AQUALOC dataset and calculated the absolute trajectory error RMSE of each algorithm after scale alignment.The test on the AQUALOC dataset showed that IMU's estimation of motion is not accurate when the lack of visual constraint is too long, and it is easy to "fly" along the trajectory.After we lose visual tracking on the harbor_04 and harbor_07 sequences, we no longer use the IMU for tracking.
From Table 2, we can see that the RMSE of the VIP system is 6.1-41.6%lower than that of the VI system.It is shown that the introduction of the pressure sensor has improved the mapping accuracy.

Conclusions
Based on ORB-SLAM-VI, we propose the ORB-SLAM-VIP algorithm for visual-inertial pressure fusion.First, we introduce IMU and pressure sensor models, including the corresponding relationship between pressure and depth, the IMU motion model and IMU pre integration.Then, we introduce how to carry out tight coupling optimization.We give a method to calculate the error term of the IMU and pressure sensors, and introduce the optimization process to the initialization, scale fine-tuning, tracking and mapping steps.Finally, we tested our method on the AQUALOC dataset and evaluated it from the perspective of monocular scale recovery and trajectory error, respectively.The experimental results show that, compared with ORB-SLAM-VI, ORB-SLAM-VIP can not only restore the scale more correctly, but can also improve the accuracy of mapping to a certain extent.The proposed method can be applied to improve the operational efficiency of naval mine countermeasures.

Figure 1 .
Figure 1.Factor graph of visual SLAM.The maximum posterior probability is: arg  ,

Sensors 2024 ,Figure 3 .
Figure 3.The process of SLAM algorithm based on vision-inertial pressure fusion.

Figure 3 .
Figure 3.The process of SLAM algorithm based on vision-inertial pressure fusion.

Figure 4 .
Figure 4. Factor graph representation for inertial pressure optimization.

Figure 4 .
Figure 4. Factor graph representation for inertial pressure optimization.
tor   = {  ,   ,    ,    }; deviation  conforms to random walk model, and th ation of adjacent frames is constrained by prior random walk residuals.The residual constrains the position and key frame pose   of the 3D map poin inertial residual constrains pose   , speed   and deviation   .The depth r constrains   in pose   = [  ,   ].

Figure 6 .
Figure 6.Factor graph representation for scale and gravity optimization.

Figure 6 .
Figure 6.Factor graph representation for scale and gravity optimization.

Figure 7 .
Figure 7. Initialization and scale refinement flow diagram.

Figure 7 .
Figure 7. Initialization and scale refinement flow diagram.

Figure 9 .
Figure 9.Comparison of scale unaligned estimate trajectory for harbor_01 underwater sequence.Figure 9. Comparison of scale unaligned estimate trajectory for harbor_01 underwater sequence.

Figure 9 .
Figure 9.Comparison of scale unaligned estimate trajectory for harbor_01 underwater sequence.Figure 9. Comparison of scale unaligned estimate trajectory for harbor_01 underwater sequence.

Figure 10 .
Figure 10.Prediction of the z-axis position for harbor_01 sequence.

Figure 10 .
Figure 10.Prediction of the z-axis position for harbor_01 sequence.

Table 2 .
Absolute trajectory error in harbor sequences.

Table 2 .
Absolute trajectory error in harbor sequences.