Research on Underwater Complex Scene SLAM Algorithm Based on Image Enhancement

Underwater images typically suffer from less explicit feature point information and more redundant information due to wild conditions. To solve these degradation problems, we propose the VINS-MONO algorithm to enhance the quality of the underwater image. Specifically, we first used the FAST feature point extraction algorithm to improve the extraction speed. Then, the inverse optical flow method was used to improve the accuracy of feature extraction. At the same time, several kinds of residual information were extracted and marginalized, separately, in the marginalization part of the back-end, in order to improve the marginalization speed. Extensive experiments on underwater dataset HAUD-Dataset and public dataset EuRoC show that our approach is superior to the original VINS-MONO algorithm. In addition, the original algorithm optimizes the situation in which the feature point information is not obvious, and the redundant information is more complex in the underwater environment, which effectively improves the visual quality of the underwater image.


Introduction
With the rapid development of artificial intelligence, sensors, and other fields such as the field of mobile robots, which combine these fields, have seen rapid developments. Simultaneous Localization and Mapping (SLAM) has also become a necessary technology for mobile robots. Visual SLAM, with the camera alone, is not very effective in practical applications. Visual inertia SLAM with an Inertial Measurement Unit (IMU) overcomes the shortcomings of visual SLAM.
In the Visual Inertial Odometer (VIO), VINS-MONO is a relatively mature algorithm that has been well studied [1][2][3][4]. Although the VINS-MONO algorithm has a relatively good performance in the actual scene operation effect, and is one of the best algorithms in the current SLAM algorithm of vision-inertia fusion, there are still some shortcomings in its underwater use. In the VINS-MONO algorithm, the Harris corner feature extraction algorithm [5], KLT optical flow feature tracking algorithm [6], and IMU pre-integration algorithm, are used in initialization, marginalization, loopback detection, etc. In this study's aim to resolve the shortcomings of the feature point information in the underwater complex environment, I propose the optimization feature extraction algorithm, feature tracking algorithm, and marginalization, to improve the underwater performance of the VINS-MONO algorithm. The first areas that need to be improved in the underwater environment are: 1.
The KLT optical flow method is adopted for tracking and matching, so the robustness and accuracy are poor for the environment, with weak texture and few key points.

2.
The Corner extraction algorithm adopts Harris corner. The algorithm adopts Gaussian filtering, which makes the corner extraction speed slow.

3.
Marginalization: Several types of residual information are put together for marginalization and optimization, which is costly.

Related Work
Tong et al. [1] proposed a tightly coupled visual-inertial SLAM algorithm and VINS-MONO algorithm, based on optimization. Meanwhile, the author of the paper [1] added closed-loop detection and global optimization to the algorithm. Through the experiments of the VINS-MONO algorithm and the effects mentioned in the paper, it can be proved that the VNIS-MONO algorithm is more stable and more accurate than OKVIS algorithm [10] in most data sets.
In 2019, Shan et al. [11] proposed that the limitations of the fusion monocular camera were changed to the RGBD camera in order to increase the observation information, which solved the problem of unobservability in the original algorithm and was called VINS-RGBD. In 2020, Zhao, HF et al. [12] discussed the application of VINS-MONO in some underwater environments, because the FAST corner feature extraction method adopted by VINS-MONO may lead to the generation of a large number of cycle candidate points, and feature matching may have mismatches, there is not enough loop in the underwater environment. The robustness of the outliers is improved by using Dark Channel Prior (DCP) to enhance the image. For the method proposed by VINS-MONO, more loops can be detected. In 2021, M. He and R. R. Rajkumar et al. [13] extended VINS-MONO, using GNSS and other absolute positioning methods, as well as relative positioning methods based on the Kalman filter. The optimized Extended VINS-MONO algorithm (Extended VINS-MONO) has better accuracy and more accurate positioning.
In 2021, Y. Wang, J. Wang et al. [14] proposed the use of a constant filling method to solve the problem of missing image edges in the FAST corner detection algorithm, and to shorten the corner detection time. In 2020, H. Zhang et al. [15] combined FAST corner detection with LK pyramid optical flow, which could not only quickly detect feature points, but also improve the accuracy of sub-pixel calculation. Mao et al. [16] proposed a double threshold algorithm to solve the threshold setting problem in the optimization of the Harris algorithm. M. Zhao et al. [17] proposed an adaptive parameter algorithm, based on the Harris algorithm, to solve the problem of inaccurate corner detection caused by fixed Gaussian parameters. S. Han et al. [18] used the B-spline function, instead of the Gaussian window function, to improve the accuracy of corner points, as well as pre-selection of diagonal points to improve the real-time performance of the algorithm. Liu Zhen bin et al. [19] improved the initialization based on the VINS-MONO algorithm and added acceleration bias to the initialization optimization algorithm. In 2021, M. He and R. R. Rajkumar et al., according to the optimized VINS-MONO algorithm (Extended VINS-MONO) [13], proposed to add a thermal imager sensor into the algorithm. When the spectral camera gives a poor performance in poor lighting conditions, the thermal vision provided by the thermal imager can make up for these shortcomings [20]. In 2019, L. J. Chen et al. [21] used the VINS-MONO algorithm to test the UAV in a room without GPS signal. In the experiment, the UAV, capable of self-positioning, was constructed by integrating the onboard computer, camera and IMU. A comparison study is given to determine the robustness and reliability of the VINS-Mono state estimator and the UAV system, using various flight velocities and environment features settings. In 2018, TD Chen, H. Jian et al. [22] proposed an image pyramid to track fast-moving targets. within comparison to the dense optical flow method and the color feature method, the results show that the proposed method has many advantages, for example, less computation, better coping with occlusions, and detecting and tracking fast moving objects. Although the pyramid LK optical flow [23] method can deal with large motion, it has the problem of accuracy. In 2018, Z. Wang et al. [24] improved optical flow tracking accuracy by layering the video of each frame in the image pyramid, calculating the optical flow in the top corner, using the next pyramid as the starting point of the pyramid, and repeating this process until the bottom pyramid image.

Proposed Method
In this paper, we introduce the Harris corner feature extraction algorithm with the FAST corner feature extraction algorithm, and the inverse optical flow method with KLT optical flow and back-end marginalization acceleration, which is shown in Figure 1. Specifically, we accelerate the marginalization by separating the marginalization of pose and information other than pose. The accuracy and speed of the VINS-MONO algorithm are improved from these three aspects.

FAST Corners and Harris Corners
The FAST corner point primarily uses the local image pixel grayscale difference to detect points of interest, and can do so quickly. The corner points extracted by FAST are

FAST Corners and Harris Corners
The FAST corner point primarily uses the local image pixel grayscale difference to detect points of interest, and can do so quickly. The corner points extracted by FAST are selected based on the intensity of the pixels around the candidate feature points. For example, in the case of a circle, if the intensity of the pixels on the circle is significantly different from the intensity of the pixels in the center of the circle, then that is a key point. According to experience, a circle with a radius of three can obtain better results and improve the calculation efficiency when selecting key points. If there are more than 12 of the 16 points on the circle, and the gray value of the central point is greater than the threshold, it is a candidate corner point, and the optimal corner point is selected by non-maximum suppression. Non-maximum suppression generally selects the corner with the largest gray difference between the center of the circle and adjacent nodes, and then retains it as the best corner.
The Harris corner is a feature extraction algorithm based on gray image, which adopts Gaussian filtering and has a slow operation speed. The principle is that corner points have large horizontal and vertical gradients, while edge points have large horizontal or vertical gradients, and other points have small horizontal and vertical gradients. Therefore, once the gradient is computed, the corner point can be determined, based on the constraint.
The Harris feature detection method uses a small window, near the feature point, to observe the change of intensity value in a certain direction in the window. Assuming displacement (u, v), then covariance can be used to represent strength change: Therefore, the steps of Harris feature detection should be as follows: First, measure the direction where the average strength value changes most obviously, and then measure whether the strength value changes greatly in the vertical direction; if it has, it is an angular point.
The above process can be approximated by Taylor's formula expansion and verified: The matrix form is: By calculating the eigenvalues of the verifiable matrix: Set the parameter K to adjust the performance of the results; K is taken as (0.05-0.5). The parameter K is a constant, and is just a coefficient of the function, and it exists only to regulate the shape of the function According to the above description, the Harris corner feature extraction method adopts Gaussian filtering in order to slow the feature extraction speed, whilst the FAST corner feature extraction method can effectively compensate for the problem of extraction speed, and gives a better performance in real-time environments.

Optical Flow and Inverse Optical Flow
The LK optical flow method is the representative of the sparse optical flow, in which there is a premise assumption, the gray invariant assumption: the gray value of the same spatial point in each image is unchanged.
At time (t), the gray level of the pixel at (x, y) (x and y are the corresponding pixel coordinate positions in the window) can be written as: When the pixel moves to (x + dx, y + dy) at t + dt, based on the assumption that the pixel gray value remains unchanged, the following equation can be obtained: Expand the first order Taylor term on the left: Since the gray level is assumed to be unchanged, the following equation can be obtained: So dx/dt is u, dy/dt is v, ∂I/∂x is I X , is I y , change in time is I t .Writing it in matrix form: Can know the matrix: we get the equation: The motion velocity u and v of pixels between images can be obtained through calculation.
When the camera moves too fast, the direct calculation of the single-layer optical flow may cause local extreme values due to excessive changes. It is necessary to scale the image through pyramid optical flow to improve this situation. Take the original image as the bottom layer of the pyramid and scale the image one layer up to achieve a pyramid shape, as shown in Figure 2.
The optical flow method computes the H matrix H = J·J T (J is the Jacobian matrix) through the least square method at each iteration, which causes a large amount of calculation; while the inverse optical flow method is the inverse of the forward optical flow, and the forward optical flow changes the direction. Optical flow covers the range from the feature point of an image (denoted as X) to a different position in the next image (denoted as Y), as the camera moves, while the inverse optical flow is from the image Y to the image X; that is, from the Y after the motion to the X before the motion. In inverse optical flow, since X is the image before motion and does not move, the H matrix has no relation to Sensors 2022, 22, 8517 6 of 20 movement. However, H is constant when calculating the increment of movement in each iteration. The H matrix only needs to be calculated once, in the first iteration, which greatly reduces the amount of calculation.
The motion velocity u and v of pixels between images can be obtained through calculation.
When the camera moves too fast, the direct calculation of the single-layer optical flow may cause local extreme values due to excessive changes. It is necessary to scale the image through pyramid optical flow to improve this situation. Take the original image as the bottom layer of the pyramid and scale the image one layer up to achieve a pyramid shape, as shown in Figure 2. The optical flow method computes the H matrix (H = J • J T ) (J is the Jacobian matrix) through the least square method at each iteration, which causes a large amount of calculation; while the inverse optical flow method is the inverse of the forward optical flow, and the forward optical flow changes the direction. Optical flow covers the range from the feature point of an image (denoted as X) to a different position in the next image (denoted as Y), as the camera moves, while the inverse optical flow is from the image Y to the image X; that is, from the Y after the motion to the X before the motion. In inverse optical flow, since X is the image before motion and does not move, the H matrix has no relation to movement. However, H is constant when calculating the increment of movement in each iteration. The H matrix only needs to be calculated once, in the first iteration, which greatly reduces the amount of calculation.

Marginalization Acceleration
If changes to the camera pose is calculated only from the two frames, it is fast, but with low accuracy. However, if the global optimization method (such as Bundle Adjustment [25]) is adopted, the accuracy is high, but the efficiency is low. Therefore, the sliding window method is introduced, which optimizes a fixed number of frames at a time. This ensures both accuracy and efficiency. Since it is a sliding window, new image frames will come in, and old image frames will leave, in the process of sliding. The process of marginalization is designed to make good use of the image frames that remain. Marginalization is designed to delete some useless pictures, but retain the information used in the image, such as prior information, IMU information, etc. Marginalization converts them into prior information, which is encapsulated and then added into nonlinear optimization. It is assumed that the state to be marginalized is x2, and the state to be retained is x1. For the incremental equation Hδx = b, become: The marginalization method is the Schur complement matrix solve the equation: The original incremental equation is derived as follows: Equivalent prior error after marginalization is: The aim of marginalization acceleration is to first marginalize the parts, with the exception of the camera pose, and then to marginalize the camera pose. The reason for this two-step process is the marginalization of the camera pose, because the amount of difference is too large, two separate threads can play a role in acceleration.

Experiments
In this paper, we compare the VINS-MONO algorithm with the improved VINS-MONO algorithm in the public dataset EuRoC and the underwater dataset HAUD-Dataset. By printing the feature tracking speed and marginalization speed, and using the EVO trajectory measurement tool, it is more intuitive to see where the algorithm has been improved. EVO is a trajectory assessment tool for visual odometry and SLAM problems. The core functionality is the ability to plot the trajectory of the camera or evaluate the error of the estimated trajectory from the true value. The absolute pose error (APE), often used as the absolute trajectory error, compares the estimated trajectory with the reference trajectory and calculates the statistics of the entire trajectory, which is suitable for testing the global consistency of the trajectory.   Secondly, the sequence_03.bag dataset in the underwater HAUD-Dataset is taken as an example to compare the trajectory errors, as shown in Figure 5: Secondly, the sequence_03.bag dataset in the underwater HAUD-Dataset is taken as an example to compare the trajectory errors, as shown in Figure 5:

Accuracy Comparison of the Algorithms in Public Dataset EuRoC
The KLT pyramid optical flow tracking algorithm has poor robustness and accuracy for the environment, with a weak texture and few key points. Therefore, this paper adopts the inverse optical flow method to replace the KLT pyramid optical flow feature tracking algorithm that is used in the original algorithm in order to improve the accuracy of the algorithm. Two algorithms were used to run all of the data sets provided by EuRoC dataset and Root-Mean-Square Error (RMSE) was used to compare the accuracy of the VINS-MONO algorithm with the optimized VINS-MONO algorithm. As shown in Table 1, Table 2, Table 3 and Table 4.

Accuracy Comparison of the Algorithms in Public Dataset EuRoC
The KLT pyramid optical flow tracking algorithm has poor robustness and accuracy for the environment, with a weak texture and few key points. Therefore, this paper adopts the inverse optical flow method to replace the KLT pyramid optical flow feature tracking algorithm that is used in the original algorithm in order to improve the accuracy of the algorithm. Two algorithms were used to run all of the data sets provided by EuRoC dataset

Comparison of the Algorithm's Corner Extraction Speed in the Public Dataset EuRoC
In terms of the feature extraction speed, the Harris corner extraction algorithm adopts Gaussian filtering to reduce the speed of corner extraction. Therefore, the FAST corner extraction algorithm is adopted in this paper to replace the Harris corner extraction algorithm in order to improve the speed of feature extraction. The speed of the original algorithm and the improved algorithm is compared through the results of operation in EuRoC dataset, as shown in Table 5.

The Algorithm Compares the Back-End Marginalization Speed in the Public Dataset EuRoC
To accelerate marginalization, the algorithm prints out the marginalization time, and compares the marginalization time between the original algorithm and the improved algorithm by running EuRoC data set to reflect the acceleration of marginalization, as shown in Table 6.  Figure 6 shows the experimental scene of the underwater data set. The KLT pyramid optical flow tracking algorithm has poor robustness and accuracy for the environment, with a weak texture and few key points. Therefore, this paper adopts the inverse optical flow method to replace the KLT pyramid optical flow feature tracking algorithm in the original algorithm to improve the accuracy of the algorithm. In terms of the feature extraction speed, the Harris corner extraction algorithm adopts Gaussian filtering to reduce the speed of corner extraction. Therefore, the FAST corner extraction algorithm is adopted in this paper to replace Harris corner extraction algorithm, in order to improve the speed of feature extraction. The underwater data sets sequence_03.bag, sequence_05.bag, sequence_06.bag, sequence_07.bag were selected, and the VINS-MONO algorithm and the optimized algorithm, were used to run these four data sets and compare the accuracy (as shown in Table 7), feature point extraction speed (as shown in Table 8) and marginalization speed (as shown in Table 9). Thus, the accuracy and speed of the algorithm are greatly improved after optimization. In addition, the accuracy of the optimized algorithm and the VINS-Fusion algorithm are compared in these four data sets (as shown in Table 10).    The two algorithms were run on sequence_03.bag datasets provided in HAUD-Dataset, and RMSE, rotation error and translation error were used to compare the accuracy of the optimized VINS-MONO algorithm and VINS-MONO algorithm, as shown in Figures 7-9.

Accuracy Comparison and Speed Comparison of Algorithms in Underwater HAUD-Dataset
The two algorithms were run on sequence_05.bag datasets provided in HAUD-Dataset, and RMSE, rotation error and translation error were used to compare the accuracy of the optimized VINS-MONO algorithm and VINS-MONO algorithm, as shown in Figures 10-12. sequence_07.bag 4.9 4.8 The two algorithms were run on sequence_03.bag datasets provided in HAUD-Dataset, and RMSE, rotation error and translation error were used to compare the accuracy of the optimized VINS-MONO algorithm and VINS-MONO algorithm, as shown in Figure 7, Figure 8 and Figure 9.    The two algorithms were run on sequence_05.bag datasets provided in HAUDtaset, and RMSE, rotation error and translation error were used to compare the accur of the optimized VINS-MONO algorithm and VINS-MONO algorithm, as shown in F ure 10, Figure 11 and Figure 12.

Discussion and Analysis
The comparison experiment, between the VINS-MONO algorithm and the optimization algorithm, was carried out on the open dataset EuRoC and the underwater dataset HAUD-Dataset. Firstly, the accuracy of the original algorithm was compared with the improved algorithm in the open dataset EuRoC (as shown in Table 11 and Table 12). The

Discussion and Analysis
The comparison experiment, between the VINS-MONO algorithm and the optimization algorithm, was carried out on the open dataset EuRoC and the underwater dataset HAUD-Dataset. Firstly, the accuracy of the original algorithm was compared with the improved algorithm in the open dataset EuRoC (as shown in Table 11 and Table 12). The

Discussion and Analysis
The comparison experiment, between the VINS-MONO algorithm and the optimization algorithm, was carried out on the open dataset EuRoC and the underwater dataset HAUD-Dataset. Firstly, the accuracy of the original algorithm was compared with the improved algorithm in the open dataset EuRoC (as shown in Tables 11 and 12). The accuracy of the optimized algorithm is higher than that of the original algorithm under the condition that most EuRoC data are concentrated without loopback, and the overall accuracy is improved by 0.8 percent. The accuracy of the optimized algorithm is 0.2 percent higher than that of the original algorithm under the condition of loopback. According to the data in the table, it can be concluded that the use of inverse optical flow significantly increases the number of effective matching points and eliminates some miscellaneous points during triangulation, therefore improving the accuracy. In addition, when optimized by the MH_04_difficult data set, the accuracy of the algorithm in dark scenes and fast movement is significantly improved, with a nine percent increase under the condition of a loop-back.
In the open dataset EuRoC, the original algorithm and the improved algorithm are compared in terms of feature extraction speed (as shown in Table 13), and the overall average time is shortened by 1.5 ms.
According to the data in the table, it can be concluded that using the FAST feature extraction algorithm to replace the Harris feature extraction algorithm can improve the speed of feature extraction and shorten the extraction time.
Finally, the marginalization time of the optimized algorithm is shortened by 2384 ms, on average, compared with the original algorithm, as shown in Table 14.
By comparing the accuracy of the original algorithm and the optimized algorithm with the HAUD-Dataset (as shown in Table 15) with a weak texture and fewer key points, the overall accuracy is improved by 4.2 percent. According to the comparison of accuracy data between the original algorithm and the optimized algorithm in the underwater data set, it can be concluded that the optimized algorithm has higher accuracy and is more suitable for underwater complex scenes.
In the underwater dataset HAUD-Dataset, the original algorithm and the improved algorithm are compared in terms of feature extraction speed, and the overall average time is shortened by 1.0 ms, as shown in Table 16.
According to the comparison of feature point extraction speed, it can be concluded that the optimized algorithm is faster in the underwater complex situation.
In the underwater dataset HAUD-Dataset, the marginalization time of the optimized algorithm is shortened by 5892 ms on average compared with the original algorithm, as shown in Table 17. Comparing the accuracy of the optimized algorithm with the VINS-Fusion algorithm in the public dataset EuRoC and the underwater dataset HAUD-Dataset, it can be concluded that the accuracy of the optimized algorithm is improved by 1.6% and 3.75%, respectively, as shown in Tables 18 and 19. From the analysis of the experimental data, it can be concluded that on the open dataset EuRoC and the underwater dataset HAUD-Dataset, the optimized algorithm is superior to the original algorithm in terms of accuracy, feature point extraction speed and marginalization speed.

Conclusions
The VINS-MONO algorithm has a good performance in vision-inertial SLAM, however, there are still some shortcomings in the feature extraction speed and recognition accuracy in the underwater complex environment. The purpose of this study is to solve the current shortcomings of the VINS-MONO algorithm and put forward solutions to optimize the VINS-MONO algorithm, as well as comparative tests to verify the feasibility of the solution.
In this paper, the first measure of optimization of VINS-MONO is to optimize the feature extraction speed. The FAST corner feature extraction algorithm is used to replace the Harris corner feature extraction algorithm, which makes up for the disadvantage of slow feature extraction speed, so that the VINS-MONO algorithm has a significant improvement in feature extraction speed. The second measure is that we use an inverse optical flow method, rather than forward optical flow, which improves the recognition accuracy of the algorithm and greatly reduces the amount of calculation. The third measure is to optimize several types of residual information for the back-end marginalized part. In the original algorithm, these types of residual information were marginalized and optimized together, while in this paper, different strategies were used to reserve and optimize different residual information, thus improving the speed of marginalization.
In this paper, the public dataset EuRoC and the underwater dataset HAUD-Dataset are used for comparative experiments, which show that the optimized algorithm offers a good improvement in feature extraction speed, recognition accuracy and marginalization speed. In the future, we will compare each visual-inertial SLAM algorithm with the VINS-MONO algorithm in all aspects, find out the shortcomings of VINS-MONO algorithm, and optimize it. At the same time, we will use the optimized algorithm for practical application research, find out the existing problems in the algorithm, and optimize. Furthermore, in future experiments, the optimized algorithm will be applied to more scenes, from which the shortcomings of the algorithm under more constraints are found. According to these shortcomings, possible solutions are proposed, and the algorithm is further studied.