LiDAR-Visual-Inertial Odometry Based on Optimized Visual Point-Line Features

: This study presents a LiDAR-Visual-Inertial Odometry (LVIO) based on optimized visual point-line features, which can effectively compensate for the limitations of a single sensor in real-time localization and mapping. Firstly, an improved line feature extraction in scale space and constraint matching strategy, using the least square method, is proposed to provide a richer visual feature for the front-end of LVIO. Secondly, multi-frame LiDAR point clouds were projected into the visual frame for feature depth correlation. Thirdly, the initial estimation results of Visual-Inertial Odometry (VIO) were carried out to optimize the scanning matching accuracy of LiDAR. Finally, a factor graph based on Bayesian network is proposed to build the LVIO fusion system, in which GNSS factor and loop factor are introduced to constrain LVIO globally. The evaluations on indoor and outdoor datasets show that the proposed algorithm is superior to other state-of-the-art algorithms in real-time efﬁciency, positioning accuracy, and mapping effect. Speciﬁcally, the average RMSE of absolute trajectory in the indoor environment is 0.075 m and that in the outdoor environment is 3.77 m. These experimental results can prove that the proposed algorithm can effectively solve the problem of line feature mismatching and the accumulated error of local sensors in mobile carrier positioning.


Introduction
Multi-sensor fusion localization technology based on Simultaneous Localization and Mapping (SLAM) is a fundamental technology in the field of high-precision localization of mobile carriers [1]. The SLAM-based multi-sensor fusion system applied to mobile carriers can be divided into two core parts: the front-end, and the back-end. The function of the front-end is used to analyze the environmental fingerprint information collected by the sensors, in order to estimate the positional information of the mobile carrier in time. In addition, the change in the surrounding environment with the movement of the carrier is restored. The function of the back-end is used to obtain the final positioning results by iteratively optimizing the position estimates obtained from the front-end analysis. Depending on the sensors used in the front-end, it can be divided into methods mainly based on LiDAR and vision [2,3]. Engineers and researchers in related fields have conducted a lot of research in both directions and produced a series of research-worthy results.
The main vision-based SLAM approach, namely visual odometry (VO), has long dominated the SLAM technology field due to the lower cost of the camera compared with LiDAR. However, pure monocular visual SLAM systems cannot recover metric scales. Thus, there is a growing trend to utilize low-cost inertial measurement units to assist monocular vision systems, which is called visual-inertial odometry (VIO). Monocular VIO provides high-quality self-motion simulation by using monocular cameras and inertial measurement unit (IMU) measurements, which has significant advantages in terms of size, cost, and power. Based on the method of feature association, visual SLAM can be classified into feature point method and direct method. The feature point-based method VIO accomplishes the inter-frame feature constraint by extracting and matching image feature points [4][5][6]. Therefore, rich environmental texture is required to ensure that the threshold of the number of effective feature points required for feature tracking is reached. Tracking loss of feature points is prone to occur in weak texture environments such as parking lots and tunnels, which in turn affects localization accuracy and real-time performance. The theoretical basis of the direct method-based VIO is the assumption of constant grayscale [7,8]. It only needs to capture environmental features by the changes in the grayscale image to establish constraints, which has a better real-time performance. Nevertheless, the tracking accuracy is greatly affected by environmental illumination changes. Therefore, stable and rich line feature models are required to be introduced into the front-end to provide stable and accurate feature constraints for visual back-end state estimation. In 2018, He et al. proposed PL-VIO based on point-line feature fusion, but too many optimization factors greatly limited the real-time performance in practical tests [9]. In 2020, Wen et al. proposed PLS-VIO to optimize the 6-DOF pose by minimizing the objective function and improving the line feature matching filtering strategy to reduce the probability of mismatching [10]. Although the VIO based on point-line features has a positive effect on the number of features [11,12], it still cannot solve the scale uncertainty problem of monocular cameras. The development of VIO in practical applications still has certain limitations.
As another important technical means of SLAM-based localization technology, SLAM mainly based on LiDAR is also widely used in the industry for its high resolution, high accuracy, and high utilization of spatial features. In 2016, Google proposed Cartographer, a 2D LiDAR based on particle filtering and graph optimization. In 2017, Zhang et al. proposed the LOAM for the first time, which uses the curvature of the LiDAR point cloud to register the effective point cloud features as planar points and edge points [13]. In 2018, Shan et al. proposed LeGO-LOAM based on LOAM, which uses the ground plane feature point cloud to further filter outliers from the scanned point cloud and improve the LOAM frame [2]. In 2020, Shan et al. further introduced the LIO-SAM algorithm based on the previous work, which uses IMU pre-integrated measurements to provide initial pose estimation for laser odometry [14]. In addition, a Bayesian network-based factor graph optimization framework is proposed, in which the global position is constrained by adding GPS factors, and an incremental smooth global voxel map is established. These schemes provide technical feasibility for the high-precision positioning by fusing LiDAR with other sensors.
However, due to the inherent shortcomings of the main sensing sensors, such as the limited scanning angle of LiDAR and the sensitivity of the mainly vision-based methods to light variations, these methods can hardly show excellent robustness in real-world applications. To further improve the localization performance, LiDAR-Visual-Inertial Odometry, as a multi-sensor fusion localization method, has become a research focus of SLAM with its advantages of multi-sensor heterogeneity and complementarity.
The existing LVIO multi-sensor fusion strategy can be described from the front-end and back-end perspectives. First, the front-end fusion strategy of LVIO is introduced. Generally, LiDAR acts as a feature depth provider for monocular VO as a way to improve the scale ambiguity of visual features. Meanwhile, VO performs state estimation from the extracted visual features, which is provided as the initial state for LiDAR scan matching. Therefore, the quantity and quality of visual features are closely related to the precision of state estimation of the fusion system. In existing fusion systems, the features extracted by camera are mainly point features [15,16]. Xiang et al. proposed a combination of fisheye camera and LiDAR based on a semantic segmentation model, which improved the confidence of the depth of visual features in the driving environment of unmanned vehicles [15]. Chen et al. proposed a method to construct a loopback constraint for LiDARvisual odometry by using the Distributed bag of Words (DboWs) model in the visual subsystem, although, without introducing IMU sensors to assist in the initial positional estimation [16]. In 2021, Lin et al. proposed R2LIVE to incorporate IMU into the fused localization system, in which the LiDAR odometry is used to establish depth constraints for VIO [17]. Although the above-mentioned algorithms exhibit superior performance to the VIO based on point features, it is still difficult to extract rich and effective features in weak texture environments, which leads to the failure in LiDAR scan matching. Therefore, additional feature constraints on the LiDAR need to be added with line features that are more robust to environmental texture and luminosity variations. Visual SLAM based on point-line features has been studied but not widely applied to LVIO systems in recent years [18,19]. In 2020, Huang et al. first proposed a LVIO based on a robust point and line depth extraction method, which greatly reduces the three-dimensional ambiguity of features [18]. Zhou et al. introduced line features in the direct method-based VIO to establish data association [19]. The above-mentioned algorithms provide technical feasibility for LVIO based on point-line features.
From the perspective of the back-end fusion strategy, LVIO can be classified into two categories based on different optimization algorithms: filter-based methods and factor graph methods. Although the filtering method is a traditional technology to realize multisensor fusion, its principle defect of frequent reconstruction of increasing or decreasing sensors limits its application in LVIO [20]. As an emerging method in recent years, the factor graph method can effectively improve the robustness of SLAM system when a single sensor fails because of its plug-and-play characteristics. Therefore, it is widely applied to deal with such heterogeneous aperiodic data fusion problems [21]. In addition, since LVIO is in the local frame, there are inherent defects such as accumulated errors. Thus GNSS measurements need to be introduced for global correction [22][23][24] to realize local accuracy and global drift-free position estimation, which makes full use of their complementarity [24]. The research on adding GNSS global constraints into the local sensor fusion framework are as follows: Lin et al. modified the extended Kalman filter to realize a loose coupling between GPS measurements and LiDAR state estimation, but there is a large single linearization error to be solved [17]. In 2019, Qin et al. proposed VINS-Fusion, which uses nonlinear optimization strategies to support Camera, IMU, and GNSS [25], but it assumes that GNSS is continuous and globally convergent, which is inconsistent with reality. In any case, these strategies presented above provide numerous reliable ideas.
Generally speaking, we can conclude that the existing LVIO fusion system has two problems that deserve further exploration. First, on the premise of ensuring the real-time performance, more abundant feature constraints are needed to improve the pose estimation accuracy of LVIO. Secondly, global constraints are needed to globally optimize the LVIO local pose estimation results. To address these issues, this study presents a LiDAR-Visual-Inertial Odometry based on optimized visual point-line features. First of all, an improved line feature extraction in scale space and constraint matching strategy based on the square method are proposed, which provides richer visual feature for the front-end of LVIO. Secondly, multi-frame LiDAR point clouds were projected into the visual frame for feature depth correlation, which improves the confidence of monocular visual depth estimation. At the same time, the initial visual state estimation can be used to optimize the scan matching of LiDAR. Finally, a factor graph based on the Bayesian network was used to build the LVIO fusion system, in which the GNSS factor and loop factor are introduced to constrain LVIO globally, to achieve locally accurate and globally drift-free position estimation in the complex environment.

System Overview
The general framework of the LiDAR-Visual-Inertial Odometry based on optimized visual point-line features proposed in this study is shown in Figure 1. The system consists of the front-end of LiDAR-Visual-Inertial Odometry tight combination and the back-end of factor graph optimization. In the front-end of our algorithm, the visual odometry not only extracts point features, but also further extracts line features in the improved scale space and performs geometric constraint matching on them, which improves the number of features in the weak texture environment. Then, the feature depth provided by LiDAR point clouds performed a role in correlating the depth of monocular visual features. IMU pre-integration provides all necessary initial values, including attitude, velocity, acceleration bias, gyroscope bias, and three-dimensional feature position, for completing the initial state estimation after time alignment with a camera. If VIO initialization fails, the IMU pre-integration value is used as the initial assumption to improve the robustness of the fusion system in the texture-free environment.
After the front-end initialization is successfully realized, the back-end optimizes the factor graph by using the estimated residual of each sensor's state. IMU pre-integration, visual residual and lidar residual were added to the factor graph as local state factors for maximum a posteriori estimation. In order to further correct the cumulative error of local state estimation, the residual of GNSS single-point positioning measurements was used as the global positioning factor to add to the factor graph. Besides, when the system detects the path loop, the loop factor will be added to the factor graph to participate in the nonlinear optimization and obtain the optimal global pose estimation.

Line Feature Extraction
Commonly used line feature extraction algorithms include Hough [26], LSWMS [27], EDLine [28], and LSD [29]. Weighing factors such as accuracy, real-time performance, and the need for parameter adjustment, we chose LSD to extract line features. According to the bottom parameter optimization strategy, we modified an improved LSD algorithm, and a minimum geometric constraint method to realize line feature constraint matching.
Given an N-layer Gaussian pyramid as the scale space of LSD line features, the scale ratio of images in each layer is defined to reduce or eliminate the sawtooth effect in images. After scaling the image s times, a downsampling was performed, and then the gradient was calculated for all pixels in the new image obtained after downsampling. By traversing the image and getting the gradient values of all pixels, the pixel gradient rectangle can be merged according to the density of same-sex points to obtain a rectangle-like line segment l. The density d of homogeneous points in the rectangle can be expressed as: where k is defined as the total number of pixels in the rectangle, and D is the density threshold of parity points. Different from the hypothesis in [12], a low co-location density threshold in the outdoor complex texture environment will extract a large number of invalid line features. Therefore, it is necessary to re-optimize the strategy according to the underlying parameters and select the following combinations near the original parameters (s = 0.8, D = 0.7), for real-time and accuracy experiments. We measured the positioning accuracy by the root mean square error of absolute trajectory error (APE_RMSE). The accuracy and real-time performance of different values of s and D on the Hong Kong 0428 dataset are shown in Figure 2. The Monte Carlo method was used in this experiment. Within the parameter range that ensures the stable operation of the line feature extraction algorithm, we conducted three experiments. First of all, as shown in Figure 2a, under the premise that the original scaling times s = 0.8, 100 random numbers were selected in the range of D ∈ (0.3, 0.9) to carry out the experiment of density threshold selection. Secondly, as shown in Figure 2b, we kept the original density threshold D = 0.7, and then selected 100 random numbers in the range of s ∈ (0.4, 0.9), which is to select the appropriate range of scaling times s. Finally, as shown in Figure 2c, within the appropriate parameter range obtained in the previous experiments, 100 groups of parameter combinations were randomly selected for line feature extraction to obtain the optimal value. According to Figure 2c it can be seen that the operation time is shorter when the value of (s, D) is around (0.5, 0.6) or around (0.6, 0.6). Furthermore, we compared the accuracy of the above two groups of parameters. It can be concluded that the accuracy of line feature extraction of the former group is slightly higher than that of the latter group. Considering the accuracy and real-time, we chose s = 0.5, D = 0.6 as the parameter combination for our system.

Inter-Frame Feature Constraint Matching
Different from the neighboring line merging of different line features within the same frame in feature extraction, the least square method-based line feature constraint matching is for the same line feature pair whose angle and distance change between two consecutive frames. Considering the angle and translation changes in the same line feature pair during the carrier movement, a minimized sparse matrix model can be constructed to ensure the minimum total error in matching the line features extracted between the front and back frames.
Given a line l W = n W T , v W T T ∈ R 6 extracted from the world coordinate system, where n W , v W ∈ R 3 is the normal vector and direction vector, respectively, of l W , let the transformation matrix from the world frame to camera frame be T W C = R W C , t W C , with R W C , t W C denoting the rotation and translation, respectively, then l W can be expressed in Plücker coordinates within the camera frame as: It can be seen that the matching of line feature pairs in the camera frame is a 6-DOF parametric matching problem. In order to improve the accuracy and simplify the line feature matching problem, it can be simplified as a 4-DOF parameter matching optimization problem. Let all the line feature pairs obtained by matching between two consecutive frames in the camera frame be: where l i and l j are certain line features extracted in the previous frame and subsequent frame, respectively, n is the total number of line features in the subsequent frame. According to the variation in the inter-frame line characteristics shown in Figure 3, the parameter matrix can be set as e ij = θ ij , µ ij , ρ ij , d ij T , θ ij and d ij are the included angle and translation distance between two consecutive frames, respectively, µ ij and ρ ij are the projection ratio and length ratio of the front-to-back interframe line features. Constructing the parameter matrix may establish a linear constraint matrix A i = [e i1 , . . . , e ij , e in ] of the subsequent keyframe for l i . The target vector of the matching judgment of l i is m i = m i1 , . . . , m ij , . . . , m in T . The value of each component is determined by the result of feature matching, where matching is 1 and non-matching is 0. If ∑ m in = 1, the linear constraint A i m i = t will be satisfied. Therefore, the line feature matching problem can be optimized into a constrained matching equation based on least squares: where λ is the weight coefficient and t = [0, 1, 1, 0] T is the constraint target vector.

LiDAR-Aided Depth Correlation of Visual Features
LiDAR-aided depth correlation of visual features can effectively improve the scale ambiguity of monocular cameras. Since the LiDAR resolution is much lower than that of the camera, the use of only a single frame of sparse point cloud depth correlation will result in a large number of visual feature depth deletions [30]. Therefore, this study purposes a strategy of superimposing multi-frame sparse point cloud to obtain the depth value of the point cloud, which is used to establish the depth correlation with the visual features.
As shown in Figure 4, f V 1 is a feature point in the visual frame {V}, and d L 1 , . . . , d L m is a group of depth point clouds in the lidar frame {L}. Projecting d L n onto a unit spherical surface V g with f V 1 as the spherical center to obtain a projection point d V g n : where R V g L and p V g L are the rotation matrix and external parameter matrix of {L} to V g , respectively. Taking f V 1 as the root node to establish KD tree to search for the three closest depth points d 1 , d 2 , d 3 on the sphere. Then, connecting f V 1 with the camera center O and intersecting

Construction of Factor Graph Optimization Framework
The framework of factor graph optimization based on the Bayesian network proposed in this study is shown in Figure 5. The state vector in that world frame construct according to the constraint factor shown in the figure is: where x n = p n , q n , v n , b a , b g represents the IMU state at the nth time, which includes the carrier position p i , the rotation quaternion q i and the velocity v n obtained by IMU preintegration in the world frame, b a and b g stand for the acceleration bias and the gyroscope bias in IMU body frame, respectively, λ p represents the inverse depth of the visual point feature in the visual frame from its initial observation in the first frame, l represents the orthogonal frame of the visual line feature, d e k and d p k stand for the distances between the LiDAR feature points and its corresponding edge or plane feature point cloud, respectively. Therefore, the Gaussian-Newton method can be used to minimize all cost functions to construct a maximum a posteriori estimation problem, to perform nonlinear optimization on the state vectors in the sliding window: where r p , J p contains the prior states after the marginalization in the sliding window, and J p is the Jacobian matrix, r B ẑ k k+1 , X represents the IMU residuals, and p i is the IMU covariance matrix; r f ẑ j i , X and r l ẑ j i , X represent the re-projection errors of visual point and line features, p c is the visual covariance matrix, and ρ represents Huber norm, with specific values as follows: ρ(e(s)) = The specific meaning of each sensor cost function in Formula (6) is as follows.

IMU Factor
The IMU state of the kth frame and the k + 1th frame in the global coordinate system can be defined as: Take the IMU state of the kth frame, x k , as an example, which includes position , accelerometer bias b ak and gyroscope bias b gk .
Next, the IMU residual equation can be constructed, which is defined as: where r p , r q , r v , r ba , r bg T represents the observation residual of IMU state between two consecutive keyframes in the sliding window, including the residual of position, rotation, velocity, accelerometer bias and gyroscope bias, R B k G represents the pose conversion matrix of the kth frame from the IMU coordinate system to GNSS global coordinate system, and p k k+1 ,q k k+1 ,v k k+1 represents the IMU pre-integration value of two keyframes in the sliding window within ∆t k .

Visual Feature Factor
The visual feature factor is essentially the re-projection error of the visual feature, that is, the difference between the theoretical value projected on the image plane and the actual observation value. In order to unify the coordinate system in Section 3.3, we provide the definition of re-projection error on the unit sphere instead of the generalized image plane. Specific schematic diagrams are shown in Figures 6 and 7.

Visual Point Feature Factor
In this study, the visual feature factors are built with reference to VINS-Mono [5]. As shown in Figure 6, the re-projection error of visual point features can be defined as the difference between the projection point on the unit spherical surface and the observation value after distortion correction. Given the ith normalized projection pointf in the jth frame, we use the first observation value in the jth frame to define the visual point feature factor as: where R V B represents the external parameter matrix between camera and IMU, which is obtained by calibration, R where K is the camera internal reference projection matrix. It can be seen from Equation (12) that the spatial coordinates of the line features projected onto the unit sphere are only related to n c . The two end points of the observation line are a j i and b j i , then the re-projection error of the line feature can be expressed by the dotted distance from the two end points of the observation line feature to the projection line feature:

LiDAR Factor
As mentioned in Section 3.3, after the LiDAR-assisted monocular visual depth correlation, the VIO will provide the LiDAR with visual initial positional estimates to correct the motion distortion of the LiDAR point cloud and improve the scan matching accuracy. The scanning matching error between adjacent keyframes of LiDAR involved in this study can be expressed by the distance from the feature point to the matched edge line and feature plane as: where X e (k+1,i) represents the edge feature point at the k + 1th time, X e (k,a) and X e

GNSS Factor and Loop Factor
When the carrier moves to a GNSS signal trusted environment, GNSS factors can be added to optimize with local sensors. The time interval of two frames of GNSS observations is ∆t, and given the GNSS measurements p G g k in the global frame and p V g k representing the observation of LVIO in the global frame, the GNSS factor can be expressed by the following observation residuals: Different from the assumption in [14] that GNSS factors are added to the system only when the GNSS measurement covariance is smaller than the LVIO measurement covariance, we noticed that the accuracy of outdoor GNSS positioning results is much higher than the LVIO local positioning results. The covariance threshold size for judging whether to add GNSS factors has little impact on the positioning accuracy. Therefore, we present that once the GNSS signal is detected by the system, the GNSS factor is added to the factor graph. In this way, even if the mobile carrier enters the GNSS rejection environment (such as the indoor parking lot or tunnel), it can also provide a more accurate initial observation value after GNSS correction. The fusion strategy of GNSS and LVIO is shown in Figure 8.
Further, considering the possible overlap of the mobile carrier travel area, i.e., the mobile carrier travels to the same position again after a period of time, we also added a loopback detection link to establish the loopback constraint that exists between non-adjacent frames. Unlike introducing another sensor (GNSS) for global correction of the local sensor (LVIO), the loopback factor establishes the correlation between the current observed frames and the historical data by the local sensor itself to obtain a globally consistent estimate. The conditions for adding the loopback factor are similar to those of GNSS. Once the carrier motion trajectory is detected to travel to the environment passed by the history, the loop factor is added to the factor graph. By registering with the point cloud of the prior map, the historical trajectory is corrected, and the global pose estimation result with higher accuracy is obtained.

Indoor Environment
For evaluating the real-time performance of our algorithm, we randomly selected the MH_01_easy dataset for indoor experiments. Since the strategy of adding line feature constraints to the VIO subsystem of our algorithm is referenced to PL-VIO, the time consumption of several threads involving line features of PL-VIO and this algorithm is compared. As shown in Figure 9, the appropriate selection of hidden parameters and the least-squares-based geometric constraint matching strategy have positive effects on real-time performance. The time cost of the line feature extraction and matching process and the line feature tracking process of the proposed algorithm is about one-third that of similar algorithms.
The time consumption of the line feature matching process is shown in Figure 9a. In the period of (110 s, 170 s), the carrier passes through the well-lit factory wall duct area. The number of line features extracted by both algorithms increases, and the corresponding time cost of line feature matching also increases with the number of line features. However, unlike PL-VIO which is significantly affected by the increase in the number of line features, the line feature matching the process time of our algorithm remains relatively stable within 1 ms. The reason is that the number of invalid line features is reduced due to the geometric constraint-based line feature matching strategy, which improves the accuracy of line feature matching between the front and current frames of the image. In the time-consuming of line feature tracking process shown in Figure 9b, it can be seen that in the initial stage (0 s, 5 s) of the visual subsystem, the line feature tracking process of the two systems takes longer. The reason is the UAV is at rest during this time and the VIO subsystem does not receive sufficient motion excitation, which leads to its incomplete initialization. After 5 seconds of initialization, the PL-VIO line feature tracking time remains stable at about 125 ms, while the time consumption of our algorithm is about 4/5 less than that of PL-VIO, about 25 ms. It has a strong positive effect on the real-time performance of the fusion system in the actual operating environment.
Although as shown in Figure 9c, the time-consuming cost of the line feature residual optimization process increases by about 10 ms, the time-consuming of the line feature tracking process is significantly reduced. Thus, the proposed method leads to a decrease in the total time cost of the three line feature-related processes in the fusion system, which still has a better real-time performance overall than before the improvement.

Outdoor Environment
Since the distribution characteristics of line features are different in indoor and outdoor environments, in order to fully evaluate the superior performance of this algorithm in terms of real-time, we selected the Hong Kong 0428 dataset for outdoor experiments. The experimental results are shown in Figure 10.
Different from the indoor environment, the outdoor environment has more complex conditions of light refraction and reflection, and the dynamic interference such as pedestrians and vehicles in the driving process of moving vehicles. The time consumption of the line feature matching process in the outdoor environment is shown in Figure 10a. It can be seen that the line feature matching time of PL-VIO in the outdoor environment is about 10 ms on average, and our algorithm still maintains the same good real-time characteristics as the indoor environment. In the line feature tracking process shown in Figure 10b, it can be seen that the line feature tracking process in the initialization phase (0 s, 5 s) of the visual subsystem is abnormally high for both systems. The same reason is that the VIO system is not provided sufficient motion excitation at the beginning of the vehicle stationary phase. It can be concluded that it is more difficult to match and track visual line features in the outdoor environment, and the time consumed for line feature tracking rises about 3-4 times compared with the indoor environment. However, the time consumed by our algorithm is still greatly shortened compared with similar algorithms, leaving more time for the optimization of a multi-sensor fusion at the back-end. In addition, as shown in Figure 10c, the time-consuming cost of the line feature residual optimization process is not much different from that of PL-VIO. Combining the above three time-consuming threads, it can be proved that our algorithm can achieve better real-time performance in different environments.

Indoor Environment
In this study, the EuROC dataset was used to compare and verify the positioning accuracy of each algorithm in the indoor environment. The experimental environment was in a factory with complex signal refraction and reflection conditions. LiDAR frequently fails in the experimental environment, so no comparison was made. The comparison of the point-line feature results extracted by PL-VIO and our algorithm in the experimental environment is shown in Figure 11. As seen in Figures 12 and 13 and Table 1, the introducing line features in the image frames to add additional feature constraints can reduce the positioning error of the system to some extent, especially in areas with dim light and poor textures. For example, during the (160 s, 240 s) time, the UAV flight area is nearly full of darkness. Thus it is difficult for Harris corner point detection method to extract the corner points with large grayscale difference from the surrounding pixel blocks. The reduction in the number of effective feature points directly leads to poor feature tracking accuracy. Therefore, the absolute trajectory error of VINS-Mono based on point features is larger in this interval (as shown in Figure 13a). In contrast, PL-VIO based on point-line features and the present algorithm are less negatively affected by illumination, and the absolute trajectory error remains within 0.6 m. In a longitudinal comparison of similar algorithms based on point and line features, the accuracy of our algorithm is significantly improved over PL-VIO. These results are attributed to the high quality of matching by the geometric constraint strategy, which avoids the missegmentation of long-line features and then misclassification as invalid matches. The experimental results demonstrate the robustness and accuracy of this algorithm in the case of single system failure, which is important for localization in complex indoor environments.

Outdoor Environment
To evaluate the performance of the algorithm we conducted in the outdoor environment, the Hong Kong dataset was used for performance evaluation and it was compared with other similar advanced algorithms. The experimental equipment and environment are shown in Figure 14. The sensor models are as follows: the camera is BFLY-U3-23S6C-C, the LiDAR is HDL 32E Velodyne, IMU is Xsens Mti 10, and the GNSS receiver is u-blox M8T. In addition, we utilized the high-grade RTK GNSS/INS integrated navigation system, NovAtel SPAN-CPT, as the ground truth. To verify the superior performance of each aspect of our system, we performed ablation experiments, constructed without GNSS global correction (*), without visual line features (#), and our complete system (proposed), respectively. The experimental results are shown in Figures 15 and 16 and Table 2.   From Figure 15, it can be seen that VIO and LIO, which are mainly based on a single sensor, each have different defects. First of all, VIO(VINS-Mono) is introduced. Before starting the movement, the moving carrier stopped at the roadside parking position for about 10 seconds. VIO was not given a large motion excitation during this period, which led to the VIO not being initialized properly. Secondly, the cumulative error caused by the scale uncertainty of the monocular camera increased significantly over time, and a large-scale estimation error was already generated at the second lap. Although the scale drift of LIO (LIO-SAM) is not large, it will immediately fail and keep restarting in the complex area of signal fold reflection. After LiDAR resumes operation, the translation and rotation of the current frame will be accumulated based on the positional estimation at the last frame that did not fail, resulting in the misjudgment of stopping the motion at the carrier motion to (50 m,150 m). When the carrier moves to the corner, LIO re-estimates the position and attitude. It was misjudged that the carrier stopped at (50 m,150 m) for a while and then began to turn, so it lost the estimated position and attitude for a period of time, which led to a large positioning error.
In a longitudinal comparison with the other LVIO system (LVI-SAM), we can conclude that our complete algorithm maintains a lower drift rate and localization integrity, which benefits from the extra constraint of line features and the global correction of GNSS.
In conclusion, even in complex outdoor environments, our algorithm still outperforms other advanced algorithms.

Mapping Performance
As a demonstration of the superiority of our algorithm in building maps, we compared the building results with other advanced algorithms on different datasets. The visual line feature extraction and map building results are shown in Figure 17. Compared with PL-VIO, our algorithm has a great improvement in the number of visual line features extracted, which is attributed to the improved line feature extraction strategy. In a factory environment with complex lighting conditions, the line features in the actual environment will look minutely curved due to the refraction of light. Due to the proper value of the threshold value D of the density of homogeneous points, the angle tolerance of fitting pixels to approximate rectangles in this environment can be improved, thus increasing the number of line feature extraction. Further, the accuracy of the bit pose estimation is also substantially improved by the combination of the improved line feature extraction and tracking optimization strategies. Further, comparison of the LiDAR point cloud detail views is shown in Figure 18. The more accurate VIO pose estimation after the line features are added provides a more accurate initial value for LiDAR scan matching, and reduces a large number of point cloud mismatching. Comparison of global point cloud trajectories is shown in Figure 19.
The area marked by circles demonstrates that the data drift caused by cumulative errors is significantly reduced by adding a GNSS factor and loop factor to our algorithm.

Discussion
Multi-sensor fusion positioning technology based on SLAM provides new opportunities for the high-precision positioning of mobile carriers. In this study, two problems that need to be further explored in the existing LVIO fusion system are proposed. The first problem is that LVIO system needs enough environmental feature information. According to the previous studies of Pumarola et al. [11] and Fu et al. [12], theoretically, the accuracy of the fusion system can be improved by increasing the constraint of visual line features. Huang et al. also proved that the average positioning error of the fusion system based on point-line feature can generally be improved, from the traditional 2.16% to 0.93% [18]. In this study, the steps of increasing visual line feature constraints are further optimized. The Monte Carlo method was used to select the appropriate scaling ratio and the density threshold of homogeneous points, which improves the angle tolerance of pixel fitting to line features. To a certain extent, it reduces the probability that short segments are wrongly judged as invalid features. According to the experiment of parameter selection in Section 3.1, compared with the indoor environment where the angle and translation of line features change little, the movement of line features between consecutive frames is more complicated in the outdoor environment. Therefore, the density threshold of homogeneous points needs to be lowered to reduce the probability that the valid line feature pairs are misjudged as invalid matches when turning sections. The results of outdoor real-time analysis shown in Section 5.1.2. show that the traditional method based on point-line features is difficult to match and track the visual line features in the outdoor environment, which takes a long time. However, the time consumption of this algorithm maintained a low level, which is beneficial to leave more time for back-end fusion optimization.
To solve the problem of line feature mismatching during the movement of carriers, Zhou et al. established a constraint equation using the 6-DOF Plücker coordinates of line features to perform matching optimization [19]. However, this increases the computational complexity of the fusion system, which is inconsistent with the lightweight requirements of autonomous driving positioning. In this study, the link of line feature constraint matching is simplified, and the original 6-DOF parameters are replaced by 4-DOF which represents the movement of line features for optimization. Thus, it can reduce the computational complexity of the system and effectively improve the inter-frame matching accuracy of line features. To explore the superiority of the proposed algorithm in real-time and positioning accuracy, we compared the precision and the time-consuming of three processes related to line features of this algorithm with several similar advanced algorithms in different environments. The experimental results show that our optimization strategy based on front-end point-line features effectively achieves the positive balance between reducing time consumption and improving accuracy.
The second problem to be solved is the global optimization of LVIO local pose estimation results by introducing global constraints. To further improve the positioning accuracy of local sensors, Qin et al. proposed a GNSS and local sensor fusion method to construct GNSS residual factors to correct the cumulative error of VIO [25]. Further, we propose a factor graph based on Bayesian network, in which GNSS observations are added as global constraint factors. The accumulated errors of LVIO are corrected by using GNSS observations within 0.1 s interval from LVIO keyframes as global constraint. In this study, it is proved that GNSS global constraint factor can effectively correct LVIO positioning error in the outdoor environment. It should be noted that since the coordinate of the current frame of LVIO is calculated from the coordinate of the previous frame, longterm observation or long moving distance will lead to more serious data drift. However, the GNSS observations are in the global coordinate, so long-term observation is not related to the data drift. Therefore, we can reasonably speculate that the longer the algorithm runs, the more obvious the correction effect of GNSS on LVIO will be. More comprehensively, LVIO will continue local positioning in GNSS rejection environment, so the positioning continuity of mobile carriers in different environments can be effectively guaranteed.

Conclusions
In this study, a LiDAR-Visual-Inertial Odometry based on optimized visual point-line features is proposed, taking advantage of the heterogeneous complementary characteristics of multiple sensors. First, a visual line feature extraction and matching optimization method is proposed. By improving the line feature extraction in the scale space and selecting the appropriate scaling ratio and same-sex point density threshold, the number of line features extracted in the light complex environment is largely improved to provide richer feature information for the front-end. Meanwhile, the original 6-DOF parameter optimization problem is further improved to a 4-DOF parameter optimization problem by using a least squares-based line feature constrained matching strategy. The complexity of the fusion system is reduced, and more accurate visual pose estimation is effectively accomplished. Second, the LiDAR point cloud is projected into the visual coordinates for depth correlation. Meanwhile, the initial pose estimation provided by the optimized VIO is used to help LiDAR scan matching. Finally, a factor graph method based on Bayesian networks is established. Two global constraint factors are added to the factor graph framework to constrain LVIO globally, which are the global constraint of GNSS factors from external sensors and the loop factor constraint of local sensors. The experimental results show that the algorithm can achieve real-time attitude estimation with good localization and mapping accuracy in different environments.
In the future, we will further improve and refine our work in the following aspects. First, the point cloud alignment algorithm of the loop factor in this study utilizes the traditional ICP algorithm, which is time-consuming to perform the nearest domain search using the KD tree. Thus, we will consider the improvement of the point cloud alignment algorithm next. Second, the inclusion of the GNSS factor in this study only utilizes the GNSS pseudo range single point positioning result. Although it is relatively simple and feasible on the vehicle platform with only one GNSS receiver, there is still room for improvement in the positioning accuracy of GNSS. A more accurate correction of LVIO by using higher accuracy RTK positioning results will be considered in the next step. Finally, since our proposed fusion system consists of two subsystems with high runtime computational resource requirements, we will work on reducing the resource occupation rate of the algorithm. Further, we will evaluate the positioning accuracy of the algorithm on vehicles with limited computing resources.