Improved Point-Line Feature Based Visual SLAM Method for Indoor Scenes

In the study of indoor simultaneous localization and mapping (SLAM) problems using a stereo camera, two types of primary features—point and line segments—have been widely used to calculate the pose of the camera. However, many feature-based SLAM systems are not robust when the camera moves sharply or turns too quickly. In this paper, an improved indoor visual SLAM method to better utilize the advantages of point and line segment features and achieve robust results in difficult environments is proposed. First, point and line segment features are automatically extracted and matched to build two kinds of projection models. Subsequently, for the optimization problem of line segment features, we add minimization of angle observation in addition to the traditional re-projection error of endpoints. Finally, our model of motion estimation, which is adaptive to the motion state of the camera, is applied to build a new combinational Hessian matrix and gradient vector for iterated pose estimation. Furthermore, our proposal has been tested on EuRoC MAV datasets and sequence images captured with our stereo camera. The experimental results demonstrate the effectiveness of our improved point-line feature based visual SLAM method in improving localization accuracy when the camera moves with rapid rotation or violent fluctuation.


Introduction
Simultaneous localization and mapping (SLAM) is used to incrementally estimate the pose of a moving platform and simultaneously build a map of the surrounding environment [1][2][3]. Owing to its ability of autonomous localization and environmental perception, SLAM has become a key prerequisite for robots to operate autonomously in an unknown environment [4]. Visual SLAM, a system that uses a camera as its data input sensor, is widely used in platforms moving in indoor environments. Compared with radar and other range-finding instruments, a visual sensor has the advantages of low power consumption and small volume, and it can provide more abundant environmental texture information for a moving platform. Consequently, visual SLAM has drawn increasing attention in the research community [5]. As a unique example, integration of visual odometry (VO) with these strategies has been applied successfully to planet rover localization of many planetary exploration missions [6][7][8][9], and has assisted the rovers to travel through challenging planetary surfaces by providing high-precision visual positioning results. Subsequently, many researchers attempted to improve the efficiency and robustness of SLAM methods. In terms of improving efficiency, some feature extraction algorithms such as Speeded-Up Robust Features (SURF) [10], Binary Robust Invariant Scalable Keypoints (BRISK) [11], different features. Di et al. obtained the inverse of the error as the weights of different data sources in RGB-D SLAM and achieved good results [40], but the motion information of the camera was not considered.
In this paper, an improved point-line feature based visual SLAM method in indoor scenes is proposed. First, unlike the traditional nonlinear least square optimization model of line segment features, an improvement of our method is the addition of the minimization of angle observation, which should be close to zero between the line segments of observation and re-projection. Compared with the traditional nonlinear least square optimization model, which includes distances between the re-projected endpoints and the observed line segment, our method combines angle observation and distance observation and shows a better performance at large turns. Second, our visual SLAM method builds an adaptive model in motion estimation so that the pose estimation model is adaptive to the motion state of a camera. With these two improvements, our visual SLAM can fully utilize point and line segment features irrespective of whether the camera is moving or turning sharply. Experimental results on EuRoC MAV datasets and sequence images captured with our stereo camera are presented to verify the accuracy and effectiveness of this improved point-line feature-based visual SLAM method in indoor scenes.  [40], but the motion information of the camera was not considered. In this paper, an improved point-line feature based visual SLAM method in indoor scenes is proposed. First, unlike the traditional nonlinear least square optimization model of line segment features, an improvement of our method is the addition of the minimization of angle observation, which should be close to zero between the line segments of observation and re-projection. Compared with the traditional nonlinear least square optimization model, which includes distances between the re-projected endpoints and the observed line segment, our method combines angle observation and distance observation and shows a better performance at large turns. Second, our visual SLAM method builds an adaptive model in motion estimation so that the pose estimation model is adaptive to the motion state of a camera. With these two improvements, our visual SLAM can fully utilize point and line segment features irrespective of whether the camera is moving or turning sharply. Experimental results on EuRoC MAV datasets and sequence images captured with our stereo camera are presented to verify the accuracy and effectiveness of this improved point-line feature-based visual SLAM method in indoor scenes.

Extraction and Matching of Point and Line Segment Features
In point feature tracking, the ORB algorithm [12] is adopted in our method to extract 2D point features and create binary descriptors for initial matching. The matching of the extracted point features in consecutive frames is followed by random sample consensus (RANSAC) algorithm and a fundamental matrix constraint, which is used to eliminate some erroneous corresponding keypoints from the matched results. The fundamental matrix constraint is also called epipolar constraint. That is, if the point m of the left image is obtained, its corresponding point on the right image will be constrained on the epipolar line l like Figure 2 shows. As a stereo camera has a baseline, we can calculate the depths and 3D coordinates of all the keypoints with respect to the optical center of the left camera.

Extraction and Matching of Point and Line Segment Features
In point feature tracking, the ORB algorithm [12] is adopted in our method to extract 2D point features and create binary descriptors for initial matching. The matching of the extracted point features in consecutive frames is followed by random sample consensus (RANSAC) algorithm and a fundamental matrix constraint, which is used to eliminate some erroneous corresponding keypoints from the matched results. The fundamental matrix constraint is also called epipolar constraint. That is, if the point m of the left image is obtained, its corresponding point on the right image will be constrained on the epipolar line l' like Figure 2 shows. As a stereo camera has a baseline, we can calculate the depths and 3D coordinates of all the keypoints with respect to the optical center of the left camera. We use the line segment detector (LSD) algorithm [41] for the extraction of line segment features. It can extract line segment features from indoor scenes in linear time, satisfying the real-time requirement of SLAM. Although there are many low-texture environments such as white walls in indoor scenes, the LSD algorithm can stably extract line features, as shown in Figure 3. Furthermore, the line band descriptor method [42] is employed to match line segment features in stereo and consecutive frames with binary descriptors. Similar to the matched point features, we can obtain the 3D coordinates of two endpoints of line segment features and their 2D coordinates. We use the line segment detector (LSD) algorithm [41] for the extraction of line segment features. It can extract line segment features from indoor scenes in linear time, satisfying the real-time requirement of SLAM. Although there are many low-texture environments such as white walls in indoor scenes, the LSD algorithm can stably extract line features, as shown in Figure 3. Furthermore, the line band descriptor method [42] is employed to match line segment features in stereo and consecutive frames with binary descriptors. Similar to the matched point features, we can obtain the 3D coordinates of two endpoints of line segment features and their 2D coordinates.

Extraction and Matching of Point and Line Segment Features
In point feature tracking, the ORB algorithm [12] is adopted in our method to extract 2D point features and create binary descriptors for initial matching. The matching of the extracted point features in consecutive frames is followed by random sample consensus (RANSAC) algorithm and a fundamental matrix constraint, which is used to eliminate some erroneous corresponding keypoints from the matched results. The fundamental matrix constraint is also called epipolar constraint. That is, if the point m of the left image is obtained, its corresponding point on the right image will be constrained on the epipolar line l' like Figure 2 shows. As a stereo camera has a baseline, we can calculate the depths and 3D coordinates of all the keypoints with respect to the optical center of the left camera. We use the line segment detector (LSD) algorithm [41] for the extraction of line segment features. It can extract line segment features from indoor scenes in linear time, satisfying the real-time requirement of SLAM. Although there are many low-texture environments such as white walls in indoor scenes, the LSD algorithm can stably extract line features, as shown in Figure 3. Furthermore, the line band descriptor method [42] is employed to match line segment features in stereo and consecutive frames with binary descriptors. Similar to the matched point features, we can obtain the 3D coordinates of two endpoints of line segment features and their 2D coordinates.

Nonlinear Least Square Optimization Models for Motion Estimation
Once the 3D and 2D coordinates of the point and line segment features are obtained, the relationships between two consecutive frames can be established using the homologous features. Subsequently, the point and line segment features are back-projected from the previous frame to the current frame. Subsequently, back-projection error models are built for both the points and the line segments. As these error models are nonlinear, the camera motion should be iteratively estimated using the nonlinear least square optimization method. In this study, we use the Gauss-Newton algorithm for the minimization of the back-projection errors. This section presents the nonlinear least square optimization models of point and line segment features.

Optimization Model of Point Features
For the point features, we use the perspective-n-point method to optimize the camera pose. The error model is the re-projection error, which is the difference between the re-projected 2D position and the observed (matched) 2D position. The optimization process can be divided into three steps. First, 3D map points at the current frame are obtained from stereo image matching and space intersection computation. The world 3D map points are transformed into the coordinate system of the current frame using the iteratively estimated pose of the camera. Subsequently, these 3D map points are re-projected into the image coordinate system of the current frame. Finally, by minimizing the distance error between the re-projected points and their corresponding observed points on the current frame, the error model of point features can be established. This process is illustrated in Figure 4.  . Common low-texture environment in an indoor scene with white walls, and point features (green x marks) extracted using ORB algorithm and line segment features (red lines) extracted using LSD algorithm.

Nonlinear Least Square Optimization Models for Motion Estimation
Once the 3D and 2D coordinates of the point and line segment features are obtained, the relationships between two consecutive frames can be established using the homologous features. Subsequently, the point and line segment features are back-projected from the previous frame to the current frame. Subsequently, back-projection error models are built for both the points and the line segments. As these error models are nonlinear, the camera motion should be iteratively estimated using the nonlinear least square optimization method. In this study, we use the Gauss-Newton algorithm for the minimization of the back-projection errors. This section presents the nonlinear least square optimization models of point and line segment features.

Optimization Model of Point Features
For the point features, we use the perspective-n-point method to optimize the camera pose. The error model is the re-projection error, which is the difference between the re-projected 2D position and the observed (matched) 2D position. The optimization process can be divided into three steps. First, 3D map points at the current frame are obtained from stereo image matching and space intersection computation. The world 3D map points are transformed into the coordinate system of the current frame using the iteratively estimated pose of the camera. Subsequently, these 3D map points are re-projected into the image coordinate system of the current frame. Finally, by minimizing the distance error between the re-projected points and their corresponding observed points on the current frame, the error model of point features can be established. This process is illustrated in Figure 4. In the re-projection error model, the error of the i-th point feature can be described as follows: Here,  is a six-dimensional vector of Lie algebras that represents the motion of the camera, and ()  T represents the transformation matrix from the world coordinate system to the current coordinate system () P' X',Y',Z' based on the pose of the camera.
K represents the In the re-projection error model, the error of the i-th point feature can be described as follows: Here, ζ is a six-dimensional vector of Lie algebras that represents the motion of the camera, and T(ζ) represents the transformation matrix from the world coordinate system P XYZ−world (X, Y, Z) to the current coordinate system P (X , Y , Z ) based on the pose of the camera. K represents the internal parameters of the camera and p(x, y) is the corresponding observed point of the re-projected point p (x , y ). e i p (ζ) is the resultant error vector. To use the Gauss-Newton method, the partial derivative of the error function with respect to the variables is required, which is the Jacobian matrix ∂ζ and can be obtained via the chain rule: By calculating ∂p ∂P and ∂P ∂ζ , we can obtain ∂p ∂ζ as follows: ∂p ∂ζ As for ∂p , it is a function matrix whose independent variables are the pixel coordinates x and y . Thus, the following equation is obtained: After ∂p and ∂p ∂ζ are calculated, we can obtain the Jacobian matrix of point features ∂ζ . In this study, the Gauss-Newton algorithm is used for the iterative estimation of the camera motion. Therefore, we must calculate the Hessian matrix H i p and gradient vector g i p required by the Gauss-Newton algorithm. The Jacobian matrix is represented by J p and the Hessian matrix and gradient vector can be obtained as follows: where P is the weight matrix of a point feature and can be defined as: Thus, we can add H i p and g i p of each point and obtain the Hessian matrix H p and gradient vector g p of all point features in the current frame: Through the above steps, the optimization model of point features is established.

Optimization Model of Line Segment Features
As for the error model of line segment features, we use two kinds of error functions. One is the traditional minimization of the distances from the re-projected endpoints to the observed line segment. The other is our proposal for use in this study: the error of angle observation, which should be close to zero between the line segments of observation and re-projection. Each line segment feature has two distance errors and two angle observation errors. The process of establishing an error model of line segment features is shown in Figure 5. Consequently, by minimizing both the distance errors and the angle observation errors, the optimization model of line segment features can be established. The error function of the j-th line segment feature is an 4 × 1 error vector e j l (ζ), which can be expressed as follows: where: In Equation (8), a, b, and c are the three coefficients of the general equation of the observed line. Point p(x p , y p ) and point q(x q , y q ) are the starting and ending endpoints of the observed line, respectively. Similarly, point p (x p , y p ) and point q (x q , y q ) represent the endpoints of the re-projected line segment. The error e 1 (ζ) can be considered the distance between point p and the observed line pq, and error e 2 (ζ) is the distance between point q and the observed line. In addition to the two error functions of distance, we add the error functions of angle. Here, e 3 (ζ) is the cosine of the angle between establishing an error model of line segment features is shown in Figure 5. Consequently, by minimizing both the distance errors and the angle observation errors, the optimization model of line segment features can be established.
The error function of the j-th line segment feature is an 41  error vector () j l  e , which can be expressed as follows: where: In Equation (8) The 3D coordinates of the endpoints P (X P , Y P , Z P ) and Q (X Q , Y Q , Z Q ) are obtained using the pose transformation exp(ζ ∧ ). Using Equation (3) ∂p , and ∂q . They are functions of points p (x p , y p ) or q (x q , y q ) and can be calculated as follows: where a and b are the coefficients of the general equation of the observed line; f p x , f p x , f p x , and f p x are partial derivatives of the coordinates of the re-projected points p (x p , y p ) and q (x q , y q ).
Similar to the point features, the Hessian matrix H j l and gradient vector g j l of line segment features are also required by the Gauss-Newton algorithm. The Jacobian matrix is represented by J l and the Hessian matrix and gradient vector can be obtained as follows: where P is the weight matrix of the j-th line segment feature. As the dimensions of the two distance error functions and the two angle error functions are different, they are weighted in two ways. Thus, P can be defined as: Subsequently, we add H j l and g j l of each line segment and obtain the Hessian matrix H l and gradient vector g p of all the line segment features in the current frame as follows: Through these steps, the optimization model of line segment features is established. Compared with the traditional error model of line segment features used in literature [38,39], we add the angular error functions. We have tested our error model on the EuRoC datasets [43] and compare the results with those obtained from the traditional error model. Figure 6 shows the resulting trajectories of the two models tested on dataset Vicon room 1 "medium". As shown in Figure 6, A and B are two big turns. From the accuracy heat map of the positioning results of these two places, our extended error model with added angular error functions is observed to be superior to the traditional error model. the results with those obtained from the traditional error model. Figure 6 shows the resulting trajectories of the two models tested on dataset Vicon room 1 "medium." As shown in Figure 6, A and B are two big turns. From the accuracy heat map of the positioning results of these two places, our extended error model with added angular error functions is observed to be superior to the traditional error model.  We use the relative pose error (RPE) as the evaluation metric, which describes the error between pairs of timestamps in the estimated trajectory file. Then we calculate the average RPE at these two big turns A and B to represent the average drift rate between the estimated trajectory and ground-truth. As shown in Table 1, the average RPE of our extended error model is less at A and B than that of traditional error model, meaning that our proposed error model has less drift rate at A and B. This also shows that proposed error model has good robustness at large turns. More detailed results will be given in the experimental results section.

Adaptive Weighting Model of Motion Estimation
After the Hessian matrices and gradient vectors of both point features and line segment features are established, our motion estimation model, which is adaptive to the motion state of a camera, is applied to build a new recombined Hessian matrix and gradient vector for iterated pose estimation.
As shown in Figure 7, we have collected the residual errors of nearly 1000 positions and their corresponding motion states of a camera with respect to the previous frame. We use the displacement of the current frame relative to the previous frame as a measure of the current motion state. The blue line in Figure 7 is the linear fitting curve according to these data. It can be observed from Figure 7 that there is a certain degree of correlation between the positioning residual errors and the motion state of the camera. Thus, we calculated the correlation coefficient between them and obtained the result of 0.57. The correlation coefficient is greater than 0.5, indicating that the positioning residual errors and the motion state of the camera are strongly correlated. In other words, the motion state of the camera will affect the positioning result to some extent. However, the reference method does not consider this. Therefore, it can be observed from the accuracy heat maps in the following experiment section that reference method has a large absolute trajectory error (ATE) when the camera moves with large rotation or rapid fluctuation. Hence, we build an adaptive model in the iterative motion estimation. The model is adaptive to the motion state of the camera. Table 1. Average RPE results at A and B turns of traditional error model and our proposed error model. The numbers in bold indicate that these terms are better than those of another model. The unit of RPE is meter. Figure 6 Traditional error model Proposed error model A 0.149500 0.138527 B 0.142878 0.131200

Adaptive Weighting Model of Motion Estimation
After the Hessian matrices and gradient vectors of both point features and line segment features are established, our motion estimation model, which is adaptive to the motion state of a camera, is applied to build a new recombined Hessian matrix and gradient vector for iterated pose estimation. As shown in Figure 7, we have collected the residual errors of nearly 1000 positions and their corresponding motion states of a camera with respect to the previous frame. We use the displacement of the current frame relative to the previous frame as a measure of the current motion state. The blue line in Figure 7 is the linear fitting curve according to these data. It can be observed from Figure 7 that there is a certain degree of correlation between the positioning residual errors and the motion state of the camera. Thus, we calculated the correlation coefficient between them and obtained the result of 0.57. The correlation coefficient is greater than 0.5, indicating that the positioning residual errors and the motion state of the camera are strongly correlated. In other words, the motion state of the camera will affect the positioning result to some extent. However, the With each iteration, we can obtain the motion of the current frame relative to the previous one. As the frame rate of the camera is fixed, the motion state on the three axes can be represented by the change of camera position ∆P(∆X, ∆Y, ∆Z). If the motion of the camera relative to the previous frame is greater in the image plane direction than in the direction perpendicular to the image plane, i.e., ∆X and ∆Y are larger than ∆Z, this indicates that the camera is shaking, which may result in blurred or weakened image texture. According to our experience, line segment features can provide significant structural information of the environment, and hence, the detection of the line segment is more robust than the detection of a point feature in such poor texture scenes. It can also be observed from the experimental results in Figure 6 that the line segment features play an important role when the camera makes a big turn. Thus, in such situations, the weight of the line segment features should be larger than the weight of the point features according to the experimental results and experience. If the motion of the camera relative to the previous frame is greater in the direction perpendicular to the image plane than in the image plane direction, ∆Z will be larger. According to the experiments, the point features in this situation are relatively rich and stable. Hence, the weight of the point features should be larger than the weight of the line segment features. Moreover, we use the inverse of the average re-projection error as a factor in weighting the point feature and line segment feature. Based on the comparative experiments and the above analysis, we propose the following adaptive weighting model of motion estimation: With Equation (15), a new recombined Hessian matrix H and gradient vector g can be obtained as follows: Thus, we can use the Gauss-Newton algorithm to estimate the motion of the camera iteratively. With new frames acquired sequentially, our point-line based visual SLAM system calculates the new positions according to our adaptive weighting model of motion estimation.

Experimental Results
In this section, to verify the actual performance of the proposed method, we have performed a series of experiments using two types of datasets: public datasets with ground-truth, and sequence images captured using our stereo camera. We also compared our method with the reference method adopted in literature [38,39], which uses the traditional error model of line feature and its weighting model is based on residual errors. All the experiments were performed on a desktop computer with an Intel Core i7-6820HQ CPU with 2.7 GHz and 16 GB RAM without GPU parallelization. The results of the experiments are described in detail below.

EuRoC MAV Datasets
The EuRoC MAV datasets were collected by an on-board micro aerial vehicle (MAV) [43]. They contain two batches of datasets. The first batch was recorded in the Swiss Federal Institute of Technology Zurich (ETH) machine hall and the second batch was recorded in two indoor rooms. They were both captured with a global shutter camera at 20 FPS. Each dataset contains stereo images and accurate ground-truth. Furthermore, calibration parameters such as the intrinsic and extrinsic parameters of the stereo camera are provided in the datasets. We compared our proposed method with the reference method adopted in recent paper and changed the optimization parameters of the reference method to better adapt to different scenarios for fair comparison. We use the absolute trajectory error (ATE) as the evaluation metric, which directly calculates the error between the estimated trajectory and the ground truth [44]. And we calculate both translation and rotation part of ATE as an evaluation of six degree-of-freedom (DoFs). Figure 8 shows the accuracy of the three coordinate axes on several different datasets. The dotted line represents the ground truth of the dataset. The solid lines in blue and red represent the results of reference method and our proposed method, respectively. As shown in Figure 8a,b, when the Z-axis values have a large fluctuation while the X-axis and Y-axis are stably changing, our proposed method is superior to reference method in the Z-axis. Further, as shown in Figure 8c, when the values of all the three axes fluctuate greatly, our proposed method is more stable and accurate than reference method in these quivering parts. For example, in the X-axis section of Figure 8c from 55-70 s and in the 40-60 s part of the Z-axis, our estimated trajectory is much closer to the ground truth than that of reference method. And then we calculate the average RPE at these places in Figure 8 where camera has rapid fluctuation. The average RPE can represent the average drift rate between the estimated trajectory and ground-truth. As can be seen in Table 2, the average RPE of our proposed method is less than that of reference method, which means our proposed method has less drift rate at these quivering parts.   These good performances in the case of rapid fluctuation of camera are mainly attributed to our extended error model and adaptive weighting model of motion estimation. The adaptive weighting model considers both the average re-projection error and the motion state between frames. Therefore, it can better utilize the advantages of different features in different motion states, so as to obtain better positioning results. Figure 8 and Table 2 also confirm that our proposed method performs better when the camera shakes quickly. For quantitative evaluation, we employed the open-source package evo, an easy-to-use evaluation tool (github.com/MichaelGrupp/evo), to evaluate reference method and our proposed method. Table 3 shows the root mean square error (RMSE) of the translation part and rotation part of ATE. Histograms of RMSE and the range of the translation part of ATE are also provided in Figure 9. Table 3. Translation parts and rotation parts of ATE of the two methods on several EuRoC MAV datasets. The numbers in bold indicate that these terms are better than those of another method. The unit of translation part is meter and the unit of rotation part is degree. These good performances in the case of rapid fluctuation of camera are mainly attributed to our extended error model and adaptive weighting model of motion estimation. The adaptive weighting model considers both the average re-projection error and the motion state between frames. Therefore, it can better utilize the advantages of different features in different motion states, so as to obtain better positioning results. Figure 8 and Table 2 also confirm that our proposed method performs better when the camera shakes quickly. For quantitative evaluation, we employed the open-source package evo, an easy-to-use evaluation tool (github.com/MichaelGrupp/evo), to evaluate reference method and our proposed method. Table 3 shows the root mean square error (RMSE) of the translation part and rotation part of ATE. Histograms of RMSE and the range of the translation part of ATE are also provided in Figure 9. Table 3. Translation parts and rotation parts of ATE of the two methods on several EuRoC MAV datasets. The numbers in bold indicate that these terms are better than those of another method. The unit of translation part is meter and the unit of rotation part is degree.   Table 3 shows that our proposed method performs better in almost all scenes of EuRoC MAV datasets for the RMSE in terms of the translation parts and rotation parts of ATE. From Figure 9a, in easy and medium scenes, such as MH_02_easy, V1_01_easy, and V2_02_medium, our proposal  Table 3 shows that our proposed method performs better in almost all scenes of EuRoC MAV datasets for the RMSE in terms of the translation parts and rotation parts of ATE. From Figure 9a, in easy and medium scenes, such as MH_02_easy, V1_01_easy, and V2_02_medium, our proposal shows slightly improved accuracy of the results. However, our method can greatly improve the accuracy in difficult scenes, such as MH_04_difficult, MH_05_difficult, and V1_03_difficult. The main reason for such situations is that the camera shakes rapidly in these difficult scenes. Our method considers this situation and better utilizes the respective advantages of point and line segment features through the adaptive weighting model of motion estimation. Furthermore, as shown in Figure 9b, our proposed method has a smaller range of translation parts of ATE, indicating that the motion estimation is relatively stable.

EuRoC
To demonstrate the results intuitively, several accuracy heat maps of trajectories estimated using reference method and our proposed method are shown in Figure 10. The gray dotted line represents the ground-truth. The color solid lines represent the estimated trajectories. The color bar represents the size of the translation part of ATE. A change in color from blue to red indicates a gradual increase in translation part of ATE. Each row shows the results of the two methods with the same dataset, and the two color bars of each row have the same maximum error and minimum error. Comparing the three trajectories, we can observe that our method shows better accuracy in some areas with large rotations of camera. This also shows that the angular error function added in our model shows a good performance at large turns. Thus, we can conclude that our proposed method with an adaptive motion model and angular error functions can yield smaller errors than reference method when the camera moves with large rotation or rapid fluctuation. shown in Figure 9b, our proposed method has a smaller range of translation parts of ATE, indicating that the motion estimation is relatively stable.
To demonstrate the results intuitively, several accuracy heat maps of trajectories estimated using reference method and our proposed method are shown in Figure 10. The gray dotted line represents the ground-truth. The color solid lines represent the estimated trajectories. The color bar represents the size of the translation part of ATE. A change in color from blue to red indicates a gradual increase in translation part of ATE. Each row shows the results of the two methods with the same dataset, and the two color bars of each row have the same maximum error and minimum error. Comparing the three trajectories, we can observe that our method shows better accuracy in some areas with large rotations of camera. This also shows that the angular error function added in our model shows a good performance at large turns. Thus, we can conclude that our proposed method with an adaptive motion model and angular error functions can yield smaller errors than reference method when the camera moves with large rotation or rapid fluctuation.

Sequence Images Captured by Our Stereo Camera
In addition to testing the performance and accuracy of our proposed visual SLAM method on public datasets with ground-truth, we also test the universality with sequence images captured with our stereo camera. The sequence images acquired using the stereo camera should be rectified first in order to use them in high-accuracy SLAM processing. In this experiment, a ZED stereo camera is adopted as our data input sensor. It can capture images with a resolution of 720p at up to 60 fps. Although the ZED camera has been adjusted in production, it does not satisfy the requirements of the experiment. We used Stereo Camera Calibrator, a MATLAB-based software package, to complete the camera calibration process, through which the calibration parameters of the stereo camera including lens distortion coefficients and internal and external parameters were calculated. The calibration results are shown in Tables 4 and 5. Using these parameters, we can obtain the rectified stereo sequence images.    Figure 11 shows the ZED stereo camera used in this experiment. Figure 3 shows a typical image acquired in this experiment. In the acquisition of sequence images, an operator (one of the co-authors of this paper) first placed the camera at the start point on the floor. Subsequently, he picked up the camera and went on a quadrilateral path along the indoor corridor. Finally, he returned to the starting point. Thus, the whole sequence images form a loop closure. In this experiment, we also present a simple comparison with the point-to-point ICP method adopted in [45]. As no ground truth of the trajectory is available for the sequence images captured with our

Sequence Images Captured by Our Stereo Camera
In addition to testing the performance and accuracy of our proposed visual SLAM method on public datasets with ground-truth, we also test the universality with sequence images captured with our stereo camera. The sequence images acquired using the stereo camera should be rectified first in order to use them in high-accuracy SLAM processing. In this experiment, a ZED stereo camera is adopted as our data input sensor. It can capture images with a resolution of 720p at up to 60 fps. Although the ZED camera has been adjusted in production, it does not satisfy the requirements of the experiment. We used Stereo Camera Calibrator, a MATLAB-based software package, to complete the camera calibration process, through which the calibration parameters of the stereo camera including lens distortion coefficients and internal and external parameters were calculated. The calibration results are shown in Tables 4 and 5. Using these parameters, we can obtain the rectified stereo sequence images.  Figure 11 shows the ZED stereo camera used in this experiment. Figure 3 shows a typical image acquired in this experiment. In the acquisition of sequence images, an operator (one of the co-authors of this paper) first placed the camera at the start point on the floor. Subsequently, he picked up the camera and went on a quadrilateral path along the indoor corridor. Finally, he returned to the starting point. Thus, the whole sequence images form a loop closure. In this experiment, we also present a simple comparison with the point-to-point ICP method adopted in [45]. As no ground truth of the trajectory is available for the sequence images captured with our stereo camera, we evaluate the performance by comparing the closure errors of ICP method, reference method used in before experiment (hereinafter referred to as reference method) and our proposed method. Furthermore, for a fair comparison of the three methods, we do not use loop closure detection in this experiment. stereo camera, we evaluate the performance by comparing the closure errors of ICP method, reference method used in before experiment (hereinafter referred to as reference method) and our proposed method. Furthermore, for a fair comparison of the three methods, we do not use loop closure detection in this experiment. Figure 11. ZED stereo camera used in this experiment.
The statistical results are shown in Table 6 and the three estimated trajectories are shown in Figure 12. The percentage error of our proposed method is 1.07%, which is better than the error obtained using the ICP method, i.e., 4.64%, and also better than the error obtained using the reference method, i.e., 1.82%. The path length calculated by three methods is 64.393 m, 65.272 m and 65.120 m, respectively. The three path lengths are close to each other. However, our method shows only approximately a quarter of the closure error of ICP method and half the closure error of reference method. As observed from the trajectories depicted in Figure 12a, the operator moved from the start point and went forward to corner A. After passing through corner A, he went straight to corner B. We can observe that the trajectories of the three methods are very close to each other in this part. However, from corner B to corner C, the trajectories of ICP method and reference method have a large deviation, which eventually leads to a larger closure error than our method.
By analyzing the motion states at corner A and corner B, we observe that Corner A starts on frame 295 and ends on frame 330 (35 frames in total), whereas Corner B starts on frame 397 and ends on frame 422 (25 frames in total). As the FPS of the camera was fixed, it took more time at corner A. In other words, the speed of the camera at corner B is higher than that at corner A, and hence, the motion is more intense. Further, as shown in Figure 12b, from the top view of the three trajectories, the trajectories of ICP method and reference method begin to deform after corner B. However, owing to the two improvements of our method, i.e., the adaptive weighting model for motion estimation and the angular error functions, the positioning result of our method is less affected by the rapid rotation at corner B than that of ICP method and reference method. Therefore, our trajectory is closer to the predetermined quadrilateral path. This experiment also shows that our proposed method is applicable to the sequence images captured with our stereo camera.  Figure 11. ZED stereo camera used in this experiment.
The statistical results are shown in Table 6 and the three estimated trajectories are shown in Figure 12. The percentage error of our proposed method is 1.07%, which is better than the error obtained using the ICP method, i.e., 4.64%, and also better than the error obtained using the reference method, i.e., 1.82%. The path length calculated by three methods is 64.393 m, 65.272 m and 65.120 m, respectively. The three path lengths are close to each other. However, our method shows only approximately a quarter of the closure error of ICP method and half the closure error of reference method. As observed from the trajectories depicted in Figure 12a, the operator moved from the start point and went forward to corner A. After passing through corner A, he went straight to corner B. We can observe that the trajectories of the three methods are very close to each other in this part. However, from corner B to corner C, the trajectories of ICP method and reference method have a large deviation, which eventually leads to a larger closure error than our method.
By analyzing the motion states at corner A and corner B, we observe that Corner A starts on frame 295 and ends on frame 330 (35 frames in total), whereas Corner B starts on frame 397 and ends on frame 422 (25 frames in total). As the FPS of the camera was fixed, it took more time at corner A. In other words, the speed of the camera at corner B is higher than that at corner A, and hence, the motion is more intense. Further, as shown in Figure 12b, from the top view of the three trajectories, the trajectories of ICP method and reference method begin to deform after corner B. However, owing to the two improvements of our method, i.e., the adaptive weighting model for motion estimation and the angular error functions, the positioning result of our method is less affected by the rapid rotation at corner B than that of ICP method and reference method. Therefore, our trajectory is closer to the predetermined quadrilateral path. This experiment also shows that our proposed method is applicable to the sequence images captured with our stereo camera.

Conclusions
In this paper, we have presented an improved point-line feature-based visual SLAM method for indoor scenes. The proposed SLAM method has two main innovations: the angular error function added in the optimization process of line segment features, and the adaptive weighting model in iterative pose estimation. Line segment feature is a higher-dimensional feature than point features and has more structural characteristics and geometric constraints. Our optimization model of line segment features with added angular error functions can better utilize this advantage than the traditional optimization model. Furthermore, after the Hessian matrices and gradient vectors of the two kinds of features are established, our model of motion estimation, which is adaptive to the motion state of camera, is applied to build a new recombined Hessian matrix and gradient vector for iterative pose estimation.
We also presented the evaluation results of the proposed SLAM method as compared with the point-line SLAM method developed in [38] and [39], which uses the traditional error model of line feature and its weighting model is based on residual errors, on both the EuRoC MAV datasets and the sequence images captured with our stereo camera. We also compared the point-to-point ICP method [45] using the sequence images from our stereo camera. According to the experimental results, we arrive at two conclusions. First, the proposed SLAM method has more geometric constraints than the traditional point-line SLAM method and classic ICP method, because the angular error function is added to the optimization model of line segment features. Furthermore, it has good robustness and positioning accuracy at large turns. This is particularly useful for robot navigation in indoor scenes as they include many corners. Second, the adaptive weighting model for motion estimation can better utilize the advantages of point and line segment features in different motion states. Thus, it can improve the system accuracy when the camera moves with rapid rotation or severe fluctuation.
At present, we mainly used the 2D structural constraints of line segment features. In the future, we plan to further improve our SLAM method by introducing the 3D structural constraints of spatial line segment features. Furthermore, topological relations between point features and line segment features will also be considered in our method in the future, so as to better match point and line segment features in indoor environments with repeated textures.

Conclusions
In this paper, we have presented an improved point-line feature-based visual SLAM method for indoor scenes. The proposed SLAM method has two main innovations: the angular error function added in the optimization process of line segment features, and the adaptive weighting model in iterative pose estimation. Line segment feature is a higher-dimensional feature than point features and has more structural characteristics and geometric constraints. Our optimization model of line segment features with added angular error functions can better utilize this advantage than the traditional optimization model. Furthermore, after the Hessian matrices and gradient vectors of the two kinds of features are established, our model of motion estimation, which is adaptive to the motion state of camera, is applied to build a new recombined Hessian matrix and gradient vector for iterative pose estimation.
We also presented the evaluation results of the proposed SLAM method as compared with the point-line SLAM method developed in [38,39], which uses the traditional error model of line feature and its weighting model is based on residual errors, on both the EuRoC MAV datasets and the sequence images captured with our stereo camera. We also compared the point-to-point ICP method [45] using the sequence images from our stereo camera. According to the experimental results, we arrive at two conclusions. First, the proposed SLAM method has more geometric constraints than the traditional point-line SLAM method and classic ICP method, because the angular error function is added to the optimization model of line segment features. Furthermore, it has good robustness and positioning accuracy at large turns. This is particularly useful for robot navigation in indoor scenes as they include many corners. Second, the adaptive weighting model for motion estimation can better utilize the advantages of point and line segment features in different motion states. Thus, it can improve the system accuracy when the camera moves with rapid rotation or severe fluctuation.
At present, we mainly used the 2D structural constraints of line segment features. In the future, we plan to further improve our SLAM method by introducing the 3D structural constraints of spatial line segment features. Furthermore, topological relations between point features and line segment features will also be considered in our method in the future, so as to better match point and line segment features in indoor environments with repeated textures.