Point-Line Visual Stereo SLAM Using EDlines and PL-BoW

: Visual Simultaneous Localization and Mapping (SLAM) technologies based on point features achieve high positioning accuracy and complete map construction. However, despite their time efﬁciency and accuracy, such SLAM systems are prone to instability and even failure in poor texture environments. In this paper, line features are integrated with point features to enhance the robustness and reliability of stereo SLAM systems in poor texture environments. Firstly, method Edge Drawing lines (EDlines) is applied to reduce the line feature detection time. Meanwhile, the proposed method improves the reliability of features by eliminating outliers of line features based on the entropy scale and geometric constraints. Furthermore, this paper proposes a novel Bags of Word (BoW) model combining the point and line features to improve the accuracy and robustness of loop detection used in SLAM. The proposed PL-BoW technique achieves this by taking into account the co-occurrence information and spatial proximity of visual words. Experiments using the KITTI and EuRoC datasets demonstrate that the proposed stereo Point and EDlines SLAM (PEL-SLAM) achieves high accuracy consistently, including in challenging environments difﬁcult to sense accurately. The processing time of the proposed method is reduced by 9.9% and 4.5% when compared to the Point and Line SLAM (PL-SLAM) and Point and stereo Point and Line based Visual Odometry (sPLVO) methods, respectively.


Introduction
Simultaneous Localization and Mapping (SLAM) was initially proposed by Smith in 1987 [1]. Since then, diverse methods systems that can simultaneously estimate the position of the onboard sensors and construct the surrounding environment map via the captured scene information have been extensively developed. This has been especially important in the field of robot navigation using diverse types of camera systems (e.g., monocular, stereo and panoramic) [2][3][4][5].
In recent years, most of the research has focused on improving the accuracy of pointfeature SLAM, and many encouraging breakthroughs have been made [6,7]. Such research includes the development of the ORB-SLAM2 system based on Parallel Tracking and Mapping (PTAM) ideas that improve the place recognition and the loop closure modules [8]. The invariant of ORB features to different viewpoints and illuminations has been employed to improve tracking, mapping, and loop closure. As a result, ORB-SLAM2 has become one of the most popular systems in rich-textured scenes. Despite the success of the ORB-SLAM2 framework, the expanding applications of mobile robots, among many other applications, such as augmented reality and autonomous driving, impose new challenges, which still remain to be resolved, related to low-textured or structured engineered environments [9,10]. Traditional point-based strategies are unstable and even fail in some scenarios where the point features are uneven or not well distributed. Fortunately, Cadena et al. pointed out that a way to solve this problem is to find an alternative feature that is abundant in the environment [3]. The low sensitivity of line features to light changes and motion blurring is also a prevalent topic in SLAM research. Lu et al. proposed an RGB-D SLAM system that combines line features and depth information to solve the problem of indoor illumination changes [11]. Scaramuzza et al. focused on extracting the vertical lines in the wider field image information captured by the omnidirectional camera to improve the accuracy and robustness of the mobile robot visual odometer system [12,13]. Pumarola et al. improved the initial accuracy of the system by synchronously calculating point and line features in the monocular SLAM initialization thread [14]. Ma et al. utilized vanishing points (VP) to constrain line features, which greatly reduced the mismatching of line features [15]. Despite the improvements, the above line segment detection and matching methods are time-consuming, making the line-based SLAM methods difficult to be processed in real-time.
To enhance the effectiveness of line detection methods, Gomez-Ojeda et al. combined a Line Segment Detector (LSD) with a Line Band Descriptor (LBD) algorithm to develop an enhanced stereo version of Point and Line SLAM (PL-SLAM). Such a method has shown higher translation accuracy than that of ORB-SLAM2 in rich-feature scenes, which proved that stereo systems are more accurate and resistant to interference than other monocular systems in line detection [16]. Berenguer et al. proposed a method to estimate the relative attitude by using Holistic descriptors of the omnidirectional image. This method solves the problem that the omnidirectional image can not deal with the height change of mobile robots, but it requires further research to apply in estimating movements with six degrees of freedom [17]. Inspired by the Scale Invariant Feature Transform (SIFT) algorithm, Li et al. proposed the scale-invariant mean-standard deviation LSD (SMLSD) to extract line features faster without sacrificing detection accuracy [18]. However, Zuo et al. proved that the performance of LSD line detection is still unsatisfactory for real-time applications [19]. There is an urgent need to develop line detection methods that can accurately extract line features at a fast rate regardless of the environment geometrical complexity. To realize the real-time SLAM for point and line features, Zhang et al. proposed the line detector based on Canny edges to obtain line features iteratively [20]. In contrast to [20], Vakhitov et al. focused on improving the accuracy and robustness of line feature extraction by training a deep yet lightweight full-convolutional neural network [21]. Gomez-Ojeda et al. [22] and Luo et al. [23] introduced the Fast Line Detector (FLD) algorithm to reduce the detecting time of the LSD method. The fast detection speed and straightforward logic composition of the FLD approach makes it suitable in the field of point-line SLAM. Although effective, it requires prior information about the scene to determine the needed parameters, which limits its usefulness in prior unknown environments, such as those encountered inside collapsed buildings.
Another hindrance that prevents the line feature from being extensively used is the complex outlier culling technique that makes their implementation nontrivial. Line features have a number of characteristics that change with the angle of view and occlusions, so the method of removing inappropriate features from the detected lines has been a focus of constant research. Shao et al. utilized a coplanar constraint of line features to filter mismatched line features, but the method lacks processing real-time data [24]. Ma et al. trained a general classifier to judge the correctness of any hypothetical matching [25]. This method converts the mismatch elimination into a two-class classification problem, but the performance of the algorithm in unknown scenes still requires further experimental verification. Lim et al. proposed a structural model based on epipolar and vanishing points to remove degenerate LSD line features intended to improve the reliability of line features [26].
Although the above-mentioned approaches present many practical solutions to issues existing in line feature detection and outliers elimination, they are still unable to effectively address the SLAM problem of point-line loop detection. Pumarola et al. proposed that only point features were detected in the PL-SLAM loop closure to reduce the computational complexity [14]. Gomez-Ojeda et al. built a line Bag of Words (BoW) model that had a parallel relationship with the point BoW to achieve the independent line loop detection [22]. However, the point-line detection method without considering the geometric proximity between point and line features does not significantly improve the accuracy of loop detection. Ma et al. modeled a BoW containing VPs and their connecting lines to improve the accuracy of the SLAM system in the corridor [15]. For the problem of false loop detection caused by the existence of a similar point map, researchers have pointed out that the accuracy and robustness of loop detection can be improved by building a Bag of Words (BoW) model that considers the spatial proximity and co-occurrence information of visual vocabulary [27,28]. Under this context, an effective Point and EDlines SLAM framework, named PEL-SLAM, is proposed to solve the above-mentioned issues, including the line feature detection, line outliers elimination, and loop detection.
Specifically, a point-line stereo SLAM system based on Edge Drawing lines (EDlines) [29] and the improved BoW model is proposed to realize localization and mapping in human-made environments. The proposed approach takes advantage of line features in low-textured environments and PL-BoW for loop detection. Point and line features of the input stereo image are detected by ORB and EDlines, followed by calculating the uncertainty entropy that describes the accuracy of the detected line features. The PL-BoW is then obtained using a PL pair selection technique following the ORB point and LBD line descriptors. Finally, a loop detection mechanism based on PL-BoW, similarity score, and space consistency detection of the keyframe is employed to determine the best matching keyframe. The three main contributions of this paper are: (1) A stereo SLAM system based on the integration of point and line features. Such a method employs an EDlines algorithm to improve the speed of line feature detector in the front-end of the system. In addition, the comprehensive representation and transformation of line features are also derived. The remainder part of this paper is as follows. Section 2 provides the geometric expression, detection, and matching methods of line features. The graph optimization for line features and the improved loop detection mechanism are presented in Section 3. Section 4 compares and analyzes the experimental results of the proposed algorithm with existing methods. The conclusions of the paper are presented in Section 5.

Representation and Detection of Line Features
The schematic diagram of the PEL-SLAM system proposed here is illustrated in Figure 1. Since the PEL-SLAM system is improved on the basis of ORB-SLAM2, most modules of the system are the same as the ORB-SLAM2 system except for the green part in Figure 1.
Improvements of the line representation, line feature detection, and line feature matching in the front-end of the proposed method are presented in this section, while improvements of Bundle Adjustment (BA) optimization and loop closure in the back-end of the system are described in the following sections.

Geometric Representation of Lines
A spatial line can be expressed via its two endpoints in the image plane, s k = u s v s 1 T and e k = u e v e 1 T (see Figure 2). This is an essential operation of the visual odometer needed to transform 3D lines in the world reference to the image plane. The Plücker parameter has the characteristic of observability and computational simplicity, and it is used for the transformation and projection of line features. However, the over parameterization of the Plücker parameter hinders the performance of the system in back-end optimization. To address this, an orthonormal representation is introduced for optimization. The spatial line L in the Plücker coordinate is expressed by L = n T , d T T ∈ R 6 , where n ∈ R 3 is the normal vector, and d ∈ R 3 represents the line direction vector of the plane π k . The plane π k is composed of the line L and the k th camera coordinate origin C k . As shown in Figure 2, the Plücker line coordinate is usually constructed by triangulation of two different camera frames. The parameters of the π 1 = π x π y π z π w T plane can be determined by s 1 , e 1 , and the camera origin C 1 = x 1 y 1 z 1 T in the world reference as follows: where the [] x is the skew-symmetric matrix of a vector in R 3 . Other planes (e.g., π k ) can be obtained following the same calculation. Given planes π 1 and π 2 , the coordinate of the line feature in the world reference is determined via the following dual Plücker matrix L * : With the Plücker coordinates known, the transformation of the line becomes convenient in 3D Euclidean space. In what follows, we represent a 3D rigid body motion Given the transformation matrix T cw from the world reference to the camera reference, the corresponding Plücker transformation matrix is defined by: Then the Plücker representation of the given line feature can be transformed by: After calculating the representation of the line feature in the camera reference, the spatial line can be projected to the image plane through: where K denotes the projection matrix of a given line, and f x , f y , c x , c y represent the intrinsic parameters of the calibrated camera. It should be noted that when a line feature is projected onto a normalized image plane, K is an identity matrix. Since a spatial line in 3D space only has four degrees of freedom, using the Plücker representation increases the computational complexity of the optimization process. The de-coupled orthonormal representation (U, W ) ∈ SO(3) × SO(2) employed in the unconstrained optimization problem is obtained from the QR decomposition of the Plücker coordinates.
Furthermore, U and W are expressed by: where ϕ = ϕ 1 ϕ 2 ϕ 3 T denotes the angle of rotation from the camera reference to the line reference (see Figure 2). Since W implies one-dimensional scale information, the minimum parameterization of a spatial line can be defined by a four-dimensional vector Once the orthogonal representation of the optimized feature is obtained, the corresponding Plücker coordinates are calculated by: in which u i denotes the i th column of U.

Extraction and Description of Line Features
The line segment detector LSD, which gives an accurate line detection and requires no parameter tuning (widely applied in state-of-the-art PL-SLAM systems), is designed to extract line segments on noisy images with sub-pixel detection accuracy [30]. However, the computational complexity of LSD causes extra time spending on the system in the front-end feature extraction. Since the limitation of LSD directly affects the tracking and matching of line features, this aspect leads to the failure of the visual odometer.
To improve the real-time performance of the line feature extraction, the EDlines algorithm is used instead of the LSD to detect line features from images captured by the stereo camera. The EDlines algorithm has been proven to run faster than the LSD, while its output tends to contain irrelevant lines [15]. The experiment in [31] shows that EDlines can achieve a similar performance when compared with LSD without sacrificing the detection accuracy by using an appropriate line outlier elimination method.
In this work, the individual performance of the LSD and EDlines mechanisms was validated on three different scenes. Referring to Figure 3, the median three figures show the line features detected by LSD, whereas the three figures on the right side of the figure show the detected results of EDlines. The results show that both detecting algorithms are able to detect the line features of interest that correspond to real straight lines in the given environments. However, it is noted that EDlines is more likely to detect curves that are near straight lines. The detected curves may degrade the performance of the system. However, these curves usually possess bad spatial positions and thus can be easily recognized from different camera views. Thus, from the testing performed, it was identified that it is possible to use EDlines to achieve effective line detection without corrupting the systems by leveraging some line renting strategies.
To make the line features detected by EDlines more recognizable, the LBD descriptors [32] are computed to represent each line feature. Similar to the ORB descriptor for point feature, the LBD descriptor contains geometric attributes and appearance descriptions of the corresponding line features. For two consecutive stereo frames, the similarity of line features is measured by calculating the consistency of LBD descriptors between line pairs.

Extraction and Description of Line Features
After exacting line features, LBD descriptors are generated to describe the corresponding line features in two consecutive captured adjacent frames. The similarity of descriptors is calculated to find the matching line features. The extracted line feature is considered to be a good match only if such feature is the best match in both images of a stereo frame. Before eliminating outliers, the preprocessing of matched pairs is required to improve the accuracy of the system matching. Given the occlusion and perspective changes that might exist in real-world environments, in this work, the line pair is not considered a match if their lengths are more than twice as different. At the same time, the line pair is considered to be mismatched if the distance between the midpoints of the two lines on the image is greater than a given threshold.
For a stereo frame, a spatial line is represented as l l and l r in the left and right images, respectively, where l r can be represented as l r1 l r2 l r3 T by the homogeneous coordinates of its endpoints. Given a stereo frame captured by a stereo camera, the matching points in the two rectified images have the same horizontal position. Therefore, the endpoints in the right image can be obtained through l r and line l v=v , as illustrated in Figure 4.
The depth parameters of s and e are calculated by the disparities as follows: where ∆u s = u s − u s , ∆u e = u e − u e , and b denotes the baseline of the stereo camera. Then the 3D position of P s is obtained by the back-projection of the endpoint s as follows: Since line features are more sensitive to image noise and mismatching than point features, the uncertainty of line features is modeled to quantify the reliability of line features. Considering the spatial properties of line features, the covariance propagation method is used to construct the uncertainty matrix of the endpoint, which can be expressed as follows: in which J P s represents the Jacobian matrix of P s , and cov(s) is the uncertainty matrix of s. Based on the properties of p s in the image plane, cov(s) is modeled as a bi-dimensional Gaussian set with standard deviation σ u = σ v = 1 pixel. The matrix J P s can be derived from Equation (13) as follows: Due to the fact that the scale of uncertainty between image pairs varies, the values are not directly comparable. The entropy of multivariate normal distribution is thus introduced to abstract the uncertainty in the covariance matrix into a scalar value, which is defined as: H s = 0.5m(ln 2π + 1) + 0.5 ln(|Σ P s |) (16) where m is the dimension of P s . The uncertainty entropy of P e can be calculated in the same way.
In the methods of [20,24], the processing of outliers is to remove matching pairs that do not meet a set constant threshold. However, we found that a constant threshold is not applicable because the uncertainty depends on the changes of motion and scene. A threshold determination method based on the uncertainty entropy of all line features of the current frame is thus applied to the outliers eliminating process. After calculating the average entropyH s for all line features of the current frame, the threshold Td c is set to 0.85H s according to experiments. If the uncertainty entropy H s and H e of a spatial line reconstructed from the stereo frame are greater than Td c , the line can be regarded as an accurate line feature, and vice versa.

Bundle Adjustment and Loop Closure with Points and Lines
The BA module in the back-end optimization consists of two main aspects: local BA of local mapping thread and global BA of loop closure thread. The original BA module of ORB-SLAM2 includes variables of camera poses and point landmarks. However, the addition of line features complicates the optimization process as a minimization process of the cost function in the co-visibility graph needs to be performed. Therefore, a graph BA optimization strategy considering line and point features is adopted in this paper. In addition, the accuracy of the loop closure depends on the calculation of the global BA and the loop detection. A novel point and line BoW is thus proposed to improve the stability and accuracy of the loop detection.

Graph Optimization with Point and Line Features
Due to the fact that the length and angle of the same line in the two images are different, projection errors cannot be obtained directly from the corresponding two frames. In this work, the projection errors are computed by re-projecting the matched lines from the world reference back to the current image reference. Given the spatial line L in the world reference and its orthogonal expression O, the corresponding line feature is transformed to the camera frame of reference by T cw . Then the projected line I c i is obtained by projecting the line features onto the normalized plane of the current frame. The reprojection error is then defined as the distance between the projected line I c i and the endpoints of the detected line l c i , which is computed via Equation (17).
where ρ denotes the robust Huber cost functions, and Σ −1 l , Σ −1 p denote the inverse information matrices of the reprojection error of points and lines, respectively. Compared to the original loss function of the ORB-SLAM2 system, the reprojection error function F el about the line feature is added to Equation (18). The camera pose and landmark are calculated by minimizing the loss function C. The Jacobians of r p are already derived in ORB-SLAM2, and the Jacobians J l concerning camera pose and line landmark can be expressed by the following matrix chain multiplication: The Jacobian matrix of the line reprojection error relative to I c i is expressed as: According to the 3D line projection, Equation (21) is obtained.
The term ∂L c i ∂δx i represents the Jacobian matrix of line features concerning translation and rotation errors in camera reference in which δx i = δp δθ . The line features only optimize the translation p and the rotation θ in the state variable, and the Jacobian matrix is calculated as follows: The Jacobian matrix of the line in the camera reference concerning the line in the world reference is the inverse of the transformation matrix represented as Equation (24).
According to the orthogonal representation of a spatial line, the last term of Equation (19) can be defined as: With the analysis of the Jacobians completed, iterative algorithms, such as Gaussian-Newton, can be employed to solve the local graph optimization of local mapping and global graph optimization of loop closure.

Loop Closure with Points and Lines
In human-made environments, the existence of weak textures (i.e., white wall) and frequent light changes leads to the false detection of the traditional BoW-based loop closure. The insufficient recognition of point features leads to the mismatch results between frames. To address this aspect, PL-BoW, a BoW that combines point and line features, is proposed, which utilizes the co-occurrence information and spatial proximity of visual words.
The visual words of point features are generated from the ORB descriptor, including the position p point and direction θ p of point features. At the same time, the direction θ l and position information of the corresponding line feature is obtained from its endpoints. As shown in Figure 5, the combination of a point and a line is defined as a PL pair only when the direction and distance of the line and the point are close enough. To improve the search speed of PL pairs, a K-D tree is constructed by using the position of each point feature. Then the line features satisfying the PL pair are selected to build the point and line K-D tree.
Given a current keyframe f u and a candidate keyframe f c , the corresponding BoW vectors are defined as v p1 , . . . , v ps , v l1 , . . . , v lt and w p1 , . . . , w ps , w l1 , . . . , w lt , respectively. In loop closure detection, the similarity of two keyframes is calculated through the BoW vectors as follows: (26) where N p and N l represent the total number of point and line features in the image, respectively. The mismatched keyframes that have the lower score are removed from the set of candidate keyframes. The procedure of the improved PL-BoW based loop detection is given in Algorithm 1.
Calculate the similarity S i ; 8 S max = Max{S i }; 9 end 10 end 11 for each f i ∈ F c do 12 Remove f i with S i < 0.8S max ; 13 end 14 Perform space consistency detection on F cm to obtain f bm .

Experimental Verification
To test and analyze the proposed PEL-SLAM, a set of experimental tests were performed with an Intel CPU i5-10060 KF@4.1GHz, 32GB RAM, without a dedicated GPU. OpenCV and g2o were mainly used as the environment running on an Ubuntu16.04 desktop. The proposed PL visual stereo SLAM algorithm was compared with popular methods, including ORB-SLAM2 [8], PL-SLAM [16], and stereo Point and Line based Visual Odometry (sPLVO) [23]. The above algorithms were tested on the KITTI stereo dataset [34] and EuRoC micro aerial vehicle (MAV) dataset [33], which provide several challenging sequences of images in indoor and outdoor environments. The absolute Root Mean Square Errors (RMSEs) between the estimated translation and rotation of the SLAM system and the groundtruth given in the dataset were computed by the EVO [35] evaluation tool.

Stereo SLAM on KITTI Dataset
The test results in terms of the RMES of ORB-SLAM2, PL-SLAM, and the proposed algorithm on the KITTI dataset are shown in Table 1. Since the strategy applied in the sPLVO is not appropriate for outdoor environments, sPLVO was not tested for comparison on KITTI dataset. The lowest absolute translation and rotation errors for each test are marked in bold. Overall, the proposed method performs better than the other two methods. The translation and rotation errors of the proposed method are about 24.8% and 29.6% lower than that of ORB-SLAM2, respectively. However, it can be seen that the application of line features in the environment with insufficient line features hinders the accuracy of the SLAM process. As shown in Figure 6, the condition of the KITTI 05 dataset is mostly foresting with fewer line features, and the translation and rotation accuracy of the proposed method are reduced by 17.1% and 8.8%, respectively. Table 1 also shows that the proposed method outperforms PL-SLAM, which also employs both point and line features. One reason for this is that the use of PL-BoW suppresses the long-term positioning drift of the system in large outdoor scenes. Another reason for this result is that the proposed line feature outliers elimination method reduces the front-end mismatching and further improves the accuracy of the constraints constructed in the back-end optimization of the system.

Stereo SLAM on EuRoC Dataset
The accuracy of the four algorithms (PEL-SLAM, PL-SLAM, sPLVO, and ORB-SLAM2) was also compared on the EuRoC dataset, which includes indoor environments images of machine halls and rooms. Table 2 shows the absolute translation and rotation errors of these four algorithms. It is clear from Table 2 that the strategy used by sPLVO for indoor line features makes it competitive in indoor environments. Nevertheless, the performance of the proposed method on the EuRoC dataset was generally superior to PL-SLAM, ORB-SLAM2, and sPLVO. Compared with the ORB-SLAM2, the translation and rotation accuracy of the proposed method are improved by nearly 5.7% and 7.3%, respectively. These improvements demonstrate that the incorporation of line features increases the accuracy of pose estimation and map construction. Although both sPLVO and the proposed method introduce line feature outliers elimination, the proposed method shows better accuracy in most testing cases. Compared with the state-of-the-art sPLVO, the maximum translation and rotation accuracy of the proposed method improves the results by 17.2% and 24.8%, respectively. These results indicate that using the entropy scale to measure the uncertainty of line features and applying PL-BoW to loop detection are beneficial to the accuracy of the SLAM process.
The tests also demonstrate that the proposed method environments with drastic changes and rapid motion of carriers. Figure 7 illustrates the estimated trajectory comparisons of MH-02 and V1-01, which contains the results with typical rapid motion and changing light. The accuracy of the two methods based on point and line features (i.e., sPLVO and PEL-SLAM) is higher than the methods that use only-point-features, ORB-SLAM2. Meanwhile, the average rotation accuracy of the proposed method is nearly 14.3% higher than that of sPLVO.

Comparison of Processing Time
Due to the fact that the real-time performance is one of the important indicators of the SLAM process, a comparison between the average processing time per frame of different methods was performed (Table 3). Table 3 shows that the operating time of the SLAM process is directly affected by the image resolution, which means that the process needs more time to process the KITTI images. At the same time, the application of line features increases the running time of the system, especially the detection of line features. The PEL-SLAM reduces the time consumption during the line features detection. Compared with PL-SLAM, the proposed method saves approximately 7.7% and 12.2% of the average running time in KITTI and EuRoC datasets, respectively. The running time of the proposed method is 4.5% faster than that of sPLVO without sacrificing the accuracy of the pose estimation and map construction.

Conclusions
In this paper, a method that can extract point and line features with the purpose of using them to improve the positioning, mapping, and loop detection is proposed. The PEL-SLAM system solves the problem of point-based failure in poorly textured scenes. The proposed method uses the faster EDlines instead of the widely used LSD and leverages the entropy scale to measure the uncertainty of line features. Meanwhile, the proposed PL-BoW is constructed and applied to the loop detection, which improves the accuracy of loop keyframe matching. Such mechanisms system enables SLAM to be performed in real-time without loss of accuracy while producing a complete point-linemap. Finally, the detailed experiments on KITTI and EuRoC datasets show that the proposed method outperforms many well-known methods with respect to translation and rotation accuracy in challenging environments, with the additional benefit of using less time consumption compared with the state-of-the-art PL SLAM systems. In the proposed algorithm, the accuracy of the system depends on the performance of the detected visual features. The way of combining other sensors information to improve the robustness of the point-line SLAM will be the focus of future research.