A Monocular Visual Odometry Method Based on Virtual-Real Hybrid Map in Low-Texture Outdoor Environment

With the extensive application of robots, such as unmanned aerial vehicle (UAV) in exploring unknown environments, visual odometry (VO) algorithms have played an increasingly important role. The environments are diverse, not always textured, or low-textured with insufficient features, making them challenging for mainstream VO. However, for low-texture environment, due to the structural characteristics of man-made scene, the lines are usually abundant. In this paper, we propose a virtual-real hybrid map based monocular visual odometry algorithm. The core idea is that we reprocess line segment features to generate the virtual intersection matching points, which can be used to build the virtual map. Introducing virtual map can improve the stability of the visual odometry algorithm in low-texture environment. Specifically, we first combine unparallel matched line segments to generate virtual intersection matching points, then, based on the virtual intersection matching points, we triangulate to get a virtual map, combined with the real map built upon the ordinary point features to form a virtual-real hybrid 3D map. Finally, using the hybrid map, the continuous camera pose estimation can be solved. Extensive experimental results have demonstrated the robustness and effectiveness of the proposed method in various low-texture scenes.


Introduction
In recent years, intelligent robots have been widely developed and deployed. For instance, as a kind of robot, unmanned aerial vehicle (UAV) has been widely used for resource surveys [1,2], terrain modeling [3], disaster rescue [4] and other applications. For robots, visual odometry (VO) and visual Simultaneous Localization and Mapping (vSLAM) are the key technologies. With visual odometry, a robot can accurately estimate its trajectory in an unknown environment, which endows it with the ability to explore unknown environments on its own [5][6][7][8]. However, possible environments are highly diverse, not always textured, and there are plenty of low-texture environments, such as the streets in urban areas and artificial squares. In these environments, there are no rich textures, but lines are much more abundant due to the structure of the man-made scene. These low-texture environments create unique challenges for VO algorithms.
The VO algorithm is to estimate the motion of the robot using images, which is an important part of vSLAM technology. VO and vSLAM methods are usually divided into two categories: direct methods and feature-based methods.
The core of the direct method is to minimize the photometric errors. Since the direct method only requires the image to have a sufficient image gradient, with no need for feature points, the direct method can be applied to low-texture scenes. In order to avoid the high computation requirement, the direct methods generally exploit only pixels with strong gradients, such as LSD-SLAM [9], a semi-dense direct approach that minimizes the stereo PL-SLAM, a novel bag-of-words approach was proposed, which employs points and line segments to enhance the loop closure process. These methods analyze the lines purely, consider how to parameterize 3D lines and build an error model between 3D lines and observations. Therefore, these methods do not consider the line information utilization from the perspective of feature combination. Furthermore, a significant drawback of using lines directly is that lines are often incompletely detected or partially occluded, leading to instability of the related algorithms. In this paper, by exploring the properties of line segments, we deal with the line segment information from the perspective of combining line segments, that is a novel way to deal with line segments. In this way, many benefits for VO algorithms can be obtained, as elaborated later in the paper.
Although some methods aiming at low-texture environments have been proposed, these methods are mainly for indoor environments. Yang et al. [24] used a single image pop-up plane model [25] to generate plane landmark measurements. Based on the plane landmarks, a plane SLAM algorithm incorporating scene layout understanding was proposed. This approach was designed specifically for corridor scenes and, thus, is limited in scope. In addition, some methods use RGBD camera as sensor. Fu et al. [26] presented a RGBD SLAM algorithm using points and lines for indoor mobile robot pose estimation and 3D map generation. Fabian et al. [27] considered that depth maps can increase the stability in poorly textured environments and that edges are more stable than the raw intensity under different illumination conditions and, thus presented a RGBD VO algorithm that uses a combination of edge images and depth maps to jointly minimize the edge distance and point-to-plane error. Due to the use of a RGBD camera, these methods can only be used in indoor environments.
For outdoor low-texture environments, only use point features for VO algorithm maybe not sufficient, a feasible solution is to use extra features such as line segment for a complement. Therefore, how to use the line segment information effectively and to propose the corresponding VO algorithm is the goal of this paper. In this paper, we utilize line information from the perspective of combining line segments and define a concept of virtual intersection matching points, based on line segments. By exploring the properties of the virtual intersection matching points, we propose a virtual-real hybrid map based monocular VO algorithm combined with points and line segments. Our main contributions are as follows: • First, we reprocess line segment features by combining line segments and introduce the concept of virtual intersection matching points. • Second, we propose the concept of a virtual map using the line segment-based virtual intersection matching points and demonstrate that the virtual map can play the same role as the real map constructed by ordinary point features to participate in camera pose estimation. • Third, we propose a monocular VO algorithm based on the virtual-real hybrid map, as shown in Figure 1b, which is built on virtual intersection matching points, endpoints of line segments, and ordinary point features.
The remainder of the paper is organized as follows: We describe the proposed concepts of virtual intersection matching points and virtual maps in Section 2. Section 3 describes the proposed virtual-real integrated VO algorithm. In Section 4, we describe the experimental results. In Section 5, we discuss the proposed method. Finally, we conclude the paper in Section 6.

LSVI Matching Points
We reprocess line segments, from the perspective of combination, in order to generate a new kind of matching points, which we call virtual intersection matching points. As they are based on line segments, we abbreviate them as LSVI matching points. As shown in Figure 1a, in contrast to ordinary point features, the LSVI matching points can be distributed in textureless regions.   We first give the definition of LSVI matching points. Then, we explain how the virtual map generated by the LSVI matching points participates in the estimation of camera poses. Then, we geometrically prove that a pair of LSVI matching points is a correspondence pair, and that there is a corresponding 3D point in space. Finally, we describe the significance of introducing LSVI matching points.

Definition
As shown in Figure 2, for the current frame image and a reference keyframe image, given two arbitrary pairs of unparallel matched line segments, the matched line segments can produce a virtual point on each of the two images, which is the intersection of the extending lines of the matched line segments. If both virtual points are in the image range, after selection (including epipolar constraint verification and quantitative limitation), a pair of LSVI matching points may be generated. Epipolar constraint [28] is a term in multiview geometry that describes the constraint relationship of a pair of correspondence in two images.

Virtual Map
We explain the virtual map and discuss the role of the virtual map in camera pose estimation.
As shown in Figure 3, two unrelated lines L a , L b in space are projected onto the reference image through the camera to produce two lines l a1 , l b1 . From the knowledge of multiview geometry [28], we can know when a line on an image is back-projected into space, a plane determined by the camera center and the line on the image is obtained. There exists a special line on the plane obtained from the back-projection of line l a1 ; the special meaning is that it also lies exactly on the plane obtained from the back-projection of line l b1 . In other words, it is an intersection line of the two planes, which is a virtual line that does not exist in real space. Similarly, these two lines L a , L b in space can also be used to obtain a virtual line through the process of projection onto the current image and the back-projection into space. The two virtual lines intersect in space to obtain the virtual 3D point X. The whole process can be summarized as two unrelated lines in space being used to produce a virtual 3D point through participation of the camera. The projections of the virtual 3D point on two images is a pair of LSVI matching points. A large number of virtual 3D points constitute a virtual map. The structure of virtual maps may be quite different from that of the real world, because they are regenerated from line information, which do not exist in the real world.  The essence of estimating the robot self-motion is to estimate the epipolar geometry (the internal projective relationship between two views) between adjacent images continuously. According to the knowledge in multiview geometry [28], the epipolar geometry is irrelevant to the structure of the scene; it is determined by the motion of the camera itself. We just use the scene structure to estimate the camera trajectory, whether the scene is a mountain or a city street. From this perspective, we can also estimate the camera trajectory with the help of the virtual map as the scene structure.

Demonstration
We give a short proof that, given two pairs of matched line segments on two images, after epipolar constraint verification, a correspondence pair and a corresponding 3D point in space can be determined.
Obviously, the intersection points of two pairs of matched lines can be regarded as a correspondence pair, as the matching of line segments is unique and two (nonparallel) lines on a plane must meet at one point. We mainly prove that there exists a 3D point corresponding to them. We express the problem in mathematical form. First, each line segment is extended into a straight line on the image plane, since the back projection of a line on an image into space is a plane. In this way, the problem is transformed into whether four planes-determined by four matching lines in two images-intersect in space, as shown in Figure 4. If four planes intersect in space, we prove that there exists a unique 3D point in space. It should be pointed out that, in this proof, tensor notation is used, where the convention is consistent with that of [28]. We assume that a line on the image is represented by a covariant vector l i , and the condition that a image point x is on this line is l i x i = 0. In addition, its corresponding 3D point X j is projected onto the image, such that we can obtain x i = a i j X j , where a i j represents the camera projection matrix. Therefore, the condition that a 3D point X j is projected onto the image on the line l i is Therefore, l i a i j represents the plane consisting of all spatial points which can project onto the image on the line l i . In order to simplify the description, the four lines are expressed as l p , l q , l r , and l s . Therefore, the corresponding four planes can be represented as l p a p , l q a q , l r b r and l s b s , where p, q, r, and s respectively represent the row index, and the ith row of the projection matrix is denoted as a i (i = p, q) and b i (i = r, s). In general, planes in space do not intersect at the same point; the necessary and sufficient condition for them to intersect is that the determinant of the equation is zero. The corresponding equation is as follows: As the determinant is linear with respect to each row, the determinant A of the matrix in the left of Equation (2) can be expressed as: Then, we use B to denote the result of multiplying the determinant A times (ε ipq ε ipq )(ε jrs ε irs ). As ε ipq ε ipq is a scalar value, we recombine it as: As the expression (l p l q ε ipq ) is the cross product of the two lines l p and l q -it is their intersection x i [28]-in the same way, the expression (l r l s ε jrs ) is equal to the intersection x j . Therefore, Equation (4) can be expressed as: As the epipolar constraint is satisfied, the properties of the fundamental matrix can be used. By means of a result about the fundamental matrix presented in [28], the determinant F ji of the fundamental matrix can be represented as: When Equation (6) is substituted into Equation (5), we obtain As one of the basic properties of the fundamental matrix is that its determinant is zero, from the above derivation, we have As the determinant A is zero, that the four planes determined by the four matching lines on the two images intersect in space is proved. Therefore, given two pairs of matched line segments on two images, a 3D point can be determined in space. For the details of the proof and tensor notation, refer to [28].

Significance of Introducing LSVI Matching Points
First, LSVI matching points cannot be affected by inconsistent line segment lengths. In the process of line segment matching, two wrong line segments with different lengths may be matched. The case is due to the occlusion or incomplete detection of a line segment, as shown in Figure 5, although line segments in orange box are in the correct direction, due to the incomplete detection of the line segments, the endpoint matching of these line segments is wrong. However, the LSVI matching points are not affected by the inconsistent line segment lengths, as the construction of LSVI matching points is determined by the position and direction of the line, not by the length of the line segment. Second, the generation of the LSVI matching points essentially involves the reprocessing of line segment information, such that the processing result significantly increases the number of features and makes the distribution of feature points more uniform. Figure 6a shows the matched line segments in two images, and the LSVI matching points generated from matched line segments are shown in Figure 6b. As shown in yellow rectangles in Figure 6b, the LSVI matching points can be distributed in places where there would not have been ordinary feature points, such as on lakes and asphalt pavement. We know that Perspective-n-Point (PnP) [29] is a key step in vSLAM or VO algorithms, which is used to solve the initial camera pose of the current frame. The accuracy of PnP depends on the number of observations and their distribution in the image [29]. A large number of 2D-3D correspondences provide redundancy for estimation, while a uniform distribution of points avoids bad-conditioned configuration [30]. For this reason, ref. [30] proposed a score strategy, which aims to select those images with uniform distribution and more observations to participate in the reconstruction. In ORB-SLAM2, in order to make the extracted features more dispersed, the image is divided into root grids, in which features are extracted. Then, using quadtree, each grid is divided into four parts each time; once the node has only one key point or the number of key points has met the requirements, the segmentation is stopped. Finally, the point with maximum Harris response in each node is stored. Compared with using only the endpoints of line segments, the use of LSVI matching points increases the number of features and makes the distribution of features more uniform, which is beneficial for the PnP algorithm, such that a more accurate initial camera pose estimate can be obtained.
Third, in vSLAM or VO algorithms, in order to obtain a more accurate camera pose estimate, local optimization through bundle adjustment (BA) [31] is necessary. Obviously, with the use of the virtual map, the amount of data is increased, which can lead to better optimization results.

Virtual-Real Integrated VO Algorithm
The proposed virtual-real hybrid map based monocular VO algorithm is illustrated in Figure 7. When processing a new frame, the point and line segment features are first extracted and matched. According to the matched line segments, the LSVI matching points are constructed. Then, according to the processed data, the initial motion estimation for current frame pose and the decision of whether to insert a keyframe are performed. Once a keyframe is inserted, the local mapping thread creates the hybrid map and performs local optimization to obtain a better estimate. The VO algorithm follows the classic keyframebased vSLAM framework, including two threads, tracking and local mapping; however, it does not include a loop closing thread. The key parts of the algorithm, including the construction of the LSVI matching points and the management of frames and maps, are described in detail as follows.

LSVI Matching Points Construction
The construction of the LSVI matching points is based on the processing of line segments. The main steps are shown in Figure 8. To extract line segments, the Line Segment Detector (LSD) [32] was employed, which is fast and provides high precision. Before feature matching, we use the Line Band Descriptor (LBD) method [33] as the binary descriptor to augment line segments. Then, we match the line segment features detected in two images. After line segment matching, geometric filtering is required, in order to get rid of mismatches. As a line segment has two endpoints, we perform double geometric filtering. For matched line segments, the two endpoints of a line segment are respectively stored in two independent endpoint sets. Then, we perform geometric validation by constructing a fundamental matrix for the two endpoint sets. For each pair of matched line segments in the two images, if both endpoints pass their respective geometric verification, this pair of matched line segments is retained. The construction of the LSVI matching points includes two steps: initial construction and subsequent selection. In the initial construction, line segments in the reference keyframe are divided into N g groups (e.g., 36 groups), according to their directions. Line segments are assigned by direction, some groups may have multiple line segments, while other groups may not have any line segments. We randomly take two nonparallel line segments from different groups and extend them into straight lines according to the image coordinates of their endpoints. Lines are expressed in slope-intercept form, and an intersection point is calculated by solving the equations of the two lines jointly, as shown in Figure 2. If the coordinate of the intersection point is in the image range, it is valid and will be retained. In the same way, using two line segments of corresponding index in the current frame, the coordinate of an intersection point are calculated. If it is valid, a pair of initial LSVI matching points is obtained. The initial construction of the LSVI matching points is completed after traversing line segments in all groups.
The selection step includes two parts: In the first part, we limit the number of matching points. In the second part, we perform geometric verification using the fundamental matrix. Given N line segments, in the extreme case, at most N * (N − 1)/2 LSVI matching points will be generated. In fact, the number of LSVI matching points cannot be as large as N * (N − 1)/2, as line segments from the same group cannot produce LSVI matching points, and a pair of LSVI matching points is regarded as valid only if both of them are located in the image area of two frames, respectively. However, sometimes, the number of LSVI matching points may still be large. A larger number of map points is beneficial to the robustness of the algorithm, but the construction of a large number of map points will restrict the efficiency of the algorithm. Our purpose is to introduce virtual map points to enhance the stability of the algorithm; however, on the other hand, the algorithm efficiency should not be reduced significantly due to constructing too many map points. Therefore, to limit the number of LSVI matching points, we need to select from the initial sets. Inspired by ORB-SLAM2, we divide the reference image into M grids and allocate the LSVI matching points into corresponding grids, according to their coordinates. If the number of points in a grid is less than N grid , the LSVI matching points in this grid are all preserved. If there are more than N grid points in a grid, the points are sorted according to the response values of line segments constituting these LSVI matching points. At most, 4 * N grid points with larger response values are reserved. In the second part of the selection, a fundamental matrix is first constructed between reference keyframe and current frame using the endpoints of the matched line segments and ordinary point features. Then, geometric verification is performed to select the LSVI matching points conforming to the epipolar constraint.

Hybrid Map
Three types of points are used: LSVI matching points, endpoints of line segments, and ordinary point features. We use SiftGPU [34], a Graphics Processing Unit (GPU) implementation of the Scale-invariant Feature Transform (SIFT) as the ordinary point feature. We use the Direct Linear Transformation (DLT) algorithm [28] for triangulation, to build the virtual-real integrated hybrid map.
In order to obtain an accurate estimate, local BA is used to optimize the poses in a local window of covisible keyframes and points in the hybrid map. The optimization problem is solved as: where κ represents a group of keyframes; χ represents the local hybrid map; x ij (.) represents the observation, either an LSVI matching point, an endpoint of the line segment, or an ordinary feature point; X j (.) represents a corresponding hybrid 3D map point, either a real map point or a virtual map point; R i and t i are the rotation matrix and translation vector of the ith frame, respectively; ρ is the robust Huber cost function; and π is the projection function, defined as follows: where f x , f y are focal lengths and (c x , c y ) is the principal point, all of which are known from calibration. In optimization, the hybrid 3D map point X j (.) , rotation matrix R i and translation vector t i are the variables to be optimized. To solve the optimization defined in Equation (9), we use the Levenberg-Marquardt algorithm [35], implemented in g2o [36], as the solver.

Frame Management
Up to N k f keyframes are kept in a window. Each new frame is initially tracked with respect to a reference keyframe. According to two criteria, the current frame is either discarded or used to create a new keyframe. First, a new keyframe needs to be created when the field of view has changed significantly. We measure the shift of the endpoints of matched line segments by where n represents the number of endpoints of matched line segments. If dist is more than 0.2× h pixels (h is the height of the image), a new keyframe is created. Second, another key factor is the number of matched points. If the number of matched ordinary feature points or LSVI matching points is less than threshold (e.g., 500) a new keyframe is created.

Results
To evaluate the performance of the proposed virtual-real integrated monocular VO algorithm, we conducted extensive experiments, including qualitative and quantitative experiments. First, we performed a qualitative evaluation to verify the robustness of the proposed method in various low-texture environments using aerial data collected by a DJI Phantom 4. Then, quantitative experiments were carried out to further evaluate the performance of the proposed method using image sequences of simulated environments. The evaluations were carried out on a PC with an Intel Core i7 CPU with 2.8 GHz, 8 GB of RAM and a NVIDIA GeForce GTX 1060 GPU with 6 GB of VRAM.

Qualitative Evaluation
We conducted qualitative experiments to validate the robustness of the proposed algorithm in different real low-texture outdoor scenes. Although we cannot get the ground truth of the UAV flight, since we manually controlled the drone to follow a preset flight path, we can know the approximate trajectory of the drone and how long the trajectory is. By analyzing the integrity of the estimated trajectory, the robustness of the different algorithms can be compared. We compared our method against DSO [10] and ORB-SLAM3 [17], which is the representative of direct and feature-based methods, respectively. We used a DJI Phantom 4 to capture aerial videos inside and outside the campus. Before the experiments, we calibrate the camera intrinsic parameters using a calibration tool, Toolbox_calib of Matlab, and then distortion correction is performed according to the distortion parameters calibrated in advance. The scenes included a stone corridor, a campus garden, a campus square, and two suburban roads in different seasons. In the first column of Figure 9, we show a panorama of each scene. A representative aerial video frame is shown at the bottom right of each panorama, and the approximate location where image was captured is marked on the panorama. Due to the influence of manual control and wind force, there was a continuous certain vibration during the flight. In addition, restricted by the sight distance, there was some deviation when flying far away.
The first row of Figure 9 displays a flight track over a stair-stepping stone corridor which was 56 m long and descended 4 m. The UAV moved at 6 m above the highest flat ground and captured a video with resolution of 1280 × 720. The main content of the image was the stone ladder, and a few plants were captured when flying to the bottom of the stone ladder. The comparison results are shown in the right of the first row of Figure 9. Our method could estimate the complete trajectory. In contrast, DSO could not estimate the trajectory because the direct method represented by DSO requires harsh conditions, with small motion and constant illumination. However, the actual data do not meet these conditions. On the one hand, this data set exhibited the significant change in illumination; on the other hand, outdoor drones cannot fly as slowly as robots move indoors, which cause serious challenges to direct methods such as DSO. Due to the insufficient features, ORB-SLAM3 could not perform successful initialization immediately. Initialization refers to the establishment of a reliable initial map from sufficient features and then the start of trajectory estimation. Due to the delay in initialization, the trajectory estimated by ORB-SLAM3 was incomplete, (i.e., the first 12.5% of the sequence was missing). However, our method could perform initialization successfully and rapidly on this data set, since the upper limit of the number of LSVI matching points is very large. In the initialization stage, we temporarily ignored the problem of efficiency and no limit was placed on the number of LSVI matching points; that is, as long as a pair of LSVI matching points was within the image range and conformed to the epipolar constraint, it was considered valid.
As shown in the second row of Figure 9, the drone flew over a campus garden along a 9-shaped path. The content of this scene was relatively rich. In addition to stone materials, there were some green plants. The drone flew 10 m above the ground and captured a video with resolution of 1280 × 720. Our method estimated the complete trajectory and the trajectory looked very smooth, with small fluctuations and an even interval. Although DSO generated part of the trajectory, it was obviously wrong. Again, ORB-SLAM3 could not perform initialization immediately and, thus, an incomplete trajectory is obtained, which looked like a rectangle instead of a 9-shaped path (i.e., the first 23.4% of the sequence was missing).  From the third row of Figure 9, it can be seen that the drone flew over a campus square along a spiral square polyline path. The drone flew 10 m above the ground and captured a video with resolution of 1280 × 720. The content of this scene is very simple, only comprised of stone floor bricks with no patterns; thus, this scene was extremely challenging and both DSO and ORB-SLAM3 could not estimate the trajectory. The reason of failure is that ORB-SLAM3 could not perform initialization successfully. In contrast, our method obtained a complete trajectory.
As shown in the fourth row of Figure 9, the drone made a roundtrip flight along a suburban road outside the campus. In other words, the drone first moved east along the road, then turned back and moved west. In the process of UAV round-trip movement, a part of tracks did not overlap, which was caused by manual control and wind influence. It should be noted that the actual track was longer than that shown in the panorama. The UAV captured a video with resolution of 1920 × 1080. Although the main content of image was asphalt pavement, there was continuous vegetation along both sides of the road and the feature distribution on the vegetation was much denser than on the asphalt pavement. In addition, vehicles sometimes passed on the road, making it a dynamic environment, which further increased the difficulty. ORB-SLAM3 only estimated a fraction of the trajectory for a section of the road with relatively rich feature points; however, for the low-texture fraction of the road, ORB-SLAM3 cannot estimate the trajectory, the reason for which was the initialization delay and lost tracking. Lost tracking refers to the failure of camera pose estimation due to the insufficient feature points in some frames, which makes VO algorithm unable to continue. Therefore, due to the two reasons, for ORB-SLAM3, in total, 55% of the sequence was missing. DSO could not estimate the trajectory. In contrast, our method estimated a complete trajectory.
The fifth row of Figure 9 displays another roundtrip flight along a suburban road outside the campus in winter. It should be noted that the actual flight track was longer than the road section shown in the panorama. Compared with the previous group of experiments, the trees and other vegetation on both sides of the road became withered in winter, such that the features on the vegetation became more rare. In addition, the drone flew over a large artificial construction area, such as a large parking lot. Thus, this scene was even more challenging. Due to the lost tracking and initialization delay, ORB-SLAM3 only estimated a part of the trajectory, missing 93% of the sequence. DSO still could not estimate a trajectory. In contrast, our method estimated a complete trajectory.
From the qualitative evaluation, we can see that the proposed method has good robustness in different real outdoor low-texture scenes.

Quantitative Evaluation
We also conducted quantitative evaluations. Due to a lack of ground truth in the aerial videos collected by the DJI Phantom 4, we used Airsim [37], combined with Unreal Engine, to obtain aerial images with ground truth in various simulation scenes. Airsim is a simulator for drones or vehicles built on Unreal Engine. With this tool, we can simulate the flight of a drone and collect data, including RGB images and GPS values, from Unreal Engine. For the experiment, we adopted various environments, including a modern city, a remote industrial city, and a castle town, as shown in Figure 10.
In the simulation experiments, there were four image sequences and the resolution of images was 640 × 480. For the castle street scene, two image sequences were collected: Downward-looking and forward-looking. For the modern city and remote industrial city scenes, downward-looking sequences were collected. To make the experiment more challenging, the selected scenes were man-made scenes based on stone materials and the drone moved at a low height; thus, most of the content in downward-looking images was roads, as shown in Figure 10. In order to generate the plots for quantitative evaluation, we used the evo package [38]. Due to the unknown real scale, we adopt the alignment mode of the evo tool to scale and align the trajectories of ORB-SLAM3, STVO-PL, and the proposed method with the ground truth trajectories. STVO-PL is a representative point and line combination VO method. As DSO could not estimate camera trajectories for these image sequences, we could not plot them for comparison. Figure 11a shows a comparison of the computed trajectories, compared to the ground truth. As shown in Figure 11a, for the IndustrialCity-downward sequence, both STVO-PL and ORB-SLAM3 only estimated a part of the trajectory before the tracking was lost. The positions of lost tracking were located in a low-texture road section, where the road was narrow and the roofs on both sides were smooth, while a few facades of buildings were captured while the drone was crossing this road section. As a point and line combination method, the estimated trajectory of STVO-PL was a little longer than that of ORB-SLAM3. In addition, the trajectories obtained by STVO-PL and ORB-SLAM3 showed frequent violent fluctuations. In fact, the drone moved very smoothly in the simulation environment, such that there was no fluctuation in the real trajectory. In contrast, our method showed robust performance and obtained a complete and smooth trajectory. In addition, we use Figure 11b to show the trajectories' comparison on xyz three-axis. As shown in Figure 11b, the trajectory obtained by the proposed method is closest to the ground truth on xyz three-axis, which shows the good performance of the proposed method. Table 1 shows the absolute RMSE of the different methods. The experimental results indicate that our method can obtain better performances on most sequences, except CastleStreet-f sequence, which was composed of forward-looking images. The content of the image in CastleStreet-f sequence was the horizontal view captured by UAV flying in the street. The ordinary point features distributed on the building facade were relatively rich. Therefore, ORB-SLAM3 can work well in this sequence.

Discussion
We discuss the combination of ordinary point features and the LSVI matching points. Our main reason for using ordinary point features is to ensure the stability of the algorithm in environments with few lines. For scenes with few lines, although the line segment features are few, the algorithm can still run stably due to the rich point features present. For low-texture environments with rich lines, it is feasible to use only LSVI matching points in the VO algorithm. We conducted experiments on outdoor simulation data sets, in order to verify the VO algorithm performance using only LSVI matching points. Figure 12 shows an example on the ModernCity-downward sequence. We can see that the algorithm with only LSVI matching points obtained similar performance as the algorithm using the combination of ordinary point features and LSVI matching points. The performance of the algorithm with hybrid features was better, as the number of features was more abundant and the distribution of features was more uniform. Since our algorithm uses all kinds of features, the application is not limited; on the other hand, our algorithm has the advantage of strong robustness in low-texture environment. We show the average timing results in Table 2 on the simulation data set. The average processing time of each frame was 56.2 ms for an image with resolution of 640 × 480. The average processing time of ORB-SLAM3 and STVO-PL was 33.1 ms and 47.8 ms, respectively. The algorithm contains two parallel threads: tracking and local mapping. Tracking is executed for every frame, but the local mapping thread does not work all the time. When a new key frame is added, the local mapping thread will complete the triangulation and BA operation. Although the time of mapping is 65.1 ms once, for the total system, the average processing time is 56.2 ms. since we use two kinds of features, line segment and ordinary point. Considering the efficiency, the two kinds of features can be calculated synchronously and independently. However, since the calculation time of line segment features (i.e., 39.1 ms and 2.84 ms) is longer than that of point features (i.e., 25.0 ms and 1.00 ms), the time consumption of feature extraction and matching (i.e., 39.1 ms and 2.84 ms) is basically equal to that of line segment features. Since line segment processing is a time-consuming operation, although the stability of the system in the lowtexture environment is improved, after introducing line segment features, the efficiency of the system will decline, which is a limitation of this paper. Note that our method is independent of line feature detection and matching, so it is beyond the scope of this paper to improve their efficiency.

Conclusions
Different from the traditional methods which only use the real map, generated by ordinary point features to perform visual odometry estimation. In this paper, we defined a concept of LSVI matching points and demonstrated that the virtual map constructed from the LSVI matching points can be used to effectively estimate the camera pose. Based on this idea, we proposed a novel virtual-real hybrid map based monocular visual odometry algorithm. The camera pose estimation was solved through the hybrid 3D map, composed of virtual map generated by LSVI matching points and real map generated by ordinary point features. The proposed method can improve the stability of the VO algorithm under low-texture environment. Extensive experiments were carried out to verify the effectiveness and robustness of the algorithm. For future work, we plan to add a loop closure detection thread to our current VO algorithm, in order to realize a complete virtual-real integrated SLAM algorithm and, thus, eliminate the scale drift.
Author Contributions: Conceptualization, X.X. and T.Y.; methodology, X.X.; data curation, Y.N.; writing-original draft preparation, X.X.; writing-review and editing, F.Z.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.