MLSS-VO: A Multi-Level Scale Stabilizer with Self-Supervised Features for Monocular Visual Odometry in Target Tracking

In this study, a multi-level scale stabilizer intended for visual odometry (MLSS-VO) combined with a self-supervised feature matching method is proposed to address the scale uncertainty and scale drift encountered in the field of monocular visual odometry. Firstly, the architecture of an instance-level recognition model is adopted to propose a feature matching model based on a Siamese neural network. Combined with the traditional approach to feature point extraction, the feature baselines on different levels are extracted, and then treated as a reference for estimating the motion scale of the camera. On this basis, the size of the target in the tracking task is taken as the top-level feature baseline, while the motion matrix parameters as obtained by the original visual odometry of the feature point method are used to solve the real motion scale of the current frame. The multi-level feature baselines are solved to update the motion scale while reducing the scale drift. Finally, the spatial target localization algorithm and the MLSS-VO are applied to propose a framework intended for the tracking of target on the mobile platform. According to the experimental results, the root mean square error (RMSE) of localization is less than 3.87 cm, and the RMSE of target tracking is less than 4.97 cm, which demonstrates that the MLSS-VO method based on the target tracking scene is effective in resolving scale uncertainty and restricting scale drift, so as to ensure the spatial positioning and tracking of the target.


Introduction
Visual odometry (VO), as the core of solving the autonomous positioning problem of robots, has been of great interest to researchers in the visual field. Although binocular vision has advantages in accuracy, the monocular system still has certain advantages for automobiles [1], UAVs [2], and other industries due to the steady decline in the cost of consumer-level monocular cameras in recent years and the lower calibration workload. Therefore, the challenges of visual odometry are both fundamental and practical. At present, the typical monocular visual odometry methods include ORB-slam3 [3] based on the feature point method, DSO [4] based on a direct method, SVO [5] based on a semi-direct method, and VINS-Mono [6] combined with inertial navigation equipment. At the same time, with the development of the neural network, researchers have carried out many explorations of the visual odometry and SLAM methods based on deep learning, for the SLAM method combining SVO and CNN [7], for the SLAM method of semantic segmentation [8], for the localization method of planar target feature [9], and for the unsupervised features used in visual odometry [10,11], etc.
At the same time, scale drift and uncertainty have been another focus and difficulty in monocular vision research. Researchers have done a lot of research on reducing the scale drift of monocular vision slam, such as the monocular SLAM scale correction method based on Bayesian estimation [12], the low drift SLAM scheme estimated by geometric information and surface normal [13], and the VIO method combined with the characteristics of inertial navigation devices [6,14,15]. Among them, the VINS method effectively solves the monocular scale uncertainty problem due to the combination of inertial navigation devices, and greatly reduces the error accumulation caused by drift in the monocular system. However, the scheme combined with inertial navigation devices has three main disadvantages: (1) sensor equipment is more expensive; (2) the calibration of equipment, i.e., the synchronization mechanism between inertial data and visual data, is more complex; and (3) the front-end that is the visual odometry and the back-end optimization algorithms brought by data fusion have higher complexity.
Based on the above summary and thinking, combined with the characteristics that the target often has the size reference information in the target tracking problem, we designed a multi-level scale stabilizer to use the feature baselines at different levels to solve the real scale of the camera. Using the size of the target as the top-level feature baseline, the baseline information is transmitted to the features of each level, and then the real proportion of the displacement T in the motion of the monocular camera is solved, so as to solve the scale uncertainty and reduce the drift. We note that the feature matching in traditional visual odometry often directly uses the extraction of feature points or pixel gradient information such as oriented fast and rotated brief (ORB) [16], and then uses the epipolar constraint [17] and random consistency algorithm [18] to obtain reliable motion matching. At the same time, due to the scale equivalence of the essential matrix E in the epipolar constraint, the resulting displacement T also faces the scale uncertainty problem. In order to solve this problem, we propose a multi-level abstract feature mechanism in the target tracking problem. The top-level feature is the target in the tracking problem. In the second level, we obtain the required feature matching region set based on the self-supervised feature matching model of the Siamese neural network, and the third level is the feature point set of the traditional orb class. Finally, the spatial scale transfer is carried out by using the prior information of the top target size, so as to obtain the baseline size of the feature area and solve the real motion scale T.
The research is aimed at solving the problem of failed navigation caused by the insufficient ability of autonomous positioning and spatial depth estimation encountered by mobile robots (autonomous vehicles, UAVs, etc.) carrying monocular vision sensors when fulfilling the tasks of target tracking and obstacle avoidance crossing in an unfamiliar environment. The main reason for the problem is the error of spatial depth estimation caused by the uncertain scale in the traditional monocular vision odometer and the accumulation of positioning error resulting from scale drift. We combined the target tracking model and monocular visual odometer to obtain better depth estimation and reduce error accumulation. We studied how to use the size information of the target to solve and transfer the scale to reduce the drift of the monocular vision system in the indoor target tracking problem. A multi-level scale stabilizer was defined using a self-supervised learning model. Based on the scale information of the target, the feature baseline extracted from the feature region as obtained through self-supervised learning was treated as the basis for scale information. With the spatial positioning of the target achieved, the scale information was transmitted to the original visual odometry, which was effective in reducing scale drift. According to this method, the extraction and transmission of feature baseline were classified into three levels, while the clear transmission relations and confidence weights between different transmission levels were defined. The main contributions of this paper are detailed as follows:

1.
A multi-level scale stabilizer (MLSS-VO) based on monocular VO is proposed. The size of the target in the task of tracking is taken as the prior information of the top level (Level 1), the feature region extracted using the self-supervised network is treated as the second level (Level 2), and the feature point set as obtained by the traditional method is regarded as the third level (Level 3). In particular, priority is given to the feature points in the self-supervised matching region for more reliable matching constraints. Then, the feature points on the third level are used to construct the original VO using the feature point method, thus obtaining the attitude and motion of the camera with scale error. On this basis, the size information of the top level and the feature baseline information of the second and third levels are combined to solve the real trajectory of camera motion; 2.
Based on the deep local descriptors intended for an instance-level recognition network model, a Siamese neural network model [19] suitable for the matching of motion video stream feature is proposed, and a baseline acquisition mechanism in the feature region is designed; 3.
Through the combination of MLSS-VO and a target space positioning algorithm, an algorithm framework is designed for autonomous positioning and target tracking based on a monocular mobile platform. During multiple sets of indoor target tracking experiments, the motion capture system is adopted to verify that the root mean squared error (RMSE) of the algorithm is less than 3.87 cm in the indoor test environment, and that the RMSE of the moving target tracking is less than 4.97 cm, which indicates the effectiveness of the algorithm in indoor autonomous positioning and target tracking.
The rest of this paper is organized as follows. In Section 2, an introduction is made of the self-supervised learning model and feature baseline extraction method. Section 3 elaborates on the process and framework of the MLSS-VO algorithm based on multi-level scale stabilizer. In Section 4, it is detailed how multi-level feature baselines can be used to solve the scale, with a set of target tracking algorithm frameworks proposed for a monocular motion platform on the basis of the MLSS-VO. Section 5 presents the performance analysis of the MLSS-VO algorithm and the verification of the indoor tracking experiments. Section 6 is a summary to conclude the paper.

Multi-Level Feature Extraction
The target recognition and feature baseline extraction in Level 1 can refer to the previous work of the author [20]. The extraction method of the orb in Level 3 has been referred to by a large number of related studies. Therefore, the feature extraction methods for Levels 1 and 3 are no longer described in detail in this paper, and the extraction of the self-supervised feature matching area in Level 2 level is emphasized.

Self-Supervised Feature Region Learning
We studied the feature matching model of deep local descriptors for instance-level recognition [19,21], and improved its training model to reduce the cost of self-supervised feature learning. The original model focuses on the description of local features. In this paper, the Siamese network is used to extract feature descriptors, which changes from focusing on feature descriptions to focus on the gap between features. It has higher universality and is more suitable for visual applications such as VO: during the training period, the Siamese network method is used to extract the W × H × d dimensional feature descriptors of the sample images for correlation calculation. If the similarity of the two input images is higher than 80%, the output is 1, otherwise 0. In practical applications, we only need to obtain a small number of reference images of similar environments for training. The training sample construction method is to construct positive samples by shifting and scaling the same image and construct anti-samples between unrelated images.
The structure of the training model is shown in Figure 1 above. The training model used in this paper is introduced as follows:

1.
Fully convolutional network (FCN): f (·) stands for a deep FCN, denoted by the function: f (·): R w×h×3 → R W×H×D that maps an input image I to a 3D tensor of activations f (I) which is used as an extractor of feature descriptors;

2.
Feature strength and attention: define the tensor of the feature descriptor as u = f (I), the feature strength is estimated as the L2 norm of the feature descriptor u by the attention function: w(u) = ||u|| (1) Feature strength is used to weigh the contribution of each feature descriptor in matching, so that it can be reduced from D dimension to 1 dimension. Considering that the influence of weak features during training is limited, w(u) can select the strongest feature as the preferred matching during testing.

3.
Local smoothing: we propose to spatially smooth the activations by average pooling in an 3 × 3 neighborhood. The smoothing result is indicated by U = h( f (I)). The main function here is to make the features more dispersed and smoother; 4.
Descriptor whitening: local descriptor dimensions are de-correlated by a linear whitening transformation. Using the PCA dimension reduction method [22], using Function o(·) which is implemented by 1 × 1 convolution with bias to achieve feature tensor dimension reduction: Training based on a Siamese network: Let U = f (I) and V = f (J) be sets of dense feature descriptors in images I and J, image similarity is given by: γ(U) = 1/||∑ u ∈ Uu ||, the actual use, the corresponding is w(u) = ||u|| weight adjustment. When J is a positive sample, the output value of similarity is 1. For negative samples, the output value of similarity is 0. Accordingly, in the test, we extract the feature matrix and weight matrix multiplied to obtain the key region feature matching, see in Figure 2. The structure of the training model and test model is shown in Figure 2. We use w(u) to select the required N strongest feature matching regions to provide a reference region for the extraction of the feature baseline below. The main function of Level 2 has two advantages: (1) Using the reliable feature matching area of self-supervised learning as the scale transfer reference of Level 2, the training cost is low and the speed is fast. (2) We first select the orb feature points in the Level 2 region as the matching points of original VO to solve the camera motion, making use of the advantages of richer information and higher reliability in the self-monitoring feature region.

Feature Baseline Extraction
For Level 1, the feature baseline is abstracted as the size feature of the target being tracked, such as the edge length, radius, etc., the extraction method is referred to our previous work [20]. The following focuses on feature baseline extraction methods in Levels 2 and 3. Figure 3 shows the extraction method of feature baseline, as follows: 1. Level 2 feature baseline: the dashed box shown in Figure 3a above is the feature region learned by Level 2, and the dots represent the ORB feature points in the region. A point P 1 (red) is randomly selected as one end of the baseline, and five points with the furthest distance of 2D pixels in the region are selected as the alternatives (color feature points). Then point P 2 (yellow feature points) with the furthest distance of Hamming distance D h from the descriptor of point P 1 is selected to form feature baseline → P 1 P 2 . For robustness, we can select point P 3 (blue feature points), which is farthest from the point P 1 descriptor Hamming distance, to form a spare feature baseline → P 1 P 3 .; 2.
Level 3 feature baseline: as shown in Figure 3b above, a point P 1 (red) is randomly selected as the end of the baseline. We select a ring area with an inner diameter of R1 and an outer diameter of R2 as a candidate feature area. The next baseline extraction method is the same as the Level 2. In addition, for R1 and R2, at the resolution of 1280 × 720, we recommend a pixel radius of 10 to 30, the distance is too small, the error of the solution is large, and the distance is too large due to the camera movement. The existence period of the baseline is very short and easy to lose. It should be noted that the ORB feature points mentioned above are the feature points filtered by RANSAC (random sample consensus) algorithm in VO, that is, the static feature points in the environment. The static characteristics of the feature baseline are the necessary conditions for the feature baseline in Levels 2 and 3.

Multi-Level Scale Stabilizer (MLSS)
In this chapter, we give the definition of multi-level feature baseline in detail. While using the feature baseline to solve the scale T, we also complete the spatial positioning and tracking of moving targets. The core solution method related to the scale solution is given in Section 4.2, and the solution of equations based on numerical iteration will be supplemented in Section 4.6.

Multi-Level Features
As shown in Figure 4 above, we first define the characteristic baseline in the three-level scale stabilizer: • Level 1: the moving target i to be tracked is taken as the top-level baseline size, and its baseline λ 1i is known; • Level 2: feature region j obtained by self-supervised learning, and its baseline λ 2j obtained by baseline λ 1i in Level 1 and camera pose; • Level 3: a large number of traditional ORB class features k. Firstly, the feature points in Level 3 are used to solve the camera motion attitude with scale uncertainty, and a small number of paired feature points are selected to obtain the baseline λ 3k . Through the baselines λ 1i and λ 2j in Level 1 and Level 2, the baseline λ 3k is obtained by combining the camera attitude. Figure 4 above shows the form in which the multi-layer feature baseline appears in 4-frame moving images. Then, rectangle, circle, and stick are adopted to denote the characteristics in Levels 1-3. Moreover, the level where the character is located, the number of the characteristic, and the baseline number of the characters are labeled. What should be noted is that the characteristics of levels 2 and 3 in actual camera images are manually abstracted and they exist in the images in a large amount. Here, only a small amount of them are drawn as examples. Due to the motion of cameras and targets, some old characteristics would deviate from existing characteristics and new characteristics would enter continuously. Hence, green, blue, and gray, respectively, represent the characters existing in the previous frame, the newly added characteristic of the present frame, and the invalid characteristic deviating from the image. Once the target disappears, MLSS is used to solve the multi-level feature baseline through the size information of the target, and the new reference baseline size is continuously obtained, so as to realize the real-time scale update.  The algorithm framework is mainly divided into the following four parts:

Framework and Data Pipeline
1.
Multi-level target extractor: according to the features in Level 1, we use traditional image processing methods or extract them through target prior information; for the features in Level 2, we use a self-supervised learning model to obtain and select a small number of feature baselines with high confidence between multiple frames as transfer objects. For the features in Level 3, we select the appropriate feature pairs as a baseline after using traditional features such as SIFT, ORB; 2.
Visual odometer: compared with the traditional visual odometer, the visual odometer in this paper has the following improvements: (1) Using the feature segmentation in the image extractor, the dynamic part (Level 1 feature) of the image is shielded to reduce the interference of dynamic features on the VO solution.
(2) Using the target size information to solve the problem of monocular scale uncertainty. (3) The MLSS-VO will take the updated scale calculated in real-time as the motion scale T of the current camera to reduce the scale drift; 3.
Scale transmitter: using the feature points in Level 3 to complete the traditional VO to obtain the attitude [R|T], the projection relationship of the target in Level 1 in 2D plane and its prior size λ 1i , the feature baseline values of Levels 2 and 3 are transmitted and estimated.
Multi-level scale updater: considering that Level 1 is the target in tracking task, Level 2 has done motion matching in the initial VO to eliminate the wrong solution. Here, we again use the RANSAC (random sample consensus) algorithm to eliminate the error matching in the second level of features. Finally, the final scale value T is output and updated after the scale weighting. Figure 6 below illustrates from the data flow level how the research in this paper connects the visual odometer model and the target space tracking model. It can be seen from Figure 6 that the feature of the target and size information in Level 1 is used in the spatial positioning algorithm, and it also assists in removing the uncertainty of the scale with the pose information MLSS-VO of the visual odometer. The feature regions in Level 2 help extract the feature baselines in Level 2, and also assist in constraining the matching of feature points in Level 3. The feature points in level 3 are mainly used in the traditional feature point method to solve the initial camera pose with scale blur. At the same time, feature baselines in Level 3 can also be proposed to help reduce scale drift. Figure 7 shows the relationship between the three coordinate systems, where blue, green, and orange represent projections under the world coordinate system, camera coordinate system, and pixel coordinate system, respectively. The input of the depth estimation module is a known pinhole camera model and the pixel coordinates of the feature points, and the output is the depth estimation equations with the depth information of feature points as the unknown number. In this paper, target features are firstly abstracted as geometric shapes such as parallelograms, as shown in Figure 8. The coordinates of the target in the world coordinate system are defined as P w i = [X wi , Y wi , Z wi ] T , i = 1, 2, 3, 4. The coordinates in the camera coordinate system are defined as  Similarly, for the circular target, considering the properties of circular projection: the projection of the circular is circular or elliptical, and the projection of the circular is still the center of the projection. Using the image algorithm, we can obtain the center of the projection surface image. According to the projection properties, the line segment obtained by the intersection of any two rays in the center and the graph is radius R, as shown in Figure 9. According to the symmetry of the circle, the shape composed of four intersections must be a parallelogram, so solving the depth problem essentially returns to the above method of solving the parallelogram. It should be noted that in practical engineering, the quadrilateral side length value obtained by this method is difficult to obtain, and the diagonal distance is the radius Rc. As above, we often use the combination of the diagonal distance Rc and the quadrilateral parallel condition equation to obtain the solution depth of the equation set F to obtain the spatial position of the circular target. Figure 10 shows the schematic diagram of the typical double-motion problem. Here, we abstract the moving target as a rigid body like a parallelogram. The rotation R 0 of the monocular camera between the two frames has been obtained by the traditional VO method. Because of the scale uncertainty of the monocular vision, T 0 is a parameter that is proportional to the actual spatial motion. What we need to solve is T 0 . At the same time, given the baseline size λ 1 of the moving rigid body target, considering the length of P1P2 in the graph is the size λ 1 in Level 1, we can solve the actual T 0 by solving the motion of the rigid body. Here we illustrate the solution. First, the pixel coordinates of each point in Figure 10 above are defined as:

Scale Solver
The camera coordinate system of each point is: The camera internal matrix CM is known, from the projection relationship: Then, we use the key information λ 1 to establish the baseline length constraint equation: Finally, using the properties of rigid body, → P 1 P 2 → P 3 P 4 and → P 1 P 2 → P 3 P 4 can establish rigid body parallel constraints: Considering that p i , p j , λ and CM are a known quantity, we can solve the equations containing P i , P i before and after the movement in the above figure, respectively. It is worth noting that a reasonable selection of equations and appropriate numerical iterative methods can improve efficiency.
Next, we solve the scale T 0 , first define the displacement of P i to P i relative to the camera as T i : At the same time, according to the spatial relationship between camera motion and target motion, we can know: At this point, we obtain the real T 0 , that is, the scale information. So far, we have completed the scale recovery of VO using the baseline of Level 1. Next, for the features of Levels 2 and 3, the problem becomes simpler, and Figure 10 is still taken as an example. At this time, the target length becomes the feature baseline length extracted by the algorithm from the image and T 0 is known. The problem is transformed into solving the unknown scale λ in Levels 2 and 3. Using the same method, the feature baseline values in Levels 2 and 3 can be obtained.

Scale Weighting and Updating
We use image checking to continuously learn and extract new Levels 2 and 3 features to join the multi-level scale stabilization of this article and remove the features that leave the image from the scale stabilizer. Theoretically, if our detection and scale calculation are absolutely accurate, then we can use any set of features in the stabilizer to complete the T 0 scale calculation. In fact, considering various problems such as error matching and feature learning failure, we add the random sample consensus method into the scale updater to first propose the error matching baseline in self-supervised learning. In fact, the Level 2 level has a certain uncertainty for the extracted feature region due to the dependence on deep neural network learning, so the RANSAC algorithm is used for investigation.
Finally, the scale weighted vector ψ and the camera motion scale vector Λ solved by the feature baseline are defined: where i, j, k Represents the number of scale features of each level. Finally, T 0 = Λ × ψ can be obtained. It is important that the assignment of Λ follows the following principles and properties: 1.
Since ideally all values in all Λ are the T 0 , vector Λ corresponds to ψ satisfying Property ∑ i,j,k ψ = 1; 2.
Since the scale learning obtained in Level 1 is the most credible and stable a priori information, when tracking a target in an image, priority is given to using the prior scale information in Level 1, which is a strategy of ∑ i ψ ≥ 0.95;

3.
When the target disappears, the weight distribution of the feature scale using Levels 2 and 3 can be determined according to the actual application scenario. For example, when there are a large number of self-supervised learning features indoors, the scales obtained in Level 2 are assigned higher weights, and when the self-supervised feature region is unstable, the scales obtained in Level 3 can be assigned higher weights.

The Advantages and Disadvantages of MLSS-VO
The advantages of a multi-level scale stabilizer based on a target tracking problem are: (1) During the initialization of the VO, the scale uncertainty in camera estimation is solved by using the size information of a target. (2) The multi-level feature baseline can be used to update the scale value T of camera motion in real time to prevent scale drift, especially when the tracking starts again after VO loss. (3) While solving the scale T, the target spatial localization algorithm is realized. (4) The feature regions obtained by the self-monitoring method are matched with orb features, which reduces the possibility of false feature matching. (5) The transfer of the real size of the feature baseline in the space is realized by using the target size, which can provide the real scale reference for various applications in the region of interest space.
There are also some disadvantages to be improved: (1) Self-supervised region extraction consumes a large amount of computation, which affects real-time performance. To solve this problem, in some platforms with limited computing resources, such as small UAV platforms, only the characteristic baselines of the first and third levels can be used for scale updating, thus greatly reducing the computational complexity. (2) For a looped motion environment, the self-supervised region of the method can obtain a longer feature life cycle; for the motion without loopback, this algorithm needs to continuously extract new self-supervised feature regions, which leads to a large computational burden. Therefore, this method is more suitable for motion scenes with a loop.

Target Location and Tracking Framework
As shown in Figure 11, considering that the first level feature baseline acquisition of the scale stabilizer in this paper is dependent on the target recognition in the 3D tracking problem, we propose a 3D positioning and tracking algorithm framework for moving targets based on a monocular motion platform combined with the target tracking algorithm. This framework mainly includes two parts:

1.
Monocular VO autonomous positioning method with scale stabilizer (red part): this part has been elaborated in detail in the above; 2.
The target in this article are located as a circle or a parallelogram. In Section 4.2, we have given the geometric constraint equations of the target relative to the camera coordinate system. An improved high-order Newton iterative algorithm is used to solve the equations numerically. Finally, the camera pose solved by MLSS-VO and the relative coordinates solved by the target positioning algorithm are used to calculate the real trajectory of the target in space.

Improved Newton Iteration
For the scale transfer equations constructed in Section 4.2, we can synchronously solve the spatial position of the moving target relative to the camera. In the actual test, we use the numerical iteration method to solve the equations. The traditional Newton iteration method has the characteristics of second-order convergence and accurate solution. The iterative equation is as follows: X n+1 = X n − ∇ 2 F(X) −1 · ∇F(X) (18) For our algorithm, X is the vector space of the unknown depth value to be solved in our algorithm, ∇F(X) is the Jacobian derivative matrix and ∇ 2 F(X) −1 is the inverse matrix of the Hessian matrix. However, the traditional Newton iterative optimization has a large amount of calculation when calculating the inverse matrix of the Hessian matrix, and the second derivative may fall into the endless loop at the inflection point, i.e., f (x n ) = 0.
Here, we give an improved Newton-Raphson method with the fifth-order convergence is adopted [23,24]. This method can solve higher-order equations more stably, whose iterative equation is as follows: where, X n is the vector to be solved, Y n and Z n are the intermediate variables of the iteration. Finally, according to the updated R and T obtained by MLSS-VO we can further solve the spatial position: where R is the rotation matrix of the camera relative to the world coordinate system, and T is the displacement vector of the camera relative to the world coordinate system. The coordinates Pi of the target feature point in the world coordinate system can be obtained by solving Equation (22).

Experiments
The author designed multiple sets of experiments based on indoor environment to test and verify the algorithm. In those experiments, a visual motion capture system named ZVR was employed to calibrate the true value of the target's trajectory. The motion capture system (MCS) is composed of eight cameras, which can cover the space experiment range of 4.7 m × 3.7 m × 2.6 m, and can achieve posture tracking and data recording with the refresh rate of 260 Hz. All images in our experiments were taken by a rolling shutter monocular pinhole camera with fixed-focus.
Aiming at the problem of target tracking, combined with the monocular motion platform test requirements involved in this article, we proposed a method for establishing a benchmark for object tracking with motion parameters (OTMP). All samples were taken by a monocular fixed-focus pinhole camera. The trajectory information of multiple sets of sample targets in space and the pose information of the camera itself are simultaneously recorded using an indoor motion capture system. Besides, the camera internal parameter matrix CM, sample calibration set S, sample space motion trajectory parameters T s , and the camera's pose [R c |T c ] after calibration using the checkerboard are provided. This data set can be used for visual scientific research such as the visual slam of indoor dynamic environments, spatial positioning of moving targets, and dynamic target recognition and classification. This dataset has been uploaded to GitHub: https://github.com/6wa-car/ OTMP-DataSet.git (accessed on 5 December 2021). Figure 12 shows a panoramic view of the entire experimental scene. The moving targets in OTMP are shown in Figure 13.

Feature Region Extraction Based on Siamese Neural Network
Experiment 1: The purpose of this experiment is to verify the performance of Level 2 feature extraction matching. After training the self-supervised Siamese neural network model with a small number of indoor environment images, we use a monocular motion camera to track and shoot the indoor target, and finally extract and match the static feature regions in the environment. Figures 14 and 15, respectively, correspond to a schematic diagram of randomly extracting orb feature points matching in Level 3 and a schematic diagram of orb feature point matching based on self-supervised feature region constraints in Level 2.
When extracting from the supervised feature region, we merge the adjacent feature regions on the image to obtain a simpler feature region division. The Correct Matches we give represent the number of correct unsupervised matching regions in Level 2. Table 1 illustrates the relevant performance of the self-supervised feature matching model. In particular, we tested the performance of the algorithm on the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) platforms: It can be seen from Figures 14 and 15 that our feature region-based method can prevent false matches caused by the original orb matching method. Compared to directly using orb feature point matching, we can control the distribution of self-supervised feature regions, such as choosing to selectively extract features with the largest local feature weights in each region of the image. Therefore, the feature point matching method based on Level 2 feature region constraints can make the distribution of feature point pairs on the image more uniform, and prevent feature point matching that is too concentrated from adversely affecting the motion solution in the VO problem, such as the increasing in motion attitude error, systematic deviation of feature matching (error calculation of camera motion attitude caused by unknown motion of feature points used in a certain concentrated area), etc.  Furthermore, from the test results in Table 1, for research and application of SLAM, the model of the improved Siamese neural network for feature matching relies on GPU-based computing platforms to obtain better real-time performance. For the CPU-based computing platform, this matching method is only suitable for non-real-time applications such as structure from motion (SFM).
It can be seen from Table 2 that in the actual application of MLSS-VO, it is necessary to consider selecting the number of feature matching in Levels 2 and 3 reasonably according to the computing power of the computing platform. Under the GTX1060 platform, when selecting five matching areas in Level 2 and 30 orb feature points in Level 3, our algorithm can run at a speed of about 9.7 fps.
The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows:

Performance of MLSS-VO
Experiment 2: the purpose of this experiment is to investigate the spatial positioning performance of MLSS-VO in indoor environment. In addition to our open-source data set OTMP, we also used the TUM data set which is commonly used in slam research as verification. Meanwhile, we have compared our MLSS-VO with two typical methods in monocular visual odometry: ORB-SLAM2 [25] and RGBD-SLAM v2 [26] for a more comprehensive analysis.
We first verified the effectiveness of MLSS-VO using OTMP data set. The experimental results are shown in Figure 16. The RMSE of MLSS-VO on x-y-z axis are shown in Table 3: The timing performance of MLSS-VO under different numbers of feature matching in Levels 2 and 3 are as follows.
It can be seen from the experimental results in Figure 16, Tables 2 and 3 that: (1) compared with the traditional feature point method corresponding to green data, the scale information in MLSS-VO effectively solves the scale uncertainty in VO initialization and restores the real proportion of motion trajectory. (2) Considering the real-time nature of the SLAM problem, the computing power of the computing platform needs to be considered when using MLSS-VO. The feature area in Level 2 should not exceed 10 while the orb feature points in Level 3 should not exceed 60. (3) In the experimental test of the actual indoor environment, the motion estimation value of MLSS-VO does not show scale drift, and the root mean square error of positioning error is controlled within 2.73 cm, which can be applied to most indoor visual motion platforms such as drones and unmanned vehicles.
Furthermore, we compared the three algorithms of MLSS-VO, ORB-SLAM2, and RGBD-SLAM v2 using the TUM data set. It is worth noting that because ORB-SLAM2's monocular module requires scale alignment when testing TUM data, we used the slam evaluation tool Evo to correct it. In addition, RGBD-SLAM v2 needs to use the depth map information of the data set when experimenting with the TUM data set. For the MLSS-VO experiment in this paper, we give the assumed target with calibration during initialization. The experimental results are shown in Figures 17 and 18 below.  The comparison of RMSE for these three methods on the x-y-z axis are shown in Table 4: As can be seen from Figure 18 and Table 4, the performance of MLSS-VO slightly lags behind that of state-of-the-art slam framework ORB-SLAM2 and RGBD-SLAM v2. However, considering that we manually calibrated the scale of ORB-SLAM2 to make the comparison meaningful, and RGBD-SLAM V2 requires dense depth map information. MLSS-VO only needs to use the target baseline in monocular vision and tracking problems. From the perspective of the tracking scene, MLSS-VO has unique advantages for solving the problem of scale alignment to some extent, and no additional depth map information is required. In fact, our multi-layer feature baselines can also be understood as solving the depth of sparse features in space, including the key point depth information of dynamic targets such as level 1, as well as the key point depth information of static features such as levels 2 and 3.

Target Tracking
Experiment 3: the purpose of this experiment is to verify the target tracking performance of a moving platform using the MLSS-VO positioning method. Considering that the ground truth requires the motion trajectories of the camera and the target, we use the OTMP open-source data set for verification. The experimental scene is shown in Figure 19, and the experimental results are shown in Figure 20.   Table 5 shows the root mean square error (RMSE) of target tracking. It can be seen from the experimental results in Figure 20 and Table 5 that the VO with a multi-level stabilizer solves the pose estimation of the moving platform itself, and realizes the spatial tracking task of the target. The proposed monocular moving platform target tracking framework can effectively track the target, and the tracking error is less than 4.97 cm.

Summary
Allowing for the problem of target tracking encountered in monocular vision, a VO with a multi-level scale stabilizer is proposed in this paper, namely MLSS-VO, to resolve monocular scale drift and scale uncertainty. On this basis, the target positioning and tracking framework of monocular motion platform are further described. The core idea of MLSS-VO is that the prior size information of the target and the attitude information of the original VO are used to transmit the spatial size information to the feature baselines at all levels, so as to calculate the real motion scale of the camera. In addition, a feature matching model is proposed on the basis of a Siamese neural network, which is conducive to the extraction of self-supervised feature matching and contributory to providing a reliable reference and constraint for the selection of orb feature points in the original VO. The proposed algorithm can be applied to various sporting platforms such as UAVs and self-driving cars fitted with monocular vision sensors.
Indoor experiments have revealed the following points. Firstly, the self-supervised feature matching based on Siamese neural network proposed in this study is effective in determining the matching region between moving images. Secondly, the scale information in MLSS-VO can be used to solve the scale uncertainty arising from VO initialization and restore the real proportion of the motion trajectory. With an appropriate number of Level 2 and Level 3 features selected, the real-time speed can reach about 9.7 FPS. Next, we compare MLSS-VO with two state of the art slam frameworks: ORB-SLAM2 and RGBD-SLAM v2, and analyze the advantages of MLSS-VO. Lastly, the root mean square error of motion estimation of MLSS-VO is restricted to within 3.87 cm, and the root mean square error of the target location error based on this method is less than 4.97 cm.
The method proposed in this paper will be further extended to various motion platforms for different purposes such as obstacle avoidance, trajectory monitoring, visual navigation, as well as the tracking of various small UAVs and autonomous cars.  Data Availability Statement: Experimental Data included in this study are available upon request by contact with the First Author. Part of the code required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

Conflicts of Interest:
The authors declare no conflict of interest.