Robust and Efficient CPU-Based RGB-D Scene Reconstruction

3D scene reconstruction is an important topic in computer vision. A complete scene is reconstructed from views acquired along the camera trajectory, each view containing a small part of the scene. Tracking in textureless scenes is well known to be a Gordian knot of camera tracking, and how to obtain accurate 3D models quickly is a major challenge for existing systems. For the application of robotics, we propose a robust CPU-based approach to reconstruct indoor scenes efficiently with a consumer RGB-D camera. The proposed approach bridges feature-based camera tracking and volumetric-based data integration together and has a good reconstruction performance in terms of both robustness and efficiency. The key points in our approach include: (i) a robust and fast camera tracking method combining points and edges, which improves tracking stability in textureless scenes; (ii) an efficient data fusion strategy to select camera views and integrate RGB-D images on multiple scales, which enhances the efficiency of volumetric integration; (iii) a novel RGB-D scene reconstruction system, which can be quickly implemented on a standard CPU. Experimental results demonstrate that our approach reconstructs scenes with higher robustness and efficiency compared to state-of-the-art reconstruction systems.


Introduction
3D scene reconstruction is an important topic in computer vision with many applications, such as robotics and augmented reality.The emergence of consumer RGB-D cameras, such as Microsoft Kinect, Asus Xtion and Structure Sensor, provides an opportunity to develop indoor scene reconstruction systems conveniently.
KinectFusion [1,2] is an outstanding method to generate photorealistic dense 3D models on a GPU.It uses a volumetric representation by the Truncated Signed Distance Function (TSDF) [3] to represent the scenes and in conjunction with fast Iterative Closest Point (ICP) [4] pose estimation to provide a real-time fused dense model.Although KinectFusion has many advantages such as algorithmic simplicity, it also has some disadvantages in camera tracking and volumetric representation.For camera tracking, it suffers from tracking drift accumulation, and the efficiency of ICP algorithm is computationally costly, as in each iteration, the nearest neighbors between two point clouds have to determined.For volumetric representation, TSDF is represented as a regular grid, and the memory consumption and computation time grows cubically with the resolution.
In contrast to dense ICP-based tracking methods, sparse feature-based methods extract features in RGB images and estimate the camera motion between the images.They are more efficient and widely used in the sparse Simultaneous Localization and Mapping (SLAM) system.In this paper, we present a new CPU-based RGB-D indoor scene reconstruction framework, which combines dense volumetric integration with a sparse feature-based tracking method and can be applied to indoor scene reconstruction with high robustness and efficiency.The main contributions of our work are: 1.A fast camera tracking method combining points and edges, by which the tracking stability in textureless scenes is improved; 2.An efficient data fusion strategy based on a novel camera view selection algorithm, by which the performance of volumetric integration is enhanced.

3.
A novel RGB-D scene reconstruction system, which can be quickly implemented on a standard CPU.
The rest of the paper is organized as follows: Section 2 introduces the related work and motivation of our research.Section 3 gives an overview of our scene reconstruction system.The details of the proposed method are presented in Section 4. Section 5 describes experiment results and discussions, while Section 6 presents some concluding remarks.

Related Work
Many methods are designed for robust camera tracking and efficient volumetric integration.In this section, we briefly discuss the related work and then state the detailed motivations of our approach.

Camera Tracking
A remarkable feature-based camera tracking method is proposed in ORB-SLAM [17][18][19], which is an accurate and efficient system and can work in real time on standard CPUs.It is prone to fail when dealing with textureless images or when feature points temporarily vanish due to motion blur.Since lines are abundant in the indoor environment and less sensitive to lighting variation than points, some systems [20][21][22][23][24][25][26] estimate the camera location by line feature or edge information.
StructSLAM [20] extends the standard visual SLAM method to adopt the building structure lines with a parametrization method that represents the Structure lines in dominant directions.Lu et al. [21] extracted 3D point and lines from RGB-D data, analyzed their measurement uncertainties and computed camera motion using maximum likelihood estimation.Zhang et al. [22] presented a graph-based visual SLAM system using straight lines as features with a stereo sensor.PL-SLAM [23,24] proposes solutions that simultaneously leverage point and line information with a monocular and a stereo sensor, respectively.Those methods are less efficient because the detection and matching for line feature are time consuming.Unlike those methods, Edge VO [25] develops a simple and efficient edge-based tracking method without any back-end optimization.To improve the accuracy, Edge SLAM [26] extends it with two-view initialization and local optimization, but reduces its efficiency.
Inspired by the above methods, we combine the advantages of edge tracking and feature-based SLAM technology and design a robust and fast camera tracking method combining points and edges to improve the stability of camera tracking.

Volumetric Integration
Volumetric methods provide efficient and simple ways of integrating multiple RGB-D images into a complete 3D model.The original idea of volumetric 3D reconstruction from depth images dates back to volumetric data integration [3].Later, the advent of consumer RGB-D cameras and massively parallel processors in GPUs led to the seminal KinectFusion system and has inspired a wide range of further work.One of the major limitations of volumetric approaches is their lack of scalability due to reliance on a uniform grid, and they can therefore only handle small scenes.Exploiting the sparsity subdivision strategies has become a research focus.
Kintinuous [5] permits the area mapped by the TSDF to move over time, which allows continuously augmenting the reconstructed surface in an incremental fashion as the camera translates and rotates in the real world.Fastfusion [27,28] proposes an efficient octree data structure that allows for fast TSDF updates and incremental meshing and that runs on a standard CPU in real time.InfiniTAM [29][30][31] uses a simple spatial hashing scheme that compresses space and allows for real-time access and updates of implicit surface data, without the need for a regular or hierarchical grid data structure.
The above methods ignore the fact that volumetric integration based on TSDF is a weighted average process.If too many redundant data are fused, the computing resources are wasted, while the surface mesh may be over-smoothed or polluted by unnecessary noise.In order to further improve the performance of volumetric integration, we propose a camera view selection algorithm to prune away redundant camera views and then quickly integrate the selected RGB-D images with multi-scale TSDF.

System Overview
A schematic overview of our approach is shown in Figure 1.The proposed system consists of two main stages: robust camera tracking and efficient volumetric integration.Each stage is briefly described as follows: Camera tracking is to localize the camera and contains front-end tracking and back-end optimization.We perform the tracking thread with both point and edge correspondence to ensure the reliability for textureless scenes.Local mapping and loop closing are used to optimize the tracking results.The former manages the local map, and the latter detects large loops and corrects accumulated drift by pose-graph optimization.

Multi-scale TSDF fusion
Motion Similarity

Front-end Tracking
Point Edge

RGB-D Images
Back-end Optimization Volumetric integration is to fuse RGB-D images of different camera views into a scene model.The output of camera tracking is a complete camera trajectory; however, it is unnecessary to use all camera views.We prune away redundant views with a novel camera view selection method based on camera motion and image similarity detection and then integrate the selected RGB-D images with adaptive multi-scale TSDF efficiently.The final mesh model is extracted with the marching cubes algorithm [32].

The Proposed Methods
The proposed methods consist of two key components, i.e., tracking via points and edges and efficient data fusion.The following subsections describe them separately.

Tracking via Points and Edges
The goal of camera tracking is to find the transformation T that maps the previous image into the new one.To improve the tracking reliability for textureless scenes, we implement the tracking thread with both point and edge correspondence.The transformation is estimated with two types of errors: feature point re-projection error in Equation ( 1) and geometrical distance error of the edge in Equation (3).Points are matched by re-projection error, which is defined as: where X denotes the position of the 3D point; X denotes the position of the matched point; and K is the camera intrinsic matrix: Edges are matched with a warping transformation based on geometrical distance estimation [25].The geometrical distance error is: where x is the pixel on the edge map; n is the direction of the gradient; and τ is the warping transformation [7] between consecutive frames, which is constructed as follows: • First, 3D point p corresponding to the pixel x = (u, v) T on the edge map is reconstructed using the inverse of the projection function π −1 as: where z(x) is the depth value of pixel x in the first depth frame.

•
Second, the 3D point in the second frame is given as: T(g(ξ), p), where g(ξ) represents the transformation by the Lie algebra se(3) associated with the group SE(3).When the second camera observes the transformed point q = (x q , y q , z q ) T , we obtain the warped pixel coordinates: • Finally, the full warping function is given as: Considering both feature point re-projection error (Ep) and geometrical distance error (Ee), we get transformation T by minimizing the cost function as: where X c denotes the correspondence set of consecutive frames; ρ p and ρ e are the Huber function; λ ∈ [0, 1] is the weighting coefficient.The cost function is minimized using iterations of classical Levenberg-Marquardt.The Huber function is introduced to reduce the effect of outliers.The choice of λ depends on the richness of texture features in the scene.In our experiments, we set it as: where N is the number of ORB features extracted per frame; N min and N max are the minimum and maximum thresholds.
For each input RGB image, we extract points using ORB features [33] and extract edges by the DoG-based detector [34] due to its robustness in illumination and contrast changes.Figure 2a,b shows a comparison of ORB feature extraction and edge extraction on an RGB image.It can be seen that the extracted number of edges is more than the number of ORB feature points in motion blur and low texture scenes.Figure 2c shows a variation of the estimated camera trajectories with different numbers (100, 200, 300, 400 and 500) of ORB features in each frame on three sequences of the TUMRGB-D dataset [35].The vertical axis indicates the accuracy of camera tracking using Absolute Trajectory (ATE) Root Mean Squared Error (RMSE in centimeters).When the number of ORB features in each frames is 100, the camera tracking is lost on the f r1_xyz and f r2_desk sequences.The trend of line chart indicates that the accuracy of camera tracking increases with the number of extracted ORB features and tends to be stable.Based on this experimental analysis, N min and N max are set to 200 and 400 in our experiments.
To accelerate the calculation, depth information is employed during initialization and matching processes.Due to the limitation of structure light technology, the depth value captured by the RGB-D camera such as Microsoft Kinect on the structure edges usually contains large error even when the texture is evident.We test the depth z from the depth image by: z min ≤ z ≤ z max to select a reliable value.The choice of z min and z max depends on the parameters of the RGB-D camera.If the depth value is beyond the range, it is estimated through the standard EKF proposed in Edge VO [25].Besides, we estimate the Standard Deviation (STD) σ of the depth noise for each pixel x based on the noise model [36].For Microsoft Kinect, we set z min and z max to be 0.5 and four (in meters), respectively, and calculate the noise model as follows: where z(x) is the depth value and θ(x) is the angle between the surface normal and z axis on pixel x.
We use this noise model to analysis the uncertainty of depth values on extracted edges and eliminate edges with poor depth values.

Efficient Data Fusion
After camera tracking, we integrate RGB-D images with camera poses into a global model by multi-scale TSDF [27,28].TSDF is discretized into a voxel grid to represent a physical volume of space.For a given voxel v in the fused scene model F, the corresponding signed distance value F(v) is computed with r views: Signed distance function f i (v) is the projective distance between a voxel and the ith depth frame and is defined as: where x = π(Kv) is the pixel into which the voxel center projects and Φ is the truncation threshold.We compute the distance along the principal (Z) axis of the camera frame using the z component denoted as [.]z.Weighting function w i (v) represents the confidence in the accuracy of the distance, which is assigned as follows: where δ is one tenth of the voxel resolution.
Considering the fact that the distances from the camera to different objects in the scene are different, the geometry information should be stored at different resolution to get an accurate and efficient volumetric integration.Taking the consumer RGB-D camera Kinect used in this paper for example, the measurement error increases with the distance from points to the principal axis.In order to quickly obtain scene models with sufficient geometrical details, we use a multiple levels octree structure to store the multi-scale TSDF and update TSDF at a higher resolution for points near the camera, while a lower resolution for points far away.The geometry is stored in small cubic volumes (bricks), consisting of 8 3 voxels.Each voxel stores the truncated signed distance, the weight and the color.All the bricks in the octree have the same size, while having different scales.The brick's scale s l is set as: s l = exp 2 log 2 max{z i , 1} , where l is the level of the octree and z i is the depth value (in meters).The choice of Φ depends on the noise of the camera.We set it to be twice the voxel scale of the grid resolution.
As can be seen from Equation (10), TSDF fusion is a weighted average process.Even small errors of camera pose will make the TSDF model blurry and consequently lose fine details.Fusing too much redundant data has no benefit in improving the precision of the model.If too many redundant or similar data are fused repeatedly, not only the computing resources are wasted, but also the surface mesh may be polluted by unnecessary noise.Therefore, we prune away redundant camera views before volumetric integration.
The purpose of camera view selection is to remove redundant views caused by the camera's slow motion and repeated views.Slow motion can be detected through the relative rotation and translational velocity between consecutive frames.Repeated views are determined by loop closure and con-visibility information between non-consecutive frames.The proposed camera view selection algorithm is illustrated in Algorithm 1.The complete trajectory with n views is reduced to a new trajectory with r views.As transformation T i = [R i | t i ] contains camera motion information, the variables in this algorithm are calculated as follows:

•
Three Euler angles α i , β i and γ i are computed by relative rotation between consecutive frames: where α i , β i and γ i represent the yaw, pitch and roll angles, respectively; • Translational velocity v i is computed by: • Loop closure key frames are detected in camera tracking; • The similarity ratio ρ i,j between the ith and jth frame is measured by con-visibility content information [37] and defined as: where n i and n j are the number of available pixels in ith and jth depth images at the ith frame coordinate system.
Loop closure [18] is in charge of detecting loops to reduce the cumulative errors in camera tracking.We make the key frames of loop closure as marks and use them as a basic condition to determine the repeated regions.For each depth image D i , we only calculate the similarity ratios between the current ith frame and each jth loop closure frame in Algorithm 1.We assume that the regions with closer content have good consistency.By measuring the con-visibility information of depth images between the ith frame and jth frame, we estimate the similarity of visual contents between them and obtain a similarity ratio ρ i,j through Equation (15).
The selection of motion thresholds depends on the movement of the RGB-D camera.In our experiments, angle thresholds (Th α , Th β and Th γ ) are fixed to 0.005 (degree); velocity threshold Th v is set to 0.2 or 0.5 (centimeter); and similarity threshold Th ρ is fixed to 0.85.Save T i ; 12: return The reduced trajectory T k = T i ;

Experiments
To illustrate the robustness and efficiency of the proposed approach, we have carried out some experiments both on synthetic and real-world scenes.The quantitative and qualitative comparisons are performed with a series of state-of-the-art systems.For all experiments, we run our system on a standard desktop PC with an Intel Core i7-4790 3.6-GHz CPU.For camera tracking, ORB-SLAM [18,19], PL-SLAM [24], Edge VO [25] and Edge SLAM [26] are run on a CPU.For 3D reconstruction, Kintinuous [6], Choi et al.'s method [12], ElasticFusion and BundleFusion are run on a GPU.

Camera Tracking
For camera tracking, we compare our method with several related systems (ORB-SLAM, Edge VO, Edge SLAM and PL-SLAM) in terms of tracking accuracy and computing speed on the TUM RGB-D dataset.Table 1 reports the accuracy of camera tracking (ATE RMSE in centimeters).Note that the results of ORB-SLAM (monocular), Edge VO, Edge SLAM and PL-SLAM are quoted from corresponding papers.The results show that our tracking method is superior to others in terms of accuracy and robustness on the TUM RGB-D dataset.Our method obtains good tracking accuracies and shows robustness especially on textureless scenes (fr3_snt_far and fr3_snt_near).A comparison of computing speed for camera tracking is given in Table 2.Note that the speeds of other methods are quoted from corresponding papers using the same operating environments as ours.Since edge extraction and matching are faster than line features, the mean tracking speed of our method is higher than PL-SLAM.Besides, initial pose estimation is accelerated with the help of depth information.The total tracking speed of our method can reach 58Hz on an Intel Core i7-4790 CPU when λ = 1.
To further compare the robustness and accuracy of camera tracking with points and edges, experiments are also conducted on eight sequences (Living Room kt0-3 and Office kt0-3) of the ICL-NUIM Absolute Trajectory (ATE) dataset [38] and four sequences (Living Room 1-2 and Office 1-2) of the Augmented ICL-NUIM dataset [12].Table 3 reports the accuracy of camera tracking (ATE RMSE in centimeters) with different tracking methods: tracking via points (ORB-SLAM [19]), tracking via edges (Edge VO [25]) and tracking via points and edges (our method).Note that Edge SLAM is not open source, so we cannot compare it on the ICL-NUIM dataset and the Augmented ICL-NUIM dataset.All the results in Table 3 are provided by our experiments on an Intel Core i7-4790 CPU.
The results indicate that tracking via points and edges has higher robustness than tracking with points or edges, respectively.

Volumetric Integration
For volumetric integration, we carry out experiments to validate the camera view selection method and the efficiency of data fusion.Figure 3a shows a variation of view numbers before and after camera view selection with different velocity thresholds on ICL-NUIM living room sequences (kt0-3), the TUM RGB-D dataset (fr3_snt_f and fr3_snt_n) and our dataset (corridor and room).Fr3_snt_f and fr3_snt_n are manually scanned, while corridor and room are scanned through a robot equipped with an RGB-D camera.The effect of camera view selection on our dataset is very obvious since the sequences contain some loop closures.Figure 3b shows a comparison of data fusion time before and after camera view selection.The average time of data fusion is reduced by 21.7% on an Intel Core i7-4790 CPU.Note that the fusion speed on real-world scene (TUM RGB-D and our dataset) is faster than the synthetic scene (ICL-NUIM) because the number of valid depths is less.
To justify the reasonableness of camera view selection, we have carried out an experiment with different numbers (100, 200, 500, 1000, 2000 and 3000) of camera views on the fr2_xyz sequence of the TUM RGB-D dataset.Note that the camera pose errors are very small and the mesh models are fused by standard TSDF.The enlarged views of reconstruction results are shown in Figure 4.The models look very similar and have some missing areas due to occlusion when r is 100, 200 and 500.The reconstructed details are best when r is 1000 and get worse when r increases to 2000 and 3000.
The results indicate that fusing too many redundant data has no benefit to improve the precision of the model.
Figure 5 demonstrates the reconstruction performance on a small indoor desk scene before and after camera view selection.Figure 5a shows a complete scene model, which is scanned through a robot equipped with Microsoft Kinect.The robot started at the origin s, moved from right to left and finally returned to the origin s.The trajectory and moving direction of the camera are marked with blue lines and arrows.The desk region A marked by the red box is a loop closure region, which has been scanned repeatedly.Figure 5b,c show the models of desk region A before and after camera view selection.As can be seen from the enlarged view of region B, the accuracy of the reconstructed surface is enhanced after camera view selection, because redundant data are removed.Note that the mesh models are fused by standard TSDF.

3D Reconstruction
To evaluate the proposed 3D reconstruction system quantitatively, we carry out experiments on four living room sequences (kt0-3) of the ICL-NUIM dataset.Table 4 reports the accuracy of camera trajectories (ATE RMSE in centimeters) and surface reconstruction (median distances in centimeters).Note that the results of Kintinuous, Choi et al.'s method [12], ElasticFusion, InfiniTAM v3, BundleFusion and DVO SLAM are quoted from corresponding papers.The results indicate that our approach obtains comparable accuracy to the state-of-the-art methods.Figure 6 shows reconstruction results with the proposed approach, the first row for estimated camera trajectories compared with the ground truth and the second row for surface reconstruction models.From the diagram, it can be seen that the accuracy of camera trajectories produced by our approach is close to the ground truth.Reconstruction results show that our system achieves a good performance in indoor scene reconstruction.Besides, we also carry out experiments on four large scene sequences (Living Room 1-2 and Office 1-2) of the Augmented ICL-NUIM dataset.The average trajectory length of those sequences is 36 meters.Table 5 reports the accuracy of camera trajectories (ATE RMSE in meters) compared with other 3D reconstruction systems.Note that the results of Kintinuous, Choi et al.'s method [12], BundleFusion and DVO SLAM are quoted from corresponding papers.The experiments with InfiniTAM-v3 on the four sequences all failed due to tracking lost in the 1418th, 1510th, 296th, and 2285th frames, respectively.The results show that the average accuracy of camera trajectories with our method on the Augmented ICL-NUIM dataset is higher than the state-of-the-art methods.For the qualitatively evaluation, real-world scene experiments are carried out on the TUM RGB-D dataset and our dataset.Table 6 reports the accuracy of the estimated camera trajectories (RMSE in centimeters) and mean speeds (fps) of data fusion on the TUM RGB-D dataset.Note that the speeds of other methods are estimated from corresponding papers.The average accuracy of our approach is higher than others.Figure 7 shows a comparison of four indoor scenes' reconstruction results on the TUM RGB-D dataset and our dataset.Fr3_snt_far (795 views) and fr3_snt_near (1055 views) are fully wrapped in a white plastic foil with little texture.They are manually scanned along a zig-zag structure.Their views used for integration are 651 and 783, respectively.Corridor (2047 views) and room (2215 views) are indoor scenes of our dataset and scanned through a robot equipped with Microsoft Kinect.The sequences have some redundancy since the movement of the robot is less flexible or more repeated.The views of corridor and room used for integration are 999 and 1329, respectively.The runtime marked in the figure is the total time of 3D reconstruction, i.e., camera tracking time and data fusion time.Choi et al.'s method [12] and ElasticFusion are run on an Nvidia GeForce GTX 750 Ti 2-GB GPU, while InfiniTAM v3 and our system are run on an Intel Core i7-4790 3.6-GHz CPU.The average speed of InfiniTAM-v3 is 1.25 Hz (fps) in our experiments.Compared to other methods, our approach produces better reconstruction results on the scenes containing textureless regions (fr3_snt_far, fr3_snt_near and corridor).All the results show that our system has a good performance and spends the shortest time even on the CPU.Table 6.Accuracy of the estimated camera trajectories (ATE RMSE in centimeters) and mean speed (fps) of data fusion on the TUM RGB-D dataset [35].Bold shows the best results.

Conclusions
We have presented a robust CPU-based approach to reconstruct indoor scenes scanned with a consumer RGB-D camera efficiently.The key idea is to estimate camera motion via points and edges and then integrate RGB-D images with an efficient data fusion strategy.Experimental results demonstrate the better performance of our proposed approach in terms of both robustness and efficiency.Our approach is applicable for indoor scene reconstruction on resource-constrained robots.

Figure 1 .
Figure 1.The pipeline of the proposed CPU-based 3D reconstruction system, which consists of two main stages: robust camera tracking and efficient volumetric integration.TSDF, Truncated Signed Distance Function.
Accuracy of camera tracking with different number of ORB features

Figure 2 .
Figure 2. (a,b) Comparison of ORBfeature extraction and edge extraction on an RGB image.Green denotes the ORB feature points; blue denotes edges with low uncertainty; red denotes edges with high uncertainty.(c) Variations of the camera tracking accuracy (Absolute Trajectory (ATE) RMSE in centimeters) with different numbers of ORB features extracted per frame.

Figure 3 .
Figure 3.The variation of the number of camera views and data fusion time.(a): variation of the number of camera views before and after view selection with different velocity thresholds.(b): variation of data fusion time before and after view selection (ICL-NUIM and our datasets: v = 0.2; TUM RGB-D dataset: v = 0.5).

Figure 4 .
Figure 4.The enlarged views of reconstruction results with different numbers (100, 200, 500, 1000, 2000 and 3000) of camera views on the fr2_xyz sequence of the TUM RGB-D dataset.

( b )Figure 5 .
Figure 5.Comparison of the reconstruction results for a desk scene before and after camera view selection.Note that the mesh model is fused by standard TSDF.

Figure 6 .
Figure 6.Reconstruction results with the proposed approach on ICL-NUIM living room sequences, the first row for estimated camera trajectories compared with the ground truth and the second row for surface reconstruction models.

Figure 7 .
Figure 7.Comparison of the reconstruction results on real-world scenes.Fr3_snt_far and fr3_snt_near are manually scanned with the Asus Xtion sensor.Corridor and room are scanned through a robot equipped with Microsoft Kinect.
and ||v i || ≤ Th v then

Table 1 .
Accuracy of camera tracking (ATE RMSE in centimeters) compared to different tracking methods on the TUMRGB-D dataset (with an Intel Core i7-4790 CPU).Bold shows the best results.X denotes uninitialized or the tracking lost.

Table 2 .
Mean computing speed of camera tracking on the TUM RGB-D dataset (with an Intel Core i7-4790 CPU).Bold shows the best results.

Table 3 .
[12]accuracy of camera tracking (ATE RMSE in centimeters) via points and edges on the ICL-NUIM[38]and Augmented ICL-NUIM[12]datasets. Bold shows the best results.X denotes uninitialized or the tracking lost.

Table 4 .
[38]racy of the estimated camera trajectories (ATE RMSE in centimeters) and surface reconstruction (median distance in centimeters) on the ICL-NUIM living room sequences[38].Bold shows the best results.

Table 5 .
[12]racy of the estimated camera trajectories (ATE RMSE in meters) on the Augmented ICL-NUIM dataset[12].X denotes the tracking lost.Bold shows the best results.