Degeneration-Aware Localization with Arbitrary Global-Local Sensor Fusion

Global localization is a fundamental ability for mobile robots. Considering the limitation of single type of sensor, fusing measurements from multiple sensors with complementary properties is a valuable task for study. In this paper, we propose a decoupled optimization-based framework for global–local sensor fusion, which fuses the intermittent 3D global positions and high-frequent 6D odometry poses to infer the 6D global localization results in real-time. The fusion process is formulated as estimating the relative transformation between global and local reference coordinates, translational extrinsic calibration, and the scale of the local pose estimator. We validate the full observability of the system under general movements, and further analyze the degenerated movement patterns where some related system state would be unobservable. A degeneration-aware sensor fusion method is designed which detects the degenerated directions before optimization, and adds constraints specifically along these directions to relieve the effect of the noise. The proposed degeneration-aware global–local sensor fusion method is validated in both simulation and real-world datasets with different sensor configurations, and shows its effectiveness in terms of accuracy and robustness compared with other decoupled sensor fusion methods for global localization.


Introduction
Precise global localization is of high importance in many applications, such as navigation and mapping. When there is no access to a pre-built environment map, the robot can only obtain the global positioning information via global-aware sensors like global positioning system (GPS) and ultra wide band (UWB). Generally these sensors can access global positional information with bounded error. However, due to the installation limitation of the ground stations and the signal block from obstacles, their measurements might be intermittent and suffer from large noise. Besides, the 3 degree-of-freedom (3D) positioning measurements are inadequate for some applications that require 6 degree-offreedom (6D) global pose estimation. On the other hand, local sensors that measure the local environmental or kinematic information can achieve consecutive 6D pose estimation in real time [1][2][3][4][5]. However, the poses are estimated relative to the origins of the trajectories, which is unobservable in the global reference frame. The estimation results would drift along with time and distance.
In this paper, we set to integrate the global and local sensors to achieve 6D global drift-free localization. To do this, the extrinsic calibration between sensors is required, and so is the mapping that transforms the local estimations to the global coordinate. Besides, the scale of the local poses is also required since some local estimators such as monocular visual odometry can not recover it. Many researches investigate the sensor-fusion problem on specific sensor configurations, and the methods can be divided into two categories.
One kind of fusing methods tightly couple the measurements from different sensors under the filtering or optimization based frameworks [6,7]. Generally, the reference frames of the local estimators are directly transformed to the global coordinates after initialization. During localization, the poses of the local estimators are propagated on the global coordinate. These methods are usually carefully designed for specific sensor configurations, and cannot be generalized to arbitrary configuration. To the best of our knowledge, none of these methods are compatible with the local estimators that cannot recover the absolute scale. Another kind of methods utilize the outputs from different pose estimation systems as observation, and loosely couple them to achieve global localization [2,8,9]. In this way, the pose estimation systems are decoupled, and the erroneous estimations or measurements from one system would not affect the others. Besides, the loosely-coupled methods are applicable to arbitrary sensor configurations due to the independency of the pose estimation front-ends [8][9][10].
However, few of these methods have analyzed the observability of the global-local sensor fusion system. As noted in [6], the sensor fusion systems would still remain globally unobservable even if the global measurements are given. Besides, the observability of the system would degenerate when the robot moves along certain motion patterns [11,12]. Many studies analyze the observability to guide the online movements or calibration design [11,13], but few of them pays attention to avoiding degeneration from worsening localization performance. One possible solution to address degeneration is to add heuristic constraints on the system, e.g., priors like constant velocity or planar motion models, which however would introduce false constraints on observable subspace if the hypothesis is violated. Some works [14,15] address the degeneration occurred in laser odometry and calibration, respectively. They first determine the unobservable directions according to the eigenvalues of systems, and truncate the state update along with these directions. These methods theoretically prevent noise from deteriorating unobservable states, but the eigenvalue threshold for observability distinction might change from case to case, and the optimizer has to be adapted for the truncated update.
In this paper, we propose a novel global-local sensor fusion framework that estimates the translational extrinsic calibration, the scale of the local poses, and the relative transformation from the reference frame of the local estimator to the global coordinate within a sliding-window, which are utilized to map the real-time local poses to the global frame for localization. We prove the full observability of this formulation under general motion patterns. To avoid the degeneration from deteriorating state estimation, we distinguish the degenerated subspace before each optimization, and add prior constraints specifically along with the unobservable directions. We validate the effectiveness of the proposed method in both simulated and real-world datasets with different sensor configurations. The experimental results also show that our method can effectively restrain the drift along degenerated directions and work out robust global localization results. To summarize, the contributions of the paper are listed as follow • A novel global-local sensor fusion framework is proposed which is applicable to arbitrary sensor configurations. The observability and degenerated motion patterns of the proposed system are detailedly analyzed; • A degeneration-aware optimization approach is designed to avoid the sensor fusion framework being deteriorated by the noise on the degenerated directions; • Both simulated and real-world datasets with multiple sensor configurations are utilized to validate the generalization and effectiveness of the degeneration-aware sensor fusion framework for global localization.

Related Works
The study in this paper is relative to multi-sensor fusion and degeneration-aware state estimation. Considering the limitation of the single type of sensor, multi-sensor fusion is a widely researched area in robotics. Generally, the sensors selected for fusion possess complementary properties. For example, the monocular visual inertial system fuses external visual information with internal inertial information to estimate 6D pose with scale. In this paper, we pay attention to the fusion between local-aware sensors (e.g., light detection and ranging (LIDAR), camera, inertial measurement unit (IMU)), and global-aware sensors (e.g., GPS, Motion Capture) for localization. The fusion frameworks can be divided into filtering and optimization-based methods. Typically, most of the filtering-based frameworks are developed based on Extended Kalman Filter (EKF) [6,8,16] and utilize global and local information to update the propagated inertial data. To reduce the first-order linearization error occurred in the EKF, some systems achieve sensor fusion taking advantages of the other filtering solutions such Partical Filter [17,18] or Sigmapoint Kalman Filter [19,20]. Lynen et al. [8] propose a general multi-sensor fusion (MSF) method that can utilize arbitrary global or local measurements to update the propagated pose, in which the scale of the local estimator and the extrinsic calibration could also be estimated during the localization. However, the inertial instrument is indispensable and the observability is not validated in this method. Lee et al. [6] tightly couple the visual, inertial, and GPS information for global localization, and simultaneously estimates the extrinsic calibration and time-offset between sensors. Optimization-based methods are proven to be more accurate than filtering-based methods [21]. During optimization, the system states are iteratively estimated based on all of the valid measurements, which however require a large number of computational resources. Many works relieve the computational burden by only maintaining a sliding-window of states to achieve efficient optimization [9,22,23]. However, few of these fusion methods address the degeneration problem during online localization.
Observability is the vital property of dynamic systems which reflects whether the inner states of the systems can be estimated from the observed measurements [24]. In the global-local sensor fusion tasks, Lee et al. [6] prove that if directly adding the relative transformation between the reference frames of the visual inertial odometry (VIO) and GPS systems into the original VIO state for estimation, there still exists four unobservable directions even though the global measurements exist. They address this problem by transforming the states of the local VIO system to the global frame to achieve full observability, which is inherently equivalent to marginalizing out the relative transformation between the reference frames of VIO and GPS systems after initialization, and directly estimating the state of the local VIO system in the global frame. This parameterization is also utilized in some of the other global-local sensor fusion methods [7,10]. Another approach decouples the local and global estimators and estimates the relative transformation between their reference frames [25]. In this way the output of the high-frequency local estimator can be directly transformed to the global frame without latency, and makes the system more robust as the local and global estimators are decoupled. In this paper, besides the relative transformation between reference frames, we further add the scale of the local estimator and the extrinsic calibration between local and global sensors into the state, which makes the system applicable to arbitrary sensor configurations. We also prove its full observability under general movements.
Even though the fusion system is proved to be fully observable in general conditions, the observability would degenerate if the robot moves with insufficiently excited motion. Ref. [11] proves in detail that there exist several motion patterns that would introduce unobservable directions even if the global measurement is provided in the inertial navigation system (INS) aided framework. However, few works propose efficient solutions towards preventing degeneration from deteriorating estimation results. Refs. [14,15] address the degeneration problem by projecting the incremental update to the observable direction, which needs to modify the update step in general optimizers. In this paper, we automatically detect the degenerated directions before optimization, and specifically add constraints based on the calculated directions to prevent the unobservable parameters from being deteriorated by noise, which can be easily plugged in many existing systems developed with general optimizers.

Method
In this section, first the notations used throughout the paper are presented. Then we introduce the formulation of the global-local sensor fusion system and its detailed implementation for batch optimization. As a special discussion, we show that some commonly-appeared motions in robotics are actually degenerated for sensor fusion. Finally, we present our degeneration-aware optimization method that is designed to address the degeneration problem occurred during online localization.

Notation
The nomenclature used throughout the paper is defined in Table 1. We consider that the local estimator outputs 6D transformation T O C from the local sensor frame C to the reference frame of the odometry O. We assume that the global estimator outputs 3D positional measurement p W G of the global sensor frame G on the robot with respect to a fixed world frame W. To fuse the outputs from the global and local estimators, we estimate the extrinsic calibration T C G between sensors, the scale s of the local estimator, and the relative transformation T W O during online localization, which constitute the system state x. T(s) denotes that the transformation T is mapped to the real scale by s. We draw the relation of the frames in Figure 1.

C
The sensor frame of the local estimator; G The sensor frame of the global estimator; O The reference frame of the local estimator; W The reference frame of the global estimator, which also is the world frame of the localization system; L The first frame in the sliding-window;

T
The transformation matrix in SE(3), where T A B represents the relative transformation from frame B to frame A defined in frame A;

R
The rotation matrix in SO(3), where R A B represents the rotation from frame B to frame A; q Unit quaternion in Hamilton notation, with q A B corresponding to R A B ; p The system state; s The scale of the local estimator in R + ; t The timestamp of the received message; n The number of global measurements in the sliding-window; The thresholds used for degeneration detection.

Global-Local Sensor Fusion System
The proposed sensor fusion system is demonstrated in Figure 2. It receives outputs from arbitrary local and global estimators, and produces the global localization results at the highest frequency of the local estimator. We estimate the transformation T W O to transform the output of the local estimator to the global frame instead of directly estimating the current pose in the global frame. To restrict the computational requirement and estimate the scale accurately, the fusion process is carried out on a sliding-window of recent n global sensor measurements and the related local sensor frames within the corresponding time duration, and we denote the first local sensor frame in the sliding-window as L. The length of the sliding-window in this work is maintained according to the distance, but can be modified by any other criterion. Upon the sliding-window based formulation, we change the estimated relative transformation from T W O to T W L and the system state is maintained accordingly as s} Generally, the global sensors possess lower frequency than the local estimators. Therefore, the system performs the fusion process each time it receives the measurement from the global estimator. As shown in Figure 2, during each fusion process, we first align the newly received global measurement with the two closest local frames to obtain the accurate local pose at the exact timestamp of the global measurement. Then, we detect whether there exists degenerated subspace according to the information within the sliding-window. If the degeneration occurs, the corresponding degenerated directions would be calculated, which further are utilized to construct prior constraints for degeneration-aware optimization.

Measurement Alignment
We first align the global and local measurements according to the timestamps. The global measurement p W G k received at timestamp t k is aligned with two closest local mea-surements at timestamp t ka and t kb as shown in Figure 1. We interpolate the corresponding local pose T O C k at the timestamp t k according to in which slerp(·) denotes the spherical linear interpolation function [26].

General Batch Optimization for Localization
After aligning the newly received global measurement, the system state x = {R W L , p W L , p C G , s} is optimized according to the time-aligned measurements within the current slidingwindow. The error function constructed by the measurements at timestamp t k is formulated as Summing up all the valid error functions in the sliding-window, the cost function for the general localization based optimization is defined as in which Ω e denotes the information matrix that can be derived as the inverse of the measurement covariance matrix.ρ(·) represents the Huber's robust kernel function [27] which can suppress the effect of outliers. In this work, we minimize this non-linear least-square cost function iteratively for system state optimization based on the Levenberg-Marquardt algorithm [28,29]. Given the latest optimized state x, we can transform the continuous output of the local estimator T L C t to the world frame for real-time 6D global localization. The 6D global localization at timestamp t can be calculated as

Observability Analysis
For the optimization formulated in Equation (4), we first analyze its observability in the general occasion. We linearize Equation (2) at the current estimate to compute the Jacobian in which Ξ k corresponds to the Jacobian of R W L on Manifold We denote [a×] as the 3 × 3 skew-symmetric for vector a = [a x , a y , a z ] T , which is formulated as As the system state x changes slowly across time, the corresponding state transition matrix Φ(t 0 , t k ) between timestamps t 0 and t k is identity. The observability matrix M can be constructed following [30] As in general cases the columns in H k are linear independent. Thus based on Equation (9) it can be noticed that in general cases the rank of M achieves 10 when n ≥ 4. Therefore, when there are more than 4 global measurements in the sliding-window, the system is observable.

Degeneration Analysis
In this subsection, we summarize the common degenerated motion patterns in the robotics area and derive their corresponding unobservable directions.

Pure Translation along One Axis
This is a common motion pattern in autonomous driving. When the robot moves along a straight line, R L C i is invariant for any i ∈ [m, m + n − 1], in which m is utilized to denote the beginning of the sliding-window. The translational measurement of the local estimator can be derived as in whichp L m 0 is the unit vector of the translational direction and α i is a scalar variable that can be derived as α i = p L C i / p L m 0 . Thus, in H i the third block is constant, and the first two blocks are also linear dependent. The right null-space matrix N of the observability matrix M can be derived as M N = 0 can be verified as H i N = 0. Therefore, in this occasion, there is a four-dimensional unobservable subspace which is constructed as span(N). The formation of N shows that the first three column vectors are related to the translational components {p W L , p C G }, while the last column vector is related to one direction of R W L which is parallel to the translational directionp L m 0 .

Random 3D Translation
This would occur when the robot performs pure translation in 2D/3D space without rotating. In this case, Equation (10) We can find that R W L is observable now, and there remain three unobservable directions that are related to the translational components {p W L , p C G }.

One-Axis Rotation
When the robot turns along a fixed rotation axis ω ∈ R 3 with non-zero translational velocity, only the translational components along ω remain unobservable. In this case, R L C i can be derived according to the Rodrigues' rotation formula in which θ i denotes the rotation angle of R C m C i . As [ω×]ω = 0 3×1 , we can derive that multiplying the third block of H i by ω leads to The result is constant for any i ∈ [m, m + n − 1]. Thus, the null-space matrix that satisfies M N = 0 can be derived as It should be noticed that when the robot performs uniform motion, more dimensions maybe unobservable. One case is that, when both the angular velocity and the translational velocity are constant, the dimension of the unobservable subspace is expanded to 2 as one direction of R W L that is along with the rotational axis would be undistinguishable. Another example is that, when the directions of the constant angular and translational velocity are perpendicular, the scale of the local estimator would be unobservable. These are likely to happen when the robot performs planar rotational movement, such as a car turns with constant velocity in autonomous driving. Detailed proof of the two claims are derived in the Appendix A.

Degeneration-Aware Batch Optimization
When there exists an unobservable subspace in the system, the related states are unconstrained and susceptible to measurement noise. To address this problem, we first detect the degenerated directions before the optimization process, then add prior constraints specifically along these directions.
Similarly to [14], we calculate the eigenvalues {λ l } and the corresponding eigenvectors {v l } by performing SVD decomposition on the Hessian matrix H T H, in which H = [H T m · · · H T m+n−1 ] T is linearized at the latest estimated state before each optimization and l = 0 · · · 9. We assume that {λ l } is sorted and λ 9 corresponds to the smallest eigenvalue. Theoretically the eigenvalues corresponding to the unobservable dimensions are precisely zero, while in practice they are not because of the system noise. It is apparent that a constant threshold is not sufficient to distinguish zero eigenvalues for all kinds of systems, calling for more flexible construction of the threshold. We also quantitatively analyze the influence of noise on eigenvalues in Section 4.1.
We propose a heuristic method to detect the unobservable subspace as shown in Algorithm 1. The absolute thresholds a and b define the coarse but valid upper and lower bounds distinguishing the unobservable and observable subspaces. For the eigenvalue that distributes between the two bounds, it is difficult to directly categorize it by a threshold due to the unrestricted range of noise. We compute its ratio with the adjacent eigenvalues to indicate whether they have the same type of observability. If the ratio is larger than threshold r , we assume that the adjacent eigenvalues have the same type of observability. As the method traverses from the smallest eigenvalue, the eigenvalue is categorized as relating to the unobservable subspace if it shares the same type of observability with its previous one. After detecting out the degenerated subspace with dimension r, we collect the corresponding eigenvectors V u = {v 0 , · · · , v r−1 } which denote the degenerated directions. Instead of projecting the incremental update to the observable directions during optimization, we restrict the update in unobservable directions by adding constraints in which j ∈ [0, r − 1]. e R j , e p W Lj , e e j and e s j denote the constraints applied on R W L , p W L , p C G and s, respectively. (·) denotes the related state estimation before optimization. v j(a:b) represents the vector that is compromised with the ath to bth elements of v j . These error terms can be considered as priors to prevent the states from drifting in the unobservable subspace, and are only added for optimization when v j(a:b) > 0. 1 in which S R , S p W L , S e and S s denote the sets of activated prior constraints on the system states, respectively. Ω p and Ω ps are the information matrices of the prior constraints.

Experimental Results
We validate the effectiveness of our degeneration-aware sensor fusion method through both simulated and real-world datasets. In simulation experiments, we first show the sensitivity of the eigenvalues with respect to the measurement noise. Then we create four sequences of simulated trajectories that correspond to the common degenerated motion patterns analyzed in Section 3.4, and utilize them to demonstrate the effect of degeneration on state estimation and the effectiveness of our proposed method in terms of preventing the state from drifting along the degenerated directions. In real-world experiments, we show the localization performance on indoor EuRoC [31] and outdoor KAIST [32] datasets. EuRoC is a visual-inertial dataset collected by a micro aerial vehicle that performs 6D movements. We utilize the 3D Leica MS50 laser tracker as the 3D global estimator and only utilize the left images to perform monocular visual odometry, which serves as the local pose estimator with no access to the absolute scale. KAIST dataset is collected in outdoor urban environment on a car and we utilize the GPS as global estimator and stereo-IMU information to perform local visual-inertial odometry. During the experiments, we set the length of the sliding-window as 10 m for indoor dataset and 50 m for outdoor datasets.
All the experiments are executed on a laptop with a 2.70 GHz Intel Core i7-7500U CPU. The codes are implemented in C++ and the optimization process is implemented with Levenberg-Marquardt algorithms in g2o [33].
In the simulation experiments, different motion patterns are controlled by setting different translational and rotational velocities for the local sensor. The minimum sampling interval is 0.1 s, and the ground truth of the trajectories can be integrated based on the velocities. The trajectories of the global sensor can be calculated based on the ground truth of the system state using Equation (3). The global measurements are simulated by corrupted the ground truth of the trajectories with Gaussian noise. To simulate the drift of the local estimator, we add Gaussian noise on the relative pose in each sampling interval, and integrate the corrupted relative pose to formulate the local measurements. As the ground truth of the system state can be accessed in the simulation experiments, the performances of different methods are evaluated by comparing the state estimation errors.
In real-world experiments, we evaluate the performances of different methods by the absolute trajectory error (ATE) and rotational error [34]. For each global pose T W C t estimated at timestamp t, we find its corresponding ground truth pose T W * C t provided by the datasets according to the timestamp. The localization error is derived as The ATE error relates to the norm of the translational component in E t , and the rotational error relates to the rotational component in E t .

Sensitivity of the Eigenvalues
To quantitatively demonstrate the influence of the noise on the eigenvalues, we create simulated trajectories that are related to two types of the degenerated motion patterns shown on the first row of Figure 3. We add Gaussian noise on the simulated measurement and show the changing of the related eigenvalues. As the derivation of eigenvalues relates to the measurements of the local estimator, we, respectively, add Gaussian noises with standard deviations from 0 to 10 cm on the translational parts of the local measurements, and add Gaussian noises with standard deviations from 0 to 6 deg on the rotational parts. To avoid the influence of randomness, for each standard deviation we generate 200 trajectories for eigenvalue evaluation. Figure 3a shows the trajectory generated when the robot moves along a straight line and the eigenvalues drawn in Figure 3c,e are influenced by translational and rotational noises on the local estimators with different standard deviations, respectively. To keep the scale of the axis for better demonstration, we do not draw the value of the 5th eigenvalue influenced by the rotational noise in Figure 3e, which are around the same values as shown in the results that influenced by the translational noise. The right column includes the results tested when the robot moves with random one-axis rotation. The trajectory is shown in Figure 3b. The eigenvalues drawn in Figure 3d  Based on our analysis in Section 3.4, when the robot moves along a straight line, there should be four unobservable directions, which means that the smallest four eigenvalues should be close to zero. When the robot moves with non-constant one-axis rotation, there should be only one unobservable direction. We, respectively, draw the smallest five and three system eigenvalues tested on the two trajectories in the two columns of Figure 3. The results show that when the noise is small, the eigenvalues corresponding to the unobservable subspace are close to zeros (i.e., the 6-9th eigenvalues on the first column and the 9th eigenvalue on the second column), and become larger when the noise grows. Although the eigenvalues corresponding to the observable subspace are generally large. Thus we could set the coarse upper bound a and lower bound b to distinguish eigenvalues. However, we can notice that the 6th eigenvalue on the left column is close to the 7th and 8th eigenvalues on the right column. If we utilize one constant threshold, it is difficult to distinguish the unobservable and observable subspaces in this situation. In our proposed detection method, the difference between the adjacent eigenvalues is taken into consideration, which contributes to handle this problem. Based on the analysis of these results, we set a = 5, b = 10 −2 , r = 0.1 throughout the left of the experiments.

Simulation Results
In this subsection, we utilize simulated data collected under degenerated movement patterns to perform ablation study on the effectiveness of the proposed degenerationaware sensor fusion. Specifically, we evaluate the accuracy of the estimated system state optimized with general optimization formulated in Equation (4) and the optimization with the degeneration-aware constraints (DC) in Equation (17), respectively. The simulated data includes four types of trajectories in which the robot moves (a) along a straight line, (b) with random 3D translation, (c) with perpendicular constant rotational and translational velocity, and (d) with random one-axis rotation. The motion patterns are controlled by the translational and rotational velocities. We set the sampling time as 0.1 s and the trajectories are integrated based on the velocity. We add zero-mean Gaussian noise on the relative pose within each sampling time to simulate the measurement noise of the local estimator and integrate the relative poses to simulate the local measurements. The 3D global measurements are collected every 1 s along the ground truth of each trajectory and are also corrupted by the noise. The standard deviations of the rotational and translational noises for the local estimator are 0.05 rad and 0.05 m. The standard deviation of the translational noise for the global estimator is 0.5 m.
The estimation errors for the system state x are drawn in Figure 4 and we represent the rotational error of R W L in Euler Angle parameterization denoting as "yaw", "pitch", and "roll" in the first three rows of the figure. Based on Equation (11), when the robot moves along a straight line, the dimension of the unobservable subspace is four, which relates to one rotational axis of R W L along the moving direction and 3D translational directions that are coupled in p W L and p C G . These can be reflected from the results in the first column of Figure 4 indicated by dashed lines which do not converge in the related dimensions. On the other hand, the results optimized with the proposed degeneration-aware constraints, though could not converge to the ground truth in the unobservable directions as no valid information is provided, do not diverge during the whole process. The results in the second to fourth columns also validate our analysis in Section 3.4. Additionally, it is interesting to notice that, as one direction of the unobservable space might be related to several state variables, r dimensions of unobservable directions would cause the divergence of more than r state variables.

Real World Experiments
We evaluate the global localization performance of our proposed method on sequences MH01-MH05 in EuRoC datasets and sequence Urban 38 in KAIST dataset.We implement the monocular visual odometry based on ORB-SLAM [1] and the stereo visual inertial odometry based on Openvins [3]. We utilize the absolute trajectory error (ATE) [34] to evaluate the global localization performance and compare our results with the opensourced multi-sensor fusion methods MSF [8] and VINS-Fusion [10].

EuRoC Dataset
In EuRoC dataset we utilize monocular vision to perform local pose estimation. The output of the monocular visual odometry does not include the absolute scale, which can not be utilized for global localization in many other global-local fusion methods [7,9,10]. Thus, to test the performance of VINS-Fusion [10], we change its local estimator into VINS-MONO [2], which is the classical monocular VIO provided by the VINS-Fusion system.
We draw ATE errors evaluated by different methods in Figure 5, and list the rotational errors in Table 2. The results of VINS-MONO is evaluated by aligning the trajectories to the ground truth. The results show that our method could successfully fuse 3D global positions and 6D scaleless local poses for global localization, and outperform the compared methods in terms of both translational and rotational localization accuracy. We can notice that in many sequences VINS-MONO also performances well as the trajectories are not long, but the fusion results in VINS-Fusion show larger errors as the extrinsic parameter is not estimated in VINS-Fusion. Besides, compared with MSF and VINS-Fusion, our method does not require IMU data to support sensor fusion, which further demonstrates its efficiency and versatility for sensor fusion based global localization. We also perform ablation study to evaluate the effectiveness of the degeneration-aware optimization. We first evaluate the localization performance tested on the general batch optimization formulated in Equation (4). The results are denoted as "no prior" in Figure 5. Besides, to validate the benefit of adding prior constraints specifically on the degenerated directions, we also test the localization performance by adding the full prior constraints for the whole system state. The results are denoted as "full priors" in Figure 5. From the results we can infer that, the degeneration constraints can successfully improve the localization accuracy as the influence from the noise in degenerated directions is restrained. Besides, adding constraints especially on degenerated directions can also improve the localization accuracy as the observable elements in the system state could be optimized directly without influenced by heuristic constraints which might be inaccurate. In KAIST dataset, we utilize the GPS as the global estimator, which is prone to drift when the car stops at the crossroads. Additionally, the GPS signal is inaccurate when the covariance is large. Therefore, during online localization, we drop the GPS messages with large covariance (the threshold is set as 60) and the drifted messages indicating that the car is moving while the local estimator denoting that the car does not move.
We list the localization performance of GPS and the stereo-IMU version of Openvins [3] in Table 3 as baseline, and also compare our results with MSF [8], VINS-Fusion [10], and full priors aided results. The translational and heading errors are listed in Tables 3 and 4. The corresponding trajectories are drawn in Figure 6. The dropped GPS signals are also marked in the trajectory.  Compared with EuRoC datasets, the robot performs planar movements in Kaist dataset, in which the degeneration would likely occur. Thus, the results of "no prior" are largely influenced by the degeneration and fail to perform the whole localization. The results also show that our method achieves better performance than the compared methods and could provide drift-free global localization results.

Computational Cost Evaluation
To validate the ability of achieving real-time localization, we evaluate the computational time cost on the multi-sensor fusion process. The computational time is defined as the duration between the time at which a new global measurement is received and the time the system state is updated by the optimization results, which is independent with the processes in the front-ends. We collect the computational time tested on the KAIST dataset and draw the results in Figure 7, in which we specifically color the the time cost on "measurement alignment" and "degeneration detection" as red. As the fusion process is taken place within a sliding-window, the computational time would not increase with time. The average processing time is 18.4 ms, which satisfies the requirement of achieving real-time global-local sensor fusion. Besides, the average time of "before optimization" is 0.45 ms, which demonstrates that our solution for the degeneration motion would not introduce large computational burden towards the fusion process.

Conclusions
To achieve drift-free 6D global localization, in this paper we propose a general optimization-based framework that fuses the 3D intermittent global positioning information and 6D local odometry for real-time global pose estimation. The system state is specifically formulated to be fully observable under general motions. To address the impact from generated motion patterns, we propose a degeneration-aware solution to robustly detect the degenerated directions that are further utilized as indicators to add prior constraints. Though extra time is cost for degeneration detection, the whole processing time could satisfy the requirement of real-time sensor fusion. The effectiveness of our proposed method is validated on both simulated and real-world dataset. In the future work, we set to extend the configuration of our framework from 3D-6D global-local combination to multiple types of pose estimators with general degree of freedom, and utilize the multi-sensor information to further detect the failure in the local pose estimators for outlier rejection.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Derivation of the Degeneration Analysis on One-Axis Rotation with Constant Velocity
As the rotational velocity ω and translational velocity v are constant in this motion pattern, the robot can be considered as moving along a spiral curve or a circle. In the following, we denote that for a 3D vector a, it can be decomposed as a = a ω× + a ω , in which a ω× represents the component that is perpendicular to ω and a ω represents the component that is parallel to a. As [a×]ω = [a ω× ×]ω, we can derive Ξ k ω as in which we set p L G k ω× = R L C k p C Gω× + sp L C k ω× . We draw the projected trajectory in Figure A1b, which should be a circle with the radius ρ. As the following derivation only focus on the variables on the projected trajectory, we reuse the notations of L, C, G to represent their projected variables. For the sake of the derivation, we define a new reference frame R. The position of R is defined as the center of the projected circle, and we set its orientation the same as the reference frame L. As p R RC k = p R RL , we can transform p R RC k to R L C k p R RL . p L G k × can be decomposed as in which p A BD denotes the vector from frame B to frame D representing in the coordinate of A. As R L R = I 3×3 , Equation (A2) equals to Thus, Equation (A1) can be derived as As R L C k rotates along ω, we can achieve that (R L C k ) T ω = ω, which leads to Thus besides Equation (15), the other vector that compromises the null-space matrix is which demonstrates that besides Equation (15), there exists one dimension of state R W L which is along with the rotational axis ω becoming unobservable. If the rotational and translational velocity is perpendicular, the trajectory is equal to a circle which we also demonstrate in Figure A1b. We can decompose p L Thus, under this circumstance, the null-space matrix is further expanded with which means the scale is degenerated to be unobservable. This completes the proof.